|
| 1 | +# String Matching algorithm |
| 2 | + |
| 3 | + |
| 4 | + |
| 5 | +## Rabin-Karp |
| 6 | +We can view a string of k characters (digits) as a length-k decimal number. E.g., the string “31425” corresponds to the decimal number 31,425. |
| 7 | +- Given a pattern P [1..m], let p denote the corresponding decimal value. |
| 8 | +- Given a text T [1..n], let $t_s$ denote the decimal value of the length-m substring T [(s+1)..(s+m)] for s=0,1,…,(n-m). |
| 9 | +- let `d` be the radix of num, thus $d = len(set(s))$ |
| 10 | +- $t_s$ = p iff T [(s+1)..(s+m)] = P [1..m]. |
| 11 | +- p can be computed in O(m) time. p = P[m] + d\*(P[m-1] + d\*(P[m-2]+…)). |
| 12 | +- t0 can similarly be computed in O(m) time. |
| 13 | +- Other $t_1,\ldots,t_{n-m}$ can be computed in O(n-m) time since $t_{s+1} can be computed from ts in constant time. |
| 14 | +Namely, |
| 15 | + |
| 16 | +$$ |
| 17 | +t_{s+1} = d*(t_s-d^{m-1} * T[s+1])+T[s+m+1] |
| 18 | +$$ |
| 19 | +However, it's no need to calculate $t_{s+1}$ directly. We can use modulus operation to reduce the work of caculation. |
| 20 | + |
| 21 | +We choose a small prime number. Eg 13 for radix( noted as d) 10. |
| 22 | +Generally, d\*q should fit within one computer word. |
| 23 | + |
| 24 | +We firstly caculate t0 mod q. |
| 25 | +Then, for every $t_i (i>1)$ |
| 26 | +assume |
| 27 | +$$ |
| 28 | + t_{i-1} = T[i+m-1] + 10*T[i+m-2]+\ldots+10^{m-1}*T[i-1] |
| 29 | +$$ |
| 30 | +denote $ d' = d^{m-1}\ mod\ q$ |
| 31 | +thus, |
| 32 | +$$ |
| 33 | +\begin{aligned} |
| 34 | +t_i &= (t_{i-1} - d^{m-1}*T[i-1]) * d + T[i+m]\\ |
| 35 | +&\equiv (t_{i-1} - d^{m-1}*T[i-1]) * d + T[i+m] (mod\ q)\\ |
| 36 | +&\equiv (t_{i-1}- ( d^{m-1} mod \ q) *T[i-1]) * d + T[i+m] (mod\ q)\\ |
| 37 | +&\equiv (t_{i-1}- d'*T[i-1]) * d + T[i+m] (mod\ q) |
| 38 | +\end{aligned} |
| 39 | +$$ |
| 40 | + |
| 41 | +So we can compare the modular value of each ti with p's. |
| 42 | +Only if they are the same, then we compare the origin chracter, namely $T[i],T[i+1],\ldots,T[i+m-1]$ and the pattern. |
| 43 | +Gernerally, this algorithm's time approximation is O(n+m), and the worst case is O((n-m+1)\*m) |
| 44 | + |
| 45 | +**Problem: this is assuming p and ts are small numbers. They may be too large to work with easily.** |
| 46 | + |
| 47 | +## FSM |
| 48 | +A FSM can be represented as (Q,q0,A,S,C), where |
| 49 | +- Q is the set of all states |
| 50 | +- q0 is the start state |
| 51 | +- $A\in Q$ is a set of accepting states. |
| 52 | +- S is a finite input alphabet. |
| 53 | +- C is the set of transition functions: namely $q_j = c(s,q_i)$. |
| 54 | + |
| 55 | +Given a pattern string S, we can build a FSM for string matching. |
| 56 | +Assume S has m chars, and there should be m+1 states. One is for the begin state, and the others are for matching state of each position of S. |
| 57 | + |
| 58 | +Once we have built the FSM, we can run it on any input string. |
| 59 | +## KMP |
| 60 | +>Knuth-Morris-Pratt method |
| 61 | +
|
| 62 | +The idea is inspired by FSM. We can avoid computing the transition functions. Instead, we compute a prefix functi`Next` on P in O(m) time, and Next has only m entries. |
| 63 | +> Prefix funtion stores info about how the pattern matches against shifts of itself. |
| 64 | +
|
| 65 | +- String w is a prefix of string x, if x=wy for some string y |
| 66 | +- String w is a suffix of string x, if x=yw for some string y |
| 67 | +- The k-character prefix of the pattern P [1..m] denoted by Pk. |
| 68 | +- Given that pattern prefix P [1..q] matches text characters T [(s+1)..(s+q)], what is the least shift s'> s such that P [1..k] = T [(s'+1)..(s'+k)] where s'+k=s+q? |
| 69 | +- At the new shift s', no need to compare the first k characters of P with corresponding characters of T. |
| 70 | +Method: For prefix pi, find the longest proper prefix of pi that is also a suffix of pi. |
| 71 | +next[q] = max{k|k\<q and pk is a suffix of pq} |
| 72 | + |
| 73 | +For example: p = ababaca, for p5 = ababa, Next[5] = 3. Namely p3=aba is the longest prefix of p that is also a suffix of p5. |
| 74 | + |
| 75 | +Time approximation: finding prefix function `next` take O(m), matching takes O(m+n) |
| 76 | + |
| 77 | +## Boyer-Moore |
| 78 | +- The longer the pattern is, the faster it works. |
| 79 | +- Starts from the end of pattern, while KMP starts from the beginning. |
| 80 | +- Works best for character string, while KMP works best for binary string. |
| 81 | +- KMP and Boyer-Moore |
| 82 | + - Preprocessing existing patterns. |
| 83 | + - Searching patterns in input strings. |
| 84 | +## Sunday |
| 85 | +### features |
| 86 | +- simplification of the Boyer-Moore algorithm; |
| 87 | +- uses only the bad-character shift; |
| 88 | +- easy to implement; |
| 89 | +- preprocessing phase in O(m+sigma) time and O(sigma) space complexity; |
| 90 | +- searching phase in O(mn) time complexity; |
| 91 | +- very fast in practice for short patterns and large alphabets. |
| 92 | +### description |
| 93 | +The Quick Search algorithm uses only the bad-character shift table (see chapter Boyer-Moore algorithm). After an attempt where the window is positioned on the text factor y[j .. j+m-1], the length of the shift is at least equal to one. So, the character y[j+m] is necessarily involved in the next attempt, and thus can be used for the bad-character shift of the current attempt. |
| 94 | + |
| 95 | +The bad-character shift of the present algorithm is slightly modified to take into account the last character of x as follows: for c in Sigma, qsBc[c]=min{i : 0 < i leq m and x[m-i]=c} if c occurs in x, m+1 otherwise (thanks to Darko Brljak). |
| 96 | + |
| 97 | +The preprocessing phase is in O(m+sigma) time and O(sigma) space complexity. |
| 98 | + |
| 99 | +During the searching phase the comparisons between pattern and text characters during each attempt can be done in any order. The searching phase has a quadratic worst case time complexity but it has a good practical behaviour. |
| 100 | + |
| 101 | +For instance, |
| 102 | + |
| 103 | + |
| 104 | +In this example, t0, ..., t4 = a b c a b is the current text window that is compared with the pattern. Its suffix a b has matched, but the comparison c-a causes a mismatch. The bad-character heuristics of the Boyer-Moore algorithm (a) uses the "bad" text character c to determine the shift distance. The Horspool algorithm (b) uses the rightmost character b of the current text window. The Sunday algorithm (c) uses the character directly right of the text window, namely d in this example. Since d does not occur in the pattern at all, the pattern can be shifted past this position. |
| 105 | + |
| 106 | + |
| 107 | +# Reference: |
| 108 | +1. Xuyun, ppt, String matching |
| 109 | +2. [Sunday-algorithm](http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/sunday.htm) |
| 110 | +3. GeeksforGeeks, [KMP Algorithm](https://www.geeksforgeeks.org/kmp-algorithm-for-pattern-searching/) |
0 commit comments