Knuth-Morris-Pratt algorithm
The Knuth-Morris-Pratt string searching algorithm locates a pattern of characters within a string. It uses the fact that after finding the first mismatch between pattern and string we already know the characters compared before to improve the efficiency of the search. We therefore will have to compare the pattern to itself to determine how many positions we have to move to the left.The algorithm was invented by Knuth and Pratt and independently by J. H. Morris in 1976.
| Table of contents |
|
2 Example 3 Notes on the Algorithm |
Overview of Knuth-Morris-Pratt Algorithm
Example
String: ABC ABC ABC ABDAB ABCDABCDABDE
pattern: ABCDABD
Building 'next' table
| index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|---|---|
| char | A | B | C | D | A | B | D | - |
| pos | -1 | 0 | 0 | 0 | 0 | 1 | 2 | 0 |
This is the precomputed table of offsets generated by the algorithm. If we look at the table, we can see that these numbers are the maximum number of characters matchable before the nth position on the string.
For example take position 6
ABCDABD
ABCDABD
^The pointer is at position 6. As you can see above, the maximum number of characters that can matched before position 6 is 2. The array entry 6 is 2.
We now have the table so we can use the algorithm to check the string for matches:
We check the last piece of the array (index 7) since we have had a match, it contains position zero. So we check the last 'E' against 'A', it is not a match and we have finished searching the string.
This algorithm builds the next array as a finite state machine and uses it to dictate to the algorithm what to do at each point.
The complexity of processing the pattern is o(m), where m is the number of characters in the string.
The complexity of searching the string for the given pattern is o(n) where n is the length of the main string.
Processing text
ABC ABC ABC ABDAB ABCDABCDABDE
ABCDABD
^
We find that the first three match, but the fourth does not, so we look up the fourth character (or index 3) in the table, this gives us zero. We should now try matching position zero on the pattern with the fourth character on the string:ABC ABC ABC ABDAB ABCDABCDABDE
ABCDABD
^
This time the match fails on the first character (or index 0) on the pattern, which means we go to the lookup table, the table is -1 therefore we now try the fifth character on the main string:ABC ABC ABC ABDAB ABCDABCDABDE
ABCDABD
^
We continue to increment in this way, sometimes matching, sometimes not, until we reach the end of 'ABCDAB'.ABC ABC ABC ABDAB ABCDABCDABDE
ABCDABD
^
This last 'D' does not match, so we look up in the table, find 2 so we start from position 2:ABC ABC ABC ABDAB ABCDABCDABDE
ABCDABD
^
So we match the remaining characters and when we reach the end of the pattern we know it is a match.Notes on the Algorithm