String Matching detecting the occurrence of a particular substring (pattern) in another string (text) • A straightforward Solution • The Knuth-Morris-Pratt Algorithm • The Boyer-Moore Algor

String Matching detecting the occurrence of a particular substring (pattern) in another string (text) • A straightforward Solution • The Knuth-Morris-Pratt Algorithm • The Boyer-Moore Algorithm TECH Computer Science

Straightforward solution • Algorithm: Simple string matching • Input: P and T, the pattern and text strings; m, the length of P. The pattern is assumed to be nonempty. • Output: The return value is the index in T where a copy of P begins, or -1 if no match for P is found.

Finite Automata • Terminologies : the alphabet *: the set of all finite-length strings formed using characters from . xy: concatenation of two strings x and y. Prefix: a string w is a prefix of a string x if x=wy for some string y *. Suffix: a string w is a suffix of a string x if x= yw for some string y *.

Strategy • In general, if there is a partial match of j chars starting at i, then we know what is in position T[i]…T[i+j-1]. So we can save by 1. 2. Skip outer iterations (for which no match possible) Skip inner iterations (when no need to test know matches). • When a mismatch occurs, we want to slide P forward, but maintain the longest overlap of a prefix of P with a suffix of the part of the text that has matched the pattern so far. • KMP algorithm achieves linear time performance by capitalizing on the observation above, via building a simplified finite automaton: each node has only two links, success and fail.

The Knuth-Morris-Pratt Flowchart • Character labels are inside the nodes • Each node has two arrows out to other nodes: success link, or fail link • next character is read only after a success link • A special node, node 0, called “get next char” which read in next text character. e.g. P = “ABABCB”

Algorithm: KMP flowchart construction • Input: P,a string of characters;m,the length of P. • Output: fail,the array of failure links,defined for indexes 1,...,m.The array is passed in and the algorithm fills it. • Step: • void kmpSetup(char[] P, int m, int[] fail) • int k,s • 1. fail[1]=0; • 2. for(k=2;k<=m;k++) • 3. s=fail[k-1]; • 4. while(s>=1) • 5. if(ps==pk-1) • 6. break; • 7. s=fail[s]; • 8. fail[k]=s+1;

Analysis • KMP Flowchart Construction require 2m – 3 character comparisons in the worst case • The scan algorithm requires 2n character comparisons in the worst case • Overall: Worst case complexity is (n+m)

Algorithm:Computing Jumps for the Boyer-Morre Algorithm • Input:Pattern string P:m the length of P;alphabet size alpha=|| • Output:Array charJump,defined on indexes 0,....,alpha-1.The array is passed in and the algorithm fills it. • void computeJumps(char[] P,int m,int alpha,int[] charJump) • char ch; int k; • for (ch=0;ch

Summary • Straightforward algorithm: O(nm) • Finite-automata algorithm: O(n) • KMP algorithm: O(n+m) Relatively easier to implement Do not require random access to the text • BM algorithm: O(n+m), worst, sublinear average Fewer character comparison The algorithm of choice in practice for string matcing