20: string matching #1

Brute-Force | Hancart | Knuth-Morris-Pratt | Boyer-Moore | Bitvector

Brute-Force String Matching

Algorithm:
1:  
algorithm
 
bruteForceStringMatching(
pattern 
P, 
text 
t)

2:  
 
 
0;

3:  
 
 
while
 
<= 
|T| 
|P|

4:  
 
 
 
 
0;

5:  
 
 
 
 
while
 
T[
i]
 
== 
P[
j]
 
and 
|P|

6:  
 
 
 
 
 
 
++i;

7:  
 
 
 
 
 
 
++j;

8:  
 
 
 
 
if
 
== 
|P|

9:  
 
 
 
 
 
 
return
 
match 
at 
|P|

10: 
 
 
 
 
1;

11: 
 
 
return
 
no 
match;

Example:
1:  
text 
ababcdabbabababad

2:  
pattern 
abababa

3:  

4:  
ababcdabbabababad

5:  
abababa

6:  
 
abababa

7:  
 
 
abababa

8:  
 
 
 
abababa

9:  
 
 
 
 
abababa

10: 
 
 
 
 
 
abababa

11: 
 
 
 
 
 
 
abababa

12: 
 
 
 
 
 
 
 
abababa

13: 
 
 
 
 
 
 
 
 
abababa 
 
 
 
 
 
 
 

14: 
 
 
 
 
 
 
 
 
 
abababa

  • Worst Case Time Complexity: O(T*P)

Hancart String Matching

Hancart also have worst case time complexity of O(T*P) but on average it performs better than the more complicated algorithms in the rest of this lecture.
  • The order of characters involved is: P[1], P[2], ... P[size-1], P[0]
1:  
algorithm
 
hancart(
pattern 
P, 
text 
T)

2:  
 
 
if
 
P[
0]
 
== 
P[
1]

3:  
 
 
 
 
sEqual 
1;

4:  
 
 
 
 
sDiff 
2;

5:  
 
 
else

6:  
 
 
 
 
sEqual 
2;

7:  
 
 
 
 
sDiff 
1;

8:  
 
 
0;

9:  
 
 
while
 
<= 
|T| 
|P|

10: 
 
 
 
 
if
 
T[
i+1]
 
!= 
P[
1]

11: 
 
 
 
 
 
 
sDiff;

12: 
 
 
 
 
else

13: 
 
 
 
 
 
 
1;

14: 
 
 
 
 
 
 
while
 
|P| 
and 
T[
i+j]
 
== 
P[
j]

15: 
 
 
 
 
 
 
 
 
++j;

16: 
 
 
 
 
 
 
if
 
== 
|P| 
and 
P[
0]
 
== 
T[
i]

17: 
 
 
 
 
 
 
 
 
return
 
match 
at 
i;

18: 
 
 
 
 
 
 
sEqual;

19: 
 
 
return
 
no 
match;

  • If the first two characters of the pattern are the same, after a mismatch with T[i+1], i is shifted over two
  • But if the first two characters of the pattern are the same and there is a mismatch in the inner while, the pattern is shifted one position
  • But if the first two characters are different (a more common case), a mismatch at T[i+1] is shifted over one and a mismatch in the inner while loop is shifted over two
Example #1 (First two characters of pattern are the same):
Pattern: aabcda
Text:    aabaabcda

First P[1] matches T[1]. Then there is a mismatch later at c and a. Shift by one.

Pattern:  aabcda
Text:    aabaabcda

P[1] does not match T[1]. Since P[0]==P[1], T[1] will never match P[0]. Shift by two.

Pattern:    aabcda
Text:    aabaabcda

P[1] through P[5] match. Check P[0]. There is a match so the pattern matches the text
at location 3.
Example #2 (First two characters of the pattern are different):
Pattern: gabcda
Text:    fgagabcda

First P[1] and T[1] does not match. Shift by one because T[1] could match P[0].

Pattern:  gabcda
Text:    fgagabcda

P[1] and T[1] match. Then there is a mismatch at P[2] and T[1]. Shift by two because
T[1] will never match P[0].

Pattern:    gabcda
Text:    fgagabcda

P[1] through P[5] match. Check P[0]. There is a match so the pattern matches the text
at location 3.

Knuth-Morris-Pratt [4-5]

In this algorithm, the pattern is pre-processed to find the longest repeating suffix for each element in the pattern. This is like generalizing Hancart to have more than two character prefix match.
1:  
algorithm
 
findNext(
pattern 
P, 
table 
next)
{

2:  
 
 
next[
0]
 
-1;

3:  
 
 
0;

4:  
 
 
-1;

5:  
 
 
while
(
<
 
|P|)
{

6:  
 
 
 
 
while
(
== 
-1 
or 
<
 
|P| 
and 
P[
i]
 
== 
P[
j]
)
{

7:  
 
 
 
 
 
 
i++;

8:  
 
 
 
 
 
 
j++;

9:  
 
 
 
 
 
 
next[
i]
 
j;

10: 
 
 
 
 
}

11: 
 
 
 
 
next[
j]
;

12: 
 
 
}

13: 
}

14: 

15: 
algorithm
 
knuthMorrisPratt(
pattern 
p, 
text 
T)
{

16: 
 
 
findNext(
p, 
next)
;

17: 
 
 
0;

18: 
 
 
while
(
<
|T| 
|P|)
{

19: 
 
 
 
 
while
(
== 
-1 
or 
|P| 
and 
T[
i]
 
== 
P[
j]
)
{

20: 
 
 
 
 
 
 
++i;

21: 
 
 
 
 
 
 
++j;

22: 
 
 
 
 
}

23: 
 
 
 
 
if
(
== 
|P|)
{

24: 
 
 
 
 
 
 
return
 
match 
at 
|P|;

25: 
 
 
 
 
}

26: 
 
 
 
 
next[
j]
;

27: 
 
 
}

28: 
 
 
return
 
no 
match;

29: 
}

Example:
Pattern: ababacdd
Next:   -10012300 
Text:    ababcabbababacdd

P[0-3] matches T[0-3]. mismatch at j=4. next[4]=2. This means align P[2] with T[4] 
because there is a repeated pattern of size 2.

Pattern:   ababacdd
Text:    ababcabbababacdd

P[0-1] matches T[0-1]. mismatch at j=2. next[2]=0. This means align P[0] with T[4]
because there is no repeat.

Pattern:     ababacdd
Text:    ababcabbababacdd

P[0] does not match T[0]. j=0. next[0]=-1; When j=-1, the algorithm increaments
both i and j. This shifts the pattern over by one.

Pattern:      ababacdd
Text:    ababcabbababacdd

P[0-1] matches T[0-1]. mismatch at j=2. next[2]=0. This means align P[0] with T[8]

Pattern:         ababacdd
Text:    ababcabbababacdd

Match at text index 8.

Next Explained:

Next specifies the place where the longest prefix and suffix match.
letter: a b a b a c d d
next:  -1 0 0 1 2 3 0 0
index:  1 2 3 4 5 6 7 8

Index 1:
Position 0 contains the longest suffix/prefix match

Index 2:
Position 0 contains the longest suffix/prefix match

Index 3:
Position 0 contains the longest suffix/prefix match

Index 4:
a b a b
a b a 
    a b a

The longest suffix/prefix match is to shift the pattern to have position 1 
match position 3.

Index 5:
a b a b a
a b a b 
    a b a b

The longest suffix/prefix match is to shift the pattern to have position 2 
match position 4.

Index 6: 
a b a b a c
a b a b a 
    a b a b a

The longest suffix/prefix match is to shift the pattern to have position 3 
match position 5.

Index 7: 
a b a b a c d
a b a b a c
            a b a b a c

The longest suffix/prefix match is to shift the pattern to have position 0 
match position 6.

Index 8: 
a b a b a c d d
a b a b a c d
              a b a b a c d

The longest suffix/prefix match is to shift the pattern to have position 0 
match position 7.

Building next:

i j next pattern
0 -1 [-1] ababacdd
1 0 [-1 0] ababacdd
1 -1 [-1 0]
2 0 [-1 0 0] ababacdd
3 1 [-1 0 0 1] ababacdd
4 2 [-1 0 0 1 2] ababacdd
5 3 [-1 0 0 1 2 3] ababacdd
5 1 [-1 0 0 1 2 3] ababacdd
5 0 [-1 0 0 1 2 3] ababacdd
5 -1 [-1 0 0 1 2 3]
6 0 [-1 0 0 1 2 3 0] ababacdd
6 -1 [-1 0 0 1 2 3 0]
7 0 [-1 0 0 1 2 3 0 0] ababacdd
7 -1 [-1 0 0 1 2 3 0 0]
8 0 [-1 0 0 1 2 3 0 0]
Another next:
pattern: a b c a b c a b d
next:   -1 0 0 0 1 2 3 4 5
index:   0 1 2 3 4 5 6 7 8 

index 4
abcab
   abcab

index 5
abcabc
   abcabc

index 6
abcabca
   abcabca

index 7
abcabcab
   abcabcab

index 8
abcabcabd
   abcabcabd
  • Time complexity: O(T)

Boyer-Moore Algorithm

In Knuth-Morris-Pratt, the search is done in the pattern from left to right. In Boyer-Moore the search is done from right to left. This makes it so when there is a mismatch, more of the pattern can often be skipped.

Rules:
  1. No occurance rule: if the mismatched character T[i] appears nowhere in P, align P[0] with T[i+1]
    1: 
    aaaaebdaabadbda

    2: 
    dabacbd

    3: 
     
     
     
     
     
    dabacbd 
     
     
     
     
     
     
    <
    -- 
    there 
    is 
    no 
    in 
    the 
    pattern, 
    so 
    it 
    is 
    shifted 
    one 
    past 
    e

  2. Left side occurance rule: if there is an occurance of character ch equal to T[i] only to the left of P[j], align T[i] with P[k] = ch closest to P[j]
    1: 
    aaaaebdaabadbda

    2: 
     
     
     
     
     
    dabacbd 
     
     
     
     
     

    3: 
     
     
     
     
     
     
     
    dabacbd 
     
     
     
     
     
     
    <
    -- 
    pattern 
    is 
    shifted 
    over 
    until 
    the 
    a's 
    match.

  3. Right side occurance rule: if there is an occurance of character ch equal to T[i] to the right of P[j], shift P by one position
    1: 
    aaaaebdaabadbda

    2: 
     
     
     
     
     
     
     
    dabacbd

    3: 
     
     
     
     
     
     
     
     
    dabacbd

Algorithm:
1:  
algorithm
 
boyerMooreSimple(
pattern 
P, 
text 
T)

2:  
 
 
for
 
to 
|P| 
1

3:  
 
 
 
 
delta1[
P[
j]
]
 
|P| 
1;

4:  
 
 
|P| 
1;

5:  
 
 
while
 
|T|

6:  
 
 
 
 
|P| 
1;

7:  
 
 
 
 
while
 
>
and 
P[
j]
 
== 
T[
i]

8:  
 
 
 
 
 
 
--i;

9:  
 
 
 
 
 
 
--j;

10: 
 
 
 
 
if
 
== 
-1

11: 
 
 
 
 
 
 
return
 
match 
at 
1;

12: 
 
 
 
 
if
 
delta1 
contains 
T[
i]

13: 
 
 
 
 
 
 
+= 
delta1[
T[
i]
]
;

14: 
 
 
 
 
else

15: 
 
 
 
 
 
 
+= 
|P|;

16: 
 
 
return
 
no 
match;

  • The worst case time complexity is 3T or O(T). This was determined in 1991. [2]
Building delta1:
Index J=0: 
Pattern: d a b a c b d
delta1[d] = 7 - 0 - 1 = 6
delta1[a]: 0
delta1[b]: 0
delta1[c]: 0
delta1[d]: 6

Index J=1: 
Pattern: d a b a c b d
delta1[a] = 7 - 1 - 1 = 5
delta1[a]: 5
delta1[b]: 0
delta1[c]: 0
delta1[d]: 6

Index J=2: 
Pattern: d a b a c b d
delta1[b] = 7 - 2 - 1 = 4
delta1[a]: 5
delta1[b]: 4
delta1[c]: 0
delta1[d]: 6

Index J=3: 
Pattern: d a b a c b d
delta1[a] = 7 - 3 - 1 = 3
delta1[a]: 3
delta1[b]: 4
delta1[c]: 0
delta1[d]: 6

Index J=4: 
Pattern: d a b a c b d
delta1[c] = 7 - 4 - 1 = 2
delta1[a]: 3
delta1[b]: 4
delta1[c]: 2
delta1[d]: 6

Index J=5: 
Pattern: d a b a c b d
delta1[b] = 7 - 5 - 1 = 1
delta1[a]: 3
delta1[b]: 1
delta1[c]: 2
delta1[d]: 6

Index J=6: 
Pattern: d a b a c b d
delta1[d] = 7 - 6 - 1 = 0
delta1[a]: 3
delta1[b]: 1
delta1[c]: 2
delta1[d]: 0

Rabin-Karp Algorithm

Rabin-Karp uses a hash function to determine if the strings match.
1:  
algorithm
 
rabinKarp(
pattern 
P, 
text 
T)

2:  
 
 
hash_pattern 
hash(
P)
;

3:  
 
 
hash_text 
hash(
T)
;

4:  
 
 
for
 
to 
|T|

5:  
 
 
 
 
if
 
hash_pattern 
== 
hash_text

6:  
 
 
 
 
 
 
if
 
verify 
works

7:  
 
 
 
 
 
 
 
 
return
 
i;

8:  
 
 
 
 
hash_text 
rolling_hash(
hash_text, 
T, 
i)
;

9:  
 
 
return
 
not_found;

Rolling hash functions:
  • hash_next = hash_prev - character_being_removed + character_being_added
Time complexity:
  • Worst case running time: O(T*P)
  • Average case running time: O(T)
Example:
Letters:
========
A 1
B 2
C 3
D 4
E 5

Pattern: ace
Text:    abaeace

Hash(ace) = 1+3+5 = 9
Hash(aba) = 1+2+1 = 4

Hashes do not match, roll text hash by one.
Hash(bae) = Hash(aba) - a + e = 4 - 1 + 5 = 8
 
Pattern:  ace
Text:    abaeace

Hashes do not match, roll text hash by one.
Hash(aea) = Hash(bae) - b + a = 8 - 2 + 1 = 7
 
Pattern:   ace
Text:    abaeace

Hashes do not match, roll text hash by one.
Hash(eac) = Hash(aea) - a + c = 7 - 1 + 3 = 9
 
Pattern:    ace
Text:    abaeace

Hashes match, check full pattern. Pattern does not match text, roll text hash by one.
Hash(ace) = Hash(eac) - e + e = 9 - 5 + 5 = 9
 
Pattern:     ace
Text:    abaeace

Hashes match, check full pattern. A match is output at index 4 in the text

Bitvector String Matching

This approach keeps a state that gets reset when there is not a match.

Code:
1:  
// shiftAnd.cpp - download here

2:  

3:  
#include
 
<
iostream>

4:  
#include
 
<
string>

5:  
#include
 
<
vector>

6:  

7:  
void
 
shiftAnd(
std:
:
string 
pattern, 
std:
:
string 
text)
{

8:  

9:  
 
 
int
 
state 
=
 
0;

10: 
 
 
int
 
matchBit 
=
 
1;

11: 
 
 
for
(
size_t
 
=
 
1;
 
<
 
pattern.length(
)
;
 
++i)
{

12: 
 
 
 
 
matchBit 
<
<
=
 
1;

13: 
 
 
}

14: 
 
 
std:
:
vector<
int
>
 
charactersInP;

15: 
 
 
for
(
int
 
=
 
0;
 
<
=
 
255;
 
++i)
{

16: 
 
 
 
 
charactersInP.push_back(
0)
;

17: 
 
 
}

18: 
 
 
for
(
size_t
 
=
 
0, 
=
 
1;
 
<
 
pattern.length(
)
;
 
++i, 
<
<
=
 
1)
{

19: 
 
 
 
 
charactersInP[
pattern[
i]
]
 
|=
 
j;

20: 
 
 
 
 
std:
:
cout 
<
<
 
"charactersInP["
 
<
<
 
pattern[
i]
;

21: 
 
 
 
 
std:
:
cout 
<
<
 
"]="
 
<
<
 
charactersInP[
pattern[
i]
]
 
<
<
 
std:
:
endl;

22: 
 
 
}

23: 
 
 
std:
:
cout 
<
<
 
"search: "
 
<
<
 
std:
:
endl;

24: 
 
 
for
(
int
 
=
 
0;
 
<
 
text.length(
)
;
 
++i)
{

25: 
 
 
 
 
std:
:
cout 
<
<
 
"charactersInP["
 
<
<
 
text[
i]
 
<
<
 
"]="
;
 

26: 
 
 
 
 
std:
:
cout 
<
<
 
charactersInP[
text[
i]
]
 
<
<
 
std:
:
endl;

27: 
 
 
 
 
state 
=
 
(
(
state 
<
<
 
1)
 
1)
 
&
 
charactersInP[
text[
i]
]
;

28: 
 
 
 
 
std:
:
cout 
<
<
 
"state: "
 
<
<
 
state 
<
<
 
std:
:
endl;

29: 
 
 
 
 
if
(
(
matchBit 
&
 
state)
 
!=
 
0)
{

30: 
 
 
 
 
 
 
std:
:
cout 
<
<
 
"match at: "
 
<
<
 
(
int
)
 
pattern.length(
)
 
1;

31: 
 
 
 
 
 
 
std:
:
cout 
<
<
 
std:
:
endl;
;

32: 
 
 
 
 
}

33: 
 
 
}

34: 
}

35: 

36: 
int
 
main(
int
 
argc, 
char
 
argv[
]
)
{

37: 

38: 
 
 
shiftAnd(
"ababab"
"ababxabababababa"
)
;

39: 
 
 
return
 
0;

40: 
}

Example:
matchBit = 10000
charactersInP[a] shows positions of a's as bits
state starts at zero

charactersInP[a] = 00101
charactersInP[b] = 01010
charactersInP[c] = 10000

Pattern: ababc
Text:    ababxababc
State:   00000

Match at T[0] and P[0]

Pattern: ababc
Text:    ababxababc
State:   00001

Match at T[1] and P[1]

Pattern: ababc
Text:    ababxababc
State:   00010

Match at T[2] and P[2]

Pattern: ababc
Text:    ababxababc
State:   00101

Match at T[3] and P[3]

Pattern: ababc
Text:    ababxababc
State:   01010

Mismatch at T[4] and P[4]

Pattern:      ababc
Text:    ababxababc
State:   00000

Match at T[5] and P[0]

Pattern:      ababc
Text:    ababxababc
State:   00001

Match at T[6] and P[1]

Pattern:      ababc
Text:    ababxababc
State:   00010

Match at T[7] and P[2]

Pattern:      ababc
Text:    ababxababc
State:   00101

Match at T[8] and P[3]

Pattern:      ababc
Text:    ababxababc
State:   01010

Match at T[9] and P[4]

Pattern:      ababc
Text:    ababxababc
State:   10000

Matchbit == State. Match found at index 5 in the text
  • Time complexity: O(|T|)
  • Max pattern size is the number of bits in a long int (64).
  • An array of long ints can be used and then the time complexity becomes: O(|P|/64 * |T|) or O(|P| * |T|)

Summary of String Matching Algorithms #1

Algorithm Description Preprocessing Time Complexity Worst Case Search Time Complexity Average Case Search Time Complexity
Brute-Force Double Nested For Loops O(1) O(T*P) O(T*P)
Hancart Can jump at most 2 characters O(1) O(T*P) O(T)
Knuth-Morris Pratt Can jump any subpattern length of characters O(P) O(T) (2T) O(T) (2T)
Boyer-Moore Searches for matches in reverse O(P) O(T) (3T) O(T) (3T)
Rabin-Karp Uses a hash of the pattern O(P) O(T*P) O(T)
Bitvector Uses a bitvector, can only use for patterns less than 64 characters O(P) O(T) (T) O(T) (T)

References

  1. http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
  2. http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
  3. http://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm
  4. Cormen, Leiserson, Rivest, Stein, "Introduction to Algorithms", Second Edition. Section 32.4
  5. Drozdek, "Data Structures and Algorithms", Fourth Edition. Section 13.1.2