21: string matching #2

Spell Check | Matching Multiple Words | Regular Expression Matching | Ternary Search Tries
Suffix Arrays | String Alignment | Burrows-Wheeler Transform | String Sorting

Trie Based Spell Check [4]

Dictionary contains words:
  • jewel
  • jeweler
  • jewels
  • jello
Seaching for corrections to word:
  • jewele
Build a trie data structure:
SpellCheck

Search algorithm:
1:  
algorithm
 
findCorrections(
TrieNode 
root, 
string 
word)
{

2:  
 
 
doFind(
root, 
word, 
1, 
word.size(
)
)
;

3:  
}

4:  

5:  
algorithm
 
doFind(
TrieNode 
curr, 
string 
word, 
int
 
depth, 

6:  
 
 
int
 
match_length, 
string 
path)
{

7:  
 

8:  
 
 
//if we found a match, add it to the output 

9:  
 
 
if
(
curr 
has 
word 
and 
depth 
>
 
0)
{

10: 
 
 
 
 
if
(
curr 
word 
is 
within 
+/- 
length 
of 
match_length)
{

11: 
 
 
 
 
 
 
add 
curr 
word 
to 
output;

12: 
 
 
 
 
}

13: 
 
 
}

14: 

15: 
 
 
//search with no change

16: 
 
 
TrieNode 
child 
curr->getChild(
first 
character 
of 
word)
;

17: 
 
 
if
(
child 
!= 
NULL)
{

18: 
 
 
 
 
doFind(
child, 
word 
without 
first 
character, 
depth, 
match_length, 
path)

19: 
 
 
}

20: 

21: 
 
 
//we have room to make some changes

22: 
 
 
if
(
depth 
>= 
1)
{

23: 
 
 
 
 
//deletion

24: 
 
 
 
 
string 
deleted 
word 
without 
first 
character;

25: 
 
 
 
 
TrieNode 
child 
curr->getChild(
first 
character 
of 
deleted)
;

26: 
 
 
 
 
if
(
child 
!= 
NULL)
{

27: 
 
 
 
 
 
 
doFind(
child, 
deleted, 
depth 
1, 
match_length 
1, 
path)

28: 
 
 
 
 
}

29: 

30: 
 
 
 
 
for
(
each 
child 
in 
curr 
children)
{

31: 
 
 
 
 
 
 
//insertion

32: 
 
 
 
 
 
 
doFind(
child, 
word, 
depth 
1, 
match_length 
1, 
path)
;

33: 

34: 
 
 
 
 
 
 
//substitution

35: 
 
 
 
 
 
 
doFind(
child, 
deleted, 
depth 
1, 
match_length 
1, 
path)
;

36: 
 
 
 
 
}

37: 
 
 
}

38: 

39: 
}

Example:
curr word depth prefix description
j jewele 1   searching with no change
e ewele 1 j searching with no change
w wele 1 je searching with no change
e ele 1 jew searching with no change
l le 1 jewe found jewel. searching with no change
e e 1 jewel searching with no change
1 jewele inserting r
r r 0 jewele found jeweler. returing from recursion.
e e 1 jewel inserting s
s s 0 jewel found jewels

Matching Sets of Words

A trie is formed for the words:
  • inner
  • input
  • tint
SetsOfWords

AhoChorasick Algorithm:
1:  
algorithm
 
AhoCorasick(
set 
keywords, 
text 
T)

2:  
 
 
computeGotoFunction(
keywords, 
g, 
output)
;

3:  
 
 
computeFailureFunction(
g, 
output, 
f)
;

4:  
 
 
state 
0;

5:  
 
 
for
(
to 
|T| 
1)

6:  
 
 
 
 
while
 
g(
state, 
T[
i]
)
 
== 
fail

7:  
 
 
 
 
 
 
state 
f(
state)
;

8:  
 
 
 
 
state 
g(
state, 
T[
i]
)
;

9:  
 
 
 
 
if
 
output(
state)
 
is 
not 
empty

10: 
 
 
 
 
 
 
print 
that 
the 
strings 
in 
output(
state)
 
match 
at 
location 
i;

11: 

12: 
algorithm
 
computeGotoFunction(
set 
keywords, 
function 
g, 
function 
output)

13: 
 
 
newstate 
0;

14: 
 
 
for
(
to 
|keywords| 
1)

15: 
 
 
 
 
state 
0;

16: 
 
 
 
 
0;

17: 
 
 
 
 
keyword[
i]
;

18: 
 
 
 
 
//descend down an existing path

19: 
 
 
 
 
while
 
g(
state, 
P[
j]
)
 
!= 
fail

20: 
 
 
 
 
 
 
state 
g(
state, 
P[
j]
)
;

21: 
 
 
 
 
 
 
++j;

22: 
 
 
 
 
//create a path for suffix

23: 
 
 
 
 
for
(
to 
|P| 
1)

24: 
 
 
 
 
 
 
newstate++;

25: 
 
 
 
 
 
 
g(
state, 
P[
p]
)
 
newstate;

26: 
 
 
 
 
 
 
state 
newstate;

27: 
 
 
 
 
add 
to 
the 
set 
output(
state)
;

28: 
 
 
for
 
all 
characters 
ch 
of 
the 
alphabet

29: 
 
 
 
 
if
 
g(
0, 
ch)
 
== 
fail

30: 
 
 
 
 
 
 
g(
0, 
ch)
 
0;

31: 

32: 
algorithm
 
computeFailureFunction(
function 
g, 
function 
output, 
function 
f)

33: 
 
 
for
 
each 
character 
ch 
of 
the 
alphabet

34: 
 
 
 
 
if
 
g(
0, 
ch)
 
!= 
0

35: 
 
 
 
 
 
 
enqueue(
g(
0, 
ch)
)
;

36: 
 
 
 
 
 
 
f(
g(
0, 
ch)
)
 
0;

37: 
 
 

38: 
 
 
//do a breadth first search on the trie

39: 
 
 
while
 
queue 
is 
not 
empty

40: 
 
 
 
 
dequeue 
state 
r;

41: 
 
 
 
 
for
 
each 
character 
ch 
of 
the 
alphabet

42: 
 
 
 
 
 
 
if
 
g(
r, 
ch)
 
!= 
fail

43: 
 
 
 
 
 
 
 
 
enqueue(
g(
r, 
ch)
)
;

44: 
 
 
 
 
 
 
 
 
state 
f(
r)
;

45: 
 
 
 
 
 
 
 
 
//follow the failure links for ch until a goto is found

46: 
 
 
 
 
 
 
 
 
while
 
g(
state, 
ch)
 
== 
fail

47: 
 
 
 
 
 
 
 
 
 
 
state 
f(
state)
;

48: 
 
 
 
 
 
 
 
 
//make the failure that goto

49: 
 
 
 
 
 
 
 
 
f(
g(
r, 
ch)
)
 
g(
state, 
ch)
;

50: 
 
 
 
 
 
 
 
 
add 
keywords 
from 
output(
g(
state, 
ch)
)
 
to 
output(
r)
;

51: 

Regular Expression Matching

Simplified regular expression language:
  • All letters are regular expressions
  • If r and s are regular expressions
    • r|s = r or s
    • r+ = one or more occurances: r, rr, rrr, ...
    • rs = concatenation of r and s
    • (r) represents the regular expression r
Example regular expressions:
  • chi(r)+up: matches chirup, chirrup, chirrrup, etc
  • (b|c)at: matches bat or cat
Nondeterministic finite automatas (NFA) are used to match regular expressions:

RegularExpressions

The longest possible match is used in these regular expressions. To do this, a set of states is kept at each position in the string.
  • When chir is matched in the first NFA, the next set of states is 3, 4.
  • When a u is finally matched, the next set of states just becomes 5.

Perl 5.0 Regular Expression Syntax is probably the most complete and consistent standard there is [2]. While I would never write a new program today in Perl, many programming languages base their regular expresion syntax on Perl 5.0.
  • Java
  • JavaScript
  • Python
  • Ruby
  • C# (and the rest of .Net)
  • C++ via Boost
  • PHP
  • antlr

Perl 5.0 Regular Expression Syntax [3]

Metacharacters
  1. \ - quote the next metacharacter
  2. ^ - match the beginning of the line
  3. . - match any character
  4. $ - match the end of the line
  5. | - choice
  6. () - grouping
  7. [] - bracketed character class

Quantifiers
  1. * - match 0 or more times
  2. + - match 1 or more times
  3. ? - match 1 or 0 times
  4. {n} - match exactly n times
  5. {n,} - match at least n times
  6. {n,m} - match at least n times but no more than m times

Character Classes
  1. [a-zA-Z0-9] - match alphanumerics
  2. etc

There are many many more details of Perl 5.0 regular expresions. But these I use routinely.

Ternary Search Tries (TSTs)

In Ternary Search Tries there are three links for each node: 1) nodes less than, 2) nodes equal to and 3) nodes greater than
  • One digit of a key is stored in the internal node
  • A search hit only occurs when we are traversing middle nodes
TernarySearchTrie
  • Add "time"
TernarySearchTrie
Strengths of TSTs
  • They adapt to irregularities in search keys that appear in practical applications
  • (With TSTs you can use 1 or 2-byte character encodings without worring about excessive costs of nodes as in Multiway Tries)
  • In practice search misses only use a few comparisons even when the strings are very long
  • TST support: Print all keys that match a prefix of the search key
  • TST support: Visit all keys keys that differ the search key by one digit position

Suffix Arrays

The suffixes of a word are sorted in lexographic order and put into an array.

Word T=proposition. Suffix array:

Suffix Array Index Original Array Index Suffix
0 8 ion
1 6 ition
2 10 n
3 9 on
4 2 oposition
5 4 osition
6 3 position
7 0 proposition
8 1 roposition
9 5 sition
10 7 tion

A suffix array can be made in O(|T|lg(|T|)) time. Then a binary search can be done in lg(|T|) time to find patterns. The suffix array is built from the text to be searched.

Approximate String Matching

In all the previous string matching algorithms, only exact matches were considered.

String Similarity

Given the words: "apple" and "capital" they can be aligned:
-app--le
capital-

Genome String Similarity

TGGAGCTGGAAAAGGAATTCCACTTCAATCGCTACCTGACCCGGCGGCGACCGTATCGAGA
|||||||||| ||||| ||||||||||| || ||  ||  ||||| ||| ||  |||||||
TGGAGCTGGAGAAGGAGTTCCACTTCAACCGTTATTTGTGCCGGCCGCGCCCTGATCGAGA

Basic operations in approximate string matching:
  • Deletion
  • Insertion
  • Substitution
  • Match
String Similarity, D(Q,R) is the measure of how similar strings Q and R are.
  • Deletion: When Q[i] is deleted from Q(0...i) then D(i-1,j-1) = D(i-2,j-1) + 1;
  • Insertion: When R[j] is inserted into R(0...j-1) then D(i-1,j-1) = D(i-1,j-2) + 1;
  • Substitution: When R[j] is substituted in R(0...j) then D(i-1,j-1) = D(i-2,j-2) + 1;
  • Match: When Q[i] == R[j], then D(i-1,j-1) = D(i-2,j-2)
All of the conditions can be summarized:
  • D(i,j) = min(D(i-1,j)+1, D(i,j-1)+1, D(i-1,j-1) + c(i,j))
    • where c(i,j)=0 if Q[i] == R[j] and c(i,j)=1 otherwise
Wagner Fischer Algorithm
1:  
algorithm
 
wagnerFischer(
table 
dist, 
string 
Q, 
string 
R)

2:  
 
 
//initialize first row  

3:  
 
 
for
 
to 
|Q|

4:  
 
 
 
 
dist[
i,0]
 
i;

5:  
 
 
//initialize first column

6:  
 
 
for
 
to 
|R|

7:  
 
 
 
 
dist[
0,j]
 
j;

8:  
 
 
for
 
to 
|Q|

9:  
 
 
 
 
for
 
to 
|R|

10: 
 
 
 
 
 
 
dist[
i-1,j 
 
]
+1;
 
//upper

11: 
 
 
 
 
 
 
dist[
 
,j-1]
+1;
 
//left

12: 
 
 
 
 
 
 
dist[
i-1,j-1]
+0;
 
//diagonal

13: 
 
 
 
 
 
 
if
(
Q[
i-1]
 
!= 
R[
j-1]
)

14: 
 
 
 
 
 
 
 
 
++z;

15: 
 
 
 
 
 
 
dist[
i,j]
 
min(
x,y,z)
;

a p p l e
0 1 2 3 4 5
c 1 1 2 3 4 5
a 2 1 2 3 4 5
p 3 2 1 2 3 4
i 4 3 2 2 3 4
t 5 4 3 3 3 4
a 6 5 4 4 4 4
l 7 6 5 5 4 5

The number in the lower right of the table (5) is the minimum distance between the string.

To print out the alignment:
1:  
algorithm
 
wagnerFischerPrint(
table 
dist, 
string 
Q, 
string 
R)

2:  
 
 
|Q|;

3:  
 
 
|R|;

4:  
 
 
while
 
!= 
or 
!= 
0

5:  
 
 
 
 
if
 
and 
dist[
i-1,j]
 
<
 
dist[
i,j]
 
 
 
//up

6:  
 
 
 
 
 
 
stackQ.push(
Q[
i]
)
;

7:  
 
 
 
 
 
 
stackR.push(
'-')
;

8:  
 
 
 
 
 
 
--i;

9:  
 
 
 
 
if
 
and 
dist[
i,j-1]
 
<
 
dist[
i,j]
 
 
 
//left

10: 
 
 
 
 
 
 
stackQ.push(
'-')
;

11: 
 
 
 
 
 
 
stackR.push(
R[
j]
)
;

12: 
 
 
 
 
 
 
--j;

13: 
 
 
 
 
else

14: 
 
 
 
 
 
 
stackQ.push(
Q[
i]
)
;

15: 
 
 
 
 
 
 
stackR.push(
R[
j]
)
;

16: 
 
 
 
 
 
 
--i;

17: 
 
 
 
 
 
 
--j;

18: 
 
 
print 
stackQ;

19: 
 
 
print 
stackR;


Prints:
capital (Q)
-apple- (R)

The intiution is that we are seeking for the place where the min was selected from while building the table. This path will produce the minimum cost alignment.

Wagner Fischer:
  • Time Complexity: O(|Q||R|)
  • Space Complexity: O(|Q||R|)
Wagner Fischer Print:
  • Time Complexity: O(max(|Q|,|R|))

Burrows-Wheeler Transform [5-6]

The Burrows-Wheeler Transform is a reversible transformation on strings that allows for better run-length encoding.

Original String: ^juju|
Resultant String: ^ujj|u

In longer strings there will be larger runs. The intuition is that words like the and there in the suffix array will be close and thus the t's will be compressable with run-length encoding.

Rotations:
^ j u j u |
| ^ j u j u
u | ^ j u j
j u | ^ j u
u j u | ^ j
j u j u | ^
Sorted:
j u j u | ^
j u | ^ j u
u j u | ^ j
u | ^ j u j
^ j u j u |
| ^ j u j u

(Take the last column from sorted and that is the result).

To reconstruct from final result keep calling add and sort:
Add
^
u
j
j
|
u
Sort
j
j
u
u
^
|
Add
^ j
u j
j u
j u
| ^
u |
Sort
j u
j u
u j
u |
^ j
| ^
Add
^ j u
u j u
j u j
j u |
| ^ j
u | ^
Sort
j u j
j u |
u j u
u | ^
^ j u
| ^ j
Add
^ j u j
u j u |
j u j u
j u | ^
| ^ j u
u | ^ j
Sort
j u j u
j u | ^
u j u |
u | ^ j
^ j u j
| ^ j u
Add
^ j u j u
u j u | ^
j u j u |
j u | ^ j
| ^ j u j
u | ^ j u
Sort
j u j u |
j u | ^ j
u j u | ^
u | ^ j u
^ j u j u
| ^ j u j
Add
^ j u j u |
u j u | ^ j
j u j u | ^
j u | ^ j u
| ^ j u j u
u | ^ j u j
Sort
j u j u | ^
j u | ^ j u
u j u | ^ j
u | ^ j u j
^ j u j u |
| ^ j u j u

String Sorting Algorithms

  • Quicksort
  • Three-way quicksort: Pick the first character of a string as the pivot. Partition strings into subsets with less than, equal and greater. Then three-way quicksort the partitions on a second character from each of the partitions.
  • MSD Radix Sort
  • Burst Sort
MSD Radix Sort
1:  
algorithm
 
sortQueue(
queue 
q)

2:  
 
 
allocate 
radix 
number 
of 
queues;

3:  
 
 
for
 
each 
string 
in 
the 
q

4:  
 
 
 
 
if
 
the 
string 
has 
more 
characters

5:  
 
 
 
 
 
 
put 
the 
string 
in 
queue 
according 
to 
the 
current 
character;

6:  
 
 
 
 
else

7:  
 
 
 
 
 
 
output 
the 
string 
as 
result;

8:  
 
 
for
 
each 
queue 
with 
elements

9:  
 
 
 
 
call 
sortQueue;

10: 

11: 
algorithm
 
stringRadix(
array 
strings, 
int
 
radix)

12: 
 
 
allocate 
'radix' 
number 
of 
queues;

13: 
 
 
put 
the 
strings 
in 
queues 
according 
to 
the 
first 
character;

14: 
 
 
for
 
each 
queue

15: 
 
 
 
 
call 
sortQueue;

Burst Sort
algorithm burstInsert(string key)
  find the trie node bucket to insert into (of size d)
  if(the size of the bucket is greater than L)
    burst the trie node bucket to create a new set of nodes of d+1
  insert the key into the bucket
Once all of the keys have been inserted, sort the buckets and then do an inorder traversal.

References

  1. http://en.wikipedia.org/wiki/Suffix_array
  2. http://en.wikipedia.org/wiki/Regular_expression#Perl-derived_regular_expressions
  3. http://perldoc.perl.org/perlre.html
  4. http://blog.afterthedeadline.com/2010/01/29/how-i-trie-to-make-spelling-suggestions/
  5. http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform
  6. http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-124.pdf
  7. http://goanna.cs.rmit.edu.au/~jz/fulltext/alenex03.pdf