String Searching
Literature references and annotations by Dick Grune, dick@dickgrune.com.
Last update: Wed Sep 04 17:27:21 2024.
These references and annotations were originally intended
for personal use and are presented here only in the hope
that they may be useful to others.
There is no claim to completeness or even correctness.
Each annotation represents my understanding of the text
at the moment I wrote the annotation.
No guarantees given; comments and content criticism welcome.
Mads Sig Ager,
Olivier Danvy,
Henning Korsholm Rohde,
Fast partial evaluation of pattern matching in strings,
ACM SIGPLAN Notices,
vol. 38,
#10,
Oct 2003,
pp. 243-249.
It is possible to derive the linear-time Knuth, Morris and Pratt string
matching algorithm from a naive quadratic one by partial evaluation, but this
specialization process still takes quadratic time.
This paper shows how to derive the matcher in linear time using partial
evaluation.
It has many literature references on the subject.
Charles L.A. Clarke,
Gordon V. Cormack,
On the use of regular expressions for searching text,
ACM TOPLAS,
vol. 19,
#3,
May 1997,
pp. 413-426.
The authors argue that for text searching the disambiguation rule should
be to return the shortest matching string, rather than the longest, as is
usual in lexical analysers.
This would indeed obviate the need for regular expressions like ([^()]*)
to match a non-nesting parenthesized segment, since under the proposed
regime the regular expression (.*) would do the job.
They also propose a containment operator ≥, with r ≥ s meaning s
provided it is contained in a segment matching r.
Algorithms for implementing the above are provided.
Ricardo Baeza-Yates,
Gaston H. Gonnet,
A new approach to text searching,
Commun. ACM,
vol. 35,
#10,
Oct. 1992,
pp. 74-82.
A pattern consisting of m components is checked for by m string
comparators running in parallel.
The states are combined and passed on à la Dömölki.
Andrew Hume,
Daniel Sunday,
Fast string searching,
Softw. Pract. Exper.,
vol. 21,
#11,
Nov. 1991,
pp. 1221-1248.
The authors do three things at the same time: give a taxonomy of
single-keyword string searching algorithms, compare selected entries in this
taxonomy experimentally, and report on their own work; they switch between
these modes in unexpected moments.
A general framework for single-keyword string searching algorithms is set up
as follows:
FOR i OVER text:
// skip to interesting character:
WHILE text[i] /= interesting_char:
ADD skip_distance[text[i]] TO i
// try a guarded match:
IF pattern COULD BE AT i: // must be a cheap test
MATCH text AT i TO pattern OR FAIL AT p
ELSE:
SET p TO i;
// shift over non-matching part:
ADD shift_distance(i, p) TO i;
This decomposes the algorithm into four components: the choice of the
interesting character (and its skip_distance[]), the cheap guard test, the
match algorithm, and the shift to the first promising position.
Although these components are not at all independent, the paper treats them
more or less as if they were; the authors even give speed figures for some
separate components.
Between 6 and 11 algorithms are described for each of the components; of the
possible combinations, 9 had been described in the literature.
Some of the conclusions are:
1. The Knuth-Morris-Pratt algorithms are found to be consistently much slower
than the Boyer-Moore algorithms.
2. The guard test helps a bit, usually.
3. Using statistics on the characters helps, but not more than a few percents.
4. Using two shift heuristics and choosing the best at run time loses with
respect to just sticking to one.
5. There is little or no relationship between the number of character
comparisons and the speed of the algorithm.
6. Boyer-Moore with a few trimmings is usually the fastest.
7. Finding the optimum for a specific application (natural language, DNA
searching, etc.) requires experimentation, for the authors have provided a
toolkit.
P.D. Smith,
Experiments with a very fast substring search algorithm,
Softw. Pract. Exper.,
vol. 21,
#10,
Oct. 1991,
pp. 1065-1074.
The idea of matching a string to a given position in a text by comparing
chars in a self-adapting order is tested experimentally.
Two adaptation methods are described: move the position that causes a
mismatch one place up in the list or move it to the top of the list.
The influence of the initial ordering was studied.
There were 4 initial orderings:
character frequency in English,
character frequency in C,
right-to-left, and
random.
Each was tested with and without self-adaptation (which one?).
Conclusions:
1. Self-adaptation often helps a little (-1 to +8% improvement).
2. A generic initial distribution (averaged over English, French and C)
is as good (within 2%) as specific English, French or C orders, and
better than right-to-left, or random.
The specific frequency distribution of the actual input text (if known)
allows you to do about 20% better, though.
A.V. Aho,
Algorithms for finding patterns in strings,
in Handbook of Theoretical Computer Science -- Algorithms and Complexity, Vol. A,
ed. by J. van Leeuwen,
Elsevier,
Amsterdam, The Netherlands,
1990,
pp. 255-300.
Chapter 5 of the handbook.
Encyclopedic article on the subject, covering the state of the art in:
single string matching:
brute-force
Karp-Rabin, caterpillar hash function
Knuth-Morris-Pratt, automaton, forward
Boyer-Moore, backward
multiple string matching:
Aho-Corasick
Commentz-Walter, p. 278, best description around
regular expression matching:
Thompson NFSA construction
regular expression matching with variables:
proved NP-complete
longest common substring location:
longest path in matrix
McCreight suffix trees
giving a very readable account of each of them, often with proof and
complexity analysis.
Draws amazing conclusions from Cook's Theorem:
"Every 2-way deterministic push-down automaton (2DPDA) language can be
recognized in linear time on a random-access machine".
The paper ends with 139 literature references.
Daniel M. Sunday,
A very fast substring search algorithm,
Commun. ACM,
vol. 33,
#8,
Aug. 1990,
pp. 132-142.
To see if a string $S[1..M]$ can be found at a given position p in a text
$T[1..N]$, the characters in S are compared in arbitrary order to the
corresponding characters in T. If a mismatch is found, a table supplies
the leftmost position in which the next try may still succeed. Three choices
for precising the arbitrary order are given: right to left, maximal shift
and most probable maximal shift.
Gaston H. Gonnet,
Ricardo A. Baezo-Yates,
An analysis of the Karp-Rabin string matching algorithm,
Inform. Process. Lett.,
vol. 34,
#5,
May 1990,
pp. 271-274.
Explicit cost functions for the Karp-Rabin algorithm, derived by simple
math.
The results are compared with actual runs and found to agree with them.
Ricardo A. Baeza-Yates,
Improved string searching,
SP&E,
vol. 19,
#3,
March 1989,
pp. 257-271.
1.Use the occurrence-of-mismatching-character shift heuristic of
Boyer-Moore only.
The "next" heuristic of Boyer-Moore costs more than it gains.
2. Combine 2 subsequent characters into a single character.
Gains 50% at the expense of a table that is 128 times larger.
Andrew Hume,
A tale of two greps,
SP&E,
vol. 18,
#11,
Nov. 1988,
pp. 1063-1072.
Description of the tuning of grep (Ken Thompson) and egrep (A.V. Aho)
by the author (his part in the so-called "grep wars").
Grep: use the optimized IO package fio.h rather than stdio.h .
Egrep, which already used lazy automaton building:
1. Use fio.h .
2. Use Boyer-Moore for non-RE strings.
3. Use Boyer-Moore for the non-RE parts of regular expressions.
All this results in an egrep that is between 8 and 40 times faster.
Alberto Apostolico,
Raffaele Giancarlo,
The Boyer-Moore string searching strategies revisited,
SIAM J. Comput.,
vol. 15,
#1,
1986,
pp. 98-105.
Incorporates Galil's (1979) treatment of periodicity of the pattern into
the original Boyer-Moore algorithm.
This results in an algorithm that performs at most $ 2n - m - 1 $
character comparisons (n: length of input; m: length of pattern).
Wojchiech Rytter,
A correct preprocessing algorithm for Boyer-Moore string searching,
SIAM J. Comput.,
vol. 9,
#3,
1980,
pp. 509-512.
First shows that Knuth's algorithm for the computation of the shift
table D[] ("Fast pattern matching in strings", 1977) is wrong, by
comparing its results for two test pattern with those of the definition
of D[]; these differ, and Knuth's are clearly the incorrect ones.
(The paper does not show, however, what input will make the over-all
algorithm fail.
After some experimenting I found: pattern[] = "a"*10 (pattern 1 in the
paper) and input[] = "a"*12 "bbb" "a"*12.
DG.)
The author then shows that Knuth's algorithm covers only 3 of 4 possible
cases, and produces a correct complicated algorithm deriving from the
failure function of the Knuth-Morris-Pratt pattern matcher.
I don't pretend to understand it yet, but it works.
(The assignment q1 := q+1 in the one but last line is superfluous.)
(A naive implementation of the definition is $ O ( n ^ 3 ) $, but that
turns out not to be a problem on modern machines, even for pattern of
several thousands of characters long.)
Beate Commentz-Walter,
A string matching algorithm fast on the average,
in Automata, Languages and Programming,
Lecture Notes in Computer Science #71,
ed. by H.A. Maurer,
Springer-Verlag,
Berlin,
1979,
pp. 118-132.
Does string set matching in a way comparable to Aho-Corasick, but, like
Boyer-Moore it matches the strings last character first.
A backward trie of the strings is constructed.
At a given position in the text, a path is looked up in the trie that spells
the text characters in backwards order; the trie can be visualized below the
text, with the root to the right (although the paper avoids all
visualization).
When the search stops (a match may be found or not), three function values
are computed:
1. the minimal right shift of the trie such that the matched text matches
a segment in a path in the trie (= the whole matched text could be a
substring of a keyword),
2. the minimal right shift of the trie such that the right end of the matched
text matches the left end of a path in the trie (= a suffix of the matched
text could be a prefix of a keyword),
3. the minimal right shift of the trie such that the mismatch character
matches any position in the trie.
Values (1) and (2) can be computed from the position in the trie where the
search ended, and (3) can be computed from the mismatch character.
The actual shift taken is then min(max((1), (3)), (2)).
The paper is high on formalism and low on explanation, unfortunately.
Much is claimed to be "obvious", but little of it is.
Zvi Galil,
On improving the worst case running time of the Boyer-Moore string matching algorithm,
Commun. ACM,
vol. 22,
#9,
Sept. 1979,
pp. 505-508.
Boyer-Moore string matching usually is $ O ( n / m ) $, where m is the
length of the pattern and n is the length of the input, but
occasionally it is $ O ( n times m ) $.
The simplest example of this undesirable behavior is a pattern $ a sup m $
and an input $ a sup n $ with $ n >> m $:
the pattern matches at $ n - m + 1 $ positions, and m comparisons are
required for confirmation in each case.
The paper aims to remove the $ times m $ component.
A string is called periodic if it has the form $ u sup n v $ where v is
a prefix of u, with $ n >= 1 $; this means that the pattern will match
itself if shifted over |u| positions.
We assume u to be minimal, so n to be maximal; this is important if
u itself is again periodic.
The author proves that if the pattern in the Boyer-Moore algorithm is
not periodic, the $ times m $ component cannot materialize.
So the algorithm needs to be modified for periodic patterns only.
As long as the pattern does not match, the behavior will be
$ O ( n / m ) $.
When the pattern matches, we report so, shift it over |u| and start
comparing leftwards from the right, as usual, but once we have confirmed
|u| characters, we have found another match, and we do not need to
confirm the other |m| - |u| character in the input, since they
contain $ u sup { n - 1 } v $, as we know from the previous match.
All these |u| character had never been tested before, so again the
$ times m $ component cannot materialize.
Formal proofs and code are given.
Note that the algorithm ignores the $ "delta" sub 1 $ and
$ "delta" sub 2 $ heuristics of the original Boyer-Moore algorithm.
Robert S. Boyer,
J. Strother Moore,
A fast string searching algorithm,
Commun. ACM,
vol. 20,
#10,
Oct. 1977,
pp. 762-772.
One keyword $pat[1..n]$ is searched for in a text string $string$ by
comparing the characters in $pat$ with those in $string$ by starting
from the right end of $pat$ and working backwards. So we first compare
$ string [ i ] $ to $ pat [ n ] $, etc.
When we find a mismatch,
( $ string [ i - m ] = char != pat [ n - m ] ),
we know that the present section of $string$ matches the pattern
$ P = [ ^ char ] pat [ n - m + 1 .. n ] $, where $ [ ^ char ] $
matches any character except $char$.
Now we look for the first position from the right in $pat$ that matches
P, where the left end of P may "fall off" the left end of $pat$.
We can shift the pattern to the right over $string$ at least so far as
to make the position of P in $pat$ coincide with the position of P
in $string$ (or we can shift over the full length of $pat$ if there is
no such position); any smaller shift could not possibly lead to a match.
This shift depends on $pat$ and m only, and can be precomputed; it is
called $ "delta" sub 2 ( m ) $.
Before doing this shift, however, we look more closely at $char$, and
find the rightmost position of $char$ in $pat$.
Again, we can shift the pattern to the right over $string$ at least so
far as to make the position of $char$ in $pat$ coincide with the
position of $char$ in $string$ (or we can shift over the full length of
$pat$ if there is no such position); any smaller shift could not
possibly lead to a match.
This shift depends on $pat$ and $char$ only, and can be precomputed; it is
called $ "delta" sub 1 ( char ) $.
The shift we finally do is the maximum of
$ "delta" sub 1 $ and $ "delta" sub 2 $.
The result is that for natural language text usually only $ 1 / m $ of
the character in $string$ have to be touched, and the algorithm is
blindingly fast.
The paper gives empirical evidence, a statistical-theoretical treatment,
implementation hint concerned with buffered I/O, and a history of the
algorithm.
The paper is surprisingly difficult to read, for two reasons.
First, the algorithm is derived in a fairly circuitous way, using 4
considerations, of which 1 and 2 are actually the same (required
presence of $char$ in $pat$) and so are 3a and 3b (required presence of
P in $pat$).
Second, everything is computed from the position of the mismatching
character, rather than from the position of the pattern.
This saves one variable in the programs but requires much confusing
arithmetic on the indexes.
Donald E. Knuth,
James H. Morris Jr.,
Vaughan R. Pratt,
Fast pattern matching in strings,
SIAM J. Comput.,
vol. 6,
#2,
June 1977,
pp. 323-350.
Very readable paper in the inimitable Knuth style: very entertaining and
informative but a bit unstructured.
The first 7 pages are taken up by a quest for a backtrack-free algorithm
for single string pattern matching.
The basic idea is that when j characters of the n-character string S
match, we can shift the pattern forward to the first position in S
where the same sequence of j characters occur in S.
Working out the details leads to a FSA for scanning the text.
Experiments showed that the automaton fails almost always right on the
first character, so the loop of the FSA is made to contain a subloop
that skips until a character is found that will properly start the FSA:
$ j == 1 $ is made a special case.
This is the basis of the "skip loop" in many string search algorithms.
The other 20 pages are dedicated to various other subjects.
1. A theory of palindromes is developed, including an algorithm that can
recognize a sequence of palindromes in linear time.
2. The history of the KMP algorithm is given, including the story of
Morris' 1967 implementation of his algorithm in a text editor being
removed by a successor programmer who did not understand it.
3. A description and analysis of the Boyer-Moore algorithm, which had
just been invented, is given.
A correction to the proposed implementation is given by Rytter (1980).
Alfred V. Aho,
Margaret J. Corasick,
Efficient string matching: an aid to bibliographic search,
Commun. ACM,
vol. 18,
#6,
June 1975,
pp. 333-340.
How to find members of a set of strings ("keywords") in a long text,
"long" meaning that it is worth while to do some preprocessing on the
set.
The basic idea is to put the keywords in a trie; then all paths from the
root to a leaf and perhaps some paths from the root to inner nodes spell
keywords.
We use the positions in the trie as states of a kind of automaton:
there is a goto function which follows the trie on the next input
character, a failure function that tells where in the trie to continue
when the trie cannot be followed, and an output function which for each
position tells which keywords(s) have been recognized.
The goto function is trivial to construct: it just follows the trie.
The output function contains for each node the zero or more keywords
recognized at that node (= spelled on the path from the root to that
node), and is easy to construct.
The failure function is harder, and basically moves the trie to the
right over the input by one or more positions, to the next possible
match.
When we are at a node N, we have already recognized the word "spelled"
by that node.
Call that word v.
We now remove the first character from v yielding the string w and
try to find w in the trie, starting at the root.
If we succeed, the node we end up at is the failure node of node N
(and we have moved the trie one position to the right).
If we don't, we repeat the process with W, and so on, until we either
obtain a suffix that is again an entry in the trie and which then
identifies a failure node, or we exhaust the already recognized word and
then the root is the failure node of node N.
The above algorithms are depth-first.
The paper produces more efficient depth-first versions of them, and
proves their correctness by showing that they are equivalent to the
depth-first algorithms.
Since the failure moves are ε moves, they can be eliminated, and it
is shown how the goto/failure automaton can be replaced by a DFA.
Ken Thompson,
Regular expression search algorithm,
Commun. ACM,
vol. 11,
#6,
June 1968,
pp. 419-422.
The regular expression is turned into a transition diagram, which is then
interpreted in parallel.
Remarkably, each step generates (IBM 7094) machine code to execute the
next step.
Herbert T. Glantz,
On the recognition of information with a digital computer,
Journal of the ACM,
vol. 4,
#2,
April 1957,
pp. 178-188.
Hesitant exploratory paper on the possibility of imitating by computer
the fuzzy string matching so easily done by humans: even in "NE
WYORKXYZ" any human will immediately find the name "NEW YORK".
The problem is defined as finding the best match for a given string (the
"subject") in a set of strings (the "target set"), although the word
"string" is not yet used.
The author envisions a distance function between two strings and a
threshold, and explores the effect of the threshold on possible
outcomes (failure to recognize, ambiguous recognition, incorrect failure
to recognize, incorrect recognition).
(Actually the first outcome is not an error: the matching string just is
not there.)
The author gives just one distance function, the percentage of
characters that do not match their counterparts in the same positions,
but suggests but does not formulate several other distance functions:
one which takes transposed characters into account,
one in which one hand in typing has slipped down one row on the
keyboard, and one based on keying and recognition errors in Morse code.
On a higher level, a kind of harmonic analysis (comparing Fourier
transforms) is suggested.
Comparing all possible error permutations is mentioned but considered
computationally infeasible.
The author also explains binary search in one paragraph (pg. 183).
No literature references.
|