String Searching

> Literature references and annotations by Dick Grune,
Last update: Mon Sep 14 13:33:32 2009.

These references and annotations were originally intended for personal use and are presented here only in the hope that they may be useful to others. There is no claim to completeness or even correctness. Each annotation represents my understanding of the text at the moment I wrote the annotation.
> No guarantees given; comments and content criticism welcome.

* Mads Sig Ager, Olivier Danvy, Henning Korsholm Rohde, Fast partial evaluation of pattern matching in strings, ACM SIGPLAN Notices, vol. 38, #10, pp. 243-249. Oct 2003,
It is possible to derive the linear-time Knuth, Morris and Pratt string matching algorithm from a naive quadratic one by partial evaluation, but this specialization process still takes quadratic time. This paper shows how to derive the matcher in linear time using partial evaluation. It has many literature references on the subject.

* Charles L.A. Clarke, Gordon V. Cormack, On the use of regular expressions for searching text, ACM TOPLAS, vol. 19, #3, May 1997, pp. 413-426.
The authors argue that for text searching the disambiguation rule should be to return the shortest matching string, rather than the longest, as is usual in lexical analysers. This would indeed obviate the need for regular expressions like ([^()]*) to match a non-nesting parenthesized segment, since under the proposed regime the regular expression (.*) would do the job. They also propose a containment operator ≥, with r ≥ s meaning s provided it is contained in a segment matching r.
     Algorithms for implementing the above are provided.

* Ricardo Baeza-Yates, Gaston H. Gonnet, A new approach to text searching, Commun. ACM, vol. 35, #10, pp. 74-82. Oct. 1992,
A pattern consisting of m components is checked for by m string comparators running in parallel. The states are combined and passed on à la Dömölki.

* Andrew Hume, Daniel Sunday, Fast string searching, Softw. Pract. Exper., vol. 21, #11, Nov. 1991, pp. 1221-1248.
The authors do three things at the same time: give a taxonomy of single-keyword string searching algorithms, compare selected entries in this taxonomy experimentally, and report on their own work; they switch between these modes in unexpected moments.
     A general framework for single-keyword string searching algorithms is set up as follows:

FOR i OVER text:
  // skip to interesting character:
  WHILE text[i] /= interesting_char:
    ADD skip_distance[text[i]] TO i
  // try a guarded match:
  IF pattern COULD BE AT i:           // must be a cheap test
    MATCH text AT i TO pattern OR FAIL AT p
    SET p TO i;
  // shift over non-matching part:
  ADD shift_distance(i, p) TO i;

     This decomposes the algorithm into four components: the choice of the interesting character (and its skip_distance[]), the cheap guard test, the match algorithm, and the shift to the first promising position. Although these components are not at all independent, the paper treats them more or less as if they were; the authors even give speed figures for some separate components.
     Between 6 and 11 algorithms are described for each of the components; of the possible combinations, 9 had been described in the literature.
     Some of the conclusions are: 1. The Knuth-Morris-Pratt algorithms are found to be consistently much slower than the Boyer-Moore algorithms. 2. The guard test helps a bit, usually. 3. Using statistics on the characters helps, but not more than a few percents. 4. Using two shift heuristics and choosing the best at run time loses with respect to just sticking to one. 5. There is little or no relationship between the number of character comparisons and the speed of the algorithm. 6. Boyer-Moore with a few trimmings is usually the fastest. 7. Finding the optimum for a specific application (natural language, DNA searching, etc.) requires experimentation, for the authors have provided a toolkit.

* P.D. Smith, Experiments with a very fast substring search algorithm, Softw. Pract. Exper., vol. 21, #10, Oct. 1991, pp. 1065-1074.
The idea of matching a string to a given position in a text by comparing chars in a self-adapting order is tested experimentally. Two adaptation methods are described: move the position that causes a mismatch one place up in the list or move it to the top of the list. The influence of the initial ordering was studied. There were 4 initial orderings: character frequency in English, character frequency in C, right-to-left, and random. Each was tested with and without self-adaptation (which one?).
     Conclusions: 1. Self-adaptation often helps a little (-1 to +8% improvement). 2. A generic initial distribution (averaged over English, French and C) is as good (within 2%) as specific English, French or C orders, and better than right-to-left, or random. The specific frequency distribution of the actual input text (if known) allows you to do about 20% better, though.

* A.V. Aho, Algorithms for finding patterns in strings, in Handbook of Theoretical Computer Science -- Algorithms and Complexity, Vol. A, ed. by J. van Leeuwen, Elsevier, Amsterdam, The Netherlands, 1990, pp. 255-300.
Chapter 5 of the handbook. Encyclopedic article on the subject, covering the state of the art in:

single string matching:
	Karp-Rabin, caterpillar hash function
	Knuth-Morris-Pratt, automaton, forward
	Boyer-Moore, backward
multiple string matching:
	Commentz-Walter, p. 278, best description around
regular expression matching:
	Thompson NFSA construction
regular expression matching with variables:
	proved NP-complete
longest common substring location:
	longest path in matrix
	McCreight suffix trees
giving a very readable account of each of them, often with proof and complexity analysis. Draws amazing conclusions from Cook's Theorem: "Every 2-way deterministic push-down automaton (2DPDA) language can be recognized in linear time on a random-access machine".
     The paper ends with 139 literature references.

* Daniel M. Sunday, A very fast substring search algorithm, Commun. ACM, vol. 33, #8, Aug. 1990, pp. 132-142.
To see if a string $S[1..M]$ can be found at a given position p in a text $T[1..N]$, the characters in S are compared in arbitrary order to the corresponding characters in T. If a mismatch is found, a table supplies the leftmost position in which the next try may still succeed. Three choices for precising the arbitrary order are given: right to left, maximal shift and most probable maximal shift.

* Gaston H. Gonnet, Ricardo A. Baezo-Yates, An analysis of the Karp-Rabin string matching algorithm, Inform. Process. Lett., vol. 34, #5, May 1990, pp. 271-274.
Explicit cost functions for the Karp-Rabin algorithm, derived by simple math. The results are compared with actual runs and found to agree with them.

* Ricardo A. Baeza-Yates, Improved string searching, SP&E, vol. 19, #3, pp. 257-271. March 1989,
1.Use the occurrence-of-mismatching-character shift heuristic of Boyer-Moore only. The "next" heuristic of Boyer-Moore costs more than it gains. 2. Combine 2 subsequent characters into a single character. Gains 50% at the expense of a table that is 128 times larger.

* Andrew Hume, A tale of two greps, SP&E, vol. 18, #11, Nov. 1988, pp. 1063-1072.
Description of the tuning of grep (Ken Thompson) and egrep (A.V. Aho) by the author (his part in the so-called "grep wars").
     Grep: use the optimized IO package fio.h rather than stdio.h .
     Egrep, which already used lazy automaton building: 1. Use fio.h . 2. Use Boyer-Moore for non-RE strings. 3. Use Boyer-Moore for the non-RE parts of regular expressions. All this results in an egrep that is between 8 and 40 times faster.

* Alberto Apostolico, Raffaele Giancarlo, The Boyer-Moore string searching strategies revisited, SIAM J. Comput., vol. 15, #1, 1986, pp. 98-105.
Incorporates Galil's (1979) treatment of periodicity of the pattern into the original Boyer-Moore algorithm. This results in an algorithm that performs at most $ 2n - m - 1 $ character comparisons (n: length of input; m: length of pattern).

* Wojchiech Rytter, A correct preprocessing algorithm for Boyer-Moore string searching, SIAM J. Comput., vol. 9, #3, pp. 509-512. 1980,
First shows that Knuth's algorithm for the computation of the shift table D[] ("Fast pattern matching in strings", 1977) is wrong, by comparing its results for two test pattern with those of the definition of D[]; these differ, and Knuth's are clearly the incorrect ones. (The paper does not show, however, what input will make the over-all algorithm fail. After some experimenting I found: pattern[] = "a"*10 (pattern 1 in the paper) and input[] = "a"*12 "bbb" "a"*12. DG.)
     The author then shows that Knuth's algorithm covers only 3 of 4 possible cases, and produces a correct complicated algorithm deriving from the failure function of the Knuth-Morris-Pratt pattern matcher. I don't pretend to understand it yet, but it works. (The assignment q1 := q+1 in the one but last line is superfluous.)
     (A naive implementation of the definition is $ O ( n ^ 3 ) $, but that turns out not to be a problem on modern machines, even for pattern of several thousands of characters long.)

* Beate Commentz-Walter, A string matching algorithm fast on the average, in Automata, Languages and Programming, Lecture Notes in Computer Science #71, ed. by H.A. Maurer, Springer-Verlag, Berlin, 1979, pp. 118-132.
Does string set matching in a way comparable to Aho-Corasick, but, like Boyer-Moore it matches the strings last character first. A backward trie of the strings is constructed. At a given position in the text, a path is looked up in the trie that spells the text characters in backwards order; the trie can be visualized below the text, with the root to the right (although the paper avoids all visualization).
     When the search stops (a match may be found or not), three function values are computed: 1. the minimal right shift of the trie such that the matched text matches a segment in a path in the trie (= the whole matched text could be a substring of a keyword), 2. the minimal right shift of the trie such that the right end of the matched text matches the left end of a path in the trie (= a suffix of the matched text could be a prefix of a keyword), 3. the minimal right shift of the trie such that the mismatch character matches any position in the trie. Values (1) and (2) can be computed from the position in the trie where the search ended, and (3) can be computed from the mismatch character. The actual shift taken is then min(max((1), (3)), (2)).
     The paper is high on formalism and low on explanation, unfortunately. Much is claimed to be "obvious", but little of it is.

* Zvi Galil, On improving the worst case running time of the Boyer-Moore string matching algorithm, Commun. ACM, vol. 22, #9, Sept. 1979, pp. 505-508.
Boyer-Moore string matching usually is $ O ( n / m ) $, where m is the length of the pattern and n is the length of the input, but occasionally it is $ O ( n times m ) $. The simplest example of this undesirable behavior is a pattern $ a sup m $ and an input $ a sup n $ with $ n >> m $: the pattern matches at $ n - m + 1 $ positions, and m comparisons are required for confirmation in each case. The paper aims to remove the $ times m $ component.
     A string is called periodic if it has the form $ u sup n v $ where v is a prefix of u, with $ n >= 1 $; this means that the pattern will match itself if shifted over |u| positions. We assume u to be minimal, so n to be maximal; this is important if u itself is again periodic. The author proves that if the pattern in the Boyer-Moore algorithm is not periodic, the $ times m $ component cannot materialize. So the algorithm needs to be modified for periodic patterns only.
     As long as the pattern does not match, the behavior will be $ O ( n / m ) $. When the pattern matches, we report so, shift it over |u| and start comparing leftwards from the right, as usual, but once we have confirmed |u| characters, we have found another match, and we do not need to confirm the other |m| - |u| character in the input, since they contain $ u sup { n - 1 } v $, as we know from the previous match. All these |u| character had never been tested before, so again the $ times m $ component cannot materialize.
     Formal proofs and code are given. Note that the algorithm ignores the $ "delta" sub 1 $ and $ "delta" sub 2 $ heuristics of the original Boyer-Moore algorithm.

* Robert S. Boyer, J. Strother Moore, A fast string searching algorithm, Commun. ACM, vol. 20, #10, Oct. 1977, pp. 762-772.
One keyword $pat[1..n]$ is searched for in a text string $string$ by comparing the characters in $pat$ with those in $string$ by starting from the right end of $pat$ and working backwards. So we first compare $ string [ i ] $ to $ pat [ n ] $, etc.
     When we find a mismatch, ( $ string [ i - m ] = char != pat [ n - m ] ), we know that the present section of $string$ matches the pattern $ P = [ ^ char ] pat [ n - m + 1 .. n ] $, where $ [ ^ char ] $ matches any character except $char$. Now we look for the first position from the right in $pat$ that matches P, where the left end of P may "fall off" the left end of $pat$. We can shift the pattern to the right over $string$ at least so far as to make the position of P in $pat$ coincide with the position of P in $string$ (or we can shift over the full length of $pat$ if there is no such position); any smaller shift could not possibly lead to a match. This shift depends on $pat$ and m only, and can be precomputed; it is called $ "delta" sub 2 ( m ) $.
     Before doing this shift, however, we look more closely at $char$, and find the rightmost position of $char$ in $pat$. Again, we can shift the pattern to the right over $string$ at least so far as to make the position of $char$ in $pat$ coincide with the position of $char$ in $string$ (or we can shift over the full length of $pat$ if there is no such position); any smaller shift could not possibly lead to a match. This shift depends on $pat$ and $char$ only, and can be precomputed; it is called $ "delta" sub 1 ( char ) $.
     The shift we finally do is the maximum of $ "delta" sub 1 $ and $ "delta" sub 2 $. The result is that for natural language text usually only $ 1 / m $ of the character in $string$ have to be touched, and the algorithm is blindingly fast.
     The paper gives empirical evidence, a statistical-theoretical treatment, implementation hint concerned with buffered I/O, and a history of the algorithm.
     The paper is surprisingly difficult to read, for two reasons. First, the algorithm is derived in a fairly circuitous way, using 4 considerations, of which 1 and 2 are actually the same (required presence of $char$ in $pat$) and so are 3a and 3b (required presence of P in $pat$). Second, everything is computed from the position of the mismatching character, rather than from the position of the pattern. This saves one variable in the programs but requires much confusing arithmetic on the indexes.

* Donald E. Knuth, James H. Morris Jr., Vaughan R. Pratt, Fast pattern matching in strings, SIAM J. Comput., vol. 6, #2, June 1977, pp. 323-350.
Very readable paper in the inimitable Knuth style: very entertaining and informative but a bit unstructured. The first 7 pages are taken up by a quest for a backtrack-free algorithm for single string pattern matching. The basic idea is that when j characters of the n-character string S match, we can shift the pattern forward to the first position in S where the same sequence of j characters occur in S. Working out the details leads to a FSA for scanning the text. Experiments showed that the automaton fails almost always right on the first character, so the loop of the FSA is made to contain a subloop that skips until a character is found that will properly start the FSA: $ j == 1 $ is made a special case. This is the basis of the "skip loop" in many string search algorithms.
     The other 20 pages are dedicated to various other subjects. 1. A theory of palindromes is developed, including an algorithm that can recognize a sequence of palindromes in linear time. 2. The history of the KMP algorithm is given, including the story of Morris' 1967 implementation of his algorithm in a text editor being removed by a successor programmer who did not understand it. 3. A description and analysis of the Boyer-Moore algorithm, which had just been invented, is given. A correction to the proposed implementation is given by Rytter (1980).

* Alfred V. Aho, Margaret J. Corasick, Efficient string matching: an aid to bibliographic search, Commun. ACM, vol. 18, #6, June 1975, pp. 333-340.
How to find members of a set of strings ("keywords") in a long text, "long" meaning that it is worth while to do some preprocessing on the set.
     The basic idea is to put the keywords in a trie; then all paths from the root to a leaf and perhaps some paths from the root to inner nodes spell keywords. We use the positions in the trie as states of a kind of automaton: there is a goto function which follows the trie on the next input character, a failure function that tells where in the trie to continue when the trie cannot be followed, and an output function which for each position tells which keywords(s) have been recognized.
     The goto function is trivial to construct: it just follows the trie. The output function contains for each node the zero or more keywords recognized at that node (= spelled on the path from the root to that node), and is easy to construct. The failure function is harder, and basically moves the trie to the right over the input by one or more positions, to the next possible match. When we are at a node N, we have already recognized the word "spelled" by that node. Call that word v. We now remove the first character from v yielding the string w and try to find w in the trie, starting at the root. If we succeed, the node we end up at is the failure node of node N (and we have moved the trie one position to the right). If we don't, we repeat the process with W, and so on, until we either obtain a suffix that is again an entry in the trie and which then identifies a failure node, or we exhaust the already recognized word and then the root is the failure node of node N.
     The above algorithms are depth-first. The paper produces more efficient depth-first versions of them, and proves their correctness by showing that they are equivalent to the depth-first algorithms.
     Since the failure moves are ε moves, they can be eliminated, and it is shown how the goto/failure automaton can be replaced by a DFA.

* Ken Thompson, Regular expression search algorithm, Commun. ACM, vol. 11, #6, June 1968, pp. 419-422.
The regular expression is turned into a transition diagram, which is then interpreted in parallel. Remarkably, each step generates (IBM 7094) machine code to execute the next step.

* Herbert T. Glantz, On the recognition of information with a digital computer, Journal of the ACM, vol. 4, #2, pp. 178-188. April 1957,
Hesitant exploratory paper on the possibility of imitating by computer the fuzzy string matching so easily done by humans: even in "NE WYORKXYZ" any human will immediately find the name "NEW YORK". The problem is defined as finding the best match for a given string (the "subject") in a set of strings (the "target set"), although the word "string" is not yet used. The author envisions a distance function between two strings and a threshold, and explores the effect of the threshold on possible outcomes (failure to recognize, ambiguous recognition, incorrect failure to recognize, incorrect recognition). (Actually the first outcome is not an error: the matching string just is not there.)
     The author gives just one distance function, the percentage of characters that do not match their counterparts in the same positions, but suggests but does not formulate several other distance functions: one which takes transposed characters into account, one in which one hand in typing has slipped down one row on the keyboard, and one based on keying and recognition errors in Morse code. On a higher level, a kind of harmonic analysis (comparing Fourier transforms) is suggested. Comparing all possible error permutations is mentioned but considered computationally infeasible.
     The author also explains binary search in one paragraph (pg. 183). No literature references.