Computational Linguistics and Text Processing

Literature references and annotations by Dick Grune, dick@dickgrune.com.
Last update: Fri Nov 12 17:03:01 2021.

These references and annotations were originally intended for personal use and are presented here only in the hope that they may be useful to others. There is no claim to completeness or even correctness. Each annotation represents my understanding of the text at the moment I wrote the annotation.
No guarantees given; comments and content criticism welcome.

* Jan Hendrik Harmsen, These are the words -- Ele haDevarim: procedures for computer-assisted syntactical parsing and actant analysis of biblical Hebrew texts, (PhD Thesis), Vrije Universiteit Amsterdam, Amsterdam, pp. 243. 1998,
The parsing process is divided in two parts (lexical analysis already being done by tagging each word in the input, the bible, by other means): a bottom-up process, which constructs the lower regions of the parsing and a top-down process, which construct the top layers. The top-down process is combined with and followed by scans that compute attributes in the node that carry reference information: a semantic process. Each of the processes is driven by a rule base which contains rewrite rules; it is unclear to me if the driver(s) and the rule bases are uniform over the three processes.
     The bottom-up process identifies phrases: noun, verb, preposition and a few other phrases, starting from the tags (attributes) attached to the separate words and doing bottom-up pattern matching. The possible findings are stored again as attributes (tags) in higher nodes. Possible phrases are: "dibber" = "he spoke", "melekh ha'emori" = "king of the Amorites", "b`ever hayarden" = "at the border of the Jordan".
     The top-down process is guided by a relatively small set of CF grammar rules; a rule base, which sometimes has detailed knowledge of specific Hebrew verbs, is used to reduce the number of possible matches. This scan identifies subjects and objects of verbs. Again the results are stored as tags in nodes. These two processes work on sentences.
     The actant analysis (reference analysis) starts by tagging those nodes that (syntactically) can represent actants, those that refer to actants, and those that do neither. The latter are numerous in Hebrew and can consist of suffixes to words: "horash<t><am>" = "<you> will drive <them> out", where the parts in angular brackets match. Next, direct speech information is distributed over the nodes. Noun phrases can also be unified by hand, when it is known that both refer to the same person, for example "your god" and "the lord". A rule base (= set of attribute evaluation rules) is then used to propagate identity information. The actant analysis works on paragraphs, unlike the other processes. About 90% of the references are resolved automatically; some cases, like "he buried him" are difficult, and require real-world information.
     Although some details of the implementation are given (mainly data representations) the three processes are not described very informally only.

* Laurens Wilhelmus Maria Bod, Enriching Linguistics with Statistics: Performance Models of Natural Language, PhD Thesis, University of Amsterdam, Amsterdam, 1995, pp. ???.
Using one or more corpora, statistics are gathered on the frequency of structures larger than a single grammar rule production. These statistics are then used to reduce the ambiguity in natural language parsing.

* Bob Carpenter, Two-level finite state morpho-phonology -- A rational reconstruction of KIMMO, vol. , 1995, pp. 7.
An automaton-theory foundation for KIMMO, a set of linguistic finite-state transducers from a lexical (underlying) to a phonetic (surface) representation. (What happened to the phonemic layer? It was removed, hence the "two-level"!?!?)
     KIMMO consists of a set of KIMMO transducers. These transducers are special in that the transitions in them are labeled with sets of symbols, rather than with single symbols. A transduction performed on an incoming symbol s is the most specific transition available, following the arc labeled with the smallest set including s.
     The KIMMO transducers are all combined into one finite-state transducer called The Big Machine; this machine is a normal transducer, in which the arcs are labeled with single symbols. The paper gives the construction for this Big Machine in terms if automata theory. Basically the Big Machine is just the product of all the KIMMO automata, but the preference of the most specific transition complicates things. The author is surprised that the Big Machine can be non-deterministic, but since it operates two ways and can also convert surface representations into lexical representations, non-determinism is to be expected; after all, the French surface form [don] can represent the lexical forms donne, donnes, and donnent.
     Usually the KIMMO automata contain many null transitions, transitions of the form $ s → s $; the presence of such a transition in an automaton means that that automaton is not interested in s. This feature (the Null Hypothesis) is incorporated into the formalism.
     KIMMO also features a dictionary -- a set of lexicons, which filter the possible lexical sequences. The paper indicates how the dictionary could be incorporated into the formalism but the result is not given.

* Michael Maxwell, Parsing using linearly ordered phonological rules, in Computational Phonolgy -- First Meeting of the ACL Special Interest Group in Computational Phonolgy, Waxhaw, NC, Summer Institute of Linguistics, July 1994, pp. 59-70.
This is parsing in the sense of phonological analysis: how to analyze a word like 'engagingly' into its components: 'engage-ing-ly'. This is either trivial (above) or impossible (what are the components of 'brought'?) in English and most other Indo-European languages, but a big thing in many other languages. For example, the Hebrew word 'efronot' (pencils) is to be analyzed as 'ipar-on-ot' (lead-thing-plural). The paper discusses phonological rules only; morphological rules (which tell you that it is 'engage-ing-ly' rather than 'engage-ly-ing') are not considered.
     The phonological rules are formulated as rewrite rules from a more lexical form to a more surface form; the initial form is the lexically constructed form ('ipar-on-ot), the final form is the pronounced (surface) form ('efronot'). The rules are context-sensitive, but the context is always finite. They are applied in a fixed order, but some rules may be applied repeatedly. Many rules are deleting or neutralizing: they delete segments (like the 'a' in 'ipar'). The rules work on segments, which are the smallest meaningful units (the English word 'church' consists of three segments: 'ch', 'ur', and 'ch'). Segments are characterized by a set of attributes, which are variables of enumeration types, and which can have values like 'voiced', 'guttural', etc. The task of the algorithm is to reconstruct the lexical form from the pronounced form.
     The basic idea is to run the rules backwards, but arbitrary segments may have to be undeleted, and many attributes cannot be reconstructed reliably. Previous algorithms just recorded all combinations, leading to exponential growth. When the algorithm presented here cannot reliably reconstruct an attribute, the attribute is just uninstantiated (set to 'unknown'); when it may have to undelete a segment, it inserts the segment marked 'optional'. This keeps the exponential growth away, but may yield a very under-defined semi-final lexical form.
     This semi-final lexical form is now paired to all actual lexical forms that will match it (through an algorithm not explained in the paper) and for each such match, the rules are now applied forwards. If the result is the surface form we started from, the semi-final lexical form is accepted as final, otherwise it is rejected. Note that multiple final lexical forms may be found.
     The paper is relatively vague, but on linguistic and on computer-science details. I am puzzled by the remark in the abstract that "its speed compares favorably with that of a KIMMO-type parser", where the text shows that it is three times slower.

* Christ Dekkers, Mark-Jan Nederhof, Janos J. Sarbo, Coping with ambiguity in decorated parse forests, in Coping with Linguistic Ambiguity in Typed Feature Formalisms, ICAI '92, pp. 11-19. Aug. 1992,
In this paper, the attributes consist of finite sets of enumeration values, for example { NOUN, VERB } for "miss". Expanding these properties would generate huge parse forests, so we keep the original parse forest and set up propagation equations. Sections ("cells") of the parse forest are isolated somewhat similar to basic blocks. Inside these cells, the equations are equalities; between the cells they are inclusions, somewhat similar to the dataflow equations between basic blocks. Additional user information may be needed to achieve uniqueness.

* Christopher W. Fraser, Balachander Krishnamurthy, Live text, Software -- Practice and Experience, vol. 20, #8, Aug. 1990, pp. 851-858.
The idea of live text editingis as follows: correcting the text in an error message automatically corrects the original text. For example, correcting a misspelled word in a spelling checker window corrects its occurrence(s) in the text. Implemented in emacs. (The idea seems to have been superseded by WYSIWYG).

* James K. Mullin, Daniel J. Margoliash, A tale of three spelling checkers, Software -- Practice and Experience, vol. 20, #6, June 1990, pp. 625-630.
English-oriented spelling checker using Bloom filters (for which see Bloom, CACM 1970, 13(7)422-426) on three sets of English words: words that accept -ed and -s, words that accept -s, and others. Input words are looked up in all three sets in all combinations (at most 6). The 3 * 15 = 45 filter hash functions were obtained by dividing the word as a variable-length base 256 integer by the 45 highest primes smaller than the bit vector size of the filter ($ 2 sup 20 $). Note that this is heuristic testing: an incorrect word may be accepted but a correct word will not be rejected.
     Corrections are suggested by creating all single-error substitutions, interchanges, insertion and deletion variants of the word and throwing them at the Bloom filter. The incorrect-suggestion rate was found to be too high, and trigrams were used to filter out obvious nonsense words.
     The PC version showed too large intervals between two interactive errors. The process was rearranged to first find all errors, then show them to the user to have them corrected, and then incorporate them. The total time was still the same, but user interaction was much more localized in time.

* Evan L. Antworth, PC-KIMMO -- A two-level processor for morphological analysis, Occasional Publications in Computing 16, Summer Inst. of Ling., Dallas, Tx., 1990, pp. .
WWW???

* J. Carroll, J. Abaitua, A morphological parser for Basque verbs' inflection, in Perspectives in Artificial Intelligence, Volume II: Machine Translation, ed. by J.A. Campbell & J. Cuena, Ellis Horwood, Chichester, 1989, pp. 77-85.
A generator for the forms of the Basque verb (of which there may be 2000 per verb!) is written in Prolog. The first stage generates all combinations of morphemes orthogonally. This set is filtered through "feature co-occurrence restrictions", which specify interrelational restrictions between the morphemes. A third stage checks the combined morphemes for compatibility with the stem of the verb. Prolog resolution is then applied to use this generator as an analyser.

* J.J. Schoorl, De computer als vertaler, (in Dutch: The computer as a translator), Boom, Meppel, 1986, pp. 238.
The first half of the book is a well-argued analysis of the possibilities and difficulties in automatic natural-language translation. The second half treats a number of commercial and research systems.

* Geoffrey K. Pullum, On two recent attempts to show that English is not a CFL, Computational Linguistics, vol. 10, #3-4, pp. 182-186. 1984,
Counters two papers, one by Postal and Langendoen (1984) and one by Higginbotham (1984), on the grounds that the non-grammatical sentences excluded by these papers are not actually non-grammatical. In a postscript the author acknowledges that new evidence by Shieber on Swiss German (1985) and by Culy on Bambara (????) may convince him that at least some languages are not CF; so why not English?

* Joan Bresnan, Ronald M. Kaplan, Stanley Peters, Annie Zaenen, Cross-serial dependencies in Dutch, Linguistic Inquiry, vol. 13, #4, 1982, pp. 613-635.
Cross-serial dependency is exhibited by such Dutch sentences as "... dat Jan Piet Marie de kinderen zag helpen leren zwemmen." Such sentences do not prove that Dutch cannot be weakly context-free (since the surface structure is $ a sup n b sup n $) but they do prove that Dutch is not strongly context-free, since the structural description is roughly $ omega omega $ where $ omega = SIGMA "\u\s-2*\s+2\d" $. In fact the paper supplies a much more precise characterization of the parse trees of cross-serially dependent sentences. The structure can be described rather naturally using lexical-functional grammars (Kaplan and Bresnan [NatLang, 1982]).

dat [Jan [Marie [de oppasser [de olifanten [zag helpen voeren]]]]

* Kenneth Church, On parsing strategies and closure , in 18th Annual Meeting of the Association for Computational Linguistics, 1980, pp. 107-111.
Extensive arguments why finite-state processing suffices for natural languages, mainly based on the idea that deep nesting exceeds people's abilities. Note DG: Some of his examples of unacceptable complexity, all of which are in English, are OK in, for example, Turkish; but there they are regular, not nesting.

* Yorick Wilks, An intelligent analyzer and understander of English, Commun. ACM, vol. 18, #5, May 1975, pp. 264-274.
Produces incredibly impressive sample translations, using lots of semantic information. If this system was already so good in 1975, why haven't we heard of it since?