Computational Linguistics
and
Text Processing
Literature references and annotations by Dick Grune, dick@dickgrune.com.
Last update: Fri Nov 12 17:03:01 2021.
These references and annotations were originally intended
for personal use and are presented here only in the hope
that they may be useful to others.
There is no claim to completeness or even correctness.
Each annotation represents my understanding of the text
at the moment I wrote the annotation.
No guarantees given; comments and content criticism welcome.
Jan Hendrik Harmsen,
These are the words -- Ele haDevarim: procedures for computer-assisted syntactical parsing and actant analysis of biblical Hebrew texts,
(PhD Thesis),
Vrije Universiteit Amsterdam,
Amsterdam,
pp. 243.
1998,
The parsing process is divided in two parts (lexical analysis already
being done by tagging each word in the input, the bible, by other
means): a bottom-up process, which constructs the lower regions of the
parsing and a top-down process, which construct the top layers.
The top-down process is combined with and followed by scans that compute
attributes in the node that carry reference information: a semantic
process.
Each of the processes is driven by a rule base which contains rewrite
rules; it is unclear to me if the driver(s) and the rule bases are
uniform over the three processes.
The bottom-up process identifies phrases: noun, verb, preposition and a
few other phrases, starting from the tags (attributes) attached to the
separate words and doing bottom-up pattern matching.
The possible findings are stored again as attributes (tags) in higher
nodes.
Possible phrases are:
"dibber" = "he spoke",
"melekh ha'emori" = "king of the Amorites",
"b`ever hayarden" = "at the border of the Jordan".
The top-down process is guided by a relatively small set of CF grammar
rules; a rule base, which sometimes has detailed knowledge of specific
Hebrew verbs, is used to reduce the number of possible matches.
This scan identifies subjects and objects of verbs.
Again the results are stored as tags in nodes.
These two processes work on sentences.
The actant analysis (reference analysis) starts by tagging those nodes
that (syntactically) can represent actants, those that refer to actants,
and those that do neither.
The latter are numerous in Hebrew and can consist of suffixes to words:
"horash<t><am>" = "<you> will drive <them> out", where the parts in
angular brackets match.
Next, direct speech information is distributed over the nodes.
Noun phrases can also be unified by hand, when it is known that both
refer to the same person, for example "your god" and "the lord".
A rule base (= set of attribute evaluation rules) is then used to
propagate identity information.
The actant analysis works on paragraphs, unlike the other processes.
About 90% of the references are resolved automatically; some cases, like
"he buried him" are difficult, and require real-world information.
Although some details of the implementation are given (mainly data
representations) the three processes are not described very informally
only.
Laurens Wilhelmus Maria Bod,
Enriching Linguistics with Statistics: Performance Models of Natural Language,
PhD Thesis,
University of Amsterdam,
Amsterdam,
1995,
pp. ???.
Using one or more corpora, statistics are gathered on the frequency of
structures larger than a single grammar rule production.
These statistics are then used to reduce the ambiguity in natural
language parsing.
Bob Carpenter,
Two-level finite state morpho-phonology -- A rational reconstruction of KIMMO,
vol. ,
1995,
pp. 7.
An automaton-theory foundation for KIMMO, a set of linguistic
finite-state transducers from a lexical (underlying) to a phonetic
(surface) representation. (What happened to the phonemic layer? It was
removed, hence the "two-level"!?!?)
KIMMO consists of a set of KIMMO transducers.
These transducers are special in that the transitions in them are
labeled with sets of symbols, rather than with single symbols.
A transduction performed on an incoming symbol s is the most specific
transition available, following the arc labeled with the smallest set
including s.
The KIMMO transducers are all combined into one finite-state transducer
called The Big Machine; this machine is a normal transducer, in which the
arcs are labeled with single symbols.
The paper gives the construction for this Big Machine in terms if
automata theory.
Basically the Big Machine is just the product of all the KIMMO automata,
but the preference of the most specific transition complicates things.
The author is surprised that the Big Machine can be non-deterministic,
but since it operates two ways and can also convert surface
representations into lexical representations, non-determinism is to be
expected; after all, the French surface form [don] can represent the
lexical forms donne, donnes, and donnent.
Usually the KIMMO automata contain many null transitions, transitions
of the form $ s → s $; the presence of such a transition in an
automaton means that that automaton is not interested in s.
This feature (the Null Hypothesis) is incorporated into the formalism.
KIMMO also features a dictionary -- a set of lexicons, which filter the
possible lexical sequences.
The paper indicates how the dictionary could be incorporated into the
formalism but the result is not given.
Michael Maxwell,
Parsing using linearly ordered phonological rules,
in Computational Phonolgy -- First Meeting of the ACL Special Interest Group in Computational Phonolgy,
Waxhaw, NC,
Summer Institute of Linguistics,
July 1994,
pp. 59-70.
This is parsing in the sense of phonological analysis: how to analyze a
word like 'engagingly' into its components: 'engage-ing-ly'.
This is either trivial (above) or impossible (what are the components of
'brought'?) in English and most other Indo-European languages, but a big
thing in many other languages.
For example, the Hebrew word 'efronot' (pencils) is to be analyzed as
'ipar-on-ot' (lead-thing-plural).
The paper discusses phonological rules only; morphological rules (which
tell you that it is 'engage-ing-ly' rather than 'engage-ly-ing') are not
considered.
The phonological rules are formulated as rewrite rules from a more
lexical form to a more surface form; the initial form is the lexically
constructed form ('ipar-on-ot), the final form is the pronounced
(surface) form ('efronot').
The rules are context-sensitive, but the context is always finite.
They are applied in a fixed order, but some rules may be applied
repeatedly.
Many rules are deleting or neutralizing: they delete segments (like
the 'a' in 'ipar').
The rules work on segments, which are the smallest meaningful units (the
English word 'church' consists of three segments: 'ch', 'ur', and 'ch').
Segments are characterized by a set of attributes, which are variables
of enumeration types, and which can have values like 'voiced',
'guttural', etc.
The task of the algorithm is to reconstruct the lexical form from the
pronounced form.
The basic idea is to run the rules backwards, but arbitrary segments
may have to be undeleted, and many attributes cannot be reconstructed
reliably.
Previous algorithms just recorded all combinations, leading to
exponential growth.
When the algorithm presented here cannot reliably reconstruct an
attribute, the attribute is just uninstantiated (set to 'unknown'); when
it may have to undelete a segment, it inserts the segment marked
'optional'.
This keeps the exponential growth away, but may yield a very
under-defined semi-final lexical form.
This semi-final lexical form is now paired to all actual lexical forms
that will match it (through an algorithm not explained in the paper) and
for each such match, the rules are now applied forwards.
If the result is the surface form we started from, the semi-final
lexical form is accepted as final, otherwise it is rejected.
Note that multiple final lexical forms may be found.
The paper is relatively vague, but on linguistic and on computer-science
details.
I am puzzled by the remark in the abstract that "its speed compares
favorably with that of a KIMMO-type parser", where the text shows that
it is three times slower.
Christ Dekkers,
Mark-Jan Nederhof,
Janos J. Sarbo,
Coping with ambiguity in decorated parse forests,
in Coping with Linguistic Ambiguity in Typed Feature Formalisms, ICAI '92,
pp. 11-19.
Aug. 1992,
In this paper, the attributes consist of finite sets of enumeration
values, for example { NOUN, VERB } for "miss".
Expanding these properties would generate huge parse forests, so we keep
the original parse forest and set up propagation equations.
Sections ("cells") of the parse forest are isolated somewhat similar to
basic blocks.
Inside these cells, the equations are equalities; between the cells they
are inclusions, somewhat similar to the dataflow equations between basic
blocks.
Additional user information may be needed to achieve uniqueness.
Christopher W. Fraser,
Balachander Krishnamurthy,
Live text,
Software -- Practice and Experience,
vol. 20,
#8,
Aug. 1990,
pp. 851-858.
The idea of live text editingis as follows:
correcting the text in an error message automatically corrects the
original text.
For example, correcting a misspelled word in a spelling checker window
corrects its occurrence(s) in the text.
Implemented in emacs.
(The idea seems to have been superseded by WYSIWYG).
James K. Mullin,
Daniel J. Margoliash,
A tale of three spelling checkers,
Software -- Practice and Experience,
vol. 20,
#6,
June 1990,
pp. 625-630.
English-oriented spelling checker using Bloom filters (for which see
Bloom, CACM 1970, 13(7)422-426) on three sets of English words:
words that accept -ed and -s, words that accept -s, and others.
Input words are looked up in all three sets in all combinations (at most
6).
The 3 * 15 = 45 filter hash functions were obtained by dividing the
word as a variable-length base 256 integer by the 45 highest primes
smaller than the bit vector size of the filter ($ 2 sup 20 $).
Note that this is heuristic testing: an incorrect word may be accepted
but a correct word will not be rejected.
Corrections are suggested by creating all single-error substitutions,
interchanges, insertion and deletion variants of the word and throwing
them at the Bloom filter.
The incorrect-suggestion rate was found to be too high, and trigrams
were used to filter out obvious nonsense words.
The PC version showed too large intervals between two interactive
errors.
The process was rearranged to first find all errors, then show them to
the user to have them corrected, and then incorporate them.
The total time was still the same, but user interaction was much more
localized in time.
Evan L. Antworth,
PC-KIMMO -- A two-level processor for morphological analysis,
Occasional Publications in Computing 16,
Summer Inst. of Ling.,
Dallas, Tx.,
1990,
pp. .
WWW???
J. Carroll,
J. Abaitua,
A morphological parser for Basque verbs' inflection,
in Perspectives in Artificial Intelligence, Volume II: Machine Translation,
ed. by J.A. Campbell & J. Cuena,
Ellis Horwood,
Chichester,
1989,
pp. 77-85.
A generator for the forms of the Basque verb (of which there may be 2000
per verb!) is written in Prolog.
The first stage generates all combinations of morphemes orthogonally.
This set is filtered through "feature co-occurrence restrictions",
which specify interrelational restrictions between the morphemes.
A third stage checks the combined morphemes for compatibility with the
stem of the verb.
Prolog resolution is then applied to use this generator as an analyser.
J.J. Schoorl,
De computer als vertaler,
(in Dutch: The computer as a translator),
Boom,
Meppel,
1986,
pp. 238.
The first half of the book is a well-argued analysis of the
possibilities and difficulties in automatic natural-language
translation.
The second half treats a number of commercial and research systems.
Geoffrey K. Pullum,
On two recent attempts to show that English is not a CFL,
Computational Linguistics,
vol. 10,
#3-4,
pp. 182-186.
1984,
Counters two papers, one by Postal and Langendoen (1984) and one by
Higginbotham (1984), on the grounds that the non-grammatical sentences
excluded by these papers are not actually non-grammatical.
In a postscript the author acknowledges that new evidence by Shieber on
Swiss German (1985) and by Culy on Bambara (????) may convince him that
at least some languages are not CF; so why not English?
Joan Bresnan,
Ronald M. Kaplan,
Stanley Peters,
Annie Zaenen,
Cross-serial dependencies in Dutch,
Linguistic Inquiry,
vol. 13,
#4,
1982,
pp. 613-635.
Cross-serial dependency is exhibited by such Dutch sentences as
"... dat Jan Piet Marie de kinderen zag helpen leren zwemmen."
Such sentences do not prove that Dutch cannot be weakly context-free
(since the surface structure is
$ a sup n b sup n $) but they do prove that Dutch is not strongly
context-free, since the structural description is roughly
$ omega omega $ where $ omega = SIGMA "\u\s-2*\s+2\d" $.
In fact the paper supplies a much more precise characterization of the
parse trees of cross-serially dependent sentences.
The structure can be described rather naturally using
lexical-functional grammars (Kaplan and Bresnan [NatLang, 1982]).
dat [Jan [Marie [de oppasser [de olifanten [zag helpen voeren]]]]
Kenneth Church,
On parsing strategies and closure ,
in 18th Annual Meeting of the Association for Computational Linguistics,
1980,
pp. 107-111.
Extensive arguments why finite-state processing suffices for natural
languages, mainly based on the idea that deep nesting exceeds people's
abilities.
Note DG:
Some of his examples of unacceptable complexity, all of which are in English,
are OK in, for example, Turkish; but there they are regular, not nesting.
Yorick Wilks,
An intelligent analyzer and understander of English,
Commun. ACM,
vol. 18,
#5,
May 1975,
pp. 264-274.
Produces incredibly impressive sample translations, using lots of
semantic information.
If this system was already so good in 1975, why haven't we heard of it
since?
|