Overview
Jurafsky and Martin (2023, Ch. 2.5) on edit distance and string alignment.
In the previous module, I said that we would investigate two complementary ways of dealing with the fact that we don’t always know which strings are in a language. The first was to model this uncertainty probabilistically: we defined language models that assign probabilities to strings, and we showed that the probabilities assigned by \(N\)-gram models trained on a lexicon correlate with human judgments of phonotactic acceptability. Under this view, a language is characterized by a probability distribution over \(\Sigma^*\), and whether a string is “in” the language is a matter of degree—specifically, a matter of how much probability the model assigns to it.
The second perspective, which we develop in this module, starts from a different place. Rather than defining a probability distribution over all of \(\Sigma^*\), we start with a finite set of known strings—a lexicon—and ask: what other strings are similar to the ones we already know? Under this view, a language is not characterized by a grammar or a probability distribution but by a region around a set of known strings in some metric space on \(\Sigma^*\). A novel string is judged as wordlike to the extent that it falls within this region—that is, to the extent that it is close to many strings (or to frequent strings) in the lexicon.
This perspective requires us to develop two things. First, we need a notion of distance between strings: a way of saying that blick is closer to black than bnick is. This is what the next two sections develop, starting from the general concept of a metric space and then specializing to strings via edit distance. Second, we need a way of aggregating the distances between a novel string and the entire lexicon into a single score—a model of how the shape of the “neighborhood” around a string in the lexicon determines its acceptability. This is the Generalized Neighborhood Model of Luce and Pisoni (1998) and Nosofsky (1986), which we develop in the final section of this module.
One consequence of this perspective is that it sidesteps the question of what grammar generates the language: a lexicon and a distance function are sufficient to define the region, without any independent specification of phonotactic constraints. Bailey and Hahn (2001) and Vitevitch and Luce (1999) showed that the probabilistic and similarity-based perspectives appear to make at least partially independent contributions to acceptability judgments, and we’ll be able to test this directly at the end of the module, using the same Daland et al. (2011) data we used in the previous module.