Define Rule and ContextFreeGrammar
import sys
sys.path.insert(0, '_code')
from grammar import Rule, ContextFreeGrammarIn the phonological-patterns module, we saw that phonological constraints on stringsets—the kinds of sound sequences a language allows—tend to fall into the lower levels of the subregular hierarchy: Strictly Local, Strictly Piecewise, or Tier-based Strictly Local. These are proper subclasses of the regular languages, and the finite state automata that recognize them have correspondingly restricted structure.
Chandlee (2014) showed that a similar picture holds for morphological maps—the input-output transformations that relate underlying forms to surface forms in morphology. Most morphological processes—prefixation, suffixation, and local infixation—turn out to be subsequential functions, a proper subclass of the regular relations (the relations computable by finite state transducers). In other words, morphological transformations are, like phonological constraints, computationally simple in a precise sense.
But there is an important exception. Total reduplication—the process by which a word \(w\) is mapped to \(ww\), as in Indonesian orang → orang-orang (people)—is not a regular relation. We proved a version of this in the section on the pumping lemma: the language \(\{ww \mid w \in \Sigma^+\}\) is not regular, which means no finite state automaton can recognize it, and no finite state transducer can compute the map \(w \mapsto ww\).
This is one of the motivations for moving beyond finite state methods. In this module, we’ll use two complementary families of models for morphology. For segmentation—finding where the morpheme boundaries are—we’ll use sequence labeling models like HMMs and CRFs, which operate on flat character sequences (just as HMMs operate on phone sequences for phonotactics). For parsing—determining the hierarchical structure of how morphemes compose—we’ll use context-free grammars and their probabilistic extensions. Segmentation feeds into parsing: once we know the morpheme boundaries, we can ask how those morphemes are structured.
Morphological structure exhibits hierarchical nesting that regular grammars cannot capture. Consider the word unlockable: it has two possible structures—\([\text{un}[\text{lock-able}]]\) (not able to be locked) and \([[\text{un-lock}]\text{-able}]\) (able to be unlocked)—and distinguishing between them requires tracking nested constituency, not just sequences of symbols.
The formalism we’ll use for this is the context-free grammar (CFG). A context-free grammar is a 4-tuple \(G = (V, \Sigma, R, S)\) where:
The key property of context-free grammars is that each rule’s application depends only on the variable being expanded, not on what surrounds it. This is what “context-free” means: the rule \(A \rightarrow \alpha\) applies wherever \(A\) appears, regardless of context.
Rule and ContextFreeGrammarimport sys
sys.path.insert(0, '_code')
from grammar import Rule, ContextFreeGrammarTo make this concrete, here is a simple morphological grammar for a fragment of English derivational morphology:
grammar = ContextFreeGrammar(
alphabet={'happy', 'un', 'ness', 'kind', 'ly', 'help', 'ful'},
variables={'Word', 'Adj', 'Adv', 'N', 'Prefix', 'Suffix'},
rules={
Rule('Adj', 'happy'),
Rule('Adj', 'kind'),
Rule('Adj', 'help', 'ful'),
Rule('Word', 'Adj'),
Rule('Word', 'Adv'),
Rule('Word', 'N'),
Rule('Adj', 'Prefix', 'Adj'), # un + kind → unkind
Rule('N', 'Adj', 'Suffix'), # kind + ness → kindness
Rule('Adv', 'Adj', 'Suffix'), # kind + ly → kindly
Rule('Prefix', 'un'),
Rule('Suffix', 'ness'),
Rule('Suffix', 'ly'),
Rule('Suffix', 'ful'),
},
start_variable='Word'
)
for rule in sorted(grammar.rules(), key=lambda r: str(r)):
print(rule)Adj -> Prefix Adj
Adj -> happy
Adj -> help ful
Adj -> kind
Adv -> Adj Suffix
N -> Adj Suffix
Prefix -> un
Suffix -> ful
Suffix -> ly
Suffix -> ness
Word -> Adj
Word -> Adv
Word -> N
This grammar generates words like unkind, kindness, unkindly, and unhappiness, each with a hierarchical structure that reflects its morphological composition.
In the following sections, we’ll develop the tools needed to work with these grammars. We’ll start with the formal machinery: operations on CFGs (union, concatenation, Kleene closure), normal forms (especially Chomsky Normal Form, which enables efficient algorithms), and two parsing algorithms—CKY (bottom-up) and Earley (top-down)—for recovering the structure of a word given a grammar. With that machinery in hand, we’ll turn to the empirical question of morphological well-formedness: we’ll look at gradient acceptability judgments for morphologically complex nonwords, explore methods for segmenting words into morphemes, and then train probabilistic context-free grammars—both from labeled data and from scratch—to predict how acceptable a given morphological structure is.
The table below summarizes when each model type is appropriate. The key distinctions are whether the task requires flat (sequential) or hierarchical structure, and whether labeled training data is available.
| Model | Structure | Training data | Best for |
|---|---|---|---|
| \(N\)-gram | Sequential | Unlabeled corpus | Phonotactic well-formedness; quick baseline |
| HMM | Sequential (latent states) | Labeled sequences | Segmentation when features are simple |
| CRF | Sequential (discriminative) | Labeled sequences | Segmentation when rich features matter |
| BPE | Sequential (compression) | Unlabeled corpus | Subword tokenization; morpheme-like units without supervision |
| CFG/PCFG (supervised) | Hierarchical | Labeled parse trees | Parsing when gold-standard trees are available |
| PCFG (unsupervised, EM) | Hierarchical | Segmented strings only | Learning structure when no labeled trees exist |