Weighted finite state automata and hidden Markov models
Up to this point, the finite state automata and transducers we’ve worked with have been unweighted: a transition is either present or absent, and a string is either accepted or rejected. But in many applications—including the language models we discussed in the section on uncertainty about languages—we want to assign probabilities (or more general weights) to strings.
To see why this matters, consider the phonotactic constraint we’ve been using as a running example: English doesn’t allow [lb] as a consonant cluster. An unweighted FSA can tell us that [lb] is forbidden—it either accepts or rejects a string. But it can’t tell us that [bl] (as in black) is more common than [bɹ] (as in brown), which is in turn more common than [bw] (as in buoy). All three are accepted; none is forbidden. The unweighted FSA has nothing more to say. A weighted FSA, by contrast, can assign different weights to different transitions, capturing the fact that some licit sequences are more probable than others.
Weighted finite state automata
A weighted finite state automaton (WFSA) is an FSA in which every transition carries a weight, and the weight of a path through the automaton is obtained by combining the weights of its transitions. Formally, a WFSA is a tuple \(\langle Q, \Sigma, \delta, q_0, F, \lambda, \rho \rangle\), where:
- \(Q\), \(\Sigma\), \(q_0\), and \(F\) are as before (states, alphabet, initial state, final states)
- \(\delta: Q \times \Sigma \rightarrow Q\) is the transition function (which may be partial)
- \(\lambda: Q \times \Sigma \times Q \rightarrow \mathbb{R}\) assigns a weight to each transition
- \(\rho: F \rightarrow \mathbb{R}\) assigns a weight to each final state
The weight of a path is the product (or sum, depending on the semiring) of the transition weights along the path, multiplied by the final-state weight. The weight of a string is the weight of the path (or the sum of the weights of all paths, in the nondeterministic case) that accepts it.
When the weights are probabilities—and specifically, when the weights leaving each state sum to 1 (including the probability of stopping at a final state)—we get a probabilistic finite state automaton (PFSA). A PFSA defines a probability distribution over \(\Sigma^*\): the probability of a string is the weight of the accepting path.
I’m introducing this here because the connection between weighted automata and the probabilistic models we’ve been working with is tighter than it might appear. In fact, the hidden Markov model—which we first encountered in the section on language models beyond \(N\)-grams—is exactly a weighted finite state automaton, viewed from a particular angle.
Consequences
Because HMMs are WFSAs, all of the algorithms we’ve developed for finite state automata—composition, intersection, determinization, minimization—have weighted analogues that apply to them. For instance:
- Composing two WFSTs can be used to apply an HMM’s emission model in series with a phonological rule, producing a combined model that jointly captures phonological constraints and probabilistic predictions.
- Determinizing a WFSA (when possible) can speed up inference, since the resulting automaton has a unique path for each input.
- Intersecting a WFSA with an unweighted FSA can be used to restrict the model to strings that satisfy some structural constraint—for instance, constraining an HMM to only produce sequences that obey a particular phonotactic pattern.
We will make use of this connection in the section on morphological segmentation, where HMMs are used to predict morpheme boundaries. Because the HMM is a WFSA, the segmentation model operates within the same formal framework as the phonological grammars and transducers we’ve been building throughout this module.
More broadly, many of the probabilistic models used in computational linguistics—\(N\)-gram models, HMMs, probabilistic finite state transducers—can be understood as weighted versions of the formal machines we’ve already studied. The algebraic properties of those machines (closure under operations, decidability of various questions) carry over to the probabilistic setting, which gives us formal guarantees about what can and cannot be computed.