In the previous section, we used BPE to segment words into morpheme-like units. BPE makes segmentation decisions based solely on the frequency of character pairs, without considering the broader context in which those characters appear. An alternative is to treat segmentation as a structured prediction problem, in which the model predicts a sequence of labels (one per character) and can condition each label on features of the surrounding characters.
The idea is straightforward. Given a word like unhappiness, represented as a sequence of characters \(\langle \text{u}, \text{n}, \text{h}, \text{a}, \text{p}, \text{p}, \text{i}, \text{n}, \text{e}, \text{s}, \text{s} \rangle\), we want to predict for each character whether it is the beginning of a morpheme, inside a morpheme, or (less commonly) outside any morpheme. This is known as BIO tagging, after the labels B (beginning), I (inside), and O (outside).
For unhappiness with morphemes un, happi, ness, the BIO tags would be:
u
n
h
a
p
p
i
n
e
s
s
B
I
B
I
I
I
I
B
I
I
I
The first character of each morpheme gets a B tag, and the remaining characters get I tags. Converting back from tags to a segmentation is trivial: every B tag marks the start of a new morpheme.
This formulation turns segmentation into a sequence labeling problem, for which there are well-studied models. We’ll look at two: hidden Markov models and conditional random fields.
The idea is to model the joint distribution of the character sequence \(\mathbf{x} = x_1 x_2 \ldots x_n\) and the tag sequence \(\mathbf{y} = y_1 y_2 \ldots y_n\) as:
where \(p(x_i \mid y_i)\) is the emission probability (how likely is character \(x_i\) given tag \(y_i\)?) and \(p(y_i \mid y_{i-1})\) is the transition probability (how likely is tag \(y_i\) given the previous tag \(y_{i-1}\)?). Given an observed character sequence \(\mathbf{x}\), we want to find the tag sequence \(\mathbf{y}^*\) that maximizes \(p(\mathbf{y} \mid \mathbf{x})\). This is done using the Viterbi algorithm(Viterbi 1967), which finds the most probable tag sequence in \(O(n \cdot |Y|^2)\) time, where \(|Y|\) is the number of possible tags (here, 2 or 3).
We can train this model on the CELEX data, where we know the gold-standard morpheme boundaries and can therefore derive the correct BIO tags for each character.
Train an HMM for morpheme segmentation
import numpy as npfrom hmmlearn.hmm import MultinomialHMM# Assume celex_segmented is a list of (word, morphemes) pairs from CELEX# We convert each to a character sequence with BIO tagsdef word_to_bio(word: str, morphemes: list[str]) ->list[str]:"""Convert a word and its morphemes to BIO tags. Parameters ---------- word : str The surface form (unused but kept for interface consistency). morphemes : list[str] The morpheme segments. Returns ------- list[str] A BIO tag for each character. """ tags = []for morph in morphemes: tags.append('B') tags.extend(['I'] * (len(morph) -1))return tagsdef chars_to_features(word: str) ->list[int]:"""Convert characters to integer features for the HMM. Parameters ---------- word : str The word to convert. Returns ------- list[int] Integer feature for each alphabetic character. """return [ord(c) -ord('a') for c in word.lower() if c.isalpha()]# Training would look like:# tag_map = {'B': 0, 'I': 1}# hmm = MultinomialHMM(n_components=2, n_iter=100)# hmm.fit(X_train, lengths_train)
The HMM is simple and fast, but it has a limitation that follows directly from its generative structure: the emission probability \(p(x_i \mid y_i)\) depends only on the tag at position \(i\), not on the surrounding characters. This means the model cannot learn that, say, the character n is more likely to start a new morpheme when preceded by u (as in un-) than when preceded by i (as in the middle of happiness). The context is carried entirely through the transition probabilities, which only look at the previous tag.
Conditional Random Fields
Generative vs. discriminative models
The HMM is a generative model: it specifies a full joint distribution \(p(\mathbf{x}, \mathbf{y})\) over both the observations (characters) and the labels (tags). To make predictions, we use Bayes’ theorem to compute \(p(\mathbf{y} \mid \mathbf{x}) \propto p(\mathbf{x}, \mathbf{y})\). This means the model must explain how the observations are generated—via the emission probabilities \(p(x_i \mid y_i)\)—even though our actual goal is just to predict the tags given the observations.
A discriminative model takes a more direct approach: it models the conditional distribution \(p(\mathbf{y} \mid \mathbf{x})\) directly, without ever specifying how the observations themselves are distributed. This has a concrete advantage. Because the model doesn’t need to “explain” the observations, it can condition on arbitrary features of the input—including features that would be difficult or impossible to model generatively. The tradeoff is that a discriminative model cannot generate new data (it has no model of \(p(\mathbf{x})\)), but for prediction tasks like segmentation, that’s not a loss.1
The CRF model
Conditional random fields (CRFs) are the discriminative counterpart of HMMs for sequence labeling (Lafferty et al. 2001). Rather than modeling the joint distribution \(p(\mathbf{x}, \mathbf{y})\), a CRF directly models the conditional distribution \(p(\mathbf{y} \mid \mathbf{x})\)—the probability of a tag sequence given the entire observed character sequence.
The model uses a log-linear (or exponential) parameterization. The idea is to define a set of features that measure properties of a candidate tag sequence in context, assign each feature a learned weight, and convert the weighted sum into a probability via the exponential function. Formally:
where \(f_k\) are feature functions that can look at the current tag \(y_i\), the previous tag \(y_{i-1}\), and any property of the entire input sequence \(\mathbf{x}\) at position \(i\). Each \(\lambda_k\) is a learned weight: positive weights encourage tag sequences where the corresponding feature fires, and negative weights discourage them. The normalizing constant \(Z(\mathbf{x}) = \sum_{\mathbf{y}'} \exp\left(\sum_{i=1}^n \sum_k \lambda_k f_k(y'_{i-1}, y'_i, \mathbf{x}, i)\right)\) ensures the distribution sums to 1 over all possible tag sequences.2
Compare this to the HMM. In the HMM, the emission probability \(p(x_i \mid y_i)\) depends only on the tag at position \(i\) and the single character at that position. In the CRF, the feature function \(f_k(y_{i-1}, y_i, \mathbf{x}, i)\) can look at the entire input sequence. This is the key payoff of the discriminative approach: features can capture context that the HMM’s generative structure rules out.
Feature functions
The power of CRFs comes from the feature functions. For morphological segmentation, we might define features like:
The character at position \(i\)
The characters at positions \(i - 1\) and \(i + 1\)
Whether position \(i\) is at the beginning or end of the word
The character \(n\)-grams surrounding position \(i\) (e.g. the trigram \(x_{i-1} x_i x_{i+1}\))
Whether the character is a vowel or consonant
These features give the model much more information than the HMM’s emission probabilities, which can only look at the character at the current position.
Character-level feature extraction for CRF
def char_features(word: str, i: int) ->dict[str, str|bool]:"""Extract features for position i in word. Parameters ---------- word : str The word to extract features from. i : int The character position. Returns ------- dict[str, str | bool] Feature name to value mapping. """ features = {'char': word[i],'is_vowel': word[i] in'aeiou','is_first': i ==0,'is_last': i ==len(word) -1, }if i >0: features['prev_char'] = word[i-1] features['bigram_left'] = word[i-1] + word[i]else: features['prev_char'] ='<START>'if i <len(word) -1: features['next_char'] = word[i+1] features['bigram_right'] = word[i] + word[i+1]else: features['next_char'] ='<END>'if i >1: features['trigram_left'] = word[i-2:i+1]if i <len(word) -2: features['trigram_right'] = word[i:i+3]return features# Exampleword ='unhappiness'for i, c inenumerate(word): feats = char_features(word, i)print(f"{c}: {feats}")
Training a CRF with these features is straightforward using sklearn-crfsuite:
Train a CRF for morpheme segmentation
import sklearn_crfsuitedef word_to_features(word: str) ->list[dict[str, str|bool]]:"""Extract character-level features for every position in a word. Parameters ---------- word : str The word to extract features from. Returns ------- list[dict[str, str | bool]] One feature dictionary per character position. """return [char_features(word, i) for i inrange(len(word))]# Training would look like:# X_train = [word_to_features(w) for w in train_words]# y_train = [word_to_bio(w, morphs) for w, morphs in train_data]# crf = sklearn_crfsuite.CRF(algorithm='lbfgs', max_iterations=100)# crf.fit(X_train, y_train)
Comparing HMMs and CRFs
The fundamental difference between the two models is directional. The HMM is generative: it models how characters and tags are jointly generated, and predicts tags by inverting that generative process. The CRF is discriminative: it directly models the distribution of tags given characters, without committing to a story about how the characters were generated.
In practice, the CRF tends to outperform the HMM on segmentation tasks because it can exploit richer features of the input. The HMM’s emission model treats each character independently (given the tag), while the CRF can look at the entire character context. The tradeoff is that CRFs are more expensive to train—the feature space is larger and the normalization constant \(Z(\mathbf{x})\) must be computed for every training example.
Both models, however, share a crucial limitation with BPE: they produce a flat segmentation. They tell you where the morpheme boundaries are, but not how the morphemes combine hierarchically. For that, we need to return to the context-free grammars we developed earlier in this module—which is exactly what we’ll do in the next two sections.
References
Jurafsky, Daniel, and James H. Martin. 2023. Speech and Language Processing.
Lafferty, John, Andrew McCallum, and Fernando C. N. Pereira. 2001. “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data.”Proceedings of the 18th International Conference on Machine Learning, 282–89.
Rabiner, Lawrence R. 1989. “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition.”Proceedings of the IEEE 77 (2): 257–86. https://doi.org/10.1109/5.18626.
Viterbi, Andrew. 1967. “Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm.”IEEE Transactions on Information Theory 13 (2): 260–69. https://doi.org/10.1109/TIT.1967.1054010.
Footnotes
The distinction between generative and discriminative models runs through much of machine learning. Rabiner (1989) is the classic treatment of generative sequence models (HMMs); Lafferty et al. (2001) introduced CRFs as their discriminative counterpart. See Jurafsky and Martin (2023, Ch. 8) for a textbook discussion of the distinction in the context of sequence labeling.↩︎
The exponential function maps any real-valued score to a positive number, and then \(Z(\mathbf{x})\) normalizes these positive numbers to a probability distribution. This is the same idea as the softmax function used in neural networks, extended to sequences.↩︎
---title: Morphological segmentation as structured predictionbibliography: ../references.bibjupyter: python3execute: eval: false---In the previous section, we used BPE to segment words into morpheme-like units. BPE makes segmentation decisions based solely on the frequency of character pairs, without considering the broader context in which those characters appear. An alternative is to treat segmentation as a *structured prediction* problem, in which the model predicts a sequence of labels (one per character) and can condition each label on features of the surrounding characters.The idea is straightforward. Given a word like *unhappiness*, represented as a sequence of characters $\langle \text{u}, \text{n}, \text{h}, \text{a}, \text{p}, \text{p}, \text{i}, \text{n}, \text{e}, \text{s}, \text{s} \rangle$, we want to predict for each character whether it is the *beginning* of a morpheme, *inside* a morpheme, or (less commonly) *outside* any morpheme. This is known as BIO tagging, after the labels B (beginning), I (inside), and O (outside).For *unhappiness* with morphemes *un*, *happi*, *ness*, the BIO tags would be:| u | n | h | a | p | p | i | n | e | s | s ||---|---|---|---|---|---|---|---|---|---|---|| B | I | B | I | I | I | I | B | I | I | I |The first character of each morpheme gets a B tag, and the remaining characters get I tags. Converting back from tags to a segmentation is trivial: every B tag marks the start of a new morpheme.This formulation turns segmentation into a sequence labeling problem, for which there are well-studied models. We'll look at two: hidden Markov models and conditional random fields.## Hidden Markov Models for segmentationWe encountered hidden Markov models in the [section on language models beyond $N$-grams](../uncertainty-about-languages/beyond-ngrams.qmd) and saw in the [section on weighted automata and HMMs](../phonological-patterns/phonological-rules-as-fsas/weighted-fsas-and-hmms.qmd) that an HMM is exactly a nondeterministic weighted finite state automaton. Here we'll use HMMs for a different purpose: predicting the BIO tag at each position in a character sequence.The idea is to model the joint distribution of the character sequence $\mathbf{x} = x_1 x_2 \ldots x_n$ and the tag sequence $\mathbf{y} = y_1 y_2 \ldots y_n$ as:$$p(\mathbf{x}, \mathbf{y}) = \prod_{i=1}^n p(x_i \mid y_i) \cdot p(y_i \mid y_{i-1})$$where $p(x_i \mid y_i)$ is the *emission probability* (how likely is character $x_i$ given tag $y_i$?) and $p(y_i \mid y_{i-1})$ is the *transition probability* (how likely is tag $y_i$ given the previous tag $y_{i-1}$?). Given an observed character sequence $\mathbf{x}$, we want to find the tag sequence $\mathbf{y}^*$ that maximizes $p(\mathbf{y} \mid \mathbf{x})$. This is done using the [Viterbi algorithm](https://en.wikipedia.org/wiki/Viterbi_algorithm)[@viterbi_error_1967], which finds the most probable tag sequence in $O(n \cdot |Y|^2)$ time, where $|Y|$ is the number of possible tags (here, 2 or 3).We can train this model on the CELEX data, where we know the gold-standard morpheme boundaries and can therefore derive the correct BIO tags for each character.```{python}#| code-fold: true#| code-summary: Train an HMM for morpheme segmentationimport numpy as npfrom hmmlearn.hmm import MultinomialHMM# Assume celex_segmented is a list of (word, morphemes) pairs from CELEX# We convert each to a character sequence with BIO tagsdef word_to_bio(word: str, morphemes: list[str]) ->list[str]:"""Convert a word and its morphemes to BIO tags. Parameters ---------- word : str The surface form (unused but kept for interface consistency). morphemes : list[str] The morpheme segments. Returns ------- list[str] A BIO tag for each character. """ tags = []for morph in morphemes: tags.append('B') tags.extend(['I'] * (len(morph) -1))return tagsdef chars_to_features(word: str) ->list[int]:"""Convert characters to integer features for the HMM. Parameters ---------- word : str The word to convert. Returns ------- list[int] Integer feature for each alphabetic character. """return [ord(c) -ord('a') for c in word.lower() if c.isalpha()]# Training would look like:# tag_map = {'B': 0, 'I': 1}# hmm = MultinomialHMM(n_components=2, n_iter=100)# hmm.fit(X_train, lengths_train)```The HMM is simple and fast, but it has a limitation that follows directly from its generative structure: the emission probability $p(x_i \mid y_i)$ depends only on the tag at position $i$, not on the surrounding characters. This means the model cannot learn that, say, the character *n* is more likely to start a new morpheme when preceded by *u* (as in *un-*) than when preceded by *i* (as in the middle of *happiness*). The context is carried entirely through the transition probabilities, which only look at the previous tag.## Conditional Random Fields### Generative vs. discriminative modelsThe HMM is a [*generative*](https://en.wikipedia.org/wiki/Generative_model) model: it specifies a full joint distribution $p(\mathbf{x}, \mathbf{y})$ over both the observations (characters) and the labels (tags). To make predictions, we use Bayes' theorem to compute $p(\mathbf{y} \mid \mathbf{x}) \propto p(\mathbf{x}, \mathbf{y})$. This means the model must explain how the observations are generated—via the emission probabilities $p(x_i \mid y_i)$—even though our actual goal is just to predict the tags given the observations.A [*discriminative*](https://en.wikipedia.org/wiki/Discriminative_model) model takes a more direct approach: it models the conditional distribution $p(\mathbf{y} \mid \mathbf{x})$ directly, without ever specifying how the observations themselves are distributed. This has a concrete advantage. Because the model doesn't need to "explain" the observations, it can condition on arbitrary features of the input—including features that would be difficult or impossible to model generatively. The tradeoff is that a discriminative model cannot generate new data (it has no model of $p(\mathbf{x})$), but for prediction tasks like segmentation, that's not a loss.^[The distinction between generative and discriminative models runs through much of machine learning. @rabiner_tutorial_1989 is the classic treatment of generative sequence models (HMMs); @lafferty_conditional_2001 introduced CRFs as their discriminative counterpart. See @jurafsky_speech_2023[Ch. 8] for a textbook discussion of the distinction in the context of sequence labeling.]### The CRF model[Conditional random fields](https://en.wikipedia.org/wiki/Conditional_random_field) (CRFs) are the discriminative counterpart of HMMs for sequence labeling [@lafferty_conditional_2001]. Rather than modeling the joint distribution $p(\mathbf{x}, \mathbf{y})$, a CRF directly models the conditional distribution $p(\mathbf{y} \mid \mathbf{x})$—the probability of a tag sequence given the entire observed character sequence.The model uses a [log-linear](https://en.wikipedia.org/wiki/Log-linear_model) (or *exponential*) parameterization. The idea is to define a set of features that measure properties of a candidate tag sequence in context, assign each feature a learned weight, and convert the weighted sum into a probability via the exponential function. Formally:$$p(\mathbf{y} \mid \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp\left(\sum_{i=1}^n \sum_k \lambda_k f_k(y_{i-1}, y_i, \mathbf{x}, i)\right)$$where $f_k$ are *feature functions* that can look at the current tag $y_i$, the previous tag $y_{i-1}$, and *any* property of the entire input sequence $\mathbf{x}$ at position $i$. Each $\lambda_k$ is a learned weight: positive weights encourage tag sequences where the corresponding feature fires, and negative weights discourage them. The normalizing constant $Z(\mathbf{x}) = \sum_{\mathbf{y}'} \exp\left(\sum_{i=1}^n \sum_k \lambda_k f_k(y'_{i-1}, y'_i, \mathbf{x}, i)\right)$ ensures the distribution sums to 1 over all possible tag sequences.^[The exponential function maps any real-valued score to a positive number, and then $Z(\mathbf{x})$ normalizes these positive numbers to a probability distribution. This is the same idea as the [softmax function](https://en.wikipedia.org/wiki/Softmax_function) used in neural networks, extended to sequences.]Compare this to the HMM. In the HMM, the emission probability $p(x_i \mid y_i)$ depends only on the tag at position $i$ and the single character at that position. In the CRF, the feature function $f_k(y_{i-1}, y_i, \mathbf{x}, i)$ can look at the *entire* input sequence. This is the key payoff of the discriminative approach: features can capture context that the HMM's generative structure rules out.### Feature functionsThe power of CRFs comes from the feature functions. For morphological segmentation, we might define features like:- The character at position $i$- The characters at positions $i - 1$ and $i + 1$- Whether position $i$ is at the beginning or end of the word- The character $n$-grams surrounding position $i$ (e.g. the trigram $x_{i-1} x_i x_{i+1}$)- Whether the character is a vowel or consonantThese features give the model much more information than the HMM's emission probabilities, which can only look at the character at the current position.```{python}#| code-fold: true#| code-summary: Character-level feature extraction for CRFdef char_features(word: str, i: int) ->dict[str, str|bool]:"""Extract features for position i in word. Parameters ---------- word : str The word to extract features from. i : int The character position. Returns ------- dict[str, str | bool] Feature name to value mapping. """ features = {'char': word[i],'is_vowel': word[i] in'aeiou','is_first': i ==0,'is_last': i ==len(word) -1, }if i >0: features['prev_char'] = word[i-1] features['bigram_left'] = word[i-1] + word[i]else: features['prev_char'] ='<START>'if i <len(word) -1: features['next_char'] = word[i+1] features['bigram_right'] = word[i] + word[i+1]else: features['next_char'] ='<END>'if i >1: features['trigram_left'] = word[i-2:i+1]if i <len(word) -2: features['trigram_right'] = word[i:i+3]return features# Exampleword ='unhappiness'for i, c inenumerate(word): feats = char_features(word, i)print(f"{c}: {feats}")```Training a CRF with these features is straightforward using `sklearn-crfsuite`:```{python}#| code-fold: true#| code-summary: Train a CRF for morpheme segmentationimport sklearn_crfsuitedef word_to_features(word: str) ->list[dict[str, str|bool]]:"""Extract character-level features for every position in a word. Parameters ---------- word : str The word to extract features from. Returns ------- list[dict[str, str | bool]] One feature dictionary per character position. """return [char_features(word, i) for i inrange(len(word))]# Training would look like:# X_train = [word_to_features(w) for w in train_words]# y_train = [word_to_bio(w, morphs) for w, morphs in train_data]# crf = sklearn_crfsuite.CRF(algorithm='lbfgs', max_iterations=100)# crf.fit(X_train, y_train)```## Comparing HMMs and CRFsThe fundamental difference between the two models is directional. The HMM is *generative*: it models how characters and tags are jointly generated, and predicts tags by inverting that generative process. The CRF is *discriminative*: it directly models the distribution of tags given characters, without committing to a story about how the characters were generated.In practice, the CRF tends to outperform the HMM on segmentation tasks because it can exploit richer features of the input. The HMM's emission model treats each character independently (given the tag), while the CRF can look at the entire character context. The tradeoff is that CRFs are more expensive to train—the feature space is larger and the normalization constant $Z(\mathbf{x})$ must be computed for every training example.Both models, however, share a crucial limitation with BPE: they produce a *flat* segmentation. They tell you where the morpheme boundaries are, but not how the morphemes combine hierarchically. For that, we need to return to the context-free grammars we developed earlier in this module—which is exactly what we'll do in the next two sections.