Morphological segmentation as structured prediction

In the previous section, we used BPE to segment words into morpheme-like units. BPE makes segmentation decisions based solely on the frequency of character pairs, without considering the broader context in which those characters appear. An alternative is to treat segmentation as a structured prediction problem, in which the model predicts a sequence of labels (one per character) and can condition each label on features of the surrounding characters.

The idea is straightforward. Given a word like unhappiness, represented as a sequence of characters $\langle \text{u}, \text{n}, \text{h}, \text{a}, \text{p}, \text{p}, \text{i}, \text{n}, \text{e}, \text{s}, \text{s} \rangle$, we want to predict for each character whether it is the beginning of a morpheme, inside a morpheme, or (less commonly) outside any morpheme. This is known as BIO tagging, after the labels B (beginning), I (inside), and O (outside).

For unhappiness with morphemes un, happi, ness, the BIO tags would be:

u	n	h	a	p	p	i	n	e	s	s
B	I	B	I	I	I	I	B	I	I	I

The first character of each morpheme gets a B tag, and the remaining characters get I tags. Converting back from tags to a segmentation is trivial: every B tag marks the start of a new morpheme.

This formulation turns segmentation into a sequence labeling problem, for which there are well-studied models. We’ll look at two: hidden Markov models and conditional random fields.

Hidden Markov Models for segmentation

We encountered hidden Markov models in the section on language models beyond $N$-grams and saw in the section on weighted automata and HMMs that an HMM is exactly a nondeterministic weighted finite state automaton. Here we’ll use HMMs for a different purpose: predicting the BIO tag at each position in a character sequence.

The idea is to model the joint distribution of the character sequence $\mathbf{x} = x_1 x_2 \ldots x_n$ and the tag sequence $\mathbf{y} = y_1 y_2 \ldots y_n$ as:

\[p(\mathbf{x}, \mathbf{y}) = \prod_{i=1}^n p(x_i \mid y_i) \cdot p(y_i \mid y_{i-1})\]

where $p(x_i \mid y_i)$ is the emission probability (how likely is character $x_i$ given tag $y_i$?) and $p(y_i \mid y_{i-1})$ is the transition probability (how likely is tag $y_i$ given the previous tag $y_{i-1}$?). Given an observed character sequence $\mathbf{x}$, we want to find the tag sequence $\mathbf{y}^*$ that maximizes $p(\mathbf{y} \mid \mathbf{x})$. This is done using the Viterbi algorithm (Viterbi 1967), which finds the most probable tag sequence in $O(n \cdot |Y|^2)$ time, where $|Y|$ is the number of possible tags (here, 2 or 3).

We can train this model on the CELEX data, where we know the gold-standard morpheme boundaries and can therefore derive the correct BIO tags for each character.

Train an HMM for morpheme segmentation

import numpy as np
from hmmlearn.hmm import MultinomialHMM

# Assume celex_segmented is a list of (word, morphemes) pairs from CELEX
# We convert each to a character sequence with BIO tags

def word_to_bio(word: str, morphemes: list[str]) -> list[str]:
    """Convert a word and its morphemes to BIO tags.

    Parameters
    ----------
    word : str
        The surface form (unused but kept for interface consistency).
    morphemes : list[str]
        The morpheme segments.

    Returns
    -------
    list[str]
        A BIO tag for each character.
    """
    tags = []
    for morph in morphemes:
        tags.append('B')
        tags.extend(['I'] * (len(morph) - 1))
    return tags


def chars_to_features(word: str) -> list[int]:
    """Convert characters to integer features for the HMM.

    Parameters
    ----------
    word : str
        The word to convert.

    Returns
    -------
    list[int]
        Integer feature for each alphabetic character.
    """
    return [ord(c) - ord('a') for c in word.lower() if c.isalpha()]


# Training would look like:
# tag_map = {'B': 0, 'I': 1}
# hmm = MultinomialHMM(n_components=2, n_iter=100)
# hmm.fit(X_train, lengths_train)

The HMM is simple and fast, but it has a limitation that follows directly from its generative structure: the emission probability $p(x_i \mid y_i)$ depends only on the tag at position $i$, not on the surrounding characters. This means the model cannot learn that, say, the character n is more likely to start a new morpheme when preceded by u (as in un-) than when preceded by i (as in the middle of happiness). The context is carried entirely through the transition probabilities, which only look at the previous tag.

Conditional Random Fields

Generative vs. discriminative models

The HMM is a generative model: it specifies a full joint distribution $p(\mathbf{x}, \mathbf{y})$ over both the observations (characters) and the labels (tags). To make predictions, we use Bayes’ theorem to compute $p(\mathbf{y} \mid \mathbf{x}) \propto p(\mathbf{x}, \mathbf{y})$. This means the model must explain how the observations are generated—via the emission probabilities $p(x_i \mid y_i)$—even though our actual goal is just to predict the tags given the observations.

A discriminative model takes a more direct approach: it models the conditional distribution $p(\mathbf{y} \mid \mathbf{x})$ directly, without ever specifying how the observations themselves are distributed. This has a concrete advantage. Because the model doesn’t need to “explain” the observations, it can condition on arbitrary features of the input—including features that would be difficult or impossible to model generatively. The tradeoff is that a discriminative model cannot generate new data (it has no model of $p(\mathbf{x})$), but for prediction tasks like segmentation, that’s not a loss.¹

The CRF model

Conditional random fields (CRFs) are the discriminative counterpart of HMMs for sequence labeling (Lafferty et al. 2001). Rather than modeling the joint distribution $p(\mathbf{x}, \mathbf{y})$, a CRF directly models the conditional distribution $p(\mathbf{y} \mid \mathbf{x})$—the probability of a tag sequence given the entire observed character sequence.

The model uses a log-linear (or exponential) parameterization. The idea is to define a set of features that measure properties of a candidate tag sequence in context, assign each feature a learned weight, and convert the weighted sum into a probability via the exponential function. Formally:

\[p(\mathbf{y} \mid \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp\left(\sum_{i=1}^n \sum_k \lambda_k f_k(y_{i-1}, y_i, \mathbf{x}, i)\right)\]

where $f_k$ are feature functions that can look at the current tag $y_i$, the previous tag $y_{i-1}$, and any property of the entire input sequence $\mathbf{x}$ at position $i$. Each $\lambda_k$ is a learned weight: positive weights encourage tag sequences where the corresponding feature fires, and negative weights discourage them. The normalizing constant $Z(\mathbf{x}) = \sum_{\mathbf{y}'} \exp\left(\sum_{i=1}^n \sum_k \lambda_k f_k(y'_{i-1}, y'_i, \mathbf{x}, i)\right)$ ensures the distribution sums to 1 over all possible tag sequences.²

Compare this to the HMM. In the HMM, the emission probability $p(x_i \mid y_i)$ depends only on the tag at position $i$ and the single character at that position. In the CRF, the feature function $f_k(y_{i-1}, y_i, \mathbf{x}, i)$ can look at the entire input sequence. This is the key payoff of the discriminative approach: features can capture context that the HMM’s generative structure rules out.

Feature functions

The power of CRFs comes from the feature functions. For morphological segmentation, we might define features like:

The character at position $i$
The characters at positions $i - 1$ and $i + 1$
Whether position $i$ is at the beginning or end of the word
The character $n$-grams surrounding position $i$ (e.g. the trigram $x_{i-1} x_i x_{i+1}$)
Whether the character is a vowel or consonant

These features give the model much more information than the HMM’s emission probabilities, which can only look at the character at the current position.

Character-level feature extraction for CRF

def char_features(word: str, i: int) -> dict[str, str | bool]:
    """Extract features for position i in word.

    Parameters
    ----------
    word : str
        The word to extract features from.
    i : int
        The character position.

    Returns
    -------
    dict[str, str | bool]
        Feature name to value mapping.
    """
    features = {
        'char': word[i],
        'is_vowel': word[i] in 'aeiou',
        'is_first': i == 0,
        'is_last': i == len(word) - 1,
    }

    if i > 0:
        features['prev_char'] = word[i-1]
        features['bigram_left'] = word[i-1] + word[i]
    else:
        features['prev_char'] = '<START>'

    if i < len(word) - 1:
        features['next_char'] = word[i+1]
        features['bigram_right'] = word[i] + word[i+1]
    else:
        features['next_char'] = '<END>'

    if i > 1:
        features['trigram_left'] = word[i-2:i+1]

    if i < len(word) - 2:
        features['trigram_right'] = word[i:i+3]

    return features


# Example
word = 'unhappiness'
for i, c in enumerate(word):
    feats = char_features(word, i)
    print(f"{c}: {feats}")

Training a CRF with these features is straightforward using sklearn-crfsuite:

Train a CRF for morpheme segmentation

import sklearn_crfsuite


def word_to_features(word: str) -> list[dict[str, str | bool]]:
    """Extract character-level features for every position in a word.

    Parameters
    ----------
    word : str
        The word to extract features from.

    Returns
    -------
    list[dict[str, str | bool]]
        One feature dictionary per character position.
    """
    return [char_features(word, i) for i in range(len(word))]


# Training would look like:
# X_train = [word_to_features(w) for w in train_words]
# y_train = [word_to_bio(w, morphs) for w, morphs in train_data]
# crf = sklearn_crfsuite.CRF(algorithm='lbfgs', max_iterations=100)
# crf.fit(X_train, y_train)

Comparing HMMs and CRFs

The fundamental difference between the two models is directional. The HMM is generative: it models how characters and tags are jointly generated, and predicts tags by inverting that generative process. The CRF is discriminative: it directly models the distribution of tags given characters, without committing to a story about how the characters were generated.

In practice, the CRF tends to outperform the HMM on segmentation tasks because it can exploit richer features of the input. The HMM’s emission model treats each character independently (given the tag), while the CRF can look at the entire character context. The tradeoff is that CRFs are more expensive to train—the feature space is larger and the normalization constant $Z(\mathbf{x})$ must be computed for every training example.

Both models, however, share a crucial limitation with BPE: they produce a flat segmentation. They tell you where the morpheme boundaries are, but not how the morphemes combine hierarchically. For that, we need to return to the context-free grammars we developed earlier in this module—which is exactly what we’ll do in the next two sections.

References

Jurafsky, Daniel, and James H. Martin. 2023. Speech and Language Processing.

Lafferty, John, Andrew McCallum, and Fernando C. N. Pereira. 2001. “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data.” Proceedings of the 18th International Conference on Machine Learning, 282–89.

Rabiner, Lawrence R. 1989. “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition.” Proceedings of the IEEE 77 (2): 257–86. https://doi.org/10.1109/5.18626.

Viterbi, Andrew. 1967. “Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm.” IEEE Transactions on Information Theory 13 (2): 260–69. https://doi.org/10.1109/TIT.1967.1054010.

Footnotes

The distinction between generative and discriminative models runs through much of machine learning. Rabiner (1989) is the classic treatment of generative sequence models (HMMs); Lafferty et al. (2001) introduced CRFs as their discriminative counterpart. See Jurafsky and Martin (2023, Ch. 8) for a textbook discussion of the distinction in the context of sequence labeling.↩︎
The exponential function maps any real-valued score to a positive number, and then $Z(\mathbf{x})$ normalizes these positive numbers to a probability distribution. This is the same idea as the softmax function used in neural networks, extended to sequences.↩︎

--- title: Morphological segmentation as structured prediction bibliography: ../references.bib jupyter: python3 execute: eval: false --- In the previous section, we used BPE to segment words into morpheme-like units. BPE makes segmentation decisions based solely on the frequency of character pairs, without considering the broader context in which those characters appear. An alternative is to treat segmentation as a *structured prediction* problem, in which the model predicts a sequence of labels (one per character) and can condition each label on features of the surrounding characters. The idea is straightforward. Given a word like *unhappiness*, represented as a sequence of characters $\langle \text{u}, \text{n}, \text{h}, \text{a}, \text{p}, \text{p}, \text{i}, \text{n}, \text{e}, \text{s}, \text{s} \rangle$, we want to predict for each character whether it is the *beginning* of a morpheme, *inside* a morpheme, or (less commonly) *outside* any morpheme. This is known as BIO tagging, after the labels B (beginning), I (inside), and O (outside). For *unhappiness* with morphemes *un*, *happi*, *ness*, the BIO tags would be: | u | n | h | a | p | p | i | n | e | s | s | |---|---|---|---|---|---|---|---|---|---|---| | B | I | B | I | I | I | I | B | I | I | I | The first character of each morpheme gets a B tag, and the remaining characters get I tags. Converting back from tags to a segmentation is trivial: every B tag marks the start of a new morpheme. This formulation turns segmentation into a sequence labeling problem, for which there are well-studied models. We'll look at two: hidden Markov models and conditional random fields. ## Hidden Markov Models for segmentation We encountered hidden Markov models in the [section on language models beyond $N$-grams](../uncertainty-about-languages/beyond-ngrams.qmd) and saw in the [section on weighted automata and HMMs](../phonological-patterns/phonological-rules-as-fsas/weighted-fsas-and-hmms.qmd) that an HMM is exactly a nondeterministic weighted finite state automaton. Here we'll use HMMs for a different purpose: predicting the BIO tag at each position in a character sequence. The idea is to model the joint distribution of the character sequence $\mathbf{x} = x_1 x_2 \ldots x_n$ and the tag sequence $\mathbf{y} = y_1 y_2 \ldots y_n$ as: $$p(\mathbf{x}, \mathbf{y}) = \prod_{i=1}^n p(x_i \mid y_i) \cdot p(y_i \mid y_{i-1})$$ where $p(x_i \mid y_i)$ is the *emission probability* (how likely is character $x_i$ given tag $y_i$?) and $p(y_i \mid y_{i-1})$ is the *transition probability* (how likely is tag $y_i$ given the previous tag $y_{i-1}$?). Given an observed character sequence $\mathbf{x}$, we want to find the tag sequence $\mathbf{y}^*$ that maximizes $p(\mathbf{y} \mid \mathbf{x})$. This is done using the [Viterbi algorithm](https://en.wikipedia.org/wiki/Viterbi_algorithm) [@viterbi_error_1967], which finds the most probable tag sequence in $O(n \cdot |Y|^2)$ time, where $|Y|$ is the number of possible tags (here, 2 or 3). We can train this model on the CELEX data, where we know the gold-standard morpheme boundaries and can therefore derive the correct BIO tags for each character. ```{python} #| code-fold: true #| code-summary: Train an HMM for morpheme segmentation import numpy as np from hmmlearn.hmm import MultinomialHMM # Assume celex_segmented is a list of (word, morphemes) pairs from CELEX # We convert each to a character sequence with BIO tags def word_to_bio(word: str, morphemes: list[str]) -> list[str]: """Convert a word and its morphemes to BIO tags. Parameters ---------- word : str The surface form (unused but kept for interface consistency). morphemes : list[str] The morpheme segments. Returns ------- list[str] A BIO tag for each character. """ tags = [] for morph in morphemes: tags.append('B') tags.extend(['I'] * (len(morph) - 1)) return tags def chars_to_features(word: str) -> list[int]: """Convert characters to integer features for the HMM. Parameters ---------- word : str The word to convert. Returns ------- list[int] Integer feature for each alphabetic character. """ return [ord(c) - ord('a') for c in word.lower() if c.isalpha()] # Training would look like: # tag_map = {'B': 0, 'I': 1} # hmm = MultinomialHMM(n_components=2, n_iter=100) # hmm.fit(X_train, lengths_train) ``` The HMM is simple and fast, but it has a limitation that follows directly from its generative structure: the emission probability $p(x_i \mid y_i)$ depends only on the tag at position $i$, not on the surrounding characters. This means the model cannot learn that, say, the character *n* is more likely to start a new morpheme when preceded by *u* (as in *un-*) than when preceded by *i* (as in the middle of *happiness*). The context is carried entirely through the transition probabilities, which only look at the previous tag. ## Conditional Random Fields ### Generative vs. discriminative models The HMM is a [*generative*](https://en.wikipedia.org/wiki/Generative_model) model: it specifies a full joint distribution $p(\mathbf{x}, \mathbf{y})$ over both the observations (characters) and the labels (tags). To make predictions, we use Bayes' theorem to compute $p(\mathbf{y} \mid \mathbf{x}) \propto p(\mathbf{x}, \mathbf{y})$. This means the model must explain how the observations are generated—via the emission probabilities $p(x_i \mid y_i)$—even though our actual goal is just to predict the tags given the observations. A [*discriminative*](https://en.wikipedia.org/wiki/Discriminative_model) model takes a more direct approach: it models the conditional distribution $p(\mathbf{y} \mid \mathbf{x})$ directly, without ever specifying how the observations themselves are distributed. This has a concrete advantage. Because the model doesn't need to "explain" the observations, it can condition on arbitrary features of the input—including features that would be difficult or impossible to model generatively. The tradeoff is that a discriminative model cannot generate new data (it has no model of $p(\mathbf{x})$), but for prediction tasks like segmentation, that's not a loss.^[The distinction between generative and discriminative models runs through much of machine learning. @rabiner_tutorial_1989 is the classic treatment of generative sequence models (HMMs); @lafferty_conditional_2001 introduced CRFs as their discriminative counterpart. See @jurafsky_speech_2023[Ch. 8] for a textbook discussion of the distinction in the context of sequence labeling.] ### The CRF model [Conditional random fields](https://en.wikipedia.org/wiki/Conditional_random_field) (CRFs) are the discriminative counterpart of HMMs for sequence labeling [@lafferty_conditional_2001]. Rather than modeling the joint distribution $p(\mathbf{x}, \mathbf{y})$, a CRF directly models the conditional distribution $p(\mathbf{y} \mid \mathbf{x})$—the probability of a tag sequence given the entire observed character sequence. The model uses a [log-linear](https://en.wikipedia.org/wiki/Log-linear_model) (or *exponential*) parameterization. The idea is to define a set of features that measure properties of a candidate tag sequence in context, assign each feature a learned weight, and convert the weighted sum into a probability via the exponential function. Formally: $$p(\mathbf{y} \mid \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp\left(\sum_{i=1}^n \sum_k \lambda_k f_k(y_{i-1}, y_i, \mathbf{x}, i)\right)$$ where $f_k$ are *feature functions* that can look at the current tag $y_i$, the previous tag $y_{i-1}$, and *any* property of the entire input sequence $\mathbf{x}$ at position $i$. Each $\lambda_k$ is a learned weight: positive weights encourage tag sequences where the corresponding feature fires, and negative weights discourage them. The normalizing constant $Z(\mathbf{x}) = \sum_{\mathbf{y}'} \exp\left(\sum_{i=1}^n \sum_k \lambda_k f_k(y'_{i-1}, y'_i, \mathbf{x}, i)\right)$ ensures the distribution sums to 1 over all possible tag sequences.^[The exponential function maps any real-valued score to a positive number, and then $Z(\mathbf{x})$ normalizes these positive numbers to a probability distribution. This is the same idea as the [softmax function](https://en.wikipedia.org/wiki/Softmax_function) used in neural networks, extended to sequences.] Compare this to the HMM. In the HMM, the emission probability $p(x_i \mid y_i)$ depends only on the tag at position $i$ and the single character at that position. In the CRF, the feature function $f_k(y_{i-1}, y_i, \mathbf{x}, i)$ can look at the *entire* input sequence. This is the key payoff of the discriminative approach: features can capture context that the HMM's generative structure rules out. ### Feature functions The power of CRFs comes from the feature functions. For morphological segmentation, we might define features like: - The character at position $i$ - The characters at positions $i - 1$ and $i + 1$ - Whether position $i$ is at the beginning or end of the word - The character $n$-grams surrounding position $i$ (e.g. the trigram $x_{i-1} x_i x_{i+1}$) - Whether the character is a vowel or consonant These features give the model much more information than the HMM's emission probabilities, which can only look at the character at the current position. ```{python} #| code-fold: true #| code-summary: Character-level feature extraction for CRF def char_features(word: str, i: int) -> dict[str, str | bool]: """Extract features for position i in word. Parameters ---------- word : str The word to extract features from. i : int The character position. Returns ------- dict[str, str | bool] Feature name to value mapping. """ features = { 'char': word[i], 'is_vowel': word[i] in 'aeiou', 'is_first': i == 0, 'is_last': i == len(word) - 1, } if i > 0: features['prev_char'] = word[i-1] features['bigram_left'] = word[i-1] + word[i] else: features['prev_char'] = '<START>' if i < len(word) - 1: features['next_char'] = word[i+1] features['bigram_right'] = word[i] + word[i+1] else: features['next_char'] = '<END>' if i > 1: features['trigram_left'] = word[i-2:i+1] if i < len(word) - 2: features['trigram_right'] = word[i:i+3] return features # Example word = 'unhappiness' for i, c in enumerate(word): feats = char_features(word, i) print(f"{c}: {feats}") ``` Training a CRF with these features is straightforward using `sklearn-crfsuite`: ```{python} #| code-fold: true #| code-summary: Train a CRF for morpheme segmentation import sklearn_crfsuite def word_to_features(word: str) -> list[dict[str, str | bool]]: """Extract character-level features for every position in a word. Parameters ---------- word : str The word to extract features from. Returns ------- list[dict[str, str | bool]] One feature dictionary per character position. """ return [char_features(word, i) for i in range(len(word))] # Training would look like: # X_train = [word_to_features(w) for w in train_words] # y_train = [word_to_bio(w, morphs) for w, morphs in train_data] # crf = sklearn_crfsuite.CRF(algorithm='lbfgs', max_iterations=100) # crf.fit(X_train, y_train) ``` ## Comparing HMMs and CRFs The fundamental difference between the two models is directional. The HMM is *generative*: it models how characters and tags are jointly generated, and predicts tags by inverting that generative process. The CRF is *discriminative*: it directly models the distribution of tags given characters, without committing to a story about how the characters were generated. In practice, the CRF tends to outperform the HMM on segmentation tasks because it can exploit richer features of the input. The HMM's emission model treats each character independently (given the tag), while the CRF can look at the entire character context. The tradeoff is that CRFs are more expensive to train—the feature space is larger and the normalization constant $Z(\mathbf{x})$ must be computed for every training example. Both models, however, share a crucial limitation with BPE: they produce a *flat* segmentation. They tell you where the morpheme boundaries are, but not how the morphemes combine hierarchically. For that, we need to return to the context-free grammars we developed earlier in this module—which is exactly what we'll do in the next two sections.