Morphological segmentation with BPE

Before we can parse the morphological structure of a word, we need to know where the morpheme boundaries are. This is the segmentation problem: given a word like unhappiness, identify that it decomposes into un, happi, and ness.¹

In this section, we’ll look at byte pair encoding (BPE), which was originally developed for data compression by Gage (1994) and later adapted for subword tokenization in NLP by Sennrich et al. (2016). Most modern large language models use BPE or a close relative to tokenize their input.

The BPE algorithm

The intuition behind BPE is that frequently co-occurring characters should be merged into a single unit. If the characters i, n, and g appear together very often in the training data, BPE will eventually merge them into the single token ing—which, not coincidentally, is a productive English suffix. The algorithm doesn’t know anything about morphology; it just tracks co-occurrence frequencies. But because morphemes are, by definition, recurrent meaningful units, frequency-based merging tends to recover morpheme-like chunks.

More precisely, BPE works by iteratively merging the most frequent pair of adjacent symbols. Starting from a vocabulary of individual characters, the algorithm repeatedly:

Counts all pairs of adjacent symbols in the training data
Merges the most frequent pair into a new symbol
Repeats until a target vocabulary size is reached

When applied to a morphological lexicon, the merges tend to recover morpheme-like units: frequent morphemes get merged early, while rare or compositional forms are left as sequences of smaller units.

Learn BPE from a word list

import re
from collections import Counter

def learn_bpe(words: list[str], num_merges: int = 1000) -> list[tuple[str, str]]:
    """Learn BPE merge operations from a word list.

    Parameters
    ----------
    words : list[str]
        The training vocabulary.
    num_merges : int, default 1000
        The number of merge operations to learn.

    Returns
    -------
    list[tuple[str, str]]
        The learned merge pairs in order of application.
    """
    # Initialize vocabulary as character sequences
    vocab = {}
    for word in words:
        chars = ' '.join(list(word)) + ' </w>'
        vocab[chars] = vocab.get(chars, 0) + 1

    merges = []

    for i in range(num_merges):
        # Count pairs
        pairs = Counter()
        for word, freq in vocab.items():
            symbols = word.split()
            for j in range(len(symbols) - 1):
                pairs[(symbols[j], symbols[j+1])] += freq

        if not pairs:
            break

        # Find most frequent pair
        best = max(pairs, key=pairs.get)
        merges.append(best)

        # Merge
        new_vocab = {}
        pattern = re.escape(' '.join(best))
        replacement = ''.join(best)

        for word, freq in vocab.items():
            new_word = re.sub(pattern, replacement, word)
            new_vocab[new_word] = freq

        vocab = new_vocab

    return merges


def apply_bpe(word: str, merges: list[tuple[str, str]]) -> list[str]:
    """Segment a word using learned BPE merges.

    Parameters
    ----------
    word : str
        The word to segment.
    merges : list[tuple[str, str]]
        The merge pairs to apply.

    Returns
    -------
    list[str]
        The subword segments.
    """
    symbols = list(word) + ['</w>']

    for a, b in merges:
        i = 0
        while i < len(symbols) - 1:
            if symbols[i] == a and symbols[i+1] == b:
                symbols[i:i+2] = [a + b]
            else:
                i += 1

    # Remove end-of-word marker
    if symbols[-1] == '</w>':
        symbols = symbols[:-1]
    elif symbols[-1].endswith('</w>'):
        symbols[-1] = symbols[-1][:-4]

    return symbols

Training on a lexicon

We train BPE on the CELEX lexicon (or any word list) to learn morpheme-like segmentation rules.

import pandas as pd

# Load CELEX word list (or use a fallback)
celex = pd.read_csv(
    'celex/english/eml/eml.cd', sep='\\', header=None,
    names=['IdNum', 'Head', 'MorphStatus', 'MorphCnt', 'StrucLab',
           'MorphOp', 'StrucAllo', 'StrucLabAdj'],
    encoding='latin-1'
)

words = celex.Head.dropna().unique().tolist()
merges = learn_bpe(words, num_merges=5000)

# Segment some test words
test_words = ['unhappiness', 'unlockable', 'decoratorship', 'depublicize']

for w in test_words:
    segments = apply_bpe(w, merges)
    print(f"{w:20s} → {' + '.join(segments)}")

Evaluation

To evaluate segmentation quality, we can compare the BPE output against the gold-standard morpheme boundaries in CELEX using the Jaccard similarity between the predicted and gold boundary sets.

def boundary_jaccard(predicted: set[int], gold: set[int]) -> float:
    """Compute Jaccard similarity between two sets of boundary positions.

    Parameters
    ----------
    predicted : set[int]
        The predicted boundary positions.
    gold : set[int]
        The gold-standard boundary positions.

    Returns
    -------
    float
        The Jaccard similarity (1.0 if both sets are empty).
    """
    pred_set = set(predicted)
    gold_set = set(gold)
    if not pred_set and not gold_set:
        return 1.0
    return len(pred_set & gold_set) / len(pred_set | gold_set)

Limitations

BPE is frequency-based and greedy. It cannot capture hierarchical structure—it produces a flat sequence of subword units, not a parse tree. It also makes no use of linguistic knowledge: it treats morpheme boundaries the same as any other character boundary. For these reasons, BPE is best understood as a practical baseline. In the next section, we’ll look at structured prediction methods that can incorporate richer features, and in the sections after that, we’ll return to the context-free grammars developed earlier in this module to model hierarchical structure directly.

References

Gage, Philip. 1994. “A New Algorithm for Data Compression.” The C Users Journal 12 (2): 23–38.

Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2016. “Neural Machine Translation of Rare Words with Subword Units.” Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 1715–25.

Footnotes

The alternation between happy and happi is a regular phonological process—one of the cases where morphology and phonology interact.↩︎

--- title: Morphological segmentation with BPE bibliography: ../references.bib jupyter: python3 execute: eval: false --- Before we can parse the morphological structure of a word, we need to know where the morpheme boundaries are. This is the *segmentation* problem: given a word like *unhappiness*, identify that it decomposes into *un*, *happi*, and *ness*.^[The alternation between *happy* and *happi* is a regular phonological process—one of the cases where morphology and phonology interact.] In this section, we'll look at [byte pair encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding) (BPE), which was originally developed for data compression by @gage_new_1994 and later adapted for subword tokenization in NLP by @sennrich2016neural. Most modern large language models use BPE or a close relative to tokenize their input. ## The BPE algorithm The intuition behind BPE is that frequently co-occurring characters should be merged into a single unit. If the characters *i*, *n*, and *g* appear together very often in the training data, BPE will eventually merge them into the single token *ing*—which, not coincidentally, is a productive English suffix. The algorithm doesn't know anything about morphology; it just tracks co-occurrence frequencies. But because morphemes are, by definition, recurrent meaningful units, frequency-based merging tends to recover morpheme-like chunks. More precisely, BPE works by iteratively merging the most frequent pair of adjacent symbols. Starting from a vocabulary of individual characters, the algorithm repeatedly: 1. Counts all pairs of adjacent symbols in the training data 2. Merges the most frequent pair into a new symbol 3. Repeats until a target vocabulary size is reached When applied to a morphological lexicon, the merges tend to recover morpheme-like units: frequent morphemes get merged early, while rare or compositional forms are left as sequences of smaller units. ```{python} #| code-fold: true #| code-summary: Learn BPE from a word list import re from collections import Counter def learn_bpe(words: list[str], num_merges: int = 1000) -> list[tuple[str, str]]: """Learn BPE merge operations from a word list. Parameters ---------- words : list[str] The training vocabulary. num_merges : int, default 1000 The number of merge operations to learn. Returns ------- list[tuple[str, str]] The learned merge pairs in order of application. """ # Initialize vocabulary as character sequences vocab = {} for word in words: chars = ' '.join(list(word)) + ' </w>' vocab[chars] = vocab.get(chars, 0) + 1 merges = [] for i in range(num_merges): # Count pairs pairs = Counter() for word, freq in vocab.items(): symbols = word.split() for j in range(len(symbols) - 1): pairs[(symbols[j], symbols[j+1])] += freq if not pairs: break # Find most frequent pair best = max(pairs, key=pairs.get) merges.append(best) # Merge new_vocab = {} pattern = re.escape(' '.join(best)) replacement = ''.join(best) for word, freq in vocab.items(): new_word = re.sub(pattern, replacement, word) new_vocab[new_word] = freq vocab = new_vocab return merges def apply_bpe(word: str, merges: list[tuple[str, str]]) -> list[str]: """Segment a word using learned BPE merges. Parameters ---------- word : str The word to segment. merges : list[tuple[str, str]] The merge pairs to apply. Returns ------- list[str] The subword segments. """ symbols = list(word) + ['</w>'] for a, b in merges: i = 0 while i < len(symbols) - 1: if symbols[i] == a and symbols[i+1] == b: symbols[i:i+2] = [a + b] else: i += 1 # Remove end-of-word marker if symbols[-1] == '</w>': symbols = symbols[:-1] elif symbols[-1].endswith('</w>'): symbols[-1] = symbols[-1][:-4] return symbols ``` ## Training on a lexicon We train BPE on the CELEX lexicon (or any word list) to learn morpheme-like segmentation rules. ```{python} import pandas as pd # Load CELEX word list (or use a fallback) celex = pd.read_csv( 'celex/english/eml/eml.cd', sep='\\', header=None, names=['IdNum', 'Head', 'MorphStatus', 'MorphCnt', 'StrucLab', 'MorphOp', 'StrucAllo', 'StrucLabAdj'], encoding='latin-1' ) words = celex.Head.dropna().unique().tolist() merges = learn_bpe(words, num_merges=5000) # Segment some test words test_words = ['unhappiness', 'unlockable', 'decoratorship', 'depublicize'] for w in test_words: segments = apply_bpe(w, merges) print(f"{w:20s} → {' + '.join(segments)}") ``` ## Evaluation To evaluate segmentation quality, we can compare the BPE output against the gold-standard morpheme boundaries in CELEX using the Jaccard similarity between the predicted and gold boundary sets. ```{python} def boundary_jaccard(predicted: set[int], gold: set[int]) -> float: """Compute Jaccard similarity between two sets of boundary positions. Parameters ---------- predicted : set[int] The predicted boundary positions. gold : set[int] The gold-standard boundary positions. Returns ------- float The Jaccard similarity (1.0 if both sets are empty). """ pred_set = set(predicted) gold_set = set(gold) if not pred_set and not gold_set: return 1.0 return len(pred_set & gold_set) / len(pred_set | gold_set) ``` ## Limitations BPE is frequency-based and greedy. It cannot capture hierarchical structure—it produces a flat sequence of subword units, not a parse tree. It also makes no use of linguistic knowledge: it treats morpheme boundaries the same as any other character boundary. For these reasons, BPE is best understood as a practical baseline. In the next section, we'll look at structured prediction methods that can incorporate richer features, and in the sections after that, we'll return to the context-free grammars developed earlier in this module to model hierarchical structure directly.