Before we can parse the morphological structure of a word, we need to know where the morpheme boundaries are. This is the segmentation problem: given a word like unhappiness, identify that it decomposes into un, happi, and ness.1
In this section, we’ll look at byte pair encoding (BPE), which was originally developed for data compression and later adapted for subword tokenization in NLP by Sennrich et al. (2016). Most modern large language models use BPE or a close relative to tokenize their input.
The BPE algorithm
The intuition behind BPE is that frequently co-occurring characters should be merged into a single unit. If the characters i, n, and g appear together very often in the training data, BPE will eventually merge them into the single token ing—which, not coincidentally, is a productive English suffix. The algorithm doesn’t know anything about morphology; it just tracks co-occurrence frequencies. But because morphemes are, by definition, recurrent meaningful units, frequency-based merging tends to recover morpheme-like chunks.
More precisely, BPE works by iteratively merging the most frequent pair of adjacent symbols. Starting from a vocabulary of individual characters, the algorithm repeatedly:
Counts all pairs of adjacent symbols in the training data
Merges the most frequent pair into a new symbol
Repeats until a target vocabulary size is reached
When applied to a morphological lexicon, the merges tend to recover morpheme-like units: frequent morphemes get merged early, while rare or compositional forms are left as sequences of smaller units.
Learn BPE from a word list
import refrom collections import Counterdef learn_bpe(words, num_merges=1000):"""Learn BPE merges from a word list."""# Initialize vocabulary as character sequences vocab = {}for word in words: chars =' '.join(list(word)) +' </w>' vocab[chars] = vocab.get(chars, 0) +1 merges = []for i inrange(num_merges):# Count pairs pairs = Counter()for word, freq in vocab.items(): symbols = word.split()for j inrange(len(symbols) -1): pairs[(symbols[j], symbols[j+1])] += freqifnot pairs:break# Find most frequent pair best =max(pairs, key=pairs.get) merges.append(best)# Merge new_vocab = {} pattern = re.escape(' '.join(best)) replacement =''.join(best)for word, freq in vocab.items(): new_word = re.sub(pattern, replacement, word) new_vocab[new_word] = freq vocab = new_vocabreturn mergesdef apply_bpe(word, merges):"""Segment a word using learned BPE merges.""" symbols =list(word) + ['</w>']for a, b in merges: i =0while i <len(symbols) -1:if symbols[i] == a and symbols[i+1] == b: symbols[i:i+2] = [a + b]else: i +=1# Remove end-of-word markerif symbols[-1] =='</w>': symbols = symbols[:-1]elif symbols[-1].endswith('</w>'): symbols[-1] = symbols[-1][:-4]return symbols
Training on a lexicon
We train BPE on the CELEX lexicon (or any word list) to learn morpheme-like segmentation rules.
import pandas as pd# Load CELEX word list (or use a fallback)celex = pd.read_csv('celex/eml/eml.cd', sep='\\', header=None, names=['IdNum', 'Head', 'MorphStatus', 'MorphCnt', 'StrucLab','MorphOp', 'StrucAllo', 'StrucLabAdj'], encoding='latin-1')words = celex.Head.dropna().unique().tolist()merges = learn_bpe(words, num_merges=5000)# Segment some test wordstest_words = ['unhappiness', 'unlockable', 'decoratorship', 'depublicize']for w in test_words: segments = apply_bpe(w, merges)print(f"{w:20s} → {' + '.join(segments)}")
Evaluation
To evaluate segmentation quality, we can compare the BPE output against the gold-standard morpheme boundaries in CELEX using the Jaccard similarity between the predicted and gold boundary sets.
def boundary_jaccard(predicted, gold):"""Jaccard similarity between two sets of boundary positions.""" pred_set =set(predicted) gold_set =set(gold)ifnot pred_set andnot gold_set:return1.0returnlen(pred_set & gold_set) /len(pred_set | gold_set)
Limitations
BPE is frequency-based and greedy. It cannot capture hierarchical structure—it produces a flat sequence of subword units, not a parse tree. It also makes no use of linguistic knowledge: it treats morpheme boundaries the same as any other character boundary. For these reasons, BPE is best understood as a practical baseline. In the next section, we’ll look at structured prediction methods that can incorporate richer features, and in the sections after that, we’ll return to the context-free grammars developed earlier in this module to model hierarchical structure directly.
References
Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2016. “Neural Machine Translation of Rare Words with Subword Units.”Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 1715–25.
Footnotes
The alternation between happy and happi is a regular phonological process—one of the cases where morphology and phonology interact.↩︎
---title: Morphological segmentation with BPEbibliography: ../references.bibjupyter: python3execute: eval: false---Before we can parse the morphological structure of a word, we need to know where the morpheme boundaries are. This is the *segmentation* problem: given a word like *unhappiness*, identify that it decomposes into *un*, *happi*, and *ness*.^[The alternation between *happy* and *happi* is a regular phonological process—one of the cases where morphology and phonology interact.]In this section, we'll look at [byte pair encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding) (BPE), which was originally developed for data compression and later adapted for subword tokenization in NLP by @sennrich2016neural. Most modern large language models use BPE or a close relative to tokenize their input.## The BPE algorithmThe intuition behind BPE is that frequently co-occurring characters should be merged into a single unit. If the characters *i*, *n*, and *g* appear together very often in the training data, BPE will eventually merge them into the single token *ing*—which, not coincidentally, is a productive English suffix. The algorithm doesn't know anything about morphology; it just tracks co-occurrence frequencies. But because morphemes are, by definition, recurrent meaningful units, frequency-based merging tends to recover morpheme-like chunks.More precisely, BPE works by iteratively merging the most frequent pair of adjacent symbols. Starting from a vocabulary of individual characters, the algorithm repeatedly:1. Counts all pairs of adjacent symbols in the training data2. Merges the most frequent pair into a new symbol3. Repeats until a target vocabulary size is reachedWhen applied to a morphological lexicon, the merges tend to recover morpheme-like units: frequent morphemes get merged early, while rare or compositional forms are left as sequences of smaller units.```{python}#| code-fold: true#| code-summary: Learn BPE from a word listimport refrom collections import Counterdef learn_bpe(words, num_merges=1000):"""Learn BPE merges from a word list."""# Initialize vocabulary as character sequences vocab = {}for word in words: chars =' '.join(list(word)) +' </w>' vocab[chars] = vocab.get(chars, 0) +1 merges = []for i inrange(num_merges):# Count pairs pairs = Counter()for word, freq in vocab.items(): symbols = word.split()for j inrange(len(symbols) -1): pairs[(symbols[j], symbols[j+1])] += freqifnot pairs:break# Find most frequent pair best =max(pairs, key=pairs.get) merges.append(best)# Merge new_vocab = {} pattern = re.escape(' '.join(best)) replacement =''.join(best)for word, freq in vocab.items(): new_word = re.sub(pattern, replacement, word) new_vocab[new_word] = freq vocab = new_vocabreturn mergesdef apply_bpe(word, merges):"""Segment a word using learned BPE merges.""" symbols =list(word) + ['</w>']for a, b in merges: i =0while i <len(symbols) -1:if symbols[i] == a and symbols[i+1] == b: symbols[i:i+2] = [a + b]else: i +=1# Remove end-of-word markerif symbols[-1] =='</w>': symbols = symbols[:-1]elif symbols[-1].endswith('</w>'): symbols[-1] = symbols[-1][:-4]return symbols```## Training on a lexiconWe train BPE on the CELEX lexicon (or any word list) to learn morpheme-like segmentation rules.```{python}import pandas as pd# Load CELEX word list (or use a fallback)celex = pd.read_csv('celex/eml/eml.cd', sep='\\', header=None, names=['IdNum', 'Head', 'MorphStatus', 'MorphCnt', 'StrucLab','MorphOp', 'StrucAllo', 'StrucLabAdj'], encoding='latin-1')words = celex.Head.dropna().unique().tolist()merges = learn_bpe(words, num_merges=5000)# Segment some test wordstest_words = ['unhappiness', 'unlockable', 'decoratorship', 'depublicize']for w in test_words: segments = apply_bpe(w, merges)print(f"{w:20s} → {' + '.join(segments)}")```## EvaluationTo evaluate segmentation quality, we can compare the BPE output against the gold-standard morpheme boundaries in CELEX using the Jaccard similarity between the predicted and gold boundary sets.```{python}def boundary_jaccard(predicted, gold):"""Jaccard similarity between two sets of boundary positions.""" pred_set =set(predicted) gold_set =set(gold)ifnot pred_set andnot gold_set:return1.0returnlen(pred_set & gold_set) /len(pred_set | gold_set)```## LimitationsBPE is frequency-based and greedy. It cannot capture hierarchical structure—it produces a flat sequence of subword units, not a parse tree. It also makes no use of linguistic knowledge: it treats morpheme boundaries the same as any other character boundary. For these reasons, BPE is best understood as a practical baseline. In the next section, we'll look at structured prediction methods that can incorporate richer features, and in the sections after that, we'll return to the context-free grammars developed earlier in this module to model hierarchical structure directly.