Unsupervised morphological grammar learning

In the previous section on probabilistic context-free grammars, we estimated rule probabilities from the CELEX database—a supervised approach that requires gold-standard morphological parses. But gold-standard parses are expensive to produce and are not available for most languages. This raises a natural question: can we learn a morphological grammar from raw data, without labeled parses?

It can be done, at least in principle, using the expectation-maximization (EM) algorithm (Dempster et al. 1977) applied to probabilistic context-free grammars. The idea is to alternate between two steps: (i) use the current grammar to compute the expected counts of each rule across all possible parses of the training data (the E step); and (ii) use those expected counts to re-estimate the rule probabilities (the M step). This is the same basic strategy as EM for Gaussian mixture models or HMMs, but applied to the more complex structure of context-free grammars.

The inside-outside algorithm

The inside-outside algorithm (Baker 1979; Lari and Young 1990) provides the machinery for the E step. To perform it, we need to compute, for each rule $A \rightarrow B\; C$ and each span $(i, j)$ in a training string, the expected number of times that rule is used to derive the substring from position $i$ to position $j$. This requires two quantities:

The inside probability $\alpha_A(i, j)$: the probability that nonterminal $A$ generates the substring $w_i \ldots w_j$ under the current grammar. We already computed this in the previous section—it’s the quantity filled in by the probabilistic CKY algorithm.
The outside probability $\beta_A(i, j)$: the probability of everything except the substring $w_i \ldots w_j$, given that $A$ is responsible for that substring. This is the new quantity we need.

The outside probability is defined recursively. The base case is:

\[\beta_S(1, n) = 1\]

That is, the outside probability of the start symbol spanning the entire string is 1 (there is nothing “outside” it).

For all other cases, the outside probability of $A$ spanning $(i, j)$ is computed by considering every way that $A$ could be a child of some parent nonterminal $B$:

\[\beta_A(i, j) = \sum_{B \rightarrow A\; C} \sum_{k=j+1}^{n} p(B \rightarrow A\; C) \cdot \beta_B(i, k) \cdot \alpha_C(j+1, k) + \sum_{B \rightarrow C\; A} \sum_{k=1}^{i-1} p(B \rightarrow C\; A) \cdot \beta_B(k, j) \cdot \alpha_C(k, i-1)\]

The first sum covers cases where $A$ is the left child of $B$, and the second covers cases where $A$ is the right child. In each case, the outside probability of the parent $B$ is multiplied by the inside probability of the sibling $C$.

Once we have both inside and outside probabilities, the expected count of a rule $A \rightarrow B\; C$ applied at span $(i, j)$ with the split at position $k$ is:

\[\text{count}(A \rightarrow B\; C, i, k, j) = \frac{\beta_A(i, j) \cdot p(A \rightarrow B\; C) \cdot \alpha_B(i, k) \cdot \alpha_C(k+1, j)}{\alpha_S(1, n)}\]

The denominator $\alpha_S(1, n)$ is the total probability of the string under the current grammar—the inside probability of the start symbol spanning the whole string. Dividing by it normalizes the expected count so that it represents a posterior expectation.

A worked example

To make this concrete, consider a tiny grammar with start symbol $W$, nonterminals $\{W, M\}$, terminals $\{un, lock, able\}$, and rules:

Rule	Probability
$W \rightarrow M\; M$	0.6
$W \rightarrow M\; W$	0.4
$M \rightarrow \text{un}$	0.3
$M \rightarrow \text{lock}$	0.5
$M \rightarrow \text{able}$	0.2

We want to compute inside and outside probabilities for the string $\text{un lock able}$ (3 terminals, positions 1–3).

Inside probabilities (bottom-up). Start with the terminals. For each single-position span:

$\alpha_M(1,1) = p(M \rightarrow \text{un}) = 0.3$
$\alpha_M(2,2) = p(M \rightarrow \text{lock}) = 0.5$
$\alpha_M(3,3) = p(M \rightarrow \text{able}) = 0.2$

For length-2 spans, we look for rules $A \rightarrow B\;C$ where $B$ covers one position and $C$ covers the other:

$\alpha_W(1,2) = p(W \rightarrow M\;M) \cdot \alpha_M(1,1) \cdot \alpha_M(2,2) = 0.6 \times 0.3 \times 0.5 = 0.09$
$\alpha_W(2,3) = p(W \rightarrow M\;M) \cdot \alpha_M(2,2) \cdot \alpha_M(3,3) = 0.6 \times 0.5 \times 0.2 = 0.06$

For the full span (1,3), there are two possible split points ($k = 1$ and $k = 2$) and two rules:

\[\alpha_W(1,3) = \underbrace{p(W \rightarrow M\;M) \cdot \alpha_M(1,1) \cdot \alpha_M(2,3)}_{\text{split at }k=1,\text{ but } \alpha_M(2,3) = 0} + \underbrace{p(W \rightarrow M\;W) \cdot \alpha_M(1,1) \cdot \alpha_W(2,3)}_{\text{split at }k=1} + \underbrace{p(W \rightarrow M\;M) \cdot \alpha_M(1,2) \cdot \alpha_M(3,3)}_{\alpha_M(1,2) = 0} + \underbrace{p(W \rightarrow M\;W) \cdot \alpha_M(1,2) \cdot \alpha_W(3,3)}_{\alpha_W(3,3) = 0}\]

Only the second term survives (the others involve nonterminals that can’t generate the required substrings): $\alpha_W(1,3) = 0.4 \times 0.3 \times 0.06 = 0.0072$. But we also need the split at $k=2$:

\[\alpha_W(1,3) \mathrel{+}= p(W \rightarrow M\;M) \cdot \alpha_M(1,2) \cdot \alpha_M(3,3) + p(W \rightarrow M\;W) \cdot \alpha_M(1,2) \cdot \alpha_W(3,3)\]

Since $M$ has no binary rules, $\alpha_M(1,2) = 0$ and $\alpha_W(3,3) = 0$, so nothing is added. We also need the $W \rightarrow M\;W$ rule at split $k=2$: $\alpha_W(1,2) \cdot \alpha_?(3,3)$ — but this reads the wrong way (left child $=W$, not $M$). So: $\alpha_W(1,3) = 0.0072$.

Outside probabilities (top-down). Start with the root:

$\beta_W(1,3) = 1$ (base case)

For the right child at span (2,3) under rule $W \rightarrow M\;W$ with parent spanning (1,3) and left sibling $M$ at (1,1):

\[\beta_W(2,3) = p(W \rightarrow M\;W) \cdot \beta_W(1,3) \cdot \alpha_M(1,1) = 0.4 \times 1 \times 0.3 = 0.12\]

This says: the “context” for $W$ spanning positions 2–3 (i.e., everything outside lock able) has probability 0.12 under the current grammar.

Expected counts. The expected count of the rule $W \rightarrow M\;W$ at span (1,3) with split at $k=1$ is:

\[\frac{\beta_W(1,3) \cdot p(W \rightarrow M\;W) \cdot \alpha_M(1,1) \cdot \alpha_W(2,3)}{\alpha_W(1,3)} = \frac{1 \times 0.4 \times 0.3 \times 0.06}{0.0072} = 1.0\]

In this tiny example, the expected count is 1 because there is only one parse of the string, and it uses $W \rightarrow M\;W$ exactly once. With a more ambiguous grammar, the expected counts would be fractional, distributed across the competing parses.

The M step

Given the expected counts from the E step, the M step is straightforward. For each nonterminal $A$, we re-estimate the probability of each rule $A \rightarrow \alpha$ as the proportion of expected uses of that rule among all rules with $A$ on the left-hand side:

\[p(A \rightarrow \alpha) = \frac{\sum_{\text{spans}} \text{count}(A \rightarrow \alpha, \text{span})}{\sum_{\alpha'} \sum_{\text{spans}} \text{count}(A \rightarrow \alpha', \text{span})}\]

This is the maximum likelihood estimate given the expected counts—the same formula we used for supervised estimation, but with fractional (expected) counts rather than hard counts.

Implementation

The full EM algorithm for PCFGs alternates E and M steps until convergence (measured by the change in total log-likelihood of the training data). Here is an outline of the implementation:

Unsupervised PCFG learning via EM

import numpy as np
from collections import defaultdict

import sys
sys.path.insert(0, '_code')
from grammar import Rule


class UnsupervisedPCFG:
    """A PCFG trained by expectation-maximization.

    Parameters
    ----------
    variables : set[str]
        The nonterminal symbols.
    alphabet : set[str]
        The terminal symbols.
    rules : list[Rule]
        The production rules.
    start_variable : str
        The start symbol.
    max_iter : int, default 50
        Maximum number of EM iterations.
    tol : float, default 1e-4
        Convergence tolerance for log-likelihood improvement.
    """

    def __init__(self, variables: set[str], alphabet: set[str],
                 rules: list[Rule], start_variable: str,
                 max_iter: int = 50, tol: float = 1e-4):
        self._variables = variables
        self._alphabet = alphabet
        self._rules = rules
        self._start = start_variable
        self._max_iter = max_iter
        self._tol = tol

        # Initialize rule probabilities uniformly
        self._probs: dict[tuple, float] = {}
        for var in variables:
            var_rules = [r for r in rules if r.left_side == var]
            for r in var_rules:
                self._probs[r.to_tuple()] = 1.0 / len(var_rules)

    def _inside(self, words: list[str]) -> dict[tuple, float]:
        """Compute inside probabilities for a sequence of terminals.

        Parameters
        ----------
        words : list[str]
            The terminal sequence.

        Returns
        -------
        dict[tuple, float]
            Mapping from (nonterminal, start, end) to probability.
        """
        n = len(words)
        alpha = defaultdict(float)

        # Base case: unary rules
        for i in range(n):
            for r in self._rules:
                if r.is_unary and r.right_side[0] == words[i]:
                    alpha[(r.left_side, i, i)] = self._probs[r.to_tuple()]

        # Fill chart bottom-up
        for span in range(1, n):
            for i in range(n - span):
                j = i + span
                for k in range(i, j):
                    for r in self._rules:
                        if r.is_binary:
                            B, C = r.right_side
                            prob = (self._probs[r.to_tuple()]
                                    * alpha.get((B, i, k), 0)
                                    * alpha.get((C, k+1, j), 0))
                            alpha[(r.left_side, i, j)] += prob

        return alpha

    def _outside(self, words: list[str], alpha: dict[tuple, float]) -> dict[tuple, float]:
        """Compute outside probabilities.

        Parameters
        ----------
        words : list[str]
            The terminal sequence.
        alpha : dict[tuple, float]
            The inside probabilities.

        Returns
        -------
        dict[tuple, float]
            Mapping from (nonterminal, start, end) to outside probability.
        """
        n = len(words)
        beta = defaultdict(float)

        # Base case
        beta[(self._start, 0, n-1)] = 1.0

        # Fill chart top-down
        for span in range(n-1, -1, -1):
            for i in range(n - span):
                j = i + span
                for r in self._rules:
                    if r.is_binary:
                        A = r.left_side
                        B, C = r.right_side

                        # B is left child of A
                        for k in range(j+1, n):
                            beta[(B, i, j)] += (
                                beta.get((A, i, k), 0)
                                * self._probs[r.to_tuple()]
                                * alpha.get((C, j+1, k), 0)
                            )

                        # C is right child of A
                        for k in range(0, i):
                            beta[(C, i, j)] += (
                                beta.get((A, k, j), 0)
                                * self._probs[r.to_tuple()]
                                * alpha.get((B, k, i-1), 0)
                            )

        return beta

    def fit(self, data: list[list[str]]) -> list[float]:
        """Train the grammar on a list of word sequences using EM.

        Parameters
        ----------
        data : list[list[str]]
            Training data as lists of terminal sequences.

        Returns
        -------
        list[float]
            The log-likelihood after each iteration.
        """
        log_likelihoods = []

        for iteration in range(self._max_iter):
            expected_counts = defaultdict(float)
            total_ll = 0.0

            # E step
            for words in data:
                n = len(words)
                alpha = self._inside(words)
                beta = self._outside(words, alpha)

                string_prob = alpha.get((self._start, 0, n-1), 0)
                if string_prob <= 0:
                    continue

                total_ll += np.log(string_prob)

                # Accumulate expected counts
                for r in self._rules:
                    if r.is_unary:
                        for i in range(n):
                            if r.right_side[0] == words[i]:
                                count = (beta.get((r.left_side, i, i), 0)
                                         * self._probs[r.to_tuple()]
                                         / string_prob)
                                expected_counts[r.to_tuple()] += count

                    elif r.is_binary:
                        B, C = r.right_side
                        for i in range(n):
                            for j in range(i, n):
                                for k in range(i, j):
                                    count = (
                                        beta.get((r.left_side, i, j), 0)
                                        * self._probs[r.to_tuple()]
                                        * alpha.get((B, i, k), 0)
                                        * alpha.get((C, k+1, j), 0)
                                        / string_prob
                                    )
                                    expected_counts[r.to_tuple()] += count

            # M step
            for var in self._variables:
                var_rules = [r for r in self._rules if r.left_side == var]
                total = sum(expected_counts[r.to_tuple()] for r in var_rules)
                if total > 0:
                    for r in var_rules:
                        self._probs[r.to_tuple()] = (
                            expected_counts[r.to_tuple()] / total
                        )

            log_likelihoods.append(total_ll)

            if len(log_likelihoods) > 1:
                improvement = log_likelihoods[-1] - log_likelihoods[-2]
                if abs(improvement) < self._tol:
                    break

        return log_likelihoods

Segmenting the test words

To train the unsupervised PCFG, we need segmented words—sequences of morpheme-like units—but we don’t need (and, in the unsupervised setting, don’t have) the hierarchical parse trees that CELEX provides. The question is where the segmentations come from. We could use BPE, but we already have something better: in the structured prediction section, we trained a CRF segmenter on CELEX’s gold-standard morpheme boundaries. That segmenter learned to predict B/I tags based on character-level features, and it does a much better job of finding linguistically meaningful boundaries than BPE’s frequency-based merging.

I want to be precise about what information each model has access to, because that’s the whole point of the comparison we’re building toward. The supervised PCFG (from the previous section) has access to the full CELEX parse trees—both the segmentation and the hierarchical structure. The unsupervised PCFG we’re training here has access to segmentations from the CRF (which was trained on CELEX boundaries) but not the tree structure. The EM algorithm has to discover the tree structure on its own. So the gap between the two models tells us how much of what makes a morphological structure well-formed is captured by the segmentation alone versus how much depends on knowing which morphemes compose with which, and in what order.

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn_crfsuite

# --- CRF segmenter (from the structured prediction section) ---

def word_to_bio(word, morphemes):
    """Convert a word and its morphemes to BIO tags."""
    tags = []
    for morph in morphemes:
        tags.append('B')
        tags.extend(['I'] * (len(morph) - 1))
    return tags

def char_features(word, i):
    """Extract character-level features for position i."""
    features = {
        'char': word[i],
        'is_vowel': word[i] in 'aeiou',
        'is_first': i == 0,
        'is_last': i == len(word) - 1,
        'position': i / len(word),
    }

    # Character context window of ±3
    for offset in range(-3, 4):
        j = i + offset
        if 0 <= j < len(word):
            features[f'char[{offset}]'] = word[j]
        else:
            features[f'char[{offset}]'] = '<PAD>'

    # Bigrams and trigrams centered on this position
    if i > 0:
        features['bigram[-1,0]'] = word[i-1:i+1]
    if i < len(word) - 1:
        features['bigram[0,+1]'] = word[i:i+2]
    if i > 0 and i < len(word) - 1:
        features['trigram'] = word[i-1:i+2]
    if i > 1:
        features['trigram_left'] = word[i-2:i+1]
    if i < len(word) - 2:
        features['trigram_right'] = word[i:i+3]

    # Suffix of remaining string and prefix of consumed string,
    # at multiple lengths—these help the CRF learn that common
    # suffixes like -tion, -ness, -able start at boundaries
    remaining = word[i:]
    for k in [2, 3, 4, 5]:
        if len(remaining) >= k:
            features[f'suffix_{k}'] = remaining[:k]

    consumed = word[:i]
    for k in [2, 3, 4]:
        if len(consumed) >= k:
            features[f'prefix_{k}'] = consumed[-k:]

    return features

def word_to_features(word):
    return [char_features(word, i) for i in range(len(word))]

def tags_to_segments(word, tags):
    """Convert BIO tags back to a list of morpheme strings."""
    segments = []
    current = ''
    for ch, tag in zip(word, tags):
        if tag == 'B' and current:
            segments.append(current)
            current = ch
        else:
            current += ch
    if current:
        segments.append(current)
    return segments

def align_morphemes_to_surface(word, morphemes):
    """Find morpheme boundary positions in the surface word.

    The underlying morphemes (from StrucLab) may not concatenate
    to the surface form due to allomorphic alternations—e.g.
    'happy' surfaces as 'happi' before '-ness'.  We use edit
    distance alignment to map each morpheme's start position in
    the underlying string to a position in the surface string.

    Parameters
    ----------
    word : str
        The surface form.
    morphemes : list[str]
        The underlying morpheme strings.

    Returns
    -------
    list[int] | None
        Boundary character indices (0-indexed), or None on failure.
    """
    underlying = ''.join(morphemes)

    if underlying == word:
        boundaries, pos = [], 0
        for m in morphemes:
            boundaries.append(pos)
            pos += len(m)
        return boundaries

    result = edlib.align(underlying, word, mode='NW', task='path')
    cigar = result['cigar']

    # Map underlying positions to surface positions via CIGAR
    u_pos, s_pos = 0, 0
    u_to_s = {}

    for length, op in re.findall(r'(\d+)([=XDIS])', cigar):
        for _ in range(int(length)):
            if op in ('=', 'X'):
                u_to_s[u_pos] = s_pos
                u_pos += 1
                s_pos += 1
            elif op == 'D':
                u_pos += 1
            elif op == 'I':
                s_pos += 1

    boundaries, u_boundary = [], 0
    for m in morphemes:
        if u_boundary in u_to_s:
            boundaries.append(u_to_s[u_boundary])
        u_boundary += len(m)

    return boundaries if len(boundaries) == len(morphemes) else None

Now we train the CRF on CELEX’s gold-standard segmentations and use it to segment the trial words from Oseki and Marantz (2020). The StrucLab field gives underlying morpheme forms (happy, beauty), which don’t always match the surface spelling (happi-ness, beauti-ful). We use edit-distance alignment to find where each morpheme’s boundary falls in the actual word, which gives us correct B/I tags even in the presence of allomorphic alternations.

import re
import edlib

# --- Load CELEX and extract gold segmentations ---
#
# The eml.cd file has 25 backslash-separated fields per line.
# Field 1 is the word; field 21 is StrucLab.

celex_words = []
celex_struclabs = []

with open('celex/english/eml/eml.cd', encoding='latin-1') as f:
    for line in f:
        fields = line.strip().split('\\')
        if len(fields) >= 22:
            celex_words.append(fields[1])
            celex_struclabs.append(fields[21])

celex = pd.DataFrame({'Head': celex_words, 'StrucLab': celex_struclabs})
print(f"Loaded {len(celex)} CELEX entries")

def extract_morphemes(struclab):
    """Pull leaf morpheme strings out of a CELEX StrucLab field.

    The StrucLab notation nests constituents in parentheses with
    category labels in brackets: ((un)[A],(happy)[A])[A].  We
    extract only the leaves—the innermost (morpheme)[Category]
    pairs—where the morpheme contains only alphabetic characters.
    """
    if not struclab or struclab == 'N':
        return None
    morphs = re.findall(r'\(([a-zA-Z]+)\)\[[^\]]+\]', struclab)
    if morphs:
        return [m.lower() for m in morphs]
    return None

celex['morphemes'] = celex.StrucLab.apply(extract_morphemes)
celex_seg = celex[celex.morphemes.notna()].copy()

# Build training data for the CRF by aligning underlying
# morphemes to surface forms
X_train, y_train = [], []

for _, row in celex_seg.iterrows():
    word = row.Head.lower()
    morphemes = row.morphemes
    boundaries = align_morphemes_to_surface(word, morphemes)

    if boundaries is None:
        continue

    # Convert boundary positions to BIO tags
    tags = ['I'] * len(word)
    for b in boundaries:
        if b < len(word):
            tags[b] = 'B'

    X_train.append(word_to_features(word))
    y_train.append(tags)

print(f"CRF training examples: {len(X_train)}")

# Train the CRF
crf = sklearn_crfsuite.CRF(algorithm='lbfgs', max_iterations=200,
                           c1=0.01, c2=0.01)
crf.fit(X_train, y_train)

# --- Segment the trial words ---

trials = pd.read_csv('data/trials.csv')
trial_words = trials.word.unique().tolist()

segmented = {}

for w in trial_words:
    features = word_to_features(w)
    tags = crf.predict_single(features)
    segmented[w] = tags_to_segments(w, tags)

for w in list(segmented)[:10]:
    print(f"{w:25s} → {' + '.join(segmented[w])}")

The CRF produces segmentations that look like real morphological decompositions—prefixes, stems, and suffixes—because it was trained on CELEX boundaries. But it doesn’t know anything about how those morphemes compose hierarchically. That’s what the unsupervised PCFG will learn.

Setting up the grammar

The grammar we give to EM is deliberately uninformative about English morphology. It has a start symbol Word that can expand into a single morpheme or a binary split into Left and Right, each of which can itself split further or terminate as a Morph. Every morpheme in the segmented data gets a unary rule. This is the minimal binary-branching grammar—it doesn’t encode any linguistic knowledge about which morphemes are prefixes, which are suffixes, or which categories combine with which. All of that structure has to be discovered by EM from the distribution of morphemes across words.

# Collect the alphabet from the segmented data
alphabet = {seg for segs in segmented.values() for seg in segs}

variables = {'Word', 'Left', 'Right', 'Morph'}

rules = [
    Rule('Word', 'Left', 'Right'),
    Rule('Left', 'Morph', 'Left'),
    Rule('Right', 'Morph', 'Right'),
    Rule('Left', 'Morph', 'Morph'),
    Rule('Right', 'Morph', 'Morph'),
]

for t in alphabet:
    rules.append(Rule('Morph', t))
    rules.append(Rule('Word', t))

print(f"Alphabet size: {len(alphabet)}")
print(f"Number of rules: {len(rules)}")
print(f"Number of words to train on: {len(segmented)}")

Training

unsupervised = UnsupervisedPCFG(variables, alphabet, rules, 'Word',
                                max_iter=30, tol=1e-6)
log_likelihoods = unsupervised.fit(list(segmented.values()))

print(f"Converged after {len(log_likelihoods)} iterations")
print(f"Final log-likelihood: {log_likelihoods[-1]:.2f}")

EM guarantees that the log-likelihood increases (or stays the same) at every iteration. Let’s verify that.

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(range(1, len(log_likelihoods) + 1), log_likelihoods,
        marker='o', markersize=4)
ax.set_xlabel('EM iteration')
ax.set_ylabel('Log-likelihood')
fig.tight_layout()
plt.show()

Scoring the trial words

We can now score each trial word under the learned grammar by computing its inside probability $\alpha_{\text{Word}}(0, n-1)$—the probability that the start symbol Word generates the full segmented form. Words whose morpheme combinations the grammar has seen frequently (or whose combinations resemble frequent ones) will get higher scores; words with unusual combinations will get lower ones.

def score_words(pcfg, segmented_words):
    """Score each word under a PCFG using the inside algorithm."""
    scores = {}
    for w, segs in segmented_words.items():
        alpha = pcfg._inside(segs)
        n = len(segs)
        prob = alpha.get(('Word', 0, n - 1), 0)
        scores[w] = np.log(prob) if prob > 0 else float('-inf')
    return scores

unsupervised_scores = score_words(unsupervised, segmented)

Comparison with human judgments

The question we’ve been building toward since the morphological well-formedness section is whether a grammar’s scores can predict the gradient acceptability judgments that speakers give to morphologically complex nonwords. We now have the pieces to answer it—and, more interestingly, to measure how much the answer depends on having supervised parse trees versus discovering structure from scratch.

We’ll compare two models against the human data: the supervised PCFG, whose rule probabilities were estimated from the CELEX parse trees (where both the segmentation and the hierarchical structure are given), and the unsupervised PCFG we just trained, which had access only to segmentations (from the CRF) and had to learn the tree structure via EM. If the supervised model does much better, that tells us the hierarchical structure matters and can’t easily be recovered from morpheme sequences alone. If the two are comparable, it suggests that most of the relevant information is already in the segmentation.

Measuring association with rank correlation

In the earlier evaluation of phonotactic models, we used cross-validated linear regression and $R^2$ to measure predictive accuracy. That approach works well when the dataset is large enough to split into folds and the relationship between predictor and response is plausibly linear. Here the situation is different: the trial set is small, and the log-probabilities from a PCFG can be very unevenly spaced—two words with similar acceptability may receive very different raw scores. We need a measure of association that is robust to monotonic but nonlinear relationships.

Spearman’s rank correlation $\rho_s$ provides exactly this (Spearman 1904). Recall that we defined the Pearson correlation $\text{corr}(X, Y) = \text{cov}(X,Y) / \sqrt{\mathbb{V}[X]\mathbb{V}[Y]}$, which measures linear association. Spearman’s correlation is simply the Pearson correlation applied not to the raw values but to their ranks. Given $n$ paired observations $(x_i, y_i)$, let $R(x_i)$ be the rank of $x_i$ among $\{x_1, \ldots, x_n\}$ (i.e. 1 for the smallest, $n$ for the largest) and similarly for $R(y_i)$. Then:¹

\[\rho_s = \text{corr}\big(R(X), R(Y)\big)\]

Because only the ranks matter, $\rho_s$ is invariant to any monotonic transformation of either variable. If we replaced log-probabilities with their exponentials, or squared them, or applied any order-preserving function, $\rho_s$ would not change. This is desirable here because we care about whether higher grammar scores correspond to higher human ratings, not about the exact functional form of the relationship.²

Interpreting p-values

The code below also reports a p-value alongside each $\rho_s$. The p-value answers a specific question: if there were truly no monotonic association between grammar scores and human judgments ($\rho_s = 0$ in the population), how often would we observe a sample correlation at least as extreme as the one we computed? A small p-value (conventionally $< 0.05$) means the observed association is unlikely to have arisen by chance under the null hypothesis of no relationship.

It is important to be clear about what the p-value does not tell us. It does not measure the strength or importance of the association—$\rho_s$ does that. Nor does it tell us the probability that the null hypothesis is true. It only tells us whether we can distinguish the observed effect from zero given our sample size. With a large enough sample, even a tiny and uninteresting correlation will yield a small p-value; with a small sample, a substantial correlation may not reach conventional significance. The ASA’s statement on p-values (Wasserstein and Lazar 2016) discusses these limitations at length.³

# --- Supervised PCFG scores ---
# We estimate a supervised PCFG from the CELEX parse trees and
# score the trial words under it.  This requires the parse-tree
# extraction from the PCFG section; we reuse that machinery here.

from collections import Counter as Ctr

# Extract rules from CELEX parses (simplified: count morpheme
# category bigrams as binary rules, morphemes as unary rules)
sup_rule_counts = Ctr()

for _, row in celex_seg.iterrows():
    morphemes = row.morphemes
    if morphemes and len(morphemes) >= 2:
        # Treat as right-branching: Word -> m1 (Word -> m2 (... mn))
        for m in morphemes:
            sup_rule_counts[('Morph', (m,))] += 1
            sup_rule_counts[('Word', (m,))] += 1
        for i in range(len(morphemes) - 1):
            sup_rule_counts[('Word', ('Left', 'Right'))] += 1
            sup_rule_counts[('Left', ('Morph', 'Left'))] += 1
            sup_rule_counts[('Right', ('Morph', 'Right'))] += 1
            sup_rule_counts[('Left', ('Morph', 'Morph'))] += 1
            sup_rule_counts[('Right', ('Morph', 'Morph'))] += 1
    elif morphemes and len(morphemes) == 1:
        sup_rule_counts[('Morph', (morphemes[0],))] += 1
        sup_rule_counts[('Word', (morphemes[0],))] += 1

# Build a supervised PCFG using the same grammar skeleton
# but with probabilities estimated from the CELEX counts
supervised = UnsupervisedPCFG(variables, alphabet, rules, 'Word',
                              max_iter=0)

# Override uniform probabilities with supervised counts
for var in variables:
    var_rules = [r for r in rules if r.left_side == var]
    total = sum(sup_rule_counts.get(r.to_tuple(), 1) for r in var_rules)
    for r in var_rules:
        count = sup_rule_counts.get(r.to_tuple(), 1)  # add-1 smoothing
        supervised._probs[r.to_tuple()] = count / total

supervised_scores = score_words(supervised, segmented)

from scipy.stats import spearmanr

word_means = trials.groupby('word').judgment_z.mean()

# Build a comparison dataframe
score_df = pd.DataFrame({
    'word': trial_words,
    'unsupervised': [unsupervised_scores.get(w, float('-inf'))
                     for w in trial_words],
    'supervised': [supervised_scores.get(w, float('-inf'))
                   for w in trial_words],
})
score_df = score_df.merge(
    word_means.reset_index().rename(columns={'judgment_z': 'human'}),
    on='word'
)

# Drop words neither model could parse
score_df = score_df[
    (score_df.unsupervised > float('-inf')) &
    (score_df.supervised > float('-inf'))
]

rho_unsup, p_unsup = spearmanr(score_df.unsupervised, score_df.human)
rho_sup, p_sup = spearmanr(score_df.supervised, score_df.human)

print(f"Unsupervised PCFG vs. human:  ρ = {rho_unsup:.3f}  (p = {p_unsup:.2e})")
print(f"Supervised PCFG vs. human:    ρ = {rho_sup:.3f}  (p = {p_sup:.2e})")
print(f"Words in comparison: {len(score_df)}")

fig, axes = plt.subplots(1, 2, figsize=(12, 5), sharey=True)

axes[0].scatter(score_df.unsupervised, score_df.human, alpha=0.4, s=15)
axes[0].set_xlabel('Unsupervised log-probability')
axes[0].set_ylabel('Mean z-scored human judgment')
axes[0].set_title(f'Unsupervised PCFG  (ρ = {rho_unsup:.3f})')

axes[1].scatter(score_df.supervised, score_df.human, alpha=0.4, s=15)
axes[1].set_xlabel('Supervised log-probability')
axes[1].set_title(f'Supervised PCFG  (ρ = {rho_sup:.3f})')

fig.tight_layout()
plt.show()

The supervised model should correlate more strongly with human judgments than the unsupervised one—it has access to the hierarchical parse structure that determines, for instance, that un- attaches to adjectives (producing un-happy, un-lock-able) rather than to nouns or suffixes. The unsupervised model, working only from flat segmentations, has to infer these attachment preferences from the distribution of morphemes across training words, and it’s learning from a relatively small set of trial words rather than the full CELEX lexicon.

The size of the gap between the two correlations is the interesting quantity. A large gap means that hierarchical structure—knowing not just what the morphemes are but how they compose—carries information about well-formedness that can’t be recovered from morpheme sequences alone. A small gap would suggest the opposite: that the segmentation carries most of the signal and the tree structure adds little. My expectation is that the gap will be nontrivial, since many of the distinctions in the Oseki and Marantz (2020) data involve attachment ambiguities (like whether -ize can attach to the output of -tion) that are visible only in the tree.

Looking ahead

The unsupervised PCFG is the most ambitious model we’ve built in this module: it takes raw segmented strings and discovers hierarchical morphological structure without any labeled examples. Whether it does this well enough to be practically useful is a separate question, and the answer depends heavily on the quality of the segmentation it receives as input and the expressiveness of the grammar it’s working with.

But the formal tools we’ve developed here—context-free grammars, parsing algorithms, probabilistic estimation via both supervised counting and unsupervised EM—are not specific to morphology. They apply equally well to syntax, where the strings are longer, the grammars are larger, and the ambiguity is more severe. That is where we’ll turn next.

References

Baker, James K. 1979. “Trainable Grammars for Speech Recognition.” The Journal of the Acoustical Society of America 65 (S1): S132. https://doi.org/10.1121/1.382260.

Dempster, Arthur P., Nan M. Laird, and Donald B. Rubin. 1977. “Maximum Likelihood from Incomplete Data via the EM Algorithm.” Journal of the Royal Statistical Society: Series B (Methodological) 39 (1): 1–38. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x.

Fisher, Ronald A. 1925. Statistical Methods for Research Workers. Oliver; Boyd.

Kendall, Maurice G. 1948. Rank Correlation Methods. Charles Griffin & Company.

Lari, Karim, and Steve J. Young. 1990. “The Estimation of Stochastic Context-Free Grammars Using the Inside-Outside Algorithm.” Computer Speech & Language 4 (1): 35–56. https://doi.org/10.1016/0885-2308(90)90022-X.

Oseki, Yohei, and Alec Marantz. 2020. “Modeling Morphological Well-Formedness.” Proceedings of the Society for Computation in Linguistics 3: 1–10.

Spearman, Charles. 1904. “The Proof and Measurement of Association Between Two Things.” The American Journal of Psychology 15 (1): 72–101. https://doi.org/10.2307/1412159.

Wasserstein, Ronald L., and Nicole A. Lazar. 2016. “The ASA’s Statement on $p$-Values: Context, Process, and Purpose.” The American Statistician 70 (2): 129–33. https://doi.org/10.1080/00031305.2016.1154108.

Footnotes

Spearman (1904) introduced rank correlation in the context of measuring the association between psychological traits—another case where the raw measurements are on arbitrary scales and only the ordering is meaningful, much like our grammar scores.↩︎
One could also use Kendall’s $\tau$, another rank correlation measure with somewhat different statistical properties. In practice the two tend to agree on the ordering of models, though $\tau$ is typically smaller in magnitude than $\rho_s$ for the same data. See Kendall (1948) for the original treatment.↩︎
The ritual threshold of $p < 0.05$ has a long and contested history going back to Fisher (1925). The ASA’s statement on p-values (Wasserstein and Lazar 2016) cautions against treating any fixed threshold as a bright line between “real” and “not real” effects. In our case, the correlation coefficient $\rho_s$ is the more informative quantity; the p-value serves mainly as a sanity check that we are not reading signal into noise.↩︎

--- title: Unsupervised morphological grammar learning bibliography: ../references.bib jupyter: python3 execute: eval: false --- In the previous section on [probabilistic context-free grammars](probabilistic-context-free-grammars.qmd), we estimated rule probabilities from the CELEX database—a supervised approach that requires gold-standard morphological parses. But gold-standard parses are expensive to produce and are not available for most languages. This raises a natural question: can we learn a morphological grammar from raw data, without labeled parses? It can be done, at least in principle, using the [*expectation-maximization*](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm) (EM) algorithm [@dempster_maximum_1977] applied to probabilistic context-free grammars. The idea is to alternate between two steps: (i) use the current grammar to compute the expected counts of each rule across all possible parses of the training data (the E step); and (ii) use those expected counts to re-estimate the rule probabilities (the M step). This is the same basic strategy as EM for Gaussian mixture models or HMMs, but applied to the more complex structure of context-free grammars. ## The inside-outside algorithm The [inside-outside algorithm](https://en.wikipedia.org/wiki/Inside%E2%80%93outside_algorithm) [@baker_trainable_1979; @lari_estimation_1990] provides the machinery for the E step. To perform it, we need to compute, for each rule $A \rightarrow B\; C$ and each span $(i, j)$ in a training string, the *expected number of times* that rule is used to derive the substring from position $i$ to position $j$. This requires two quantities: - The *inside probability* $\alpha_A(i, j)$: the probability that nonterminal $A$ generates the substring $w_i \ldots w_j$ under the current grammar. We already computed this in the previous section—it's the quantity filled in by the probabilistic CKY algorithm. - The *outside probability* $\beta_A(i, j)$: the probability of everything *except* the substring $w_i \ldots w_j$, given that $A$ is responsible for that substring. This is the new quantity we need. The outside probability is defined recursively. The base case is: $$\beta_S(1, n) = 1$$ That is, the outside probability of the start symbol spanning the entire string is 1 (there is nothing "outside" it). For all other cases, the outside probability of $A$ spanning $(i, j)$ is computed by considering every way that $A$ could be a child of some parent nonterminal $B$: $$\beta_A(i, j) = \sum_{B \rightarrow A\; C} \sum_{k=j+1}^{n} p(B \rightarrow A\; C) \cdot \beta_B(i, k) \cdot \alpha_C(j+1, k) + \sum_{B \rightarrow C\; A} \sum_{k=1}^{i-1} p(B \rightarrow C\; A) \cdot \beta_B(k, j) \cdot \alpha_C(k, i-1)$$ The first sum covers cases where $A$ is the left child of $B$, and the second covers cases where $A$ is the right child. In each case, the outside probability of the parent $B$ is multiplied by the inside probability of the sibling $C$. Once we have both inside and outside probabilities, the expected count of a rule $A \rightarrow B\; C$ applied at span $(i, j)$ with the split at position $k$ is: $$\text{count}(A \rightarrow B\; C, i, k, j) = \frac{\beta_A(i, j) \cdot p(A \rightarrow B\; C) \cdot \alpha_B(i, k) \cdot \alpha_C(k+1, j)}{\alpha_S(1, n)}$$ The denominator $\alpha_S(1, n)$ is the total probability of the string under the current grammar—the inside probability of the start symbol spanning the whole string. Dividing by it normalizes the expected count so that it represents a posterior expectation. ### A worked example To make this concrete, consider a tiny grammar with start symbol $W$, nonterminals $\{W, M\}$, terminals $\{un, lock, able\}$, and rules: | Rule | Probability | |------|-------------| | $W \rightarrow M\; M$ | 0.6 | | $W \rightarrow M\; W$ | 0.4 | | $M \rightarrow \text{un}$ | 0.3 | | $M \rightarrow \text{lock}$ | 0.5 | | $M \rightarrow \text{able}$ | 0.2 | We want to compute inside and outside probabilities for the string $\text{un lock able}$ (3 terminals, positions 1–3). **Inside probabilities (bottom-up).** Start with the terminals. For each single-position span: - $\alpha_M(1,1) = p(M \rightarrow \text{un}) = 0.3$ - $\alpha_M(2,2) = p(M \rightarrow \text{lock}) = 0.5$ - $\alpha_M(3,3) = p(M \rightarrow \text{able}) = 0.2$ For length-2 spans, we look for rules $A \rightarrow B\;C$ where $B$ covers one position and $C$ covers the other: - $\alpha_W(1,2) = p(W \rightarrow M\;M) \cdot \alpha_M(1,1) \cdot \alpha_M(2,2) = 0.6 \times 0.3 \times 0.5 = 0.09$ - $\alpha_W(2,3) = p(W \rightarrow M\;M) \cdot \alpha_M(2,2) \cdot \alpha_M(3,3) = 0.6 \times 0.5 \times 0.2 = 0.06$ For the full span (1,3), there are two possible split points ($k = 1$ and $k = 2$) and two rules: $$\alpha_W(1,3) = \underbrace{p(W \rightarrow M\;M) \cdot \alpha_M(1,1) \cdot \alpha_M(2,3)}_{\text{split at }k=1,\text{ but } \alpha_M(2,3) = 0} + \underbrace{p(W \rightarrow M\;W) \cdot \alpha_M(1,1) \cdot \alpha_W(2,3)}_{\text{split at }k=1} + \underbrace{p(W \rightarrow M\;M) \cdot \alpha_M(1,2) \cdot \alpha_M(3,3)}_{\alpha_M(1,2) = 0} + \underbrace{p(W \rightarrow M\;W) \cdot \alpha_M(1,2) \cdot \alpha_W(3,3)}_{\alpha_W(3,3) = 0}$$ Only the second term survives (the others involve nonterminals that can't generate the required substrings): $\alpha_W(1,3) = 0.4 \times 0.3 \times 0.06 = 0.0072$. But we also need the split at $k=2$: $$\alpha_W(1,3) \mathrel{+}= p(W \rightarrow M\;M) \cdot \alpha_M(1,2) \cdot \alpha_M(3,3) + p(W \rightarrow M\;W) \cdot \alpha_M(1,2) \cdot \alpha_W(3,3)$$ Since $M$ has no binary rules, $\alpha_M(1,2) = 0$ and $\alpha_W(3,3) = 0$, so nothing is added. We also need the $W \rightarrow M\;W$ rule at split $k=2$: $\alpha_W(1,2) \cdot \alpha_?(3,3)$ — but this reads the wrong way (left child $=W$, not $M$). So: $\alpha_W(1,3) = 0.0072$. **Outside probabilities (top-down).** Start with the root: - $\beta_W(1,3) = 1$ (base case) For the right child at span (2,3) under rule $W \rightarrow M\;W$ with parent spanning (1,3) and left sibling $M$ at (1,1): $$\beta_W(2,3) = p(W \rightarrow M\;W) \cdot \beta_W(1,3) \cdot \alpha_M(1,1) = 0.4 \times 1 \times 0.3 = 0.12$$ This says: the "context" for $W$ spanning positions 2–3 (i.e., everything outside *lock able*) has probability 0.12 under the current grammar. **Expected counts.** The expected count of the rule $W \rightarrow M\;W$ at span (1,3) with split at $k=1$ is: $$\frac{\beta_W(1,3) \cdot p(W \rightarrow M\;W) \cdot \alpha_M(1,1) \cdot \alpha_W(2,3)}{\alpha_W(1,3)} = \frac{1 \times 0.4 \times 0.3 \times 0.06}{0.0072} = 1.0$$ In this tiny example, the expected count is 1 because there is only one parse of the string, and it uses $W \rightarrow M\;W$ exactly once. With a more ambiguous grammar, the expected counts would be fractional, distributed across the competing parses. ## The M step Given the expected counts from the E step, the M step is straightforward. For each nonterminal $A$, we re-estimate the probability of each rule $A \rightarrow \alpha$ as the proportion of expected uses of that rule among all rules with $A$ on the left-hand side: $$p(A \rightarrow \alpha) = \frac{\sum_{\text{spans}} \text{count}(A \rightarrow \alpha, \text{span})}{\sum_{\alpha'} \sum_{\text{spans}} \text{count}(A \rightarrow \alpha', \text{span})}$$ This is the maximum likelihood estimate given the expected counts—the same formula we used for supervised estimation, but with fractional (expected) counts rather than hard counts. ## Implementation The full EM algorithm for PCFGs alternates E and M steps until convergence (measured by the change in total log-likelihood of the training data). Here is an outline of the implementation: ```{python} #| code-fold: true #| code-summary: Unsupervised PCFG learning via EM import numpy as np from collections import defaultdict import sys sys.path.insert(0, '_code') from grammar import Rule class UnsupervisedPCFG: """A PCFG trained by expectation-maximization. Parameters ---------- variables : set[str] The nonterminal symbols. alphabet : set[str] The terminal symbols. rules : list[Rule] The production rules. start_variable : str The start symbol. max_iter : int, default 50 Maximum number of EM iterations. tol : float, default 1e-4 Convergence tolerance for log-likelihood improvement. """ def __init__(self, variables: set[str], alphabet: set[str], rules: list[Rule], start_variable: str, max_iter: int = 50, tol: float = 1e-4): self._variables = variables self._alphabet = alphabet self._rules = rules self._start = start_variable self._max_iter = max_iter self._tol = tol # Initialize rule probabilities uniformly self._probs: dict[tuple, float] = {} for var in variables: var_rules = [r for r in rules if r.left_side == var] for r in var_rules: self._probs[r.to_tuple()] = 1.0 / len(var_rules) def _inside(self, words: list[str]) -> dict[tuple, float]: """Compute inside probabilities for a sequence of terminals. Parameters ---------- words : list[str] The terminal sequence. Returns ------- dict[tuple, float] Mapping from (nonterminal, start, end) to probability. """ n = len(words) alpha = defaultdict(float) # Base case: unary rules for i in range(n): for r in self._rules: if r.is_unary and r.right_side[0] == words[i]: alpha[(r.left_side, i, i)] = self._probs[r.to_tuple()] # Fill chart bottom-up for span in range(1, n): for i in range(n - span): j = i + span for k in range(i, j): for r in self._rules: if r.is_binary: B, C = r.right_side prob = (self._probs[r.to_tuple()] * alpha.get((B, i, k), 0) * alpha.get((C, k+1, j), 0)) alpha[(r.left_side, i, j)] += prob return alpha def _outside(self, words: list[str], alpha: dict[tuple, float]) -> dict[tuple, float]: """Compute outside probabilities. Parameters ---------- words : list[str] The terminal sequence. alpha : dict[tuple, float] The inside probabilities. Returns ------- dict[tuple, float] Mapping from (nonterminal, start, end) to outside probability. """ n = len(words) beta = defaultdict(float) # Base case beta[(self._start, 0, n-1)] = 1.0 # Fill chart top-down for span in range(n-1, -1, -1): for i in range(n - span): j = i + span for r in self._rules: if r.is_binary: A = r.left_side B, C = r.right_side # B is left child of A for k in range(j+1, n): beta[(B, i, j)] += ( beta.get((A, i, k), 0) * self._probs[r.to_tuple()] * alpha.get((C, j+1, k), 0) ) # C is right child of A for k in range(0, i): beta[(C, i, j)] += ( beta.get((A, k, j), 0) * self._probs[r.to_tuple()] * alpha.get((B, k, i-1), 0) ) return beta def fit(self, data: list[list[str]]) -> list[float]: """Train the grammar on a list of word sequences using EM. Parameters ---------- data : list[list[str]] Training data as lists of terminal sequences. Returns ------- list[float] The log-likelihood after each iteration. """ log_likelihoods = [] for iteration in range(self._max_iter): expected_counts = defaultdict(float) total_ll = 0.0 # E step for words in data: n = len(words) alpha = self._inside(words) beta = self._outside(words, alpha) string_prob = alpha.get((self._start, 0, n-1), 0) if string_prob <= 0: continue total_ll += np.log(string_prob) # Accumulate expected counts for r in self._rules: if r.is_unary: for i in range(n): if r.right_side[0] == words[i]: count = (beta.get((r.left_side, i, i), 0) * self._probs[r.to_tuple()] / string_prob) expected_counts[r.to_tuple()] += count elif r.is_binary: B, C = r.right_side for i in range(n): for j in range(i, n): for k in range(i, j): count = ( beta.get((r.left_side, i, j), 0) * self._probs[r.to_tuple()] * alpha.get((B, i, k), 0) * alpha.get((C, k+1, j), 0) / string_prob ) expected_counts[r.to_tuple()] += count # M step for var in self._variables: var_rules = [r for r in self._rules if r.left_side == var] total = sum(expected_counts[r.to_tuple()] for r in var_rules) if total > 0: for r in var_rules: self._probs[r.to_tuple()] = ( expected_counts[r.to_tuple()] / total ) log_likelihoods.append(total_ll) if len(log_likelihoods) > 1: improvement = log_likelihoods[-1] - log_likelihoods[-2] if abs(improvement) < self._tol: break return log_likelihoods ``` ## Segmenting the test words To train the unsupervised PCFG, we need segmented words—sequences of morpheme-like units—but we don't need (and, in the unsupervised setting, don't have) the hierarchical parse trees that CELEX provides. The question is where the segmentations come from. We could use BPE, but we already have something better: in the [structured prediction section](segmentation-with-structured-prediction.qmd), we trained a CRF segmenter on CELEX's gold-standard morpheme boundaries. That segmenter learned to predict B/I tags based on character-level features, and it does a much better job of finding linguistically meaningful boundaries than BPE's frequency-based merging. I want to be precise about what information each model has access to, because that's the whole point of the comparison we're building toward. The *supervised* PCFG (from the [previous section](probabilistic-context-free-grammars.qmd)) has access to the full CELEX parse trees—both the segmentation *and* the hierarchical structure. The *unsupervised* PCFG we're training here has access to segmentations from the CRF (which was trained on CELEX boundaries) but *not* the tree structure. The EM algorithm has to discover the tree structure on its own. So the gap between the two models tells us how much of what makes a morphological structure well-formed is captured by the segmentation alone versus how much depends on knowing which morphemes compose with which, and in what order. ```{python} import os import numpy as np import pandas as pd import matplotlib.pyplot as plt import sklearn_crfsuite # --- CRF segmenter (from the structured prediction section) --- def word_to_bio(word, morphemes): """Convert a word and its morphemes to BIO tags.""" tags = [] for morph in morphemes: tags.append('B') tags.extend(['I'] * (len(morph) - 1)) return tags def char_features(word, i): """Extract character-level features for position i.""" features = { 'char': word[i], 'is_vowel': word[i] in 'aeiou', 'is_first': i == 0, 'is_last': i == len(word) - 1, 'position': i / len(word), } # Character context window of ±3 for offset in range(-3, 4): j = i + offset if 0 <= j < len(word): features[f'char[{offset}]'] = word[j] else: features[f'char[{offset}]'] = '<PAD>' # Bigrams and trigrams centered on this position if i > 0: features['bigram[-1,0]'] = word[i-1:i+1] if i < len(word) - 1: features['bigram[0,+1]'] = word[i:i+2] if i > 0 and i < len(word) - 1: features['trigram'] = word[i-1:i+2] if i > 1: features['trigram_left'] = word[i-2:i+1] if i < len(word) - 2: features['trigram_right'] = word[i:i+3] # Suffix of remaining string and prefix of consumed string, # at multiple lengths—these help the CRF learn that common # suffixes like -tion, -ness, -able start at boundaries remaining = word[i:] for k in [2, 3, 4, 5]: if len(remaining) >= k: features[f'suffix_{k}'] = remaining[:k] consumed = word[:i] for k in [2, 3, 4]: if len(consumed) >= k: features[f'prefix_{k}'] = consumed[-k:] return features def word_to_features(word): return [char_features(word, i) for i in range(len(word))] def tags_to_segments(word, tags): """Convert BIO tags back to a list of morpheme strings.""" segments = [] current = '' for ch, tag in zip(word, tags): if tag == 'B' and current: segments.append(current) current = ch else: current += ch if current: segments.append(current) return segments def align_morphemes_to_surface(word, morphemes): """Find morpheme boundary positions in the surface word. The underlying morphemes (from StrucLab) may not concatenate to the surface form due to allomorphic alternations—e.g. 'happy' surfaces as 'happi' before '-ness'. We use edit distance alignment to map each morpheme's start position in the underlying string to a position in the surface string. Parameters ---------- word : str The surface form. morphemes : list[str] The underlying morpheme strings. Returns ------- list[int] | None Boundary character indices (0-indexed), or None on failure. """ underlying = ''.join(morphemes) if underlying == word: boundaries, pos = [], 0 for m in morphemes: boundaries.append(pos) pos += len(m) return boundaries result = edlib.align(underlying, word, mode='NW', task='path') cigar = result['cigar'] # Map underlying positions to surface positions via CIGAR u_pos, s_pos = 0, 0 u_to_s = {} for length, op in re.findall(r'(\d+)([=XDIS])', cigar): for _ in range(int(length)): if op in ('=', 'X'): u_to_s[u_pos] = s_pos u_pos += 1 s_pos += 1 elif op == 'D': u_pos += 1 elif op == 'I': s_pos += 1 boundaries, u_boundary = [], 0 for m in morphemes: if u_boundary in u_to_s: boundaries.append(u_to_s[u_boundary]) u_boundary += len(m) return boundaries if len(boundaries) == len(morphemes) else None ``` Now we train the CRF on CELEX's gold-standard segmentations and use it to segment the trial words from @oseki2020modeling. The StrucLab field gives underlying morpheme forms (`happy`, `beauty`), which don't always match the surface spelling (`happi-ness`, `beauti-ful`). We use edit-distance alignment to find where each morpheme's boundary falls in the actual word, which gives us correct B/I tags even in the presence of allomorphic alternations. ```{python} import re import edlib # --- Load CELEX and extract gold segmentations --- # # The eml.cd file has 25 backslash-separated fields per line. # Field 1 is the word; field 21 is StrucLab. celex_words = [] celex_struclabs = [] with open('celex/english/eml/eml.cd', encoding='latin-1') as f: for line in f: fields = line.strip().split('\\') if len(fields) >= 22: celex_words.append(fields[1]) celex_struclabs.append(fields[21]) celex = pd.DataFrame({'Head': celex_words, 'StrucLab': celex_struclabs}) print(f"Loaded {len(celex)} CELEX entries") def extract_morphemes(struclab): """Pull leaf morpheme strings out of a CELEX StrucLab field. The StrucLab notation nests constituents in parentheses with category labels in brackets: ((un)[A],(happy)[A])[A]. We extract only the leaves—the innermost (morpheme)[Category] pairs—where the morpheme contains only alphabetic characters. """ if not struclab or struclab == 'N': return None morphs = re.findall(r'$([a-zA-Z]+)$\[[^\]]+\]', struclab) if morphs: return [m.lower() for m in morphs] return None celex['morphemes'] = celex.StrucLab.apply(extract_morphemes) celex_seg = celex[celex.morphemes.notna()].copy() # Build training data for the CRF by aligning underlying # morphemes to surface forms X_train, y_train = [], [] for _, row in celex_seg.iterrows(): word = row.Head.lower() morphemes = row.morphemes boundaries = align_morphemes_to_surface(word, morphemes) if boundaries is None: continue # Convert boundary positions to BIO tags tags = ['I'] * len(word) for b in boundaries: if b < len(word): tags[b] = 'B' X_train.append(word_to_features(word)) y_train.append(tags) print(f"CRF training examples: {len(X_train)}") ``` ```{python} # Train the CRF crf = sklearn_crfsuite.CRF(algorithm='lbfgs', max_iterations=200, c1=0.01, c2=0.01) crf.fit(X_train, y_train) # --- Segment the trial words --- trials = pd.read_csv('data/trials.csv') trial_words = trials.word.unique().tolist() segmented = {} for w in trial_words: features = word_to_features(w) tags = crf.predict_single(features) segmented[w] = tags_to_segments(w, tags) for w in list(segmented)[:10]: print(f"{w:25s} → {' + '.join(segmented[w])}") ``` The CRF produces segmentations that look like real morphological decompositions—prefixes, stems, and suffixes—because it was trained on CELEX boundaries. But it doesn't know anything about how those morphemes compose hierarchically. That's what the unsupervised PCFG will learn. ## Setting up the grammar The grammar we give to EM is deliberately uninformative about English morphology. It has a start symbol `Word` that can expand into a single morpheme or a binary split into `Left` and `Right`, each of which can itself split further or terminate as a `Morph`. Every morpheme in the segmented data gets a unary rule. This is the minimal binary-branching grammar—it doesn't encode any linguistic knowledge about which morphemes are prefixes, which are suffixes, or which categories combine with which. All of that structure has to be discovered by EM from the distribution of morphemes across words. ```{python} # Collect the alphabet from the segmented data alphabet = {seg for segs in segmented.values() for seg in segs} variables = {'Word', 'Left', 'Right', 'Morph'} rules = [ Rule('Word', 'Left', 'Right'), Rule('Left', 'Morph', 'Left'), Rule('Right', 'Morph', 'Right'), Rule('Left', 'Morph', 'Morph'), Rule('Right', 'Morph', 'Morph'), ] for t in alphabet: rules.append(Rule('Morph', t)) rules.append(Rule('Word', t)) print(f"Alphabet size: {len(alphabet)}") print(f"Number of rules: {len(rules)}") print(f"Number of words to train on: {len(segmented)}") ``` ## Training ```{python} unsupervised = UnsupervisedPCFG(variables, alphabet, rules, 'Word', max_iter=30, tol=1e-6) log_likelihoods = unsupervised.fit(list(segmented.values())) print(f"Converged after {len(log_likelihoods)} iterations") print(f"Final log-likelihood: {log_likelihoods[-1]:.2f}") ``` EM guarantees that the log-likelihood increases (or stays the same) at every iteration. Let's verify that. ```{python} fig, ax = plt.subplots(figsize=(8, 4)) ax.plot(range(1, len(log_likelihoods) + 1), log_likelihoods, marker='o', markersize=4) ax.set_xlabel('EM iteration') ax.set_ylabel('Log-likelihood') fig.tight_layout() plt.show() ``` ## Scoring the trial words We can now score each trial word under the learned grammar by computing its inside probability $\alpha_{\text{Word}}(0, n-1)$—the probability that the start symbol `Word` generates the full segmented form. Words whose morpheme combinations the grammar has seen frequently (or whose combinations resemble frequent ones) will get higher scores; words with unusual combinations will get lower ones. ```{python} def score_words(pcfg, segmented_words): """Score each word under a PCFG using the inside algorithm.""" scores = {} for w, segs in segmented_words.items(): alpha = pcfg._inside(segs) n = len(segs) prob = alpha.get(('Word', 0, n - 1), 0) scores[w] = np.log(prob) if prob > 0 else float('-inf') return scores unsupervised_scores = score_words(unsupervised, segmented) ``` ## Comparison with human judgments The question we've been building toward since the [morphological well-formedness section](morphological-wellformedness.qmd) is whether a grammar's scores can predict the gradient acceptability judgments that speakers give to morphologically complex nonwords. We now have the pieces to answer it—and, more interestingly, to measure how much the answer depends on having supervised parse trees versus discovering structure from scratch. We'll compare two models against the human data: the *supervised* PCFG, whose rule probabilities were estimated from the CELEX parse trees (where both the segmentation and the hierarchical structure are given), and the *unsupervised* PCFG we just trained, which had access only to segmentations (from the CRF) and had to learn the tree structure via EM. If the supervised model does much better, that tells us the hierarchical structure matters and can't easily be recovered from morpheme sequences alone. If the two are comparable, it suggests that most of the relevant information is already in the segmentation. #### Measuring association with rank correlation In the [earlier evaluation of phonotactic models](../uncertainty-about-languages/ngram-models.qmd#evaluating-a-predictor), we used cross-validated linear regression and $R^2$ to measure predictive accuracy. That approach works well when the dataset is large enough to split into folds and the relationship between predictor and response is plausibly linear. Here the situation is different: the trial set is small, and the log-probabilities from a PCFG can be very unevenly spaced—two words with similar acceptability may receive very different raw scores. We need a measure of association that is robust to monotonic but nonlinear relationships. [Spearman's rank correlation](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) $\rho_s$ provides exactly this [@spearman_proof_1904]. Recall that we defined the [Pearson correlation](../uncertainty-about-languages/some-more-useful-definitions.qmd#covariance-and-correlation) $\text{corr}(X, Y) = \text{cov}(X,Y) / \sqrt{\mathbb{V}[X]\mathbb{V}[Y]}$, which measures *linear* association. Spearman's correlation is simply the Pearson correlation applied not to the raw values but to their *ranks*. Given $n$ paired observations $(x_i, y_i)$, let $R(x_i)$ be the rank of $x_i$ among $\{x_1, \ldots, x_n\}$ (i.e. 1 for the smallest, $n$ for the largest) and similarly for $R(y_i)$. Then:^[@spearman_proof_1904 introduced rank correlation in the context of measuring the association between psychological traits—another case where the raw measurements are on arbitrary scales and only the ordering is meaningful, much like our grammar scores.] $$\rho_s = \text{corr}\big(R(X), R(Y)\big)$$ Because only the ranks matter, $\rho_s$ is invariant to any monotonic transformation of either variable. If we replaced log-probabilities with their exponentials, or squared them, or applied any order-preserving function, $\rho_s$ would not change. This is desirable here because we care about whether higher grammar scores correspond to higher human ratings, not about the exact functional form of the relationship.^[One could also use Kendall's $\tau$, another rank correlation measure with somewhat different statistical properties. In practice the two tend to agree on the ordering of models, though $\tau$ is typically smaller in magnitude than $\rho_s$ for the same data. See @kendall_rank_1948 for the original treatment.] #### Interpreting p-values The code below also reports a [p-value](https://en.wikipedia.org/wiki/P-value) alongside each $\rho_s$. The p-value answers a specific question: if there were truly no monotonic association between grammar scores and human judgments ($\rho_s = 0$ in the population), how often would we observe a sample correlation at least as extreme as the one we computed? A small p-value (conventionally $< 0.05$) means the observed association is unlikely to have arisen by chance under the null hypothesis of no relationship. It is important to be clear about what the p-value does *not* tell us. It does not measure the strength or importance of the association—$\rho_s$ does that. Nor does it tell us the probability that the null hypothesis is true. It only tells us whether we can distinguish the observed effect from zero given our sample size. With a large enough sample, even a tiny and uninteresting correlation will yield a small p-value; with a small sample, a substantial correlation may not reach conventional significance. The ASA's statement on p-values [@wasserstein_asa_2016] discusses these limitations at length.^[The ritual threshold of $p < 0.05$ has a long and contested history going back to @fisher_statistical_1925. The ASA's statement on p-values [@wasserstein_asa_2016] cautions against treating any fixed threshold as a bright line between "real" and "not real" effects. In our case, the correlation coefficient $\rho_s$ is the more informative quantity; the p-value serves mainly as a sanity check that we are not reading signal into noise.] ```{python} # --- Supervised PCFG scores --- # We estimate a supervised PCFG from the CELEX parse trees and # score the trial words under it. This requires the parse-tree # extraction from the PCFG section; we reuse that machinery here. from collections import Counter as Ctr # Extract rules from CELEX parses (simplified: count morpheme # category bigrams as binary rules, morphemes as unary rules) sup_rule_counts = Ctr() for _, row in celex_seg.iterrows(): morphemes = row.morphemes if morphemes and len(morphemes) >= 2: # Treat as right-branching: Word -> m1 (Word -> m2 (... mn)) for m in morphemes: sup_rule_counts[('Morph', (m,))] += 1 sup_rule_counts[('Word', (m,))] += 1 for i in range(len(morphemes) - 1): sup_rule_counts[('Word', ('Left', 'Right'))] += 1 sup_rule_counts[('Left', ('Morph', 'Left'))] += 1 sup_rule_counts[('Right', ('Morph', 'Right'))] += 1 sup_rule_counts[('Left', ('Morph', 'Morph'))] += 1 sup_rule_counts[('Right', ('Morph', 'Morph'))] += 1 elif morphemes and len(morphemes) == 1: sup_rule_counts[('Morph', (morphemes[0],))] += 1 sup_rule_counts[('Word', (morphemes[0],))] += 1 # Build a supervised PCFG using the same grammar skeleton # but with probabilities estimated from the CELEX counts supervised = UnsupervisedPCFG(variables, alphabet, rules, 'Word', max_iter=0) # Override uniform probabilities with supervised counts for var in variables: var_rules = [r for r in rules if r.left_side == var] total = sum(sup_rule_counts.get(r.to_tuple(), 1) for r in var_rules) for r in var_rules: count = sup_rule_counts.get(r.to_tuple(), 1) # add-1 smoothing supervised._probs[r.to_tuple()] = count / total supervised_scores = score_words(supervised, segmented) ``` ```{python} from scipy.stats import spearmanr word_means = trials.groupby('word').judgment_z.mean() # Build a comparison dataframe score_df = pd.DataFrame({ 'word': trial_words, 'unsupervised': [unsupervised_scores.get(w, float('-inf')) for w in trial_words], 'supervised': [supervised_scores.get(w, float('-inf')) for w in trial_words], }) score_df = score_df.merge( word_means.reset_index().rename(columns={'judgment_z': 'human'}), on='word' ) # Drop words neither model could parse score_df = score_df[ (score_df.unsupervised > float('-inf')) & (score_df.supervised > float('-inf')) ] rho_unsup, p_unsup = spearmanr(score_df.unsupervised, score_df.human) rho_sup, p_sup = spearmanr(score_df.supervised, score_df.human) print(f"Unsupervised PCFG vs. human: ρ = {rho_unsup:.3f} (p = {p_unsup:.2e})") print(f"Supervised PCFG vs. human: ρ = {rho_sup:.3f} (p = {p_sup:.2e})") print(f"Words in comparison: {len(score_df)}") ``` ```{python} fig, axes = plt.subplots(1, 2, figsize=(12, 5), sharey=True) axes[0].scatter(score_df.unsupervised, score_df.human, alpha=0.4, s=15) axes[0].set_xlabel('Unsupervised log-probability') axes[0].set_ylabel('Mean z-scored human judgment') axes[0].set_title(f'Unsupervised PCFG (ρ = {rho_unsup:.3f})') axes[1].scatter(score_df.supervised, score_df.human, alpha=0.4, s=15) axes[1].set_xlabel('Supervised log-probability') axes[1].set_title(f'Supervised PCFG (ρ = {rho_sup:.3f})') fig.tight_layout() plt.show() ``` The supervised model should correlate more strongly with human judgments than the unsupervised one—it has access to the hierarchical parse structure that determines, for instance, that *un-* attaches to adjectives (producing *un-happy*, *un-lock-able*) rather than to nouns or suffixes. The unsupervised model, working only from flat segmentations, has to infer these attachment preferences from the distribution of morphemes across training words, and it's learning from a relatively small set of trial words rather than the full CELEX lexicon. The size of the gap between the two correlations is the interesting quantity. A large gap means that hierarchical structure—knowing not just what the morphemes are but how they compose—carries information about well-formedness that can't be recovered from morpheme sequences alone. A small gap would suggest the opposite: that the segmentation carries most of the signal and the tree structure adds little. My expectation is that the gap will be nontrivial, since many of the distinctions in the @oseki2020modeling data involve attachment ambiguities (like whether *-ize* can attach to the output of *-tion*) that are visible only in the tree. ## Looking ahead The unsupervised PCFG is the most ambitious model we've built in this module: it takes raw segmented strings and discovers hierarchical morphological structure without any labeled examples. Whether it does this well enough to be practically useful is a separate question, and the answer depends heavily on the quality of the segmentation it receives as input and the expressiveness of the grammar it's working with. But the formal tools we've developed here—context-free grammars, parsing algorithms, probabilistic estimation via both supervised counting and unsupervised EM—are not specific to morphology. They apply equally well to syntax, where the strings are longer, the grammars are larger, and the ambiguity is more severe. That is where we'll turn next.