Working with CELEX morphological data

To train and evaluate the morphological grammars we’ve been building, we need a dataset that provides gold-standard morphological parses—not just a list of morpheme boundaries but full hierarchical structure showing how morphemes compose. The CELEX lexical database provides exactly this. It’s one of the most widely used resources in computational morphology, and we’ll use it throughout the rest of this module as both a training set (for estimating grammar parameters) and an evaluation set (for checking our models against known parses).

WarningData availability

CELEX (LDC96L14) is distributed by the Linguistic Data Consortium and requires an LDC license. If you have access, place the extracted database in a celex/ directory. The code below assumes this directory exists. If it does not, the code blocks on this page will not execute, but I’ve included representative output so you can follow along.

What is CELEX?

CELEX is a lexical database for English, Dutch, and German that includes phonological, morphological, syntactic, and frequency information for a large number of word forms. It was compiled in the early 1990s at the Centre for Lexical Information in Nijmegen, and it remains one of the standard resources in psycholinguistics and computational linguistics despite its age.

For our purposes, the relevant field is StrucLab—the morphological structure label—which provides a serialized parse tree for each entry. The notation looks a bit cryptic at first, but it encodes exactly the kind of hierarchical structure that our context-free grammars generate.

For example, the word unhappiness has the StrucLab representation ((un)[A],(happy)[A])[A],(ness)[N|A.])[N]. What this means is: the prefix un- combines with the adjective happy to form an adjective (the inner bracketing), which then combines with the suffix -ness to form a noun (the outer bracketing). The category labels in square brackets correspond to what we’ve been calling nonterminals in our CFG formalism.

Parsing CELEX entries

The ParsedWord class aligns the serialized parse with the actual surface string using the edlib library for sequence alignment.

Define ParsedWord
import re
import edlib
import pyparsing

import sys
sys.path.insert(0, '_code')
from grammar import Rule


# Parse the CELEX StrucLab format
LPAR = pyparsing.Suppress('(')
RPAR = pyparsing.Suppress(')')
LBRACK = pyparsing.Suppress('[')
RBRACK = pyparsing.Suppress(']')

morpheme = pyparsing.Regex(r'[a-zA-Z]+')
category = LBRACK + pyparsing.Regex(r'[A-Z|.]+') + RBRACK

exp = pyparsing.Forward()
constituent = pyparsing.Group(LPAR + (exp | morpheme) + RPAR + category)
exp <<= constituent | pyparsing.Group(constituent + pyparsing.Suppress(',') + exp)


class ParsedWord:
    """A morphologically parsed word from CELEX."""

    def __init__(self, word, struclab):
        self._word = word
        self._struclab = struclab
        self._parse = self._parse_struclab(struclab)

    def _parse_struclab(self, struclab):
        try:
            return exp.parseString(struclab)[0]
        except pyparsing.ParseException:
            return None

    @property
    def word(self):
        return self._word

    @property
    def parse(self):
        return self._parse

    def extract_rules(self):
        """Extract CFG rules from this parse."""
        if self._parse is None:
            return set()
        return self._extract_rules_recursive(self._parse)

    def _extract_rules_recursive(self, node):
        rules = set()
        if isinstance(node, str):
            return rules
        # Implementation depends on parse format
        return rules

Loading the database

import pandas as pd

# Load the English morphological lemma file
celex = pd.read_csv(
    'celex/eml/eml.cd',
    sep='\\',
    header=None,
    names=['IdNum', 'Head', 'MorphStatus', 'MorphCnt', 'StrucLab',
           'MorphOp', 'StrucAllo', 'StrucLabAdj'],
    encoding='latin-1'
)

# Filter to entries with structure labels
celex_parsed = celex[celex.StrucLab.notna()].copy()

print(f"Total entries: {len(celex)}")
print(f"Entries with structure: {len(celex_parsed)}")
celex_parsed[['Head', 'StrucLab']].head(10)

Extracting a grammar

From the parsed entries, we can extract a context-free grammar by walking each parse tree and collecting the production rules.

from grammar import ContextFreeGrammar

all_rules = set()

for _, row in celex_parsed.iterrows():
    pw = ParsedWord(row.Head, row.StrucLab)
    all_rules |= pw.extract_rules()

print(f"Extracted {len(all_rules)} unique rules")

This grammar—extracted directly from a morphological database—gives us a supervised baseline for morphological analysis. In the following sections, we’ll explore ways to learn such grammars from less annotated data.