Probabilistic Context-Free Grammars for Morphological Analysis
Introduction to Probabilistic Context-Free Grammars
So far, we’ve explored context-free grammars (CFGs) as powerful tools for modeling hierarchical structures in language. In this section, we’ll extend our framework to probabilistic context-free grammars (PCFGs), which augment CFGs with probabilities associated with each rule (Baker 1979). This probabilistic extension allows us to:
- Model the likelihood of different grammatical structures
- Disambiguate between multiple possible analyses
- Learn grammars from data in an unsupervised or semi-supervised manner
- Quantify the well-formedness of novel word formations
PCFGs are particularly useful for morphological analysis, where we often want to determine not just whether a word can be derived according to a grammar, but also how likely that derivation is compared to alternatives.
Formal Definition of PCFGs
A probabilistic context-free grammar is a 5-tuple \(G = (V, \Sigma, R, S, P)\) where:
- \(V\) is a finite set of variables (non-terminals)
- \(\Sigma\) is a finite set of terminals (the alphabet)
- \(R\) is a finite set of rules of the form \(A \rightarrow \alpha\) where \(A \in V\) and \(\alpha \in (V \cup \Sigma)^*\)
- \(S \in V\) is the start variable
- \(P\) is a probability function that assigns a probability \(P(A \rightarrow \alpha)\) to each rule, such that for each \(A \in V\): \[\sum_{\alpha : (A \rightarrow \alpha) \in R} P(A \rightarrow \alpha) = 1\]
In other words, the probabilities of all rules with the same left-hand side must sum to 1, forming a conditional probability distribution \(P(\alpha | A)\).
The Probabilistic Rule Class
Let’s extend our existing Rule class to incorporate probabilities. For computational convenience, we’ll work with log probabilities rather than raw probabilities:
class ProbabilisticRule(Rule):
"""A probabilistic context-free grammar rule.
Parameters
----------
left_side : str
The left-hand side variable.
*right_side : str
The right-hand side sequence of variables and terminals.
logprob : float | None, default None
The log probability of the rule.
Attributes
----------
left_side : str
The left-hand side variable.
right_side : tuple[str, ...]
The right-hand side symbols.
logprob : float | None
The log probability of the rule.
"""
def __init__(self, left_side: str, *right_side: str, logprob: float | None = None):
super().__init__(left_side, *right_side)
self._logprob = logprob
def __repr__(self) -> str:
"""Return a string representation including the log probability."""
rule = self._left_side + ' -> ' + ' '.join(self._right_side)
if self._logprob is None:
return rule
else:
return rule + ' [' + f"{self._logprob:.4f}" + ']'
@property
def logprob(self) -> float | None:
"""The log probability of the rule."""
return self._logprob
@logprob.setter
def logprob(self, value: float) -> None:
self._logprob = valueThe Probabilistic Context-Free Grammar Class
Now, let’s define our ProbabilisticContextFreeGrammar class, extending the basic ContextFreeGrammar:
import numpy as np
from scipy.special import logsumexp
class ProbabilisticContextFreeGrammar:
"""A probabilistic context-free grammar in Chomsky Normal Form.
Parameters
----------
alphabet : set[str]
The set of terminal symbols.
variables : set[str]
The set of non-terminal symbols.
rules : set[ProbabilisticRule]
The set of production rules with associated probabilities.
start_variable : str
The start symbol of the grammar.
Attributes
----------
alphabet : set[str]
The set of terminal symbols.
variables : set[str]
The set of non-terminal symbols.
rules : set[ProbabilisticRule]
The production rules.
start_variable : str
The start symbol.
"""
def __init__(self, alphabet: set[str], variables: set[str],
rules: set[ProbabilisticRule], start_variable: str):
self._alphabet = alphabet
self._variables = variables
self._rules = rules
self._start_variable = start_variable
self._validate_variables()
self._validate_rules()
self._parser = CKYParser(self)
def __call__(self, string: list[str] | str, mode: str = "recognize"):
"""Parse or recognize a string.
Parameters
----------
string : list[str] | str
The input to process.
mode : str, default "recognize"
Either "recognize" or "parse".
Returns
-------
float | list[tuple]
Log probability (recognize) or list of parse trees (parse).
"""
return self._parser(string, mode)
def _validate_variables(self) -> None:
"""Check that alphabet and variables are disjoint and start variable is valid."""
if self._alphabet & self._variables:
raise ValueError('alphabet and variables must not share elements')
if self._start_variable not in self._variables:
raise ValueError('start variable must be in set of variables')
def _validate_rules(self) -> None:
"""Validate rule well-formedness and probability distributions."""
for r in self._rules:
r.validate(self._alphabet, self._variables)
# Group log probabilities by left-hand side
log_probs: dict[str, list[float]] = {}
for r in self._rules:
if r.left_side not in log_probs:
log_probs[r.left_side] = []
log_probs[r.left_side].append(r.logprob)
# Each nonterminal's rule probabilities must sum to 1
for nt, lps in log_probs.items():
sum_prob = np.exp(logsumexp(lps))
if not np.isclose(sum_prob, 1.0, rtol=1e-5):
raise ValueError(f'Rules for {nt} sum to {sum_prob}, not 1.0')
@property
def alphabet(self) -> set[str]:
"""The set of terminal symbols."""
return self._alphabet
@property
def variables(self) -> set[str]:
"""The set of non-terminal symbols."""
return self._variables
@property
def rules(self) -> set[ProbabilisticRule]:
"""The production rules."""
return self._rules
@property
def start_variable(self) -> str:
"""The start symbol."""
return self._start_variable
def reduce(self, *right_side: str) -> set[tuple[str, float]]:
"""Find nonterminals that rewrite as the given right side, with log probabilities.
Parameters
----------
*right_side : str
The right-hand side symbols to match.
Returns
-------
set[tuple[str, float]]
Pairs of (nonterminal, log probability).
"""
return {(r.left_side, r.logprob) for r in self._rules
if r.right_side == tuple(right_side)}The Inside Algorithm for Computing String Probabilities
In a PCFG, each parse tree has a probability, which is the product of the probabilities of all rules used in the derivation. The probability of a string in the language is the sum of the probabilities of all possible parse trees for that string.
To compute this efficiently, we use the inside algorithm (Baker 1979; Lari and Young 1990), which is a dynamic programming algorithm similar to the CKY algorithm. The inside algorithm computes what are called “inside probabilities” \(\alpha_{i,j}(A)\), which represent the probability that non-terminal \(A\) derives the substring \(w_i...w_{j-1}\).
For a grammar in Chomsky Normal Form, the inside probabilities are defined recursively:
\[ \alpha_{i,j}(A) = \begin{cases} P(A \rightarrow w_i) & \text{if } j = i + 1 \\ \sum_{k=i+1}^{j-1} \sum_{B,C \in V} P(A \rightarrow B C) \cdot \alpha_{i,k}(B) \cdot \alpha_{k,j}(C) & \text{if } j > i + 1 \end{cases} \]
The probability of the whole string \(w\) according to the grammar is then \(\alpha_{0,n}(S)\), where \(n\) is the length of \(w\) and \(S\) is the start symbol.
Implementation of the Inside Algorithm
Let’s implement the Inside algorithm as part of our CKY parser:
class CKYChart:
"""A chart for a CKY parser.
Parameters
----------
input_size : int
The length of the input string.
"""
def __init__(self, input_size: int):
self._input_size = input_size
self._chart: list[list[set]] = [
[set() for _ in range(input_size + 1)]
for _ in range(input_size + 1)
]
def __getitem__(self, index: tuple[int, int]) -> set:
"""Retrieve the chart cell at the given span."""
i, j = index
self._validate_index(i, j)
return self._chart[i][j]
def __setitem__(self, key: tuple[int, int], item: set) -> None:
"""Set the chart cell at the given span."""
i, j = key
self._chart[i][j] = item
def _validate_index(self, i: int, j: int) -> None:
"""Raise ValueError if the index is out of bounds or in the lower triangle."""
if i >= j:
raise ValueError("cannot index into the lower triangle of the chart")
try:
self._chart[i]
except IndexError:
raise ValueError("row index is too large")
try:
self._chart[i][j]
except IndexError:
raise ValueError("column index is too large")
@property
def input_size(self) -> int:
"""The length of the input string."""
return self._input_size
class CKYChartEntry:
"""A chart entry for a CKY chart.
Parameters
----------
symbol : str
The non-terminal symbol.
logprob : float
The log probability.
*backpointers : CKYChartEntry
Pointers to child entries (for parsing).
"""
def __init__(self, symbol: str, logprob: float, *backpointers: 'CKYChartEntry'):
self._symbol = symbol
self._logprob = logprob
self._backpointers = backpointers
def __hash__(self) -> int:
return hash((self._symbol, self._backpointers))
def __eq__(self, other: 'CKYChartEntry') -> bool:
return (self._symbol == other._symbol and
self._backpointers == other._backpointers)
def __repr__(self) -> str:
"""Return a string representation with symbol, probability, and children."""
if self._backpointers:
return f"{self._symbol}[{self._logprob:.4f}] -> {' '.join(bp.symbol for bp in self._backpointers)}"
else:
return f"{self._symbol}[{self._logprob:.4f}]"
@property
def symbol(self) -> str:
"""The non-terminal symbol."""
return self._symbol
@property
def logprob(self) -> float:
"""The log probability."""
return self._logprob
@property
def backpointers(self) -> tuple['CKYChartEntry', ...]:
"""Pointers to child entries."""
return self._backpointers
class CKYParser:
"""A CKY parser for probabilistic context-free grammars.
Parameters
----------
grammar : ProbabilisticContextFreeGrammar
The grammar to parse with.
"""
def __init__(self, grammar: ProbabilisticContextFreeGrammar):
self._grammar = grammar
def __call__(self, string: list[str] | str, mode: str = "recognize"):
"""Parse or recognize a string.
Parameters
----------
string : list[str] | str
The input to process.
mode : str, default "recognize"
Either "recognize" or "parse".
Returns
-------
float | list[tuple]
Log probability (recognize) or list of parse trees (parse).
"""
if mode == "recognize":
return self._recognize(string)
elif mode == "parse":
return self._parse(string)
else:
raise ValueError('mode must be "parse" or "recognize"')
def _fill_chart(self, string: list[str], keep_backtraces: bool = True) -> CKYChart:
"""Fill the CKY chart using the inside algorithm.
Parameters
----------
string : list[str]
The input string.
keep_backtraces : bool, default True
Whether to keep backtraces for parsing.
Returns
-------
CKYChart
The filled chart.
"""
input_size = len(string)
chart = CKYChart(input_size)
# Base case: fill in chart[i,i+1] for each terminal
for j in range(1, input_size+1):
i = j - 1
# Get all non-terminals that can derive this terminal
for symbol, logprob in self._grammar.reduce(string[i]):
chart[i, j].add(CKYChartEntry(symbol, logprob, CKYChartEntry(string[i], 0.0)))
# Recursive case: fill in chart[i,j] for j-i > 1
for length in range(2, input_size+1):
for i in range(input_size - length + 1):
j = i + length
# For each possible split point k
for k in range(i+1, j):
# For each pair of non-terminals B, C where B derives w[i:k] and C derives w[k:j]
for left_entry in chart[i, k]:
for right_entry in chart[k, j]:
# Find all non-terminals A such that A -> B C
for parent, rule_logprob in self._grammar.reduce(left_entry.symbol, right_entry.symbol):
# Compute the probability of this derivation
entry_logprob = rule_logprob + left_entry.logprob + right_entry.logprob
if keep_backtraces:
# Create a new chart entry with backpointers
new_entry = CKYChartEntry(parent, entry_logprob, left_entry, right_entry)
chart[i, j].add(new_entry)
else:
# Update the aggregate probability for this non-terminal
existing_entries = [e for e in chart[i, j] if e.symbol == parent]
if existing_entries:
# Combine probabilities using logsumexp for numerical stability
old_logprob = existing_entries[0].logprob
new_logprob = logsumexp([old_logprob, entry_logprob])
chart[i, j].remove(existing_entries[0])
chart[i, j].add(CKYChartEntry(parent, new_logprob))
else:
chart[i, j].add(CKYChartEntry(parent, entry_logprob))
return chart
def _parse(self, string: list[str] | str) -> list[tuple]:
"""Parse a string to find all possible parse trees.
Parameters
----------
string : list[str] | str
The string to parse.
Returns
-------
list[tuple[Tree, float]]
Parse trees paired with their log probabilities.
"""
if isinstance(string, str):
string = string.split()
chart = self._fill_chart(string)
parses = []
for entry in chart[0, len(string)]:
if entry.symbol == self._grammar.start_variable:
parse_tree = self._build_tree(entry)
parses.append((parse_tree, entry.logprob))
return parses
def _build_tree(self, entry: CKYChartEntry):
"""Build a parse tree from a chart entry.
Parameters
----------
entry : CKYChartEntry
The chart entry to convert.
Returns
-------
Tree
The parse tree rooted at this entry.
"""
if not entry.backpointers:
return Tree(entry.symbol, [])
else:
children = [self._build_tree(bp) for bp in entry.backpointers]
return Tree(entry.symbol, children)
def _recognize(self, string: list[str] | str) -> float:
"""Compute the log probability of a string under the grammar.
Parameters
----------
string : list[str] | str
The string to recognize.
Returns
-------
float
The log probability, or -inf if the string is not in the language.
"""
# Ensure input is a list of tokens
if isinstance(string, str):
string = string.split()
chart = self._fill_chart(string, keep_backtraces=False)
# Find the probability of the string by looking for the start symbol in the root of the chart
logprobs = [entry.logprob for entry in chart[0, len(string)]
if entry.symbol == self._grammar.start_variable]
if logprobs:
# If there are multiple parse trees, sum their probabilities
return logsumexp(logprobs)
else:
# String is not in the language
return float('-inf')Application to Morphological Analysis with CELEX
Now let’s apply our PCFG framework to morphological analysis using the CELEX database. CELEX provides morphological parses for thousands of English words, which we can use to estimate a PCFG for morphological analysis.
Example: Analyzing “unhappiness”
Let’s work through a detailed example of how to use our PCFG framework to analyze the morphological structure of a word like “unhappiness”:
# First, let's load CELEX data
import pyparsing
LPAR = pyparsing.Suppress('(')
RPAR = pyparsing.Suppress(')')
LBRAC = pyparsing.Suppress('[')
RBRAC = pyparsing.Suppress(']')
tag = pyparsing.Regex(r'[^\(\)\[\]]+')
morpheme = pyparsing.Regex(r'[a-z]+')
exp = pyparsing.Forward()
tree = pyparsing.Group(LPAR + pyparsing.delimitedList(exp) + RPAR + LBRAC + tag + RBRAC)
exp <<= morpheme | tree
class ParsedWord:
"""A morphological parse tree node.
Parameters
----------
data : str
The category label at this node.
word : str | None, default None
The surface word (only present at the root).
children : list[ParsedWord] | None, default None
The child nodes.
Attributes
----------
data : str
The category label.
word : str | None
The surface word.
children : list[ParsedWord]
The child nodes.
"""
PARSER = exp
def __init__(self, data: str, word: str | None = None,
children: list['ParsedWord'] | None = None):
self.word = word
self.data = data
self.children = children or []
@classmethod
def from_string(cls, word: str, tree: str) -> 'ParsedWord':
"""Parse a CELEX tree string into a ParsedWord.
Parameters
----------
word : str
The surface form.
tree : str
The CELEX bracket notation.
Returns
-------
ParsedWord
The parsed tree.
"""
treelist = cls.PARSER.parseString(tree)[0]
return cls.from_list(treelist, word)
@classmethod
def from_list(cls, treelist, word: str | None = None) -> 'ParsedWord':
"""Build a ParsedWord from a nested list representation.
Parameters
----------
treelist : list
The nested list from the parser.
word : str | None, default None
The surface form (only at the root).
Returns
-------
ParsedWord
The parsed tree.
"""
if all(isinstance(x, str) for x in treelist):
return cls(treelist[1], word, [cls(treelist[0])])
else:
return cls(treelist[-1], word,
[cls.from_list(l) for l in treelist[:-1]])
def flatten(self) -> list[tuple[str, str]]:
"""Return a flat list of (category, morpheme) pairs.
Returns
-------
list[tuple[str, str]]
The morphemes with their categories.
"""
if len(self.children) > 1:
return [morph for child in self.children for morph in child.flatten()]
else:
return [(self.data, self.children[0].data)]
@property
def rules(self) -> list:
"""Extract CFG rules from this parsed word.
Returns
-------
list[Rule]
The production rules implied by the tree.
"""
if self.children:
rule = Rule(self.data, *[c.data for c in self.children], logprob=None)
child_rules = [r for c in self.children for r in c.rules]
return [rule] + child_rules
else:
return []
# Let's parse a specific example from CELEX
unhappiness_tree = "(((un)[Prefix] (happy)[Adj])[Adj] (ness)[Suffix])[N]"
unhappiness = ParsedWord.from_string('unhappiness', unhappiness_tree)
# Extract the rules from this parse
rules = unhappiness.rules
for rule in rules:
print(rule)
# Now let's estimate a PCFG from a treebank of CELEX parses
# Assuming we have a list of parsed words:
parsed_words = [unhappiness] # In reality, this would be many more examples
# Estimate the grammar
pcfg_estimator = PCFGEstimator()
pcfg = pcfg_estimator.fit(parsed_words)
# Let's compute the probability of "unhappiness"
morphemes = ['un', 'happy', 'ness']
log_prob = pcfg(morphemes)
print(f"Log probability of 'unhappiness': {log_prob}")
# We can also get the most likely parse
parses = pcfg(morphemes, mode="parse")
best_parse, best_logprob = max(parses, key=lambda x: x[1])
print(f"Best parse: {best_parse}")
print(f"Best parse log probability: {best_logprob}")Evaluating Novel Morphological Forms
One of the key applications of PCFGs in morphology is evaluating the well-formedness of novel word formations. For example, we can compare the probabilities of attested and unattested forms:
# Attested forms
attested = [
['un', 'happy', 'ness'],
['re', 'consider', 'ation'],
['dis', 'agree', 'ment']
]
# Unattested but well-formed forms
well_formed_unattested = [
['un', 'belief', 'able', 'ness'],
['re', 'present', 'ative', 'ness'],
['anti', 'establish', 'ment', 'arian', 'ism']
]
# Ill-formed examples
ill_formed = [
['ness', 'un', 'happy'], # Wrong order
['un', 'ness', 'happy'], # Wrong order
['un', 'ly', 'happy'] # Incompatible affixes
]
# Compute log probabilities for each group
attested_probs = [pcfg(word) for word in attested]
well_formed_probs = [pcfg(word) for word in well_formed_unattested]
ill_formed_probs = [pcfg(word) for word in ill_formed]
# Print average log probabilities
print(f"Average log probability of attested forms: {np.mean(attested_probs)}")
print(f"Average log probability of well-formed but unattested forms: {np.mean(well_formed_probs)}")
print(f"Average log probability of ill-formed examples: {np.mean(ill_formed_probs)}")