Bottom-Up Parsing with CKY
In the previous sections, we introduced context-free grammars, operations on grammars, and normal forms. Now, we’ll explore how to use these grammars for parsing.
We’ll focus on the Cocke-Kasami-Younger (CKY) algorithm, which is a bottom-up parsing algorithm for context-free grammars in Chomsky Normal Form (Kasami 1965; Younger 1967). CKY is particularly well-suited for morphological analysis because of its efficiency and its ability to handle ambiguity.
Parsing for Morphological Analysis
In morphological analysis, parsing serves two primary purposes:
- Recognition: Determining whether a word is well-formed according to the morphological rules of a language
- Parsing/Analysis: Identifying the morphological structure of a word, including the hierarchical relationships between morphemes
For instance, when parsing “unhappiness”, we want to not only verify that it’s a valid English word but also determine its structure: ((un+happy)+ness) rather than (un+(happy+ness)), as these different structures might have different semantic interpretations.
The CKY Algorithm
The CKY algorithm is a dynamic programming algorithm that fills a parse table bottom-up. It’s particularly efficient because it avoids redundant computations by storing and reusing intermediate results. See Jurafsky and Martin (2023, Ch. 18) and Hopcroft and Ullman (1979) for textbook treatments.
The algorithm runs in \(O(n^3 \cdot |R|)\) time, where \(n\) is the length of the input and \(|R|\) is the number of grammar rules. The cubic factor comes from three nested loops: there are \(O(n^2)\) possible spans \((i, j)\) in the input, and for each span, we consider \(O(n)\) possible split points \(k\) where the span could be divided into two smaller spans. For each split point, we check which grammar rules could combine the non-terminals from the two sub-spans. This is efficient enough for morphological analysis, where inputs are typically short (a single word with a handful of morphemes).1
Here’s the core idea of the algorithm:
- For a word with \(n\) morphemes, create an \(n \times n\) table where each cell \((i,j)\) represents the possible non-terminals that can derive the span of morphemes from position \(i\) to position \(j\).
- Fill the table diagonally, starting with single morphemes (spans of length 1), then spans of length 2, and so on.
- For each span \((i,j)\), consider all possible ways to split it into two smaller spans \((i,k)\) and \((k,j)\) for \(i < k < j\).
- Use the grammar rules to determine which non-terminals can derive the combination of the non-terminals in the smaller spans.
Consider the grammar \(W \rightarrow M\;M\), \(W \rightarrow M\;W\), \(M \rightarrow \text{un}\), \(M \rightarrow \text{lock}\), \(M \rightarrow \text{able}\) (in CNF) and the input string \(\langle \text{un}, \text{lock}, \text{able} \rangle\). Fill in the CKY table by hand. Which cells can contain \(W\)? Which can contain \(M\)? How many complete parses does the string have?
The table (lower-triangular, with cell \((i,j)\) representing span from position \(i\) to \(j\)):
| 1 | 2 | 3 | |
|---|---|---|---|
| 1 | {\(M\)} | {\(W\)} | {\(W\)} |
| 2 | {\(M\)} | {\(W\)} | |
| 3 | {\(M\)} |
- Diagonal: \(M\) in each cell (the terminal rules).
- Cell \((1,2)\): \(W\) via \(W \rightarrow M\;M\) (split at \(k=1\)).
- Cell \((2,3)\): \(W\) via \(W \rightarrow M\;M\) (split at \(k=2\)).
- Cell \((1,3)\): \(W\) via \(W \rightarrow M\;W\) at split \(k=1\) (using \(M\) from cell \((1,1)\) and \(W\) from cell \((2,3)\)). Also \(W\) via \(W \rightarrow W\;M\)? No—that rule doesn’t exist. So there is exactly one parse: \([_W [_M \text{un}]\; [_W [_M \text{lock}]\; [_M \text{able}]]]\).
CKY Algorithm Pseudocode
Here’s a formal pseudocode for the CKY algorithm:
function CKY(grammar, input):
n = length(input)
# Initialize the chart (an n×n table)
chart = new Table[n+1, n+1]
# Fill in the diagonal (spans of length 1)
for j = 1 to n:
for each rule A → input[j-1] in grammar:
add A to chart[j-1, j]
# Fill in the rest of the chart (spans of length > 1)
for length = 2 to n:
for i = 0 to n - length:
j = i + length
for k = i + 1 to j - 1:
for each non-terminal B in chart[i, k]:
for each non-terminal C in chart[k, j]:
for each rule A → BC in grammar:
add A to chart[i, j]
# Check if the start symbol is in the top-right corner
return (start_symbol in chart[0, n])
For parsing (as opposed to just recognition), we also need to keep track of backpointers to reconstruct the parse tree.
Let’s define the core classes needed for CKY parsing:
type SpanIndices = tuple[int, int]
type CKYBackPointer = tuple[str, SpanIndices]
class Chart:
"""Abstract base for parser charts."""
@property
def parses(self):
"""Return the set of parse trees."""
raise NotImplementedError
class ChartEntry:
"""Abstract base for chart entries."""
def __hash__(self) -> int:
raise NotImplementedError
@property
def backpointers(self):
"""Return the backpointers for this entry."""
raise NotImplementedError
class CKYChartEntry(ChartEntry):
"""A chart entry for a CKY chart.
Parameters
----------
symbol : str
The nonterminal symbol.
*backpointers : CKYBackPointer
Pairs of (symbol, span) pointing to child entries.
Attributes
----------
symbol : str
The nonterminal symbol.
backpointers : tuple[CKYBackPointer, ...]
The child backpointers.
"""
def __init__(self, symbol: str, *backpointers: CKYBackPointer):
self._symbol = symbol
self._backpointers = backpointers
def to_tuple(self) -> tuple[str, tuple[CKYBackPointer, ...]]:
"""Return a hashable tuple representation.
Returns
-------
tuple[str, tuple[CKYBackPointer, ...]]
The symbol and backpointers.
"""
return (self._symbol, self._backpointers)
def __hash__(self) -> int:
return hash(self.to_tuple())
def __eq__(self, other: 'CKYChartEntry') -> bool:
return self.to_tuple() == other.to_tuple()
def __repr__(self) -> str:
"""Return a string showing the symbol and its backpointer spans."""
return self._symbol + ' -> ' + ' '.join(
f"{bp[0]}({bp[1][0]}, {bp[1][1]})"
for bp in self.backpointers
)
def __str__(self) -> str:
return self.__repr__()
@property
def symbol(self) -> str:
"""The nonterminal symbol."""
return self._symbol
@property
def backpointers(self) -> tuple[CKYBackPointer, ...]:
"""The child backpointers."""
return self._backpointers
class CKYChart(Chart):
"""A chart for a CKY parser.
Jurafsky & Martin call this a table.
Parameters
----------
input_size : int
The number of tokens in the input.
Attributes
----------
parses : set[Tree]
The parse trees extracted from the chart.
"""
def __init__(self, input_size: int):
self._input_size = input_size
self._chart: list[list[set[CKYChartEntry]]] = [
[set() for _ in range(input_size + 1)]
for _ in range(input_size + 1)
]
def __getitem__(self, index: SpanIndices) -> set[CKYChartEntry]:
"""Retrieve the chart cell at the given span."""
i, j = index
self._validate_index(i, j)
return self._chart[i][j]
def __setitem__(self, key: SpanIndices, item: set[CKYChartEntry]) -> None:
"""Set the chart cell at the given span."""
i, j = key
self._chart[i][j] = item
def _validate_index(self, i: int, j: int) -> None:
"""Raise ValueError if the index is out of bounds or in the lower triangle."""
if i >= j:
msg = ("cannot index into the lower "
"triangle of the chart")
raise ValueError(msg)
try:
self._chart[i]
except IndexError:
msg = "row index is too large"
raise ValueError(msg)
try:
self._chart[i][j]
except IndexError:
msg = "column index is too large"
raise ValueError(msg)
@property
def parses(self) -> set:
"""The parse trees extracted from the chart."""
try:
return self._parses
except AttributeError:
self._parses = self._construct_parses()
return self._parses
def _construct_parses(self, entry: 'CKYChartEntry | None' = None) -> set:
"""Construct parse trees from chart backpointers.
Parameters
----------
entry : CKYChartEntry | None, default None
The entry to start from; if None, starts from the root span.
Returns
-------
set[Tree]
The parse trees reachable from this entry.
"""
if entry is None:
return {parse for entry in self[0, self._input_size]
for parse in self._construct_parses(entry)}
if not entry.backpointers:
return {Tree(entry.symbol, [])}
parses = set()
for left_bp, right_bp in zip(entry.backpointers[0::2], entry.backpointers[1::2]):
left_symbol, left_span = left_bp
right_symbol, right_span = right_bp
for left_entry in (e for e in self[left_span] if e.symbol == left_symbol):
left_parses = self._construct_parses(left_entry)
for right_entry in (e for e in self[right_span] if e.symbol == right_symbol):
right_parses = self._construct_parses(right_entry)
for left_parse in left_parses:
for right_parse in right_parses:
parses.add(Tree(entry.symbol, [left_parse, right_parse]))
return parsesNow let’s define the CKYParser class that will use the chart to parse input strings:
class ContextFreeGrammarParser:
"""Base class for CFG parsers.
Parameters
----------
grammar : ContextFreeGrammar
The grammar to parse with.
"""
def __init__(self, grammar: ContextFreeGrammar):
self._grammar = grammar
def __call__(self, string: list[str] | str, mode: str = "recognize"):
"""Parse or recognize a string.
Parameters
----------
string : list[str] | str
The input to process.
mode : str, default "recognize"
Either "recognize" or "parse".
Returns
-------
set[Tree] | bool
Parse trees (parse mode) or boolean (recognize mode).
"""
if mode == "recognize":
return self._recognize(string)
elif mode == "parse":
return self._parse(string)
else:
msg = 'mode must be "parse" or "recognize"'
raise ValueError(msg)
class CKYParser(ContextFreeGrammarParser):
"""A CKY parser for grammars in Chomsky Normal Form.
Parameters
----------
grammar : ContextFreeGrammar
The grammar to parse with.
"""
normal_form = NormalForm.CNF
def _fill_chart(self, string: list[str]) -> CKYChart:
"""Fill the chart using the CKY algorithm.
Parameters
----------
string : list[str]
The input string to parse.
Returns
-------
CKYChart
The filled chart.
"""
# you'll implement this in your Assignment 9
raise NotImplementedError
def _parse(self, string: list[str] | str) -> set:
"""Parse a string to find all possible parse trees.
Parameters
----------
string : list[str] | str
The string to parse.
Returns
-------
set[Tree]
The set of all possible parse trees.
"""
if isinstance(string, str):
string = string.split()
chart = self._fill_chart(string)
return chart.parses
def _recognize(self, string: list[str] | str) -> bool:
"""Recognize whether a string is in the language.
Parameters
----------
string : list[str] | str
The string to recognize.
Returns
-------
bool
True if the string is in the language.
"""
if isinstance(string, str):
string = string.split()
chart = self._fill_chart(string)
return any(entry.symbol == self._grammar.start_variable
for entry in chart[0, len(string)])Detailed Example: Parsing “unhappiness”
Let’s walk through a detailed example of the CKY algorithm parsing the word “unhappiness” with the segmentation ["un", "happy", "ness"] using a simplified grammar:
grammar = ContextFreeGrammar(
alphabet={'un', 'happy', 'ness'},
variables={'Word', 'Adj', 'N', 'Prefix', 'Suffix'},
rules={
Rule('Adj', 'happy'),
Rule('Prefix', 'un'),
Rule('Suffix', 'ness'),
Rule('Adj', 'Prefix', 'Adj'),
Rule('N', 'Adj', 'Suffix'),
Rule('Word', 'N')
},
start_variable='Word'
)
# Convert to CNF (already in CNF in this case)
cnf_grammar = grammar
# Set the parser class
ContextFreeGrammar.parser_class = CKYParser
# Parse the word
parses = cnf_grammar(['un', 'happy', 'ness'], mode="parse")Step-by-Step Chart Filling
Let’s trace the CKY algorithm as it fills the chart for “unhappiness”:
Step 1: Initialize the chart
We create a 3×3 chart (for 3 morphemes):
| 0 | 1 | 2 | 3 |
--------|--------|--------|--------|--------|
0 | | ? | ? | ? |
--------|--------|--------|--------|--------|
1 | | | ? | ? |
--------|--------|--------|--------|--------|
2 | | | | ? |
--------|--------|--------|--------|--------|
Step 2: Fill the diagonal (spans of length 1)
We look at each morpheme and find all non-terminals that can generate it:
For “un” (span 0-1): - Prefix -> un
For “happy” (span 1-2): - Adj -> happy
For “ness” (span 2-3): - Suffix -> ness
The chart now looks like:
| 0 | 1 | 2 | 3 |
--------|--------|--------|--------|--------|
0 | | Prefix | ? | ? |
--------|--------|--------|--------|--------|
1 | | | Adj | ? |
--------|--------|--------|--------|--------|
2 | | | | Suffix |
--------|--------|--------|--------|--------|
Step 3: Fill spans of length 2
Now we consider spans of length 2:
For span (0-2), we consider all ways to split it: - Split at k=1: Combine chart[0,1] and chart[1,2] - Prefix (from chart[0,1]) + Adj (from chart[1,2]) - Rule Adj -> Prefix Adj applies - Add Adj to chart[0,2]
For span (1-3), we consider all ways to split it: - Split at k=2: Combine chart[1,2] and chart[2,3] - Adj (from chart[1,2]) + Suffix (from chart[2,3]) - Rule N -> Adj Suffix applies - Add N to chart[1,3]
The chart now looks like:
| 0 | 1 | 2 | 3 |
--------|--------|--------|--------|--------|
0 | | Prefix | Adj | ? |
--------|--------|--------|--------|--------|
1 | | | Adj | N |
--------|--------|--------|--------|--------|
2 | | | | Suffix |
--------|--------|--------|--------|--------|
Step 4: Fill the full span (0-3)
Finally, we consider the full span (0-3):
For span (0-3), we consider all ways to split it: - Split at k=1: Combine chart[0,1] and chart[1,3] - Prefix (from chart[0,1]) + N (from chart[1,3]) - No rule applies - Split at k=2: Combine chart[0,2] and chart[2,3] - Adj (from chart[0,2]) + Suffix (from chart[2,3]) - Rule N -> Adj Suffix applies - Add N to chart[0,3]
Since N is in chart[0,3], we also check if we can derive the start symbol: - Rule Word -> N applies - Add Word to chart[0,3]
The final chart looks like:
| 0 | 1 | 2 | 3 |
--------|--------|--------|--------|--------|
0 | | Prefix | Adj | N,Word |
--------|--------|--------|--------|--------|
1 | | | Adj | N |
--------|--------|--------|--------|--------|
2 | | | | Suffix |
--------|--------|--------|--------|--------|
Since the start symbol Word appears in chart[0,3], the word “unhappiness” is in the language defined by our grammar.
Visualizing the Parse Tree
The backpointers in our chart entries allow us to reconstruct the parse tree:
Word
|
N
/ \
Adj Suffix
/ \ |
Prefix Adj ness
| |
un happy
This tree correctly captures the structure ((un+happy)+ness) as opposed to (un+(happy+ness)).
Handling Ambiguity in Morphological Analysis
One of the strengths of the CKY algorithm is its ability to handle ambiguity. Consider the word “unlockable”, which has two possible interpretations:
un + (lock + able)= “not able to be locked”(un + lock) + able= “able to be unlocked”
Let’s create a grammar that captures this ambiguity:
ambiguous_grammar = ContextFreeGrammar(
alphabet={'un', 'lock', 'able'},
variables={'Word', 'V', 'Adj', 'Prefix', 'Suffix'},
rules={
Rule('V', 'lock'),
Rule('Prefix', 'un'),
Rule('Suffix', 'able'),
Rule('V', 'Prefix', 'V'), # un + lock -> unlock
Rule('Adj', 'V', 'Suffix'), # lock + able -> lockable OR unlock + able -> unlockable
Rule('Adj', 'Prefix', 'Adj'), # un + lockable -> unlockable
Rule('Word', 'Adj')
},
start_variable='Word'
)When we parse “unlockable” with CKY, the chart will contain multiple entries in chart[0,3], representing the different ways to analyze the word:
- Chart[0,3] contains
AdjfromAdj -> Prefix Adj, wherePrefixis “un” andAdjis “lockable” - Chart[0,3] also contains
AdjfromAdj -> V Suffix, whereVis “unlock” andSuffixis “able”
Both of these analyses derive Word, resulting in two different parse trees:
Parse Tree 1 (un + lockable):
Word
|
Adj
/ \
Prefix Adj
| / \
un V Suffix
| |
lock able
Parse Tree 2 (unlock + able):
Word
|
Adj
/ \
V Suffix
/ \ |
Prefix V able
| |
un lock
The CKY algorithm finds all possible parses, which is essential for accurately analyzing structurally ambiguous words.
Real-World Example: Analyzing CELEX Data
Let’s apply the CKY algorithm to analyze a more complex example from the CELEX database. The word “unbelievably” has this structure in CELEX:
((((un)[Prefix] (believe)[V])[V] (able)[Suffix])[Adj] (ly)[Suffix])[Adv]
The challenge with this example is that it involves multiple levels of derivation, which the CKY algorithm handles efficiently by building up larger constituents from smaller ones.
Let’s walk through how CKY would parse this word:
- First, it identifies the parts of speech for each morpheme:
- “un” -> Prefix
- “believe” -> V
- “able” -> Suffix
- “ly” -> Suffix
- Then, it combines these to form larger constituents:
- “un” + “believe” -> V (“unbelieve”)
- “unbelieve” + “able” -> Adj (“unbelievable”)
- “unbelievable” + “ly” -> Adv (“unbelievably”)
- Finally, it identifies the full word as an adverb.
References
Footnotes
For comparison, a naive approach that enumerates all possible parse trees would take exponential time, since the number of binary trees over \(n\) leaves is the Catalan number \(C_n\), which grows as \(\Theta(4^n / n^{3/2})\).↩︎