Bottom-Up Parsing with CKY

In the previous sections, we introduced context-free grammars, operations on grammars, and normal forms. Now, we’ll explore how to use these grammars for parsing.

We’ll focus on the Cocke-Kasami-Younger (CKY) algorithm, which is a bottom-up parsing algorithm for context-free grammars in Chomsky Normal Form (Kasami 1965; Younger 1967). CKY is particularly well-suited for morphological analysis because of its efficiency and its ability to handle ambiguity.

Parsing for Morphological Analysis

In morphological analysis, parsing serves two primary purposes:

  1. Recognition: Determining whether a word is well-formed according to the morphological rules of a language
  2. Parsing/Analysis: Identifying the morphological structure of a word, including the hierarchical relationships between morphemes

For instance, when parsing “unhappiness”, we want to not only verify that it’s a valid English word but also determine its structure: ((un+happy)+ness) rather than (un+(happy+ness)), as these different structures might have different semantic interpretations.

The CKY Algorithm

The CKY algorithm is a dynamic programming algorithm that fills a parse table bottom-up. It’s particularly efficient because it avoids redundant computations by storing and reusing intermediate results. See Jurafsky and Martin (2023, Ch. 18) and Hopcroft and Ullman (1979) for textbook treatments.

The algorithm runs in \(O(n^3 \cdot |R|)\) time, where \(n\) is the length of the input and \(|R|\) is the number of grammar rules. The cubic factor comes from three nested loops: there are \(O(n^2)\) possible spans \((i, j)\) in the input, and for each span, we consider \(O(n)\) possible split points \(k\) where the span could be divided into two smaller spans. For each split point, we check which grammar rules could combine the non-terminals from the two sub-spans. This is efficient enough for morphological analysis, where inputs are typically short (a single word with a handful of morphemes).1

Here’s the core idea of the algorithm:

  1. For a word with \(n\) morphemes, create an \(n \times n\) table where each cell \((i,j)\) represents the possible non-terminals that can derive the span of morphemes from position \(i\) to position \(j\).
  2. Fill the table diagonally, starting with single morphemes (spans of length 1), then spans of length 2, and so on.
  3. For each span \((i,j)\), consider all possible ways to split it into two smaller spans \((i,k)\) and \((k,j)\) for \(i < k < j\).
  4. Use the grammar rules to determine which non-terminals can derive the combination of the non-terminals in the smaller spans.
CautionExercise: Hand-trace CKY

Consider the grammar \(W \rightarrow M\;M\), \(W \rightarrow M\;W\), \(M \rightarrow \text{un}\), \(M \rightarrow \text{lock}\), \(M \rightarrow \text{able}\) (in CNF) and the input string \(\langle \text{un}, \text{lock}, \text{able} \rangle\). Fill in the CKY table by hand. Which cells can contain \(W\)? Which can contain \(M\)? How many complete parses does the string have?

The table (lower-triangular, with cell \((i,j)\) representing span from position \(i\) to \(j\)):

1 2 3
1 {\(M\)} {\(W\)} {\(W\)}
2 {\(M\)} {\(W\)}
3 {\(M\)}
  • Diagonal: \(M\) in each cell (the terminal rules).
  • Cell \((1,2)\): \(W\) via \(W \rightarrow M\;M\) (split at \(k=1\)).
  • Cell \((2,3)\): \(W\) via \(W \rightarrow M\;M\) (split at \(k=2\)).
  • Cell \((1,3)\): \(W\) via \(W \rightarrow M\;W\) at split \(k=1\) (using \(M\) from cell \((1,1)\) and \(W\) from cell \((2,3)\)). Also \(W\) via \(W \rightarrow W\;M\)? No—that rule doesn’t exist. So there is exactly one parse: \([_W [_M \text{un}]\; [_W [_M \text{lock}]\; [_M \text{able}]]]\).

CKY Algorithm Pseudocode

Here’s a formal pseudocode for the CKY algorithm:

function CKY(grammar, input):
    n = length(input)
    
    # Initialize the chart (an n×n table)
    chart = new Table[n+1, n+1]
    
    # Fill in the diagonal (spans of length 1)
    for j = 1 to n:
        for each rule A → input[j-1] in grammar:
            add A to chart[j-1, j]
    
    # Fill in the rest of the chart (spans of length > 1)
    for length = 2 to n:
        for i = 0 to n - length:
            j = i + length
            for k = i + 1 to j - 1:
                for each non-terminal B in chart[i, k]:
                    for each non-terminal C in chart[k, j]:
                        for each rule A → BC in grammar:
                            add A to chart[i, j]
    
    # Check if the start symbol is in the top-right corner
    return (start_symbol in chart[0, n])

For parsing (as opposed to just recognition), we also need to keep track of backpointers to reconstruct the parse tree.

Let’s define the core classes needed for CKY parsing:

type SpanIndices = tuple[int, int]
type CKYBackPointer = tuple[str, SpanIndices]

class Chart:
    """Abstract base for parser charts."""

    @property
    def parses(self):
        """Return the set of parse trees."""
        raise NotImplementedError

class ChartEntry:
    """Abstract base for chart entries."""

    def __hash__(self) -> int:
        raise NotImplementedError

    @property
    def backpointers(self):
        """Return the backpointers for this entry."""
        raise NotImplementedError

class CKYChartEntry(ChartEntry):
    """A chart entry for a CKY chart.

    Parameters
    ----------
    symbol : str
        The nonterminal symbol.
    *backpointers : CKYBackPointer
        Pairs of (symbol, span) pointing to child entries.

    Attributes
    ----------
    symbol : str
        The nonterminal symbol.
    backpointers : tuple[CKYBackPointer, ...]
        The child backpointers.
    """

    def __init__(self, symbol: str, *backpointers: CKYBackPointer):
        self._symbol = symbol
        self._backpointers = backpointers

    def to_tuple(self) -> tuple[str, tuple[CKYBackPointer, ...]]:
        """Return a hashable tuple representation.

        Returns
        -------
        tuple[str, tuple[CKYBackPointer, ...]]
            The symbol and backpointers.
        """
        return (self._symbol, self._backpointers)

    def __hash__(self) -> int:
        return hash(self.to_tuple())

    def __eq__(self, other: 'CKYChartEntry') -> bool:
        return self.to_tuple() == other.to_tuple()

    def __repr__(self) -> str:
        """Return a string showing the symbol and its backpointer spans."""
        return self._symbol + ' -> ' + ' '.join(
            f"{bp[0]}({bp[1][0]}, {bp[1][1]})"
            for bp in self.backpointers
        )

    def __str__(self) -> str:
        return self.__repr__()

    @property
    def symbol(self) -> str:
        """The nonterminal symbol."""
        return self._symbol

    @property
    def backpointers(self) -> tuple[CKYBackPointer, ...]:
        """The child backpointers."""
        return self._backpointers

class CKYChart(Chart):
    """A chart for a CKY parser.

    Jurafsky & Martin call this a table.

    Parameters
    ----------
    input_size : int
        The number of tokens in the input.

    Attributes
    ----------
    parses : set[Tree]
        The parse trees extracted from the chart.
    """

    def __init__(self, input_size: int):
        self._input_size = input_size

        self._chart: list[list[set[CKYChartEntry]]] = [
            [set() for _ in range(input_size + 1)]
            for _ in range(input_size + 1)
        ]

    def __getitem__(self, index: SpanIndices) -> set[CKYChartEntry]:
        """Retrieve the chart cell at the given span."""
        i, j = index
        self._validate_index(i, j)
        return self._chart[i][j]

    def __setitem__(self, key: SpanIndices, item: set[CKYChartEntry]) -> None:
        """Set the chart cell at the given span."""
        i, j = key
        self._chart[i][j] = item

    def _validate_index(self, i: int, j: int) -> None:
        """Raise ValueError if the index is out of bounds or in the lower triangle."""
        if i >= j:
            msg = ("cannot index into the lower "
                   "triangle of the chart")
            raise ValueError(msg)

        try:
            self._chart[i]
        except IndexError:
            msg = "row index is too large"
            raise ValueError(msg)

        try:
            self._chart[i][j]
        except IndexError:
            msg = "column index is too large"
            raise ValueError(msg)

    @property
    def parses(self) -> set:
        """The parse trees extracted from the chart."""
        try:
            return self._parses
        except AttributeError:
            self._parses = self._construct_parses()
            return self._parses

    def _construct_parses(self, entry: 'CKYChartEntry | None' = None) -> set:
        """Construct parse trees from chart backpointers.

        Parameters
        ----------
        entry : CKYChartEntry | None, default None
            The entry to start from; if None, starts from the root span.

        Returns
        -------
        set[Tree]
            The parse trees reachable from this entry.
        """
        if entry is None:
            return {parse for entry in self[0, self._input_size] 
                   for parse in self._construct_parses(entry)}
        
        if not entry.backpointers:
            return {Tree(entry.symbol, [])}
        
        parses = set()
        
        for left_bp, right_bp in zip(entry.backpointers[0::2], entry.backpointers[1::2]):
            left_symbol, left_span = left_bp
            right_symbol, right_span = right_bp

            for left_entry in (e for e in self[left_span] if e.symbol == left_symbol):
                left_parses = self._construct_parses(left_entry)

                for right_entry in (e for e in self[right_span] if e.symbol == right_symbol):
                    right_parses = self._construct_parses(right_entry)

                    for left_parse in left_parses:
                        for right_parse in right_parses:
                            parses.add(Tree(entry.symbol, [left_parse, right_parse]))
        
        return parses

Now let’s define the CKYParser class that will use the chart to parse input strings:

class ContextFreeGrammarParser:
    """Base class for CFG parsers.

    Parameters
    ----------
    grammar : ContextFreeGrammar
        The grammar to parse with.
    """

    def __init__(self, grammar: ContextFreeGrammar):
        self._grammar = grammar

    def __call__(self, string: list[str] | str, mode: str = "recognize"):
        """Parse or recognize a string.

        Parameters
        ----------
        string : list[str] | str
            The input to process.
        mode : str, default "recognize"
            Either "recognize" or "parse".

        Returns
        -------
        set[Tree] | bool
            Parse trees (parse mode) or boolean (recognize mode).
        """
        if mode == "recognize":
            return self._recognize(string)
        elif mode == "parse":
            return self._parse(string)
        else:
            msg = 'mode must be "parse" or "recognize"'
            raise ValueError(msg)

class CKYParser(ContextFreeGrammarParser):
    """A CKY parser for grammars in Chomsky Normal Form.

    Parameters
    ----------
    grammar : ContextFreeGrammar
        The grammar to parse with.
    """

    normal_form = NormalForm.CNF

    def _fill_chart(self, string: list[str]) -> CKYChart:
        """Fill the chart using the CKY algorithm.

        Parameters
        ----------
        string : list[str]
            The input string to parse.

        Returns
        -------
        CKYChart
            The filled chart.
        """
        # you'll implement this in your Assignment 9
        raise NotImplementedError

    def _parse(self, string: list[str] | str) -> set:
        """Parse a string to find all possible parse trees.

        Parameters
        ----------
        string : list[str] | str
            The string to parse.

        Returns
        -------
        set[Tree]
            The set of all possible parse trees.
        """
        if isinstance(string, str):
            string = string.split()

        chart = self._fill_chart(string)

        return chart.parses

    def _recognize(self, string: list[str] | str) -> bool:
        """Recognize whether a string is in the language.

        Parameters
        ----------
        string : list[str] | str
            The string to recognize.

        Returns
        -------
        bool
            True if the string is in the language.
        """
        if isinstance(string, str):
            string = string.split()

        chart = self._fill_chart(string)

        return any(entry.symbol == self._grammar.start_variable
                   for entry in chart[0, len(string)])

Detailed Example: Parsing “unhappiness”

Let’s walk through a detailed example of the CKY algorithm parsing the word “unhappiness” with the segmentation ["un", "happy", "ness"] using a simplified grammar:

grammar = ContextFreeGrammar(
    alphabet={'un', 'happy', 'ness'},
    variables={'Word', 'Adj', 'N', 'Prefix', 'Suffix'},
    rules={
        Rule('Adj', 'happy'),
        Rule('Prefix', 'un'),
        Rule('Suffix', 'ness'),
        Rule('Adj', 'Prefix', 'Adj'),
        Rule('N', 'Adj', 'Suffix'),
        Rule('Word', 'N')
    },
    start_variable='Word'
)

# Convert to CNF (already in CNF in this case)
cnf_grammar = grammar

# Set the parser class
ContextFreeGrammar.parser_class = CKYParser

# Parse the word
parses = cnf_grammar(['un', 'happy', 'ness'], mode="parse")

Step-by-Step Chart Filling

Let’s trace the CKY algorithm as it fills the chart for “unhappiness”:

Step 1: Initialize the chart

We create a 3×3 chart (for 3 morphemes):

        | 0      | 1      | 2      | 3      |
--------|--------|--------|--------|--------|
0       |        | ?      | ?      | ?      |
--------|--------|--------|--------|--------|
1       |        |        | ?      | ?      |
--------|--------|--------|--------|--------|
2       |        |        |        | ?      |
--------|--------|--------|--------|--------|

Step 2: Fill the diagonal (spans of length 1)

We look at each morpheme and find all non-terminals that can generate it:

For “un” (span 0-1): - Prefix -> un

For “happy” (span 1-2): - Adj -> happy

For “ness” (span 2-3): - Suffix -> ness

The chart now looks like:

        | 0      | 1      | 2      | 3      |
--------|--------|--------|--------|--------|
0       |        | Prefix | ?      | ?      |
--------|--------|--------|--------|--------|
1       |        |        | Adj    | ?      |
--------|--------|--------|--------|--------|
2       |        |        |        | Suffix |
--------|--------|--------|--------|--------|

Step 3: Fill spans of length 2

Now we consider spans of length 2:

For span (0-2), we consider all ways to split it: - Split at k=1: Combine chart[0,1] and chart[1,2] - Prefix (from chart[0,1]) + Adj (from chart[1,2]) - Rule Adj -> Prefix Adj applies - Add Adj to chart[0,2]

For span (1-3), we consider all ways to split it: - Split at k=2: Combine chart[1,2] and chart[2,3] - Adj (from chart[1,2]) + Suffix (from chart[2,3]) - Rule N -> Adj Suffix applies - Add N to chart[1,3]

The chart now looks like:

        | 0      | 1      | 2      | 3      |
--------|--------|--------|--------|--------|
0       |        | Prefix | Adj    | ?      |
--------|--------|--------|--------|--------|
1       |        |        | Adj    | N      |
--------|--------|--------|--------|--------|
2       |        |        |        | Suffix |
--------|--------|--------|--------|--------|

Step 4: Fill the full span (0-3)

Finally, we consider the full span (0-3):

For span (0-3), we consider all ways to split it: - Split at k=1: Combine chart[0,1] and chart[1,3] - Prefix (from chart[0,1]) + N (from chart[1,3]) - No rule applies - Split at k=2: Combine chart[0,2] and chart[2,3] - Adj (from chart[0,2]) + Suffix (from chart[2,3]) - Rule N -> Adj Suffix applies - Add N to chart[0,3]

Since N is in chart[0,3], we also check if we can derive the start symbol: - Rule Word -> N applies - Add Word to chart[0,3]

The final chart looks like:

        | 0      | 1      | 2      | 3      |
--------|--------|--------|--------|--------|
0       |        | Prefix | Adj    | N,Word |
--------|--------|--------|--------|--------|
1       |        |        | Adj    | N      |
--------|--------|--------|--------|--------|
2       |        |        |        | Suffix |
--------|--------|--------|--------|--------|

Since the start symbol Word appears in chart[0,3], the word “unhappiness” is in the language defined by our grammar.

Visualizing the Parse Tree

The backpointers in our chart entries allow us to reconstruct the parse tree:

       Word
        |
        N
       / \
     Adj  Suffix
    /  \    |
Prefix Adj  ness
   |    |
  un  happy

This tree correctly captures the structure ((un+happy)+ness) as opposed to (un+(happy+ness)).

Handling Ambiguity in Morphological Analysis

One of the strengths of the CKY algorithm is its ability to handle ambiguity. Consider the word “unlockable”, which has two possible interpretations:

  1. un + (lock + able) = “not able to be locked”
  2. (un + lock) + able = “able to be unlocked”

Let’s create a grammar that captures this ambiguity:

ambiguous_grammar = ContextFreeGrammar(
    alphabet={'un', 'lock', 'able'},
    variables={'Word', 'V', 'Adj', 'Prefix', 'Suffix'},
    rules={
        Rule('V', 'lock'),
        Rule('Prefix', 'un'),
        Rule('Suffix', 'able'),
        Rule('V', 'Prefix', 'V'),    # un + lock -> unlock
        Rule('Adj', 'V', 'Suffix'),  # lock + able -> lockable OR unlock + able -> unlockable
        Rule('Adj', 'Prefix', 'Adj'), # un + lockable -> unlockable
        Rule('Word', 'Adj')
    },
    start_variable='Word'
)

When we parse “unlockable” with CKY, the chart will contain multiple entries in chart[0,3], representing the different ways to analyze the word:

  1. Chart[0,3] contains Adj from Adj -> Prefix Adj, where Prefix is “un” and Adj is “lockable”
  2. Chart[0,3] also contains Adj from Adj -> V Suffix, where V is “unlock” and Suffix is “able”

Both of these analyses derive Word, resulting in two different parse trees:

Parse Tree 1 (un + lockable):

    Word
     |
    Adj
   /   \
Prefix  Adj
  |    /  \
 un    V  Suffix
       |    |
     lock  able

Parse Tree 2 (unlock + able):

    Word
     |
    Adj
   /   \
  V    Suffix
 / \     |
Prefix V able
  |   |
 un lock

The CKY algorithm finds all possible parses, which is essential for accurately analyzing structurally ambiguous words.

Real-World Example: Analyzing CELEX Data

Let’s apply the CKY algorithm to analyze a more complex example from the CELEX database. The word “unbelievably” has this structure in CELEX:

((((un)[Prefix] (believe)[V])[V] (able)[Suffix])[Adj] (ly)[Suffix])[Adv]

The challenge with this example is that it involves multiple levels of derivation, which the CKY algorithm handles efficiently by building up larger constituents from smaller ones.

Let’s walk through how CKY would parse this word:

  1. First, it identifies the parts of speech for each morpheme:
    • “un” -> Prefix
    • “believe” -> V
    • “able” -> Suffix
    • “ly” -> Suffix
  2. Then, it combines these to form larger constituents:
    • “un” + “believe” -> V (“unbelieve”)
    • “unbelieve” + “able” -> Adj (“unbelievable”)
    • “unbelievable” + “ly” -> Adv (“unbelievably”)
  3. Finally, it identifies the full word as an adverb.

References

Hopcroft, John E., and Jeffrey D. Ullman. 1979. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley.
Jurafsky, Daniel, and James H. Martin. 2023. Speech and Language Processing.
Kasami, Tadao. 1965. An Efficient Recognition and Syntax-Analysis Algorithm for Context-Free Languages. AFCRL-65-758. Air Force Cambridge Research Labs.
Younger, Daniel H. 1967. “Recognition and Parsing of Context-Free Languages in Time \(n^3\).” Information and Control 10 (2): 189–208. https://doi.org/10.1016/S0019-9958(67)80007-X.

Footnotes

  1. For comparison, a naive approach that enumerates all possible parse trees would take exponential time, since the number of binary trees over \(n\) leaves is the Catalan number \(C_n\), which grows as \(\Theta(4^n / n^{3/2})\).↩︎