Bottom-Up Parsing with CKY

In the previous sections, we introduced context-free grammars, operations on grammars, and normal forms. Now, we’ll explore how to use these grammars for parsing.

We’ll focus on the Cocke-Kasami-Younger (CKY) algorithm, which is a bottom-up parsing algorithm for context-free grammars in Chomsky Normal Form. CKY is particularly well-suited for morphological analysis because of its efficiency and its ability to handle ambiguity.

Parsing for Morphological Analysis

In morphological analysis, parsing serves two primary purposes:

  1. Recognition: Determining whether a word is well-formed according to the morphological rules of a language
  2. Parsing/Analysis: Identifying the morphological structure of a word, including the hierarchical relationships between morphemes

For instance, when parsing “unhappiness”, we want to not only verify that it’s a valid English word but also determine its structure: ((un+happy)+ness) rather than (un+(happy+ness)), as these different structures might have different semantic interpretations.

The CKY Algorithm

The CKY algorithm is a dynamic programming algorithm that fills a parse table bottom-up. It’s particularly efficient because it avoids redundant computations by storing and reusing intermediate results.

Here’s the core idea of the algorithm:

  1. For a word with \(n\) morphemes, create an \(n \times n\) table where each cell \((i,j)\) represents the possible non-terminals that can derive the span of morphemes from position \(i\) to position \(j\).
  2. Fill the table diagonally, starting with single morphemes (spans of length 1), then spans of length 2, and so on.
  3. For each span \((i,j)\), consider all possible ways to split it into two smaller spans \((i,k)\) and \((k,j)\) for \(i < k < j\).
  4. Use the grammar rules to determine which non-terminals can derive the combination of the non-terminals in the smaller spans.

CKY Algorithm Pseudocode

Here’s a formal pseudocode for the CKY algorithm:

function CKY(grammar, input):
    n = length(input)
    
    # Initialize the chart (an n×n table)
    chart = new Table[n+1, n+1]
    
    # Fill in the diagonal (spans of length 1)
    for j = 1 to n:
        for each rule A → input[j-1] in grammar:
            add A to chart[j-1, j]
    
    # Fill in the rest of the chart (spans of length > 1)
    for length = 2 to n:
        for i = 0 to n - length:
            j = i + length
            for k = i + 1 to j - 1:
                for each non-terminal B in chart[i, k]:
                    for each non-terminal C in chart[k, j]:
                        for each rule A → BC in grammar:
                            add A to chart[i, j]
    
    # Check if the start symbol is in the top-right corner
    return (start_symbol in chart[0, n])

For parsing (as opposed to just recognition), we also need to keep track of backpointers to reconstruct the parse tree.

Let’s define the core classes needed for CKY parsing:

from typing import Union

SpanIndices = tuple[int, int]
CKYBackPointer = tuple[str, SpanIndices]

class Chart:
    @property
    def parses(self):
        raise NotImplementedError

class ChartEntry:
    def __hash__(self) -> int:
        raise NotImplementedError

    @property
    def backpointers(self):
        raise NotImplementedError

class CKYChartEntry(ChartEntry):
    """
    A chart entry for a CKY chart

    Parameters
    ----------
    symbol
    backpointers

    Attributes
    ----------
    symbol
    backpointers
    """

    def __init__(self, symbol: str, *backpointers: CKYBackPointer):
        self._symbol = symbol
        self._backpointers = backpointers

    def to_tuple(self):
        return (self._symbol, self._backpointers)
        
    def __hash__(self) -> int:
        return hash(self.to_tuple())

    def __eq__(self, other: 'CKYChartEntry') -> bool:
        return self.to_tuple() == other.to_tuple()
    
    def __repr__(self) -> str:
        return self._symbol + ' -> ' + ' '.join(
            f"{bp[0]}({bp[1][0]}, {bp[1][1]})" 
            for bp in self.backpointers
        )

    def __str__(self) -> str:
        return self.__repr__()
    
    @property
    def symbol(self) -> str:
        return self._symbol

    @property
    def backpointers(self) -> tuple[CKYBackPointer, ...]:
        return self._backpointers

class CKYChart(Chart):
    """
    A chart for a CKY parser

    Jurafsky & Martin call this a table

    Parameters
    ----------
    input_size

    Attributes
    ----------
    parses
    """

    def __init__(self, input_size: int):
        self._input_size = input_size
        
        self._chart: list[list[set[CKYChartEntry]]] = [
            [set({})
             for _ in range(input_size+1)]
            for _ in range(input_size+1)
        ]
        
    def __getitem__(self, index: SpanIndices) -> set[CKYChartEntry]:
        i, j = index

        self._validate_index(i, j)
        
        return self._chart[i][j]

    def __setitem__(self, key: SpanIndices, item: set[CKYChartEntry]):
        i, j = key

        self._chart[i][j] = item
        
    def _validate_index(self, i, j):
        try:
            assert i < j
        except AssertionError:
            msg = "cannot index into the lower " +\
                  "triangle of the chart"
            raise ValueError(msg)

        try:
            self._chart[i]
        except IndexError:
            msg = "row index is too large"
            raise ValueError(msg)

        try:
            self._chart[i][j]
        except IndexError:
            msg = "column index is too large"
            raise ValueError(msg)

    @property
    def parses(self) -> set[Tree]:
        try:
            return self._parses
        except AttributeError:
            self._parses = self._construct_parses()
            return self._parses

    def _construct_parses(self, entry: Union['CKYChartEntry', None] = None) -> set[Tree]:
        """Construct the parses implied by the chart

        Parameters
        ----------
        entry
        """
        if entry is None:
            return {parse for entry in self[0, self._input_size] 
                   for parse in self._construct_parses(entry)}
        
        if not entry.backpointers:
            return {Tree(entry.symbol, [])}
        
        parses = set()
        
        for left_bp, right_bp in zip(entry.backpointers[0::2], entry.backpointers[1::2]):
            left_symbol, left_span = left_bp
            right_symbol, right_span = right_bp
            
            left_parses = self._construct_parses(
                next(e for e in self[left_span] if e.symbol == left_symbol)
            )
            
            right_parses = self._construct_parses(
                next(e for e in self[right_span] if e.symbol == right_symbol)
            )
            
            for left_parse in left_parses:
                for right_parse in right_parses:
                    parses.add(Tree(entry.symbol, [left_parse, right_parse]))
        
        return parses

Now let’s define the CKYParser class that will use the chart to parse input strings:

class ContextFreeGrammarParser:
    
    def __init__(self, grammar: ContextFreeGrammar):
        self._grammar = grammar

    def __call__(self, string, mode="recognize"):
        if mode == "recognize":
            return self._recognize(string)
        elif mode == "parse":
            return self._parse(string)            
        else:
            msg = 'mode must be "parse" or "recognize"'
            raise ValueError(msg)

class CKYParser(ContextFreeGrammarParser):
    """
    A CKY parser

    Parameters
    ----------
    grammar : ContextFreeGrammar
    """
    
    normal_form = NormalForm.CNF
    
    def _fill_chart(self, string: list[str]) -> CKYChart:
        """
        Fill the chart using the CKY algorithm
        
        Parameters
        ----------
        string : list[str]
            The input string to parse
            
        Returns
        -------
        CKYChart
            The filled chart
        """
        # you'll implement this in your Assignment 9
        raise NotImplementedError

    def _parse(self, string):
        """
        Parse a string to find all possible parse trees
        
        Parameters
        ----------
        string : list[str] or str
            The string to parse, as a list of tokens
            
        Returns
        -------
        set[Tree]
            The set of all possible parse trees
        """
        # Ensure the input is a list of tokens
        if isinstance(string, str):
            string = string.split()
        
        # Fill the chart
        chart = self._fill_chart(string)
        
        # Return all possible parses
        return chart.parses
        
    def _recognize(self, string):
        """
        Recognize whether a string is in the language
        
        Parameters
        ----------
        string : list[str] or str
            The string to recognize, as a list of tokens
            
        Returns
        -------
        bool
            True if the string is in the language, False otherwise
        """
        # Ensure the input is a list of tokens
        if isinstance(string, str):
            string = string.split()
        
        # Fill the chart
        chart = self._fill_chart(string)
        
        # Check if the start variable appears in the top-right corner of the chart
        return any(entry.symbol == self._grammar.start_variable
                  for entry in chart[0, len(string)])

Detailed Example: Parsing “unhappiness”

Let’s walk through a detailed example of the CKY algorithm parsing the word “unhappiness” with the segmentation ["un", "happy", "ness"] using a simplified grammar:

grammar = ContextFreeGrammar(
    alphabet={'un', 'happy', 'ness'},
    variables={'Word', 'Adj', 'N', 'Prefix', 'Suffix'},
    rules={
        Rule('Adj', 'happy'),
        Rule('Prefix', 'un'),
        Rule('Suffix', 'ness'),
        Rule('Adj', 'Prefix', 'Adj'),
        Rule('N', 'Adj', 'Suffix'),
        Rule('Word', 'N')
    },
    start_variable='Word'
)

# Convert to CNF (already in CNF in this case)
cnf_grammar = grammar

# Set the parser class
ContextFreeGrammar.parser_class = CKYParser

# Parse the word
parses = cnf_grammar(['un', 'happy', 'ness'], mode="parse")

Step-by-Step Chart Filling

Let’s trace the CKY algorithm as it fills the chart for “unhappiness”:

Step 1: Initialize the chart

We create a 3×3 chart (for 3 morphemes):

        | 0      | 1      | 2      | 3      |
--------|--------|--------|--------|--------|
0       |        | ?      | ?      | ?      |
--------|--------|--------|--------|--------|
1       |        |        | ?      | ?      |
--------|--------|--------|--------|--------|
2       |        |        |        | ?      |
--------|--------|--------|--------|--------|

Step 2: Fill the diagonal (spans of length 1)

We look at each morpheme and find all non-terminals that can generate it:

For “un” (span 0-1): - Prefix -> un

For “happy” (span 1-2): - Adj -> happy

For “ness” (span 2-3): - Suffix -> ness

The chart now looks like:

        | 0      | 1      | 2      | 3      |
--------|--------|--------|--------|--------|
0       |        | Prefix | ?      | ?      |
--------|--------|--------|--------|--------|
1       |        |        | Adj    | ?      |
--------|--------|--------|--------|--------|
2       |        |        |        | Suffix |
--------|--------|--------|--------|--------|

Step 3: Fill spans of length 2

Now we consider spans of length 2:

For span (0-2), we consider all ways to split it: - Split at k=1: Combine chart[0,1] and chart[1,2] - Prefix (from chart[0,1]) + Adj (from chart[1,2]) - Rule Adj -> Prefix Adj applies - Add Adj to chart[0,2]

For span (1-3), we consider all ways to split it: - Split at k=2: Combine chart[1,2] and chart[2,3] - Adj (from chart[1,2]) + Suffix (from chart[2,3]) - Rule N -> Adj Suffix applies - Add N to chart[1,3]

The chart now looks like:

        | 0      | 1      | 2      | 3      |
--------|--------|--------|--------|--------|
0       |        | Prefix | Adj    | ?      |
--------|--------|--------|--------|--------|
1       |        |        | Adj    | N      |
--------|--------|--------|--------|--------|
2       |        |        |        | Suffix |
--------|--------|--------|--------|--------|

Step 4: Fill the full span (0-3)

Finally, we consider the full span (0-3):

For span (0-3), we consider all ways to split it: - Split at k=1: Combine chart[0,1] and chart[1,3] - Prefix (from chart[0,1]) + N (from chart[1,3]) - No rule applies - Split at k=2: Combine chart[0,2] and chart[2,3] - Adj (from chart[0,2]) + Suffix (from chart[2,3]) - Rule N -> Adj Suffix applies - Add N to chart[0,3]

Since N is in chart[0,3], we also check if we can derive the start symbol: - Rule Word -> N applies - Add Word to chart[0,3]

The final chart looks like:

        | 0      | 1      | 2      | 3      |
--------|--------|--------|--------|--------|
0       |        | Prefix | Adj    | N,Word |
--------|--------|--------|--------|--------|
1       |        |        | Adj    | N      |
--------|--------|--------|--------|--------|
2       |        |        |        | Suffix |
--------|--------|--------|--------|--------|

Since the start symbol Word appears in chart[0,3], the word “unhappiness” is in the language defined by our grammar.

Visualizing the Parse Tree

The backpointers in our chart entries allow us to reconstruct the parse tree:

       Word
        |
        N
       / \
     Adj  Suffix
    /  \    |
Prefix Adj  ness
   |    |
  un  happy

This tree correctly captures the structure ((un+happy)+ness) as opposed to (un+(happy+ness)).

Handling Ambiguity in Morphological Analysis

One of the strengths of the CKY algorithm is its ability to handle ambiguity. Consider the word “unlockable”, which has two possible interpretations:

  1. un + (lock + able) = “not able to be locked”
  2. (un + lock) + able = “able to be unlocked”

Let’s create a grammar that captures this ambiguity:

ambiguous_grammar = ContextFreeGrammar(
    alphabet={'un', 'lock', 'able'},
    variables={'Word', 'V', 'Adj', 'Prefix', 'Suffix'},
    rules={
        Rule('V', 'lock'),
        Rule('Prefix', 'un'),
        Rule('Suffix', 'able'),
        Rule('V', 'Prefix', 'V'),    # un + lock -> unlock
        Rule('Adj', 'V', 'Suffix'),  # lock + able -> lockable OR unlock + able -> unlockable
        Rule('Adj', 'Prefix', 'Adj'), # un + lockable -> unlockable
        Rule('Word', 'Adj')
    },
    start_variable='Word'
)

When we parse “unlockable” with CKY, the chart will contain multiple entries in chart[0,3], representing the different ways to analyze the word:

  1. Chart[0,3] contains Adj from Adj -> Prefix Adj, where Prefix is “un” and Adj is “lockable”
  2. Chart[0,3] also contains Adj from Adj -> V Suffix, where V is “unlock” and Suffix is “able”

Both of these analyses derive Word, resulting in two different parse trees:

Parse Tree 1 (un + lockable):

    Word
     |
    Adj
   /   \
Prefix  Adj
  |    /  \
 un    V  Suffix
       |    |
     lock  able

Parse Tree 2 (unlock + able):

    Word
     |
    Adj
   /   \
  V    Suffix
 / \     |
Prefix V able
  |   |
 un lock

The CKY algorithm finds all possible parses, which is essential for accurately analyzing structurally ambiguous words.

Real-World Example: Analyzing CELEX Data

Let’s apply the CKY algorithm to analyze a more complex example from the CELEX database. The word “unbelievably” has this structure in CELEX:

((((un)[Prefix] (believe)[V])[V] (able)[Suffix])[Adj] (ly)[Suffix])[Adv]

The challenge with this example is that it involves multiple levels of derivation, which the CKY algorithm handles efficiently by building up larger constituents from smaller ones.

Let’s walk through how CKY would parse this word:

  1. First, it identifies the parts of speech for each morpheme:
    • “un” -> Prefix
    • “believe” -> V
    • “able” -> Suffix
    • “ly” -> Suffix
  2. Then, it combines these to form larger constituents:
    • “un” + “believe” -> V (“unbelieve”)
    • “unbelieve” + “able” -> Adj (“unbelievable”)
    • “unbelievable” + “ly” -> Adv (“unbelievably”)
  3. Finally, it identifies the full word as an adverb.