Top-Down Parsing with Earley

In the previous section, we explored the CKY algorithm, a bottom-up parsing approach for context-free grammars. Now, we’ll examine the Earley algorithm, which is a top-down parsing method that combines the advantages of both top-down and bottom-up approaches.

Unlike CKY, which requires the grammar to be in Chomsky Normal Form, the Earley algorithm can work with any context-free grammar. This flexibility makes it particularly useful for morphological analysis, where the natural expression of morphological rules might not conform to CNF.

Top-Down vs. Bottom-Up Parsing

Before diving into the Earley algorithm, let’s briefly compare top-down and bottom-up parsing approaches:

Bottom-Up Parsing (e.g., CKY): - Starts with the input and builds larger constituents - Works well with grammars in normal forms - Typically more efficient for highly ambiguous grammars - May do unnecessary work on strings not in the language

Top-Down Parsing: - Starts with the start symbol and tries to derive the input - More intuitive for tracing the derivation process - Can incorporate contextual constraints more easily - May get stuck in loops with left-recursive grammars

The Earley algorithm combines the strengths of both approaches by using a top-down prediction strategy but avoiding redundant work through dynamic programming techniques similar to those used in CKY.

Dotted Rules

The key concept in the Earley algorithm is the “dotted rule” (also called an “Earley item”). A dotted rule is a context-free rule with a dot (•) inserted somewhere on the right-hand side to indicate how much of the rule has been recognized so far.

For example, if we have the rule N -> Adj Suffix, we can have the following dotted rules: - N -> • Adj Suffix (nothing recognized yet) - N -> Adj • Suffix (Adj recognized, Suffix expected) - N -> Adj Suffix • (everything recognized)

Dotted rules also track their span (the portion of the input they cover) and any backpointers needed for reconstructing the parse tree.

Earley Algorithm Pseudocode

function Earley(grammar, input):
    n = length(input)
    
    # Initialize the chart (n+1 sets of dotted rules)
    chart = new Array[n+1]
    for i = 0 to n:
        chart[i] = new Set()
    
    # Initialize chart[0] with dotted rules for the start symbol
    for each rule S -> γ in grammar (where S is the start symbol):
        add (S -> • γ, 0) to chart[0]
    
    # Fill in the chart
    for i = 0 to n:
        for each state (A -> α • β, j) in chart[i]:
            if β is not empty:
                if β starts with a terminal symbol a:
                    if i < n and input[i] = a:  # Scan
                        add (A -> α a • β', j) to chart[i+1]
                else if β starts with a non-terminal B:  # Predict
                    for each rule B -> γ in grammar:
                        add (B -> • γ, i) to chart[i]
            else:  # Complete
                for each state (C -> δ • A ζ, k) in chart[j]:
                    add (C -> δ A • ζ, k) to chart[i]
        
        # Continue processing chart[i] until no more items can be added
        repeat until chart[i] doesn't change:
            apply Predict and Complete operations to all items in chart[i]
    
    # Check if the start symbol can derive the entire input
    return any item (S -> γ •, 0) in chart[n] where S is the start symbol

Implementation Classes

Let’s define the key classes for our Earley parser:

class DottedRule(Rule):
    
    def __init__(self, rule: Rule, span: SpanIndices, dot: int = 0):
        self._rule = rule
        self._left_side = rule.left_side
        self._right_side = rule.right_side
        
        self._span = span
        self._dot = dot
    
    def to_tuple(self) -> tuple[Rule, SpanIndices, int]:
        return self._rule, self._span, self._dot
    
    def __hash__(self) -> int:
        return hash(self.to_tuple())
    
    def __eq__(self, other) -> bool:
        return self.to_tuple() == other.to_tuple()
    
    def __repr__(self) -> str:
        return self._left_side + ' -> ' +\
               ' '.join(self._right_side[:self._dot]) +\
               ' . ' +\
               ' '.join(self._right_side[self._dot:]) +\
               ' [' + str(self._span[0]) + ', ' + str(self._span[1]) + ']'
    
    def complete(self, new_left_edge: int) -> 'DottedRule':
        """Complete the next symbol in this rule
        
        Parameters
        ----------
        new_left_edge : int
            The new left edge of the span
        
        Returns
        -------
        DottedRule
            A new dotted rule with the dot advanced
        """
        dot = self._dot + 1
        span = (self._span[0], new_left_edge)
        
        return DottedRule(self._rule, span, dot)

    @property
    def next_symbol(self) -> str:
        if self.is_complete:
            raise AttributeError('dotted rule is already complete')
        else:
            return self._right_side[self._dot]
        
    @property
    def dot(self) -> int:
        return self._dot
    
    @property
    def span(self) -> SpanIndices:
        return self._span
    
    @property
    def is_complete(self) -> bool:
        return self._dot == len(self._right_side)
    
    @property
    def left_side(self) -> str:
        return self._rule.left_side
EarleyBackPointer = tuple[str, int]

class EarleyChartEntry(ChartEntry):
    """A chart entry for a Earley chart

    Parameters
    ----------
    dotted_rule
    backpointers
    """

    def __init__(self, dotted_rule: DottedRule, *backpointers: EarleyBackPointer):
        self._dotted_rule = dotted_rule
        self._backpointers = backpointers

    def to_tuple(self) -> tuple[DottedRule, tuple[EarleyBackPointer, ...]]:
        return self._dotted_rule, self._backpointers
        
    def __hash__(self) -> int:
        return hash(self.to_tuple())

    def __eq__(self, other) -> bool:
        return self.to_tuple() == other.to_tuple()
    
    def __repr__(self) -> str:
        return self._dotted_rule.__repr__()

    def __str__(self) -> str:
        return self.__repr__()

    @property
    def backpointers(self) -> tuple[EarleyBackPointer, ...]:
        return self._backpointers
    
    @property
    def dotted_rule(self) -> DottedRule:
        return self._dotted_rule
    
    @property
    def next_symbol(self) -> str:
        return self._dotted_rule.next_symbol
    
    @property
    def span(self) -> SpanIndices:
        return self._dotted_rule.span
    
    @property
    def is_complete(self) -> bool:
        return self._dotted_rule.is_complete
    
    def complete(self, entry: 'EarleyChartEntry', new_left_edge: int) -> 'EarleyChartEntry':
        """Complete this entry with another entry
        
        Parameters
        ----------
        entry : EarleyChartEntry
            The entry that completes this one
        new_left_edge : int
            The new left edge of the span
            
        Returns
        -------
        EarleyChartEntry
            A new entry with the dot advanced
        """
        new_dotted_rule = self._dotted_rule.complete(new_left_edge)
        
        bp = (self._dotted_rule.next_symbol, self._dotted_rule.span[1])
        backpointers = self._backpointers + (bp,)
        
        return EarleyChartEntry(new_dotted_rule, *backpointers)
    
    def is_completion_of(self, other: 'EarleyChartEntry') -> bool:
        """Check if this entry completes another entry
        
        Parameters
        ----------
        other : EarleyChartEntry
            The entry to check
            
        Returns
        -------
        bool
            True if this entry completes the other entry
        """
        return self._dotted_rule.left_side == other.dotted_rule.next_symbol
class EarleyChart(Chart):
    """A chart for an Earley parser

    Parameters
    ----------
    input_size
    """
    def __init__(self, input_size: int, start_variable: str = 'S'):
        self._start_variable = start_variable
        
        self._chart: list[set[EarleyChartEntry]] = [
            set() for _ in range(input_size+1)
        ]
        
    def __getitem__(self, index) -> set[EarleyChartEntry]:
        return self._chart[index]

    def __setitem__(self, key, item) -> None:
        self._chart[key] = item

    @property
    def parses(self) -> set[Tree]:
        try:
            return self._parses
        except AttributeError:
            self._parses = set()
            
            for entry in self._chart[-1]:
                is_start = entry.dotted_rule.left_side == self._start_variable
                covers_string = entry.dotted_rule.span == (0, self.input_size)
                is_complete = entry.dotted_rule.is_complete
                
                if is_start and covers_string and is_complete:
                    self._parses.add(self._construct_parses(entry))
            
            return self._parses
    
    @property
    def input_size(self) -> int:
        return len(self._chart) - 1
class EarleyParser(ContextFreeGrammarParser):
    """
    An Earley parser

    Parameters
    ----------
    grammar : ContextFreeGrammar
    """
    normal_form = None
                    
    def _predict(self, chart: EarleyChart, entry: EarleyChartEntry, position: int):
        """
        Predict operation: add new dotted rules for the non-terminal
        
        Parameters
        ----------
        chart : EarleyChart
            The chart to update
        entry : EarleyChartEntry
            The entry to predict from
        position : int
            The current position in the input
        """
        for rule in self._grammar.rules(entry.next_symbol):
            span = (position, position)
            dotted_rule = DottedRule(rule, span)
            new_entry = EarleyChartEntry(dotted_rule)
            
            chart[position].add(new_entry)
     
    def _scan(self, chart: EarleyChart, entry: EarleyChartEntry, position: int):
        """
        Scan operation: advance the dot by matching the next terminal
        
        Parameters
        ----------
        chart : EarleyChart
            The chart to update
        entry : EarleyChartEntry
            The entry to scan from
        position : int
            The current position in the input
        """
        word = self._string[position]
        expected = entry.next_symbol
        
        if word == expected:
            # Create a rule for the terminal
            rule = Rule(expected, word)
            span = (position, position+1)
            dotted_rule = DottedRule(rule, span, dot=1)
            
            # Add a completed entry for the terminal
            terminal_entry = EarleyChartEntry(dotted_rule)
            chart[position+1].add(terminal_entry)
            
            # Add a completed entry for the entry that was waiting for this terminal
            completed_entry = entry.complete(terminal_entry, position+1)
            chart[position+1].add(completed_entry)
        
    def _complete(self, chart: EarleyChart, entry: EarleyChartEntry, position: int):
        """
        Complete operation: advance the dot for entries waiting for this one
        
        Parameters
        ----------
        chart : EarleyChart
            The chart to update
        entry : EarleyChartEntry
            The completed entry
        position : int
            The current position in the input
        """
        start, end = entry.span
        
        for prev_entry in chart[start]:
            if not prev_entry.is_complete and entry.is_completion_of(prev_entry):
                completed_entry = prev_entry.complete(entry, end)
                
                chart[position].add(completed_entry)
        
    def _parse(self, string: str | list[str]):
        """
        Parse a string to find all possible parse trees
        
        Parameters
        ----------
        string : str | list[str]
            The string to parse
            
        Returns
        -------
        set[Tree]
            The set of all possible parse trees
        """
        # Ensure the input is a list of tokens
        if isinstance(string, str):
            string = string.split()
        
        # Fill the chart
        chart = self._fill_chart(string)
        
        # Return all possible parses
        return chart.parses

    def _recognize(self, string: str | list[str]):
        """
        Recognize whether a string is in the language
        
        Parameters
        ----------
        string : str | list[str]
            The string to recognize
            
        Returns
        -------
        bool
            True if the string is in the language, False otherwise
        """
        # Ensure the input is a list of tokens
        if isinstance(string, str):
            string = string.split()
        
        # Fill the chart
        chart = self._fill_chart(string)
        
        # Check if the start symbol appears in the final position with a complete rule
        # that spans the entire input
        for entry in chart[-1]:
            is_start = entry.dotted_rule.left_side == self._grammar.start_variable
            covers_string = entry.dotted_rule.span == (0, len(string))
            is_complete = entry.dotted_rule.is_complete
            
            if is_start and covers_string and is_complete:
                return True
            
        return False

Implementations to Complete (Assignment)

The following three methods need to be implemented as part of your assignment:

  1. The _fill_chart method:
function _fill_chart(string):
    # Store the string for use in the scan operation
    self._string = string
    
    # Initialize the chart
    chart = new EarleyChart(length(string), self._grammar.start_variable)
    
    # Initialize chart[0] with dotted rules for the start symbol
    for each rule S → γ where S is the start symbol:
        add a dotted rule S → • γ at position 0 to chart[0]
    
    # Fill each position in the chart
    for i = 0 to length(string):
        # Process all entries at this position until no new entries are added
        repeat until chart[i] doesn't change:
            for each entry in chart[i]:
                if entry is not complete:
                    if next symbol is a non-terminal:
                        apply Predict operation
                    else if next symbol is a terminal and i < length(string):
                        apply Scan operation
                else:
                    apply Complete operation
    
    return chart
  1. The _construct_parses method:
function _construct_parses(entry):
    # Create a node for this entry
    node = new Tree(entry.dotted_rule.left_side)
    
    # If the entry has backpointers, recursively construct the children
    if entry has backpointers:
        for each backpointer (symbol, position) in entry.backpointers:
            # Find the completed entry in the chart at position that matches the symbol
            for each potential_child in chart[position]:
                if potential_child.left_side == symbol and potential_child.is_complete:
                    # Recursively construct the child's parse tree
                    child = _construct_parses(potential_child)
                    add child to node.children
                    break
    
    return node
  1. The _predict_next_word method:
function _predict_next_word(prefix):
    # Fill the chart for the prefix
    chart = _fill_chart(prefix)
    
    # Find all incomplete entries in the final position
    incomplete_entries = {all entries in chart[length(prefix)] where entry is not complete}
    
    # Group by next expected symbol
    next_symbols = {}
    for each entry in incomplete_entries:
        next_symbol = entry.next_symbol
        
        # If the next symbol is a terminal, add it directly
        if next_symbol is a terminal:
            add next_symbol to next_symbols[next_symbol]
        
        # If the next symbol is a non-terminal, find terminals it can derive
        else if next_symbol is a non-terminal:
            for each rule with left_side = next_symbol:
                if rule directly derives a terminal:
                    add the terminal to next_symbols[next_symbol]
    
    return next_symbols

Detailed Example: Parsing “unhappiness”

Let’s walk through a detailed example of the Earley algorithm parsing the word “unhappiness” with the segmentation ["un", "happy", "ness"] using our simplified grammar:

grammar = ContextFreeGrammar(
    alphabet={'un', 'happy', 'ness'},
    variables={'Word', 'Adj', 'N', 'Prefix', 'Suffix'},
    rules={
        Rule('Adj', 'happy'),
        Rule('Prefix', 'un'),
        Rule('Suffix', 'ness'),
        Rule('Adj', 'Prefix', 'Adj'),
        Rule('N', 'Adj', 'Suffix'),
        Rule('Word', 'N')
    },
    start_variable='Word'
)

# Set the parser class
ContextFreeGrammar.parser_class = EarleyParser

# Parse the word
parses = grammar(['un', 'happy', 'ness'], mode="parse")

Step-by-Step Chart Filling

Let’s trace the Earley algorithm as it fills the chart for “unhappiness”:

Chart[0]: Initial State

First, we initialize chart[0] with a dotted rule for the start symbol Word:

  1. Add Word -> • N [0,0] (from start symbol rule)
  2. Apply predict:
    • For Word -> • N:
      • Add N -> • Adj Suffix [0,0]
  3. Apply predict again:
    • For N -> • Adj Suffix:
      • Add Adj -> • happy [0,0]
      • Add Adj -> • Prefix Adj [0,0]
  4. Apply predict again:
    • For Adj -> • Prefix Adj:
      • Add Prefix -> • un [0,0]

Chart[0] now looks like:

Word -> • N [0,0]
N -> • Adj Suffix [0,0]
Adj -> • happy [0,0]
Adj -> • Prefix Adj [0,0]
Prefix -> • un [0,0]

Chart[1]: After Processing “un”

Now we process the first input symbol “un”:

  1. Apply scan:
    • For Prefix -> • un [0,0]:
      • Add Prefix -> un • [0,1] (complete)
  2. Apply complete:
    • For Prefix -> un • [0,1] (complete):
      • Find Adj -> • Prefix Adj [0,0] waiting for Prefix
      • Add Adj -> Prefix • Adj [0,1]
  3. Apply predict:
    • For Adj -> Prefix • Adj [0,1]:
      • Add Adj -> • happy [1,1]
      • Add Adj -> • Prefix Adj [1,1]
  4. Apply predict again:
    • For Adj -> • Prefix Adj [1,1]:
      • Add Prefix -> • un [1,1]

Chart[1] now looks like:

Prefix -> un • [0,1]
Adj -> Prefix • Adj [0,1]
Adj -> • happy [1,1]
Adj -> • Prefix Adj [1,1]
Prefix -> • un [1,1]

Chart[2]: After Processing “happy”

Now we process the second input symbol “happy”:

  1. Apply scan:
    • For Adj -> • happy [1,1]:
      • Add Adj -> happy • [1,2] (complete)
  2. Apply complete:
    • For Adj -> happy • [1,2] (complete):
      • Find Adj -> Prefix • Adj [0,1] waiting for Adj
      • Add Adj -> Prefix Adj • [0,2] (complete)
  3. Apply complete again:
    • For Adj -> Prefix Adj • [0,2] (complete):
      • Find N -> • Adj Suffix [0,0] waiting for Adj
      • Add N -> Adj • Suffix [0,2]
  4. Apply predict:
    • For N -> Adj • Suffix [0,2]:
      • Add Suffix -> • ness [2,2]

Chart[2] now looks like:

Adj -> happy • [1,2]
Adj -> Prefix Adj • [0,2]
N -> Adj • Suffix [0,2]
Suffix -> • ness [2,2]

Chart[3]: After Processing “ness”

Finally, we process the third input symbol “ness”:

  1. Apply scan:
    • For Suffix -> • ness [2,2]:
      • Add Suffix -> ness • [2,3] (complete)
  2. Apply complete:
    • For Suffix -> ness • [2,3] (complete):
      • Find N -> Adj • Suffix [0,2] waiting for Suffix
      • Add N -> Adj Suffix • [0,3] (complete)
  3. Apply complete again:
    • For N -> Adj Suffix • [0,3] (complete):
      • Find Word -> • N [0,0] waiting for N
      • Add Word -> N • [0,3] (complete)

Chart[3] now looks like:

Suffix -> ness • [2,3]
N -> Adj Suffix • [0,3]
Word -> N • [0,3]

Since there’s a completed dotted rule for the start symbol Word that spans the entire input (0 to 3), the word “unhappiness” is in the language defined by our grammar.

Visualizing the Chart State

We can visualize the entire chart as follows:

|-----------------------|--------------------------|--------------------------|--------------------------|
|      Chart[0]         |        Chart[1]          |        Chart[2]          |        Chart[3]          |
|-----------------------|--------------------------|--------------------------|--------------------------|
| Word -> • N [0,0]     | Prefix -> un • [0,1]     | Adj -> happy • [1,2]     | Suffix -> ness • [2,3]   |
| N -> • Adj Suffix [0,0]| Adj -> Prefix • Adj [0,1]| Adj -> Prefix Adj • [0,2]| N -> Adj Suffix • [0,3]  |
| Adj -> • happy [0,0]  | Adj -> • happy [1,1]     | N -> Adj • Suffix [0,2]  | Word -> N • [0,3]        |
| Adj -> • Prefix Adj [0,0]| Adj -> • Prefix Adj [1,1]| Suffix -> • ness [2,2]   |                        |
| Prefix -> • un [0,0]  | Prefix -> • un [1,1]     |                         |                        |
|-----------------------|--------------------------|--------------------------|--------------------------|

Reconstructing the Parse Tree

From the completed dotted rule Word -> N • [0,3] in Chart[3], we can reconstruct the parse tree by following the backpointers:

  1. Start with Word -> N • [0,3] in Chart[3]
  2. Follow the backpointer to N -> Adj Suffix • [0,3] in Chart[3]
  3. Follow the backpointer to Adj -> Prefix Adj • [0,2] in Chart[2]
  4. Follow the backpointer to Prefix -> un • [0,1] in Chart[1]
  5. Follow the backpointer to Adj -> happy • [1,2] in Chart[2]
  6. Follow the backpointer to Suffix -> ness • [2,3] in Chart[3]

This gives us the parse tree:

       Word
        |
        N
       / \
     Adj  Suffix
    /  \    |
Prefix Adj  ness
   |    |
  un  happy

This tree correctly captures the structure ((un+happy)+ness).

Handling Ambiguity: The “unlockable” Example

Let’s revisit the ambiguous word “unlockable”, which has two possible interpretations:

  1. un + (lock + able) = “not able to be locked”
  2. (un + lock) + able = “able to be unlocked”

With the Earley algorithm, we can handle this ambiguity naturally. Let’s create a grammar that captures both interpretations:

ambiguous_grammar = ContextFreeGrammar(
    alphabet={'un', 'lock', 'able'},
    variables={'Word', 'V', 'Adj', 'Prefix', 'Suffix'},
    rules={
        Rule('V', 'lock'),
        Rule('Prefix', 'un'),
        Rule('Suffix', 'able'),
        Rule('V', 'Prefix', 'V'),    # un + lock -> unlock
        Rule('Adj', 'V', 'Suffix'),  # lock + able -> lockable OR unlock + able -> unlockable
        Rule('Adj', 'Prefix', 'Adj'), # un + lockable -> unlockable
        Rule('Word', 'Adj')
    },
    start_variable='Word'
)

When we parse “unlockable” with Earley, we’ll end up with two completed dotted rules in the final chart:

  1. Word -> Adj • [0,3] where Adj comes from Adj -> Prefix Adj • [0,3] (un + lockable)
  2. Word -> Adj • [0,3] where Adj comes from Adj -> V Suffix • [0,3] (unlock + able)

Both of these analyses derive Word, resulting in two different parse trees:

Parse Tree 1 (un + lockable):

    Word
     |
    Adj
   /   \
Prefix  Adj
  |    /  \
 un    V  Suffix
       |    |
     lock  able

Parse Tree 2 (unlock + able):

    Word
     |
    Adj
   /   \
  V    Suffix
 / \     |
Prefix V able
  |   |
 un lock

The Earley algorithm finds all possible parses, which is essential for accurately analyzing structurally ambiguous words.

The Three Operations of Earley

The Earley algorithm consists of three main operations:

  1. Predict: For each dotted rule where the dot precedes a non-terminal, add new dotted rules for all possible expansions of that non-terminal. This operation looks ahead to determine what might come next.

  2. Scan: For each dotted rule where the dot precedes a terminal that matches the current input symbol, advance the dot past that terminal. This operation represents recognizing an input symbol.

  3. Complete: For each dotted rule where the dot is at the end (indicating a completed constituent), go back to the chart position where the constituent began and advance the dot in all dotted rules that were waiting for this constituent.

These operations work together to efficiently parse the input while avoiding redundant work.

Advantages of Earley for Morphological Analysis

The Earley algorithm offers several advantages for morphological analysis:

  1. Any CFG: Unlike CKY, Earley works with any context-free grammar, including those with ε-rules (empty productions), unary rules, and arbitrary rule lengths. This allows for more natural expression of morphological patterns.

  2. Incremental processing: Earley processes the input left-to-right, which can be useful for interactive applications or streaming inputs.

  3. Left recursion: Earley handles left-recursive rules without modification, which is important for certain morphological patterns.

  4. Explicit predictions: The predict operation explicitly lists all the possible continuations at each position, which can be useful for generating suggestions in applications like spelling correction or auto-completion.

  5. Efficiency: While the worst-case complexity is O(n³), for many practical grammars Earley performs better, especially for unambiguous or nearly unambiguous grammars.

Real-World Example: CELEX Data

Let’s revisit our “unbelievably” example from CELEX:

((((un)[Prefix] (believe)[V])[V] (able)[Suffix])[Adj] (ly)[Suffix])[Adv]

With the Earley algorithm, we can parse this complex structure without converting to CNF:

# Extract grammar from CELEX
celex_grammar = ContextFreeGrammar.from_treebank(celex_parses)

# Parse a specific word
parses = celex_grammar(['un', 'believe', 'able', 'ly'], mode="parse")
for parse in parses:
    print(parse)

We can also predict what might follow partial words:

# Predict what could follow "un believe"
predictions = celex_grammar.parser._predict_next_word(['un', 'believe'])
print(predictions)  # Might output {'Suffix': {'able', 'er', 'ing', ...}}