Top-Down Parsing with Earley
In the previous section, we explored the CKY algorithm, a bottom-up parsing approach for context-free grammars. Now, we’ll examine the Earley algorithm, which is a top-down parsing method that combines the advantages of both top-down and bottom-up approaches.
Unlike CKY, which requires the grammar to be in Chomsky Normal Form, the Earley algorithm can work with any context-free grammar. This flexibility makes it particularly useful for morphological analysis, where the natural expression of morphological rules might not conform to CNF.
Top-Down vs. Bottom-Up Parsing
Before diving into the Earley algorithm, let’s briefly compare top-down and bottom-up parsing approaches:
Bottom-Up Parsing (e.g., CKY): - Starts with the input and builds larger constituents - Works well with grammars in normal forms - Typically more efficient for highly ambiguous grammars - May do unnecessary work on strings not in the language
Top-Down Parsing: - Starts with the start symbol and tries to derive the input - More intuitive for tracing the derivation process - Can incorporate contextual constraints more easily - May get stuck in loops with left-recursive grammars
The Earley algorithm combines the strengths of both approaches by using a top-down prediction strategy but avoiding redundant work through dynamic programming techniques similar to those used in CKY.
Dotted Rules
The key concept in the Earley algorithm is the “dotted rule” (also called an “Earley item”). A dotted rule is a context-free rule with a dot (•) inserted somewhere on the right-hand side to indicate how much of the rule has been recognized so far.
For example, if we have the rule N -> Adj Suffix
, we can have the following dotted rules: - N -> • Adj Suffix
(nothing recognized yet) - N -> Adj • Suffix
(Adj recognized, Suffix expected) - N -> Adj Suffix •
(everything recognized)
Dotted rules also track their span (the portion of the input they cover) and any backpointers needed for reconstructing the parse tree.
Earley Algorithm Pseudocode
function Earley(grammar, input):
n = length(input)
# Initialize the chart (n+1 sets of dotted rules)
chart = new Array[n+1]
for i = 0 to n:
chart[i] = new Set()
# Initialize chart[0] with dotted rules for the start symbol
for each rule S -> γ in grammar (where S is the start symbol):
add (S -> • γ, 0) to chart[0]
# Fill in the chart
for i = 0 to n:
for each state (A -> α • β, j) in chart[i]:
if β is not empty:
if β starts with a terminal symbol a:
if i < n and input[i] = a: # Scan
add (A -> α a • β', j) to chart[i+1]
else if β starts with a non-terminal B: # Predict
for each rule B -> γ in grammar:
add (B -> • γ, i) to chart[i]
else: # Complete
for each state (C -> δ • A ζ, k) in chart[j]:
add (C -> δ A • ζ, k) to chart[i]
# Continue processing chart[i] until no more items can be added
repeat until chart[i] doesn't change:
apply Predict and Complete operations to all items in chart[i]
# Check if the start symbol can derive the entire input
return any item (S -> γ •, 0) in chart[n] where S is the start symbol
Implementation Classes
Let’s define the key classes for our Earley parser:
class DottedRule(Rule):
def __init__(self, rule: Rule, span: SpanIndices, dot: int = 0):
self._rule = rule
self._left_side = rule.left_side
self._right_side = rule.right_side
self._span = span
self._dot = dot
def to_tuple(self) -> tuple[Rule, SpanIndices, int]:
return self._rule, self._span, self._dot
def __hash__(self) -> int:
return hash(self.to_tuple())
def __eq__(self, other) -> bool:
return self.to_tuple() == other.to_tuple()
def __repr__(self) -> str:
return self._left_side + ' -> ' +\
' '.join(self._right_side[:self._dot]) +\
' . ' +\
' '.join(self._right_side[self._dot:]) +\
' [' + str(self._span[0]) + ', ' + str(self._span[1]) + ']'
def complete(self, new_left_edge: int) -> 'DottedRule':
"""Complete the next symbol in this rule
Parameters
----------
new_left_edge : int
The new left edge of the span
Returns
-------
DottedRule
A new dotted rule with the dot advanced
"""
= self._dot + 1
dot = (self._span[0], new_left_edge)
span
return DottedRule(self._rule, span, dot)
@property
def next_symbol(self) -> str:
if self.is_complete:
raise AttributeError('dotted rule is already complete')
else:
return self._right_side[self._dot]
@property
def dot(self) -> int:
return self._dot
@property
def span(self) -> SpanIndices:
return self._span
@property
def is_complete(self) -> bool:
return self._dot == len(self._right_side)
@property
def left_side(self) -> str:
return self._rule.left_side
= tuple[str, int]
EarleyBackPointer
class EarleyChartEntry(ChartEntry):
"""A chart entry for a Earley chart
Parameters
----------
dotted_rule
backpointers
"""
def __init__(self, dotted_rule: DottedRule, *backpointers: EarleyBackPointer):
self._dotted_rule = dotted_rule
self._backpointers = backpointers
def to_tuple(self) -> tuple[DottedRule, tuple[EarleyBackPointer, ...]]:
return self._dotted_rule, self._backpointers
def __hash__(self) -> int:
return hash(self.to_tuple())
def __eq__(self, other) -> bool:
return self.to_tuple() == other.to_tuple()
def __repr__(self) -> str:
return self._dotted_rule.__repr__()
def __str__(self) -> str:
return self.__repr__()
@property
def backpointers(self) -> tuple[EarleyBackPointer, ...]:
return self._backpointers
@property
def dotted_rule(self) -> DottedRule:
return self._dotted_rule
@property
def next_symbol(self) -> str:
return self._dotted_rule.next_symbol
@property
def span(self) -> SpanIndices:
return self._dotted_rule.span
@property
def is_complete(self) -> bool:
return self._dotted_rule.is_complete
def complete(self, entry: 'EarleyChartEntry', new_left_edge: int) -> 'EarleyChartEntry':
"""Complete this entry with another entry
Parameters
----------
entry : EarleyChartEntry
The entry that completes this one
new_left_edge : int
The new left edge of the span
Returns
-------
EarleyChartEntry
A new entry with the dot advanced
"""
= self._dotted_rule.complete(new_left_edge)
new_dotted_rule
= (self._dotted_rule.next_symbol, self._dotted_rule.span[1])
bp = self._backpointers + (bp,)
backpointers
return EarleyChartEntry(new_dotted_rule, *backpointers)
def is_completion_of(self, other: 'EarleyChartEntry') -> bool:
"""Check if this entry completes another entry
Parameters
----------
other : EarleyChartEntry
The entry to check
Returns
-------
bool
True if this entry completes the other entry
"""
return self._dotted_rule.left_side == other.dotted_rule.next_symbol
class EarleyChart(Chart):
"""A chart for an Earley parser
Parameters
----------
input_size
"""
def __init__(self, input_size: int, start_variable: str = 'S'):
self._start_variable = start_variable
self._chart: list[set[EarleyChartEntry]] = [
set() for _ in range(input_size+1)
]
def __getitem__(self, index) -> set[EarleyChartEntry]:
return self._chart[index]
def __setitem__(self, key, item) -> None:
self._chart[key] = item
@property
def parses(self) -> set[Tree]:
try:
return self._parses
except AttributeError:
self._parses = set()
for entry in self._chart[-1]:
= entry.dotted_rule.left_side == self._start_variable
is_start = entry.dotted_rule.span == (0, self.input_size)
covers_string = entry.dotted_rule.is_complete
is_complete
if is_start and covers_string and is_complete:
self._parses.add(self._construct_parses(entry))
return self._parses
@property
def input_size(self) -> int:
return len(self._chart) - 1
class EarleyParser(ContextFreeGrammarParser):
"""
An Earley parser
Parameters
----------
grammar : ContextFreeGrammar
"""
= None
normal_form
def _predict(self, chart: EarleyChart, entry: EarleyChartEntry, position: int):
"""
Predict operation: add new dotted rules for the non-terminal
Parameters
----------
chart : EarleyChart
The chart to update
entry : EarleyChartEntry
The entry to predict from
position : int
The current position in the input
"""
for rule in self._grammar.rules(entry.next_symbol):
= (position, position)
span = DottedRule(rule, span)
dotted_rule = EarleyChartEntry(dotted_rule)
new_entry
chart[position].add(new_entry)
def _scan(self, chart: EarleyChart, entry: EarleyChartEntry, position: int):
"""
Scan operation: advance the dot by matching the next terminal
Parameters
----------
chart : EarleyChart
The chart to update
entry : EarleyChartEntry
The entry to scan from
position : int
The current position in the input
"""
= self._string[position]
word = entry.next_symbol
expected
if word == expected:
# Create a rule for the terminal
= Rule(expected, word)
rule = (position, position+1)
span = DottedRule(rule, span, dot=1)
dotted_rule
# Add a completed entry for the terminal
= EarleyChartEntry(dotted_rule)
terminal_entry +1].add(terminal_entry)
chart[position
# Add a completed entry for the entry that was waiting for this terminal
= entry.complete(terminal_entry, position+1)
completed_entry +1].add(completed_entry)
chart[position
def _complete(self, chart: EarleyChart, entry: EarleyChartEntry, position: int):
"""
Complete operation: advance the dot for entries waiting for this one
Parameters
----------
chart : EarleyChart
The chart to update
entry : EarleyChartEntry
The completed entry
position : int
The current position in the input
"""
= entry.span
start, end
for prev_entry in chart[start]:
if not prev_entry.is_complete and entry.is_completion_of(prev_entry):
= prev_entry.complete(entry, end)
completed_entry
chart[position].add(completed_entry)
def _parse(self, string: str | list[str]):
"""
Parse a string to find all possible parse trees
Parameters
----------
string : str | list[str]
The string to parse
Returns
-------
set[Tree]
The set of all possible parse trees
"""
# Ensure the input is a list of tokens
if isinstance(string, str):
= string.split()
string
# Fill the chart
= self._fill_chart(string)
chart
# Return all possible parses
return chart.parses
def _recognize(self, string: str | list[str]):
"""
Recognize whether a string is in the language
Parameters
----------
string : str | list[str]
The string to recognize
Returns
-------
bool
True if the string is in the language, False otherwise
"""
# Ensure the input is a list of tokens
if isinstance(string, str):
= string.split()
string
# Fill the chart
= self._fill_chart(string)
chart
# Check if the start symbol appears in the final position with a complete rule
# that spans the entire input
for entry in chart[-1]:
= entry.dotted_rule.left_side == self._grammar.start_variable
is_start = entry.dotted_rule.span == (0, len(string))
covers_string = entry.dotted_rule.is_complete
is_complete
if is_start and covers_string and is_complete:
return True
return False
Implementations to Complete (Assignment)
The following three methods need to be implemented as part of your assignment:
- The
_fill_chart
method:
function _fill_chart(string):
# Store the string for use in the scan operation
self._string = string
# Initialize the chart
chart = new EarleyChart(length(string), self._grammar.start_variable)
# Initialize chart[0] with dotted rules for the start symbol
for each rule S → γ where S is the start symbol:
add a dotted rule S → • γ at position 0 to chart[0]
# Fill each position in the chart
for i = 0 to length(string):
# Process all entries at this position until no new entries are added
repeat until chart[i] doesn't change:
for each entry in chart[i]:
if entry is not complete:
if next symbol is a non-terminal:
apply Predict operation
else if next symbol is a terminal and i < length(string):
apply Scan operation
else:
apply Complete operation
return chart
- The
_construct_parses
method:
function _construct_parses(entry):
# Create a node for this entry
node = new Tree(entry.dotted_rule.left_side)
# If the entry has backpointers, recursively construct the children
if entry has backpointers:
for each backpointer (symbol, position) in entry.backpointers:
# Find the completed entry in the chart at position that matches the symbol
for each potential_child in chart[position]:
if potential_child.left_side == symbol and potential_child.is_complete:
# Recursively construct the child's parse tree
child = _construct_parses(potential_child)
add child to node.children
break
return node
- The
_predict_next_word
method:
function _predict_next_word(prefix):
# Fill the chart for the prefix
chart = _fill_chart(prefix)
# Find all incomplete entries in the final position
incomplete_entries = {all entries in chart[length(prefix)] where entry is not complete}
# Group by next expected symbol
next_symbols = {}
for each entry in incomplete_entries:
next_symbol = entry.next_symbol
# If the next symbol is a terminal, add it directly
if next_symbol is a terminal:
add next_symbol to next_symbols[next_symbol]
# If the next symbol is a non-terminal, find terminals it can derive
else if next_symbol is a non-terminal:
for each rule with left_side = next_symbol:
if rule directly derives a terminal:
add the terminal to next_symbols[next_symbol]
return next_symbols
Detailed Example: Parsing “unhappiness”
Let’s walk through a detailed example of the Earley algorithm parsing the word “unhappiness” with the segmentation ["un", "happy", "ness"]
using our simplified grammar:
= ContextFreeGrammar(
grammar ={'un', 'happy', 'ness'},
alphabet={'Word', 'Adj', 'N', 'Prefix', 'Suffix'},
variables={
rules'Adj', 'happy'),
Rule('Prefix', 'un'),
Rule('Suffix', 'ness'),
Rule('Adj', 'Prefix', 'Adj'),
Rule('N', 'Adj', 'Suffix'),
Rule('Word', 'N')
Rule(
},='Word'
start_variable
)
# Set the parser class
= EarleyParser
ContextFreeGrammar.parser_class
# Parse the word
= grammar(['un', 'happy', 'ness'], mode="parse") parses
Step-by-Step Chart Filling
Let’s trace the Earley algorithm as it fills the chart for “unhappiness”:
Chart[0]: Initial State
First, we initialize chart[0] with a dotted rule for the start symbol Word
:
- Add
Word -> • N [0,0]
(from start symbol rule) - Apply predict:
- For
Word -> • N
:- Add
N -> • Adj Suffix [0,0]
- Add
- For
- Apply predict again:
- For
N -> • Adj Suffix
:- Add
Adj -> • happy [0,0]
- Add
Adj -> • Prefix Adj [0,0]
- Add
- For
- Apply predict again:
- For
Adj -> • Prefix Adj
:- Add
Prefix -> • un [0,0]
- Add
- For
Chart[0] now looks like:
Word -> • N [0,0]
N -> • Adj Suffix [0,0]
Adj -> • happy [0,0]
Adj -> • Prefix Adj [0,0]
Prefix -> • un [0,0]
Chart[1]: After Processing “un”
Now we process the first input symbol “un”:
- Apply scan:
- For
Prefix -> • un [0,0]
:- Add
Prefix -> un • [0,1]
(complete)
- Add
- For
- Apply complete:
- For
Prefix -> un • [0,1]
(complete):- Find
Adj -> • Prefix Adj [0,0]
waiting for Prefix - Add
Adj -> Prefix • Adj [0,1]
- Find
- For
- Apply predict:
- For
Adj -> Prefix • Adj [0,1]
:- Add
Adj -> • happy [1,1]
- Add
Adj -> • Prefix Adj [1,1]
- Add
- For
- Apply predict again:
- For
Adj -> • Prefix Adj [1,1]
:- Add
Prefix -> • un [1,1]
- Add
- For
Chart[1] now looks like:
Prefix -> un • [0,1]
Adj -> Prefix • Adj [0,1]
Adj -> • happy [1,1]
Adj -> • Prefix Adj [1,1]
Prefix -> • un [1,1]
Chart[2]: After Processing “happy”
Now we process the second input symbol “happy”:
- Apply scan:
- For
Adj -> • happy [1,1]
:- Add
Adj -> happy • [1,2]
(complete)
- Add
- For
- Apply complete:
- For
Adj -> happy • [1,2]
(complete):- Find
Adj -> Prefix • Adj [0,1]
waiting for Adj - Add
Adj -> Prefix Adj • [0,2]
(complete)
- Find
- For
- Apply complete again:
- For
Adj -> Prefix Adj • [0,2]
(complete):- Find
N -> • Adj Suffix [0,0]
waiting for Adj - Add
N -> Adj • Suffix [0,2]
- Find
- For
- Apply predict:
- For
N -> Adj • Suffix [0,2]
:- Add
Suffix -> • ness [2,2]
- Add
- For
Chart[2] now looks like:
Adj -> happy • [1,2]
Adj -> Prefix Adj • [0,2]
N -> Adj • Suffix [0,2]
Suffix -> • ness [2,2]
Chart[3]: After Processing “ness”
Finally, we process the third input symbol “ness”:
- Apply scan:
- For
Suffix -> • ness [2,2]
:- Add
Suffix -> ness • [2,3]
(complete)
- Add
- For
- Apply complete:
- For
Suffix -> ness • [2,3]
(complete):- Find
N -> Adj • Suffix [0,2]
waiting for Suffix - Add
N -> Adj Suffix • [0,3]
(complete)
- Find
- For
- Apply complete again:
- For
N -> Adj Suffix • [0,3]
(complete):- Find
Word -> • N [0,0]
waiting for N - Add
Word -> N • [0,3]
(complete)
- Find
- For
Chart[3] now looks like:
Suffix -> ness • [2,3]
N -> Adj Suffix • [0,3]
Word -> N • [0,3]
Since there’s a completed dotted rule for the start symbol Word
that spans the entire input (0 to 3), the word “unhappiness” is in the language defined by our grammar.
Visualizing the Chart State
We can visualize the entire chart as follows:
|-----------------------|--------------------------|--------------------------|--------------------------|
| Chart[0] | Chart[1] | Chart[2] | Chart[3] |
|-----------------------|--------------------------|--------------------------|--------------------------|
| Word -> • N [0,0] | Prefix -> un • [0,1] | Adj -> happy • [1,2] | Suffix -> ness • [2,3] |
| N -> • Adj Suffix [0,0]| Adj -> Prefix • Adj [0,1]| Adj -> Prefix Adj • [0,2]| N -> Adj Suffix • [0,3] |
| Adj -> • happy [0,0] | Adj -> • happy [1,1] | N -> Adj • Suffix [0,2] | Word -> N • [0,3] |
| Adj -> • Prefix Adj [0,0]| Adj -> • Prefix Adj [1,1]| Suffix -> • ness [2,2] | |
| Prefix -> • un [0,0] | Prefix -> • un [1,1] | | |
|-----------------------|--------------------------|--------------------------|--------------------------|
Reconstructing the Parse Tree
From the completed dotted rule Word -> N • [0,3]
in Chart[3], we can reconstruct the parse tree by following the backpointers:
- Start with
Word -> N • [0,3]
in Chart[3] - Follow the backpointer to
N -> Adj Suffix • [0,3]
in Chart[3] - Follow the backpointer to
Adj -> Prefix Adj • [0,2]
in Chart[2] - Follow the backpointer to
Prefix -> un • [0,1]
in Chart[1] - Follow the backpointer to
Adj -> happy • [1,2]
in Chart[2] - Follow the backpointer to
Suffix -> ness • [2,3]
in Chart[3]
This gives us the parse tree:
Word
|
N
/ \
Adj Suffix
/ \ |
Prefix Adj ness
| |
un happy
This tree correctly captures the structure ((un+happy)+ness)
.
Handling Ambiguity: The “unlockable” Example
Let’s revisit the ambiguous word “unlockable”, which has two possible interpretations:
un + (lock + able)
= “not able to be locked”(un + lock) + able
= “able to be unlocked”
With the Earley algorithm, we can handle this ambiguity naturally. Let’s create a grammar that captures both interpretations:
= ContextFreeGrammar(
ambiguous_grammar ={'un', 'lock', 'able'},
alphabet={'Word', 'V', 'Adj', 'Prefix', 'Suffix'},
variables={
rules'V', 'lock'),
Rule('Prefix', 'un'),
Rule('Suffix', 'able'),
Rule('V', 'Prefix', 'V'), # un + lock -> unlock
Rule('Adj', 'V', 'Suffix'), # lock + able -> lockable OR unlock + able -> unlockable
Rule('Adj', 'Prefix', 'Adj'), # un + lockable -> unlockable
Rule('Word', 'Adj')
Rule(
},='Word'
start_variable )
When we parse “unlockable” with Earley, we’ll end up with two completed dotted rules in the final chart:
Word -> Adj • [0,3]
where Adj comes fromAdj -> Prefix Adj • [0,3]
(un + lockable)Word -> Adj • [0,3]
where Adj comes fromAdj -> V Suffix • [0,3]
(unlock + able)
Both of these analyses derive Word
, resulting in two different parse trees:
Parse Tree 1 (un + lockable):
Word
|
Adj
/ \
Prefix Adj
| / \
un V Suffix
| |
lock able
Parse Tree 2 (unlock + able):
Word
|
Adj
/ \
V Suffix
/ \ |
Prefix V able
| |
un lock
The Earley algorithm finds all possible parses, which is essential for accurately analyzing structurally ambiguous words.
The Three Operations of Earley
The Earley algorithm consists of three main operations:
Predict: For each dotted rule where the dot precedes a non-terminal, add new dotted rules for all possible expansions of that non-terminal. This operation looks ahead to determine what might come next.
Scan: For each dotted rule where the dot precedes a terminal that matches the current input symbol, advance the dot past that terminal. This operation represents recognizing an input symbol.
Complete: For each dotted rule where the dot is at the end (indicating a completed constituent), go back to the chart position where the constituent began and advance the dot in all dotted rules that were waiting for this constituent.
These operations work together to efficiently parse the input while avoiding redundant work.
Advantages of Earley for Morphological Analysis
The Earley algorithm offers several advantages for morphological analysis:
Any CFG: Unlike CKY, Earley works with any context-free grammar, including those with ε-rules (empty productions), unary rules, and arbitrary rule lengths. This allows for more natural expression of morphological patterns.
Incremental processing: Earley processes the input left-to-right, which can be useful for interactive applications or streaming inputs.
Left recursion: Earley handles left-recursive rules without modification, which is important for certain morphological patterns.
Explicit predictions: The predict operation explicitly lists all the possible continuations at each position, which can be useful for generating suggestions in applications like spelling correction or auto-completion.
Efficiency: While the worst-case complexity is O(n³), for many practical grammars Earley performs better, especially for unambiguous or nearly unambiguous grammars.
Real-World Example: CELEX Data
Let’s revisit our “unbelievably” example from CELEX:
((((un)[Prefix] (believe)[V])[V] (able)[Suffix])[Adj] (ly)[Suffix])[Adv]
With the Earley algorithm, we can parse this complex structure without converting to CNF:
# Extract grammar from CELEX
= ContextFreeGrammar.from_treebank(celex_parses)
celex_grammar
# Parse a specific word
= celex_grammar(['un', 'believe', 'able', 'ly'], mode="parse")
parses for parse in parses:
print(parse)
We can also predict what might follow partial words:
# Predict what could follow "un believe"
= celex_grammar.parser._predict_next_word(['un', 'believe'])
predictions print(predictions) # Might output {'Suffix': {'able', 'er', 'ing', ...}}