Normal Forms for Context-Free Grammars
In the previous sections, we introduced context-free grammars and operations on them. While CFGs are powerful tools for linguistic analysis, their general form can make certain computational tasks, such as parsing, inefficient. To address this, we can transform CFGs into equivalent but more constrained forms called normal forms.
In this section, we’ll focus on Chomsky Normal Form (CNF), which is particularly important for parsing algorithms like the CKY algorithm that we’ll see in the next section.
Normal Forms
A normal form for a grammar is a standardized format that restricts the form of the production rules while preserving the language generated by the grammar. The most common normal forms for CFGs are:
- Chomsky Normal Form (CNF): Each production rule must be in one of two forms:
- \(A \rightarrow BC\) where \(B, C \in V\) (non-terminals)
- \(A \rightarrow a\) where \(a \in \Sigma\) (terminals)
- Additionally, there can be a rule \(S \rightarrow \epsilon\) if the language contains the empty string, but \(S\) cannot appear on the right side of any rule.
- Greibach Normal Form (GNF): Each production rule must be in the form:
- \(A \rightarrow aB_1B_2...B_n\) where \(a \in \Sigma\) and \(B_1, B_2, ..., B_n \in V\)
- Binary Normal Form (BNF): Each production rule must be in one of three forms:
- \(A \rightarrow BC\) where \(B, C \in V\)
- \(A \rightarrow a\) where \(a \in \Sigma\)
- \(A \rightarrow \epsilon\)
For morphological analysis, CNF is particularly useful as it simplifies the parsing algorithms while maintaining the hierarchical structure of morphological analyses.
Chomsky Normal Form
Let’s now focus on transforming a CFG into Chomsky Normal Form. The transformation process involves several steps:
- Eliminate ε-rules (rules of the form \(A \rightarrow \epsilon\))
- Eliminate unit rules (rules of the form \(A \rightarrow B\) where \(B \in V\))
- Convert long right-hand sides (rules of the form \(A \rightarrow X_1 X_2 ... X_n\) where \(n > 2\))
- Handle terminal symbols in mixed rules (rules containing both terminals and non-terminals on the right-hand side)
Let’s implement this transformation as a method in our ContextFreeGrammar
class:
from enum import Enum
from copy import deepcopy
class NormalForm(Enum):
= 0
CNF = 1
BNF = 2
GNF
class Rule(Rule):
def validate(self, alphabet: set[str], variables: set[str], normal_form: NormalForm = None):
"""validate the rule
Parameters
----------
alphabet : set(str)
variables : set(str)
normal_form : NormalForm, optional
The normal form to validate against, by default None
"""
if self._left_side not in variables:
= "left side of rule must be a variable"
msg raise ValueError(msg)
= alphabet | variables | {''}
acceptable
if not all([s in acceptable for s in self._right_side]):
= "right side of rule must contain only " + \
msg "a variable, a symbol, or the empty string"
raise ValueError(msg)
if normal_form == NormalForm.CNF:
try:
if len(self.right_side) == 1:
assert self.right_side[0] in alphabet
elif len(self.right_side) == 2:
assert all([s in variables for s in self.right_side])
else:
raise AssertionError
except AssertionError:
raise ValueError(f"{self} is not in CNF")
class ContextFreeGrammar(ContextFreeGrammar):
def to_cnf(self) -> 'ContextFreeGrammar':
"""
Convert the grammar to Chomsky Normal Form
Returns
-------
ContextFreeGrammar
A new grammar in Chomsky Normal Form that generates the same language
"""
# Create deep copies to avoid modifying the original grammar
= self.alphabet.copy()
alphabet = self.variables.copy()
variables = self.rules().copy()
rules = self.start_variable
start_variable
# Step 1: Add a new start variable if needed
= start_variable + "'"
new_start while new_start in variables:
+= "'"
new_start
variables.add(new_start)
rules.add(Rule(new_start, start_variable))= new_start
start_variable
# Step 2: Eliminate ε-rules
= self._find_nullable_variables(rules)
nullable_vars = self._eliminate_epsilon_rules(rules, nullable_vars, variables)
rules
# Step 3: Eliminate unit rules
= self._eliminate_unit_rules(rules, variables)
rules
# Step 4: Convert rules with terminals in the RHS
= self._convert_terminal_rules(rules, variables, alphabet)
rules, variables
# Step 5: Break long rules
= self._break_long_rules(rules, variables)
rules, variables
return ContextFreeGrammar(alphabet, variables, rules, start_variable)
def _find_nullable_variables(self, rules: set[Rule]) -> set[str]:
"""
Find all variables that can derive the empty string
Parameters
----------
rules : set(Rule)
The grammar rules
Returns
-------
set(str)
The set of nullable variables
"""
= set()
nullable
# First pass: find variables that directly derive ε
for rule in rules:
if len(rule.right_side) == 0: # A -> ε
nullable.add(rule.left_side)
# Fixed-point computation: find variables that indirectly derive ε
while True:
= len(nullable)
size_before
for rule in rules:
if rule.left_side not in nullable:
# If all variables in the right-hand side are nullable,
# then the left-hand side is also nullable
if all(symbol in nullable for symbol in rule.right_side):
nullable.add(rule.left_side)
if len(nullable) == size_before:
break
return nullable
def _eliminate_epsilon_rules(self, rules: set[Rule], nullable: set[str],
set[str]) -> set[Rule]:
variables: """
Eliminate ε-rules by adding new rules that account for nullable variables
Parameters
----------
rules : set(Rule)
The grammar rules
nullable : set(str)
The set of nullable variables
variables : set(str)
The set of variables
Returns
-------
set(Rule)
The new set of rules without ε-rules
"""
= set()
new_rules
# Remove all ε-rules
for rule in rules:
if len(rule.right_side) > 0: # Keep non-ε rules
new_rules.add(rule)
# For each rule, add versions that omit nullable variables
for rule in rules:
if len(rule.right_side) > 1: # Rules with at least 2 symbols on RHS
# Find positions of nullable variables in the right-hand side
= [i for i, symbol in enumerate(rule.right_side)
nullable_positions if symbol in nullable]
# Generate all subsets of these positions
for k in range(1, len(nullable_positions) + 1):
for positions_to_remove in itertools.combinations(nullable_positions, k):
# Create a new right side by omitting the nullable variables
= [rule.right_side[i] for i in range(len(rule.right_side))
new_right_side if i not in positions_to_remove]
if new_right_side: # Don't create new ε-rules
*new_right_side))
new_rules.add(Rule(rule.left_side,
return new_rules
def _eliminate_unit_rules(self, rules: set[Rule], variables: set[str]) -> set[Rule]:
"""
Eliminate unit rules of the form A -> B
Parameters
----------
rules : set(Rule)
The grammar rules
variables : set(str)
The set of variables
Returns
-------
set(Rule)
The new set of rules without unit rules
"""
# Compute unit pairs (A, B) such that A ->* B
= {(var, var) for var in variables} # Reflexive closure
unit_pairs
# Find all direct unit pairs
= {(rule.left_side, rule.right_side[0])
direct_pairs for rule in rules
if len(rule.right_side) == 1 and rule.right_side[0] in variables}
# Compute the transitive closure
while True:
= len(unit_pairs)
size_before
for A, B in list(unit_pairs):
for B2, C in direct_pairs:
if B == B2:
unit_pairs.add((A, C))
if len(unit_pairs) == size_before:
break
# Replace unit rules
= set()
new_rules
for rule in rules:
# Keep non-unit rules
if not (len(rule.right_side) == 1 and rule.right_side[0] in variables):
new_rules.add(rule)
# For each unit pair (A, B), add rules A -> γ for each rule B -> γ
for A, B in unit_pairs:
if A != B: # Avoid trivial unit pairs
for rule in rules:
if rule.left_side == B and (len(rule.right_side) != 1 or rule.right_side[0] not in variables):
*rule.right_side))
new_rules.add(Rule(A,
return new_rules
def _convert_terminal_rules(self, rules: set[Rule], variables: set[str],
set[str]) -> tuple[set[Rule], set[str]]:
alphabet: """
Convert rules with terminals in the right-hand side to CNF
Parameters
----------
rules : set(Rule)
The grammar rules
variables : set(str)
The set of variables
alphabet : set(str)
The terminal alphabet
Returns
-------
tuple
(new_rules, new_variables)
"""
= set()
new_rules = variables.copy()
new_variables
# Map terminals to their corresponding variables
= {}
terminal_vars
for rule in rules:
if len(rule.right_side) == 1 and rule.right_side[0] in alphabet:
# Keep terminal rules (A -> a)
new_rules.add(rule)else:
# For each terminal in a mixed or long rule
= []
new_right_side for symbol in rule.right_side:
if symbol in alphabet:
# Create a new variable for this terminal if needed
if symbol not in terminal_vars:
= f"T_{symbol}"
term_var while term_var in new_variables:
+= "'"
term_var = term_var
terminal_vars[symbol]
new_variables.add(term_var)
new_rules.add(Rule(term_var, symbol))
# Replace the terminal with its variable
new_right_side.append(terminal_vars[symbol])else:
new_right_side.append(symbol)
*new_right_side))
new_rules.add(Rule(rule.left_side,
return new_rules, new_variables
def _break_long_rules(self, rules: set[Rule], variables: set[str]) -> tuple[set[Rule], set[str]]:
"""
Break rules with more than 2 symbols on the right-hand side
Parameters
----------
rules : set(Rule)
The grammar rules
variables : set(str)
The set of variables
Returns
-------
tuple
(new_rules, new_variables)
"""
= set()
new_rules = variables.copy()
new_variables
for rule in rules:
if len(rule.right_side) <= 2:
# Keep rules that are already in CNF
new_rules.add(rule)else:
# Break the rule A -> X1 X2 ... Xn
# into A -> X1 A1, A1 -> X2 A2, ..., An-2 -> Xn-1 Xn
= list(rule.right_side)
symbols = rule.left_side
left_side
for i in range(len(symbols) - 2):
= f"{left_side}_{i+1}"
new_var while new_var in new_variables:
+= "'"
new_var
new_variables.add(new_var)
0], new_var))
new_rules.add(Rule(left_side, symbols[= new_var
left_side = symbols[1:]
symbols
# Add the final rule An-2 -> Xn-1 Xn
0], symbols[1]))
new_rules.add(Rule(left_side, symbols[
return new_rules, new_variables
Example: Converting a Grammar to CNF
Let’s see this transformation in action with a simple morphological grammar. Consider a grammar that models the formation of English nouns with prefixes and suffixes:
= ContextFreeGrammar(
grammar ={'un', 'happy', 'ness', 'kind', 'ful', 'help'},
alphabet={'Word', 'Adj', 'N', 'Prefix', 'Suffix'},
variables={
rules# Base words
'Adj', 'happy'),
Rule('Adj', 'kind'),
Rule('Adj', 'help', 'ful'), # A ternary rule
Rule(
# Word formation rules
'Word', 'N'),
Rule(
# Derivational rules
'Adj', 'Prefix', 'Adj'), # un + kind → unkind
Rule('N', 'Adj', 'Suffix'), # kind + ness → kindness
Rule('N', 'Prefix', 'Adj', 'Suffix'), # un + kind + ness → unkindness
Rule(
# Affixes
'Prefix', 'un'),
Rule('Suffix', 'ness'),
Rule('Suffix', 'ful')
Rule(
},='Word'
start_variable
)
= grammar.to_cnf()
cnf_grammar print(cnf_grammar.rules())
In this example, we have a ternary rule Adj -> help ful
and a rule with three symbols on the right-hand side: N -> Prefix Adj Suffix
. Let’s see how these are transformed in CNF:
Before transformation:
Adj -> help ful
N -> Prefix Adj Suffix
After transformation:
T_help -> help
T_ful -> ful
Adj -> T_help T_ful
N -> Prefix N_1
N_1 -> Adj Suffix
The ternary rule Adj -> help ful
is replaced by two binary rules: T_help -> help
, T_ful -> ful
, and Adj -> T_help T_ful
. The long rule N -> Prefix Adj Suffix
is replaced by two binary rules: N -> Prefix N_1
and N_1 -> Adj Suffix
.
Real-World Example: CELEX Data
Let’s look at a real example from the CELEX database. One common pattern in English is the derivation of nouns from verbs, often involving several affixes. For instance, consider the word “unhappiness” which has the structure:
un + happy + ness
This structure can be represented in CELEX as:
(((un)[Prefix] (happy)[Adj])[Adj] (ness)[Suffix])[N]
Here’s how this rule might appear in our grammar before converting to CNF:
N -> Prefix Adj Suffix
After conversion to CNF, we would have:
N -> Prefix N_1
N_1 -> Adj Suffix
If we examine the CELEX database, we can find examples of words with even more complex morphological structures. For instance, the word “unbelievably” has the structure:
un + believe + able + ly
This would be represented in CELEX as:
((((un)[Prefix] (believe)[V])[V] (able)[Suffix])[Adj] (ly)[Suffix])[Adv]
This complex structure involves multiple levels of derivation, which can be represented by a set of rules in our grammar:
# Extract a rule from CELEX
= celex_parses.iloc[50859].parsed_word # "unbelievably"
celex_word print(celex_word.rules)
This would output rules like:
Adv -> Adj Suffix
Adj -> V Suffix
V -> Prefix V
Prefix -> un
V -> believe
Suffix -> able
Suffix -> ly
Converting this grammar to CNF would result in a set of binary rules that preserve the hierarchical structure while conforming to the CNF constraints.
Advantages of CNF for Morphological Analysis
There are several advantages to using CNF for morphological analysis:
- Simplified parsing algorithms: Algorithms like CKY require the grammar to be in CNF
- Uniform structure: All rules have at most two symbols on the right-hand side
- Preservation of hierarchical structure: The transformation preserves the hierarchical structure of morphological analyses
- Efficient recognition: CNF enables more efficient recognition of complex morphological structures
Implementing the Validation for CNF
To ensure that a grammar is in CNF, we need to validate that each rule conforms to the CNF constraints. Let’s add a validation method to our ContextFreeGrammar
class:
def _validate_cnf(self, rules: set[Rule], variables: set[str], alphabet: set[str]) -> bool:
"""
Validate that a grammar is in Chomsky Normal Form
Parameters
----------
rules : set(Rule)
The grammar rules
variables : set(str)
The set of variables
alphabet : set(str)
The terminal alphabet
Returns
-------
bool
True if the grammar is in CNF, False otherwise
"""
for rule in rules:
# Check that the left side is a variable
if rule.left_side not in variables:
return False
# Check that each rule is in one of the two CNF forms
if len(rule.right_side) == 1:
# A -> a
if rule.right_side[0] not in alphabet:
return False
elif len(rule.right_side) == 2:
# A -> BC
if not all(symbol in variables for symbol in rule.right_side):
return False
else:
# Rules with 0 or more than 2 symbols are not in CNF
return False
return True
In the next section, we’ll explore how to use CFGs in CNF with the CKY parsing algorithm to efficiently parse and analyze morphological structures.