Normal Forms for Context-Free Grammars

In the previous sections, we introduced context-free grammars and operations on them. While CFGs are powerful tools for linguistic analysis, their general form can make certain computational tasks, such as parsing, inefficient. To address this, we can transform CFGs into equivalent but more constrained forms called normal forms.

In this section, we’ll focus on Chomsky Normal Form (CNF), which is particularly important for parsing algorithms like the CKY algorithm that we’ll see in the next section.

Normal Forms

A normal form for a grammar is a standardized format that restricts the form of the production rules while preserving the language generated by the grammar. The most common normal forms for CFGs are:

  1. Chomsky Normal Form (CNF): Each production rule must be in one of two forms:
    • \(A \rightarrow BC\) where \(B, C \in V\) (non-terminals)
    • \(A \rightarrow a\) where \(a \in \Sigma\) (terminals)
    • Additionally, there can be a rule \(S \rightarrow \epsilon\) if the language contains the empty string, but \(S\) cannot appear on the right side of any rule.
  2. Greibach Normal Form (GNF): Each production rule must be in the form:
    • \(A \rightarrow aB_1B_2...B_n\) where \(a \in \Sigma\) and \(B_1, B_2, ..., B_n \in V\)
  3. Binary Normal Form (BNF): Each production rule must be in one of three forms:
    • \(A \rightarrow BC\) where \(B, C \in V\)
    • \(A \rightarrow a\) where \(a \in \Sigma\)
    • \(A \rightarrow \epsilon\)

For morphological analysis, CNF is particularly useful as it simplifies the parsing algorithms while maintaining the hierarchical structure of morphological analyses.

Chomsky Normal Form

Let’s now focus on transforming a CFG into Chomsky Normal Form. The transformation process involves several steps:

  1. Eliminate ε-rules (rules of the form \(A \rightarrow \epsilon\))
  2. Eliminate unit rules (rules of the form \(A \rightarrow B\) where \(B \in V\))
  3. Convert long right-hand sides (rules of the form \(A \rightarrow X_1 X_2 ... X_n\) where \(n > 2\))
  4. Handle terminal symbols in mixed rules (rules containing both terminals and non-terminals on the right-hand side)

Let’s implement this transformation as a method in our ContextFreeGrammar class:

from enum import Enum
from copy import deepcopy

class NormalForm(Enum):
    CNF = 0
    BNF = 1
    GNF = 2

class Rule(Rule):
    def validate(self, alphabet: set[str], variables: set[str], normal_form: NormalForm = None):
        """validate the rule

        Parameters
        ----------
        alphabet : set(str)
        variables : set(str)
        normal_form : NormalForm, optional
            The normal form to validate against, by default None
        """
        if self._left_side not in variables:
            msg = "left side of rule must be a variable"
            raise ValueError(msg)

        acceptable = alphabet | variables | {''}
            
        if not all([s in acceptable for s in self._right_side]):
            msg = "right side of rule must contain only " + \
                  "a variable, a symbol, or the empty string"
            raise ValueError(msg)

        if normal_form == NormalForm.CNF:
            try:
                if len(self.right_side) == 1:
                    assert self.right_side[0] in alphabet
                elif len(self.right_side) == 2:
                    assert all([s in variables for s in self.right_side])
                else:
                    raise AssertionError
            except AssertionError:
                raise ValueError(f"{self} is not in CNF")

class ContextFreeGrammar(ContextFreeGrammar):
    
    def to_cnf(self) -> 'ContextFreeGrammar':
        """
        Convert the grammar to Chomsky Normal Form
        
        Returns
        -------
        ContextFreeGrammar
            A new grammar in Chomsky Normal Form that generates the same language
        """
        # Create deep copies to avoid modifying the original grammar
        alphabet = self.alphabet.copy()
        variables = self.variables.copy()
        rules = self.rules().copy()
        start_variable = self.start_variable
        
        # Step 1: Add a new start variable if needed
        new_start = start_variable + "'"
        while new_start in variables:
            new_start += "'"
            
        variables.add(new_start)
        rules.add(Rule(new_start, start_variable))
        start_variable = new_start
        
        # Step 2: Eliminate ε-rules
        nullable_vars = self._find_nullable_variables(rules)
        rules = self._eliminate_epsilon_rules(rules, nullable_vars, variables)
        
        # Step 3: Eliminate unit rules
        rules = self._eliminate_unit_rules(rules, variables)
        
        # Step 4: Convert rules with terminals in the RHS
        rules, variables = self._convert_terminal_rules(rules, variables, alphabet)
        
        # Step 5: Break long rules
        rules, variables = self._break_long_rules(rules, variables)
        
        return ContextFreeGrammar(alphabet, variables, rules, start_variable)
    
    def _find_nullable_variables(self, rules: set[Rule]) -> set[str]:
        """
        Find all variables that can derive the empty string
        
        Parameters
        ----------
        rules : set(Rule)
            The grammar rules
            
        Returns
        -------
        set(str)
            The set of nullable variables
        """
        nullable = set()
        
        # First pass: find variables that directly derive ε
        for rule in rules:
            if len(rule.right_side) == 0:  # A -> ε
                nullable.add(rule.left_side)
        
        # Fixed-point computation: find variables that indirectly derive ε
        while True:
            size_before = len(nullable)
            
            for rule in rules:
                if rule.left_side not in nullable:
                    # If all variables in the right-hand side are nullable,
                    # then the left-hand side is also nullable
                    if all(symbol in nullable for symbol in rule.right_side):
                        nullable.add(rule.left_side)
            
            if len(nullable) == size_before:
                break
        
        return nullable
    
    def _eliminate_epsilon_rules(self, rules: set[Rule], nullable: set[str], 
                                 variables: set[str]) -> set[Rule]:
        """
        Eliminate ε-rules by adding new rules that account for nullable variables
        
        Parameters
        ----------
        rules : set(Rule)
            The grammar rules
        nullable : set(str)
            The set of nullable variables
        variables : set(str)
            The set of variables
            
        Returns
        -------
        set(Rule)
            The new set of rules without ε-rules
        """
        new_rules = set()
        
        # Remove all ε-rules
        for rule in rules:
            if len(rule.right_side) > 0:  # Keep non-ε rules
                new_rules.add(rule)
        
        # For each rule, add versions that omit nullable variables
        for rule in rules:
            if len(rule.right_side) > 1:  # Rules with at least 2 symbols on RHS
                # Find positions of nullable variables in the right-hand side
                nullable_positions = [i for i, symbol in enumerate(rule.right_side) 
                                     if symbol in nullable]
                
                # Generate all subsets of these positions
                for k in range(1, len(nullable_positions) + 1):
                    for positions_to_remove in itertools.combinations(nullable_positions, k):
                        # Create a new right side by omitting the nullable variables
                        new_right_side = [rule.right_side[i] for i in range(len(rule.right_side)) 
                                         if i not in positions_to_remove]
                        
                        if new_right_side:  # Don't create new ε-rules
                            new_rules.add(Rule(rule.left_side, *new_right_side))
        
        return new_rules
    
    def _eliminate_unit_rules(self, rules: set[Rule], variables: set[str]) -> set[Rule]:
        """
        Eliminate unit rules of the form A -> B
        
        Parameters
        ----------
        rules : set(Rule)
            The grammar rules
        variables : set(str)
            The set of variables
            
        Returns
        -------
        set(Rule)
            The new set of rules without unit rules
        """
        # Compute unit pairs (A, B) such that A ->* B
        unit_pairs = {(var, var) for var in variables}  # Reflexive closure
        
        # Find all direct unit pairs
        direct_pairs = {(rule.left_side, rule.right_side[0]) 
                        for rule in rules 
                        if len(rule.right_side) == 1 and rule.right_side[0] in variables}
        
        # Compute the transitive closure
        while True:
            size_before = len(unit_pairs)
            
            for A, B in list(unit_pairs):
                for B2, C in direct_pairs:
                    if B == B2:
                        unit_pairs.add((A, C))
            
            if len(unit_pairs) == size_before:
                break
        
        # Replace unit rules
        new_rules = set()
        
        for rule in rules:
            # Keep non-unit rules
            if not (len(rule.right_side) == 1 and rule.right_side[0] in variables):
                new_rules.add(rule)
        
        # For each unit pair (A, B), add rules A -> γ for each rule B -> γ
        for A, B in unit_pairs:
            if A != B:  # Avoid trivial unit pairs
                for rule in rules:
                    if rule.left_side == B and (len(rule.right_side) != 1 or rule.right_side[0] not in variables):
                        new_rules.add(Rule(A, *rule.right_side))
        
        return new_rules
    
    def _convert_terminal_rules(self, rules: set[Rule], variables: set[str], 
                               alphabet: set[str]) -> tuple[set[Rule], set[str]]:
        """
        Convert rules with terminals in the right-hand side to CNF
        
        Parameters
        ----------
        rules : set(Rule)
            The grammar rules
        variables : set(str)
            The set of variables
        alphabet : set(str)
            The terminal alphabet
            
        Returns
        -------
        tuple
            (new_rules, new_variables)
        """
        new_rules = set()
        new_variables = variables.copy()
        
        # Map terminals to their corresponding variables
        terminal_vars = {}
        
        for rule in rules:
            if len(rule.right_side) == 1 and rule.right_side[0] in alphabet:
                # Keep terminal rules (A -> a)
                new_rules.add(rule)
            else:
                # For each terminal in a mixed or long rule
                new_right_side = []
                for symbol in rule.right_side:
                    if symbol in alphabet:
                        # Create a new variable for this terminal if needed
                        if symbol not in terminal_vars:
                            term_var = f"T_{symbol}"
                            while term_var in new_variables:
                                term_var += "'"
                            terminal_vars[symbol] = term_var
                            new_variables.add(term_var)
                            new_rules.add(Rule(term_var, symbol))
                        
                        # Replace the terminal with its variable
                        new_right_side.append(terminal_vars[symbol])
                    else:
                        new_right_side.append(symbol)
                
                new_rules.add(Rule(rule.left_side, *new_right_side))
        
        return new_rules, new_variables
    
    def _break_long_rules(self, rules: set[Rule], variables: set[str]) -> tuple[set[Rule], set[str]]:
        """
        Break rules with more than 2 symbols on the right-hand side
        
        Parameters
        ----------
        rules : set(Rule)
            The grammar rules
        variables : set(str)
            The set of variables
            
        Returns
        -------
        tuple
            (new_rules, new_variables)
        """
        new_rules = set()
        new_variables = variables.copy()
        
        for rule in rules:
            if len(rule.right_side) <= 2:
                # Keep rules that are already in CNF
                new_rules.add(rule)
            else:
                # Break the rule A -> X1 X2 ... Xn
                # into A -> X1 A1, A1 -> X2 A2, ..., An-2 -> Xn-1 Xn
                symbols = list(rule.right_side)
                left_side = rule.left_side
                
                for i in range(len(symbols) - 2):
                    new_var = f"{left_side}_{i+1}"
                    while new_var in new_variables:
                        new_var += "'"
                    new_variables.add(new_var)
                    
                    new_rules.add(Rule(left_side, symbols[0], new_var))
                    left_side = new_var
                    symbols = symbols[1:]
                
                # Add the final rule An-2 -> Xn-1 Xn
                new_rules.add(Rule(left_side, symbols[0], symbols[1]))
        
        return new_rules, new_variables

Example: Converting a Grammar to CNF

Let’s see this transformation in action with a simple morphological grammar. Consider a grammar that models the formation of English nouns with prefixes and suffixes:

grammar = ContextFreeGrammar(
    alphabet={'un', 'happy', 'ness', 'kind', 'ful', 'help'},
    variables={'Word', 'Adj', 'N', 'Prefix', 'Suffix'},
    rules={
        # Base words
        Rule('Adj', 'happy'),
        Rule('Adj', 'kind'),
        Rule('Adj', 'help', 'ful'),  # A ternary rule
        
        # Word formation rules
        Rule('Word', 'N'),
        
        # Derivational rules
        Rule('Adj', 'Prefix', 'Adj'),  # un + kind → unkind
        Rule('N', 'Adj', 'Suffix'),    # kind + ness → kindness
        Rule('N', 'Prefix', 'Adj', 'Suffix'),  # un + kind + ness → unkindness
        
        # Affixes
        Rule('Prefix', 'un'),
        Rule('Suffix', 'ness'),
        Rule('Suffix', 'ful')
    },
    start_variable='Word'
)

cnf_grammar = grammar.to_cnf()
print(cnf_grammar.rules())

In this example, we have a ternary rule Adj -> help ful and a rule with three symbols on the right-hand side: N -> Prefix Adj Suffix. Let’s see how these are transformed in CNF:

Before transformation:

Adj -> help ful
N -> Prefix Adj Suffix

After transformation:

T_help -> help
T_ful -> ful
Adj -> T_help T_ful
N -> Prefix N_1
N_1 -> Adj Suffix

The ternary rule Adj -> help ful is replaced by two binary rules: T_help -> help, T_ful -> ful, and Adj -> T_help T_ful. The long rule N -> Prefix Adj Suffix is replaced by two binary rules: N -> Prefix N_1 and N_1 -> Adj Suffix.

Real-World Example: CELEX Data

Let’s look at a real example from the CELEX database. One common pattern in English is the derivation of nouns from verbs, often involving several affixes. For instance, consider the word “unhappiness” which has the structure:

un + happy + ness

This structure can be represented in CELEX as:

(((un)[Prefix] (happy)[Adj])[Adj] (ness)[Suffix])[N]

Here’s how this rule might appear in our grammar before converting to CNF:

N -> Prefix Adj Suffix

After conversion to CNF, we would have:

N -> Prefix N_1
N_1 -> Adj Suffix

If we examine the CELEX database, we can find examples of words with even more complex morphological structures. For instance, the word “unbelievably” has the structure:

un + believe + able + ly

This would be represented in CELEX as:

((((un)[Prefix] (believe)[V])[V] (able)[Suffix])[Adj] (ly)[Suffix])[Adv]

This complex structure involves multiple levels of derivation, which can be represented by a set of rules in our grammar:

# Extract a rule from CELEX
celex_word = celex_parses.iloc[50859].parsed_word  # "unbelievably"
print(celex_word.rules)

This would output rules like:

Adv -> Adj Suffix
Adj -> V Suffix
V -> Prefix V
Prefix -> un
V -> believe
Suffix -> able
Suffix -> ly

Converting this grammar to CNF would result in a set of binary rules that preserve the hierarchical structure while conforming to the CNF constraints.

Advantages of CNF for Morphological Analysis

There are several advantages to using CNF for morphological analysis:

  1. Simplified parsing algorithms: Algorithms like CKY require the grammar to be in CNF
  2. Uniform structure: All rules have at most two symbols on the right-hand side
  3. Preservation of hierarchical structure: The transformation preserves the hierarchical structure of morphological analyses
  4. Efficient recognition: CNF enables more efficient recognition of complex morphological structures

Implementing the Validation for CNF

To ensure that a grammar is in CNF, we need to validate that each rule conforms to the CNF constraints. Let’s add a validation method to our ContextFreeGrammar class:

def _validate_cnf(self, rules: set[Rule], variables: set[str], alphabet: set[str]) -> bool:
    """
    Validate that a grammar is in Chomsky Normal Form
    
    Parameters
    ----------
    rules : set(Rule)
        The grammar rules
    variables : set(str)
        The set of variables
    alphabet : set(str)
        The terminal alphabet
        
    Returns
    -------
    bool
        True if the grammar is in CNF, False otherwise
    """
    for rule in rules:
        # Check that the left side is a variable
        if rule.left_side not in variables:
            return False
        
        # Check that each rule is in one of the two CNF forms
        if len(rule.right_side) == 1:
            # A -> a
            if rule.right_side[0] not in alphabet:
                return False
        elif len(rule.right_side) == 2:
            # A -> BC
            if not all(symbol in variables for symbol in rule.right_side):
                return False
        else:
            # Rules with 0 or more than 2 symbols are not in CNF
            return False
    
    return True

In the next section, we’ll explore how to use CFGs in CNF with the CKY parsing algorithm to efficiently parse and analyze morphological structures.