Normal Forms for Context-Free Grammars

In the previous sections, we introduced context-free grammars and operations on them. While CFGs are powerful tools for linguistic analysis, their general form can make certain computational tasks, such as parsing, inefficient. To address this, we can transform CFGs into equivalent but more constrained forms called normal forms.

In this section, we’ll focus on Chomsky Normal Form (CNF), which is particularly important for parsing algorithms like the CKY algorithm that we’ll see in the next section. Hopcroft and Ullman (1979) provides a standard reference for normal form transformations and their correctness proofs.

Normal Forms

A normal form for a grammar is a standardized format that restricts the form of the production rules while preserving the language generated by the grammar. The most common normal forms for CFGs are:

  1. Chomsky Normal Form (CNF) (Chomsky 1963): Each production rule must be in one of two forms:
    • \(A \rightarrow BC\) where \(B, C \in V\) (non-terminals)
    • \(A \rightarrow a\) where \(a \in \Sigma\) (terminals)
    • Additionally, there can be a rule \(S \rightarrow \epsilon\) if the language contains the empty string, but \(S\) cannot appear on the right side of any rule.
  2. Greibach Normal Form (GNF) (Greibach 1965): Each production rule must be in the form:
    • \(A \rightarrow aB_1B_2...B_n\) where \(a \in \Sigma\) and \(B_1, B_2, ..., B_n \in V\)
  3. Binary Normal Form (BNF): Each production rule must be in one of three forms:
    • \(A \rightarrow BC\) where \(B, C \in V\)
    • \(A \rightarrow a\) where \(a \in \Sigma\)
    • \(A \rightarrow \epsilon\)

For morphological analysis, CNF is particularly useful as it simplifies the parsing algorithms while maintaining the hierarchical structure of morphological analyses.

Chomsky Normal Form

Let’s now focus on transforming a CFG into Chomsky Normal Form. The transformation process involves several steps:

  1. Eliminate ε-rules (rules of the form \(A \rightarrow \epsilon\))
  2. Eliminate unit rules (rules of the form \(A \rightarrow B\) where \(B \in V\))
  3. Convert long right-hand sides (rules of the form \(A \rightarrow X_1 X_2 ... X_n\) where \(n > 2\))
  4. Handle terminal symbols in mixed rules (rules containing both terminals and non-terminals on the right-hand side)

Let’s implement this transformation as a method in our ContextFreeGrammar class:

from enum import Enum
from copy import deepcopy

class NormalForm(Enum):
    CNF = 0
    BNF = 1
    GNF = 2

class Rule(Rule):
    def validate(self, alphabet: set[str], variables: set[str],
                 normal_form: NormalForm | None = None) -> None:
        """Validate the rule against the grammar's alphabet and variables.

        Parameters
        ----------
        alphabet : set[str]
            The terminal symbols.
        variables : set[str]
            The nonterminal symbols.
        normal_form : NormalForm | None, default None
            The normal form to validate against.
        """
        if self._left_side not in variables:
            msg = "left side of rule must be a variable"
            raise ValueError(msg)

        acceptable = alphabet | variables | {''}
            
        if not all([s in acceptable for s in self._right_side]):
            msg = "right side of rule must contain only " + \
                  "a variable, a symbol, or the empty string"
            raise ValueError(msg)

        if normal_form == NormalForm.CNF:
            if len(self.right_side) == 1:
                if self.right_side[0] not in alphabet:
                    raise ValueError(f"{self} is not in CNF")
            elif len(self.right_side) == 2:
                if not all(s in variables for s in self.right_side):
                    raise ValueError(f"{self} is not in CNF")
            else:
                raise ValueError(f"{self} is not in CNF")

class ContextFreeGrammar(ContextFreeGrammar):

    def to_cnf(self) -> 'ContextFreeGrammar':
        """Convert the grammar to Chomsky Normal Form.

        Returns
        -------
        ContextFreeGrammar
            A new grammar in CNF that generates the same language.
        """
        # Create deep copies to avoid modifying the original grammar
        alphabet = self.alphabet.copy()
        variables = self.variables.copy()
        rules = self.rules().copy()
        start_variable = self.start_variable
        
        # Step 1: Add a new start variable if needed
        new_start = start_variable + "'"
        while new_start in variables:
            new_start += "'"
            
        variables.add(new_start)
        rules.add(Rule(new_start, start_variable))
        start_variable = new_start
        
        # Step 2: Eliminate ε-rules
        nullable_vars = self._find_nullable_variables(rules)
        rules = self._eliminate_epsilon_rules(rules, nullable_vars, variables)

        # If the original start variable was nullable, the language contains ε,
        # and we need to preserve that by adding S' → ε to the new grammar
        # (since _eliminate_epsilon_rules strips all ε-rules indiscriminately).
        if self.start_variable in nullable_vars:
            rules.add(Rule(start_variable))
        
        # Step 3: Eliminate unit rules
        rules = self._eliminate_unit_rules(rules, variables)
        
        # Step 4: Convert rules with terminals in the RHS
        rules, variables = self._convert_terminal_rules(rules, variables, alphabet)
        
        # Step 5: Break long rules
        rules, variables = self._break_long_rules(rules, variables)
        
        return ContextFreeGrammar(alphabet, variables, rules, start_variable)
    
    def _find_nullable_variables(self, rules: set[Rule]) -> set[str]:
        """Find all variables that can derive the empty string.

        Parameters
        ----------
        rules : set[Rule]
            The grammar rules.

        Returns
        -------
        set[str]
            The nullable variables.
        """
        nullable = set()
        
        # First pass: find variables that directly derive ε
        for rule in rules:
            if len(rule.right_side) == 0:  # A -> ε
                nullable.add(rule.left_side)
        
        # Fixed-point computation: find variables that indirectly derive ε
        while True:
            size_before = len(nullable)
            
            for rule in rules:
                if rule.left_side not in nullable:
                    # If all variables in the right-hand side are nullable,
                    # then the left-hand side is also nullable
                    if all(symbol in nullable for symbol in rule.right_side):
                        nullable.add(rule.left_side)
            
            if len(nullable) == size_before:
                break
        
        return nullable
    
    def _eliminate_epsilon_rules(self, rules: set[Rule], nullable: set[str],
                                 variables: set[str]) -> set[Rule]:
        """Eliminate epsilon-rules by adding new rules for nullable variables.

        Parameters
        ----------
        rules : set[Rule]
            The grammar rules.
        nullable : set[str]
            The nullable variables.
        variables : set[str]
            The set of variables.

        Returns
        -------
        set[Rule]
            The new rules without epsilon-rules.
        """
        new_rules = set()
        
        # Remove all ε-rules
        for rule in rules:
            if len(rule.right_side) > 0:  # Keep non-ε rules
                new_rules.add(rule)
        
        # For each rule, add versions that omit nullable variables
        for rule in rules:
            if len(rule.right_side) > 1:  # Rules with at least 2 symbols on RHS
                # Find positions of nullable variables in the right-hand side
                nullable_positions = [i for i, symbol in enumerate(rule.right_side) 
                                     if symbol in nullable]
                
                # Generate all subsets of these positions
                for k in range(1, len(nullable_positions) + 1):
                    for positions_to_remove in itertools.combinations(nullable_positions, k):
                        # Create a new right side by omitting the nullable variables
                        new_right_side = [rule.right_side[i] for i in range(len(rule.right_side)) 
                                         if i not in positions_to_remove]
                        
                        if new_right_side:  # Don't create new ε-rules
                            new_rules.add(Rule(rule.left_side, *new_right_side))
        
        return new_rules
    
    def _eliminate_unit_rules(self, rules: set[Rule], variables: set[str]) -> set[Rule]:
        """Eliminate unit rules of the form A -> B.

        Parameters
        ----------
        rules : set[Rule]
            The grammar rules.
        variables : set[str]
            The set of variables.

        Returns
        -------
        set[Rule]
            The new rules without unit rules.
        """
        # Compute unit pairs (A, B) such that A ->* B
        unit_pairs = {(var, var) for var in variables}  # Reflexive closure
        
        # Find all direct unit pairs
        direct_pairs = {(rule.left_side, rule.right_side[0]) 
                        for rule in rules 
                        if len(rule.right_side) == 1 and rule.right_side[0] in variables}
        
        # Compute the transitive closure
        while True:
            size_before = len(unit_pairs)
            
            for A, B in list(unit_pairs):
                for B2, C in direct_pairs:
                    if B == B2:
                        unit_pairs.add((A, C))
            
            if len(unit_pairs) == size_before:
                break
        
        # Replace unit rules
        new_rules = set()
        
        for rule in rules:
            # Keep non-unit rules
            if not (len(rule.right_side) == 1 and rule.right_side[0] in variables):
                new_rules.add(rule)
        
        # For each unit pair (A, B), add rules A -> γ for each rule B -> γ
        for A, B in unit_pairs:
            if A != B:  # Avoid trivial unit pairs
                for rule in rules:
                    if rule.left_side == B and (len(rule.right_side) != 1 or rule.right_side[0] not in variables):
                        new_rules.add(Rule(A, *rule.right_side))
        
        return new_rules
    
    def _convert_terminal_rules(self, rules: set[Rule], variables: set[str],
                               alphabet: set[str]) -> tuple[set[Rule], set[str]]:
        """Replace terminals in mixed rules with dedicated nonterminals.

        Parameters
        ----------
        rules : set[Rule]
            The grammar rules.
        variables : set[str]
            The set of variables.
        alphabet : set[str]
            The terminal alphabet.

        Returns
        -------
        tuple[set[Rule], set[str]]
            The updated rules and variables.
        """
        new_rules = set()
        new_variables = variables.copy()
        
        # Map terminals to their corresponding variables
        terminal_vars = {}
        
        for rule in rules:
            if len(rule.right_side) == 1 and rule.right_side[0] in alphabet:
                # Keep terminal rules (A -> a)
                new_rules.add(rule)
            else:
                # For each terminal in a mixed or long rule
                new_right_side = []
                for symbol in rule.right_side:
                    if symbol in alphabet:
                        # Create a new variable for this terminal if needed
                        if symbol not in terminal_vars:
                            term_var = f"T_{symbol}"
                            while term_var in new_variables:
                                term_var += "'"
                            terminal_vars[symbol] = term_var
                            new_variables.add(term_var)
                            new_rules.add(Rule(term_var, symbol))
                        
                        # Replace the terminal with its variable
                        new_right_side.append(terminal_vars[symbol])
                    else:
                        new_right_side.append(symbol)
                
                new_rules.add(Rule(rule.left_side, *new_right_side))
        
        return new_rules, new_variables
    
    def _break_long_rules(self, rules: set[Rule], variables: set[str]) -> tuple[set[Rule], set[str]]:
        """Break rules with more than 2 right-hand side symbols into binary rules.

        Parameters
        ----------
        rules : set[Rule]
            The grammar rules.
        variables : set[str]
            The set of variables.

        Returns
        -------
        tuple[set[Rule], set[str]]
            The updated rules and variables.
        """
        new_rules = set()
        new_variables = variables.copy()
        
        for rule in rules:
            if len(rule.right_side) <= 2:
                # Keep rules that are already in CNF
                new_rules.add(rule)
            else:
                # Break the rule A -> X1 X2 ... Xn
                # into A -> X1 A1, A1 -> X2 A2, ..., An-2 -> Xn-1 Xn
                symbols = list(rule.right_side)
                left_side = rule.left_side
                
                for i in range(len(symbols) - 2):
                    new_var = f"{left_side}_{i+1}"
                    while new_var in new_variables:
                        new_var += "'"
                    new_variables.add(new_var)
                    
                    new_rules.add(Rule(left_side, symbols[0], new_var))
                    left_side = new_var
                    symbols = symbols[1:]
                
                # Add the final rule An-2 -> Xn-1 Xn
                new_rules.add(Rule(left_side, symbols[0], symbols[1]))
        
        return new_rules, new_variables

Example: Converting a Grammar to CNF

Let’s see this transformation in action with a simple morphological grammar. Consider a grammar that models the formation of English nouns with prefixes and suffixes:

grammar = ContextFreeGrammar(
    alphabet={'un', 'happy', 'ness', 'kind', 'ful', 'help'},
    variables={'Word', 'Adj', 'N', 'Prefix', 'Suffix'},
    rules={
        # Base words
        Rule('Adj', 'happy'),
        Rule('Adj', 'kind'),
        Rule('Adj', 'help', 'ful'),  # A ternary rule
        
        # Word formation rules
        Rule('Word', 'N'),
        
        # Derivational rules
        Rule('Adj', 'Prefix', 'Adj'),  # un + kind → unkind
        Rule('N', 'Adj', 'Suffix'),    # kind + ness → kindness
        Rule('N', 'Prefix', 'Adj', 'Suffix'),  # un + kind + ness → unkindness
        
        # Affixes
        Rule('Prefix', 'un'),
        Rule('Suffix', 'ness'),
        Rule('Suffix', 'ful')
    },
    start_variable='Word'
)

cnf_grammar = grammar.to_cnf()
print(cnf_grammar.rules())

In this example, we have a ternary rule Adj -> help ful and a rule with three symbols on the right-hand side: N -> Prefix Adj Suffix. Let’s see how these are transformed in CNF:

Before transformation:

Adj -> help ful
N -> Prefix Adj Suffix

After transformation:

T_help -> help
T_ful -> ful
Adj -> T_help T_ful
N -> Prefix N_1
N_1 -> Adj Suffix

The ternary rule Adj -> help ful is replaced by two binary rules: T_help -> help, T_ful -> ful, and Adj -> T_help T_ful. The long rule N -> Prefix Adj Suffix is replaced by two binary rules: N -> Prefix N_1 and N_1 -> Adj Suffix.

Real-World Example: CELEX Data

Let’s look at a real example from the CELEX database. One common pattern in English is the derivation of nouns from verbs, often involving several affixes. For instance, consider the word “unhappiness” which has the structure:

un + happy + ness

This structure can be represented in CELEX as:

(((un)[Prefix] (happy)[Adj])[Adj] (ness)[Suffix])[N]

Here’s how this rule might appear in our grammar before converting to CNF:

N -> Prefix Adj Suffix

After conversion to CNF, we would have:

N -> Prefix N_1
N_1 -> Adj Suffix

If we examine the CELEX database, we can find examples of words with even more complex morphological structures. For instance, the word “unbelievably” has the structure:

un + believe + able + ly

This would be represented in CELEX as:

((((un)[Prefix] (believe)[V])[V] (able)[Suffix])[Adj] (ly)[Suffix])[Adv]

This complex structure involves multiple levels of derivation, which can be represented by a set of rules in our grammar:

# Extract a rule from CELEX
celex_word = celex_parses.iloc[50859].parsed_word  # "unbelievably"
print(celex_word.rules)

This would output rules like:

Adv -> Adj Suffix
Adj -> V Suffix
V -> Prefix V
Prefix -> un
V -> believe
Suffix -> able
Suffix -> ly

Converting this grammar to CNF would result in a set of binary rules that preserve the hierarchical structure while conforming to the CNF constraints.

Advantages of CNF for Morphological Analysis

There are several advantages to using CNF for morphological analysis:

  1. Simplified parsing algorithms: Algorithms like CKY require the grammar to be in CNF
  2. Uniform structure: All rules have at most two symbols on the right-hand side
  3. Preservation of hierarchical structure: The transformation preserves the hierarchical structure of morphological analyses
  4. Efficient recognition: CNF enables more efficient recognition of complex morphological structures

Implementing the Validation for CNF

To ensure that a grammar is in CNF, we need to validate that each rule conforms to the CNF constraints. Let’s add a validation method to our ContextFreeGrammar class:

def _validate_cnf(self, rules: set[Rule], variables: set[str], alphabet: set[str]) -> bool:
    """Validate that a grammar is in Chomsky Normal Form.

    Parameters
    ----------
    rules : set[Rule]
        The grammar rules.
    variables : set[str]
        The set of variables.
    alphabet : set[str]
        The terminal alphabet.

    Returns
    -------
    bool
        True if the grammar is in CNF.
    """
    for rule in rules:
        # Check that the left side is a variable
        if rule.left_side not in variables:
            return False
        
        # Check that each rule is in one of the two CNF forms
        if len(rule.right_side) == 1:
            # A -> a
            if rule.right_side[0] not in alphabet:
                return False
        elif len(rule.right_side) == 2:
            # A -> BC
            if not all(symbol in variables for symbol in rule.right_side):
                return False
        else:
            # Rules with 0 or more than 2 symbols are not in CNF
            return False
    
    return True

In the next section, we’ll explore how to use CFGs in CNF with the CKY parsing algorithm to efficiently parse and analyze morphological structures.

References

Chomsky, Noam. 1963. “Formal Properties of Grammars.” In Handbook of Mathematical Psychology, edited by R. Duncan Luce, Robert R. Bush, and Eugene Galanter, vol. 2. John Wiley & Sons.
Greibach, Sheila A. 1965. “A New Normal-Form Theorem for Context-Free Phrase Structure Grammars.” Journal of the ACM 12 (1): 42–52. https://doi.org/10.1145/321250.321254.
Hopcroft, John E., and Jeffrey D. Ullman. 1979. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley.