Operations on Context-Free Grammars
In the previous section, we introduced the formal definition of context-free grammars. Now we’ll explore operations that can be performed on CFGs—union, concatenation, and Kleene closure—which allow us to build more complex grammars from simpler ones. This is useful for modeling morphological processes, where we often want to describe a complex word-formation system by composing smaller, more manageable grammars.
Union
Given two context-free grammars \(G_1 = (V_1, \Sigma_1, R_1, S_1)\) and \(G_2 = (V_2, \Sigma_2, R_2, S_2)\), the union grammar \(G = G_1 \cup G_2\) generates the language \(L(G) = L(G_1) \cup L(G_2)\).
We can construct \(G = (V, \Sigma, R, S)\) as follows:
- Create a new start symbol \(S\) not in \(V_1\) or \(V_2\)
- Set \(V = V_1 \cup V_2 \cup \{S\}\)
- Set \(\Sigma = \Sigma_1 \cup \Sigma_2\)
- Set \(R = R_1 \cup R_2 \cup \{S \rightarrow S_1, S \rightarrow S_2\}\)
(We assume throughout this section that \(V_1\) and \(V_2\) are disjoint—i.e. that the two grammars share no variable names. If they do, we would need to relabel one grammar’s variables before combining the rule sets, since otherwise a rule from \(G_1\) could interact with a variable from \(G_2\) in unintended ways.)
def union(self, other: 'ContextFreeGrammar') -> 'ContextFreeGrammar':
"""Construct the union of this grammar with another.
Parameters
----------
other : ContextFreeGrammar
The other grammar (must have disjoint variables).
Returns
-------
ContextFreeGrammar
A grammar generating L(self) | L(other).
"""
new_start = "S_union"
while new_start in self.variables or new_start in other.variables:
new_start += "_"
variables = self.variables | other.variables | {new_start}
alphabet = self.alphabet | other.alphabet
rules = self.rules() | other.rules()
rules |= {
Rule(new_start, self.start_variable),
Rule(new_start, other.start_variable)
}
return ContextFreeGrammar(alphabet, variables, rules, new_start)In morphology, union is useful for modeling alternative word formation processes. We might have one grammar for noun formation and another for adjective formation, and we can combine them into a single grammar that generates both.
Concatenation
Given two context-free grammars \(G_1 = (V_1, \Sigma_1, R_1, S_1)\) and \(G_2 = (V_2, \Sigma_2, R_2, S_2)\), the concatenation grammar \(G = G_1 \circ G_2\) generates the language \(L(G) = L(G_1) \circ L(G_2)\)—all strings \(w_1w_2\) where \(w_1 \in L(G_1)\) and \(w_2 \in L(G_2)\).
We can construct \(G = (V, \Sigma, R, S)\) as follows:
- Create a new start symbol \(S\) not in \(V_1\) or \(V_2\)
- Set \(V = V_1 \cup V_2 \cup \{S\}\)
- Set \(\Sigma = \Sigma_1 \cup \Sigma_2\)
- Set \(R = R_1 \cup R_2 \cup \{S \rightarrow S_1 S_2\}\)
As with union, we assume \(V_1\) and \(V_2\) are disjoint.
def concatenate(self, other: 'ContextFreeGrammar') -> 'ContextFreeGrammar':
"""Construct the concatenation of this grammar with another.
Parameters
----------
other : ContextFreeGrammar
The other grammar (must have disjoint variables).
Returns
-------
ContextFreeGrammar
A grammar generating L(self) . L(other).
"""
new_start = "S_concat"
while new_start in self.variables or new_start in other.variables:
new_start += "_"
variables = self.variables | other.variables | {new_start}
alphabet = self.alphabet | other.alphabet
rules = self.rules() | other.rules()
rules.add(Rule(new_start, self.start_variable, other.start_variable))
return ContextFreeGrammar(alphabet, variables, rules, new_start)Concatenation is the bread and butter of morphological modeling—it captures how morphemes combine. We might have a grammar for prefixes and another for stems, and concatenation lets us describe how they compose to form complex words.
Kleene Closure
Given a context-free grammar \(G = (V, \Sigma, R, S)\), the Kleene closure grammar \(G^* = (V', \Sigma, R', S')\) generates the language \(L(G^*) = L(G)^*\)—all strings formed by concatenating zero or more strings from \(L(G)\).
We can construct \(G^* = (V', \Sigma, R', S')\) as follows:
- Create a new start symbol \(S'\) not in \(V\)
- Set \(V' = V \cup \{S'\}\)
- Set \(R' = R \cup \{S' \rightarrow \epsilon, S' \rightarrow S S'\}\)
def kleene_star(self) -> 'ContextFreeGrammar':
"""Construct the Kleene closure of this grammar.
Returns
-------
ContextFreeGrammar
A grammar generating L(self)*.
"""
new_start = "S_star"
while new_start in self.variables:
new_start += "_"
variables = self.variables | {new_start}
alphabet = self.alphabet
rules = self.rules()
rules.add(Rule(new_start, self.start_variable, new_start)) # S' -> S S' (right-recursive)
rules.add(Rule(new_start)) # S' -> epsilon
return ContextFreeGrammar(alphabet, variables, rules, new_start)In morphology, Kleene closure is useful for modeling repetitive processes like reduplication or iterative affixation.
Intersection
Context-free languages are not closed under intersection: if \(L_1\) and \(L_2\) are both context-free, \(L_1 \cap L_2\) may not be. There is, however, an important special case—the intersection of a context-free language with a regular language is always context-free. This means we can use a finite state automaton to filter or constrain a CFG’s output, which is useful when we want a grammar to generate only strings matching some phonological pattern (described by a regular language) while retaining the hierarchical structure the CFG provides. We won’t implement intersection here, but the property is worth keeping in mind.
Complementation
Context-free languages are also not closed under complementation: given a context-free language \(L\), its complement \(\overline{L} = \Sigma^* - L\) is not necessarily context-free. This follows from the failure of closure under intersection, since if CFLs were closed under both union and complementation, we could construct intersections via De Morgan’s laws.
Example: Morphological Operations
We can demonstrate these operations with a small morphological example. We’ll create one grammar for adjectival morphology and another for nominal morphology, then combine them.
adj_grammar = ContextFreeGrammar(
alphabet={'happy', 'kind', 'un', 'ly'},
variables={'Word', 'Adj', 'Adv', 'Prefix'},
rules={
Rule('Adj', 'happy'),
Rule('Adj', 'kind'),
Rule('Word', 'Adj'),
Rule('Word', 'Adv'),
Rule('Adj', 'Prefix', 'Adj'), # un + happy -> unhappy
Rule('Adv', 'Adj', 'ly'), # happy + ly -> happily
Rule('Prefix', 'un')
},
start_variable='Word'
)
noun_grammar = ContextFreeGrammar(
alphabet={'happy', 'kind', 'ness', 'ity'},
variables={'Word', 'Adj', 'N', 'Suffix'},
rules={
Rule('Adj', 'happy'),
Rule('Adj', 'kind'),
Rule('Word', 'N'),
Rule('N', 'Adj', 'Suffix'), # happy + ness -> happiness
Rule('Suffix', 'ness'),
Rule('Suffix', 'ity')
},
start_variable='Word'
)
combined_grammar = adj_grammar.union(noun_grammar)The combined grammar can generate bare adjectives like “happy,” prefixed adjectives like “unhappy,” adverbs like “happily,” and derived nominals like “happiness” or “unhappiness”—all from the union of two relatively simple component grammars.
Implementing the Operations in the ContextFreeGrammar Class
We can add these operations as methods (along with operator overloads for convenience) to our ContextFreeGrammar class.
class ContextFreeGrammar(ContextFreeGrammar):
"""Extended grammar with closure operations."""
def __or__(self, other: 'ContextFreeGrammar') -> 'ContextFreeGrammar':
"""Union via the | operator."""
return self.union(other)
def __add__(self, other: 'ContextFreeGrammar') -> 'ContextFreeGrammar':
"""Concatenation via the + operator."""
return self.concatenate(other)
def union(self, other: 'ContextFreeGrammar') -> 'ContextFreeGrammar':
"""Construct the union of this grammar with another.
Parameters
----------
other : ContextFreeGrammar
The other grammar (must have disjoint variables).
Returns
-------
ContextFreeGrammar
A grammar generating L(self) | L(other).
"""
new_start = "S_union"
while new_start in self.variables or new_start in other.variables:
new_start += "_"
variables = self.variables | other.variables | {new_start}
alphabet = self.alphabet | other.alphabet
rules = self.rules() | other.rules()
rules |= {
Rule(new_start, self.start_variable),
Rule(new_start, other.start_variable)
}
return ContextFreeGrammar(alphabet, variables, rules, new_start)
def concatenate(self, other: 'ContextFreeGrammar') -> 'ContextFreeGrammar':
"""Construct the concatenation of this grammar with another.
Parameters
----------
other : ContextFreeGrammar
The other grammar (must have disjoint variables).
Returns
-------
ContextFreeGrammar
A grammar generating L(self) . L(other).
"""
new_start = "S_concat"
while new_start in self.variables or new_start in other.variables:
new_start += "_"
variables = self.variables | other.variables | {new_start}
alphabet = self.alphabet | other.alphabet
rules = self.rules() | other.rules()
rules.add(Rule(new_start, self.start_variable, other.start_variable))
return ContextFreeGrammar(alphabet, variables, rules, new_start)
def kleene_star(self) -> 'ContextFreeGrammar':
"""Construct the Kleene closure of this grammar.
Returns
-------
ContextFreeGrammar
A grammar generating L(self)*.
"""
new_start = "S_star"
while new_start in self.variables:
new_start += "_"
variables = self.variables | {new_start}
alphabet = self.alphabet
rules = self.rules()
rules.add(Rule(new_start, self.start_variable, new_start)) # S' -> S S' (right-recursive)
rules.add(Rule(new_start)) # S' -> epsilon
return ContextFreeGrammar(alphabet, variables, rules, new_start)Real-World Example: CELEX Morphological Data
The CELEX database provides morphological analyses of words in English, German, and Dutch. We can extract CFG rules directly from these analyses—treating each morphological parse as a derivation tree—and then use the operations above to compose grammars for particular affixation patterns. For instance, we might extract all rules involving the prefix “un-” into one grammar and all rules involving the suffix “-ness” into another, then take their union to get a grammar that generates both types of derived forms.
def extract_rules_from_celex(parsed_words) -> tuple[set[str], set[str], set[Rule]]:
"""Extract a CFG from a collection of CELEX parsed words.
Parameters
----------
parsed_words : Iterable[ParsedWord]
The parsed word objects.
Returns
-------
tuple[set[str], set[str], set[Rule]]
The alphabet, variables, and rules.
"""
alphabet = set()
variables = set()
rules = set()
for word in parsed_words:
for _, morpheme in word.flatten():
alphabet.add(morpheme)
for tag, _ in word.flatten():
variables.add(tag)
for rule in word.rules:
rules.add(rule)
return alphabet, variables, rules
alphabet, variables, rules = extract_rules_from_celex(celex_parses.parsed_word.values)
celex_grammar = ContextFreeGrammar(alphabet, variables, rules, "WORD")In the next section, we’ll explore how to convert context-free grammars to normal forms—particularly Chomsky Normal Form—which simplifies parsing algorithms.