Context-free grammars for syntax

The context-free grammars we developed for morphology apply directly to syntax. Where a morphological grammar has rules like $\text{N} \rightarrow \text{Adj}\;\text{Suffix}$ (with morphemes as terminals), a syntactic grammar has rules like $\text{S} \rightarrow \text{NP}\;\text{VP}$ (with words as terminals). The formalism is the same 4-tuple $G = (V, \Sigma, R, S)$; only the alphabet and the intended interpretation change.

A syntactic CFG

Here is a small syntactic grammar for a fragment of English:

\[\begin{align*} S &\rightarrow \text{NP}\;\text{VP} \\ NP &\rightarrow \text{D}\;\text{N} \\ NP &\rightarrow \text{D}\;\text{N}\;\text{PP} \\ VP &\rightarrow \text{V}\;\text{NP} \\ VP &\rightarrow \text{V}\;\text{NP}\;\text{PP} \\ PP &\rightarrow \text{P}\;\text{NP} \\ D &\rightarrow \text{the} \mid \text{a} \\ N &\rightarrow \text{greyhound} \mid \text{man} \mid \text{telescope} \\ V &\rightarrow \text{saw} \mid \text{loves} \\ P &\rightarrow \text{with} \mid \text{in} \end{align*}\]

This grammar generates sentences like the greyhound saw a man and a man loves the greyhound. The CKY and Earley algorithms we developed in the morphological module work on this grammar without modification.

Treebanks

Just as CELEX provides gold-standard morphological parses, treebanks provide gold-standard syntactic parses. The most widely used English treebank is the Penn Treebank (Marcus et al. 1993), which contains syntactic annotations for roughly 50,000 sentences from the Wall Street Journal. Penn Treebank annotations use a CFG-like bracketing format—very similar to the CELEX StrucLab format we saw in the morphological module—with nonterminal labels like S, NP, VP, PP and functional tags for subjects, objects, and other grammatical relations.

Treebanks serve the same role for syntax that CELEX served for morphology: they provide training data for probabilistic grammars and evaluation data for parsers.

Ambiguity

Syntactic ambiguity is far more pervasive than morphological ambiguity. Consider the sentence I saw the man with the telescope. This sentence has two parses:

$[\text{S}\;[\text{NP}\;\text{I}]\;[\text{VP}\;[\text{V}\;\text{saw}]\;[\text{NP}\;[\text{D}\;\text{the}]\;[\text{N}\;\text{man}]\;[\text{PP}\;\text{with the telescope}]]]]$ — the man has the telescope
$[\text{S}\;[\text{NP}\;\text{I}]\;[\text{VP}\;[\text{V}\;\text{saw}]\;[\text{NP}\;[\text{D}\;\text{the}]\;[\text{N}\;\text{man}]]\;[\text{PP}\;\text{with the telescope}]]]$ — I used the telescope to see

The number of parses grows exponentially with sentence length in the worst case. Probabilistic CFGs—the same formalism we developed in the morphological module—are one way to handle this: we assign probabilities to rules, and the most probable parse (or the $k$ most probable parses) can be found efficiently.

Limitations of CFGs for syntax

Despite their usefulness, CFGs cannot capture certain syntactic phenomena that occur in natural languages. The most famous example is due to Shieber (1985), who showed that the cross-serial dependencies in Swiss German subordinate clauses cannot be generated by any context-free grammar.

In Swiss German, a subordinate clause can contain a sequence of noun phrases followed by a sequence of verbs, where the $i^{\text{th}}$ noun phrase is the argument of the $i^{\text{th}}$ verb:

\[\text{...dass } \underbrace{\text{d'chind}}_{\text{NP}_1} \underbrace{\text{em Hans}}_{\text{NP}_2} \underbrace{\text{es huus}}_{\text{NP}_3} \underbrace{\text{lönd}}_{\text{V}_1} \underbrace{\text{hälfe}}_{\text{V}_2} \underbrace{\text{aastriiche}}_{\text{V}_3}\]

The dependency between $\text{NP}_i$ and $\text{V}_i$ crosses over the dependencies between $\text{NP}_j$ and $\text{V}_j$ for $j > i$. The string language $\{a^n b^m c^n d^m \mid n, m \geq 1\}$ captures the essential structure of this pattern, and it can be shown (using the pumping lemma for context-free languages) that this language is not context-free.

The Swiss German case is the classic example, but it is far from the only one. Kobele (2006) argues that the fundamental issue is copying: many syntactic constructions involve a dependency between two positions in a sentence—a “filler” and a “gap”—where the same structural content must be tracked at both positions, even though only one is phonologically realized. Wh-movement is a familiar case: in *what did you see __?, the wh-phrase what* appears at the front but is interpreted as the object of see. The gap and the filler must agree in category and selectional properties, and the distance between them is unbounded—you can embed arbitrarily many clauses between them (*what did you think that Mary said that John believes that she saw __?*).

Now, to be precise, a CFG can handle unbounded movement—but only by encoding the information about what has been extracted into the nonterminal labels themselves. You could, for instance, have nonterminals like $\text{S/NP}$ (a sentence missing an NP), $\text{VP/NP}$ (a VP missing an NP), and so on, with rules that thread the “slash” category through the tree until it reaches the gap site. This is exactly what the GPSG tradition did. The problem is that this is a brute-force encoding: from the perspective of the CFG formalism, the nonterminals $\text{S}$, $\text{S/NP}$, and $\text{S/PP}$ are unrelated atomic symbols, just as the states of a finite state transducer are atomic even when we as analysts name them suggestively. The internal structure of the nonterminal label is invisible to the grammar; the CFG does not “know” that $\text{S/NP}$ is an $\text{S}$ with a missing $\text{NP}$. It simply has more rules. And the number of rules you need grows multiplicatively with the number of extraction types, because you need a separate slash variant for every nonterminal that the extraction might pass through. This is a sign that the formalism is not expressing the generalization directly—it is simulating it.

Reduplication, which we already saw in the morphological module, is another instance of the copying problem: the language $\{ww \mid w \in \Sigma^+\}$ requires reproducing a string exactly, which no finite state device can do (as we proved with the pumping lemma). Kobele’s point is that syntactic movement and morphological reduplication are both manifestations of copying, and that the mildly context-sensitive formalisms provide the right way to handle them—not by encoding dependencies into atomic labels but by building the dependency-tracking mechanism into the operations of the grammar itself.

This situation should be contrasted with the cross-serial dependencies in Swiss German, which I described above. Unbounded filler-gap dependencies can at least be simulated by a CFG, even if the simulation is clumsy. Unbounded cross-serial dependencies simply cannot be handled at all—no amount of category proliferation will help, because the pumping lemma rules out the relevant string languages entirely. A CFG with slash categories can track one unbounded dependency (by threading a slash feature through the tree), but it cannot simultaneously track two dependencies whose paths cross, which is what cross-serial dependencies require.

So we have two problems with different characters. For filler-gap dependencies, the issue is not that CFGs lack the generative capacity—they don’t—but that they express the generalization indirectly, through category encodings that are opaque to the formalism. For cross-serial dependencies, the issue is a limitation of generative capacity: the relevant patterns lie outside the context-free languages entirely. The mildly context-sensitive formalisms we’ll study next address both problems: they handle cross-serial dependencies because they are more powerful than CFGs, and they handle filler-gap dependencies natively—through operations like adjunction in TAGs, Move in Minimalist Grammars, or multi-component yields in MCFGs—rather than through category proliferation.

References

Kobele, Gregory M. 2006. “Generating Copies: An Investigation into Structural Identity in Language and Grammar.” PhD thesis, UCLA.

Marcus, Mitchell P., Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. “Building a Large Annotated Corpus of English: The Penn Treebank.” Computational Linguistics (Cambridge, MA) 19 (2): 313–30. https://aclanthology.org/J93-2004.

Shieber, Stuart M. 1985. “Evidence Against the Context-Freeness of Natural Language.” Linguistics and Philosophy 8 (3): 333–43.

--- title: Context-free grammars for syntax bibliography: ../references.bib --- The context-free grammars we developed for morphology apply directly to syntax. Where a morphological grammar has rules like $\text{N} \rightarrow \text{Adj}\;\text{Suffix}$ (with morphemes as terminals), a syntactic grammar has rules like $\text{S} \rightarrow \text{NP}\;\text{VP}$ (with words as terminals). The formalism is the same 4-tuple $G = (V, \Sigma, R, S)$; only the alphabet and the intended interpretation change. ## A syntactic CFG Here is a small syntactic grammar for a fragment of English: $$\begin{align*} S &\rightarrow \text{NP}\;\text{VP} \\ NP &\rightarrow \text{D}\;\text{N} \\ NP &\rightarrow \text{D}\;\text{N}\;\text{PP} \\ VP &\rightarrow \text{V}\;\text{NP} \\ VP &\rightarrow \text{V}\;\text{NP}\;\text{PP} \\ PP &\rightarrow \text{P}\;\text{NP} \\ D &\rightarrow \text{the} \mid \text{a} \\ N &\rightarrow \text{greyhound} \mid \text{man} \mid \text{telescope} \\ V &\rightarrow \text{saw} \mid \text{loves} \\ P &\rightarrow \text{with} \mid \text{in} \end{align*}$$ This grammar generates sentences like *the greyhound saw a man* and *a man loves the greyhound*. The CKY and Earley algorithms we developed in the morphological module work on this grammar without modification. ## Treebanks Just as CELEX provides gold-standard morphological parses, *treebanks* provide gold-standard syntactic parses. The most widely used English treebank is the Penn Treebank [@marcus_building_1993], which contains syntactic annotations for roughly 50,000 sentences from the Wall Street Journal. Penn Treebank annotations use a CFG-like bracketing format—very similar to the CELEX StrucLab format we saw in the morphological module—with nonterminal labels like S, NP, VP, PP and functional tags for subjects, objects, and other grammatical relations. Treebanks serve the same role for syntax that CELEX served for morphology: they provide training data for probabilistic grammars and evaluation data for parsers. ## Ambiguity Syntactic ambiguity is far more pervasive than morphological ambiguity. Consider the sentence *I saw the man with the telescope*. This sentence has two parses: 1. $[\text{S}\;[\text{NP}\;\text{I}]\;[\text{VP}\;[\text{V}\;\text{saw}]\;[\text{NP}\;[\text{D}\;\text{the}]\;[\text{N}\;\text{man}]\;[\text{PP}\;\text{with the telescope}]]]]$ — the man has the telescope 2. $[\text{S}\;[\text{NP}\;\text{I}]\;[\text{VP}\;[\text{V}\;\text{saw}]\;[\text{NP}\;[\text{D}\;\text{the}]\;[\text{N}\;\text{man}]]\;[\text{PP}\;\text{with the telescope}]]]$ — I used the telescope to see The number of parses grows exponentially with sentence length in the worst case. Probabilistic CFGs—the same formalism we developed in the morphological module—are one way to handle this: we assign probabilities to rules, and the most probable parse (or the $k$ most probable parses) can be found efficiently. ## Limitations of CFGs for syntax Despite their usefulness, CFGs cannot capture certain syntactic phenomena that occur in natural languages. The most famous example is due to @shieber1985evidence, who showed that the cross-serial dependencies in Swiss German subordinate clauses cannot be generated by any context-free grammar. In Swiss German, a subordinate clause can contain a sequence of noun phrases followed by a sequence of verbs, where the $i^{\text{th}}$ noun phrase is the argument of the $i^{\text{th}}$ verb: $$\text{...dass } \underbrace{\text{d'chind}}_{\text{NP}_1} \underbrace{\text{em Hans}}_{\text{NP}_2} \underbrace{\text{es huus}}_{\text{NP}_3} \underbrace{\text{lönd}}_{\text{V}_1} \underbrace{\text{hälfe}}_{\text{V}_2} \underbrace{\text{aastriiche}}_{\text{V}_3}$$ The dependency between $\text{NP}_i$ and $\text{V}_i$ crosses over the dependencies between $\text{NP}_j$ and $\text{V}_j$ for $j > i$. The string language $\{a^n b^m c^n d^m \mid n, m \geq 1\}$ captures the essential structure of this pattern, and it can be shown (using the pumping lemma for context-free languages) that this language is not context-free. The Swiss German case is the classic example, but it is far from the only one. @kobele2006generating argues that the fundamental issue is *copying*: many syntactic constructions involve a dependency between two positions in a sentence—a "filler" and a "gap"—where the same structural content must be tracked at both positions, even though only one is phonologically realized. Wh-movement is a familiar case: in *what did you see __?*, the wh-phrase *what* appears at the front but is interpreted as the object of *see*. The gap and the filler must agree in category and selectional properties, and the distance between them is unbounded—you can embed arbitrarily many clauses between them (*what did you think that Mary said that John believes that she saw __?*). Now, to be precise, a CFG *can* handle unbounded movement—but only by encoding the information about what has been extracted into the nonterminal labels themselves. You could, for instance, have nonterminals like $\text{S/NP}$ (a sentence missing an NP), $\text{VP/NP}$ (a VP missing an NP), and so on, with rules that thread the "slash" category through the tree until it reaches the gap site. This is exactly what the GPSG tradition did. The problem is that this is a brute-force encoding: from the perspective of the CFG formalism, the nonterminals $\text{S}$, $\text{S/NP}$, and $\text{S/PP}$ are unrelated atomic symbols, just as the states of a [finite state transducer](../phonological-patterns/phonological-rules-as-fsas/finite-state-transducers.qmd) are atomic even when we as analysts name them suggestively. The internal structure of the nonterminal label is invisible to the grammar; the CFG does not "know" that $\text{S/NP}$ is an $\text{S}$ with a missing $\text{NP}$. It simply has more rules. And the number of rules you need grows multiplicatively with the number of extraction types, because you need a separate slash variant for every nonterminal that the extraction might pass through. This is a sign that the formalism is not expressing the generalization directly—it is simulating it. Reduplication, which we already saw in the [morphological module](../morphological-patterns/index.qmd), is another instance of the copying problem: the language $\{ww \mid w \in \Sigma^+\}$ requires reproducing a string exactly, which no finite state device can do (as we [proved with the pumping lemma](../phonological-patterns/generative-capacity/pumping-lemma.qmd)). Kobele's point is that syntactic movement and morphological reduplication are both manifestations of copying, and that the [mildly context-sensitive](mildly-context-sensitive/index.qmd) formalisms provide the right way to handle them—not by encoding dependencies into atomic labels but by building the dependency-tracking mechanism into the operations of the grammar itself. This situation should be contrasted with the cross-serial dependencies in Swiss German, which I described above. Unbounded filler-gap dependencies can at least be *simulated* by a CFG, even if the simulation is clumsy. Unbounded cross-serial dependencies simply cannot be handled at all—no amount of category proliferation will help, because the pumping lemma rules out the relevant string languages entirely. A CFG with slash categories can track *one* unbounded dependency (by threading a slash feature through the tree), but it cannot simultaneously track two dependencies whose paths cross, which is what cross-serial dependencies require. So we have two problems with different characters. For filler-gap dependencies, the issue is not that CFGs lack the generative capacity—they don't—but that they express the generalization indirectly, through category encodings that are opaque to the formalism. For cross-serial dependencies, the issue is a limitation of generative capacity: the relevant patterns lie outside the context-free languages entirely. The [mildly context-sensitive](mildly-context-sensitive/index.qmd) formalisms we'll study next address both problems: they handle cross-serial dependencies because they are more powerful than CFGs, and they handle filler-gap dependencies natively—through operations like adjunction in TAGs, Move in [Minimalist Grammars](mildly-context-sensitive/minimalist-grammars.qmd), or multi-component yields in [MCFGs](mildly-context-sensitive/mcfgs-and-lcfrs.qmd)—rather than through category proliferation.