The Generativist Conceit

Why would we want to do all this? In the first lecture, I discussed what I called the Generativist Conceit: that we can understand natural language by studying the class of grammars that generate all + only the classes of natural languages. Though certain strong ideas associated with the Generativist Conceit are controversial, as I have stated it (relatively weakly) here, it is relatively uncontroversial: if our aim is to provide a scientific theory that characterizes natural language (whatever that might be: a colection of actual utterances, an aspect of human psychology or some abstract object with an existence independent of human minds), we at least need some method for encoding that theory in a way that makes testable predictions. Grammars are definitionally such a method as I’ve described them.

Another relatively uncontroversial observation that drives the Generativist Conceit is that children must be able to start without a language and come to know one, and the one they come to know is determined in some important way from their environment (rather than their genetics) in a finite amount of time on the basis of a finite number of observations of actual language use. Why this observation is important is that it means that natural languages must be learnable from a finite number of examples. Grammars are usually finite objects that (may) describe countably infinite languages, and so they might be a useful tool for describing the process by which a child comes to know a language.

More formally, we may say that children have some method $c$ for mapping sequences of strings (or texts) $\mathbf{t} \in \Sigma^{**}$ along with some extralinguistic experience $E \in \mathcal{E}$ into a grammar $G \in \mathcal{G}$: \[c: \Sigma^{**} \times \mathcal{E} \rightarrow \mathcal{G}\]. A child’s extralinguistic experience is nontrivial to represent, so to make the problem more tractable, we will often ignore that part.

To demonstrate that the problem of learning a grammar from a text, consider the following formal model, known as language identification in the limit, due to Gold 1964.

We start from the idea that a learner should be modeled as a function $f: \Sigma^{**} \rightarrow 2^{\Sigma^*}$ that maps an environment $E = \langle \boldsymbol\sigma_1, \boldsymbol\sigma_2, \ldots \rangle \in \Sigma^{**}$ to a language $L$, where an environment is a sequence of strings in $L$. We will say that a learner $f$ learns a language $L$ given an environment (or text) $E \subset L^*$ if the learner outputs $L$ after seeing enough examples from the environment and that it learns a language $L$ if it learns that language in any environment. That is, $f$ learns $L$ if it doesn’t matter which sequence you give it: $f \text{ learns } L \leftrightarrow \forall \mathbf{e} \in L^*: f(\mathbf{e}) = L$. A language family $\mathcal{L}$ is learnable if there exists a language learner that can learn all languages in the family: $f \text{ learns } \mathcal{L} \leftrightarrow \forall L \in \mathcal{L}: \forall \mathbf{e} \in L^*: f(\mathbf{e}) = L$.

Gold (1967) showed that, if a language family $\mathcal{L}$ contains languages $L_1, L_2, \ldots, L_\infty$, such that $L_1 \subset L_2 \subset \ldots \subset L_\infty = \bigcup_{i=1}^\infty L_i$, then it is not learnable. Here’s the idea behind the proof: suppose $f$ is a learner that can learn $L_1, L_2, \ldots, L_\infty$; we’ll show that it cannot learn $L_\infty$, by constructing an environment for $L_\infty$ that “tricks” $f$.

First, we construct environments $E_1, E_2, \ldots$ such that $f(E_i) = L_i$. Next, we construct environment $E_\infty$ for $L_\infty$ inductively as follows:

Present $f$ with $E_1$ until it outputs $L_{1}$.
Switch to presenting $f$ with an environment that alternates the rest of $E_1$ and the entirety of $E_2$. Since $L_1 \subset L_2$, this concatenated environment is still an environment for $L_2$, so $f$ must eventually output $L_2$.
Switch to presenting the rest of $E_1 \circ E_2$ and the entirety of $E_3$ alternatively. And so on for all $i$.

By construction, the resulting environment $E$ contains the entirety of $E_{1},E_{2},\ldots$, thus it contains $L_\infty$, so it is an environment for $L_\infty$. But since the learner always switches to $L_{i}$ for some finite $i$, it never converges to $L_{\infty}$.

The key insight is that the problem arises from the structure of the hypothesis space, not from the complexity of any individual language. Any single language $L_i$ in the chain is perfectly learnable on its own. But as soon as the learner must also consider the larger languages $L_{i+1}, L_{i+2}, \ldots$ as possible hypotheses, it faces an underdetermination problem: every finite sample from $L_\infty$ is also consistent with some smaller $L_i$, and the learner has no positive evidence—no string it could observe—that would force it to move from $L_i$ to $L_\infty$.¹

This has a direct consequence for phonology. The class of regular languages—exactly the languages recognizable by finite state automata—contains infinite containment chains. For instance, the set of languages “all strings over $\Sigma$ of length at most $n$” forms a chain $L_1 \subset L_2 \subset \ldots \subset \Sigma^*$. So the full class of regular languages is not learnable in Gold’s sense.

The hope, and the motivation for the subregular hierarchy we’ll study next, is that the phonological patterns of natural languages don’t range over the full class of regular languages. They cluster in much smaller subclasses—Strictly Local, Strictly Piecewise, Tier-based Strictly Local—that may not have the problematic containment structure. If natural languages come from a sufficiently restricted subclass, the learnability problem may dissolve: the learner doesn’t need to search all of $\text{REG}$, only the much smaller set of languages in the subclass.² This is what makes the subregular hierarchy linguistically interesting: it’s not just a classification of formal languages, but a hypothesis about what makes phonological learning tractable.

References

Heinz, Jeffrey. 2010. “Learning Long-Distance Phonotactics.” Linguistic Inquiry 41 (4): 623–61. https://doi.org/10.1162/LING\_a\_00015.

Heinz, Jeffrey, and James Rogers. 2013. “Learning Subregular Classes of Languages with Factored Deterministic Automata.” Proceedings of the 13th Meeting on Mathematics of Language (MOL), 64–71.

Jardine, Adam, and Jeffrey Heinz. 2016. “Learning Tier-Based Strictly 2-Local Languages.” Transactions of the Association for Computational Linguistics 4: 87–98. https://doi.org/10.1162/tacl\_a\_00085.

Footnotes

Gold’s result assumes learning from positive examples only—the learner sees strings that are in the language but never sees explicit evidence that a string is not in the language. This is usually taken as a reasonable model of the child’s input, since children are rarely told “this is not a sentence.” If we allow negative examples, the picture changes considerably.↩︎
Heinz (2010) shows that the Strictly Local and Strictly Piecewise classes are indeed identifiable in the limit from positive data, vindicating this hope for the lowest levels of the hierarchy. See Heinz and Rogers (2013) and Jardine and Heinz (2016) for extensions to other subregular classes.↩︎

--- title: The Generativist Conceit bibliography: ../references.bib --- Why would we want to do all this? In the first lecture, I discussed what I called the *Generativist Conceit*: that we can understand natural language by studying the class of grammars that generate all + only the classes of natural languages. Though certain strong ideas associated with the Generativist Conceit are controversial, as I have stated it (relatively weakly) here, it is relatively uncontroversial: if our aim is to provide a scientific theory that characterizes natural language (whatever that might be: a colection of actual utterances, an aspect of human psychology or some abstract object with an existence independent of human minds), we at least need some method for encoding that theory in a way that makes testable predictions. Grammars are definitionally such a method as I've described them. Another relatively uncontroversial observation that drives the Generativist Conceit is that children must be able to start without a language and come to know one, and the one they come to know is determined in some important way from their environment (rather than their genetics) in a finite amount of time on the basis of a finite number of observations of actual language use. Why this observation is important is that it means that natural languages must be *learnable* from a finite number of examples. Grammars are usually finite objects that (may) describe countably infinite languages, and so they might be a useful tool for describing the process by which a child comes to know a language. More formally, we may say that children have some method $c$ for mapping sequences of strings (or *texts*) $\mathbf{t} \in \Sigma^{**}$ along with some extralinguistic experience $E \in \mathcal{E}$ into a grammar $G \in \mathcal{G}$: $$c: \Sigma^{**} \times \mathcal{E} \rightarrow \mathcal{G}$$. A child's extralinguistic experience is nontrivial to represent, so to make the problem more tractable, we will often ignore that part. To demonstrate that the problem of learning a grammar from a text, consider the following formal model, known as [*language identification in the limit*](https://en.wikipedia.org/wiki/Language_identification_in_the_limit), due to [Gold 1964](https://www.rand.org/pubs/research_memoranda/RM4136.html). We start from the idea that a *learner* should be modeled as a function $f: \Sigma^{**} \rightarrow 2^{\Sigma^*}$ that maps an *environment* $E = \langle \boldsymbol\sigma_1, \boldsymbol\sigma_2, \ldots \rangle \in \Sigma^{**}$ to a language $L$, where an environment is a sequence of strings in $L$. We will say that a learner $f$ learns a language $L$ given an *environment* (or *text*) $E \subset L^*$ if the learner outputs $L$ after seeing enough examples from the environment and that it learns a language $L$ if it learns that language in any environment. That is, $f$ learns $L$ if it doesn't matter which sequence you give it: $f \text{ learns } L \leftrightarrow \forall \mathbf{e} \in L^*: f(\mathbf{e}) = L$. A language family $\mathcal{L}$ is learnable if there exists a language learner that can learn all languages in the family: $f \text{ learns } \mathcal{L} \leftrightarrow \forall L \in \mathcal{L}: \forall \mathbf{e} \in L^*: f(\mathbf{e}) = L$. [Gold (1967)](https://www.sciencedirect.com/science/article/pii/S0019995867911655) showed that, if a language family $\mathcal{L}$ contains languages $L_1, L_2, \ldots, L_\infty$, such that $L_1 \subset L_2 \subset \ldots \subset L_\infty = \bigcup_{i=1}^\infty L_i$, then it is not learnable. Here's the idea behind the proof: suppose $f$ is a learner that can learn $L_1, L_2, \ldots, L_\infty$; we'll show that it cannot learn $L_\infty$, by constructing an environment for $L_\infty$ that "tricks" $f$. First, we construct environments $E_1, E_2, \ldots$ such that $f(E_i) = L_i$. Next, we construct environment $E_\infty$ for $L_\infty$ inductively as follows: 1. Present $f$ with $E_1$ until it outputs $L_{1}$. 2. Switch to presenting $f$ with an environment that alternates the rest of $E_1$ and the entirety of $E_2$. Since $L_1 \subset L_2$, this concatenated environment is still an environment for $L_2$, so $f$ must eventually output $L_2$. 3. Switch to presenting the rest of $E_1 \circ E_2$ and the entirety of $E_3$ alternatively. And so on for all $i$. By construction, the resulting environment $E$ contains the entirety of $E_{1},E_{2},\ldots$, thus it contains $L_\infty$, so it is an environment for $L_\infty$. But since the learner always switches to $L_{i}$ for some finite $i$, it never converges to $L_{\infty}$. The key insight is that the problem arises from the *structure of the hypothesis space*, not from the complexity of any individual language. Any single language $L_i$ in the chain is perfectly learnable on its own. But as soon as the learner must also consider the larger languages $L_{i+1}, L_{i+2}, \ldots$ as possible hypotheses, it faces an underdetermination problem: every finite sample from $L_\infty$ is also consistent with some smaller $L_i$, and the learner has no positive evidence—no string it could observe—that would force it to move from $L_i$ to $L_\infty$.^[Gold's result assumes learning from *positive examples only*—the learner sees strings that are in the language but never sees explicit evidence that a string is *not* in the language. This is usually taken as a reasonable model of the child's input, since children are rarely told "this is not a sentence." If we allow negative examples, the picture changes considerably.] This has a direct consequence for phonology. The class of [regular languages](formal-definition/the-regular-operations.qmd)—exactly the languages recognizable by finite state automata—contains infinite containment chains. For instance, the set of languages "all strings over $\Sigma$ of length at most $n$" forms a chain $L_1 \subset L_2 \subset \ldots \subset \Sigma^*$. So the full class of regular languages is not learnable in Gold's sense. The hope, and the motivation for the [subregular hierarchy](subregular-hierarchy/index.qmd) we'll study next, is that the phonological patterns of natural languages don't range over the full class of regular languages. They cluster in much smaller subclasses—[Strictly Local](subregular-hierarchy/local-languages.qmd), [Strictly Piecewise](subregular-hierarchy/piecewise-languages.qmd), Tier-based Strictly Local—that may not have the problematic containment structure. If natural languages come from a sufficiently restricted subclass, the learnability problem may dissolve: the learner doesn't need to search all of $\text{REG}$, only the much smaller set of languages in the subclass.^[@heinz2010learning shows that the Strictly Local and Strictly Piecewise classes are indeed identifiable in the limit from positive data, vindicating this hope for the lowest levels of the hierarchy. See @heinz2013learning and @jardine2016learning for extensions to other subregular classes.] This is what makes the subregular hierarchy linguistically interesting: it's not just a classification of formal languages, but a hypothesis about what makes phonological learning tractable.