Overview

In the last submodule, we explored one way that we can compactly describe (a subset of) the languages on an alphabet $\Sigma$: the regular expressions $R(\Sigma) \subset 2^{[\Sigma \cup \{\epsilon, \emptyset, \cup, \circ, (, ), *\}]^*}$. Regular expressions $\mathbf{r} \in R(\Sigma)$ evaluate to languages $\text{eval}(\mathbf{r}) \in 2^{\Sigma^*}$, and we refer to the image of $R(\Sigma)$ under $\text{eval}$–i.e. all languages describable by a regular expression on $\Sigma$–as the regular languages on $\Sigma$.

One hypothesis we will talk about a bit later in the course is that all phonological systems are subregular. Effectively, this hypothesis posits that all phonological grammars are describable by a strict subset of the regular expressions on the phones of a language.

One challenge to evaluating such hypotheses is that we don’t always know which strings are in a language. To see what I mean, let’s contrast two languages we might be interested in: (i) the set of all actual phonological forms you know for English morphemes; and (ii) the set of all possible phonological forms for English morphemes.

Because you presumably know a finite number of morphemes, the first language must be finite. Any finite language is describable by a regular expression: we can list out all all the strings in the language; then for each string, we can construct the regular expression that evaluates to the language containing only that string; and finally, we can combine those regular expressions using $\cup$.

The second language is almost certainly infinite. We can’t list out all the possible phonological forms for English morphemes, but we may still be able construct a regular expression that evaluates to the relevant language. The challenge is that, for some strings, it’s clear they should be in this language–e.g. any of the strings in the set of all actual phonological forms you know for English morphemes has got to be in there; for other strings, it’s clear they shouldn’t be in this language–e.g. any string starting with a voiced stop followed by a voiceless sibillant, $\text{bs}$ or $\text{gs}$; but for other strings, it’s not clear whether they should be in the language or not–e.g. any string starting with a voiced stop followed by a nasal, like $\text{bn}$ or $\text{gn}$ seems better than the analogous string starting with a voiced stop followed by a voiceless sibillant.

How do we capture the graded nature of judgments about $\text{bs}$ and $\text{gs}$ in comparison to $\text{bn}$ and $\text{gn}$? In this module and the next, we’ll investigate this question from two different perspectives. One perspective will model this gradation as a kind of uncertainty about what is in the set of all possible phonological forms for English morphemes, for which we will use the tools of probability theory. The other perspective, which is not mutually exclusive, is to model this uncertainty as a function of a strings similarity to known words. We will investigate this perspective in the next module.

--- title: Overview bibliography: ../references.bib --- In the last submodule, we explored one way that we can compactly describe (a subset of) the languages on an alphabet $\Sigma$: the *regular expressions* $R(\Sigma) \subset 2^{[\Sigma \cup \{\epsilon, \emptyset, \cup, \circ, (, ), *\}]^*}$. Regular expressions $\mathbf{r} \in R(\Sigma)$ evaluate to languages $\text{eval}(\mathbf{r}) \in 2^{\Sigma^*}$, and we refer to the image of $R(\Sigma)$ under $\text{eval}$–i.e. all languages describable by a regular expression on $\Sigma$–as the *regular languages* on $\Sigma$. One hypothesis we will talk about a bit later in the course is that all phonological systems are *subregular*. Effectively, this hypothesis posits that all phonological grammars are describable by a strict subset of the regular expressions on the phones of a language. One challenge to evaluating such hypotheses is that we don't always know which strings are in a language. To see what I mean, let's contrast two languages we might be interested in: (i) the set of all actual phonological forms you know for English morphemes; and (ii) the set of all possible phonological forms for English morphemes. Because you presumably know a finite number of morphemes, the first language must be finite. Any finite language is describable by a regular expression: we can list out all all the strings in the language; then for each string, we can construct the regular expression that evaluates to the language containing only that string; and finally, we can combine those regular expressions using $\cup$. The second language is almost certainly infinite. We can't list out all the possible phonological forms for English morphemes, but we may still be able construct a regular expression that evaluates to the relevant language. The challenge is that, for some strings, it's clear they should be in this language–e.g. any of the strings in the set of all actual phonological forms you know for English morphemes has got to be in there; for other strings, it's clear they shouldn't be in this language–e.g. any string starting with a voiced stop followed by a voiceless sibillant, $\text{bs}$ or $\text{gs}$; but for other strings, it's not clear whether they should be in the language or not–e.g. any string starting with a voiced stop followed by a nasal, like $\text{bn}$ or $\text{gn}$ seems better than the analogous string starting with a voiced stop followed by a voiceless sibillant. How do we capture the graded nature of judgments about $\text{bs}$ and $\text{gs}$ in comparison to $\text{bn}$ and $\text{gn}$? In this module and the next, we'll investigate this question from two different perspectives. One perspective will model this gradation as a kind of uncertainty about what is in the set of all possible phonological forms for English morphemes, for which we will use the tools of probability theory. The other perspective, which is not mutually exclusive, is to model this uncertainty as a function of a strings similarity to known words. We will investigate this perspective in the next module.