Overview

We’ve now seen that the regular languages on an alphabet $\Sigma$ are closed under union, concatenation, Kleene star, complement, and intersection. This means that if we start with any finite collection of regular languages and combine them using these operations, we always get another regular language. It’s a powerful class—and as I mentioned in the section on generation, it seems likely that all phonological grammars of natural languages fall within it.

But recall the point I made in the section on the Generativist Conceit: Gold’s theorem tells us that the full class of regular languages is not learnable in the limit from positive data, because it contains infinite chains of languages $L_1 \subset L_2 \subset \ldots$ that allow the learner to be tricked. If phonological grammars are regular, and if children learn phonological grammars from positive data, then the class of phonological grammars cannot be the entire class of regular languages. It must be a proper subclass—one that avoids these infinite containment chains.

This observation raises a natural question: which subclass? Is there a principled way to identify the region of the regular languages where phonological patterns live?

Over the past several decades, a body of work in formal language theory and computational phonology has identified a collection of subclasses of the regular languages—known collectively as the subregular hierarchy—that have two properties relevant to this question: (i) they are learnable in the limit from positive data; and (ii) the phonological patterns of natural languages overwhelmingly fall into them (Heinz 2018; Rogers and Pullum 2011; Rogers et al. 2013; Lambert et al. 2021).

The hierarchy looks roughly like this:

graph BT
    SL["<b>SL</b><br/>Strictly Local"] --> LT["<b>LT</b><br/>Locally Testable"]
    SP["<b>SP</b><br/>Strictly Piecewise"] --> PT["<b>PT</b><br/>Piecewise Testable"]
    LT --> SF["<b>Star-Free</b>"]
    PT --> SF
    SF --> REG["<b>Regular</b>"]
    SL -.-> TSL["<b>TSL</b><br/>Tier-based<br/>Strictly Local"]
    LT -.-> TSL

    style SL fill:#d4edda,stroke:#28a745,stroke-width:2px
    style SP fill:#d4edda,stroke:#28a745,stroke-width:2px
    style TSL fill:#fff3cd,stroke:#ffc107,stroke-width:2px,stroke-dasharray: 5 5
    style LT fill:#d1ecf1,stroke:#17a2b8
    style PT fill:#d1ecf1,stroke:#17a2b8
    style SF fill:#e2e3e5,stroke:#6c757d
    style REG fill:#f8d7da,stroke:#dc3545

Solid arrows indicate strict containment ($\subset$); the dashed lines to TSL indicate that it crosscuts the local branch, containing languages that are not SL but are contained in LT. The green classes at the bottom are where most attested phonological patterns live.

SL (Strictly Local) and SP (Strictly Piecewise) are the smallest classes, capturing constraints on adjacent segments and on subsequences respectively.
LT (Locally Testable) and PT (Piecewise Testable) are somewhat larger: they can count local or piecewise patterns up to a threshold.
The Star-Free languages sit just below the full regular languages—they are exactly the languages definable without the Kleene star (using only union, concatenation, and complement).
TSL (Tier-based Strictly Local) crosscuts this picture: it projects the string onto a tier (a subset of the alphabet) and then applies SL constraints on the projected string. This allows it to handle long-distance consonant harmony and similar patterns that are not SL on the full alphabet but become SL once irrelevant segments are ignored (Heinz et al. 2011; Jardine and Heinz 2016).

Attested phonological patterns cluster at the bottom of this hierarchy, primarily in SL, SP, and TSL. Very few phonological patterns require even LT or PT, and essentially none require the full power of the regular languages. The convergence between the formal hierarchy and the typological distribution of phonological patterns suggests that the learnability constraint is doing real work in shaping what phonological systems look like.

In the following sections, I’ll define each of these classes, ground them in phonological examples, and show how they relate to each other and to the finite state automata we’ve been studying.

References

Heinz, Jeffrey. 2018. “The Computational Nature of Phonological Generalizations.” In Phonological Typology, edited by Larry M. Hyman and Frans Plank. De Gruyter Mouton. https://doi.org/doi:10.1515/9783110451931-005.

Heinz, Jeffrey, Chetan Rawal, and Herbert G. Tanner. 2011. “Tier-Based Strictly Local Constraints for Phonology.” Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 58–64.

Jardine, Adam, and Jeffrey Heinz. 2016. “Learning Tier-Based Strictly 2-Local Languages.” Transactions of the Association for Computational Linguistics 4: 87–98. https://doi.org/10.1162/tacl\_a\_00085.

Lambert, Dakotah, Jonathan Rawski, and Jeffrey Heinz. 2021. “Typology Emerges from Simplicity in Representations and Learning.” J. Language Modelling 9 (1): 151–94. https://doi.org/10.15398/jlm.v9i1.262.

Rogers, James, Jeffrey Heinz, Margaret Fero, Jeremy Hurst, Dakotah Lambert, and Sean Wibel. 2013. “Cognitive and Sub-Regular Complexity.” Formal Grammar, Lecture notes in computer science, vol. 8036: 90–108. https://doi.org/10.1007/978-3-642-39998-5\_6.

Rogers, James, and Geoffrey K. Pullum. 2011. “Aural Pattern Recognition Experiments and the Subregular Hierarchy.” J. Logic, Language and Information 20 (3): 329–42. https://doi.org/10.1007/s10849-011-9140-2.

--- title: Overview bibliography: ../../references.bib --- We've now seen that the regular languages on an alphabet $\Sigma$ are closed under union, concatenation, Kleene star, complement, and intersection. This means that if we start with any finite collection of regular languages and combine them using these operations, we always get another regular language. It's a powerful class—and as I mentioned in the [section on generation](../generation.qmd), it seems likely that all phonological grammars of natural languages fall within it. But recall the point I made in the [section on the Generativist Conceit](../the-generativist-conceit.qmd): Gold's theorem tells us that the full class of regular languages is not learnable in the limit from positive data, because it contains infinite chains of languages $L_1 \subset L_2 \subset \ldots$ that allow the learner to be tricked. If phonological grammars are regular, and if children learn phonological grammars from positive data, then the class of phonological grammars cannot be the *entire* class of regular languages. It must be a proper subclass—one that avoids these infinite containment chains. This observation raises a natural question: which subclass? Is there a principled way to identify the region of the regular languages where phonological patterns live? Over the past several decades, a body of work in formal language theory and computational phonology has identified a collection of subclasses of the regular languages—known collectively as the *subregular hierarchy*—that have two properties relevant to this question: (i) they are learnable in the limit from positive data; and (ii) the phonological patterns of natural languages overwhelmingly fall into them [@heinz_computational_2018; @rogers2011aural; @rogers2013cognitive; @lambert2021typology]. The hierarchy looks roughly like this: ```{mermaid} %%| fig-width: 7 graph BT SL["SL Strictly Local"] --> LT["LT Locally Testable"] SP["SP Strictly Piecewise"] --> PT["PT Piecewise Testable"] LT --> SF["Star-Free"] PT --> SF SF --> REG["Regular"] SL -.-> TSL["TSL Tier-based Strictly Local"] LT -.-> TSL style SL fill:#d4edda,stroke:#28a745,stroke-width:2px style SP fill:#d4edda,stroke:#28a745,stroke-width:2px style TSL fill:#fff3cd,stroke:#ffc107,stroke-width:2px,stroke-dasharray: 5 5 style LT fill:#d1ecf1,stroke:#17a2b8 style PT fill:#d1ecf1,stroke:#17a2b8 style SF fill:#e2e3e5,stroke:#6c757d style REG fill:#f8d7da,stroke:#dc3545 ``` Solid arrows indicate strict containment ($\subset$); the dashed lines to TSL indicate that it crosscuts the local branch, containing languages that are not SL but are contained in LT. The green classes at the bottom are where most attested phonological patterns live. - **SL** ([Strictly Local](local-languages.qmd)) and **SP** ([Strictly Piecewise](piecewise-languages.qmd)) are the smallest classes, capturing constraints on adjacent segments and on subsequences respectively. - **LT** (Locally Testable) and **PT** (Piecewise Testable) are somewhat larger: they can count local or piecewise patterns up to a threshold. - The **[Star-Free](star-free-and-the-full-picture.qmd)** languages sit just below the full regular languages—they are exactly the languages definable without the Kleene star (using only union, concatenation, and complement). - **TSL** (Tier-based Strictly Local) crosscuts this picture: it projects the string onto a *tier* (a subset of the alphabet) and then applies SL constraints on the projected string. This allows it to handle long-distance consonant harmony and similar patterns that are not SL on the full alphabet but become SL once irrelevant segments are ignored [@heinz2011tier; @jardine2016learning]. Attested phonological patterns cluster at the bottom of this hierarchy, primarily in SL, SP, and TSL. Very few phonological patterns require even LT or PT, and essentially none require the full power of the regular languages. The convergence between the formal hierarchy and the typological distribution of phonological patterns suggests that the learnability constraint is doing real work in shaping what phonological systems look like. In the following sections, I'll define each of these classes, ground them in phonological examples, and show how they relate to each other and to the finite state automata we've been studying.