---
title: Star-free languages and the full picture
bibliography: ../../references.bib
jupyter: python3
---
We've now seen two branches of the subregular hierarchy: the local side (SL $\subset$ LT) and the piecewise side (SP $\subset$ PT), along with TSL crosscutting both. These branches converge at a class called the *star-free languages*, which sits just below the full regular languages and just above both LT and PT.
## Star-free languages
A language is *star-free* if it can be described by a generalized regular expression that uses union, concatenation, and *complement* but not Kleene star [@mcnaughton1971counter; @schutzenberger1965finite]. That is, we replace the $*$ operation with $\overline{\cdot}$ (complement with respect to $\Sigma^*$).
This might sound like a minor change, but it turns out to carve out a very specific subclass of the regular languages. The reason is that Kleene star is what gives regular expressions the ability to count modulo a number. For instance, the language $\{w \in \{a, b\}^* \mid |w|_a \equiv 0 \pmod{2}\}$—strings with an even number of $a$'s—is regular (you can build a two-state DFA for it) but not star-free.
We can build star-free expressions using `arcweight` by constructing automata with `union`, `concat`, and `difference` (complement) only.
```{python}
#| code-fold: true
#| code-summary: Build a star-free language using only union, concatenation, and complement
import arcweight
# Alphabet: {a, b}
syms = arcweight.SymbolTable()
a = syms.add_symbol('a')
b = syms.add_symbol('b')
# Sigma-star: accepts all strings over {a, b}
sigma_star = arcweight.VectorFst()
s = sigma_star.add_state()
sigma_star.set_start(s)
sigma_star.set_final(s, 0.0)
sigma_star.add_arc(s, a, a, 0.0, s)
sigma_star.add_arc(s, b, b, 0.0, s)
# Single-symbol FSAs
fst_a = arcweight.VectorFst()
s0 = fst_a.add_state()
s1 = fst_a.add_state()
fst_a.set_start(s0)
fst_a.set_final(s1, 0.0)
fst_a.add_arc(s0, a, a, 0.0, s1)
fst_b = arcweight.VectorFst()
s0 = fst_b.add_state()
s1 = fst_b.add_state()
fst_b.set_start(s0)
fst_b.set_final(s1, 0.0)
fst_b.add_arc(s0, b, b, 0.0, s1)
# Star-free expression for "strings that contain at least one 'a'":
# complement(complement({a}) ∘ sigma_star)... actually let's do it directly.
# "strings containing 'a'" = sigma_star ∘ {a} ∘ sigma_star
# But we can't use Kleene star! We use complement instead:
# "strings NOT containing 'a'" = complement of (sigma_star ∘ {a} ∘ sigma_star)
# But sigma_star itself uses star...
# The trick: sigma_star = complement(empty_set), and empty_set is star-free.
# Actually, the standard star-free construction for sigma_star:
# Σ* = complement(∅)
# So we need ∅ first
empty = arcweight.VectorFst()
s = empty.add_state()
empty.set_start(s)
# No final state — accepts nothing
# Σ* = complement(∅) = difference(sigma_star, empty)
# But we need sigma_star to compute the difference...
# In practice, we work with sigma_star as given and use difference for complement.
# "Strings that contain [ab] as a substring" (star-free):
# Σ* ∘ {a} ∘ {b} ∘ Σ* — but this uses concatenation, not star.
# We concatenate with sigma_star (= complement(∅)) on both sides.
ab_substr = arcweight.concat(arcweight.concat(sigma_star, fst_a), arcweight.concat(fst_b, sigma_star))
# "Strings that do NOT contain [ab]" (complement of the above)
no_ab = arcweight.difference(sigma_star, ab_substr)
no_ab = arcweight.minimize(no_ab)
print(f"no_ab states: {no_ab.num_states()}")
```
### The McNaughton-Papert theorem
The star-free languages have a beautiful algebraic characterization due to @mcnaughton1971counter and @schutzenberger1965finite: a regular language is star-free if and only if its [syntactic monoid](https://en.wikipedia.org/wiki/Syntactic_monoid) is *aperiodic*—meaning it contains no nontrivial cyclic groups. The equivalent automaton-theoretic statement is that a regular language is star-free if and only if it is recognizable by a [*counter-free* automaton](https://en.wikipedia.org/wiki/Aperiodic_finite_automaton): a DFA in which there is no state $q$, nonempty string $w$, and integer $n > 1$ such that $\delta(q, w^n) = q$ but $\delta(q, w) \neq q$. In other words, if iterating some string $w$ eventually cycles a state back to itself, then a *single* application of $w$ must already return to that state—the automaton has no nontrivial cycles and therefore cannot count repetitions.
There is also a logical characterization: the star-free languages are exactly the languages definable in first-order logic with the linear order and successor predicates on string positions. This connects the subregular hierarchy to a broader program of relating language classes to logical expressiveness.
I won't prove these equivalences here, but they are worth knowing about because they illuminate *why* star-free languages are a natural boundary: the languages above them (but still regular) are exactly those that require modular counting, and there is no known phonological pattern that requires counting modulo a number.
## The complete hierarchy
We can now assemble the full picture. The following containments are all proper:
$$\text{SL}_k \subset \text{SL}_{k+1} \subset \ldots \subset \text{SL} \subset \text{LT} \subset \text{Star-Free} \subset \text{Regular}$$
$$\text{SP}_k \subset \text{SP}_{k+1} \subset \ldots \subset \text{SP} \subset \text{PT} \subset \text{Star-Free} \subset \text{Regular}$$
where SL $= \bigcup_k \text{SL}_k$ and SP $= \bigcup_k \text{SP}_k$. TSL sits to the side: it is not contained in SL or SP, but many TSL languages are contained in the star-free languages. The relationships among SL, SP, LT, PT, and TSL are not simple containment—there are languages in each that are not in the others—but they all live within the star-free languages, and all are well below the full regular class.
::: {.callout-caution}
## Question
The language "strings over $\{a, b\}$ with an even number of $a$'s" is regular. Is it star-free?
:::
::: {.callout-tip collapse=true}
## Answer
No. This language requires counting $a$'s modulo 2, which requires a cyclic group in the syntactic monoid. The minimal DFA has two states that cycle with each $a$, making it a counter. Since the syntactic monoid is not aperiodic, the language is not star-free.
:::
## The empirical picture
Documented across a large body of work [@heinz_computational_2018; @rogers2013cognitive; @lambert2021typology; @graf2017power], attested phonological patterns overwhelmingly fall into the lowest levels of this hierarchy.
- Most phonotactic constraints (restrictions on what sequences of sounds can appear) are SL—they depend on short contiguous contexts.
- Long-distance phonological processes like harmony are typically SP or TSL.
- Very few phonological patterns require even LT or PT.
- No known phonological pattern requires the full power of the regular languages.
There is nothing in the definition of a DFA that prevents a phonological pattern from using modular counting or other non-star-free capabilities. That phonological patterns don't use them is an empirical observation about natural language, not a logical necessity.
## Learnability revisited
One candidate explanation for why phonological patterns cluster in the subregular hierarchy comes from learnability. Recall [Gold's result](../the-generativist-conceit.qmd): the full class of regular languages is not learnable in the limit from positive data, because it contains infinite chains $L_1 \subset L_2 \subset \ldots$ that allow the learner to be tricked. If the subregular classes avoid this problem—if they *are* learnable from positive data—then the clustering of phonological patterns in these classes would have a principled explanation: natural languages live where they do because those are the languages that children can learn.
This is not just a hope. @heinz2010learning showed that SL and SP languages are learnable in the limit from positive data, and the algorithms that do it are remarkably simple.
### Learning SL languages
Recall from the [section on SL languages](local-languages.qmd) that an SL-$k$ language is determined by a set of forbidden $k$-grams $\mathcal{F}$: a string is in the language if and only if none of its $k$-grams (in the boundary-augmented string) appear in $\mathcal{F}$. The learning algorithm exploits this directly.
Given a parameter $k$ and a stream of positive examples $w_1, w_2, \ldots$ from an unknown SL-$k$ language $L$:
1. Initialize $\mathcal{A} = \emptyset$ (the set of *attested* $k$-grams).
2. For each new example $w_i$, extract all $k$-grams from the augmented string $\rtimes^{k-1} w_i \ltimes^{k-1}$ and add them to $\mathcal{A}$.
3. At any point, the learner's current hypothesis is the SL-$k$ language with forbidden factors $\mathcal{F} = (\Sigma \cup \{\rtimes, \ltimes\})^k \setminus \mathcal{A}$—that is, every $k$-gram that has *not yet been observed* is assumed forbidden.
The learner starts maximally restrictive (everything is forbidden) and relaxes constraints as it sees evidence. Since the alphabet is finite, there are only finitely many possible $k$-grams. After enough examples, every $k$-gram that is licit in $L$ will have been observed at least once, and the learner's hypothesis will stabilize at $L$.^[More precisely: since $L$ is SL-$k$, its set of permitted $k$-grams is exactly the set of $k$-grams that appear in strings in $L$. Any [text](../the-generativist-conceit.qmd) for $L$—any sequence that eventually includes every string in $L$—will therefore eventually present every permitted $k$-gram. Once all permitted $k$-grams have been seen, $\mathcal{A}$ equals the true set of permitted $k$-grams, $\mathcal{F}$ equals the true set of forbidden $k$-grams, and the hypothesis is correct. It will never change again, because no future example can introduce a $k$-gram not already in $\mathcal{A}$.]
This algorithm converges to the correct language in finite time from any text, which is exactly what [identification in the limit](https://en.wikipedia.org/wiki/Language_identification_in_the_limit) requires. It never overgeneralizes beyond the target language (since it only permits $k$-grams it has evidence for), and it never gets permanently stuck on a sublanguage (since the finite set of permitted $k$-grams will eventually all be observed).
Why doesn't Gold's impossibility result apply? Because the SL-$k$ class (for a fixed $k$) does *not* contain an infinite ascending chain. The number of possible SL-$k$ languages over a finite alphabet is finite (there are only finitely many subsets of the finite set of $k$-grams), so the condition Gold's theorem requires—an infinite chain $L_1 \subset L_2 \subset \ldots$—simply cannot arise.^[The union $\text{SL} = \bigcup_k \text{SL}_k$ over all $k$ *does* contain infinite chains (since SL-$k \subset$ SL-$(k+1)$), and indeed $\text{SL}$ as a whole is not learnable from positive data without knowing $k$ in advance. But for any fixed $k$, the class is finite and learnable. In practice, this means the learner must either know $k$ or have a method for selecting it—an issue @heinz2010learning discusses in detail.]
### Learning SP languages
The algorithm for SP languages is entirely parallel. An SP-$k$ language is determined by a set of forbidden $k$-*subsequences* $\mathcal{F}$: a string is in the language if and only if none of the forbidden subsequences appear as subsequences of the string.
The learning algorithm:
1. Initialize $\mathcal{A} = \emptyset$ (attested $k$-subsequences).
2. For each new example $w_i$, extract all length-$k$ subsequences of $w_i$ and add them to $\mathcal{A}$.
3. The current hypothesis is the SP-$k$ language with forbidden subsequences $\mathcal{F} = \Sigma^k \setminus \mathcal{A}$.
The convergence argument is the same: there are finitely many $k$-subsequences over a finite alphabet, so after enough examples all permitted subsequences will have been observed and the hypothesis stabilizes.^[Extracting all $k$-subsequences from a string of length $n$ takes $O(\binom{n}{k})$ time, which for fixed $k$ is polynomial in $n$. For the values of $k$ relevant to phonology (typically 2 or 3), this is fast.]
### Learning TSL languages
@jardine2016learning extended the approach to TSL languages. Recall that a TSL-$k$ language with tier $\tau$ is one where the projection $\pi_\tau(L)$ is SL-$k$. The learning algorithm must discover both the tier $\tau$ and the forbidden $k$-grams on that tier.
The key insight is that the tier can be identified from the data: a symbol belongs to the tier if and only if it participates in a long-distance dependency. @jardine2016learning formalize this by looking for symbols whose distribution is not independent of their non-adjacent context—if a symbol's distribution depends on what appeared several positions earlier (beyond the SL window), it must be on the tier. Once the tier is identified, the SL learning algorithm is applied to the projected strings.
### The broader picture
@heinz2013learning provided a general framework that unifies these algorithms using *factored deterministic automata*: the idea is that each subregular class corresponds to a particular way of factoring the state space of a DFA, and the learning algorithm for each class exploits the corresponding factorization to learn from positive data. @lambert2021typology argued that the convergence between learnability and typology is not a coincidence—that the distributional properties of subregular classes predict which patterns should be common cross-linguistically.
The picture that emerges is this: the constraint that phonological systems must be learnable from the kind of data children have access to pushes the class of possible phonological grammars into exactly the region of the regular languages where we find them. The subregular hierarchy provides a precise characterization of that region, and the learning algorithms provide a mechanistic account of *why* that region is special.