Star-free languages and the full picture

We’ve now seen two branches of the subregular hierarchy: the local side (SL \(\subset\) LT) and the piecewise side (SP \(\subset\) PT), along with TSL crosscutting both. These branches converge at a class called the star-free languages, which sits just below the full regular languages and just above both LT and PT.

Star-free languages

A language is star-free if it can be described by a generalized regular expression that uses union, concatenation, and complement but not Kleene star (McNaughton and Papert 1971; Schützenberger 1965). That is, we replace the \(*\) operation with \(\overline{\cdot}\) (complement with respect to \(\Sigma^*\)).

This might sound like a minor change, but it turns out to carve out a very specific subclass of the regular languages. The reason is that Kleene star is what gives regular expressions the ability to count modulo a number. For instance, the language \(\{w \in \{a, b\}^* \mid |w|_a \equiv 0 \pmod{2}\}\)—strings with an even number of \(a\)’s—is regular (you can build a two-state DFA for it) but not star-free.

We can build star-free expressions using arcweight by constructing automata with union, concat, and difference (complement) only.

Build a star-free language using only union, concatenation, and complement
import arcweight

# Alphabet: {a, b}
syms = arcweight.SymbolTable()
a = syms.add_symbol('a')
b = syms.add_symbol('b')

# Sigma-star: accepts all strings over {a, b}
sigma_star = arcweight.VectorFst()
s = sigma_star.add_state()
sigma_star.set_start(s)
sigma_star.set_final(s, 0.0)
sigma_star.add_arc(s, a, a, 0.0, s)
sigma_star.add_arc(s, b, b, 0.0, s)

# Single-symbol FSAs
fst_a = arcweight.VectorFst()
s0 = fst_a.add_state()
s1 = fst_a.add_state()
fst_a.set_start(s0)
fst_a.set_final(s1, 0.0)
fst_a.add_arc(s0, a, a, 0.0, s1)

fst_b = arcweight.VectorFst()
s0 = fst_b.add_state()
s1 = fst_b.add_state()
fst_b.set_start(s0)
fst_b.set_final(s1, 0.0)
fst_b.add_arc(s0, b, b, 0.0, s1)

# Star-free expression for "strings that contain at least one 'a'":
# complement(complement({a}) ∘ sigma_star)... actually let's do it directly.
# "strings containing 'a'" = sigma_star ∘ {a} ∘ sigma_star
# But we can't use Kleene star! We use complement instead:
# "strings NOT containing 'a'" = complement of (sigma_star ∘ {a} ∘ sigma_star)
# But sigma_star itself uses star...
# The trick: sigma_star = complement(empty_set), and empty_set is star-free.

# Actually, the standard star-free construction for sigma_star:
# Σ* = complement(∅)
# So we need ∅ first
empty = arcweight.VectorFst()
s = empty.add_state()
empty.set_start(s)
# No final state — accepts nothing

# Σ* = complement(∅) = difference(sigma_star, empty)
# But we need sigma_star to compute the difference...
# In practice, we work with sigma_star as given and use difference for complement.

# "Strings that contain [ab] as a substring" (star-free):
# Σ* ∘ {a} ∘ {b} ∘ Σ* — but this uses concatenation, not star.
# We concatenate with sigma_star (= complement(∅)) on both sides.
ab_substr = arcweight.concat(arcweight.concat(sigma_star, fst_a), arcweight.concat(fst_b, sigma_star))

# "Strings that do NOT contain [ab]" (complement of the above)
no_ab = arcweight.difference(sigma_star, ab_substr)
no_ab = arcweight.minimize(no_ab)

print(f"no_ab states: {no_ab.num_states()}")
no_ab states: 3

The McNaughton-Papert theorem

The star-free languages have a beautiful algebraic characterization due to McNaughton and Papert (1971) and Schützenberger (1965): a regular language is star-free if and only if its syntactic monoid is aperiodic—meaning it contains no nontrivial cyclic groups. The equivalent automaton-theoretic statement is that a regular language is star-free if and only if it is recognizable by a counter-free automaton: a DFA in which there is no state \(q\), nonempty string \(w\), and integer \(n > 1\) such that \(\delta(q, w^n) = q\) but \(\delta(q, w) \neq q\). In other words, if iterating some string \(w\) eventually cycles a state back to itself, then a single application of \(w\) must already return to that state—the automaton has no nontrivial cycles and therefore cannot count repetitions.

There is also a logical characterization: the star-free languages are exactly the languages definable in first-order logic with the linear order and successor predicates on string positions. This connects the subregular hierarchy to a broader program of relating language classes to logical expressiveness.

I won’t prove these equivalences here, but they are worth knowing about because they illuminate why star-free languages are a natural boundary: the languages above them (but still regular) are exactly those that require modular counting, and there is no known phonological pattern that requires counting modulo a number.

The complete hierarchy

We can now assemble the full picture. The following containments are all proper:

\[\text{SL}_k \subset \text{SL}_{k+1} \subset \ldots \subset \text{SL} \subset \text{LT} \subset \text{Star-Free} \subset \text{Regular}\] \[\text{SP}_k \subset \text{SP}_{k+1} \subset \ldots \subset \text{SP} \subset \text{PT} \subset \text{Star-Free} \subset \text{Regular}\]

where SL \(= \bigcup_k \text{SL}_k\) and SP \(= \bigcup_k \text{SP}_k\). TSL sits to the side: it is not contained in SL or SP, but many TSL languages are contained in the star-free languages. The relationships among SL, SP, LT, PT, and TSL are not simple containment—there are languages in each that are not in the others—but they all live within the star-free languages, and all are well below the full regular class.

CautionQuestion

The language “strings over \(\{a, b\}\) with an even number of \(a\)’s” is regular. Is it star-free?

No. This language requires counting \(a\)’s modulo 2, which requires a cyclic group in the syntactic monoid. The minimal DFA has two states that cycle with each \(a\), making it a counter. Since the syntactic monoid is not aperiodic, the language is not star-free.

The empirical picture

Documented across a large body of work (Heinz 2018; Rogers et al. 2013; Lambert et al. 2021; Graf 2017), attested phonological patterns overwhelmingly fall into the lowest levels of this hierarchy.

  • Most phonotactic constraints (restrictions on what sequences of sounds can appear) are SL—they depend on short contiguous contexts.
  • Long-distance phonological processes like harmony are typically SP or TSL.
  • Very few phonological patterns require even LT or PT.
  • No known phonological pattern requires the full power of the regular languages.

There is nothing in the definition of a DFA that prevents a phonological pattern from using modular counting or other non-star-free capabilities. That phonological patterns don’t use them is an empirical observation about natural language, not a logical necessity.

Learnability revisited

One candidate explanation for why phonological patterns cluster in the subregular hierarchy comes from learnability. Recall Gold’s result: the full class of regular languages is not learnable in the limit from positive data, because it contains infinite chains \(L_1 \subset L_2 \subset \ldots\) that allow the learner to be tricked. If the subregular classes avoid this problem—if they are learnable from positive data—then the clustering of phonological patterns in these classes would have a principled explanation: natural languages live where they do because those are the languages that children can learn.

This is not just a hope. Heinz (2010) showed that SL and SP languages are learnable in the limit from positive data, and the algorithms that do it are remarkably simple.

Learning SL languages

Recall from the section on SL languages that an SL-\(k\) language is determined by a set of forbidden \(k\)-grams \(\mathcal{F}\): a string is in the language if and only if none of its \(k\)-grams (in the boundary-augmented string) appear in \(\mathcal{F}\). The learning algorithm exploits this directly.

Given a parameter \(k\) and a stream of positive examples \(w_1, w_2, \ldots\) from an unknown SL-\(k\) language \(L\):

  1. Initialize \(\mathcal{A} = \emptyset\) (the set of attested \(k\)-grams).
  2. For each new example \(w_i\), extract all \(k\)-grams from the augmented string \(\rtimes^{k-1} w_i \ltimes^{k-1}\) and add them to \(\mathcal{A}\).
  3. At any point, the learner’s current hypothesis is the SL-\(k\) language with forbidden factors \(\mathcal{F} = (\Sigma \cup \{\rtimes, \ltimes\})^k \setminus \mathcal{A}\)—that is, every \(k\)-gram that has not yet been observed is assumed forbidden.

The learner starts maximally restrictive (everything is forbidden) and relaxes constraints as it sees evidence. Since the alphabet is finite, there are only finitely many possible \(k\)-grams. After enough examples, every \(k\)-gram that is licit in \(L\) will have been observed at least once, and the learner’s hypothesis will stabilize at \(L\).1

This algorithm converges to the correct language in finite time from any text, which is exactly what identification in the limit requires. It never overgeneralizes beyond the target language (since it only permits \(k\)-grams it has evidence for), and it never gets permanently stuck on a sublanguage (since the finite set of permitted \(k\)-grams will eventually all be observed).

Why doesn’t Gold’s impossibility result apply? Because the SL-\(k\) class (for a fixed \(k\)) does not contain an infinite ascending chain. The number of possible SL-\(k\) languages over a finite alphabet is finite (there are only finitely many subsets of the finite set of \(k\)-grams), so the condition Gold’s theorem requires—an infinite chain \(L_1 \subset L_2 \subset \ldots\)—simply cannot arise.2

Learning SP languages

The algorithm for SP languages is entirely parallel. An SP-\(k\) language is determined by a set of forbidden \(k\)-subsequences \(\mathcal{F}\): a string is in the language if and only if none of the forbidden subsequences appear as subsequences of the string.

The learning algorithm:

  1. Initialize \(\mathcal{A} = \emptyset\) (attested \(k\)-subsequences).
  2. For each new example \(w_i\), extract all length-\(k\) subsequences of \(w_i\) and add them to \(\mathcal{A}\).
  3. The current hypothesis is the SP-\(k\) language with forbidden subsequences \(\mathcal{F} = \Sigma^k \setminus \mathcal{A}\).

The convergence argument is the same: there are finitely many \(k\)-subsequences over a finite alphabet, so after enough examples all permitted subsequences will have been observed and the hypothesis stabilizes.3

Learning TSL languages

Jardine and Heinz (2016) extended the approach to TSL languages. Recall that a TSL-\(k\) language with tier \(\tau\) is one where the projection \(\pi_\tau(L)\) is SL-\(k\). The learning algorithm must discover both the tier \(\tau\) and the forbidden \(k\)-grams on that tier.

The key insight is that the tier can be identified from the data: a symbol belongs to the tier if and only if it participates in a long-distance dependency. Jardine and Heinz (2016) formalize this by looking for symbols whose distribution is not independent of their non-adjacent context—if a symbol’s distribution depends on what appeared several positions earlier (beyond the SL window), it must be on the tier. Once the tier is identified, the SL learning algorithm is applied to the projected strings.

The broader picture

Heinz and Rogers (2013) provided a general framework that unifies these algorithms using factored deterministic automata: the idea is that each subregular class corresponds to a particular way of factoring the state space of a DFA, and the learning algorithm for each class exploits the corresponding factorization to learn from positive data. Lambert et al. (2021) argued that the convergence between learnability and typology is not a coincidence—that the distributional properties of subregular classes predict which patterns should be common cross-linguistically.

The picture that emerges is this: the constraint that phonological systems must be learnable from the kind of data children have access to pushes the class of possible phonological grammars into exactly the region of the regular languages where we find them. The subregular hierarchy provides a precise characterization of that region, and the learning algorithms provide a mechanistic account of why that region is special.

References

Graf, Thomas. 2017. “The Power of Locality Domains in Phonology.” Phonology 34 (2): 385–405. https://doi.org/10.1017/S0952675717000197.
Heinz, Jeffrey. 2010. “Learning Long-Distance Phonotactics.” Linguistic Inquiry 41 (4): 623–61. https://doi.org/10.1162/LING\_a\_00015.
Heinz, Jeffrey. 2018. “The Computational Nature of Phonological Generalizations.” In Phonological Typology, edited by Larry M. Hyman and Frans Plank. De Gruyter Mouton. https://doi.org/doi:10.1515/9783110451931-005.
Heinz, Jeffrey, and James Rogers. 2013. “Learning Subregular Classes of Languages with Factored Deterministic Automata.” Proceedings of the 13th Meeting on Mathematics of Language (MOL), 64–71.
Jardine, Adam, and Jeffrey Heinz. 2016. “Learning Tier-Based Strictly 2-Local Languages.” Transactions of the Association for Computational Linguistics 4: 87–98. https://doi.org/10.1162/tacl\_a\_00085.
Lambert, Dakotah, Jonathan Rawski, and Jeffrey Heinz. 2021. “Typology Emerges from Simplicity in Representations and Learning.” J. Language Modelling 9 (1): 151–94. https://doi.org/10.15398/jlm.v9i1.262.
McNaughton, Robert, and Seymour A. Papert. 1971. Counter-Free Automata. M.i.t. Research Monograph 65. The MIT Press.
Rogers, James, Jeffrey Heinz, Margaret Fero, Jeremy Hurst, Dakotah Lambert, and Sean Wibel. 2013. “Cognitive and Sub-Regular Complexity.” Formal Grammar, Lecture notes in computer science, vol. 8036: 90–108. https://doi.org/10.1007/978-3-642-39998-5\_6.
Schützenberger, Marcel-Paul. 1965. “On Finite Monoids Having Only Trivial Subgroups.” Information and Control 8 (2): 190–94. https://doi.org/10.1016/S0019-9958(65)90108-7.

Footnotes

  1. More precisely: since \(L\) is SL-\(k\), its set of permitted \(k\)-grams is exactly the set of \(k\)-grams that appear in strings in \(L\). Any text for \(L\)—any sequence that eventually includes every string in \(L\)—will therefore eventually present every permitted \(k\)-gram. Once all permitted \(k\)-grams have been seen, \(\mathcal{A}\) equals the true set of permitted \(k\)-grams, \(\mathcal{F}\) equals the true set of forbidden \(k\)-grams, and the hypothesis is correct. It will never change again, because no future example can introduce a \(k\)-gram not already in \(\mathcal{A}\).↩︎

  2. The union \(\text{SL} = \bigcup_k \text{SL}_k\) over all \(k\) does contain infinite chains (since SL-\(k \subset\) SL-\((k+1)\)), and indeed \(\text{SL}\) as a whole is not learnable from positive data without knowing \(k\) in advance. But for any fixed \(k\), the class is finite and learnable. In practice, this means the learner must either know \(k\) or have a method for selecting it—an issue Heinz (2010) discusses in detail.↩︎

  3. Extracting all \(k\)-subsequences from a string of length \(n\) takes \(O(\binom{n}{k})\) time, which for fixed \(k\) is polynomial in \(n\). For the values of \(k\) relevant to phonology (typically 2 or 3), this is fast.↩︎