Consider the English onset cluster [bl], as in black. This is a perfectly fine way to start a word. The reverse, [lb], is not—there is no English word that begins with [lb]. This constraint depends only on pairs of adjacent symbols: [lb] is bad regardless of what comes before or after it. In the terminology we’ll develop here, this makes it a local constraint.

Strictly Local languages

A Strictly Local language of order \(k\), or SL-\(k\) language, is one whose membership can be determined by checking whether any forbidden sequence of \(k\) consecutive symbols appears in the string (Heinz 2010; Rogers and Pullum 2011). More precisely, we augment the string with boundary markers \(\rtimes\) (left edge) and \(\ltimes\) (right edge) and check all the \(k\)-grams of the augmented string against a finite set of forbidden \(k\)-grams.

A language \(L\) is SL-\(k\) if and only if there exists a finite set \(\mathcal{F} \subset (\Sigma \cup \{\rtimes, \ltimes\})^k\) such that:

\[w \in L \iff \text{no } k\text{-gram of } \rtimes^{k-1} w \ltimes^{k-1} \text{ is in } \mathcal{F}\]

The set \(\mathcal{F}\) is called the set of forbidden factors. Notice that the boundary markers let us express constraints on how strings begin and end: a forbidden \(k\)-gram starting with \(\rtimes\) is a constraint on the first symbols of the string.

For instance, the constraint against [lb] onsets in English can be expressed as an SL-2 constraint with \(\mathcal{F} \ni \rtimes\text{l}\text{b}\)… but actually, [lb] is bad not just at the beginning of a word. It’s bad everywhere: English doesn’t allow [lb] word-internally either (at least within a morpheme). So a simpler SL-2 constraint just forbids the bigram \(\text{lb}\) outright: \(\text{lb} \in \mathcal{F}\).

We can build an FSA that enforces this constraint using arcweight. The idea is to construct an automaton that accepts precisely the strings containing the forbidden factor, and then take its complement.

Build an SL-2 constraint FSA forbidding [lb]
import arcweight

# Define the alphabet
phones = ['l', 'b', 'k', 'æ', 'ɪ']
syms = arcweight.SymbolTable()
sym_ids = {}
for p in phones:
    sym_ids[p] = syms.add_symbol(p)

# Build sigma-star: an FSA accepting all strings over the alphabet
sigma_star = arcweight.VectorFst()
s = sigma_star.add_state()
sigma_star.set_start(s)
sigma_star.set_final(s, 0.0)
for p in phones:
    sigma_star.add_arc(s, sym_ids[p], sym_ids[p], 0.0, s)

# Build an FSA that accepts strings containing the bigram [lb]
# States: 0 = start/no l seen, 1 = just saw l, 2 = saw lb (accept sink)
contains_lb = arcweight.VectorFst()
s0 = contains_lb.add_state()
s1 = contains_lb.add_state()
s2 = contains_lb.add_state()
contains_lb.set_start(s0)
contains_lb.set_final(s2, 0.0)

# From s0: seeing 'l' goes to s1, anything else stays at s0
for p in phones:
    if p == 'l':
        contains_lb.add_arc(s0, sym_ids[p], sym_ids[p], 0.0, s1)
    else:
        contains_lb.add_arc(s0, sym_ids[p], sym_ids[p], 0.0, s0)

# From s1: seeing 'b' goes to s2 (found lb), seeing 'l' stays at s1, else back to s0
for p in phones:
    if p == 'b':
        contains_lb.add_arc(s1, sym_ids[p], sym_ids[p], 0.0, s2)
    elif p == 'l':
        contains_lb.add_arc(s1, sym_ids[p], sym_ids[p], 0.0, s1)
    else:
        contains_lb.add_arc(s1, sym_ids[p], sym_ids[p], 0.0, s0)

# From s2: accept sink — everything stays
for p in phones:
    contains_lb.add_arc(s2, sym_ids[p], sym_ids[p], 0.0, s2)

# The SL-2 constraint: accept strings that do NOT contain [lb]
no_lb = arcweight.difference(sigma_star, contains_lb)

print(f"sigma_star states: {sigma_star.num_states()}")
print(f"contains_lb states: {contains_lb.num_states()}")
print(f"no_lb states: {no_lb.num_states()}")
sigma_star states: 1
contains_lb states: 3
no_lb states: 3

Notice that the states of an SL-\(k\) constraint automaton are determined by the last \(k - 1\) symbols seen. In our SL-2 example, the state tracks only the previous symbol: whether it was an [l] (in which case a following [b] is forbidden) or something else. This is a very restricted kind of DFA—one where the state space is essentially \(\Sigma^{k-1}\) (or a subset of it).

This is the same information that an \(N\)-gram model conditions on: the preceding \(N - 1\) symbols. An SL-\(k\) automaton and a \(k\)-gram model are tracking the same context; the difference is that the SL automaton makes a binary decision about whether the context is licit, while the \(N\)-gram model assigns it a probability.

Combining SL constraints

Because SL languages are regular, they are closed under intersection. This means we can specify a phonotactic grammar as a collection of forbidden \(k\)-grams and intersect the corresponding constraints to get the language that satisfies all of them simultaneously.

Combine multiple SL-2 constraints
# Also forbid word-initial [bk] (not a licit English onset)
contains_bk = arcweight.VectorFst()
s0 = contains_bk.add_state()
s1 = contains_bk.add_state()
s2 = contains_bk.add_state()
contains_bk.set_start(s0)
contains_bk.set_final(s2, 0.0)

for p in phones:
    if p == 'b':
        contains_bk.add_arc(s0, sym_ids[p], sym_ids[p], 0.0, s1)
    else:
        contains_bk.add_arc(s0, sym_ids[p], sym_ids[p], 0.0, s0)

for p in phones:
    if p == 'k':
        contains_bk.add_arc(s1, sym_ids[p], sym_ids[p], 0.0, s2)
    elif p == 'b':
        contains_bk.add_arc(s1, sym_ids[p], sym_ids[p], 0.0, s1)
    else:
        contains_bk.add_arc(s1, sym_ids[p], sym_ids[p], 0.0, s0)

for p in phones:
    contains_bk.add_arc(s2, sym_ids[p], sym_ids[p], 0.0, s2)

no_bk = arcweight.difference(sigma_star, contains_bk)

# Combine: strings with neither [lb] nor [bk]
combined = arcweight.intersect(no_lb, no_bk)
combined = arcweight.minimize(combined)

print(f"Combined constraint states: {combined.num_states()}")
Combined constraint states: 5

This is the sense in which SL grammars are constraint-based: we specify what is forbidden, and the language is whatever remains.

Locally Testable languages

Strictly Local languages can only express negative constraints: certain factors are forbidden. But what if we want to say that a factor is required? For instance, suppose we want to express the constraint that every word must contain at least one vowel. This is not an SL constraint, because SL can only forbid substrings—it cannot require them.1

Locally Testable (LT) languages generalize SL by allowing Boolean combinations of factor constraints—not just forbidden factors but required factors and arbitrary combinations thereof (Brzozowski and Simon 1973; Rogers et al. 2013). A language is LT-\(k\) if membership can be determined by a Boolean formula over statements of the form “the \(k\)-gram \(u\) appears in \(\rtimes^{k-1} w \ltimes^{k-1}\).”

So “every word must contain at least one vowel” is LT-1: it’s the constraint that at least one of the 1-grams from the set of vowels must appear.

CautionQuestion

Is the constraint “a word must contain at least one vowel” SL? Is it LT?

It is not SL (for any \(k\)), because SL can only forbid substrings; it cannot require them. But it is LT-1: the constraint is that the set of 1-grams occurring in the word must include at least one vowel, which is a positive Boolean statement about factor occurrence.

LT properly contains SL: every SL language is LT (a forbidden factor is just the negation of “this factor does not appear”), but not every LT language is SL (because SL cannot express positive requirements). In practice, most phonotactic constraints are SL rather than LT—they tend to be prohibitions against certain sequences rather than requirements for certain sequences to appear. But the distinction is important for understanding the full hierarchy.

References

Brzozowski, Janusz A., and Imre Simon. 1973. “Characterizations of Locally Testable Events.” Discrete Mathematics 4 (3): 243–71. https://doi.org/10.1016/S0012-365X(73)80005-6.
Heinz, Jeffrey. 2010. “Learning Long-Distance Phonotactics.” Linguistic Inquiry 41 (4): 623–61. https://doi.org/10.1162/LING\_a\_00015.
Rogers, James, Jeffrey Heinz, Margaret Fero, Jeremy Hurst, Dakotah Lambert, and Sean Wibel. 2013. “Cognitive and Sub-Regular Complexity.” Formal Grammar, Lecture notes in computer science, vol. 8036: 90–108. https://doi.org/10.1007/978-3-642-39998-5\_6.
Rogers, James, and Geoffrey K. Pullum. 2011. “Aural Pattern Recognition Experiments and the Subregular Hierarchy.” J. Logic, Language and Information 20 (3): 329–42. https://doi.org/10.1007/s10849-011-9140-2.

Footnotes

  1. To see why: an SL language is defined by a set of forbidden \(k\)-grams. If a string avoids all forbidden \(k\)-grams, it’s in the language. There is no mechanism for requiring that a particular \(k\)-gram be present.↩︎