Local languages

Consider the English onset cluster [bl], as in black. This is a perfectly fine way to start a word. The reverse, [lb], is not—there is no English word that begins with [lb]. This constraint depends only on a short sequence of symbols at a particular position in the word: the beginning. In the terminology we’ll develop here, this makes it a local constraint.

Strictly Local languages

A Strictly Local language of order $k$, or SL-$k$ language, is one whose membership can be determined by checking whether any forbidden sequence of $k$ consecutive symbols appears in the string (Heinz 2010; Rogers and Pullum 2011). More precisely, we augment the string with boundary markers $\rtimes$ (left edge) and $\ltimes$ (right edge) and check all the $k$-grams of the augmented string against a finite set of forbidden $k$-grams.

A language $L$ is SL-$k$ if and only if there exists a finite set $\mathcal{F} \subset (\Sigma \cup \{\rtimes, \ltimes\})^k$ such that:

\[w \in L \iff \text{no } k\text{-gram of } \rtimes^{k-1} w \ltimes^{k-1} \text{ is in } \mathcal{F}\]

The set $\mathcal{F}$ is called the set of forbidden factors. Notice that the boundary markers let us express constraints on how strings begin and end: a forbidden $k$-gram starting with $\rtimes$ is a constraint on the first symbols of the string.

For instance, the constraint against [lb] onsets in English can be expressed as an SL-3 constraint with $\rtimes\text{lb} \in \mathcal{F}$: the trigram consisting of the left boundary marker followed by [l] followed by [b] is forbidden.¹

This illustrates an important point: boundary markers are not just a technical convenience. They are what allow SL grammars to distinguish between positional and non-positional constraints. Without them, every SL constraint would apply uniformly across the string. With them, we can express constraints like “word-initial [lb] is forbidden” separately from “word-medial [lb] is fine.”

For a constraint that does apply uniformly, consider the sequence [ŋm]: English never allows a velar nasal [ŋ] followed immediately by a bilabial nasal [m], regardless of position in the word. This is an SL-2 constraint with $\text{ŋm} \in \mathcal{F}$—no boundary marker needed.

We can build FSAs that enforce these constraints using arcweight. The idea is to construct an automaton that accepts precisely the strings containing the forbidden factor, and then take its complement.

Build an SL-2 constraint FSA forbidding [ŋm]

import arcweight

# Define the alphabet
phones = ['l', 'b', 'k', 'æ', 'ɪ', 'ŋ', 'm']
syms = arcweight.SymbolTable()
sym_ids = {}
for p in phones:
    sym_ids[p] = syms.add_symbol(p)

# Build sigma-star: an FSA accepting all strings over the alphabet
sigma_star = arcweight.VectorFst()
s = sigma_star.add_state()
sigma_star.set_start(s)
sigma_star.set_final(s, 0.0)
for p in phones:
    sigma_star.add_arc(s, sym_ids[p], sym_ids[p], 0.0, s)

# Build an FSA that accepts strings containing the bigram [ŋm]
# States: 0 = no ŋ just seen, 1 = just saw ŋ, 2 = saw ŋm (accept sink)
contains_nm = arcweight.VectorFst()
s0 = contains_nm.add_state()
s1 = contains_nm.add_state()
s2 = contains_nm.add_state()
contains_nm.set_start(s0)
contains_nm.set_final(s2, 0.0)

# From s0: seeing 'ŋ' goes to s1, anything else stays at s0
for p in phones:
    if p == 'ŋ':
        contains_nm.add_arc(s0, sym_ids[p], sym_ids[p], 0.0, s1)
    else:
        contains_nm.add_arc(s0, sym_ids[p], sym_ids[p], 0.0, s0)

# From s1: seeing 'm' goes to s2 (found ŋm), seeing 'ŋ' stays at s1, else back to s0
for p in phones:
    if p == 'm':
        contains_nm.add_arc(s1, sym_ids[p], sym_ids[p], 0.0, s2)
    elif p == 'ŋ':
        contains_nm.add_arc(s1, sym_ids[p], sym_ids[p], 0.0, s1)
    else:
        contains_nm.add_arc(s1, sym_ids[p], sym_ids[p], 0.0, s0)

# From s2: accept sink — everything stays
for p in phones:
    contains_nm.add_arc(s2, sym_ids[p], sym_ids[p], 0.0, s2)

# The SL-2 constraint: accept strings that do NOT contain [ŋm]
no_nm = arcweight.difference(sigma_star, contains_nm)

print(f"sigma_star states: {sigma_star.num_states()}")
print(f"contains_nm states: {contains_nm.num_states()}")
print(f"no_nm states: {no_nm.num_states()}")

sigma_star states: 1
contains_nm states: 3
no_nm states: 3

Notice that the states of an SL-$k$ constraint automaton are determined by the last $k - 1$ symbols seen. In our SL-2 example, the state tracks only the previous symbol: whether it was an [ŋ] (in which case a following [m] is forbidden) or something else. This is a very restricted kind of DFA—one where the state space is essentially $\Sigma^{k-1}$ (or a subset of it).

This is the same information that an $N$-gram model conditions on: the preceding $N - 1$ symbols. An SL-$k$ automaton and a $k$-gram model are tracking the same context; the difference is that the SL automaton makes a binary decision about whether the context is licit, while the $N$-gram model assigns it a probability.

Combining SL constraints

A key property of the SL class is that it is closed under intersection—and this fact does not follow from the regularity of SL languages. The regular languages are closed under intersection, but that only tells us the intersection of two SL languages is regular; it doesn’t tell us it’s still SL. We need a direct argument.

The argument is simple. Suppose $L_1$ is SL-$k$ with forbidden factors $\mathcal{F}_1$ and $L_2$ is SL-$k$ with forbidden factors $\mathcal{F}_2$.² A string $w$ is in $L_1 \cap L_2$ if and only if it contains no forbidden factor from $\mathcal{F}_1$ and no forbidden factor from $\mathcal{F}_2$—which is exactly the condition for $w$ to be in the SL-$k$ language with forbidden factors $\mathcal{F}_1 \cup \mathcal{F}_2$. So $L_1 \cap L_2$ is SL-$k$ with $\mathcal{F} = \mathcal{F}_1 \cup \mathcal{F}_2$.

This is why we can specify a phonotactic grammar as a collection of forbidden $k$-grams: the language satisfying all constraints simultaneously is the intersection of the individual SL constraints, and that intersection is itself SL. Each forbidden $k$-gram contributes to a single unified forbidden-factor set.

Combine multiple SL-2 constraints

# Also forbid [mŋ] (bilabial nasal followed by velar nasal — doesn't occur in English)
contains_mn = arcweight.VectorFst()
s0 = contains_mn.add_state()
s1 = contains_mn.add_state()
s2 = contains_mn.add_state()
contains_mn.set_start(s0)
contains_mn.set_final(s2, 0.0)

for p in phones:
    if p == 'm':
        contains_mn.add_arc(s0, sym_ids[p], sym_ids[p], 0.0, s1)
    else:
        contains_mn.add_arc(s0, sym_ids[p], sym_ids[p], 0.0, s0)

for p in phones:
    if p == 'ŋ':
        contains_mn.add_arc(s1, sym_ids[p], sym_ids[p], 0.0, s2)
    elif p == 'm':
        contains_mn.add_arc(s1, sym_ids[p], sym_ids[p], 0.0, s1)
    else:
        contains_mn.add_arc(s1, sym_ids[p], sym_ids[p], 0.0, s0)

for p in phones:
    contains_mn.add_arc(s2, sym_ids[p], sym_ids[p], 0.0, s2)

no_mn = arcweight.difference(sigma_star, contains_mn)

# Combine: strings with neither [ŋm] nor [mŋ]
combined = arcweight.intersect(no_nm, no_mn)
combined = arcweight.minimize(combined)

print(f"Combined constraint states: {combined.num_states()}")

Combined constraint states: 5

This is the sense in which SL grammars are constraint-based: we specify what is forbidden, and the language is whatever remains.

Locally Testable languages

Strictly Local languages can only express negative constraints: certain factors are forbidden. But what if we want to say that a factor is required? For instance, suppose we want to express the constraint that every word must contain at least one vowel. This is not an SL constraint, because SL can only forbid substrings—it cannot require them.³

Locally Testable (LT) languages generalize SL by allowing Boolean combinations of factor constraints—not just forbidden factors but required factors and arbitrary combinations thereof (Brzozowski and Simon 1973; Rogers et al. 2013). A language is LT-$k$ if membership can be determined by a Boolean formula over statements of the form “the $k$-gram $u$ appears in $\rtimes^{k-1} w \ltimes^{k-1}$.”

So “every word must contain at least one vowel” is LT-1: it’s the constraint that at least one of the 1-grams from the set of vowels must appear.

Question

Is the constraint “a word must contain at least one vowel” SL? Is it LT?

Answer

It is not SL (for any $k$), because SL can only forbid substrings; it cannot require them. But it is LT-1: the constraint is that the set of 1-grams occurring in the word must include at least one vowel, which is a positive Boolean statement about factor occurrence.

LT properly contains SL: every SL language is LT (a forbidden factor is just the negation of “this factor does not appear”), but not every LT language is SL (because SL cannot express positive requirements). In practice, most phonotactic constraints are SL rather than LT—they tend to be prohibitions against certain sequences rather than requirements for certain sequences to appear. But the distinction is important for understanding the full hierarchy.

References

Brzozowski, Janusz A., and Imre Simon. 1973. “Characterizations of Locally Testable Events.” Discrete Mathematics 4 (3): 243–71. https://doi.org/10.1016/S0012-365X(73)80005-6.

Heinz, Jeffrey. 2010. “Learning Long-Distance Phonotactics.” Linguistic Inquiry 41 (4): 623–61. https://doi.org/10.1162/LING\_a\_00015.

Rogers, James, Jeffrey Heinz, Margaret Fero, Jeremy Hurst, Dakotah Lambert, and Sean Wibel. 2013. “Cognitive and Sub-Regular Complexity.” Formal Grammar, Lecture notes in computer science, vol. 8036: 90–108. https://doi.org/10.1007/978-3-642-39998-5\_6.

Rogers, James, and Geoffrey K. Pullum. 2011. “Aural Pattern Recognition Experiments and the Subregular Hierarchy.” J. Logic, Language and Information 20 (3): 329–42. https://doi.org/10.1007/s10849-011-9140-2.

Footnotes

Why SL-3 and not SL-2? Because the forbidden factor involves three symbols: the boundary marker $\rtimes$ plus the two consonants. An SL-2 constraint with $\text{lb} \in \mathcal{F}$ would forbid [lb] everywhere in the string—but English allows [lb] across syllable boundaries, as in elbow [ɛlboʊ], bulb [bʌlb], and album [ælbəm]. The boundary marker is what lets us restrict the constraint to word-initial position.↩︎
If $L_1$ is SL-$k_1$ and $L_2$ is SL-$k_2$ with $k_1 \neq k_2$, take $k = \max(k_1, k_2)$ and extend the shorter forbidden factors. Every $k'$-gram $u$ in $\mathcal{F}_1$ (with $k' < k$) can be replaced by the set of all $k$-grams that contain $u$ as a substring—these are the $k$-grams that are forbidden because they contain the shorter forbidden factor. After this extension, both sets of forbidden factors have the same order $k$.↩︎
To see why: an SL language is defined by a set of forbidden $k$-grams. If a string avoids all forbidden $k$-grams, it’s in the language. There is no mechanism for requiring that a particular $k$-gram be present.↩︎

--- title: Local languages bibliography: ../../references.bib jupyter: python3 --- Consider the English onset cluster [bl], as in *black*. This is a perfectly fine way to start a word. The reverse, [lb], is not—there is no English word that begins with [lb]. This constraint depends only on a short sequence of symbols at a particular position in the word: the beginning. In the terminology we'll develop here, this makes it a *local* constraint. ## Strictly Local languages A *Strictly Local* language of order $k$, or SL-$k$ language, is one whose membership can be determined by checking whether any forbidden sequence of $k$ consecutive symbols appears in the string [@heinz2010learning; @rogers2011aural]. More precisely, we augment the string with boundary markers $\rtimes$ (left edge) and $\ltimes$ (right edge) and check all the $k$-grams of the augmented string against a finite set of forbidden $k$-grams. A language $L$ is SL-$k$ if and only if there exists a finite set $\mathcal{F} \subset (\Sigma \cup \{\rtimes, \ltimes\})^k$ such that: $$w \in L \iff \text{no } k\text{-gram of } \rtimes^{k-1} w \ltimes^{k-1} \text{ is in } \mathcal{F}$$ The set $\mathcal{F}$ is called the set of *forbidden factors*. Notice that the boundary markers let us express constraints on how strings begin and end: a forbidden $k$-gram starting with $\rtimes$ is a constraint on the first symbols of the string. For instance, the constraint against [lb] onsets in English can be expressed as an SL-3 constraint with $\rtimes\text{lb} \in \mathcal{F}$: the trigram consisting of the left boundary marker followed by [l] followed by [b] is forbidden.^[Why SL-3 and not SL-2? Because the forbidden factor involves three symbols: the boundary marker $\rtimes$ plus the two consonants. An SL-2 constraint with $\text{lb} \in \mathcal{F}$ would forbid [lb] *everywhere* in the string—but English allows [lb] across syllable boundaries, as in *elbow* [ɛlboʊ], *bulb* [bʌlb], and *album* [ælbəm]. The boundary marker is what lets us restrict the constraint to word-initial position.] This illustrates an important point: boundary markers are not just a technical convenience. They are what allow SL grammars to distinguish between positional and non-positional constraints. Without them, every SL constraint would apply uniformly across the string. With them, we can express constraints like "word-initial [lb] is forbidden" separately from "word-medial [lb] is fine." For a constraint that *does* apply uniformly, consider the sequence [ŋm]: English never allows a velar nasal [ŋ] followed immediately by a bilabial nasal [m], regardless of position in the word. This is an SL-2 constraint with $\text{ŋm} \in \mathcal{F}$—no boundary marker needed. We can build FSAs that enforce these constraints using `arcweight`. The idea is to construct an automaton that accepts precisely the strings containing the forbidden factor, and then take its complement. ```{python} #| code-fold: true #| code-summary: Build an SL-2 constraint FSA forbidding [ŋm] import arcweight # Define the alphabet phones = ['l', 'b', 'k', 'æ', 'ɪ', 'ŋ', 'm'] syms = arcweight.SymbolTable() sym_ids = {} for p in phones: sym_ids[p] = syms.add_symbol(p) # Build sigma-star: an FSA accepting all strings over the alphabet sigma_star = arcweight.VectorFst() s = sigma_star.add_state() sigma_star.set_start(s) sigma_star.set_final(s, 0.0) for p in phones: sigma_star.add_arc(s, sym_ids[p], sym_ids[p], 0.0, s) # Build an FSA that accepts strings containing the bigram [ŋm] # States: 0 = no ŋ just seen, 1 = just saw ŋ, 2 = saw ŋm (accept sink) contains_nm = arcweight.VectorFst() s0 = contains_nm.add_state() s1 = contains_nm.add_state() s2 = contains_nm.add_state() contains_nm.set_start(s0) contains_nm.set_final(s2, 0.0) # From s0: seeing 'ŋ' goes to s1, anything else stays at s0 for p in phones: if p == 'ŋ': contains_nm.add_arc(s0, sym_ids[p], sym_ids[p], 0.0, s1) else: contains_nm.add_arc(s0, sym_ids[p], sym_ids[p], 0.0, s0) # From s1: seeing 'm' goes to s2 (found ŋm), seeing 'ŋ' stays at s1, else back to s0 for p in phones: if p == 'm': contains_nm.add_arc(s1, sym_ids[p], sym_ids[p], 0.0, s2) elif p == 'ŋ': contains_nm.add_arc(s1, sym_ids[p], sym_ids[p], 0.0, s1) else: contains_nm.add_arc(s1, sym_ids[p], sym_ids[p], 0.0, s0) # From s2: accept sink — everything stays for p in phones: contains_nm.add_arc(s2, sym_ids[p], sym_ids[p], 0.0, s2) # The SL-2 constraint: accept strings that do NOT contain [ŋm] no_nm = arcweight.difference(sigma_star, contains_nm) print(f"sigma_star states: {sigma_star.num_states()}") print(f"contains_nm states: {contains_nm.num_states()}") print(f"no_nm states: {no_nm.num_states()}") ``` Notice that the states of an SL-$k$ constraint automaton are determined by the last $k - 1$ symbols seen. In our SL-2 example, the state tracks only the previous symbol: whether it was an [ŋ] (in which case a following [m] is forbidden) or something else. This is a very restricted kind of DFA—one where the state space is essentially $\Sigma^{k-1}$ (or a subset of it). This is the same information that an [$N$-gram model](../../uncertainty-about-languages/ngram-models.qmd) conditions on: the preceding $N - 1$ symbols. An SL-$k$ automaton and a $k$-gram model are tracking the same context; the difference is that the SL automaton makes a binary decision about whether the context is licit, while the $N$-gram model assigns it a probability. ### Combining SL constraints A key property of the SL class is that it is closed under intersection—and this fact does *not* follow from the regularity of SL languages. The regular languages are closed under intersection, but that only tells us the intersection of two SL languages is regular; it doesn't tell us it's still SL. We need a direct argument. The argument is simple. Suppose $L_1$ is SL-$k$ with forbidden factors $\mathcal{F}_1$ and $L_2$ is SL-$k$ with forbidden factors $\mathcal{F}_2$.^[If $L_1$ is SL-$k_1$ and $L_2$ is SL-$k_2$ with $k_1 \neq k_2$, take $k = \max(k_1, k_2)$ and extend the shorter forbidden factors. Every $k'$-gram $u$ in $\mathcal{F}_1$ (with $k' < k$) can be replaced by the set of all $k$-grams that contain $u$ as a substring—these are the $k$-grams that are forbidden because they contain the shorter forbidden factor. After this extension, both sets of forbidden factors have the same order $k$.] A string $w$ is in $L_1 \cap L_2$ if and only if it contains no forbidden factor from $\mathcal{F}_1$ *and* no forbidden factor from $\mathcal{F}_2$—which is exactly the condition for $w$ to be in the SL-$k$ language with forbidden factors $\mathcal{F}_1 \cup \mathcal{F}_2$. So $L_1 \cap L_2$ is SL-$k$ with $\mathcal{F} = \mathcal{F}_1 \cup \mathcal{F}_2$. This is why we can specify a phonotactic grammar as a *collection* of forbidden $k$-grams: the language satisfying all constraints simultaneously is the intersection of the individual SL constraints, and that intersection is itself SL. Each forbidden $k$-gram contributes to a single unified forbidden-factor set. ```{python} #| code-fold: true #| code-summary: Combine multiple SL-2 constraints # Also forbid [mŋ] (bilabial nasal followed by velar nasal — doesn't occur in English) contains_mn = arcweight.VectorFst() s0 = contains_mn.add_state() s1 = contains_mn.add_state() s2 = contains_mn.add_state() contains_mn.set_start(s0) contains_mn.set_final(s2, 0.0) for p in phones: if p == 'm': contains_mn.add_arc(s0, sym_ids[p], sym_ids[p], 0.0, s1) else: contains_mn.add_arc(s0, sym_ids[p], sym_ids[p], 0.0, s0) for p in phones: if p == 'ŋ': contains_mn.add_arc(s1, sym_ids[p], sym_ids[p], 0.0, s2) elif p == 'm': contains_mn.add_arc(s1, sym_ids[p], sym_ids[p], 0.0, s1) else: contains_mn.add_arc(s1, sym_ids[p], sym_ids[p], 0.0, s0) for p in phones: contains_mn.add_arc(s2, sym_ids[p], sym_ids[p], 0.0, s2) no_mn = arcweight.difference(sigma_star, contains_mn) # Combine: strings with neither [ŋm] nor [mŋ] combined = arcweight.intersect(no_nm, no_mn) combined = arcweight.minimize(combined) print(f"Combined constraint states: {combined.num_states()}") ``` This is the sense in which SL grammars are *constraint-based*: we specify what is forbidden, and the language is whatever remains. ## Locally Testable languages Strictly Local languages can only express *negative* constraints: certain factors are forbidden. But what if we want to say that a factor is *required*? For instance, suppose we want to express the constraint that every word must contain at least one vowel. This is not an SL constraint, because SL can only forbid substrings—it cannot require them.^[To see why: an SL language is defined by a set of forbidden $k$-grams. If a string avoids all forbidden $k$-grams, it's in the language. There is no mechanism for requiring that a particular $k$-gram be present.] *Locally Testable* (LT) languages generalize SL by allowing *Boolean combinations* of factor constraints—not just forbidden factors but required factors and arbitrary combinations thereof [@brzozowski1973characterizations; @rogers2013cognitive]. A language is LT-$k$ if membership can be determined by a Boolean formula over statements of the form "the $k$-gram $u$ appears in $\rtimes^{k-1} w \ltimes^{k-1}$." So "every word must contain at least one vowel" is LT-1: it's the constraint that at least one of the 1-grams from the set of vowels must appear. ::: {.callout-caution} ## Question Is the constraint "a word must contain at least one vowel" SL? Is it LT? ::: ::: {.callout-tip collapse=true} ## Answer It is not SL (for any $k$), because SL can only forbid substrings; it cannot require them. But it is LT-1: the constraint is that the set of 1-grams occurring in the word must include at least one vowel, which is a positive Boolean statement about factor occurrence. ::: LT properly contains SL: every SL language is LT (a forbidden factor is just the negation of "this factor does not appear"), but not every LT language is SL (because SL cannot express positive requirements). In practice, most phonotactic constraints are SL rather than LT—they tend to be prohibitions against certain sequences rather than requirements for certain sequences to appear. But the distinction is important for understanding the full hierarchy.