Lookarounds

In the previous section, we used groups and greediness to extract stems from words ending in /ʃən/—like extracting /æbstɹæk/ from /æbstɹækʃən/. We also saw the anchors ^ and $, which assert something about the position in the string (beginning or end) without consuming any characters. Lookarounds are a generalization of the same idea to arbitrary patterns: you can assert that a particular pattern does or does not occur at a given position, without that assertion consuming any of the string.

Load IPA representation of CMU Pronouncing Dictionary

with open("cmudict-ipa") as f:
    entries: list[tuple[str, str]] = [
        l.strip().split(",") for l in f
    ]
    entries: dict[str, list[str]] = {
        w: ipa.split() for w, ipa in entries
    }

Positive lookahead

A positive lookahead (?=...) asserts that what follows the current position matches a given pattern. Recall our /ʃən/ extraction from the previous section: we captured the stem with (.+?)(?:eɪ)?ʃən$, which required careful management of greediness to avoid swallowing the /eɪ/. With a lookahead, we can instead find the position just before /ʃən/ (or /eɪʃən/) without consuming the suffix at all.

import re

regex_stem = r'(.+?)(?=(?:eɪ)?ʃən$)'

n_matches = 0

for w, ipa in entries.items():
    joined = "".join(ipa)
    m = re.search(regex_stem, joined)
    if m and joined.endswith('ʃən'):
        if n_matches < 15:
            n_matches += 1
            print(f"{joined:20s}  stem: {m.group(1):10s}  ({w})")
        else:
            break

əbɹivieɪʃən           stem: əbɹivi      (abbreviation)
æbdɪkeɪʃən            stem: æbdɪk       (abdication)
æbdəkʃən              stem: æbdək       (abduction)
əbdəkʃən              stem: əbdək       (abduction(1))
æbɝeɪʃən              stem: æbɝ         (aberration)
əbleɪʃən              stem: əbl         (ablation)
əbluʃən               stem: əblu        (ablution)
æbnɛgeɪʃən            stem: æbnɛg       (abnegation)
æbəlɪʃən              stem: æbəlɪ       (abolition)
əbɑməneɪʃən           stem: əbɑmən      (abomination)
əbɔɹʃən               stem: əbɔɹ        (abortion)
æbɹəgeɪʃən            stem: æbɹəg       (abrogation)
æbsəluʃən             stem: æbsəlu      (absolution)
əbzɔɹpʃən             stem: əbzɔɹp      (absorption)
əbsɔɹpʃən             stem: əbsɔɹp      (absorption(1))

The (?=(?:eɪ)?ʃən$) checks that what follows the current position is (optionally /eɪ/ and then) /ʃən/ at the end of the string—but it doesn’t consume those characters. The capturing group (.+?) therefore gets exactly the stem, with none of the suffix leaking in. Compare this to the greediness-based approach from the previous section, where we had to think carefully about whether .+ would swallow /eɪ/.

Negative lookahead

A negative lookahead (?!...) asserts that what follows does not match a given pattern. This is useful for expressing phonotactic constraints—restrictions on what sequences of sounds a language allows.

Let’s find all words in the CMU dictionary where /ʃ/ appears but is not followed by /ən/—i.e. cases where /ʃ/ is not part of the /-ʃən/ suffix we’ve been studying. This gives us a sense of what other contexts /ʃ/ appears in.

regex = r'ʃ(?!ən)'

n_matches = 0

for w, ipa in entries.items():
    joined = "".join(ipa)
    if re.search(regex, joined):
        if n_matches < 15:
            n_matches += 1
            pos = re.search(regex, joined).start()
            print(f"{joined:20s}  /ʃ/ at position {pos}  ({w})")
        else:
            break

əbæʃ                  /ʃ/ at position 3  (abash)
əbæʃt                 /ʃ/ at position 3  (abashed)
əbɑlɪʃ                /ʃ/ at position 5  (abolish)
əbɑlɪʃt               /ʃ/ at position 5  (abolished)
əbɑlɪʃɪz              /ʃ/ at position 5  (abolishes)
əbɑlɪʃɪŋ              /ʃ/ at position 5  (abolishing)
ɑbɹəmtʃɪk             /ʃ/ at position 6  (abramczyk)
əbɹɑməvɪtʃ            /ʃ/ at position 9  (abramowicz)
æbsɛnʃə               /ʃ/ at position 5  (absentia)
æbʃɝ                  /ʃ/ at position 2  (absher)
æbʃiɝ                 /ʃ/ at position 2  (abshier)
æbʃaɪɹ                /ʃ/ at position 2  (abshire)
əkeɪʃə                /ʃ/ at position 4  (acacia)
æksɛntʃueɪt           /ʃ/ at position 6  (accentuate)
æksɛntʃueɪtɪd         /ʃ/ at position 6  (accentuated)

These are words where /ʃ/ appears in contexts other than the /-ʃən/ suffix: initial /ʃ/ (as in she), medial /ʃ/ before other vowels (as in fashion), etc. The negative lookahead expresses the constraint “not followed by /ən/” directly. Without it, we would need to enumerate all the things /ʃ/ can be followed by, which is tedious and error-prone.

Lookbehind

A lookbehind asserts something about what precedes the current position: (?<=...) for positive and (?<!...) for negative. In Python’s re module, the pattern inside a lookbehind must have a fixed length—you can’t use * or + inside it.¹

Let’s use a lookbehind to separate the two allomorphs of the /-ʃən/ suffix that we identified in the previous section: /eɪʃən/ (as in accreditation) versus plain /ʃən/ (as in abstraction).

regex_ation = r'(?<=eɪ)ʃən$'
regex_tion = r'(?<!eɪ)ʃən$'

ation_words = []
tion_words = []

for w, ipa in entries.items():
    joined = "".join(ipa)
    if re.search(regex_ation, joined):
        ation_words.append((w, joined))
    elif re.search(regex_tion, joined):
        tion_words.append((w, joined))

print("/eɪʃən/ words:")
for w, ipa in ation_words[:10]:
    print(f"  {ipa:20s}  ({w})")

print()
print("/ʃən/ (without /eɪ/) words:")
for w, ipa in tion_words[:10]:
    print(f"  {ipa:20s}  ({w})")

/eɪʃən/ words:
  əbɹivieɪʃən           (abbreviation)
  æbdɪkeɪʃən            (abdication)
  æbɝeɪʃən              (aberration)
  əbleɪʃən              (ablation)
  æbnɛgeɪʃən            (abnegation)
  əbɑməneɪʃən           (abomination)
  æbɹəgeɪʃən            (abrogation)
  æksɛlɝeɪʃən           (acceleration)
  ækləmeɪʃən            (acclamation)
  ækləmeɪʃən            (acclimation)

/ʃən/ (without /eɪ/) words:
  æbdəkʃən              (abduction)
  əbdəkʃən              (abduction(1))
  əbluʃən               (ablution)
  æbəlɪʃən              (abolition)
  əbɔɹʃən               (abortion)
  æbsəluʃən             (absolution)
  əbzɔɹpʃən             (absorption)
  əbsɔɹpʃən             (absorption(1))
  əbstɛntʃən            (abstention)
  æbstɛntʃən            (abstention(1))

The lookbehind checks the preceding context without consuming it, so we don’t need to worry about greediness at all—the two allomorphs fall out of the two patterns directly.

Combining lookarounds for phonotactic queries

You can stack multiple lookarounds at the same position to express conjunctive constraints. Suppose we want to find all VCCV sequences in the lexicon—a vowel followed by two consonants and then another vowel. These sequences are relevant for syllabification, since the question of where the syllable boundary falls between the two consonants depends on properties like their relative sonority.

vowels = 'æəɑɔɛɝɪiʊu'
consonants = 'bcdfghjklmnŋpɹstvwzðθʃʒ'

regex_vccv = f'[{vowels}][{consonants}]{{2}}[{vowels}]'

n_matches = 0

for w, ipa in entries.items():
    joined = "".join(ipa)
    matches = list(re.finditer(regex_vccv, joined))
    if matches:
        if n_matches < 10:
            n_matches += 1
            for m in matches:
                cluster = m.group()
                print(f"{joined:20s}  VCCV: {cluster}  ({w})")
        else:
            break

eɪfɔɹtuwəntueɪt       VCCV: ɔɹtu  (a42128)
eɪfɔɹtuwəntueɪt       VCCV: əntu  (a42128)
ɑlsɛθ                 VCCV: ɑlsɛ  (aalseth)
ɑnkɔɹ                 VCCV: ɑnkɔ  (aancor)
ɑɹdɛmə                VCCV: ɑɹdɛ  (aardema)
ɛɹənsən               VCCV: ənsə  (aaronson)
ɑɹənsən               VCCV: ənsə  (aaronson(1))
ɛɹənsənz              VCCV: ənsə  (aaronson's)
ɑɹənsənz              VCCV: ənsə  (aaronson's(1))
ɑɹti                  VCCV: ɑɹti  (aarti)
əbɑlkɪn               VCCV: ɑlkɪ  (abalkin)

We can refine this with a lookahead: suppose we want only VCCV sequences where the second consonant is a stop (/p t k b d g/).

stops = 'ptkbdg'

regex_vcsv = f'[{vowels}][{consonants}](?=[{stops}][{vowels}])[{stops}][{vowels}]'

n_matches = 0

for w, ipa in entries.items():
    joined = "".join(ipa)
    matches = list(re.finditer(regex_vcsv, joined))
    if matches:
        if n_matches < 10:
            n_matches += 1
            for m in matches:
                print(f"{joined:20s}  VCstopV: {m.group()}  ({w})")
        else:
            break

eɪfɔɹtuwəntueɪt       VCstopV: ɔɹtu  (a42128)
eɪfɔɹtuwəntueɪt       VCstopV: əntu  (a42128)
ɑnkɔɹ                 VCstopV: ɑnkɔ  (aancor)
ɑɹdɛmə                VCstopV: ɑɹdɛ  (aardema)
ɑɹti                  VCstopV: ɑɹti  (aarti)
əbɑlkɪn               VCstopV: ɑlkɪ  (abalkin)
əbændən               VCstopV: ændə  (abandon)
əbændənd              VCstopV: ændə  (abandoned)
əbændənɪŋ             VCstopV: ændə  (abandoning)
əbændənmənt           VCstopV: ændə  (abandonment)
əbændənmənts          VCstopV: ændə  (abandonments)

Formal perspective

I’m spending time on lookarounds partly because they’re useful and partly because they’re formally interesting: they don’t increase the expressive power of regular expressions. A positive lookahead (?=\rho) at some position is equivalent to requiring that the remaining substring is in $\text{eval}(\rho) \circ \Sigma^*$—that is, it starts with something in $\text{eval}(\rho)$. A negative lookahead (?!\rho) requires the opposite. Since the regular languages are closed under intersection and complement (as we’ll see when we study finite state automata), these positional assertions can always be expressed—much less compactly—using just the basic operations we defined at the beginning of this module, together with intersection and complement. Lookarounds, like quantifiers and character classes before them, are notational conveniences that don’t extend the class of languages we can describe.

Footnotes

The engine would need to try all possible lengths for the lookbehind at each position, which is expensive. The third-party regex module relaxes this restriction.↩︎

--- title: Lookarounds jupyter: python3 --- In the previous section, we used groups and greediness to extract stems from words ending in /ʃən/—like extracting /æbstɹæk/ from /æbstɹækʃən/. We also saw the anchors `^` and `$`, which assert something about the *position* in the string (beginning or end) without consuming any characters. Lookarounds are a generalization of the same idea to arbitrary patterns: you can assert that a particular pattern does or does not occur at a given position, without that assertion consuming any of the string. ```{python} #| code-fold: true #| code-summary: Load IPA representation of CMU Pronouncing Dictionary with open("cmudict-ipa") as f: entries: list[tuple[str, str]] = [ l.strip().split(",") for l in f ] entries: dict[str, list[str]] = { w: ipa.split() for w, ipa in entries } ``` ## Positive lookahead A *positive lookahead* `(?=...)` asserts that what follows the current position matches a given pattern. Recall our /ʃən/ extraction from the previous section: we captured the stem with `(.+?)(?:eɪ)?ʃən$`, which required careful management of greediness to avoid swallowing the /eɪ/. With a lookahead, we can instead find the position just before /ʃən/ (or /eɪʃən/) without consuming the suffix at all. ```{python} import re regex_stem = r'(.+?)(?=(?:eɪ)?ʃən$)' n_matches = 0 for w, ipa in entries.items(): joined = "".join(ipa) m = re.search(regex_stem, joined) if m and joined.endswith('ʃən'): if n_matches < 15: n_matches += 1 print(f"{joined:20s} stem: {m.group(1):10s} ({w})") else: break ``` The `(?=(?:eɪ)?ʃən$)` checks that what follows the current position is (optionally /eɪ/ and then) /ʃən/ at the end of the string—but it doesn't consume those characters. The capturing group `(.+?)` therefore gets exactly the stem, with none of the suffix leaking in. Compare this to the greediness-based approach from the previous section, where we had to think carefully about whether `.+` would swallow /eɪ/. ## Negative lookahead A *negative lookahead* `(?!...)` asserts that what follows does *not* match a given pattern. This is useful for expressing phonotactic constraints—restrictions on what sequences of sounds a language allows. Let's find all words in the CMU dictionary where /ʃ/ appears but is *not* followed by /ən/—i.e. cases where /ʃ/ is not part of the /-ʃən/ suffix we've been studying. This gives us a sense of what other contexts /ʃ/ appears in. ```{python} regex = r'ʃ(?!ən)' n_matches = 0 for w, ipa in entries.items(): joined = "".join(ipa) if re.search(regex, joined): if n_matches < 15: n_matches += 1 pos = re.search(regex, joined).start() print(f"{joined:20s} /ʃ/ at position {pos} ({w})") else: break ``` These are words where /ʃ/ appears in contexts other than the /-ʃən/ suffix: initial /ʃ/ (as in *she*), medial /ʃ/ before other vowels (as in *fashion*), etc. The negative lookahead expresses the constraint "not followed by /ən/" directly. Without it, we would need to enumerate all the things /ʃ/ *can* be followed by, which is tedious and error-prone. ## Lookbehind A *lookbehind* asserts something about what *precedes* the current position: `(?<=...)` for positive and `(?<!...)` for negative. In Python's `re` module, the pattern inside a lookbehind must have a fixed length—you can't use `*` or `+` inside it.^[The engine would need to try all possible lengths for the lookbehind at each position, which is expensive. The third-party `regex` module relaxes this restriction.] Let's use a lookbehind to separate the two allomorphs of the /-ʃən/ suffix that we identified in the previous section: /eɪʃən/ (as in *accreditation*) versus plain /ʃən/ (as in *abstraction*). ```{python} regex_ation = r'(?<=eɪ)ʃən$' regex_tion = r'(?<!eɪ)ʃən$' ation_words = [] tion_words = [] for w, ipa in entries.items(): joined = "".join(ipa) if re.search(regex_ation, joined): ation_words.append((w, joined)) elif re.search(regex_tion, joined): tion_words.append((w, joined)) print("/eɪʃən/ words:") for w, ipa in ation_words[:10]: print(f" {ipa:20s} ({w})") print() print("/ʃən/ (without /eɪ/) words:") for w, ipa in tion_words[:10]: print(f" {ipa:20s} ({w})") ``` The lookbehind checks the preceding context without consuming it, so we don't need to worry about greediness at all—the two allomorphs fall out of the two patterns directly. ## Combining lookarounds for phonotactic queries You can stack multiple lookarounds at the same position to express conjunctive constraints. Suppose we want to find all VCCV sequences in the lexicon—a vowel followed by two consonants and then another vowel. These sequences are relevant for syllabification, since the question of where the syllable boundary falls between the two consonants depends on properties like their relative sonority. ```{python} vowels = 'æəɑɔɛɝɪiʊu' consonants = 'bcdfghjklmnŋpɹstvwzðθʃʒ' regex_vccv = f'[{vowels}][{consonants}]{{2}}[{vowels}]' n_matches = 0 for w, ipa in entries.items(): joined = "".join(ipa) matches = list(re.finditer(regex_vccv, joined)) if matches: if n_matches < 10: n_matches += 1 for m in matches: cluster = m.group() print(f"{joined:20s} VCCV: {cluster} ({w})") else: break ``` We can refine this with a lookahead: suppose we want only VCCV sequences where the second consonant is a stop (/p t k b d g/). ```{python} stops = 'ptkbdg' regex_vcsv = f'[{vowels}][{consonants}](?=[{stops}][{vowels}])[{stops}][{vowels}]' n_matches = 0 for w, ipa in entries.items(): joined = "".join(ipa) matches = list(re.finditer(regex_vcsv, joined)) if matches: if n_matches < 10: n_matches += 1 for m in matches: print(f"{joined:20s} VCstopV: {m.group()} ({w})") else: break ``` ## Formal perspective I'm spending time on lookarounds partly because they're useful and partly because they're formally interesting: they don't increase the expressive power of regular expressions. A positive lookahead `(?=\rho)` at some position is equivalent to requiring that the remaining substring is in $\text{eval}(\rho) \circ \Sigma^*$—that is, it starts with something in $\text{eval}(\rho)$. A negative lookahead `(?!\rho)` requires the opposite. Since the regular languages are closed under intersection and complement (as we'll see when we study finite state automata), these positional assertions can always be expressed—much less compactly—using just the basic operations we defined at the beginning of this module, together with intersection and complement. Lookarounds, like quantifiers and character classes before them, are notational conveniences that don't extend the class of languages we can describe.