Substitutions

So far, we’ve used regular expressions to recognize and extract from strings—asking whether a string is in the language that a regular expression evaluates to, or pulling out substrings that match a pattern. Regular expressions can also be used to transform strings, and this is where the connection to phonological and morphological theory becomes most concrete: the transformations we’ll write here are, formally, the same kind of object as the phonological rules we’ll study in the next module.

Load IPA representation of CMU Pronouncing Dictionary

with open("cmudict-ipa") as f:
    entries: list[tuple[str, str]] = [
        l.strip().split(",") for l in f
    ]
    entries: dict[str, list[str]] = {
        w: ipa.split() for w, ipa in entries
    }

Basic substitution with `re.sub`

The function re.sub(pattern, replacement, string) finds all occurrences of pattern in string and replaces them with replacement. The pattern can be any regular expression, and the replacement can refer back to captured groups using backreferences (\1, \2, etc.).

In the previous sections, we worked out that many words ending in /ʃən/ are derived from stems that end in /t/—the /t/ gets deleted before the suffix. We can model this derivation in reverse using re.sub: given the surface form /æbstɹækʃən/, we reconstruct the underlying stem by reinserting the /t/ before /ʃən/. We use a negative lookbehind to make sure we only do this when the suffix is plain /ʃən/, not /eɪʃən/ (where no /t/-deletion occurred).

import re

regex = r'(.+?)(?<!eɪ)(ʃən)$'

test_words = ['abstraction', 'attraction', 'accreditation', 'fashion', 'passion']

for w in test_words:
    if w in entries:
        joined = "".join(entries[w])
        underlying = re.sub(regex, r'\1t\2', joined)
        if underlying != joined:
            print(f"{joined:20s} → {underlying:20s}  ({w})")
        else:
            print(f"{joined:20s} → (no /t/ restored)    ({w})")

æbstɹækʃən           → æbstɹæktʃən           (abstraction)
ətɹækʃən             → ətɹæktʃən             (attraction)
əkɹɛdəteɪʃən         → (no /t/ restored)    (accreditation)
fæʃən                → fætʃən                (fashion)
pæʃən                → pætʃən                (passion)

The backreference \1 preserves whatever the stem was, and \2 preserves the suffix /ʃən/. Words like accreditation (where the suffix is /eɪʃən/) and fashion (where /ʃən/ isn’t a suffix at all) are left alone.

Going forward: verb → noun

We can also use re.sub to go in the forward direction. Many of the /ʃən/ nouns in our lexicon are derived from verbs ending in /t/—abstract → abstraction, attract → attraction, and so on. The morphophonological rule is: delete the final /t/ of the verb and append /ʃən/. Let’s implement that rule and check which verb-noun pairs it correctly predicts.

regex_nominalize = r'(.+)t$'

noun_forms = {w: "".join(ipa) for w, ipa in entries.items()
              if "".join(ipa).endswith('ʃən')}

n_matches = 0

for w, ipa in entries.items():
    joined = "".join(ipa)
    if re.fullmatch(regex_nominalize, joined):
        predicted_noun = re.sub(regex_nominalize, r'\1ʃən', joined)
        for noun_w, noun_ipa in noun_forms.items():
            if noun_ipa == predicted_noun:
                if n_matches < 15:
                    n_matches += 1
                    print(f"{joined:15s} ({w:15s}) → {predicted_noun:15s} ({noun_w})")
                break

    if n_matches >= 15:
        break

əbɹivieɪt       (abbreviate     ) → əbɹivieɪʃən     (abbreviation)
æbdəkt          (abduct         ) → æbdəkʃən        (abduction)
əbleɪt          (ablate         ) → əbleɪʃən        (ablation)
əbɔɹt           (abort          ) → əbɔɹʃən         (abortion)
æbɹəgeɪt        (abrogate       ) → æbɹəgeɪʃən      (abrogation)
æbsəlut         (absolut        ) → æbsəluʃən       (absolution)
æbsəlut         (absolute       ) → æbsəluʃən       (absolution)
æbstɹækt        (abstract       ) → æbstɹækʃən      (abstraction)
æbstɹækt        (abstract(1)    ) → æbstɹækʃən      (abstraction)
æksɛlɝeɪt       (accelerate     ) → æksɛlɝeɪʃən     (acceleration)
ækləmeɪt        (acclimate      ) → ækləmeɪʃən      (acclamation)
əkɑmədeɪt       (accommodate    ) → əkɑmədeɪʃən     (accommodation)
əkjumjəleɪt     (accumulate     ) → əkjumjəleɪʃən   (accumulation)
ækt             (act            ) → ækʃən           (action)
æktəveɪt        (activate       ) → æktəveɪʃən      (activation)

The substitution re.sub(r'(.+)t$', r'\1ʃən', stem) is doing exactly what the morphophonological rule says: it captures everything before the final /t/, drops the /t/, and appends /ʃən/. And because we check whether the predicted noun form actually exists in the CMU dictionary, we’re verifying that the rule is producing real words.

Substitution with functions

For more complex transformations, we can pass a function as the replacement argument to re.sub. The function receives the Match object and returns the replacement string.

We noted in the previous section that some verbs form their /-tion/ noun with /eɪʃən/ rather than /ʃən/—accredit → accreditation, for instance. The conditioning factor is partly phonological: /eɪʃən/ tends to occur when the verb ends in a consonant cluster rather than a bare /t/. We can model this with a function that inspects the final cluster and chooses the appropriate allomorph.

consonants = 'bcdfghjklmnŋpɹstvwzðθʃʒ'

def nominalize(m: re.Match) -> str:
    """Derive a -tion noun from a verb stem.

    Parameters
    ----------
    m : re.Match
        The match object from the verb pattern.

    Returns
    -------
    str
        The predicted noun form.
    """
    stem = m.group(1)
    final_cluster = m.group(2)

    if len(final_cluster) > 1:
        return stem + 'eɪʃən'
    else:
        return stem + 'ʃən'

regex_verb = f'(.+?)([{consonants}]?t)$'

n_matches = 0

for w, ipa in entries.items():
    joined = "".join(ipa)
    m = re.fullmatch(regex_verb, joined)
    if m:
        predicted = re.sub(regex_verb, nominalize, joined)
        for noun_w, noun_ipa in entries.items():
            if "".join(noun_ipa) == predicted:
                if n_matches < 15:
                    n_matches += 1
                    print(f"{joined:15s} ({w:15s}) → {predicted:20s} ({noun_w})")
                break

    if n_matches >= 15:
        break

əbɹivieɪt       (abbreviate     ) → əbɹivieɪʃən          (abbreviation)
əbleɪt          (ablate         ) → əbleɪʃən             (ablation)
æbɹəgeɪt        (abrogate       ) → æbɹəgeɪʃən           (abrogation)
æbsəlut         (absolut        ) → æbsəluʃən            (absolution)
æbsəlut         (absolute       ) → æbsəluʃən            (absolution)
æksɛlɝeɪt       (accelerate     ) → æksɛlɝeɪʃən          (acceleration)
ækləmeɪt        (acclimate      ) → ækləmeɪʃən           (acclamation)
əkɑmədeɪt       (accommodate    ) → əkɑmədeɪʃən          (accommodation)
əkjumjəleɪt     (accumulate     ) → əkjumjəleɪʃən        (accumulation)
æktəveɪt        (activate       ) → æktəveɪʃən           (activation)
ædmɪnɪstɹeɪt    (administrate   ) → ædmɪnɪstɹeɪʃən       (administration)
ədmɪt           (admit          ) → ədmɪʃən              (admission(1))
ædʒəleɪt        (adulate        ) → ædʒəleɪʃən           (adulation)
ədəltɝeɪt       (adulterate     ) → ədəltɝeɪʃən          (adulteration)
ædvəkeɪt        (advocate(1)    ) → ædvəkeɪʃən           (advocation)

Passing a function as the replacement lets us implement conditional logic that depends on the phonological shape of the match—here, whether the verb ends in a single /t/ or a consonant-plus-/t/ cluster.

Connection to phonological rules and transducers

The substitutions we’ve been writing here are closely related to the phonological rule notation introduced by @chomsky1968sound in The Sound Pattern of English (SPE). The SPE notation for a phonological rule is:

\[A \rightarrow B / C\_D\]

which reads: “$A$ becomes $B$ when preceded by $C$ and followed by $D$.” Our /t/-deletion rule could be written as:

\[\text{t} \rightarrow \emptyset / C\_\#\]

where $C$ is any consonant and $\#$ is a word boundary. In re.sub terms, this is re.sub(r'([consonants])t$', r'\1', string).

As we’ll see in the phonological patterns module, every SPE-style rule of this form can be compiled into a finite state transducer—a machine that reads an input string and produces an output string. The pattern in re.sub defines when the rule applies; the replacement defines what it does. When both the pattern and the replacement are regular, the resulting transformation is a regular relation—exactly the class of relations computable by finite state transducers. We’ll formalize this connection when we get to transducers, but I want to flag it now because re.sub is where most people first encounter this idea, even if they don’t realize it at the time.

`re.subn`

One practical note: re.subn works like re.sub but also returns the number of substitutions made, which is useful when you want to know whether a rule applied.

result, n = re.subn(f'([{consonants}])t$', r'\1', 'wɛst')
print(f"Result: {result}, substitutions made: {n}")

result, n = re.subn(f'([{consonants}])t$', r'\1', 'kæt')
print(f"Result: {result}, substitutions made: {n}")

Result: wɛs, substitutions made: 1
Result: kæt, substitutions made: 0

The first call deleted the /t/ in /wɛst/ (one substitution); the second left /kæt/ alone (the /t/ is preceded by a vowel, not a consonant, so the rule didn’t apply).

--- title: Substitutions jupyter: python3 --- So far, we've used regular expressions to *recognize* and *extract from* strings—asking whether a string is in the language that a regular expression evaluates to, or pulling out substrings that match a pattern. Regular expressions can also be used to *transform* strings, and this is where the connection to phonological and morphological theory becomes most concrete: the transformations we'll write here are, formally, the same kind of object as the phonological rules we'll study in the next module. ```{python} #| code-fold: true #| code-summary: Load IPA representation of CMU Pronouncing Dictionary with open("cmudict-ipa") as f: entries: list[tuple[str, str]] = [ l.strip().split(",") for l in f ] entries: dict[str, list[str]] = { w: ipa.split() for w, ipa in entries } ``` ## Basic substitution with `re.sub` The function `re.sub(pattern, replacement, string)` finds all occurrences of `pattern` in `string` and replaces them with `replacement`. The pattern can be any regular expression, and the replacement can refer back to captured groups using backreferences (`\1`, `\2`, etc.). In the previous sections, we worked out that many words ending in /ʃən/ are derived from stems that end in /t/—the /t/ gets deleted before the suffix. We can model this derivation in reverse using `re.sub`: given the surface form /æbstɹækʃən/, we reconstruct the underlying stem by reinserting the /t/ before /ʃən/. We use a negative lookbehind to make sure we only do this when the suffix is plain /ʃən/, not /eɪʃən/ (where no /t/-deletion occurred). ```{python} import re regex = r'(.+?)(?<!eɪ)(ʃən)$' test_words = ['abstraction', 'attraction', 'accreditation', 'fashion', 'passion'] for w in test_words: if w in entries: joined = "".join(entries[w]) underlying = re.sub(regex, r'\1t\2', joined) if underlying != joined: print(f"{joined:20s} → {underlying:20s} ({w})") else: print(f"{joined:20s} → (no /t/ restored) ({w})") ``` The backreference `\1` preserves whatever the stem was, and `\2` preserves the suffix /ʃən/. Words like *accreditation* (where the suffix is /eɪʃən/) and *fashion* (where /ʃən/ isn't a suffix at all) are left alone. ## Going forward: verb → noun We can also use `re.sub` to go in the forward direction. Many of the /ʃən/ nouns in our lexicon are derived from verbs ending in /t/—*abstract* → *abstraction*, *attract* → *attraction*, and so on. The morphophonological rule is: delete the final /t/ of the verb and append /ʃən/. Let's implement that rule and check which verb-noun pairs it correctly predicts. ```{python} regex_nominalize = r'(.+)t$' noun_forms = {w: "".join(ipa) for w, ipa in entries.items() if "".join(ipa).endswith('ʃən')} n_matches = 0 for w, ipa in entries.items(): joined = "".join(ipa) if re.fullmatch(regex_nominalize, joined): predicted_noun = re.sub(regex_nominalize, r'\1ʃən', joined) for noun_w, noun_ipa in noun_forms.items(): if noun_ipa == predicted_noun: if n_matches < 15: n_matches += 1 print(f"{joined:15s} ({w:15s}) → {predicted_noun:15s} ({noun_w})") break if n_matches >= 15: break ``` The substitution `re.sub(r'(.+)t$', r'\1ʃən', stem)` is doing exactly what the morphophonological rule says: it captures everything before the final /t/, drops the /t/, and appends /ʃən/. And because we check whether the predicted noun form actually exists in the CMU dictionary, we're verifying that the rule is producing real words. ## Substitution with functions For more complex transformations, we can pass a function as the replacement argument to `re.sub`. The function receives the `Match` object and returns the replacement string. We noted in the previous section that some verbs form their /-tion/ noun with /eɪʃən/ rather than /ʃən/—*accredit* → *accreditation*, for instance. The conditioning factor is partly phonological: /eɪʃən/ tends to occur when the verb ends in a consonant cluster rather than a bare /t/. We can model this with a function that inspects the final cluster and chooses the appropriate allomorph. ```{python} consonants = 'bcdfghjklmnŋpɹstvwzðθʃʒ' def nominalize(m: re.Match) -> str: """Derive a -tion noun from a verb stem. Parameters ---------- m : re.Match The match object from the verb pattern. Returns ------- str The predicted noun form. """ stem = m.group(1) final_cluster = m.group(2) if len(final_cluster) > 1: return stem + 'eɪʃən' else: return stem + 'ʃən' regex_verb = f'(.+?)([{consonants}]?t)$' n_matches = 0 for w, ipa in entries.items(): joined = "".join(ipa) m = re.fullmatch(regex_verb, joined) if m: predicted = re.sub(regex_verb, nominalize, joined) for noun_w, noun_ipa in entries.items(): if "".join(noun_ipa) == predicted: if n_matches < 15: n_matches += 1 print(f"{joined:15s} ({w:15s}) → {predicted:20s} ({noun_w})") break if n_matches >= 15: break ``` Passing a function as the replacement lets us implement conditional logic that depends on the phonological shape of the match—here, whether the verb ends in a single /t/ or a consonant-plus-/t/ cluster. ## Connection to phonological rules and transducers The substitutions we've been writing here are closely related to the phonological rule notation introduced by @chomsky1968sound in *The Sound Pattern of English* (SPE). The SPE notation for a phonological rule is: $$A \rightarrow B / C\_D$$ which reads: "$A$ becomes $B$ when preceded by $C$ and followed by $D$." Our /t/-deletion rule could be written as: $$\text{t} \rightarrow \emptyset / C\_\#$$ where $C$ is any consonant and $\#$ is a word boundary. In `re.sub` terms, this is `re.sub(r'([consonants])t$', r'\1', string)`. As we'll see in the [phonological patterns module](../phonological-patterns/phonological-rules-as-fsas/finite-state-transducers.qmd), every SPE-style rule of this form can be compiled into a *finite state transducer*—a machine that reads an input string and produces an output string. The pattern in `re.sub` defines when the rule applies; the replacement defines what it does. When both the pattern and the replacement are regular, the resulting transformation is a *regular relation*—exactly the class of relations computable by finite state transducers. We'll formalize this connection when we get to transducers, but I want to flag it now because `re.sub` is where most people first encounter this idea, even if they don't realize it at the time. ## `re.subn` One practical note: `re.subn` works like `re.sub` but also returns the number of substitutions made, which is useful when you want to know whether a rule applied. ```{python} result, n = re.subn(f'([{consonants}])t$', r'\1', 'wɛst') print(f"Result: {result}, substitutions made: {n}") result, n = re.subn(f'([{consonants}])t$', r'\1', 'kæt') print(f"Result: {result}, substitutions made: {n}") ``` The first call deleted the /t/ in /wɛst/ (one substitution); the second left /kæt/ alone (the /t/ is preceded by a vowel, not a consonant, so the rule didn't apply).

Basic substitution with re.sub

Going forward: verb → noun

Substitution with functions

Connection to phonological rules and transducers

re.subn

Basic substitution with `re.sub`

`re.subn`