Groups and Greediness

One of the major uses for regular expressions is for extracting substrings from a string. This can be done with groups. For instance, suppose I want all of stems that have the morpheme with the form /ʃən/.

Load IPA representation of CMU Pronouncing Dictionary

with open("cmudict-ipa") as f:
    entries: list[tuple[str, str]] = [
        l.strip().split(",") for l in f
    ]
    entries: dict[str, list[str]] = {
        w: ipa.split() for w, ipa in entries
    }

import re

regex_ʃən = '(.+)ʃən'

n_matches = 0

for w, ipa in entries.items():
    if re.fullmatch(regex_ʃən, "".join(ipa)):
        if n_matches < 30:
            n_matches += 1
            print("".join(ipa), re.findall(regex_ʃən, "".join(ipa)), f"({w})")
        else:
            print("...")
            break

əbɹivieɪʃən ['əbɹivieɪ'] (abbreviation)
æbdɪkeɪʃən ['æbdɪkeɪ'] (abdication)
æbdəkʃən ['æbdək'] (abduction)
əbdəkʃən ['əbdək'] (abduction(1))
æbɝeɪʃən ['æbɝeɪ'] (aberration)
əbleɪʃən ['əbleɪ'] (ablation)
əbluʃən ['əblu'] (ablution)
æbnɛgeɪʃən ['æbnɛgeɪ'] (abnegation)
æbəlɪʃən ['æbəlɪ'] (abolition)
əbɑməneɪʃən ['əbɑməneɪ'] (abomination)
əbɔɹʃən ['əbɔɹ'] (abortion)
æbɹəgeɪʃən ['æbɹəgeɪ'] (abrogation)
æbsəluʃən ['æbsəlu'] (absolution)
əbzɔɹpʃən ['əbzɔɹp'] (absorption)
əbsɔɹpʃən ['əbsɔɹp'] (absorption(1))
əbstɛntʃən ['əbstɛnt'] (abstention)
æbstɛntʃən ['æbstɛnt'] (abstention(1))
æbstɹækʃən ['æbstɹæk'] (abstraction)
ækədəmɪʃən ['ækədəmɪ'] (academician)
æksɛlɝeɪʃən ['æksɛlɝeɪ'] (acceleration)
əksɛʃən ['əksɛ'] (accession)
ækləmeɪʃən ['ækləmeɪ'] (acclamation)
ækləmeɪʃən ['ækləmeɪ'] (acclimation)
əkɑmədeɪʃən ['əkɑmədeɪ'] (accommodation)
əkɹɛdəteɪʃən ['əkɹɛdəteɪ'] (accreditation)
əkɹiʃən ['əkɹi'] (accretion)
əkjumjəleɪʃən ['əkjumjəleɪ'] (accumulation)
ækjəzeɪʃən ['ækjəzeɪ'] (accusation)
ækjuzeɪʃən ['ækjuzeɪ'] (accusation(1))
əsɪdəfəkeɪʃən ['əsɪdəfəkeɪ'] (acidification)
...

This works to some extent, but notice that it will capture cases where /ʃən/ is not a morpheme. For instance, the word passion will get matched. It will also return the wrong stem when the morpheme is realized as /eɪʃən/, such as accreditation.

re.findall(regex_ʃən, "".join(entries["passion"])), re.findall(regex_ʃən, "".join(entries["accreditation"]))

(['pæ'], ['əkɹɛdəteɪ'])

To handle the second, we might look for /eɪʃən/ and /ʃən/. We can use the quantifier ? to say that /eɪ/ is optional. Because it is a digraph, we need to surround it with parentheses.

regex_ʃən = '(.+)(eɪ)?ʃən'

n_matches = 0

for w, ipa in entries.items():
    if re.fullmatch(regex_ʃən, "".join(ipa)):
        if n_matches < 30:
            n_matches += 1
            print("".join(ipa), re.findall(regex_ʃən, "".join(ipa)), f"({w})")
        else:
            print("...")
            break

əbɹivieɪʃən [('əbɹivieɪ', '')] (abbreviation)
æbdɪkeɪʃən [('æbdɪkeɪ', '')] (abdication)
æbdəkʃən [('æbdək', '')] (abduction)
əbdəkʃən [('əbdək', '')] (abduction(1))
æbɝeɪʃən [('æbɝeɪ', '')] (aberration)
əbleɪʃən [('əbleɪ', '')] (ablation)
əbluʃən [('əblu', '')] (ablution)
æbnɛgeɪʃən [('æbnɛgeɪ', '')] (abnegation)
æbəlɪʃən [('æbəlɪ', '')] (abolition)
əbɑməneɪʃən [('əbɑməneɪ', '')] (abomination)
əbɔɹʃən [('əbɔɹ', '')] (abortion)
æbɹəgeɪʃən [('æbɹəgeɪ', '')] (abrogation)
æbsəluʃən [('æbsəlu', '')] (absolution)
əbzɔɹpʃən [('əbzɔɹp', '')] (absorption)
əbsɔɹpʃən [('əbsɔɹp', '')] (absorption(1))
əbstɛntʃən [('əbstɛnt', '')] (abstention)
æbstɛntʃən [('æbstɛnt', '')] (abstention(1))
æbstɹækʃən [('æbstɹæk', '')] (abstraction)
ækədəmɪʃən [('ækədəmɪ', '')] (academician)
æksɛlɝeɪʃən [('æksɛlɝeɪ', '')] (acceleration)
əksɛʃən [('əksɛ', '')] (accession)
ækləmeɪʃən [('ækləmeɪ', '')] (acclamation)
ækləmeɪʃən [('ækləmeɪ', '')] (acclimation)
əkɑmədeɪʃən [('əkɑmədeɪ', '')] (accommodation)
əkɹɛdəteɪʃən [('əkɹɛdəteɪ', '')] (accreditation)
əkɹiʃən [('əkɹi', '')] (accretion)
əkjumjəleɪʃən [('əkjumjəleɪ', '')] (accumulation)
ækjəzeɪʃən [('ækjəzeɪ', '')] (accusation)
ækjuzeɪʃən [('ækjuzeɪ', '')] (accusation(1))
əsɪdəfəkeɪʃən [('əsɪdəfəkeɪ', '')] (acidification)
...

The problem is that this makes Python think we want to capture it. So what we need is a non-capturing group, which we get by putting ?: after the open parenthesis.

regex_ʃən = '(.+)(?:eɪ)?ʃən'

n_matches = 0

for w, ipa in entries.items():
    if re.fullmatch(regex_ʃən, "".join(ipa)):
        if n_matches < 30:
            n_matches += 1
            print("".join(ipa), re.findall(regex_ʃən, "".join(ipa)), f"({w})")
        else:
            print("...")
            break

əbɹivieɪʃən ['əbɹivieɪ'] (abbreviation)
æbdɪkeɪʃən ['æbdɪkeɪ'] (abdication)
æbdəkʃən ['æbdək'] (abduction)
əbdəkʃən ['əbdək'] (abduction(1))
æbɝeɪʃən ['æbɝeɪ'] (aberration)
əbleɪʃən ['əbleɪ'] (ablation)
əbluʃən ['əblu'] (ablution)
æbnɛgeɪʃən ['æbnɛgeɪ'] (abnegation)
æbəlɪʃən ['æbəlɪ'] (abolition)
əbɑməneɪʃən ['əbɑməneɪ'] (abomination)
əbɔɹʃən ['əbɔɹ'] (abortion)
æbɹəgeɪʃən ['æbɹəgeɪ'] (abrogation)
æbsəluʃən ['æbsəlu'] (absolution)
əbzɔɹpʃən ['əbzɔɹp'] (absorption)
əbsɔɹpʃən ['əbsɔɹp'] (absorption(1))
əbstɛntʃən ['əbstɛnt'] (abstention)
æbstɛntʃən ['æbstɛnt'] (abstention(1))
æbstɹækʃən ['æbstɹæk'] (abstraction)
ækədəmɪʃən ['ækədəmɪ'] (academician)
æksɛlɝeɪʃən ['æksɛlɝeɪ'] (acceleration)
əksɛʃən ['əksɛ'] (accession)
ækləmeɪʃən ['ækləmeɪ'] (acclamation)
ækləmeɪʃən ['ækləmeɪ'] (acclimation)
əkɑmədeɪʃən ['əkɑmədeɪ'] (accommodation)
əkɹɛdəteɪʃən ['əkɹɛdəteɪ'] (accreditation)
əkɹiʃən ['əkɹi'] (accretion)
əkjumjəleɪʃən ['əkjumjəleɪ'] (accumulation)
ækjəzeɪʃən ['ækjəzeɪ'] (accusation)
ækjuzeɪʃən ['ækjuzeɪ'] (accusation(1))
əsɪdəfəkeɪʃən ['əsɪdəfəkeɪ'] (acidification)
...

It still seems to be capturing /eɪ/ in accreditation. What gives? The reason this is happening is that quantifiers like + are greedy by default. That means they will match as much as they can. And because /eɪ/ is optional, (.+) can match it.

To make sure it doesn’t match it if it doesn’t need to, we can make the quantifier non-greedy by appending a ?.

regex_ʃən = '(.+?)(?:eɪ)?ʃən'

n_matches = 0

for w, ipa in entries.items():
    if re.fullmatch(regex_ʃən, "".join(ipa)):
        if n_matches < 30:
            n_matches += 1
            print("".join(ipa), re.findall(regex_ʃən, "".join(ipa)), f"({w})")
        else:
            print("...")
            break

əbɹivieɪʃən ['əbɹivi'] (abbreviation)
æbdɪkeɪʃən ['æbdɪk'] (abdication)
æbdəkʃən ['æbdək'] (abduction)
əbdəkʃən ['əbdək'] (abduction(1))
æbɝeɪʃən ['æbɝ'] (aberration)
əbleɪʃən ['əbl'] (ablation)
əbluʃən ['əblu'] (ablution)
æbnɛgeɪʃən ['æbnɛg'] (abnegation)
æbəlɪʃən ['æbəlɪ'] (abolition)
əbɑməneɪʃən ['əbɑmən'] (abomination)
əbɔɹʃən ['əbɔɹ'] (abortion)
æbɹəgeɪʃən ['æbɹəg'] (abrogation)
æbsəluʃən ['æbsəlu'] (absolution)
əbzɔɹpʃən ['əbzɔɹp'] (absorption)
əbsɔɹpʃən ['əbsɔɹp'] (absorption(1))
əbstɛntʃən ['əbstɛnt'] (abstention)
æbstɛntʃən ['æbstɛnt'] (abstention(1))
æbstɹækʃən ['æbstɹæk'] (abstraction)
ækədəmɪʃən ['ækədəmɪ'] (academician)
æksɛlɝeɪʃən ['æksɛlɝ'] (acceleration)
əksɛʃən ['əksɛ'] (accession)
ækləmeɪʃən ['ækləm'] (acclamation)
ækləmeɪʃən ['ækləm'] (acclimation)
əkɑmədeɪʃən ['əkɑməd'] (accommodation)
əkɹɛdəteɪʃən ['əkɹɛdət'] (accreditation)
əkɹiʃən ['əkɹi'] (accretion)
əkjumjəleɪʃən ['əkjumjəl'] (accumulation)
ækjəzeɪʃən ['ækjəz'] (accusation)
ækjuzeɪʃən ['ækjuz'] (accusation(1))
əsɪdəfəkeɪʃən ['əsɪdəfək'] (acidification)
...

Okay. So how do we deal with cases where /ʃən/ is not a morpheme? One thing we can do is to look for stems that show up without /ʃən/. This will exclude passion, since /pæ/ is not a word.

regex_ʃən = '(.+?)(?:eɪ)?ʃən'

n_matches = 0
seen = set()

for w1, ipa1 in entries.items():
    possible_morpheme = re.findall(regex_ʃən, "".join(ipa1))
    if possible_morpheme:
        for w2, ipa2 in entries.items():
            if re.fullmatch(possible_morpheme[0], "".join(ipa2)):
                if n_matches < 20 and "".join(ipa2) not in seen:
                    n_matches += 1
                    seen |= {"".join(ipa2)}
                    print("".join(ipa2), f"({w2})", "+", "ʃən", "=", "".join(ipa1), f"({w1})")
                else:
                    break

    if n_matches >= 20:   
        print("...")
        break

əbɔɹ (abor) + ʃən = əbɔɹʃən (abortion)
əkɹi (acree) + ʃən = əkɹiʃən (accretion)
æk (ack) + ʃən = ækʃən (action)
ædɝ (adder) + ʃən = ædɝeɪʃən (adoration)
ædʒəl (agile) + ʃən = ædʒəleɪʃən (adulation)
eɪliən (alien) + ʃən = eɪliəneɪʃən (alienation)
ɔltɝ (altar) + ʃən = ɔltɝeɪʃən (alteration)
əmælgəm (amalgam) + ʃən = əmælgəmeɪʃən (amalgamation)
eɪnt (ain't) + ʃən = eɪntʃənt (ancient)
eɪn (aine) + ʃən = eɪnʃənt (ancient(1))
ænəm (annum) + ʃən = ænəmeɪʃən (animation)
ænɛks (annex) + ʃən = ænɛkseɪʃən (annexation)
æn (ahn) + ʃən = ænʃən (anshan)
æpəl (appel) + ʃən = æpəleɪʃən (appalachian(1))
æspɝ (asper) + ʃən = æspɝeɪʃən (aspiration)
əsæsən (assassin) + ʃən = əsæsəneɪʃən (assassination)
ɑk (och) + ʃən = ɑkʃən (auction)
ɔtəm (autumn) + ʃən = ɔtəmeɪʃən (automation)
eɪvi (av) + ʃən = eɪvieɪʃən (aviacion)
bæt (bat) + ʃən = bætʃənd (bachand)
...

An issue here is that /ʃən/ doesn’t simply get appended to a stem. There is an additional phonological process that deletes a portion of that stem–e.g. /æbstɹækt/ + /ʃən/ is /æbstɹækʃən/, not /æbstɹæktʃən/. So we need to consider cases where a final consonant–usually a t–was deleted. But we need to make sure we do so only when the morpheme wasn’t realized as /eɪʃən/, so we need to go back to capturing it.

regex_ʃən = '(.+?)(eɪ)?ʃən'

n_matches = 0

seen = set()

for w1, ipa1 in entries.items():
    possible_morpheme = re.findall(regex_ʃən, "".join(ipa1))
    if possible_morpheme:
        for w2, ipa2 in entries.items():
            if possible_morpheme[0][1]:
                regex_stem = possible_morpheme[0][0]
            else:
                regex_stem = possible_morpheme[0][0] + "t"
            
            if re.fullmatch(regex_stem, "".join(ipa2)):
                if n_matches < 20 and "".join(ipa2) not in seen:
                    n_matches += 1
                    seen |= {"".join(ipa2)}
                    print("".join(ipa2), f"({w2})", "+", "ʃən", "=", "".join(ipa1), f"({w1})")
                else:
                    break

    if n_matches >= 20:   
        print("...")
        break

æbdəkt (abduct) + ʃən = æbdəkʃən (abduction)
əbɔɹt (abort) + ʃən = əbɔɹʃən (abortion)
æbsəlut (absolut) + ʃən = æbsəluʃən (absolution)
æbstɹækt (abstract) + ʃən = æbstɹækʃən (abstraction)
ækt (act) + ʃən = ækʃən (action)
ədɪkt (addict) + ʃən = ədɪkʃən (addiction)
ədmɪt (admit) + ʃən = ədmɪʃən (admission(1))
ədɑpt (adopt) + ʃən = ədɑpʃən (adoption)
ædɝ (adder) + ʃən = ædɝeɪʃən (adoration)
ædʒəl (agile) + ʃən = ædʒəleɪʃən (adulation)
əfɛkt (affect) + ʃən = əfɛkʃən (affection)
əflɪkt (afflict) + ʃən = əflɪkʃən (affliction)
eɪliən (alien) + ʃən = eɪliəneɪʃən (alienation)
ɔltɝ (altar) + ʃən = ɔltɝeɪʃən (alteration)
əmælgəm (amalgam) + ʃən = əmælgəmeɪʃən (amalgamation)
eɪnt (ain't) + ʃən = eɪnʃənt (ancient(1))
ænəm (annum) + ʃən = ænəmeɪʃən (animation)
ænɛks (annex) + ʃən = ænɛkseɪʃən (annexation)
ænt (ant) + ʃən = ænʃən (anshan)
æpəl (appel) + ʃən = æpəleɪʃən (appalachian(1))
...

There’s still some wonky stuff in here–e.g. ancient coming from ain’t and ashen coming from at–but we’re getting closer. We can’t really deal with cases like ashen coming from at, but we can deal with ancient coming from ain’t, which reveals a behavior of re.findall: it functions like re.match, rather than re.fullmatch, in that it matches the beginning of a string. If we want it to match the entire string, we have to explicitly specify that in the regular expression. To do this, we can use a $, which means “end of string”.¹

regex_ʃən = '(.+?)(eɪ)?ʃən$'

n_matches = 0

seen = set()

for w1, ipa1 in entries.items():
    possible_morpheme = re.findall(regex_ʃən, "".join(ipa1))
    if possible_morpheme:
        for w2, ipa2 in entries.items():
            if possible_morpheme[0][1]:
                regex_stem = possible_morpheme[0][0]
            else:
                regex_stem = possible_morpheme[0][0] + "t"
            
            if re.fullmatch(regex_stem, "".join(ipa2)):
                if n_matches < 20 and "".join(ipa2) not in seen:
                    n_matches += 1
                    seen |= {"".join(ipa2)}
                    print("".join(ipa2), f"({w2})", "+", "ʃən", "=", "".join(ipa1), f"({w1})")
                else:
                    break

    if n_matches >= 20:   
        print("...")
        break

æbdəkt (abduct) + ʃən = æbdəkʃən (abduction)
əbɔɹt (abort) + ʃən = əbɔɹʃən (abortion)
æbsəlut (absolut) + ʃən = æbsəluʃən (absolution)
æbstɹækt (abstract) + ʃən = æbstɹækʃən (abstraction)
ækt (act) + ʃən = ækʃən (action)
ədɪkt (addict) + ʃən = ədɪkʃən (addiction)
ədmɪt (admit) + ʃən = ədmɪʃən (admission(1))
ədɑpt (adopt) + ʃən = ədɑpʃən (adoption)
ædɝ (adder) + ʃən = ædɝeɪʃən (adoration)
ædʒəl (agile) + ʃən = ædʒəleɪʃən (adulation)
əfɛkt (affect) + ʃən = əfɛkʃən (affection)
əflɪkt (afflict) + ʃən = əflɪkʃən (affliction)
eɪliən (alien) + ʃən = eɪliəneɪʃən (alienation)
ɔltɝ (altar) + ʃən = ɔltɝeɪʃən (alteration)
əmælgəm (amalgam) + ʃən = əmælgəmeɪʃən (amalgamation)
ænəm (annum) + ʃən = ænəmeɪʃən (animation)
ænɛks (annex) + ʃən = ænɛkseɪʃən (annexation)
ænt (ant) + ʃən = ænʃən (anshan)
æpəl (appel) + ʃən = æpəleɪʃən (appalachian(1))
əsɛnt (ascent) + ʃən = əsɛnʃən (ascension)
...

To get much better than this, we’d need to start matching on the orthographic representation as well–e.g. matching the ion at the end of the orthographic representation of the word, thus filtering things like /æt/ + /ʃən/ = /æʃən/. One thing we’ll still miss are cases like adoration, where there is a vowel quality change (which is consequently why we get adder + ion = adoration currently). To handle those cases, we would need to account for the conditions under which vowel quality changes, which we could do in principle using regular expressions but which I won’t do here.

Footnotes

The dual of $ for the beginning of the string is ^. Note that this looks like the negation symbol we saw earlier. It is different in that that symbol must be preceded by [ to be interpreted as negation. Anywhere else ^ means the beginning of a string. That in turn means that putting a bare ^ anywhere besides the beginning of a regular expression is going to result in a regular expression that evaluates to the empty set.↩︎