Quantifiers

Note that ., the character ranges, and the escape characters match only a single character, and so to match more than one, we need more than one of whichever we are interested in matching.

Load IPA representation of CMU Pronouncing Dictionary

with open("cmudict-ipa") as f:
    entries: list[tuple[str, str]] = [
        l.strip().split(",") for l in f
    ]
    entries: dict[str, list[str]] = {
        w: ipa.split() for w, ipa in entries
    }

import re

regex_dot_stɹ_dot_kʃən = '.bstɹ.kʃən'
regex_dot_dot_tɹ_dot_kʃən = '..stɹ.kʃən'

print(regex_dot_stɹ_dot_kʃən, "matches:")
print()

for w, ipa in entries.items():    
    if re.fullmatch(regex_dot_stɹ_dot_kʃən, "".join(ipa)):
        print(regex_dot_stɹ_dot_kʃən, "matches", "".join(ipa), f"({w})")

print()

print(regex_dot_dot_tɹ_dot_kʃən, "matches:")
print()

for w, ipa in entries.items():
    if re.fullmatch(regex_dot_dot_tɹ_dot_kʃən, "".join(ipa)):
        print("".join(ipa), f"({w})")

.bstɹ.kʃən matches:

.bstɹ.kʃən matches æbstɹækʃən (abstraction)
.bstɹ.kʃən matches əbstɹəkʃən (obstruction)

..stɹ.kʃən matches:

æbstɹækʃən (abstraction)
dɪstɹəkʃən (destruction)
dɪstɹækʃən (distraction)
ɛkstɹækʃən (extraction)
ɪnstɹəkʃən (instruction)
əbstɹəkʃən (obstruction)
ɹistɹɪkʃən (restriction)

We can avoid writing out multiple by using a quantifier. There are a few different quantifiers. For instance, if you have an exact number in mind:

regex_dot2_tɹ_dot_kʃən = '.{2}stɹ.kʃən'

for w, ipa in entries.items():
    if re.fullmatch(regex_dot2_tɹ_dot_kʃən, "".join(ipa)):
        print("".join(ipa), f"({w})")

æbstɹækʃən (abstraction)
dɪstɹəkʃən (destruction)
dɪstɹækʃən (distraction)
ɛkstɹækʃən (extraction)
ɪnstɹəkʃən (instruction)
əbstɹəkʃən (obstruction)
ɹistɹɪkʃən (restriction)

Or if you had a range of numbers in mind:

regex_dot2_tɹæk_dot13 = '.{2}stɹək.{1,3}'

for w, ipa in entries.items():
    if re.fullmatch(regex_dot2_tɹæk_dot13, "".join(ipa)):
        print("".join(ipa), f"({w})")

dɪstɹəkt (destruct)
dɪstɹəktɪd (destructed)
dɪstɹəktɪŋ (destructing)
dɪstɹəkʃən (destruction)
dɪstɹəktɪv (destructive)
dɪstɹəkts (destructs)
ɛkstɹəkeɪt (extricate)
ɪnstɹəkt (instruct)
ɪnstɹəktəd (instructed)
ɪnstɹəktɪd (instructed(1))
ɪnstɹəktɪŋ (instructing)
ɪnstɹəkʃən (instruction)
ɪnstɹəktɪv (instructive)
ɪnstɹəktɝ (instructor)
ɪnstɹəktɝz (instructors)
ɪnstɹəkts (instructs)
əbstɹəkt (obstruct)
əbstɹəktɪd (obstructed)
əbstɹəktɪŋ (obstructing)
əbstɹəkʃən (obstruction)
əbstɹəktɪv (obstructive)
əbstɹəkts (obstructs)
ɹistɹəktʃɝ (restructure)
ənstɹəkʃɝd (unstructured)

You can also leave off one bound:

regex_dot2_tɹæk_dot03 = '.{2}stɹək.{,3}'
regex_dot2_tɹæk_dot1inf = '.{2}stɹək.{1,}'

print(regex_dot2_tɹæk_dot03, "matches:")
print()

n_matches = 0

for w, ipa in entries.items():
    if re.fullmatch(regex_dot2_tɹæk_dot03, "".join(ipa)):
        if n_matches < 10:
            n_matches += 1
            print("".join(ipa), f"({w})")
        else:
            print("...")
            break

print()
print(regex_dot2_tɹæk_dot1inf, "matches:")
print()

n_matches = 0

for w, ipa in entries.items():
    if re.fullmatch(regex_dot2_tɹæk_dot1inf, "".join(ipa)):
        if n_matches < 10:
            n_matches += 1
            print("".join(ipa), f"({w})")
        else:
            print("...")
            break

.{2}stɹək.{,3} matches:

dɪstɹəkt (destruct)
dɪstɹəktɪd (destructed)
dɪstɹəktɪŋ (destructing)
dɪstɹəkʃən (destruction)
dɪstɹəktɪv (destructive)
dɪstɹəkts (destructs)
ɛkstɹəkeɪt (extricate)
ɪnstɹəkt (instruct)
ɪnstɹəktəd (instructed)
ɪnstɹəktɪd (instructed(1))
...

.{2}stɹək.{1,} matches:

dɪstɹəkt (destruct)
dɪstɹəktəbəl (destructable)
dɪstɹəktɪd (destructed)
dɪstɹəktɪŋ (destructing)
dɪstɹəkʃən (destruction)
dɪstɹəktɪv (destructive)
dɪstɹəktɪvnɪs (destructiveness)
dɪstɹəkts (destructs)
ɛkstɹəkɝɪkjəlɝ (extracurricular)
ɛkstɹəkeɪt (extricate)
...

Note that {,} is equivalent to *. There is also a special quantifier symbol for {1,}: +

And if you wanted at least one character ot come after Aaron, but didn’t care after that you could use +.

regex_dot2_tɹæk_dotplus = '.{2}stɹək.+'

n_matches = 0

for w, ipa in entries.items():
    if re.fullmatch(regex_dot2_tɹæk_dotplus, "".join(ipa)):
        if n_matches < 10:
            n_matches += 1
            print("".join(ipa), f"({w})")
        else:
            print("...")
            break

dɪstɹəkt (destruct)
dɪstɹəktəbəl (destructable)
dɪstɹəktɪd (destructed)
dɪstɹəktɪŋ (destructing)
dɪstɹəkʃən (destruction)
dɪstɹəktɪv (destructive)
dɪstɹəktɪvnɪs (destructiveness)
dɪstɹəkts (destructs)
ɛkstɹəkɝɪkjəlɝ (extracurricular)
ɛkstɹəkeɪt (extricate)
...

Note that none of these quantifiers increase the expressive power of the regular expressions. We can always write their equivalents as a vanilla regular expression (in the sense of the formal definition we gave above); it would just be tedious in many cases.

Set complement

For any of these cases where we escape a lowercase alphabetic character to get a character set, the set complement can generally be gotten with by the uppercase version—e.g. \w goes to \W.

regex_notw_bstɹækt = '\Wbstɹəkt'

(re.fullmatch(regex_notw_bstɹækt, "".join(entries["obstruct"])),
 re.fullmatch(regex_notw_bstɹækt, '\n'+"".join(entries["obstruct"])[1:]))

<>:1: SyntaxWarning: invalid escape sequence '\W'
<>:1: SyntaxWarning: invalid escape sequence '\W'
/tmp/ipykernel_3180/327081058.py:1: SyntaxWarning: invalid escape sequence '\W'
  regex_notw_bstɹækt = '\Wbstɹəkt'

(None, <re.Match object; span=(0, 8), match='\nbstɹəkt'>)

Sometimes you want the complement of a set that doesn’t have an associated escaped alphabetic character. For that you can use the same square bracket set notation but put a ^ after the first bracket.

regex_notæ_bstɹ_notæ_kt = '[^æ][^b]stɹ[^æ]kt'

for w, ipa in entries.items():    
    if re.fullmatch(regex_notæ_bstɹ_notæ_kt, "".join(ipa)):
        print("".join(ipa), f"({w})")

dɪstɹəkt (destruct)
dɪstɹɪkt (district)
ɪnstɹəkt (instruct)
mɑstɹɪkt (maastricht)
ɹistɹɪkt (restrict)

The placement of this ^ is really important, since it only has the negation interpretation directly after [.

--- title: Quantifiers jupyter: python3 --- Note that `.`, the character ranges, and the escape characters match only a single character, and so to match more than one, we need more than one of whichever we are interested in matching. ```{python} #| code-fold: true #| code-summary: Load IPA representation of CMU Pronouncing Dictionary with open("cmudict-ipa") as f: entries: list[tuple[str, str]] = [ l.strip().split(",") for l in f ] entries: dict[str, list[str]] = { w: ipa.split() for w, ipa in entries } ``` ```{python} #| colab: {base_uri: 'https://localhost:8080/'} #| executionInfo: {elapsed: 5, status: ok, timestamp: 1675099877492, user: {displayName: Aaron Steven White, userId: 06256629009318567325}, user_tz: 300} #| outputId: df0dd6ca-d641-431a-b8f1-3b1703c7f188 import re regex_dot_stɹ_dot_kʃən = '.bstɹ.kʃən' regex_dot_dot_tɹ_dot_kʃən = '..stɹ.kʃən' print(regex_dot_stɹ_dot_kʃən, "matches:") print() for w, ipa in entries.items(): if re.fullmatch(regex_dot_stɹ_dot_kʃən, "".join(ipa)): print(regex_dot_stɹ_dot_kʃən, "matches", "".join(ipa), f"({w})") print() print(regex_dot_dot_tɹ_dot_kʃən, "matches:") print() for w, ipa in entries.items(): if re.fullmatch(regex_dot_dot_tɹ_dot_kʃən, "".join(ipa)): print("".join(ipa), f"({w})") ``` We can avoid writing out multiple by using a quantifier. There are a few different quantifiers. For instance, if you have an exact number in mind: ```{python} #| colab: {base_uri: 'https://localhost:8080/'} #| executionInfo: {elapsed: 5, status: ok, timestamp: 1675099877493, user: {displayName: Aaron Steven White, userId: 06256629009318567325}, user_tz: 300} #| outputId: 91dfd965-f326-4001-ea2d-4cc6f69c1160 regex_dot2_tɹ_dot_kʃən = '.{2}stɹ.kʃən' for w, ipa in entries.items(): if re.fullmatch(regex_dot2_tɹ_dot_kʃən, "".join(ipa)): print("".join(ipa), f"({w})") ``` Or if you had a range of numbers in mind: ```{python} #| colab: {base_uri: 'https://localhost:8080/'} #| executionInfo: {elapsed: 3, status: ok, timestamp: 1675099877493, user: {displayName: Aaron Steven White, userId: 06256629009318567325}, user_tz: 300} #| outputId: 84389b5a-30c5-4c2c-8f49-3b69f388ad61 regex_dot2_tɹæk_dot13 = '.{2}stɹək.{1,3}' for w, ipa in entries.items(): if re.fullmatch(regex_dot2_tɹæk_dot13, "".join(ipa)): print("".join(ipa), f"({w})") ``` You can also leave off one bound: ```{python} #| colab: {base_uri: 'https://localhost:8080/'} #| executionInfo: {elapsed: 125, status: ok, timestamp: 1675099877616, user: {displayName: Aaron Steven White, userId: 06256629009318567325}, user_tz: 300} #| outputId: 74eb7177-645f-4b1c-97e7-406b77e8546f regex_dot2_tɹæk_dot03 = '.{2}stɹək.{,3}' regex_dot2_tɹæk_dot1inf = '.{2}stɹək.{1,}' print(regex_dot2_tɹæk_dot03, "matches:") print() n_matches = 0 for w, ipa in entries.items(): if re.fullmatch(regex_dot2_tɹæk_dot03, "".join(ipa)): if n_matches < 10: n_matches += 1 print("".join(ipa), f"({w})") else: print("...") break print() print(regex_dot2_tɹæk_dot1inf, "matches:") print() n_matches = 0 for w, ipa in entries.items(): if re.fullmatch(regex_dot2_tɹæk_dot1inf, "".join(ipa)): if n_matches < 10: n_matches += 1 print("".join(ipa), f"({w})") else: print("...") break ``` Note that `{,}` is equivalent to `*`. There is also a special quantifier symbol for `{1,}`: `+` And if you wanted at least one character ot come after `Aaron`, but didn't care after that you could use `+`. ```{python} #| colab: {base_uri: 'https://localhost:8080/'} #| executionInfo: {elapsed: 6, status: ok, timestamp: 1675099877617, user: {displayName: Aaron Steven White, userId: 06256629009318567325}, user_tz: 300} #| outputId: a871e772-659b-453e-f2b6-59ed3532bc14 regex_dot2_tɹæk_dotplus = '.{2}stɹək.+' n_matches = 0 for w, ipa in entries.items(): if re.fullmatch(regex_dot2_tɹæk_dotplus, "".join(ipa)): if n_matches < 10: n_matches += 1 print("".join(ipa), f"({w})") else: print("...") break ``` Note that none of these quantifiers increase the expressive power of the regular expressions. We can always write their equivalents as a vanilla regular expression (in the sense of the formal definition we gave above); it would just be tedious in many cases. ### Set complement For any of these cases where we escape a lowercase alphabetic character to get a character set, the set complement can generally be gotten with by the uppercase version—e.g. `\w` goes to `\W`. ```{python} #| colab: {base_uri: 'https://localhost:8080/'} #| executionInfo: {elapsed: 4, status: ok, timestamp: 1675099877617, user: {displayName: Aaron Steven White, userId: 06256629009318567325}, user_tz: 300} #| outputId: f0da836e-2572-4963-bf0e-68e635f9905c regex_notw_bstɹækt = '\Wbstɹəkt' (re.fullmatch(regex_notw_bstɹækt, "".join(entries["obstruct"])), re.fullmatch(regex_notw_bstɹækt, '\n'+"".join(entries["obstruct"])[1:])) ``` Sometimes you want the complement of a set that doesn't have an associated escaped alphabetic character. For that you can use the same square bracket set notation but put a `^` after the first bracket. ```{python} #| colab: {base_uri: 'https://localhost:8080/'} #| executionInfo: {elapsed: 3, status: ok, timestamp: 1675099877617, user: {displayName: Aaron Steven White, userId: 06256629009318567325}, user_tz: 300} #| outputId: f0dac4b6-8502-431a-8fe9-3c7347c76bd4 regex_notæ_bstɹ_notæ_kt = '[^æ][^b]stɹ[^æ]kt' for w, ipa in entries.items(): if re.fullmatch(regex_notæ_bstɹ_notæ_kt, "".join(ipa)): print("".join(ipa), f"({w})") ``` The placement of this `^` is really important, since it only has the negation interpretation directly after `[`.