Wild cards and character ranges

We can represent all simple regular expressions according to our formal definition, but in certain cases, doing so would be tedious. For instance, suppose I want to represent the set of all English phonemes \(\Sigma\). Using our formal definition, we would need to list out all of the phonemes joined by \(\cup\): \((\text{i} \cup (\text{ɝ} \cup (\text{a} \cup (\text{ɪ} \cup \ldots))))\).

To make this less tedious, Python (and many other languages) introduce additional special characters into the definition of \(R(\Sigma)\). The most basic is the wildcard ., which matches any single character (alphanumeric or otherwise) except the newline \n.

Load IPA representation of CMU Pronouncing Dictionary

with open("cmudict-ipa") as f:
    entries: list[tuple[str, str]] = [
        l.strip().split(",") for l in f
    ]
    entries: dict[str, list[str]] = {
        w: ipa.split() for w, ipa in entries
    }

import re

regex_dot_bstɹ_dot_kʃən = '.bstɹ.kʃən'

for w, ipa in entries.items():
    if re.fullmatch(regex_dot_bstɹ_dot_kʃən, "".join(ipa)):
        print("".join(ipa), f"({w})")

abstraction æbstɹækʃən
obstruction əbstɹəkʃən

If you want to match the . itself (or any special character we introduce below), you need to escape it.

regex_period_bstɹækʃən = '\.bstɹækʃən'

re.fullmatch(regex_period_bstɹækʃən, '.' + string_æbstɹækʃən[1:])

<re.Match object; span=(0, 10), match='.bstɹækʃən'>

Besides ., we can also use character ranges to target more specific sets, like upper- and lower-case alphabetic characters ([A-z]), lower-case alphabetic character ([a-z]), numerals [0-9].

regex_numeric_bstɹækʃən = '[0-9]bstɹækʃən'
string_4bstɹækʃən = '4bstɹækʃən'

re.fullmatch(regex_numeric_bstɹækʃən, string_4bstɹækʃən)

<re.Match object; span=(0, 10), match='4bstɹækʃən'>

In addition to ranges, there are even more compact escape characters. For instance, alphanumeric characters plus _ can be gotten with \w (the same as the range [A-z0-9_]).

regex_alphanumeric_bstɹækʃən = '\wbstɹækʃən'

re.fullmatch(regex_alphanumeric_bstɹækʃən, string_æbstɹækʃən)

<re.Match object; span=(0, 10), match='æbstɹækʃən'>