We can represent all simple regular expressions according to our formal definition, but in certain cases, doing so would be tedious. For instance, suppose I want to represent the set of all English phonemes \(\Sigma\). Using our formal definition, we would need to list out all of the phonemes joined by \(\cup\): \((\text{i} \cup (\text{ɝ} \cup (\text{a} \cup (\text{ɪ} \cup \ldots))))\).
To make this less tedious, Python (and many other languages) introduce additional special characters into the definition of \(R(\Sigma)\). The most basic is the wildcard ., which matches any single character (alphanumeric or otherwise) except the newline \n.
Load IPA representation of CMU Pronouncing Dictionary
withopen("cmudict-ipa") as f: entries: list[tuple[str, str]] = [ l.strip().split(",") for l in f ] entries: dict[str, list[str]] = { w: ipa.split() for w, ipa in entries }
import reregex_dot_bstɹ_dot_kʃən ='.bstɹ.kʃən'for w, ipa in entries.items():if re.fullmatch(regex_dot_bstɹ_dot_kʃən, "".join(ipa)):print("".join(ipa), f"({w})")
abstraction æbstɹækʃən
obstruction əbstɹəkʃən
If you want to match the . itself (or any special character we introduce below), you need to escape it.
Besides ., we can also use character ranges to target more specific sets, like upper- and lower-case alphabetic characters ([A-z]), lower-case alphabetic character ([a-z]), numerals [0-9].
In addition to ranges, there are even more compact escape characters. For instance, alphanumeric characters plus _ can be gotten with \w (the same as the range [A-z0-9_]).