Note that ., the character ranges, and the escape characters match only a single character, and so to match more than one, we need more than one of whichever we are interested in matching.
Load IPA representation of CMU Pronouncing Dictionary
withopen("cmudict-ipa") as f: entries: list[tuple[str, str]] = [ l.strip().split(",") for l in f ] entries: dict[str, list[str]] = { w: ipa.split() for w, ipa in entries }
import reregex_dot_stɹ_dot_kʃən ='.bstɹ.kʃən'regex_dot_dot_tɹ_dot_kʃən ='..stɹ.kʃən'print(regex_dot_stɹ_dot_kʃən, "matches:")print()for w, ipa in entries.items(): if re.fullmatch(regex_dot_stɹ_dot_kʃən, "".join(ipa)):print(regex_dot_stɹ_dot_kʃən, "matches", "".join(ipa), f"({w})")print()print(regex_dot_dot_tɹ_dot_kʃən, "matches:")print()for w, ipa in entries.items():if re.fullmatch(regex_dot_dot_tɹ_dot_kʃən, "".join(ipa)):print("".join(ipa), f"({w})")
Note that none of these quantifiers increase the expressive power of the regular expressions. We can always write their equivalents as a vanilla regular expression (in the sense of the formal definition we gave above); it would just be tedious in many cases.
Set complement
For any of these cases where we escape a lowercase alphabetic character to get a character set, the set complement can generally be gotten with by the uppercase version—e.g. \w goes to \W.
Sometimes you want the complement of a set that doesn’t have an associated escaped alphabetic character. For that you can use the same square bracket set notation but put a ^ after the first bracket.
regex_notæ_bstɹ_notæ_kt ='[^æ][^b]stɹ[^æ]kt'for w, ipa in entries.items(): if re.fullmatch(regex_notæ_bstɹ_notæ_kt, "".join(ipa)):print("".join(ipa), f"({w})")