Implementing the Regular Operations

So far, we’ve seen only a trivial regular expression: one containing a single character æ, which evaluates to the language {æ} \(\in 2^{\Sigma^*}\). How do we represent other kinds of regular expressions?

Concatenation

The operation of concatenation, which we represented using \(\circ\), is implicit in putting two characters next to each other. For instance, to represent the regular expression \((\text{æ} \circ (\text{b} \circ (\text{s} \circ (\text{t} \circ (\text{ɹ} \circ (\text{æ} \circ (\text{k} \circ (\text{ʃ} \circ (\text{ə} \circ \text{n})))))))))\), we can simply write æbstɹækʃən.

Load IPA representation of CMU Pronouncing Dictionary
with open("cmudict-ipa") as f:
    entries: list[tuple[str, str]] = [
        l.strip().split(",") for l in f
    ]
    entries: dict[str, list[str]] = {
        w: ipa.split() for w, ipa in entries
    }
import re

regex_æbstɹækʃən = "æbstɹækʃən"

string_æbstɹækʃən = "".join(entries["abstraction"])

re.fullmatch(regex_æbstɹækʃən, string_æbstɹækʃən)
<re.Match object; span=(0, 10), match='æbstɹækʃən'>

Union

In contrast, to represent the regular expression \(((\text{æ} \cup \text{ə}) \circ (\text{b} \circ (\text{s} \circ (\text{t} \circ (\text{ɹ} \circ ((\text{æ} \cup \text{ə}) \circ (\text{k} \circ (\text{ʃ} \circ (\text{ə} \circ \text{n})))))))))\), which evaluates to {æbstɹækʃən, əbstɹækʃən, æbstɹəkʃən, əbstɹəkʃən}, we either use []

regex_æəbstɹæəkʃən = "[æə]bstɹ[æə]kʃən"

string_əbstɹəkʃən = "".join(entries["obstruction"])
string_æbstɹəkʃən = "æbstɹəkʃən"
string_əbstɹækʃən = "əbstɹækʃən"

(re.fullmatch(regex_æəbstɹæəkʃən, string_æbstɹækʃən),
 re.fullmatch(regex_æəbstɹæəkʃən, string_æbstɹəkʃən),
 re.fullmatch(regex_æəbstɹæəkʃən, string_əbstɹækʃən), 
 re.fullmatch(regex_æəbstɹæəkʃən, string_əbstɹəkʃən))
(<re.Match object; span=(0, 10), match='æbstɹækʃən'>,
 <re.Match object; span=(0, 10), match='æbstɹəkʃən'>,
 <re.Match object; span=(0, 10), match='əbstɹækʃən'>,
 <re.Match object; span=(0, 10), match='əbstɹəkʃən'>)

…or an explicit |.

regex_æəbstɹæəkʃən = "(æ|ə)bstɹ(æ|ə)kʃən"

(re.fullmatch(regex_æəbstɹæəkʃən, string_æbstɹækʃən),
 re.fullmatch(regex_æəbstɹæəkʃən, string_æbstɹəkʃən),
 re.fullmatch(regex_æəbstɹæəkʃən, string_əbstɹækʃən), 
 re.fullmatch(regex_æəbstɹæəkʃən, string_əbstɹəkʃən))
(<re.Match object; span=(0, 10), match='æbstɹækʃən'>,
 <re.Match object; span=(0, 10), match='æbstɹəkʃən'>,
 <re.Match object; span=(0, 10), match='əbstɹækʃən'>,
 <re.Match object; span=(0, 10), match='əbstɹəkʃən'>)

Note that the () are important in the latter case!

regex_æəbstɹæəkʃən = "æ|əbstɹæ|əkʃən"

(re.fullmatch(regex_æəbstɹæəkʃən, string_æbstɹækʃən),
 re.fullmatch(regex_æəbstɹæəkʃən, string_æbstɹəkʃən),
 re.fullmatch(regex_æəbstɹæəkʃən, string_əbstɹækʃən), 
 re.fullmatch(regex_æəbstɹæəkʃən, string_əbstɹəkʃən))
(None, None, None, None)

Kleene star

Finally, the Kleene star works the way you would expect.

regex_ææææbstɹækʃən = "æ*bstɹækʃən"

for i in range(10):
    print(re.fullmatch(regex_ææææbstɹækʃən, "æ"*i + string_æbstɹækʃən[1:]))
<re.Match object; span=(0, 9), match='bstɹækʃən'>
<re.Match object; span=(0, 10), match='æbstɹækʃən'>
<re.Match object; span=(0, 11), match='ææbstɹækʃən'>
<re.Match object; span=(0, 12), match='æææbstɹækʃən'>
<re.Match object; span=(0, 13), match='ææææbstɹækʃən'>
<re.Match object; span=(0, 14), match='æææææbstɹækʃən'>
<re.Match object; span=(0, 15), match='ææææææbstɹækʃən'>
<re.Match object; span=(0, 16), match='æææææææbstɹækʃən'>
<re.Match object; span=(0, 17), match='ææææææææbstɹækʃən'>
<re.Match object; span=(0, 18), match='æææææææææbstɹækʃən'>

To apply the Kleene star to a complex regular expression, we need ().

regex_reæbstɹækʃən = "(ɹi|di)*æbstɹækʃən"

for i in range(3):
    print(re.fullmatch(regex_reæbstɹækʃən, "ɹi"*i + string_æbstɹækʃən))
    print(re.fullmatch(regex_reæbstɹækʃən, "di"*i + string_æbstɹækʃən))
    print(re.fullmatch(regex_reæbstɹækʃən, "ɹidi"*i + string_æbstɹækʃən))
    print(re.fullmatch(regex_reæbstɹækʃən, "diɹi"*i + string_æbstɹækʃən))
<re.Match object; span=(0, 10), match='æbstɹækʃən'>
<re.Match object; span=(0, 10), match='æbstɹækʃən'>
<re.Match object; span=(0, 10), match='æbstɹækʃən'>
<re.Match object; span=(0, 10), match='æbstɹækʃən'>
<re.Match object; span=(0, 12), match='ɹiæbstɹækʃən'>
<re.Match object; span=(0, 12), match='diæbstɹækʃən'>
<re.Match object; span=(0, 14), match='ɹidiæbstɹækʃən'>
<re.Match object; span=(0, 14), match='diɹiæbstɹækʃən'>
<re.Match object; span=(0, 14), match='ɹiɹiæbstɹækʃən'>
<re.Match object; span=(0, 14), match='didiæbstɹækʃən'>
<re.Match object; span=(0, 18), match='ɹidiɹidiæbstɹækʃən'>
<re.Match object; span=(0, 18), match='diɹidiɹiæbstɹækʃən'>