So far, we’ve seen only a trivial regular expression: one containing a single character æ
, which evaluates to the language {æ
} \(\in 2^{\Sigma^*}\) . How do we represent other kinds of regular expressions?
Concatenation
The operation of concatenation, which we represented using \(\circ\) , is implicit in putting two characters next to each other. For instance, to represent the regular expression \((\text{æ} \circ (\text{b} \circ (\text{s} \circ (\text{t} \circ (\text{ɹ} \circ (\text{æ} \circ (\text{k} \circ (\text{ʃ} \circ (\text{ə} \circ \text{n})))))))))\) , we can simply write æbstɹækʃən
.
Load IPA representation of CMU Pronouncing Dictionary
with open ("cmudict-ipa" ) as f:
entries: list [tuple [str , str ]] = [
l.strip().split("," ) for l in f
]
entries: dict [str , list [str ]] = {
w: ipa.split() for w, ipa in entries
}
import re
regex_æbstɹækʃən = "æbstɹækʃən"
string_æbstɹækʃən = "" .join(entries["abstraction" ])
re.fullmatch(regex_æbstɹækʃən, string_æbstɹækʃən)
<re.Match object; span=(0, 10), match='æbstɹækʃən'>
Union
In contrast, to represent the regular expression \(((\text{æ} \cup \text{ə}) \circ (\text{b} \circ (\text{s} \circ (\text{t} \circ (\text{ɹ} \circ ((\text{æ} \cup \text{ə}) \circ (\text{k} \circ (\text{ʃ} \circ (\text{ə} \circ \text{n})))))))))\) , which evaluates to {æbstɹækʃən
, əbstɹækʃən
, æbstɹəkʃən
, əbstɹəkʃən
}, we either use []
…
regex_æəbstɹæəkʃən = "[æə]bstɹ[æə]kʃən"
string_əbstɹəkʃən = "" .join(entries["obstruction" ])
string_æbstɹəkʃən = "æbstɹəkʃən"
string_əbstɹækʃən = "əbstɹækʃən"
(re.fullmatch(regex_æəbstɹæəkʃən, string_æbstɹækʃən),
re.fullmatch(regex_æəbstɹæəkʃən, string_æbstɹəkʃən),
re.fullmatch(regex_æəbstɹæəkʃən, string_əbstɹækʃən),
re.fullmatch(regex_æəbstɹæəkʃən, string_əbstɹəkʃən))
(<re.Match object; span=(0, 10), match='æbstɹækʃən'>,
<re.Match object; span=(0, 10), match='æbstɹəkʃən'>,
<re.Match object; span=(0, 10), match='əbstɹækʃən'>,
<re.Match object; span=(0, 10), match='əbstɹəkʃən'>)
…or an explicit |
.
regex_æəbstɹæəkʃən = "(æ|ə)bstɹ(æ|ə)kʃən"
(re.fullmatch(regex_æəbstɹæəkʃən, string_æbstɹækʃən),
re.fullmatch(regex_æəbstɹæəkʃən, string_æbstɹəkʃən),
re.fullmatch(regex_æəbstɹæəkʃən, string_əbstɹækʃən),
re.fullmatch(regex_æəbstɹæəkʃən, string_əbstɹəkʃən))
(<re.Match object; span=(0, 10), match='æbstɹækʃən'>,
<re.Match object; span=(0, 10), match='æbstɹəkʃən'>,
<re.Match object; span=(0, 10), match='əbstɹækʃən'>,
<re.Match object; span=(0, 10), match='əbstɹəkʃən'>)
Note that the ()
are important in the latter case!
regex_æəbstɹæəkʃən = "æ|əbstɹæ|əkʃən"
(re.fullmatch(regex_æəbstɹæəkʃən, string_æbstɹækʃən),
re.fullmatch(regex_æəbstɹæəkʃən, string_æbstɹəkʃən),
re.fullmatch(regex_æəbstɹæəkʃən, string_əbstɹækʃən),
re.fullmatch(regex_æəbstɹæəkʃən, string_əbstɹəkʃən))
Kleene star
Finally, the Kleene star works the way you would expect.
regex_ææææbstɹækʃən = "æ*bstɹækʃən"
for i in range (10 ):
print (re.fullmatch(regex_ææææbstɹækʃən, "æ" * i + string_æbstɹækʃən[1 :]))
<re.Match object; span=(0, 9), match='bstɹækʃən'>
<re.Match object; span=(0, 10), match='æbstɹækʃən'>
<re.Match object; span=(0, 11), match='ææbstɹækʃən'>
<re.Match object; span=(0, 12), match='æææbstɹækʃən'>
<re.Match object; span=(0, 13), match='ææææbstɹækʃən'>
<re.Match object; span=(0, 14), match='æææææbstɹækʃən'>
<re.Match object; span=(0, 15), match='ææææææbstɹækʃən'>
<re.Match object; span=(0, 16), match='æææææææbstɹækʃən'>
<re.Match object; span=(0, 17), match='ææææææææbstɹækʃən'>
<re.Match object; span=(0, 18), match='æææææææææbstɹækʃən'>
To apply the Kleene star to a complex regular expression, we need ()
.
regex_reæbstɹækʃən = "(ɹi|di)*æbstɹækʃən"
for i in range (3 ):
print (re.fullmatch(regex_reæbstɹækʃən, "ɹi" * i + string_æbstɹækʃən))
print (re.fullmatch(regex_reæbstɹækʃən, "di" * i + string_æbstɹækʃən))
print (re.fullmatch(regex_reæbstɹækʃən, "ɹidi" * i + string_æbstɹækʃən))
print (re.fullmatch(regex_reæbstɹækʃən, "diɹi" * i + string_æbstɹækʃən))
<re.Match object; span=(0, 10), match='æbstɹækʃən'>
<re.Match object; span=(0, 10), match='æbstɹækʃən'>
<re.Match object; span=(0, 10), match='æbstɹækʃən'>
<re.Match object; span=(0, 10), match='æbstɹækʃən'>
<re.Match object; span=(0, 12), match='ɹiæbstɹækʃən'>
<re.Match object; span=(0, 12), match='diæbstɹækʃən'>
<re.Match object; span=(0, 14), match='ɹidiæbstɹækʃən'>
<re.Match object; span=(0, 14), match='diɹiæbstɹækʃən'>
<re.Match object; span=(0, 14), match='ɹiɹiæbstɹækʃən'>
<re.Match object; span=(0, 14), match='didiæbstɹækʃən'>
<re.Match object; span=(0, 18), match='ɹidiɹidiæbstɹækʃən'>
<re.Match object; span=(0, 18), match='diɹidiɹiæbstɹækʃən'>