Defining Regular Expressions

Regular expressions themselves are strings. We formally define these strings recursively, starting with an alphabet \(\Sigma\), a symbol for the empty string \(\epsilon\), and a symbol for the empty set \(\emptyset\). Let’s assume the alphabet is the set of English phonemes:

\[\Sigma \equiv \{\text{ɑ, æ, ʌ, ɔ, aʊ, aɪ, b, tʃ, d, ð,} \ldots\}\]

\(\rho\) is a regular expression if and only if:

  1. \(\rho \in \Sigma \cup \{\epsilon, \emptyset\}\)
  2. \(\rho\) is \((\rho_1 \cup \rho_2)\) for some regular expressions \(\rho_1\) and \(\rho_2\)
  3. \(\rho\) is \((\rho_1 \circ \rho_2)\) for some regular expressions \(\rho_1\) and \(\rho_2\)
  4. \(\rho\) is \(\rho_1^*\) for some regular expression \(\rho_1\)

Thus, given the \(\Sigma\) above, the following are all regular expressions.

  1. æ
  2. b
  3. \(\cup\) b)
  4. \(\circ\) b\(^*\))
  5. \(\circ\) b)\(^*\)
  6. ((æ \(\circ\) b) \(\cup\) b)
  7. \(\circ\) (b \(\cup\) b))

The following are not regular expressions.

  1. æ \(\cup\) b
  2. æ \(\circ\) b\(^*\)
  3. æ \(\circ\) b \(\cup\) b

Another way of saying this is that the regular expressions on \(\Sigma\) are a language on \(\Sigma \cup\{\epsilon, \emptyset, \cup, \circ, (, ), *\}\). Note that, based on the recursive definition above, the set of regular expressions is infinite. Thus, the regular expressions on \(\Sigma\) are the first infinite language we’ve described (besides \(\Sigma^*\) itself).

We can generate all the regular expressions given an alphabet. Note that because the set of regular expressions is infinite, we need to use a generator.

from collections.abc import Iterable

def regular_expressions(sigma: set[str]) -> Iterable[str]:
    old_regex_set = frozenset(sigma | {'∅', '𝜖'})
    
    for rho in old_regex_set:
        yield rho
    
    while True:
        new_regex_set = set(old_regex_set)
            
        for rho in old_regex_set:
            elem = rho+'*'
            new_regex_set |= {elem}
            yield elem
            
        for rho1 in old_regex_set:
            for rho2 in old_regex_set:
                elem = '('+rho1+' ∪ '+rho2+')'
                new_regex_set |= {elem}
                yield elem
                
        for rho1 in old_regex_set:
            for rho2 in old_regex_set:
                elem = '('+rho1+' ∘ '+rho2+')'
                new_regex_set |= {elem}
                yield elem
                
        old_regex_set = frozenset(new_regex_set)
english_phonemes = {"ɑ", "æ", "ə", "ʌ", "ɔ", "aʊ", "aɪ", "b", "tʃ", "d", "ð", "ɛ", 
                    "ɝ", "eɪ", "f", "g", "h", "ɪ", "i", "dʒ", "k", "l", "m", 
                    "n", "ŋ", "oʊ", "ɔɪ", "p", "ɹ", "s", "ʃ", "t", "θ", "ʊ", 
                    "u", "v", "w", "j", "z", "ʒ"}

for i, r in enumerate(regular_expressions(english_phonemes)):
    print(r)
    
    if i > 100:
      break
aʊ
j
ŋ
ʊ
θ
d
oʊ
dʒ
ʒ
ɹ
ɑ
ɔ
h
eɪ
v
n
k
g
ə
f
l
u
ɛ
ɝ
æ
aɪ
ʃ
p
ɪ
z
ð
s
t
b
∅
𝜖
ʌ
tʃ
i
m
w
ɔɪ
aʊ*
j*
ŋ*
ʊ*
θ*
d*
oʊ*
dʒ*
ʒ*
ɹ*
ɑ*
ɔ*
h*
eɪ*
v*
n*
k*
g*
ə*
f*
l*
u*
ɛ*
ɝ*
æ*
aɪ*
ʃ*
p*
ɪ*
z*
ð*
s*
t*
b*
∅*
𝜖*
ʌ*
tʃ*
i*
m*
w*
ɔɪ*
(aʊ ∪ aʊ)
(aʊ ∪ j)
(aʊ ∪ ŋ)
(aʊ ∪ ʊ)
(aʊ ∪ θ)
(aʊ ∪ d)
(aʊ ∪ oʊ)
(aʊ ∪ dʒ)
(aʊ ∪ ʒ)
(aʊ ∪ ɹ)
(aʊ ∪ ɑ)
(aʊ ∪ ɔ)
(aʊ ∪ h)
(aʊ ∪ eɪ)
(aʊ ∪ v)
(aʊ ∪ n)
(aʊ ∪ k)
(aʊ ∪ g)