Strings

We build strings from some alphabet/lexicon \(\Sigma\).

Strings of length \(N\) are given by \(\Sigma^N\) and the Kleene closure of \(\Sigma\)—notated \(\Sigma^*\)—gives us the set of all such sets.

\[\Sigma^* \equiv \bigcup_{i\in\mathbb{N}} \Sigma^i\]

As an infinite set, we need to implement \(\Sigma^*\) using a generator. We’ll start by defining a function that produces \(\Sigma^i\).

from itertools import product

def sigma_i(sigma: set[str], i: int) -> product:
    sigma_repeated = [sigma]*i
    return product(*sigma_repeated)

sigma = ["ɹ", "d", "u"]

for s in sigma_i(sigma, 3):
    print(''.join(s))

ɹɹɹ
ɹɹd
ɹɹu
ɹdɹ
ɹdd
ɹdu
ɹuɹ
ɹud
ɹuu
dɹɹ
dɹd
dɹu
ddɹ
ddd
ddu
duɹ
dud
duu
uɹɹ
uɹd
uɹu
udɹ
udd
udu
uuɹ
uud
uuu

We can then define \(\Sigma^*\) using a generator comprehension.

Generator for natural numbers

def natural_numbers() -> int:
    """Initialize a generator for the natural numbers"""
    i = 0
    while True:
        yield i
        i += 1

sigma_star = (''.join(s) 
              for i in natural_numbers() 
              for s in sigma_n(sigma, i))

for s in sigma_star:
    if len(s) < 4:
        print(s)
    else:
        break


ɹ
d
u
ɹɹ
ɹd
ɹu
dɹ
dd
du
uɹ
ud
uu
ɹɹɹ
ɹɹd
ɹɹu
ɹdɹ
ɹdd
ɹdu
ɹuɹ
ɹud
ɹuu
dɹɹ
dɹd
dɹu
ddɹ
ddd
ddu
duɹ
dud
duu
uɹɹ
uɹd
uɹu
udɹ
udd
udu
uuɹ
uud
uuu

Question

How many strings are there in \(\Sigma^*\) (assuming that \(\Sigma\) is finite)? That is, what is \(|\Sigma^*| = |\bigcup_{i\in\mathbb{N}} \Sigma^i|\)?

It must be at least at least as big as \(|\mathbb{N}|\), since we have a nonempty set \(\Sigma^i\) corresponding to each natural number \(i\).

Surprisingly, \(|\Sigma^*|\) turns out to be exactly as big as \(|\mathbb{N}|\), which we can show by demonstrating that there is a bijection from \(\mathbb{N}\) to \(\Sigma^*\): for each \(i \in \mathbb{N}\) we can map \(i\) to a unique string in \(\Sigma^*\) and for each string in \(\Sigma^*\), we can map that string to a unique natural number \(i\). This bijection is a total function from \(\mathbb{N}\) to \(\Sigma^*\).

The trick is to notice that each \(\Sigma^i\) is itself of finite cardinality. This means that we can always break off a chunk of the natural numbers to allocate for building a sequence of all strings in \(\Sigma^i\). The idea is then that we can then stitch those sequences together to get a sequence of all \(\Sigma^*\) that never repeats strings.

You can get an idea for how this works by enumerating the strings we generate from sigma_star.

N = natural_numbers()

sigma_star = (''.join(s) 
              for i in N 
              for s in sigma_n(sigma, i))

for j, s in enumerate(sigma_star):
    if len(s) < 5:
        print(j, s)
    else:
        break

0 
1 ɹ
2 d
3 u
4 ɹɹ
5 ɹd
6 ɹu
7 dɹ
8 dd
9 du
10 uɹ
11 ud
12 uu
13 ɹɹɹ
14 ɹɹd
15 ɹɹu
16 ɹdɹ
17 ɹdd
18 ɹdu
19 ɹuɹ
20 ɹud
21 ɹuu
22 dɹɹ
23 dɹd
24 dɹu
25 ddɹ
26 ddd
27 ddu
28 duɹ
29 dud
30 duu
31 uɹɹ
32 uɹd
33 uɹu
34 udɹ
35 udd
36 udu
37 uuɹ
38 uud
39 uuu
40 ɹɹɹɹ
41 ɹɹɹd
42 ɹɹɹu
43 ɹɹdɹ
44 ɹɹdd
45 ɹɹdu
46 ɹɹuɹ
47 ɹɹud
48 ɹɹuu
49 ɹdɹɹ
50 ɹdɹd
51 ɹdɹu
52 ɹddɹ
53 ɹddd
54 ɹddu
55 ɹduɹ
56 ɹdud
57 ɹduu
58 ɹuɹɹ
59 ɹuɹd
60 ɹuɹu
61 ɹudɹ
62 ɹudd
63 ɹudu
64 ɹuuɹ
65 ɹuud
66 ɹuuu
67 dɹɹɹ
68 dɹɹd
69 dɹɹu
70 dɹdɹ
71 dɹdd
72 dɹdu
73 dɹuɹ
74 dɹud
75 dɹuu
76 ddɹɹ
77 ddɹd
78 ddɹu
79 dddɹ
80 dddd
81 dddu
82 dduɹ
83 ddud
84 dduu
85 duɹɹ
86 duɹd
87 duɹu
88 dudɹ
89 dudd
90 dudu
91 duuɹ
92 duud
93 duuu
94 uɹɹɹ
95 uɹɹd
96 uɹɹu
97 uɹdɹ
98 uɹdd
99 uɹdu
100 uɹuɹ
101 uɹud
102 uɹuu
103 udɹɹ
104 udɹd
105 udɹu
106 uddɹ
107 uddd
108 uddu
109 uduɹ
110 udud
111 uduu
112 uuɹɹ
113 uuɹd
114 uuɹu
115 uudɹ
116 uudd
117 uudu
118 uuuɹ
119 uuud
120 uuuu

So for instance, in the above, you can think of the enumeration as setting aside \(\{1, 2, 3\}\) for the strings of length 1, \(\{4, \ldots, 12\}\) for the strings of length 2, and so on.

More formally, we’ll set aside a chunk of natural numbers \(N_i \equiv \left\{\left[\max N_{i-1}\right] + 1, ..., [\max N_{i-1}] + |\Sigma^i| + 1\right\}\)–with \(N_0 \equiv \{|\Sigma^0| - 1\} = \{0\}\)–for each \(\Sigma^i\). Note that this definition ensures that \(N_i \cap N_j = \emptyset\) for all \(i \neq j\). That is, the chunk of natural numbers \(N_i\) we set aside for \(\Sigma^i\) is disjoint from the chunk of natural numbers \(N_j\) we set aside for \(\Sigma^j\) when \(i \neq j\), and \(|N_i| = |\Sigma^i|\).

We then use \(N_i\) to construct a sequence \(S_i: N_i \rightarrow \Sigma ^i\). Finally, we can stitch those sequences together by defining \(S_*: \mathbb{N} \rightarrow \Sigma^*\) in terms of those \(S_i\) along with a function \(k: \mathbb{N} \rightarrow \mathbb{N}\) that satisfies \(k^{-1}(\{i\}) = N_i\) for all \(i \in \mathbb{N}\)–i.e. the preimage of \(\{i\}\) under \(k\) is \(N_i\).

\[S_*(n) = S_{k(n)}(n)\]

One question you might have is how do we know how to build the sequence \(S_i\) for an arbitrary set of strings \(\Sigma^i\) of length \(i\). As long as the alphabet/lexicon \(\Sigma\) is finite, we can do this by imposing an order on \(\Sigma\) itself, then use that to order sort the elements of \(\Sigma^i\) in lexicographic order.