from itertools import product
#| code-fold: true
#| code-summary: Generator for natural numbers
def natural_numbers() -> int:
"""Initialize a generator for the natural numbers"""
= 0
i while True:
yield i
+= 1 i
Languages
We build languages from some alphabet/lexicon \(\Sigma\) by defining a language \(L\) on \(\Sigma\) as a subset of the strings \(\Sigma^*\) that can be built from \(\Sigma\).
\[L \subseteq \Sigma^*\]
This definition implies that the set of all languages on \(\Sigma\) is the power set of \(\Sigma^*\)
\[L \in 2^{\Sigma^*}\]
This terminology arises from the fact that, if \(\Sigma\) were, say, all of the phonemes of English, at least one element of \(2^{\Sigma^*}\) would be all and only the words of English (or at least one persons English idiolect). If \(\Sigma\) were all English words (and assuming that grammaticality is a coherent binary concept), at least one element of \(2^{\Sigma^*}\) would be all the grammatical sentences of English (or at least one persons English idiolect). Of course, many of the sets in \(2^{\Sigma^*}\) won’t look anything like English or any other languages, and a big part of this class is going to be figuring out how to find subsets of \(2^{\Sigma^*}\) that look like possible languages.
To generate the set of all languages, we can compose our generator for \(\Sigma^*\) with our power set generator.
Generator for power set
from typing import TypeVar
from collections.abc import Iterable
= TypeVar("T")
T
= frozenset()
emptyset
def powerset(iterable: Iterable[T]) -> set[T]:
yield emptyset
= {emptyset}
seen
for r in iterable:
= {s | frozenset({r}) for s in seen}
new for n in new:
yield n
seen.add(n)
Generator for Σ*
def sigma_i(sigma: set[str], i: int) -> product:
= [sigma]*i
sigma_repeated return product(*sigma_repeated)
def sigma_star(sigma: set[str]) -> str:
for i in natural_numbers():
for s in sigma_i(sigma, i):
yield ''.join(s)
from collections.abc import Generator
= {"ɑ", "æ", "ə", "ʌ", "ɔ", "aʊ", "aɪ", "b", "tʃ", "d", "ð", "ɛ",
english_phonemes "ɝ", "eɪ", "f", "g", "h", "ɪ", "i", "dʒ", "k", "l", "m",
"n", "ŋ", "oʊ", "ɔɪ", "p", "ɹ", "s", "ʃ", "t", "θ", "ʊ",
"u", "v", "w", "j", "z", "ʒ"}
frozenset[str]] = powerset(sigma_star(english_phonemes))
languages: Generator[
for i, l in enumerate(languages):
if not i % 100000:
print(l)
if i > 1000000:
break
frozenset()
frozenset({'', 'ʃ', 'ə', 'dʒ', 'k', 'ɹ', 'f', 'ʌ', 'ɔ', 'u'})
frozenset({'', 'ʃ', 'dʒ', 'w', 'k', 'b', 'z', 'f', 'ʌ', 'ð'})
frozenset({'', 'ʃ', 'dʒ', 'i', 'ɝ', 'z', 'f', 'ɔ', 'ɔɪ', 'tʃ'})
frozenset({'', 'i', 'k', 'ɝ', 'ɹ', 'f', 'h', 'ɔ', 'u', 'ɔɪ', 'tʃ'})
frozenset({'', 'i', 'ɔɪ', 'ɝ', 'ɹ', 'z', 'f', 'h', 'u', 'ð', 'tʃ'})
frozenset({'i', 'ð', 'ɔ', 'ə', 'k', 'eɪ', 'ɹ'})
frozenset({'ʃ', 'ə', 'dʒ', 'ɝ', 'f', 'ʌ', 'h', 'ɔ', 'u', 'ð', 'tʃ', 'eɪ'})
frozenset({'', 'ʃ', 'dʒ', 'i', 'ɹ', 'b', 'ʌ', 'h', 'ə', 'ɔɪ', 'eɪ'})
frozenset({'', 'ʃ', 'ə', 'w', 'ɝ', 'b', 'z', 'ʌ', 'u', 'ð', 'tʃ', 'eɪ'})
frozenset({'', 'ʃ', 'dʒ', 'i', 'k', 'ɹ', 'f', 'ʌ', 'ɔ', 'ə', 'ð', 'tʃ', 'eɪ'})
Notice that even if we compute a million of these languages using our power set generator, we are still ending up with small languages consisting of only singleton strings.
How many languages are there in \(2^{\Sigma^*}\)? That is, what is \(|2^{\Sigma^*}|\)?
Because \(\{s\} \in 2^{\Sigma^*}\) for all \(s \in \Sigma^*\), we know that \(2^{\Sigma^*}\) must be at least as large as \(\mathbb{\Sigma}^*\), which is the same size as \(\mathbb{N}\). But unlike \(\mathbb{\Sigma}^*\) in comparison to \(\mathbb{N}\), it turns out that \(2^{\Sigma^*}\) is larger than either.
The trick to seeing this is to try to sequence all languages in \(2^{\Sigma^*}\). Suppose there were such a sequence of languages \(L_*: \mathbb{N} \rightarrow 2^{\Sigma^*}\) such that \(L\) is bijective: if \(L_*(i) = L_*(j)\), then \(i = j\) (\(L_*\) is injective), and \(L_*(\mathbb{N}) = 2^{\Sigma^*}\) (\(L_*\) is surjective). Given a sequence \(S_*\) on the strings, which we know we can construct, define a language \(\bar{L}\) such that \(S(i) \in \bar{L}\) if \(S(i) \not\in L_*(i)\), otherwise \(S(i) \not\in \bar{L}\) for all \(i \in \mathbb{N}\).
By definition, \(\bar{L} \neq L_*(i)\) for any \(i\), since \(\bar{L}\) and \(L_*(i)\) differ on at least \(S(i)\): either \(S(i) \in \bar{L}\) and \(S(i) \not\in L_*(i)\) or \(S(i) \in L_*(i)\) and \(S(i) \not\in \bar{L}\). But if \(\bar{L} \neq L_*(i)\) for any \(i\), that means that \(\bar{L} \not\in L_*(\mathbb{N})\) and therefore \(L_*\) is not bijective (because it is not surjective): there will always be some \(\bar{L}\) not in a particular sequence \(L_*\). Therefore, \(|2^{\Sigma^*}| > |\Sigma^*| = |\mathbb{N}|\).