We build languages from some alphabet/lexicon \(\Sigma\) by defining a language \(L\) on \(\Sigma\) as a subset of the strings \(\Sigma^*\) that can be built from \(\Sigma\).

\[L \subseteq \Sigma^*\]

This definition implies that the set of all languages on \(\Sigma\) is the power set of \(\Sigma^*\)

\[L \in 2^{\Sigma^*}\]

This terminology arises from the fact that, if \(\Sigma\) were, say, all of the phonemes of English, at least one element of \(2^{\Sigma^*}\) would be all and only the words of English (or at least one persons English idiolect). If \(\Sigma\) were all English words (and assuming that grammaticality is a coherent binary concept), at least one element of \(2^{\Sigma^*}\) would be all the grammatical sentences of English (or at least one persons English idiolect). Of course, many of the sets in \(2^{\Sigma^*}\) won’t look anything like English or any other languages, and a big part of this class is going to be figuring out how to find subsets of \(2^{\Sigma^*}\) that look like possible languages.

To generate the set of all languages, we can compose our generator for \(\Sigma^*\) with our power set generator.

from itertools import product

#| code-fold: true
#| code-summary: Generator for natural numbers

def natural_numbers() -> int:
    """Yield the natural numbers starting from 0.

    Yields
    ------
    int
        The next natural number.
    """
    i = 0
    while True:
        yield i
        i += 1
Generator for power set
from collections.abc import Iterable

emptyset = frozenset()

def powerset[T](iterable: Iterable[T]) -> set[T]:
    yield emptyset

    seen = {emptyset}

    for r in iterable:
        new = {s | frozenset({r}) for s in seen}
        for n in new:
            yield n
            seen.add(n)
Generator for Σ*
def sigma_i(sigma: set[str], i: int) -> product:
    """Generate all strings of length i over an alphabet.

    Parameters
    ----------
    sigma : set[str]
        The alphabet.
    i : int
        The string length.

    Returns
    -------
    product
        An iterator over all strings of length ``i``.
    """
    sigma_repeated = [sigma]*i
    return product(*sigma_repeated)

def sigma_star(sigma: set[str]) -> str:
    """Generate all strings over an alphabet in length order.

    Parameters
    ----------
    sigma : set[str]
        The alphabet.

    Yields
    ------
    str
        The next string in the enumeration.
    """
    for i in natural_numbers():
        for s in sigma_i(sigma, i):
            yield ''.join(s)
from collections.abc import Generator

english_phonemes = {"ɑ", "æ", "ə", "ʌ", "ɔ", "aʊ", "aɪ", "b", "tʃ", "d", "ð", "ɛ", 
                    "ɝ", "eɪ", "f", "g", "h", "ɪ", "i", "dʒ", "k", "l", "m", 
                    "n", "ŋ", "oʊ", "ɔɪ", "p", "ɹ", "s", "ʃ", "t", "θ", "ʊ", 
                    "u", "v", "w", "j", "z", "ʒ"}

languages: Generator[frozenset[str]] = powerset(sigma_star(english_phonemes))

for i, l in enumerate(languages):
    if not i % 100000:
        print(l)
    
    if i > 1000000:
        break
frozenset()
frozenset({'l', 'w', 'oʊ', 'f', 'k', 'ɛ', 'j'})
frozenset({'m', 't', 'l', 'w', 'ŋ', 'θ', 'ɛ'})
frozenset({'', 'ʃ', 'w', 'eɪ', 'θ', 't', 'l', 'ʊ', 'ŋ', 'ɹ'})
frozenset({'m', 'l', 'd', 'ʊ', 'ŋ', 'oʊ', 'ɛ'})
frozenset({'m', 'w', 'θ', 't', 'l', 'ʊ', 'v', 'k', 'ɛ', 'ɹ'})
frozenset({'', 'ʃ', 'w', 'oʊ', 'f', 'ʊ', 'i', 'k', 'ɛ', 'ɹ'})
frozenset({'', 'd', 'θ', 'eɪ', 'f', 't', 'l', 'ʊ', 'i', 'ŋ', 'v', 'k', 'ɛ', 'j'})
frozenset({'', 'm', 'ʃ', 'oʊ', 'ɹ', 'θ', 'eɪ', 'f', 't', 'l', 'ʊ', 'i', 'v', 's', 'j'})
frozenset({'m', 'd', 'ɹ', 'θ', 'eɪ', 'f', 't', 'l', 'ʊ', 'i', 'ŋ', 's', 'ɛ', 'j'})
frozenset({'d', 'w', 't', 'l', 'ʊ', 'i', 's', 'ɛ', 'j'})

Notice that even if we compute a million of these languages using our power set generator, we are still ending up with small languages consisting of only singleton strings.

CautionQuestion

How many languages are there in \(2^{\Sigma^*}\)? That is, what is \(|2^{\Sigma^*}|\)?

Because \(\{s\} \in 2^{\Sigma^*}\) for all \(s \in \Sigma^*\), we know that \(2^{\Sigma^*}\) must be at least as large as \(\mathbb{\Sigma}^*\), which is the same size as \(\mathbb{N}\). But unlike \(\mathbb{\Sigma}^*\) in comparison to \(\mathbb{N}\), it turns out that \(2^{\Sigma^*}\) is larger than either.

The trick to seeing this is to try to sequence all languages in \(2^{\Sigma^*}\). Suppose there were such a sequence of languages \(L_*: \mathbb{N} \rightarrow 2^{\Sigma^*}\) such that \(L\) is bijective: if \(L_*(i) = L_*(j)\), then \(i = j\) (\(L_*\) is injective), and \(L_*(\mathbb{N}) = 2^{\Sigma^*}\) (\(L_*\) is surjective). Given a sequence \(S_*\) on the strings, which we know we can construct, define a language \(\bar{L}\) such that \(S(i) \in \bar{L}\) if \(S(i) \not\in L_*(i)\), otherwise \(S(i) \not\in \bar{L}\) for all \(i \in \mathbb{N}\).

By definition, \(\bar{L} \neq L_*(i)\) for any \(i\), since \(\bar{L}\) and \(L_*(i)\) differ on at least \(S(i)\): either \(S(i) \in \bar{L}\) and \(S(i) \not\in L_*(i)\) or \(S(i) \in L_*(i)\) and \(S(i) \not\in \bar{L}\). But if \(\bar{L} \neq L_*(i)\) for any \(i\), that means that \(\bar{L} \not\in L_*(\mathbb{N})\) and therefore \(L_*\) is not bijective (because it is not surjective): there will always be some \(\bar{L}\) not in a particular sequence \(L_*\). Therefore, \(|2^{\Sigma^*}| > |\Sigma^*| = |\mathbb{N}|\).