= frozenset()
emptyset = frozenset({'e', 'i', 'o', 'u', 'æ', 'ɑ', 'ɔ', 'ə', 'ɛ', 'ɪ', 'ʊ'})
vowels
# high v. nonhigh
= frozenset({'i', 'u', 'ɪ', 'ʊ'})
high = vowels - high
nonhigh
= frozenset({
f_highness frozenset(emptyset),
frozenset(high), frozenset(nonhigh),
frozenset(vowels)
})
# back v. nonback
= frozenset({'u', 'ʊ', 'o', 'ɔ'})
back = vowels - back
nonback
= frozenset({
f_backness frozenset(emptyset),
frozenset(back), frozenset(nonback),
frozenset(vowels)
})
How we model possibilities
The Kolmogorov axioms start by specifying a set \(\Omega\) that contains all and only the things that can possibly happen. This set is known as the sample space. So what it means to be a possibility is a brute fact: it’s all and only the things in \(\Omega\).
That’s very abstract, so let’s consider a few examples:
- \(\Omega\) could be the set of all vowel types in a language–e.g. for English \(\Omega = \{\text{e, i, o, u, æ, ɑ, ɔ, ə, ɛ, ɪ, ʊ}\}\).
- \(\Omega\) could be the set of all strings of phonemes in a language–e.g. if \(\Sigma\) is the set of phonemes, then \(\Omega = \Sigma^* = \bigcup_{i=0}^\infty \Sigma^i\).
- \(\Omega\) could be the language \(\text{eval}(r)\) that a regular expression \(r \in R(\Sigma)\) evaluates to.
- \(\Omega\) could be the set of all regular expressions in a language–e.g. if \(\Sigma\) is the set of phonemes, then \(\Omega = R(\Sigma)\). That is, \(\Omega\) could be possible grammars, which in turn correspond to possible languages.
The axioms then move forward by defining a way of classifying possibilities \(\mathcal{F} \subseteq 2^\Omega\). These classes of possibilities are known as events and the set of said classes is known as the event space. It is events, which can contain just a single possibility, that we measure the probability of.1
The event space is where interesting linguistic structure enters the picture. Let’s look at a few examples of event spaces that assume our first example of a sample space above: \(\Omega = \{\text{e, i, o, u, æ, ɑ, ɔ, ə, ɛ, ɪ, ʊ}\}\).
- One possible event space distinguishes vowels with respect to highness: \(\mathcal{F}_\text{highness} = \{H, L, \Omega, \emptyset\}\), with \(H = \{\text{i, u, ɪ, ʊ}\}\) and \(L = \Omega - H\).
- Another possible event space distinguishes vowels with respect to backness: \(\mathcal{F}_\text{backness} = \{B, F, \Omega, \emptyset\}\), with \(B = \{\text{u, ʊ, o, ɔ}\}\) and \(F = \Omega - B\).
You’ll notice that beyond having just the set of high v. non-high vowels, or the set of back v. non-back vowels in these event spaces, we also have the entire set of vowels itself alongside the empty set. The reasons for this are technical: to make certain aspects of the formalization of what it means to measure probabilties work out nicely, we need the event space \(\mathcal{F}\) to form what is known as a \(\sigma\)-algebra on the sample space \(\Omega\). All this means is that:
- \(\mathcal{F} \subseteq 2^\Omega\)
- \(E \in \mathcal{F}\) iff \(\Omega - E \in \mathcal{F}\) (closure under complement)
- \(\bigcup \mathcal{E} \in \mathcal{F}\) for all countable \(\mathcal{E} \subseteq \mathcal{F}\) (closure under countable union)
- \(\bigcap \mathcal{E} \in \mathcal{F}\) for all countable \(\mathcal{E} \subseteq \mathcal{F}\) (closure under countable intersection)
from typing import Set, FrozenSet, Iterable
from itertools import chain, combinations
from functools import reduce
= FrozenSet[str]
SampleSpace = FrozenSet[str]
Event = FrozenSet[Event]
SigmaAlgebra
def powerset(iterable: Iterable) -> Iterable:
"""The power set of a set
See https://docs.python.org/3/library/itertools.html#itertools-recipes
Parameters
----------
iterable
The set to take the power set of
"""
= list(iterable)
s return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
class FiniteMeasurableSpace:
"""A finite measurable space
Parameters
----------
atoms
The atoms of the space
sigma_algebra
The σ-algebra of the space
"""
def __init__(self, atoms: SampleSpace, sigma_algebra: SigmaAlgebra):
self._atoms = atoms
self._sigma_algebra = sigma_algebra
self._validate()
def _validate(self):
for subset in self._sigma_algebra:
# check powerset condition
if not subset <= self._atoms:
raise ValueError("All events must be a subset of the atoms")
# check closure under complement
if not (self._atoms - subset) in self._sigma_algebra:
raise ValueError("The σ-algebra must be closed under complements")
for subsets in powerset(self._sigma_algebra):
= list(subsets)
subsets
# python doesn't like to reduce empty iterables
if not subsets:
continue
# check closure under finite union
= frozenset(reduce(frozenset.union, subsets))
union if union not in self._sigma_algebra:
raise ValueError(
"The σ-algebra must be closed under countable union"
)
# check closure under finite intersection
= frozenset(reduce(frozenset.intersection, subsets))
intersection if intersection not in self._sigma_algebra:
raise ValueError(
"The σ-algebra must be closed under countable intersection"
)
@property
def atoms(self) -> SampleSpace:
return self._atoms
@property
def sigma_algebra(self) -> SigmaAlgebra:
return self._sigma_algebra
You can check that all of these conditions are satisfied for our two examples above as long as \(\Omega\) and \(\emptyset\) are both in \(\mathcal{F}\). When \(\mathcal{F} \subseteq 2^\Omega\) is a \(\sigma\)-algebra, the pair \(\langle \Omega, \mathcal{F} \rangle\) is referred to as a measurable space.
= FiniteMeasurableSpace(vowels, f_highness)
highness_space = FiniteMeasurableSpace(vowels, f_backness) backness_space
The two examples above are close to trivial in the sense that the only “interesting” events are complements of each other. But what if we put both together, distinguishing vowels with respect to both highness and backness? Both have the same sample space, so nothing needs to change there.
Can we simply define \(\mathcal{F}_\text{highness-backness} = \mathcal{F}_\text{highness} \cup \mathcal{F}_\text{backness}\)?
No. While Condition 1 above would be satisfied (that’s easy), we would be missing quite a few sets that Conditions 2-4 require: e.g. the high back vowels \(H \cap B\) and the high and/or back vowels \(H \cup B\).
try:
= FiniteMeasurableSpace(vowels, f_highness.union(f_backness))
highness_space except ValueError as e:
print(f"ValueError: {e}")
ValueError: The σ-algebra must be closed under countable union
This point demonstrates an important fact about \(\sigma\)-algebras: if you design a classification based on some (countable) set of features like highness and backness, the constraint that \(\mathcal{F}\) be a \(\sigma\)-algebra on \(\Omega\) implies that \(\mathcal{F}\) contains events corresponding to all possible conjunctions (e.g. high and back) and disjunctions (e.g. high and/or back) of those features. So we need to extend \(\mathcal{F}_\text{highness} \cup \mathcal{F}_\text{backness}\) with additional sets. We call this extension the \(\sigma\)-algebra generated by the family of sets \(\mathcal{F}_\text{highness} \cup \mathcal{F}_\text{backness}\), denoted \(\sigma\left(\mathcal{F}_\text{highness} \cup \mathcal{F}_\text{backness}\right)\).
def generate_sigma_algebra(family: SigmaAlgebra) -> SigmaAlgebra:
"""Generate a σ-algebra from a family of sets
Parameters
----------
family
The family of sets from which to generate the σ-algebra
"""
= set(family)
sigma_algebra = set(family)
old_sigma_algebra
= False
complete
while not complete:
for subsets in powerset(old_sigma_algebra):
= list(subsets)
subsets
if not subsets:
continue
= reduce(frozenset.union, subsets)
union
sigma_algebra.add(union)
= reduce(frozenset.intersection, subsets)
intersection
sigma_algebra.add(intersection)
= sigma_algebra == old_sigma_algebra
complete = set(sigma_algebra)
old_sigma_algebra
return frozenset(sigma_algebra)
= generate_sigma_algebra(f_highness | f_backness)
f_highness_backness
f_highness_backness
frozenset({frozenset(),
frozenset({'u', 'ʊ'}),
frozenset({'i', 'o', 'ɔ', 'ɪ'}),
frozenset({'i', 'o', 'u', 'ɔ', 'ɪ', 'ʊ'}),
frozenset({'e', 'o', 'æ', 'ɑ', 'ɔ', 'ə', 'ɛ'}),
frozenset({'e', 'i', 'o', 'æ', 'ɑ', 'ɔ', 'ə', 'ɛ', 'ɪ'}),
frozenset({'e', 'u', 'æ', 'ɑ', 'ə', 'ɛ', 'ʊ'}),
frozenset({'i', 'ɪ'}),
frozenset({'e', 'i', 'u', 'æ', 'ɑ', 'ə', 'ɛ', 'ɪ', 'ʊ'}),
frozenset({'o', 'ɔ'}),
frozenset({'o', 'u', 'ɔ', 'ʊ'}),
frozenset({'i', 'u', 'ɪ', 'ʊ'}),
frozenset({'e', 'æ', 'ɑ', 'ə', 'ɛ'}),
frozenset({'e', 'o', 'u', 'æ', 'ɑ', 'ɔ', 'ə', 'ɛ', 'ʊ'}),
frozenset({'e', 'i', 'æ', 'ɑ', 'ə', 'ɛ', 'ɪ'}),
frozenset({'e', 'i', 'o', 'u', 'æ', 'ɑ', 'ɔ', 'ə', 'ɛ', 'ɪ', 'ʊ'})})
= FiniteMeasurableSpace(vowels, f_highness_backness) highness_backness_space
So far, we’ve seen a case where the sample space is finite. How about when the sample space is infinite? For example, what if we define \(\Omega\) to be the set of all strings \(\Sigma^*\)? In that case, our event space will be a subset of \(2^{\Sigma^*}\)–i.e. it will consist of languages on \(\Sigma\). This pairing of a sample space (the strings on \(\Sigma\)) and an event space (the languages on \(\Sigma\)) is what underlies all language models.
Because an event space on \(\Sigma^*\) is simply a set of languages, one natural way to define the event space is using a grammar.
Do the regular languages on \(\Sigma\)–i.e. the image of \(R(\Sigma)\) under \(\text{eval}\)–form a \(\sigma\)-algebra?
No. And the reason has to do with the fact that the set of regular languages on \(\Sigma\) are not closed under countable union. We will prove this later in the course when we discuss the pumping lemma for regular languages.
An alternative way to do it is to assume that the event space is simply all possible languages on \(\Sigma\). In that case, the event space gets a bit trickier to define: the powerset is uncountable for even a countably infinite sample space–something that we need to consider in the context of working with strings and derivations. This property can be a problem for reasons I’ll gesture at in a second. So in general, we won’t work with event spaces that are power sets of their corresponding sample space in this context. We’ll instead work with what are called Borel \(\sigma\)-algebras. It’s not important to understand the intricacies of what a Borel \(\sigma\)-algebra is; I’ll try to give you an intuition in a second.
Footnotes
Don’t ask me why, but \(\mathcal{F}\) is standard notation for the event space. Why we don’t use \(\mathcal{E}\) is beyond me. It might be some convention from measure theory I’m not aware of; or it might have to do with not confusing the event space with the expectation \(\mathbb{E}\), which we’ll review below.↩︎