Overview

Reading

Jurafsky and Martin (2023, Ch. 2.1) on regular expressions.

In the last submodule, we defined the set of strings on an alphabet \(\Sigma\) as:

\[\Sigma^* = \bigcup_{i=0}^\infty \Sigma^i\]

We then defined a language \(L\) on \(\Sigma\) to be some (possibly improper) subset of the set of strings:

\[L \subseteq \Sigma^*\]

The set of all languages on \(\Sigma\) is thus the powerset \(2^{\Sigma^*}\) of the set of strings \(\Sigma^*\).

In this submodule, we’re going to discuss one way that we can compactly describe (a subset of) the languages in \(2^{\Sigma^*}\): regular expressions.

Regular expressions are foundational both from an applied and from a scientific perspective. From an applied perspective, they allow us to effectively query and modify text in a compact programmatic way. From a scientific perspective, they provide a way of stating phonological grammars in an efficient way.

References

Jurafsky, Daniel, and James H. Martin. 2023. Speech and Language Processing.