Basic Matching

We mainly use regular expressions for searching, extracting from, and modifying strings. Each such use is underwritten by a string being in the set of strings that a regular expression evaluates to.

One core use of regular expressions is basic matching, which checks whether a string falls into the language that a regular expression evaluates to. We can describe matching formally in a couple ways. One is to view it as a function from a regular expression and a string to a boolean.

\[\text{match}: R(\Sigma)\;\times\;\Sigma^* \rightarrow \{\top, \bot\}\]

\[\text{match}(\rho, \sigma) = \begin{cases}\top & \text{if } \sigma \in \text{eval}(\rho) \\ \bot & \text{otherwise}\end{cases}\]

We can compute this function with fullmatch from the re module.

import re

regex_æ = 'æ'
string_æ = 'æ'

re.fullmatch(regex_æ, string_æ)
<re.Match object; span=(0, 1), match='æ'>

Another is to view it as a function from strings to booleans parameterized by a regular expression \(\rho\).

\[\text{match}_\rho: \Sigma^* \rightarrow \{\top, \bot\}\]

\[\text{match}_\rho(\sigma) = \begin{cases}\top & \text{if } \sigma \in \text{eval}(\rho) \\ \bot & \text{otherwise}\end{cases}\]

I’m pointing out this seemingly trivial distinction because it will be important when we discuss the concept of a grammar recognizing a string, where we will define a recognition algorithm for a particular class of grammars that we parameterize by grammars in that class.

We can compute this function by compiling the regular expression to a Pattern object…

match_æ = re.compile(regex_æ)

type(match_æ)
re.Pattern

…then call the fullmatch instance method on this object.

match_æ.fullmatch(string_æ)
<re.Match object; span=(0, 1), match='æ'>

Both return a Match object (so its not a perfect implementation of the formal definition).

type(re.fullmatch(regex_æ, string_æ)), type(match_æ.fullmatch(string_æ))
(re.Match, re.Match)

But if we try to cast this object to a bool we get True.

bool(re.fullmatch(regex_æ, string_æ))
True

In contrast, match returns None if there is no match.

string_æbstɹækʃən = 'æbstɹækʃən'

type(re.fullmatch(regex_æ, string_æbstɹækʃən))
NoneType

This means that casting a non-match to a bool returns False.

bool(re.fullmatch(regex_æ, string_æbstɹækʃən))
False

I point this out because you might expect that fullmatch would return a bool, but it doesn’t. This can be a problem if you’re not calling match within some if-else statement. Nonetheless, this is probably the most straightforward way to check whether a particularly pattern is matched by a string.

You might have noticed that there is another function match. It might seem like this functon would be simpler, but it has some odd behavior: it matches characters at the beginning of the string and ignores the rest of the string.

re.match(regex_æ, string_æbstɹækʃən)
<re.Match object; span=(0, 1), match='æ'>

The reason this is annoying is that this “beginning of the string” matching behavior is already built into regular expressions, which we’ll see in a second. It also means that re.match is not an implementation of the formal definition above.