import re
regex_æ = 'æ'
string_æ = 'æ'
re.fullmatch(regex_æ, string_æ)<re.Match object; span=(0, 1), match='æ'>
We mainly use regular expressions for searching, extracting from, and modifying strings. Each such use is underwritten by a string being in the set of strings that a regular expression evaluates to.
One core use of regular expressions is basic matching, which checks whether a string falls into the language that a regular expression evaluates to. We can describe matching formally in a couple ways. One is to view it as a function from a regular expression and a string to a boolean.
\[\text{match}: R(\Sigma)\;\times\;\Sigma^* \rightarrow \{\top, \bot\}\]
\[\text{match}(\rho, \sigma) = \begin{cases}\top & \text{if } \sigma \in \text{eval}(\rho) \\ \bot & \text{otherwise}\end{cases}\]
We can compute this function with fullmatch from the re module.
import re
regex_æ = 'æ'
string_æ = 'æ'
re.fullmatch(regex_æ, string_æ)<re.Match object; span=(0, 1), match='æ'>
Another is to view it as a function from strings to booleans parameterized by a regular expression \(\rho\).
\[\text{match}_\rho: \Sigma^* \rightarrow \{\top, \bot\}\]
\[\text{match}_\rho(\sigma) = \begin{cases}\top & \text{if } \sigma \in \text{eval}(\rho) \\ \bot & \text{otherwise}\end{cases}\]
I’m pointing out this seemingly trivial distinction because it will be important when we discuss the concept of a grammar recognizing a string, where we will define a recognition algorithm for a particular class of grammars that we parameterize by grammars in that class.
We can compute this function by compiling the regular expression to a Pattern object…
match_æ = re.compile(regex_æ)
type(match_æ)re.Pattern
…then call the fullmatch instance method on this object.
match_æ.fullmatch(string_æ)<re.Match object; span=(0, 1), match='æ'>
Both return a Match object (so its not a perfect implementation of the formal definition).
type(re.fullmatch(regex_æ, string_æ)), type(match_æ.fullmatch(string_æ))(re.Match, re.Match)
But if we try to cast this object to a bool we get True.
bool(re.fullmatch(regex_æ, string_æ))True
In contrast, match returns None if there is no match.
string_æbstɹækʃən = 'æbstɹækʃən'
type(re.fullmatch(regex_æ, string_æbstɹækʃən))NoneType
This means that casting a non-match to a bool returns False.
bool(re.fullmatch(regex_æ, string_æbstɹækʃən))False
I point this out because you might expect that fullmatch would return a bool, but it doesn’t. This can be a problem if you’re not calling match within some if-else statement. Nonetheless, this is probably the most straightforward way to check whether a particularly pattern is matched by a string.
You might have noticed that there is another function match. It might seem like this functon would be simpler, but it has some odd behavior: it matches characters at the beginning of the string and ignores the rest of the string.
re.match(regex_æ, string_æbstɹækʃən)<re.Match object; span=(0, 1), match='æ'>
The reason this is annoying is that this “beginning of the string” matching behavior is already built into regular expressions, which we’ll see in a second. It also means that re.match is not an implementation of the formal definition above.