import re
= 'æ'
regex_æ = 'æ'
string_æ
re.fullmatch(regex_æ, string_æ)
<re.Match object; span=(0, 1), match='æ'>
We mainly use regular expressions for searching, extracting from, and modifying strings. Each such use is underwritten by a string being in the set of strings that a regular expression evaluates to.
One core use of regular expressions is basic matching, which checks whether a string falls into the language that a regular expression evaluates to. We can describe matching formally in a couple ways. One is to view it as a function from a regular expression and a string to a boolean.
\[\text{match}: R(\Sigma)\;\times\;\Sigma^* \rightarrow \{\top, \bot\}\]
\[\text{match}(\rho, \sigma) = \begin{cases}\top & \text{if } \sigma \in \text{eval}(\rho) \\ \bot & \text{otherwise}\end{cases}\]
We can compute this function with fullmatch
from the re
module.
import re
= 'æ'
regex_æ = 'æ'
string_æ
re.fullmatch(regex_æ, string_æ)
<re.Match object; span=(0, 1), match='æ'>
Another is to view it as a function from strings to booleans parameterized by a regular expression \(\rho\).
\[\text{match}_\rho: \Sigma^* \rightarrow \{\top, \bot\}\]
\[\text{match}_\rho(\sigma) = \begin{cases}\top & \text{if } \sigma \in \text{eval}(\rho) \\ \bot & \text{otherwise}\end{cases}\]
I’m pointing out this seemingly trivial distinction because it will be important when we discuss the concept of a grammar recognizing a string, where we will define a recognition algorithm for a particular class of grammars that we parameterize by grammars in that class.
We can compute this function by compiling the regular expression to a Pattern
object…
= re.compile(regex_æ)
match_æ
type(match_æ)
re.Pattern
…then call the fullmatch
instance method on this object.
match_æ.fullmatch(string_æ)
<re.Match object; span=(0, 1), match='æ'>
Both return a Match
object (so its not a perfect implementation of the formal definition).
type(re.fullmatch(regex_æ, string_æ)), type(match_æ.fullmatch(string_æ))
(re.Match, re.Match)
But if we try to cast this object to a bool
we get True
.
bool(re.fullmatch(regex_æ, string_æ))
True
In contrast, match
returns None
if there is no match.
= 'æbstɹækʃən'
string_æbstɹækʃən
type(re.fullmatch(regex_æ, string_æbstɹækʃən))
NoneType
This means that casting a non-match to a bool
returns False
.
bool(re.fullmatch(regex_æ, string_æbstɹækʃən))
False
I point this out because you might expect that fullmatch
would return a bool
, but it doesn’t. This can be a problem if you’re not calling match
within some if-else
statement. Nonetheless, this is probably the most straightforward way to check whether a particularly pattern is matched by a string.
You might have noticed that there is another function match
. It might seem like this functon would be simpler, but it has some odd behavior: it matches characters at the beginning of the string and ignores the rest of the string.
re.match(regex_æ, string_æbstɹækʃən)
<re.Match object; span=(0, 1), match='æ'>
The reason this is annoying is that this “beginning of the string” matching behavior is already built into regular expressions, which we’ll see in a second. It also means that re.match
is not an implementation of the formal definition above.