---
title: Quantifiers
jupyter: python3
---
Note that `.`, the character ranges, and the escape characters match only a single character, and so to match more than one, we need more than one of whichever we are interested in matching.
```{python}
#| code-fold: true
#| code-summary: Load IPA representation of CMU Pronouncing Dictionary
with open("cmudict-ipa") as f:
entries: list[tuple[str, str]] = [
l.strip().split(",") for l in f
]
entries: dict[str, list[str]] = {
w: ipa.split() for w, ipa in entries
}
```
```{python}
#| colab: {base_uri: 'https://localhost:8080/'}
#| executionInfo: {elapsed: 5, status: ok, timestamp: 1675099877492, user: {displayName: Aaron Steven White, userId: 06256629009318567325}, user_tz: 300}
#| outputId: df0dd6ca-d641-431a-b8f1-3b1703c7f188
import re
regex_dot_stɹ_dot_kʃən = '.bstɹ.kʃən'
regex_dot_dot_tɹ_dot_kʃən = '..stɹ.kʃən'
print(regex_dot_stɹ_dot_kʃən, "matches:")
print()
for w, ipa in entries.items():
if re.fullmatch(regex_dot_stɹ_dot_kʃən, "".join(ipa)):
print(regex_dot_stɹ_dot_kʃən, "matches", "".join(ipa), f"({w})")
print()
print(regex_dot_dot_tɹ_dot_kʃən, "matches:")
print()
for w, ipa in entries.items():
if re.fullmatch(regex_dot_dot_tɹ_dot_kʃən, "".join(ipa)):
print("".join(ipa), f"({w})")
```
We can avoid writing out multiple by using a quantifier. There are a few different quantifiers. For instance, if you have an exact number in mind:
```{python}
#| colab: {base_uri: 'https://localhost:8080/'}
#| executionInfo: {elapsed: 5, status: ok, timestamp: 1675099877493, user: {displayName: Aaron Steven White, userId: 06256629009318567325}, user_tz: 300}
#| outputId: 91dfd965-f326-4001-ea2d-4cc6f69c1160
regex_dot2_tɹ_dot_kʃən = '.{2}stɹ.kʃən'
for w, ipa in entries.items():
if re.fullmatch(regex_dot2_tɹ_dot_kʃən, "".join(ipa)):
print("".join(ipa), f"({w})")
```
Or if you had a range of numbers in mind:
```{python}
#| colab: {base_uri: 'https://localhost:8080/'}
#| executionInfo: {elapsed: 3, status: ok, timestamp: 1675099877493, user: {displayName: Aaron Steven White, userId: 06256629009318567325}, user_tz: 300}
#| outputId: 84389b5a-30c5-4c2c-8f49-3b69f388ad61
regex_dot2_tɹæk_dot13 = '.{2}stɹək.{1,3}'
for w, ipa in entries.items():
if re.fullmatch(regex_dot2_tɹæk_dot13, "".join(ipa)):
print("".join(ipa), f"({w})")
```
You can also leave off one bound:
```{python}
#| colab: {base_uri: 'https://localhost:8080/'}
#| executionInfo: {elapsed: 125, status: ok, timestamp: 1675099877616, user: {displayName: Aaron Steven White, userId: 06256629009318567325}, user_tz: 300}
#| outputId: 74eb7177-645f-4b1c-97e7-406b77e8546f
regex_dot2_tɹæk_dot03 = '.{2}stɹək.{,3}'
regex_dot2_tɹæk_dot1inf = '.{2}stɹək.{1,}'
print(regex_dot2_tɹæk_dot03, "matches:")
print()
n_matches = 0
for w, ipa in entries.items():
if re.fullmatch(regex_dot2_tɹæk_dot03, "".join(ipa)):
if n_matches < 10:
n_matches += 1
print("".join(ipa), f"({w})")
else:
print("...")
break
print()
print(regex_dot2_tɹæk_dot1inf, "matches:")
print()
n_matches = 0
for w, ipa in entries.items():
if re.fullmatch(regex_dot2_tɹæk_dot1inf, "".join(ipa)):
if n_matches < 10:
n_matches += 1
print("".join(ipa), f"({w})")
else:
print("...")
break
```
Note that `{,}` is equivalent to `*`. There is also a special quantifier symbol for `{1,}`: `+`
And if you wanted at least one character ot come after `Aaron`, but didn't care after that you could use `+`.
```{python}
#| colab: {base_uri: 'https://localhost:8080/'}
#| executionInfo: {elapsed: 6, status: ok, timestamp: 1675099877617, user: {displayName: Aaron Steven White, userId: 06256629009318567325}, user_tz: 300}
#| outputId: a871e772-659b-453e-f2b6-59ed3532bc14
regex_dot2_tɹæk_dotplus = '.{2}stɹək.+'
n_matches = 0
for w, ipa in entries.items():
if re.fullmatch(regex_dot2_tɹæk_dotplus, "".join(ipa)):
if n_matches < 10:
n_matches += 1
print("".join(ipa), f"({w})")
else:
print("...")
break
```
Note that none of these quantifiers increase the expressive power of the regular expressions. We can always write their equivalents as a vanilla regular expression (in the sense of the formal definition we gave above); it would just be tedious in many cases.
### Set complement
For any of these cases where we escape a lowercase alphabetic character to get a character set, the set complement can generally be gotten with by the uppercase version—e.g. `\w` goes to `\W`.
```{python}
#| colab: {base_uri: 'https://localhost:8080/'}
#| executionInfo: {elapsed: 4, status: ok, timestamp: 1675099877617, user: {displayName: Aaron Steven White, userId: 06256629009318567325}, user_tz: 300}
#| outputId: f0da836e-2572-4963-bf0e-68e635f9905c
regex_notw_bstɹækt = '\Wbstɹəkt'
(re.fullmatch(regex_notw_bstɹækt, "".join(entries["obstruct"])),
re.fullmatch(regex_notw_bstɹækt, '\n'+"".join(entries["obstruct"])[1:]))
```
Sometimes you want the complement of a set that doesn't have an associated escaped alphabetic character. For that you can use the same square bracket set notation but put a `^` after the first bracket.
```{python}
#| colab: {base_uri: 'https://localhost:8080/'}
#| executionInfo: {elapsed: 3, status: ok, timestamp: 1675099877617, user: {displayName: Aaron Steven White, userId: 06256629009318567325}, user_tz: 300}
#| outputId: f0dac4b6-8502-431a-8fe9-3c7347c76bd4
regex_notæ_bstɹ_notæ_kt = '[^æ][^b]stɹ[^æ]kt'
for w, ipa in entries.items():
if re.fullmatch(regex_notæ_bstɹ_notæ_kt, "".join(ipa)):
print("".join(ipa), f"({w})")
```
The placement of this `^` is really important, since it only has the negation interpretation directly after `[`.