One of the major uses for regular expressions is for extracting substrings from a string. This can be done with groups. For instance, suppose I want all of stems that have the morpheme with the form /ʃən/.
Load IPA representation of CMU Pronouncing Dictionary
withopen("cmudict-ipa") as f: entries: list[tuple[str, str]] = [ l.strip().split(",") for l in f ] entries: dict[str, list[str]] = { w: ipa.split() for w, ipa in entries }
This works to some extent, but notice that it will capture cases where /ʃən/ is not a morpheme. For instance, the word passion will get matched. It will also return the wrong stem when the morpheme is realized as /eɪʃən/, such as accreditation.
To handle the second, we might look for /eɪʃən/ and /ʃən/. We can use the quantifier ? to say that /eɪ/ is optional. Because it is a digraph, we need to surround it with parentheses.
The problem is that this makes Python think we want to capture it. So what we need is a non-capturing group, which we get by putting ?: after the open parenthesis.
It still seems to be capturing /eɪ/ in accreditation. What gives? The reason this is happening is that quantifiers like + are greedy by default. That means they will match as much as they can. And because /eɪ/ is optional, (.+) can match it.
To make sure it doesn’t match it if it doesn’t need to, we can make the quantifier non-greedy by appending a ?.
Okay. So how do we deal with cases where /ʃən/ is not a morpheme? One thing we can do is to look for stems that show up without /ʃən/. This will exclude passion, since /pæ/ is not a word.
An issue here is that /ʃən/ doesn’t simply get appended to a stem. There is an additional phonological process that deletes a portion of that stem–e.g. /æbstɹækt/ + /ʃən/ is /æbstɹækʃən/, not /æbstɹæktʃən/. So we need to consider cases where a final consonant–usually a t–was deleted. But we need to make sure we do so only when the morpheme wasn’t realized as /eɪʃən/, so we need to go back to capturing it.
There’s still some wonky stuff in here–e.g. ancient coming from ain’t and ashen coming from at–but we’re getting closer. We can’t really deal with cases like ashen coming from at, but we can deal with ancient coming from ain’t, which reveals a behavior of re.findall: it functions like re.match, rather than re.fullmatch, in that it matches the beginning of a string. If we want it to match the entire string, we have to explicitly specify that in the regular expression. To do this, we can use a $, which means “end of string”.1
To get much better than this, we’d need to start matching on the orthographic representation as well–e.g. matching the ion at the end of the orthographic representation of the word, thus filtering things like /æt/ + /ʃən/ = /æʃən/. One thing we’ll still miss are cases like adoration, where there is a vowel quality change (which is consequently why we get adder + ion = adoration currently). To handle those cases, we would need to account for the conditions under which vowel quality changes, which we could do in principle using regular expressions but which I won’t do here.
Footnotes
The dual of $ for the beginning of the string is ^. Note that this looks like the negation symbol we saw earlier. It is different in that that symbol must be preceded by [ to be interpreted as negation. Anywhere else ^ means the beginning of a string. That in turn means that putting a bare ^ anywhere besides the beginning of a regular expression is going to result in a regular expression that evaluates to the empty set.↩︎