Implementing the Regular Operations

So far, we’ve seen only a trivial regular expression: one containing a single character æ, which evaluates to the language {æ} \(\in 2^{\Sigma^*}\). How do we represent other kinds of regular expressions?

Concatenation

The operation of concatenation, which we represented using \(\circ\), is implicit in putting two characters next to each other. For instance, to represent the regular expression \((\text{æ} \circ (\text{b} \circ (\text{s} \circ (\text{t} \circ (\text{ɹ} \circ (\text{æ} \circ (\text{k} \circ (\text{ʃ} \circ (\text{ə} \circ \text{n})))))))))\), we can simply write æbstɹækʃən.

Load IPA representation of CMU Pronouncing Dictionary

with open("cmudict-ipa") as f:
    entries: list[tuple[str, str]] = [
        l.strip().split(",") for l in f
    ]
    entries: dict[str, list[str]] = {
        w: ipa.split() for w, ipa in entries
    }

import re

regex_æbstɹækʃən = "æbstɹækʃən"

string_æbstɹækʃən = "".join(entries["abstraction"])

re.fullmatch(regex_æbstɹækʃən, string_æbstɹækʃən)

<re.Match object; span=(0, 10), match='æbstɹækʃən'>

Union

In contrast, to represent the regular expression \(((\text{æ} \cup \text{ə}) \circ (\text{b} \circ (\text{s} \circ (\text{t} \circ (\text{ɹ} \circ ((\text{æ} \cup \text{ə}) \circ (\text{k} \circ (\text{ʃ} \circ (\text{ə} \circ \text{n})))))))))\), which evaluates to {æbstɹækʃən, əbstɹækʃən, æbstɹəkʃən, əbstɹəkʃən}, we either use []…

regex_æəbstɹæəkʃən = "[æə]bstɹ[æə]kʃən"

string_əbstɹəkʃən = "".join(entries["obstruction"])
string_æbstɹəkʃən = "æbstɹəkʃən"
string_əbstɹækʃən = "əbstɹækʃən"

(re.fullmatch(regex_æəbstɹæəkʃən, string_æbstɹækʃən),
 re.fullmatch(regex_æəbstɹæəkʃən, string_æbstɹəkʃən),
 re.fullmatch(regex_æəbstɹæəkʃən, string_əbstɹækʃən), 
 re.fullmatch(regex_æəbstɹæəkʃən, string_əbstɹəkʃən))

(<re.Match object; span=(0, 10), match='æbstɹækʃən'>,
 <re.Match object; span=(0, 10), match='æbstɹəkʃən'>,
 <re.Match object; span=(0, 10), match='əbstɹækʃən'>,
 <re.Match object; span=(0, 10), match='əbstɹəkʃən'>)

…or an explicit |.

regex_æəbstɹæəkʃən = "(æ|ə)bstɹ(æ|ə)kʃən"

(re.fullmatch(regex_æəbstɹæəkʃən, string_æbstɹækʃən),
 re.fullmatch(regex_æəbstɹæəkʃən, string_æbstɹəkʃən),
 re.fullmatch(regex_æəbstɹæəkʃən, string_əbstɹækʃən), 
 re.fullmatch(regex_æəbstɹæəkʃən, string_əbstɹəkʃən))

(<re.Match object; span=(0, 10), match='æbstɹækʃən'>,
 <re.Match object; span=(0, 10), match='æbstɹəkʃən'>,
 <re.Match object; span=(0, 10), match='əbstɹækʃən'>,
 <re.Match object; span=(0, 10), match='əbstɹəkʃən'>)

Note that the () are important in the latter case!

regex_æəbstɹæəkʃən = "æ|əbstɹæ|əkʃən"

(re.fullmatch(regex_æəbstɹæəkʃən, string_æbstɹækʃən),
 re.fullmatch(regex_æəbstɹæəkʃən, string_æbstɹəkʃən),
 re.fullmatch(regex_æəbstɹæəkʃən, string_əbstɹækʃən), 
 re.fullmatch(regex_æəbstɹæəkʃən, string_əbstɹəkʃən))

(None, None, None, None)

Kleene star

Finally, the Kleene star works the way you would expect.

regex_ææææbstɹækʃən = "æ*bstɹækʃən"

for i in range(10):
    print(re.fullmatch(regex_ææææbstɹækʃən, "æ"*i + string_æbstɹækʃən[1:]))

<re.Match object; span=(0, 9), match='bstɹækʃən'>
<re.Match object; span=(0, 10), match='æbstɹækʃən'>
<re.Match object; span=(0, 11), match='ææbstɹækʃən'>
<re.Match object; span=(0, 12), match='æææbstɹækʃən'>
<re.Match object; span=(0, 13), match='ææææbstɹækʃən'>
<re.Match object; span=(0, 14), match='æææææbstɹækʃən'>
<re.Match object; span=(0, 15), match='ææææææbstɹækʃən'>
<re.Match object; span=(0, 16), match='æææææææbstɹækʃən'>
<re.Match object; span=(0, 17), match='ææææææææbstɹækʃən'>
<re.Match object; span=(0, 18), match='æææææææææbstɹækʃən'>

To apply the Kleene star to a complex regular expression, we need ().

regex_reæbstɹækʃən = "(ɹi|di)*æbstɹækʃən"

for i in range(3):
    print(re.fullmatch(regex_reæbstɹækʃən, "ɹi"*i + string_æbstɹækʃən))
    print(re.fullmatch(regex_reæbstɹækʃən, "di"*i + string_æbstɹækʃən))
    print(re.fullmatch(regex_reæbstɹækʃən, "ɹidi"*i + string_æbstɹækʃən))
    print(re.fullmatch(regex_reæbstɹækʃən, "diɹi"*i + string_æbstɹækʃən))

<re.Match object; span=(0, 10), match='æbstɹækʃən'>
<re.Match object; span=(0, 10), match='æbstɹækʃən'>
<re.Match object; span=(0, 10), match='æbstɹækʃən'>
<re.Match object; span=(0, 10), match='æbstɹækʃən'>
<re.Match object; span=(0, 12), match='ɹiæbstɹækʃən'>
<re.Match object; span=(0, 12), match='diæbstɹækʃən'>
<re.Match object; span=(0, 14), match='ɹidiæbstɹækʃən'>
<re.Match object; span=(0, 14), match='diɹiæbstɹækʃən'>
<re.Match object; span=(0, 14), match='ɹiɹiæbstɹækʃən'>
<re.Match object; span=(0, 14), match='didiæbstɹækʃən'>
<re.Match object; span=(0, 18), match='ɹidiɹidiæbstɹækʃən'>
<re.Match object; span=(0, 18), match='diɹidiɹiæbstɹækʃən'>

Wild cards and character ranges

We can represent all simple regular expressions according to our formal definition, but in certain cases, doing so would be tedious. For instance, suppose I want to represent the set of all English phonemes \(\Sigma\). Using our formal definition, we would need to list out all of the phonemes joined by \(\cup\): \((\text{i} \cup (\text{ɝ} \cup (\text{a} \cup (\text{ɪ} \cup \ldots))))\).

To make this less tedious, Python (and many other languages) introduce additional special characters into the definition of \(R(\Sigma)\). The most basic is the wildcard ., which matches any single character (alphanumeric or otherwise) except the newline \n.

regex_dot_bstɹækʃən = '.bstɹækʃən'

re.fullmatch(regex_dot_bstɹækʃən, string_æbstɹækʃən)

<re.Match object; span=(0, 10), match='æbstɹækʃən'>

If you want to match the . itself (or any special character we introduce below), you need to escape it.

regex_period_bstɹækʃən = '\.bstɹækʃən'

re.fullmatch(regex_period_bstɹækʃən, '.' + string_æbstɹækʃən[1:])

<re.Match object; span=(0, 10), match='.bstɹækʃən'>

Besides ., we can also use character ranges to target more specific sets, like upper- and lower-case alphabetic characters ([A-z]), lower-case alphabetic character ([a-z]), numerals [0-9].

regex_numeric_bstɹækʃən = '[0-9]bstɹækʃən'
string_4bstɹækʃən = '4bstɹækʃən'

re.fullmatch(regex_numeric_bstɹækʃən, string_4bstɹækʃən)

<re.Match object; span=(0, 10), match='4bstɹækʃən'>

In addition to ranges, there are even more compact escape characters. For instance, alphanumeric characters plus _ can be gotten with \w (the same as the range [A-z0-9_]).

regex_alphanumeric_bstɹækʃən = '\wbstɹækʃən'

re.fullmatch(regex_alphanumeric_bstɹækʃən, string_æbstɹækʃən)

<re.Match object; span=(0, 10), match='æbstɹækʃən'>

Quantifiers

Note that ., the character ranges, and the escape characters match only a single character, and so to match more than one, we need more than one of whichever we are interested in matching.

regex_dot_stɹækʃən = '.stɹækʃən'
regex_dot_dot_tɹækʃən = '..stɹækʃən'

(re.fullmatch(regex_dot_stɹækʃən, string_æbstɹækʃən),
 re.fullmatch(regex_dot_dot_tɹækʃən, string_æbstɹækʃən))

(None, <re.Match object; span=(0, 10), match='æbstɹækʃən'>)

We can avoid writing out multiple by using a quantifier. There are a few different quantifiers. For instance, if you have an exact number in mind:

regex_dot2_tɹækʃən = '.{2}stɹækʃən'

re.fullmatch(regex_dot2_tɹækʃən, string_æbstɹækʃən)

<re.Match object; span=(0, 10), match='æbstɹækʃən'>

Or if you had a range of numbers in mind:

regex_dot2_tɹæk_dot13 = '.{2}stɹək.{1,3}'
string_əbstɹəkt = string_əbstɹəkʃən[:-3] + 't'

(re.fullmatch(regex_dot2_tɹæk_dot13, string_əbstɹəkʃən),
 re.fullmatch(regex_dot2_tɹæk_dot13, string_əbstɹəkt))

(<re.Match object; span=(0, 10), match='əbstɹəkʃən'>,
 <re.Match object; span=(0, 8), match='əbstɹəkt'>)

You can also leave off one bound:

regex_dot2_tɹæk_dot03 = '.{2}stɹək.{,3}'
regex_dot2_tɹæk_dot1inf = '.{2}stɹək.{1,}'

(re.fullmatch(regex_dot2_tɹæk_dot03, string_əbstɹəkt),
 re.fullmatch(regex_dot2_tɹæk_dot1inf, string_əbstɹəkt))

(<re.Match object; span=(0, 8), match='əbstɹəkt'>,
 <re.Match object; span=(0, 8), match='əbstɹəkt'>)

Note that {,} is equivalent to *. There is also a special quantifier symbol for {1,}: +

And if you wanted at least one character ot come after Aaron, but didn’t care after that you could use +.

regex_dot2_tɹæk_dotplus = '.{2}stɹək.+'

re.fullmatch(regex_dot2_tɹæk_dotplus, string_əbstɹəkt)

<re.Match object; span=(0, 8), match='əbstɹəkt'>

Note that none of these quantifiers increase the expressive power of the regular expressions. We can always write their equivalents as a vanilla regular expression (in the sense of the formal definition we gave above); it would just be tedious in many cases.

Set complement

For any of these cases where we escape a lowercase alphabetic character to get a character set, the set complement can generally be gotten with by the uppercase version—e.g. \w goes to \W.

regex_notw_bstɹækt = '\Wbstɹəkt'

(re.fullmatch(regex_notw_bstɹækt, string_əbstɹəkt),
 re.fullmatch(regex_notw_bstɹækt, '\n'+string_əbstɹəkt[1:]))

(None, <re.Match object; span=(0, 8), match='\nbstɹəkt'>)

Sometimes you want the complement of a set that doesn’t have an associated escaped alphabetic character. For that you can use the same square bracket set notation but put a ^ after the first bracket.

regex_notə_bstɹækt = '[^æ]bstɹəkt'

(re.fullmatch(regex_notə_bstɹækt, string_əbstɹəkt),
 re.fullmatch(regex_notə_bstɹækt, 'æ' + string_əbstɹəkt[1:]))

(<re.Match object; span=(0, 8), match='əbstɹəkt'>, None)

The placement of this ^ is really important, since it only has the negation interpretation directly after [.

Groups and greediness

One of the major uses for regular expressions is for extracting substrings from a string. This can be done with groups. For instance, suppose I want all of the last names of people named Aaron in some text (if such a last name is mentioned). I would put parentheses around what I think is the last name and use the findall function.

regex = 'Aaron ([A-Z][a-z]*)'
string = 'Aaron White'

#re.fullmatch(regex, string)
re.findall(regex, string)

['White']

This particular pattern could be problematic if a middle name is listed.

regex = 'Aaron ([A-Z][a-z]*)'
string = 'Aaron Steven White'

re.findall(regex, string)

['Steven']

To handle this, we could explicitly assume that there is always a middle name.

regex = 'Aaron [A-Z][a-z]* ([A-Z][a-z]*)'
string = 'Aaron Steven White'

re.findall(regex, string)

['White']

But then what about the cases where there’s not?

string = 'Aaron Steven White'

re.findall(regex, string)

['White']

In this case, we can use the quantifier ?, which matches zero or one instances of a character before it.

string = 'Aaron Steven White'
regex = 'Aaron ([A-Z]?[a-z]+ ?)+'

re.findall(regex, string)

['White']

We don’t always want to have to list out all the possible matches we might find. For instance, suppose we’re parsing a CSV.

csv = 'c1,c2,c3,c4,c5,c6,c7,c8\nd1,d2,d3,d4,d5,d6,d7,d8'

print(csv)

c1,c2,c3,c4,c5,c6,c7,c8
d1,d2,d3,d4,d5,d6,d7,d8

For this, we want to be able to repeat a pattern an arbitrary number of times.

You’d think you might be able to use the following:

regex = '((.*),)*'

re.findall(regex, csv)

[('c1,c2,c3,c4,c5,c6,c7,', 'c1,c2,c3,c4,c5,c6,c7'),
 ('', ''),
 ('', ''),
 ('', ''),
 ('d1,d2,d3,d4,d5,d6,d7,', 'd1,d2,d3,d4,d5,d6,d7'),
 ('', ''),
 ('', ''),
 ('', '')]

Why doesn’t this work? The problem is that quantifiers are greedy by default: thy will match as much as they can before stopping. We can use ? to ensure that the quantifier is not intepreted greedily.

regex = '((.*?),)+?'

re.findall(regex, csv)

[('c1,', 'c1'),
 ('c2,', 'c2'),
 ('c3,', 'c3'),
 ('c4,', 'c4'),
 ('c5,', 'c5'),
 ('c6,', 'c6'),
 ('c7,', 'c7'),
 ('d1,', 'd1'),
 ('d2,', 'd2'),
 ('d3,', 'd3'),
 ('d4,', 'd4'),
 ('d5,', 'd5'),
 ('d6,', 'd6'),
 ('d7,', 'd7')]

That’s almost right, but what’s the problem? We’re capturing more than we want, and we’re missing the last column. To partially handle the first problem, we can use non-capturing groups, which place a ?: right after the left (.

regex = r'(?:(.+?)(?:,|\n))+?'

re.findall(regex, csv)

['c1',
 'c2',
 'c3',
 'c4',
 'c5',
 'c6',
 'c7',
 'c8',
 'd1',
 'd2',
 'd3',
 'd4',
 'd5',
 'd6',
 'd7']

How do we make sure we capture the last column?

regex = r'(?:(.+?)(?:,|(?=\n)|(?=$)))+?'

re.findall(regex, csv)

['c1',
 'c2',
 'c3',
 'c4',
 'c5',
 'c6',
 'c7',
 'c8',
 'd1',
 'd2',
 'd3',
 'd4',
 'd5',
 'd6',
 'd7',
 'd8']

To get a format that represents each row in the CSV as its own list, we might then restructure this result into a list of lists. In actual practice, you could do this in an number of ways—e.g. first matching each row up to a newline, then matching the above regex with (?=\n) part against the list or eschewing regexes and using Python’s str.split() methods or the csv library (which, by the way, would be very useful for Assignment 1).

A note on raw (`r`) strings

One thing I haven’t addressed here is that you’ll often seen an odd-looking r directly before the first quote in the string in a lot of Python regular expressions. This is because, as explained here…

Regular expressions use the backslash character (‘\’) to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write ‘\\\\’ as the pattern string, because the regular expression must be \\, and each backslash must be expressed as \\ inside a regular Python string literal.

The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with ‘r’. So r”” is a two-character string containing ‘\’ and ‘n’, while “” is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation.

So it only really matters when you’re using special escape characters, but you should just get in the habit of using it more generally.

Querying a lexicon

One use case for regular expressions in the context of computational linguistics is querying a lexicon. To look at this use case, we will use the lexicon of word forms found in the CMU Pronouncing Dictionary.

!wget https://raw.githubusercontent.com/Alexir/CMUdict/master/cmudict-0.7b

--2024-01-29 12:13:08--  https://raw.githubusercontent.com/Alexir/CMUdict/master/cmudict-0.7b
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3865710 (3.7M) [text/plain]
Saving to: ‘cmudict-0.7b.6’

cmudict-0.7b.6      100%[===================>]   3.69M  12.2MB/s    in 0.3s    

2024-01-29 12:13:09 (12.2 MB/s) - ‘cmudict-0.7b.6’ saved [3865710/3865710]

!head -n 200 cmudict-0.7b

;;; # CMUdict  --  Major Version: 0.07
;;; 
;;; # $HeadURL: https://svn.code.sf.net/p/cmusphinx/code/branches/cmudict/cmudict-0.7b $
;;; # $Date:: 2015-07-15 12:34:30 -0400 (Wed, 15 Jul 2015)      $:
;;; # $Id:: cmudict-0.7b 13083 2015-07-15 16:34:30Z air         $:
;;; # $Rev:: 13083                                              $: 
;;; # $Author:: air                                             $:
;;;
;;; #
;;; # ========================================================================
;;; # Copyright (C) 1993-2015 Carnegie Mellon University. All rights reserved.
;;; #
;;; # Redistribution and use in source and binary forms, with or without
;;; # modification, are permitted provided that the following conditions
;;; # are met:
;;; #
;;; # 1. Redistributions of source code must retain the above copyright
;;; #    notice, this list of conditions and the following disclaimer.
;;; #    The contents of this file are deemed to be source code.
;;; #
;;; # 2. Redistributions in binary form must reproduce the above copyright
;;; #    notice, this list of conditions and the following disclaimer in
;;; #    the documentation and/or other materials provided with the
;;; #    distribution.
;;; #
;;; # This work was supported in part by funding from the Defense Advanced
;;; # Research Projects Agency, the Office of Naval Research and the National
;;; # Science Foundation of the United States of America, and by member
;;; # companies of the Carnegie Mellon Sphinx Speech Consortium. We acknowledge
;;; # the contributions of many volunteers to the expansion and improvement of
;;; # this dictionary.
;;; #
;;; # THIS SOFTWARE IS PROVIDED BY CARNEGIE MELLON UNIVERSITY ``AS IS'' AND
;;; # ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
;;; # THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
;;; # PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL CARNEGIE MELLON UNIVERSITY
;;; # NOR ITS EMPLOYEES BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
;;; # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
;;; # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
;;; # DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
;;; # THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
;;; # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
;;; # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
;;; #
;;; # ========================================================================
;;; #
;;;
;;;  NOTES  ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;; 
;;;  [20080401] (air)  New dict file format introduced 
;;;   - comments (like this section) are allowed
;;;   - file name is major version; vers/rev information is now in the header
;;;
;;; 
;;; ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;; 
!EXCLAMATION-POINT  EH2 K S K L AH0 M EY1 SH AH0 N P OY2 N T
"CLOSE-QUOTE  K L OW1 Z K W OW1 T
"DOUBLE-QUOTE  D AH1 B AH0 L K W OW1 T
"END-OF-QUOTE  EH1 N D AH0 V K W OW1 T
"END-QUOTE  EH1 N D K W OW1 T
"IN-QUOTES  IH1 N K W OW1 T S
"QUOTE  K W OW1 T
"UNQUOTE  AH1 N K W OW1 T
#HASH-MARK  HH AE1 M AA2 R K
#POUND-SIGN  P AW1 N D S AY2 N
#SHARP-SIGN  SH AA1 R P S AY2 N
%PERCENT  P ER0 S EH1 N T
&AMPERSAND  AE1 M P ER0 S AE2 N D
'ALLO  AA2 L OW1
'APOSTROPHE  AH0 P AA1 S T R AH0 F IY0
'BOUT  B AW1 T
'CAUSE  K AH0 Z
'COURSE  K AO1 R S
'CUSE  K Y UW1 Z
'EM  AH0 M
'END-INNER-QUOTE  EH1 N D IH1 N ER0 K W OW1 T
'END-QUOTE  EH1 N D K W OW1 T
'FRISCO  F R IH1 S K OW0
'GAIN  G EH1 N
'INNER-QUOTE  IH1 N ER0 K W OW1 T
'KAY  K EY1
'M  AH0 M
'N  AH0 N
'QUOTE  K W OW1 T
'ROUND  R AW1 N D
'S  EH1 S
'SINGLE-QUOTE  S IH1 NG G AH0 L K W OW1 T
'TIL  T IH1 L
'TIS  T IH1 Z
'TWAS  T W AH1 Z
(BEGIN-PARENS  B IH0 G IH1 N P ER0 EH1 N Z
(IN-PARENTHESES  IH1 N P ER0 EH1 N TH AH0 S IY2 Z
(LEFT-PAREN  L EH1 F T P ER0 EH1 N
(OPEN-PARENTHESES  OW1 P AH0 N P ER0 EH1 N TH AH0 S IY2 Z
(PAREN  P ER0 EH1 N
(PARENS  P ER0 EH1 N Z
(PARENTHESES  P ER0 EH1 N TH AH0 S IY2 Z
)CLOSE-PAREN  K L OW1 Z P ER0 EH1 N
)CLOSE-PARENTHESES  K L OW1 Z P ER0 EH1 N TH AH0 S IY2 Z
)END-PAREN  EH1 N D P ER0 EH1 N
)END-PARENS  EH1 N D P ER0 EH1 N Z
)END-PARENTHESES  EH1 N D P ER0 EH1 N TH AH0 S IY2 Z
)END-THE-PAREN  EH1 N D DH AH0 P ER0 EH1 N
)PAREN  P ER0 EH1 N
)PARENS  P ER0 EH1 N Z
)RIGHT-PAREN  R AY1 T P EH2 R AH0 N
)UN-PARENTHESES  AH1 N P ER0 EH1 N TH AH0 S IY1 Z
+PLUS  P L UH1 S
,COMMA  K AA1 M AH0
--DASH  D AE1 SH
-DASH  D AE1 SH
-HYPHEN  HH AY1 F AH0 N
...ELLIPSIS  IH2 L IH1 P S IH0 S
.DECIMAL  D EH1 S AH0 M AH0 L
.DOT  D AA1 T
.FULL-STOP  F UH1 L S T AA1 P
.PERIOD  P IH1 R IY0 AH0 D
.POINT  P OY1 N T
/SLASH  S L AE1 SH
3-D  TH R IY1 D IY2
3D  TH R IY1 D IY2
:COLON  K OW1 L AH0 N
;SEMI-COLON  S EH1 M IY0 K OW1 L AH0 N
;SEMI-COLON(1)  S EH1 M IH0 K OW2 L AH0 N
?QUESTION-MARK  K W EH1 S CH AH0 N M AA1 R K
A  AH0
A(1)  EY1
A'S  EY1 Z
A.  EY1
A.'S  EY1 Z
A.D.  EY2 D IY1
A.M.  EY2 EH1 M
A.S  EY1 Z
A42128  EY1 F AO1 R T UW1 W AH1 N T UW1 EY1 T
AA  EY2 EY1
AAA  T R IH2 P AH0 L EY1
AAAI  T R IH2 P AH0 L EY2 AY1
AABERG  AA1 B ER0 G
AACHEN  AA1 K AH0 N
AACHENER  AA1 K AH0 N ER0
AAH  AA1
AAKER  AA1 K ER0
AALIYAH  AA2 L IY1 AA2
AALSETH  AA1 L S EH0 TH
AAMODT  AA1 M AH0 T
AANCOR  AA1 N K AO2 R
AARDEMA  AA0 R D EH1 M AH0
AARDVARK  AA1 R D V AA2 R K
AARDVARKS  AA1 R D V AA2 R K S
AARGH  AA1 R G
AARHUS  AA2 HH UW1 S
AARON  EH1 R AH0 N
AARON'S  EH1 R AH0 N Z
AARONS  EH1 R AH0 N Z
AARONSON  EH1 R AH0 N S AH0 N
AARONSON(1)  AA1 R AH0 N S AH0 N
AARONSON'S  EH1 R AH0 N S AH0 N Z
AARONSON'S(1)  AA1 R AH0 N S AH0 N Z
AARTI  AA1 R T IY2
AASE  AA1 S
AASEN  AA1 S AH0 N
AB  AE1 B
ABA  EY2 B IY2 EY1
ABABA  AH0 B AA1 B AH0
ABABA(1)  AA1 B AH0 B AH0
ABACHA  AE1 B AH0 K AH0
ABACK  AH0 B AE1 K
ABACO  AE1 B AH0 K OW2
ABACUS  AE1 B AH0 K AH0 S
ABAD  AH0 B AA1 D
ABADAKA  AH0 B AE1 D AH0 K AH0
ABADI  AH0 B AE1 D IY0
ABADIE  AH0 B AE1 D IY0
ABAIR  AH0 B EH1 R
ABALKIN  AH0 B AA1 L K IH0 N
ABALONE  AE2 B AH0 L OW1 N IY0
ABALONES  AE2 B AH0 L OW1 N IY0 Z
ABALOS  AA0 B AA1 L OW0 Z
ABANDON  AH0 B AE1 N D AH0 N
ABANDONED  AH0 B AE1 N D AH0 N D
ABANDONING  AH0 B AE1 N D AH0 N IH0 NG
ABANDONMENT  AH0 B AE1 N D AH0 N M AH0 N T
ABANDONMENTS  AH0 B AE1 N D AH0 N M AH0 N T S
ABANDONS  AH0 B AE1 N D AH0 N Z
ABANTO  AH0 B AE1 N T OW0
ABARCA  AH0 B AA1 R K AH0
ABARE  AA0 B AA1 R IY0
ABASCAL  AE1 B AH0 S K AH0 L
ABASH  AH0 B AE1 SH
ABASHED  AH0 B AE1 SH T
ABASIA  AH0 B EY1 ZH Y AH0
ABATE  AH0 B EY1 T
ABATED  AH0 B EY1 T IH0 D
ABATEMENT  AH0 B EY1 T M AH0 N T
ABATEMENTS  AH0 B EY1 T M AH0 N T S
ABATES  AH0 B EY1 T S
ABATING  AH0 B EY1 T IH0 NG
ABATTOIR  AE2 B AH0 T W AA1 R
ABBA  AE1 B AH0

with open('cmudict-0.7b', encoding = "ISO-8859-1") as f:
    cmudict = [l.split() for l in f if l[:3] != ";;;"]
    cmudict = {w[0].lower(): w[1:] for w in cmudict}
    
cmudict["abstraction"]

['AE0', 'B', 'S', 'T', 'R', 'AE1', 'K', 'SH', 'AH0', 'N']

This dictionary uses what’s known as the ARPABET and represents stress using numeric indicators: 0 for no stress, 1 for primary stress, and 2 for secondary stress. The ARPABET maps to more recognizable phonemes.

arpabet_to_phoneme = {'AA': 'ɑ', 
                      'AE': 'æ', 
                      'AH': 'ʌ', 
                      'AO': 'ɔ', 
                      'AW': 'aʊ', 
                      'AY': 'aɪ', 
                      'B': 'b', 
                      'CH': 'tʃ', 
                      'D': 'd', 
                      'DH': 'ð', 
                      'EH': 'ɛ',
                      'ER': 'ɝ', 
                      'EY': 'eɪ', 
                      'F': 'f', 
                      'G': 'g', 
                      'HH': 'h', 
                      'IH': 'ɪ', 
                      'IY': 'i', 
                      'JH': 'dʒ', 
                      'K': 'k', 
                      'L': 'l', 
                      'M': 'm', 
                      'N': 'n',
                      'NG': 'ŋ', 
                      'OW': 'oʊ', 
                      'OY': 'ɔɪ', 
                      'P': 'p', 
                      'R': 'ɹ', 
                      'S': 's', 
                      'SH': 'ʃ', 
                      'T': 't', 
                      'TH': 'θ', 
                      'UH': 'ʊ', 
                      'UW': 'u', 
                      'V': 'v',
                      'W': 'w', 
                      'Y': 'j', 
                      'Z': 'z', 
                      'ZH': 'ʒ'}

import re

entries: dict[str, list[str]] = {
    w: [arpabet_to_phoneme[''.join(re.findall('[A-z]', phoneme))]
        for phoneme in transcription]
        for w, transcription in cmudict.items()
        if len({c for c in w})>1
        if len(transcription)>1 
        if not re.findall('[^A-z]', w[0])
}

entries["abstraction"]

['æ', 'b', 's', 't', 'ɹ', 'æ', 'k', 'ʃ', 'ʌ', 'n']

For instance, one thing we might be interested in is investigating ablaut patterns—e.g. cases where the verb inflects for some tense or aspect by a vowel change rather than a regular “-ed” ending. For instance, the past tense of sing is sang, rather than singed.

Exercise: how would we find all cases of this sort of ablaut using English UniMorph? Does your solution handle cases like fight to fought.

# implement regex-based solution here

Exercise: how might we discover cases that are similar to think, whose past is thought? Note that this case is intuitively different in nature from verbs that are fully irregular in their inflection, like be.

# implement regex-based solution here