Assignments 1 and 2

Assignment 1 will consist of Tasks 1-3 and Assignment 2 will consist of Tasks 4-8.

In these assignments, you will be implementing and testing a vowel harmony rule system for Turkish. Vowel harmony rule systems are intended to explain the fact that, in some languages, vowels in a word must have the same value on certain phonological features. Your job in this assignment will not be to derive the rule system itself. Rather, I’m going to give you a rule system to implement that works reasonably well, and we’ll ask where it fails.

Mathematical objects

Throughout the assignments, I will be asking you to say what kind of mathematical object you are implementing in a particular task. The kind of answers you might give here are relation and function. If your response is function, it should be as specific as possible–e.g. the function may be partial or total. In addition to specifying partiality and totality, I’d also like you to specify whether a function is injective and/or surjective. An injective function is one where, if \(f(x) = f(y)\), then \(x = y\) for all \(x\) and \(y\). A surjective function is one where, if \(f: X \rightarrow Y\), then \(f(X) = Y\)—i.e. the range of \(f\) is the same as its codomain; or said another way, the image of \(X\) under \(f: X \rightarrow Y\) is \(Y\).

Data

This assignment uses Bruce Hayes’ phonological features spreadsheet—his FeaturesDoulosSIL.xls sheet, which I have converted into a UTF-8 encoded CSV for easier processing in Python. This file contains the equivalent of the IPA charts familiar to you from LIN110.

You do not need the full chart for this assignment, since we will only need access to four features–SYLLABIC, HIGH, FRONT, and ROUND–and the phones that Turkish has. We’ll work with the slightly altered version of the chart below, which only contains the features for these phones and maps 0 to -.

features = '''phone,syllabic,high,front,round
ɑ,+,-,-,-
b,-,-,-,-
d͡ʒ,-,-,-,-
t͡ʃ,-,-,-,-
d,-,-,-,-
e,+,-,+,-
f,-,-,-,-
ɟ,-,+,+,-
j,-,+,+,-
h,-,-,-,-
ɯ,+,+,-,-
i,+,+,+,-
ʒ,-,-,-,-
c,-,+,+,-
l,-,-,-,-
m,-,-,-,-
n,-,-,-,-
o,+,-,-,+
ø,+,-,+,+
p,-,-,-,-
ɾ,-,-,-,-
s,-,-,-,-
ʃ,-,-,-,-
t,-,-,-,-
u,+,+,-,+
y,+,+,+,+
v,-,-,-,-
j,-,+,+,-
z,-,-,-,-'''

with open('features.csv', 'w') as fout:
    fout.write(features)

%%bash
cat features.csv

phone,syllabic,high,front,round
ɑ,+,-,-,-
b,-,-,-,-
d͡ʒ,-,-,-,-
t͡ʃ,-,-,-,-
d,-,-,-,-
e,+,-,+,-
f,-,-,-,-
ɟ,-,+,+,-
j,-,+,+,-
h,-,-,-,-
ɯ,+,+,-,-
i,+,+,+,-
ʒ,-,-,-,-
c,-,+,+,-
l,-,-,-,-
m,-,-,-,-
n,-,-,-,-
o,+,-,-,+
ø,+,-,+,+
p,-,-,-,-
ɾ,-,-,-,-
s,-,-,-,-
ʃ,-,-,-,-
t,-,-,-,-
u,+,+,-,+
y,+,+,+,+
v,-,-,-,-
j,-,+,+,-
z,-,-,-,-

If you are interested in doing further work in computational phonology, you might also check out the panphon package, which provides various tools for working with featurizations of phones.

Definition

To represent (e.g. FRONT, ROUND, etc.) and feature values (+, -), we will use two Enum classes: Feature and FeatureValue. Using Enums here allows us to define the set of possible feature names and feature values and thereby constrain the values that can appear in feature valuations. This functionality is useful as an additional check on the correctness of our code–e.g. in the case that we get invalid feature names or feature values.

from enum import Enum

class Feature(Enum):
    SYLLABIC = "syllabic"
    HIGH = "high"
    FRONT = "front"
    ROUND = "round"

    def __repr__(self):
        return self.value

    def __str__(self):
        return self.__repr__()

class FeatureValue(Enum):
    PLUS = "+"
    MINUS = "-"

    def __repr__(self):
        return self.value

    def __str__(self):
        return self.__repr__()

To represent the relationship between feature names and feature values—encoded in the rows of features.csv—we’ll be using FeatureValuation objects, which are just thin wrappers around a dictionary with feature names (e.g. FRONT, ROUND, etc.) as keys and feature values (+, -) as values.

Importantly, note that, unlike dictionaries, FeatureValuations are hashable, since they implement the __hash__ magic method. Usually, we want hashables to be immutable–e.g. lists and sets are mutable and not hashable while tuples and frozensets are immutable and hashable–though python does not enforce this. In this case, I will demarcate that we want the core data of the feature valuation to be a private instance attribute FeatureValuation._valuation by prepending an underscore to the attribute name: when you see an underscore prepended like this, it is a convention that you should not modify its value from outside the object it is an attribute of. If you need to access the raw dictionary (and you will need to), you should use the FeatureValuation.valuation property.

The __hash__ magic method more specifically determines what the hash function from the python standard library outputs when applied to a FeatureValuation object. This output will be an integer that is used in determining how to identify when to instances of the class are the same for the purposes of uniquely identifying them within a collection—e.g. when an element of a set or a dict key.

The upshot for our purposes is that, if a class implements __hash__, its objects can be used as dictionary keys. The class also implements comparison between feature valuations: == (__eq__), > (__gt__), < (__lt__), >= (__ge__), and <= (__le__). This behavior will be very useful for some tasks.

class FeatureValuation:
    '''A mapping from feature names to feature values
    
    Parameters
    ----------
    valuation
        the feature valuation as a dictionary
    '''
    
    def __init__(self, valuation: dict[str, str]):
        self._valuation = {
            Feature(f): FeatureValue(v) 
            for f, v in valuation.items()
        }
    
    def __hash__(self) -> int:
        return hash(tuple(self._valuation.items()))
    
    def __getitem__(self, key: Feature) -> FeatureValue:
        return self._valuation[key]
    
    def __eq__(self, other: 'FeatureValuation') -> bool:
        self.__class__._check_type(other)
        
        return self._valuation == other._valuation
    
    def __lt__(self, other: 'FeatureValuation') -> bool:
        self.__class__._check_type(other)
        
        if set(self._valuation) < set(other._valuation):
            return all(other._valuation[k] == v 
                       for k, v in self._valuation.items())
        else:
            return False
    
    def __gt__(self, other: 'FeatureValuation') -> bool:        
        return other < self

    def __le__(self, other: 'FeatureValuation') -> bool:
        return self == other or self < other
    
    def __ge__(self, other: 'FeatureValuation') -> bool:
        return self == other or self > other

    def __repr__(self):
        return self._valuation.__repr__()

    def __str__(self):
        return self._valuation.__str__()
    
    @property
    def valuation(self) -> dict[Feature, FeatureValue]:
        return dict(self._valuation) # makes a copy

    @classmethod
    def _check_type(cls, obj):
        try:
            assert isinstance(obj, cls)
        except AssertionError:
            raise ValueError(
                'can only compute equality between'
                ' two FeatureValuation objects'
            )

We can construct a FeatureValuation by calling its __init__ magic method on a Dict[str, str].

fv1 = FeatureValuation({'syllabic': '+', 'round': '+'})
fv2 = FeatureValuation({'syllabic': '+', 'round': '+', 'high': '+'})

And note that because FeatureValuations are hashable, we can use them as dictionary keys.

v1 = {fv1: {'o', 'ø', 'u', 'y'}}
v2 = {fv2: {'u', 'y'}}

And because we have defined __eq__, __lt__, and __gt__, we can compare FeatureValuations. Make sure you understand what each comparison does. You will need at least one of these operations for the tasks below.

fv1 == fv1, fv1 < fv2, fv1 > fv2

(True, True, False)

Finally, to show you that hash works and returns an integer:

hash(fv2)

-2436770590250344338

Task 1

Lines: 5

Define a class method from_csv in the PhonologicalFeatureChart1 class defined below. This method should take as input a string representation of the directory path fpath to features.csv and return a PhonologicalFeatureChart1 object. This object should have a dictionary-valued private attribute _phone_to_features with phones as keys and FeatureValuation objects as values.

(Note: I’m calling this class PhonologicalFeatureChart1 so that we can subclass it later without a bunch of copying and pasting. This isn’t strictly necessary for subclassing purposes, since you could simply subclass an new version of PhonologicalFeatureChart with an old version; but it’s useful here so that, if you run the cells out of order, you know exactly which version of the class you’re working with.) I’ll do this for other classes below without comment.)

class PhonologicalFeatureChart1:
    '''The phonological features of different phones'''

    def __init__(self, phone_to_features: Dict[str, FeatureValuation]):
        self._phone_to_features = phone_to_features

    def __repr__(self):
        return self._phone_to_features.__repr__()

    def __str__(self):
        return self._phone_to_features.__str__()

    @classmethod
    def from_csv(cls, fpath: str='features.csv') -> 'PhonologicalFeatureChart1':
        '''Load Hayes' phonological feature chart

        Parameters
        ----------
        fpath
            path to phonological feature chart as a csv
        '''

        # remove after implementing
        raise NotImplementedError

    def phone_to_features(self, phone: str) -> FeatureValuation:
        return self._phone_to_features[phone]

Write a test that checks for the correctness of from_csv by calling phone_to_features on some phone and making sure that it returns the correct feature valuation. (The fact that feature valuations implement __eq__ will be useful for this.) This (and all future) test should use standard Python exception handling facilities (try-except).

try:
    phonological_feature_chart = PhonologicalFeatureChart1.from_csv()
except NotImplementedError:
    print("You still need to implement PhonologicalFeatureChart1.from_csv.")

# WRITE TESTS HERE

You still need to implement PhonologicalFeatureChart1.from_csv

Referring to the set of feature as \(F = \{\text{FRONT}, \text{ROUND}, \text{HIGH}, \text{SYLLABIC}\}\) and the set of feature values as \(V = \{+, -\}\), explain what kind of mathematical object the feature valuations you just constructed are. If they are functions, say whether they are injective and/or surjective. Note that I am not asking about all possible feature valuations—just the ones constructed in from_csv.

WRITE YOUR ANSWER HERE

Task 2

Lines: 2

Define an instance method phone_from_features in the PhonologicalFeatureChart2 class that takes as input a FeatureValuation object and returns the set of phones that match that feature valuation. Assume that feature valuations need not specify a feature value for all feature names—e.g. the following should still return something (namely, all the high vowels).

try:
    chart = PhonologicalFeatureChart2.from_csv('features.csv')
    valuation = FeatureValuation({'syllabic': '+', 'high': '+'})
    chart.phone_from_features(valuation)
except NameError:
    print("You still need to define PhonologicalFeatureChart2.")

You still need to define PhonologicalFeatureChart2

We will refer to valuations like this as partial feature valuations.

Note that you need to return a set because some phones are not uniquely determined by the features in features.csv—e.g. all consonants (besides the semivowels) will be - on these features. Further, it may return an empty set, since some feature combinations do not show up in features.csv—e.g. [-SYLLABIC, +ROUND].

class PhonologicalFeatureChart2(PhonologicalFeatureChart1):
    '''The phonological features of different phones'''

    def phone_from_features(self, features: FeatureValuation) -> set[str]:
        '''The phones that have a particular feature valuation

        Parameters
        ----------
        features
            the feature valuation
        '''

        # remove after implementing
        raise NotImplementedError

Write a test that checks for the correctness of phone_from_features. This test should check at least five cases: (i) one where a singleton set should be returned when a total feature valuation is input; (ii) one where an empty set should be returned when a total feature valuation is input; (iii) one where a non-empty, non-singleton set should be returned when a total feature valuation is input; (iv) one where an empty set should be returned when a partial feature valuation is input; and (v) one where a non-empty, non-singleton set should be returned when a partial feature valuation is input.

# WRITE TESTS HERE

Explain what kind of mathematical object phone_from_features implements and what kind of object a partial feature valuation is, referring to the set of phones as \(P\). There are two possible answers here depending on what you take the right side of the relation/function to be.

WRITE YOUR ANSWER HERE

Task 3

Lines: 2

Using your phone_from_features method, define an instance method alter_features_of_phone in PhonologicalFeatureChart (our final version, so no number) that takes as input a phone and a (partial) feature valuation like valuation above. This function should return the set of phones that correspond to setting that phone’s features to the values listed in the feature valuation. For instance, if I passed this function the phone /u/ and the (partial) feature valuation [-ROUND], the function should return {/ɯ/}, but if I passed it /u/ and the feature valuation [-SYLLABIC, -HIGH, -LOW, -ROUND], it should return the set of consonants.

class PhonologicalFeatureChart(PhonologicalFeatureChart2):
    '''The phonological features of different phones'''

    def alter_features_of_phone(
        self, phone: str, 
        features: FeatureValuation
    ) -> Set[str]:
        '''The phones with features altered

        Parameters
        ----------
        phone
            the phone whose features we want to alter
        features
            the feature to alter
        '''

        # remove after implementing
        raise NotImplementedError

Write a test that checks for the correctness of alter_features_of_phone. This test should check the same five kinds of cases that your test for Task 2 checked.

# WRITE TESTS HERE

Explain what kind of mathematical object alter_features_of_phone implements. There are two possible answers here depending on what you take the right side of the relation/function to be. Note that the left side of the relation is a tuple.

WRITE YOUR ANSWER HERE

Data

The remainder of this assignment is based on data from the UniMorph project – specifically, Turkish UniMorph. The UniMorph project provides a schema for annotating word forms with their root form and the morphological features they express across languages, as well as annotated data for (currently) 168 languages. Take a look at the Turkish dataset. You’ll notice that it consists of three columns.

    hamsi          hamsiler          N;NOM;PL
    hamsi          hamsilere         N;DAT;PL
    hamsi          hamsilerden       N;ABL;PL
    hamsi          hamsinin          N;GEN;SG
    hamsi          hamsiye           N;DAT;SG
    hamsi          hamsiyi           N;ACC;SG
    hamsi          hamsilerin        N;GEN;PL
    hamsi          hamsileri         N;ACC;PL
    hamsi          hamsiden          N;ABL;SG
    hamsi          hamsilerde        N;LOC;PL
    hamsi          hamside           N;LOC;SG
    hamsi          hamsi             N;NOM;SG

The second column contains word forms; the first contains the root corresponding to that form; and the third corresponds to the part of speech of and morphological features expressed by that form, separated by ;.

I have included some code below that should make working with these data easier by loading Turkish Unimorph as an instance of my custom Unimorph class, defined below. Before moving forward, read through this code to make sure you understand what turkish_unimorph is.

from collections import defaultdict

class Unimorph:

    def __init__(self, fpath, pos_filter=lambda x: True, root_filter=lambda x: True,
                 word_filter=lambda x: True, feature_filter=lambda x: True,
                 graph_to_phone_map=None):

        self._graph_to_phone_map = graph_to_phone_map

        self._pos_filter = pos_filter
        self._root_filter = root_filter
        self._word_filter = word_filter
        self._feature_filter = feature_filter
        
        self._load_unimorph(fpath)

    def __getitem__(self, key):
        return self._pos_to_word_to_features[key]

    def __iter__(self):
        return self

    def __next__(self):
        try:
            return next(self._gen)
        except StopIteration:
            self._initialize_gen()
            raise

    def _load_unimorph(self, fpath):
        '''load unimorph file and convert graphs to ipa

        Parameters
        ----------
        fpath : str
            path to unimorph data
        

        Returns
        -------
        tuple(dict)
        '''

        pos_to_word_to_features = defaultdict(lambda:
                                              defaultdict(lambda:
                                                          defaultdict(set)))

        with open(fpath) as f:
            for line in f:
                line_split = line.strip().split('\t')

                if len(line_split) != 3:
                    continue

                root, word, pos_features = line_split

                pos_features_split = pos_features.split(';')

                pos = pos_features_split[0]
                features = set(pos_features_split[1:])

                if self._graph_to_phone_map is not None:
                    try:
                        root = self._convert_graph_to_phone(root)
                        word = self._convert_graph_to_phone(word)
                    except KeyError:
                        continue
                else:
                    root = tuple(root)
                    word = tuple(word)
                        

                keep = self._pos_filter(pos)
                keep &= self._root_filter(root)
                keep &= self._word_filter(word)
                keep &= self._feature_filter(features)

                if keep:
                    pos_to_word_to_features[pos][root][word] = features

        # freeze dict so it is no longer a defaultdict
        self._pos_to_word_to_features = dict(pos_to_word_to_features)

        self._initialize_gen()

    def _initialize_gen(self):
        self._gen = ((pos, root, word, features)
                     for pos, d1 in self._pos_to_word_to_features.items()
                     for root, d2 in d1.items()
                     for word, features in d2.items())
        
    def _convert_graph_to_phone(self, word):
        '''map graphs to phones

        Parameters
        ----------
        word : str
            the word as a string of graphs

        Returns
        -------
        str
        '''

        # this takes the last phone in the list
        # it should maybe create a set of possible words
        return tuple([self._graph_to_phone_map[graph][-1]
                      for graph in word])

graph_to_phone_map = {'a': ['ɑ'],
                      'b': ['b'],
                      'c': ['d͡ʒ'],
                      'ç': ['t͡ʃ'],
                      'd': ['d'],
                      'e': ['e'],
                      'f': ['f'],
                      'g': ['ɡ̟', 'ɟ'],
                      'ğ': ['ː', '‿', 'j'],
                      'h': ['h'],
                      'ı': ['ɯ'],
                      'i': ['i'],
                      'j': ['ʒ'],
                      'k': ['k', 'c'],
                      'l': ['ɫ', 'l'],
                      'm': ['m'],
                      'n': ['n'],
                      'o': ['o'],
                      'ö': ['ø'],
                      'p': ['p'],
                      'r': ['ɾ'],
                      's': ['s'],
                      'ş': ['ʃ'],
                      't': ['t'],
                      'u': ['u'],
                      'ü': ['y'],
                      'v': ['v'],
                      'y': ['j'],
                      'z': ['z'],
                      ' ': [' ']}

import requests
from io import BytesIO
from zipfile import ZipFile

turkish_unimorph_url = 'https://github.com/unimorph/tur/archive/master.zip'
turkish_unimorph_zip = requests.get(turkish_unimorph_url).content

with ZipFile(BytesIO(turkish_unimorph_zip)) as zf:
    with zf.open('tur-master/tur') as f_in:
        with open('tur.txt', 'w') as f_out:
            f_out.write(f_in.read().decode())

turkish_unimorph = Unimorph('tur.txt',
                            pos_filter=lambda x: x == 'N',
                            root_filter=lambda x: ' ' not in x,
                            word_filter=lambda x: ' ' not in x,
                            feature_filter=lambda x: x.issubset({'PL', 'GEN'}),
                            graph_to_phone_map=graph_to_phone_map)

There are two important things to notice. First, words and roots are represented as tuples of strings, instead of strings. The reason for this is that (i) I map each root and word in Turkish Unimorph to a phonetic/phonemic representation using a fixed mapping from graphs to phones; and (ii) some phones are represented as digraphs or trigraphs in unicode (e.g. t͡ʃ), so if we mapped from strings of graphs to strings of phones, it would be difficult to recover which characters in a string are a single phone and which are part of a phone that unicode represents with multiple symbols. Second, my Unimorph class allows the user to pass filters to the constructor __init__. In the current case, I have set these filters so our Unimorph instance only contains plural and/or genitive nouns.

Task 4

Lines: 24

In standard descriptions of Turkish, the vowel harmony rule system plays out on three features: height [+/-HIGH], frontness [+/-FRONT], and roundedness [+/-ROUND]. Roughly, if a vowel is high, it must match with the immediately previous vowel on both frontness and roundedness; and if it is not high and not round, it must match with the immediately previous vowel on frontness.

Using your alter_features_of_phone method, define a class TurkishVowelHarmony1 whose instances take as input a word and applies the vowel harmony rule system to it (implemented using the __call__ magic method). Pay special attention to the fact that this system only looks at the immediately previous vowel.

String = tuple[str]

class TurkishVowelHarmony1:
    '''The Turkish vowel harmony system'''
    
    def __call__(self, word: String) -> String:
        '''Apply the vowel harmony rule
        
        Parameters
        ----------
        word
            the word to apply the vowel harmony rule to
        '''
        
        # remove after implementing
        raise NotImplementedError

Write a test that checks for the correctness of __call__. It should check at least six cases: (i) three randomly selected words found in Turkish Unimorph where the result of applying a TurkishVowelHarmony1 object to those words returns the same word back; and (ii) three randomly selected words found in Turkish Unimorph where it doesn’t.

# WRITE TESTS HERE

Explain what kind of mathematical object turkish_vowel_harmony implements, referring to the set of Turkish phones as \(\Sigma\) and the set of strings over those phones as \(\Sigma^*\). (Remember that \(\Sigma^* \equiv \bigcup_i^\infty \Sigma^i\).)

WRITE YOUR ANSWER HERE

Task 5

Lines: 1

A disharmonic form is a root/word that does not obey the vowel harmony rule. Write an instance method disharmonic in TurkishVowelHarmony that maps a root or word to a boolean indicating whether or not it that root or word is disharmonic.

class TurkishVowelHarmony2(TurkishVowelHarmony1):
    '''The Turkish vowel harmony system'''
    
    def disharmonic(self, word: Tuple[str]) -> bool:
        '''Whether the word is disharmonic
        
        Parameters
        ----------
        word
            the word to check for disharmony
        '''
        
        # remove after implementing
        raise NotImplementedError

Write a test that checks for the correctness of disharmonic. It should check the same six cases you used to test __call__.

# WRITE TESTS HERE

Task 6

Lines: 2

Using your disharmonic method, write another instance method proportion_disharmonic_roots to compute the proportion of roots that are disharmonic in Turkish Unimorph.

class TurkishVowelHarmony3(TurkishVowelHarmony2):
    '''The Turkish vowel harmony system'''
    
    def proportion_disharmonic_roots(self, lexicon: Unimorph) -> float:
        '''The proportion of words that are disharmonic in the lexicon
        
        Parameters
        ----------
        lexicon
            the Unimorph lexicon to check for disharmony
        '''
        
        # remove after implementing
        raise NotImplementedError

Task 7

Lines: 7

Using your disharmonic method, write an instance method xtab_root_word_harmony to cross-tabulate the proportion of words that are disharmonic against whether those words’ roots are disharmonic. The method should print that cross-tabulation as a \(2 \times 2\) table with root (dis)harmony along the rows and word (dis)harmony along the columns.

class TurkishVowelHarmony4(TurkishVowelHarmony3):
    '''The Turkish vowel harmony system'''
    
    def xtab_root_word_harmony(self, lexicon: Unimorph) -> None:
        '''Cross-tabulate word disharmony against root disharmony
        
        This should print (not return) a table represented as a list of lists:
        
                         | harmonic word | disharmonic word |
                         ------------------------------------
           harmonic root |               |                  |
        disharmonic root |               |                  |
        
        Parameters
        ----------
        lexicon
            the Unimorph lexicon to check for disharmony
        '''
        
        # remove after implementing
        raise NotImplementedError

Explain the pattern that you see in this table.

WRITE YOUR ANSWER HERE

Task 8

Lines: 1

Using your disharmonic function, write an instance method get_disharmonic to find all of the words of some category (e.g. N, V, etc.) with a particular set of features (e.g. {plural, genitive}, etc.). Use that method to find all the plural and/or genitive nouns with disharmonic roots. Note that I’ve prefiltered Turkish Unimorph to just the plural and genitive nouns, but this method should still work for arbitrary categories and morphological features.

class TurkishVowelHarmony(TurkishVowelHarmony4):
    '''The Turkish vowel harmony system'''
    
    def get_disharmonic(self, 
                        lexicon: Unimorph, 
                        category: str,
                        features: Set[str]) -> Set[Tuple[str]]:
        '''Find all of the words of some category with a particular set of features
        
        Parameters
        ----------
        lexicon
            the Unimorph lexicon to check for disharmony
        category
            some category (e.g. "N", "V", etc.)
        features
            some set of features (e.g. {"PL", "GEN"}, etc.)
        '''
        
        # remove after implementing
        raise NotImplementedError

Explain what pattern you see in the vowels of the plural and genitive affixes. (A prerequisite for answering this question is figuring out what the plural and genitive affixes are.)

WRITE YOUR ANSWER HERE