Assignments 5 and 6

Like Assignments 1-4, Assignments 5 and 6 are bundled together. You only need to do Tasks 1 and 2 for Assignment 5 and Task 3 for Assignment 6. There is additionally a Task 4 that you are not required to even attempt because of its difficulty. I have left that task in in case you’d like to take a crack at it. Whether or not you actually attempt Task 4, please at least read through the entire notebook to see what the task entails and how you might handle it.

These assignments focus on implementing a natural language inference system. In natural language inference, we receive a premise sentence and a hypothesis sentence and we must say whether we can infer the premise from the hypothesis. For instance, if (1) were our premise and (2) were our hypothesis, our system should respond yes.

  1. Every firm polled saw costs grow more than expected, even after adjusting for inflation.
  2. Every big company in the poll reported cost increases.

In MacCartney & Manning 2009 (henceforth, M&M), you read about one sort of system for doing this: a natural logic system. This system works by (i) obtaining an edit path from the premise and the hypothesis; (ii) mapping that edit path into an inference path; (iii) computing the join of the inferences in this path to obtain a relation between the premise and the hypothesis; and (iv) checking whether there is a forward entailment relation between the premise and the hypothesis.

The definition of the relations is given in Table 2 of the paper.

Symbol Names Example Set theoretic definition
\(x \equiv y\) equivalence couch \(\equiv\) sofa \(x = y\)
\(x \sqsubset y\) forward entailment crow \(\sqsubset\) bird \(x \subset y\)
\(x \sqsupset y\) reverse entailment European \(\sqsupset\) French \(x \supset y\)
\(x \land y\) negation human \(\land\) nonhuman \(x \cap y = \emptyset\) & \(x \cup y = U\)
\(x \mid y\) alternation cat \(\mid\) dog \(x \cap y = \emptyset\) & \(x \cup y \neq U\)
\(x \smile y\) cover animal \(\smile\) nonhuman \(x \cap y \neq \emptyset\) & \(x \cup y = U\)
\(x\;\#\;y\) independence animal \(\;\#\;\) nonhuman otherwise

The table of joins is given below.

relations = ['≡', '[', ']', '^', '|', 'u', '#']

join_table = {('≡', '≡'): {'≡'},
              ('≡', '['): {'['},
              ('≡', ']'): {']'},
              ('≡', '^'): {'^'},
              ('≡', '|'): {'|'},
              ('≡', 'u'): {'u'},
              ('≡', '#'): {'#'},
              ('[', '≡'): {'['},
              ('[', '['): {'['},
              ('[', ']'): {'#', '|', '≡', '[', ']'},
              ('[', '^'): {'|'},
              ('[', '|'): {'|'},
              ('[', 'u'): {'#', '^', 'u', '|', '['},
              ('[', '#'): {'#', '|', '['},
              (']', '≡'): {']'},
              (']', '['): {'#', 'u', '≡', '[', ']'},
              (']', ']'): {']'},
              (']', '^'): {'u'},
              (']', '|'): {'#', '^', 'u', '|', ']'},
              (']', 'u'): {'u'},
              (']', '#'): {'#', 'u', ']'},
              ('^', '≡'): {'^'},
              ('^', '['): {'u'},
              ('^', ']'): {'|'},
              ('^', '^'): {'≡'},
              ('^', '|'): {']'},
              ('^', 'u'): {'['},
              ('^', '#'): {'#'},
              ('|', '≡'): {'|'},
              ('|', '['): {'[', '^', '|', 'u', '#'},
              ('|', ']'): {'|'},
              ('|', '^'): {'['},
              ('|', '|'): {'#', '|', '≡', '[', ']'},
              ('|', 'u'): {'['},
              ('|', '#'): {'#', '|', '['},
              ('u', '≡'): {'u'},
              ('u', '['): {'u'},
              ('u', ']'): {'#', '^', 'u', '|', ']'},
              ('u', '^'): {']'},
              ('u', '|'): {']'},
              ('u', 'u'): {'#', 'u', '≡', '[', ']'},
              ('u', '#'): {'#', 'u', ']'},
              ('#', '≡'): {'#'},
              ('#', '['): {'#', 'u', '['},
              ('#', ']'): {']', '|', '#'},
              ('#', '^'): {'#'},
              ('#', '|'): {'#', '|', ']'},
              ('#', 'u'): {'#', 'u', '['},
              ('#', '#'): set()}

print('\t'.join(['']+relations))
for r1 in relations:
    row = '\t'.join(''.join([r3 for r3 in relations 
                             if r3 in join_table[r1, r2]]) 
                    for r2 in relations)
    print(f'{r1}\t{row}\n')
    ≡   [   ]   ^   |   u   #
≡   ≡   [   ]   ^   |   u   #

[   [   [   ≡[]|#   |   |   [^|u#   [|#

]   ]   ≡[]u#   ]   u   ]^|u#   u   ]u#

^   ^   u   |   ≡   ]   [   #

|   |   [^|u#   |   [   ≡[]|#   [   [|#

u   u   u   ]^|u#   ]   ]   ≡[]u#   ]u#

#   #   [u# ]|# #   ]|# [u# 

In Tasks 1 and 2, you will be developing the core of this system, using the minimum edit distance-based edit paths we developed in class and assuming default inference relations associated with each atomic edit operation (as discussed in Section 4 of M&M). In Tasks 3 and 4, you will enrich this system with lexical relation information from WordNet (Task 3) and with more intelligent handling of inferences associated with certain environments (as discussed in Section 5 of M&M). You will then test the system on the classic FraCaS dataset.

Task 1

Lines: 4

Define the __add__ magic method for the Inference class below. This method should use join_table (defined above) to produce a set of Inferences by joining two inferences—e.g. animal \(\sqsupset\) dog \(\bowtie\) dog \(\sqsupset\) greyhound = {animal \(\sqsupset\) greyhound}. __add__ must return a set because, as M&M discussed, the result of joining two relations can result in indeterminacy. (In their implementation, M&M actually treat all such indetrminate joins as #. We will not do that here, since it is useful to see why they do that.)

Importantly, note that __add__ should not be symmetric for the same reason joins are not: animal \(\sqsupset\) dog \(\bowtie\) dog \(\sqsubset\) mammal = {animal \(\equiv\) mammal, animal \(\sqsupset\) mammal, animal \(\sqsubset\) mammal, animal \(\smile\) mammal, animal \(\#\) mammal}, but dog \(\sqsubset\) mammal \(\bowtie\) animal \(\sqsupset\) dog isn’t even a licit join.

class Inference:
    '''An inference from one linguistic expression to another
    
    Parameters
    ----------
    premise
        The premise in the relation
    hypothesis
        The hypothesis in the relation
    relation
        The relation
    '''
    
    def __init__(self, premise: list[str], hypothesis: list[str], relation: str):
        if relation not in relations:
            raise ValueError(f'relation must be in {relations}')
        
        self.premise = premise
        self.hypothesis = hypothesis
        self.relation = relation
    
    def __repr__(self):
        return ' '.join(self.premise) + ' ' + self.relation + ' ' + ' '.join(self.hypothesis)
    
    def __hash__(self):
        return hash((tuple(self.premise), tuple(self.hypothesis), self.relation))
        
    def __add__(self, other: 'Inference') -> set['Inference']:
        raise NotImplementedError
    
    def __eq__(self, other: 'Inference') -> bool:
        return self.premise == other.premise &\
               self.hypothesis == other.hypothesis &\
               self.relation == other.relation

Test your implementation of Inference.__add__ using the Editor subclasses below.

from abc import ABC
from typing import Tuple, Optional

class Editor(ABC):
    
    def __init__(self, *args):
        raise NotImplementedError

    def __call__(self, input: list[str], idx: int) -> Inference:
        raise NotImplementedError
        
    @property
    def input(self):
        return self._input
    
    @property
    def output(self):
        return self._output
        

class Substitution(Editor):
    """A substitution editor
    
    Parameters
    ----------
    input
        The string in the input to replace
    output
        The string to replace the input string with
    relation
        The inference relation that results
    """
    
    default_relation = None
    
    def __init__(self, input: str, output: str, relation: str):
        self._input = input
        self._output = output
        self._relation = relation
        
    def __repr__(self):
        return f'<SUB "{self._output}" for "{self._input}" resulting in {self._relation}>'
    
    def __call__(self, input: list[str], idx: int) -> Inference:
        """Substitute input for output at location idx"""
        if input[idx] != self._input:
            raise ValueError(f'SUB "{self._input}" -> "{self._output}" at {idx} '
                             f'cannot be applied to {input}')
        
        output = input[:idx] + [self._output] + input[(idx+1):]
        
        return Inference(input, output, self._relation)
        
class Deletion(Editor):
    """A deletion editor
    
    Parameters
    ----------
    input
        The string in the input to delete
    relation
        The inference relation that results
    """
    
    def __init__(self, input: str, relation: str='['):
        self._input = input
        self._relation = relation
        
    def __repr__(self):
        return f'<DEL "{self._input}" resulting in {self._relation}>'
        
    def __call__(self, input: list[str], idx: int) -> Inference:
        """Substitute input for output at location idx"""
        if input[idx] != self._input:
            raise ValueError(f'DEL "{self._input}" at {idx} '
                             f'cannot be applied to {input}')
        
        output = input[:idx] + input[(idx+1):]
        
        return Inference(input, output, self._relation)
        
class Insertion(Editor):
    """An insertion editor
    
    Parameters
    ----------
    input
        The string to insert into the output
    relation
        The inference relation that results
    """
    
    def __init__(self,  output: str, relation: str=']'):
        self._output = output
        self._relation = relation
      
    def __repr__(self):
        return f'<INS "{self._output}" resulting in {self._relation}>'
    
    def __call__(self, input: list[str], idx: int) -> Inference:
        """Substitute input for output at location idx"""
        output = input[:idx] + [self._output] + input[idx:]
        
        return Inference(input, output, self._relation)

These subclasses are initialized with input and/or output strings and a relation. For instance, “brindle” and “fawn” are two different colorings of greyhounds—no greyhound is both brindle and fawn—and so they are in the | relation. Each is at least a subsective modifier (all brindle greyhounds are greyhounds), so if we delete one, we obtain a \(\sqsubset\) relation, and if we insert one, we get a \(\sqsupset\) relation (the default relations for deletion and insertion, as discussed in M&M).

substitute_fawn_for_brindle = Substitution('brindle', 'fawn', '|')
delete_brindle = Deletion('brindle')
insert_brindle = Insertion('brindle')

substitute_fawn_for_brindle, delete_brindle, insert_brindle
(<SUB "fawn" for "brindle" resulting in |>,
 <DEL "brindle" resulting in [>,
 <INS "brindle" resulting in ]>)

Note that not all insertions or deletions of adjectives will be associated with \(\sqsubset\) or \(\sqsupset\): privative adjectives like “fake” will introduce a \(|\): fake greyhounds are not greyhounds (fake greyhound \(|\) greyhound) and greyhounds are not fake greyhounds (greyhound \(|\) fake greyhound).

delete_fake = Deletion('fake', relation='|')
insert_fake = Insertion('fake', relation='|')

Indeed, most substitutions involving “fake” will also yield a \(|\) relation.

substitute_fake_for_virtuosic = Substitution('virtuosic', 'fake', '|')
substitute_virtuosic_for_fake = Substitution('fake', 'virtuosic', '|')

But insertion and deletion edits involving “virtuosic” should act like “brindle”.

delete_virtuosic = Deletion('virtuosic')
insert_virtuosic = Insertion('virtuosic')

Use the following four sentences to write your tests. These tests should involve applying an edit \(e_1\) to sentence \(s_i\) to yield sentence \(e_1(s_i)\), then applying an edit \(e_2\) to \(e_1(s_i)\) to yield sentence. You should then combine the inferences associated with \(e_1\) and \(e_2\) using your Inference.__add__ and check that it is correct. Make sure to test at least one case where the result should be a non-singleton set of inferences.

test_sentence1 = ['a', 'virtuosic', 'synthesist', 'loves', 'a', 'happy', 'brindle', 'greyhound']
test_sentence2 = ['a', 'synthesist', 'loves', 'a', 'happy', 'greyhound']
test_sentence3 = ['a', 'fake', 'synthesist', 'loves', 'a', 'happy', 'brindle', 'greyhound']
test_sentence4 = ['a', 'synthesist', 'loves', 'a', 'happy', 'brindle', 'greyhound']
# write tests here  

Task 2

Lines: 20

We don’t want to have to hand-compute the edits that are required to convert one sentence into another. Instead, we will use a modified form of the StringEdit class we developed in class. What we need in particular are the edit paths that that class produces.

import numpy as np
from typing import Union

EditorType = str
EditLocation = int
EditorInputOutput = Union[str, tuple[str, str]]
EditorParameters = tuple[EditorInputOutput, EditLocation]

EditPath = list[tuple[EditorType, EditorParameters]]
Alignment = list[tuple[int, int]]

def shift_edit_path(edit_path: EditPath):
    edit_path_shifted = []
    shifts = []

    for edit_type, (edit, idx) in edit_path:
        
        if edit_type == 'substitute':
            idx -= 1
            
            if edit[0] == edit[1]:
                continue

        if edit_type == 'delete':
            idx -= 1
            shifts.append((idx, -1))

        if edit_type == 'insert':
            shifts.append((idx, 1))

        for j, s in shifts[:-1]:
            idx = idx + s if idx >= j else idx     

        edit_path_shifted.append((edit_type, (edit, idx)))

    return edit_path_shifted


class StringEdit:
    '''distance, alignment, and edit paths between strings


    Parameters
    ----------
    insertion_cost
    deletion_cost
    substitution_cost
    '''
    
    def __init__(self, insertion_cost: float = 1., deletion_cost: float = 1., substitution_cost: float | None = None):
        self._insertion_cost = insertion_cost
        self._deletion_cost = deletion_cost

        if substitution_cost is None:
            self._substitution_cost = insertion_cost + deletion_cost
        else:
            self._substitution_cost = substitution_cost

    def __call__(self, source: list[str], target: list[str], only_distance: bool = False) ->  float | tuple[float, Alignment, EditPath]:
        return self._wagner_fisher(source, target, only_distance)
            
    def _wagner_fisher(self, source: list[str], target: list[str], only_distance: bool) ->  float | tuple[float, Alignment, EditPath]:
        '''compute minimum edit distance, alignment, and edit sequence'''

        n, m = len(source), len(target)
        source, target = self._add_sentinel(source, target)

        distance = np.zeros([n+1, m+1], dtype=float)
        pointers = np.zeros([n+1, m+1], dtype=list)
        edits = np.zeros([n+1, m+1], dtype=list)

        pointers[0,0] = []
        edits[0,0] = []
        
        for i in range(1,n+1):
            distance[i,0] = distance[i-1,0]+self._deletion_cost
            pointers[i,0] = [(i-1,0)]
            edits[i,0] = [('delete', (source[i], i))]

        for j in range(1,m+1):
            distance[0,j] = distance[0,j-1]+self._insertion_cost
            pointers[0,j] = [(0,j-1)]
            edits[0,j] = [('insert', (target[j], j))]
            
        for i in range(1,n+1):
            for j in range(1,m+1):
                if source[i] == target[j]:
                    substitution_cost = 0.
                else:
                    substitution_cost = self._substitution_cost
                    
                costs = np.array([distance[i-1,j]+self._deletion_cost,
                                  distance[i-1,j-1]+substitution_cost,
                                  distance[i,j-1]+self._insertion_cost])
                    
                distance[i,j] = costs.min()

                best_edits = np.where(costs==distance[i,j])[0]

                indices = [(i-1,j), (i-1,j-1), (i,j-1)]
                pointers[i,j] = [indices[k] for k in best_edits]
 
                edit_types = list(zip(["delete", "substitute", "insert"],
                                      [(source[i], i), 
                                       ((source[i], target[j]), i), 
                                       (target[j], i)]))
                edits[i,j] = [edit_types[k] for k in best_edits]

        if only_distance:
            return distance[n,m]

        pointer_backtrace, edit_backtrace = self._construct_backtrace(pointers, edits)
        
        return distance[n,m], pointer_backtrace, [shift_edit_path(bt) for bt in edit_backtrace]

    def _construct_backtrace(self, pointers, edits):
        last_idx = (
            pointers.shape[0] - 1, 
            pointers.shape[1] - 1
        )
        
        incomplete_pointer_backtraces = [([last_idx], [])]
        complete_pointer_backtraces = []
        
        complete_edit_backtraces = []

        while incomplete_pointer_backtraces:
            new_backtraces = [
                ([ptr] + ptr_bt, [edit] + edit_bt)
                for ptr_bt, edit_bt in incomplete_pointer_backtraces
                for ptr, edit in zip(pointers[ptr_bt[0]], edits[ptr_bt[0]])
            ]

            complete_pointer_backtraces += [
                bt for bt, _ in new_backtraces
                if bt[0] == (0, 0)
            ]
            complete_edit_backtraces += [
                edit_bt for ptr_bt, edit_bt in new_backtraces
                if ptr_bt[0] == (0, 0)
            ]

            incomplete_pointer_backtraces = [
                (ptr_bt, edit_bt) 
                for ptr_bt, edit_bt in new_backtraces
                if ptr_bt[0] != (0, 0)
            ]
            
        return complete_pointer_backtraces, complete_edit_backtraces
        
    def _add_sentinel(self, source, target):
        if isinstance(source, str):
            source = '#'+source
        elif isinstance(source, list):
            source = ['#'] + source
        elif isinstance(source, tuple):
            source = ('#',) + source
        else:
            raise ValueError('source must be str, list, or tuple')
            
        if isinstance(target, str):
            target = '#' + target
        elif isinstance(target, list):
            target = ['#'] + target
        elif isinstance(target, tuple):
            target = ('#',) + target
        else:
            raise ValueError('target must be str, list, or tuple')
            
        return source, target

In the original implementation, the edit path indexed into the source string. This made sense at the time because we wanted to know which words, relative to their original position in the string, are operated on by an edit. It’s problematic for current purposes, because once we compute insertions and deletions, the position of later insertions or deletions change. The implementation below now corrects for this, but just make sure you’re taking into account that the order of edits matters for this reason.

editdist = StringEdit(1, 1, 1)

print('Source:   ', test_sentence1)
print('Target:   ', test_sentence2)
print('Pointer path:', editdist(test_sentence1, test_sentence2)[1])
print('Edit path:', editdist(test_sentence1, test_sentence2)[2])
Source:    ['a', 'virtuosic', 'synthesist', 'loves', 'a', 'happy', 'brindle', 'greyhound']
Target:    ['a', 'synthesist', 'loves', 'a', 'happy', 'greyhound']
Pointer path: [[(0, 0), (1, 1), (2, 1), (3, 2), (4, 3), (5, 4), (6, 5), (7, 5), (8, 6)]]
Edit path: [[('delete', ('virtuosic', 1)), ('delete', ('brindle', 5))]]
print('Source:   ', test_sentence2)
print('Target:   ', test_sentence1)
print('Edit path:', editdist(test_sentence2, test_sentence1)[2])
Source:    ['a', 'synthesist', 'loves', 'a', 'happy', 'greyhound']
Target:    ['a', 'virtuosic', 'synthesist', 'loves', 'a', 'happy', 'brindle', 'greyhound']
Edit path: [[('insert', ('virtuosic', 1)), ('insert', ('brindle', 6))]]

Implement the __call__ method for the NaturalLogic class. This should take a premise sentence and a hypothesis sentence, and it should produce two things: (i) the path of inferences (computed from the path of edits) that take you from premise to hypothesis; and (ii) the set of inferences that results from joining the inferences in that path.

EditorLibrary = dict[str, dict[EditorInputOutput, Editor]]
InferencePath = list[Inference]
NaturalLogicResult = tuple[InferencePath, set[Inference]]

empty_library = {'substitute': {}, 'delete': {}, 'insert': {}}

class NaturalLogic:
    
    EDIT = StringEdit(1, 1, 1)
    
    def __init__(self, editor_library: EditorLibrary=empty_library):
        self._editor_library = editor_library
    
    def __getitem__(self, key: tuple[EditorType, EditorInputOutput]):
        editor_type, edit = key
        
        if edit not in self._editor_library[editor_type]:
            self._add_default_editor(editor_type, edit)
            
        return self._editor_library[editor_type][edit]
    
    def __call__(self, premise: list[str], hypothesis: list[str]) -> list[NaturalLogicResult]:
        raise NotImplementedError
    
    def add_editor(self, editor: Editor):
        if isinstance(editor, Substitution):
            self._editor_library['substitute'][(editor.input, editor.output)] = editor
            
        if isinstance(editor, Insertion):
            self._editor_library['insert'][editor.output] = editor
            
        if isinstance(editor, Deletion):
            self._editor_library['delete'][editor.input] = editor
            
    def _add_default_editor(self, editor_type: str, edit: EditorInputOutput):
        if editor_type == 'substitute':
            self.add_editor(Substitution(input=edit[0], output=edit[1], relation='#'))
        
        elif editor_type == 'insert':
            self.add_editor(Insertion(output=edit))
            
        elif editor_type == 'delete':
            self.add_editor(Deletion(input=edit))
            
        else:
            raise ValueError(f'{editor_type} is not a recognized edit type')
            

Implement tests using the four test sentences above. For now, you can just assume that the editor library contains the editors defined for Task 1. (We don’t need to explicitly specify any insertions that result in \(\sqsupset\) or deletions that result in \(\sqsubset\), since those are added by default by NaturalLogic.add_editor.) In Task 3, we will expand the library using WordNet.

library = {'substitute': {('virtuosic', 'fake'): substitute_fake_for_virtuosic,
                          ('fake', 'virtuosic'): substitute_virtuosic_for_fake,
                          ('brindle', 'fawn'): substitute_fawn_for_brindle}, 
           'delete': {"fake": delete_fake}, 
           'insert': {"fake": insert_fake}}
# write tests here

Evaluating against FraCaS

For the remainder of the assignment (Tasks 3 and 4), we will evaluate our NaturalLogic implementation using the FraCaS textual inference test suite. FraCaS is shipped as XML.

%%bash

wget https://nlp.stanford.edu/~wcmac/downloads/fracas.xml
cat fracas.xml

I’ve included a simple corpus reader below.

!pip install beautifulsoup4
!pip install lxml
from bs4 import BeautifulSoup, Tag

class Fracas:
    """Corpus reader for the FraCaS textual inference problem set"""
    
    def __init__(self, root: str="fracas.xml"):
        with open(root) as fp:
            self._data = BeautifulSoup(fp, 'lxml')
            
        self._construct_problem_generator()
            
    def __iter__(self):
        return self
    
    def __next__(self):
        return next(self._problem_generator)
    
    def __repr__(self):
        return self._data.comment.string
     
    def _construct_problem_generator(self):
        for problem in self.problems:
            yield problem
    
    @property
    def problems(self):
        return [FracasProblem(problem) 
                for problem in self._data.find_all('problem')]

class FracasProblem:
    """A FraCaS problem"""
    
    problem_type_map = {'001'}
    
    def __init__(self, problem: Tag):
        self.id = problem.get('id')
        self.answer = problem.get('fracas_answer')
        
        self.premise = problem.p.string.strip()
        self.question = problem.q.string.strip()
        self.hypothesis = problem.h.string.strip()
        
    def __repr__(self):
        return (f"id: {self.id}"
                f"\n\npremise: {self.premise}"
                f"\nquestion: {self.question}"
                f"\nhypothesis: {self.hypothesis}"
                f"\n\nanswer: {self.answer}")
fracas = Fracas()

fracas

Since the sentences are just raw strings, to get them in the form of a list of strings, you will need a tokenizer. I would suggest using the one available in the stanza package. For our purposes, it is also simpler to use the lemma, rather than the token itself, because your WordNet editor library won’t handle inflectional morphology (unless you explicitly engineered it to).

!pip install stanza

import stanza

stanza.download('en')
lemmatizer = stanza.Pipeline('en', processors='tokenize, mwt, pos, lemma')

lemmatizer('Every virtuosic synthesist loves some greyhounds.')

To use this dataset to test your NaturalLogic implementation, you will need to convert the inference produced by __call__ into a “yes”, “no”, or “don’t know” answer. (Don’t worry about any items not labeled with one of these three. This will require you to define a mapping from inference types to answers. You should then compute the accuracy, precision, recall, and F1 of your system.

Each of these metrics can be defined in terms of…

  1. The true positive count for class \(c\): \[\mathrm{tp}(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c) = |\{i\;:\;y^\mathrm{test}_i = \hat{y}^\mathrm{test}_i = c\}|\]
  2. The true negative count for class \(c\): \[\mathrm{tn}(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c) = |\{i\;:\;y^\mathrm{test}_i \neq c \land \hat{y}^\mathrm{test}_i \neq c\}|\]
  3. The false positve count for class \(c\): \[\mathrm{fp}(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c) = |\{i\;:\;y^\mathrm{test}_i \neq c \land \hat{y}^\mathrm{test}_i = c\}|\]
  4. The false negative count for class \(c\): \[\mathrm{fn}(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c) = |\{i\;:\;y^\mathrm{test}_i = c \land \hat{y}^\mathrm{test}_i \neq c\}|\]

…where the class is “yes”, “no”, or “unknown”; \(y^\mathrm{test}_i\) is the true label for item \(i\) (found in FraCaS) and \(\hat{y}^\mathrm{test}_i\) is your system’s prediction for the class of item \(i\). (Ignore cases where the class is not one of these three.)

Accuracy

For what proportion of the test data \(\{(x^\mathrm{test}_{1}, y^\mathrm{test}_1), ..., (x^\mathrm{test}_N, y^\mathrm{test}_N)\}\) does the model’s predicted class \(f(x^\mathrm{test}_i) = \hat{y}^\mathrm{test}_i\) for an item match the ground truth class for that item?

\[\mathrm{accuracy}\left(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}\right) = \frac{\sum_{c \in \mathcal{Y}}\mathrm{tp}(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c) + \mathrm{tn}(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c)}{N}\]

sklearn.metrics technically provides an accuracy_score function, but generally it’s just as straightforward to compute it yourself.

!pip install sklearn
from sklearn.metrics import accuracy_score

Precision

For a particular class \(c\), what proportion of the test items that the model said have that class actually have that class?

\[\mathrm{precision}\left(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c\right) = \frac{\mathrm{tp}(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c)}{\mathrm{tp}(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c) + \mathrm{fp}(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c)}\]

For giving an aggregate precision across classes, it’s common to distinguish micro-average precision and macro-average precision.

\[\mathrm{microprecision}\left(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}\right) = \frac{\sum_{c \in \mathcal{Y}} \mathrm{tp}(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c)}{\sum_{c \in \mathcal{Y}} \mathrm{tp}(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c) + \mathrm{fp}(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c)}\]

\[\mathrm{macroprecision}\left(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}\right) = \frac{1}{|\mathcal{Y}|}\sum_{c \in \mathcal{Y}} \mathrm{precision}\left(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c\right)\]

from sklearn.metrics import precision_score

Recall

For a particular class \(c\), what proportion of the test items that have that class did the model correctly predict to have that class?

\[\mathrm{recall}\left(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c\right) = \frac{\mathrm{tp}(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c)}{\mathrm{tp}(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c) + \mathrm{fn}(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c)}\]

Similar definitions for micro- and macro-average recall can be given.

from sklearn.metrics import recall_score

F1

For a class \(c\), what is the harmonic mean of precision and recall?

\[F_1\left(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c\right) = \frac{2}{\frac{1}{\mathrm{precision}\left(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c\right)} + \frac{1}{\mathrm{recall}\left(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c\right)}} = 2\frac{\mathrm{precision}\left(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c\right)\;\cdot\;\mathrm{recall}\left(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c\right)}{\mathrm{precision}\left(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c\right) + \mathrm{recall}\left(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c\right)}\]

To define micro- and macro-average \(F_1\) it can be useful to reexpress it.

\[F_1\left(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c\right) = \frac{2\mathrm{tp}(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c)}{2\mathrm{tp}(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c) + \mathrm{fp}(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c) + \mathrm{fn}(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c)}\]

Definitions similar to those for precision can be given for micro- and macro-average \(F_1\).

from sklearn.metrics import f1_score

Task 3

Define an instance method NaturalLogic.load_wordnet that constructs an editor library from WordNet hypernymy, hyponymy, and antonymy relations.

!pip install nltk

import nltk

nltk.download('wordnet')

from nltk.corpus import wordnet
class NaturalLogic(NaturalLogic):

    def load_wordnet(self):
        raise NotImplementedError
    
    @classmethod
    def from_wordnet(cls):
        natlog = cls()
        natlog.load_wordnet()
        
        return natlog

Test your new library by writing examples that require knowledge of hypernymy, hyponymy, and antonymy to correctly handle.

# write tests here

Evaluate your new library on FraCaS by computing precision, recall, and F1 for the items that are either labeled “yes”, “no”, or “don’t know”. Remember that this is going to require you to define a way of mapping inference types to answers.

# write evaluation here

These numbers will be bad. The point is to see that handling even the apparently simple cases in FraCaS is very difficult, even with a fairly extensive edit library. PArt of the reason for this is that we are not handling quantification or negation at all.

Task 4

Update your implementation of NaturalLogic.__call__ to correctly handle negation and the quantifiers discussed in Section 5 of M&M. Assume that “a” behaves as “some”; that “all” and “each” behave like “every”; that “not all” behaves like “not every”; and that “none” behaves like “no”. (There are many quantifiers this won’t cover—e.g. “most”, “many”, etc.—don’t worry about trying to figure out what the projectivity signatures for these looks like.)

To do this, you will need to identify the first and second arguments of the quantifier. For instance, for (3), the first argument of “every” is “virtuosic synthesist” and the second argument is “loves a grehound”. (If you’ve taken semantics, you know I’m fudging a little here.)

  1. Every virtuosic synthesist loves some grehyound.

I have provided an implementation below. This implementation will only work for simple cases, like (3). Finding the arguments of quantifiers in general can be highly nontrivial for reason you’ll need to take a formal semantics course to truly appreciate.

from collections import defaultdict
from functools import lru_cache

class DependencyParse:
    
    parser = stanza.Pipeline('en')
    
    def __init__(self, sentence: str):
        self.parsed_sentence = self.parser(sentence).sentences[0].words
    
    def parent_of(self, idx: int) -> dict[str, Union[int, str]]:
        paridx = self.parsed_sentence[idx].head - 1
        return self.parsed_sentence[paridx]
    
    @lru_cache(256)
    def children_of(self, idx: int, closure: bool=False):
        immediate_children = [word 
                              for word in self.parsed_sentence 
                              if word.head == (idx+1)]
        
        if closure:
            return immediate_children +\
                   [word for child in immediate_children 
                    for word in self.children_of(child.id-1, closure)]
        
        else:
            return immediate_children
        
    def find_quantifier_arguments(self):
        """Find the first and second arguments of each quantifier in the sentence."""

        first_argument = defaultdict(list)
        second_argument = defaultdict(list)
        
        for quantifier in self.parsed_sentence:
            if quantifier.lemma in ['every', 'all', 'some', 'a', 'no', 'none']:
                parent = self.parent_of(quantifier.id-1)
                first_argument[quantifier.id-1].append(parent)

                for dependent in self.children_of(parent.id-1):
                    if quantifier.id != dependent.id:
                        first_argument[quantifier.id-1].append(dependent)

                predicate = self.parent_of(parent.id-1)

                second_argument[quantifier.id-1] = [predicate] +\
                                                   [word 
                                                    for word in self.children_of(predicate.id-1,
                                                                                 closure=True) 
                                                    if word not in first_argument[quantifier.id-1] 
                                                    if quantifier.id != word.id]

        return first_argument, second_argument
            

Note that this implementation produces second arguments for one quantifier that can overlap with the first arguments of another.

quantifier_arguments = DependencyParse('Every virtuosic synthesist loves a greyhound.').find_quantifier_arguments()
    
print('first arguments\n\n', quantifier_arguments[0].values())
print()
print()
print('second arguments\n\n', quantifier_arguments[1])

This overlap means that you are going to need to choose an order in which to project the inference relations through the quantifiers—e.g. in (3), do you first consider “every”, then “a”; or vice versa. This order will necessarily be heuristic. A reasonable order might be the reverse of the linear (or “surface”) order. Other options might be based on depth in the tree.

class NaturalLogic(NaturalLogic):

    def __call__(self, premise: list[str], hypothesis: list[str]) -> tuple[list[InferencePath], 
                                                                           set[Inference]]:
        raise NotImplementedError

Apply the same evaluation you developed for Task 3 to your new implementation of NaturalLogic.__call__.

# write evaluation here