Like Assignments 1-4, Assignments 5 and 6 are bundled together. You only need to do Tasks 1 and 2 for Assignment 5 and Task 3 for Assignment 6. There is additionally a Task 4 that you are not required to even attempt because of its difficulty. I have left that task in in case you’d like to take a crack at it. Whether or not you actually attempt Task 4, please at least read through the entire notebook to see what the task entails and how you might handle it.
These assignments focus on implementing a natural language inference system. In natural language inference, we receive a premise sentence and a hypothesis sentence and we must say whether we can infer the premise from the hypothesis. For instance, if (1) were our premise and (2) were our hypothesis, our system should respond yes.
Every firm polled saw costs grow more than expected, even after adjusting for inflation.
Every big company in the poll reported cost increases.
In MacCartney & Manning 2009 (henceforth, M&M), you read about one sort of system for doing this: a natural logic system. This system works by (i) obtaining an edit path from the premise and the hypothesis; (ii) mapping that edit path into an inference path; (iii) computing the join of the inferences in this path to obtain a relation between the premise and the hypothesis; and (iv) checking whether there is a forward entailment relation between the premise and the hypothesis.
The definition of the relations is given in Table 2 of the paper.
≡ [ ] ^ | u #
≡ ≡ [ ] ^ | u #
[ [ [ ≡[]|# | | [^|u# [|#
] ] ≡[]u# ] u ]^|u# u ]u#
^ ^ u | ≡ ] [ #
| | [^|u# | [ ≡[]|# [ [|#
u u u ]^|u# ] ] ≡[]u# ]u#
# # [u# ]|# # ]|# [u#
In Tasks 1 and 2, you will be developing the core of this system, using the minimum edit distance-based edit paths we developed in class and assuming default inference relations associated with each atomic edit operation (as discussed in Section 4 of M&M). In Tasks 3 and 4, you will enrich this system with lexical relation information from WordNet (Task 3) and with more intelligent handling of inferences associated with certain environments (as discussed in Section 5 of M&M). You will then test the system on the classic FraCaS dataset.
Task 1
Lines: 4
Define the __add__ magic method for the Inference class below. This method should use join_table (defined above) to produce a set of Inferences by joining two inferences—e.g. animal \(\sqsupset\) dog \(\bowtie\) dog \(\sqsupset\) greyhound = {animal \(\sqsupset\) greyhound}. __add__ must return a set because, as M&M discussed, the result of joining two relations can result in indeterminacy. (In their implementation, M&M actually treat all such indetrminate joins as #. We will not do that here, since it is useful to see why they do that.)
Importantly, note that __add__ should not be symmetric for the same reason joins are not: animal \(\sqsupset\) dog \(\bowtie\) dog \(\sqsubset\) mammal = {animal \(\equiv\) mammal, animal \(\sqsupset\) mammal, animal \(\sqsubset\) mammal, animal \(\smile\) mammal, animal \(\#\) mammal}, but dog \(\sqsubset\) mammal \(\bowtie\) animal \(\sqsupset\) dog isn’t even a licit join.
class Inference:'''An inference from one linguistic expression to another Parameters ---------- premise The premise in the relation hypothesis The hypothesis in the relation relation The relation '''def__init__(self, premise: list[str], hypothesis: list[str], relation: str):if relation notin relations:raiseValueError(f'relation must be in {relations}')self.premise = premiseself.hypothesis = hypothesisself.relation = relationdef__repr__(self):return' '.join(self.premise) +' '+self.relation +' '+' '.join(self.hypothesis)def__hash__(self):returnhash((tuple(self.premise), tuple(self.hypothesis), self.relation))def__add__(self, other: 'Inference') ->set['Inference']:raiseNotImplementedErrordef__eq__(self, other: 'Inference') ->bool:returnself.premise == other.premise &\self.hypothesis == other.hypothesis &\self.relation == other.relation
Test your implementation of Inference.__add__ using the Editor subclasses below.
from abc import ABCfrom typing import Tuple, Optionalclass Editor(ABC):def__init__(self, *args):raiseNotImplementedErrordef__call__(self, input: list[str], idx: int) -> Inference:raiseNotImplementedError@propertydefinput(self):returnself._input@propertydef output(self):returnself._outputclass Substitution(Editor):"""A substitution editor Parameters ---------- input The string in the input to replace output The string to replace the input string with relation The inference relation that results """ default_relation =Nonedef__init__(self, input: str, output: str, relation: str):self._input =inputself._output = outputself._relation = relationdef__repr__(self):returnf'<SUB "{self._output}" for "{self._input}" resulting in {self._relation}>'def__call__(self, input: list[str], idx: int) -> Inference:"""Substitute input for output at location idx"""ifinput[idx] !=self._input:raiseValueError(f'SUB "{self._input}" -> "{self._output}" at {idx} 'f'cannot be applied to {input}') output =input[:idx] + [self._output] +input[(idx+1):]return Inference(input, output, self._relation)class Deletion(Editor):"""A deletion editor Parameters ---------- input The string in the input to delete relation The inference relation that results """def__init__(self, input: str, relation: str='['):self._input =inputself._relation = relationdef__repr__(self):returnf'<DEL "{self._input}" resulting in {self._relation}>'def__call__(self, input: list[str], idx: int) -> Inference:"""Substitute input for output at location idx"""ifinput[idx] !=self._input:raiseValueError(f'DEL "{self._input}" at {idx} 'f'cannot be applied to {input}') output =input[:idx] +input[(idx+1):]return Inference(input, output, self._relation)class Insertion(Editor):"""An insertion editor Parameters ---------- input The string to insert into the output relation The inference relation that results """def__init__(self, output: str, relation: str=']'):self._output = outputself._relation = relationdef__repr__(self):returnf'<INS "{self._output}" resulting in {self._relation}>'def__call__(self, input: list[str], idx: int) -> Inference:"""Substitute input for output at location idx""" output =input[:idx] + [self._output] +input[idx:]return Inference(input, output, self._relation)
These subclasses are initialized with input and/or output strings and a relation. For instance, “brindle” and “fawn” are two different colorings of greyhounds—no greyhound is both brindle and fawn—and so they are in the | relation. Each is at least a subsective modifier (all brindle greyhounds are greyhounds), so if we delete one, we obtain a \(\sqsubset\) relation, and if we insert one, we get a \(\sqsupset\) relation (the default relations for deletion and insertion, as discussed in M&M).
(<SUB "fawn" for "brindle" resulting in |>,
<DEL "brindle" resulting in [>,
<INS "brindle" resulting in ]>)
Note that not all insertions or deletions of adjectives will be associated with \(\sqsubset\) or \(\sqsupset\): privative adjectives like “fake” will introduce a \(|\): fake greyhounds are not greyhounds (fake greyhound \(|\) greyhound) and greyhounds are not fake greyhounds (greyhound \(|\) fake greyhound).
Use the following four sentences to write your tests. These tests should involve applying an edit \(e_1\) to sentence \(s_i\) to yield sentence \(e_1(s_i)\), then applying an edit \(e_2\) to \(e_1(s_i)\) to yield sentence. You should then combine the inferences associated with \(e_1\) and \(e_2\) using your Inference.__add__ and check that it is correct. Make sure to test at least one case where the result should be a non-singleton set of inferences.
We don’t want to have to hand-compute the edits that are required to convert one sentence into another. Instead, we will use a modified form of the StringEdit class we developed in class. What we need in particular are the edit paths that that class produces.
In the original implementation, the edit path indexed into the source string. This made sense at the time because we wanted to know which words, relative to their original position in the string, are operated on by an edit. It’s problematic for current purposes, because once we compute insertions and deletions, the position of later insertions or deletions change. The implementation below now corrects for this, but just make sure you’re taking into account that the order of edits matters for this reason.
Implement the __call__ method for the NaturalLogic class. This should take a premise sentence and a hypothesis sentence, and it should produce two things: (i) the path of inferences (computed from the path of edits) that take you from premise to hypothesis; and (ii) the set of inferences that results from joining the inferences in that path.
Implement tests using the four test sentences above. For now, you can just assume that the editor library contains the editors defined for Task 1. (We don’t need to explicitly specify any insertions that result in \(\sqsupset\) or deletions that result in \(\sqsubset\), since those are added by default by NaturalLogic.add_editor.) In Task 3, we will expand the library using WordNet.
For the remainder of the assignment (Tasks 3 and 4), we will evaluate our NaturalLogic implementation using the FraCaS textual inference test suite. FraCaS is shipped as XML.
from bs4 import BeautifulSoup, Tagclass Fracas:"""Corpus reader for the FraCaS textual inference problem set"""def__init__(self, root: str="fracas.xml"):withopen(root) as fp:self._data = BeautifulSoup(fp, 'lxml')self._construct_problem_generator()def__iter__(self):returnselfdef__next__(self):returnnext(self._problem_generator)def__repr__(self):returnself._data.comment.stringdef _construct_problem_generator(self):for problem inself.problems:yield problem@propertydef problems(self):return [FracasProblem(problem) for problem inself._data.find_all('problem')]class FracasProblem:"""A FraCaS problem""" problem_type_map = {'001'}def__init__(self, problem: Tag):self.id= problem.get('id')self.answer = problem.get('fracas_answer')self.premise = problem.p.string.strip()self.question = problem.q.string.strip()self.hypothesis = problem.h.string.strip()def__repr__(self):return (f"id: {self.id}"f"\n\npremise: {self.premise}"f"\nquestion: {self.question}"f"\nhypothesis: {self.hypothesis}"f"\n\nanswer: {self.answer}")
fracas = Fracas()fracas
Since the sentences are just raw strings, to get them in the form of a list of strings, you will need a tokenizer. I would suggest using the one available in the stanza package. For our purposes, it is also simpler to use the lemma, rather than the token itself, because your WordNet editor library won’t handle inflectional morphology (unless you explicitly engineered it to).
To use this dataset to test your NaturalLogic implementation, you will need to convert the inference produced by __call__ into a “yes”, “no”, or “don’t know” answer. (Don’t worry about any items not labeled with one of these three. This will require you to define a mapping from inference types to answers. You should then compute the accuracy, precision, recall, and F1 of your system.
Each of these metrics can be defined in terms of…
The true positive count for class \(c\): \[\mathrm{tp}(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c) = |\{i\;:\;y^\mathrm{test}_i = \hat{y}^\mathrm{test}_i = c\}|\]
The true negative count for class \(c\): \[\mathrm{tn}(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c) = |\{i\;:\;y^\mathrm{test}_i \neq c \land \hat{y}^\mathrm{test}_i \neq c\}|\]
The false positve count for class \(c\): \[\mathrm{fp}(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c) = |\{i\;:\;y^\mathrm{test}_i \neq c \land \hat{y}^\mathrm{test}_i = c\}|\]
The false negative count for class \(c\): \[\mathrm{fn}(\hat{\mathbf{y}}^\mathrm{test}_i, \mathbf{y}^\mathrm{test}, c) = |\{i\;:\;y^\mathrm{test}_i = c \land \hat{y}^\mathrm{test}_i \neq c\}|\]
…where the class is “yes”, “no”, or “unknown”; \(y^\mathrm{test}_i\) is the true label for item \(i\) (found in FraCaS) and \(\hat{y}^\mathrm{test}_i\) is your system’s prediction for the class of item \(i\). (Ignore cases where the class is not one of these three.)
Accuracy
For what proportion of the test data \(\{(x^\mathrm{test}_{1}, y^\mathrm{test}_1), ..., (x^\mathrm{test}_N, y^\mathrm{test}_N)\}\) does the model’s predicted class \(f(x^\mathrm{test}_i) = \hat{y}^\mathrm{test}_i\) for an item match the ground truth class for that item?
class NaturalLogic(NaturalLogic):def load_wordnet(self):raiseNotImplementedError@classmethoddef from_wordnet(cls): natlog = cls() natlog.load_wordnet()return natlog
Test your new library by writing examples that require knowledge of hypernymy, hyponymy, and antonymy to correctly handle.
# write tests here
Evaluate your new library on FraCaS by computing precision, recall, and F1 for the items that are either labeled “yes”, “no”, or “don’t know”. Remember that this is going to require you to define a way of mapping inference types to answers.
# write evaluation here
These numbers will be bad. The point is to see that handling even the apparently simple cases in FraCaS is very difficult, even with a fairly extensive edit library. PArt of the reason for this is that we are not handling quantification or negation at all.
Task 4
Update your implementation of NaturalLogic.__call__ to correctly handle negation and the quantifiers discussed in Section 5 of M&M. Assume that “a” behaves as “some”; that “all” and “each” behave like “every”; that “not all” behaves like “not every”; and that “none” behaves like “no”. (There are many quantifiers this won’t cover—e.g. “most”, “many”, etc.—don’t worry about trying to figure out what the projectivity signatures for these looks like.)
To do this, you will need to identify the first and second arguments of the quantifier. For instance, for (3), the first argument of “every” is “virtuosic synthesist” and the second argument is “loves a grehound”. (If you’ve taken semantics, you know I’m fudging a little here.)
Every virtuosic synthesist loves some grehyound.
I have provided an implementation below. This implementation will only work for simple cases, like (3). Finding the arguments of quantifiers in general can be highly nontrivial for reason you’ll need to take a formal semantics course to truly appreciate.
from collections import defaultdictfrom functools import lru_cacheclass DependencyParse: parser = stanza.Pipeline('en')def__init__(self, sentence: str):self.parsed_sentence =self.parser(sentence).sentences[0].wordsdef parent_of(self, idx: int) ->dict[str, Union[int, str]]: paridx =self.parsed_sentence[idx].head -1returnself.parsed_sentence[paridx]@lru_cache(256)def children_of(self, idx: int, closure: bool=False): immediate_children = [word for word inself.parsed_sentence if word.head == (idx+1)]if closure:return immediate_children +\ [word for child in immediate_children for word inself.children_of(child.id-1, closure)]else:return immediate_childrendef find_quantifier_arguments(self):"""Find the first and second arguments of each quantifier in the sentence.""" first_argument = defaultdict(list) second_argument = defaultdict(list)for quantifier inself.parsed_sentence:if quantifier.lemma in ['every', 'all', 'some', 'a', 'no', 'none']: parent =self.parent_of(quantifier.id-1) first_argument[quantifier.id-1].append(parent)for dependent inself.children_of(parent.id-1):if quantifier.id!= dependent.id: first_argument[quantifier.id-1].append(dependent) predicate =self.parent_of(parent.id-1) second_argument[quantifier.id-1] = [predicate] +\ [word for word inself.children_of(predicate.id-1, closure=True) if word notin first_argument[quantifier.id-1] if quantifier.id!= word.id]return first_argument, second_argument
Note that this implementation produces second arguments for one quantifier that can overlap with the first arguments of another.
This overlap means that you are going to need to choose an order in which to project the inference relations through the quantifiers—e.g. in (3), do you first consider “every”, then “a”; or vice versa. This order will necessarily be heuristic. A reasonable order might be the reverse of the linear (or “surface”) order. Other options might be based on depth in the tree.
class NaturalLogic(NaturalLogic):def__call__(self, premise: list[str], hypothesis: list[str]) ->tuple[list[InferencePath], set[Inference]]:raiseNotImplementedError
Apply the same evaluation you developed for Task 3 to your new implementation of NaturalLogic.__call__.