Final Project

In the final project for the course, you will develop a package for parsing with multiple context free grammars (MCFGs). This package must implement an agenda-based parser as described in Shieber et al. 1995. You should use the inference rules laid out in Kallmeyer 2013.

In addition to a working agenda-based parser, the package must use the standard directory structure for a python package, including a full test suite implemented in pytest. If you have not written a python package before, I encourage you to read this tutorial and use the directory structure they discuss:

PACKAGE_NAME/
├── LICENSE
├── pyproject.toml
├── README.md
├── src/
│   └── PACKAGE_NAME/
│       ├── __init__.py
│       └── example.py
└── tests/

Test Grammar

In writing tests, it will be useful to have a grammar to test your parser against. Rather than have you write your own, you might find it useful to use the one below.

S(uv) -> NP(u) VP(v)
S(uv) -> NPwh(u) VP(v)
S(vuw) -> Aux(u) Swhmain(v, w)
S(uwv) -> NPdisloc(u, v) VP(w)
S(uwv) -> NPwhdisloc(u, v) VP(w)
Sbar(uv) -> C(u) S(v)
Sbarwh(v, uw) -> C(u) Swhemb(v, w)
Sbarwh(u, v) -> NPwh(u) VP(v)
Swhmain(v, uw) -> NP(u) VPwhmain(v, w)
Swhmain(w, uxv) -> NPdisloc(u, v) VPwhmain(w, x)
Swhemb(v, uw) -> NP(u) VPwhemb(v, w)
Swhemb(w, uxv) -> NPdisloc(u, v) VPwhemb(w, x)
Src(v, uw) -> NP(u) VPrc(v, w)
Src(w, uxv) -> NPdisloc(u, v) VPrc(w, x)
Src(u, v) -> N(u) VP(v)
Swhrc(u, v) -> Nwh(u) VP(v)
Swhrc(v, uw) -> NP(u) VPwhrc(v, w)
Sbarwhrc(v, uw) -> C(u) Swhrc(v, w)
VP(uv) -> Vpres(u) NP(v)
VP(uv) -> Vpres(u) Sbar(v)
VPwhmain(u, v) -> NPwh(u) Vroot(v)
VPwhmain(u, wv) -> NPwhdisloc(u, v) Vroot(w)
VPwhmain(v, uw) -> Vroot(u) Sbarwh(v, w)
VPwhemb(u, v) -> NPwh(u) Vpres(v)
VPwhemb(u, wv) -> NPwhdisloc(u, v) Vpres(w)
VPwhemb(v, uw) -> Vpres(u) Sbarwh(v, w)
VPrc(u, v) -> N(u) Vpres(v)
VPrc(v, uw) -> Vpres(u) Nrc(v, w)
VPwhrc(u, v) -> Nwh(u) Vpres(v)
VPwhrc(v, uw) -> Vpres(u) Sbarwhrc(v, w)
NP(uv) -> D(u) N(v)
NP(uvw) -> D(u) Nrc(v, w)
NPdisloc(uv, w) -> D(u) Nrc(v, w)
NPwh(uv) -> Dwh(u) N(v)
NPwh(uvw) -> Dwh(u) Nrc(v, w)
NPwhdisloc(uv, w) -> Dwh(u) Nrc(v, w)
Nrc(v, uw) -> C(u) Src(v, w)
Nrc(u, vw) -> N(u) Swhrc(v, w)
Nrc(u, vwx) -> Nrc(u, v) Swhrc(w, x)
Dwh(which)
Nwh(who)
D(the)
D(a)
N(greyhound)
N(human)
Vpres(believes)
Vroot(believe)
Aux(does)
C(that)

Whatever grammar you use, you’ll want to load this grammar as a pytest.fixture within your tests.

Rules

To get you off the ground, I’ve provided some tools for representing rules in an MCFG. Like the context free grammar Rules you worked with in Assignments 9 and 10, MCFGRules have a left side and a right side, but rather than simply being composed of a right side of type str and a left side of type tuple[str, ...], they are composed of a right side of type MCFGRuleElement and a left side of type tuple[MCFGRuleElement, ...]. The reason for this is that we need to track not only the variable—e.g. S, NP, VP, etc.—but also the string variable(s) associated with that variable.

import retype StringVariables = tuple[int, ...]class MCFGRuleElement:    """A multiple context free grammar rule element.    Parameters    ----------    variable : str        The nonterminal variable name.    string_variables : StringVariables        Variable number of string variable tuples.    Attributes    ----------    variable : str        The nonterminal variable name.    string_variables : tuple[StringVariables, ...]        The string variable tuples.    """    def __init__(self, variable: str, *string_variables: StringVariables):        self._variable = variable        self._string_variables = string_variables    def __str__(self) -> str:        strvars = ', '.join(            ''.join(str(v) for v in vtup)            for vtup in self._string_variables        )                return f"{self._variable}({strvars})"    def __eq__(self, other) -> bool:        vareq = self._variable == other._variable        strvareq = self._string_variables == other._string_variables                return vareq and strvareq            def to_tuple(self) -> tuple[str, tuple[StringVariables, ...]]:        """Convert to a hashable tuple representation.        Returns        -------        tuple[str, tuple[StringVariables, ...]]            The (variable, string_variables) tuple.        """        return (self._variable, self._string_variables)    def __hash__(self) -> int:        return hash(self.to_tuple())            @property    def variable(self) -> str:        """The nonterminal variable name."""        return self._variable    @property    def string_variables(self) -> tuple[StringVariables, ...]:        """The string variable tuples."""        return self._string_variables    @property        def unique_string_variables(self) -> set[int]:        """The unique string variable indices across all tuples."""        return {            i            for tup in self.string_variables            for i in tup        }

An MCFGRuleElement is thus basically a wrapper around a variable, represented as a str, and a (possibly singleton or empty) tuple of variables, represented as tuple[int, ...]. For instance, one rule in the grammar is VPwhemb(u, v) -> NPwh(u) Vpres(v), where u and v might be represented as tuple[int]s (0,) and (1,), respectively.

print(
    MCFGRuleElement('VPwhemb', (0,), (1,)),
    "->", 
    MCFGRuleElement('NPwh', (0,)), 
    MCFGRuleElement('Vpres', (1,))
)
VPwhemb(0, 1) -> NPwh(0) Vpres(1)

The reason string variables must be represented as tuple[int, ...]s, rather than simply int, is to ensure that we can correctly represent left sides of rules that concatenate string variables, like VPwhemb(u, wv) -> NPwhdisloc(u, v) Vpres(w).

print(
    MCFGRuleElement('VPwhemb', (0,), (2, 1)),
    "->", 
    MCFGRuleElement('NPwhdisloc', (0,), (1,)),
    MCFGRuleElement('Vpres', (2,))
)
VPwhemb(0, 21) -> NPwhdisloc(0, 1) Vpres(2)

One complexity that MCFGs add on top of CFGs is that we now need to explicitly track which string spans correspond to which variables. For instance, in who does the greyhound believe, u and v in VPwhmain(u, v) -> NPwh(u) Vroot(v) will be instantiated by who at \((0, 1)\) and believe at \((4, 5)\), respectively.

And if we want to put that VPwhmain constituent together with the greyhound at \((2, 4),\) which instantiates NP(uv) -> D(u) N(v), we need to know that the span at \((2, 4)\) instantiates uv to ensure that, e.g., the spans satisfy the constraints implied by some other rule. For instance, to know that Swhmain(v, uw) -> NP(u) VPwhmain(v, w) can be instantiated by who and the greyhound believe, we need to know that the span instantiating u, the greyound at \((2, 4)\), has a right edge matching the left edge of the span instantiating w, believe at \((4, 5)\)—though there are no similar constraints for the left edge of the span instantiating u or the left or right edge of the span instantiating v.

Because it is non-trivial to implement, I have already done the work of calculating satisfaction of those constraints in methods defined in MCFGRule below. You will write docstrings for these methods in Task 1. To keep track of this sort of information, we will use MCFGRuleElementInstances, which pair a variable with a span (represented as a tuple[int, ...]).

type SpanIndices = tuple[int, ...]class MCFGRuleElementInstance:    """An instantiated multiple context free grammar rule element.    Parameters    ----------    variable : str        The nonterminal variable name.    string_spans : SpanIndices        Variable number of span index tuples.    Attributes    ----------    variable : str        The nonterminal variable name.    string_spans : tuple[SpanIndices, ...]        The span index tuples.    """    def __init__(self, variable: str, *string_spans: SpanIndices):        self._variable = variable        self._string_spans = string_spans    def __eq__(self, other: 'MCFGRuleElementInstance') -> bool:        vareq = self._variable == other._variable        strspaneq = self._string_spans == other._string_spans                return vareq and strspaneq            def to_tuple(self) -> tuple[str, tuple[SpanIndices, ...]]:        """Convert to a hashable tuple representation.        Returns        -------        tuple[str, tuple[SpanIndices, ...]]            The (variable, string_spans) tuple.        """        return (self._variable, self._string_spans)    def __hash__(self) -> int:        return hash(self.to_tuple())    def __str__(self):        strspans = ', '.join(            str(list(stup))            for stup in self._string_spans        )                return f"{self._variable}({strspans})"    def __repr__(self) -> str:        return self.__str__()        @property    def variable(self) -> str:        """The nonterminal variable name."""        return self._variable    @property    def string_spans(self) -> tuple[SpanIndices, ...]:        """The span index tuples."""        return self._string_spans

The full MCFGRule implementation is given below.

type SpanMap = dict[int, SpanIndices]class MCFGRule:    """A linear multiple context free grammar rule.    Parameters    ----------    left_side : MCFGRuleElement        The left side of the rule.    right_side : MCFGRuleElement        Variable number of right side elements.    Attributes    ----------    left_side : MCFGRuleElement        The left side of the rule.    right_side : tuple[MCFGRuleElement, ...]        The right side elements.    """    def __init__(self, left_side: MCFGRuleElement, *right_side: MCFGRuleElement):        self._left_side = left_side        self._right_side = right_side        self._validate()    def to_tuple(self) -> tuple[MCFGRuleElement, tuple[MCFGRuleElement, ...]]:        """Convert to a hashable tuple representation.        Returns        -------        tuple[MCFGRuleElement, tuple[MCFGRuleElement, ...]]            The (left_side, right_side) tuple.        """        return (self._left_side, self._right_side)    def __hash__(self) -> int:        return hash(self.to_tuple())        def __repr__(self) -> str:        return '<Rule: '+str(self)+'>'            def __str__(self) -> str:        if self.is_epsilon:            return str(self._left_side)                        else:            return str(self._left_side) +\                ' -> ' +\                ' '.join(str(el) for el in self._right_side)    def __eq__(self, other: 'MCFGRule') -> bool:        left_side_equal = self._left_side == other._left_side        right_side_equal = self._right_side == other._right_side        return left_side_equal and right_side_equal    def _validate(self):        vs = [            el.unique_string_variables            for el in self.right_side        ]        sharing = any(            vs1.intersection(vs2)            for i, vs1 in enumerate(vs)            for j, vs2 in enumerate(vs)            if i < j        )        if sharing:            raise ValueError(                'right side variables cannot share '                'string variables'            )        if not self.is_epsilon:            left_vars = self.left_side.unique_string_variables            right_vars = {                var for el in self.right_side                for var in el.unique_string_variables            }            if left_vars != right_vars:                raise ValueError(                    'number of arguments to instantiate must '                    'be equal to number of unique string_variables'                )            @property    def left_side(self) -> MCFGRuleElement:        """The left side of the rule."""        return self._left_side    @property    def right_side(self) -> tuple[MCFGRuleElement, ...]:        """The right side elements."""        return self._right_side    @property    def is_epsilon(self) -> bool:        """Whether this is an epsilon (terminal) rule."""        return len(self._right_side) == 0    @property    def unique_variables(self) -> set[str]:        """The set of unique variable names across both sides."""        return {            el.variable            for el in [self._left_side]+list(self._right_side)        }    def instantiate_left_side(self, *right_side: MCFGRuleElementInstance) -> MCFGRuleElementInstance:        """Instantiate the left side of the rule given an instantiated right side.        Parameters        ----------        right_side : MCFGRuleElementInstance            The instantiated right side elements.        Returns        -------        MCFGRuleElementInstance            The instantiated left side element.        Raises        ------        ValueError            If spans are not adjacent as required by the rule.        """                if self.is_epsilon:            strvars = tuple(v[0] for v in self._left_side.string_variables)            strconst = tuple(el.variable for el in right_side)                        if strconst == strvars:                return MCFGRuleElementInstance(                    self._left_side.variable,                    *[s for el in right_side for s in el.string_spans]                )        new_spans = []        span_map = self._build_span_map(right_side)                for vs in self._left_side.string_variables:            for i in range(1,len(vs)):                end_prev = span_map[vs[i-1]][1]                begin_curr = span_map[vs[i]][0]                if end_prev != begin_curr:                    raise ValueError(                        f"Spans {span_map[vs[i-1]]} and {span_map[vs[i]]} "                        f"must be adjacent according to {self} but they "                        "are not."                    )                            begin_span = span_map[vs[0]][0]            end_span = span_map[vs[-1]][1]            new_spans.append((begin_span, end_span))        return MCFGRuleElementInstance(            self._left_side.variable, *new_spans        )        def _build_span_map(self, right_side: tuple[MCFGRuleElementInstance, ...]) -> SpanMap:        """Construct a mapping from string variables to string spans.        Parameters        ----------        right_side : tuple[MCFGRuleElementInstance, ...]            The instantiated right side elements.        Returns        -------        SpanMap            Mapping from string variable indices to span tuples.        Raises        ------        ValueError            If the instantiated right side does not align with the rule.        """                if self._right_side_aligns(right_side):            return {                strvar[0]: strspan                for elem, eleminst in zip(                    self._right_side,                    right_side                )                for strvar, strspan in zip(                    elem.string_variables,                    eleminst.string_spans                )            }        else:            raise ValueError(                f"Instantiated right side {right_side} do not "                f"align with rule's right side {self._right_side}"            )    def _right_side_aligns(self, right_side: tuple[MCFGRuleElementInstance, ...]) -> bool:        """Check whether the instantiated right side aligns with the rule.        Parameters        ----------        right_side : tuple[MCFGRuleElementInstance, ...]            The instantiated right side elements.        Returns        -------        bool            Whether the right side aligns.        """        if len(right_side) == len(self._right_side):            vars_match = all(                elem.variable == eleminst.variable                for elem, eleminst in zip(self._right_side, right_side)            )            strvars_match = all(                len(elem.string_variables) == len(eleminst.string_spans)                for elem, eleminst in zip(self._right_side, right_side)            )            return vars_match and strvars_match        else:            return False     @classmethod    def from_string(cls, rule_string) -> 'MCFGRule':        """Parse an MCFG rule from a string representation.        Parameters        ----------        rule_string : str            The rule string to parse, e.g. ``'A(uv) -> B(u) C(v)'``.        Returns        -------        MCFGRule            The parsed rule.        """        elem_strs = re.findall('(\w+)\(((?:\w+,? ?)+?)\)', rule_string)        elem_tuples = [(var, [v.strip()                              for v in svs.split(',')])                       for var, svs in elem_strs]        if len(elem_tuples) == 1:            return cls(MCFGRuleElement(elem_tuples[0][0],                                   tuple(w for w in elem_tuples[0][1])))        else:            strvars = [v for _, sv in elem_tuples[1:] for v in sv]            # no duplicate string variables            try:                assert len(strvars) == len(set(strvars))            except AssertionError:                msg = 'variables duplicated on right side of '+rule_string                raise ValueError(msg)                        elem_left = MCFGRuleElement(elem_tuples[0][0],                                    *[tuple([strvars.index(v)                                             for v in re.findall('('+'|'.join(strvars)+')', vs)])                                      for vs in elem_tuples[0][1]])            elems_right = [MCFGRuleElement(var, *[(strvars.index(sv),)                                              for sv in svs])                           for var, svs in elem_tuples[1:]]            return cls(elem_left, *elems_right)            def string_yield(self):        """Get the string yield of an epsilon rule.        Returns        -------        str            The variable name of the left side.        Raises        ------        ValueError            If this is not an epsilon rule.        """        if self.is_epsilon:            return self._left_side.variable        else:            raise ValueError(                'string_yield is only implemented for epsilon rules'            )

To make grammar loading easier, I have provided a class method for reading rules from strings.

rule = MCFGRule.from_string('A(w1u, x1v) -> B(w1, x1) C(u, v)')

rule
<Rule: A(02, 13) -> B(0, 1) C(2, 3)>

One very important instance method to note in this implementation is MCFGRule.instantiate_left_side. This method will be crucial in your implementations–particularly, in implementing the inference rules of the parser. Before you do anything else, you should make sure to understand how it works. An example usage can be found below.

rule.instantiate_left_side(
    MCFGRuleElementInstance("B", (1, 2), (5, 7)),
    MCFGRuleElementInstance("C", (2, 4), (7, 8))
)
A([1, 4], [5, 8])

This functionality is the main reason I’m providing you an implementation of MCFGRule in the first place: it is nontrivial but mostly because it is tedious, rather than difficult to understand conceptually. I’m more concerned with you demonstrating understanding of deductive parsing in general and agenda-based parsing specifically.

Structuring your package

I suggest structuring your package into three separate modules:

  1. grammar: classes for representing MCFGs, including the classes for representing rules given above and a class for representing a grammar as a collection of such rules that provides an interface to your agenda-based parser.
  2. parser: classes for representing your parser and any data structures that your parser manipulates.
  3. tree: a class for representing parse trees produced by your parser.

You may draw on any code I have provided you in the course. The code from Assignments 9 and 10 may be particularly useful for you.

The structure of your test suite should mirror the structure of your package–with each module having its own test module and each test module being structured analogously to the module it tests. For instance, if you are testing a class, you should have a corresponding test class in your test module that bundles together all of the tests for methods in that class.

You must have tests every class and every method within that class, even the most trivial. This requirement includes the code I give you above: you must provide tests of MCFGRuleElement, MCFGRuleElementInstance, and MCFGRule. Indeed, I would suggest that the first thing you do is to create a grammar module that contains these classes and begin writing your test suite with a test_grammar module in tests/.

Finally, all of your code must contain docstrings in numpy style for every module, every class, and every public method implemented in a class. (Ideally, you would provide docstrings for the private methods as well, but I will not require that.) I suggest you also type-hint every method.