import retype StringVariables = tuple[int, ...]class MCFGRuleElement: """A multiple context free grammar rule element. Parameters ---------- variable : str The nonterminal variable name. string_variables : StringVariables Variable number of string variable tuples. Attributes ---------- variable : str The nonterminal variable name. string_variables : tuple[StringVariables, ...] The string variable tuples. """ def __init__(self, variable: str, *string_variables: StringVariables): self._variable = variable self._string_variables = string_variables def __str__(self) -> str: strvars = ', '.join( ''.join(str(v) for v in vtup) for vtup in self._string_variables ) return f"{self._variable}({strvars})" def __eq__(self, other) -> bool: vareq = self._variable == other._variable strvareq = self._string_variables == other._string_variables return vareq and strvareq def to_tuple(self) -> tuple[str, tuple[StringVariables, ...]]: """Convert to a hashable tuple representation. Returns ------- tuple[str, tuple[StringVariables, ...]] The (variable, string_variables) tuple. """ return (self._variable, self._string_variables) def __hash__(self) -> int: return hash(self.to_tuple()) @property def variable(self) -> str: """The nonterminal variable name.""" return self._variable @property def string_variables(self) -> tuple[StringVariables, ...]: """The string variable tuples.""" return self._string_variables @property def unique_string_variables(self) -> set[int]: """The unique string variable indices across all tuples.""" return { i for tup in self.string_variables for i in tup }Final Project
In the final project for the course, you will develop a package for parsing with multiple context free grammars (MCFGs). This package must implement an agenda-based parser as described in Shieber et al. 1995. You should use the inference rules laid out in Kallmeyer 2013.
In addition to a working agenda-based parser, the package must use the standard directory structure for a python package, including a full test suite implemented in pytest. If you have not written a python package before, I encourage you to read this tutorial and use the directory structure they discuss:
PACKAGE_NAME/
├── LICENSE
├── pyproject.toml
├── README.md
├── src/
│ └── PACKAGE_NAME/
│ ├── __init__.py
│ └── example.py
└── tests/
Test Grammar
In writing tests, it will be useful to have a grammar to test your parser against. Rather than have you write your own, you might find it useful to use the one below.
S(uv) -> NP(u) VP(v)
S(uv) -> NPwh(u) VP(v)
S(vuw) -> Aux(u) Swhmain(v, w)
S(uwv) -> NPdisloc(u, v) VP(w)
S(uwv) -> NPwhdisloc(u, v) VP(w)
Sbar(uv) -> C(u) S(v)
Sbarwh(v, uw) -> C(u) Swhemb(v, w)
Sbarwh(u, v) -> NPwh(u) VP(v)
Swhmain(v, uw) -> NP(u) VPwhmain(v, w)
Swhmain(w, uxv) -> NPdisloc(u, v) VPwhmain(w, x)
Swhemb(v, uw) -> NP(u) VPwhemb(v, w)
Swhemb(w, uxv) -> NPdisloc(u, v) VPwhemb(w, x)
Src(v, uw) -> NP(u) VPrc(v, w)
Src(w, uxv) -> NPdisloc(u, v) VPrc(w, x)
Src(u, v) -> N(u) VP(v)
Swhrc(u, v) -> Nwh(u) VP(v)
Swhrc(v, uw) -> NP(u) VPwhrc(v, w)
Sbarwhrc(v, uw) -> C(u) Swhrc(v, w)
VP(uv) -> Vpres(u) NP(v)
VP(uv) -> Vpres(u) Sbar(v)
VPwhmain(u, v) -> NPwh(u) Vroot(v)
VPwhmain(u, wv) -> NPwhdisloc(u, v) Vroot(w)
VPwhmain(v, uw) -> Vroot(u) Sbarwh(v, w)
VPwhemb(u, v) -> NPwh(u) Vpres(v)
VPwhemb(u, wv) -> NPwhdisloc(u, v) Vpres(w)
VPwhemb(v, uw) -> Vpres(u) Sbarwh(v, w)
VPrc(u, v) -> N(u) Vpres(v)
VPrc(v, uw) -> Vpres(u) Nrc(v, w)
VPwhrc(u, v) -> Nwh(u) Vpres(v)
VPwhrc(v, uw) -> Vpres(u) Sbarwhrc(v, w)
NP(uv) -> D(u) N(v)
NP(uvw) -> D(u) Nrc(v, w)
NPdisloc(uv, w) -> D(u) Nrc(v, w)
NPwh(uv) -> Dwh(u) N(v)
NPwh(uvw) -> Dwh(u) Nrc(v, w)
NPwhdisloc(uv, w) -> Dwh(u) Nrc(v, w)
Nrc(v, uw) -> C(u) Src(v, w)
Nrc(u, vw) -> N(u) Swhrc(v, w)
Nrc(u, vwx) -> Nrc(u, v) Swhrc(w, x)
Dwh(which)
Nwh(who)
D(the)
D(a)
N(greyhound)
N(human)
Vpres(believes)
Vroot(believe)
Aux(does)
C(that)
Whatever grammar you use, you’ll want to load this grammar as a pytest.fixture within your tests.
Rules
To get you off the ground, I’ve provided some tools for representing rules in an MCFG. Like the context free grammar Rules you worked with in Assignments 9 and 10, MCFGRules have a left side and a right side, but rather than simply being composed of a right side of type str and a left side of type tuple[str, ...], they are composed of a right side of type MCFGRuleElement and a left side of type tuple[MCFGRuleElement, ...]. The reason for this is that we need to track not only the variable—e.g. S, NP, VP, etc.—but also the string variable(s) associated with that variable.
An MCFGRuleElement is thus basically a wrapper around a variable, represented as a str, and a (possibly singleton or empty) tuple of variables, represented as tuple[int, ...]. For instance, one rule in the grammar is VPwhemb(u, v) -> NPwh(u) Vpres(v), where u and v might be represented as tuple[int]s (0,) and (1,), respectively.
print(
MCFGRuleElement('VPwhemb', (0,), (1,)),
"->",
MCFGRuleElement('NPwh', (0,)),
MCFGRuleElement('Vpres', (1,))
)VPwhemb(0, 1) -> NPwh(0) Vpres(1)
The reason string variables must be represented as tuple[int, ...]s, rather than simply int, is to ensure that we can correctly represent left sides of rules that concatenate string variables, like VPwhemb(u, wv) -> NPwhdisloc(u, v) Vpres(w).
print(
MCFGRuleElement('VPwhemb', (0,), (2, 1)),
"->",
MCFGRuleElement('NPwhdisloc', (0,), (1,)),
MCFGRuleElement('Vpres', (2,))
)VPwhemb(0, 21) -> NPwhdisloc(0, 1) Vpres(2)
One complexity that MCFGs add on top of CFGs is that we now need to explicitly track which string spans correspond to which variables. For instance, in who does the greyhound believe, u and v in VPwhmain(u, v) -> NPwh(u) Vroot(v) will be instantiated by who at \((0, 1)\) and believe at \((4, 5)\), respectively.
And if we want to put that VPwhmain constituent together with the greyhound at \((2, 4),\) which instantiates NP(uv) -> D(u) N(v), we need to know that the span at \((2, 4)\) instantiates uv to ensure that, e.g., the spans satisfy the constraints implied by some other rule. For instance, to know that Swhmain(v, uw) -> NP(u) VPwhmain(v, w) can be instantiated by who and the greyhound believe, we need to know that the span instantiating u, the greyound at \((2, 4)\), has a right edge matching the left edge of the span instantiating w, believe at \((4, 5)\)—though there are no similar constraints for the left edge of the span instantiating u or the left or right edge of the span instantiating v.
Because it is non-trivial to implement, I have already done the work of calculating satisfaction of those constraints in methods defined in MCFGRule below. You will write docstrings for these methods in Task 1. To keep track of this sort of information, we will use MCFGRuleElementInstances, which pair a variable with a span (represented as a tuple[int, ...]).
type SpanIndices = tuple[int, ...]class MCFGRuleElementInstance: """An instantiated multiple context free grammar rule element. Parameters ---------- variable : str The nonterminal variable name. string_spans : SpanIndices Variable number of span index tuples. Attributes ---------- variable : str The nonterminal variable name. string_spans : tuple[SpanIndices, ...] The span index tuples. """ def __init__(self, variable: str, *string_spans: SpanIndices): self._variable = variable self._string_spans = string_spans def __eq__(self, other: 'MCFGRuleElementInstance') -> bool: vareq = self._variable == other._variable strspaneq = self._string_spans == other._string_spans return vareq and strspaneq def to_tuple(self) -> tuple[str, tuple[SpanIndices, ...]]: """Convert to a hashable tuple representation. Returns ------- tuple[str, tuple[SpanIndices, ...]] The (variable, string_spans) tuple. """ return (self._variable, self._string_spans) def __hash__(self) -> int: return hash(self.to_tuple()) def __str__(self): strspans = ', '.join( str(list(stup)) for stup in self._string_spans ) return f"{self._variable}({strspans})" def __repr__(self) -> str: return self.__str__() @property def variable(self) -> str: """The nonterminal variable name.""" return self._variable @property def string_spans(self) -> tuple[SpanIndices, ...]: """The span index tuples.""" return self._string_spansThe full MCFGRule implementation is given below.
type SpanMap = dict[int, SpanIndices]class MCFGRule: """A linear multiple context free grammar rule. Parameters ---------- left_side : MCFGRuleElement The left side of the rule. right_side : MCFGRuleElement Variable number of right side elements. Attributes ---------- left_side : MCFGRuleElement The left side of the rule. right_side : tuple[MCFGRuleElement, ...] The right side elements. """ def __init__(self, left_side: MCFGRuleElement, *right_side: MCFGRuleElement): self._left_side = left_side self._right_side = right_side self._validate() def to_tuple(self) -> tuple[MCFGRuleElement, tuple[MCFGRuleElement, ...]]: """Convert to a hashable tuple representation. Returns ------- tuple[MCFGRuleElement, tuple[MCFGRuleElement, ...]] The (left_side, right_side) tuple. """ return (self._left_side, self._right_side) def __hash__(self) -> int: return hash(self.to_tuple()) def __repr__(self) -> str: return '<Rule: '+str(self)+'>' def __str__(self) -> str: if self.is_epsilon: return str(self._left_side) else: return str(self._left_side) +\ ' -> ' +\ ' '.join(str(el) for el in self._right_side) def __eq__(self, other: 'MCFGRule') -> bool: left_side_equal = self._left_side == other._left_side right_side_equal = self._right_side == other._right_side return left_side_equal and right_side_equal def _validate(self): vs = [ el.unique_string_variables for el in self.right_side ] sharing = any( vs1.intersection(vs2) for i, vs1 in enumerate(vs) for j, vs2 in enumerate(vs) if i < j ) if sharing: raise ValueError( 'right side variables cannot share ' 'string variables' ) if not self.is_epsilon: left_vars = self.left_side.unique_string_variables right_vars = { var for el in self.right_side for var in el.unique_string_variables } if left_vars != right_vars: raise ValueError( 'number of arguments to instantiate must ' 'be equal to number of unique string_variables' ) @property def left_side(self) -> MCFGRuleElement: """The left side of the rule.""" return self._left_side @property def right_side(self) -> tuple[MCFGRuleElement, ...]: """The right side elements.""" return self._right_side @property def is_epsilon(self) -> bool: """Whether this is an epsilon (terminal) rule.""" return len(self._right_side) == 0 @property def unique_variables(self) -> set[str]: """The set of unique variable names across both sides.""" return { el.variable for el in [self._left_side]+list(self._right_side) } def instantiate_left_side(self, *right_side: MCFGRuleElementInstance) -> MCFGRuleElementInstance: """Instantiate the left side of the rule given an instantiated right side. Parameters ---------- right_side : MCFGRuleElementInstance The instantiated right side elements. Returns ------- MCFGRuleElementInstance The instantiated left side element. Raises ------ ValueError If spans are not adjacent as required by the rule. """ if self.is_epsilon: strvars = tuple(v[0] for v in self._left_side.string_variables) strconst = tuple(el.variable for el in right_side) if strconst == strvars: return MCFGRuleElementInstance( self._left_side.variable, *[s for el in right_side for s in el.string_spans] ) new_spans = [] span_map = self._build_span_map(right_side) for vs in self._left_side.string_variables: for i in range(1,len(vs)): end_prev = span_map[vs[i-1]][1] begin_curr = span_map[vs[i]][0] if end_prev != begin_curr: raise ValueError( f"Spans {span_map[vs[i-1]]} and {span_map[vs[i]]} " f"must be adjacent according to {self} but they " "are not." ) begin_span = span_map[vs[0]][0] end_span = span_map[vs[-1]][1] new_spans.append((begin_span, end_span)) return MCFGRuleElementInstance( self._left_side.variable, *new_spans ) def _build_span_map(self, right_side: tuple[MCFGRuleElementInstance, ...]) -> SpanMap: """Construct a mapping from string variables to string spans. Parameters ---------- right_side : tuple[MCFGRuleElementInstance, ...] The instantiated right side elements. Returns ------- SpanMap Mapping from string variable indices to span tuples. Raises ------ ValueError If the instantiated right side does not align with the rule. """ if self._right_side_aligns(right_side): return { strvar[0]: strspan for elem, eleminst in zip( self._right_side, right_side ) for strvar, strspan in zip( elem.string_variables, eleminst.string_spans ) } else: raise ValueError( f"Instantiated right side {right_side} do not " f"align with rule's right side {self._right_side}" ) def _right_side_aligns(self, right_side: tuple[MCFGRuleElementInstance, ...]) -> bool: """Check whether the instantiated right side aligns with the rule. Parameters ---------- right_side : tuple[MCFGRuleElementInstance, ...] The instantiated right side elements. Returns ------- bool Whether the right side aligns. """ if len(right_side) == len(self._right_side): vars_match = all( elem.variable == eleminst.variable for elem, eleminst in zip(self._right_side, right_side) ) strvars_match = all( len(elem.string_variables) == len(eleminst.string_spans) for elem, eleminst in zip(self._right_side, right_side) ) return vars_match and strvars_match else: return False @classmethod def from_string(cls, rule_string) -> 'MCFGRule': """Parse an MCFG rule from a string representation. Parameters ---------- rule_string : str The rule string to parse, e.g. ``'A(uv) -> B(u) C(v)'``. Returns ------- MCFGRule The parsed rule. """ elem_strs = re.findall('(\w+)\(((?:\w+,? ?)+?)\)', rule_string) elem_tuples = [(var, [v.strip() for v in svs.split(',')]) for var, svs in elem_strs] if len(elem_tuples) == 1: return cls(MCFGRuleElement(elem_tuples[0][0], tuple(w for w in elem_tuples[0][1]))) else: strvars = [v for _, sv in elem_tuples[1:] for v in sv] # no duplicate string variables try: assert len(strvars) == len(set(strvars)) except AssertionError: msg = 'variables duplicated on right side of '+rule_string raise ValueError(msg) elem_left = MCFGRuleElement(elem_tuples[0][0], *[tuple([strvars.index(v) for v in re.findall('('+'|'.join(strvars)+')', vs)]) for vs in elem_tuples[0][1]]) elems_right = [MCFGRuleElement(var, *[(strvars.index(sv),) for sv in svs]) for var, svs in elem_tuples[1:]] return cls(elem_left, *elems_right) def string_yield(self): """Get the string yield of an epsilon rule. Returns ------- str The variable name of the left side. Raises ------ ValueError If this is not an epsilon rule. """ if self.is_epsilon: return self._left_side.variable else: raise ValueError( 'string_yield is only implemented for epsilon rules' )To make grammar loading easier, I have provided a class method for reading rules from strings.
rule = MCFGRule.from_string('A(w1u, x1v) -> B(w1, x1) C(u, v)')
rule<Rule: A(02, 13) -> B(0, 1) C(2, 3)>
One very important instance method to note in this implementation is MCFGRule.instantiate_left_side. This method will be crucial in your implementations–particularly, in implementing the inference rules of the parser. Before you do anything else, you should make sure to understand how it works. An example usage can be found below.
rule.instantiate_left_side(
MCFGRuleElementInstance("B", (1, 2), (5, 7)),
MCFGRuleElementInstance("C", (2, 4), (7, 8))
)A([1, 4], [5, 8])
This functionality is the main reason I’m providing you an implementation of MCFGRule in the first place: it is nontrivial but mostly because it is tedious, rather than difficult to understand conceptually. I’m more concerned with you demonstrating understanding of deductive parsing in general and agenda-based parsing specifically.
Structuring your package
I suggest structuring your package into three separate modules:
grammar: classes for representing MCFGs, including the classes for representing rules given above and a class for representing a grammar as a collection of such rules that provides an interface to your agenda-based parser.parser: classes for representing your parser and any data structures that your parser manipulates.tree: a class for representing parse trees produced by your parser.
You may draw on any code I have provided you in the course. The code from Assignments 9 and 10 may be particularly useful for you.
The structure of your test suite should mirror the structure of your package–with each module having its own test module and each test module being structured analogously to the module it tests. For instance, if you are testing a class, you should have a corresponding test class in your test module that bundles together all of the tests for methods in that class.
You must have tests every class and every method within that class, even the most trivial. This requirement includes the code I give you above: you must provide tests of MCFGRuleElement, MCFGRuleElementInstance, and MCFGRule. Indeed, I would suggest that the first thing you do is to create a grammar module that contains these classes and begin writing your test suite with a test_grammar module in tests/.
Finally, all of your code must contain docstrings in numpy style for every module, every class, and every public method implemented in a class. (Ideally, you would provide docstrings for the private methods as well, but I will not require that.) I suggest you also type-hint every method.