import re
= tuple[int, ...]
StringVariables
class MCFGRuleElement:
"""A multiple context free grammar rule element
Parameters
----------
variable
string_variables
Attributes
----------
symbol
string_variables
"""
def __init__(self, variable: str, *string_variables: StringVariables):
self._variable = variable
self._string_variables = string_variables
def __str__(self) -> str:
= ', '.join(
strvars ''.join(str(v) for v in vtup)
for vtup in self._string_variables
)
return f"{self._variable}({strvars})"
def __eq__(self, other) -> bool:
= self._variable == other._variable
vareq = self._string_variables == other._string_variables
strvareq
return vareq and strvareq
def to_tuple(self) -> tuple[str, tuple[StringVariables, ...]]:
return (self._variable, self._string_variables)
def __hash__(self) -> int:
return hash(self.to_tuple())
@property
def variable(self) -> str:
return self._variable
@property
def string_variables(self) -> tuple[StringVariables, ...]:
return self._string_variables
@property
def unique_string_variables(self) -> set[int]:
return {
ifor tup in self.string_variables
for i in tup
}
Final Project
In the final project for the course, you will develop a package for parsing with multiple context free grammars (MCFGs). This package must implement an agenda-based parser as described in Shieber et al. 1995. You should use the inference rules laid out in Kallmeyer 2013.
In addition to a working agenda-based parser, the package must use the standard directory structure for a python package, including a full test suite implemented in pytest
. If you have not written a python package before, I encourage you to read this tutorial and use the directory structure they discuss:
PACKAGE_NAME/
├── LICENSE
├── pyproject.toml
├── README.md
├── src/
│ └── PACKAGE_NAME/
│ ├── __init__.py
│ └── example.py
└── tests/
Test Grammar
In writing tests, it will be useful to have a grammar to test your parser against. Rather than have you write your own, you might find it useful to use the one below.
S(uv) -> NP(u) VP(v)
S(uv) -> NPwh(u) VP(v)
S(vuw) -> Aux(u) Swhmain(v, w)
S(uwv) -> NPdisloc(u, v) VP(w)
S(uwv) -> NPwhdisloc(u, v) VP(w)
Sbar(uv) -> C(u) S(v)
Sbarwh(v, uw) -> C(u) Swhemb(v, w)
Sbarwh(u, v) -> NPwh(u) VP(v)
Swhmain(v, uw) -> NP(u) VPwhmain(v, w)
Swhmain(w, uxv) -> NPdisloc(u, v) VPwhmain(w, x)
Swhemb(v, uw) -> NP(u) VPwhemb(v, w)
Swhemb(w, uxv) -> NPdisloc(u, v) VPwhemb(w, x)
Src(v, uw) -> NP(u) VPrc(v, w)
Src(w, uxv) -> NPdisloc(u, v) VPrc(w, x)
Src(u, v) -> N(u) VP(v)
Swhrc(u, v) -> Nwh(u) VP(v)
Swhrc(v, uw) -> NP(u) VPwhrc(v, w)
Sbarwhrc(v, uw) -> C(u) Swhrc(v, w)
VP(uv) -> Vpres(u) NP(v)
VP(uv) -> Vpres(u) Sbar(v)
VPwhmain(u, v) -> NPwh(u) Vroot(v)
VPwhmain(u, wv) -> NPwhdisloc(u, v) Vroot(w)
VPwhmain(v, uw) -> Vroot(u) Sbarwh(v, w)
VPwhemb(u, v) -> NPwh(u) Vpres(v)
VPwhemb(u, wv) -> NPwhdisloc(u, v) Vpres(w)
VPwhemb(v, uw) -> Vpres(u) Sbarwh(v, w)
VPrc(u, v) -> N(u) Vpres(v)
VPrc(v, uw) -> Vpres(u) Nrc(v, w)
VPwhrc(u, v) -> Nwh(u) Vpres(v)
VPwhrc(v, uw) -> Vpres(u) Sbarwhrc(v, w)
NP(uv) -> D(u) N(v)
NP(uvw) -> D(u) Nrc(v, w)
NPdisloc(uv, w) -> D(u) Nrc(v, w)
NPwh(uv) -> Dwh(u) N(v)
NPwh(uvw) -> Dwh(u) Nrc(v, w)
NPwhdisloc(uv, w) -> Dwh(u) Nrc(v, w)
Nrc(v, uw) -> C(u) Src(v, w)
Nrc(u, vw) -> N(u) Swhrc(v, w)
Nrc(u, vwx) -> Nrc(u, v) Swhrc(w, x)
Dwh(which)
Nwh(who)
D(the)
D(a)
N(greyhound)
N(human)
Vpres(believes)
Vroot(believe)
Aux(does)
C(that)
Whatever grammar you use, you’ll want to load this grammar as a pytest.fixture
within your tests.
Rules
To get you off the ground, I’ve provided some tools for representing rules in an MCFG. Like the context free grammar Rule
s you worked with in Assignments 9 and 10, MCFGRule
s have a left side and a right side, but rather than simply being composed of a right side of type str
and a left side of type tuple[str, ...]
, they are composed of a right side of type MCFGRuleElement
and a left side of type tuple[MCFGRuleElement, ...]
. The reason for this is that we need to track not only the variable—e.g. S
, NP
, VP
, etc.—but also the string variable(s) associated with that variable.
An MCFGRuleElement
is thus basically a wrapper around a variable, represented as a str
, and a (possibly singleton or empty) tuple of variables, represented as tuple[int, ...]
. For instance, one rule in the grammar is VPwhemb(u, v) -> NPwh(u) Vpres(v)
, where u
and v
might be represented as Tuple[int]
s (0,)
and (1,)
, respectively.
print(
'VPwhemb', (0,), (1,)),
MCFGRuleElement("->",
'NPwh', (0,)),
MCFGRuleElement('Vpres', (1,))
MCFGRuleElement( )
VPwhemb(0, 1) -> NPwh(0) Vpres(1)
The reason string variables must be represented as tuple[int, ...]
s, rather than simply int
, is to ensure that we can correctly represent left sides of rules that concatenate string variables, like VPwhemb(u, wv) -> NPwhdisloc(u, v) Vpres(w)
.
print(
'VPwhemb', (0,), (2, 1)),
MCFGRuleElement("->",
'NPwhdisloc', (0,), (1,)),
MCFGRuleElement('Vpres', (2,))
MCFGRuleElement( )
VPwhemb(0, 21) -> NPwhdisloc(0, 1) Vpres(2)
One complexity that MCFGs add on top of CFGs is that we now need to explicitly track which string spans correspond to which variables. For instance, in who does the greyhound believe, u
and v
in VPwhmain(u, v) -> NPwh(u) Vroot(v)
will be instantiated by who at \((0, 1)\) and believe at \((4, 5)\), respectively.
And if we want to put that VPwhmain
constituent together with the greyhound at \((2, 4),\) which instantiates NP(uv) -> D(u) N(v)
, we need to know that the span at \((2, 4)\) instantiates uv
to ensure that, e.g., the spans satisfy the constraints implied by some other rule. For instance, to know that Swhmain(v, uw) -> NP(u) VPwhmain(v, w)
can be instantiated by who and the greyhound believe, we need to know that the span instantiating u
, the greyound at \((2, 4)\), has a right edge matching the left edge of the span instantiating w
, believe at \((4, 5)\)—though there are no similar constraints for the left edge of the span instantiating u
or the left or right edge of the span instantiating v
.
Because it is non-trivial to implement, I have already done the work of calculating satisfaction of those constraints in methods defined in MCFGRule
below. You will write docstrings for these methods in Task 1. To keep track of this sort of information, we will use MCFGRuleElementInstance
s, which pair a variable with a span (represented as a tuple[int, ...]
).
= tuple[int, ...]
SpanIndices
class MCFGRuleElementInstance:
"""An instantiated multiple context free grammar rule element
Parameters
----------
symbol
string_spans
Attributes
----------
symbol
string_spans
"""
def __init__(self, variable: str, *string_spans: SpanIndices):
self._variable = variable
self._string_spans = string_spans
def __eq__(self, other: 'MCFGRuleElementInstance') -> bool:
= self._variable == other._variable
vareq = self._string_spans == other._string_spans
strspaneq
return vareq and strspaneq
def to_tuple(self) -> tuple[str, tuple[SpanIndices, ...]]:
return (self._variable, self._string_spans)
def __hash__(self) -> int:
return hash(self.to_tuple())
def __str__(self):
= ', '.join(
strspans str(list(stup))
for stup in self._string_spans
)
return f"{self._variable}({strspans})"
def __repr__(self) -> str:
return self.__str__()
@property
def variable(self) -> str:
return self._variable
@property
def string_spans(self) -> tuple[SpanIndices, ...]:
return self._string_spans
The full MCFGRule
implementation is given below.
= dict[int, SpanIndices]
SpanMap
class MCFGRule:
"""A linear multiple context free grammar rule
Parameters
----------
left_side
right_side
Attributes
----------
left_side
right_side
"""
def __init__(self, left_side: MCFGRuleElement, *right_side: MCFGRuleElement):
self._left_side = left_side
self._right_side = right_side
self._validate()
def to_tuple(self) -> tuple[MCFGRuleElement, tuple[MCFGRuleElement, ...]]:
return (self._left_side, self._right_side)
def __hash__(self) -> int:
return hash(self.to_tuple())
def __repr__(self) -> str:
return '<Rule: '+str(self)+'>'
def __str__(self) -> str:
if self.is_epsilon:
return str(self._left_side)
else:
return str(self._left_side) +\
' -> ' +\
' '.join(str(el) for el in self._right_side)
def __eq__(self, other: 'MCFGRule') -> bool:
= self._left_side == other._left_side
left_side_equal = self._right_side == other._right_side
right_side_equal
return left_side_equal and right_side_equal
def _validate(self):
= [
vs
el.unique_string_variablesfor el in self.right_side
]= any(
sharing
vs1.intersection(vs2)for i, vs1 in enumerate(vs)
for j, vs2 in enumerate(vs)
if i < j
)
if sharing:
raise ValueError(
'right side variables cannot share '
'string variables'
)
if not self.is_epsilon:
= self.left_side.unique_string_variables
left_vars = {
right_vars for el in self.right_side
var for var in el.unique_string_variables
}if left_vars != right_vars:
raise ValueError(
'number of arguments to instantiate must '
'be equal to number of unique string_variables'
)
@property
def left_side(self) -> MCFGRuleElement:
return self._left_side
@property
def right_side(self) -> tuple[MCFGRuleElement, ...]:
return self._right_side
@property
def is_epsilon(self) -> bool:
return len(self._right_side) == 0
@property
def unique_variables(self) -> set[str]:
return {
el.variablefor el in [self._left_side]+list(self._right_side)
}
def instantiate_left_side(self, *right_side: MCFGRuleElementInstance) -> MCFGRuleElementInstance:
"""Instantiate the left side of the rule given an instantiated right side
Parameters
----------
right_side
The instantiated right side of the rule.
"""
if self.is_epsilon:
= tuple(v[0] for v in self._left_side.string_variables)
strvars = tuple(el.variable for el in right_side)
strconst
if strconst == strvars:
return MCFGRuleElementInstance(
self._left_side.variable,
*[s for el in right_side for s in el.string_spans]
)
= []
new_spans = self._build_span_map(right_side)
span_map
for vs in self._left_side.string_variables:
for i in range(1,len(vs)):
= span_map[vs[i-1]][1]
end_prev = span_map[vs[i]][0]
begin_curr
if end_prev != begin_curr:
raise ValueError(
f"Spans {span_map[vs[i-1]]} and {span_map[vs[i]]} "
f"must be adjacent according to {self} but they "
"are not."
)
= span_map[vs[0]][0]
begin_span = span_map[vs[-1]][1]
end_span
new_spans.append((begin_span, end_span))
return MCFGRuleElementInstance(
self._left_side.variable, *new_spans
)
def _build_span_map(self, right_side: tuple[MCFGRuleElementInstance, ...]) -> SpanMap:
"""Construct a mapping from string variables to string spans"""
if self._right_side_aligns(right_side):
return {
0]: strspan
strvar[for elem, eleminst in zip(
self._right_side,
right_side
)for strvar, strspan in zip(
elem.string_variables,
eleminst.string_spans
)
}else:
raise ValueError(
f"Instantiated right side {right_side} do not "
f"align with rule's right side {self._right_side}"
)
def _right_side_aligns(self, right_side: tuple[MCFGRuleElementInstance, ...]) -> bool:
"""Check whether the right side aligns"""
if len(right_side) == len(self._right_side):
= all(
vars_match == eleminst.variable
elem.variable for elem, eleminst in zip(self._right_side, right_side)
)= all(
strvars_match len(elem.string_variables) == len(eleminst.string_spans)
for elem, eleminst in zip(self._right_side, right_side)
)
return vars_match and strvars_match
else:
return False
@classmethod
def from_string(cls, rule_string) -> 'MCFGRule':
= re.findall('(\w+)\(((?:\w+,? ?)+?)\)', rule_string)
elem_strs
= [(var, [v.strip()
elem_tuples for v in svs.split(',')])
for var, svs in elem_strs]
if len(elem_tuples) == 1:
return cls(MCFGRuleElement(elem_tuples[0][0],
tuple(w for w in elem_tuples[0][1])))
else:
= [v for _, sv in elem_tuples[1:] for v in sv]
strvars
# no duplicate string variables
try:
assert len(strvars) == len(set(strvars))
except AssertionError:
= 'variables duplicated on right side of '+rule_string
msg raise ValueError(msg)
= MCFGRuleElement(elem_tuples[0][0],
elem_left *[tuple([strvars.index(v)
for v in re.findall('('+'|'.join(strvars)+')', vs)])
for vs in elem_tuples[0][1]])
= [MCFGRuleElement(var, *[(strvars.index(sv),)
elems_right for sv in svs])
for var, svs in elem_tuples[1:]]
return cls(elem_left, *elems_right)
def string_yield(self):
if self.is_epsilon:
return self._left_side.variable
else:
raise ValueError(
'string_yield is only implemented for epsilon rules'
)
To make grammar loading easier, I have provided a class method for reading rules from strings.
= MCFGRule.from_string('A(w1u, x1v) -> B(w1, x1) C(u, v)')
rule
rule
<Rule: A(02, 13) -> B(0, 1) C(2, 3)>
One very important instance method to note in this implementation is MCFGRule.instantiate_left_side
. This method will be crucial in your implementations–particularly, in implementing the inference rules of the parser. Before you do anything else, you should make sure to understand how it works. An example usage can be found below.
rule.instantiate_left_side("B", (1, 2), (5, 7)),
MCFGRuleElementInstance("C", (2, 4), (7, 8))
MCFGRuleElementInstance( )
A([1, 4], [5, 8])
This functionality is the main reason I’m providing you an implementation of MCFGRule
in the first place: it is nontrivial but mostly because it is tedious, rather than difficult to understand conceptually. I’m more concerned with you demonstrating understanding of deductive parsing in general and agenda-based parsing specifically.
Structuring your package
I suggest structuring your package into three separate modules:
grammar
: classes for representing MCFGs, including the classes for representing rules given above and a class for representing a grammar as a collection of such rules that provides an interface to your agenda-based parser.parser
: classes for representing your parser and any data structures that your parser manipulates.tree
: a class for representing parse trees produced by your parser.
You may draw on any code I have provided you in the course. The code from Assignments 9 and 10 may be particularly useful for you.
The structure of your test suite should mirror the structure of your package–with each module having its own test module and each test module being structured analogously to the module it tests. For instance, if you are testing a class, you should have a corresponding test class in your test module that bundles together all of the tests for methods in that class.
You must have tests every class and every method within that class, even the most trivial. This requirement includes the code I give you above: you must provide tests of MCFGRuleElement
, MCFGRuleElementInstance
, and MCFGRule
. Indeed, I would suggest that the first thing you do is to create a grammar
module that contains these classes and begin writing your test suite with a test_grammar
module in tests/
.
Finally, all of your code must contain docstrings in numpy
style for every module, every class, and every public method implemented in a class. (Ideally, you would provide docstrings for the private methods as well, but I will not require that.) I suggest you also type-hint every method.