We’re going to be working through how to load a treebank into memory, and the first thing we need to know is how to deal with the objects contained in a treebank: trees. To structure this discussion, we’ll use a motivating example: suppose I’m interested in finding all sentences with a definite determiner in a subject.
Initial design
The first question we need to ask is: what are trees in the abstract?
An initial approximation is that a tree is something that is…
…empty (base case)
…a nonempty sequence of trees
class Tree:"""A tree Parameters ---------- children The subtrees of this tree """def__init__(self, children: list['Tree']=[]):self._children = childrenTree([Tree(), Tree()])
<__main__.Tree at 0x1138ed510>
One problem is that these sorts of abstract trees aren’t super useful. So we can augment our definition.
A tree is something that is…
…empty (base case)
…a piece of data paired with a nonempty sequence of trees
from typing import TypeVarDataType = TypeVar("DataType")class Tree:"""A tree Parameters ---------- data The data contained in this tree children The subtrees of this tree """def__init__(self, data: DataType, children: list['Tree']=[]):self._data = dataself._children = children
By convention, we shouldn’t access the private attributes _data and _children, so a common thing to do is to build read-only accessors using the @property decorators.
class Tree(Tree):@propertydef data(self) -> DataType:returnself._data @propertydef children(self) ->list['Tree']:returnself._children
t = Tree('S', [Tree('NP', ['the', Tree('children')]), Tree('VP')])t.children[0].data
'NP'
Our class doesn’t currently enforce that the children be Trees. To enforce this, we can build a validator private method into the intialization.
class Tree:"""A tree Parameters ---------- data The data contained in this tree children The subtrees of this tree """def__init__(self, data: DataType, children: list['Tree']=[]):self._data = dataself._children = childrenself._validate()def _validate(self) ->None:try:assertall(isinstance(c, Tree)for c inself._children)exceptAssertionError: msg ='all children must be trees'raiseTypeError(msg)
If we try to look at the tree, the result isn’t very informative.
tree1
<__main__.Tree at 0x7f81e0d481c0>
This is because we need to tell python how to display objects of our class. There are two obvious things to do: print the yield of the tree or print some representation of the tree itself. We implement both using the __str__ (what is shown when we call print()) and __repr__ (what is shown when we evaluate) magic methods.
We’ll have __str__ return the yield and __repr__ return the tree representation.
class Tree(Tree):def__str__(self):ifself._children:return' '.join(c.__str__() for c inself._children)else:returnstr(self._data)def__repr__(self):returnself.to_string(0)def to_string(self, depth: int) ->str: s = (depth -1) *' '+\int(depth >0) *'--'+\self._data +'\n' s +=''.join(c.to_string(depth+1)for c inself._children)return s
So if we print a Tree, we get the sentence it corresponds to.