from typing import TypeVarDataType = TypeVar("DataType")class Tree:"""A tree Parameters ---------- data The data contained in this tree children The subtrees of this tree """def__init__(self, data: DataType, children: list['Tree']=[]):self._data = dataself._children = childrenself._validate()def _validate(self) ->None:try:assertall(isinstance(c, Tree)for c inself._children)exceptAssertionError: msg ='all children must be trees'raiseTypeError(msg)@propertydef data(self) -> DataType:returnself._data @propertydef children(self) ->list['Tree']:returnself._childrendef__str__(self):ifself._children:return' '.join(c.__str__() for c inself._children)else:returnstr(self._data)def__repr__(self):returnself.to_string(0)def to_string(self, depth: int) ->str: s = (depth -1) *' '+\int(depth >0) *'--'+\self._data +'\n' s +=''.join(c.to_string(depth+1)for c inself._children)return sdef__contains__(self, data: DataType) ->bool:# pre-order depth-first searchifself._data == data:returnTrueelse:for child inself._children:if data in child:returnTruereturnFalse
Let’s return to the motivating example from the last section: suppose I’m interested in finding all sentences with a definite determiner in a subject.
We know how to compute containment. The next question is: how do we find particular subtrees? The basic idea is going to be that, when searching, we want to return a subtree instead of a boolean. But it’s also going to be important to know where that subtree is with respect to other subtrees–e.g. to figure out whether it is a subject. This is going to require a way of indexing trees.
Indexation by search traversal
One way to index trees is analogous to lists: an int. But what does that int represent? The idea is that it will represent when a particular search algorithm visits a node. One way to do this is to flatten or linearize the tree according to the order in which, say, pre-order depth-first search visits subtrees, then to index into the flattened version of the tree.
class Tree(Tree):def__getitem__(self, idx):returnself.flattened[idx]def__len__(self):returnlen(self.flattened)@propertydef flattened(self):try:returnself._flattenedexceptAttributeError:# pre-order depth-first searchself._flattened = [self] +\ [elem for c inself._childrenfor elem in c.flattened]returnself._flattened
0 S
1 NP
2 D
3 a
4 N
5 greyhound
6 VP
7 V
8 loves
9 NP
10 D
11 a
12 N
13 greyhound
Indexation by path to root
One issue with this indexation scheme is that it makes it a bit hard to represent relations like parenthood or sisterhood in a tree. One way to deal with this issue is to instead index using tuples representing the index path to the root.
class Tree(Tree):def__getitem__(self, idx: tuple[int]) ->'Tree': idx = (idx,) ifisinstance(idx, int) else idxtry:assertall(isinstance(i, int) for i in idx)assertall(i >=0for i in idx)exceptAssertionError: errmsg ='index must be a positive int or tuple of positive ints'raiseIndexError(errmsg)ifnot idx:returnselfeliflen(idx) ==1:returnself._children[idx[0]]else:returnself._children[idx[0]][idx[1:]]