from typing import TypeVarDataType = TypeVar("DataType")class Tree:"""A tree Parameters ---------- data The data contained in this tree children The subtrees of this tree """def__init__(self, data: DataType, children: list['Tree']=[]):self._data = dataself._children = childrenself._validate()def _validate(self) ->None:try:assertall(isinstance(c, Tree)for c inself._children)exceptAssertionError: msg ='all children must be trees'raiseTypeError(msg)@propertydef data(self) -> DataType:returnself._data @propertydef children(self) ->list['Tree']:returnself._childrendef__str__(self):ifself._children:return' '.join(c.__str__() for c inself._children)else:returnstr(self._data)def__repr__(self):returnself.to_string(0)def to_string(self, depth: int) ->str: s = (depth -1) *' '+\int(depth >0) *'--'+\self._data +'\n' s +=''.join(c.to_string(depth+1)for c inself._children)return sdef__contains__(self, data: DataType) ->bool:# pre-order depth-first searchifself._data == data:returnTrueelse:for child inself._children:if data in child:returnTruereturnFalsedef__getitem__(self, idx: tuple[int]) ->'Tree': idx = (idx,) ifisinstance(idx, int) else idxtry:assertall(isinstance(i, int) for i in idx)assertall(i >=0for i in idx)exceptAssertionError: errmsg ='index must be a positive int or tuple of positive ints'raiseIndexError(errmsg)ifnot idx:returnselfeliflen(idx) ==1:returnself._children[idx[0]]else:returnself._children[idx[0]][idx[1:]]
We can get from indices to trees, but how would we go from data to indices? Similar to a list, we can implement an index() method.
class Tree(Tree):def index(self, data, index_path=tuple()): indices = [index_path] ifself._data==data else [] root_path = [] if index_path ==-1else index_path indices += [j for i, c inenumerate(self._children) for j in c.index(data, root_path+(i,))]return indices
class Tree(Tree):def find(self, pattern: 'Tree', subtree_idx: tuple=tuple()) ->list[tuple]:'''The subtrees matching the pattern Parameters ---------- pattern the tree pattern to match against subtree_idx the index of the subtree within the tree pattern to return defaults to the entire match '''#raise NotImplementedError match_indices = [i + subtree_idxfor i inself.index(pattern.data) ifself[i].match(pattern)]return match_indicesdef match(self, pattern: 'Tree') ->bool:ifself._data != pattern.data:returnFalsefor child1, child2 inzip(self._children, pattern.children):ifnot child1.match(child2):returnFalsereturnTrue
This sort of treelet-based matching is somewhat weak as it stands. What if we wanted:
…nodes to be allowed to have some value from a set?
…arbitrary distance between the nodes we are matching on?
…arbitrary boolean conditions on node matches?
Expanding pattern-based search with SPARQL
To handle this, we need both a domain-specific language (DSL) for specifying such queries and an interpeter for that language. We can use SPARQL for our DSL. To intepret SPARQL, we will use the existing interpreter in rdflib.
To use rdflib’s interpreter, we need to map our Tree objects into an in-memory format for which a SPARQL interpreter is already implemented. We will use Resource Description Format as implemented in rdflib.
from rdflib import Graph, URIRefclass Tree(Tree): RDF_TYPES = {} RDF_EDGES = {'is': URIRef('is-a'),'parent': URIRef('is-the-parent-of'),'child': URIRef('is-a-child-of'),'sister': URIRef('is-a-sister-of')}def to_rdf(self, graph=None, nodes={}, idx=tuple()) -> Graph: graph = Graph() if graph isNoneelse graph idxstr ='_'.join(str(i) for i in idx) nodes[idx] = URIRef(idxstr)ifself._data notin Tree.RDF_TYPES: Tree.RDF_TYPES[self._data] = URIRef(self._data) typetriple = (nodes[idx], Tree.RDF_EDGES['is'], Tree.RDF_TYPES[self.data]) graph.add(typetriple)for i, child inenumerate(self._children): childidx = idx+(i,) child.to_rdf(graph, nodes, childidx) partriple = (nodes[idx], Tree.RDF_EDGES['parent'], nodes[childidx]) chitriple = (nodes[childidx], Tree.RDF_EDGES['child'], nodes[idx]) graph.add(partriple) graph.add(chitriple)for i, child1 inenumerate(self._children):for j, child2 inenumerate(self._children): child1idx = idx+(i,) child2idx = idx+(j,) sistriple = (nodes[child1idx], Tree.RDF_EDGES['sister'], nodes[child2idx]) graph.add(sistriple)self._rdf_nodes = nodesreturn graph@propertydef rdf(self) -> Graph:ifnothasattr(self, "_rdf"):self._rdf =self.to_rdf()returnself._rdfdef find(self, query: str) ->list[tuple[int]]:return [tuple([int(i) for i instr(res[0]).split('_')]) for res inself.rdf.query(query)]