Searching XIST trees

Searching, tree traversal, xfind and CSS selectors

Home › Python software › ll.xist › SearchingText · XIST

The walk method

There are three related methods available for iterating through an XML tree and finding nodes in the tree: The methods walk, walknodes and walkpaths.

The method walk is a generator. You pass an WalkFilter to walk which is used for determining which part of the tree should be searched and which nodes should be returned. The objects produced by the walk method are lists with the path from the root of the tree to the node in question (Actually it's always the some list objects, if you want distinct objects, use the walkpaths method). The method walknodes produces the nodes instead of the paths to the node.

When walk iterates through the tree it calls the walkfilter's filterpath method with a list containing the path to the node in question as the only argument. (It's also possible to implement the method filternode in your own walkfilters instead of filterpath. Instead of the complete path it only gets the node itself as an argument.) filterpath (or filternode) must return a sequence of “node handling options”. A node handling option is one of the following:

True
This tells walk to yield this node from the generator;
False
Don't yield this node from the generator;
enterattrs
This is a global constant in ll.xist.xsc and tells walk to traverse the attributes of this node (if it's an Element, otherwise this option will be ignored);
entercontent
This is a global constant in ll.xist.xsc and tells walk to traverse the child nodes of this node (if it's an Element, otherwise this option will be ignored);

These options will be executed in the order they are specified in the sequence, so by changing the order of the options in the sequence returned you can switch between top-down and bottom-up traversal. To get a top-down traversal of a tree and produce all table elements, the following code could be used:

from ll.xist import xsc, parse, xfind
from ll.xist.ns import xml, html

class IsTable(xfind.WalkFilter):
·  def filternode(self, node):
·  ·  if isinstance(node, html.table):
·  ·  ·  return (True, xfind.entercontent)
·  ·  else:
·  ·  ·  return (xfind.entercontent,)

node = parse.tree(parse.URL("http://www.python.org"), parse.Tidy(), parse.NS(html), parse.Node(pool=xsc.Pool(xml, html)))

for node in doc.walknodes(IsTable()):
·  ...
Using the walk method

ll.xist.xsc provides several useful walkfilter classes for specifying what should be returned from walk: FindType will search only the first level of the tree and will return any node that is an instance of one of the classes passed to the constructor. So if you have an instance of ll.xist.ns.html.ul named node you could search for all ll.xist.ns.html.li elements inside with the following code:

for li in node.content.walknodes(xfind.FindType(html.li)):
·  ...
Searching for li inside ul with walk

FindTypeAll can be used when you want to search the complete tree. The following example extracts all the links on the Python home page:

from ll.xist import xsc, parse
from ll.xist.ns import xml, html

node = parse.tree(parse.URL("http://www.python.org"), parse.Tidy(), parse.NS(html), parse.Node(pool=xsc.Pool(xml, html)))

for node in doc.walknodes(xfind.FindTypeAll(html.a)):
·  print node.attrs.href
Finding all links on the Python home page

This gives the output:

http://www.python.org/
http://www.python.org/#left%2dhand%2dnavigation
http://www.python.org/#content%2dbody
http://www.python.org/search
http://www.python.org/about/
http://www.python.org/news/
...

The following example will find all external links on the Python home page:

from ll.xist import parse, xfind
from ll.xist.ns import xml, html

class IsExtLink(xfind.WalkFilter):
·  def filternode(self, node):
·  ·  if isinstance(node, html.a) and not unicode(node.attrs.href).startswith(u"http://www.python.org"):
·  ·  ·  return (True, xfind.entercontent)
·  ·  return (xfind.entercontent,)

doc = parse.tree(parse.URL("http://www.python.org"), parse.Tidy(), parse.NS(html), parse.Node(pool=xsc.Pool(xml, html)))

for node in doc.walknodes(IsExtLink()):
·  print node.attrs.href
Finding external links on the Python home page

This gives the output:

http://docs.python.org/
http://pypi.python.org/pypi
http://wiki.python.org/moin/PythonWebsiteCreatingNewTickets
http://youtube.com/
http://wiki.python.org/moin/WebProgramming
http://wiki.python.org/moin/CgiScripts
...

xfind expressions

A second method exists for iterating through a tree: xfind expressions. xfind expressions are special walkfilters that look somewhat like an XPath expressions, but are implemented as pure Python expressions (overloading various Python operators).

Every subclass of ll.xist.xsc.Node can be used as an xfind operator and combined with other xfind operators to get xfind expressions. For example searching for links that contain images works as follows:

for path in node.walk(html.a/html.img):
·  print path[-2].attrs.href, path[-1].attrs.src
Searching for img inside a with an xfind expression

The output looks like this:

http://www.python.org/ http://www.python.org/images/python-logo.gif
http://www.python.org/#left%2dhand%2dnavigation http://www.python.org/images/trans.gif
http://www.python.org/#content%2dbody http://www.python.org/images/trans.gif
http://www.python.org/about/success/usa http://www.python.org/images/success/nasa.jpg

If the img elements are not immediate children of the a elements, the xfind expression above won't output then. In this case you can use a “decendant selector” instead of a “child selector”. To do this simply replace html.a/html.img with html.a//html.img.

Apart from the / and // operators you can also use the | and & operators to combine xfind expressions:

from ll.xist import xsc, parse, xfind
from ll.xist.ns import xml, html

doc = parse.tree(parse.URL("http://www.python.org"), parse.Tidy(), parse.NS(html), parse.Node(pool=xsc.Pool(xml, html)))

for node in doc.walknodes((html.a | html.area) & xfind.hasattr("href")):
·  print node.attrs.href

Here's another example that finds all elements that have an id attribute:

from ll.xist import xsc, parse, xfind
from ll.xist.ns import xml, html

doc = parse.tree(parse.URL("http://www.python.org"), parse.Tidy(), parse.NS(html), parse.Node(pool=xsc.Pool(xml, html)))

for node in doc.walknodes(xfind.hasattr("id")):
·  print node.attrs.id

The output looks like this:

screen-switcher-stylesheet
logoheader
logolink
logo
skiptonav
skiptocontent
...

For more examples refer to the documentation of the xfind module.

CSS selectors

It's also possible to use CSS selectors as walk filters. The module ll.xist.css provides a function selector that turns a CSS selector expression into a walk filter:

from ll.xist import xsc, parse, css
from ll.xist.ns import xml, html

doc = parse.tree(parse.URL("http://www.python.org"), parse.Tidy(), parse.NS(html), parse.Node(pool=xsc.Pool(xml, html)))

for node in doc.walknodes(css.selector("div#menu ul.level-one li > a")):
·  print node.attrs.href
Using CSS selectors as walk filters

This outputs all the first level links in the navigation:

http://www.python.org/about/
http://www.python.org/news/
http://www.python.org/doc/
http://www.python.org/download/
http://www.python.org/community/
http://www.python.org/psf/
http://www.python.org/dev/
http://www.python.org/links/

Most of the CSS 3 selectors are supported.

For more examples see the documentation of the css module.