The walk method
There are three related methods available for iterating through an XML
tree and finding nodes in the tree: The methods walk,
walknodes and walkpaths.
The method walk
is a generator. You pass an WalkFilter
to walk which is used for determining which part of the tree
should be searched and which nodes should be returned. The objects produced by
the walk method are lists with the path from the root of the
tree to the node in question (Actually it's always the some list objects, if you
want distinct objects, use the walkpaths method). The method
walknodes produces the nodes instead of the paths to the
node.
When walk iterates through the tree it calls the
walkfilter's filterpath method with a list containing the path
to the node in question as the only argument. (It's also possible to implement
the method filternode in your own walkfilters instead of
filterpath. Instead of the complete path it only gets the node
itself as an argument.) filterpath (or filternode)
must return a sequence of “node handling options”. A node handling option
is one of the following:
True- This tells
walkto yield this node from the generator; False- Don't yield this node from the generator;
enterattrs- This is a global constant in
ll.xist.xscand tellswalkto traverse the attributes of this node (if it's anElement, otherwise this option will be ignored); entercontent- This is a global constant in
ll.xist.xscand tellswalkto traverse the child nodes of this node (if it's anElement, otherwise this option will be ignored);
These options will be executed in the order they are specified in the sequence, so by changing the order of the options in the sequence returned you can switch between top-down and bottom-up traversal. To get a top-down traversal of a tree and produce all table elements, the following code could be used:
from ll.xist import xsc, parse, xfind from ll.xist.ns import xml, html class IsTable(xfind.WalkFilter): · def filternode(self, node): · · if isinstance(node, html.table): · · · return (True, xfind.entercontent) · · else: · · · return (xfind.entercontent,) node = parse.tree(parse.URL("http://www.python.org"), parse.Tidy(), parse.NS(html), parse.Node(pool=xsc.Pool(xml, html))) for node in doc.walknodes(IsTable()): · ...
walk methodll.xist.xsc provides several useful walkfilter classes
for specifying what should be returned from walk:
FindType
will search only the first level of the tree and will return any node that is an
instance of one of the classes passed to the constructor. So if you have an
instance of ll.xist.ns.html.ul named node you could
search for all ll.xist.ns.html.li elements inside with the
following code:
for li in node.content.walknodes(xfind.FindType(html.li)):
· ...
li inside ul with walkFindTypeAll
can be used when you want to search the complete tree. The following example
extracts all the links on the
Python home page:
from ll.xist import xsc, parse
from ll.xist.ns import xml, html
node = parse.tree(parse.URL("http://www.python.org"), parse.Tidy(), parse.NS(html), parse.Node(pool=xsc.Pool(xml, html)))
for node in doc.walknodes(xfind.FindTypeAll(html.a)):
· print node.attrs.href
This gives the output:
http://www.python.org/ http://www.python.org/#left%2dhand%2dnavigation http://www.python.org/#content%2dbody http://www.python.org/search http://www.python.org/about/ http://www.python.org/news/ ...
The following example will find all external links on the Python home page:
from ll.xist import parse, xfind from ll.xist.ns import xml, html class IsExtLink(xfind.WalkFilter): · def filternode(self, node): · · if isinstance(node, html.a) and not unicode(node.attrs.href).startswith(u"http://www.python.org"): · · · return (True, xfind.entercontent) · · return (xfind.entercontent,) doc = parse.tree(parse.URL("http://www.python.org"), parse.Tidy(), parse.NS(html), parse.Node(pool=xsc.Pool(xml, html))) for node in doc.walknodes(IsExtLink()): · print node.attrs.href
This gives the output:
http://docs.python.org/ http://pypi.python.org/pypi http://wiki.python.org/moin/PythonWebsiteCreatingNewTickets http://youtube.com/ http://wiki.python.org/moin/WebProgramming http://wiki.python.org/moin/CgiScripts ...
xfind expressions
A second method exists for iterating through a tree: xfind expressions. xfind expressions are special walkfilters that look somewhat like an XPath expressions, but are implemented as pure Python expressions (overloading various Python operators).
Every subclass of
ll.xist.xsc.Node
can be used as an xfind operator and combined with other xfind operators to get
xfind expressions. For example searching for links that contain images works as
follows:
for path in node.walk(html.a/html.img):
· print path[-2].attrs.href, path[-1].attrs.src
img inside a with an xfind expressionThe output looks like this:
http://www.python.org/ http://www.python.org/images/python-logo.gif http://www.python.org/#left%2dhand%2dnavigation http://www.python.org/images/trans.gif http://www.python.org/#content%2dbody http://www.python.org/images/trans.gif http://www.python.org/about/success/usa http://www.python.org/images/success/nasa.jpg
If the img elements are not immediate children of the
a elements, the xfind expression above won't output then. In this
case you can use a “decendant selector” instead of a “child selector”.
To do this simply replace html.a/html.img with html.a//html.img.
Apart from the / and // operators you can also use
the | and & operators to combine xfind expressions:
from ll.xist import xsc, parse, xfind
from ll.xist.ns import xml, html
doc = parse.tree(parse.URL("http://www.python.org"), parse.Tidy(), parse.NS(html), parse.Node(pool=xsc.Pool(xml, html)))
for node in doc.walknodes((html.a | html.area) & xfind.hasattr("href")):
· print node.attrs.href
Here's another example that finds all elements that have an id
attribute:
from ll.xist import xsc, parse, xfind
from ll.xist.ns import xml, html
doc = parse.tree(parse.URL("http://www.python.org"), parse.Tidy(), parse.NS(html), parse.Node(pool=xsc.Pool(xml, html)))
for node in doc.walknodes(xfind.hasattr("id")):
· print node.attrs.id
The output looks like this:
screen-switcher-stylesheet logoheader logolink logo skiptonav skiptocontent ...
For more examples refer to the documentation of the xfind
module.
CSS selectors
It's also possible to use CSS selectors as walk filters. The module
ll.xist.css provides a
function selector that turns a CSS selector expression
into a walk filter:
from ll.xist import xsc, parse, css
from ll.xist.ns import xml, html
doc = parse.tree(parse.URL("http://www.python.org"), parse.Tidy(), parse.NS(html), parse.Node(pool=xsc.Pool(xml, html)))
for node in doc.walknodes(css.selector("div#menu ul.level-one li > a")):
· print node.attrs.href
This outputs all the first level links in the navigation:
http://www.python.org/about/ http://www.python.org/news/ http://www.python.org/doc/ http://www.python.org/download/ http://www.python.org/community/ http://www.python.org/psf/ http://www.python.org/dev/ http://www.python.org/links/
Most of the CSS 3 selectors are supported.
For more examples see the documentation of the
css module.