Link: start Link: parent Link: First page in set (first) Link: Previous page (previous) Link: Next page (next) Link: Last page in set (last) Link: A plain text version of this page (alternate) Link: The XIST source of this page (alternate) Searching XIST trees ==================== Searching, tree traversal, xfind and CSS selectors ================================================== Home > Python software > ll.xist > Searching Text · XIST Python softwarelist of projects * ll.xistAn extensible XML/HTML generator * ExamplesParsing/creating/modifying XML; Traversing XML trees * HowtoExplains parsing/generating XML files, XML transformations via XIST classes and other basic concepts. * SearchingHow to iterate through XIST trees * TransformationHow to transform XIST trees * Advanced topicsPool chaining, converter contexts, validation * MiscellaneousExplains various odds and ends of XIST * xscXIST core classes * nsPackage containing namespace modules * parseParsing XML * presentScreen output of XML trees * simsSimple schema validation * xfindTree iteration and filtering * cssCSS related functions * scriptsScripts for text conversion and creating XIST namespaces * HistoryChangeLog for XIST * InstallationHow to install and configure XIST * MigrationHow to update your code to new versions of XIST * Mailing listsHow to subscribe to the XIST mailing lists * ll.ul4cA templating language * ll.urlRFC 2396 compliant URLs * ll.makeObject oriented make replacement * ll.daemonForking daemon processes * ll.sisyphusWriting cron jobs with Python * ll.colorRGB color values and color model conversion * ll.miscMisc utility functions and classes * ll.orasqlUtilities for cx_Oracle * ll.nightshadeServe the output of Oracle functions/procedures with CherryPy * ll.scriptsScripts for UL4 template rendering and URL handling * AploraLogging Apache HTTP requests to an Oracle database * PycocoPython code coverage * DownloadLinks to Windows and Linux, source and binary distributions * Source codeAccess to the Mercurial repositories The walk method =============== There are three related methods available for iterating through an XML tree and finding nodes in the tree: The methods walk, walknodes and walkpaths. The method walk is a generator. You pass an WalkFilter to walk which is used for determining which part of the tree should be searched and which nodes should be returned. The objects produced by the walk method are lists with the path from the root of the tree to the node in question (Actually it's always the some list objects, if you want distinct objects, use the walkpaths method). The method walknodes produces the nodes instead of the paths to the node. When walk iterates through the tree it calls the walkfilter's filterpath method with a list containing the path to the node in question as the only argument. (It's also possible to implement the method filternode in your own walkfilters instead of filterpath. Instead of the complete path it only gets the node itself as an argument.) filterpath (or filternode) must return a sequence of "node handling options". A node handling option is one of the following: True This tells walk to yield this node from the generator; False Don't yield this node from the generator; enterattrs This is a global constant in ll.xist.xsc and tells walk to traverse the attributes of this node (if it's an Element, otherwise this option will be ignored); entercontent This is a global constant in ll.xist.xsc and tells walk to traverse the child nodes of this node (if it's an Element, otherwise this option will be ignored); These options will be executed in the order they are specified in the sequence, so by changing the order of the options in the sequence returned you can switch between top-down and bottom-up traversal. To get a top-down traversal of a tree and produce all table elements, the following code could be used: from ll.xist import xsc, parse, xfind from ll.xist.ns import xml, html class IsTable(xfind.WalkFilter): · def filternode(self, node): · · if isinstance(node, html.table): · · · return (True, xfind.entercontent) · · else: · · · return (xfind.entercontent,) node = parse.tree(parse.URL("http://www.python.org"), parse.Tidy(), parse.NS(html), parse.Node(pool=xsc.Pool(xml, html))) for node in doc.walknodes(IsTable()): · ... Using the walk method ll.xist.xsc provides several useful walkfilter classes for specifying what should be returned from walk: FindType will search only the first level of the tree and will return any node that is an instance of one of the classes passed to the constructor. So if you have an instance of ll.xist.ns.html.ul named node you could search for all ll.xist.ns.html.li elements inside with the following code: for li in node.content.walknodes(xfind.FindType(html.li)): · ... Searching for li inside ul with walk FindTypeAll can be used when you want to search the complete tree. The following example extracts all the links on the Python home page: from ll.xist import xsc, parse from ll.xist.ns import xml, html node = parse.tree(parse.URL("http://www.python.org"), parse.Tidy(), parse.NS(html), parse.Node(pool=xsc.Pool(xml, html))) for node in doc.walknodes(xfind.FindTypeAll(html.a)): · print node.attrs.href Finding all links on the Python home page This gives the output: http://www.python.org/ http://www.python.org/#left%2dhand%2dnavigation http://www.python.org/#content%2dbody http://www.python.org/search http://www.python.org/about/ http://www.python.org/news/ ... The following example will find all external links on the Python home page: from ll.xist import parse, xfind from ll.xist.ns import xml, html class IsExtLink(xfind.WalkFilter): · def filternode(self, node): · · if isinstance(node, html.a) and not unicode(node.attrs.href).startswith(u"http://www.python.org"): · · · return (True, xfind.entercontent) · · return (xfind.entercontent,) doc = parse.tree(parse.URL("http://www.python.org"), parse.Tidy(), parse.NS(html), parse.Node(pool=xsc.Pool(xml, html))) for node in doc.walknodes(IsExtLink()): · print node.attrs.href Finding external links on the Python home page This gives the output: http://docs.python.org/ http://pypi.python.org/pypi http://wiki.python.org/moin/PythonWebsiteCreatingNewTickets http://youtube.com/ http://wiki.python.org/moin/WebProgramming http://wiki.python.org/moin/CgiScripts ... xfind expressions ================= A second method exists for iterating through a tree: xfind expressions. xfind expressions are special walkfilters that look somewhat like an XPath expressions, but are implemented as pure Python expressions (overloading various Python operators). Every subclass of ll.xist.xsc.Node can be used as an xfind operator and combined with other xfind operators to get xfind expressions. For example searching for links that contain images works as follows: for path in node.walk(html.a/html.img): · print path[-2].attrs.href, path[-1].attrs.src Searching for img inside a with an xfind expression The output looks like this: http://www.python.org/ http://www.python.org/images/python-logo.gif http://www.python.org/#left%2dhand%2dnavigation http://www.python.org/images/trans.gif http://www.python.org/#content%2dbody http://www.python.org/images/trans.gif http://www.python.org/about/success/usa http://www.python.org/images/success/nasa.jpg If the img elements are not immediate children of the a elements, the xfind expression above won't output then. In this case you can use a "decendant selector" instead of a "child selector". To do this simply replace html.a/html.img with html.a//html.img. Apart from the / and // operators you can also use the | and & operators to combine xfind expressions: from ll.xist import xsc, parse, xfind from ll.xist.ns import xml, html doc = parse.tree(parse.URL("http://www.python.org"), parse.Tidy(), parse.NS(html), parse.Node(pool=xsc.Pool(xml, html))) for node in doc.walknodes((html.a | html.area) & xfind.hasattr("href")): · print node.attrs.href Here's another example that finds all elements that have an id attribute: from ll.xist import xsc, parse, xfind from ll.xist.ns import xml, html doc = parse.tree(parse.URL("http://www.python.org"), parse.Tidy(), parse.NS(html), parse.Node(pool=xsc.Pool(xml, html))) for node in doc.walknodes(xfind.hasattr("id")): · print node.attrs.id The output looks like this: screen-switcher-stylesheet logoheader logolink logo skiptonav skiptocontent ... For more examples refer to the documentation of the xfind module. CSS selectors ============= It's also possible to use CSS selectors as walk filters. The module ll.xist.css provides a function selector that turns a CSS selector expression into a walk filter: from ll.xist import xsc, parse, css from ll.xist.ns import xml, html doc = parse.tree(parse.URL("http://www.python.org"), parse.Tidy(), parse.NS(html), parse.Node(pool=xsc.Pool(xml, html))) for node in doc.walknodes(css.selector("div#menu ul.level-one li > a")): · print node.attrs.href Using CSS selectors as walk filters This outputs all the first level links in the navigation: http://www.python.org/about/ http://www.python.org/news/ http://www.python.org/doc/ http://www.python.org/download/ http://www.python.org/community/ http://www.python.org/psf/ http://www.python.org/dev/ http://www.python.org/links/ Most of the CSS 3 selectors are supported. For more examples see the documentation of the css module.