Link: start
Link: parent
Link: First page in set (first)
Link: Previous page (previous)
Link: Next page (next)
Link: Last page in set (last)
Link: A plain text version of this page (alternate)
Link: The XIST source of this page (alternate)
Link: The Python module described in this page (alternate)
XIST.parse
==========
Tools for parsing XML from various sources
==========================================
Home > Python software > ll.xist > parse Text · XIST · Python
Python softwarelist of projects
* ll.xistAn extensible XML/HTML generator
* ExamplesParsing/creating/modifying XML; Traversing XML
trees
* HowtoExplains parsing/generating XML files, XML
transformations via XIST classes and other basic
concepts.
* SearchingHow to iterate through XIST trees
* TransformationHow to transform XIST trees
* Advanced topicsPool chaining, converter contexts,
validation
* MiscellaneousExplains various odds and ends of XIST
* xscXIST core classes
* nsPackage containing namespace modules
* parseParsing XML
* presentScreen output of XML trees
* simsSimple schema validation
* xfindTree iteration and filtering
* cssCSS related functions
* scriptsScripts for text conversion and creating XIST
namespaces
* HistoryChangeLog for XIST
* InstallationHow to install and configure XIST
* MigrationHow to update your code to new versions of XIST
* Mailing listsHow to subscribe to the XIST mailing lists
* ll.ul4cA templating language
* ll.urlRFC 2396 compliant URLs
* ll.makeObject oriented make replacement
* ll.daemonForking daemon processes
* ll.sisyphusWriting cron jobs with Python
* ll.colorRGB color values and color model conversion
* ll.miscMisc utility functions and classes
* ll.orasqlUtilities for cx_Oracle
* ll.nightshadeServe the output of Oracle functions/procedures
with CherryPy
* ll.scriptsScripts for UL4 template rendering and URL handling
* AploraLogging Apache HTTP requests to an Oracle database
* PycocoPython code coverage
* DownloadLinks to Windows and Linux, source and binary
distributions
* Source codeAccess to the Mercurial repositories
This module contains everything you need to create XIST objects by
parsing files, strings, URLs etc.
Parsing XML is done with a pipelined approach. The first step in
the pipeline is a source object that provides the input for the
rest of the pipeline. The next step is the XML parser. It turns
the input source into an iterator over parsing events (an "event
stream"). Further steps in the pipeline might resolve namespace
prefixes (NS), and instantiate XIST classes (Node). The final step
in the pipeline is either building an XML tree via tree or an
iterative parsing step (similar to ElementTrees iterparse
function) via itertree.
Parsing a simple HTML string might e.g. look like this:
>>> from ll.xist import xsc, parse
>>> from ll.xist.ns import html
>>> source = "Python"
>>> doc = parse.tree(
... parse.String(source)
... parse.Expat()
... parse.NS(html)
... parse.Node(pool=xsc.Pool(html))
... )
>>> doc.bytes()
'Python'
A source object is an iterable object that produces the input byte
string for the parser (possibly in multiple chunks) (and
information about the URL of the input):
>>> from ll.xist import parse
>>> list(parse.String("Python"))
[('url', URL('STRING')),
('bytes', "Python")]
All subsequent objects in the pipeline are callable objects, get
the input iterator as an argument and return an iterator over
events themselves. The following code shows an example of an event
stream:
>>> from ll.xist import parse
>>> source = "Python"
>>> list(parse.events(parse.String(source), parse.Expat()))
[('url', URL('STRING')),
('position', (0, 0)),
('enterstarttag', u'a'),
('enterattr', u'href'),
('text', u'http://www.python.org/'),
('leaveattr', u'href'),
('leavestarttag', u'a'),
('position', (0, 39)),
('text', u'Python'),
('endtag', u'a')]
An event is a tuple consisting of the event type and the event
data. Different stages in the pipeline produce different event
types. The following event types can be produced by source
objects:
"url"
The event data is the URL of the source. Usually such an
event is produced only once at the start of the event
stream. For sources that have no natural URL (like strings
or streams) the URL can be specified when creating the
source object.
"bytes"
This event is produced by source objects (and Transcoder
objects). The event data is a byte string.
"unicode"
The event data is a unicode string. This event is produced
by Decoder objects. Note that the only predefined pipeline
objects that can handle "unicode" events are Encoder
objects.
The following type of events are produced by parsers (in addition
to the "url" event from above):
"position"
The event data is a tuple containing the line and column
number in the source (both starting with 0). All the
following events should use this position information
until the next position event.
"xmldecl"
The XML declaration. The event data is a dictionary
containing the keys "version", "encoding" and
"standalone". Parsers may omit this event.
"begindoctype"
The begin of the doctype. The event data is a dictionary
containing the keys "name", "publicid" and "systemid".
Parsers may omit this event.
"enddoctype"
The end of the doctype. The event data is None. (If there
is no internal subset, the "enddoctype" event immediately
follows the "begindoctype" event). Parsers may omit this
event.
"comment"
A comment. The event data is the content of the comment.
"text"
Text data. The event data is the text content. Parsers
should try to avoid outputting multiple text events in
sequence.
"cdata"
A CDATA section. The event data is the content of the
CDATA section. Parsers may report CDATA sections as "text"
events instead of "cdata" events.
"enterstarttag"
The beginning of an element start tag. The event data is
the element name.
"leavestarttag"
The end of an element start tag. The event data is the
element name. The parser will output events for the
attributes between the "enterstarttag" and the
"leavestarttag" event.
"enterattr"
The beginning of an attribute. The event data is the
attribute name.
"leaveattr"
The end of an attribute. The event data is the attribute
name. The parser will output events for the attribute
value between the "enterattr" and the "leaveattr" event.
(In almost all cases this is one text event).
"endtag"
An element end tag. The event data is the element name.
"procinst"
A processing instruction. The event data is a tuple
consisting of the processing instruction target and the
data.
"entity"
An entity reference. The event data is the entity name.
The following events are produced for elements and attributes in
namespace mode (instead of those without the ns suffix). They are
produced by NS objects or by Expat objects when ns is true (i.e.
the expat parser does the namespace resolution):
"enterstarttagns"
The beginning of an element start tag in namespace mode.
The event data is an (element name, namespace name) tuple.
"leavestarttagns"
The end of an element start tag in namespace mode. The
event data is an (element name, namespace name) tuple.
"enterattrns"
The beginning of an attribute in namespace mode. The event
data is an (element name, namespace name) tuple.
"leaveattrns"
The end of an attribute in namespace mode. The event data
is an (element name, namespace name) tuple.
"endtagns"
An element end tag in namespace mode. The event data is an
(element name, namespace name) tuple.
Once XIST nodes have been instantiated (by Node objects) the
following events are used:
"xmldeclnode"
The XML declaration. The event data is an instance of
ll.xist.xml.XML.
"doctypenode"
The doctype. The event data is an instance of
ll.xist.xsc.DocType.
"commentnode"
A comment. The event data is an instance of
ll.xist.xsc.Comment.
"textnode"
Text data. The event data is an instance of
ll.xist.xsc.Text.
"startelementnode"
The beginning of an element. The event data is an instance
of ll.xist.xsc.Element (or rather one of its subclasses).
The attributes of the element object are set, but the
element has no content.
"endelementnode"
The end of an element. The event data is an instance of
ll.xist.xsc.Element.
"procinstnode"
A processing instruction. The event data is an instance of
ll.xist.xsc.ProcInst.
"entitynode"
An entity reference. The event data is an instance of
ll.xist.xsc.Entity.
For consuming event streams there are three functions:
events
This generator simply outputs the events.
tree
This function builds an XML tree from the events and
returns it.
itertree
This generator builds a tree like tree, but returns events
during certain steps in the parsing process.
class UnknownEventError(TypeError):
====================================
This exception is raised when a pipeline object doesn't know how
to handle an event.
def __init__(self, pipe, event):
=================================
def __str__(self):
===================
class String(object):
======================
Provides parser input from a string.
def __init__(self, data, url=None):
====================================
Create a String object. data must be a byte or unicode string. url
specifies the URL for the source (defaulting to "STRING").
def __iter__(self):
====================
Produces an event stream of one "url" event and one "bytes" or
"unicode" event for the data.
class Iter(object):
====================
Provides parser input from an iterator over strings.
def __init__(self, iterable, url=None):
========================================
Create a Iter object. iterable must be an iterable object
producing byte or unicode strings. url specifies the URL for the
source (defaulting to "ITER").
def __iter__(self):
====================
Produces an event stream of one "url" event followed by the
"bytes"/"unicode" events for the data from the iterable.
class Stream(object):
======================
Provides parser input from a stream (i.e. an object that provides
a read method).
def __init__(self, stream, url=None, bufsize=8192):
====================================================
Create a Stream object. stream must have a read method (with a
size argument). url specifies the URL for the source (defaulting
to "STREAM"). bufsize specifies the chunksize for reads from the
stream.
def __iter__(self):
====================
Produces an event stream of one "url" event followed by the
"bytes"/"unicode" events for the data from the stream.
class File(object):
====================
Provides parser input from a file.
def __init__(self, filename, bufsize=8192):
============================================
Create a File object. filename is the name of the file and may
start with ~ or ~user for the home directory of the current or the
specified user. bufsize specifies the chunksize for reads from the
file.
def __iter__(self):
====================
Produces an event stream of one "url" event followed by the
"bytes" events for the data from the file.
class URL(object):
===================
Provides parser input from a URL.
def __init__(self, name, bufsize=8192, *args, **kwargs):
=========================================================
Create a URL object. name is the URL. bufsize specifies the
chunksize for reads from the URL. args and kwargs will be passed
on to the open method of the URL object.
The URL for the input will be the final URL for the resource (i.e.
it will include redirects).
def __iter__(self):
====================
Produces an event stream of one "url" event followed by the
"bytes" events for the data from the URL.
class ETree(object):
=====================
Produces a (namespaced) event stream from an object that supports
the ElementTree API.
def __init__(self, data, url=None, defaultxmlns=None):
=======================================================
Create an ETree object. Arguments have the following meaning:
data
An object that supports the ElementTree API.
url
The URL of the source. Defaults to "ETREE".
defaultxmlns
The namespace name (or a namespace module containing a
namespace name) that will be used for all elements that
don't have a namespace.
def _asxist(self, node):
=========================
def __iter__(self):
====================
Produces an event stream of namespaced parsing events for the
ElementTree object passed as data to the constructor.
class Decoder(object):
=======================
Decode the byte strings produced by the previous object in the
pipeline to unicode strings.
This input object can be a source object or any other pipeline
object that produces byte strings.
def __init__(self, encoding=None):
===================================
Create a Decoder object. encoding is the encoding of the input. If
encoding is None it will be automatically detected from the XML
data.
def __call__(self, input):
===========================
def __repr__(self):
====================
class Encoder(object):
=======================
Encode the unicode strings produced by the previous object in the
pipeline to byte strings.
This input object must be a pipeline object that produces unicode
output (e.g. a Decoder object).
def __init__(self, encoding=None):
===================================
Create an Encoder object. encoding will be the encoding of the
output. If encoding is None it will be automatically detected from
the XML declaration in the data.
def __call__(self, input):
===========================
def __repr__(self):
====================
class Transcoder(object):
==========================
Transcode the byte strings of the input object into another
encoding.
This input object can be a source object or any other pipeline
object that produces byte strings.
def __init__(self, fromencoding=None, toencoding=None):
========================================================
Create a Transcoder object. fromencoding is the encoding of the
input. toencoding is the encoding of the output. If any of them is
None the encoding will be detected from the data.
def __call__(self, input):
===========================
def __repr__(self):
====================
class Parser(object):
======================
Basic parser interface.
class Expat(Parser):
=====================
A parser using Pythons builtin expat parser.
def __init__(self, encoding=None, xmldecl=False, doctype=False,
loc=True, cdata=False, ns=False):
==================================================================================================
Create an Expat parser. Arguments have the following meaning:
encoding (string or None)
Forces the parser to use the specified encoding. The
default None results in the encoding being detected from
the XML itself.
xmldecl (bool)
Should the parser produce events for the XML declaration?
doctype (bool)
Should the parser produce events for the document type?
loc (bool)
Should the parser produce "location" events?
cdata (bool)
Should the parser output CDATA sections as "cdata" events?
(If cdata is false output "text" events instead.)
ns (bool)
If ns is true, the parser does its own namespace
processing, i.e. it will emit "enterstarttagns",
"leavestarttagns", "endtagns", "enterattrns" and
"leaveattrns" events instead of "enterstarttag",
"leavestarttag", "endtag", "enterattr" and "leaveattr"
events.
def __repr__(self):
====================
def __call__(self, input):
===========================
Return an iterator over the events produced by input.
def _event(self, evtype, evdata):
==================================
def _flush(self, force):
=========================
def _getname(self, name):
==========================
def _handle_startcdata(self):
==============================
def _handle_endcdata(self):
============================
def _handle_xmldecl(self, version, encoding, standalone):
==========================================================
def _handle_begindoctype(self, doctypename, systemid, publicid,
has_internal_subset):
======================================================================================
def _handle_enddoctype(self):
==============================
def _handle_default(self, data):
=================================
def _handle_comment(self, data):
=================================
def _handle_text(self, data):
==============================
def _handle_startelement(self, name, attrs):
=============================================
def _handle_endelement(self, name):
====================================
def _handle_procinst(self, target, data):
==========================================
class SGMLOP(Parser):
======================
A parser based on sgmlop.
def __init__(self, encoding=None, cdata=False):
================================================
Create a SGMLOP parser. Arguments have the following meaning:
encoding (string or None)
Forces the parser to use the specified encoding. The
default None results in the encoding being detected from
the XML itself.
cdata (bool)
Should the parser output CDATA sections as "cdata" events?
(If cdata is false output "text" events instead.)
def __repr__(self):
====================
def __call__(self, input):
===========================
Return an iterator over the events produced by input.
def _event(self, evtype, evdata):
==================================
def _flush(self, force):
=========================
def handle_comment(self, data):
================================
def handle_data(self, data):
=============================
def handle_cdata(self, data):
==============================
def handle_proc(self, target, data):
=====================================
def handle_entityref(self, name):
==================================
def handle_enterstarttag(self, name):
======================================
def handle_leavestarttag(self, name):
======================================
def handle_enterattr(self, name):
==================================
def handle_leaveattr(self, name):
==================================
def handle_endtag(self, name):
===============================
class NS(object):
==================
An NS object is used in a parsing pipeline to add support for XML
namespaces. It replaces the "enterstarttag", "leavestarttag",
"endtag", "enterattr" and "leaveattr" events with the appropriate
namespace version of the events (i.e. "enterstarttagns" etc.)
where the event data is a (name, namespace) tuple.
The output of an NS object in the stream looks like this:
>>> from ll.xist import parse
>>> from ll.xist.ns import html
>>> list(parse.events(
... parse.String("Python"),
... parse.Expat(),
... parse.NS(html)
... ))
[('url', URL('STRING')),
('position', (0, 0)),
('enterstarttagns', (u'a', 'http://www.w3.org/1999/xhtml')),
('enterattrns', (u'href', None)),
('text', u'http://www.python.org/'),
('leaveattrns', (u'href', None)),
('leavestarttagns', (u'a', 'http://www.w3.org/1999/xhtml')),
('position', (0, 39)),
('text', u'Python'),
('endtagns', (u'a', 'http://www.w3.org/1999/xhtml'))]
def __init__(self, prefixes=None, **kwargs):
=============================================
Create an NS object. prefixes (if not None) can be a namespace
name (or module), which will be used for the empty prefix, or a
dictionary that maps prefixes to namespace names (or modules).
kwargs maps prefixes to namespaces names too. If a prefix is in
both prefixes and kwargs, kwargs wins.
def __call__(self, input):
===========================
def url(self, data):
=====================
def xmldecl(self, data):
=========================
def begindoctype(self, data):
==============================
def enddoctype(self, data):
============================
def comment(self, data):
=========================
def text(self, data):
======================
def cdata(self, data):
=======================
def procinst(self, data):
==========================
def entity(self, data):
========================
def position(self, data):
==========================
def enterstarttag(self, data):
===============================
def enterattr(self, data):
===========================
def leaveattr(self, data):
===========================
def leavestarttag(self, data):
===============================
def endtag(self, data):
========================
class Node(object):
====================
A Node object is used in a parsing pipeline to instantiate XIST
nodes. It consumes a namespaced event stream:
>>> from ll.xist import xsc, parse
>>> from ll.xist.ns import html
>>> list(parse.events(
... parse.String("Python"),
... parse.Expat(),
... parse.NS(html),
... parse.Node(pool=xsc.Pool(html))
... ))
[(u'startelementnode',
),
(u'textnode',
),
(u'endelementnode',
)]
The event data of all events are XIST nodes. The element node from
the "startelementnode" event already has all attributes set. There
will be no events for attributes.
def __init__(self, pool=None, base=None, loc=True):
====================================================
property base:
==============
def __get__(self):
-------------------
def __call__(self, input):
===========================
def url(self, data):
=====================
def xmldecl(self, data):
=========================
def begindoctype(self, data):
==============================
def enddoctype(self, data):
============================
def entity(self, data):
========================
def comment(self, data):
=========================
def cdata(self, data):
=======================
def text(self, data):
======================
def enterstarttagns(self, data):
=================================
def enterattrns(self, data):
=============================
def leaveattrns(self, data):
=============================
def leavestarttagns(self, data):
=================================
def endtagns(self, data):
==========================
def procinst(self, data):
==========================
def position(self, data):
==========================
class Tidy(object):
====================
A Tidy object parses (potentially ill-formed) HTML from a source
into a (unnamespaced) event stream by using libxml2's HTML parser:
>>> from ll.xist import parse
>>> list(parse.events(parse.URL("http://www.yahoo.com/"), parse.Tidy()))
[('url', URL('http://de.yahoo.com/?p=us')),
('position', (3, None)),
('enterstarttag', u'html'),
('enterattr', u'lang'),
('text', u'de-DE'),
('leaveattr', u'lang'),
('enterattr', u'class'),
('text', u'y-fp-bg y-fp-pg-grad bkt708'),
('leaveattr', u'class'),
('leavestarttag', u'html')
...
def __init__(self, encoding=None, skipbad=False, loc=True):
============================================================
Create a new Tidy object. Parameters have the following meaning:
encoding (string or None)
The encoding of the input. If encoding is None it will be
automatically detected by the HTML parser.
skipbad (bool)
If skipbad is true, unknown elements (i.e. those not in
the ll.xist.ns.html namespace) will be skipped (i.e.
instead of the element its content will be output).
Unknown attributes will be skipped completely.
loc (bool)
If loc is true, "position" events will be generated else
they will be skipped.
def __repr__(self):
====================
def _handle_pos(self, node):
=============================
def _asxist(self, node):
=========================
def __call__(self, input):
===========================
def events(*pipeline):
=======================
Return an iterator over the events produced by the pipeline
objects in pipeline.
def tree(*pipeline, **kwargs):
===============================
Return a tree of XIST nodes from the event stream pipeline.
pipeline must output only events that contain XIST nodes, i.e. the
event types "xmldeclnode", "doctypenode", "commentnode",
"textnode", "startelementnode", "endelementnode", "procinstnode"
and "entitynode".
kwargs supports one keyword argument: validate. If validate is
true, the tree is validated, i.e. it is checked if the structure
of the tree is valid (according to the model attribute of each
element node), if all required attributes are specified and all
attributes have allowed values.
The node returned from tree will always be a Frag object.
Example:
>>> from ll.xist import xsc, parse
>>> from ll.xist.ns import xml, html, chars
>>> doc = parse.tree(
... parse.URL("http://www.python.org/"),
... parse.Expat(ns=True),
... parse.Node(pool=xsc.Pool(xml, html, chars))
... )
>>> doc[0]
def itertree(*pipeline, **kwargs):
===================================
Parse the event stream pipeline iteratively.
itertree still builds a tree, but it returns a iterator of (event
type, path) tuples that track changes to the tree as it is built.
path is a list containing the path from the root Frag object to
the node being worked on.
Which events and paths are produced depends on the keyword
arguments events and filter. events specifies which events you
want to see (possible event types are "xmldeclnode",
"doctypenode", "commentnode", "textnode", "startelementnode",
"endelementnode", "procinstnode" and "entitynode"). The default is
to only produce "endelementnode" events. (Note that for
"startelementnode" events, the attributes of the element have been
set, but the element is still empty). filter specifies an XIST
walk filter (see the ll.xist.xfind module for more info on walk
filters) to filter which paths are output. The default is to
output all paths.
Example:
>>> from ll.xist import xsc, parse
>>> from ll.xist.ns import xml, html, chars
>>> for (evtype, path) in parse.itertree(
... parse.URL("http://www.python.org/"),
... parse.Expat(ns=True),
... parse.Node(pool=xsc.Pool(xml, html, chars)),
... filter=html.a/html.img
... ):
... print path[-1].attrs.src, "-->", path[-2].attrs.href
http://www.python.org/images/python-logo.gif --> http://www.python.org/
http://www.python.org/images/trans.gif --> http://www.python.org/#left%2Dhand%2Dnavigation
http://www.python.org/images/trans.gif --> http://www.python.org/#content%2Dbody
http://www.python.org/images/donate.png --> http://www.python.org/psf/donations/
http://www.python.org/images/worldmap.jpg --> http://wiki.python.org/moin/Languages
http://www.python.org/images/success/tribon.jpg --> http://www.python.org/about/success/tribon/