This module contains everything you need to create XIST objects by parsing files, strings, URLs etc.
Parsing XML is done with a pipelined approach. The first step in the pipeline
is a source object that provides the input for the rest of the pipeline.
The next step is the XML parser. It turns the input source into an iterator over
parsing events (an "event stream"). Further steps in the pipeline might resolve
namespace prefixes (NS), and instantiate XIST classes
(Node). The final step in the pipeline is either building an
XML tree via tree or an iterative parsing step (similar to ElementTrees
iterparse function) via itertree.
Parsing a simple HTML string might e.g. look like this:
>>> from ll.xist import xsc, parse >>> from ll.xist.ns import html >>> source = "<a href='http://www.python.org/'>Python</a>" >>> doc = parse.tree( ... parse.String(source) ... parse.Expat() ... parse.NS(html) ... parse.Node(pool=xsc.Pool(html)) ... ) >>> doc.bytes() '<a href="http://www.python.org/">Python</a>'
A source object is an iterable object that produces the input byte string for the parser (possibly in multiple chunks) (and information about the URL of the input):
>>> from ll.xist import parse
>>> list(parse.String("<a href='http://www.python.org/'>Python</a>"))
[('url', URL('STRING')),
('bytes', "<a href='http://www.python.org/'>Python</a>")]All subsequent objects in the pipeline are callable objects, get the input iterator as an argument and return an iterator over events themselves. The following code shows an example of an event stream:
>>> from ll.xist import parse
>>> source = "<a href='http://www.python.org/'>Python</a>"
>>> list(parse.events(parse.String(source), parse.Expat()))
[('url', URL('STRING')),
('position', (0, 0)),
('enterstarttag', u'a'),
('enterattr', u'href'),
('text', u'http://www.python.org/'),
('leaveattr', u'href'),
('leavestarttag', u'a'),
('position', (0, 39)),
('text', u'Python'),
('endtag', u'a')]An event is a tuple consisting of the event type and the event data. Different stages in the pipeline produce different event types. The following event types can be produced by source objects:
"url"The event data is the URL of the source. Usually such an event is produced only once at the start of the event stream. For sources that have no natural URL (like strings or streams) the URL can be specified when creating the source object.
"bytes"This event is produced by source objects (and
Transcoderobjects). The event data is a byte string."unicode"The event data is a unicode string. This event is produced by
Decoderobjects. Note that the only predefined pipeline objects that can handle"unicode"events areEncoderobjects.
The following type of events are produced by parsers (in addition to the
"url" event from above):
"position"The event data is a tuple containing the line and column number in the source (both starting with 0). All the following events should use this position information until the next position event.
"xmldecl"The XML declaration. The event data is a dictionary containing the keys
"version","encoding"and"standalone". Parsers may omit this event."begindoctype"The begin of the doctype. The event data is a dictionary containing the keys
"name","publicid"and"systemid". Parsers may omit this event."enddoctype"The end of the doctype. The event data is
None. (If there is no internal subset, the"enddoctype"event immediately follows the"begindoctype"event). Parsers may omit this event."comment"A comment. The event data is the content of the comment.
"text"Text data. The event data is the text content. Parsers should try to avoid outputting multiple text events in sequence.
"cdata"A CDATA section. The event data is the content of the CDATA section. Parsers may report CDATA sections as
"text"events instead of"cdata"events."enterstarttag"The beginning of an element start tag. The event data is the element name.
"leavestarttag"The end of an element start tag. The event data is the element name. The parser will output events for the attributes between the
"enterstarttag"and the"leavestarttag"event."enterattr"The beginning of an attribute. The event data is the attribute name.
"leaveattr"The end of an attribute. The event data is the attribute name. The parser will output events for the attribute value between the
"enterattr"and the"leaveattr"event. (In almost all cases this is one text event)."endtag"An element end tag. The event data is the element name.
"procinst"A processing instruction. The event data is a tuple consisting of the processing instruction target and the data.
"entity"An entity reference. The event data is the entity name.
The following events are produced for elements and attributes in namespace mode
(instead of those without the ns suffix). They are produced by NS
objects or by Expat objects when ns is true (i.e. the expat
parser does the namespace resolution):
"enterstarttagns"The beginning of an element start tag in namespace mode. The event data is an (element name, namespace name) tuple.
"leavestarttagns"The end of an element start tag in namespace mode. The event data is an (element name, namespace name) tuple.
"enterattrns"The beginning of an attribute in namespace mode. The event data is an (element name, namespace name) tuple.
"leaveattrns"The end of an attribute in namespace mode. The event data is an (element name, namespace name) tuple.
"endtagns"An element end tag in namespace mode. The event data is an (element name, namespace name) tuple.
Once XIST nodes have been instantiated (by Node objects) the
following events are used:
"xmldeclnode"The XML declaration. The event data is an instance of
ll.xist.xml.XML."doctypenode"The doctype. The event data is an instance of
ll.xist.xsc.DocType."commentnode"A comment. The event data is an instance of
ll.xist.xsc.Comment."textnode"Text data. The event data is an instance of
ll.xist.xsc.Text."startelementnode"The beginning of an element. The event data is an instance of
ll.xist.xsc.Element(or rather one of its subclasses). The attributes of the element object are set, but the element has no content."endelementnode"The end of an element. The event data is an instance of
ll.xist.xsc.Element."procinstnode"A processing instruction. The event data is an instance of
ll.xist.xsc.ProcInst."entitynode"An entity reference. The event data is an instance of
ll.xist.xsc.Entity.
For consuming event streams there are three functions:
eventsThis generator simply outputs the events.
treeThis function builds an XML tree from the events and returns it.
itertreeThis generator builds a tree like
tree, but returns events during certain steps in the parsing process.
class UnknownEventError(TypeError):
This exception is raised when a pipeline object doesn't know how to handle an event.
def __init__(self, pipe, event):
selfdef __str__(self):
selfclass String(object):
Provides parser input from a string.
def __init__(self, data, url=None):
selfCreate a String object. data must be a byte or
unicode string. url specifies the URL for the source (defaulting
to "STRING").
def __iter__(self):
selfProduces an event stream of one "url" event and one "bytes" or
"unicode" event for the data.
class Iter(object):
Provides parser input from an iterator over strings.
def __init__(self, iterable, url=None):
selfCreate a Iter object. iterable must be an iterable object
producing byte or unicode strings. url specifies the URL for the
source (defaulting to "ITER").
def __iter__(self):
selfProduces an event stream of one "url" event followed by the
"bytes"/"unicode" events for the data from the iterable.
class Stream(object):
Provides parser input from a stream (i.e. an object that provides a
read method).
def __init__(self, stream, url=None, bufsize=8192):
selfCreate a Stream object. stream must have a read
method (with a size argument). url specifies the URL for the
source (defaulting to "STREAM"). bufsize specifies the
chunksize for reads from the stream.
def __iter__(self):
selfProduces an event stream of one "url" event followed by the
"bytes"/"unicode" events for the data from the stream.
class File(object):
Provides parser input from a file.
def __init__(self, filename, bufsize=8192):
selfCreate a File object. filename is the name of the file
and may start with ~ or ~user for the home directory of the
current or the specified user. bufsize specifies the chunksize
for reads from the file.
def __iter__(self):
selfProduces an event stream of one "url" event followed by the
"bytes" events for the data from the file.
class URL(object):
Provides parser input from a URL.
def __init__(self, name, bufsize=8192, *args, **kwargs):
selfCreate a URL object. name is the URL. bufsize
specifies the chunksize for reads from the URL. args and
kwargs will be passed on to the open method of the URL
object.
The URL for the input will be the final URL for the resource (i.e. it will include redirects).
def __iter__(self):
selfProduces an event stream of one "url" event followed by the
"bytes" events for the data from the URL.
class ETree(object):
Produces a (namespaced) event stream from an object that supports the ElementTree API.
def __init__(self, data, url=None, defaultxmlns=None):
selfCreate an ETree object. Arguments have the following meaning:
dataAn object that supports the ElementTree API.
urlThe URL of the source. Defaults to
"ETREE".defaultxmlnsThe namespace name (or a namespace module containing a namespace name) that will be used for all elements that don't have a namespace.
def _asxist(self, node):
selfdef __iter__(self):
selfProduces an event stream of namespaced parsing events for the ElementTree
object passed as data to the constructor.
class Decoder(object):
Decode the byte strings produced by the previous object in the pipeline to unicode strings.
This input object can be a source object or any other pipeline object that produces byte strings.
def __init__(self, encoding=None):
selfCreate a Decoder object. encoding is the encoding of the
input. If encoding is None it will be automatically detected
from the XML data.
def __call__(self, input):
selfdef __repr__(self):
selfclass Encoder(object):
Encode the unicode strings produced by the previous object in the pipeline to byte strings.
This input object must be a pipeline object that produces unicode output
(e.g. a Decoder object).
def __init__(self, encoding=None):
selfCreate an Encoder object. encoding will be the encoding of
the output. If encoding is None it will be automatically
detected from the XML declaration in the data.
def __call__(self, input):
selfdef __repr__(self):
selfclass Transcoder(object):
Transcode the byte strings of the input object into another encoding.
This input object can be a source object or any other pipeline object that produces byte strings.
def __init__(self, fromencoding=None, toencoding=None):
selfCreate a Transcoder object. fromencoding is the encoding
of the input. toencoding is the encoding of the output. If any of
them is None the encoding will be detected from the data.
def __call__(self, input):
selfdef __repr__(self):
selfclass Parser(object):
Basic parser interface.
class Expat(Parser):
A parser using Pythons builtin expat parser.
def __init__(self, encoding=None, xmldecl=False, doctype=False, loc=True, cdata=False, ns=False):
selfCreate an Expat parser. Arguments have the following meaning:
encoding(string orNone)Forces the parser to use the specified encoding. The default
Noneresults in the encoding being detected from the XML itself.xmldecl(bool)Should the parser produce events for the XML declaration?
doctype(bool)Should the parser produce events for the document type?
loc(bool)Should the parser produce
"location"events?cdata(bool)Should the parser output CDATA sections as
"cdata"events? (Ifcdatais false output"text"events instead.)ns(bool)If
nsis true, the parser does its own namespace processing, i.e. it will emit"enterstarttagns","leavestarttagns","endtagns","enterattrns"and"leaveattrns"events instead of"enterstarttag","leavestarttag","endtag","enterattr"and"leaveattr"events.
def __repr__(self):
selfdef __call__(self, input):
selfReturn an iterator over the events produced by input.
def _event(self, evtype, evdata):
selfdef _flush(self, force):
selfdef _getname(self, name):
selfdef _handle_startcdata(self):
selfdef _handle_endcdata(self):
selfdef _handle_xmldecl(self, version, encoding, standalone):
selfdef _handle_begindoctype(self, doctypename, systemid, publicid, has_internal_subset):
selfdef _handle_enddoctype(self):
selfdef _handle_default(self, data):
selfdef _handle_comment(self, data):
selfdef _handle_text(self, data):
selfdef _handle_startelement(self, name, attrs):
selfdef _handle_endelement(self, name):
selfdef _handle_procinst(self, target, data):
selfclass SGMLOP(Parser):
A parser based on sgmlop.
def __init__(self, encoding=None, cdata=False):
selfCreate a SGMLOP parser. Arguments have the following meaning:
encoding(string orNone)Forces the parser to use the specified encoding. The default
Noneresults in the encoding being detected from the XML itself.cdata(bool)Should the parser output CDATA sections as
"cdata"events? (Ifcdatais false output"text"events instead.)
def __repr__(self):
selfdef __call__(self, input):
selfReturn an iterator over the events produced by input.
def _event(self, evtype, evdata):
selfdef _flush(self, force):
selfdef handle_comment(self, data):
selfdef handle_data(self, data):
selfdef handle_cdata(self, data):
selfdef handle_proc(self, target, data):
selfdef handle_entityref(self, name):
selfdef handle_enterstarttag(self, name):
selfdef handle_leavestarttag(self, name):
selfdef handle_enterattr(self, name):
selfdef handle_leaveattr(self, name):
selfdef handle_endtag(self, name):
selfclass NS(object):
An NS object is used in a parsing pipeline to add support for XML
namespaces. It replaces the "enterstarttag", "leavestarttag",
"endtag", "enterattr" and "leaveattr" events with the appropriate
namespace version of the events (i.e. "enterstarttagns" etc.) where the
event data is a (name, namespace) tuple.
The output of an NS object in the stream looks like this:
>>> from ll.xist import parse
>>> from ll.xist.ns import html
>>> list(parse.events(
... parse.String("<a href='http://www.python.org/'>Python</a>"),
... parse.Expat(),
... parse.NS(html)
... ))
[('url', URL('STRING')),
('position', (0, 0)),
('enterstarttagns', (u'a', 'http://www.w3.org/1999/xhtml')),
('enterattrns', (u'href', None)),
('text', u'http://www.python.org/'),
('leaveattrns', (u'href', None)),
('leavestarttagns', (u'a', 'http://www.w3.org/1999/xhtml')),
('position', (0, 39)),
('text', u'Python'),
('endtagns', (u'a', 'http://www.w3.org/1999/xhtml'))]def __init__(self, prefixes=None, **kwargs):
selfCreate an NS object. prefixes (if not None) can be a
namespace name (or module), which will be used for the empty prefix,
or a dictionary that maps prefixes to namespace names (or modules).
kwargs maps prefixes to namespaces names too. If a prefix is in both
prefixes and kwargs, kwargs wins.
def __call__(self, input):
selfdef url(self, data):
selfdef xmldecl(self, data):
selfdef begindoctype(self, data):
selfdef enddoctype(self, data):
selfdef comment(self, data):
selfdef text(self, data):
selfdef cdata(self, data):
selfdef procinst(self, data):
selfdef entity(self, data):
selfdef position(self, data):
selfdef enterstarttag(self, data):
selfdef enterattr(self, data):
selfdef leaveattr(self, data):
selfdef leavestarttag(self, data):
selfdef endtag(self, data):
selfclass Node(object):
A Node object is used in a parsing pipeline to instantiate XIST
nodes. It consumes a namespaced event stream:
>>> from ll.xist import xsc, parse
>>> from ll.xist.ns import html
>>> list(parse.events(
... parse.String("<a href='http://www.python.org/'>Python</a>"),
... parse.Expat(),
... parse.NS(html),
... parse.Node(pool=xsc.Pool(html))
... ))
[(u'startelementnode',
<ll.xist.ns.html.a element object (no children/1 attr) (from STRING:0:0) at 0x1026e6a10>),
(u'textnode',
<ll.xist.xsc.Text content=u'Python' (from STRING:0:39) at 0x102566b48>),
(u'endelementnode',
<ll.xist.ns.html.a element object (no children/1 attr) (from STRING:0:0) at 0x1026e6a10>)]The event data of all events are XIST nodes. The element node from the
"startelementnode" event already has all attributes set. There will be
no events for attributes.
def __init__(self, pool=None, base=None, loc=True):
selfproperty base:
def __get__(self):
def __call__(self, input):
selfdef url(self, data):
selfdef xmldecl(self, data):
selfdef begindoctype(self, data):
selfdef enddoctype(self, data):
selfdef entity(self, data):
selfdef comment(self, data):
selfdef cdata(self, data):
selfdef text(self, data):
selfdef enterstarttagns(self, data):
selfdef enterattrns(self, data):
selfdef leaveattrns(self, data):
selfdef leavestarttagns(self, data):
selfdef endtagns(self, data):
selfdef procinst(self, data):
selfdef position(self, data):
selfclass Tidy(object):
A Tidy object parses (potentially ill-formed) HTML from a source
into a (unnamespaced) event stream by using libxml2's HTML parser:
>>> from ll.xist import parse
>>> list(parse.events(parse.URL("http://www.yahoo.com/"), parse.Tidy()))
[('url', URL('http://de.yahoo.com/?p=us')),
('position', (3, None)),
('enterstarttag', u'html'),
('enterattr', u'lang'),
('text', u'de-DE'),
('leaveattr', u'lang'),
('enterattr', u'class'),
('text', u'y-fp-bg y-fp-pg-grad bkt708'),
('leaveattr', u'class'),
('leavestarttag', u'html')
...def __init__(self, encoding=None, loc=True):
selfdef __repr__(self):
selfdef _handle_pos(self, node):
selfdef _asxist(self, node):
selfdef __call__(self, input):
selfdef events(*pipeline):
Return an iterator over the events produced by the pipeline objects in
pipeline.
def tree(*pipeline, **kwargs):
Return a tree of XIST nodes from the event stream pipeline.
pipeline must output only events that contain XIST nodes, i.e. the
event types "xmldeclnode", "doctypenode", "commentnode",
"textnode", "startelementnode", "endelementnode",
"procinstnode" and "entitynode".
kwargs supports one keyword argument: validate.
If validate is true, the tree is validated, i.e. it is checked if
the structure of the tree is valid (according to the model attribute
of each element node), if all required attributes are specified and all
attributes have allowed values.
The node returned from tree will always be a Frag object.
Example:
>>> from ll.xist import xsc, parse
>>> from ll.xist.ns import xml, html, chars
>>> doc = parse.tree(
... parse.URL("http://www.python.org/"),
... parse.Expat(ns=True),
... parse.Node(pool=xsc.Pool(xml, html, chars))
... )
>>> doc[0]
<ll.xist.ns.html.html element object (5 children/2 attrs) (from http://www.python.org/:3:0) at 0x1028eb3d0>def itertree(*pipeline, **kwargs):
Parse the event stream pipeline iteratively.
itertree still builds a tree, but it returns a iterator of
(event type, path) tuples that track changes to the tree as it is built.
path is a list containing the path from the root Frag object to the
node being worked on.
Which events and paths are produced depends on the keyword arguments
events and filter. events specifies which events you
want to see (possible event types are "xmldeclnode", "doctypenode",
"commentnode", "textnode", "startelementnode",
"endelementnode", "procinstnode" and "entitynode"). The default
is to only produce "endelementnode" events. (Note that for
"startelementnode" events, the attributes of the element have been set,
but the element is still empty). filter specifies an XIST walk filter
(see the ll.xist.xfind module for more info on walk filters) to filter
which paths are output. The default is to output all paths.
Example:
>>> from ll.xist import xsc, parse
>>> from ll.xist.ns import xml, html, chars
>>> for (evtype, path) in parse.itertree(
... parse.URL("http://www.python.org/"),
... parse.Expat(ns=True),
... parse.Node(pool=xsc.Pool(xml, html, chars)),
... filter=html.a/html.img
... ):
... print path[-1].attrs.src, "-->", path[-2].attrs.href
http://www.python.org/images/python-logo.gif --> http://www.python.org/
http://www.python.org/images/trans.gif --> http://www.python.org/#left%2Dhand%2Dnavigation
http://www.python.org/images/trans.gif --> http://www.python.org/#content%2Dbody
http://www.python.org/images/donate.png --> http://www.python.org/psf/donations/
http://www.python.org/images/worldmap.jpg --> http://wiki.python.org/moin/Languages
http://www.python.org/images/success/tribon.jpg --> http://www.python.org/about/success/tribon/