Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Parsing XML streams

5 views
Skip to first unread message

Peter Scott

unread,
Sep 11, 2003, 7:30:18 PM9/11/03
to
I have a program that listens on an IRC channel and logs everything to
XML on standard output. The format of the XML is pretty
straightforward, looking like this:

<channel name='#sandbox'>
<message user='PeterScott'>Hello, my bot</message>
<message user='PeterScott'>This is a message</message>
<nickchange>
<oldnick>PeterScott</oldnick>
<newnick>PeterSc</newnick>
</nickchange>
</channel>

I'm writing another program that should parse that sort of XML on its
stdin, printing out a more user-friendly representation. For this, I
need to parse the XML as it comes in, not all at once.

I wrote a parser using xml.sax, and it works well---provided that I
read in the whole document. However, I want to be able to just read
the XML piece by piece, calling event handlers whenever something
happens and waiting for more to happen.

Is there some way to do this with the standard python xml parsers?
Will I need to use PyXML? Or what?

Thanks,
-Peter

Jeremy Bowers

unread,
Sep 11, 2003, 10:24:27 PM9/11/03
to
On Thu, 11 Sep 2003 16:30:18 -0700, Peter Scott wrote:

> Is there some way to do this with the standard python xml parsers?
> Will I need to use PyXML? Or what?

xml.parsers.expat can parse things in pieces. It shouldn't be *too* much
work to convert over.

Alan Kennedy

unread,
Sep 12, 2003, 5:58:49 AM9/12/03
to
Peter Scott wrote:
> I'm writing another program that should parse that sort of XML on its
> stdin, printing out a more user-friendly representation. For this, I
> need to parse the XML as it comes in, not all at once.

Peter,

Check out the IncrementalParser class in the library module

Lib/xml/sax/xmlreader.py

This extension of the standard XMLReader class acts just like a SAX
parser, in that it delivers SAX2 events to your ContentHandler as it
processes the tokens from the source XML document.

But rather than the parser itself controlling when and how it gets its
input, you control that through the use of the .feed() method. So you
can "drip feed" the parser with input if you wish.

Not all XML parsers support an IncrementalParser interface. In order
for an XML parser to support incremental parsing, it must have been
coded specifically to do so. Fortunately, the expat wrapper supplied
with the base distribution of python does support incremental parsing.

Which I think should solve your problem quite nicely. When you start
up your process for the first time, feed() the IncrementalParser a
document element (all XML document must have one and only one document
element). Then simply feed the output of your logging stream directly
to the IncrementalParser, as and when you receive it.

You should not have any problems with XML tokens being split over two
different .feed() calls either. For example, this should work just
fine

ip = IncrementalParser()
ip.feed('<docu')
ip.feed('ment')
ip.feed('/>')

When your logging stream is closing, simply feed a close tag for your
document element to your IncrementalParser, and everything will clean
up nicely.

Here is some sample code:

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
import xml.sax
from xml.sax.handler import ContentHandler

logentry = """


<channel name='#sandbox'>
<message user='PeterScott'>Hello, my bot</message>
<message user='PeterScott'>This is a message</message>
<nickchange>
<oldnick>PeterScott</oldnick>
<newnick>PeterSc</newnick>
</nickchange>
</channel>
"""

incr_parser = xml.sax.make_parser('xml.sax.expatreader')
incr_parser.setContentHandler(ContentHandler())
incr_parser.feed('<mylogstream>')
incr_parser.feed(logentry)
incr_parser.feed('</mylogstream>')
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

regards,

--
alan kennedy
-----------------------------------------------------
check http headers here: http://xhaus.com/headers
email alan: http://xhaus.com/mailto/alan

0 new messages