All I'm doing boils down to:
response = rf.nextResponse()
dom = parseString(response)
in a loop. Am I doing something wrong? Is there a faster way when all I
need is a traversable tree structure as the result?
For our requirements we can't use anything under 10 accts/sec without
impacting our SLAs...
-- bjorn
> All I'm doing boils down to:
>
> response = rf.nextResponse()
> dom = parseString(response)
>
> in a loop. Am I doing something wrong?
You have to give more details. What Python version? PyXML or stock
Python? One traditional reason is that people, not knowingly, have
used PyXML xmlproc, which is a pure-Python parser, instead of Expat.
PyXML 0.8.x has a number of speed improvements for minidom-with-expat
(such as eliminating the SAX driver), and memory usage improvements
(such as interning element and attribute names).
> Is there a faster way when all I need is a traversable tree
> structure as the result?
"All I need" reads quite funny in this context, as producing a
traversable tree is one of the more expensive ways for XML
processing. There are certainly faster ways if you *don't* need a
traversable tree.
Regards,
Martin
> All I'm doing boils down to:
>
> response = rf.nextResponse()
> dom = parseString(response)
>
> in a loop. Am I doing something wrong? Is there a faster way when all I
> need is a traversable tree structure as the result?
as a general rule, XML toolkits that try to implement the DOM specification
in pure Python are incredibly slow and bloated.
on random XML data, minidom can easily gobble up a kilobyte or two for
each element. in one of my benchmarks, it used about 50 bytes of object
memory for each input character:
http://online.effbot.org/2002_12_01_archive.htm#dom-bloat
creating all those objects take time...
toolkits that use a more pythonic api also tend to be more efficient; for
example, the pure python version of my elementtree module is typically
3-5 times faster than minidom, and uses less than half the memory:
http://effbot.org/zone/element-index.htm
you may be able to reach 10x with SAX-style custom code using pyexpat
(or sgmlop) directly...
http://www.python.org/doc/current/lib/module-xml.parsers.expat.html
http://effbot.org/zone/sgmlop-index.htm
...but to be on the safe side, I'd go for a C parser/tree builder. the following
two are about as fast as anything can be:
http://xmlsoft.org/python.html
http://www.reportlab.com/xml/pyrxp.html
(unfortunately, the C version of elementtree isn't yet ready for public
consumption...)
</F>
I would also, perhaps, recommend my [gnosis.xml.objectify] module. It
creates "Pythonic" objects based on XML documents. If you use it, be
sure to use the EXPAT parser, rather than going through DOM (which is
the default, but it only takes one argument to use EXPAT).
I don't have a precise timing, but certainly much faster than DOM. My
goal with the module was not really speed, but rather obtaining a more
natural Python interface. Nonetheless, it goes pretty fast.
Take a look at <http://gnosis.cx/download/Gnosis_Utils-current.tar.gz>.
Yours, David...
--
mertz@ | The specter of free information is haunting the `Net! All the
gnosis | powers of IP- and crypto-tyranny have entered into an unholy
.cx | alliance...ideas have nothing to lose but their chains. Unite
| against "intellectual property" and anti-privacy regimes!
-------------------------------------------------------------------------
Python 2.2.1 without PyXML. The full code looks like:
def test():
from xml.dom.minidom import parseString
rf = ResponseFile('c:/data/Testoutput.xml')
count = 0
start = time.time()
try:
while 1:
# nextResponse() returns a complete xml
# document as a string (throws at eof).
response = rf.nextResponse()
dom = dom = parseString(response)
count += 1
sys.stdout.write('.')
except:
pass
stop = time.time()
return count, stop-start
If I'm reading the minidom/pulldom files correctly this should use Expat(?)
> PyXML 0.8.x has a number of speed improvements for
> minidom-with-expat (such as eliminating the SAX driver), and
> memory usage improvements (such as interning element and
> attribute names).
As a test, I tried building my own tree directly from the Expat events. This was about 4 times faster (2.89 accts/sec), but still far from fast enough... I'm starting to think a custom C++ parser might be the way to go (and here I was having such a nice day <sigh>).
> > Is there a faster way when all I need is a traversable tree
> > structure as the result?
>
> "All I need" reads quite funny in this context, as producing
> a traversable tree is one of the more expensive ways for XML
> processing. There are certainly faster ways if you *don't*
> need a traversable tree.
:-) Unfortunately they're not my requirements. (They go something like: "we will eventually need all the data, so put them in a form that the next step can traverse to put into a DB".) If you think a different approach is better I'm all ears :-)
Thanks for the interest.
-- bjorn
[lots of good references...]
Thanks Fredrik! I'll look into the various options and post any
interesting findings :-)
-- bjorn
> If I'm reading the minidom/pulldom files correctly this should use
> Expat(?)
Yes, that is the only possible interpretation if no other parsers are
available.
> As a test, I tried building my own tree directly from the Expat
> events. This was about 4 times faster (2.89 accts/sec), but still
> far from fast enough... I'm starting to think a custom C++ parser
> might be the way to go (and here I was having such a nice day
> <sigh>).
I see. Then I would suggest that the mere parsing speed is not the
issue - this uses roughly all tricks we can think of. It still would
be interesting to find out where the computation time is spend. If
these are complicated documents (i.e. many elements and attributes,
short PCDATA), then surely memory allocation is an issue - you could
try Python 2.3a1 also, as a test (pymalloc should give some
improvements when there are many memory allocations).
I doubt that a custom parser can do much better, unless it allows you
to drop data you are not interested in.
What *has* been demonstrated to be a speed-up over minidom is to use
4Suite's cDomlette. It is faster, because:
- it allocates less objects: many things are stored in the elements
themselves, instead of in dictionaries, as Python classic classes
do.
- object creation is through C, with no need to lookup Python methods
over and over again.
When completed, it still gives you a Python-conforming DOM tree. That
DOM tree misses some of the DOM functionality, though, that's why they
call it a Domlette.
> :-) Unfortunately they're not my requirements. (They go something
> :like: "we will eventually need all the data, so put them in a form
> :that the next step can traverse to put into a DB".) If you think a
> :different approach is better I'm all ears :-)
The stream-processing approaches are *much* faster, in all
languages. They don't create intermediate objects, but present you
with just the strings that the parser had to extract from the
document, anyway.
In order of increasing speed, decreasing standards conformance:
- SAX: depending on how you design the content handler, you can be
much faster than a DOM builder already. As a test, you might want to
plug in an empty ContentHandler, and see how many documents you
can parse without processing in a certain time.
- Expat raw interface: parsing is XML-conforming, but the API of
Expat is proprietary. This safes indirections, and is again faster.
You can apply the same benchmark with little effort.
- PyXML//F sgmlop: to my knowledge, the fastest for-Python XML
parser, but it misses a number of XML features (e.g. it won't
do entity expansion).
In any case, please report what your findings are and what technology
you eventually use.
Regards,
Martin
PyExpat is pretty darn fast. I would be quite surprised if you could do
several times better with a custom C++ parser. You might take a look at
RXP/PyRXP, which brags of its speed. But even then, it claims a few
percent better than expat, not several times (albeit, with validation
added in).
Yours, Lulu...
--
Keeping medicines from the bloodstreams of the sick; food from the bellies
of the hungry; books from the hands of the uneducated; technology from the
underdeveloped; and putting advocates of freedom in prisons. Intellectual
property is to the 21st century what the slave trade was to the 16th.
moving the tree building from Python to C++ can easily buy you 10x,
no matter what parser you're using.
(if you think otherwise, you're seriously underestimating the cost of a
C++ -> Python call...)
</F>
> moving the tree building from Python to C++ can easily buy you 10x,
> no matter what parser you're using.
Indeed; this is what makes 4Suite's cDomlette so fast.
Of course, if you use a C++ parser (say, Xerces), and then create
wrapper objects around the tree that Xerces produces, you lose some of
the advantages by creating more objects than necessary.
Regards,
Martin
I can definitely agree with this. While the XML work I've been doing
has also involved DOM operations (parsing isn't interesting on its
own, after all), I've seen the performance of different packages in
order of decreasing speed to be: cDomlette, minidom, 4DOM. However,
I'm deliberately not using the extra functionality of 4DOM.
[...]
> When completed, it still gives you a Python-conforming DOM tree. That
> DOM tree misses some of the DOM functionality, though, that's why they
> call it a Domlette.
Still, cDomlette does have things like importNode which are missing
from minidom (which at least was missing from PyXML a couple of
releases ago).
Paul
And cDomlette produces document representations that can be
manipulated using standards-compliant APIs, too, for those of us that
care about that kind of thing. Hats off to Fourthought! :-)
Paul
Quick question... I've got benchmark results for most of the options, but I haven't been able to track down cDomlette yet (google didn't turn up any obvious links). Does anyone know where I can download it (and the documentation :-) I'll post my benchmarks in a couple of days...
-- bjorn
> Quick question... I've got benchmark results for most of the
> options, but I haven't been able to track down cDomlette yet (google
> didn't turn up any obvious links). Does anyone know where I can
> download it (and the documentation :-) I'll post my benchmarks in a
> couple of days...
It's part for 4Suite, www.4Suite.org. If you download the 4Suite beta
(which you should), it is in Ft.Lib.cDomlette. You can find usage
examples in
http://uche.ogbuji.net:8080/uche.ogbuji.net/tech/akara/pyxml/domlettes
Regards,
Martin
What Martin said.
I just want to point out another friendly introduction.
http://www.xml.com/pub/a/2002/10/16/py-xml.html
And if you use Python 2.2 you may also want to see my tips on using
generators with DOM, which often give a speed advantage.
http://www.xml.com/pub/a/2003/01/08/py-xml.html
--Uche
http://uche.ogbuji.net