Why is xml.dom.minidom so slow?

Bjorn Pettersen

unread,

Jan 2, 2003, 3:36:31 PM1/2/03

to

Background: I've got 750 accounts in xml format in a 50 Mb file. I can
extract the string for an individual account at a rate of >290
accts/sec. When I add a call to parseString() afterwards however, the
speed drops to 0.75 accts/sec.

All I'm doing boils down to:

response = rf.nextResponse()
dom = parseString(response)

in a loop. Am I doing something wrong? Is there a faster way when all I
need is a traversable tree structure as the result?

For our requirements we can't use anything under 10 accts/sec without
impacting our SLAs...

-- bjorn

Martin v. Löwis

unread,

Jan 2, 2003, 4:28:05 PM1/2/03

to

"Bjorn Pettersen" <BPett...@NAREX.com> writes:

> All I'm doing boils down to:
>
> response = rf.nextResponse()
> dom = parseString(response)
>
> in a loop. Am I doing something wrong?

You have to give more details. What Python version? PyXML or stock
Python? One traditional reason is that people, not knowingly, have
used PyXML xmlproc, which is a pure-Python parser, instead of Expat.

PyXML 0.8.x has a number of speed improvements for minidom-with-expat
(such as eliminating the SAX driver), and memory usage improvements
(such as interning element and attribute names).

> Is there a faster way when all I need is a traversable tree
> structure as the result?

"All I need" reads quite funny in this context, as producing a
traversable tree is one of the more expensive ways for XML
processing. There are certainly faster ways if you *don't* need a
traversable tree.

Regards,
Martin

Fredrik Lundh

unread,

Jan 2, 2003, 4:50:13 PM1/2/03

to

Bjorn Pettersen wrote:

> All I'm doing boils down to:
>
> response = rf.nextResponse()
> dom = parseString(response)
>
> in a loop. Am I doing something wrong? Is there a faster way when all I
> need is a traversable tree structure as the result?

as a general rule, XML toolkits that try to implement the DOM specification
in pure Python are incredibly slow and bloated.

on random XML data, minidom can easily gobble up a kilobyte or two for
each element. in one of my benchmarks, it used about 50 bytes of object
memory for each input character:

http://online.effbot.org/2002_12_01_archive.htm#dom-bloat

creating all those objects take time...

toolkits that use a more pythonic api also tend to be more efficient; for
example, the pure python version of my elementtree module is typically
3-5 times faster than minidom, and uses less than half the memory:

http://effbot.org/zone/element-index.htm

you may be able to reach 10x with SAX-style custom code using pyexpat
(or sgmlop) directly...

http://www.python.org/doc/current/lib/module-xml.parsers.expat.html
http://effbot.org/zone/sgmlop-index.htm

...but to be on the safe side, I'd go for a C parser/tree builder. the following
two are about as fast as anything can be:

http://xmlsoft.org/python.html
http://www.reportlab.com/xml/pyrxp.html

(unfortunately, the C version of elementtree isn't yet ready for public
consumption...)

</F>

David Mertz

unread,

Jan 2, 2003, 5:14:20 PM1/2/03

to

"Fredrik Lundh" <fre...@pythonware.com> wrote previously:

|as a general rule, XML toolkits that try to implement the DOM
|specification in pure Python are incredibly slow and bloated.

|toolkits that use a more pythonic api also tend to be more efficient;

| http://effbot.org/zone/element-index.htm

I would also, perhaps, recommend my [gnosis.xml.objectify] module. It
creates "Pythonic" objects based on XML documents. If you use it, be
sure to use the EXPAT parser, rather than going through DOM (which is
the default, but it only takes one argument to use EXPAT).

I don't have a precise timing, but certainly much faster than DOM. My
goal with the module was not really speed, but rather obtaining a more
natural Python interface. Nonetheless, it goes pretty fast.

Take a look at <http://gnosis.cx/download/Gnosis_Utils-current.tar.gz>.

Yours, David...

--
mertz@ | The specter of free information is haunting the `Net! All the
gnosis | powers of IP- and crypto-tyranny have entered into an unholy
.cx | alliance...ideas have nothing to lose but their chains. Unite
| against "intellectual property" and anti-privacy regimes!
-------------------------------------------------------------------------

Bjorn Pettersen

unread,

Jan 2, 2003, 5:29:21 PM1/2/03

to

> From: Martin v. Löwis [mailto:mar...@v.loewis.de]

>
> "Bjorn Pettersen" <BPett...@NAREX.com> writes:
>
> > All I'm doing boils down to:
> >
> > response = rf.nextResponse()
> > dom = parseString(response)
> >
> > in a loop. Am I doing something wrong?
>

> You have to give more details. What Python version? PyXML or
> stock Python? One traditional reason is that people, not
> knowingly, have used PyXML xmlproc, which is a pure-Python
> parser, instead of Expat.

Python 2.2.1 without PyXML. The full code looks like:

def test():
from xml.dom.minidom import parseString
rf = ResponseFile('c:/data/Testoutput.xml')
count = 0

start = time.time()
try:
while 1:
# nextResponse() returns a complete xml
# document as a string (throws at eof).
response = rf.nextResponse()
dom = dom = parseString(response)
count += 1
sys.stdout.write('.')
except:
pass
stop = time.time()
return count, stop-start

If I'm reading the minidom/pulldom files correctly this should use Expat(?)

> PyXML 0.8.x has a number of speed improvements for
> minidom-with-expat (such as eliminating the SAX driver), and
> memory usage improvements (such as interning element and
> attribute names).

As a test, I tried building my own tree directly from the Expat events. This was about 4 times faster (2.89 accts/sec), but still far from fast enough... I'm starting to think a custom C++ parser might be the way to go (and here I was having such a nice day <sigh>).

> > Is there a faster way when all I need is a traversable tree
> > structure as the result?
>

> "All I need" reads quite funny in this context, as producing
> a traversable tree is one of the more expensive ways for XML
> processing. There are certainly faster ways if you *don't*
> need a traversable tree.

:-) Unfortunately they're not my requirements. (They go something like: "we will eventually need all the data, so put them in a form that the next step can traverse to put into a DB".) If you think a different approach is better I'm all ears :-)

Thanks for the interest.

-- bjorn

Bjorn Pettersen

unread,

Jan 2, 2003, 5:40:15 PM1/2/03

to

> From: Fredrik Lundh [mailto:fre...@pythonware.com]
>
[...]

>
> as a general rule, XML toolkits that try to implement the DOM
> specification in pure Python are incredibly slow and bloated.
>

> on random XML data, minidom can easily gobble up a kilobyte
> or two for each element. in one of my benchmarks, it used
> about 50 bytes of object memory for each input character:

[lots of good references...]

Thanks Fredrik! I'll look into the various options and post any
interesting findings :-)

-- bjorn

Martin v. Löwis

unread,

Jan 2, 2003, 6:01:10 PM1/2/03

to

"Bjorn Pettersen" <BPett...@NAREX.com> writes:

> If I'm reading the minidom/pulldom files correctly this should use
> Expat(?)

Yes, that is the only possible interpretation if no other parsers are
available.

> As a test, I tried building my own tree directly from the Expat
> events. This was about 4 times faster (2.89 accts/sec), but still
> far from fast enough... I'm starting to think a custom C++ parser
> might be the way to go (and here I was having such a nice day
> <sigh>).

I see. Then I would suggest that the mere parsing speed is not the
issue - this uses roughly all tricks we can think of. It still would
be interesting to find out where the computation time is spend. If
these are complicated documents (i.e. many elements and attributes,
short PCDATA), then surely memory allocation is an issue - you could
try Python 2.3a1 also, as a test (pymalloc should give some
improvements when there are many memory allocations).

I doubt that a custom parser can do much better, unless it allows you
to drop data you are not interested in.

What *has* been demonstrated to be a speed-up over minidom is to use
4Suite's cDomlette. It is faster, because:
- it allocates less objects: many things are stored in the elements
themselves, instead of in dictionaries, as Python classic classes
do.
- object creation is through C, with no need to lookup Python methods
over and over again.

When completed, it still gives you a Python-conforming DOM tree. That
DOM tree misses some of the DOM functionality, though, that's why they
call it a Domlette.

> :-) Unfortunately they're not my requirements. (They go something
> :like: "we will eventually need all the data, so put them in a form
> :that the next step can traverse to put into a DB".) If you think a
> :different approach is better I'm all ears :-)

The stream-processing approaches are *much* faster, in all
languages. They don't create intermediate objects, but present you
with just the strings that the parser had to extract from the
document, anyway.

In order of increasing speed, decreasing standards conformance:
- SAX: depending on how you design the content handler, you can be
much faster than a DOM builder already. As a test, you might want to
plug in an empty ContentHandler, and see how many documents you
can parse without processing in a certain time.
- Expat raw interface: parsing is XML-conforming, but the API of
Expat is proprietary. This safes indirections, and is again faster.
You can apply the same benchmark with little effort.
- PyXML//F sgmlop: to my knowledge, the fastest for-Python XML
parser, but it misses a number of XML features (e.g. it won't
do entity expansion).

In any case, please report what your findings are and what technology
you eventually use.

Regards,
Martin

Lulu of the Lotus-Eaters

unread,

Jan 2, 2003, 10:01:28 PM1/2/03

to

"Bjorn Pettersen" <BPett...@NAREX.com> wrote previously:

|As a test, I tried building my own tree directly from the Expat events.
|This was about 4 times faster (2.89 accts/sec), but still far from fast

|enough... I'm starting to think a custom C++ parser might be...

PyExpat is pretty darn fast. I would be quite surprised if you could do
several times better with a custom C++ parser. You might take a look at
RXP/PyRXP, which brags of its speed. But even then, it claims a few
percent better than expat, not several times (albeit, with validation
added in).

Yours, Lulu...

--
Keeping medicines from the bloodstreams of the sick; food from the bellies
of the hungry; books from the hands of the uneducated; technology from the
underdeveloped; and putting advocates of freedom in prisons. Intellectual
property is to the 21st century what the slave trade was to the 16th.

Fredrik Lundh

unread,

Jan 3, 2003, 3:20:40 AM1/3/03

to

Lulu of the Lotus-Eaters wrote:
> PyExpat is pretty darn fast. I would be quite surprised if you could do
> several times better with a custom C++ parser.

moving the tree building from Python to C++ can easily buy you 10x,
no matter what parser you're using.

(if you think otherwise, you're seriously underestimating the cost of a
C++ -> Python call...)

</F>

Martin v. Löwis

unread,

Jan 3, 2003, 3:48:48 AM1/3/03

to

"Fredrik Lundh" <fre...@pythonware.com> writes:

> moving the tree building from Python to C++ can easily buy you 10x,
> no matter what parser you're using.

Indeed; this is what makes 4Suite's cDomlette so fast.

Of course, if you use a C++ parser (say, Xerces), and then create
wrapper objects around the tree that Xerces produces, you lose some of
the advantages by creating more objects than necessary.

Regards,
Martin

Paul Boddie

unread,

Jan 3, 2003, 7:39:40 AM1/3/03

to

mar...@v.loewis.de (Martin v. Löwis wrote in message news:<mailman.104154852...@python.org>...

>
> What *has* been demonstrated to be a speed-up over minidom is to use
> 4Suite's cDomlette.

I can definitely agree with this. While the XML work I've been doing
has also involved DOM operations (parsing isn't interesting on its
own, after all), I've seen the performance of different packages in
order of decreasing speed to be: cDomlette, minidom, 4DOM. However,
I'm deliberately not using the extra functionality of 4DOM.

[...]

> When completed, it still gives you a Python-conforming DOM tree. That
> DOM tree misses some of the DOM functionality, though, that's why they
> call it a Domlette.

Still, cDomlette does have things like importNode which are missing
from minidom (which at least was missing from PyXML a couple of
releases ago).

Paul

Paul Boddie

unread,

Jan 6, 2003, 6:46:04 AM1/6/03

to

mar...@v.loewis.de (Martin v. Löwis wrote in message news:<m3n0mim...@mira.informatik.hu-berlin.de>...

> "Fredrik Lundh" <fre...@pythonware.com> writes:
>
> > moving the tree building from Python to C++ can easily buy you 10x,
> > no matter what parser you're using.
>
> Indeed; this is what makes 4Suite's cDomlette so fast.

And cDomlette produces document representations that can be
manipulated using standards-compliant APIs, too, for those of us that
care about that kind of thing. Hats off to Fourthought! :-)

Paul

Bjorn Pettersen

unread,

Jan 6, 2003, 12:41:33 PM1/6/03

to

> From: Paul Boddie [mailto:pa...@boddie.net]

Quick question... I've got benchmark results for most of the options, but I haven't been able to track down cDomlette yet (google didn't turn up any obvious links). Does anyone know where I can download it (and the documentation :-) I'll post my benchmarks in a couple of days...

-- bjorn

Martin v. Löwis

unread,

Jan 6, 2003, 1:31:09 PM1/6/03

to

"Bjorn Pettersen" <BPett...@NAREX.com> writes:

> Quick question... I've got benchmark results for most of the
> options, but I haven't been able to track down cDomlette yet (google
> didn't turn up any obvious links). Does anyone know where I can
> download it (and the documentation :-) I'll post my benchmarks in a
> couple of days...

It's part for 4Suite, www.4Suite.org. If you download the 4Suite beta
(which you should), it is in Ft.Lib.cDomlette. You can find usage
examples in

http://uche.ogbuji.net:8080/uche.ogbuji.net/tech/akara/pyxml/domlettes

Regards,
Martin

Uche Ogbuji

unread,

Jan 11, 2003, 1:40:06 PM1/11/03

to

mar...@v.loewis.de (Martin v. Löwis wrote in message news:<m3hecmw...@mira.informatik.hu-berlin.de>...

What Martin said.

I just want to point out another friendly introduction.

http://www.xml.com/pub/a/2002/10/16/py-xml.html

And if you use Python 2.2 you may also want to see my tips on using
generators with DOM, which often give a speed advantage.

http://www.xml.com/pub/a/2003/01/08/py-xml.html

--Uche
http://uche.ogbuji.net