Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

10GB XML Blows out Memory, Suggestions?

22 views
Skip to first unread message

axw...@gmail.com

unread,
Jun 6, 2006, 7:48:39 AM6/6/06
to
I wrote a program that takes an XML file into memory using Minidom. I
found out that the XML document is 10gb.

I clearly need SAX or something else?

Any suggestions on what that something else is? Is it hard to convert
the code from DOM to SAX?

Rene Pijlman

unread,
Jun 6, 2006, 8:01:50 AM6/6/06
to
axw...@gmail.com:

>I wrote a program that takes an XML file into memory using Minidom. I
>found out that the XML document is 10gb.
>
>I clearly need SAX or something else?
>
>Any suggestions on what that something else is?

PullDOM.
http://www-128.ibm.com/developerworks/xml/library/x-tipulldom.html
http://www.prescod.net/python/pulldom.html
http://docs.python.org/lib/module-xml.dom.pulldom.html (not much)

--
René Pijlman

Mathias Waack

unread,
Jun 6, 2006, 8:03:58 AM6/6/06
to
axw...@gmail.com wrote:

> I wrote a program that takes an XML file into memory using Minidom. I
> found out that the XML document is 10gb.
>
> I clearly need SAX or something else?

More memory;)
Maybe you should have a look at pulldom, a combination of sax and dom: it
reads your document in a sax-like manner and expands only selected
sub-trees.

> Any suggestions on what that something else is? Is it hard to convert
> the code from DOM to SAX?

Assuming a good design of course not. Esp. if you only need some selected
parts of the document SAX should be your choice.

Mathias

Diez B. Roggisch

unread,
Jun 6, 2006, 8:05:11 AM6/6/06
to
axw...@gmail.com schrieb:

Yes.

You could used elementtree iterparse - that should be the easiest solution.

http://effbot.org/zone/element-iterparse.htm

Diez

K.S.Sreeram

unread,
Jun 6, 2006, 8:30:31 AM6/6/06
to pytho...@python.org
axw...@gmail.com wrote:
> I wrote a program that takes an XML file into memory using Minidom. I
> found out that the XML document is 10gb.

With a 10gb file, you're best bet might be to juse use Expat and C!!

Regards
Sreeram


signature.asc

Nicola Musatti

unread,
Jun 6, 2006, 9:07:41 AM6/6/06
to

axw...@gmail.com wrote:
> I wrote a program that takes an XML file into memory using Minidom. I
> found out that the XML document is 10gb.
>
> I clearly need SAX or something else?

What you clearly need is a better suited file format, but I suspect
you're not in a position to change it, are you?

Cheers,
Nicola Musatti

Diez B. Roggisch

unread,
Jun 6, 2006, 9:33:25 AM6/6/06
to
K.S.Sreeram schrieb:

No what exactly makes C grok a 10Gb file where python will fail to do so?

What the OP needs is a different approach to XML-documents that won't
parse the whole file into one giant tree - but I'm pretty sure that
(c)ElementTree will do the job as well as expat. And I don't recall the
OP musing about performances woes, btw.

Diez

Paul McGuire

unread,
Jun 6, 2006, 9:56:14 AM6/6/06
to
<axw...@gmail.com> wrote in message
news:1149594519....@u72g2000cwu.googlegroups.com...

> I wrote a program that takes an XML file into memory using Minidom. I
> found out that the XML document is 10gb.
>
> I clearly need SAX or something else?
>

You clearly need something instead of XML.

This sounds like a case where a prototype, which worked for the developer's
simple test data set, blows up in the face of real user/production data.
XML adds lots of overhead for nested structures, when in fact, the actual
meat of the data can be relatively small. Note also that this XML overhead
is directly related to the verbosity of the XML designer's choice of tag
names, and whether the designer was predisposed to using XML elements over
attributes. Imagine a record structure for a 3D coordinate point (described
here in no particular coding language):

struct ThreeDimPoint:
xValue : integer,
yValue : integer,
zValue : integer

Directly translated to XML gives:

<ThreeDimPoint>
<xValue>4</xValue>
<yValue>5</yValue>
<zValue>6</zValue>
</ThreeDimPoint>

This expands 3 integers to a whopping 101 characters. Throw in namespaces
for good measure, and you inflate the data even more.

Many Java folks treat XML attributes as anathema, but look how this cuts
down the data inflation:

<ThreeDimPoint xValue="4" yValue="5" zValue="6"/>

This is only 50 characters, or *only* 4 times the size of the contained data
(assuming 4-byte integers).

Try zipping your 10Gb file, and see what kind of compression you get - I'll
bet it's close to 30:1. If so, convert the data to a real data storage
medium. Even a SQLite database table should do better, and you can ship it
around just like a file (just can't open it up like a text file).

-- Paul


sk...@pobox.com

unread,
Jun 6, 2006, 10:27:32 AM6/6/06
to Paul McGuire, pytho...@python.org

Paul> You clearly need something instead of XML.

Amen, brother...

+1 QOTW.

Skip

Kay Schluehr

unread,
Jun 6, 2006, 12:02:59 PM6/6/06
to

If your XML files grow so large you might rethink the representation
model. Maybe you give eXist a try?

http://exist.sourceforge.net/

Regards,
Kay

Felipe Almeida Lessa

unread,
Jun 6, 2006, 1:43:24 PM6/6/06
to pytho...@python.org
Em Ter, 2006-06-06 às 13:56 +0000, Paul McGuire escreveu:
> (just can't open it up like a text file)

Who'll open a 10 GiB file anyway?

--
Felipe.

K.S.Sreeram

unread,
Jun 6, 2006, 2:23:21 PM6/6/06
to pytho...@python.org
Diez B. Roggisch wrote:
> What the OP needs is a different approach to XML-documents that won't
> parse the whole file into one giant tree - but I'm pretty sure that
> (c)ElementTree will do the job as well as expat. And I don't recall the
> OP musing about performances woes, btw.


There's just NO WAY that the 10gb xml file can be loaded into memory as
a tree on any normal machine, irrespective of whether we use C or
Python. So the *only* way is to perform some kind of 'stream' processing
on the file. Perhaps using a SAX like API. So (c)ElementTree is ruled
out for this.

Diez B. Roggisch wrote:
> No what exactly makes C grok a 10Gb file where python will fail to do so?

In most typical cases where there's any kind of significant python code,
its possible to achieve a *minimum* of a 10x speedup by using C. In most
cases, the speedup is not worth it and we just trade it for the
increased flexiblity/power of the python language. But in this situation
using a bit of tight C code could make the difference between the
process taking just 15mins or taking a few hours!

Ofcourse I'm not asking him to write the entire application in C. It
makes sense to just write the performance critical sections in C, and
wrap it in Python, and write the rest of the application in Python.

signature.asc

Fredrik Lundh

unread,
Jun 6, 2006, 2:37:32 PM6/6/06
to pytho...@python.org
K.S.Sreeram wrote:

> There's just NO WAY that the 10gb xml file can be loaded into memory as
> a tree on any normal machine, irrespective of whether we use C or
> Python. So the *only* way is to perform some kind of 'stream' processing
> on the file. Perhaps using a SAX like API. So (c)ElementTree is ruled
> out for this.

both ElementTree and cElementTree support "sax-style" event generation
(through XMLTreeBuilder/XMLParser) and incremental parsing (through
iterparse). the cElementTree versions of these are even faster than
pyexpat.

the iterparse interface is described here:

http://effbot.org/zone/element-iterparse.htm

</F>

K.S.Sreeram

unread,
Jun 6, 2006, 3:05:50 PM6/6/06
to pytho...@python.org
Fredrik Lundh wrote:
> both ElementTree and cElementTree support "sax-style" event generation
> (through XMLTreeBuilder/XMLParser) and incremental parsing (through
> iterparse). the cElementTree versions of these are even faster than
> pyexpat.
>
> the iterparse interface is described here:
>
> http://effbot.org/zone/element-iterparse.htm
>
Thats cool! Thanks for the info!

For a multi-gigabyte file, I would still recommend C/C++, because the
processing code which sits on top of the XML library needs to be Python,
and that could turn out to be a significant overhead in such extreme cases.

Of course, the exact strategy to follow would depend on the specifics of
the case, and all this speculation may not really apply! :)

Regards
Sreeram

signature.asc

gregarican

unread,
Jun 6, 2006, 3:28:12 PM6/6/06
to
10 gigs? Wow, even using SAX I would imagine that you would be pushing
the limits of reasonable performance. Any way you can depart from the
XML requirement? That's not really what XML was intended for in terms
of passing along information IMHO...

axw...@gmail.com

unread,
Jun 6, 2006, 3:48:27 PM6/6/06
to
The file is an XML dump from Goldmine. I have built a document parser
that allows for the population of data from Goldmine into SugarCRM. The
clients data se is 10gb.

Fredrik Lundh

unread,
Jun 6, 2006, 3:52:44 PM6/6/06
to pytho...@python.org
gregarican wrote:

> 10 gigs? Wow, even using SAX I would imagine that you would be pushing
> the limits of reasonable performance.

depends on how you define "reasonable", of course. modern computers are
quite fast:

> dir data.xml

2006-06-06 21:35 1 002 000 015 data.xml
1 File(s) 1 002 000 015 bytes

> more test.py
from xml.etree import cElementTree as ET
import time

t0 = time.time()

for event, elem in ET.iterparse("data.xml"):
if elem.tag == "item":
elem.clear()

print time.time() - t0

gives me timings between 27.1 and 49.1 seconds over 5 runs.

(Intel Dual Core T2300, slow laptop disks, 1000000 XML "item" elements
averaging 1000 byte each, bundled cElementTree, peak memory usage 33 MB.
your milage may vary.)

</F>

axw...@gmail.com

unread,
Jun 6, 2006, 3:53:58 PM6/6/06
to
Paul,

This is interesting. Unfortunately, I have no control over the XML
output. The file is from Goldmine. However, you have given me an
idea...

Is it possible to read an XML document in compressed format?

gregarican

unread,
Jun 6, 2006, 4:01:43 PM6/6/06
to
That a good sized Goldmine database. In past lives I have supported
that app and recall that you could match the Goldmine front end against
an SQL backend. If you can get to the underlying data utilizing SQL you
can selectively port over sections of the database and might be able to
attack things more methodically than parsing through a mongo XML file.
Instead you could bulk insert portions of the Goldmine data into
SugarCRM. Know what I mean?

John J. Lee

unread,
Jun 6, 2006, 4:11:42 PM6/6/06
to
"K.S.Sreeram" <sre...@tachyontech.net> writes:
[...]

> There's just NO WAY that the 10gb xml file can be loaded into memory as
> a tree on any normal machine, irrespective of whether we use C or
> Python.

Yes.

> So the *only* way is to perform some kind of 'stream' processing
> on the file. Perhaps using a SAX like API. So (c)ElementTree is ruled
> out for this.

No, that's not true. I guess you didn't read the other posts:

http://effbot.org/zone/element-iterparse.htm


> Diez B. Roggisch wrote:
> > No what exactly makes C grok a 10Gb file where python will fail to do so?
>
> In most typical cases where there's any kind of significant python code,
> its possible to achieve a *minimum* of a 10x speedup by using C. In most

[...]

I don't know where you got that from. And in this particular case, of
course, cElementTree *is* written in C, there's presumably plenty of
"significant python code" around since, one assumes, *all* of the OP's
code is written in Python (does that count as "any kind" of Python
code?), and yet rewriting something in C here may not make much
difference.


John

fuzzylollipop

unread,
Jun 6, 2006, 10:43:07 PM6/6/06
to


you got no idea what you are talking about, anyone knows that something
like this is IO bound.
CPU is the least of his worries. And for IO bound applications Python
is just as fast as any other language.

fuzzylollipop

unread,
Jun 6, 2006, 10:51:50 PM6/6/06
to

axw...@gmail.com wrote:
> Paul,
>
> This is interesting. Unfortunately, I have no control over the XML
> output. The file is from Goldmine. However, you have given me an
> idea...
>
> Is it possible to read an XML document in compressed format?

compressing the footprint on disk won't matter, you still have 10GB of
data that you need to process and it can only be processed
uncompressed.

I would just export the data in smaller batches, there should not be
any reason you can't export subsets and process them that way.

Fredrik Lundh

unread,
Jun 7, 2006, 2:24:23 AM6/7/06
to pytho...@python.org
fuzzylollipop wrote:

> you got no idea what you are talking about, anyone knows that something
> like this is IO bound.

which of course explains why some XML parsers for Python are a 100 times
faster than other XML parsers for Python...

</F>

Fredrik Lundh

unread,
Jun 7, 2006, 2:27:47 AM6/7/06
to pytho...@python.org
fuzzylollipop wrote:

>> Is it possible to read an XML document in compressed format?
>
> compressing the footprint on disk won't matter, you still have 10GB of
> data that you need to process and it can only be processed uncompressed.

didn't you just claim that this was an I/O bound problem ?

</F>

Fredrik Lundh

unread,
Jun 7, 2006, 3:06:03 AM6/7/06
to pytho...@python.org
axw...@gmail.com wrote:
> Paul,
>
> This is interesting. Unfortunately, I have no control over the XML
> output. The file is from Goldmine. However, you have given me an
> idea...
>
> Is it possible to read an XML document in compressed format?

sure. you can e.g. use gzip.open to create a file object that
decompresses on the way in.

file = gzip.open("data.xml.gz")

for event, elem in ET.iterparse(file):


if elem.tag == "item":
elem.clear()

I tried compressing my 1 GB example, but all 1000-byte records in that
file are identical, so I got a 500x compression, which is a bit higher
than you can reasonably expect ;-) however, with that example, I get a
stable parsing time of 26 seconds, so it looks as if gzip can produce
data about as fast as a preloaded disk cache...

</F>

gregarican

unread,
Jun 7, 2006, 11:00:37 AM6/7/06
to
Am I missing something? I don't read where the poster mentioned the
operation as being CPU intensive. He does mention that the entirety of
a 10 GB file cannot be loaded into memory. If you discount physical
swapfile paging and base this assumption on a "normal" PC that might
have maybe 1 or 2 GB of RAM is his assumption that out of line?

And I don't doubt that Python is efficient as possible for I/O
operations. But since it is an interpreted scripting language how could
it be "just as fast as any language" as you claim? C would have to be
faster. Machine language would have to be faster. And even other
interpreted languages *could* be faster, given certain conditions. A
generalization like the claim kind of invalidates the remainder of your
assertion.

fuzzylollipop

unread,
Jun 7, 2006, 11:27:11 AM6/7/06
to

dependes on the CODE and the SIZE of the file, in this case

processing 10GB of file, unless that file is heavly encrypted or
compressed will, the process will be IO bound PERIOD!

And in the case of XML unless the PARSER is extremely inefficient, and
I assume, that would be an edge case, the parser is NOT the bottle neck
in this case.

The relativel performance of Python XML parsers is irrelvant in
relationship to this being an IO bound process, even the slowest parser
could only process the data as fast as it can be read off the disk.

Anyone saying that using C instead of Python will be faster when 99% of
the time in this case is just waiting on the disk to feed a buffer, has
no idea what they are talking about.

I work with TeraBytes of files, and all our Python code is just as fast
as equivelent C code for IO bound processes.

axw...@gmail.com

unread,
Jun 7, 2006, 12:01:19 PM6/7/06
to
Thanks guys for all your posts...

So I am a bit confused....Fuzzy, the code I saw looks like it
decompresses as a stream (i.e. per byte). Is this the case or are you
just compressing for file storage but the actual data set has to be
exploded in memory?

Diez B. Roggisch

unread,
Jun 7, 2006, 12:11:13 PM6/7/06
to
fuzzylollipop wrote:

>
> Fredrik Lundh wrote:
>> fuzzylollipop wrote:
>>
>> > you got no idea what you are talking about, anyone knows that something
>> > like this is IO bound.
>>
>> which of course explains why some XML parsers for Python are a 100 times
>> faster than other XML parsers for Python...
>>
>
> dependes on the CODE and the SIZE of the file, in this case
>
> processing 10GB of file, unless that file is heavly encrypted or
> compressed will, the process will be IO bound PERIOD!

Why so? IO-bounds will be hit when the processing of the fetched data is
faster than the fetching itself. So if I decide to read 10GB a 4Kb block
per second, I'm possibly a very patient fella, but no IO-bounds are hit. So
no PERIOD here - without talking about _what_ actually happens.

> Anyone saying that using C instead of Python will be faster when 99% of
> the time in this case is just waiting on the disk to feed a buffer, has
> no idea what they are talking about.

Which is true - but the chances for C performing whatever I want to in the
1% of time are a few times better than to do so in Python.

Mind you: I don't argue that the statements of Mr. Sreeram are true, either.
This discussion can only be hold with respect to the actual use case (which
is certainly more that just parsing XML, but also processing it)

> I work with TeraBytes of files, and all our Python code is just as fast
> as equivelent C code for IO bound processes.

Care to share what kind of processing you perfrom on these files?

Regards,

Diez

Fredrik Lundh

unread,
Jun 7, 2006, 12:30:07 PM6/7/06
to pytho...@python.org
fuzzylollipop wrote:

> dependes on the CODE and the SIZE of the file, in this case
> processing 10GB of file, unless that file is heavly encrypted or
> compressed will, the process will be IO bound PERIOD!

so the fact that

for token, node in pulldom.parse(file):
pass

is 50-200% slower than

for event, elem in ET.iterparse(file):
if elem.tag == "item":
elem.clear()

when reading a gigabyte-sized XML file, is due to an unexpected slowdown
in the I/O subsystem after importing xml.dom?

> I work with TeraBytes of files, and all our Python code is just as fast
> as equivelent C code for IO bound processes.

so how large are the things that you're actually *processing* in your
Python code? megabyte blobs or 100-1000 byte records? or even smaller
things?

</F>

gregarican

unread,
Jun 7, 2006, 12:59:48 PM6/7/06
to
Point for Fredrik. If someone doesn't recognize the inherent
performance differences between different XML parsers they haven't
experienced the pain (and eventual victory) of trying to optimize their
techniques for working with the albatross that XML can be :-)

Paul Boddie

unread,
Jun 7, 2006, 1:50:21 PM6/7/06
to
gregarican wrote:
> Am I missing something? I don't read where the poster mentioned the
> operation as being CPU intensive. He does mention that the entirety of
> a 10 GB file cannot be loaded into memory. If you discount physical
> swapfile paging and base this assumption on a "normal" PC that might
> have maybe 1 or 2 GB of RAM is his assumption that out of line?

Indeed. The complaint is fairly obvious from the title of the thread.
Now, if the complaint was specifically about the size of the minidom
representation in memory, perhaps a more efficient representation could
be chosen by using another library. Even so, the size of the file being
processed is still likely to be pretty big, considering various
observations and making vague estimates:

http://effbot.org/zone/celementtree.htm

For many people, an XML file of, say, 600MB would still be quite a load
on their "home/small business edition" computer if you had to load the
whole file in and then work on it, even just as a text file. Of course,
approaches where you can avoid keeping a representation of the whole
thing around would be beneficial, and as mentioned previously in a
thread on large XML files, there's always the argument that some kind
of database system should be employed to make querying more efficient
if you can't perform some kind of sequential processing.

Paul

Ralf Muschall

unread,
Jun 7, 2006, 2:10:48 PM6/7/06
to
Paul McGuire schrieb:

> meat of the data can be relatively small. Note also that this XML overhead
> is directly related to the verbosity of the XML designer's choice of tag
> names, and whether the designer was predisposed to using XML elements over
> attributes. Imagine a record structure for a 3D coordinate point (described
> here in no particular coding language):

> struct ThreeDimPoint:
> xValue : integer,
> yValue : integer,
> zValue : integer

> Directly translated to XML gives:

> <ThreeDimPoint>
> <xValue>4</xValue>
> <yValue>5</yValue>
> <zValue>6</zValue>
> </ThreeDimPoint>

This is essentially true, but should not cause the OP's problem.
After parsing, the overhead of XML is gone, and long tag names
are nothing but pointers to a string which happens to be long
(unless *all* tags in the XML are differently named, which would
cause a huge DTD/XSD as well).

> This expands 3 integers to a whopping 101 characters. Throw in namespaces
> for good measure, and you inflate the data even more.

In the DOM, it contracts to 3 integers and a few pointers -
essentially the same as needed in a reasonably written
data structure.

> Try zipping your 10Gb file, and see what kind of compression you get - I'll
> bet it's close to 30:1. If so, convert the data to a real data storage

In this case, his DOM (or whatever equivalent data structure, i.e.
that what he *must* process) would be 300 MB + pointers.
I'd even go as far and say that the best thing that can happen to
him is a huge overhead - this would mean he has a little data
in a rather spongy file (which collapses on parsing).

> medium. Even a SQLite database table should do better, and you can ship it
> around just like a file (just can't open it up like a text file).

A table helps only if the data is tabular (i.e. a single relation),
i.e. probably never (otherwise the sending side would have shipped
something like CSV).

Ralf

Fredrik Lundh

unread,
Jun 7, 2006, 2:19:05 PM6/7/06
to pytho...@python.org
Ralf Muschall wrote:

> In the DOM, it contracts to 3 integers and a few pointers -
> essentially the same as needed in a reasonably written
> data structure.

what kind of magic DWIM DOM is this?

</F>

Thomas Ganss

unread,
Jun 8, 2006, 3:56:35 AM6/8/06
to
>>medium. Even a SQLite database table should do better, and you can ship it
>>around just like a file (just can't open it up like a text file).
>
>
> A table helps only if the data is tabular (i.e. a single relation),
> i.e. probably never (otherwise the sending side would have shipped
> something like CSV).

Perhaps the previous poster meant "database file", which for some
systems describes the "container" of the whole database. If the XML has
redundancies represented in "linked" data, data normalization can cut
down on the needed space.

my 0.02 EUR

thomas

fuzzylollipop

unread,
Jun 8, 2006, 9:29:59 AM6/8/06
to

axw...@gmail.com wrote:
> Thanks guys for all your posts...
>
> So I am a bit confused....Fuzzy, the code I saw looks like it
> decompresses as a stream (i.e. per byte). Is this the case or are you
> just compressing for file storage but the actual data set has to be
> exploded in memory?
>

it wasn't my code.

if you zip the 10GB and read from the zip into a DOM style tree, you
haven't gained anything, except adding additional CPU requirements to
do the decompression. You still have to load the entire thing into
memory.

There are differences in XML Parsers, IN EVERY LANGUAGE a poorly
written parser is a poorly written parser. Using the wrong IDIOM is
more of a problem than anything else. DOM parsers are good when you
need to read and process every element and attribute and the data is
"small". Granted, "small" is relative, but no body will consider 10GB
"small".

SAX style or a pull-parser has to be used when the data is "large" or
when you don't really need to process every element and attribute.

This problem looks like it is just a data export / import problem. In
that case you will either have to use a sax style parser and parse the
10GB file. Or as I suggested in another reply, export the data in
smaller chunks and process them separately, which in almost EVERY case
is a better solution to do batch processing.

You should always break processing up into as many discreet steps as
possible. Make for easier debugging and you can start over in the
middle much easier.

Even if you just write a simple SAX style parser to just break the file
up into smaller pieces to actually process it you will be ahead of the
game.

We have systems that process streaming data coming from sockets in XML
format, that run in Java with very little memory footprint and very
little CPU usage. At 50 megabit a sec, that is about 4TB a day. C
wouldn't read from a socket any faster than the NBIO, actually it would
be harder to get the same performance in C because we would have to
duplicate all the SEDA style NBIO.

Fredrik Lundh

unread,
Jun 8, 2006, 9:50:49 AM6/8/06
to pytho...@python.org
fuzzylollipop wrote:

> SAX style or a pull-parser has to be used when the data is "large" or
> when you don't really need to process every element and attribute.
>
> This problem looks like it is just a data export / import problem. In
> that case you will either have to use a sax style parser and parse the
> 10GB file. Or as I suggested in another reply, export the data in
> smaller chunks

or use a parser that can do the chunking for you, on the way in...

in Python, incremental parsers like cET's iterparse and the one in Amara
gives you *better* performance than SAX (including "raw" pyexpat) in
many cases, and offers a much simpler programming model.

</F>

fuzzylollipop

unread,
Jun 8, 2006, 12:30:37 PM6/8/06
to

thats good to know, I haven't worked with cET yet. Haven't had time to
get it installed :-(

uche....@gmail.com

unread,
Jun 11, 2006, 11:02:50 AM6/11/06
to

Honestly, i think that legitimate use-cases for multi-gigabyte XML are
very rare. Many people abuse XML as some sort of DBMS replacement.
This abuse is part of the reason why so many developers are hostile to
XML. XML is best for documents, and documents can get to the
multi-gigabyte range, but rarely do. Usually, when they do, there is a
logical way to decompose them, process them, and re-compose them,
whereas with XML used as a DBMS replacement, relations and datatyping
complicate such natural divide-and-conquer techniques.

I always say that if you're dealing with gigabyte XML, it's well worth
considering whether you're not using a hammer to screw in a bolt.

If monster XML is inevitable, then I extend's Fredrik earlier mention
of Amara to say that Pushdom allows you to pre-declare the chunks of
XML you're interested in, and then it processes the XML in streaming
mode, only instantiating the chunks of interest one at a time. This
allows for handling of huge files with a very simple programming idiom.

http://uche.ogbuji.net/tech/4suite/amara/

--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://fourthought.com
http://copia.ogbuji.net http://4Suite.org
Articles: http://uche.ogbuji.net/tech/publications/

cavallo71

unread,
Jun 22, 2006, 5:10:29 AM6/22/06
to
> > I wrote a program that takes an XML file into memory using Minidom. I
> > found out that the XML document is 10gb.
> >
> > I clearly need SAX or something else?

If the data is composed by a large number of records,
like a database dump of some sort,
then probably you could have a look to a stax processor
for python like pulldom.

In this way you could process each single record one at the time,
without loading the entiere document.

Regards,
Antonio

0 new messages