This is a rather long post, but i wanted to include all the details & everything i have tried so far myself, so please bear with me & read the entire boringly long post.
I am trying to parse a ginormous ( ~ 1gb) xml file.
0. I am a python & xml n00b, s& have been relying on the excellent beginner book DIP(Dive_Into_Python3 by MP(Mark Pilgrim).... Mark , if u are readng this, you are AWESOME & so is your witty & humorous writing style)
1. Almost all exmaples pf parsing xml in python, i have seen, start off with these 4 lines of code.
import xml.etree.ElementTree as etree
tree = etree.parse('*path_to_ginormous_xml*')
root = tree.getroot() #my huge xml has 1 root at the top level
print root
2. In the 2nd line of code above, as Mark explains in DIP, the parse function builds & returns a tree object, in-memory(RAM), which represents the entire document.
I tried this code, which works fine for a small ( ~ 1MB), but when i run this simple 4 line py code in a terminal for my HUGE target file (1GB), nothing happens.
In a separate terminal, i run the top command, & i can see a python process, with memory (the VIRT column) increasing from 100MB , all the way upto 2100MB.
I am guessing, as this happens (over the course of 20-30 mins), the tree representing is being slowly built in memory, but even after 30-40 mins, nothing happens.
I dont get an error, seg fault or out_of_memory exception.
My hardware setup : I have a win7 pro box with 8gb of RAM & intel core2 quad cpuq9400.
On this i am running sun virtualbox(3.2.12), with ubuntu 10.10 as guest os, with 23gb disk space & 2gb(2048mb) ram, assigned to the guest ubuntu os.
3. I also tried using lxml, but an lxml tree is much more expensive, as it retains more info about a node's context, including references to it's parent.
[http://www.ibm.com/developerworks/xml/library/x-hiperfparse/]
When i ran the same 4line code above, but with lxml's elementree ( using the import below in line1of the code above)
import lxml.etree as lxml_etree
i can see the memory consumption of the python process(which is running the code) shoot upto ~ 2700mb & then, python(or the os ?) kills the process as it nears the total system memory(2gb)
I ran the code from 1 terminal window (screenshot :http://imgur.com/ozLkB.png)
& ran top from another terminal (http://imgur.com/HAoHA.png)
4. I then investigated some streaming libraries, but am confused - there is SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse interface[http://effbot.org/zone/element-iterparse.htm]
Which one is the best for my situation ?
Any & all code_snippets/wisdom/thoughts/ideas/suggestions/feedback/comments/ of the c.l.p community would be greatly appreciated.
Plz feel free to email me directly too.
thanks a ton
cheers
ashish
email :
ashish.makani
domain:gmail.com
p.s.
Other useful links on xml parsing in python
0. http://diveintopython3.org/xml.html
1. http://stackoverflow.com/questions/1513592/python-is-there-an-xml-parser-implemented-as-a-generator
2. http://codespeak.net/lxml/tutorial.html
3. https://groups.google.com/forum/?hl=en&lnk=gst&q=parsing+a+huge+xml#!topic/comp.lang.python/CMgToEnjZBk
4. http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
5.http://effbot.org/zone/element-index.htm
http://effbot.org/zone/element-iterparse.htm
6. SAX : http://en.wikipedia.org/wiki/Simple_API_for_XML
I have made extensive use of SAX and it will certainly work for low
memory parsing of XML. I have never used "iterparse"; so, I cannot make
an informed comparison between them.
> Which one is the best for my situation ?
Your posed was long but it failed to tell us the most important piece
of information: What does your data look like and what are you trying
to do with it?
SAX is a low level API that provides a callback interface allowing you to
processes various elements as they are encountered. You can therefore
do anything you want to the information, as you encounter it, including
outputing and discarding small chunks as you processes it; ignoring
most of it and saving only what you want to memory data structures;
or saving all of it to a more random access database or on disk data
structure that you can load and process as required.
What you need to do will depend on what you are actually trying to
accomplish. Without knowing that, I can only affirm that SAX will work
for your needs without providing any information about how you should
be using it.
Do that hundreds of times a day.
> 0. I am a python & xml n00b, s& have been relying on the excellent
> beginner book DIP(Dive_Into_Python3 by MP(Mark Pilgrim).... Mark , if
> u are readng this, you are AWESOME & so is your witty & humorous
> writing style)
> 1. Almost all exmaples pf parsing xml in python, i have seen, start off with these 4 lines of code.
> import xml.etree.ElementTree as etree
> tree = etree.parse('*path_to_ginormous_xml*')
> root = tree.getroot() #my huge xml has 1 root at the top level
> print root
Yes, this is a terrible technique; most examples are crap.
> 2. In the 2nd line of code above, as Mark explains in DIP, the parse
> function builds & returns a tree object, in-memory(RAM), which
> represents the entire document.
> I tried this code, which works fine for a small ( ~ 1MB), but when i
> run this simple 4 line py code in a terminal for my HUGE target file
> (1GB), nothing happens.
> In a separate terminal, i run the top command, & i can see a python
> process, with memory (the VIRT column) increasing from 100MB , all the
> way upto 2100MB.
Yes, this is using DOM. DOM is evil and the enemy, full-stop.
> I am guessing, as this happens (over the course of 20-30 mins), the
> tree representing is being slowly built in memory, but even after
> 30-40 mins, nothing happens.
> I dont get an error, seg fault or out_of_memory exception.
You need to process the document as a stream of elements; aka SAX.
> 3. I also tried using lxml, but an lxml tree is much more expensive,
> as it retains more info about a node's context, including references
> to it's parent.
> [http://www.ibm.com/developerworks/xml/library/x-hiperfparse/]
> When i ran the same 4line code above, but with lxml's elementree
> ( using the import below in line1of the code above)
> import lxml.etree as lxml_etree
You're still using DOM; DOM is evil.
> Which one is the best for my situation ?
> Any & all
> code_snippets/wisdom/thoughts/ideas/suggestions/feedback/comments/ of
> the c.l.p community would be greatly appreciated.
> Plz feel free to email me directly too.
First up, thanks for your prompt reply.
I will make sure i read RFC1855, before posting again, but right now chasing a hard deadline :)
I am sorry i left out what exactly i am trying to do.
0. Goal :I am looking for a specific element..there are several 10s/100s occurrences of that element in the 1gb xml file.
The contents of the xml, is just a dump of config parameters from a packet switch( although imho, the contents of the xml dont matter)
I need to detect them & then for each 1, i need to copy all the content b/w the element's start & end tags & create a smaller xml file.
1. Can you point me to some examples/samples of using SAX, especially , ones dealing with really large XML files.
2.This brings me to another q. which i forgot to ask in my OP(original post).
Is simply opening the file, & using reg ex to look for the element i need, a *good* approach ?
While researching my problem, some article seemed to advise against this, especially since its known apriori, that the file is an xml & since regex code gets complicated very quickly & is not very readable.
But is that just a "style"/"elegance" issue, & for my particular problem (detecting a certain element, & then creating(writing) a smaller xml file corresponding to, each pair of start & end tags of said element), is the open file & regex approach, something you would recommend ?
Thanks again for your super-prompt response :)
cheers
ashish
Yep, do that a lot; via iterparse.
> 1. Can you point me to some examples/samples of using SAX,
> especially , ones dealing with really large XML files.
SaX is equivalent to iterparse (iterpase is a way, to essentially, do
SaX-like processing).
I provided an iterparse example already. See the Read_Rows method in
<http://coils.hg.sourceforge.net/hgweb/coils/coils/file/62335a211fda/src/coils/foundation/standard_xml.py>
> 2.This brings me to another q. which i forgot to ask in my OP(original post).
> Is simply opening the file, & using reg ex to look for the element i need, a *good* approach ?
No.
> Yes, this is using DOM. DOM is evil and the enemy, full-stop.
> You're still using DOM; DOM is evil.
For serial processing, DOM is superfluous superstructure.
For random access processing, some might disagree.
>
>> Which one is the best for my situation ?
>> Any& all
>> code_snippets/wisdom/thoughts/ideas/suggestions/feedback/comments/ of
>> the c.l.p community would be greatly appreciated.
>> Plz feel free to email me directly too.
>
> <http://docs.python.org/library/xml.sax.html>
>
> <http://coils.hg.sourceforge.net/hgweb/coils/coils/file/62335a211fda/src/coils/foundation/standard_xml.py>
For Python (unlike Java), wrapping module functions as class static
methods is superfluous superstructure that only slows things down.
raise Exception(...) # should be something specific like
raise ValueError(...)
--
Terry Jan Reedy
Then you need:
1. To detect whenever you move inside of the type element you are
seeking and whenever you move out of it. As long as these
elements cannot be nested inside of each other, this is an
easy binary task. If they can be nested, then you will
need to maintain some kind of level count or recursively
decompose each level.
2. Once you have obtained a complete element (from its start tag to
its end tag) you will need to test whether you have the
single correct element that you are looking for.
Something like this (untested) will work if the target tag cannot be nested
in another target tag:
- import xml.sax
- class tagSearcher(xml.sax.ContentHandler):
-
- def startDocument():
- self.inTarget = False
-
- def startElement(name, attrs):
- if name == targetName:
- self.inTarget = True
- elif inTarget = True:
- # save element information
-
- def endElement(name):
- if name == targetName:
- self.inTarget = False
- # test the saved information to see if you have the
- # one you want:
- #
- # if its the peice you are looking for, then
- # you can process the information
- # you have saved
- #
- # if not, disgard the accumulated
- # information and move on
-
- def characters(content):
- if self.inTarget == True:
- # save the content
-
- yourHandler = tagSearcher()
- yourParser = xml.sax.make_parser()
- yourParser.parse(inputXML, yourHandler)
Then you just walk through the document picking up and discarding each
target element type until you have the one that you are looking for.
> I need to detect them & then for each 1, i need to copy all the content
> b/w the element's start & end tags & create a smaller xml file.
Easy enough; but, with SAX you will have to recreate the tags from
the information that they contain because they will be skipped by the
character() events; so you will need to save the information from each tag
as you come across it. This could probably be done more automatically
using saxutils.XMLGenerator; but, I haven't actually worked with it
before. xml.dom.pulldom also looks interesting
> 1. Can you point me to some examples/samples of using SAX, especially ,
> ones dealing with really large XML files.
There is nothing special about large files with SAX. Sax is very simple.
It walks through the document and calls the the functions that you
give it for each event as it reaches varius elements. Your callback
functions (methods of a handler) do everthing with the information.
SAX does nothing more then call your functions. There are events for
reaching a starting tag, an end tag, and characters between tags;
as well as some for beginning and ending a document.
> 2.This brings me to another q. which i forgot to ask in my OP(original
> post). Is simply opening the file, & using reg ex to look for the element
> i need, a *good* approach ? While researching my problem, some article
> seemed to advise against this, especially since its known apriori, that
> the file is an xml & since regex code gets complicated very quickly &
> is not very readable.
>
> But is that just a "style"/"elegance" issue, & for my particular problem
> (detecting a certain element, & then creating(writing) a smaller xml
> file corresponding to, each pair of start & end tags of said element),
> is the open file & regex approach, something you would recommend ?
It isn't an invalid approach if it works for your situatuation. I have
used it before for very simple problems. The thing is, XML is a context
free data format which makes it difficult to generate precise regular
expressions, especially where where tags of the same type can be nested.
It can be very error prone. Its really easy to have a regex work for
your tests and fail, either by matching too much or failing to match,
because you didn't anticipate a given piece of data. I wouldn't consider
it a robust solution.
Try
import xml.etree.cElementTree as etree
instead. Note the leading "c", which hints at the C implementations of
ElementTree. It's much faster and much more memory friendly than the Python
implementation.
>> tree = etree.parse('*path_to_ginormous_xml*')
>> root = tree.getroot() #my huge xml has 1 root at the top level
>> print root
>
> Yes, this is a terrible technique; most examples are crap.
>
>> 2. In the 2nd line of code above, as Mark explains in DIP, the parse
>> function builds& returns a tree object, in-memory(RAM), which
>> represents the entire document.
>> I tried this code, which works fine for a small ( ~ 1MB), but when i
>> run this simple 4 line py code in a terminal for my HUGE target file
>> (1GB), nothing happens.
>> In a separate terminal, i run the top command,& i can see a python
>> process, with memory (the VIRT column) increasing from 100MB , all the
>> way upto 2100MB.
>
> Yes, this is using DOM. DOM is evil and the enemy, full-stop.
Actually, ElementTree is not "DOM", it's modelled after the XML Infoset.
While I agree that DOM is, well, maybe not "the enemy", but not exactly
beautiful either, ElementTree is really a good thing, likely also in this case.
>> I am guessing, as this happens (over the course of 20-30 mins), the
>> tree representing is being slowly built in memory, but even after
>> 30-40 mins, nothing happens.
>> I dont get an error, seg fault or out_of_memory exception.
>
> You need to process the document as a stream of elements; aka SAX.
IMHO, this is the worst advice you can give.
Stefan
Then cElementTree's iterparse() is your friend. It allows you to basically
iterate over the XML tags while its building an in-memory tree from them.
That way, you can either remove subtrees from the tree if you don't need
them (to safe memory) or otherwise handle them in any way you like, such as
serialising them into a new file (and then deleting them).
Also note that the iterparse implementation in lxml.etree allows you to
specify a tag name to restrict the iterator to these tags. That's usually a
lot faster, but it also means that you need to take more care to clean up
the parts of the tree that the iterator stepped over. Depending on your
requirements and the amount of manual code optimisation that you want to
invest, either cElementTree or lxml.etree may perform better for you.
It seems that you already found the article by Liza Daly about high
performance XML processing with Python. Give it another read, it has a
couple of good hints and examples that will help you here.
Stefan
I've just subclassed HTMLparser for this. It's slow, but
100% Python. Using the SAX parser is essentially equivalent.
I'm processing multi-gigabyte XML files and updating a MySQL
database, so I do need to look at all the entries, but don't
need a parse tree of the XML.
> SaX is equivalent to iterparse (iterpase is a way, to essentially, do
> SaX-like processing).
Iterparse does try to build a tree, although you can discard the
parts you don't want. If you can't decide whether a part of the XML
is of interest until you're deep into it, an "iterparse" approach
may result in a big junk tree. You have to keep clearing the "root"
element to discard that.
> I provided an iterparse example already. See the Read_Rows method in
> <http://coils.hg.sourceforge.net/hgweb/coils/coils/file/62335a211fda/src/coils/foundation/standard_xml.py>
I don't quite see the point of creating a class with only static
methods. That's basically a verbose way to create a module.
>
>> 2.This brings me to another q. which i forgot to ask in my OP(original post).
>> Is simply opening the file,& using reg ex to look for the element i need, a *good* approach ?
>
> No.
If the XML file has a very predictable structure, that may not be
a bad idea. It's not very general, but if you have some XML file
that's basically fixed format records using XML to delimit the
fields, pounding on the thing with a regular expression is simple
and fast.
John Nagle
And take a look at xmlsh.org, they offer tools for the command line,
like xml2csv. (Need java, btw).
Cheers
> Normally (what is normal, anyway?) such files are auto-generated,
> and are something that has a apparent similarity with a database query
> result, encapsuled in xml.
> Most of the time the structure is same for every "row" thats in there.
> So, a very unpythonic but fast, way would be to let awk resemble the
> records and write them in csv format to stdout.
awk works well if the input is formatted such that each line is a record;
it's not so good otherwise. XML isn't a line-oriented format; in
particular, there are many places where both newlines and spaces are just
whitespace. A number of XML generators will "word wrap" the resulting XML
to make it more human readable, so line-oriented tools aren't a good idea.
For large datasets I always have huge question marks if one says "xml".
But I don't want to start a flame war.
I agree people abuse the "spirit of XML" using it to transfer gigabytes
of data, but what else are they to use?
regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
PyCon 2011 Atlanta March 9-17 http://us.pycon.org/
See Python Video! http://python.mirocommunity.org/
Holden Web LLC http://www.holdenweb.com/
I keep reading people say that (and *much* worse). XML may not be the
tightly tailored solution for data of that size, but it's not inherently
wrong to store gigabytes of data in XML. I mean, XML is a reasonably fast,
versatile, widely used, well-compressing and safe data format with an
extremely ubiquitous and well optimised set of tools available for all
sorts of environments. So as soon as the data is any complex or the
environments require portable data exchange, I consider XML a reasonable
choice, even for large data sets (which usually implies that it's machine
generated outputo anyway).
Stefan
"Steve Holden" <st...@holdenweb.com> wrote:
>On 12/23/2010 4:34 PM, Stefan Sonnenberg-Carstens wrote:
>> For large datasets I always have huge question marks if one says
>"xml".
>> But I don't want to start a flame war.
>I agree people abuse the "spirit of XML" using it to transfer gigabytes
>of data,
How so? I think this assertion is bogus. XML works extremely well for large datasets.
>but what else are they to use?
If you are sending me data - please use XML . I've gotten 22GB XML files in the past - worked without issue and pretty quickly too.
Sure better than trying to figure out whatever goofy document format someone cooks up on their own. XML toolkits are proven and documented.
I would agree; but, you don't always have the choice over the data format
that you have to work with. You just have to do the best you can with what
they give you.
> I agree people abuse the "spirit of XML" using it to transfer gigabytes
> of data, but what else are they to use?
Something with an index so that you don't have to parse the entire file
would be nice. SQLite comes to mind. It is not standardized; but, the
implementation is free with bindings for most languages.
> XML works extremely well for large datasets.
Barf. I'll agree that there are some nice points to XML. It is
portable. It is (to a certain extent) human readable, and in a pinch
you can use standard text tools to do ad-hoc queries (i.e. grep for a
particular entry). And, yes, there are plenty of toolsets for dealing
with XML files.
On the other hand, the verbosity is unbelievable. I'm currently working
with a data feed we get from a supplier in XML. Every day we get
incremental updates of about 10-50 MB each. The total data set at this
point is 61 GB. It's got stuff like this in it:
<Parental-Advisory>FALSE</Parental-Advisory>
That's 54 bytes to store a single bit of information. I'm all for
human-readable formats, but bloating the data by a factor of 432 is
rather excessive. Of course, that's an extreme example. A more
efficient example would be:
<Id>1173722</Id>
which is 26 bytes to store an integer. That's only a bloat factor of
6-1/2.
Of course, one advantage of XML is that with so much redundant text, it
compresses well. We typically see gzip compression ratios of 20:1.
But, that just means you can archive them efficiently; you can't do
anything useful until you unzip them.
>> XML works extremely well for large datasets.
One advantage it has over many legacy formats is that there are no
inherent 2^31/2^32 limitations. Many binary formats inherently cannot
support files larger than 2GiB or 4Gib due to the use of 32-bit offsets in
indices.
> Of course, one advantage of XML is that with so much redundant text, it
> compresses well. We typically see gzip compression ratios of 20:1.
> But, that just means you can archive them efficiently; you can't do
> anything useful until you unzip them.
XML is typically processed sequentially, so you don't need to create a
decompressed copy of the file before you start processing it.
If file size is that much of an issue, eventually we'll see a standard for
compressing XML. This could easily result in smaller files than using a
dedicated format compressed with general-purpose compression algorithms,
as a widely-used format such as XML merits more effort than any
application-specific format.
And what legacy format has support for code pages, namespaces, schema
verification, or comments? None.
> > Of course, one advantage of XML is that with so much redundant text, it
> > compresses well. We typically see gzip compression ratios of 20:1.
> > But, that just means you can archive them efficiently; you can't do
> > anything useful until you unzip them.
> XML is typically processed sequentially, so you don't need to create a
> decompressed copy of the file before you start processing it.
Yep.
> If file size is that much of an issue,
Which it isn't.
> eventually we'll see a standard for
> compressing XML. This could easily result in smaller files than using a
> dedicated format compressed with general-purpose compression algorithms,
> as a widely-used format such as XML merits more effort than any
> application-specific format.
Agree; and there actually already is a standard compression scheme -
HTTP compression [supported by every modern web-server]; so the data is
compressed at the only point where it matters [during transfer].
Again: "XML works extremely well for large datasets".
Only if you're prepared to squander resources that could be put to better
use.
XML is so redundant, anyone (even me :-) could probably spend an afternoon
coming up with a compression scheme to reduce it to a fraction of it's size.
It can even be an custom format, provided you also send along the few dozen
lines of Python (or whatever language) needed to decompress. Although if
it's done properly, it might be possible to create an XML library that works
directly on the compressed format, and as a plug-in replacement for a
conventional library.
That will likely save time and memory.
Anyway there seem to be existing schemes for binary XML, indicating some
people do think it is an issue.
I'm just concerned at the waste of computer power (I used to think HTML was
bad, for example repeating the same long-winded font name hundreds of times
over in the same document. And PDF: years ago I was sent a 1MB document for
a modem; perhaps some substantial user manual for it? No, just a simple
diagram showing how to plug it into the phone socket!).
--
Bartc
That is probably true of many older and binary formats; but, XML
is certainly not the the only format that supports arbitrary size.
It certainly doesn't prohibit another format with better handling of
large data sets from being developed. XML's primary benefit is its
ubiquity. While it is an excellent format for a number of uses, I don't
accept ubiquity as the only or preeminent metric when choosing a data
format.
>> Of course, one advantage of XML is that with so much redundant text, it
>> compresses well. We typically see gzip compression ratios of 20:1.
>> But, that just means you can archive them efficiently; you can't do
>> anything useful until you unzip them.
>
> XML is typically processed sequentially, so you don't need to create a
> decompressed copy of the file before you start processing it.
Sometimes XML is processed sequentially. When the markup footprint is
large enough it must be. Quite often, as in the case of the OP, you only
want to extract a small piece out of the total data. In those cases, being
forced to read all of the data sequentially is both inconvenient and and a
performance penalty unless there is some way to address the data you want
directly.
Sometimes that is true and sometimes it isn't. There are many situations
where you want to access the data nonsequentially or address just a small
subset of it. Just because you never want to access data randomly doesn't
mean others might not. Certainly the OP would be happier using something
like XPath to get just the piece of data that he is looking for.
>> XML is typically processed sequentially, so you don't need to create a
>> decompressed copy of the file before you start processing it.
>
> Sometimes XML is processed sequentially. When the markup footprint is
> large enough it must be. Quite often, as in the case of the OP, you only
> want to extract a small piece out of the total data. In those cases,
> being forced to read all of the data sequentially is both inconvenient and
> and a performance penalty unless there is some way to address the data you
> want directly.
Actually, I should have said "must be processed sequentially". Even if you
only care about a small portion of the data, you have to read it
sequentially to locate that portion. IOW, anything you can do with
uncompressed XML can be done with compressed XML; you can't do random
access with either.
If XML has a drawback over application-specific formats, it's the
sequential nature of XML rather than its (uncompressed) size.
OTOH, formats designed for random access tend to be more limited in their
utility. You can only perform random access based upon criteria which
match the format's indexing. Once you step outside that, you often have to
walk the entire file anyhow.
So what? If you only have to do that once, it doesn't matter if you have to
read the whole file or just a part of it. Should make a difference of a
couple of minutes.
If you do it a lot, you will have to find a way to make the access
efficient for your specific use case. So the file format doesn't matter
either, because the data will most likely end up in a fast data base after
reading it in sequentially *once*, just as in the case above.
I really don't think there are many important use cases where you need fast
random access to large data sets and cannot afford to adapt the storage
layout before hand.
Stefan
That may be true and it may not. Even assuming that you have to walk
through a large number of top level elements there may be an advantage
to being able to directly access the next element as opposed to having
to parse through the entire current element once you have determined it
isn't one which you are looking for. To be fair, this may be invalid
preoptimization without taking into account how the hard drive buffers;
but, I would suspect that there is a threshold where the amount of
data skipped starts to outweigh the penalty of overreaching the hard
drives buffers.
Much agreed. I assume that the process needs to be repeated or it
probably would be simpler just to rip out what I wanted using regular
expressions with shell utilities.
> If you do it a lot, you will have to find a way to make the access
> efficient for your specific use case. So the file format doesn't matter
> either, because the data will most likely end up in a fast data base after
> reading it in sequentially *once*, just as in the case above.
If the data is just going to end up in a database anyway; then why not
send it as a database to begin with and save the trouble of having to
convert it?
I don't think anyone would object to using a native format when copying
data from one database 1:1 to another one. But if the database formats are
different on both sides, it's a lot easier to map XML formatted data to a
given schema than to map a SQL dump, for example. Matter of use cases, not
of data size.
Stefan
Your assumption keeps hinging on the fact that I should want to dump
the data into a database in the first place. Very often I don't.
I just want to rip out the small portion of information that happens to
be important to me. I may not even want to archive my little piece of
the information once I have processed it.
Even assuming that I want to dump all the data into a database,
walking through a bunch of database records to translate them into the
schema for another database is no more difficult then walking through a
bunch of XML elements. In fact, it is even easier since I can use the
relational model to reconstruct the information in an organization that
better fits how the data is actually structured in my database instead
of being constrained by how somebody else wanted to organize their XML.
There is no need to "map a[sic] SQL dump."
XML is great when the data is set is small enough that parsing the
whole tree has negligable costs. I can choose whether I want to parse
it sequentially or use XPath/DOM/Etree etc to make it appear as though
I am making random accesses. When the data set grows so that parsing
it is expensive I loose that choice even if my use case would otherwise
prefer a random access paradigm. When that happens, there are better
ways of communicating that data that doesn't force me into using a high
overhead method of extracting my data.
The problem is that XML has become such a defacto standard that it
used automatically, without thought, even when there are much better
alternatives available.
Why do you say that? I would have thought that using SAX in this
application is an excellent idea.
I agree that for applications for which performance is not a problem,
and for which we need to examine more than one or a few element types, a
tree implementation is more functional, less programmer intensive, and
provides an easier to understand approach to the data. But with huge
amounts of data where performance is a problem SAX will be far more
practical. In the special case where only a few elements are of
interest in a complex tree, SAX can sometimes also be more natural and
easy to use.
SAX might also be more natural for this application. The O.P. could
tell us for sure, but I wonder if perhaps his 1 GB XML file is NOT a
true single record. You can store an entire text encyclopedia in less
than one GB. What he may have is a large number logically distinct
individual records of some kind, each stored as a node in an
all-encompassing element wrapper. Building a tree for each record could
make sense but, if I'm right about the nature of the data, building a
tree for the wrapper gives very little return for the high cost.
If that's so, then I'd recommend one of two approaches:
1. Use SAX, or
2. Parse out individual logical records using string manipulation on an
input stream, then build a tree for one individual record in memory
using one of the DOM or ElementTree implementations. After each record
is processed, discard its tree and start on the next record.
Alan
I agree with you but, as you say, it has become a defacto standard. As
a result, we often need to use it unless there is some strong reason to
use something else.
The same thing can be said about relational databases. There are
applications for which a hierarchical database makes more sense, is more
efficient, and is easier to understand. But anyone who recommends a
database that is not relational had better be prepared to defend his
choice with some powerful reasoning because his management, his
customers, and the other programmers on his team are probably going to
need a LOT of convincing.
And of course there are many applications where XML really is the best.
It excels at representing complex textual documents while still
allowing programmatic access to individual items of information.
Alan
From my experience, SAX is only practical for very simple cases where
little state is involved when extracting information from the parse events.
A typical example is gathering statistics based on single tags - not a very
common use case. Anything that involves knowing where in the XML tree you
are to figure out what to do with the event is already too complicated. The
main drawback of SAX is that the callbacks run into separate method calls,
so you have to do all the state keeping manually through fields of the SAX
handler instance.
My serious advices is: don't waste your time learning SAX. It's simply too
frustrating to debug SAX extraction code into existence. Given how simple
and fast it is to extract data with ElementTree's iterparse() in a memory
efficient way, there is really no reason to write complicated SAX code instead.
Stefan
I've found that using a stack-model makes traversing complex documents
with SAX quite manageable. For example, I parse BPML files with SAX.
If the document is nested and context sensitive then I really don't see
how iterparse differs all that much.
XML should be used where it makes sense to do so. As always, use the
proper tool for the proper job. XML became such a defacto standard, in
part, because it was abused for many uses in the first place so using it
because it is a defacto standard is just piling more and more mistakes
on top of each other.
> The same thing can be said about relational databases. There are
> applications for which a hierarchical database makes more sense, is more
> efficient, and is easier to understand. But anyone who recommends a
> database that is not relational had better be prepared to defend his
> choice with some powerful reasoning because his management, his
> customers, and the other programmers on his team are probably going to
> need a LOT of convincing.
I have no particular problem with using other database models in
theory. In practice, at least until recently, there were few decent
implementations for alternative model databases. That is starting to
change with the advent of the so-called NoSQL databases. There are a few
models that I really do like; but, there are also a lot of failed models.
A large part of the problem was the push towards object databases which
is one of the failed models IMNSHO. Its failure tended to give some of
the other datase models a bad name.
> And of course there are many applications where XML really is the best.
> It excels at representing complex textual documents while still
> allowing programmatic access to individual items of information.
Much agreed. There are many things that XML does very well. It works
great for XMP-RPC style interfaces. I prefer it over binary formats
for documents. It does suitibly for exporting discreet amounts of
information.
There are however a number of things that it does poorly. I don't
condone its use for configuration files. I don't condone its use as a
data store and when you have data approaching gigabytes, that is exaclty
how you are using it.
> On 12/26/2010 3:15 PM, Tim Harig wrote:
> I agree with you but, as you say, it has become a defacto standard. As
> a result, we often need to use it unless there is some strong reason to
> use something else.
This is certainly true. In the rarified world of usenet, we can all
bash XML (and I'm certainly front and center of the XML bashing crowd).
In the real world, however, it's a necessary evil. Knowing how to work
with it (at least to some extent) should be in every software engineer's
bag of tricks.
> The same thing can be said about relational databases. There are
> applications for which a hierarchical database makes more sense, is more
> efficient, and is easier to understand. But anyone who recommends a
> database that is not relational had better be prepared to defend his
> choice with some powerful reasoning because his management, his
> customers, and the other programmers on his team are probably going to
> need a LOT of convincing.
This is also true. In the old days, they used to say, "Nobody ever got
fired for buying IBM". Relational databases have pretty much gotten to
that point. Suits are comfortable with Oracle and MS SqlServer, and
even MySQL. If you want to go NoSQL, the onus will be on you to
demonstrate that it's the right choice.
Sometimes, even when it is the right choice, it's the wrong choice. You
typically have a limited amount of influence capital to spend, and many
battles to fight. Sometimes it's right to go along with SQL, even if
you know it's wrong from a technology point of view, simply because
taking the easy way out on that battle may let you devote the energy you
need to win more important battles.
And, anyway, when your SQL database becomes the bottleneck, you can
always go back and say, "I told you so". Trust me, if you're ever
involved in an "I told you so" moment, you really want to be on the
transmitting end.
> And of course there are many applications where XML really is the best.
> It excels at representing complex textual documents while still
> allowing programmatic access to individual items of information.
Yup. For stuff like that, there really is no better alternative. To go
back to my earlier example of
<Parental-Advisory>FALSE</Parental-Advisory>
using 432 bits to store 1 bit of information, stuff like that doesn't
happen in marked-up text documents. Most of the file is CDATA (do they
still use that term in XML, or was that an SGML-ism only?). The markup
is a relatively small fraction of the data. I'm happy to pay a factor
of 2 or 3 to get structured text that can be machine processed in useful
ways. I'm not willing to pay a factor of 432 to get tabular data when
there's plenty of other much more reasonable ways to encode it.
I confess that I hadn't been thinking about iterparse(). I presume that
clear() is required with iterparse() if we're going to process files of
arbitrary length.
I should think that this approach provides an intermediate solution.
It's more work than building the full tree in memory because the
programmer has to do some additional housekeeping to call clear() at the
right time and place. But it's less housekeeping than SAX.
I guess I've done enough SAX, in enough different languages, that I
don't find it that onerous to use. When I need an element stack to keep
track of things I can usually re-use code I've written for other
applications. But for a programmer that doesn't do a lot of this stuff,
I agree, the learning curve with lxml will be shorter and the
programming and debugging can be faster.
Alan
> ... In the old days, they used to say, "Nobody ever got
> fired for buying IBM". Relational databases have pretty much gotten to
> that point....
That's _exactly_ the comparison I had in mind too.
I once worked for a company that made a pitch to a big potential client
(the BBC) and I made the mistake of telling the client that I didn't
think a relational database was the best for his particular application.
We didn't win that contract and I never made that mistake again!
Alan
I've written a lot of code on a lot of projects in my 35 year career but
I don't think I've written anything anywhere near as useful to anywhere
near as many people as lxml.
Thank you very much for writing lxml and contributing it to the community.
Alan
If the above only appears once in a large document, I don't care how much
space it takes. If it appears all over the place, it will compress down to
a couple of bits, so I don't care about the space, either.
It's readability that counts here. Try to reverse engineer a binary format
that stores the above information in 1 bit.
Stefan
I don't. After all, this discussion is more about the general data format
than the specific tools.
> I use lxml more and more in my work. It's fast, functional and pretty elegant.
>
> I've written a lot of code on a lot of projects in my 35 year career but I
> don't think I've written anything anywhere near as useful to anywhere near
> as many people as lxml.
>
> Thank you very much for writing lxml and contributing it to the community.
Thanks, I'm happy to read that. You're welcome.
Note that lxml also owes a lot to Fredrik Lundh for designing ElementTree
and to Martijn Faassen for starting to reimplement it on top of libxml2
(and choosing the name :).
Stefan
The iterparse() implementation in lxml.etree allows you to intercept on a
specific tag name, which is especially useful for large XML documents that
are basically an endless sequence of (however deeply structured) top-level
elements - arguably the most common format for gigabyte sized XML files. So
what I usually do here is to intercept on the top level tag name, clear()
that tag after use and leave it dangling around, like this:
for _, element in ET.iterparse(source, tag='toptagname'):
# ... work on the element and its subtree
element.clear()
That allows you to write simple in-memory tree handling code (iteration,
XPath, XSLT, whatever), while pushing the performance up (compared to ET's
iterparse that returns all elements) and keeping the total amount of memory
usage reasonably low. Even a series of several hundred thousand empty top
level tags don't add up to anything that would truly hurt a decent machine.
In many cases where I know that the XML file easily fits into memory
anyway, I don't even do any housekeeping at all. And the true advantage is:
if you ever find that it's needed because the file sizes grow beyond your
initial expectations, you don't have to touch your tested and readily
debugged data extraction code, just add a suitable bit of cleanup code, or
even switch from the initial all-in-memory parse() solution to an
event-driven iterparse()+cleanup solution.
> I guess I've done enough SAX, in enough different languages, that I don't
> find it that onerous to use. When I need an element stack to keep track of
> things I can usually re-use code I've written for other applications. But
> for a programmer that doesn't do a lot of this stuff, I agree, the learning
> curve with lxml will be shorter and the programming and debugging can be
> faster.
I'm aware that SAX has the advantage of being available for more languages.
But if you are in the lucky position to use Python for XML processing, why
not just use the tools that it makes available?
Stefan
+1
> It's readability that counts here. Try to reverse engineer a binary format
> that stores the above information in 1 bit.
I think a point many of the arguments against XML miss is the HR cost of
custom solutions. Every time you come up with a cool super-efficient
solution it has to be weighed against the increase in the tool-stack
[whereas XML is, essentially, built-in] and
nobody-else-knows-about-your-super-cool-solution [1]. IMO, tool-stack
bloat is a *big* problem in shops with an Open Source tendency. Always
tossing the new and shiny thing [it's free!] into the bucket for some
theoretical benefit. [This is an unrecognized benefit to expensive
software - it creates focus]. Soon the bucket is huge and maintaining
it becomes a burden.
[1] The odds you sufficiently documented your super-cool-solution is
probably nil.
So I'm one of those you'd have to make a *really* good argument *not* to
use XML. XML is known, the tools are good, the knotty problems are
solved [thanks to the likes of SAX, lxml / ElementTree, and
ElementFlow]. If the premise argument is "bloat" I'd probably dismiss
it out of hand since removing that bloat will necessitate adding bloat
somewhere else; that somewhere else almost certainly being more
expensive.
"Stefan Behnel" <stef...@behnel.de> wrote in message
news:mailman.335.1293516...@python.org...
The above typically won't get much below 2 bytes (as one character plus a
separator, eg. in comma-delimited-format). So it's more like 27:1, if you're
going to stay with a text format.
Still, that's 27 times as much as it need be. Readability is fine, but why
does the full, expanded, human-readable textual format have to be stored on
disk too, and for every single instance?
What if the 'Parental-Advisory' tag was even longer? Just how long do these
things have to get before even the advocates here admit that it's getting
ridiculous?
Isn't it possible for XML to define a shorter alias for these tags? Isn't
there a shortcut available for </Parental-Advisory> in simple examples like
this (I seem to remember something like this)?
And why not use 1 and 0 for TRUE and FALSE? Even the consumer appliances in
my house have 1 and 0 on their power switches! With the advantage that they
are internationally recognised.
--
Bartc
>> Roy Smith, 28.12.2010 00:21:
>>> To go back to my earlier example of
>>>
>>> <Parental-Advisory>FALSE</Parental-Advisory>
>>>
>
> Isn't it possible for XML to define a shorter alias for these tags? Isn't
> there a shortcut available for </Parental-Advisory> in simple examples like
> this (I seem to remember something like this)?
Yes, you can define your own entities in a DTD:
<!ENTITY paf "<Parental-Advisory>FALSE</Parental-Advisory>">
<!ENTITY pat "<Parental-Advisory>TRUE</Parental-Advisory>">
Later, in your document:
&paf;
&pat;
Although, this is a bit of a contrived example - if space is such a
major concern, one wouldn't be so wasteful of it to begin with, but
might instead use a short tag form whose value attribute defaults to
"FALSE".
<!ELEMENT advisory EMPTY>
<!ATTLIST advisory value (TRUE | FALSE) "FALSE">
Later, in your document:
<movie title="Bambi"><advisory/></movie>
<movie title="Scarface"><advisory value="TRUE"/></movie>
To save even more space, one could instead define a "pa" attribute as
part of the "movie" element, with a default value that would then take
no space at all:
<!ATTLIST movie pa (TRUE | FALSE) "FALSE">
Later, in your document:
<movie name="Bambi"/>
<movie name="Scarface" pa="TRUE"/>
When you see someone doing stupid things with a tool, it's usually not
the tool's fault. Far more often, it's someone using the wrong tool for
the task at hand, or using the right tool the wrong way.
> And why not use 1 and 0 for TRUE and FALSE?
Sounds reasonable in general, although a parental advisory would more
often be a range of possible values (G, PG, R, MA, etc.) rather than a
boolean.
sherm--
--
Sherm Pendley
<http://camelbones.sourceforge.net>
Cocoa Developer
> Still, that's 27 times as much as it need be. Readability is fine, but why
> does the full, expanded, human-readable textual format have to be stored on
> disk too, and for every single instance?
Well, I know the answer to that one. The particular XML feed I'm
working with is a dump from an SQL database. The element names in the
XML are exactly the same as the column names in the SQL database.
The difference being that in the database, the string
"Parental-Advisory" appears in exactly one place, in some schema
metadata table. In the XML, it appears (doubled!) once per row.
It's still obscene. That fact that I understand the cause of the
obscenity doesn't make it any less so.
Another problem with XML is that some people don't use real XML tools to
write their XML files. DTD? What's that? So you end up with tag soup
that the real XML tools can't parse on the other end.
On Dec 20, 11:34 am, spaceman-spiff <ashish.mak...@gmail.com> wrote:
> Hi c.l.p folks
>
> This is a rather long post, but i wanted to include all the details & everything i have tried so far myself, so please bear with me & read the entire boringly long post.
>
> I am trying to parse a ginormous ( ~ 1gb) xml file.
>
> 0. I am a python & xml n00b, s& have been relying on the excellent beginner book DIP(Dive_Into_Python3 by MP(Mark Pilgrim).... Mark , if u are readng this, you are AWESOME & so is your witty & humorous writing style)
>
> 1. Almost all exmaples pf parsing xml in python, i have seen, start off with these 4 lines of code.
>
> import xml.etree.ElementTree as etree
> tree = etree.parse('*path_to_ginormous_xml*')
> root = tree.getroot() #my huge xml has 1 root at the top level
> print root
>
> 2. In the 2nd line of code above, as Mark explains in DIP, the parse function builds & returns a tree object, in-memory(RAM), which represents the entire document.
> I tried this code, which works fine for a small ( ~ 1MB), but when i run this simple 4 line py code in a terminal for my HUGE target file (1GB), nothing happens.
> In a separate terminal, i run the top command, & i can see a python process, with memory (the VIRT column) increasing from 100MB , all the way upto 2100MB.
>
> I am guessing, as this happens (over the course of 20-30 mins), the tree representing is being slowly built in memory, but even after 30-40 mins, nothing happens.
> I dont get an error, seg fault or out_of_memory exception.
>
> My hardware setup : I have a win7 pro box with 8gb of RAM & intel core2 quad cpuq9400.
> On this i am running sun virtualbox(3.2.12), with ubuntu 10.10 as guest os, with 23gb disk space & 2gb(2048mb) ram, assigned to the guest ubuntu os.
>
> 3. I also tried using lxml, but an lxml tree is much more expensive, as it retains more info about a node's context, including references to it's parent.
> [http://www.ibm.com/developerworks/xml/library/x-hiperfparse/]
>
> When i ran the same 4line code above, but with lxml's elementree ( using the import below in line1of the code above)
> import lxml.etree as lxml_etree
>
> i can see the memory consumption of the python process(which is running the code) shoot upto ~ 2700mb & then, python(or the os ?) kills the process as it nears the total system memory(2gb)
>
> I ran the code from 1 terminal window (screenshot :http://imgur.com/ozLkB.png)
> & ran top from another terminal (http://imgur.com/HAoHA.png)
>
> 4. I then investigated some streaming libraries, but am confused - there is SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse interface[http://effbot.org/zone/element-iterparse.htm]
>
> Which one is the best for my situation ?
>
> Any & all code_snippets/wisdom/thoughts/ideas/suggestions/feedback/comments/ of the c.l.p community would be greatly appreciated.
> Plz feel free to email me directly too.
>
> thanks a ton
>
> cheers
> ashish
>
> email :
> ashish.makani
> domain:gmail.com
>
> p.s.
> Other useful links on xml parsing in python
> 0.http://diveintopython3.org/xml.html
> 1.http://stackoverflow.com/questions/1513592/python-is-there-an-xml-par...
> 2.http://codespeak.net/lxml/tutorial.html
> 3.https://groups.google.com/forum/?hl=en&lnk=gst&q=parsing+a+huge+xml#!...
> 4.http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
> 5.http://effbot.org/zone/element-index.htmhttp://effbot.org/zone/element-iterparse.htm
> 6. SAX :http://en.wikipedia.org/wiki/Simple_API_for_XML
Thanks! I updated our codebase this afternoon...
--
Aahz (aa...@pythoncraft.com) <*> http://www.pythoncraft.com/
"The volume of a pizza of thickness 'a' and radius 'z' is
given by pi*z*z*a"