What's the best way to do it in Haskell? It looks like there's no
xml module in the standard library. I see there's several external
ones including HaXML, HXML, and HXT. I tried installing HaXML but
the build script crashed, never a good sign.
Is there a particular one of these packages that's clearly the best
choice? Is this stuff considered ready for prime time yet? Do I have
to understand arrows to use HXT? I currently kind of sort of barely
understand monads, I think. Arrows sound like yet more mind expansion,
always a good thing but it might take a while.
Thanks.
HXT is newer and more actively maintained. It was designed despite HaXml
but maybe less stable than HaXml.
HTH Christian
HXT is newer and more actively maintained. It was designed despite HaXml
but maybe less stable.
HTH Christian
The other reason is that HXT's suite of functions is complete: Whatever
you do, you never need to know the data structure representing
documents. (The data structure is still exposed, if you really want, but
you don't get as much support.) This also means there are more functions
to learn. HaXML's suite of functions is incomplete: Most interesting
tasks are unsupported by the functions and you have to process the data
structure directly, e.g.,
http://thread.gmane.org/gmane.comp.lang.haskell.cafe/14466
So you end up feeling: its API is irrelevant anyway, just need to learn
three things: how to read file to tree, how to write tree to file, and
what is the tree structure in between. It is a kind of "much less to
learn", but more code to write.
I have a preliminary HXT tutorial at
http://www.vex.net/~trebla/haskell/hxt-arrow/
But I haven't covered modifying or producing new trees.
The Haskell Wiki has another tutorial at
http://www.haskell.org/haskellwiki/HXT
with longer examples. But IMO it spends too much time on the now
obsolete "filter".
Cale Gibbard has another example at
http://cale.yi.org/index.php/HRSS
http://www.haskell.org/haskellwiki/HXT/Practical
is a (small, but hopefully growing) collection of some "real" examples
using HXT.
> I want to read in some (large) XML documents, crunch them around a
> little, and write out new documents. I'm currently doing it with a
> Python program using xml.etree.cElementTree but it's too darn slow.
>
It's slow because it have to build a document tree in memory.
You should consider the use of SAX API, if possible.
By the way, I'm very new to Haskell.
I'm interested in how well it performs in the task of parsing XML files
(both speed and memory usage).
Regards Manlio Perillo
There are three libraries capable of XML.
HXT reads the whole file and keeps the whole tree in memory. (Its
advantage over the following two, however, is its API is richer, you can
write less code to do more.)
HaXml lets you choose the granularity of how much reading and building
per step. With the simplest coding, you still read the whole file and
build the whole tree. With more coding, you can read, build, discard
one subtree at a time.
TagSoup is the most fine-grained and low-level. It lazily reads the
file and returns a lazy list of "here is an open tag, here is some text,
oh here is another open tag, oh now here is a close tag..." It doesn't
verify open tags and close tags to match, however. This is probably
closest to SAX.
Funny this thread should revive after so long. I'm still interested
in the subject. I was just chatting about it a couple nights ago
online, and someone mentioned that there's a Haskell binding to expat,
so that should be on the list too.
> Manlio Perillo wrote:
>> By the way, I'm very new to Haskell.
>> I'm interested in how well it performs in the task of parsing XML files
>> (both speed and memory usage).
>
> There are three libraries capable of XML.
>
> HXT reads the whole file and keeps the whole tree in memory. (Its
> advantage over the following two, however, is its API is richer, you can
> write less code to do more.)
>
How are performance with very large XML files?
> HaXml lets you choose the granularity of how much reading and building
> per step. With the simplest coding, you still read the whole file and
> build the whole tree. With more coding, you can read, build, discard
> one subtree at a time.
>
Very interesting.
But what about HXML?
> TagSoup is the most fine-grained and low-level. It lazily reads the
> file and returns a lazy list of "here is an open tag, here is some text,
> oh here is another open tag, oh now here is a close tag..." It doesn't
> verify open tags and close tags to match, however. This is probably
> closest to SAX.
After a google search I have also found this:
http://www.haskell.org/pipermail/haskell/2006-March/017656.html
Manlio Perillo