XML?

Paul Rubin

unread,

Oct 16, 2007, 12:38:03 AM10/16/07

to

I want to read in some (large) XML documents, crunch them around a
little, and write out new documents. I'm currently doing it with
a Python program using xml.etree.cElementTree but it's too darn slow.

What's the best way to do it in Haskell? It looks like there's no
xml module in the standard library. I see there's several external
ones including HaXML, HXML, and HXT. I tried installing HaXML but
the build script crashed, never a good sign.

Is there a particular one of these packages that's clearly the best
choice? Is this stuff considered ready for prime time yet? Do I have
to understand arrows to use HXT? I currently kind of sort of barely
understand monads, I think. Arrows sound like yet more mind expansion,
always a good thing but it might take a while.

Thanks.

Christian Maeder

unread,

Oct 16, 2007, 5:42:00 AM10/16/07

to Paul Rubin

We have both packages HaXml-1.13.2 and hxt-7.1 working with ghc-6.6.1
(and ghc-6.6) on several architectures. I cannot help you with your
choice, though.

HXT is newer and more actively maintained. It was designed despite HaXml
but maybe less stable than HaXml.

HTH Christian

Christian Maeder

unread,

Oct 16, 2007, 5:43:46 AM10/16/07

to

We have both packages HaXml-1.13.2 and hxt-7.1 working with ghc-6.6.1
(and ghc-6.6) on several architectures. I cannot help you with your
choice, though.

HXT is newer and more actively maintained. It was designed despite HaXml

but maybe less stable.

HTH Christian

Albert Y. C. Lai

unread,

Oct 17, 2007, 6:36:04 PM10/17/07

to

HaXML may require less learning. HXT requires more learning, and one
reason (out of two) is that it involves arrows. But I don't see arrows
as a prerequisite of HXT; rather, I see HXT as a prerequisite of arrows,
i.e., HXT is an excellent example and motivation for arrows.

The other reason is that HXT's suite of functions is complete: Whatever
you do, you never need to know the data structure representing
documents. (The data structure is still exposed, if you really want, but
you don't get as much support.) This also means there are more functions
to learn. HaXML's suite of functions is incomplete: Most interesting
tasks are unsupported by the functions and you have to process the data
structure directly, e.g.,

http://thread.gmane.org/gmane.comp.lang.haskell.cafe/14466

So you end up feeling: its API is irrelevant anyway, just need to learn
three things: how to read file to tree, how to write tree to file, and
what is the tree structure in between. It is a kind of "much less to
learn", but more code to write.

I have a preliminary HXT tutorial at
http://www.vex.net/~trebla/haskell/hxt-arrow/
But I haven't covered modifying or producing new trees.

The Haskell Wiki has another tutorial at
http://www.haskell.org/haskellwiki/HXT
with longer examples. But IMO it spends too much time on the now
obsolete "filter".

Cale Gibbard has another example at
http://cale.yi.org/index.php/HRSS

Matthew Danish

unread,

Oct 21, 2007, 11:34:22 AM10/21/07

to

On Oct 17, 6:36 pm, "Albert Y. C. Lai" <tre...@vex.net> wrote:
> I have a preliminary HXT tutorial athttp://www.vex.net/~trebla/haskell/hxt-arrow/

> But I haven't covered modifying or producing new trees.

http://www.haskell.org/haskellwiki/HXT/Practical

is a (small, but hopefully growing) collection of some "real" examples
using HXT.

Manlio Perillo

unread,

Mar 13, 2008, 8:49:47 AM3/13/08

to

Il Mon, 15 Oct 2007 21:38:03 -0700, Paul Rubin ha scritto:

> I want to read in some (large) XML documents, crunch them around a
> little, and write out new documents. I'm currently doing it with a
> Python program using xml.etree.cElementTree but it's too darn slow.
>

It's slow because it have to build a document tree in memory.
You should consider the use of SAX API, if possible.

By the way, I'm very new to Haskell.
I'm interested in how well it performs in the task of parsing XML files
(both speed and memory usage).

Regards Manlio Perillo

Albert Y. C. Lai

unread,

Mar 13, 2008, 12:59:41 PM3/13/08

to

Manlio Perillo wrote:
> By the way, I'm very new to Haskell.
> I'm interested in how well it performs in the task of parsing XML files
> (both speed and memory usage).

There are three libraries capable of XML.

HXT reads the whole file and keeps the whole tree in memory. (Its
advantage over the following two, however, is its API is richer, you can
write less code to do more.)

HaXml lets you choose the granularity of how much reading and building
per step. With the simplest coding, you still read the whole file and
build the whole tree. With more coding, you can read, build, discard
one subtree at a time.

TagSoup is the most fine-grained and low-level. It lazily reads the
file and returns a lazy list of "here is an open tag, here is some text,
oh here is another open tag, oh now here is a close tag..." It doesn't
verify open tags and close tags to match, however. This is probably
closest to SAX.

Paul Rubin

unread,

Mar 13, 2008, 2:43:28 PM3/13/08

to

"Albert Y. C. Lai" <tre...@vex.net> writes:
> There are three libraries capable of XML.

> HXT reads the whole file and keeps the whole tree in memory...
> HaXml lets you choose the granularity ...
> TagSoup is the most fine-grained and low-level...

Funny this thread should revive after so long. I'm still interested
in the subject. I was just chatting about it a couple nights ago
online, and someone mentioned that there's a Haskell binding to expat,
so that should be on the list too.

Manlio Perillo

unread,

Mar 13, 2008, 4:15:40 PM3/13/08

to

Il Thu, 13 Mar 2008 12:59:41 -0400, Albert Y. C. Lai ha scritto:

> Manlio Perillo wrote:
>> By the way, I'm very new to Haskell.
>> I'm interested in how well it performs in the task of parsing XML files
>> (both speed and memory usage).
>
> There are three libraries capable of XML.
>
> HXT reads the whole file and keeps the whole tree in memory. (Its
> advantage over the following two, however, is its API is richer, you can
> write less code to do more.)
>

How are performance with very large XML files?

> HaXml lets you choose the granularity of how much reading and building
> per step. With the simplest coding, you still read the whole file and
> build the whole tree. With more coding, you can read, build, discard
> one subtree at a time.
>

Very interesting.
But what about HXML?

> TagSoup is the most fine-grained and low-level. It lazily reads the
> file and returns a lazy list of "here is an open tag, here is some text,
> oh here is another open tag, oh now here is a close tag..." It doesn't
> verify open tags and close tags to match, however. This is probably
> closest to SAX.

After a google search I have also found this:
http://www.haskell.org/pipermail/haskell/2006-March/017656.html

Manlio Perillo