Processing XML-formatted documents in CL

Ron Parker

unread,

Jan 13, 2009, 10:57:47 PM1/13/09

to

I need to analyze and manipulate several multi-megabyte documents that
are stored in various XML formats. There appears to be a fairly long
list of XML parsers and tools on the CLiki but I have no frame of
reference to judge them.

Some of the documents I need to access will probably contain Unicode,
although I am not positive of this.

Half of me wants to hack it from scratch, but this is my first CL
project so I don't know if this would be wise even though I've hacked
Elisp for a couple decades off and on.

Any recommendations in this area would be appreciated.

Xah Lee

unread,

Jan 14, 2009, 1:43:51 AM1/14/09

to

thought i'd mention that there's nxml mode written by the xml expert
James Clark. It features a xml parser and xml validation as you type.

since it contains a complet parser, i think it can be used for your
project, but am not sure how the code is structured to be used like
that. (the code is over 10k lines)

(James is also the one who wrote the widely used xml parser expat in
c. (which happened to be used as part of our app server software
written in perl back in 1999, before realizing him in about 2007.))

Xah
∑ http://xahlee.org/

☄

Volkan YAZICI

unread,

Jan 14, 2009, 3:26:41 AM1/14/09

to

On Jan 14, 5:57 am, Ron Parker <rdpar...@gmail.com> wrote:
> I need to analyze and manipulate several multi-megabyte documents that
> are stored in various XML formats. There appears to be a fairly long
> list of XML parsers and tools on the CLiki but I have no frame of
> reference to judge them.
>
> Some of the documents I need to access will probably contain Unicode,
> although I am not positive of this.

I think Closure XML[1] is pretty good; have a good support and
community.

You didn't specify much about what you mean with "multi-megabyte", but
this issue could be a problem. Anyway, just let's try and see. If it
fails for some memory related reasons, you can first convert your XML
files into s-expression forms using some sort of XSLT and then easily
parse these s-expressions from lisp.

Regards.

[1] http://common-lisp.net/project/cxml/

Andy Chambers

unread,

Jan 14, 2009, 5:26:14 AM1/14/09

to

On Jan 14, 8:26 am, Volkan YAZICI <volkan.yaz...@gmail.com> wrote:
> On Jan 14, 5:57 am, Ron Parker <rdpar...@gmail.com> wrote:
>
> > I need to analyze and manipulate several multi-megabyte documents that
> > are stored in various XML formats. There appears to be a fairly long
> > list of XML parsers and tools on the CLiki but I have no frame of
> > reference to judge them.
>
> > Some of the documents I need to access will probably contain Unicode,
> > although I am not positive of this.
>
> I think Closure XML[1] is pretty good; have a good support and
> community.
>
> You didn't specify much about what you mean with "multi-megabyte", but
> this issue could be a problem.

Not really. Closure can process documents as streams or trees and
it's
stream processor has a really nice interface (klacks).

--
Andy

GP lisper

unread,

Jan 14, 2009, 5:52:38 AM1/14/09

to

On Tue, 13 Jan 2009 19:57:47 -0800 (PST), <rdpa...@gmail.com> wrote:
> I need to analyze and manipulate several multi-megabyte documents that
> are stored in various XML formats. There appears to be a fairly long
> list of XML parsers and tools on the CLiki but I have no frame of
> reference to judge them.

If it is standards compliant XML, any of the fancy code will work.

When I faced this problem about 3 years ago, "s-xml" solved my
non-standard XML problems nicely. I still use it for everything,
since I didn't need to learn any buzzwords and specs to apply it.

You'll probably try a few parsers anyway, sounds like speed will be an
issue.

--
"Most programmers use this on-line documentation nearly all of the
time, and thereby avoid the need to handle bulky manuals and perform
the translation from barbarous tongues." CMU CL User Manual

Zach Beane

unread,

Jan 14, 2009, 8:26:22 AM1/14/09

to

Ron Parker <rdpa...@gmail.com> writes:

I've been a happy user of Closure XML for some time now. It's very
capable and the documentation is good.

Zach

game_designer

unread,

Jan 15, 2009, 10:03:39 AM1/15/09

to

On Jan 13, 8:57 pm, Ron Parker <rdpar...@gmail.com> wrote:
.
>
> Any recommendations in this area would be appreciated.

If the goal is to read the file and to map XML elements into similar
structured CLOS objects you may want to explore XMLisp. A new version
released recently:

http://www.agentsheets.com/lisp/XMLisp/

If it works, great! If not, let me know why not.

Alex

Chaitanya Gupta

unread,

Jan 20, 2009, 3:20:12 AM1/20/09

to

Ron Parker wrote:
> I need to analyze and manipulate several multi-megabyte documents that
> are stored in various XML formats. There appears to be a fairly long
> list of XML parsers and tools on the CLiki but I have no frame of
> reference to judge them.
>

Others have mentioned CXML. It is quite good for most of your XML needs,
but since you mention multi-megabyte XML documents, its performance
might be an issue[1].

If that is the case, consider using S-XML:
http://common-lisp.net/project/s-xml/

I don't know if S-XML will satisfy your Unicode needs or not, but its
DOM parser is faster than CXML's.

Chaitanya

1. http://common-lisp.net/pipermail/cxml-devel/2008-September/000444.html

David Lichteblau

unread,

Jan 20, 2009, 3:57:28 AM1/20/09

to

On 2009-01-20, Chaitanya Gupta <ma...@chaitanyagupta.com> wrote:
> Others have mentioned CXML. It is quite good for most of your XML needs,
> but since you mention multi-megabyte XML documents, its performance
> might be an issue[1].

[...]
> 1. http://common-lisp.net/pipermail/cxml-devel/2008-September/000444.html

Hey, I don't claim that cxml is the fastest XML implementation around.

But in that mailing list post above, you had issues with Allegro CL's
default scheduler configuration, not cxml speed.

So MP:*DEFAULT-PROCESS-QUANTUM* is very high.
That's not cxml's fault.

d.

Chaitanya Gupta

unread,

Jan 20, 2009, 6:55:04 AM1/20/09

to

David Lichteblau wrote:
> Hey, I don't claim that cxml is the fastest XML implementation around.
>

Right, you don't.

> But in that mailing list post above, you had issues with Allegro CL's
> default scheduler configuration, not cxml speed.
>
> So MP:*DEFAULT-PROCESS-QUANTUM* is very high.
> That's not cxml's fault.
>

We did play around with mp:*default-process-quantum*, but it didn't help
much.

And we discovered that S-XML's DOM parser was faster and, IIRC, had
lower memory needs. It also didn't cause any image hanging issues that
we faced with CXML. In the end, we decided to switch to S-XML for that
particular service, and (so I've heard, since I left by then), there
haven't been any sleepless nights since. ;)

Mind you, CXML is still my tool of choice for XML parsing needs, and I
am particularly grateful for the extensions built on top of it
(cxml-rng, plexippus-xpath, etc.), but its performance did bite us once.
So I am just letting the OP know of a good alternative in case he feels
the performance pinch too.

Chaitanya