I need to analyze and manipulate several multi-megabyte documents that are stored in various XML formats. There appears to be a fairly long list of XML parsers and tools on the CLiki but I have no frame of reference to judge them.
Some of the documents I need to access will probably contain Unicode, although I am not positive of this.
Half of me wants to hack it from scratch, but this is my first CL project so I don't know if this would be wise even though I've hacked Elisp for a couple decades off and on.
Any recommendations in this area would be appreciated.
On Jan 13, 7:57 pm, Ron Parker <rdpar...@gmail.com> wrote:
> I need to analyze and manipulate several multi-megabyte documents that > are stored in various XML formats. There appears to be a fairly long > list of XML parsers and tools on the CLiki but I have no frame of > reference to judge them.
> Some of the documents I need to access will probably contain Unicode, > although I am not positive of this.
> Half of me wants to hack it from scratch, but this is my first CL > project so I don't know if this would be wise even though I've hacked > Elisp for a couple decades off and on.
> Any recommendations in this area would be appreciated.
thought i'd mention that there's nxml mode written by the xml expert James Clark. It features a xml parser and xml validation as you type.
since it contains a complet parser, i think it can be used for your project, but am not sure how the code is structured to be used like that. (the code is over 10k lines)
(James is also the one who wrote the widely used xml parser expat in c. (which happened to be used as part of our app server software written in perl back in 1999, before realizing him in about 2007.))
On Jan 14, 5:57 am, Ron Parker <rdpar...@gmail.com> wrote:
> I need to analyze and manipulate several multi-megabyte documents that > are stored in various XML formats. There appears to be a fairly long > list of XML parsers and tools on the CLiki but I have no frame of > reference to judge them.
> Some of the documents I need to access will probably contain Unicode, > although I am not positive of this.
I think Closure XML[1] is pretty good; have a good support and community.
You didn't specify much about what you mean with "multi-megabyte", but this issue could be a problem. Anyway, just let's try and see. If it fails for some memory related reasons, you can first convert your XML files into s-expression forms using some sort of XSLT and then easily parse these s-expressions from lisp.
On Jan 14, 8:26 am, Volkan YAZICI <volkan.yaz...@gmail.com> wrote:
> On Jan 14, 5:57 am, Ron Parker <rdpar...@gmail.com> wrote:
> > I need to analyze and manipulate several multi-megabyte documents that > > are stored in various XML formats. There appears to be a fairly long > > list of XML parsers and tools on the CLiki but I have no frame of > > reference to judge them.
> > Some of the documents I need to access will probably contain Unicode, > > although I am not positive of this.
> I think Closure XML[1] is pretty good; have a good support and > community.
> You didn't specify much about what you mean with "multi-megabyte", but > this issue could be a problem.
Not really. Closure can process documents as streams or trees and it's stream processor has a really nice interface (klacks).
On Tue, 13 Jan 2009 19:57:47 -0800 (PST), <rdpar...@gmail.com> wrote: > I need to analyze and manipulate several multi-megabyte documents that > are stored in various XML formats. There appears to be a fairly long > list of XML parsers and tools on the CLiki but I have no frame of > reference to judge them.
If it is standards compliant XML, any of the fancy code will work.
When I faced this problem about 3 years ago, "s-xml" solved my non-standard XML problems nicely. I still use it for everything, since I didn't need to learn any buzzwords and specs to apply it.
You'll probably try a few parsers anyway, sounds like speed will be an issue.
-- "Most programmers use this on-line documentation nearly all of the time, and thereby avoid the need to handle bulky manuals and perform the translation from barbarous tongues." CMU CL User Manual
Ron Parker <rdpar...@gmail.com> writes: > I need to analyze and manipulate several multi-megabyte documents that > are stored in various XML formats. There appears to be a fairly long > list of XML parsers and tools on the CLiki but I have no frame of > reference to judge them.
> Some of the documents I need to access will probably contain Unicode, > although I am not positive of this.
> Half of me wants to hack it from scratch, but this is my first CL > project so I don't know if this would be wise even though I've hacked > Elisp for a couple decades off and on.
> Any recommendations in this area would be appreciated.
I've been a happy user of Closure XML for some time now. It's very capable and the documentation is good.
On Jan 13, 8:57 pm, Ron Parker <rdpar...@gmail.com> wrote: .
> Any recommendations in this area would be appreciated.
If the goal is to read the file and to map XML elements into similar structured CLOS objects you may want to explore XMLisp. A new version released recently:
Ron Parker wrote: > I need to analyze and manipulate several multi-megabyte documents that > are stored in various XML formats. There appears to be a fairly long > list of XML parsers and tools on the CLiki but I have no frame of > reference to judge them.
Others have mentioned CXML. It is quite good for most of your XML needs, but since you mention multi-megabyte XML documents, its performance might be an issue[1].
David Lichteblau wrote: > Hey, I don't claim that cxml is the fastest XML implementation around.
Right, you don't.
> But in that mailing list post above, you had issues with Allegro CL's > default scheduler configuration, not cxml speed.
> So MP:*DEFAULT-PROCESS-QUANTUM* is very high. > That's not cxml's fault.
We did play around with mp:*default-process-quantum*, but it didn't help much.
And we discovered that S-XML's DOM parser was faster and, IIRC, had lower memory needs. It also didn't cause any image hanging issues that we faced with CXML. In the end, we decided to switch to S-XML for that particular service, and (so I've heard, since I left by then), there haven't been any sleepless nights since. ;)
Mind you, CXML is still my tool of choice for XML parsing needs, and I am particularly grateful for the extensions built on top of it (cxml-rng, plexippus-xpath, etc.), but its performance did bite us once. So I am just letting the OP know of a good alternative in case he feels the performance pinch too.