Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[Caml-list] ANNOUNCE: Xmlm

6 views
Skip to first unread message

Daniel Bünzli

unread,
Feb 26, 2007, 8:17:23 PM2/26/07
to caml...@yquem.inria.fr
Xmlm is an OCaml module providing sequential XML input/output and
a persistent cursor. It aims at making non valid XML processing
robust and painless.

The sequential interface can be used to process documents without
building an in-memory representation. It also lets the programmer
translate its own data structures to an XML representation and
vice-versa.

The cursor allows to navigate and update a simple in-memory tree
representation of XML documents. Updates performed by the cursor
are persistent (non destructive).

To facilitate direct integration into projects, Xmlm is made of a
single module and distributed under a BSD license.

Project home page : <http://code.google.com/p/xmlm>

Your feedback is welcome,

Daniel

P.S.

Why another XML parser ?

Dissatisfaction about existing solutions either too complete and
complex or too britlle and restrictive. Besides it seems all
existing parsers force you to read the whole document in
memory. Here are some points that motivated the design of Xmlm.

1. Easy to integrate into projects without introducing external
dependencies. A single module provides everything including
documentation (ocamldoc) and the license.

2. Well documented. Features and limitations of the parser are precisely
documented.

3. Easy to use yet flexible api.
- Choice between sequential (SAX-like) or tree (DOM-like) processing.
- Construction/deconstruction of user data structures from/to xml
documents.
- Tree processing with persistent cursor (zipper).
- Simple white space handling options for character data.
- Character encodings are translated to UTF-8.
UTF-8 is the only encoding the programmer needs to handle.
- Character references and predefined entities are resolved.
Other entity references can be resolved via a user provided
callback.
- Early access to data to allow parse time data transformations.
- Parse time element pruning.

4. Robust parsing. Does not assume an xml subset.
- Supports major encodings : ASCII, UTF-8, UTF-16 (LE and BE),
ISO-8559-1.
- Parses qualified names (namespaces).
- Tail-recursive.

5. Limitations. If you need one of these things use PXP.
- Comments, processing instructions and standalone declaration are
dropped by the parser (it is a feature).
- No DTD support (but it can be extracted and written as a raw
string).
- No validity support.

_______________________________________________
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

Stefano Zacchiroli

unread,
Feb 27, 2007, 3:31:45 AM2/27/07
to caml...@yquem.inria.fr
On Tue, Feb 27, 2007 at 02:16:43AM +0100, Daniel Bünzli wrote:
> Why another XML parser ?

What about the performance? How does Xmlm compares with Pxp,
Ocaml-Expat, or Ocaml Xmlreader (the bindings of libxml reader
interface)?

Nonetheless, many thanks for your new parser!

Cheers.

--
Stefano Zacchiroli -*- Computer Science PhD student @ Uny Bologna, Italy
zack@{cs.unibo.it,debian.org,bononia.it} -%- http://www.bononia.it/zack/
(15:56:48) Zack: e la demo dema ? /\ All one has to do is hit the
(15:57:15) Bac: no, la demo scema \/ right keys at the right time

Daniel Bünzli

unread,
Feb 27, 2007, 6:31:52 AM2/27/07
to Stefano Zacchiroli

Le 27 févr. 07 à 09:28, Stefano Zacchiroli a écrit :

> What about the performance?

I don't know, I didn't invest a lot of time in profiling, maybe it
can be improved.

But if you want an unscientific benchmark here you have :

I compare the two programs xmllint (which comes on my system) and
xmltrip (compiled without any special option) respectively
distributed with libxml and xmlm. Note xmllint is a C program so we
are not comparing to libxml's ocaml interface. Besides I have no idea
how xmllint is written and what it does internally, maybe it does
more than xmlm does, so we may well be comparing the uncomparable.
The files are <http://www.ximpleware.com/xmls.zip>, uncompressed this
is 144 mb of xml files.

On macos 10.4.8, G4 1Ghz, 512mo ram.

Parse only, without building the tree.

> > time ./xmltrip.opt -p -ename ~/tmp/xmls/*.xml
>
> real 0m53.567s
> user 0m51.562s
> sys 0m1.043s
>
> > time xmllint --noent --nocdata --noout --nonet --stream ~/tmp/
> xmls/*.xml
>
> real 0m25.264s
> user 0m24.314s
> sys 0m0.725s

Parse only, building an in-memory tree.

> > time ./xmltrip.opt -t -p -ename ~/tmp/xmls/*.xml
>
> real 2m2.099s
> user 1m44.821s
> sys 0m8.215s
>
> > time xmllint --noblanks --noent --nocdata --noout --nonet ~/tmp/
> xmls/*.xml
> real 1m4.590s
> user 0m47.561s
> sys 0m3.193s

Best,

Daniel

0 new messages