Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[Caml-list] xpath or alternatives

105 views
Skip to first unread message

Richard Jones

unread,
Sep 28, 2009, 8:17:56 AM9/28/09
to caml...@inria.fr

I need to do some relatively simple extraction of fields from an XML
document. In Perl I would use xpath, very specifically if $xml was an
XML document[1] stored as a string, then:

my $p = XML::XPath->new (xml => $xml);
my @disks = $p->findnodes ('//devices/disk/source/@dev');
push (@disks, $p->findnodes ('//devices/disk/source/@file'));

This isn't type safe or pretty, but it is very easy to use for quick
and dirty extraction.

What is the OCaml equivalent for this sort of code?

Alain Frisch has a library called Xpath
(http://alain.frisch.fr/soft.html#xpath), but unfortunately this
relies on the now obsolete wlex program.

Is there a completely alternative way to do this? Better still, in 3
lines of code??

Rich.

[1] for XML doc, see: http://libvirt.org/formatdomain.html

--
Richard Jones
Red Hat

_______________________________________________
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

Yaron Minsky

unread,
Sep 28, 2009, 8:49:16 AM9/28/09
to Richard Jones, caml...@inria.fr
I don't have the code in front of me, but I've done something like
this using the list monad. i.e., using bind (= concat-map) and map
chained together, along with a couple operators I wrote for lifting
bits of XML documents into lists, by say returning the subnodes of the
present node as a list.

It was quite effective. I got the inspiration from a similar tool we
have for navigating s-expressions, which we should release at some
point...

Yaron Minsky

Till Varoquaux

unread,
Sep 28, 2009, 11:06:37 AM9/28/09
to Yaron Minsky, caml...@inria.fr, Richard Jones
There are a few projects out here:

xtisp
http://www.xtisp.org

xstream
http://yquem.inria.fr/~frisch/xstream/

and of course the good old cduce/xduce/ocamlduce. All in all naive
querying is not hard and tree automata:

(e.g.) http://www.grappa.univ-lille3.fr/~filiot/tata/

can provide a good middle ground between efficiency and simplicity.
The problem you might run into is that XML is a tricky format to deal
with and some of these tools will choke up on complex files
(namespaces,switching character encoding, weird entities in the DTD
etc..).

Till

P.S.: Alain has a good paper on how to compile queries (as done in
cduce). I am just too lazy to look for it.

Mikkel Fahnøe Jørgensen

unread,
Sep 29, 2009, 7:00:34 PM9/29/09
to Till Varoquaux, Yaron Minsky, caml...@inria.fr, Richard Jones
In line with what Yaron suggests, you can use a combinator parser.

I do this to parse json, and this parser could be adapted to xml by
focusing on basic syntax and ignoring the details, or you could
prefilter xml and use the json parser directly.

See the Fleece parser embedded here:

There is also the object abstraction that dives into an object
hierarchy after parsing, see the Objects module. The combination of
these two makes it quite easy to work on structured data, but 3 lines
only come after some xml adaptation work - but you can see many
one-liner json access in the last part of the file.

http://git.dvide.com/pub/symbiosis/tree/myocamlbuild_config.ml

Otherwise there is xmlm which is self-contained in single xml file,
and as I recall, has some sort of zipper navigator. (I initially
intended to use it before deciding on the json format):

http://erratique.ch/software/xmlm

Richard Jones

unread,
Sep 30, 2009, 6:17:00 AM9/30/09
to Mikkel Fahnøe Jørgensen, Yaron Minsky, Till Varoquaux, caml...@inria.fr
On Wed, Sep 30, 2009 at 01:00:15AM +0200, Mikkel Fahn�e J�rgensen wrote:
> In line with what Yaron suggests, you can use a combinator parser.
>
> I do this to parse json, and this parser could be adapted to xml by
> focusing on basic syntax and ignoring the details, or you could
> prefilter xml and use the json parser directly.
>
> See the Fleece parser embedded here:
>
> There is also the object abstraction that dives into an object
> hierarchy after parsing, see the Objects module. The combination of
> these two makes it quite easy to work on structured data, but 3 lines
> only come after some xml adaptation work - but you can see many
> one-liner json access in the last part of the file.
>
> http://git.dvide.com/pub/symbiosis/tree/myocamlbuild_config.ml
>
> Otherwise there is xmlm which is self-contained in single xml file,
> and as I recall, has some sort of zipper navigator. (I initially
> intended to use it before deciding on the json format):
>
> http://erratique.ch/software/xmlm

It's interesting you mention xmlm, because I couldn't write
the code using xmlm at all.

The discussion here has got quite theoretical, but it's not helping
me to write the original 3 lines of Perl in OCaml.

my $p = XML::XPath->new (xml => $xml);
my @disks = $p->findnodes ('//devices/disk/source/@dev');
push (@disks, $p->findnodes ('//devices/disk/source/@file'));

My best effort, using xml-light, is around 40 lines:

http://git.et.redhat.com/?p=libguestfs.git;a=blob;f=ocaml/examples/viewer.ml;h=ef6627b1b92a4fff7d4fa1fa4aca63eeffc05ece;hb=HEAD#l322

Rich.

--
Richard Jones
Red Hat

_______________________________________________

Sebastien Mondet

unread,
Sep 30, 2009, 6:36:53 AM9/30/09
to Richard Jones, caml...@inria.fr
> The discussion here has got quite theoretical, but it's not helping
> me to write the original 3 lines of Perl in OCaml.
>
>    my $p = XML::XPath->new (xml => $xml);
>    my @disks = $p->findnodes ('//devices/disk/source/@dev');
>    push (@disks, $p->findnodes ('//devices/disk/source/@file'));
>
> My best effort, using xml-light, is around 40 lines:
>
> http://git.et.redhat.com/?p=libguestfs.git;a=blob;f=ocaml/examples/viewer.ml;h=ef6627b1b92a4fff7d4fa1fa4aca63eeffc05ece;hb=HEAD#l322
>

Galax is (or was ??) an XQuery implementation in ocaml
and XPath 2.0 is included in XQuery... so maybe you can use it...

the site does not seem to respond now:
http://www.galaxquery.org/

but there is a debian package:
http://upsilon.cc/~zack/blog/posts/2008/02/galax_in_debian/

Mikkel Fahnøe Jørgensen

unread,
Sep 30, 2009, 6:49:24 AM9/30/09
to Richard Jones, Yaron Minsky, Till Varoquaux, caml...@inria.fr
2009/9/30 Richard Jones <ri...@annexia.org>:

> On Wed, Sep 30, 2009 at 01:00:15AM +0200, Mikkel Fahnøe Jørgensen wrote:
>> In line with what Yaron suggests, you can use a combinator parser.
> It's interesting you mention xmlm, because I couldn't write
> the code using xmlm at all.

If you can manage to convert an xml document into a json like tagged
tree structure,
then a simple solution like

module Value = struct
56 type value_type =
57 Object of (string * value_type) list
58 | Array of value_type list
59 | String of string
60 | Int of int
61 | Float of float
62 | Bool of bool
63 | Null
64 end
65
.
665 let get_object v = match v with Object x -> x
666 | _ -> fail "json object expected"
.
685 let pattern_path value names =
686 let rec again value = function
687 | "*" :: names -> List.iter (fun (n, v) -> try again v names
688 with Invalid_argument _ | Not_found -> ()) (get_object value)
689 | name :: names -> again (List.assoc name (get_object value)) names
690 | [] -> raise (Found value)
691 in try again value names; raise Not_found with Found value -> value
692

combined with a path split function

22 let split c s =
23 let n = String.length s in
24 let rec again i lst =
25 begin try let k = String.rindex_from s i c in
26 again (k - 1) ((if i = k then "" else (String.sub s (k + 1)
(i - k))) :: lst)
27 with _ -> (String.sub s 0 (i + 1)) :: lst
28 end
29 in again (n - 1) []

will do almost exactly what you are asking for - notice the "*"
searches broadly in all subtrees. You can add your own xpath like
functions as you discover a need for them.

I believe that the xmlm examples has a tree transformation operation
that would easily be adapted to produce a json like tree, if modified
a little.

let out_tree o t =
let frag = function
| E (tag, childs) -> `El (tag, childs)
| D d -> `Data d
in
Xmlm.output_doc_tree frag o t


> My best effort, using xml-light, is around 40 lines:

If you spend those 40 lines on a layer on top of a lightweight xml
parser, you might get away with 3 lines the next time.

Dario Teixeira

unread,
Sep 30, 2009, 7:05:15 AM9/30/09
to Richard Jones, caml...@inria.fr
Hi,

Ocamlduce has been mentioned before in this thread, but I didn't catch
the reason why it has been discarded as a solution. Is it because you
don't want to carry the extra (large) dependency, or is there some other
reason?

And on the subject of simple XML parsers for Ocaml, there's also the
aptly named Simplexmlparser from the Ocsigen project [1]. It's about
as spartan as one can conceive, yet sufficient for a large subset of
XML extraction tasks.

Cheers,
Dario Teixeira

[1] http://ocsigen.org/docu/1.2.0/Simplexmlparser.html

Richard Jones

unread,
Sep 30, 2009, 7:57:33 AM9/30/09
to Dario Teixeira, caml...@inria.fr
On Wed, Sep 30, 2009 at 04:05:03AM -0700, Dario Teixeira wrote:
> Hi,
>
> Ocamlduce has been mentioned before in this thread, but I didn't catch
> the reason why it has been discarded as a solution. Is it because you
> don't want to carry the extra (large) dependency, or is there some other
> reason?

Actually the reason is that I thought it wasn't available for 3.11.1,
but I just checked the website and it is, and ocamlduce does seem to
be the obvious solution for this problem. (However I'll need to try
and see if I can come up with the equivalent code).

> And on the subject of simple XML parsers for Ocaml, there's also the
> aptly named Simplexmlparser from the Ocsigen project [1]. It's about
> as spartan as one can conceive, yet sufficient for a large subset of
> XML extraction tasks.
>

> [1] http://ocsigen.org/docu/1.2.0/Simplexmlparser.html

Thanks - but if I understand that page correctly, then isn't it
just parsing XML into a tree?

Rich.

--
Richard Jones
Red Hat

_______________________________________________

Richard Jones

unread,
Sep 30, 2009, 8:59:14 AM9/30/09
to caml...@inria.fr
On Wed, Sep 30, 2009 at 12:57:23PM +0100, Richard Jones wrote:
> On Wed, Sep 30, 2009 at 04:05:03AM -0700, Dario Teixeira wrote:
> > Hi,
> >
> > Ocamlduce has been mentioned before in this thread, but I didn't catch
> > the reason why it has been discarded as a solution. Is it because you
> > don't want to carry the extra (large) dependency, or is there some other
> > reason?
>
> Actually the reason is that I thought it wasn't available for 3.11.1,
> but I just checked the website and it is, and ocamlduce does seem to
> be the obvious solution for this problem. (However I'll need to try
> and see if I can come up with the equivalent code).

Do any cduce developers want to give me a clue here? It would seem
like I need something along these lines:

let devs = match xml with
| {{ <domain>[<devices>[<source dev=(String & dev) ..>[]]] }} -> dev
| {{ <domain>[<devices>[<source file=(String & file) ..>[]]] }} -> file in

However according to the compiler, devs has type <XML>. In any case,
I think I may need either the map or map* operator, since I want to
match all, not just the first one.

Till Varoquaux

unread,
Sep 30, 2009, 9:33:19 AM9/30/09
to Richard Jones, caml...@inria.fr
OCamlduce (Alain correct me if I am wrong) basically maintains two
separate type systems side by side (the Xduce one and the Ocaml one).
This is done in order to make Ocamlduce maintainable by keeping a
clear separation. As a result you have to explicitly convert values
between type systems using {:...:}. These casts are type safe but do
lead to some work at runtime.

Also note that ocaml's string are Latin1 and not String in the XML world. So:

let devs = match xml with

| {{ <domain>[<devices>[<source dev=(Latin1 & dev) ..>[]]] }} -> {:dev:}
| {{ <domain>[<devices>[<source file=(Latin1 & file) ..>[]]] }} ->
{:file:} in

Should work (I'm rusty and have nothing to check handy).

Till

Stefano Zacchiroli

unread,
Sep 30, 2009, 9:40:29 AM9/30/09
to caml...@inria.fr, PXP Users ML
On Mon, Sep 28, 2009 at 01:17:45PM +0100, Richard Jones wrote:
> I need to do some relatively simple extraction of fields from an XML
> document. In Perl I would use xpath, very specifically if $xml was an
> XML document[1] stored as a string, then:
>
> my $p = XML::XPath->new (xml => $xml);
> my @disks = $p->findnodes ('//devices/disk/source/@dev');
> push (@disks, $p->findnodes ('//devices/disk/source/@file'));

I've just realized that this thread can look a bit ridiculous, at least
for people used to other languages where XPath implementations can even
be found in the language standard library (the best solutions we have
thus far are: a 40-line xml-light solution, the need to use a modified
version of the OCaml compiler [yes, I know, it is compatible, but still
..], Galax with unreachable homepage, ...).

So, I was wondering, has anybody ever tried to develop an XPath
implementation on top of, say, PXP? The original announcement page of
PXP (now archived) mentions "rumors" about people which, back then, were
developing it. Has anything ever been released?

At first glance, it doesn't seem to exist any specific typing problem,
at least with XPath 1.0, since the PXP node interface is already common
for all node types. Sure XPath 2.0, when static typing is in use, can be
better integrated with the language, but that's probably already
happening in Galax.

[ Cc-ing the PXP mailing list ]

Cheers.

--
Stefano Zacchiroli -o- PhD in Computer Science \ PostDoc @ Univ. Paris 7
zack@{upsilon.cc,pps.jussieu.fr,debian.org} -<>- http://upsilon.cc/zack/
Dietro un grande uomo c'� ..| . |. Et ne m'en veux pas si je te tutoie
sempre uno zaino ...........| ..: |.... Je dis tu � tous ceux que j'aime

Richard Jones

unread,
Sep 30, 2009, 10:01:24 AM9/30/09
to caml...@inria.fr
On Wed, Sep 30, 2009 at 09:33:07AM -0400, Till Varoquaux wrote:
> OCamlduce (Alain correct me if I am wrong) basically maintains two
> separate type systems side by side (the Xduce one and the Ocaml one).
> This is done in order to make Ocamlduce maintainable by keeping a
> clear separation. As a result you have to explicitly convert values
> between type systems using {:...:}. These casts are type safe but do
> lead to some work at runtime.
>
> Also note that ocaml's string are Latin1 and not String in the XML world. So:
>
> let devs = match xml with
> | {{ <domain>[<devices>[<source dev=(Latin1 & dev) ..>[]]] }} -> {:dev:}
> | {{ <domain>[<devices>[<source file=(Latin1 & file) ..>[]]] }} ->
> {:file:} in
>
> Should work (I'm rusty and have nothing to check handy).

I tried variations on the above, but couldn't get it to work.
ocamlduce is very fond of a mysterious error called "Error: Subtyping
failed", which is very difficult for me to understand, and therefore
must be absolutely impossible for someone not used to strong typing.

This is where I'm heading at the moment (sorry, my previous
example missed a <disk> level inside <devices>), so:

let xml = from_string xml in
prerr_endline (Ocamlduce.to_string xml);

let devs = {{ map [xml] with
| <domain..>[<devices..>[<disk..>[<source dev=(Latin1 & s) ..>_]]]
| <domain..>[<devices..>[<disk..>[<source file=(Latin1 & s) ..>_]]] -> [s]
| _ -> [] }} in
prerr_endline (Ocamlduce.to_string devs);

+1 : this compiles
-1 : it doesn't work, devs is empty

This is what the first prerr_endline prints:

<domain
type="kvm"
id="2">[
<name>[ 'CentOS5x32' ]
<uuid>[ '2ce397d9-1931-feb1-8ad8-15f22c4f18af' ]
<memory>[ '524288' ]
<currentMemory>[ '524288' ]
<vcpu>[ '1' ]
<os>[ <type arch="x86_64" machine="pc-0.11">[ 'hvm' ] <boot dev="hd">[ ] ]
<features>[ <acpi>[ ] <apic>[ ] <pae>[ ] ]
<clock offset="utc">[ ]
<on_poweroff>[ 'destroy' ]
<on_reboot>[ 'restart' ]
<on_crash>[ 'restart' ]
<devices>[
<emulator>[ '/usr/bin/qemu-kvm' ]
<disk
type="block"
device="disk">[
<source dev="/dev/vg_trick/CentOS5x32">[ ]
<target bus="ide" dev="hda">[ ]
]
<interface
type="network">[
<mac address="54:52:00:3c:76:11">[ ]
<source network="default">[ ]
<target dev="vnet0">[ ]
]
<serial type="pty">[ <source path="/dev/pts/7">[ ] <target port="0">[ ] ]
<console
type="pty"
tty="/dev/pts/7">[
<source path="/dev/pts/7">[ ]
<target port="0">[ ]
]
<input type="mouse" bus="ps2">[ ]
<graphics autoport="yes" port="5900" type="vnc">[ ]
<video>[ <model type="cirrus" vram="9216" heads="1">[ ] ]
]
]

and what the second prerr_endline prints:

Till Varoquaux

unread,
Sep 30, 2009, 10:28:26 AM9/30/09
to Richard Jones, caml...@inria.fr
If I am not mistaken you are selecting a domain whose first child is a
device node whose only child is disk node ...

instead of:


<domain..>[<devices..>[<disk..>[<source dev=(Latin1 & s) ..>_]]]

you should aim for something in the vein of:

<domain ..> [_* (<devices..> (<disk..>(<source dev=(Latin1 & s)>|
<souce file = (Latin1 &s)>_)* |_)* _*]

Till

Gerd Stolpmann

unread,
Sep 30, 2009, 10:45:17 AM9/30/09
to Stefano Zacchiroli, PXP Users ML, caml...@inria.fr

Am Mittwoch, den 30.09.2009, 15:39 +0200 schrieb Stefano Zacchiroli:
> On Mon, Sep 28, 2009 at 01:17:45PM +0100, Richard Jones wrote:
> > I need to do some relatively simple extraction of fields from an XML
> > document. In Perl I would use xpath, very specifically if $xml was an
> > XML document[1] stored as a string, then:
> >
> > my $p = XML::XPath->new (xml => $xml);
> > my @disks = $p->findnodes ('//devices/disk/source/@dev');
> > push (@disks, $p->findnodes ('//devices/disk/source/@file'));
>
> I've just realized that this thread can look a bit ridiculous, at least
> for people used to other languages where XPath implementations can even
> be found in the language standard library (the best solutions we have
> thus far are: a 40-line xml-light solution, the need to use a modified
> version of the OCaml compiler [yes, I know, it is compatible, but still
> ...], Galax with unreachable homepage, ...).

>
> So, I was wondering, has anybody ever tried to develop an XPath
> implementation on top of, say, PXP? The original announcement page of
> PXP (now archived) mentions "rumors" about people which, back then, were
> developing it. Has anything ever been released?

No. However, there is a little XPath evaluator in SVN:

https://godirepo.camlcity.org/svn/lib-pxp/trunk/src/pxp-engine/pxp_xpath.ml

I have never found the time to complete it, and to add some syntax
extension for painless use. But maybe somebody wants to take this over?

Gerd

> At first glance, it doesn't seem to exist any specific typing problem,
> at least with XPath 1.0, since the PXP node interface is already common
> for all node types. Sure XPath 2.0, when static typing is in use, can be
> better integrated with the language, but that's probably already
> happening in Galax.
>
> [ Cc-ing the PXP mailing list ]
>
> Cheers.
>
--

------------------------------------------------------------
Gerd Stolpmann, Bad Nauheimer Str.3, 64289 Darmstadt,Germany
ge...@gerd-stolpmann.de http://www.gerd-stolpmann.de
Phone: +49-6151-153855 Fax: +49-6151-997714
------------------------------------------------------------

Alain Frisch

unread,
Sep 30, 2009, 10:51:17 AM9/30/09
to Richard Jones, caml...@inria.fr
Richard Jones wrote:
> let devs = {{ map [xml] with
> | <domain..>[<devices..>[<disk..>[<source dev=(Latin1 & s) ..>_]]]
> | <domain..>[<devices..>[<disk..>[<source file=(Latin1 & s) ..>_]]] -> [s]
> | _ -> [] }} in

The following should work:

let l = {{ [xml] }} in
let l = {{ map l with <domain..>l -> l | _ -> [] }} in
let l = {{ map l with <devices..>l -> l | _ -> [] }} in
let l = {{ map l with <disk..>l -> l | _ -> [] }} in
let l = {{ map l with <source dev=(Latin1 & s) ..>_
| <source file=(Latin1 & s) ..>_-> s
| _ -> [] }} in
...

let () =
let l = {{ [xml] }} in
let l = {{ (((l.(<domain..>_)) / .(<devices..>_)) / .(<disk..>_)) / }} in
let l = {{ map l with <source dev=(Latin1 & s) ..>_


| <source file=(Latin1 & s) ..>_ -> s
| _ -> [] }} in

..


This uses the constructions e/ and e.(t) as described in the manual.

That said, using OCamlDuce for this kind of XML data-extraction seems
just crazy to me.


Cheers,

Alain

Richard Jones

unread,
Sep 30, 2009, 11:09:29 AM9/30/09
to Alain Frisch, caml...@inria.fr
On Wed, Sep 30, 2009 at 04:51:01PM +0200, Alain Frisch wrote:
> Richard Jones wrote:
> > let devs = {{ map [xml] with
> > | <domain..>[<devices..>[<disk..>[<source dev=(Latin1 & s) ..>_]]]
> > | <domain..>[<devices..>[<disk..>[<source file=(Latin1 & s) ..>_]]] ->
> > [s]
> > | _ -> [] }} in
>
> The following should work:
>
> let l = {{ [xml] }} in
> let l = {{ map l with <domain..>l -> l | _ -> [] }} in
> let l = {{ map l with <devices..>l -> l | _ -> [] }} in
> let l = {{ map l with <disk..>l -> l | _ -> [] }} in
> let l = {{ map l with <source dev=(Latin1 & s) ..>_
> | <source file=(Latin1 & s) ..>_-> s
> | _ -> [] }} in
> ...
>
> let () =
> let l = {{ [xml] }} in
> let l = {{ (((l.(<domain..>_)) / .(<devices..>_)) / .(<disk..>_)) / }} in
> let l = {{ map l with <source dev=(Latin1 & s) ..>_
> | <source file=(Latin1 & s) ..>_ -> s
> | _ -> [] }} in
> ..

Thanks Alain. My latest attempt was similar to your version 1 above,
and it works :-)

Now my code looks like your version 2:

let xml = from_string xml in

let xs = {{ [xml] }} in
let xs = {{ (((xs.(<domain..>_)) / .(<devices..>_)) / .(<disk..>_)) / }} in
let xs = {{ map xs with


| <source dev=(Latin1 & s) ..>_
| <source file=(Latin1 & s) ..>_ -> [s]
| _ -> [] }} in

{: xs :}

(plus the boilerplate for interfacing xml-light and CDuce).

We're getting close to the xpath/perl solution (8 lines vs 3 lines),
with some added type safety and the possibility of validating the XML.

On the other hand, the code is hard to understand. It's not clear to
me what the .( ) syntax means, nor why there is an apparently trailing
/ character.

> This uses the constructions e/ and e.(t) as described in the manual.
>
> That said, using OCamlDuce for this kind of XML data-extraction seems
> just crazy to me.

I have some comments:

(A) "Subtyping failed" is a very common error, but is only mentioned
briefly in the manual. I have no idea what these errors mean, so they
should have more explanation. Here is a simple one which was caused
by me using a value instead of a list (but that is not at all obvious
from the error message):

Error: Subtyping failed Latin1 <= [ Latin1* ]
Sample:
[ Latin1Char ]

(B) I think the interfacing code here:

http://yquem.inria.fr/~frisch/ocamlcduce/samples/expat/
http://yquem.inria.fr/~frisch/ocamlcduce/samples/pxp/
http://yquem.inria.fr/~frisch/ocamlcduce/samples/xmllight/

should be distributed along with ocamlduce.

Rich.

--
Richard Jones
Red Hat

_______________________________________________

Stefano Zacchiroli

unread,
Sep 30, 2009, 11:13:25 AM9/30/09
to caml...@inria.fr, PXP Users ML
On Wed, Sep 30, 2009 at 04:49:37PM +0200, Gerd Stolpmann wrote:
> No. However, there is a little XPath evaluator in SVN:
> https://godirepo.camlcity.org/svn/lib-pxp/trunk/src/pxp-engine/pxp_xpath.ml

Cool, and you have even already implemented all of the XPath 1.0
standard library!

> I have never found the time to complete it, and to add some syntax
> extension for painless use. But maybe somebody wants to take this
> over?

If I'm not mistaken, more than a syntax extension that evaluator needs a
parser from concrete syntax to the abstract syntax you've already
implemented. Once you have that, I don't think there is really a need of
any syntax extension, what would be wrong in using it as follows:

let nodes = xpath_eval ~xpath:(xpath "/foo/bar[2]/@baz") tree in
let nodes2 = xpath_eval ~expr:"/foo/bar[2]/@baz" in
...

we already use regexps this way and is more than handy. Or am I missing
something here?

I don't have energy to volunteer myself, but I duly note that Alain's
old XPath implementation already contains a parser that can be reused
(whereas the lexer should be changed, as already observed; most likely
the lexer should be ported to Ulex).

All in all, it is probably just a matter of integration work (modulo the
limitations of the current evaluator, of course).

Any volunteer? :-)

Cheers.

--
Stefano Zacchiroli -o- PhD in Computer Science \ PostDoc @ Univ. Paris 7
zack@{upsilon.cc,pps.jussieu.fr,debian.org} -<>- http://upsilon.cc/zack/
Dietro un grande uomo c'� ..| . |. Et ne m'en veux pas si je te tutoie
sempre uno zaino ...........| ..: |.... Je dis tu � tous ceux que j'aime

_______________________________________________

Alain Frisch

unread,
Sep 30, 2009, 11:19:09 AM9/30/09
to Richard Jones, caml...@inria.fr
Richard Jones wrote:
> On the other hand, the code is hard to understand. It's not clear to
> me what the .( ) syntax means, nor why there is an apparently trailing
> / character.

From the manual:

If the x-expression e evaluates to an x-sequence, the construction e/
will result in a new x-sequence obtained by taking in order all the
children of the XML elements from the sequence e. For instance, the
x-expression [<a>[ 1 2 3 ] 4 5 <b>[ 6 7 8 ] ]/ evaluates to the x-value
[ 1 2 3 6 7 8 ].

If the x-expression e evaluates to an x-sequence, the construction e.(t)
(where t is an x-type) will result in a new x-sequence obtained by
filtering e to keep only the elements of type t. For instance, the
x-expression [<a>[ 1 2 3 ] 4 5 <b>[ 6 7 8 ] ].(Int) evaluates to the
x-value [ 4 5 ].

> I have some comments:
>
> (A) "Subtyping failed" is a very common error, but is only mentioned
> briefly in the manual. I have no idea what these errors mean, so they
> should have more explanation. Here is a simple one which was caused
> by me using a value instead of a list (but that is not at all obvious
> from the error message):
>
> Error: Subtyping failed Latin1 <= [ Latin1* ]
> Sample:
> [ Latin1Char ]

The error tells you that Latin1 is not a subtype of [ Latin1* ].
It probably means that you are trying to use a value of type Latin1
where a value of type [ Latin1* ] is expected.

There was a GODI package that includes them. It would be ok to put these
files in the distribution without compiling them (otherwise it would
create a dependency on more OCaml packages). It's up to St�phane Glondu,
the new maintainer of OCamlDuce.


Cheers,

Alain

Jordan Schatz

unread,
Sep 30, 2009, 11:22:49 AM9/30/09
to caml...@inria.fr, PXP Users ML
I hope this is germane, I am very new to Ocaml.

Do these help at all?
http://packages.debian.org/sid/libxml-light-ocaml-dev
http://tech.motion-twin.com/xmllight.html

I expect it wouldn't be to difficult to write a wrapper around libxml
http://xmlsoft.org/index.html

-Jordan

Daniel Bünzli

unread,
Oct 27, 2009, 10:22:33 PM10/27/09
to Richard Jones, Mikkel Fahnøe Jørgensen, caml...@inria.fr
Sorry for the late reply.

On Wed, Sep 30, 2009 at 01:00:15AM +0200, Mikkel Fahnøe Jørgensen wrote:

> Otherwise there is xmlm which is self-contained in single xml file,
> and as I recall, has some sort of zipper navigator. (I initially
> intended to use it before deciding on the json format):

The cursor api was removed from the library in 1.0.0.


On Wed, Sep 30, 2009 at 6:16 PM, Richard Jones <ri...@annexia.org> wrote:

> It's interesting you mention xmlm, because I couldn't write
> the code using xmlm at all.

Why ? That doesn't feel like an insurmontable task.

Below is a function that extracts from a (sub)tree's sequence of
signals the attributes' data of an absolute path (i.e. the particular
xpath pattern you're after if I understand correctly). Each
attribute's data is stored in a separate list. The function is simpler
than it looks, in essence it's just a recursive case analysis on
signals. In the function [aux], [pos] maintains the current path in
the parse tree. [mismatch] counts the level of mismatch w.r.t. the
[path] we are looking for.

let absolute_path_atts i path atts =
let rec aux i pos mismatch path accs = match Xmlm.input i with
| `El_start (tag, atts) ->
if mismatch > 0 then aux i (tag :: pos) (mismatch + 1) path accs else
begin match path with
| n :: path' when n = tag ->
if path' <> [] then aux i (tag :: pos) 0 path' accs else
let update_acc ((att, acc) as v) =
try att, (List.assoc att atts) :: acc with Not_found -> v
in
aux i (tag :: pos) 0 [] (List.map update_acc accs)
| _ -> aux i (tag :: pos) (mismatch + 1) path accs
end
| `El_end ->
begin match pos with
| _ :: [] -> List.rev_map (fun (att, acc) -> List.rev acc) accs
| tag :: pos' ->
if mismatch > 0 then aux i pos' (mismatch - 1) path accs else
aux i pos' 0 (tag :: path) accs
| [] -> assert false
end
| `Data _ -> aux i pos mismatch path accs
| `Dtd _ -> assert false
in
let accs = List.rev_map (fun att -> att, []) atts in
begin match Xmlm.peek i with
| `El_start _ -> aux i [] 0 path accs
| `Dtd _ | `El_end | `Data _ -> invalid_arg "no subtree here"
end

Now your function becomes something like this :

let get_devices_from_xml xml =
try
let i = Xmlm.make_input (`String (0, xml)) in
ignore (Xmlm.input i); (* `Dtd signal *)
let path = ["", "domain"; "","devices"; "", "disk"; "", "source"] in
match absolute_path_atts i path ["", "dev"; "", "file"] with
| [devs; files] when Xmlm.eoi i -> devs @ files
| _ -> failwith "xml document not well-formed"
with
| Xmlm.Error ((l,c), e) ->
failwith (Printf.sprintf "%d:%d: %s" l c (Xmlm.error_message e))

I know this is still more effort than you'd like, but
Xmlm is purposedly low-level and will remain. It provides only a
robust xmlm parser convenient (I believe) to develop higher-level
abstractions to process the insane uses of this standard. It would be
nice to develop a module using xmlm to provide a (non-camlp4) dsl for
xml queries. Unfortunately I do not have the time for that at the
moment (unless someone wants to fund me to do that...).

Best,

Daniel

0 new messages