"walk over," and XPath-based substitutions?

Ivan Shmakov

unread,

Apr 6, 2013, 7:32:11 AM4/6/13

to

[Cross-posting to news:comp.text.xml, yet omitting it from
Followup-To:, for I'm primarily interested in Perl-based
solutions.]

Is there an easy way to invoke a particular code for each of XML
nodes that satisfies an XPath expression out of a certain list?

A simple-minded approach (based on XML::LibXML) could be like:

require XML::LibXML;

my %xpath_sub = {
q {//node ()[@foo = "bar"]} => \&foo_bar,
q {//node ()[@baz = "qux"]} => sub { baz ("qux", @_); }
};

foreach my $xpath (keys (%xpath_sub)) {
my $sub
= $xpath_sub{$xpath};
foreach my $node ($context->findnodes ($xpath)) {
$sub->($node);
}
}

However, AIUI, the code above implies that the XML tree is to be
traversed multiple times. Which could probably be avoided by
traversing the tree explicitly, as in:

sub traverse {
my ($node, $xsubs) = @_;
foreach my $xpath (keys (%$xsubs)) {
next
unless ($node->find ($xpath));
## FIXME: check if the result is a boolean?
$xsubs->{$xpath}->($node);
## FIXME: there, one may wish for a recursion; or not
}
## recurse over the children
foreach my $child ($node->childNodes ()) {
traverse ($child, $xsubs);
}
## .
}

Still, it may repeatedly traverse the children of $node while
computing ->find () for each of the XPath expressions. (Unlike
the way an "optimized," or "compiled," regular expression would
be handled, IIUC.)

The question is: does LibXML (or some other library) provide a
way to make such a task both simpler to code and more efficient
on execution?

... Or do I "optimize" all the XPath expressions themselves into
a single one somehow?

TIA.

--
FSF associate member #7257 http://hfday.org/

Dennis

unread,

Apr 6, 2013, 11:01:55 AM4/6/13

to

On 4/6/2013 5:32 AM, Ivan Shmakov wrote:
> [Cross-posting to news:comp.text.xml, yet omitting it from
> Followup-To:, for I'm primarily interested in Perl-based
> solutions.]
>
> Is there an easy way to invoke a particular code for each of XML
> nodes that satisfies an XPath expression out of a certain list?
>
> A simple-minded approach (based on XML::LibXML) could be like:
>
> require XML::LibXML;

XML::Twig http://search.cpan.org/~mirod/XML-Twig-3.42/Twig.pm
may be useful.

Dennis

Joe Kesselman

unread,

Apr 6, 2013, 12:11:40 PM4/6/13

to

On 4/6/2013 7:32 AM, Ivan Shmakov wrote:
...

> However, AIUI, the code above implies that the XML tree is to be
> traversed multiple times.

First off, I'd suggest that you consider XSLT or XQuery, which are
specifically designed for this kind of find-and-process operation.

What you're looking for is a "streaming processor" -- one which rewrites
the complete set of operations into a state machine which can produce
its results in a single pass over the nodes. There are XPath/XSLT/XQuery
systems which attempt to do this for a subset of the query language -- I
think Xerces and the IBM XML parser have streaming-subset XPath
evaluators, and I know the DataPower "xml appliance" machines have some
limited XSLT streaming capability -- but even as subsets, those are
fairly rare, and while they may be able to reduce storage by not keeping
the entire document model in memory they may not reduce computational
load. If you're looking for something off-the-shelf, that's where I'd start.

A _good_ general solution for matching multiple paths in a single pass
over the document is NOT easy to create. You need to create a state
machine which tracks what has been seen so far and detects which nodes
match which expression, and at the same time you want to constrain the
tree walk so you don't waste time exploring trees which provably can't
contribute nodes to those results. Getting all those details right even
for the subset approach can be complicated. Reassembling the individual
results in the correct order to produce the intended result document
further complicates the process.

(I'm one of the authors of a patent on that topic, actually -- US
8,120,789 B2 -- but unfortunately our group didn't get the funding to
finish a product-quality implementation of that logic so it isn't
available for use. If someone wants to license the patent, I'm sure IBM
would be delighted to talk to you...)

--
Joe Kesselman,
http://www.love-song-productions.com/people/keshlam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."

Ivan Shmakov

unread,

Apr 7, 2013, 5:22:58 AM4/7/13

to

>>>>> Joe Kesselman <keshlam.c...@verizon.net> writes:
>>>>> On 4/6/2013 7:32 AM, Ivan Shmakov wrote:

[Given that there were little Perl-specific matter in this
subthread, cross-posting back to news:comp.text.xml, and setting
Followup-To: there.]

>> However, AIUI, the code above implies that the XML tree is to be
>> traversed multiple times.

> First off, I'd suggest that you consider XSLT or XQuery, which are
> specifically designed for this kind of find-and-process operation.

I see little advantage in using XSLT for my task (and I'm not
familiar with XQuery), as XML is not the only data source I need
to interface. (E. g., I'm also accessing an SQLite database.)
The usual benefits of XSLT -- the existence of browser-based
implementations and its "Lisp-like" nature (in that it uses the
same syntax for both the code and data) -- do not seem to apply.

> What you're looking for is a "streaming processor" -- one which
> rewrites the complete set of operations into a state machine which
> can produce its results in a single pass over the nodes.

Indeed, thanks for clarification!

> There are XPath/XSLT/XQuery systems which attempt to do this for a
> subset of the query language -- I think Xerces

Is it Apache Xerces [1]? It doesn't seem to include either XSLT
or XQuery.

[1] https://xerces.apache.org/

> and the IBM XML parser

Which is?

> have streaming-subset XPath evaluators, and I know the DataPower "xml
> appliance" machines have some limited XSLT streaming capability --
> but even as subsets, those are fairly rare, and while they may be
> able to reduce storage by not keeping the entire document model in
> memory they may not reduce computational load. If you're looking for
> something off-the-shelf, that's where I'd start.

ACK, thanks. My XMLs are rather small, so I'm more interested
in reducing computational load than memory usage. But even that
is not a priority right now. Rather, I'm looking for the ways
to avoid total code rewrite at some later point.

I guess I should check XML::Twig. Or, given that the conditions
that I currently need to consider are rather simple, a
straight-forward ->childNodes ()-based, no-XPath implementation
may be possible.

[...]

> (I'm one of the authors of a patent on that topic, actually -- US
> 8,120,789 B2 -- but unfortunately our group didn't get the funding to
> finish a product-quality implementation of that logic so it isn't
> available for use. If someone wants to license the patent, I'm sure
> IBM would be delighted to talk to you...)

I believe that I may be under a jurisdiction which has no notion
of software patents. (Subject to the reading of TRIPS, though.)