doc.xpath() abysmally slow

122 views
Skip to first unread message

fearless_fool

unread,
Jan 18, 2013, 8:47:26 AM1/18/13
to nokogi...@googlegroups.com
Is there something I can do to speed up my .xpath() call?

Here are the two lines in question:

  doc = File.open(fname) {|f| Nokogiri::XML(f) }
  recs = doc.xpath('//meter:IntervalBlock//meter:IntervalReading', 'meter' => "http://naesb.org/espi")

The call to Nokogiri::XML(f) takes about 80 mSec.
The call doc.xpath(...) takes over 132 (!!) seconds.

Granted, it's a large xml file (~66K lines), but I'm surprised xpath() is taking so long -- it's been plenty fast in all the other cases that I've used it.  Am I passing a stupid search path?  

TIA.

- ff

Robert Poor

unread,
Jan 18, 2013, 9:16:22 AM1/18/13
to nokogi...@googlegroups.com
Update:

Sometimes remove_namespaces! can be used for good, not for evil:

doc = File.open(fname) {|f| Nokogiri::XML(f) }
doc.remove_namespaces!
recs = doc.xpath('//IntervalBlock//IntervalReading')

This goes blazingly fast! But my original question holds: Is there a
way to speed up the search without nuking the namespaces?
> --
> You received this message because you are subscribed to the Google Groups
> "nokogiri-talk" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/nokogiri-talk/-/MRagbMqrglEJ.
> To post to this group, send email to nokogi...@googlegroups.com.
> To unsubscribe from this group, send email to
> nokogiri-tal...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/nokogiri-talk?hl=en.

Mike Dalessio

unread,
Jan 18, 2013, 11:47:42 AM1/18/13
to nokogiri-talk
Greetings!

Can you provide a working example so perhaps some intrepid nokogiri-talk member can profile what's going on that takes two minutes?

Thanks for using Nokogiri!

Robert Poor

unread,
Jan 18, 2013, 12:11:18 PM1/18/13
to nokogi...@googlegroups.com
On Fri, Jan 18, 2013 at 8:47 AM, Mike Dalessio <mike.d...@gmail.com> wrote:
> Greetings!
>
> Can you provide a working example so perhaps some intrepid nokogiri-talk
> member can profile what's going on that takes two minutes?
>
> Thanks for using Nokogiri!

With pleasure. I've posted the xml file as a gist at:

https://gist.github.com/4566179

Let me know if you need more info (or the xslt file). Looking forward
to what you find!

Mike Dalessio

unread,
Jan 22, 2013, 8:43:21 AM1/22/13
to nokogiri-talk
Howdy,

Running perftools, it's easy to see that libxml2 is spending the entire time evaluating the XPath query. It's not immediately clear to me why (I have yet to get a debug build of libxml2 on my new laptop), but it appears to be the combination of namespaces and wildcards that is causing libxml2 to run slowly.

For example, if you changed your original query:

    doc.xpath('//meter:IntervalBlock//meter:IntervalReading', 'meter' => "http://naesb.org/espi")

to this:

    doc.xpath('//meter:IntervalBlock/meter:IntervalReading', 'meter' => "http://naesb.org/espi")

that is, don't use a double-slash between IntervalBlock and IntervalReading, the query runs quickly. As you pointed out, if you remove namespaces, then the query also runs quickly.

So, as a workaround, use the most precise XPath query you can in order to improve performance in libxml2 XPath queries.

HTH,
-mike




Robert Poor

unread,
Jan 22, 2013, 11:40:03 AM1/22/13
to nokogi...@googlegroups.com
Mike:

On Tue, Jan 22, 2013 at 5:43 AM, Mike Dalessio <mike.d...@gmail.com> wrote:
> Running perftools...it appears to be the combination of namespaces
> and wildcards that is causing libxml2 to run slowly.
>
> For example, if you changed your original query ...to this:
>
> doc.xpath('//meter:IntervalBlock/meter:IntervalReading', 'meter' =>
> "http://naesb.org/espi")
>
> the query runs quickly. ... So, as a workaround, use the most precise
> XPath query you can in order to improve performance in libxml2 XPath
> queries.

Thanks for the sleuth work! Removing the namespaces made me nervous
(namespaces are there for a reason!), so your workaround makes me much
happier. Finding an actual solution to the problem will be a nice bit
of debugging work for someone.

Robert Poor

unread,
Jan 22, 2013, 11:45:23 AM1/22/13
to nokogi...@googlegroups.com
On Tue, Jan 22, 2013 at 5:43 AM, Mike Dalessio <mike.d...@gmail.com> wrote:
> Running perftools...it appears to be the combination of namespaces
> and wildcards that is causing libxml2 to run slowly.

By the way, what is the proper channel for reporting libxml2 bugs? I
can file a bug report unless you already have.

Mike Dalessio

unread,
Jan 22, 2013, 12:18:47 PM1/22/13
to nokogiri-talk
I'm not sure it's a "bug" ... why don't I take the responsibility to figure out what's going on in libxml2 internals, and if necessary report a performance problem. Sound OK with you?



Robert Poor

unread,
Jan 22, 2013, 12:26:28 PM1/22/13
to nokogi...@googlegroups.com
On Tue, Jan 22, 2013 at 9:18 AM, Mike Dalessio <mike.d...@gmail.com> wrote:
> I'm not sure it's a "bug" ... why don't I take the responsibility to figure
> out what's going on in libxml2 internals, and if necessary report a
> performance problem. Sound OK with you?

A 1500x slowdown? It would surprise me if this wasn't a bug. ;) I'd
be honored to leave the issue in your capable hands. Thanks!
Reply all
Reply to author
Forward
0 new messages