Web Images Videos Maps News Shopping Gmail more »
Recently Visited Groups | Help | Sign in
Google Groups Home
XML Parsing Speed - ruby libxml & REXML
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  12 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
subimage  
View profile  
 More options May 30 2006, 5:33 am
Newsgroups: comp.lang.ruby
From: "subimage" <subim...@gmail.com>
Date: 30 May 2006 02:33:25 -0700
Local: Tues, May 30 2006 5:33 am
Subject: XML Parsing Speed - ruby libxml & REXML
Hey all...

I'm working on a massive Rails site that does heavy data import daily.
A lot of this data is in XML files of various sizes ranging from 100k
to 400mb, and totaling around 2gb for all sources. I'd like to keep the
entire project using Ruby.

At first, I wrote my parsers using REXML, but found that to be DOG
SLOW, especially for the large files. I tried REXML::parse_stream but
couldn't find any good documentation for handling parsing that way. It
was taking around 30 minutes to an hour to even _open_ the larger files
on a p4 1.8ghz test machine.

After that exercise I switched to libxml, which is a lot speedier, but
still slow (no numbers to back it up yet, just can tell by the speed of
data insert in my DB)

I'm wondering if there's some other lib out there that I'm missing? Can
someone point me in the right direction? Is there anything faster I'm
missing out on?

Are there any "gotchas" with using libxml that I should be aware of
speed-wise?

Any and all help is much appreciated...thanks!


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Robert Klemme  
View profile  
 More options May 30 2006, 7:09 am
Newsgroups: comp.lang.ruby
From: Robert Klemme <bob.n...@gmx.net>
Date: Tue, 30 May 2006 13:09:41 +0200
Local: Tues, May 30 2006 7:09 am
Subject: Re: XML Parsing Speed - ruby libxml & REXML

Since you insert data into a DB: are you absolutely positive about the
fact that it's the XML parsing part that's slow?  Here's what I'd do:
use two threads connected with a bounded queue, one thread for reading
XML with REXML's stream parser and one thread for inserting into the DB.
That way you can utilize CPU for parsing XML while your process waits
for the DB call to return.  If possible use bulk insertions.
Alternatively, write out a CVS file and use the DB's bulk loader to pump
the data into the DB.  HTH

Kind regards

        robert


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
subimage  
View profile  
 More options May 30 2006, 4:20 pm
Newsgroups: comp.lang.ruby
From: "subimage" <subim...@gmail.com>
Date: 30 May 2006 13:20:46 -0700
Local: Tues, May 30 2006 4:20 pm
Subject: Re: XML Parsing Speed - ruby libxml & REXML
Robert thanks for the response...

I definitely _know_ it's the XML parsing that's slow. As mentioned,
even opening the file with REXML or libxml takes some time, then
finding all of my nodes (and nodes within) is even longer. Could it be
because I'm using doc.root.element.find("path") inside of my loop?
Anyone know a better way to go about grabbing specific nodes within a
document using libxml?

Insertion to the db is simple and quick - although your idea of a
bounded queue with 2 threads is interesting. I'll have to look into
that (have any example code I might start from?)

Also - I was unable to get stream parsing working properly for REXML so
I just gave up and moved to libxml. Do you have any resources on REXML
stream parsing you can share? A tutorial or reference? Anything would
be helpful.

Everything I've read online says libxml is much faster than REXML, so I
thought I made the best choice available.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Robert Klemme  
View profile  
 More options May 30 2006, 4:38 pm
Newsgroups: comp.lang.ruby
From: Robert Klemme <bob.n...@gmx.net>
Date: Tue, 30 May 2006 22:38:46 +0200
Local: Tues, May 30 2006 4:38 pm
Subject: Re: XML Parsing Speed - ruby libxml & REXML

subimage wrote:
> Robert thanks for the response...

> I definitely _know_ it's the XML parsing that's slow. As mentioned,
> even opening the file with REXML or libxml takes some time, then
> finding all of my nodes (and nodes within) is even longer. Could it be
> because I'm using doc.root.element.find("path") inside of my loop?

We would have to see the code.  Normally you would use find once on the
top level and have an XPath expression in place that selects all the
nodes that you need.  At the moment I'm not sure whether it's any of the
XML libs or the way you use them.

> Anyone know a better way to go about grabbing specific nodes within a
> document using libxml?

> Insertion to the db is simple and quick - although your idea of a
> bounded queue with 2 threads is interesting. I'll have to look into
> that (have any example code I might start from?)

Not handy.  But it's fairly simple: you create the queue (see in thread)
  and then create two threads, one for reading and one for writing.

require 'thread'
Q = SizedQueue.new 100

Thread.new do
   # open file
   # read XML
   # loop
     Q.enc "something"
   # end loop
   # close file
   # signal finish:
   Q.enc Q
end

# open DB
until Q == (task = Q.deq)
   # insert task into DB
end
# commit TX

> Also - I was unable to get stream parsing working properly for REXML so
> I just gave up and moved to libxml. Do you have any resources on REXML
> stream parsing you can share? A tutorial or reference? Anything would
> be helpful.

http://www.germane-software.com/software/rexml/doc/classes/REXML/Stre...

If you want to see what happens you can use this class as callback:

class Dummy
   def method_missing(s,*a,&b)
     print s, " ", a.inspect, b, "\n"
   end
end

You'll see on the console which methods are called with which arguments.

> Everything I've read online says libxml is much faster than REXML, so I
> thought I made the best choice available.

I've never used libxml myself.  I'd rather start with REXML because it
usually comes pre installed.  If documents are large I'd use the stream API.

Kind regards

        robert


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
rcoder  
View profile  
 More options May 30 2006, 4:44 pm
Newsgroups: comp.lang.ruby
From: "rcoder" <rco...@gmail.com>
Date: 30 May 2006 13:44:34 -0700
Local: Tues, May 30 2006 4:44 pm
Subject: Re: XML Parsing Speed - ruby libxml & REXML
libxml will definitely be faster, but with either parser you'll want to
avoid loading the entire file into RAM -- even using the fastest C++ or
Java parsers, wrapping every bit of the XML tree structure in object
instances is going to involve a huge amount of overhead. Using XPath to
traverse the entire document tree will further slow things, as most
XPath implementations (including incomplete ones like REXML's) are
horribly inefficient.

Keep in mind that *every* object value in Ruby uses something like 12
bytes of RAM, so your 400MB XML document is probably also ending up
having a larger footprint than your system RAM and hitting swap, at
which point nothing can save you from, as you put it, "dog-slow"
performance.

Can you be a little more specfic about the problems you had when you
were "unable to get stream parsing working"? Event-driven parsing can
be somewhat more complex to implement, but especially with large
datasets offers *huge* performance gains, because it can help avoid the
memory footprint issues I mentioned above.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
subimage  
View profile  
 More options May 30 2006, 4:55 pm
Newsgroups: comp.lang.ruby
From: "subimage" <subim...@gmail.com>
Date: 30 May 2006 13:55:26 -0700
Local: Tues, May 30 2006 4:55 pm
Subject: Re: XML Parsing Speed - ruby libxml & REXML
I guess that's where I'm going wrong - loading everything into ram.

RE: Stream parsing

I didn't know even where to begin to get it working. It was very
confusing for me, so I just stuck with what worked...I guess I'm more
of a "learn from existing code or tutorial" kind of person.

If this is the way to go I guess I need to spend some more time working
on my parser. For reference, here's my entire parse method using libxml
as it's currently working...

def parse_files
        files = Dir['*.xml']
        # Loop through all files
        for file in files
                puts "Parsing #{file}"
                # Open XML file
                doc = XML::Document.file(file)
                puts "...file opened"
                doc_root = doc.root
                # Get Merchant information
                merchant_name = doc_root.find("//header/merchantName").to_a[0].to_s
                puts "Merchant: #{merchant_name}"
                # Loop through each product in the document
    puts "...finding products"
                doc_root.find("product").each do |product|
                        unique_id = product['product_id']

                        # Find by unique product id and data source id
                        p = Product.find(:first,
                                                                                :conditions => ["data_unique_id = ? AND data_source_id = ?",
unique_id, DATA_SOURCE_ID])
                        # If we didn't find a product that matches create a new one
                        p = Product.new if !p

                        price = product.find("price/sale").to_a[0]

                        # Set all object properties
                        p.data_unique_id                = unique_id
                        p.name                                                  = product['name']
                        p.data_source_id                = DATA_SOURCE_ID
                        p.link_url                                      = product.find("URL/product").to_a[0].to_s
                        p.image_url                             = product.find("URL/productImage").to_a[0].to_s
                        p.short_description = product.find("description/short").to_a[0].to_s
                        p.long_description      = product.find("description/long").to_a[0].to_s
                        p.price                                                 = price.content
                        p.msrp                                                  = product.find("//price/retail").to_a[0].to_s
                        p.start_date                            = price['begin_date']
                        p.end_date                                      = price['end_date']
                        # Make sure dates are null if we get nothing
                        p.start_date = nil if p.start_date.blank?
                        p.end_date = nil if p.end_date.blank?

                        puts p.name

                        # Set the merchant up
                        # Create a new merchant if this one doesn't exist
                        merchant = Merchant.find_or_create_by_name(merchant_name)
                        p.merchant_id = merchant.id

                puts p.inspect

                        begin
                                p.save!
                        rescue ActiveRecord::RecordInvalid => err
        puts "!!!ERROR - #{err} : #{p.errors.full_messages}"
                                puts p.inspect
                                next
                        end

                        # Add categories to the product
                        category_1 = product.find("category/primary").to_a.to_s
                        p.add_category_by_name(category_1.strip) if !category_1.blank?
                        category_2 = product.find("category/secondary").to_a.to_s
                        p.add_category_by_name(category_2.strip) if !category_2.blank?

                  puts category_1
                        puts category_2

                end # end each product
        end
end


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
subimage  
View profile  
 More options May 31 2006, 1:20 am
Newsgroups: comp.lang.ruby
From: "subimage" <subim...@gmail.com>
Date: 30 May 2006 22:20:16 -0700
Local: Wed, May 31 2006 1:20 am
Subject: Re: XML Parsing Speed - ruby libxml & REXML
WHOAH!

Ok so I finally dug into the stream parser and this is lightning fast!

Thanks everyone for the advice...this is really sweet.

PS: I learned a lot from the tutorial available here:

http://www.rubyxml.com/articles/REXML/Stream_Parsing_with_REXML

I wrote a BasicStreamListener that throws each item into a hash
complete with pseudo xPaths...

Let me know if anyone would be interested in it, or a tutorial. Might
write something up for my blog as well on the subject, since there
doesn't seem to be a wealth of information out there on the subject.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Kenosis  
View profile  
 More options May 31 2006, 3:22 pm
Newsgroups: comp.lang.ruby
From: "Kenosis" <keno...@gmail.com>
Date: 31 May 2006 12:22:27 -0700
Local: Wed, May 31 2006 3:22 pm
Subject: Re: XML Parsing Speed - ruby libxml & REXML
The book "Enterprise Integration with Ruby" has some information on the
use of rexml and stream parsing, as I recall (don't have my copy at
hand at the moment.)

Ken


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Adam Sanderson  
View profile  
 More options Jun 5 2006, 12:44 pm
Newsgroups: comp.lang.ruby
From: "Adam Sanderson" <netgh...@gmail.com>
Date: 5 Jun 2006 09:44:00 -0700
Subject: Re: XML Parsing Speed - ruby libxml & REXML
Yeah, I ran into similar problems ealier using xpath.  Streaming the
xml and plucking out what you need is a little more complicated but it
is more efficent on three counts:

1) It will probably consume less memory since you will likely only
store a small subset of the data
2) You either don't need to build a full DOM tree, or you can build a
very light weight one
3) Parsing and executing xpath expressions takes some time, if you're
doing a ton of them it might have a noticeable effect.

It might be best for people test it out using xpath and such, if that
works keep it, but if not, you can always fall back on building a
stream parser.  Wish I saw your post earlier ;)

  .adam


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Mathieu Blondel  
View profile  
 More options Jun 7 2006, 11:21 am
Newsgroups: comp.lang.ruby
From: "Mathieu Blondel" <mblon...@gmail.com>
Date: 7 Jun 2006 08:21:16 -0700
Local: Wed, Jun 7 2006 11:21 am
Subject: Re: XML Parsing Speed - ruby libxml & REXML
For large file, stream parsers are faster and have a smaller memory
footprint.

A few months ago, I also tested the expat bindings for ruby which
turned out to be up to 20 times faster than the stream parser provided
by REXML.

subimage a écrit :


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
subimage  
View profile  
 More options Jun 7 2006, 6:53 pm
Newsgroups: comp.lang.ruby
From: "subimage" <subim...@gmail.com>
Date: 7 Jun 2006 15:53:37 -0700
Local: Wed, Jun 7 2006 6:53 pm
Subject: Re: XML Parsing Speed - ruby libxml & REXML
Got a URL or more info on these bindings? Test code? How to get it
running?


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Mathieu Blondel  
View profile  
 More options Jun 8 2006, 7:02 am
Newsgroups: comp.lang.ruby
From: "Mathieu Blondel" <mblon...@gmail.com>
Date: 8 Jun 2006 04:02:13 -0700
Local: Thurs, Jun 8 2006 7:02 am
Subject: Re: XML Parsing Speed - ruby libxml & REXML
http://www.yoshidam.net/Ruby.html#xmlparser

You will most certainly have to compile the binding yourself.

Look at samples/xmlevent.rb which is shipped with the source code. It
shows how to use the event-based (a la sax) xmlparser.

HTH
Mathieu

subimage a écrit :


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google