I'm working on a massive Rails site that does heavy data import daily. A lot of this data is in XML files of various sizes ranging from 100k to 400mb, and totaling around 2gb for all sources. I'd like to keep the entire project using Ruby.
At first, I wrote my parsers using REXML, but found that to be DOG SLOW, especially for the large files. I tried REXML::parse_stream but couldn't find any good documentation for handling parsing that way. It was taking around 30 minutes to an hour to even _open_ the larger files on a p4 1.8ghz test machine.
After that exercise I switched to libxml, which is a lot speedier, but still slow (no numbers to back it up yet, just can tell by the speed of data insert in my DB)
I'm wondering if there's some other lib out there that I'm missing? Can someone point me in the right direction? Is there anything faster I'm missing out on?
Are there any "gotchas" with using libxml that I should be aware of speed-wise?
> I'm working on a massive Rails site that does heavy data import daily. > A lot of this data is in XML files of various sizes ranging from 100k > to 400mb, and totaling around 2gb for all sources. I'd like to keep the > entire project using Ruby.
> At first, I wrote my parsers using REXML, but found that to be DOG > SLOW, especially for the large files. I tried REXML::parse_stream but > couldn't find any good documentation for handling parsing that way. It > was taking around 30 minutes to an hour to even _open_ the larger files > on a p4 1.8ghz test machine.
> After that exercise I switched to libxml, which is a lot speedier, but > still slow (no numbers to back it up yet, just can tell by the speed of > data insert in my DB)
> I'm wondering if there's some other lib out there that I'm missing? Can > someone point me in the right direction? Is there anything faster I'm > missing out on?
> Are there any "gotchas" with using libxml that I should be aware of > speed-wise?
> Any and all help is much appreciated...thanks!
Since you insert data into a DB: are you absolutely positive about the fact that it's the XML parsing part that's slow? Here's what I'd do: use two threads connected with a bounded queue, one thread for reading XML with REXML's stream parser and one thread for inserting into the DB. That way you can utilize CPU for parsing XML while your process waits for the DB call to return. If possible use bulk insertions. Alternatively, write out a CVS file and use the DB's bulk loader to pump the data into the DB. HTH
I definitely _know_ it's the XML parsing that's slow. As mentioned, even opening the file with REXML or libxml takes some time, then finding all of my nodes (and nodes within) is even longer. Could it be because I'm using doc.root.element.find("path") inside of my loop? Anyone know a better way to go about grabbing specific nodes within a document using libxml?
Insertion to the db is simple and quick - although your idea of a bounded queue with 2 threads is interesting. I'll have to look into that (have any example code I might start from?)
Also - I was unable to get stream parsing working properly for REXML so I just gave up and moved to libxml. Do you have any resources on REXML stream parsing you can share? A tutorial or reference? Anything would be helpful.
Everything I've read online says libxml is much faster than REXML, so I thought I made the best choice available.
subimage wrote: > Robert thanks for the response...
> I definitely _know_ it's the XML parsing that's slow. As mentioned, > even opening the file with REXML or libxml takes some time, then > finding all of my nodes (and nodes within) is even longer. Could it be > because I'm using doc.root.element.find("path") inside of my loop?
We would have to see the code. Normally you would use find once on the top level and have an XPath expression in place that selects all the nodes that you need. At the moment I'm not sure whether it's any of the XML libs or the way you use them.
> Anyone know a better way to go about grabbing specific nodes within a > document using libxml?
> Insertion to the db is simple and quick - although your idea of a > bounded queue with 2 threads is interesting. I'll have to look into > that (have any example code I might start from?)
Not handy. But it's fairly simple: you create the queue (see in thread) and then create two threads, one for reading and one for writing.
require 'thread' Q = SizedQueue.new 100
Thread.new do # open file # read XML # loop Q.enc "something" # end loop # close file # signal finish: Q.enc Q end
# open DB until Q == (task = Q.deq) # insert task into DB end # commit TX
> Also - I was unable to get stream parsing working properly for REXML so > I just gave up and moved to libxml. Do you have any resources on REXML > stream parsing you can share? A tutorial or reference? Anything would > be helpful.
libxml will definitely be faster, but with either parser you'll want to avoid loading the entire file into RAM -- even using the fastest C++ or Java parsers, wrapping every bit of the XML tree structure in object instances is going to involve a huge amount of overhead. Using XPath to traverse the entire document tree will further slow things, as most XPath implementations (including incomplete ones like REXML's) are horribly inefficient.
Keep in mind that *every* object value in Ruby uses something like 12 bytes of RAM, so your 400MB XML document is probably also ending up having a larger footprint than your system RAM and hitting swap, at which point nothing can save you from, as you put it, "dog-slow" performance.
Can you be a little more specfic about the problems you had when you were "unable to get stream parsing working"? Event-driven parsing can be somewhat more complex to implement, but especially with large datasets offers *huge* performance gains, because it can help avoid the memory footprint issues I mentioned above.
I guess that's where I'm going wrong - loading everything into ram.
RE: Stream parsing
I didn't know even where to begin to get it working. It was very confusing for me, so I just stuck with what worked...I guess I'm more of a "learn from existing code or tutorial" kind of person.
If this is the way to go I guess I need to spend some more time working on my parser. For reference, here's my entire parse method using libxml as it's currently working...
def parse_files files = Dir['*.xml'] # Loop through all files for file in files puts "Parsing #{file}" # Open XML file doc = XML::Document.file(file) puts "...file opened" doc_root = doc.root # Get Merchant information merchant_name = doc_root.find("//header/merchantName").to_a[0].to_s puts "Merchant: #{merchant_name}" # Loop through each product in the document puts "...finding products" doc_root.find("product").each do |product| unique_id = product['product_id']
# Find by unique product id and data source id p = Product.find(:first, :conditions => ["data_unique_id = ? AND data_source_id = ?", unique_id, DATA_SOURCE_ID]) # If we didn't find a product that matches create a new one p = Product.new if !p
price = product.find("price/sale").to_a[0]
# Set all object properties p.data_unique_id = unique_id p.name = product['name'] p.data_source_id = DATA_SOURCE_ID p.link_url = product.find("URL/product").to_a[0].to_s p.image_url = product.find("URL/productImage").to_a[0].to_s p.short_description = product.find("description/short").to_a[0].to_s p.long_description = product.find("description/long").to_a[0].to_s p.price = price.content p.msrp = product.find("//price/retail").to_a[0].to_s p.start_date = price['begin_date'] p.end_date = price['end_date'] # Make sure dates are null if we get nothing p.start_date = nil if p.start_date.blank? p.end_date = nil if p.end_date.blank?
puts p.name
# Set the merchant up # Create a new merchant if this one doesn't exist merchant = Merchant.find_or_create_by_name(merchant_name) p.merchant_id = merchant.id
puts p.inspect
begin p.save! rescue ActiveRecord::RecordInvalid => err puts "!!!ERROR - #{err} : #{p.errors.full_messages}" puts p.inspect next end
# Add categories to the product category_1 = product.find("category/primary").to_a.to_s p.add_category_by_name(category_1.strip) if !category_1.blank? category_2 = product.find("category/secondary").to_a.to_s p.add_category_by_name(category_2.strip) if !category_2.blank?
I wrote a BasicStreamListener that throws each item into a hash complete with pseudo xPaths...
Let me know if anyone would be interested in it, or a tutorial. Might write something up for my blog as well on the subject, since there doesn't seem to be a wealth of information out there on the subject.
The book "Enterprise Integration with Ruby" has some information on the use of rexml and stream parsing, as I recall (don't have my copy at hand at the moment.)
subimage wrote: > Robert thanks for the response...
> I definitely _know_ it's the XML parsing that's slow. As mentioned, > even opening the file with REXML or libxml takes some time, then > finding all of my nodes (and nodes within) is even longer. Could it be > because I'm using doc.root.element.find("path") inside of my loop? > Anyone know a better way to go about grabbing specific nodes within a > document using libxml?
> Insertion to the db is simple and quick - although your idea of a > bounded queue with 2 threads is interesting. I'll have to look into > that (have any example code I might start from?)
> Also - I was unable to get stream parsing working properly for REXML so > I just gave up and moved to libxml. Do you have any resources on REXML > stream parsing you can share? A tutorial or reference? Anything would > be helpful.
> Everything I've read online says libxml is much faster than REXML, so I > thought I made the best choice available.
Yeah, I ran into similar problems ealier using xpath. Streaming the xml and plucking out what you need is a little more complicated but it is more efficent on three counts:
1) It will probably consume less memory since you will likely only store a small subset of the data 2) You either don't need to build a full DOM tree, or you can build a very light weight one 3) Parsing and executing xpath expressions takes some time, if you're doing a ton of them it might have a noticeable effect.
It might be best for people test it out using xpath and such, if that works keep it, but if not, you can always fall back on building a stream parser. Wish I saw your post earlier ;)
> I wrote a BasicStreamListener that throws each item into a hash > complete with pseudo xPaths...
> Let me know if anyone would be interested in it, or a tutorial. Might > write something up for my blog as well on the subject, since there > doesn't seem to be a wealth of information out there on the subject.
> I wrote a BasicStreamListener that throws each item into a hash > complete with pseudo xPaths...
> Let me know if anyone would be interested in it, or a tutorial. Might > write something up for my blog as well on the subject, since there > doesn't seem to be a wealth of information out there on the subject.
Mathieu Blondel wrote: > For large file, stream parsers are faster and have a smaller memory > footprint.
> A few months ago, I also tested the expat bindings for ruby which > turned out to be up to 20 times faster than the stream parser provided > by REXML.
> subimage a écrit :
> > WHOAH!
> > Ok so I finally dug into the stream parser and this is lightning fast!
> > Thanks everyone for the advice...this is really sweet.
> > PS: I learned a lot from the tutorial available here:
> > I wrote a BasicStreamListener that throws each item into a hash > > complete with pseudo xPaths...
> > Let me know if anyone would be interested in it, or a tutorial. Might > > write something up for my blog as well on the subject, since there > > doesn't seem to be a wealth of information out there on the subject.
> Got a URL or more info on these bindings? Test code? How to get it > running?
> Mathieu Blondel wrote: > > For large file, stream parsers are faster and have a smaller memory > > footprint.
> > A few months ago, I also tested the expat bindings for ruby which > > turned out to be up to 20 times faster than the stream parser provided > > by REXML.
> > subimage a écrit :
> > > WHOAH!
> > > Ok so I finally dug into the stream parser and this is lightning fast!
> > > Thanks everyone for the advice...this is really sweet.
> > > PS: I learned a lot from the tutorial available here:
> > > I wrote a BasicStreamListener that throws each item into a hash > > > complete with pseudo xPaths...
> > > Let me know if anyone would be interested in it, or a tutorial. Might > > > write something up for my blog as well on the subject, since there > > > doesn't seem to be a wealth of information out there on the subject.