RE: [VuFind-Tech] solrmarc import speed

9 views
Skip to first unread message

Demian Katz

unread,
Feb 24, 2012, 7:29:47 AM2/24/12
to Tod Olson, vufin...@lists.sourceforge.net, solrma...@googlegroups.com
I'm copying this message to solrmarc-tech, since you'll probably get additional suggestions from there.

Also, you might want to look at this thread -- it's a few years old but probably still relevant:

http://sourceforge.net/mailarchive/message.php?msg_id=21044664

...and here's another one:

http://groups.google.com/group/solrmarc-tech/browse_thread/thread/fe329385bb1dc953

One thing that's particularly worth experimenting with (if you haven't already) is comparing performance between direct index writing and writing over HTTP. If you edit import.properties and change your solr.path value to "REMOTE", then SolrMarc will post updates to the solr.hosturl URL rather than writing them directly to the index. If you can split up your MARC file into chunks, you can run multiple instances of SolrMarc in parallel using the HTTP writing method, and that might help speed things up.

- Demian
________________________________________
From: Tod Olson [t...@uchicago.edu]
Sent: Thursday, February 23, 2012 10:08 PM
To: vufin...@lists.sourceforge.net
Subject: [VuFind-Tech] solrmarc import speed

Well, the inevitable question of how to speed up solrmarc imports is coming up. Some guidance about what to look for would be welcome.

The test system is a VM with 2CPUs, 49GB RAM (~4GB free) running Ubuntu 10. What we observe is in our first full import (6 million records) one of our later files of about a million records would be added in 9 hours. Not production speed, but enough to test. Now that we have a full index and are re-importing the records, we only imported about 370K records in the first 6 hours. Looks to me like we are CPU bound, seems maybe there's a single thread in solrmarc that is the bottleneck. solrconfig.xml is the default from the VuFind distro: mergFactor=10, that sort of thing.

Behavior-wise, we also notice that records will chug along for awhile, and then there will be a big pause with no feedback. I assume this is when solr is merging segments.

I know a few of you are indexing several million records, so I figure I'll start here. What were your first steps in speeding up indexing, and what kinds of metrics were useful to you?

Thanks for any advice or pointers.

-Tod


Tod Olson <t...@uchicago.edu<mailto:t...@uchicago.edu>>
Systems Librarian
University of Chicago Library

Tod Olson

unread,
Feb 24, 2012, 11:16:34 AM2/24/12
to Demian Katz, Tod Olson, vufin...@lists.sourceforge.net, solrma...@googlegroups.com
Thanks! We'll look at parallel indexing and the Solr merge settings. And also the GC behavior.

-Tod

Tod Olson

unread,
Feb 29, 2012, 4:40:56 PM2/29/12
to Demian Katz, Tod Olson, vufin...@lists.sourceforge.net, solrma...@googlegroups.com
What we discovered in this instance was that the VM apparently had too much RAM for it's own good. Reducing the RAM allocated to the VM has sped things up to where it is decent, about 1 million records per hour.

We will be looking further at the various suggestion (parallelizing the input, tweaking ramBufferSizeMB and such) to see how far we can reduce the total time, but at least now the import times are sane.

Thanks everyone for the responses.

-Tod

On Feb 24, 2012, at 6:29 AM, Demian Katz wrote:

Reply all
Reply to author
Forward
0 new messages