Different configs for indexing millions of docs (and rsolr-direct vs. normal rsolr/jetty setup)

chrisfinne

unread,

Jun 3, 2010, 8:29:10 PM6/3/10

to rsolr

Different configs for indexing millions of docs (and rsolr-direct vs.
normal rsolr/jetty setup)

I've been doing a lot of importing recently as I'm trying to find the
fastest way to get my 22 million records into Solr. I've used
DataImportHandler and it is of course by far the fastest, but I need
to massage some of the data in Ruby first, so DIH is out.

These are rough numbers, but thought you might be interested

on Snow Leopard jruby 1.5.0rc3 vs. REE 1.8.7:

- For basic ruby stuff (generating the hashes to send to Solr), jruby
is roughly 20-25% faster
- rsolr-direct vs. normal rsolr-ext to a locally running jetty, direct
is 5x to 10x faster.

But this is just a single-process, single-threaded test, so not very
real-world in most cases.

There are so many factors that determine where the bottleneck and what
the final performance will be:
- jruby vs ree (haven't tried ruby 1.9)
- rsolr-direct vs. rsolr=>jetty
- single solr core vs. multi-core
- single CPU core vs multiple processors/cores
- threads vs processes
- size of the doc being indexed
- fields stored vs. not stored

Here are just a few of the permutations that I've tried:
- Single Solr Core, JRuby, rsolr-direct, multiple jruby threads with a
mutex around rsolr-direct writes
- Single Solr Core, multiple REE processes, rsolr-ext, solr on jetty
- Multiple Solr Cores, multiple JRuby processes using rsolr-direct,
then doing a merge of the Solr cores

The last one is what I'm starting to lean towards because it makes the
best use of my multiple cores on my dev box and will likely do the
best on a super big EC2 instance. My Mac has 8GB of DDR3 RAM and a
2.9GHz quad-core processor and it seems that CPU is the limiting
factor. I tried 3 JRuby processes, but the CPU was only at about 50%.
I'm now doing 7 JRuby processes and the CPU is running at about 90%.

Shairon Toledo

unread,

Jun 4, 2010, 8:09:52 AM6/4/10

to rs...@googlegroups.com

There are somethings that you can improve the indexing speed in solr side, one of that is set mergeFactor to ~ 30, it will avoid unnecessary segments merge in lucene, after the indexing you can roll back to factor 10 and run optimize command in your core.

The experience tells that is better you index a set of document per request than a document per request.

Generally the guys send sort of

<add>

</add>

I recommend you send a set of document > 10 in the same request and the parameter commit=true in the url(don't optimize the core yet)

<add>

...

</add>

it will reduce the cost to open the index files to write, xml parse, etc.

In jvm, increase the heap using the familiar parameters -Xms/Xmx

Good luck,

--
You received this message because you are subscribed to the Google Groups "rsolr" group.
To post to this group, send email to rs...@googlegroups.com.
To unsubscribe from this group, send email to rsolr+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/rsolr?hl=en.

--
[ ]'s
Shairon Toledo
http://www.google.com/profiles/shairon.toledo

chrisfinne

unread,

Jun 4, 2010, 9:58:49 AM6/4/10

to rsolr

mergeFactor looks like a great tip. I'll have to implement that one.

When doing a bulk import, you shouldn't commit until all the documents
are imported as there is a lot of overhead for it.

> > rsolr+un...@googlegroups.com <rsolr%2Bunsu...@googlegroups.com>.

Reply all

Reply to author

Forward