I've done a little profiling of solrmarc lately, tracking down a stupid bug in one of my own custom methods. Once I fixed that, I found that after a few hundred-thousand records any processing being done by either me or solrmarc core code is totally swallowed up by the cost of adding to and merging indexes. More than 90% of the time on the profile I had on my desk today was spent just letting lucene do its thing. I'm pretty sure the bottleneck is merging, since otherwise we wouldn't see differences in throughput (records/second) based on the size of the input file, right? Or am I thinking about this wrong?
Optimization of the underlying lucene processing (which I know absolutely nothing about) is, I think, where to focus. The time taken by everything else is just noise as near as I can tell. I'd love to find out if anyone else has done any profiling, because our 6-million record indexing process right now takes about 10 hours on kickass hardware and I'd love to shorten that so I can do a full index more often.
-Bill-
--
Bill Dueber
Library Systems Programmer
University of Michigan Library
> I've done a little profiling of solrmarc lately, tracking down a stupid
> bug in one of my own custom methods. Once I fixed that, I found that
> after a few hundred-thousand records any processing being done by either
> me or solrmarc core code is totally swallowed up by the cost of adding
> to and merging indexes.
I noticed the same. I split large record files into smaller pieces and
did <optimize/> after 1 million records each. That helped a bit.
There are several options in solrconfig.xml that affect merging. I
played with them (without really understanding what I am doing), but
without traceable effect. Someone who has optimized these values and can
explain there effects?
> Optimization of the underlying lucene processing (which I know
> absolutely nothing about) is, I think, where to focus.
Let's see how that's going to be with Solr 1.4 final and Lucene 2.9.0/1
final (whatever Solr will use...). I used randomly selected Solr nightly
builds in the last months, and feel(!) the bottleneck wasn't constant
(bit I did no profiling or other systematic evaluation).
Till
--
Till Kinstler
Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
Platz der Göttinger Sieben 1, D 37073 Göttingen
kins...@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de
One thing that we found had a huge affect on indexing time was our
decision to really make our index as lean as possible. Eric Larson and I
took our original Solr schema from our first prototype and cleaned up
all the sloppy places. I think we went from about 6 hours to about 3.5
hours.
Most of the trimming involved taking a hard look at what needs to be
indexed vs. what needs to be stored or whether end users need all the
nuances of MARC fields. E.g., does any patron ever search by title added
entry or should a 'smart' title search translate that our our end?
Single record display is another story, but we store the full raw MARC.
Our first version had 78 fields and the SOLR_HOME/data/index directory
had about 25GB. The new numbers are 34 fields and about 16GB.
Our challenge is to see whether the scaled back index can do everything
our librarians expect it to do when people start taking a closer look at
it...
-Steve
Bill Dueber wrote:I've done a little profiling of solrmarc lately, tracking down a stupidbug in one of my own custom methods. Once I fixed that, I found thatafter a few hundred-thousand records any processing being done by eitherme or solrmarc core code is totally swallowed up by the cost of adding
to and merging indexes. More than 90% of the time on the profile I had
on my desk today was spent just letting lucene do its thing. I'm prettysure the bottleneck is merging, since otherwise we wouldn't seedifferences in throughput (records/second) based on the size of theinput file, right? Or am I thinking about this wrong?
Optimization of the underlying lucene processing (which I know
absolutely nothing about) is, I think, where to focus. The time taken by
everything else is just noise as near as I can tell. I'd love to findout if anyone else has done any profiling, because our 6-million recordindexing process right now takes about 10 hours on kickass hardware andI'd love to shorten that so I can do a full index more often.-Bill-On Thu, Oct 29, 2009 at 5:23 PM, Naomi Dushay <ndu...@stanford.edu
608-265-2844 (ph)"Just don't let the human factor fail to be a factor at all."- Andrew Bird, "Tables and Chairs"
--Bill DueberLibrary Systems ProgrammerUniversity of Michigan Library
> Another quick data point: When indexing ~50K records with solrmarc2 on
> really good hardware, I get about 200 records/second throughput.
Hmmm, that's slow compared to my values, though indexing speed depends
on data and index structure, of course.
I started reindexing our records Friday evening and now have some data
to add, details see below. Our data is in collection specific files
holding between 1000 and about 2000000 records each, altogether about
21000000 records. Data below is in the order the record files were
indexed. Times include a <commit/> at the end.
I am using a self build solrmarc with some customized Indexer methods
based on a SVN checkout of last Friday. And a Solr nightly build of
August, 25th (1.3.0.2009.08.25.08.05.53). So all below may be outdated
by newer Solr/Lucene versions. Let's see what Solr 1.4.0 final will bring...
You can see that indexing speed drops significantly at some point,
recovers a bit but never again to values I had with a small index.
Yesterday I started indexing the "slow package" holding 2 million
records again on top of the existing index (holding some 18 million
documents at that time). This time with jrat as Java profiler in the
chain. I am not experienced in Java profiling, so the following is just
my naive observation:
Indexing went painfully slow (Jrat itself slows it down significantly)
at about 2 records/s, so i stopped after
INFO [main] (MarcImporter.java:504) - Finished indexing in 2404:51,00
INFO [main] (MarcImporter.java:513) - Indexed 334995 at a rate of
about 2.0 per sec
The classes that "wasted" most time were:
org.apache.index.lucene.IndexInput: 21.4% (methods readVInt() 11,7% and
ReadVLong() 9,5%)
org.apache.index.lucene.BufferedIndexInput: 15.1% (method readByte() 11,3%)
org.apache.index.lucene.IndexWriter: 9.9% (method doWait() 9,9%).
I haven't looked at Lucene Sources, but doWait() in
org.apache.index.lucene.IndexWriter, that got 9,9% of total processing
time sounds like a place to look at. That spends a lot of time waiting
for something...
I will rerun that when Solr 1.4 final is there.
Till
Solrmarc's log output summarizing indexing speed in consecutive order
with some notes:
INFO [main] (MarcImporter.java:504) - Finished indexing in 1:42,00
INFO [main] (MarcImporter.java:513) - Indexed 59196 at a rate of about
575.0 per sec
INFO [main] (MarcImporter.java:504) - Finished indexing in 5:18,00
INFO [main] (MarcImporter.java:513) - Indexed 154265 at a rate of
about 483.0 per sec
INFO [main] (MarcImporter.java:504) - Finished indexing in 0:32,00
INFO [main] (MarcImporter.java:513) - Indexed 21811 at a rate of about
669.0 per sec
INFO [main] (MarcImporter.java:504) - Finished indexing in 3:13,00
INFO [main] (MarcImporter.java:513) - Indexed 96234 at a rate of about
498.0 per sec
[this is special, because for every record there is an XML document
pulled in from the file system that contains a fulltext of about 10kB to
20kB and that is indexed together with the metadata record]
INFO [main] (MarcImporter.java:504) - Finished indexing in 103:13,00
INFO [main] (MarcImporter.java:513) - Indexed 387375 at a rate of
about 62.0 per sec
INFO [main] (MarcImporter.java:504) - Finished indexing in 0:04,00
INFO [main] (MarcImporter.java:513) - Indexed 1001 at a rate of about
246.0 per sec
INFO [main] (MarcImporter.java:504) - Finished indexing in 0:56,00
INFO [main] (MarcImporter.java:513) - Indexed 29940 at a rate of about
532.0 per sec
INFO [main] (MarcImporter.java:504) - Finished indexing in 0:04,00
INFO [main] (MarcImporter.java:513) - Indexed 2380 at a rate of about
510.0 per sec
INFO [main] (MarcImporter.java:504) - Finished indexing in 4:30,00
INFO [main] (MarcImporter.java:513) - Indexed 136209 at a rate of
about 503.0 per sec
INFO [main] (MarcImporter.java:504) - Finished indexing in 0:02,00
INFO [main] (MarcImporter.java:513) - Indexed 890 at a rate of about
319.0 per sec
INFO [main] (MarcImporter.java:504) - Finished indexing in 1:08,00
INFO [main] (MarcImporter.java:513) - Indexed 36305 at a rate of about
527.0 per sec
INFO [main] (MarcImporter.java:504) - Finished indexing in 0:16,00
INFO [main] (MarcImporter.java:513) - Indexed 4092 at a rate of about
250.0 per sec
INFO [main] (MarcImporter.java:504) - Finished indexing in 0:03,00
INFO [main] (MarcImporter.java:513) - Indexed 910 at a rate of about
235.0 per sec
INFO [main] (MarcImporter.java:504) - Finished indexing in 0:01,00
INFO [main] (MarcImporter.java:513) - Indexed 157 at a rate of about
121.0 per sec
INFO [main] (MarcImporter.java:504) - Finished indexing in 16:38,00
INFO [main] (MarcImporter.java:513) - Indexed 494822 at a rate of
about 495.0 per sec
INFO [main] (MarcImporter.java:504) - Finished indexing in 36:13,00
INFO [main] (MarcImporter.java:513) - Indexed 2000000 at a rate of
about 919.0 per sec
INFO [main] (MarcImporter.java:504) - Finished indexing in 29:33,00
INFO [main] (MarcImporter.java:513) - Indexed 2000000 at a rate of
about 1127.0 per sec
INFO [main] (MarcImporter.java:504) - Finished indexing in 31:05,00
INFO [main] (MarcImporter.java:513) - Indexed 1999999 at a rate of
about 1071.0 per sec
INFO [main] (MarcImporter.java:504) - Finished indexing in 59:50,00
INFO [main] (MarcImporter.java:513) - Indexed 2000000 at a rate of
about 557.0 per sec
INFO [main] (MarcImporter.java:504) - Finished indexing in 31:53,00
INFO [main] (MarcImporter.java:513) - Indexed 1999998 at a rate of
about 1044.0 per sec
INFO [main] (MarcImporter.java:504) - Finished indexing in 101:10,00
INFO [main] (MarcImporter.java:513) - Indexed 1907040 at a rate of
about 314.0 per sec
INFO [main] (MarcImporter.java:504) - Finished indexing in 92:57,00
INFO [main] (MarcImporter.java:513) - Indexed 2000000 at a rate of
about 358.0 per sec
[Hmmm, what's that? Data is about the same as in the 2000000 records
packages above, but indexing went painfully slow, so I finally
interrupted (there are 1,3 million records more in that file...)]
INFO [main] (MarcImporter.java:504) - Finished indexing in 431:14,00
INFO [main] (MarcImporter.java:513) - Indexed 695932 at a rate of
about 26.0 per sec
INFO [main] (MarcImporter.java:504) - Finished indexing in 4:56,00
INFO [main] (MarcImporter.java:513) - Indexed 44814 at a rate of about
151.0 per sec
INFO [main] (MarcImporter.java:504) - Finished indexing in 5:45,00
INFO [main] (MarcImporter.java:513) - Indexed 51994 at a rate of about
150.0 per sec
[again: interrupted manually because indexing went painfully slow]
INFO [main] (MarcImporter.java:504) - Finished indexing in 351:55,00
INFO [main] (MarcImporter.java:513) - Indexed 1044175 at a rate of
about 49.0 per sec
INFO [main] (MarcImporter.java:504) - Finished indexing in 198:55,00
INFO [main] (MarcImporter.java:513) - Indexed 1584169 at a rate of
about 132.0 per sec
INFO [main] (MarcImporter.java:504) - Finished indexing in 0:33,00
INFO [main] (MarcImporter.java:513) - Indexed 917 at a rate of about
27.0 per sec
INFO [main] (MarcImporter.java:504) - Finished indexing in 8:49,00
INFO [main] (MarcImporter.java:513) - Indexed 23889 at a rate of about
45.0 per sec