Re: [Blacklight-development] MARC record indexing benchmarks

2 views
Skip to first unread message

Naomi Dushay

unread,
Oct 29, 2009, 5:23:27 PM10/29/09
to blacklight-...@googlegroups.com, solrma...@googlegroups.com
Steve,

This is great!

1. Did you do any profiling? Neither Bob nor I have done this yet.

2. What machine and what JVM settings for solrmarc?

3. Our export from Sirsi takes forever for 6 million records (8-12
hours ... I've forgotten ... done by a colleague). What is your ILS?

I've been pondering the benefit/cost ratio of solrmarc being neutral
as to solr version. That bloats the code and probably increases
execution time, but I haven't tested.

And yes, we'd love you to be more vocal! Don't neglect the solrmarc-
tech list -- that way VuFind folks and others can benefit.

It would be great if you just posted a couple sentence "we did" or
"we're about to do" emails if you don't want to do a huge write up.

- Naomi


On Oct 29, 2009, at 2:09 PM, Stephen Meyer wrote:

>
> We have been too quiet about what we are up to regarding our
> Blacklight
> project here at UW-Madison. We are trying to move our prototype
> towards
> a beta release and our ears perked up during the indexing discussions
> about a month ago. We happened to be working through the same issue at
> the time.
>
> We ran a couple of trials to look at various approaches and are trying
> to create some internal documentation that could be shared more
> broadly:
>
> http://sdg.library.wisc.edu/blog/2009/10/29/marc-record-indexing-benchmarks/
>
> -Steve
> --
> Stephen Meyer
> Library Application Developer
> UW-Madison Libraries
> 312F Memorial Library
> 728 State St.
> Madison, WI 53706
>
> sme...@library.wisc.edu
> 608-265-2844 (ph)
>
>
> "Just don't let the human factor fail to be a factor at all."
> - Andrew Bird, "Tables and Chairs"
>
> >

Bill Dueber

unread,
Oct 29, 2009, 10:51:46 PM10/29/09
to solrma...@googlegroups.com, blacklight-...@googlegroups.com
Another quick data point: When indexing ~50K records with solrmarc2 on really good hardware, I get about 200 records/second throughput. Indexing a file with those records plus 6.3 million others with the exact same setup gives me a final throughput of 69 records/second.

Again, it seems to me that the lesson is to not worry so much about the speed of anything but the actual indexing/merging. All the screwing around with the MARC record (or whatever your input) isn't likely to have much of an effect, percentage-wise, unless you're doing something incredibly intensive.

  -Bill-


On Thu, Oct 29, 2009 at 8:38 PM, Bill Dueber <bi...@dueber.com> wrote:
I've done a little profiling of solrmarc lately, tracking down a stupid bug in one of my own custom methods.  Once I fixed that, I found that after a few hundred-thousand records any processing being done by either me or solrmarc core code is totally swallowed up by the cost of adding to and merging indexes. More than 90% of the time on the profile I had on my desk today was spent just letting lucene do its thing. I'm pretty sure the bottleneck is merging, since otherwise we wouldn't see differences in throughput (records/second) based on the size of the input file, right? Or am I thinking about this wrong?

Optimization of the underlying lucene processing (which I know absolutely nothing about) is, I think, where to focus. The time taken by everything else is just noise as near as I can tell. I'd love to find out if anyone else has done any profiling, because our 6-million record indexing process right now takes about 10 hours on kickass hardware and I'd love to shorten that so I can do a full index more often.

 -Bill-
--
Bill Dueber
Library Systems Programmer
University of Michigan Library



--
Bill Dueber
Library Systems Programmer
University of Michigan Library

Till Kinstler

unread,
Oct 30, 2009, 11:02:01 AM10/30/09
to blacklight-...@googlegroups.com, solrma...@googlegroups.com
Bill Dueber schrieb:

> I've done a little profiling of solrmarc lately, tracking down a stupid
> bug in one of my own custom methods. Once I fixed that, I found that
> after a few hundred-thousand records any processing being done by either
> me or solrmarc core code is totally swallowed up by the cost of adding
> to and merging indexes.

I noticed the same. I split large record files into smaller pieces and
did <optimize/> after 1 million records each. That helped a bit.
There are several options in solrconfig.xml that affect merging. I
played with them (without really understanding what I am doing), but
without traceable effect. Someone who has optimized these values and can
explain there effects?

> Optimization of the underlying lucene processing (which I know
> absolutely nothing about) is, I think, where to focus.

Let's see how that's going to be with Solr 1.4 final and Lucene 2.9.0/1
final (whatever Solr will use...). I used randomly selected Solr nightly
builds in the last months, and feel(!) the bottleneck wasn't constant
(bit I did no profiling or other systematic evaluation).

Till

--
Till Kinstler
Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
Platz der Göttinger Sieben 1, D 37073 Göttingen
kins...@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de

Naomi Dushay

unread,
Oct 30, 2009, 1:13:45 PM10/30/09
to blacklight-...@googlegroups.com, solrma...@googlegroups.com
Steve,

I'd love to see your schema.xml and solrconfig.xml.

I'm parsimonius about what we store in our index, and make the same hard line you do between what is for search result display and what is for item display (except when the item display field has a very complex algorithm).



From Bill's note, I suspect we could bump up our ramBufferSizeMB 



  <indexDefaults>
    <useCompoundFile>false</useCompoundFile>
    <mergeFactor>20</mergeFactor>
    <!--
      If both ramBufferSizeMB and maxBufferedDocs is set, then Lucene will flush based on whichever limit is hit first.
    -->
    <!--<maxBufferedDocs>1000</maxBufferedDocs>-->
    <!-- Tell Lucene when to flush documents to disk.
      Giving Lucene more memory for indexing means faster indexing at the cost of more RAM
    -->
    <ramBufferSizeMB>10240</ramBufferSizeMB>
    <maxMergeDocs>2147483647</maxMergeDocs>
    <writeLockTimeout>1000</writeLockTimeout>
    <commitLockTimeout>10000</commitLockTimeout>   
    <lockType>single</lockType>
  </indexDefaults>

  <mainIndex>
    <!-- options specific to the main on-disk lucene index -->
    <useCompoundFile>false</useCompoundFile>
    <ramBufferSizeMB>10240</ramBufferSizeMB>
    <mergeFactor>20</mergeFactor>
    <maxMergeDocs>2147483647</maxMergeDocs>
    <maxFieldLength>10000</maxFieldLength>
    <unlockOnStartup>false</unlockOnStartup>
    <deletionPolicy class="solr.SolrDeletionPolicy">
      <str name="keepOptimizedOnly">false</str>
      <str name="maxCommitsToKeep">1</str>
    </deletionPolicy>
  </mainIndex>

- Naomi

On Oct 30, 2009, at 7:08 AM, Stephen Meyer wrote:


One thing that we found had a huge affect on indexing time was our
decision to really make our index as lean as possible. Eric Larson and I
took our original Solr schema from our first prototype and cleaned up
all the sloppy places. I think we went from about 6 hours to about 3.5
hours.

Most of the trimming involved taking a hard look at what needs to be
indexed vs. what needs to be stored or whether end users need all the
nuances of MARC fields. E.g., does any patron ever search by title added
entry or should a 'smart' title search translate that our our end?
Single record display is another story, but we store the full raw MARC.

Our first version had 78 fields and the SOLR_HOME/data/index directory
had about 25GB. The new numbers are 34 fields and about 16GB.

Our challenge is to see whether the scaled back index can do everything
our librarians expect it to do when people start taking a closer look at
it...

-Steve


Bill Dueber wrote:
I've done a little profiling of solrmarc lately, tracking down a stupid
bug in one of my own custom methods.  Once I fixed that, I found that
after a few hundred-thousand records any processing being done by either
me or solrmarc core code is totally swallowed up by the cost of adding
to and merging indexes. More than 90% of the time on the profile I had
on my desk today was spent just letting lucene do its thing. I'm pretty
sure the bottleneck is merging, since otherwise we wouldn't see
differences in throughput (records/second) based on the size of the
input file, right? Or am I thinking about this wrong?

Optimization of the underlying lucene processing (which I know
absolutely nothing about) is, I think, where to focus. The time taken by
everything else is just noise as near as I can tell. I'd love to find
out if anyone else has done any profiling, because our 6-million record
indexing process right now takes about 10 hours on kickass hardware and
I'd love to shorten that so I can do a full index more often.

-Bill-



On Thu, Oct 29, 2009 at 5:23 PM, Naomi Dushay <ndu...@stanford.edu
608-265-2844 (ph)


"Just don't let the human factor fail to be a factor at all."
- Andrew Bird, "Tables and Chairs"








--
Bill Dueber
Library Systems Programmer
University of Michigan Library



Stephen Meyer

unread,
Oct 30, 2009, 4:59:06 PM10/30/09
to solrma...@googlegroups.com, blacklight-...@googlegroups.com
Our schema is here:

http://sdg.library.wisc.edu/projects/forward/trunk/vendor/apache-solr/solr/conf/schema.xml

Our solrconfig is out of the box, plus the Blacklight document handler.

Thanks for the lucene wiki article, we (lazily) poked around for that
info unsuccessfully.

-sm

Till Kinstler

unread,
Nov 2, 2009, 10:26:23 AM11/2/09
to solrma...@googlegroups.com
Bill Dueber schrieb:

> Another quick data point: When indexing ~50K records with solrmarc2 on
> really good hardware, I get about 200 records/second throughput.

Hmmm, that's slow compared to my values, though indexing speed depends
on data and index structure, of course.

I started reindexing our records Friday evening and now have some data
to add, details see below. Our data is in collection specific files
holding between 1000 and about 2000000 records each, altogether about
21000000 records. Data below is in the order the record files were
indexed. Times include a <commit/> at the end.

I am using a self build solrmarc with some customized Indexer methods
based on a SVN checkout of last Friday. And a Solr nightly build of
August, 25th (1.3.0.2009.08.25.08.05.53). So all below may be outdated
by newer Solr/Lucene versions. Let's see what Solr 1.4.0 final will bring...

You can see that indexing speed drops significantly at some point,
recovers a bit but never again to values I had with a small index.

Yesterday I started indexing the "slow package" holding 2 million
records again on top of the existing index (holding some 18 million
documents at that time). This time with jrat as Java profiler in the
chain. I am not experienced in Java profiling, so the following is just
my naive observation:
Indexing went painfully slow (Jrat itself slows it down significantly)
at about 2 records/s, so i stopped after

INFO [main] (MarcImporter.java:504) - Finished indexing in 2404:51,00
INFO [main] (MarcImporter.java:513) - Indexed 334995 at a rate of
about 2.0 per sec

The classes that "wasted" most time were:
org.apache.index.lucene.IndexInput: 21.4% (methods readVInt() 11,7% and
ReadVLong() 9,5%)
org.apache.index.lucene.BufferedIndexInput: 15.1% (method readByte() 11,3%)
org.apache.index.lucene.IndexWriter: 9.9% (method doWait() 9,9%).

I haven't looked at Lucene Sources, but doWait() in
org.apache.index.lucene.IndexWriter, that got 9,9% of total processing
time sounds like a place to look at. That spends a lot of time waiting
for something...

I will rerun that when Solr 1.4 final is there.

Till

Solrmarc's log output summarizing indexing speed in consecutive order
with some notes:

INFO [main] (MarcImporter.java:504) - Finished indexing in 1:42,00
INFO [main] (MarcImporter.java:513) - Indexed 59196 at a rate of about
575.0 per sec

INFO [main] (MarcImporter.java:504) - Finished indexing in 5:18,00
INFO [main] (MarcImporter.java:513) - Indexed 154265 at a rate of
about 483.0 per sec

INFO [main] (MarcImporter.java:504) - Finished indexing in 0:32,00
INFO [main] (MarcImporter.java:513) - Indexed 21811 at a rate of about
669.0 per sec

INFO [main] (MarcImporter.java:504) - Finished indexing in 3:13,00
INFO [main] (MarcImporter.java:513) - Indexed 96234 at a rate of about
498.0 per sec

[this is special, because for every record there is an XML document
pulled in from the file system that contains a fulltext of about 10kB to
20kB and that is indexed together with the metadata record]
INFO [main] (MarcImporter.java:504) - Finished indexing in 103:13,00
INFO [main] (MarcImporter.java:513) - Indexed 387375 at a rate of
about 62.0 per sec

INFO [main] (MarcImporter.java:504) - Finished indexing in 0:04,00
INFO [main] (MarcImporter.java:513) - Indexed 1001 at a rate of about
246.0 per sec

INFO [main] (MarcImporter.java:504) - Finished indexing in 0:56,00
INFO [main] (MarcImporter.java:513) - Indexed 29940 at a rate of about
532.0 per sec

INFO [main] (MarcImporter.java:504) - Finished indexing in 0:04,00
INFO [main] (MarcImporter.java:513) - Indexed 2380 at a rate of about
510.0 per sec

INFO [main] (MarcImporter.java:504) - Finished indexing in 4:30,00
INFO [main] (MarcImporter.java:513) - Indexed 136209 at a rate of
about 503.0 per sec

INFO [main] (MarcImporter.java:504) - Finished indexing in 0:02,00
INFO [main] (MarcImporter.java:513) - Indexed 890 at a rate of about
319.0 per sec

INFO [main] (MarcImporter.java:504) - Finished indexing in 1:08,00
INFO [main] (MarcImporter.java:513) - Indexed 36305 at a rate of about
527.0 per sec

INFO [main] (MarcImporter.java:504) - Finished indexing in 0:16,00
INFO [main] (MarcImporter.java:513) - Indexed 4092 at a rate of about
250.0 per sec

INFO [main] (MarcImporter.java:504) - Finished indexing in 0:03,00
INFO [main] (MarcImporter.java:513) - Indexed 910 at a rate of about
235.0 per sec

INFO [main] (MarcImporter.java:504) - Finished indexing in 0:01,00
INFO [main] (MarcImporter.java:513) - Indexed 157 at a rate of about
121.0 per sec

INFO [main] (MarcImporter.java:504) - Finished indexing in 16:38,00
INFO [main] (MarcImporter.java:513) - Indexed 494822 at a rate of
about 495.0 per sec

INFO [main] (MarcImporter.java:504) - Finished indexing in 36:13,00
INFO [main] (MarcImporter.java:513) - Indexed 2000000 at a rate of
about 919.0 per sec

INFO [main] (MarcImporter.java:504) - Finished indexing in 29:33,00
INFO [main] (MarcImporter.java:513) - Indexed 2000000 at a rate of
about 1127.0 per sec

INFO [main] (MarcImporter.java:504) - Finished indexing in 31:05,00
INFO [main] (MarcImporter.java:513) - Indexed 1999999 at a rate of
about 1071.0 per sec

INFO [main] (MarcImporter.java:504) - Finished indexing in 59:50,00
INFO [main] (MarcImporter.java:513) - Indexed 2000000 at a rate of
about 557.0 per sec

INFO [main] (MarcImporter.java:504) - Finished indexing in 31:53,00
INFO [main] (MarcImporter.java:513) - Indexed 1999998 at a rate of
about 1044.0 per sec

INFO [main] (MarcImporter.java:504) - Finished indexing in 101:10,00
INFO [main] (MarcImporter.java:513) - Indexed 1907040 at a rate of
about 314.0 per sec

INFO [main] (MarcImporter.java:504) - Finished indexing in 92:57,00
INFO [main] (MarcImporter.java:513) - Indexed 2000000 at a rate of
about 358.0 per sec

[Hmmm, what's that? Data is about the same as in the 2000000 records
packages above, but indexing went painfully slow, so I finally
interrupted (there are 1,3 million records more in that file...)]
INFO [main] (MarcImporter.java:504) - Finished indexing in 431:14,00
INFO [main] (MarcImporter.java:513) - Indexed 695932 at a rate of
about 26.0 per sec

INFO [main] (MarcImporter.java:504) - Finished indexing in 4:56,00
INFO [main] (MarcImporter.java:513) - Indexed 44814 at a rate of about
151.0 per sec

INFO [main] (MarcImporter.java:504) - Finished indexing in 5:45,00
INFO [main] (MarcImporter.java:513) - Indexed 51994 at a rate of about
150.0 per sec

[again: interrupted manually because indexing went painfully slow]
INFO [main] (MarcImporter.java:504) - Finished indexing in 351:55,00
INFO [main] (MarcImporter.java:513) - Indexed 1044175 at a rate of
about 49.0 per sec

INFO [main] (MarcImporter.java:504) - Finished indexing in 198:55,00
INFO [main] (MarcImporter.java:513) - Indexed 1584169 at a rate of
about 132.0 per sec

INFO [main] (MarcImporter.java:504) - Finished indexing in 0:33,00
INFO [main] (MarcImporter.java:513) - Indexed 917 at a rate of about
27.0 per sec

INFO [main] (MarcImporter.java:504) - Finished indexing in 8:49,00
INFO [main] (MarcImporter.java:513) - Indexed 23889 at a rate of about
45.0 per sec

Reply all
Reply to author
Forward
0 new messages