Indexing Speed Addressed - new approach

49 views
Skip to first unread message

Erik Hatcher

unread,
Jan 7, 2010, 8:49:33 AM1/7/10
to solrmarc-tech
You guys make me manic... up all night because I couldn't sleep until
this problem was well under control. And I think I pretty much kicked
its butt.

Test data set, so you folks can play along at home (thanks Ross!):
http://www.archive.org/details/talis_openlibrary_contribution using
the talis-openlibrary-contribution.mrc

Looks to be 5,786,920 records (reported from my DIH-based indexer, see
later). Good enough size for testing here.

First I gave the 2.1 branch of SolrMarc a try. I chatted with Bob,
got the exact scoop to index a MARC file into Solr over HTTP. I'm not
even gonna fiddle with the embedded indexer, as I don't believe it is
the way to go.

I kicked it off at 5pm last night. At 2am, it's still going (9
HOURS!!!!). I ctrl-c'd it as I was tired of my computer churning so
hard. It was mostly done I found out as it reported:

...
INFO [main] (MarcImporter.java:256) - Added record 5051841 read from
file: 8f89ebfab7654a4b8e923d3d893481be
INFO [Thread-1] (MarcImporter.java:455) - Starting Shutdown hook
INFO [Thread-1] (MarcImporter.java:459) - Stopping main loop
INFO [main] (MarcImporter.java:256) - Added record 5051842 read from
file: ddd9ca40d8a24cea865bdafa588877c9
INFO [main] (MarcImporter.java:506) - Adding 5051842 of 5051842
documents to index
INFO [main] (MarcImporter.java:507) - Deleting 0 documents from
index
INFO [main] (MarcImporter.java:381) - Calling commit
INFO [main] (MarcImporter.java:392) - Done with the commit, closing
Solr
INFO [main] (MarcImporter.java:395) - Setting Solr closed flag
INFO [Thread-1] (MarcImporter.java:474) - Finished Shutdown hook
INFO [main] (MarcImporter.java:516) - Finished indexing in 573:02.00
INFO [main] (MarcImporter.java:525) - Indexed 5051842 at a rate of
about 22.0 per sec
INFO [main] (MarcImporter.java:526) - Deleted 0 records

Confirmed the number of documents via SolrMarc's cool tools:
~/dev/solrmarc/dist/bin: printrecord /Users/erikhatcher/Downloads/
talis-openlibrary-contribution.mrc | egrep '^001' | wc -l 5786920

Thanks Bob for the help to get this rolling!

As I was kicking off the indexer, I perused the code and was a bit
aghast at the SolrProxy stuff, and all the reflection going on in
there. SolrServer is the abstraction I recommend using, in the SolrJ
library. And yes, I understand the rationale (class loader, Solr
version, etc), but let's go ahead and jump to the real win here....

So rather than digging into the code in depth, I wanted to prove out a
simpler way of indexing MARC. I had done this earlier (1.5 years ago
or so?) using a custom update request handler, very simple loop using
plain ol' MARC4J. I just wanted to see how fast the simplest possible
thing that could work runs. But rather than the update handler
approach, I figured I'd go for the gusto and frame it within the
DataImportHandler framework as a custom EntityProcessor.

So I've done that. My results:

Indexed that entire set in 55 minutes, 15 seconds. Yes, you read
that right.

What's the difference? Ok, some shortcuts here - I'm not using the
SolrMarc mappings, just MARC4J to index the 001 id value, and took the
Record object and toString'd it into an indexed text field. Makes the
whole thing searchable at least. That's all. So no heavy lifting on
the mappings, just pure MARC -> Solr. With my SolrMarc indexing try,
I pointed at the UvaBlacklight Solr config and mappings, and with my
"DIHmarc" (die MARC? dim-mak?) implementation I used Solr's example
schema. So some difference analysis work too.

I'm posting a .zip file to the files section shortly with the full
(incredibly tiny) deal and a README.txt showing how easy this is to
use with Solr 1.4.

What's left? The README.txt has these points:
* The MARC file path is currently hardcoded into
MarcEntityProcessor.
* SolrMarc mappings aren't wired in, see placeholder in
MarcEntityProcessor
* Parameterization is needed, and integration to allow
MarcEntityProcessor to work on a list of files or input streams
* Consider whether SolrMarc mappings are the way to go for
straightforward things, or use DIH Transformers

What now? Maybe we commit this and iterate on it on a branch or
somewhere where I can commit to it also? And we make it
parameterizable and then wire in SolrMarc with this placeholder
resides in my prototype:

// Insert Record => Map SolrMarc magic API call here

I imagine there is some MARC4J iteration work that might need to be
tweaked - I just used the stock marc4j.jar, and old one I had lying
around when I did the update handler a couple of years ago (oh the
pains of that collecting dust!).

Why is my implementation so much faster? I don't know yet, and
frankly I don't think it matters. :) It surely isn't simply the
SolrMarc mappings, it's got to be in the indexing code, maybe with the
reflection code? Maybe with schema.xml (that version="0.4" is way
wrong). Maybe other complexities? Mainly I think DIH is a better way
anyway, so I started my work with that.

The difference in speeds:

SolrMarc: 22 docs / s
DIHmarc: 1,745 docs / s

Futhermore, with this in the DataImportHandler framework it allows
blending with other data sources. So one next step could be to build
a config to iterate through a directory of files and index them
linearly. Or hit a relational database for lookup values to
incorporate (the SQL result can be cached, so don't worry about
performance here!), or to fetch a URL that might be pointed to from
the MARC record and incorporate its text into a field, or.... well the
possibilities are pretty wide open. Sure, SolrMarc's flexible config
made some of this possible too, but DIH is more heavily used simply by
having a broader user base. So, keep the SolrMarc mapping magic, but
consider how it fits into the DIH framework. DIH has Transformers.

And yes, DIH is not without faults - its API is awkward, many others
feel the same way. We'll see DIH evolve even further soon. But,
we're finding at Lucid engagements that DIH solves the problem and
while it may look a little uglier, it's way quicker and easier to use
it than writing a bunch of custom stuff for every project.

Ok, so, who's up for testing this out next on their data sets? Let's
see some other numbers here.

Erik

Erik Hatcher

unread,
Jan 7, 2010, 8:52:01 AM1/7/10
to solrmarc-tech
DIHmarc.zip uploaded here, follow README.txt, note the notes.
http://groups.google.com/group/solrmarc-tech/web/DIHmarc.zip

Jonathan Rochkind

unread,
Jan 7, 2010, 10:57:10 AM1/7/10
to solrma...@googlegroups.com
This is very interseting. But I'm wondering... for most of our actual
cases, it's not feasible to give up on all the SolrMarc mappings and
just map everything as full text. It's an interesting demonstration,
but.... we actually need the power of solrmarc to do sophisticated
mappings.

So... I'm left unsure how to proceed that would be useful for me.
Simply using your solution with no mappings is not a solution. So
theoretically we can add features to your custom Solr update handler to
allow just as sophisticated mappings as SolrMarc... but will that just
make it slow again, depending on how they're added?

Comparing: SolrMarc with sophisticated mappings; TO: Request handler
with no mappings. We have (at least) two dimensions, without profiling
we can't really be sure that speed gain is due to 'custom update
handler', rather than just abandoning the sophisticated mappings. But
we actually need the sophisticated mappings for our applications. So
I'm not following why "it doesn't matter" why your code is so much
faster; because if someoen spent lots of time adding the ability for
sophisticated locally defined mappings in your request handler (when
that ability is already implemented in SolrMarc), only to find that they
just wound up with the same performance problems... that would be kind
of mis-spent development time, no?

Does what I'm saying make sense? The thing is that the ability to
locally define sophisticated mappings is kind of crucial to us.

Jonathan

Ross Singer

unread,
Jan 7, 2010, 11:36:40 AM1/7/10
to solrma...@googlegroups.com
Jonathan, I think you're missing Erik's point.

He's not talking about removing the mappings, he just didn't include
them in his proof-of-concept. The idea would be to swap out MARC4J
with SolrMarc and have it create the mappings -- the difference is
that it wouldn't do all the other stuff it does, just take the MARC
and make it a Solr doc.

Let SolrMarc do its job (take a MARC record, analyze, turn it into a
Hash -- basically) and let Solr do its job
(add/delete/commit/optimize).

-Ross.

> --
> You received this message because you are subscribed to the Google Groups
> "solrmarc-tech" group.
> To post to this group, send email to solrma...@googlegroups.com.
> To unsubscribe from this group, send email to
> solrmarc-tec...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/solrmarc-tech?hl=en.
>
>
>
>

Jonathan Rochkind

unread,
Jan 7, 2010, 11:38:29 AM1/7/10
to solrma...@googlegroups.com
Aha, I see, you're right, I did misunderstand, if the idea is it's
potentially possible to use actually existing SolrMarc code with this
new approach too, that sounds promissing. It would be interesting to
see if once you do that... your bottleneck comes back. But it's
definitely worth investigating, if anyone does it they'll definitely
have my gratitude.

Jonathan

Till Kinstler

unread,
Jan 7, 2010, 12:01:36 PM1/7/10
to solrma...@googlegroups.com
Jonathan Rochkind schrieb:

> Aha, I see, you're right, I did misunderstand, if the idea is it's
> potentially possible to use actually existing SolrMarc code with this
> new approach too, that sounds promissing. It would be interesting to
> see if once you do that... your bottleneck comes back.

I wouldn't expect that. My stupid little profiling efforts (I am no way
a provicient Java profiler) seem to indicate that the increasing
slowness of solrmarc indexing come from the actual indexing in its
embedded Solr/Lucene part, not from the record processing in solrmarc.
Of course record processing in solrmarc eats additional processing time
(depending on what you do there), but that should be pretty independent
of index size. And currently indexing speed seems to depend on index
size (at least for my data, index configuration and hardware).
Of course we shouldn't throw away the sophisticated options solrmarc
offers to map MARCish records to Solr indexes. That's really useful!

Till

--
Till Kinstler
Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
Platz der G�ttinger Sieben 1, D 37073 G�ttingen
kins...@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de

Jonathan Rochkind

unread,
Jan 7, 2010, 12:03:39 PM1/7/10
to solrma...@googlegroups.com
Till Kinstler wrote:
>
> I wouldn't expect that. My stupid little profiling efforts (I am no way
> a provicient Java profiler) seem to indicate that the increasing
> slowness of solrmarc indexing come from the actual indexing in its
> embedded Solr/Lucene part, not from the record processing in solrmarc.
>

Has anyone tried benchmarking/profiling SolrMarc 2.1 using it's ability
to simply http post to solr instead of using the embedded solr/lucene?
Some have always been suspicious of the embedded solr/lucene.

Jonathan

Jonathan Rochkind

unread,
Jan 7, 2010, 12:05:03 PM1/7/10
to solrma...@googlegroups.com
Oops, nevermind Erik's test _was_ of SolrMarc using http post indexing,
not embedded solr/lucene. So the slowness of SolrMarc in Erik's test
definitely wasn't related to the embedded solr/lucene. Hmm.

Erik Hatcher

unread,
Jan 7, 2010, 12:26:07 PM1/7/10
to solrma...@googlegroups.com

On Jan 7, 2010, at 12:03 PM, Jonathan Rochkind wrote:
> Has anyone tried benchmarking/profiling SolrMarc 2.1 using it's
> ability to simply http post to solr instead of using the embedded
> solr/lucene? Some have always been suspicious of the embedded solr/
> lucene.

My 9 hour aborted run on the 5.7M .mrc file last night was with
SolrMarc 2.1 over HTTP.

Erik

Cicer0

unread,
Jan 11, 2010, 3:39:23 PM1/11/10
to solrmarc-tech
I posted a possibly related question on the "indexing time -RAM"
thread, but for completeness could you confirm that your 9 hour
aborted run did not have an excessive number of un-merged segments
(which I am currently speculating is behind the relation ship of index
size to index time relationship)

Till Kinstler

unread,
Jan 14, 2010, 9:39:23 AM1/14/10
to solrma...@googlegroups.com
Jonathan Rochkind schrieb:

> Has anyone tried benchmarking/profiling SolrMarc 2.1 using it's ability
> to simply http post to solr instead of using the embedded solr/lucene?

I finally found some minutes to upgrade my solrmarc zoo to a checkout of
the solrmarc-2.1 branch (following the instructions by Bob and Willem).
I just did a short test using both methods: I added 132835 MARC records
to an index holding 21504573 records.
using embedded Solr:
INFO [main] (MarcImporter.java:516) - Finished indexing in 5:00,00
INFO [main] (MarcImporter.java:525) - Indexed 132835 at a rate of
about 441.0 per sec
using HTTP POST (by unsetting the solr.path property):
INFO [main] (MarcImporter.java:516) - Finished indexing in 10:15,00
INFO [main] (MarcImporter.java:525) - Indexed 132835 at a rate of
about 215.0 per sec
As expected, memory consumption is much lower using HTTP post, because
solrmarc doesn't load a second Solr instance on the machine.

Reply all
Reply to author
Forward
0 new messages