marc --> solr, and performance testing

21 views
Skip to first unread message

Naomi Dushay

unread,
Feb 26, 2010, 2:05:09 PM2/26/10
to vufin...@lists.sourceforge.net, blacklight-...@googlegroups.com
Hi folks,

I am curious as to efforts beyond SolrMarc folks are using to get marc
data into Solr. I'm wondering if we can leverage each
others' (other's?) knowledge and work.

Could folks please reply to solrma...@googlegroups.com with non-
SolrMarc efforts they are aware of, or of "non-standard" uses of
SolrMarc (anything beyond writing local customizations that don't
affect the core code)

I know of:

UWisc - Stephen Meyer - very local modified, stripped down SolrMarc.
Much zippier. Raw marc stored in reldb and retrieved from same at
display time.
UMich - Bill Deuber - working on multithreaded processing, and using
jruby and some ruby wrappers.


I'm also interested in figuring out how to do reasonable performance
testing of indexing as painlessly as possible.


- Naomi

Jonathan Rochkind

unread,
Mar 1, 2010, 12:47:35 PM3/1/10
to blacklight-...@googlegroups.com
I am pretty optimistic about Bill's approach, and at some point hope to
find the time to flesh it out into something as flexible and
easy-to-setup as SolrMarc is, while still remaining simple and elegant
code.

Jonathan

Bill Dueber

unread,
Mar 1, 2010, 11:05:35 PM3/1/10
to blacklight-...@googlegroups.com
The latest update is that...well, I'm having threading problems. Running with a single thread the jruby code is roughly 90% as fast as Solrmarc (which some of us would accept just to be able to work in Ruby). I'm going to tackle trying to get a better understanding of Ruby threads and how they interact with / are implemented by Java native threads later this week. 

--
You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
To post to this group, send email to blacklight-...@googlegroups.com.
To unsubscribe from this group, send email to blacklight-develo...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/blacklight-development?hl=en.




--
Bill Dueber
Library Systems Programmer
University of Michigan Library

Stephen Meyer

unread,
Mar 2, 2010, 9:49:07 AM3/2/10
to blacklight-...@googlegroups.com
Bill, are you using the StreamingUpdateSolrServer? I had problems with
thread locks when using this class and they seem to be known and being
worked out by the Solr developers:

* https://issues.apache.org/jira/browse/SOLR-1543
* https://issues.apache.org/jira/browse/SOLR-1711

I have not tried recently with a nightly version of this SolrJ client. I
have yet to resolve the problem myself, but plan on returning to it in
the future.

-Steve

Bill Dueber wrote:
> The latest update is that...well, I'm having threading problems. Running
> with a single thread the jruby code is roughly 90% as fast as Solrmarc
> (which some of us would accept just to be able to work in Ruby). I'm
> going to tackle trying to get a better understanding of Ruby threads and
> how they interact with / are implemented by Java native threads later
> this week.
>
> On Mon, Mar 1, 2010 at 12:47 PM, Jonathan Rochkind <roch...@jhu.edu
> <mailto:roch...@jhu.edu>> wrote:
>
> I am pretty optimistic about Bill's approach, and at some point hope
> to find the time to flesh it out into something as flexible and
> easy-to-setup as SolrMarc is, while still remaining simple and
> elegant code.
>
> Jonathan
>
>
> Naomi Dushay wrote:
>
> Hi folks,
>
> I am curious as to efforts beyond SolrMarc folks are using to
> get marc data into Solr. I'm wondering if we can leverage each
> others' (other's?) knowledge and work.
>
> Could folks please reply to solrma...@googlegroups.com

> <mailto:solrma...@googlegroups.com> with non- SolrMarc


> efforts they are aware of, or of "non-standard" uses of
> SolrMarc (anything beyond writing local customizations that
> don't affect the core code)
>
> I know of:
>
> UWisc - Stephen Meyer - very local modified, stripped down
> SolrMarc. Much zippier. Raw marc stored in reldb and
> retrieved from same at display time.
> UMich - Bill Deuber - working on multithreaded processing, and
> using jruby and some ruby wrappers.
>
>
> I'm also interested in figuring out how to do reasonable
> performance testing of indexing as painlessly as possible.
>
>
> - Naomi
>
>
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Blacklight Development" group.
> To post to this group, send email to
> blacklight-...@googlegroups.com

> <mailto:blacklight-...@googlegroups.com>.


> To unsubscribe from this group, send email to
> blacklight-develo...@googlegroups.com

> <mailto:blacklight-development%2Bunsu...@googlegroups.com>.


> For more options, visit this group at
> http://groups.google.com/group/blacklight-development?hl=en.
>
>
>
>
> --
> Bill Dueber
> Library Systems Programmer
> University of Michigan Library
>
> --
> You received this message because you are subscribed to the Google
> Groups "Blacklight Development" group.
> To post to this group, send email to
> blacklight-...@googlegroups.com.
> To unsubscribe from this group, send email to
> blacklight-develo...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/blacklight-development?hl=en.

--
Stephen Meyer
Library Application Developer
UW-Madison Libraries
312F Memorial Library
728 State St.
Madison, WI 53706

sme...@library.wisc.edu
608-265-2844 (ph)


"Just don't let the human factor fail to be a factor at all."
- Andrew Bird, "Tables and Chairs"

matt mitchell

unread,
Mar 3, 2010, 1:33:32 PM3/3/10
to Blacklight Development
I didn't realize this was on the blacklight list -- on solrmarc too.
Just gonna copy what I put over there...

--------

Thought I'd mention the great work that Mike Perham is doing with
rsolr-async... This is a Ruby 1.9 connection driver that uses
EventMachine (and Fibers) to concurrently send updates to Solr via
http:

http://github.com/mwmitchell/rsolr-async

Combine this with the Ruby marc/enhanced gem and I'd imagine you'd
have a nice little combo happening.

-------

Here's an example of mapping with ruby marc and rsolr-direct:

http://github.com/mwmitchell/sifter/tree/master/example/

Jonathan Rochkind

unread,
Mar 3, 2010, 1:46:32 PM3/3/10
to blacklight-...@googlegroups.com
I don't particularly want to make the jump to ruby 1.9 for my Blacklight
rails app yet myself (too many useful gems not 1.9 yet, 1.9 in general
still being 'beta'; in some cases it can be hard to write code that
works both under 1.8 and 1.9) -- but running just a separate indexer
process under ruby 1.9 might be more feasible, if there's a good reason
to, like Mike Perham's code.

Seems from talking to Erik though, that using JSolr (under java or
jruby) is going to be better performance than HTTP calls no matter how
you do the HTTP calls. Plus there are escaping issues with trying to
put marc binary in solr via an XML HTTP call, that are taken care of
with JSolr.

But, so many options! I'm starting to get choice-fatigue in my indexing
attempts.

Jonathan

Bill Dueber

unread,
Mar 3, 2010, 1:56:30 PM3/3/10
to blacklight-...@googlegroups.com
I'd encourage everyone to not focus too too too much on the speed of sending stuff to solr until you do some benchmarking. Pulling all the crap we want out of MARC is expensive, and the push to solr may be just noise depending on how you're doing it. 

In generaly, of course, we want fast pushes to Solr, but for doing what SolrMarc does, I don't think that's the bottleneck anymore.

--
You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
To post to this group, send email to blacklight-...@googlegroups.com.
To unsubscribe from this group, send email to blacklight-develo...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/blacklight-development?hl=en.

Stephen Meyer

unread,
Mar 3, 2010, 4:31:34 PM3/3/10
to blacklight-...@googlegroups.com
I told some folks at code4lib that I would send out an overview of what
exactly we are doing at the University of WI:

http://sdg.library.wisc.edu/blog/2010/03/03/solr-marc-indexing-based-on-diffs/

Check out the diagram since that is intended to provide a quick overview
of how we are getting away with doing one of the the most expensive
things (MARC parsing, as Bill points out) as few times as possible. We
only do MARC indexing/processing 50k times for the adds and updates
rather than the 8M times for our whole MARC record set.

-sm

> <mailto:ndus...@stanford.edu>> wrote:
>
>
> Hi folks,
>
> I am curious as to efforts beyond SolrMarc folks are using
> to get marc data into Solr. I'm wondering if we can
> leverage each others' (other's?) knowledge and work.
>
> Could folks please reply to solrma...@googlegroups.com

> <mailto:solrma...@googlegroups.com> with non-


> SolrMarc efforts they are aware of, or of "non-standard"
> uses of SolrMarc (anything beyond writing local
> customizations that don't affect the core code)
>
> I know of:
>
> UWisc - Stephen Meyer - very local modified, stripped down
> SolrMarc. Much zippier. Raw marc stored in reldb and
> retrieved from same at display time.
> UMich - Bill Deuber - working on multithreaded processing,
> and using jruby and some ruby wrappers.
>
> I'm also interested in figuring out how to do reasonable
> performance testing of indexing as painlessly as possible.
>
> - Naomi
>
>
>
>
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Blacklight Development" group.
> To post to this group, send email to
> blacklight-...@googlegroups.com

> <mailto:blacklight-...@googlegroups.com>.


> To unsubscribe from this group, send email to
> blacklight-develo...@googlegroups.com

> <mailto:blacklight-development%2Bunsu...@googlegroups.com>.


> For more options, visit this group at
> http://groups.google.com/group/blacklight-development?hl=en.
>
>
>
>
> --
> Bill Dueber
> Library Systems Programmer
> University of Michigan Library
>
> --
> You received this message because you are subscribed to the Google
> Groups "Blacklight Development" group.
> To post to this group, send email to
> blacklight-...@googlegroups.com.
> To unsubscribe from this group, send email to
> blacklight-develo...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/blacklight-development?hl=en.

--

Robert Haschart

unread,
Mar 3, 2010, 4:47:41 PM3/3/10
to blacklight-...@googlegroups.com
In some initial tests of benchmarking for SolrMarc running on a 25000 record marc file produced the following timings:

simply reading the marc records, translating them from marc8 to utf-8 encoding, and writing them out in text format to /dev/null:      12 secs
reading the records, translating them, processing them via UVA rather complex indexing specification, NOT sending them to solr:   46 secs
reading the records, translating them, creating the index records, and sending them to solr to an empty index :                        1 min 44 secs
reading the records, translating them, creating the index records, and sending them to solr to an index w/ ~4M records :        2 min   4 secs

So pulling all the crap we want out of MARC is expensive,  but the push to solr is not just noise.

-Bob Haschart

Bill Dueber

unread,
Mar 3, 2010, 6:15:00 PM3/3/10
to blacklight-...@googlegroups.com
Thanks for running these, Bob! I'm a little out-of-date -- what mechanism are you using to push them to solr these days? I'll do a similar check with jruby/StreamingUpdateSolrServer and post results here.

Jonathan Rochkind

unread,
Mar 3, 2010, 6:25:54 PM3/3/10
to blacklight-...@googlegroups.com
2:04 for 25,000 records to a full index is still 390 records a second, which is way faster than the ~140 records/second I'm getting on my emptyish index.

Could be that you have faster hardware. Could be that my logic rules are even more complicated (I do use a lot of .bsh scripts with SolrMarc; which shoudln't in itself be a problem, assuming SolrMarc compiles beanshell once, and not recompiles on every record!).

We're both storing full marc in the record, right? So it wasn't that.

Odd. Maybe just your hardware is a lot faster.
________________________________________
From: blacklight-...@googlegroups.com [blacklight-...@googlegroups.com] On Behalf Of Bill Dueber [bi...@dueber.com]
Sent: Wednesday, March 03, 2010 6:15 PM
To: blacklight-...@googlegroups.com
Subject: Re: [Blacklight-development] Re: marc --> solr, and performance testing

Thanks for running these, Bob! I'm a little out-of-date -- what mechanism are you using to push them to solr these days? I'll do a similar check with jruby/StreamingUpdateSolrServer and post results here.

On Wed, Mar 3, 2010 at 4:47 PM, Robert Haschart <rh...@virginia.edu<mailto:rh...@virginia.edu>> wrote:
In some initial tests of benchmarking for SolrMarc running on a 25000 record marc file produced the following timings:

simply reading the marc records, translating them from marc8 to utf-8 encoding, and writing them out in text format to /dev/null: 12 secs
reading the records, translating them, processing them via UVA rather complex indexing specification, NOT sending them to solr: 46 secs
reading the records, translating them, creating the index records, and sending them to solr to an empty index : 1 min 44 secs
reading the records, translating them, creating the index records, and sending them to solr to an index w/ ~4M records : 2 min 4 secs

So pulling all the crap we want out of MARC is expensive, but the push to solr is not just noise.

-Bob Haschart


Bill Dueber wrote:
I'd encourage everyone to not focus too too too much on the speed of sending stuff to solr until you do some benchmarking. Pulling all the crap we want out of MARC is expensive, and the push to solr may be just noise depending on how you're doing it.

In generaly, of course, we want fast pushes to Solr, but for doing what SolrMarc does, I don't think that's the bottleneck anymore.

On Wed, Mar 3, 2010 at 1:46 PM, Jonathan Rochkind <roch...@jhu.edu<mailto:roch...@jhu.edu>> wrote:
I don't particularly want to make the jump to ruby 1.9 for my Blacklight rails app yet myself (too many useful gems not 1.9 yet, 1.9 in general still being 'beta'; in some cases it can be hard to write code that works both under 1.8 and 1.9) -- but running just a separate indexer process under ruby 1.9 might be more feasible, if there's a good reason to, like Mike Perham's code.

Seems from talking to Erik though, that using JSolr (under java or jruby) is going to be better performance than HTTP calls no matter how you do the HTTP calls. Plus there are escaping issues with trying to put marc binary in solr via an XML HTTP call, that are taken care of with JSolr.

But, so many options! I'm starting to get choice-fatigue in my indexing attempts.

Jonathan


matt mitchell wrote:
I didn't realize this was on the blacklight list -- on solrmarc too.
Just gonna copy what I put over there...

--------

Thought I'd mention the great work that Mike Perham is doing with
rsolr-async... This is a Ruby 1.9 connection driver that uses
EventMachine (and Fibers) to concurrently send updates to Solr via
http:

http://github.com/mwmitchell/rsolr-async

Combine this with the Ruby marc/enhanced gem and I'd imagine you'd
have a nice little combo happening.

-------

Here's an example of mapping with ruby marc and rsolr-direct:

http://github.com/mwmitchell/sifter/tree/master/example/


On Feb 26, 2:05 pm, Naomi Dushay <ndus...@stanford.edu<mailto:ndus...@stanford.edu>> wrote:

Hi folks,

I am curious as to efforts beyond SolrMarc folks are using to get marc data into Solr. I'm wondering if we can leverage each others' (other's?) knowledge and work.

Could folks please reply to solrma...@googlegroups.com<mailto:solrma...@googlegroups.com> with non-


SolrMarc efforts they are aware of, or of "non-standard" uses of SolrMarc (anything beyond writing local customizations that don't affect the core code)

I know of:

UWisc - Stephen Meyer - very local modified, stripped down SolrMarc. Much zippier. Raw marc stored in reldb and retrieved from same at display time.
UMich - Bill Deuber - working on multithreaded processing, and using jruby and some ruby wrappers.

I'm also interested in figuring out how to do reasonable performance testing of indexing as painlessly as possible.

- Naomi


--
You received this message because you are subscribed to the Google Groups "Blacklight Development" group.

To post to this group, send email to blacklight-...@googlegroups.com<mailto:blacklight-...@googlegroups.com>.
To unsubscribe from this group, send email to blacklight-develo...@googlegroups.com<mailto:blacklight-development%2Bunsu...@googlegroups.com>.


--
Bill Dueber
Library Systems Programmer
University of Michigan Library
--
You received this message because you are subscribed to the Google Groups "Blacklight Development" group.

To post to this group, send email to blacklight-...@googlegroups.com<mailto:blacklight-...@googlegroups.com>.
To unsubscribe from this group, send email to blacklight-develo...@googlegroups.com<mailto:blacklight-develo...@googlegroups.com>.


--
You received this message because you are subscribed to the Google Groups "Blacklight Development" group.

To post to this group, send email to blacklight-...@googlegroups.com<mailto:blacklight-...@googlegroups.com>.
To unsubscribe from this group, send email to blacklight-develo...@googlegroups.com<mailto:blacklight-development%2Bunsu...@googlegroups.com>.

Ross Singer

unread,
Mar 3, 2010, 7:53:16 PM3/3/10
to blacklight-...@googlegroups.com
On Wed, Mar 3, 2010 at 6:25 PM, Jonathan Rochkind <roch...@jhu.edu> wrote:
> Could be that you have faster hardware. Could be that my logic rules are even more complicated (I do use a lot of .bsh scripts with SolrMarc; which shoudln't in itself be a problem, assuming SolrMarc compiles beanshell once, and not recompiles on every record!).

Beanshell in Java is not Java -- it will still only be as fast as the
Beanshell interpreter (just like JRuby and all of the other
dynamically typed JVM languages):

http://ikayzo.org/confluence/pages/viewpage.action?pageId=16

So it's quite likely that at least *some* of your performance
differential is .bsh-related (which needs to be offset with the
development time of doing what you're doing in beanshell in Java).

-Ross.

Bill Dueber

unread,
Mar 4, 2010, 12:07:21 PM3/4/10
to blacklight-...@googlegroups.com
I ran some tests of my own using JRuby with the StreamingUpdateSolrServer. I've got a whole lot more on my blog, which seemed too long to post here. 

If you're interested, there's a lot more there, including some explanation of what I'm doing and a look at how muli-threading the processing stage helps. It's at at  http://robotlibrarian.billdueber.com/pushing-marc-to-solr-processing-times-and-threading-and-such/

The basics are, though:

* 18,881 records in marc-binary format
* Times are in seconds, run on my desktop
* Remember, you can't compare these numbers to Bob's because we're doing
different things to different data. 

 19  Just read the records with marc4j and do nothing.
 85  Read, process 35 "normal" fields (no custom)
104  Read, 35 normal, 15 custom fields
110  Read, normal, custom, allfields
129  Read, normal, custom, allfields, to_xml
136  Read, normal, custom, allfields, to_xml, 2-threaded SUSS, 
     commit every 5K docs
142  Read, normal, custom, allfields, to_xml, 1-threaded SUSS, 
     commit every 5k docs

# Add a processing thread
124  Read, normal, custom, allfields, to_xmx, 1-threaded SUSS,
     commit every 5k docs, 2 threads doing processing

That breaks down to:

 129 do all the reading and processing
  13 send to solr with one thread

There are a lot of reasons why my submit-to-solr might seem like less of a
burden. The ones I can think of off the top of my head are:

* SUSS is just faster than whatever solrmarc does. 
* My processing stage is so much slower than solrmac's (due to algorithms or jruby-vs-java, I don't know) that the "push to solr" portion of it gets swallowed up by the slowness of the of overall code.
* The Solr server is so much faster than my desktop that my poor little 
  desktop can't send it data fast enough to work it.

Reply all
Reply to author
Forward
0 new messages