Storing raw MARC outside of Solr

Stephen Meyer

unread,

Mar 4, 2010, 11:41:31 AM3/4/10

to blacklight-...@googlegroups.com

At UW, we made the decision to not store our raw MARC in Solr. Instead
we are storing the raw MARC in an RDBMS. We have a "records" table and
corresponding ActiveRecord model that also has a btree indexed column
for the Solr identifier. We use the raw MARC heavily for our single
record display page.

We decided to keep the raw MARC out of Solr to

1) keep the index small

2) reduce the indexing time for full builds. when we run a full index
regeneration we cut the build time by anywhere from a third to a half.
my suspicion is that there is a lot of overhead on the indexing side
when rebalancing the index as it grows. w/o storing 8M marc streams we
drastically reduce the amount of data pushed in an consequently the
reshuffling that periodically takes place.

3) more importantly, though, we want a smaller index so that more of it
can fit in memory and the index will be searched faster.

I spoke to Jonathan, Naomi and Jessie about this a little bit at
code4lib. They recommended emailing the group to float an idea about how
the core blacklight code might enable a configuration for where the raw
MARC is stored. We want to be able to keep hooking into the things like
citation, emailing, SMS, etc utilities that use the blacklight marc
objects, but right now we are kind of hacking it in our own model class
that calls

autoload :Citation, 'blacklight/marc/citation.rb'

and it doesn't smell entirely right. Jonathan indicated he has ideas
about this.

What do folks think for a future version of blacklight?

-Steve
--
Stephen Meyer
Library Application Developer
UW-Madison Libraries
312F Memorial Library
728 State St.
Madison, WI 53706

sme...@library.wisc.edu
608-265-2844 (ph)

"Just don't let the human factor fail to be a factor at all."
- Andrew Bird, "Tables and Chairs"

Jonathan Rochkind

unread,

Mar 4, 2010, 2:22:06 PM3/4/10

to blacklight-...@googlegroups.com

I believe our plans for SolrDocument 'extensions' will cover this. My
plans for SolrDocument extensions are that a [ SolrDocument#to_marc =>
RubyMarcRecord ] method would be added to SolrDocument by an extension,
not be in the base SolrDocument class itself.

Then all code that wanted to see if a marc representation was available
would simply call document.respond_to?(:to_marc) [instead of the
current !document.marc.marc.nil? ], and would simply call
document.to_marc to get it (instead of teh current document.marc.marc).

Then for Stephen's use case, you'd just add a local (or from a shared
plugin) SolrDocument extension that added a #to_marc method which pulled
the marc from the db instead of a solr field , and only added that
method to documents that had look-up-able marc. (Have to use a bit of
ruby trickery to conditionally only add the method based on conditions
of the instance, but it's do-able).

And then it would Just Work. [It would be slightly harder if you wanted
to use marc documents on search results screen, rather than just item
detail, because you'd want to fetch all of the marc in one SQL call from
your rdbms, not N, for efficiency. But still do-able].

I hope to code up a draft proposed patch implementing this kind of
SolrDocument extension API "soon". I will need it for my Umlaut
integration too.

Jonathan

Stephen Meyer

unread,

Mar 4, 2010, 3:48:54 PM3/4/10

to blacklight-...@googlegroups.com

we don't use the raw marc in our search results because of the
efficiency concerns.

Jonathan Rochkind

unread,

Mar 4, 2010, 3:52:45 PM3/4/10

to blacklight-...@googlegroups.com

I'm curious if you experimented or measured performance impact of using
the raw marc in your search result, or just figured better not to try?

In my own (very limited and not precise) profiling, it _appeared_ that
the only place a performance slowdown appeared when using Marc in search
results (in my case from my solr index) was not in parsing the marc, and
not neccesarily (apparently, at my index size) in Solr's processing --
but actually in the extra time to transfer the marc records over the
HTTP wire from solr.

So I'm curious if grabbing it from a db instead might reduce even that
performance impact (which for my purposes is so far tolerable, even as
it is).

Jonathan

Bill Dueber

unread,

Mar 4, 2010, 4:08:44 PM3/4/10

to blacklight-...@googlegroups.com

Granted, we have great hardware, but for our vufind installs we retrieve MARC-XML from solr and parse it into MARC objects for every single search result (20 on a page) and don't have a problem. I'd benchmark it before you do too much (premature?) optimization...

--
You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
To post to this group, send email to blacklight-...@googlegroups.com.
To unsubscribe from this group, send email to blacklight-develo...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/blacklight-development?hl=en.

--
Bill Dueber
Library Systems Programmer
University of Michigan Library

Stephen Meyer

unread,

Mar 4, 2010, 4:12:34 PM3/4/10

to blacklight-...@googlegroups.com

We didn't do any formal measurements we just have been on constant
lookout for all possible time we can shave. It has been more of a frame
of mind for us driven by the union index.

Take another case that worries me - this search result set has only 4
results, but 123 item holdings:

http://forward.library.wisconsin.edu/catalog?q=%22in+defense+of+food%22

So we are doing 123 circulation availability lookups for a single
results page.

(This is because the book was used in our university's common book
program this year.)

In a lot of cases, the usability concerns may refactor away these
concerns but we are still mindful of our milliseconds.

-steve

--

Ross Singer

unread,

Mar 5, 2010, 2:59:56 PM3/5/10

to blacklight-...@googlegroups.com

On Thu, Mar 4, 2010 at 4:12 PM, Stephen Meyer <sme...@library.wisc.edu> wrote:

> Take another case that worries me - this search result set has only 4
> results, but 123 item holdings:
>
> http://forward.library.wisconsin.edu/catalog?q=%22in+defense+of+food%22
>
> So we are doing 123 circulation availability lookups for a single results
> page.

I'm a little confused about this argument, really - I would think
you'd cut that frightening holdings display --way-- before the
searching safety net that keeping the marc record searchable would
bring.

Anyway, it seems your holdings lookup could be done asynchronously via
AJAX or, if you're really concerned about usability/performance,
harvested into Solr as well so you could actually search on things
that are available.

If anything, this particular example shows me that you've got
performance to *spare* if you're able to synchronously look up 123
items and I barely noticed.

-Ross.

Stephen Meyer

unread,

Mar 8, 2010, 11:48:30 AM3/8/10

to blacklight-...@googlegroups.com

Yeah, sorry for being less than clear on this...addressing those
frightening displays will make this moot.

Also, we are preharvesting the item circ status data and loading it into
a db table. We do a daily extract and load of this data for the entire
set of item level holdings and put a btree index on the two columns
needed to look up an item. This set includes every copy of every
circulating item we have from all UW System schools. So this table has
circ status records for 15M things. Not a small table.

Given our history with Voyager [1], our sysadmins are not comfortable
with us making live circstatus lookups against the production tables. It
is also further complicated by the fact that we would need to make these
lookups against 14 separate instances of Voyager so it is not just a
single sysadmin team we would be dealing with. Instead we decided to
preharvest. (A JANGLE-fied ILS API would be fantastic here.)

The reason I brought up the example is that it illustrates (in my mind)
one of a few scale issues we are trying to be cognizant of. While we
aren't dealing with Hathi levels of data, doing the consortial index
does present a set of scale (and usability) issues that we weren't
previously thinking about in managing our OPAC and so we are on the
lookout for all the places where might keep things simple and fast.

-sm

[1] Last fall we tried to setup the newest version of the Voyager OPAC
at UW-Madison. It is a big switch from the old CGI-based version to a
newer Tomcat webapp. When classes started and production load resumed
its normal levels the new version of the OPAC slowed to a crawl and the
app fell over. We ended up resolving the issue after a few weeks by
going back to the old version of the OPAC. This has been a bit of
background context for our Forward project and we are trying to walk a
fine line of not prematurely optimizing, but also ensuring that when we
get to a point where we get real load, we are ready for it.

--

Eric Larson

unread,

Mar 8, 2010, 5:16:07 PM3/8/10

to blacklight-...@googlegroups.com

I don't believe rendering a search result page requires parsing full objects. Good for you if you choose to do that, but there's just not a lot of info you can squeeze into a 500 x 150px result row div (your measurements may differ, slightly!). Title, contributor, year, format, location, call number, image and you're essentially broke. Honestly, it's a lovely design limitation.

Our index goals again are:

1) Small index

2) Fast index build time

3) Fast search results

Another reason it makes sense for us to simplify our result page and index is because we're not just MARC. We've got a million-plus METS records waiting in fedora that hope to join the project soon. Ultimately, our Blacklight app will provide discovery for all sorts of metadata schemes, with data coming from many differing repositories.

Cross-walking MARC / METS / DC / MODS / TEI / IEEE LOM / OME-XML... etc, you've got to cut to the point on the result page and push people out to views (or external apps) that intimately know our OME-XML collection. Views that specialize in object parsing and presentation.

- - - -

Now, touching the database for 123 circ holdings *is silly* -- we're gonna polish that soon.

Cheers,

- Eric

--
You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
To post to this group, send email to blacklight-...@googlegroups.com.
To unsubscribe from this group, send email to blacklight-develo...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/blacklight-development?hl=en.

--

Eric Larson

Digital Library Consultant
UW Digital Collections Center

ela...@library.wisc.edu

Connect with us on…
•The Web: http://uwdc.library.wisc.edu
•Facebook: http://digital.library.wisc.edu/1711.dl/uwdc-fb
•Twitter: http://twitter.com/UWdigiCollec
•RSS: http://uwdc.library.wisc.edu/News/NewsFeed/UWDCNews.xml

Jonathan Rochkind

unread,

Mar 8, 2010, 5:46:17 PM3/8/10

to blacklight-...@googlegroups.com

For me, it made a lot of sense to parse the Marc in order to render the
search results page -- while there are only a handful things on that
page (although they kind of add up, esp when you include non-roman
displays), I wanted the flexibility of pulling them out of Marc at
display-time, and using the same ruby code to pull them out of marc that
I'm using on item detail pages.

For me, it doesn't seem to add unacceptable time to display, but I
haven't benchmarked fully yet, or profiled with care to see exactly
where the added time is coming from, if there is significant added time.

If it becomes a problem, I'd first try storing the marc in a database
like you guys, but still rendering at display time even for search
results page. And if that was STILL a problem, I might have to resort to
NOT parsing marc for search results page at display time. But I hope it
doesn't get to that, it's just SO convenient.

But to each their own! BL should (and kind of does, but getting better
all the time) support any of these implementation choices by the local
implementer.

Jonathan

Eric Larson wrote:
> I don't believe rendering a search result page requires parsing full objects. Good for you if you choose to do that, but there's just not a lot of info you can squeeze into a 500 x 150px result row div (your measurements may differ, slightly!). Title, contributor, year, format, location, call number, image and you're essentially broke. Honestly, it's a lovely design limitation.
>
> Our index goals again are:
> 1) Small index
> 2) Fast index build time
> 3) Fast search results
>
> Another reason it makes sense for us to simplify our result page and index is because we're not just MARC. We've got a million-plus METS records waiting in fedora that hope to join the project soon. Ultimately, our Blacklight app will provide discovery for all sorts of metadata schemes, with data coming from many differing repositories.
>
> Cross-walking MARC / METS / DC / MODS / TEI / IEEE LOM / OME-XML... etc, you've got to cut to the point on the result page and push people out to views (or external apps) that intimately know our OME-XML collection. Views that specialize in object parsing and presentation.
>
> - - - -
>
> Now, touching the database for 123 circ holdings *is silly* -- we're gonna polish that soon.
>
> Cheers,
> - Eric
>
> On Mar 5, 2010, at 1:59 PM, Ross Singer wrote:
>

> On Thu, Mar 4, 2010 at 4:12 PM, Stephen Meyer <sme...@library.wisc.edu<mailto:sme...@library.wisc.edu>> wrote:
>
> Take another case that worries me - this search result set has only 4
> results, but 123 item holdings:
>
> http://forward.library.wisconsin.edu/catalog?q=%22in+defense+of+food%22
>
> So we are doing 123 circulation availability lookups for a single results
> page.
>
> I'm a little confused about this argument, really - I would think
> you'd cut that frightening holdings display --way-- before the
> searching safety net that keeping the marc record searchable would
> bring.
>
> Anyway, it seems your holdings lookup could be done asynchronously via
> AJAX or, if you're really concerned about usability/performance,
> harvested into Solr as well so you could actually search on things
> that are available.
>
> If anything, this particular example shows me that you've got
> performance to *spare* if you're able to synchronously look up 123
> items and I barely noticed.
>
> -Ross.
>
> --
> You received this message because you are subscribed to the Google Groups "Blacklight Development" group.

> To post to this group, send email to blacklight-...@googlegroups.com<mailto:blacklight-...@googlegroups.com>.
> To unsubscribe from this group, send email to blacklight-develo...@googlegroups.com<mailto:blacklight-develo...@googlegroups.com>.

> For more options, visit this group at http://groups.google.com/group/blacklight-development?hl=en.
>
>
> --
> Eric Larson
> Digital Library Consultant
> UW Digital Collections Center

> ela...@library.wisc.edu<mailto:ela...@library.wisc.edu>
>
>
> Connect with us onï¿½
> ï¿½The Web: http://uwdc.library.wisc.edu<http://uwdc.library.wisc.edu/>
> ï¿½Facebook: http://digital.library.wisc.edu/1711.dl/uwdc-fb
> ï¿½Twitter: http://twitter.com/UWdigiCollec
> ï¿½RSS: http://uwdc.library.wisc.edu/News/NewsFeed/UWDCNews.xml

Ross Singer

unread,

Mar 8, 2010, 11:27:33 PM3/8/10

to blacklight-...@googlegroups.com

On Mon, Mar 8, 2010 at 5:16 PM, Eric Larson <ela...@library.wisc.edu> wrote:
> I don't believe rendering a search result page requires parsing full
> objects.

Hmm, I don't either and maybe I'm misunderstanding what the argument
is about (certainly wouldn't be the first time).

I thought you were storing the marc record in the rdbms because you
weren't putting it in Solr, is that wrong? If so, disregard
everything after the following word: sorry.

If not, you don't have to use the MARC record in Solr for any aspect
of display, but is there no value in having the whole thing there just
for searching purposes?

-Ross.

Stephen Meyer

unread,

Mar 9, 2010, 10:07:07 AM3/9/10

to blacklight-...@googlegroups.com

I think the confusing part of this thread is that I responded to an
incidental comment that Jonathan made about parsing MARC records in a
search results page.

As to your point below - correct me if I am missing it - about storing
everything in a raw MARC record for searching, our thought is that we
will eventually shape our MARC parsing/indexing rules to extract all
keywords into some part of the index. However, there is also a lot of
extra stuff that we don't want to store in Solr because it would just be
noise. A good example would be the leader and directory for a given record:

00641nam a2200229Ia 45x0001001200000005001700012008004100029010001700070035
0016000870400013001030490009001160900026001251000028001
5124500520017925000120023126000630024330000210030650000
1100327546001500338852004600353994001200399

On a small scale, those 230 chars don't seem like much but when that is
stored for 8 million records it amounts to 1.7G of data (if my math is
right) stored in the index. We just don't need that data stored in Solr
and would like to keep it out so we can try to run Solr with as much of
our index in memory as possible.

-sm

Ross Singer wrote:
> If not, you don't have to use the MARC record in Solr for any aspect
> of display, but is there no value in having the whole thing there just
> for searching purposes?
>
> -Ross.
>

--

Eric Larson

unread,

Mar 9, 2010, 10:07:42 AM3/9/10

to blacklight-...@googlegroups.com

@Ross: We're not arguing, so no need to apologize. :)

It's what you actually need to display from the index that we're debating. It's neat you can store large chunks of raw data in a Solr index, but it's wisest to only store what you need to search on or immediately display. Small index = happier, sprier index.

We're storing a few MARC fields and values in our solr index. At other institutions, they store some MARC fields and values, *plus* the entire MARC record in a solr text field. It's the plus part that we've moved into a RDBMS to only fetch on a detail/show page.

BL probably shouldn't care where whole objects come from, or assume they'll be stored in the Solr index. Ultimately, we'd love to see BL move towards configurable / plugin-able raw object storage. A RDBMS or a Fedora install or an Amazon S3 account all make some sense to store raw objects.

Cheers!

- Eric

--
You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
To post to this group, send email to blacklight-...@googlegroups.com.
To unsubscribe from this group, send email to blacklight-develo...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/blacklight-development?hl=en.

Jonathan Rochkind

unread,

Mar 9, 2010, 10:39:53 AM3/9/10

to blacklight-...@googlegroups.com

It definitely seems to be solr "best practice" to reduce/eliminate
'stored fields' in solr as much as possible, and store (relatively)
'large' objects in an external store like an rdbms instead of solr. So I
think you are definitely correct and on the right track.

Jonathan

Bill Dueber

unread,

Mar 9, 2010, 10:47:30 AM3/9/10

to blacklight-...@googlegroups.com

Like all of these time/space tradeoffs, obviously different folks are going to have different needs and resources. Pulling display data from a raw MARC record (wherever you store it) is nice because you don't need to rebuild your index if you decide to add an extra hunk of data to the display. Storing exactly what you need is good because, well, you're only storing exactly what you need :-)

If you're going to pull some of your display data from raw records, then the question is where should you store it (assuming your current storage in your ILS is insufficient).

The salient (benchmarkable) questions seem to be:

1. What's the penalty on search/retrieval time for stored (unindexed) fields? For example, if I store raw MARC in my Solr instance *but never request that field* (it's never in 'fl'), how does it affect search and retrieval speed? This addresses the "well, we only use it on the individual record view" issue.

2. What's the penalty if you *do* request the large stored field...

3. ...vs. pulling it in via an RDBMS.

I admit to not having done any testing because Solr returns stuff to me faster than I can process it as it is. Anyone else?

Erik Hatcher

unread,

Mar 9, 2010, 10:47:43 AM3/9/10

to blacklight-...@googlegroups.com

I don't think there's any such "best practice" like this. It all
*depends*. Seems fine to store MARC for rendering a detail page
from. But wisest to, as several have already said on this thread, to
store what you want rendering in the search results list as separate
fields in Solr. And that simply makes sense when merging other data
sources, and funneling attributes down to "Dublin Core"-like basics
that all objects will have like id and at least a title.

Too much discussion isn't really needed here, other than to make
Blacklight pluggable for stored data sources allowing flexibility for
where it comes from.

Erik

Erik Hatcher

unread,

Mar 9, 2010, 10:51:50 AM3/9/10

to blacklight-...@googlegroups.com

On Mar 9, 2010, at 10:47 AM, Bill Dueber wrote:
> The salient (benchmarkable) questions seem to be:
>
> 1. What's the penalty on search/retrieval time for stored
> (unindexed) fields? For example, if I store raw MARC in my Solr
> instance *but never request that field* (it's never in 'fl'), how
> does it affect search and retrieval speed? This addresses the "well,
> we only use it on the individual record view" issue.

No effect whatsoever.

> 2. What's the penalty if you *do* request the large stored field...

Definite I/O to retrieve that data, and if you're pulling that field
for a page of search results then it could be bouncing around randomly
through the stored field data to gather it all. Whether this is a
prohibitive hit or not will depend on several factors (index cached by
operating system? network speed, speed of client and server, time to
parse on the client...)

> 3. ...vs. pulling it in via an RDBMS.

probably not a lot different than pulling it from a Solr stored
field. But the point that the index growth is an issue makes an RDBMS
(or CouchDB, etc) great options for storing this stuff.

> I admit to not having done any testing because Solr returns stuff to
> me faster than I can process it as it is.