We decided to keep the raw MARC out of Solr to
1) keep the index small
2) reduce the indexing time for full builds. when we run a full index
regeneration we cut the build time by anywhere from a third to a half.
my suspicion is that there is a lot of overhead on the indexing side
when rebalancing the index as it grows. w/o storing 8M marc streams we
drastically reduce the amount of data pushed in an consequently the
reshuffling that periodically takes place.
3) more importantly, though, we want a smaller index so that more of it
can fit in memory and the index will be searched faster.
I spoke to Jonathan, Naomi and Jessie about this a little bit at
code4lib. They recommended emailing the group to float an idea about how
the core blacklight code might enable a configuration for where the raw
MARC is stored. We want to be able to keep hooking into the things like
citation, emailing, SMS, etc utilities that use the blacklight marc
objects, but right now we are kind of hacking it in our own model class
that calls
autoload :Citation, 'blacklight/marc/citation.rb'
and it doesn't smell entirely right. Jonathan indicated he has ideas
about this.
What do folks think for a future version of blacklight?
-Steve
--
Stephen Meyer
Library Application Developer
UW-Madison Libraries
312F Memorial Library
728 State St.
Madison, WI 53706
sme...@library.wisc.edu
608-265-2844 (ph)
"Just don't let the human factor fail to be a factor at all."
- Andrew Bird, "Tables and Chairs"
Then all code that wanted to see if a marc representation was available
would simply call document.respond_to?(:to_marc) [instead of the
current !document.marc.marc.nil? ], and would simply call
document.to_marc to get it (instead of teh current document.marc.marc).
Then for Stephen's use case, you'd just add a local (or from a shared
plugin) SolrDocument extension that added a #to_marc method which pulled
the marc from the db instead of a solr field , and only added that
method to documents that had look-up-able marc. (Have to use a bit of
ruby trickery to conditionally only add the method based on conditions
of the instance, but it's do-able).
And then it would Just Work. [It would be slightly harder if you wanted
to use marc documents on search results screen, rather than just item
detail, because you'd want to fetch all of the marc in one SQL call from
your rdbms, not N, for efficiency. But still do-able].
I hope to code up a draft proposed patch implementing this kind of
SolrDocument extension API "soon". I will need it for my Umlaut
integration too.
Jonathan
In my own (very limited and not precise) profiling, it _appeared_ that
the only place a performance slowdown appeared when using Marc in search
results (in my case from my solr index) was not in parsing the marc, and
not neccesarily (apparently, at my index size) in Solr's processing --
but actually in the extra time to transfer the marc records over the
HTTP wire from solr.
So I'm curious if grabbing it from a db instead might reduce even that
performance impact (which for my purposes is so far tolerable, even as
it is).
Jonathan
--
You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
To post to this group, send email to blacklight-...@googlegroups.com.
To unsubscribe from this group, send email to blacklight-develo...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/blacklight-development?hl=en.
Take another case that worries me - this search result set has only 4
results, but 123 item holdings:
http://forward.library.wisconsin.edu/catalog?q=%22in+defense+of+food%22
So we are doing 123 circulation availability lookups for a single
results page.
(This is because the book was used in our university's common book
program this year.)
In a lot of cases, the usability concerns may refactor away these
concerns but we are still mindful of our milliseconds.
-steve
--
> Take another case that worries me - this search result set has only 4
> results, but 123 item holdings:
>
> http://forward.library.wisconsin.edu/catalog?q=%22in+defense+of+food%22
>
> So we are doing 123 circulation availability lookups for a single results
> page.
I'm a little confused about this argument, really - I would think
you'd cut that frightening holdings display --way-- before the
searching safety net that keeping the marc record searchable would
bring.
Anyway, it seems your holdings lookup could be done asynchronously via
AJAX or, if you're really concerned about usability/performance,
harvested into Solr as well so you could actually search on things
that are available.
If anything, this particular example shows me that you've got
performance to *spare* if you're able to synchronously look up 123
items and I barely noticed.
-Ross.
Also, we are preharvesting the item circ status data and loading it into
a db table. We do a daily extract and load of this data for the entire
set of item level holdings and put a btree index on the two columns
needed to look up an item. This set includes every copy of every
circulating item we have from all UW System schools. So this table has
circ status records for 15M things. Not a small table.
Given our history with Voyager [1], our sysadmins are not comfortable
with us making live circstatus lookups against the production tables. It
is also further complicated by the fact that we would need to make these
lookups against 14 separate instances of Voyager so it is not just a
single sysadmin team we would be dealing with. Instead we decided to
preharvest. (A JANGLE-fied ILS API would be fantastic here.)
The reason I brought up the example is that it illustrates (in my mind)
one of a few scale issues we are trying to be cognizant of. While we
aren't dealing with Hathi levels of data, doing the consortial index
does present a set of scale (and usability) issues that we weren't
previously thinking about in managing our OPAC and so we are on the
lookout for all the places where might keep things simple and fast.
-sm
[1] Last fall we tried to setup the newest version of the Voyager OPAC
at UW-Madison. It is a big switch from the old CGI-based version to a
newer Tomcat webapp. When classes started and production load resumed
its normal levels the new version of the OPAC slowed to a crawl and the
app fell over. We ended up resolving the issue after a few weeks by
going back to the old version of the OPAC. This has been a bit of
background context for our Forward project and we are trying to walk a
fine line of not prematurely optimizing, but also ensuring that when we
get to a point where we get real load, we are ready for it.
--
--
You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
To post to this group, send email to blacklight-...@googlegroups.com.
To unsubscribe from this group, send email to blacklight-develo...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/blacklight-development?hl=en.
For me, it doesn't seem to add unacceptable time to display, but I
haven't benchmarked fully yet, or profiled with care to see exactly
where the added time is coming from, if there is significant added time.
If it becomes a problem, I'd first try storing the marc in a database
like you guys, but still rendering at display time even for search
results page. And if that was STILL a problem, I might have to resort to
NOT parsing marc for search results page at display time. But I hope it
doesn't get to that, it's just SO convenient.
But to each their own! BL should (and kind of does, but getting better
all the time) support any of these implementation choices by the local
implementer.
Jonathan
Eric Larson wrote:
> I don't believe rendering a search result page requires parsing full objects. Good for you if you choose to do that, but there's just not a lot of info you can squeeze into a 500 x 150px result row div (your measurements may differ, slightly!). Title, contributor, year, format, location, call number, image and you're essentially broke. Honestly, it's a lovely design limitation.
>
> Our index goals again are:
> 1) Small index
> 2) Fast index build time
> 3) Fast search results
>
> Another reason it makes sense for us to simplify our result page and index is because we're not just MARC. We've got a million-plus METS records waiting in fedora that hope to join the project soon. Ultimately, our Blacklight app will provide discovery for all sorts of metadata schemes, with data coming from many differing repositories.
>
> Cross-walking MARC / METS / DC / MODS / TEI / IEEE LOM / OME-XML... etc, you've got to cut to the point on the result page and push people out to views (or external apps) that intimately know our OME-XML collection. Views that specialize in object parsing and presentation.
>
> - - - -
>
> Now, touching the database for 123 circ holdings *is silly* -- we're gonna polish that soon.
>
> Cheers,
> - Eric
>
> On Mar 5, 2010, at 1:59 PM, Ross Singer wrote:
>
> On Thu, Mar 4, 2010 at 4:12 PM, Stephen Meyer <sme...@library.wisc.edu<mailto:sme...@library.wisc.edu>> wrote:
>
> Take another case that worries me - this search result set has only 4
> results, but 123 item holdings:
>
> http://forward.library.wisconsin.edu/catalog?q=%22in+defense+of+food%22
>
> So we are doing 123 circulation availability lookups for a single results
> page.
>
> I'm a little confused about this argument, really - I would think
> you'd cut that frightening holdings display --way-- before the
> searching safety net that keeping the marc record searchable would
> bring.
>
> Anyway, it seems your holdings lookup could be done asynchronously via
> AJAX or, if you're really concerned about usability/performance,
> harvested into Solr as well so you could actually search on things
> that are available.
>
> If anything, this particular example shows me that you've got
> performance to *spare* if you're able to synchronously look up 123
> items and I barely noticed.
>
> -Ross.
>
> --
> You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
> To post to this group, send email to blacklight-...@googlegroups.com<mailto:blacklight-...@googlegroups.com>.
> To unsubscribe from this group, send email to blacklight-develo...@googlegroups.com<mailto:blacklight-develo...@googlegroups.com>.
> For more options, visit this group at http://groups.google.com/group/blacklight-development?hl=en.
>
>
> --
> Eric Larson
> Digital Library Consultant
> UW Digital Collections Center
> ela...@library.wisc.edu<mailto:ela...@library.wisc.edu>
>
>
> Connect with us on�
> �The Web: http://uwdc.library.wisc.edu<http://uwdc.library.wisc.edu/>
> �Facebook: http://digital.library.wisc.edu/1711.dl/uwdc-fb
> �Twitter: http://twitter.com/UWdigiCollec
> �RSS: http://uwdc.library.wisc.edu/News/NewsFeed/UWDCNews.xml
Hmm, I don't either and maybe I'm misunderstanding what the argument
is about (certainly wouldn't be the first time).
I thought you were storing the marc record in the rdbms because you
weren't putting it in Solr, is that wrong? If so, disregard
everything after the following word: sorry.
If not, you don't have to use the MARC record in Solr for any aspect
of display, but is there no value in having the whole thing there just
for searching purposes?
-Ross.
As to your point below - correct me if I am missing it - about storing
everything in a raw MARC record for searching, our thought is that we
will eventually shape our MARC parsing/indexing rules to extract all
keywords into some part of the index. However, there is also a lot of
extra stuff that we don't want to store in Solr because it would just be
noise. A good example would be the leader and directory for a given record:
00641nam a2200229Ia 45x0001001200000005001700012008004100029010001700070035
0016000870400013001030490009001160900026001251000028001
5124500520017925000120023126000630024330000210030650000
1100327546001500338852004600353994001200399
On a small scale, those 230 chars don't seem like much but when that is
stored for 8 million records it amounts to 1.7G of data (if my math is
right) stored in the index. We just don't need that data stored in Solr
and would like to keep it out so we can try to run Solr with as much of
our index in memory as possible.
-sm
Ross Singer wrote:
> If not, you don't have to use the MARC record in Solr for any aspect
> of display, but is there no value in having the whole thing there just
> for searching purposes?
>
> -Ross.
>
--
--
You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
To post to this group, send email to blacklight-...@googlegroups.com.
To unsubscribe from this group, send email to blacklight-develo...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/blacklight-development?hl=en.
Jonathan
Too much discussion isn't really needed here, other than to make
Blacklight pluggable for stored data sources allowing flexibility for
where it comes from.
Erik
No effect whatsoever.
> 2. What's the penalty if you *do* request the large stored field...
Definite I/O to retrieve that data, and if you're pulling that field
for a page of search results then it could be bouncing around randomly
through the stored field data to gather it all. Whether this is a
prohibitive hit or not will depend on several factors (index cached by
operating system? network speed, speed of client and server, time to
parse on the client...)
> 3. ...vs. pulling it in via an RDBMS.
probably not a lot different than pulling it from a Solr stored
field. But the point that the index growth is an issue makes an RDBMS
(or CouchDB, etc) great options for storing this stuff.
> I admit to not having done any testing because Solr returns stuff to
> me faster than I can process it as it is.
:)
+1
Erik