http://nwda.projectblacklight.org/ (The main project website)
http://oregonstate.projectblacklight.org/ (The same website with different styling, triggered by the url used to access it)
http://nwda.projectblacklight.org/?f[format_facet][]=Archival+Collection+Guide (A list of the archival collection guides)
http://nwda.projectblacklight.org/catalog/bcc_1-summary (The Bing Crosby Historical Society EAD Guide... note the navigation options for the guide in the left column)
http://nwda.projectblacklight.org/catalog/bcc_1-v (A list of individual items in the collection)
http://nwda.projectblacklight.org/catalog/bcc_1-v-7 (An independently discoverable archival item, which links to the guide of which it is a part)
Please note: display behavior is quite flexible, limited only by the data that is available in the documents being shown. If display for an item seems sparse, it is likely due to sparse metadata in the description, not a limitation on what can be done with the software.
This project is running under a version of Blacklight from last July, so some newer blacklight features may be missing. I also set up capistrano for sever deployment, and the NWDA folks tested the deploy script successfully, but your milage may vary.
The full code for the NWDA demo site is available at http://github.com/bess/northwest-digital-archives
I'd be happy to answer any questions, and I'd be particularly interested in discussing whether there is enough community interest to devote effort to folding EAD functionality back into the core blacklight.
Cheers,
Bess
I am particularly interested in hearing more about how you've, if I
remember/understand right, set things up so an individual Solr Document
will have both Marc _and_ EAD attached to it. I'm interested in how you
manage to do this at the _indexing_ stage, without it being incredibly
slow. I've got a big file of MARC, and a big file of EAD, and _some_ of
the marc records correspond to _some_ of the EAD records.... is this
your situation too? And if so, how do you get your indexer to figure out
at indexing time that an individual MARC record should be combined with
an individual EAD record?
Jonathan
+++++1
I think this has already come up about 5 or 6 times.
- Naomi
I wasn't happy with the way the NWDA implementation breaks an EAD over
several Solr documents.
So I've worked on this some myself, but was not happy enough with my
own implementation to commit it to Blacklight. The initial problem I
ran into was that Nokogiri was segfaulting when applying stylesheets.
I was also unhappy that Nokogiri could only use XSLT 1 where my
existing stylesheets rely on XSLT 2.
I ended up doing all of my EAD display with Nokogiri parsing and Ruby
in place of XSLT. It is very slow, especially for long documents, but
I've implemented partial caching so that the EAD is only parsed the
first time the resource is requested. Also I do not display a lot of
the links and other content that is found within our EAD, so the
display is stripped down some. Since we refer to our collection guide
application for the full view of the finding aid we can get away with
that for now.
All of the EAD, which we call Collection Guides:
http://historicalstate.lib.ncsu.edu/catalog?commit=Search&f[type_facet][]=Collection+Guide&per_page=20&q=&qt=search
Here you can see a collection guide that includes images to give a
flavor of what is within that container:
http://historicalstate.lib.ncsu.edu/catalog/ua023_004
This is really the piece that made it worthwhile to do outside of XSLT.
I'm happy to work on this more for Blacklight, but I think it is a
matter of coming to some sort of agreement as to how we should
proceed. The one problem we're likely to run into is that everyone's
EAD is going to be different.
Jason
I think maybe ideally someone's "EAD stuff" could be an extra-Blacklight
plugin, so agreement on how to proceed doesn't neccesarily need to be
made. jaron can have an EAD plugin doing things how he wants, bess can
have one doing things how she wants, both can be shareable with others.
To do THAT, to make a plugin possible, there may be certain 'hooks' in
Blacklight that need to be made more cleanly
over-rideable/customizable/hookable. The document extension stuff is
one such hook that hopefully will help. But the only real way to be sure
is to start trying to do it, and see where you cant' cleanly hook into
Blacklight core from an external plugin, and then refactor/patch
Blacklight to provide those hooks cleanly. It's THAT part that will
definitely need agreement, and hopefully hooks neccesary for one plugin
will make others possible too, if done right.
Does that make sense? What do you think?
Jonathan
There aren't really any "best practices" out there in terms of
indexing EAD, especially in Solr. I'd recommend possibly talking to
some others who are doing this as they take a couple of different
approaches. Off the top of my head, Yale, Duke, Columbia, and NYU are
all using Solr to index their EAD, but each of them does it somewhat
differently.
Jason said:
> I wasn't happy with the way the NWDA implementation breaks an EAD over
> several Solr documents.
Can you explain why?
Mark A. Matienzo
Digital Archivist, Manuscripts and Archives
Yale University Library
> --
> You received this message because you are subscribed to the Google Groups
> "Blacklight Development" group.
> To post to this group, send email to
> blacklight-...@googlegroups.com.
> To unsubscribe from this group, send email to
> blacklight-develo...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/blacklight-development?hl=en.
>
>
Mostly for some implementation details.
- Each section involves a separate page load, sometimes for very
little content. Ajax could help with that.
http://nwda.projectblacklight.org/catalog/bcc_1-vii
- Take a look at the search results here:
http://nwda.projectblacklight.org/catalog?q=bing+crosby&qt=search&per_page=10&commit=search
Click on Format and see the facet values. Archival collection guide
lists 16 hits but clicking on it only displays 1. My guess this has to
do with document collapsing that gets done.
- It takes away scanning the page and Ctrl-F to be able to search
through containers and visually see the structure of the collection.
Jason
Indeed. I hope to high heaven I never have to actively work on a real
EAD-based project again.
That being said, a Ruby library for EAD might make this a lot less
painful (well, once it's done -- writing it sounds pretty awful,
honestly). I think Matt Zumwalt pitched this idea at Code4lib? Mark,
is there any prior art in another language (I'm thinking Python would
probably be the only analog, although I suppose if there's a Java
implementation that might be worth looking at)?
Jason, it's interesting that Nokogiri was such a performance dog for
you. I assume this has a lot to do with using a DOM parser? Would
Nokogiri's pull parsers bring any sort of improvement? I suppose it's
a balance, of course, since then you get into the hairy issue of
dealing with EAD's nonsense /AS IT'S COMING STRAIGHT AT YOU/, but it
made huge improvements in ruby-marc.
I'm also a little disappointed about XSLT. Did ruby-libxslt also have
these problems? (Obviously they'd also only be XSLT 1) Did you
consider JRuby? In a pinch you could also always use an external
transformation service (the Platform offers one, for example -- there
are others) and CACHE THAT BABY.
Obviously, the EAD... it lingers with you.
-Ross.
There were two separate problems.
1. Nokogiri and XSLT resulted in a segfault. Actually I had it working
at one point but when I made a necessary update to my Nokogiri gem or
made some other change it stopped working and I wasn't able or willing
to trace what was happening.
2. Once I went to Nokogiri + Ruby one bottleneck was Ruby. I was using
Ruby to iterate down through the nested structure and looping is slow.
Some of these EADs are rather long as well. I think looping down
through was the culprit. I also relied on a few different little
partials which may have effected things as well. Haven't done
benchmarks; these are all guesses. I am using xpaths for node
selection, so the pull parser would likely be faster and that's worth
consideration. At the time I wanted something simple without having to
worry about state and "EAD's nonsense."
> I'm also a little disappointed about XSLT. Did ruby-libxslt also have
> these problems? (Obviously they'd also only be XSLT 1) Did you
> consider JRuby? In a pinch you could also always use an external
> transformation service (the Platform offers one, for example -- there
> are others) and CACHE THAT BABY.
Once I realized there were things I wanted to do that were so much
easier to do in Ruby than XSLT (and it seemed nearly impossible to do
in XSLT), XSLT was no longer an option for using alone. For instance I
wanted to be able to call out to Solr or a relational db to bring in
materials related to a particular container--didn't know how to do
that with XSLT. I have considered using a transformation service. We
have XTF set up for finding aids right now, but I haven't gotten back
to creating a stylesheet for just the page partial I need. Even then
I'd have to do some post-processing in Ruby.
And, yes, Rails' flexible caching saved the day. I've thought about
how I might create the finding aid page partial and cache it during
index time, so that even the first request is fast, but haven't gotten
around to that.
Jason
This is going to remain a problem with rather unsatisfying solutions
as long as "library" developers are the ones tasked with building
these systems: the domain is too strange and the overall library
priority of special collections and archives is way too low to give it
the time and energy it needs to be done "right". Early enthusiasm and
good intentions give way pretty quickly to "well, let's just get it
working any way possible because we've got a bunch of higher priority
projects that are waiting on the completion of this".
The flip-side of this is that archivists and special collections
libraries tend to not have the expertise in house to do it themselves.
Cycle continues.
I'm not exactly sure how to break it, unfortunately.
-Ross.
OK, I'll admit these all make sense, and I think they're all
interrelated. The indexing and presentation implementations are
somewhat intertwined. Perhaps it makes sense to separate them out? I
think we need to index the individual components in some sort of
fashion, but I'm not sure what the best strategy would be.
To my knowledge, NYU splits the finding aid into multiple documents -
one for the overall collection, and one for each component. The search
results "in context" are loaded by AJAX in the search result (see
http://dlib.nyu.edu/findingaids/search/?q=test for example). However,
the presentation of each finding aid is ultimately handled by pointing
to a static HTML version that's been transformed using XSLT.
Another, more complicated option would be to use XML payloads in Solr.
I don't know much about this but Tricia Williams was using this
specifically to deal with providing search results in context for
digitized texts in TEI. See this issue on Apache's JIRA:
https://issues.apache.org/jira/browse/SOLR-380
Looking at Matt Mitchell's Raven ( http://github.com/mwmitchell/raven/
) has come up in conversation and has been on my todo list for a
while; perhaps we need to hack on it and see?
I don't mean to dwell on this but this isn't particularly an issue
that's unique when library developers have to deal with archives or
special collections projects. This is a common issue for developers in
general - it's making stuff abstract enough and getting people to
collaborate (or to commit to collaborate) that is.
> I'm not exactly sure how to break it, unfortunately.
There are two ways to start, both of which are larger in scope than
just Blacklight. I've been intending to have some sort of meeting or
teleconference to discuss Solr indexing strategies for EAD. I
initially was thinking that I wanted to cover both Solr and (non-Solr)
Lucene implementations but that had the potential to derail things a
bit. I think starting to survey the community and getting this
discussion going would be a good start.
Also, EAD is getting ready to go through another revision process (see
http://listserv.loc.gov/cgi-bin/wa?A2=ind1003&L=ead&T=0&P=12317 for
more info). Part of the work for the revision will be evaluating and
revising its structure, potentially with a "loose" and "strict"
version; the latter would be for more programmatic access. Full
disclosure: I'm on one of the groups involved the EAD revision (the
Schema Development Team).
Not so much about Blacklight anymore, but I predict if there's a loose
and strict version, then almost all the EAD we actually encounter is
going to end up being the 'loose' version. Because if the people
generating EAD understood the benefit of programmatic access, EAD
wouldn't look like it does in the first place.
Why not just revise it's structure so anything that's EAD is neccesarily
more programmatically accessible than the current version? What is the
benefit of a non-programmatically-accessible EAD, over just writing HTML
in the first place?
Jonathan
Mark
I've talked a little to folks here about how a constrained
implementation of EAD would benefit us. The problem is we have so many
EAD encoded finding aids that are all over the place, don't conform to
current practice and are probably low priority for rearranging. So
whatever I do for the foreseeable future, I'll still be confronted
with "loose" EADs.
Jason
It's not so much of an issue of rearranging them as it is in terms of
normalizing the XML. Here at Yale we have seven different repositories
contributing EAD, and anything that makes it into our search and
presentation system has to be validated against a locally-constrained
Relax-NG schema. I know some other places are taking a similar
approach, like Indiana (which I think is using Schematron, fwiw).
Mark