Blacklight EAD examples

Bess Sadler

unread,

Apr 14, 2010, 12:03:08 PM4/14/10

to Blacklight Developers List

I've had a few questions off list about the Blacklight EAD examples. Most of the work that's been done with Blacklight for EADs was done by Matt Mitchell and me for the Northwest Digital Archives. Example URLs are:

http://nwda.projectblacklight.org/ (The main project website)
http://oregonstate.projectblacklight.org/ (The same website with different styling, triggered by the url used to access it)
http://nwda.projectblacklight.org/?f[format_facet][]=Archival+Collection+Guide (A list of the archival collection guides)
http://nwda.projectblacklight.org/catalog/bcc_1-summary (The Bing Crosby Historical Society EAD Guide... note the navigation options for the guide in the left column)
http://nwda.projectblacklight.org/catalog/bcc_1-v (A list of individual items in the collection)
http://nwda.projectblacklight.org/catalog/bcc_1-v-7 (An independently discoverable archival item, which links to the guide of which it is a part)

Please note: display behavior is quite flexible, limited only by the data that is available in the documents being shown. If display for an item seems sparse, it is likely due to sparse metadata in the description, not a limitation on what can be done with the software.

This project is running under a version of Blacklight from last July, so some newer blacklight features may be missing. I also set up capistrano for sever deployment, and the NWDA folks tested the deploy script successfully, but your milage may vary.

The full code for the NWDA demo site is available at http://github.com/bess/northwest-digital-archives

I'd be happy to answer any questions, and I'd be particularly interested in discussing whether there is enough community interest to devote effort to folding EAD functionality back into the core blacklight.

Cheers,
Bess

Jonathan Rochkind

unread,

Apr 14, 2010, 12:18:56 PM4/14/10

to blacklight-...@googlegroups.com

Bess Sadler wrote:
> I'd be happy to answer any questions, and I'd be particularly interested in discussing whether there is enough community interest to devote effort to folding EAD functionality back into the core blacklight.
>

Which the new document extension architecture should make it a lot
easier to do in a clean way! Hooray. I suspect there will be interest.
I think we are probably interested, eventually, in EAD display.

I am particularly interested in hearing more about how you've, if I
remember/understand right, set things up so an individual Solr Document
will have both Marc _and_ EAD attached to it. I'm interested in how you
manage to do this at the _indexing_ stage, without it being incredibly
slow. I've got a big file of MARC, and a big file of EAD, and _some_ of
the marc records correspond to _some_ of the EAD records.... is this
your situation too? And if so, how do you get your indexer to figure out
at indexing time that an individual MARC record should be combined with
an individual EAD record?

Jonathan

Naomi Dushay

unread,

Apr 14, 2010, 1:34:05 PM4/14/10

to blacklight-...@googlegroups.com

> I'd be particularly interested in discussing whether there is
> enough community interest to devote effort to folding EAD
> functionality back into the core blacklight.

+++++1

I think this has already come up about 5 or 6 times.

- Naomi

Jason Ronallo

unread,

Apr 14, 2010, 3:11:29 PM4/14/10

to blacklight-...@googlegroups.com

>> I'd be particularly interested in discussing whether there is enough
>> community interest to devote effort to folding EAD functionality back into
>> the core blacklight.

I wasn't happy with the way the NWDA implementation breaks an EAD over
several Solr documents.

So I've worked on this some myself, but was not happy enough with my
own implementation to commit it to Blacklight. The initial problem I
ran into was that Nokogiri was segfaulting when applying stylesheets.
I was also unhappy that Nokogiri could only use XSLT 1 where my
existing stylesheets rely on XSLT 2.

I ended up doing all of my EAD display with Nokogiri parsing and Ruby
in place of XSLT. It is very slow, especially for long documents, but
I've implemented partial caching so that the EAD is only parsed the
first time the resource is requested. Also I do not display a lot of
the links and other content that is found within our EAD, so the
display is stripped down some. Since we refer to our collection guide
application for the full view of the finding aid we can get away with
that for now.

All of the EAD, which we call Collection Guides:
http://historicalstate.lib.ncsu.edu/catalog?commit=Search&f[type_facet][]=Collection+Guide&per_page=20&q=&qt=search

Here you can see a collection guide that includes images to give a
flavor of what is within that container:
http://historicalstate.lib.ncsu.edu/catalog/ua023_004
This is really the piece that made it worthwhile to do outside of XSLT.

I'm happy to work on this more for Blacklight, but I think it is a
matter of coming to some sort of agreement as to how we should
proceed. The one problem we're likely to run into is that everyone's
EAD is going to be different.

Jason

Jonathan Rochkind

unread,

Apr 14, 2010, 3:35:56 PM4/14/10

to blacklight-...@googlegroups.com

Jason Ronallo wrote:
>
> I'm happy to work on this more for Blacklight, but I think it is a
> matter of coming to some sort of agreement as to how we should
> proceed. The one problem we're likely to run into is that everyone's
> EAD is going to be different.
>

I think maybe ideally someone's "EAD stuff" could be an extra-Blacklight
plugin, so agreement on how to proceed doesn't neccesarily need to be
made. jaron can have an EAD plugin doing things how he wants, bess can
have one doing things how she wants, both can be shareable with others.

To do THAT, to make a plugin possible, there may be certain 'hooks' in
Blacklight that need to be made more cleanly
over-rideable/customizable/hookable. The document extension stuff is
one such hook that hopefully will help. But the only real way to be sure
is to start trying to do it, and see where you cant' cleanly hook into
Blacklight core from an external plugin, and then refactor/patch
Blacklight to provide those hooks cleanly. It's THAT part that will
definitely need agreement, and hopefully hooks neccesary for one plugin
will make others possible too, if done right.

Does that make sense? What do you think?

Jonathan

Mark A. Matienzo

unread,

Apr 14, 2010, 3:44:51 PM4/14/10

to blacklight-...@googlegroups.com

I guess it depends on what you mean by "EAD stuff." EAD's a lot more
flexible in terms of its structure as opposed to something like MARC.
Both indexing and presentation code could be wildly different between
implementations.

There aren't really any "best practices" out there in terms of
indexing EAD, especially in Solr. I'd recommend possibly talking to
some others who are doing this as they take a couple of different
approaches. Off the top of my head, Yale, Duke, Columbia, and NYU are
all using Solr to index their EAD, but each of them does it somewhat
differently.

Jason said:
> I wasn't happy with the way the NWDA implementation breaks an EAD over
> several Solr documents.

Can you explain why?

Mark A. Matienzo
Digital Archivist, Manuscripts and Archives
Yale University Library

> --
> You received this message because you are subscribed to the Google Groups
> "Blacklight Development" group.
> To post to this group, send email to
> blacklight-...@googlegroups.com.
> To unsubscribe from this group, send email to
> blacklight-develo...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/blacklight-development?hl=en.
>
>

Jonathan Rochkind

unread,

Apr 14, 2010, 5:00:07 PM4/14/10

to blacklight-...@googlegroups.com

I guess that's what I was thinking, and why there shoudln't need to be
any agreement about how to "do EAD", but instead the possibility of
sharing code for handling EAD as an extra-blacklight plugin, so there
can be different ones doing things different ways. No?

Jason Ronallo

unread,

Apr 14, 2010, 5:10:38 PM4/14/10

to blacklight-...@googlegroups.com

> Jason said:
>> I wasn't happy with the way the NWDA implementation breaks an EAD over
>> several Solr documents.
>
> Can you explain why?

Mostly for some implementation details.
- Each section involves a separate page load, sometimes for very
little content. Ajax could help with that.
http://nwda.projectblacklight.org/catalog/bcc_1-vii
- Take a look at the search results here:
http://nwda.projectblacklight.org/catalog?q=bing+crosby&qt=search&per_page=10&commit=search
Click on Format and see the facet values. Archival collection guide
lists 16 hits but clicking on it only displays 1. My guess this has to
do with document collapsing that gets done.
- It takes away scanning the page and Ctrl-F to be able to search
through containers and visually see the structure of the collection.

Jason

Ross Singer

unread,

Apr 14, 2010, 9:35:26 PM4/14/10

to blacklight-...@googlegroups.com

On Wed, Apr 14, 2010 at 3:44 PM, Mark A. Matienzo
<mark.m...@gmail.com> wrote:
> I guess it depends on what you mean by "EAD stuff." EAD's a lot more
> flexible in terms of its structure as opposed to something like MARC.
> Both indexing and presentation code could be wildly different between
> implementations.
>

Indeed. I hope to high heaven I never have to actively work on a real
EAD-based project again.

That being said, a Ruby library for EAD might make this a lot less
painful (well, once it's done -- writing it sounds pretty awful,
honestly). I think Matt Zumwalt pitched this idea at Code4lib? Mark,
is there any prior art in another language (I'm thinking Python would
probably be the only analog, although I suppose if there's a Java
implementation that might be worth looking at)?

Jason, it's interesting that Nokogiri was such a performance dog for
you. I assume this has a lot to do with using a DOM parser? Would
Nokogiri's pull parsers bring any sort of improvement? I suppose it's
a balance, of course, since then you get into the hairy issue of
dealing with EAD's nonsense /AS IT'S COMING STRAIGHT AT YOU/, but it
made huge improvements in ruby-marc.

I'm also a little disappointed about XSLT. Did ruby-libxslt also have
these problems? (Obviously they'd also only be XSLT 1) Did you
consider JRuby? In a pinch you could also always use an external
transformation service (the Platform offers one, for example -- there
are others) and CACHE THAT BABY.

Obviously, the EAD... it lingers with you.

-Ross.

Jason Ronallo

unread,

Apr 15, 2010, 9:07:50 AM4/15/10

to blacklight-...@googlegroups.com

On Wed, Apr 14, 2010 at 9:35 PM, Ross Singer <rossf...@gmail.com> wrote:
> Jason, it's interesting that Nokogiri was such a performance dog for
> you. I assume this has a lot to do with using a DOM parser? Would
> Nokogiri's pull parsers bring any sort of improvement? I suppose it's
> a balance, of course, since then you get into the hairy issue of
> dealing with EAD's nonsense /AS IT'S COMING STRAIGHT AT YOU/, but it
> made huge improvements in ruby-marc.

There were two separate problems.
1. Nokogiri and XSLT resulted in a segfault. Actually I had it working
at one point but when I made a necessary update to my Nokogiri gem or
made some other change it stopped working and I wasn't able or willing
to trace what was happening.
2. Once I went to Nokogiri + Ruby one bottleneck was Ruby. I was using
Ruby to iterate down through the nested structure and looping is slow.
Some of these EADs are rather long as well. I think looping down
through was the culprit. I also relied on a few different little
partials which may have effected things as well. Haven't done
benchmarks; these are all guesses. I am using xpaths for node
selection, so the pull parser would likely be faster and that's worth
consideration. At the time I wanted something simple without having to
worry about state and "EAD's nonsense."

> I'm also a little disappointed about XSLT. Did ruby-libxslt also have
> these problems? (Obviously they'd also only be XSLT 1) Did you
> consider JRuby? In a pinch you could also always use an external
> transformation service (the Platform offers one, for example -- there
> are others) and CACHE THAT BABY.

Once I realized there were things I wanted to do that were so much
easier to do in Ruby than XSLT (and it seemed nearly impossible to do
in XSLT), XSLT was no longer an option for using alone. For instance I
wanted to be able to call out to Solr or a relational db to bring in
materials related to a particular container--didn't know how to do
that with XSLT. I have considered using a transformation service. We
have XTF set up for finding aids right now, but I haven't gotten back
to creating a stylesheet for just the page partial I need. Even then
I'd have to do some post-processing in Ruby.

And, yes, Rails' flexible caching saved the day. I've thought about
how I might create the finding aid page partial and cache it during
index time, so that even the first request is fast, but haven't gotten
around to that.

Jason

Ross Singer

unread,

Apr 15, 2010, 9:29:41 AM4/15/10

to blacklight-...@googlegroups.com

Thanks, Jason, I think this is pretty informative. It also reminds me
a lot of conversations I had with Mark on this stuff 4 years ago.

This is going to remain a problem with rather unsatisfying solutions
as long as "library" developers are the ones tasked with building
these systems: the domain is too strange and the overall library
priority of special collections and archives is way too low to give it
the time and energy it needs to be done "right". Early enthusiasm and
good intentions give way pretty quickly to "well, let's just get it
working any way possible because we've got a bunch of higher priority
projects that are waiting on the completion of this".

The flip-side of this is that archivists and special collections
libraries tend to not have the expertise in house to do it themselves.

Cycle continues.

I'm not exactly sure how to break it, unfortunately.

-Ross.

Mark A. Matienzo

unread,

Apr 15, 2010, 10:35:55 AM4/15/10

to blacklight-...@googlegroups.com

On Wed, Apr 14, 2010 at 5:10 PM, Jason Ronallo <jron...@gmail.com> wrote:
> - Each section involves a separate page load, sometimes for very
> little content. Ajax could help with that.
> http://nwda.projectblacklight.org/catalog/bcc_1-vii
> - Take a look at the search results here:
> http://nwda.projectblacklight.org/catalog?q=bing+crosby&qt=search&per_page=10&commit=search
> Click on Format and see the facet values. Archival collection guide
> lists 16 hits but clicking on it only displays 1. My guess this has to
> do with document collapsing that gets done.
> - It takes away scanning the page and Ctrl-F to be able to search
> through containers and visually see the structure of the collection.

OK, I'll admit these all make sense, and I think they're all
interrelated. The indexing and presentation implementations are
somewhat intertwined. Perhaps it makes sense to separate them out? I
think we need to index the individual components in some sort of
fashion, but I'm not sure what the best strategy would be.

To my knowledge, NYU splits the finding aid into multiple documents -
one for the overall collection, and one for each component. The search
results "in context" are loaded by AJAX in the search result (see
http://dlib.nyu.edu/findingaids/search/?q=test for example). However,
the presentation of each finding aid is ultimately handled by pointing
to a static HTML version that's been transformed using XSLT.

Another, more complicated option would be to use XML payloads in Solr.
I don't know much about this but Tricia Williams was using this
specifically to deal with providing search results in context for
digitized texts in TEI. See this issue on Apache's JIRA:
https://issues.apache.org/jira/browse/SOLR-380

Looking at Matt Mitchell's Raven ( http://github.com/mwmitchell/raven/
) has come up in conversation and has been on my todo list for a
while; perhaps we need to hack on it and see?

Mark A. Matienzo

unread,

Apr 15, 2010, 10:57:45 AM4/15/10

to blacklight-...@googlegroups.com

On Thu, Apr 15, 2010 at 9:29 AM, Ross Singer <rossf...@gmail.com> wrote:
> Early enthusiasm and
> good intentions give way pretty quickly to "well, let's just get it
> working any way possible because we've got a bunch of higher priority
> projects that are waiting on the completion of this".

I don't mean to dwell on this but this isn't particularly an issue
that's unique when library developers have to deal with archives or
special collections projects. This is a common issue for developers in
general - it's making stuff abstract enough and getting people to
collaborate (or to commit to collaborate) that is.

> I'm not exactly sure how to break it, unfortunately.

There are two ways to start, both of which are larger in scope than
just Blacklight. I've been intending to have some sort of meeting or
teleconference to discuss Solr indexing strategies for EAD. I
initially was thinking that I wanted to cover both Solr and (non-Solr)
Lucene implementations but that had the potential to derail things a
bit. I think starting to survey the community and getting this
discussion going would be a good start.

Also, EAD is getting ready to go through another revision process (see
http://listserv.loc.gov/cgi-bin/wa?A2=ind1003&L=ead&T=0&P=12317 for
more info). Part of the work for the revision will be evaluating and
revising its structure, potentially with a "loose" and "strict"
version; the latter would be for more programmatic access. Full
disclosure: I'm on one of the groups involved the EAD revision (the
Schema Development Team).

Jonathan Rochkind

unread,

Apr 15, 2010, 11:06:18 AM4/15/10

to blacklight-...@googlegroups.com

Mark A. Matienzo wrote:
>
> Also, EAD is getting ready to go through another revision process (see
> http://listserv.loc.gov/cgi-bin/wa?A2=ind1003&L=ead&T=0&P=12317 for
> more info). Part of the work for the revision will be evaluating and
> revising its structure, potentially with a "loose" and "strict"
> version; the latter would be for more programmatic access.

Not so much about Blacklight anymore, but I predict if there's a loose
and strict version, then almost all the EAD we actually encounter is
going to end up being the 'loose' version. Because if the people
generating EAD understood the benefit of programmatic access, EAD
wouldn't look like it does in the first place.

Why not just revise it's structure so anything that's EAD is neccesarily
more programmatically accessible than the current version? What is the
benefit of a non-programmatically-accessible EAD, over just writing HTML
in the first place?

Jonathan

Mark A. Matienzo

unread,

Apr 15, 2010, 11:45:35 AM4/15/10

to blacklight-...@googlegroups.com

I've brought this issue up on the EAD listserv:
http://listserv.loc.gov/cgi-bin/wa?A2=ind1004&L=ead&D=1&T=0&O=D&P=5035

Mark

Jason Ronallo

unread,

Apr 15, 2010, 11:55:27 AM4/15/10

to blacklight-...@googlegroups.com

On Thu, Apr 15, 2010 at 10:57 AM, Mark A. Matienzo
<mark.m...@gmail.com> wrote:
> Also, EAD is getting ready to go through another revision process (see
> http://listserv.loc.gov/cgi-bin/wa?A2=ind1003&L=ead&T=0&P=12317 for
> more info). Part of the work for the revision will be evaluating and
> revising its structure, potentially with a "loose" and "strict"
> version; the latter would be for more programmatic access. Full
> disclosure: I'm on one of the groups involved the EAD revision (the
> Schema Development Team).

I've talked a little to folks here about how a constrained
implementation of EAD would benefit us. The problem is we have so many
EAD encoded finding aids that are all over the place, don't conform to
current practice and are probably low priority for rearranging. So
whatever I do for the foreseeable future, I'll still be confronted
with "loose" EADs.

Jason

Mark A. Matienzo

unread,

Apr 15, 2010, 12:15:25 PM4/15/10

to blacklight-...@googlegroups.com

On Thu, Apr 15, 2010 at 11:55 AM, Jason Ronallo <jron...@gmail.com> wrote:
> I've talked a little to folks here about how a constrained
> implementation of EAD would benefit us. The problem is we have so many
> EAD encoded finding aids that are all over the place, don't conform to
> current practice and are probably low priority for rearranging. So
> whatever I do for the foreseeable future, I'll still be confronted
> with "loose" EADs.

It's not so much of an issue of rearranging them as it is in terms of
normalizing the XML. Here at Yale we have seven different repositories
contributing EAD, and anything that makes it into our search and
presentation system has to be validated against a locally-constrained
Relax-NG schema. I know some other places are taking a similar
approach, like Indiana (which I think is using Schematron, fwiw).

Mark

Reply all

Reply to author

Forward