TEI Indexing

Eric

unread,

Oct 21, 2010, 1:49:25 PM10/21/10

to Blacklight Development

Hello,

We are trying to index a very simple Text Encoding Initiative (TEI)
document.

rake solr:tei:index TEI_FILE=data/sampletei.xml SOLR_WAR_PATH=jetty/
webapps/solr.war

The following result is this.

rake aborted!
Don't know how to build task 'solr:tei:index'

I was wondering if there is any information out there about how to
handle TEI files. The Blacklight documentation indicates that it can
handle TEI, but I have not been able to find any details concerning
how. Does it require a special configuration of solr?

Thanks,

--Eric

Jonathan Rochkind

unread,

Oct 25, 2010, 12:47:48 PM10/25/10

to blacklight-...@googlegroups.com

I don't believe there is anything built into core Blacklight to handle
TEI. If you let me know where in the BL documentation it suggested it
"could handle TEI files", I'll remove it or make it more clear.

It may or may not require special configuration of solr, depending on
how you implement it. It will definitely require a tool that can index
TEI files to Solr -- or one of the existing Solr XML indexer tools (I
think one ships with Solr maybe?) configured for TEI and how you'd like
to implement it. Blacklight itself doesn't ship with any special tool
for indexing TEI, as you found out there is no "solr:tei:index" rake task.

There are other people who have been doing TEI in BL, perhaps some of
them will see this and give you some ideas. You could also try searching
the listserv archives, not sure if this has come up before or not (at
first I thought it had, but I think I was confusing it with something
else).

Jonathan

Bess Sadler

unread,

Oct 25, 2010, 1:13:11 PM10/25/10

to blacklight-...@googlegroups.com

Hi, Eric.

When I was working at UVA we indexed TEI into Blacklight, so that might be where you saw a reference to it. There isn't anything pre-written, because TEI tends to be pretty free form. One of the fields for TEI was a url so that the blacklight entry could give the user a link to our actual TEI presentation application (XTF).

These, for example, are all TEI documents in Blacklight: http://search.lib.virginia.edu/?f%5Bdigital_collection_facet%5D%5B%5D=UVa+Text+Collection&sort=date_received_facet+desc

I can't find a code example of indexing TEI documents, but you might want to take a look at some examples of indexing other arbitrary XML. Take a look at the code for the Northwest Digital Archives, for example:

http://github.com/bess/northwest-digital-archives/

especially in lib/nwda-lib

In summary, it's quite possible but you have to write it yourself. But other people have done it and would be happy to answer questions I'm sure.

Bess

> --
> You received this message because you are subscribed to the Google Groups "Blacklight Development" group.
> To post to this group, send email to blacklight-...@googlegroups.com.
> To unsubscribe from this group, send email to blacklight-develo...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/blacklight-development?hl=en.
>

Bill Parod

unread,

Oct 25, 2010, 5:58:03 PM10/25/10

to blacklight-...@googlegroups.com

We'd like to have a Google sitemap in place when we launch our Blacklight site. I noticed some posts about generating one in blacklight-development, but it seems like there isn't an implementation available. So I took a crack at it and it seems to work. Since I'm new to Rails I thought I'd post it here for review in case it's in terribly bad form in some way. Since there's not much to it (route added, sitemap_controller.rb, and index.erb and _document_list.erb views), I'll just drop it in this email for a quick read:

Thanks for any tips that come to mind. Please be blunt and indulge any desire to pick nits!

Thanks,

Bill

------------------------------------------------------------------------

blacklight-app/config/routes.rb:

added:

map.resources :sitemaps

------------------------------------------------------------------------

blacklight-app/app/controllers/sitemap_controller.rb:

class SitemapController < ApplicationController

include Blacklight::SolrHelper

def index

all_query = { 'id:' => '*', 'rows' => '50000'}
all_docs = Blacklight.solr.find(all_query)
document_list = all_docs.docs.collect {|doc| SolrDocument.new(doc)}
(@response, @document_list) = [all_docs, document_list]

respond_to do |format|
format.xml { render :layout => false }
end

end

end

------------------------------------------------------------------------

blacklight-app/app/views/sitemap/index.erb:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9

http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">

<%= render :partial=>'document_list' %>

</urlset>

------------------------------------------------------------------------

blacklight-app/app/views/sitemap/_document_list.erb

<% @document_list.each_with_index do |document,counter| %>

<url>

<loc>http://hostname/catalog/<%= document[:id] %></loc>

</url>

<% end %>

------------------------------------------------------------------------

Bill Parod

Library Technology Division - Enterprise Systems

Northwestern University Library

bill-...@northwestern.edu

847 491 5368

Jonathan Rochkind

unread,

Oct 25, 2010, 6:13:42 PM10/25/10

to blacklight-...@googlegroups.com

Cool, thanks for sharing Bill. Once we're happy with it, I'd love it if
you sent this as a patch to BL. I'm imagining that one would run a rake
task 'rake generate_google_sitemap' or something to generate the sitemap.

Some people thought that you might run into performance problems paging
through ALL the rows in the db with an ordinary Solr find -- Solr isn't
really very good at this kind of 'deep paging', it's been kind of
optimized against it. But you haven't run into problems? How many docs
do you have in your solr? How long does it take to generate? It was
suggested that if there IS a probelm with this, then instead of using an
ordinary solr query, you should use the Solr 'terms' component on the
'id' field.

You are hard-coding "http://hostname/catalog/<%= document[:id] %>" as
your URL. That's first of all wrong because shouldn't it be your actual
hostname instead of the hard-coded string 'hostname'? And secondly
wrong because you should ideally be using Rails routing methods to
generate this URL, so it'll be right even if the local app changes the
routing. One way to do that is with: <%= catalog_index_url(
document[:id] ) %>

I think having the sitemap/_document_list partial is overkill in this
case, it'll be simpler and easier to understand with just ONE template.
However, I wonder about using the Rails controller system for this at
all -- what you've done will generate the sitemap dynamically every
time a certain URL is hit. But this is a very expensive operation,
makes more sense to write it to disk as a static file every once in a
while (say nightly). So I woudln't use a Rails controller at all, I'd
just write some ruby code that gets triggered by a rake task to generate
the static file(s). (Cause another thing is you need to check to make
sure you're not over the maximum number of lines or bytes for a
sitemap.xml file, and if you are, split it into multiple ones with a
master one referencing them all, as per the sitemap spec). You _could_
use ERB templates in a a non-rails-controller ordinary ruby file like
this, if you needed to, but this is simple enough you might not even
need an ERB template, just generating strings might be enough.

Outside of a Rails controller, my earlier advice to use Rails routing
might be harder to follow, if neccesary it could be hardcoded instead of
using Rails routing, but then the "prefix" (http://something/catalog/)
should be in a parameter, rather than hard-coded, so it's easy to
change, and so the code can be shared.

Hmm, I guess that was a whole bunch of advice/critique, in the end, I
think you probably ought to be doing things somewhat differently. Hope
it makes sense and is helpful.

Jonathan

> bill-...@northwestern.edu<mailto:bill-...@northwestern.edu>
> 847 491 5368

Jonathan Rochkind

unread,

Oct 25, 2010, 6:16:41 PM10/25/10

to blacklight-...@googlegroups.com

Actually, some googling found this page with tips on how to use Rails
Routing (like catalog_index_url) from a rake task, rather than a Rails
controller:

http://www.treyconnell.com/rails-restful-helpers-rake-tasks

Jonathan Rochkind

unread,

Oct 25, 2010, 6:25:36 PM10/25/10

to blacklight-...@googlegroups.com

And, you _could_ use Rails controller with rails template caching to use
the rails controller system even though you don't want this to be
generated dynamically on every single call, but cached.

But where you're going to run into trouble there is with the need to
split a sitemap into multiple parts if it's too large, I think the Rails
controller/view system is going to make that needlessly complex if
possible at all, and you'll be better off just writing ruby code in a
rake task, not using rails controller.

Bill Parod

unread,

Oct 25, 2010, 6:32:10 PM10/25/10

to blacklight-...@googlegroups.com

Hi Jonathan,

This is all very helpful. I've wondered too about generating a static file instead of creating a Rails controller for this purpose. The site we're launching is a finding aid site with only ~ 300 documents indexed. So this approach works ok in this case but perhaps is not a good pattern as it belies the issues you raise that would be found at more typical scale.

The hostname/routing tip is a good one. I knew that didn't seem right but wasn't sure how to generate full urls. I believe the sitemap schema requires full urls in the <loc> element.

Likewise, use of the partial seemed a little fussy to me too but I'm still learning this idioms and so am being newbie doctrinaire. Thanks for suggesting not to bother with it.

In the end - especially from getting your feedback - I think I too inclined to generate a static file. I have another site for which we need a sitemap but isn't in Rails, though is Solr-based. So it will be useful to have a single solution for both.

Thanks again for your quick review,

Bill

Bill Parod

847 491 5368

Jonathan Rochkind

unread,

Oct 25, 2010, 7:01:39 PM10/25/10

to blacklight-...@googlegroups.com

That 'hacky' solution may well work fine for your 300 document site. But
if you decide to work on a more general purpose solution that works fine
for millions of documents too (performance wise, and worrying about
sitemap size limits), then please please share it back with the core BL
project! Keep us updated.

> To post to this group, send email to blacklight-...@googlegroups.com<mailto:blacklight-...@googlegroups.com>.
> To unsubscribe from this group, send email to blacklight-develo...@googlegroups.com<mailto:blacklight-develo...@googlegroups.com>.

> For more options, visit this group at http://groups.google.com/group/blacklight-development?hl=en.
>
>
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups "Blacklight Development" group.

> To post to this group, send email to blacklight-...@googlegroups.com<mailto:blacklight-...@googlegroups.com>.
> To unsubscribe from this group, send email to blacklight-develo...@googlegroups.com<mailto:blacklight-develo...@googlegroups.com>.

> For more options, visit this group at http://groups.google.com/group/blacklight-development?hl=en.
>
>
>
> Bill Parod
>
> Library Technology Division - Enterprise Systems
> Northwestern University Library

Michael Slone

unread,

Oct 25, 2010, 7:19:57 PM10/25/10

to blacklight-...@googlegroups.com

On Mon, Oct 25, 2010 at 12:47 PM, Jonathan Rochkind <roch...@jhu.edu> wrote:
> I don't believe there is anything built into core Blacklight to handle TEI.
> If you let me know where in the BL documentation it suggested it "could
> handle TEI files", I'll remove it or make it more clear.

The FAQ page at http://projectblacklight.org/?page_id=3 states:

"Currently, Blacklight can index, search, and provide faceted browsing
for MaRC records and several kinds of XML documents, including TEI,
EAD, and GDMS."

This is perhaps ambiguous but appears to suggest that Blacklight
directly handles TEI, among other formats.

--
Michael Slone

Jonathan Rochkind

unread,

Oct 25, 2010, 8:37:16 PM10/25/10

to blacklight-...@googlegroups.com

Michael Slone wrote:
> The FAQ page at http://projectblacklight.org/?page_id=3 states:
> "Currently, Blacklight can index, search, and provide faceted browsing
> for MaRC records and several kinds of XML documents, including TEI,
> EAD, and GDMS."
>
> This is perhaps ambiguous but appears to suggest that Blacklight
> directly handles TEI, among other formats.
>

Yes, I think that's misleading.

But I still don't have any access to editing the WP.

I think there's actually a buncha mis-leading stuff in the FAQ.

Bess, how's the project to transfer this to GitHub Pages going? In the
meantime, can someone give me access to edit the WP? I'm going to edit
the heck out of it though, so it might be better to have the revision
control we'll have when it's in git, but it seems like that could be a
while.

I don't think we do ourselves or anyone else any favors by
over-promissing what BL currently easily provides out of the box, and I
think the WP pages do that in several places.

Jonathan

Reply all

Reply to author

Forward