IIIF Search Implementation

Ashma Shrestha

unread,

Dec 30, 2016, 10:27:47 AM12/30/16

to Universal Viewer

Hi all,

I am new to universal viewer and working on evaluating this for our new project. I have generated the presentation manifest and have the basic features working. However I want to evaluate the search feature of Universal viewer as well. I have the ALTO-XML ocr output of the images available. I am not sure where to start so a little guidance will be helpful. Is there working search service which I can reference? Are you indexing the ocr and the coordinates to SOLR? Is there any tool available for extracting the annotation from ALTO-XML?

Thank you and hope you have a great new year.

Edward Silverton

unread,

Jan 2, 2017, 7:15:37 AM1/2/17

to Universal Viewer

Hi Ashma,

I have posted this for comment on the UV #general Slack channel as it's not really my area of experitise :-)

Here's a link to join (if you wish to): https://universalviewerinvite.herokuapp.com

We need to improve documentation around this subject on the wiki. Will post suggestions here for reference.

Regards,

Ed

Edward Silverton

unread,

Jan 2, 2017, 9:17:36 AM1/2/17

to Universal Viewer

Andy Irving at the BL recommends this:

https://github.com/NCSU-Libraries/ocracoke/blob/master/README.md

--
You received this message because you are subscribed to the Google Groups "Universal Viewer" group.
To unsubscribe from this group and stop receiving emails from it, send an email to universalviewer+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jason Ronallo

unread,

Jan 18, 2017, 5:08:57 PM1/18/17

to ashma.s...@gmail.com, Universal Viewer

Ashma,

I think most folks now are implementing their own thing in a way that's not easily sharable. I'd like to see more implementations of the Search API that can be reused.

If you're interested in trying Ocracoke, let me know. There's nothing yet that will take ALTO (I'm using hOCR) and extract out the information that's needed to get search to work, but that ought to be easy enough. If you get on the IIIF Slack and ask in the #newspapers channel you might find others who are extracting data out of ALTO. With Ocracoke at least you could look at the types of files that it creates and how it indexes in Solr for an idea of how you could implement something. I used a basic technique that the Library of Congress had used. It ought to be possible (and if it isn't I should correct it) to use just the search piece of Ocracoke without using the whole OCR pipeline as well when you have your own OCR already.

You can be checking here for other implementations of the Search API, but currently Ocracoke is the only one I know of:
https://github.com/IIIF/awesome-iiif#content-search-api

Hope that helps.

Best,

Jason

Christopher Johnson

unread,

Jan 19, 2017, 2:57:45 AM1/19/17

to Universal Viewer, ashma.s...@gmail.com

Hi,

I have developed a process for generating the annotation sets using Fedora and SPARQL that the search API requires in the UV. I presented it at SWIB16 in November and you can see the video here.

It also depends on hOCR, but I plan on creating a module in the Pandora Modeller with a CLI for ALTO in the near future.

Cheers,

Christopher

To unsubscribe from this group and stop receiving emails from it, send an email to universalview...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Universal Viewer" group.

To unsubscribe from this group and stop receiving emails from it, send an email to universalview...@googlegroups.com.

Ashma Shrestha

unread,

Jan 19, 2017, 10:23:17 AM1/19/17

to Universal Viewer, ashma.s...@gmail.com

Thanks Jason. Yes Ocracoke is a great starting point to understand the workflow and get some ideas for Solr configurations. Seems like a lot of implementations are based on hOCR rather than ALTO-XML. I would be curious to know if anyone is using ALTO-XML here. I thought wellcome library's implementation is based on ALTO-XML. However I guess if I can have the ALTO-XML to Annotation covered, I should be able to leverage a lot from Ocracoke.

Ashma

To unsubscribe from this group and stop receiving emails from it, send an email to universalview...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Universal Viewer" group.

To unsubscribe from this group and stop receiving emails from it, send an email to universalview...@googlegroups.com.

Ashma Shrestha

unread,

Jan 19, 2017, 10:26:33 AM1/19/17

to Universal Viewer, ashma.s...@gmail.com

Interesting work Christopher. I will keep an eye out for your module. However seems like there is for some reason I couldn't load the youtube video. Might be some internal network issue, I will check on it later but thanks for good work.

Tom Crane

unread,

Jan 19, 2017, 12:03:09 PM1/19/17

to Universal Viewer, ashma.s...@gmail.com

Dear Ashma,

The Wellcome search within implementation is powered by ALTO, but not by Elasticsearch or Solr. It is fairly simplistic and builds a map of word coordinates in ALTO to generate annotations dynamically. The code is in C# and not separated out into a clean module, but I could send it to you if you are interested.

Another approach is to convert the ALTO to hOCR so that it can be used with ocracoke:

https://github.com/UB-Mannheim/ocr-fileformat (and other similar tools)

We are actually working right now on an Elasticsearch powered IIIF Content Search API (wrapped in Python this time), I'll ask one of my colleagues to chip in here with some details.

Tom

Ashma Shrestha

unread,

Jan 19, 2017, 3:20:28 PM1/19/17

to Universal Viewer, ashma.s...@gmail.com

Hi Tom,

Thanks for your response. I am very much interested in your implementation and it would be great you if you can share some code. At this point me are more in evaluation process so I want to be aware of all possible options and that been proven.

Ashma

Reply all

Reply to author

Forward