Content Search API implementations?

432 views
Skip to first unread message

Jason Ronallo

unread,
Jun 29, 2016, 2:48:51 PM6/29/16
to iiif-d...@googlegroups.com
What current implementations exist of the Content Search API? Are
there demonstrations or production sites where we can see it in
action? I'm interested in both client-side and server-side
implementations. What version of the spec do they follow?

Knowing about the client-side applications will help me determine
which viewer I might be able to use. I suspect that UV and Mirador
have or are interested in implementing the search API on the
client-side, right? Are there others?

What are folks doing on the server side? Is there already a
server-side application anyone has developed which is just responsible
for indexing text and providing results and autocomplete?

If I've got OCR and transcripts for many different types of content
(books, newspapers, video) is there a good application for indexing
this data and providing a simple search inside service? I'm
considering whether to build this functionality into an existing
application or break this functionality out into its own service that
begins with just the minimum necessary to work with a client-side
implementation. Of course don't want to build if I can adopt something
that would work for us.

I'm also interested in collecting any useful links on Content Search
API implementations or tutorials for
https://github.com/IIIF/awesome-iiif

Thank you,

Jason

Jeffrey C. Witt

unread,
Jun 29, 2016, 5:30:24 PM6/29/16
to IIIF Discuss
Hi Jason,

Server side, I'm using eXist-db to transform TEI encoded texts into annotations lists that meet the IIIF search spec.

UV supports search and there is a Mirador branch that supports basic search (but not autocomplete) that I think will be eventually worked in the 2.2 release.

Here's a live version of Mirador with the search panel working, http://reactmirador.herokuapp.com/#/plaoulcommentary. Any of these manuscripts should show a "search tab" when loaded. Search results should load in the side panel.

We may get a chance to look at this briefly on the Mirador call tomorrow. 

Let me know if you have any questions or if the link doesn't work.

Best,
jw

Tom Crane

unread,
Jun 29, 2016, 5:30:27 PM6/29/16
to IIIF Discuss
Hi Jason,

The UV is a client of the search API, but currently only for the "q" parameter.
Likewise, the Wellcome Library provides a search service for all of its OCRed content, but only honours the "q" parameter:



we explicitly assert that we ignore params that are mentioned in the spec but not supported by this server:


This isn't quite the 1.0 spec, we need to bring it up to date. But there's not much difference.

There are a few server side implementations of the Search API besides Wellcome. I know Jeff Witt has one, the BL are working on one, the National Library of Wales have one and I'm sure others can chip in here. Jeff also had search for Mirador: https://groups.google.com/forum/#!msg/iiif-discuss/hg2XGG0vaIM/xhY7O4nhBwAJ.

Your other questions raise interesting points. The Content Search API is about searching "within" a IIIF resource (typically a manifest). The results are annotations that match the search params (and are returned as an annotation list, with decoration for clients that can show more UI for search results). A matching annotation might be a page-level transcription, or a comment, or a tag (search for all annos made by user x between date1 and date2 that have the "identifying" motivation). But in the examples above, the annotations returned aren't like this - they are ContentAsText annotations that paint the canvas, but they didn't exist as annotations on the server in advance. They are dynamically created at runtime so that single terms and phrases can be highlighted, rather than paragraphs or whole pages. 

Implementations like this might have METS-ALTO OCR data in the background, with word-level rectangle information. The METS-ALTO-derived text might be indexed in Solr (possibly storing the xywh information in the index). The annotations are created dynamically when the server returns a page of search results. 

Contrast this with a Search API implementation that indexes the content of an annotation server. Maybe this uses Solr too, but we're always going to return an existing annotation as a result - an annotation that some person or machine made earlier and stored in the anno server.

Ideally you want to deliver both kinds of results from the same service - maybe using the motivation param to retrieve different kinds of annos. We have a task on our roadmap to develop/implement a "unified" service like this (and include the deferred "box" param that was in the 0.9 spec, or a variant of it), and I'd like to do that as a self-contained server solution that's easy to roll out if you have a pile of METS-ALTO over here and an annotation server over there. Let me know if you'd like to know more, maybe we can compare requirements.

Here's a recent slide deck about the Search API and what it's searching:


Tom

Jeffrey C. Witt

unread,
Jun 29, 2016, 5:38:18 PM6/29/16
to IIIF Discuss
Mirador search-within branch is here: https://github.com/jeffreycwitt/mirador/tree/feature/2.1-search-within

It currently supports all query parameters including paging. (But my search service only supports the "q" and "page" parameter.

jw

Jason Ronallo

unread,
Jun 29, 2016, 9:01:50 PM6/29/16
to iiif-d...@googlegroups.com
Tom,

Yes, right now I'm only interested in dynamic ContentAsText
annotations from indexed OCR.

We currently have search inside, but it only gets you to the page and
shows snippets. In order to replace this with a IIIF-compliant
service, I only need to get that far at first. I want the minimal
viable search response, then from there we could add other features
and annotation types.

Try "albee":
http://d.lib.ncsu.edu/collections/catalog/technician-v59n50-1979-01-26

Yes, the pile of OCR search server is what I'm after. As far as I've
gotten in migrating to IIIF search inside is to use an existing API,
our IIIF image server, and tesseract to OCR pages and output a text
file, hOCR and a PDF. [1] Next step is to have a Solr or Elasticsearch
index built from this pile of OCR.

This made me think that serving up search inside for OCR could be a
standalone service--and possibly something useful enough for others to
use for simple use cases. I know I can't get away with just static
content like a level 0 image server, but what's the closest I can get
to a "level 0" IIIF content search server for OCR? Just returning OCR
text hits doesn't seem like it needs to know anything else about the
resource or content so it could stand on its own.

Beyond the basics, highlighting words on the image would be nice.
Eventually we may get to wanting to search other types of annotations,
but that can wait until we've migrated.

So that's what I'm thinking of right now.

Jason

[1] Hope is that as open OCR engines improve (or we license a better
scriptable OCR engine) we can automate updating our OCR.
> --
> -- You received this message because you are subscribed to the IIIF-Discuss
> Google group. To post to this group, send email to
> iiif-d...@googlegroups.com. To unsubscribe from this group, send email to
> iiif-discuss...@googlegroups.com. For more options, visit this
> group at https://groups.google.com/d/forum/iiif-discuss?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "IIIF Discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to iiif-discuss...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Tom Crane

unread,
Jun 30, 2016, 5:49:01 AM6/30/16
to IIIF Discuss
Hi Jason,

you might have enough data there to shim a search API implementation on top, with the result annotations targeting the whole canvas (no xywh fragment) and the result context (snippet) returned as an oa:TextQuoteSelector:


However, if you can only translate to page-level contentAsText annotations at present, this will be a bit unwieldy, and I'm not sure what client support there is for oa:TextQuoteSelector (none in the UV yet). 

Tom


Reply all
Reply to author
Forward
0 new messages