text granularity: more examples available from NCSU Libraries

49 views
Skip to first unread message

Jason Ronallo

unread,
Oct 7, 2017, 12:49:27 PM10/7/17
to iiif-d...@googlegroups.com
NCSU Libraries now publishes annotation lists at the word, line, and
paragraph levels for each page/canvas of our resources that make
available a content search service.

You can see all the resources here that provide content search:
https://d.lib.ncsu.edu/collections/catalog?f%5Bfulltext_bs%5D%5B%5D=true
Take the URL for any of the show views and add on "/manifest" to the
end of the URL. Look for the "otherContent" property on a canvas in
the JSON.

Or if you want a Collection of manifests:
https://d.lib.ncsu.edu/collections/catalog/manifest?f[fulltext_bs][]=true

This data is based on OCR created by Tesseract and extracted from the
resulting hOCR. Our manifests also link to the hOCR and plain text via
a seeAlso to allow for further exploration of reuse of the OCR. A
searchable PDF rendering is also available for each resource.

The ability to create multiple annotation lists at these granularities
is now added to Ocracoke:
https://github.com/NCSU-Libraries/ocracoke
Thanks to Alison Blaine here at NCSU Libraries for adding this
functionality to Ocracoke.

One of the issues when viewing the annotations in Mirador is that for
pages with a lot of text it can be rather intensive. CPU utilization
spikes. My relatively fast, new machine spins the fans all the way up.
Is this just me? You can try this manifest for instance on some of the
pages with the most text:
https://d.lib.ncsu.edu/collections/catalog/technician-2002-04-10/manifest

It would be nice to have some way in Mirador to select which
annotation lists to turn off/on. Or what's the right performance
optimization when there are lots of annotations regardless of whether
it is segmented into lists by text granularity or otherwise?

The big question left here is how to note the granularity of the
annotation (lists)? Each of the annotation lists we've published has a
label. Should we be adding information somewhere else as well? The
labels would work fine for humans selecting a list, but a machine
should not make assumptions about labels, right? How to allow for a
client to keep a certain kind of annotation on when the page is
turned? So should a client group labels for any annotation list toggle
that would allow that label of annotation list to be enabled after
each page turn? That would allow for not downloading all lists. Should
a property noting text granularity be at the annotation list level? Or
is putting a property at the annotation level a sufficient
optimization? That would require downloading and searching through
every annotation in every list.

Please let me know if you have any questions. I won't be in Toronto,
but we wanted to make this available now to help with the discussion
there.
https://docs.google.com/document/d/1mtSoA8r0wofi4xtFYjM-gCEaj7aypG0NgoMFCCFq_RY/edit

Jason
Reply all
Reply to author
Forward
0 new messages