Hello

Wulin Teo

unread,

Jun 18, 2014, 8:24:39 PM6/18/14

to wellcom...@googlegroups.com

Hi

First of all, I am so excited to learn about wellcome player. I would like to use wellcome player for the presentation of my thesis.

I am even excited about the search within the function and seadragon/deep zoom feature.

I have installed wellcome player in my computer. I have grunted the player and it works in local computer. Like others, I would like to add the function of search within in the welcome player. I have read the thread, but I am not sure I understood it.

These are my questions:

1. I have to export OCR document in an ALTO xml file, how do I convert it from pdf to alto file?)

2. About generating the packages, do you have any recommendations on how to facilitate the process of creating those packages? (Any recommended JSON editor for doing such task?)

Thanks

Tom Crane

unread,

Jun 19, 2014, 1:30:03 PM6/19/14

to wellcom...@googlegroups.com

Hi Wulin Teo,

We're glad you like it.

As you can see from http://player.digirati.co.uk/digitising.html, digitising a book by hand and producing the tile and metadata assets (package) is a laborious process, but it is possible to produce a completely static site featuring a book viewer. However, a book digitised this way isn't searchable.

In the Wellcome Digital Library, everything the player consumes is produced dynamically, by a server-side application. The Library's digitisation workflow produces the following assets:

JPEG 2000 images
METS files (http://www.loc.gov/standards/mets/) that describe the digital objects
METS-ALTO files (http://www.loc.gov/standards/alto/) that record the OCR data derived from the images that comprise the digital object

These three types of object, along with information from the library catalogue, are the raw materials that are used by the Wellcome Library's server-side application (the "Digital Delivery System" or DDS) to produce the tiles and package data, and also to provide a search service.

Take this example:

http://wellcomelibrary.org/player/b1803469x

The package file, http://wellcomelibrary.org/package/b1803469x, was produced dynamically from a METS file
Each tile was produced dynamically, on-the-fly, from an Image Server - http://wellcomelibrary.org/dz/b1803469x/0/bc1f87f6-ecd0-4dc2-a316-7bbd89f39f27.jp2_files/12/4_3.jpg
When you do a search within the book, the player calls a web service that uses text position data derived from the METS-ALTO files produced for this book - http://wellcomelibrary.org/service/search/b1803469x/0?t=tangled - this is used to place markers and highlight the text.

If you produce an entirely static version, e.g., by following the process described at http://player.digirati.co.uk/digitising.html, you won't be able to offer search. Even if you could produce ALTO files you still need a server-side process to query them and generate search results.

The Player on Github is an entirely client-side application, we haven't yet released any of the server-side components that a library might use. At present, getting this up an running would be a big overhead for casual use.

We would like to do some more work on a more user friendly server implementation that doesn't rely on the infrastructure resources of a large library.

So to come back to your questions:

1) You need OCR software that can output ALTO format XML (e.g., http://content-conversion.com/ or http://www.abbyy.com/, both commercial)

2) Our packages are generated dynamically from the source METS files; they are never edited by hand.

We hope to be able to do some work on a better package editor, hopefully a visual one. ... However, without server-side support you still won't be able to offer search.

But we hope you can still use the Player without the search within feature for now.

Tom

PS the following links go into more detail about the Wellcome Library's systems:

http://www.ariadne.ac.uk/issue71/henshaw-kiley

http://www.slideshare.net/goobi_org/goobi-in-the-wellcome-library

http://player.digirati.co.uk/digital-delivery.html

Wulin Teo

unread,

Jun 20, 2014, 5:08:36 AM6/20/14

to wellcom...@googlegroups.com

Hi Tom,

Thank you for explaining how the data are being processed from the beginning to the end at the wellcome library . It is very informative.

Thank you for making digital documents so much live again.

Wulin

Klaus E. Werner

unread,

Nov 15, 2019, 3:16:14 AM11/15/19

to Universal Viewer

Hello Tom,

I'm putting up our resources using our own viewer and - in parallel - the UV via IIIF manifests and it's working fine for now (http://dlib.biblhertz.it).

I thought to implement OCR SEARCH, too, and after some tinkering found out that providing:
1. the manifest file
2. the annoservices json (more or less a JSON group for each line with text string and canvas coordinates)
is, unfortunately, not enough.

This experience is more or less in line with what's been said by you here ... I found out the hard way.