Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

PDF.js text under image

94 views
Skip to first unread message

meljt...@burntlands.ca

unread,
May 19, 2013, 2:19:01 PM5/19/13
to
I am building an online historical newspaper archive for a local museum. The archive is composed of many PDF files prepared by Finereader as text under image. The subject newspaper is the Almonte Gazette 1865 - 1987. A typical issue is 8 broadsheet pages, each with a high-res image and underlying text prepared by OCR.

At the server end I have a keyword index (mySQL) and the capability to extract and serve a single page pdf to the browser. The capability can be viewed at http://mvtm.ca/museum under the 'collections' tab.

I want to use PDF.js to provide a more rewarding user experience than currently. First to highlight the keywords that were used to initiate the online search, and secondly to provide an option to pull out an excerpt of text.

In initial experimentation I am finding that the rendering of the image is not of high quality (to read the text the image has to be displayed at about 400%. This provides a sharp image in Safari/Preview but not in Firefox v 20 with PDF.js. I'm not sure why but it is not due to the image resolution, but to the rendering.

Ideally, I would like to do a custom rendering of just a portion of a page image surrounding a keyword and provide a brief excerpt from the underlying text. If anyone has accomplished something similar I would really appreciate some pointers on how to approach this problem.
0 new messages