Improving accuracy of printed pages from 1934

149 views
Skip to first unread message

corey....@flinders.edu.au

unread,
Mar 29, 2016, 2:29:27 AM3/29/16
to tesseract-ocr
Hi All,

I've been experimenting with tesseract and have been impressed with the accuracy of the software. I'm looking to use tesseract to process around 200 pages of printed material that was printed in around 1934. I've attached a sample of the PDF I need to work with. 

I'm looking to improve the accuracy of the OCR process as much as possible. I believe that with the vast, and I admit intimidating, list of options available that there are ways to improve the accuracy. Speed of recognition isn't as high a factor as accuracy for this project. 

The following steps is what I've found works best so far:

1. Convert the PDF to TIFF

convert -density 350 input.pdf -type Grayscale -background white +matte -depth 32 input.tif


2. Clean the TIFF file using the text cleaner script [1]

textcleaner -t 25 -s 1 -g input.tif cleaned.tif


3. OCR the cleaned TIFF file.

tesseract cleaned.tif ./test-ocr


Any thoughts on ways to improve the accuracy will be gratefully received. 


With thanks. 


-Corey


[1] http://www.fmwconcepts.com/imagemagick/textcleaner/

Pages from 1934 filmdailyyearboo00film_4.pdf

Libo Huang

unread,
Mar 29, 2016, 5:53:36 AM3/29/16
to tesseract-ocr
I think that you should split page text block to multiple columns, then rows, by leptonica or opencv.  Thus, it is easy to ocr.

在 2016年3月29日星期二 UTC+8下午2:29:27,corey....@flinders.edu.au写道:

Tom Morris

unread,
Mar 29, 2016, 12:21:51 PM3/29/16
to tesseract-ocr
Great to see someone using Tesseract to preserve a little history! 

The first thing you should do is start with as close to the original as possible.  Since you're working with this scan: https://archive.org/details/filmdailyyearboo00film_4
that would be the zip containing the original JPEG2000 images: https://archive.org/download/filmdailyyearboo00film_4/filmdailyyearboo00film_4_jp2.zip

Note that the Internet Archive runs all uploads through ABBY FineReader and the output from that is available here: https://archive.org/download/filmdailyyearboo00film_4/filmdailyyearboo00film_4_abbyy.gz
Similar to Tesseracts hOCR output it includes coordinates for all text blocks, so if it messed up the page segmentation it should be possible to post-process to reconstruct the correct flow.  You can find an ABBY parser that I wrote for another purpose here: https://github.com/tfmorris/oed/blob/master/oedabby.py

If you want to run things through Tesseract to compare for better quality (or just for the fun of it), you should be able to do that directly if your copy of Tesseract was built against a version of Leptonica with JPEG2000 support (mine was). I used this command to produce the attached output.

$ tesseract filmdailyyearboo00film_4_0742.jp2 pg738 hocr


Not surprisingly, Tesseract doesn't get the page segmentation correct.  You could either preprocess to cut the image into four columns that you OCR separately or post-process the hOCR output to put all the words in the correct order.

When I manually crop to just the first column, I get pretty reasonable (to my eye) results. Files attached.

Tom
pg738.hocr
pg738.txt
pg738_col1.txt
pg738_col1.html

corey....@flinders.edu.au

unread,
Apr 3, 2016, 12:16:17 AM4/3/16
to tesseract-ocr
Hi All,

Many thanks to those who have replied to my question here on the group, and privately.

It has given us some avenues to explore in extracting and preserving this information. 

I remain impressed by everyone who has contributed to the project and its capabilities. 

With thanks. 

-Corey

Tom Morris

unread,
Apr 3, 2016, 1:06:55 AM4/3/16
to tesseract-ocr
On Sunday, April 3, 2016 at 12:16:17 AM UTC-4, corey....@flinders.edu.au wrote:

Many thanks to those who have replied to my question here on the group, and privately.

It has given us some avenues to explore in extracting and preserving this information. 

Glad we were able to help!

I bet future visitors will appreciate it when you follow up here with more details about what worked and what didn't for your particular use case.  

It's a virtuous circle of paying it forward and paying it back ...
Reply all
Reply to author
Forward
0 new messages