OCR andtips from Tesseract gurus!

125 views
Skip to first unread message

Luis Zertuche

unread,
Aug 18, 2016, 4:18:40 PM8/18/16
to tesseract-ocr
Hello Tesseract Gurus!

I'm working on a  pdf2text extraction for legal documents. I've done some searches and found tips to improve quality, but I was wondering if someone here can provide info beyond the basics. Image processing-wise I've been: resizing to 600dpi, correcting for skew angle and [denoising with a median filter, contrast stretching, dilating with a small structuring element and otsu_thresholding] All those things improved the results only for a subset of the documents and given how widely they vary in acquisition quality, noise level and contrast, Ive realized the imaging pipeline is not one size fits all. I'm considering creating parallel image processing pipelines and do OCR on all of them and just pick the best,

1. Can anyone comment on what would be some good variations of an imaging pipelines for 'high variance dirty-text' ? Or alternatively can anyone think of an imaging pipeline that would cover a wider range of document quality?

As far as tesseract parameters go. I've put together a small parameter exploration loop evaluating iterations with a text-quality metric(M2, based on number of dictionary words, using pyenchant), here is an example for the linesize value parameter for a single document:

+++For linesize value, 1.25, M2 value is 0.661157024793

+++For linesize value, 1.35, M2 value is 0.661157024793

+++For linesize value, 1.45, M2 value is 0.644628099174

+++For linesize value, 1.55, M2 value is 0.611111111111

+++For linesize value, 1.65, M2 value is 0.0

+++For linesize value, 1.75, M2 value is 0.693693693694

+++For linesize value, 1.85, M2 value is 0.672413793103

+++For linesize value, 1.95, M2 value is 0.631578947368

+++For linesize value, 2.05, M2 value is 0.0


Here the best linsize value is 1.75 for that document, which yields good results(the ). From this info,

2. Can anyone recommend what are some good parameters to do apply this method with? Any other tips of combining parameters into something more general o any other exploration tips?

Thanks for reading! Any other potentially useful tips or info would be greatly appreciated, whether its on the image processing or on the tesseract parameters.

Best, Luis.


Luis Zertuche

unread,
Aug 22, 2016, 2:30:32 PM8/22/16
to tesseract-ocr
No takers huh? :(

Tom Morris

unread,
Aug 23, 2016, 6:33:05 PM8/23/16
to tesseract-ocr
If there were a "one size fits all" answer, it'd probably already be done automatically by Tesseract, but you might have a look at some of the work the eMOP project did to deal with OCRing challenging texts at scale to see if you can reuse some of their tooling or learnings (although a lot of what they were doing was focused on custom training, as opposed to other types of image processing).

Tom
Reply all
Reply to author
Forward
0 new messages