General strategies for dealing with problem images

122 views
Skip to first unread message

gl0...@gmail.com

unread,
Mar 18, 2019, 1:59:05 PM3/18/19
to tesseract-ocr
I would like some advice concerning the general use of tesseract, because my experience with it tends to two extremes: either tesseract performs flawlessly, with no prior modification of the image necessary except cropping to the text and (most significant) enlarging the image by a factor of 2 or 4; or tesseract's output is riddled with errors.

Following advice to improve the quality of the image (Fred's textcleaner script, or applying the Imagemagick functions it uses individually), usually produces significant improvement in human readability of the image, but as regards tesseract they usually produce no improvement, and most often actual deterioration in its performance.

So I am looking for another reason to explain tesseract's difficulty with certain images. I thought perhaps its performance may be dependent on its trying to identify the particular font used, but https://github.com/tesseract-ocr/docs/blob/master/tesseracticdar2007.pdf seems to say not.

The only other possibility I can think of is either the size or the aspect ratio of the text in the image has been subtly deformed. If so, it is not apparent to my eye, but certainly tesseract is very sensitive to size change, because, when it works, resizing the image makes such a dramatic improvement.

Does anyone have other suggestions as to the nature of the problem? I'm not asking for detailed advice here, which is why I've given no image samples, but for general lines of attack, strategy rather than tactics. Thank you.


Jonathan Muller

unread,
Mar 19, 2019, 1:03:18 AM3/19/19
to tesser...@googlegroups.com
I don't really agree with your statement. There is a lot of things we had to consider with image processing before tesseract finally gave us accurate results. But it all makes sense. Here is our actual pipeline:

 1 - Cleanup the image: remove any artifact of the camera or scan device, cut the paper accurately, remove noise, binarize
 2 - Unskew the image: make text lines very horizontal
 3 - Cut the zone of interest: take text zone of interest in the document, using DNN to recognize the zones
 4 - Clean the text zone: remove any unrelevant part in the image (like lines, tables, stamps)
 5 - Create a whitelist based on the zone of probable characters (this one improves accuracy a lot !)
 6 - Submit to tesseract with appropriate settings for the language

1: it is understandable how noise or image quality could affect recognition
2: tesseract expect lines of text to be straight
3: this reduces the processing speed and allow us to focus on the zone for further cleaning (next steps) or custom parameters before submitting
4: lines, tables, and other things can alter recognition, because a piece of line sometimes is recognised as |, -, _, l, `1`. it could also affect nearby characters, especially when working with Chinese-based characters
5: whitelisting based on the content helps recognition a lot. simple example is if you search for numbers, whitelist "1234567890" - 0 is close to O. Even humans make the mistake, that's why we banned O from Wifi passwords :laugh:
6: Settings of tesseract can improve a lot the recognition when working with non-english scripts or when image is not perfect (tesseract works best with dpi 300)

We gone from 10% accuracy to nearly 95% now. Each image is different and each may require different processing or parameters. Making a solutions that fits all is very complex, but I still think it is possible if the application is specific enough. I guess that is why it is not included in tesseract. Making it work very well for a specific use-case would break others.

I guess you just have to find the right pre-processing for your kind of image

Hope it thelps

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/15dcee7c-0815-47c3-9c74-29f8e90a7ca2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Jonathan
06.49.32.74.55

gl0...@gmail.com

unread,
Mar 19, 2019, 4:23:02 PM3/19/19
to tesseract-ocr
Thank you for your response, my experience with OCR is limited to the conversion of screenshots I may take online, yours far more extensive I think.

And thank you particularly for items 2 and 5, slight skewing of the image may better account for the distortions in size and or aspect ratio that I've been thinking as the problem, because skewing can be localized to only parts of the image, which better describes tesseract's behavior with such images (in parts good, in other parts garbled).
Item 5 particularly seems very promising in that I could use it to ascii-ize the text that tesseract produces. I do a lot of post-editing with vim-scripts that often require ascii text to work properly. Can't really cut out numerals though, 0 and O aren't the only problem there, 5 and S, 1 and I or l, J and ].

Lorenzo Bolzani

unread,
Mar 23, 2019, 5:28:24 AM3/23/19
to tesser...@googlegroups.com
Il giorno mar 19 mar 2019 alle ore 06:03 Jonathan Muller <jmu...@pukogames.com> ha scritto:
 5 - Create a whitelist based on the zone of probable characters (this one improves accuracy a lot !)

Ho do you do whitelisting with tesseract 4.x? As far as I know is not yet supported.

I do the same with simple letters replacement like: B/8, I/1, Z/2, ecc. Maybe there is a better/simpler way.


Lorenzo

Shree Devi Kumar

unread,
Mar 23, 2019, 6:44:25 AM3/23/19
to tesser...@googlegroups.com
https://github.com/tesseract-ocr/tesseract/pull/2294 by @bertsky adds the whitelist/blacklist functionality for Tesseract4. It has not been merged yet. 

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
Reply all
Reply to author
Forward
0 new messages