Tesseract makes different predictions on seemingly equal images. How to make it more robust?

102 views
Skip to first unread message

MysteriousGuy

unread,
Jul 14, 2020, 10:04:49 AM7/14/20
to tesseract-ocr
I am using Tesseract to extract text from images attached. For some reason, even though the images are nearly identical, tesseract makes a mistake in one of them: for 'bad.png' the output is ELHADIJ, whereas for 'good.png' it is ELHADJ

Here is what I have and done:
So the commands I ran were

tesseract good.png output1 -l eng --psm 8
tesseract bad.png output2 -l eng --psm 8

and similarly for other PSMs


My question is: how do I make tesseract more robust? Why does it make a mistake in one case but not in the other?
good.png
bad.png

Zdenko Podobny

unread,
Jul 14, 2020, 10:13:40 AM7/14/20
to tesser...@googlegroups.com
Try to use the latest version of tesseract.

Zdenko


ut 14. 7. 2020 o 16:04 MysteriousGuy <gyt...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/81a83479-b266-4686-a2d8-fae2d5916831o%40googlegroups.com.

MysteriousGuy

unread,
Jul 15, 2020, 3:01:14 AM7/15/20
to tesseract-ocr
Hi. Latest stable version (4.1.1) produces the same error


2020 m. liepa 14 d., antradienis 17:13:40 UTC+3, zdenop rašė:
Try to use the latest version of tesseract.

Zdenko


ut 14. 7. 2020 o 16:04 MysteriousGuy <gyt...@gmail.com> napísal(a):
I am using Tesseract to extract text from images attached. For some reason, even though the images are nearly identical, tesseract makes a mistake in one of them: for 'bad.png' the output is ELHADIJ, whereas for 'good.png' it is ELHADJ

Here is what I have and done:
So the commands I ran were

tesseract good.png output1 -l eng --psm 8
tesseract bad.png output2 -l eng --psm 8

and similarly for other PSMs


My question is: how do I make tesseract more robust? Why does it make a mistake in one case but not in the other?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Tuan Ardouin

unread,
Jul 15, 2020, 5:08:33 AM7/15/20
to tesseract-ocr
You need to apply some pre-processing to your image.

MysteriousGuy

unread,
Jul 15, 2020, 2:55:09 PM7/15/20
to tesseract-ocr
This seems like an ad-hoc approach. I am already converting images to grayscale. If I apply blurring, binarisation, etc. then I will solve this case but I will prompt another case to fail as a result. There is something with tesseract that fails to generalize on clearly near-identical images, and I am interested in what is it.

Lorenzo Bolzani

unread,
Jul 15, 2020, 3:31:15 PM7/15/20
to tesser...@googlegroups.com
 I think the reason is that your input is bad so the model is confused and a few pixels are enough to see an extra letter.

Your input is "bad" because it is different from the one used to train the neural network. The difference between the two images is small but the difference from the training data for both is big.

If you improve your image with zero borders, less noise and a much stronger contrast, maybe even straighten the text this kind of problem should become much less common.

If you want to understand a little more why this is possible read something about how an LSTM ocr works. This is likely something in the step that tries to decide the letters from the neural network output (beam search, CTC). Not a bug just how it works.

I do not think there is much you can do, parameters, etc., other than improve your image or tesseract. Sometimes it happens even with fine tuned models.



Lorenzo




To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d6df0771-04e5-4e78-9109-28d91e2c2f2do%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages