Tesseract confused between a character and a digit which look-alike

Yash Mistry

unread,

Jun 7, 2022, 3:50:44 AM6/7/22

to tesseract-ocr

I am facing challenge to extract correct a letter from a word which are look-alike, i.e 5 & S, I & 1, 8 & S.

I applied image pre-processing techniques like Blurring, erode, dilate, normalised the noise, remove unnecessary component and text detection from the input image but after these much of pre-processing tesseract OCR isn't giving correct result.

Please check attached images,

Original Image

Pre-processed Image

image (1).png

Detected Text

Tesseract Configuration

-l eng --oem 1 --psm 7 -c tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n" load_system_dawg=false load_freq_dawg=false

Result of OCR: TITLENUMBER 81003716

As we can see OCR extract S as 8 even after pre-processing and text detection.

Is there anyway we can overcome this problem?

Tesseract Version: tesseract 5.1.0-32-gf36c0

Note: Asked same question in pytesseract github repo and got suggestion to drop this question here.

Lorenzo Bolzani

unread,

Jun 7, 2022, 4:15:48 AM6/7/22

to tesser...@googlegroups.com

Hi Yash,

in my experience you are going top see a lot of these errors on similar characters.

Given the pre processed text only I might do the same mistake myself.

What I do is to fix these letters according to a pattern, in this case WDDDDDDD

and I replace:

S <-> 8

O <-> 0

I <-> 1

i <-> 1

l <-> 1

z <-> 2

Z <-> 2

etc.

Another options, but I'm not 100% sure if it is possible with the latest version, is to ask tesseract for the whole list of predictions for each token with confidence. For the first token you'd get something like:

S: 0.6839

8: 0.2123

B: 0.1445

...

and, again according to a pattern, you select the best matching one (you need to use the choiceIterator on the result object iterating at level SYMBOL). This second approach is more elegant but I do not think is giving you much more over the simpler approach.

Of course a little bit of model fine tuning helps but will not fix these problems 100% and it takes a lot of time to do it properly.

I recommend using tessocr that is a real API/library wrapper (not a command line wrapper...), it gives you access to the whole API and, if used properly, it is a lot faster.

Bye

Lorenzo

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/72dac625-d07f-4240-9032-3fa856868b8dn%40googlegroups.com.

Yash Mistry

unread,

Jun 24, 2022, 3:22:50 AM6/24/22

to tesseract-ocr

Hi Lorenzo,

Thank you for the suggestions.

The first approach you suggest is not feasible for me because there is no certainty that at particular position specific type of data will present.

I am interested in second approach, I am trying to find any functionality of tesseract which give me all possible prediction for the specific letter bur I haven't found any solution yet.

Can you please help me from where did you find this kind of functionality in tesseract and of which version of tesseract?

Thank you

Lorenzo Bolzani

unread,

Jun 24, 2022, 5:15:15 AM6/24/22

to tesser...@googlegroups.com

Hi Yash,

please see the example at the bottom of this page:

https://github.com/sirfz/tesserocr

and this issue about the versions (I think you need version 5.x):

https://github.com/sirfz/tesserocr/issues/166

If you have problems with tesserocr make sure it matches the tesseract version it was compiled for:

https://github.com/sirfz/tesserocr/releases/tag/v2.5.2

The alternative choices should also be available in the XML output, if I remember correctly.

Your input image is very tiny (text is 9 pixels tall) and there are a lot of compression artifacts. If possibile, acquire an higher resolution image with less compression.

Also try to MANUALLY clean the text more (with Gimp for example) to remove the black fragments of the border or the dot on the left to see IF this gives you better results. Also try to MANUALLY remove almost all of the white borders.

IF any of these gives you better results you can think about how to improve your automated pre-processing step with a clear target, like the attached images (I did not test them).

Your image uses two background colors, you can cut the top and bottom parts and process each fragment on its own (so adaptive thresholding does not get confused).

Bye,

Lorenzo

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c46185ed-b502-4320-bf98-966a6b2e90een%40googlegroups.com.