OpenCV Python preprocessing strategies for OCR (pytesseract) character recognition

Jokūbas Žižiūnas

unread,

Jan 3, 2025, 6:18:13 PMJan 3

to tesseract-ocr

I wanted to ask what are the most optimal pre-processing techniques for my case in the letters that I would like to read. I am using pytesseract for character recognition, but sometimes my characters are not recognized properly.

I have added couple samples of images I am using, but am using more.

The most common issues are:
- 5 get recognized as S (but not vice versa)
- S gets recognized as O (but not vice versa)
- / gets recognized as I

I have tried multiple techniques, but if one technique fixes an issue, then another issue pops up. The character recognition works most of the time, but it is not consistent, I would say ~80%. I can take a picutre, do the processing and recognition works, then take a new picture in same conditions and the recognition does not work, seems like recognition is within the tolerance of noise

I believe that a large part of issue is that the font is in bold. For example, I did notice that the wider / is, the more likely it is to be recognized as I. I have tried cv2.resize(fx=2, fy=2) + cv2.erode(), but then for some reason I recognized that the thicker the 5 is, the less likely it is to be recognized as S. At the same time , if characters are thicker, or I reduce the threshold in binarization, the hole in 4 gets filled in and causes the problems.

I cannot change the font. I have tried taking picture at various exposures, nothing does seem to fix the core of the issue. I This is the best focus I am able to obtain. I cannot whitelist certain symbols, because both letters and numbers are possible. I do not want to do .replace('SX', '5X') because the point of the check is to validate the that the label has been printer correctly.

Techniques I have tried:
- Regular binarization
- OTSU binarization
- Adaptive thresholding
- Resize + erode()
- Upscale image with cv2.dnn_superres, kinda better, but too slow, because I have a lot of images to process
- Histogram equalization before any of the above

NOTE: I am able to get the solution for sample images, I am unable to get the consistent solution if images slightly vary, I cannot get it to work 100% of the time.

Can someone provide info on how would you go about cleaning up these images

image_save1.png

image_save2.png

Zdenko Podobny

unread,

Jan 3, 2025, 6:27:52 PMJan 3

to tesser...@googlegroups.com

There is nothing like 100% OCR.

Please provide an example of an image that causes a problem. These you provided work out of box:
tesseract image_save1.png -
Estimating resolution as 445
S/N: 0112182
DATE: DECEMBER 2024

tesseract image_save2.png -
Estimating resolution as 450
5X

Zdenko

so 4. 1. 2025 o 0:18 Jokūbas Žižiūnas <jokub...@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/59cbd128-37c6-4c06-abbc-f79a05d95a5dn%40googlegroups.com.

Ger Hobbelt

unread,

Jan 4, 2025, 7:27:49 AMJan 4

to tesseract-ocr

Since you state: "because the point of the check is to validate the that the label has been printer correctly", it sounds like you have deeper control of the chain than what is usual in processes where OCR is selected as primary part of the solution:

Your line suggests that you at least know what's coming your way (a printed label with known text/content): "check" + "validate".

This immediately suggests another approach: get your proverbial claws on the label "image" as it is sent to the printer, i.e. hook into the label production / printing process and obtain a "sollwert" image for reference. (German: sollwert + istwert: nouns originating from control engineering; English is not my mother tongue but it's something like reference + observation; I like the German jargon better and think in those terms; anyway...)

Once you have a reference image, your validation problem, still not simple!, becomes a pattern matching and scoring comparison: take the sollwert/reference image obtained from the label producing software and find a decent convolution-or-similar pattern matching approach in opencv or like: the task is to find/locate the sollwert image in the istwert/observation, i.e. in your camera picture. With the picture A in picture B localization algorithm, you also get some sort of pattern match quality ranking (measure of fitness, ...) for the find. That is your first & main indicator for label production quality.

Once you have that locate A in B (with focus issues, lens & lighting issues), you can improve the quality of your observation process: fixed camera position/rig = kill variation in observation subprocess, tuning the lighting, etc, before the image even enters the computer.

Lastly you can do what you originally asked about: tuning your digital image preprocess: contrast, sharpness, etc.etc.

Tesseract OCR definitely can be a part of your label quality assurance process, but IMO it would be a second (inner) stage, where you use the tesseract hocr output to add additional statistical ranking numbers to your evaluation subprocess. (Tesseract hocr, CSV, ... can output some ranking statistics per character, if told to do so. Your issue sounds more like the challenge: "did we get what we expected?" rather than the usual OCR task: "can we read this, at all? Let's see what we have got (I don't know yet)!", so you might be interested in evaluating those tesseract-internal ranking numbers as well as the actual characters produced.

Since you are talking "validation" rather than the usual "recognition", you also have a text "sollwert" available: the "ground truth" text you wanted printed on that label. This allows you to add dedicated (= custom) text comparison ranking adjustments, which would be unavailable to folks using OCR to scan a new page image which does not originate from another computer/machine: in your case, recognizing 5 vs S or s is a very high similarity as your (modern) printing process can be assumed to not make mistakes like that (you have the reference text already, so you 'know' this must be the OCR engine slipping up, ie noise injected in the observation/feedback loop), so you can create adjustment tables or maybe a small neural net (to be trained) as you would be interested in symbol shapes, rather than the precise text, coming out of the OCR engine block: while (for example) S vs 5 would be ranked pretty bad in a *recognition* task, yours is a *validation* task with 100% quality ground truth text available straight out of the production process machinery, so in your case S ranks pretty darn close to 5, the way I'm looking at it.

Plenty more interesting challenges along the way, I'm sure. :-)

HTH,

Ger Hobbelt

--

Reply all

Reply to author

Forward