Text extraction failure after preprocessing.

uday kaipa

unread,

Jun 27, 2024, 12:24:36 PM (3 days ago) Jun 27

to tesseract-ocr

Hi,

I have an image having number 96 in it.(that might contains a number between 0 and 100.) PFA.

I have used tesseract PSM from 6 to 13 and image size and font and everything looks good to me. Text is recognized as 36.

When i try to adjust padding or other pre-processing, it would work for this image and some images are recognized incorrectly.

Can anyone recommend any other pre-processing that might improve the recognition.

tesseract --oem 1 --psm 7 -c tessedit_char_whitelist=0123456789.: C:/Users/xxx/Desktop/test_folder/IMG_2303_2cfac/subboxes/Image_BHU32_1_PREPROCESSED_27-06-2024_17h39m53s.JPG new hocr

Many thanks in advance.

Regards

Uday

Image_BHU32_1_PREPROCESSED_27-06-2024_17h39m53s.JPG

uday kaipa

unread,

Jun 28, 2024, 8:09:11 AM (2 days ago) Jun 28

to tesseract-ocr

I have resized the image so that text height would be around 30pxs and i have tried with 10px boarder as recommended in some threads here.

I converted image to binary, and tried all PSM modes.
I am not sure why it is not OCR'ed properly.

Any help is appreciated. :)

resized.jpg

bin_border.jpg

bin.jpg

Zdenko Podobny

unread,

Jun 28, 2024, 9:31:15 AM (2 days ago) Jun 28

to tesser...@googlegroups.com

First of all, using jpg as a format for image processing and OCR is not very smart.

Next: it does not seem like a very standard font... maybe you will need to train tesseract for it.

For me, it looks like a heavy preprocessed 7-segment font... so I tried this:

tesseract 14.png - --psm 7 --oem 0 -l letsgodigital
14

Zdenko

pi 28. 6. 2024 o 14:09 'uday kaipa' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d59827e4-6973-45af-92c0-e2aebbd7f2e7n%40googlegroups.com.

14.png

uday kaipa

unread,

Jun 28, 2024, 11:07:24 AM (2 days ago) Jun 28

to tesseract-ocr

Hi Zdenko,

Thanks for your recommendation about image format and letsgodigital trainidata. Yes, you are right. I got the digits from a segment display.
I would try the training process before that i wanted to try other options.

I suppose you have used the lets.traindata after renaming, when i tried the same command with same psm, on the PNG image, I got .4 instead.

By the way, Did you apply any processing on the image?, the edges look slightly different.

tesseract 14.png out -l lets --oem 0 --psm 7

.4

Thanks for your time.

Zdenko Podobny

unread,

Jun 28, 2024, 11:28:50 AM (2 days ago) Jun 28

to tesser...@googlegroups.com

As far as I remember, the traineddata are from https://github.com/arturaugusto/display_ocr/blob/master/letsgodigital/letsgodigital.traineddata

Also, check https://github.com/Shreeshrii/tessdata_ssd for Seven Segment Display recognition.

Zdenko

pi 28. 6. 2024 o 17:07 'uday kaipa' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/09a5c5e1-2cc7-49c2-9833-e2dc5c770203n%40googlegroups.com.

Reply all

Reply to author

Forward