tesseract returns random and spurious characters

Z. Jay

unread,

Jun 21, 2022, 1:25:33 PM6/21/22

to tesseract-ocr

We have been using a competing OCR tool and are now evaluating a switch to tesseract. However, when converting a png, tesseract randomly - albeit rarely, returns characters where there is only white space. For example, tesseract will return a comma or equal sign where there is only white space. Scrutinizing the png I do not see anything such as dirt or a spec which looks like anything other than white space. While this is rare and random, it happens enough to be a problem. Note that this does not occur when using our current OCR tool. I suspect someone has encountered this issue before and already posted the solution somewhere on this list or elsewhere.

For reference, here is a comparison of the actual text and the text returned by tesseract:
Actual:
10/17 10/17, 0000 PAYMENT THANK YOU $64.79CR

Returned:
10/17, 10/17, 0000 =PAYMENT THANK YOU $64.79CR

Any pointers appreciated.

Thanks,

--zj

Terry Hardie

unread,

Mar 15, 2023, 1:28:27 PM3/15/23

to tesseract-ocr

I'm having the same issue, although, I see it when interfacing to tesseract programmatically. If I take the same image (It's a PERFECT source, coming from a machine generated PDF->PNG) and run it through tesseract on the command line, the equals does not show up.

I hope you managed to find a solution and just haven't updated this thread?

Thanks!

Zdenko Podobny

unread,

Mar 24, 2023, 3:53:27 AM3/24/23

to tesser...@googlegroups.com

Hello,

unless you provide a test case for reproducing problem (+ information about tesseract, language data platform etc.), nobody could help you...

Zdenko

ut 21. 6. 2022 o 19:25 Z. Jay <zjs...@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7ab12970-6d15-42c2-bbcf-31865458d95cn%40googlegroups.com.

Reply all

Reply to author

Forward