Want advice on how to proceed with Tesseract and reducing recognition errors

47 views
Skip to first unread message

Ralph Cook

unread,
May 23, 2023, 7:05:22 PM5/23/23
to tesseract-ocr
I have a (Java) application that uses Tesseract on English-language docs that are often scanned poorly. I cannot change the quality of the scanning. However, the siginificant data in the documents is all in a single fixed-width font, all in capital letters. I don't know the name of the font, I just call it "old line printer" when I have to refer to it. I've attached an example. (Unfortunately the information in the documents is confidential, so I cannot post a complete example without major work redacting things.)

Besides the understandable substitutions of 0 for O and confusions between 1 and I, some of the scans have started having Tesseract mistake D for B, 0 for G, and other similar errors. It only happens, it seems to me, when the quality of the scan is poor.

I am hoping that the very capable Tesseract engine can somehow be configured or trained or something to reduce these errors. I've seen references to "cleaning", to "training", and other things, but don't know what would be most appropriate here. I started looking at the documentation for training, but realized it was too much work to do on spec; I'm willing to do that if it's the best way to improve the tool for mjy situation, but would rather not do it before there's an informed opinion about whether it is. 

What should I be looking at doing?
FontExample.png

Zdenko Podobny

unread,
May 25, 2023, 1:09:26 AM5/25/23
to tesser...@googlegroups.com
I tried your example image with tesseract executable:

> tesseract FontExample.png - -c preserve_interword_spaces=1
#*%% DRIVER LICENSE STATUS: CLS C SUSPENDED *xx

LIC                                     LMT COND
CLASS GRP TYP ISSUE DT EXPIR DT CDL DISQ PROB PRIV RESTR    STATUS
I   D 06-16-22 03-22-29 N    N     N     N    N    ID CARD

ENDORS:


As far as I see there is  problem is with "***" only (tesseract has problem with repeating symbols) - no problem with  D vs B, 0 vs G.
Try to check which PSM and trainneddata  (tessdata, best, fast) is used by your app (My test used https://github.com/tesseract-ocr/tessdata).

Zdenko


st 24. 5. 2023 o 1:05 Ralph Cook <rcja...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1c3f7fe3-69c6-4e36-ba81-3b5bbf9eb7bcn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages