Suggestions wanted on how to improve recognition

44 views
Skip to first unread message

Ralph Cook

unread,
Jul 1, 2024, 12:21:25 AM (2 days ago) Jul 1
to tesseract-ocr
I have an application using Tesseract on documents which are all in English, one font, everything I want to recognize is in capital letters, digits, and punctuation. 

The quality of the scans is often poor, and I have no control over that. It's sometimes about what you would expect with pages that are scanned, printed, then scanned again; lots of noise, characters not distinct, etc.

I don't know what the font is, I call it "Old Line Printer". Here's a sample:

Sample text anonymized.png

I have erased some identifying information and scratched some lines where it went.

I am not familiar with OCR technology in general, nor with neural networks. I've read in the documentation abouto how to improve the image, some things about training, some things about how training is likely not necessary, etc. I'm looking for someone to recommend an overall strategy: what should I try first, what is the best 2nd plan, is there likely to be a 3rd, etc. I'm trying not to spend weeks studying the wrong things.

Ger Hobbelt

unread,
Jul 1, 2024, 4:18:31 AM (2 days ago) Jul 1
to tesseract-ocr
Hi, 

More on this later (I seem to still have issues posting with attachments here, plus running into a few surprises while doing bulk testing, so this is preliminary):

1. Dont use lossy image file formats if you can, so PNG is better than JPEG. From what I see, if you need lossy due to storage limitations, it seems webp is better than JPEG. Has to do with the type of noise jpeg introduces as "jpeg artifacts".

2. Scale (resize, use imagemagick or other tool to do this in bulk) the input image to approximate 30px capital letter height for each line. That's the ballpark, do try a couple of scales near that measure, e.g. test results with a set of scaled images 5% off to see which scale is 'optimal' for you. It can help to then run an additional test set with scales in a 1-2% geometric scale range (i.e. next scale to try is 102% of previous smaller test size).

How to check: output both hocr and tsv outputs with character confidence reporting turned on (tesseract hocr output for character confidence is broken, those numbers only show in tsv), then read those files and check both character and word confidence values output by tesseract. Pick the scaling+misc preprocessing that gives you the highest numbers there on average for your test set.


After that, it depends...

BTW: to my eye your image isn't noisy and you mention noise, hence: you got a few rotten ones for us?  ;-)


Re noise, preprocessing: what I find helps is killing (masking) all noise that is a few pixels away from any character. Particularly when you are processing low dpi / jpeg input. This must be done before feeding it to tesseract as current tesseract does thresholding, etc for detecting the spots where the text (words) are at, but the latest engine (LSTM) is fed the raw input pixels so any useless noise ends up in there and degrades output.


TLDR:

- scale
- Denoise
- enhance contrast (not necessary in your case)
- ... other means to make image easier legible, anything goes ...
- dictionary, etc. for tesseract or post: I see you've got jargon in there (susp, iss, ...) which are not regular English dictionary words, so it might help to use a custom dict, but don't have hard data on that one yet myself)




--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/185590fa-c34f-4775-a8a8-9f2bfd18c09en%40googlegroups.com.

Ralph Cook

unread,
Jul 1, 2024, 7:29:33 AM (2 days ago) Jul 1
to tesseract-ocr
I"ll look into the scaling and denoising. 

I have no control over the input format. If you mean to take the TIFF image I've got and convert it before OCR, please say that.

Yes, the example I gave was not one of the noisy inputs. I've looked through the ones I have handy, and none of them seem to be that bad -- I'll look up some of poor quality and post those as well.

Thanks.

Ger Hobbelt

unread,
Jul 1, 2024, 4:27:15 PM (2 days ago) Jul 1
to tesseract-ocr
TIFF should be okay (IIRC that not a lossy compression format, usually).  

The advice re image formats is most relevant when you preprocess your scanned TIFF images: always use a lossless format, e.g. PNG, as intermediate output format, so when, for example, using imagemagick, do

      magick -input.tiff   -resize WxH     image.png
      tesseract ........ image.png

instead of 

      magick -input.tiff   -resize WxH     image.jpg
      tesseract ........ image.jpg


Cheers,

Ger
Reply all
Reply to author
Forward
0 new messages