How to prevern Tesseract from interpreting noise as characters

184 views
Skip to first unread message

Iain Downs

unread,
Jul 16, 2024, 12:38:02 PM7/16/24
to tesseract-ocr
I'm working on processing scanned paperback books with tesseract (C++ API at the moment).  One issue I've found is that when a page has little or no text tesseract gets overkeen and interprets the noise as text.

The image below is the raw page.  In this case it's the inside front cover of a book.
HookRawPage.jpg
This is the image after tesseract has processed it (binarization) and before the character recognition.
HookPostProcessed.jpg

tesseract suggests that there are 160 or so words (by some definition of word!) on this page as per the attached (Hook02Small.txt).

This also happens on pages which DO contain text but a small amount.  I suspect that the binarization (possibly OTSU?) is to blame.  I can probable do something to detect entirely blank pages, but less sure what do do with mainly blank pages.

Any suggestions most welcome!

Iain


Hook02Small.txt

Iain Downs

unread,
Aug 4, 2024, 7:22:17 AM8/4/24
to tesseract-ocr
In the event that anyone else has a similar issue, this is how I approached it.

Firstly, make a histogram of the number of pixels with each intensity (so an array of 256 numbers).

When you inspect this you get results like the below.

Finding empty pages.png

This is after a little smoothing and taking the log of the values.

You can see that the properly blank pages show little or no very dark (black) pixels, whereas the pages with some text, even if a small amount have a fair number.

I simply set a cutoff level (in this case 1) and a cutoff intensity (in my case 80), so providing the first peak of 1 of the log smoothed intensity is below 80 it is text, otherwise it is blank.

You can also see the problem which tesseract has (with default binarisation) in that the intensity is distinctly bimodal.  I think this is due to bleedthrough from the reverse of the page.  Of course that is essentially what OTSU uses pick out 'black' from 'white'.

Iain

Zdenko Podobny

unread,
Aug 4, 2024, 7:44:41 AM8/4/24
to tesser...@googlegroups.com
tesseract unnamed.jpg -
Estimating resolution as 182

 e.g. no recognized word... So the problem could be in the parameters you used for OCR...

Before OCR I suggest image preprocessing and maybe the detection of empty pages.
Have a look at leptonica example for Normalize for uneven illumination (pixBackgroundNorm in https://github.com/DanBloomberg/leptonica/blob/master/prog/livre_adapt.c) and then binarize image.
I think with some more "aggressive" parameters you can get a clean empty page, so will not need to modify your OCR parameters...

Zdenko


ne 4. 8. 2024 o 13:22 Iain Downs <ia...@idcl.co.uk> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e78f6620-4019-4e36-95cf-0aad5194313dn%40googlegroups.com.

Iain Downs

unread,
Aug 6, 2024, 10:56:08 AM8/6/24
to tesseract-ocr
Thanks for this Zdenko. I've had a look at some resources on 'greyscale closing' and kind of get it.  However, my app is currently in c# and the library I'm using does all the pix functions.  I will try and build the sample in C++ and see what it does.

Iain

Iain Downs

unread,
Aug 10, 2024, 8:44:04 AM8/10/24
to tesseract-ocr
Zdenko - I've had a look at the sample code (in C++!) and tried it out on my files.  It clearly works well at cleaning the pages up, but does no better on my 'empty' pages than my histogram approach.  Also, and unfortunately, I get (slightly) worse correctness in recognising the text than with the default Tesseract processing, so for the moment I don't think I will take this approach.  However, many thanks for the input.

Iain

Reply all
Reply to author
Forward
0 new messages