Trying to understand why Tesseract-ocr fails on some images

1,244 views
Skip to first unread message

astro

unread,
Jul 26, 2023, 9:05:04 AM7/26/23
to tesseract-ocr
Hi All,
    As I had mentioned in an earlier message, I've got tesseract to
properly identify dates and time at a rate of about 84%.. However what
puzzles me is why the program reads the time stamp from the image
properly and on another image it fails. All the images are similar and
for all I crop put the date/time area to isolate it. I have attaches an
example.

The tempimage.jpg is the full image. outpx.jpx is the cropped image and
outpx.txt is the OCR result produced from the cropped image.

If anyone has any idea why OCR fails on this I would love to hear from you.

Thanks for your help.

Cheers
 Nor
tempImg.jpg
outpx.jpg
outpx.txt

nor s

unread,
Jul 26, 2023, 9:21:56 AM7/26/23
to tesseract-ocr
To show an example of an OCR that properly extracted the date/time, here are the files I used.
ShowPix it the full image , Outpx.2.jpg is the cropped image and outpx2.txt is the result of the OCR.

As you can see the imaged that failed and the one that worked are very similar.

Cheers
 Nor
outpx2.jpg
showPix2.jpg
outpx2.txt

nor s

unread,
Jul 26, 2023, 12:24:26 PM7/26/23
to tesseract-ocr
Just to add a bit more information. I have found that changing the vertical position of the crop box by a few pixels seems to make a difference.
One image that had a crop location of +930+1015 was not reading the date/time. However, changing the vertical position to +1000 resulted in a 105 out of 133 correct readings.  Again, not being familiar with the internal workings of OCR, I having difficulty in understanding why OCR is behaving this way.

Still digging! :)

Cheers
 Nor

nor s

unread,
Jul 26, 2023, 3:09:53 PM7/26/23
to tesseract-ocr
OK I think I found the sweet spot. Setting the location for the crop rectangle to +933+1013 from the top left corner of the image gives me an amazing result of 98.8% and average on 670 images. I think that's pretty good! 
I still don't know why moving the box around a few pixels makes such a difference.

I think I'm where I want to be. if anyone has any ideas or suggestion about what's happening I'd love to hear from you.

Cheers
 Nor

Lorenzo Bolzani

unread,
Jul 27, 2023, 4:35:48 AM7/27/23
to tesser...@googlegroups.com

Hi Nor,

I would crop the text as tight as possible, in this way you control exactly the text region (see the attached image). Altro try adding a white border of 1 or 2 pixels later, see IF this works best.

The image you sent is not pure black and white, so maybe the automatic cropping gets confused. At the bottom of the image there is a gray line that probably causes the problem. If you do not want to crop it yourself do a threshold on the image but you need to find a reasonable threshold (experiment with Gimp). Cropping seems easier.

Use psm 7, or 6, (see tesseract --help-extra).

With the tightly cropped images try a few rescale to fixed height like "original size", 30, 35, 40, 45, 50 px and see what works best. Do a second pass on the best "height region" with a finer grid.

As you have a reasonable amount of test images, I would run a script to test all these combinations of preprocessing, a few hundreds, to find the sweet spot even if it may take a couple of hours.

You can also use the whitelist to limit the valid characters, depending on the type or errors you are seeing.

The image looks very compressed, if possible reduce the compression or use PNG.

I do not know which tool/language you are using but, if you are programming, see if you can find a real API bindings (like tesserocr for python) and not a command line wrapper.


Bye

Lorenzo


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/631ff8fd-660e-4bb2-b558-013bcc00218cn%40googlegroups.com.
outpx2.jpg

Tom Morris

unread,
Jul 27, 2023, 12:19:49 PM7/27/23
to tesseract-ocr
Instead of playing with OCR, why not just extract the date and time from the EXIF data in the JPEG using exiftool or a similar utility? The Bushnell camera traps apparently don't encode the other date like the temperature in the EXIF, but other manufacturers like Reconyx do.

Tom

astro

unread,
Jul 27, 2023, 12:24:45 PM7/27/23
to tesseract-ocr
HI Tom.
    Yup I'm aware of the Exif tools unfortunately these images are not directly off the ssd card and have been modifies some how losing the time stamp for the images.

Cheers
 Nor


On 7/27/2023 12:19 PM, Tom Morris wrote:
Instead of playing with OCR, why not just extract the date and time from the EXIF data in the JPEG using exiftool or a similar utility? The Bushnell camera traps apparently don't encode the other date like the temperature in the EXIF, but other manufacturers like Reconyx do.

Tom
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

nor s

unread,
Jul 27, 2023, 12:47:23 PM7/27/23
to tesseract-ocr
I'm back and still at it. I've found that if I crop the images so that there is a 1 or 2 pixel's of the image included in the box Tesseract seems to have a better result in resing the date/time values. See attached.
I find this whole exercise a great learning experience.

Cheers
 Nor
outpx.jpg

astro

unread,
Jul 28, 2023, 12:35:30 PM7/28/23
to tesseract-ocr
Still playing around with improving Tesseract-OCR 's results.

 One more data point. As mentioned in my previous post, I found that if
there is a dark border at the top of the cropped image the OCR works
much better. With that in mine, I decided to add my own 25 pixel black
border to the top of the cropped image by adding the draw command to the
command line input for ImageMagick ( see attached). With this simple
addition I'm able to get 100% conversion in most cases.

Cheers
 Nor

outpx.png

Ger Hobbelt

unread,
Jul 30, 2023, 6:14:51 PM7/30/23
to tesser...@googlegroups.com
I had a bit of time to run a sample of yours through my (customized) tesseract rig and the OCR (reading "Tas" instead of "11") is reproducible on my rig (5.3.2 + local patches).
This what comes out as part of the diagnostics report:

brave_1Yfwkzyjrg.png

The red hashed areas designate the surroundings of the "word bounding box" currently processed in tesseract.

As can be seen, for some very curious reason, the "11" get lopped off at the top resulting in some weird OCR results (high confidence "Tas").

I don't know WHY this happens exactly -- that requires further investigation -- but this looks like a mishap in the segmentation code.

(For others who are interested: this is HTML generated from my custom tesseract; the text lines in the snapshot are tprintf() output, while the images have been added as part of the debug code, where the hashing, clipping, etc. is done via leptonica.)

BTW: also note that the noise line at the bottom of the cropped image also affect the segmentation as the boxes all reach all the way to the bottom. The bottom line noise is reported as "found some diacritics" and thus influences the line/segmentation code as well. But this DOES NOT explain why the "11"s get lopped off, while the other digits are not: see the screenshot.

Food for thought (and debugging).

Binarized image resulting from tesseract default Otsu thresholding/binarization is attached as well: here the bottom line noise is clearly visible.

nor-bushnell-decoded-debug.n0004.img0029.Setup.Page.Seg.And.Detect.Orientation.png

(This image is the binarized b&w image used internally by tesseract as the source image for segmentation/ocr/etc., which is *blended* with the original source image (as a subdued rose background); what matters here are the pure black pixels as those are what tesseract sees once we get at the segmentation + ocr stage.)



That's it for now; AFK for a while again.



Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   g...@hobbelt.com
mobile: +31-6-11 120 978
--------------------------------------------------


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

astro

unread,
Jul 30, 2023, 6:41:03 PM7/30/23
to tesseract-ocr
Hi Ger
   Since a black stripe at the to of the image helps, DO you think putting a similar stripe at the bottom of the image would help?

Nor

Ger Hobbelt

unread,
Jul 30, 2023, 6:54:47 PM7/30/23
to tesseract-ocr
I haven't looked at the effect of the black stripe yet; I only had time to investigate your first image where you reported an OCR error (Tas <-> 11).
Frankly, I have no idea why that happens exactly; I've found where things go wrong *visually* but digging up which precise bit of the code decides to lop off the tops there is still an open question -- most of my time went into working on my diagnostics code, which is a work in progress (and benefits from your error reports!)
Hence I'm loath to say anything pro or contra another black bar. 

THEORY (and old practice) would suggest another approach, which is to delete that bottom layer of pixels, so that "noise" is not picked up by leptonica as "diacritics" and causing the segmentation code to incorectly dimension the bounding boxes around the text as you can see (infer) from my partial screenshot. I ASSUME the black bar pushes the default Otsu threshold code to choose a lower (darker) cutoff, which would be a round-about way of "pushing those bottom noise pixels into the white" and thus "hiding" them from leptonica, but that is, right now, pure conjecture as I haven't checked yet what your black bar does re diagnostics output for tesseract.

To be addressed later this week I hope; next few days will be loaded with other (non IT) stuff here, so it'll take some time to reply to this.



Ger Hobbelt

unread,
Aug 4, 2023, 8:57:57 PM8/4/23
to tesseract-ocr
L.S.,

Finally took the time to debug this as the '11'->'Tas' image-to-OCR-text conversion was a very curious one.

Turns out tesseract has a bug relatively deep inside its innards, where the actual code DOES NOT take the binarized pixel data (as one would expect it would use for OCR as those black&white pixels represent the *cleaned-up* source image) but grabs the (noisy!) *original image pixels* instead and feeds those straight into the LSTM engine, resulting in surprising OCR failures. 

See https://github.com/tesseract-ocr/tesseract/pull/4111 for a submitted bugfix and an extended description/analysis.

I expect this to impact more folks (including *myself*) who have/had WTF trouble with color image inputs and other non-black&white and/or noisy image sources (old book scans, etc.), but I haven't had time to check more images, apart from the first sample reported by astro/Nor. 

BTW: thanks to astro/Nor, the OP, for solid reporting; this enabled this evening's debug session and root cause analysis to happen at all!

Closing in on 3AM here so sleep is overdue; I hope others can reproduce my findings and not discover screw-ups on my part! 😅😅

Best regards,

Ger

P.S.: I haven't dug further into the "lopped off" effect as previously observed by me (see screenshot of augmented diagnostics output earlier in this message chain) as this is reasonably explicable by this bug + fix, which was discovered while taking out the BestPix() API, after which I ran a `git bisect` driven by tesseract OCR test runs to dig up the actual commit where the fix occurred *by happenstance*. All that means is that I'm a bit hand-wavey about that "lopped-off top of '11'" bounded-box as seen before: time/effort restrictions apply so there *might* be more lurking in that section of the tesseract codebase still... I'm not 100% sure, 's all I'm sayin'. 😅



Reply all
Reply to author
Forward
0 new messages