tesseract and "empty page" issue?

6,925 views
Skip to first unread message

Valent

unread,
Jul 10, 2012, 4:40:23 AM7/10/12
to tesseract-ocr
Hi,
I'm trying to OCR my gas meter [1] usage, and I stumbled upon issue
that tesseract doesn't recognize anything in some tif images, just
gives "empty page".

Has anybody had this issue?

Here are cropped numbers:
https://dl.dropbox.com/u/184632/ocr-gas-cropped.tif

$ tesseract ocr-gas-cropped.tif output
Tesseract Open Source OCR Engine with LibTiff
Empty page

as you can see it just fails to recognize, so I GIMPed the image and
got this one:
https://dl.dropbox.com/u/184632/ocr-gas-cropped-grayscale.tif

and that one works:
$ tesseract ocr-gas-cropped-grayscale.tif output
Tesseract Open Source OCR Engine with LibTiff

$ cat output.txt
O 1 5 1 1@3» »'4*?5


But when I went one step further and cleaned the image to be more
easily scanned it fails again:
https://dl.dropbox.com/u/184632/ocr-gas-cropped-grayscale-clean.tif

$ tesseract ocr-gas-cropped-grayscale-clean.tif output
Tesseract Open Source OCR Engine with LibTiff
Empty page


Any ideas why is this happening?

If possible I would like to use tesseract for automatically reading my
gas meter usage, is this even possible? Is it possible to force
tesseract to recognize only numbers and to ignore letters?

[1] https://dl.dropbox.com/u/184632/ocr-gas-meter.jpg

Cheers,
Valent.

Nick White

unread,
Jul 10, 2012, 7:31:01 AM7/10/12
to tesser...@googlegroups.com
Hi Valent,

Just to answer the easy bit of your email ;)

On Tue, Jul 10, 2012 at 01:40:23AM -0700, Valent wrote:
> Is it possible to force
> tesseract to recognize only numbers and to ignore letters?

Yes, and easy:
http://code.google.com/p/tesseract-ocr/wiki/FAQ#How_do_I_recognize_only_digits?

Valent

unread,
Jul 11, 2012, 3:14:13 AM7/11/12
to tesseract-ocr
> Yes, and easy:http://code.google.com/p/tesseract-ocr/wiki/FAQ#How_do_I_recognize_on...

Thanks for this info!

Has anybody tested my image files? Are they the problem or is this
some tesseract bug that I should report?

Valent

unread,
Jul 12, 2012, 3:49:03 AM7/12/12
to tesseract-ocr
> Has anybody tested my image files? Are they the problem or is this
> some tesseract bug that I should report?

Is this google group right place to ask these questions? Is there some
other developers mailing list?

Nick White

unread,
Jul 12, 2012, 4:19:44 AM7/12/12
to tesser...@googlegroups.com
Hi Valent,

Sorry for taking a while to reply properly. This is the right place
for your questions. There's just rather more people asking questions
than answering here at the moment.

I'll reply to you inline.

On Tue, Jul 10, 2012 at 01:40:23AM -0700, Valent wrote:
> I'm trying to OCR my gas meter [1] usage, and I stumbled upon issue
> that tesseract doesn't recognize anything in some tif images, just
> gives "empty page".
>
> Has anybody had this issue?

I presume you're using Tesseract 2? I have Tesseract 2.04 installed
on my Debian Squeeze box, and ran it on the three images you link.
They all returned text, with your third, grayscale-clean, coming out
best.

I would guess that perhaps your tesseract isn't reading the Tiffs
properly. TIFF is a pretty diverse file format, and tesseract only
likes some of them. Is your Tesseract compiled with compressed TIFF
support? If you can, I recommend using Tesseract 3.01, linked to
the Leptonica library. That way you can use PNGs, which are much
easier to deal with and more reliable. Failing that, see if you can
get ImageMagick to produce something that your Tesseract will read
reliably. Something like this definitely ought to work:

convert in.png -monochrome -density 600 -compress none out.tif

> If possible I would like to use tesseract for automatically reading my
> gas meter usage, is this even possible?

Yes, and it looks like you're close. Good project, I like it :) Let
us know how you get on.

Best of luck,

Nick

Valent

unread,
Jul 13, 2012, 2:39:53 AM7/13/12
to tesseract-ocr
Thanks for your answers, sorry for being a bit impatient.
I'm using tesseract 3.0 (latest version on Fedora 17), but it looks
like openwrt has tesseract 3.031 packaged as well I'll try to do ocr
on the embedded machine if it can handle it, if not then I'll do it on
my home server.

I used uncompressed tiffs saved in latest GIMP 2.8, I'll also try
command line imagemagic maybe there is some issue with GIMP, but as
you managed to read them on tesseract 2.04 maybe there is an issue
with tesseract 3.0 not liking these tiffs for some reason?

I'll keep you all updated on my progress.

Cheers,
Valent.

Matic Odar

unread,
Oct 3, 2013, 7:23:39 PM10/3/13
to tesser...@googlegroups.com
Just in case if anyone encounters this problem, i've noticed that if there are not enough pixels above and below the text, it might cause this problem. And seeing the above pictures, this might be the case here. Keep about 10 extra pixels above and below the text, and it might fix it. Or just above, im not sure.


Valent

unread,
Nov 3, 2013, 6:31:43 PM11/3/13
to tesser...@googlegroups.com
I tested your hypothesis and it is right, after enlarging canvas size so that numbers have lost of "free" space and that they aren't tool close to border helped and now I get output from tesseract!

Thanks all of you for helping out!
Reply all
Reply to author
Forward
0 new messages