"Empty Page" and incomplete text recognition

132 views
Skip to first unread message

Daniel Kraft

unread,
Oct 27, 2015, 3:32:40 AM10/27/15
to tesseract-ocr
Hi all!

I've just started to experiment with tesseract (and OCR in general).  I would like to use it for reading sequences of numbers from pictures taken off an old screen.  I've trained tesseract to my situation, including the particular font used on the screen and only numbers as characters.  Recognition works usually very well, with not a single mistake (e. g., confusing 0 with 8 or 1 with 7) after training.

However, sometimes tesseract simply refuses to recognise *any* content at all, or only recognises text starting at some line half way through the picture.  I found [1], which seems to be related.  However, resizing the image canvas does not help me in my situation (see attachments and below).

  [1] https://groups.google.com/forum/#!topic/tesseract-ocr/eM7vClhtgw8

I've attached two images including the resulting text output (which cannot be reproduced in this quality without training).  The pictures are based on photographs but have been preprocessed already to improve contrast.  I don't really see much of a difference in the visual quality between the "Failing" and "Working" image, which makes me wonder why tesseract only outputs the last lines of Failing while it gives perfect results (except for spurious line breaks) in Working.  Any ideas what the issue could be?  Both images have been created in the same way, with the same preprocessing parameters and so on.

Thanks a lot!  Yours,
Daniel
Failing.png
Failing.txt
Working.png
Working.txt

Allistair C

unread,
Oct 27, 2015, 3:57:20 AM10/27/15
to tesser...@googlegroups.com
I think your whole document needs enough surrounding margin - I found the empty page issue when my text was too close to the page edges. In your first image you have this but not your second.

Sent from my iPhone
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/73d8219e-933d-478a-bc71-40394f612e37%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<Failing.png>
<Failing.txt>
<Working.png>
<Working.txt>

Daniel Kraft

unread,
Oct 27, 2015, 10:13:17 AM10/27/15
to tesser...@googlegroups.com
Hi!

On 2015-10-27 08:57, Allistair C wrote:
> I think your whole document needs enough surrounding margin - I found
> the empty page issue when my text was too close to the page edges. In
> your first image you have this but not your second.

Yes, that's also what I read. Note, however, that the first document
fails while the second works (not the other way round). In fact, I
actually added the margin in the failing example to ensure this is not
the reason why it fails -- but this does not help.

Yours,
Daniel

--
http://www.domob.eu/
OpenPGP: 1142 850E 6DFF 65BA 63D6 88A8 B249 2AC4 A733 0737
Namecoin: id/domob -> https://nameid.org/?name=domob
--
Done: Arc-Bar-Cav-Hea-Kni-Ran-Rog-Sam-Tou-Val-Wiz
To go: Mon-Pri

signature.asc

Allistair

unread,
Oct 27, 2015, 11:11:11 AM10/27/15
to tesser...@googlegroups.com
Ah OK.

Firstly I do not get Empty Page with Tesseract 3 on Mac. It reads a couple of lines then gives up.

I was able to get it reading everything by cropping it to the same amount as Working but then rotating it anti clockwise by just a few degrees - I tried this because I noticed the text was rotated - Tesseract is meant to handle this but you just need to try stuff out sometimes.

It does mean though that you will need to do preprocessing before handing off to Tesseract to get whatever you're doing working.

Cheers

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
failing-a.txt
Failing-a.png

Daniel Kraft

unread,
Oct 27, 2015, 4:49:11 PM10/27/15
to tesser...@googlegroups.com
Hi!

On 2015-10-27 16:10, Allistair wrote:
> Firstly I do not get Empty Page with Tesseract 3 on Mac. It reads a
> couple of lines then gives up.

Yes, that's true -- this particular example gives a few lines (actually
the *later* ones, not the first and then giving up). But with a
slightly different example, I also get the "Empty Page" sometimes.

> I was able to get it reading everything by cropping it to the same
> amount as Working but then rotating it anti clockwise by just a few
> degrees - I tried this because I noticed the text was rotated -
> Tesseract is meant to handle this but you just need to try stuff out
> sometimes.

Ah ok, that's a good hint! I'll try rotating my other samples and see
if it helps!

> It does mean though that you will need to do preprocessing before
> handing off to Tesseract to get whatever you're doing working.

I already do preprocessing to get the pictures as posted. The originals
are colour photos of yellow print on a black monitor. ;)

It should be fine to also add some rotation to the preprocessing as needed.
signature.asc

Tom Morris

unread,
Oct 28, 2015, 1:15:42 PM10/28/15
to tesseract-ocr


On Tuesday, October 27, 2015 at 4:49:11 PM UTC-4, Daniel Kraft wrote:

On 2015-10-27 16:10, Allistair wrote:

> I was able to get it reading everything by cropping it to the same
> amount as Working but then rotating it anti clockwise by just a few
> degrees - I tried this because I noticed the text was rotated -
> Tesseract is meant to handle this but you just need to try stuff out
> sometimes.

Ah ok, that's a good hint!  I'll try rotating my other samples and see
if it helps!

In addition to the skew, which I didn't notice until Alistair mentioned it, closer examination also reveals that the images are warped, almost as if the text was displayed on the face of a curved CRT from the olden days.  You might try de-warping the image to remove the effects of the curvature so that you have level AND straight lines of text to feed Tess.

Tom 

Daniel Kraft

unread,
Oct 28, 2015, 3:03:01 PM10/28/15
to tesser...@googlegroups.com
Hi!

On 2015-10-28 18:15, Tom Morris wrote:
> In addition to the skew, which I didn't notice until Alistair mentioned
> it, closer examination also reveals that the images are warped, almost
> as if the text was displayed on the face of a curved CRT from the olden
> days. You might try de-warping the image to remove the effects of the
> curvature so that you have level AND straight lines of text to feed Tess.

I'm always amazed at the (for me tiny details) that image-processing
people spot -- in fact, the text *is* displayed on an old-style curved
screen.

With rotation, everything seems to work perfectly for me, though. I'll
try de-warping if I run into more problems in the future.
signature.asc
Reply all
Reply to author
Forward
0 new messages