Empty page result. Bug?

147 views
Skip to first unread message

S.J. Becker

unread,
Apr 21, 2016, 2:19:11 AM4/21/16
to tesseract-ocr

I've attached two files.

The first file is my original one. It returns empty page (with eng.traineddata).

I noticed that there was no margin at the top and little at the bottom.
So I used gimp to add about 4 pixels at the top and bottom. The result
is the second attached file.

This ocred properly.

Command line:
tesseract -c tessedit_create_tsv=1 tess_1_1b.tif tess

Output:
level   page_num    block_num   par_num line_num    word_num    left    top width   height  conf    text
1   1   0   0   0   0   0   0   336 110 -1<>
2   1   1   0   0   0   28  7   270 98  -1<>
3   1   1   1   0   0   28  7   270 98  -1<>
4   1   1   1   1   0   28  7   270 98  -1<>
5   1   1   1   1   1   28  7   270 98  91  A1.01


A1.01  with a confidence of 91

Should I file a bug? Or always pad my images with whitespace?

thanks

tess_1_1.tif
tess_1_1b.tif

ShreeDevi Kumar

unread,
Apr 21, 2016, 2:46:00 AM4/21/16
to tesser...@googlegroups.com

Please file an issue on GitHub repo with these files so that it can be looked at by the developers.

However, for your app, add the whitespace margin to your images as part of preprocessing, since any fix may take a while.

- sent from my phone. excuse the brevity.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/40a4828d-9a46-4e36-9b22-8b925f39a046%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Zdenko Podobný

unread,
Apr 21, 2016, 7:21:47 AM4/21/16
to tesser...@googlegroups.com

Zdenko Podobný

unread,
Apr 21, 2016, 7:28:04 AM4/21/16
to tesser...@googlegroups.com
On Thu, Apr 21, 2016 at 8:45 AM, ShreeDevi Kumar <shree...@gmail.com> wrote:

Please file an issue on GitHub repo with these files so that it can be looked at by the developers.

Why? To waste their time??? E.g. presented command does not work ('tesseract -c tessedit_create_tsv=1 tess_1_1b.tif tess'). If you really want to help user then point him/her to correct wiki (using correct psm).

S.J. Becker

unread,
Apr 22, 2016, 7:12:32 AM4/22/16
to tesseract-ocr

I just did more testing.

My one word or single character image works with
-psm 7
-psm 8

my two or three lines of text image works with the default of
-psm 3
as well as
-psm 4

They both seem to work with
-psm 6

I may have to go with 6 even though my three line test with different
font sizes should be done with 4 based on it's description.

I feel it's a bug that 3 and 4 can't reliably handle simpler content.
To get the most out of Tesseract, I must analyze the segmentation?!

That is why I had to go through the trouble of compiling leptonica;
so that tesseract is smart enough that I don't have to re-invent the wheel.


It seems that it's failing at the segmentation stage. If it finds nothing
it could try again automatically with a more primitive setting. That is
way more efficient than my process spawning tesseract twice as often.

    thanks
    scott

S.J. Becker

unread,
Apr 22, 2016, 7:12:33 AM4/22/16
to tesseract-ocr

This page only shows the same list I've seen many times before without
any explanation:

What does mean when it says "script detection"
I tried OSD and it did not automatically correct incorrect rotation (90 degrees off)

I think I understand what "Automatic page segmentation" may mean but with / without OSD?
Kinda need a full explanation.

"vertically aligned text"???

I guess I'll try #4: "Assume a single column of text of variable sizes"
That best describes what I have but the default seemed to work
in limited testing of my one and two liners.

The wiki also has a waybackmachine link to a bug saying that adding
whitespace helps. (Is that a current bug?, etc.)

    thanks
    scott


On Thursday, April 21, 2016 at 4:21:47 AM UTC-7, zdenop wrote:

Zdenko Podobný

unread,
Apr 23, 2016, 11:53:05 AM4/23/16
to tesser...@googlegroups.com
On Thu, Apr 21, 2016 at 11:53 PM, S.J. Becker <scottb...@gmail.com> wrote:

This page only shows the same list I've seen many times before without
any explanation:

What does mean when it says "script detection"

 
I tried OSD and it did not automatically correct incorrect rotation (90 degrees off)

detection = correction???

 
 
I think I understand what "Automatic page segmentation" may mean but with / without OSD?
Kinda need a full explanation.
OSD =  Orientation and script detection

"vertically aligned text"???

I guess I'll try #4: "Assume a single column of text of variable sizes"
That best describes what I have but the default seemed to work
in limited testing of my one and two liners.

The wiki also has a waybackmachine link to a bug saying that adding
whitespace helps. (Is that a current bug?, etc.)

It is not bug. It is feature - if you use correct psm and you still can not get correct result, maybe problem is that there is not sufficient border.
 
    thanks
    scott


On Thursday, April 21, 2016 at 4:21:47 AM UTC-7, zdenop wrote:

Zdenko


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Zdenko Podobný

unread,
Apr 23, 2016, 12:02:24 PM4/23/16
to tesser...@googlegroups.com
On Fri, Apr 22, 2016 at 12:27 AM, S.J. Becker <scottb...@gmail.com> wrote:

I just did more testing.

My one word or single character image works with
-psm 7
-psm 8

my two or three lines of text image works with the default of
-psm 3
as well as
-psm 4

They both seem to work with
-psm 6

I may have to go with 6 even though my three line test with different
font sizes should be done with 4 based on it's description.

I feel it's a bug that 3 and 4 can't reliably handle simpler content.
To get the most out of Tesseract, I must analyze the segmentation?!

Why analyze? Don't you know in advance if you are asking to OCR page or just paragraph, line or word???

That is why I had to go through the trouble of compiling leptonica;
so that tesseract is smart enough that I don't have to re-invent the wheel.

Tesseract use leptonica as dependancy so it does not need to re-invent the wheel. 

It seems that it's failing at the segmentation stage. If it finds nothing
it could try again automatically with a more primitive setting. That is
way more efficient than my process spawning tesseract twice as often.

    thanks
    scott

On Thursday, April 21, 2016 at 4:21:47 AM UTC-7, zdenop wrote:

Zdenko


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

S.J. Becker

unread,
Apr 23, 2016, 2:12:21 PM4/23/16
to tesseract-ocr

On Saturday, April 23, 2016 at 9:02:24 AM UTC-7, zdenop wrote:

> Why analyze? Don't you know in advance if you are asking to OCR page or just paragraph, line or word???

 
No.

My user is viewing an image of a large construction blueprint. They select "Copy Text"
and draw a rectangle around part of the image which contains text. I need my program
to ocr any text in that sub-image and copy it to the clipboard.

I have no idea if they select a character, a word, a single line sentence or a multi-line
sentence.

I was tracing down a non-fatal error message which was printed to the console when
running tesseract. I found out tesseract was calling leptonica to segment the page
and that leptonica was emitting an error and returning fail because the image was below
a certain height. It was not trying to segment the image.

The leptonica developer made the arbitrary decision that it didn't make sense to
segment the page because it was too small. If leptonica makes such judgements,
the tesseract has to intelligently deal with it. If tesseract does not want to deal with
it, then I must deal with it. If I refuse to deal with it then I can ask my user to describe
what they selected and make them deal with it.

If I asked my user if they selected a single character, a single word, a single line of
words or multiple lines of words, they would conclude that my software is a steaming
pile of crap. So that leaves me to solve the problem.

It's my opinion that it crazy for an ocr program to return "Empty Page!" when I feed
it an image with "A2.12" on it because it is below a certain size or because it lacks
white space or because I told it to expect multiple lines of text with varying heights
instead of "Expect a single word".

It's returning "Empty Page!" without even trying to ocr the image!

The last 6 psm options are in a nice hierarchy. If you don't think it makes sense
to fall back to a more primitive setting when the advanced setting fails, then I
will have to create a patched version which does that.

It makes no sense for me to launch tesseract two or three times to ocr "A2.12".

   TIA
   scott


Reply all
Reply to author
Forward
0 new messages