Empty page result. Bug?

S.J. Becker

unread,

Apr 21, 2016, 2:19:11 AM4/21/16

to tesseract-ocr

I've attached two files.

The first file is my original one. It returns empty page (with eng.traineddata).

I noticed that there was no margin at the top and little at the bottom.

So I used gimp to add about 4 pixels at the top and bottom. The result

is the second attached file.

This ocred properly.

Command line:

tesseract -c tessedit_create_tsv=1 tess_1_1b.tif tess

Output:

level page_num block_num par_num line_num word_num left top width height conf text

1 1 0 0 0 0 0 0 336 110 -1<>

2 1 1 0 0 0 28 7 270 98 -1<>

3 1 1 1 0 0 28 7 270 98 -1<>

4 1 1 1 1 0 28 7 270 98 -1<>

5 1 1 1 1 1 28 7 270 98 91 A1.01

A1.01 with a confidence of 91

Should I file a bug? Or always pad my images with whitespace?

thanks

tess_1_1.tif

tess_1_1b.tif

ShreeDevi Kumar

unread,

Apr 21, 2016, 2:46:00 AM4/21/16

to tesser...@googlegroups.com

Please file an issue on GitHub repo with these files so that it can be looked at by the developers.

However, for your app, add the whitespace margin to your images as part of preprocessing, since any fix may take a while.

- sent from my phone. excuse the brevity.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/40a4828d-9a46-4e36-9b22-8b925f39a046%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Zdenko Podobný

unread,

Apr 21, 2016, 7:21:47 AM4/21/16

to tesser...@googlegroups.com

Please read the wiki https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#page-segmentation-method

Zdenko

Zdenko Podobný

unread,

Apr 21, 2016, 7:28:04 AM4/21/16

to tesser...@googlegroups.com

On Thu, Apr 21, 2016 at 8:45 AM, ShreeDevi Kumar <shree...@gmail.com> wrote:

Please file an issue on GitHub repo with these files so that it can be looked at by the developers.

Why? To waste their time??? E.g. presented command does not work ('tesseract -c tessedit_create_tsv=1 tess_1_1b.tif tess'). If you really want to help user then point him/her to correct wiki (using correct psm).

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUn1PWio0o-n_J80ihc-92Qv5q8JwkK6k%3DxM0qbd0shHw%40mail.gmail.com.

S.J. Becker

unread,

Apr 22, 2016, 7:12:32 AM4/22/16

to tesseract-ocr

I just did more testing.

My one word or single character image works with

-psm 7

-psm 8

my two or three lines of text image works with the default of

-psm 3

as well as

-psm 4

They both seem to work with

-psm 6

I may have to go with 6 even though my three line test with different

font sizes should be done with 4 based on it's description.

I feel it's a bug that 3 and 4 can't reliably handle simpler content.

To get the most out of Tesseract, I must analyze the segmentation?!

That is why I had to go through the trouble of compiling leptonica;

so that tesseract is smart enough that I don't have to re-invent the wheel.

It seems that it's failing at the segmentation stage. If it finds nothing

it could try again automatically with a more primitive setting. That is

way more efficient than my process spawning tesseract twice as often.

thanks

scott

S.J. Becker

unread,

Apr 22, 2016, 7:12:33 AM4/22/16

to tesseract-ocr

This page only shows the same list I've seen many times before without

any explanation:

What does mean when it says "script detection"

I tried OSD and it did not automatically correct incorrect rotation (90 degrees off)

I think I understand what "Automatic page segmentation" may mean but with / without OSD?

Kinda need a full explanation.

"vertically aligned text"???

I guess I'll try #4: "Assume a single column of text of variable sizes"

That best describes what I have but the default seemed to work

in limited testing of my one and two liners.

The wiki also has a waybackmachine link to a bug saying that adding

whitespace helps. (Is that a current bug?, etc.)

thanks

scott

On Thursday, April 21, 2016 at 4:21:47 AM UTC-7, zdenop wrote:

Zdenko Podobný

unread,

Apr 23, 2016, 11:53:05 AM4/23/16

to tesser...@googlegroups.com

On Thu, Apr 21, 2016 at 11:53 PM, S.J. Becker <scottb...@gmail.com> wrote:

This page only shows the same list I've seen many times before without
any explanation:

What does mean when it says "script detection"

See https://en.wikipedia.org/wiki/List_of_writing_systems and

https://github.com/tesseract-ocr/tesseract/blob/master/ccmain/osdetect.cpp#L50

I tried OSD and it did not automatically correct incorrect rotation (90 degrees off)

detection = correction???

I think I understand what "Automatic page segmentation" may mean but with / without OSD?
Kinda need a full explanation.

OSD = Orientation and script detection

"vertically aligned text"???

I guess I'll try #4: "Assume a single column of text of variable sizes"
That best describes what I have but the default seemed to work
in limited testing of my one and two liners.

The wiki also has a waybackmachine link to a bug saying that adding
whitespace helps. (Is that a current bug?, etc.)

It is not bug. It is feature - if you use correct psm and you still can not get correct result, maybe problem is that there is not sufficient border.

thanks
scott

On Thursday, April 21, 2016 at 4:21:47 AM UTC-7, zdenop wrote:
Please read the wiki https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#page-segmentation-method

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3d822711-56fa-41af-8b18-fefadf05a841%40googlegroups.com.

Zdenko Podobný

unread,

Apr 23, 2016, 12:02:24 PM4/23/16

to tesser...@googlegroups.com

On Fri, Apr 22, 2016 at 12:27 AM, S.J. Becker <scottb...@gmail.com> wrote:

I just did more testing.

My one word or single character image works with
-psm 7
-psm 8

my two or three lines of text image works with the default of
-psm 3
as well as
-psm 4

They both seem to work with
-psm 6

I may have to go with 6 even though my three line test with different
font sizes should be done with 4 based on it's description.

I feel it's a bug that 3 and 4 can't reliably handle simpler content.
To get the most out of Tesseract, I must analyze the segmentation?!

Why analyze? Don't you know in advance if you are asking to OCR page or just paragraph, line or word???

That is why I had to go through the trouble of compiling leptonica;
so that tesseract is smart enough that I don't have to re-invent the wheel.

Tesseract use leptonica as dependancy so it does not need to re-invent the wheel.

It seems that it's failing at the segmentation stage. If it finds nothing
it could try again automatically with a more primitive setting. That is
way more efficient than my process spawning tesseract twice as often.

thanks
scott

On Thursday, April 21, 2016 at 4:21:47 AM UTC-7, zdenop wrote:
Please read the wiki https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#page-segmentation-method

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e9f5cb1a-374f-49b6-82ef-795b009e0180%40googlegroups.com.

S.J. Becker

unread,

Apr 23, 2016, 2:12:21 PM4/23/16

to tesseract-ocr

On Saturday, April 23, 2016 at 9:02:24 AM UTC-7, zdenop wrote:

> Why analyze? Don't you know in advance if you are asking to OCR page or just paragraph, line or word???

No.

My user is viewing an image of a large construction blueprint. They select "Copy Text"

and draw a rectangle around part of the image which contains text. I need my program

to ocr any text in that sub-image and copy it to the clipboard.

I have no idea if they select a character, a word, a single line sentence or a multi-line

sentence.

I was tracing down a non-fatal error message which was printed to the console when

running tesseract. I found out tesseract was calling leptonica to segment the page

and that leptonica was emitting an error and returning fail because the image was below

a certain height. It was not trying to segment the image.

The leptonica developer made the arbitrary decision that it didn't make sense to

segment the page because it was too small. If leptonica makes such judgements,

the tesseract has to intelligently deal with it. If tesseract does not want to deal with

it, then I must deal with it. If I refuse to deal with it then I can ask my user to describe

what they selected and make them deal with it.

If I asked my user if they selected a single character, a single word, a single line of

words or multiple lines of words, they would conclude that my software is a steaming

pile of crap. So that leaves me to solve the problem.

It's my opinion that it crazy for an ocr program to return "Empty Page!" when I feed

it an image with "A2.12" on it because it is below a certain size or because it lacks

white space or because I told it to expect multiple lines of text with varying heights

instead of "Expect a single word".

It's returning "Empty Page!" without even trying to ocr the image!

The last 6 psm options are in a nice hierarchy. If you don't think it makes sense

to fall back to a more primitive setting when the advanced setting fails, then I

will have to create a patched version which does that.

It makes no sense for me to launch tesseract two or three times to ocr "A2.12".

TIA

scott

Reply all

Reply to author

Forward