Tesseract Reading Issue

340 views
Skip to first unread message

KAH

unread,
Jul 18, 2010, 11:40:53 PM7/18/10
to tesseract-ocr
I have two files....

http://dl.dropbox.com/u/1531272/pg1-CROP.jpg
and
http://dl.dropbox.com/u/1531272/pg1-CROP-Lines.jpg

Note on the "Lines" file there are dark lines on the left and right
side of this image.
I am trying to understand why the tessnet dll would render such
different readings for this image.

Can anyone offer some help or understanding regarding how this product
reads that would cause this? Additionally if there are any variables
I would set that would help I would love to have some direction on
them.

Thank you for your help.
KAH

patrickq

unread,
Jul 19, 2010, 8:20:35 AM7/19/10
to tesseract-ocr
This is a great example of a serious problem with Tesseract when
analyzing any image with fonts of variable sizes such as a street
sign, flyer, business card etc. What happens is that Tesseract's
adaptive classifier makes assumptions about letter heights and uses
that knowledge when recognizing the next characters. This is right and
useful when parsing a word or (to a lesser degree but still) a
sentence with words separated by spaces because in that case it makes
sense to assume uniformity. However it is dead wrong when dealing with
different blocks. In your case, the tall bar is separated by enough
space that it should be treated as a different block and that letter
should NOT cause Tesseract to assume ANYTHING about letter height when
it tackles the next block with the phone number.

The good news is that the fix required in Tesseract is really not that
hard, it's essentially about resetting the adaptive classifier between
blocks (separated by space larger than a blank vertically or like your
example, horizontally). Even better news: Jimmy is working on it ...

On Jul 18, 11:40 pm, KAH <henderson.aus...@gmail.com> wrote:
> I have two files....
>
> http://dl.dropbox.com/u/1531272/pg1-CROP.jpg
> andhttp://dl.dropbox.com/u/1531272/pg1-CROP-Lines.jpg

Jimmy O'Regan

unread,
Jul 19, 2010, 8:30:39 AM7/19/10
to tesser...@googlegroups.com
On 19 July 2010 13:20, patrickq <patrick.q...@gmail.com> wrote:
> This is a great example of a serious problem with Tesseract when
> analyzing any image with fonts of variable sizes such as a street
> sign, flyer, business card etc. What happens is that Tesseract's
> adaptive classifier makes assumptions about letter heights and uses
> that knowledge when recognizing the next characters. This is right and
> useful when parsing a word or (to a lesser degree but still) a
> sentence with words separated by spaces because in that case it makes
> sense to assume uniformity. However it is dead wrong when dealing with
> different blocks. In your case, the tall bar is separated by enough
> space that it should be treated as a different block and that letter
> should NOT cause Tesseract to assume ANYTHING about letter height when
> it tackles the next block with the phone number.
>
> The good news is that the fix required in Tesseract is really not that
> hard, it's essentially about resetting the adaptive classifier between
> blocks (separated by space larger than a blank vertically or like your
> example, horizontally). Even better news: Jimmy is working on it ...

Well, it won't do him any good because he's using tessnet2, so he
won't get the fix if/when I find it.

Actually, my current thought is that setting segmentation to line mode
might be enough to solve this problem, but I haven't gotten around to
checking. I'm a little too wrapped up in internationalising Tesseract
(which is an issue a little closer to my own interests).

--
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

Jimmy O'Regan

unread,
Jul 19, 2010, 8:31:50 AM7/19/10
to tesser...@googlegroups.com
On 19 July 2010 13:30, Jimmy O'Regan <jor...@gmail.com> wrote:
> On 19 July 2010 13:20, patrickq <patrick.q...@gmail.com> wrote:
>> This is a great example of a serious problem with Tesseract when
>> analyzing any image with fonts of variable sizes such as a street
>> sign, flyer, business card etc. What happens is that Tesseract's
>> adaptive classifier makes assumptions about letter heights and uses
>> that knowledge when recognizing the next characters. This is right and
>> useful when parsing a word or (to a lesser degree but still) a
>> sentence with words separated by spaces because in that case it makes
>> sense to assume uniformity. However it is dead wrong when dealing with
>> different blocks. In your case, the tall bar is separated by enough
>> space that it should be treated as a different block and that letter
>> should NOT cause Tesseract to assume ANYTHING about letter height when
>> it tackles the next block with the phone number.
>>
>> The good news is that the fix required in Tesseract is really not that
>> hard, it's essentially about resetting the adaptive classifier between
>> blocks (separated by space larger than a blank vertically or like your
>> example, horizontally). Even better news: Jimmy is working on it ...
>
> Well, it won't do him any good because he's using tessnet2, so he
> won't get the fix if/when I find it.

My apologies; I assumed 'he', which was quite a sexist assumption to make.

patrickq

unread,
Jul 19, 2010, 10:00:56 AM7/19/10
to tesseract-ocr
Setting the segmentation mode to PSM_SINGLE_LINE doesn't help (I
checked).

Here is an even more striking example: "John Doe" and
"jo...@widgets.com": http://www.scanbizcards.com/johndoe.jpg
Just because the email address uses a smaller font, Tesseract 3.0
stubbornly insists on interpreting all the letters of "John Doe" as
tall lowercase or uppercase letters/digits, yielding something like
"JO11fl DO9".
What's even more bizarre here is that Tesseract should "see" that the
'n' in "John" is much smaller than the 'J' and 'h' so even within that
word the assumption that the 'n' is a tall letter makes no sense!

Tesseract is a great piece of software yet basic issues like than make
us (Tesseract) look like a retarded person BEFORE his morning
coffee :-). Yes, Tesseract was meant for uniform pages of text but the
reality is that lots and lots and lots of people use it for non-
uniform texts.

On Jul 19, 8:30 am, "Jimmy O'Regan" <jore...@gmail.com> wrote:

Austin Henderson

unread,
Jul 19, 2010, 10:34:38 AM7/19/10
to tesseract-ocr
Thank you for your feedback.
I am working with some automated image pre-processing to try to remove the
lines before reading and having better results.
I just wanted to make sure I didn’t miss an optional setting that would
allow it to differentiate better between these blocks.

This is the same issue in reality that I posted earlier about handwriting
above or below the text being grouped in with the same text when read that
caused bad reads.
It is helpful to have a bit better understanding of what is happening under
the hood that is causing this problem.

I suppose I don’t understand why the space before/after the word is not
"enough" for it to see those as different objects?
Do you think tosp_table_xht_sp_ratio could have any impact on this if I
tweak it?
I am not really sure I understand the significance of the values passed for
this option though.

Thanks
Austin

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com.
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.

patrickq

unread,
Jul 19, 2010, 10:54:28 AM7/19/10
to tesseract-ocr
Hi Austin,

Tesseract makes that unwanted assumption about height even if the
blocks are well separated, tweaking the block size won't help. This
bad problem is just about fixing Tesseract to accept the reality that
not all text have the same height for all letters because not
everything is a book.

You could perform layout analysis to find blocks and rows within these
blocks then making sub-images out of each row but that's a ton of
coding, it will double or triple your processing time and doesn't
always work. I tried that approach and it was not fun + didn't fully
work + it is intellectually vexing to jump through hoops instead of
just fixing at the source.

Patrick

On Jul 19, 10:34 am, "Austin Henderson" <henderson.aus...@gmail.com>
wrote:
> "j...@widgets.com":http://www.scanbizcards.com/johndoe.jpg

Jimmy O'Regan

unread,
Jul 19, 2010, 10:56:14 AM7/19/10
to tesser...@googlegroups.com
On 19 July 2010 15:34, Austin Henderson <henderso...@gmail.com> wrote:
> Thank you for your feedback.
> I am working with some automated image pre-processing to try to remove the
> lines before reading and having better results.
> I just wanted to make sure I didn’t miss an optional setting that would
> allow it to differentiate better between these blocks.
>

Nah. Most of the open source OCR guis use unpaper for this, though.

> This is the same issue in reality that I posted earlier about handwriting
> above or below the text being grouped in with the same text when read that
> caused bad reads.
> It is helpful to have a bit better understanding of what is happening under
> the hood that is causing this problem.
>
> I suppose I don’t understand why the space before/after the word is not
> "enough" for it to see those as different objects?
> Do you think tosp_table_xht_sp_ratio could have any impact on this if I
> tweak it?

No; that's the ratio used to determine the space between words (1/3rd
of the height of 'x').
You would set that ratio to something else if you get too many words
being output without spaces between them (needs to be lower), or if
you get spaces between letters (needs to be higher).

Austin Henderson

unread,
Jul 19, 2010, 1:34:31 PM7/19/10
to tesser...@googlegroups.com
Ok so safe to say for now my options are..

1- Live with it
2- Figure out how to get the lines off the page before I read them...

Right?

Thanks

-----Original Message-----
From: Jimmy O'Regan
Sent: Monday, July 19, 2010 9:56 AM
To: tesser...@googlegroups.com
Subject: Re: Tesseract Reading Issue

On 19 July 2010 15:34, Austin Henderson <henderso...@gmail.com> wrote:
> Thank you for your feedback.
> I am working with some automated image pre-processing to try to remove the
> lines before reading and having better results.

> I just wanted to make sure I didn�t miss an optional setting that would


> allow it to differentiate better between these blocks.
>

Nah. Most of the open source OCR guis use unpaper for this, though.

> This is the same issue in reality that I posted earlier about handwriting
> above or below the text being grouped in with the same text when read that
> caused bad reads.
> It is helpful to have a bit better understanding of what is happening
> under
> the hood that is causing this problem.
>

> I suppose I don�t understand why the space before/after the word is not

patrickq

unread,
Jul 19, 2010, 2:01:41 PM7/19/10
to tesseract-ocr
Wrong ... option 2 won't really work unless you want to cut-out
individual words. This image where everything in on one line still
fails with the same insane forcing of the letters in "John" to be
interpreted as tall letters:
http://www.scanbizcards.com/johndoeoneline.jpg

I think option 2 should be for all of us together now to beg Jimmy to
spend the 3-4 hours required to just tell Tesseract to quit this
persistent folly of pretending that all blocks are of the same
heights. This is issue is arguably the most damaging Tesseract flaw
for mixed text material (which is almost everything except books).

On Jul 19, 1:34 pm, "Austin Henderson" <henderson.aus...@gmail.com>
wrote:
> Ok so safe to say for now my options are..
>
> 1- Live with it
> 2- Figure out how to get the lines off the page before I read them...
>
> Right?
>
> Thanks
>
> -----Original Message-----
> From: Jimmy O'Regan
> Sent: Monday, July 19, 2010 9:56 AM
> To: tesser...@googlegroups.com
> Subject: Re: Tesseract Reading Issue
>
> > "j...@widgets.com":http://www.scanbizcards.com/johndoe.jpg

Austin Henderson

unread,
Jul 19, 2010, 9:52:53 PM7/19/10
to tesser...@googlegroups.com

As a developer I am cautious to estimate the amount of time a code change will take. I am thrilled to have the code and look forward to enhancements as they are ported to .net environments. For now I am cleaning up the image in pre processing steps to remove blobs that are inconsistent with others - this is not a problem in my use case and gets around this tesseract issue just fine.

Thanks to thegroup for clarifying what the issue was. It helped me solve my problem.

On Jul 19, 2010 1:01 PM, "patrickq" <patrick.q...@gmail.com> wrote:

Wrong ... option 2 won't really work unless you want to cut-out
individual words. This image where everything in on one line still
fails with the same insane forcing of the letters in "John" to be
interpreted as tall letters:
http://www.scanbizcards.com/johndoeoneline.jpg

I think option 2 should be for all of us together now to beg Jimmy to
spend the 3-4 hours required to just tell Tesseract to quit this
persistent folly of pretending that all blocks are of the same
heights. This is issue is arguably the most damaging Tesseract flaw
for mixed text material (which is almost everything except books).

On Jul 19, 1:34 pm, "Austin Henderson" <henderson.aus...@gmail.com>
wrote:

> Ok so safe to say for now my options are..
>
> 1- Live with it

> 2- Figure out how to get the line...

> On 19 July 2010 15:34, Austin Henderson <henderson.aus...@gmail.com> wrote:

> > Thank you for your...

> > I just wanted to make sure I didn�t miss an optional setting that would

> > allow it to differentiate better between these blocks.
>

> Nah. Most of the open source OCR guis...

> > I suppose I don�t understand why the space before/after the word is not

> > "enough" for it to see those as different objects?

> > Do you think tosp_table_xht_sp_ratio coul...

> > "j...@widgets.com":http://www.scanbizcards.com/johndoe.jpg

> > Just because the email address uses a smaller font, Tesseract 3.0

> > stubbornly insists on inte...

> For more options, visit this group athttp://groups.google.com/group/tesseract-ocr?hl=en.


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group...

Jimmy O'Regan

unread,
Jul 20, 2010, 8:20:15 AM7/20/10
to tesser...@googlegroups.com
On 20 July 2010 02:52, Austin Henderson <henderso...@gmail.com> wrote:
> As a developer I am cautious to estimate the amount of time a code change
> will take.

:D I like you a lot right now.

> I am thrilled to have the code and look forward to enhancements
> as they are ported to .net environments.

Nobody has mentioned any plans to write a .net wrapper for Tesseract
3, and the developer of tessnet2 has mentioned that he would rather
pay for someone to reimplement Tesseract than touch it again, so I
wouldn't hold my breath, if I were you.

(On a related note, I spent a little while yesterday looking at some
truly horrifically written spaghetti code[1], so I'm a little less
unsympathetic than before, but I think he's seriously underestimating
the magnitude of such a reimplementation).

[1] Reminded me of this: http://www.ioccc.org/

> "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com.
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com.

Taxman

unread,
Jul 20, 2010, 11:01:54 AM7/20/10
to tesseract-ocr
"This bad problem is just about fixing Tesseract to accept the reality
that not all text have the same height for all letters because not
everything is a book."

Only some books have uniform text sizes. Textbooks have a large degree
of variability in text size within the same page and probably cause
the same problems.

patrickq

unread,
Jul 20, 2010, 11:34:19 AM7/20/10
to tesseract-ocr
As I said, we just need Jimmy to find 4-5 hours of his free time to
knock this one out :-)!

rthomas

unread,
Jul 21, 2010, 4:23:23 AM7/21/10
to tesseract-ocr
>
> Nobody has mentioned any plans to write a .net wrapper for Tesseract
> 3, and the developer of tessnet2 has mentioned that he would rather
> pay for someone to reimplement Tesseract than touch it again, so I
> wouldn't hold my breath, if I were you.
>

Yes, but the main reason is because I had to do very few modification
in tesseract and I can't commit the code.
Can we commit again now?
I just get last alpha 3 version.
Can you explain me this syntax in imgtiff.cpp
tprintf (_("Resolution=%d\n"), *res);
What this underscore mean?
This is not C++ ISO?

> (On a related note, I spent a little while yesterday looking at some
> truly horrifically written spaghetti code[1], so I'm a little less
> unsympathetic than before, but I think he's seriously underestimating
> the magnitude of such a reimplementation).
>

I don't underestimating. And that's why a university or engineer
student will help us during 3 month.
I write in C/C++ since I'm 18 (I'm 41 now), I'm self employed since 10
years, I handled so many projects I know exactly how much work it
needs.
I also know we can't get a good solution triking small pieces of
tesseract code, we need to get the big picture and rewrite it.

Remi

rthomas

unread,
Jul 21, 2010, 4:50:19 AM7/21/10
to tesseract-ocr


On Jul 21, 10:23 am, rthomas <remi.tho...@gmail.com> wrote:
> > Nobody has mentioned any plans to write a .net wrapper for Tesseract
> > 3, and the developer of tessnet2 has mentioned that he would rather
> > pay for someone to reimplement Tesseract than touch it again, so I
> > wouldn't hold my breath, if I were you.
>
> Yes, but the main reason is because I had to do very few modification
> in tesseract and I can't commit the code.
> Can we commit again now?
> I just get last alpha 3 version.
> Can you explain me this syntax in imgtiff.cpp
> tprintf (_("Resolution=%d\n"), *res);
> What this underscore mean?

Ok, found, and now your are linking with Leptonica Library
I really don't understand.
90% of users need 10% of functionalities.
If you need image processing before calling OCR then in YOUR code you
do image processing and then you call OCR. You don't include image
processing in the OCR, because 90% of the people doesn't need it (or
want it).
And if somebody need HD Photo support you add it also? And camera RAW
also, can you add it?

So I confirm, I'll never write the tessnet3 wrapper.

Jimmy O'Regan

unread,
Jul 21, 2010, 5:46:15 AM7/21/10
to tesser...@googlegroups.com
On 21 July 2010 09:23, rthomas <remi....@gmail.com> wrote:
>>
>> Nobody has mentioned any plans to write a .net wrapper for Tesseract
>> 3, and the developer of tessnet2 has mentioned that he would rather
>> pay for someone to reimplement Tesseract than touch it again, so I
>> wouldn't hold my breath, if I were you.
>>
>
> Yes, but the main reason is because I had to do very few modification
> in tesseract and I can't commit the code.
> Can we commit again now?

Eh? I'm not aware of anyone's commit rights having been taken away. If
you had commit rights before, you should still have them.

You don't automatically get commit rights just by joining a mailing
list, not on any open source project. If you want to commit, you have
to ask the project owner to add you. In this case, that's Ray Smith.
His email address should be easy to find.

> I just get last alpha 3 version.
> Can you explain me this syntax in imgtiff.cpp
> tprintf (_("Resolution=%d\n"), *res);
> What this underscore mean?
> This is not C++ ISO?
>

It's a gettext convenience macro, for localisation. It's a small first
step towards making Tesseract translatable, mostly made to see what
broke (thanks to Zdenko, btw, for finding the breakage).

>> (On a related note, I spent a little while yesterday looking at some
>> truly horrifically written spaghetti code[1], so I'm a little less
>> unsympathetic than before, but I think he's seriously underestimating
>> the magnitude of such a reimplementation).
>>
>
> I don't underestimating. And that's why a university or engineer
> student will help us during 3 month.
> I write in C/C++ since I'm 18 (I'm 41 now), I'm self employed since 10
> years, I handled so many projects I know exactly how much work it
> needs.

Hey, by all means, prove me wrong.

> I also know we can't get a good solution triking small pieces of
> tesseract code, we need to get the big picture and rewrite it.

That's you opinion; it's also your time and your money, so use them as
you see fit.

Jimmy O'Regan

unread,
Jul 21, 2010, 6:03:45 AM7/21/10
to tesser...@googlegroups.com
On 21 July 2010 09:50, rthomas <remi....@gmail.com> wrote:
>
>
> On Jul 21, 10:23 am, rthomas <remi.tho...@gmail.com> wrote:
>> > Nobody has mentioned any plans to write a .net wrapper for Tesseract
>> > 3, and the developer of tessnet2 has mentioned that he would rather
>> > pay for someone to reimplement Tesseract than touch it again, so I
>> > wouldn't hold my breath, if I were you.
>>
>> Yes, but the main reason is because I had to do very few modification
>> in tesseract and I can't commit the code.
>> Can we commit again now?
>> I just get last alpha 3 version.
>> Can you explain me this syntax in imgtiff.cpp
>> tprintf (_("Resolution=%d\n"), *res);
>> What this underscore mean?
>
> Ok, found, and now your are linking with Leptonica Library
> I really don't understand.
> 90% of users need 10% of functionalities.
> If you need image processing before calling OCR then in YOUR code you
> do image processing and then you call OCR. You don't include image
> processing in the OCR, because 90% of the people doesn't need it (or
> want it).

The usual convention on mailing lists is, when you want to comment on
a particular statement, you respond to *that* email, instead of
finding something completely unrelated and inserting your two cents
there.

Clearly, you didn't understand. Maybe you should read it again instead
of trying to reconstruct it from memory.

> And if somebody need HD Photo support you add it also? And camera RAW
> also, can you add it?
>

It looks like you're trying to take your misunderstanding to the level
of absurdity.

> So I confirm, I'll never write the tessnet3 wrapper.

Yeah, that was a really long winded way of eventually meandering to
the point, wasn't it?

Reply all
Reply to author
Forward
0 new messages