Different results on subimages

108 views
Skip to first unread message

Francesco

unread,
Feb 2, 2010, 1:40:37 PM2/2/10
to tesseract-ocr
Hi everybody,
I'm writing an application to automatically scan tons of postal
orders, using TessNet2 library from C#. Tesseract is great and
recognizes about everything on the postal order. But, because some
fields contain only numbers and some others only letters, I want to
process single subimages from the whole picture, adjusting
tessedit_char_blacklist and tessedit_char_whitelist variables for each
of these.
But while processing the entire picture gives great results (still
with some letters recognized as numbers like '0' instead of O),
processing a single subimage, particularly this one, gives no results
at all: http://www.francescovannini.com/pub/importo.jpg
The library detects only a tilde in this image, strangely with a
confidence of 100/255. Unfortunately this is the only part of the
postal order image that I can publish, because sensitive data
concerns.
Is there something that I can tune? Surely processing the entire
picture gives Tesseract some more information about font features than
processing this subimage. That's the only reason why it seems possible
to me. But how can I process a subimage setting a particular whitelist
while achieving the same accuracy that processing the entire picture
gives?
Thank you in advance.

patrickq

unread,
Feb 3, 2010, 10:46:59 AM2/3/10
to tesseract-ocr
Hi Francesco,

Tesseract 3.0 actually recognizes all the digits in your sample image
just great. I have processed your image using the ScanBizCards iPhone
application (which uses Tesseract 3.0) and you can see screenshots on:
http://www.scanbizcards.com/boxes.jpg
http://www.scanbizcards.com/results.jpg

The first screenshot is taken during processing and shows you in red
the boxes found by Tesseract during the layout analysis, the 2nd
screenshot is the text result where you can see that all digits were
recognized properly.
We convert the image to a grayscale (using non-equal weights for the 3
RGB components) before submitting the image to Tesseract so it's
possible that this makes the difference (but I doubt it). Note by the
way how Tesseract returns several imaginative matches for many of the
'*' characters - not sure why - but you should be able to ignore these
in your code, for example by searching for consecutive sequences of
digits.

Regarding your issue in general: you are right that Tesseract may do a
better job when processing an entire image, because it can draw
conclusions on text size (for example) but in some cases, that
algorithm is actually a bad thing, for example where each line is in a
totally different font and size! This is the case for what I scan and
I have asked this forum for the Tesserract variable to turn such
adaptive learning OFF - but got no replies. Anyone out there with the
answer? I just want Tesseract to analyze each line separately,
"forgetting" anything it may have learned from other lines. I think
that means disabling the adaptive classifier but not certain.

If you are having better luck scanning the entire image, I suggest
that instead of using blacklist / whitelist on sub-images, you may
want to do this:
- use a regular expression describing, for example, a number
- in that regular expression, don't just look for a sequence of
digits, do something like "[\\dIlOZ&]*", which means "accept a digit
or uppercase I or lowercase i or uppercase O or uppercase Z or &
(because these letters look similar to digits)
- then in the string matched by the regexp, just replace occurences of
O with 0, Z with 2, & with 8 etc

Patrick

Faisal Shafait

unread,
Feb 4, 2010, 6:25:44 AM2/4/10
to tesser...@googlegroups.com
Hi,
Tesseract 3.0 has an individual text-line recognition mode. When running in that mode, I think adaptive classifier does not adapt to other text-lines in the page.

Cheers,
Faisal

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com.
To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.


patrickq

unread,
Feb 4, 2010, 7:46:12 AM2/4/10
to tesseract-ocr
Hi Faisal,

Here is the image by the way: http://scanbizcards.com/twolines.jpg
Could not be simpler ... two lines:
John Doe
jo...@widgets.com

and yet, scanned as an entire image, John Doe is recognized with 4
mistakes out of 7 letters ("JOhfl DO6")! See http://scanbizcards.com/resultstwolines.jpg

No problems if I split as two images.

Truly bizarre ... what's even weirder is that the line that's messed
up is the first line, as if this was a result of scanning the 2nd line
(even though that line comes after and I think Tesseract recognizes
top down).

I think you mean the SetPageSegMode API, which takes one of:
PSM_AUTO, // Fully automatic page segmentation.
PSM_SINGLE_COLUMN, // Assume a single column of text of variable
sizes.
PSM_SINGLE_BLOCK, // Assume a single uniform block of text.
(Default.)
PSM_SINGLE_LINE, // Treat the image as a single text line.
PSM_SINGLE_WORD, // Treat the image as a single word.
PSM_SINGLE_CHAR, // Treat the image as a single character.

Setting the mode to PSM_SINGLE_COLUMN would seem to be the one I need
- unfortunately, I tried it and it doesn't seem to help in the case in
question.

Patrick

On Feb 4, 6:25 am, Faisal Shafait <faisalshaf...@gmail.com> wrote:
> Hi,
> Tesseract 3.0 has an individual text-line recognition mode. When running in
> that mode, I think adaptive classifier does not adapt to other text-lines in
> the page.
>
> Cheers,
> Faisal
>

> > tesseract-oc...@googlegroups.com<tesseract-ocr%2Bunsu...@googlegroups.com>

Ilya Mezhirov

unread,
Feb 4, 2010, 10:23:13 AM2/4/10
to tesseract-ocr
Hi Patrick,

This: JOhfl DO6 - looks like a result of a bad line height estimation.
Tesseract rescales all the lines to some standard height and when this
step goes wrong, the classifier is helpless. This might lead to
letters being recognized as taller letters (o->O, n->fl, ...). I think
that the line height is estimated during the layout analysis stage,
before the actual recognition starts. It is quite possible (I'm not
sure) that Tesseract enforces the same line height onto all lines in a
column.

Ilya

On Feb 4, 1:46 pm, patrickq <patrick.questemb...@gmail.com> wrote:
> Hi Faisal,
>
> Here is the image by the way:http://scanbizcards.com/twolines.jpg
> Could not be simpler ... two lines:
> John Doe

> j...@widgets.com


>
> and yet, scanned as an entire image, John Doe is recognized with 4

> mistakes out of 7 letters ("JOhfl DO6")! Seehttp://scanbizcards.com/resultstwolines.jpg

Reply all
Reply to author
Forward
0 new messages