improving tesseract accuracy

andrew

unread,

Jun 7, 2008, 3:39:43 PM6/7/08

to tesseract-ocr

Hi!

I am thinking about using tesseract in a project where user scans a
document through Java Applet, which uploads the image to the server,
then tesseract performs OCR on a predetermined (small) part of this
document, the user is asked for confirmation (important, because there
must be no error) and the image is saved.

The problem is that the OCR is not nearly as accurate as I thought it
would be... The characters are ASCII, clearly separated, visible and
readable to the human reader. (I would upload an image but Google
Groups don't seem to allow it :( )

1) Is there a way to improve the accuracy of the process? Would
building a dictionary with my own scans help?

2) Can tesseract learn with time? The system will know of any mistakes
tesseract makes, so ideally we could use those chars to improve the
accuracy.

Thank you!

andrew

unread,

Jun 7, 2008, 4:00:21 PM6/7/08

to tesseract-ocr

Ok, here is the image... :)

Any help would be appreciated.

b.tif

andrew

unread,

Jun 7, 2008, 4:12:54 PM6/7/08

to tesseract-ocr

Forgot to write - the text it recognizes is:
Apafah 35B6430].2/424 NOKIA 6234
which is not very helpful... :)

Tesseract version is Debian testing, tesseract-ocr_2.03-1_i386.deb. Dictionary
is tesseract-ocr-eng_2.00-1_all.deb.

Sorry it took 3 posts to ask this question. ;)

rthomas

unread,

Jun 7, 2008, 11:50:36 PM6/7/08

to tesseract-ocr

Hi,

Try with 150 DPI greyscale scan.

Remi

andrew

unread,

Jun 9, 2008, 2:10:54 PM6/9/08

to tesseract-ocr

> Try with 150 DPI greyscale scan.

Remi, thank you for the suggestion!

The image was already 150 DPI, I have now tried the grayscale version
(attached), but it doesn't find any character in the image? (it finds just a
single space char)

However, I have again converted to BW and it correctly identified all of the
chars! (see bw_test.tif) If you compare bw_test.tif with b.tif (from previous
post) there is not much difference, at least to human eye... Interesting. :)

But I know, tesseract should work on the grayscale image. Does anybody know
why it doesn't?

Another thing, tesseract seems to be able to read much more if I run it over
the whole A4 document - the read-out is far from perfect ('5' instead of '6'
for instance), but at least it reads something...

Any idea what could be done to make it better?

Thank you!

grayscale_test.tif

bw_test.tif

andrew

unread,

Jun 9, 2008, 6:36:51 PM6/9/08

to tesseract-ocr

> Any idea what could be done to make it better?

Let me answer my own question: by scaling the image by 200% (enlarging it). It
looks like the characters have some "ideal" height that has a great impact on
OCR accuracy.

Could anybody please comment on that?

What would the ideal font size be for the default data set?

Thanks! :)

andrew

unread,

Jun 9, 2008, 7:21:16 PM6/9/08

to tesseract-ocr

Another thing I discovered - turning dictionary _off_ helps a lot in my case.

Write these lines to /usr/share/tesseract-ocr/tessdata/configs/nodict :
ok_word 0
good_word 0
non_word 0

Then run tess like this:
tesseract b.tif output /usr/share/tesseract-ocr/tessdata/configs/nodict

This is fun... :)

Any other ideas? I still can't get anything from this grayscale tif...

Thanks!

grayscale_test.tif

VictorF

unread,

Jun 10, 2008, 5:35:50 AM6/10/08

to tesseract-ocr

Hi,

can't help you on your question on ideal size, but there have been a
number of post that could confirm your finding that tesseract works
better after enlarging.

I rescale all my alphabets to 30x30 (respecting aspect ratio of coz),
and anything bigger than this doesn't seem to make any observable
improvement. I have no ieda why, it was just trial and error :)

another thing you can try is probably restricting the output character
set if you know that there are only alphabets and numbers? There's a
guide in the FAQ on this matter... "how to recognise digits only?"
smthg like that...

hope this helps :)

Best Regards,
Victor

andrew

unread,

Jun 10, 2008, 5:54:32 AM6/10/08

to tesser...@googlegroups.com, VictorF

> I rescale all my alphabets to 30x30 (respecting aspect ratio of coz),
> and anything bigger than this doesn't seem to make any observable
> improvement. I have no ieda why, it was just trial and error :)

Thanks, that helps - I'll just try different sizes and decide on the best
one. :)

> another thing you can try is probably restricting the output character
> set if you know that there are only alphabets and numbers? There's a
> guide in the FAQ on this matter... "how to recognise digits only?"
> smthg like that...

I have tried:
http://code.google.com/p/tesseract-ocr/wiki/FAQ

But if I use this:
tessedit_char_whitelist 0123456789

I get this error:
error: Could not find variable 'tessedit_char_whitelist'

My guess is that documentation is outdated?

If nothing else I could train tess, though I would rather not... it seems
labour-intensive. :)

Thanks again, I appreciate it!

andrew

unread,

Jun 10, 2008, 6:13:56 AM6/10/08

to tesser...@googlegroups.com, Victor Fung

Hope you don't mind if I post this back to the group... :)

> > > another thing you can try is probably restricting the output character
> > > set if you know that there are only alphabets and numbers? There's a
> > > guide in the FAQ on this matter... "how to recognise digits only?"
> > > smthg like that...

> > ...

> > I get this error:
> > error: Could not find variable 'tessedit_char_whitelist'

> ...
> I ran into the same problem
>
> u need to use tesseract 2.03.

I use tesseract 2.03-1 (debian)... Does it work for you in 2.03? Which OS do
you use?

Thanks!

Ray Smith

unread,

Jun 11, 2008, 9:35:05 PM6/11/08

to tesser...@googlegroups.com, VictorF

The documentation is brand-new, but it ONLY works on 2.03, and above. The symptom of "Could not find variable" implies an older version or a type in your string. Try pasting the name from tessedit.cpp, line 81. If it is not there, you have the wrong version.
Ray.

andrew

unread,

Jun 12, 2008, 4:01:24 AM6/12/08

to tesser...@googlegroups.com

Ray, thank you for the answer! I checked the sources and the line was there.

The problem was in the way I was calling tess executable:
*****
$ tesseract test.tif out /usr/share/tesseract-ocr/tessdata/configs/digits

error: Could not find variable 'tessedit_char_whitelist'

$ tesseract test.tif out digits
Could not open file, digits

$ tesseract test.tif out nobatch digits
Tesseract Open Source OCR Engine
*****

It might be obvious, but it wasn't to me - it looks like you need to
use "nobatch" parameter. It would be nice to have command-line help...

Thank you again, it helps a lot!

Best,

Andrew

Ray Smith

unread,

Jun 12, 2008, 8:44:11 PM6/12/08

to tesser...@googlegroups.com

Thanks for the update on the source of the confusion. I have improved the FAQ entry.
Ray.

Reply all

Reply to author

Forward