Tesseract vs. Commercial OCR

Scott Oom

unread,

May 23, 2012, 1:03:31 PM5/23/12

to tesseract-ocr

We are working on automated testing tools for applications and games.

We want to be able to verify various text in the UIs in different
languages and have been experimenting with Tesseract OCR and having a
lot of fun with it.

In 2007, Ray Smith mentioned that "Tesseract is now behind the leading
commercial engines in terms of its accuracy."

What commercial engines are more accurate than Tesseract and in what
ways? Can Tesseract OCR approach the commercial engines with training
and adjusting of parameters or is it still behind?

Thanks,
-Scott

Sven Pedersen

unread,

May 23, 2012, 1:34:49 PM5/23/12

to tesser...@googlegroups.com

It is clear that, out of the box, Abbyy Fine Reader is more accurate.
It may well be still more accurate with training, maybe due to
post-processing. Many people who produce effective solutions on this
list use pre- and post-processing scripts to deal with various common
issues. With all that, Tesseract accuracy may be over 96% for normal
text (mostly letters, not numbers and punctuation), judging by
self-evaluations...
--Sven

nikolaykhl

unread,

May 24, 2012, 2:08:58 AM5/24/12

to tesser...@googlegroups.com

I agree that Abbyy will do the job more accurate out of the box and is easier to get started with.
You may also want to have a look at this article: http://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison

zdenko podobny

unread,

May 24, 2012, 3:27:48 PM5/24/12

to tesser...@googlegroups.com

On Thu, May 24, 2012 at 8:08 AM, nikolaykhl <koli...@gmail.com> wrote:

I agree that Abbyy will do the job more accurate out of the box and is easier to get started with.
You may also want to have a look at this article: http://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison

This comparison is from 2010 and tesseract-ocr svn r402. Current revision is 725, so I guess there are some improvements since that test ;-)

On Wednesday, May 23, 2012 9:03:31 PM UTC+4, Scott Oom wrote:
We are working on automated testing tools for applications and games.

We want to be able to verify various text in the UIs in different
languages and have been experimenting with Tesseract OCR and having a
lot of fun with it.

In 2007, Ray Smith mentioned that "Tesseract is now behind the leading
commercial engines in terms of its accuracy."

What commercial engines are more accurate than Tesseract and in what
ways? Can Tesseract OCR approach the commercial engines with training
and adjusting of parameters or is it still behind?

I would say it depends on your tasks and budget. E.g. in our local Gutenberg project Finereader is used for standard text. But for text with Fraktur we used tesseract-ocr (I did custom training for it). Project leader did not want to buy special version of Finereader[1]...

On other side - I have not good experience with using tesseract to identify bold and italics text...

[1] http://www.frakturschrift.de/en:start

--
Zdenko

Reply all

Reply to author

Forward