Tesseract vs Commercial Products

951 views
Skip to first unread message

Jason Funk

unread,
Feb 18, 2012, 2:43:26 PM2/18/12
to tesseract-ocr
I am testing tesseract against some other commercial products and the
commercials products seems to blow tesseract out of the water in terms
of quality and accuracy. Is this because tesseract is just not as good
as the other products? Or perhaps tesseract is designed for a specific
purpose other than what I am testing it for?

Maybe a different question would be, for what applications are people
using tesseract successfully?

Sven Pedersen

unread,
Feb 18, 2012, 4:07:16 PM2/18/12
to tesser...@googlegroups.com
Tesseract is especially good for custom training for a particular type of text. Accuracy can increase to over 98% for a given font. Also, it can be trained for foreign languages.
--Sven



--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en



--
``All that is gold does not glitter,
  not all those who wander are lost;
the old that is strong does not wither,
  deep roots are not reached by the frost.
From the ashes a fire shall be woken,
  a light from the shadows shall spring;
renewed shall be blade that was broken,
  the crownless again shall be king.”

La Monte H. P. Yarroll

unread,
Feb 18, 2012, 4:53:11 PM2/18/12
to tesser...@googlegroups.com
A good example is fraktur (old German black-letter fonts). The only commercial option is over $10,000 for a single copy. There are some languages for which tesseract is the only option.

Jason Funk

unread,
Feb 18, 2012, 4:53:31 PM2/18/12
to tesseract-ocr
If I am understanding you right, it does not work very well without
being trained?

Jason

Jason Funk

unread,
Feb 18, 2012, 5:46:00 PM2/18/12
to tesseract-ocr
But what if I am simply trying to do OCR on images that use standard
normal english fonts? Why isn't it working as well as the commercial
options which do beautifully? Does the default english language data
file not contain a lot of the of typical fonts?

On Feb 18, 3:53 pm, "La Monte H. P. Yarroll" <piggy.yarr...@gmail.com>
wrote:
> A good example is fraktur (old German black-letter fonts). The only
> commercial option is over $10,000 for a single copy. There are some
> languages for which tesseract is the only option.
>
> On Sat, Feb 18, 2012 at 4:07 PM, Sven Pedersen <sven.peder...@gmail.com>wrote:
>
>
>
>
>
>
>
> > Tesseract is especially good for custom training for a particular type of
> > text. Accuracy can increase to over 98% for a given font. Also, it can be
> > trained for foreign languages.
> > --Sven
>

Sven Pedersen

unread,
Feb 18, 2012, 7:02:20 PM2/18/12
to tesser...@googlegroups.com
Commercial options have lots of built-in age processing. You can do that with free software but it does not just do it automatically. Post examples and you'll get feedback about how to do it with tesseract.
--Sven

Sven Pedersen

unread,
Feb 18, 2012, 7:03:19 PM2/18/12
to tesser...@googlegroups.com
Image processing, not age. :-)

Jason Funk

unread,
Feb 18, 2012, 7:58:04 PM2/18/12
to tesseract-ocr
My specific examples are screen captures of powerpoint slides. For
example, what would need to be done to this image?

http://jasonfunk.net/example2.jpeg

Derek Dohler

unread,
Feb 19, 2012, 12:21:47 AM2/19/12
to tesser...@googlegroups.com
I applied some of the image processing that I commonly use to the image you sent.
Before image processing, Tesseract outputs:
The Evolving Student
@

After processing, it outputs:
The Evolving Student
0 Children and Email
Classroom Requirem nts
Online Coursework Dependency
v Learning a Vital Social Skill

(The missing e is due to the pre-processing, not tesseract).
The main thing I notice about the image that you sent is that most of the letters have very low contrast with their surroundings. If you add some pre-processing to intelligently convert the image to black and white, I expect that your results will improve significantly.

Derek

Dmitri Silaev

unread,
Feb 19, 2012, 12:32:00 AM2/19/12
to tesser...@googlegroups.com
Consider this metaphor: a gasoline engine and a car. For a mainstream
user gas engine alone almost has no use, but for custom car shops, car
factories and simply enthusiasts it can be of value. A car is an
end-user product and you can start driving it at once with no
additional do-it-yourself. Same is with Tesseract and commercial OCRs.
Tesseract is hard to use, poorly documented, although free and
powerful when used by a professional or dedicated enthusiast.
Commercial OCR systems allow you to proceed to recognition right away
providing convenient user interface and automating the process as much
as possible. The UI and automation is what can be deemed as car body,
transmission, stereo, etc. By the way these "extras" is what makes
users think how good a final product is. Uncomfortable or glitchy
product is usually sad as "bad", although the engine (OCR engine)
behind the scenes can be most elaborated and powerful in class.

Besides unusual fonts and rare languages which other forum members
mention, Tesseract is used in custom OCR-related software programs and
web services, or as an OCR engine inside industrial scale text
recognition systems. Many users trade their efforts and time with free
Tesseract for costs of commercial OCR systems. Currently Google works
on making Tesseract more user-friendly, though.

Out-of-the-box Tesseract works best with black-white paper sheet scans
having a non-complex layout. Most of image processing work as Sven
says is aimed to bring source images to such form.

So regarding your image, you'll need to convert it to monochrome and
make the text characters stand out of the background. This can be done
in any image editor program by converting the image to grayscale,
probably selecting one of the R, G or B channels, then applying a
threshold which can be chosen manually. I think this is nearly what
Derek did and you see - the results are quite decent. If you have many
such images you can use ImageMagick to automate the above image
processing operations and then feed resulting images to Tesseract, all
in a single script.

HTH

Warm regards,
Dmitri Silaev
www.CustomOCR.com

TP

unread,
Feb 19, 2012, 2:41:52 AM2/19/12
to tesser...@googlegroups.com
On Sat, Feb 18, 2012 at 9:32 PM, Dmitri Silaev <daemo...@gmail.com> wrote:
> If you have many
> such images you can use ImageMagick to automate the above image
> processing operations and then feed resulting images to Tesseract, all
> in a single script.

Or, since tesseract-ocr already links with the Leptonica C Image
Processing Library
(http://tpgit.github.com/UnOfficialLeptDocs/leptonica/index.html), you
could use its many powerful functions to process your PIX directly in
memory. This of course requires changing tesseractmain.cpp and
rebuilding tesseract, but we are trying to make using libtesseract
3.02 easier on Windows (it's already pretty easy on Linux).

TP

unread,
Feb 19, 2012, 2:57:48 AM2/19/12
to tesser...@googlegroups.com
On Sat, Feb 18, 2012 at 4:58 PM, Jason Funk <jason...@gmail.com> wrote:
> My specific examples are screen captures of powerpoint slides. For
> example, what would need to be done to this image?
>
> http://jasonfunk.net/example2.jpeg

Remember, its *always* a bad idea to save an image in jpeg format if
it will later be processed by other programs. Notice all the noise
that now surrounds your characters?

Use tiff or png instead.

Dmitri Silaev

unread,
Feb 19, 2012, 4:36:15 AM2/19/12
to tesser...@googlegroups.com
Jason doesn't seem to be a developer so I think these are no options
for him. Otherwise the choice is limitless including 3rd party image
processing libraries and of course self-made custom algorithms.

Warm regards,
Dmitri Silaev
www.CustomOCR.com

Wil Hadden

unread,
Feb 19, 2012, 6:20:56 PM2/19/12
to tesseract-ocr
Having recently used leptonica for pre-processing I have to ask why is
leptonica not used more in tesseract?

Even pretty basic stuff like assuming text is black on white and
boosting images based on this with pixGammaTRC and pixContrastTRC
would make default processing more accurate, and I think it would be a
fair assumption to make. I could be wrong but I don't see these being
used in tesseract.


I seem to be getting better results using pixDeskew, which I see is
used in cube processing but no where else. Is this the enhanced
alignment functionality 3.x? If so why not use it everywhere?

I usual I stand to be corrected!

Wil


On Feb 19, 7:41 am, TP <wing...@gmail.com> wrote:

Jason Funk

unread,
Feb 20, 2012, 6:35:42 AM2/20/12
to tesseract-ocr
Actually, I am a developer. But I am new to the OCR world. The piece
that I was missing in the equation is the image pre-processing. I will
investigate it further. Thanks for your help.

Roast

unread,
Feb 20, 2012, 5:30:00 AM2/20/12
to tesser...@googlegroups.com
Hi, Derek Dohler, could you tell me the detail of process the image to get the better result?

Thanks.
The time you enjoy wasting is not wasted time!

Derek Dohler

unread,
Feb 20, 2012, 7:03:08 AM2/20/12
to tesser...@googlegroups.com
Hi Roast,

It is locally adapted binarization; see here for more details: http://www.leptonica.com/binarization.html

Wil Hadden

unread,
Feb 20, 2012, 9:28:31 AM2/20/12
to tesseract-ocr
As a quick aside to my own post, I found from trawling the leptonica
examples that initially calling pixBackgroundNormSimple gives results
that when you look at them look worse but seem to produce better
results.

Just so you should know, it's worth experimenting.

Wil
Reply all
Reply to author
Forward
0 new messages