Teseract vs Abbyy

11,453 views
Skip to first unread message

mw18888

unread,
Jul 3, 2011, 10:10:09 AM7/3/11
to tesseract-ocr
Can anyone comment on the accuracy of Tesseract vs Abbyy?

Regards,

mw18888

patrickq

unread,
Jul 3, 2011, 11:40:51 AM7/3/11
to tesseract-ocr
The answer is (of course) "it depends":
1. If you compare Tesseract and ABBY on a same image, without applying
preprocessing to it, ABBY wins (because Tesseract's image processing
is very rudimentary - at best). Of course if your test images are
produced (for example) by a flatbed scanner, the lack of image
processing is not an issue and refer to case 2 below.
2. If you compare Tesseract and ABBY on a clean (processed) image,
without applying any post-Tesseract heuristic, ABBY may have an
advantage
3. However, if you compare Tesseract + image processing + heuristics &
corrections, Tesseract actually beats ABBY hands down.

ScanBizCards is case #3 around Tesseract 3.01. If you want to test
this combo please do this:
- go to http://www.scanbizcards.com/webdemo
- upload an image (under Batch Actions). Warning: ScanBizCards is
geared towards recognizing text on business cards so it would be best
if you tested on something *like* a business card (sparse text), not a
full page with lots of text
- click that image then "Image Editor" on top and OCR it
- when done testing please delete the test images from this demo
account (or get your own online account) ...

You can also test instead on your Android or iPhone mobile device by
installing the free version of ScanBizCards. ABBY powers two iPhone
apps made by German company - Business Card Reader (by Shape Services)
and Card Reader (by xRoot Software) - and of course ABBY's own
iPhone / Android business card reader app.

Patrick

mw18888

unread,
Jul 3, 2011, 3:55:00 PM7/3/11
to tesseract-ocr
Thank you for your comment.

Best regards,

mw18888.


On Jul 3, 11:40 am, patrickq <patrick.questemb...@gmail.com> wrote:
> The answer is (of course) "it depends":
> 1. If you compare Tesseract and ABBY on a same image, without applying
> preprocessing to it, ABBY wins (because Tesseract's image processing
> is very rudimentary - at best). Of course if your test images are
> produced (for example) by a flatbed scanner, the lack of image
> processing is not an issue and refer to case 2 below.
> 2. If you compare Tesseract and ABBY on a clean (processed) image,
> without applying any post-Tesseract heuristic, ABBY may have an
> advantage
> 3. However, if you compare Tesseract + image processing + heuristics &
> corrections, Tesseract actually beats ABBY hands down.
>
> ScanBizCards is case #3 around Tesseract 3.01. If you want to test
> this combo please do this:
> - go tohttp://www.scanbizcards.com/webdemo

Andres

unread,
Jul 4, 2011, 12:56:39 AM7/4/11
to tesser...@googlegroups.com
Hello Patrick,

Could you extend a little about what do you mean with Tesseract heuristics ?

Thanks,

Andres

2011/7/3 patrickq <patrick.q...@gmail.com>
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Patrick Collins

unread,
Jul 4, 2011, 3:26:56 AM7/4/11
to tesser...@googlegroups.com
I'm currently working on a project where I am scanning standard 150dpi English-language Arial-font documents. I'm comparing Tesseract against ABBYY in this example. ABBYY wins hands down on all fronts. But ABBYY is expensive for what I'm trying to do.

I love the hocr output feature of tesseract, it makes for easy post processing. I wish ABBYY had a similar command line tool, but instead I have to write custom C++ programs just to get the bounding boxes around words.

Patrick.

nikolaykhl

unread,
Jul 4, 2011, 7:24:13 AM7/4/11
to tesseract-ocr
Hello, mw18888.

My name is Nikolay, I work at ABBYY. You may want to refer to one of
OCR engine comparison articles, for example check out
http://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison
Or you can test ABBYY OCR engine yourself using a trial version of
ABBYY OCR CLI for Linux, which can be downloaded here: http://ocr4linux.com/en:download

In addition, i would like to comment on patrikq's message above. In my
opinion, it is more reasonable to compare OCR engines directly:
• Business card reading apps apply a lot of heuristics on top of OCR
results, which makes it difficult to compare OCR technologies used
inside.
• I guess the apps mentioned above are based on ABBYY Mobile OCR
engine, which is simpler and smaller version and may not have full
power of desktop ABBYY OCR SDK.
Feel free to contact me if you have more questions.

WBR, Nikolay Khlebinskiy
Message has been deleted

A P Rajam

unread,
Jul 4, 2011, 7:54:11 AM7/4/11
to tesser...@googlegroups.com
Hi

We are working on a product that requires a OCR engine. Our product is now in Linux but will eventually move to Android.

I am not interested in UI of OCR engine - what I need is backend engine with API support to call from my application. We need support for English, Spanish, Italian and French. 

can you please let me know the pricing details of the OCR engine.

Also can you please let me know if OCR engine could recognize the handwritten words also (only English required)

Thanks and regards

Rajam
Qmax Systems

mw18888

unread,
Jul 4, 2011, 10:39:13 AM7/4/11
to tesseract-ocr
Nikolay,

Thank you for your comment.

Enjoy the holiday weekend.

mw18888


On Jul 4, 7:24 am, nikolaykhl <kolia....@gmail.com> wrote:
> Hello, mw18888.
>
> My name is Nikolay, I work at ABBYY. You may want to refer to one of
> OCR engine comparison articles, for example check outhttp://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison

Patrick Questembert

unread,
Jul 6, 2011, 6:53:33 AM7/6/11
to tesser...@googlegroups.com
It's really a long list of approaches, including:
- spacing: we don't trust any spacing determination by Tesseract and reevaluate every space indicated by Tesseract for possible elimination or consider every two letters for a possible space insertion
- obvious mistakes: this is by far the largest category of corrections we make. For example VV is usually corrected back to W - but there are hundreds more cases
- ambiguous letters such as i versus l: surprisingly, Tesseract makes a ton of incongruous mistakes that lead me to believe there is no feature analysis whatsoever - for example a 'y' may get mapped to 'g', even though there is 0% chance of that based on a wide open gap on top. For these types of mistakes we go back to the source image to apply our own OCR of sorts.
- dictionaries: another big disappointment - from our testing we found that Tesseract applies the dictionary in less than 5% of the cases where it should (i.e. where the letter mistake is one listed in the ambigs files, with the correct spelling in the user dictionary) so we implemented our own dictionaries
- pattern matching: the regular expressions we use include wide tolerance for mistakes. Under the "protection" of a regular expression for a specific pattern we have the flexibility to include hundreds of ambiguities (because these trigger only when they help complete a match which makes it more likely to be a valid substitution

Patrick

Lutz, Michael

unread,
Jul 6, 2011, 7:07:23 AM7/6/11
to tesser...@googlegroups.com

 

If you are referring to http://www.abbyyusa.com/, then I think the biggest difference is that tesseract is open source and abbyy not J.

So in ABBYY you pay for the image preprocessing and in tesseract not.

I totally agree with Patrick, if you do the preprocessing well then I always get perfect result with tesseract, but I never tried ABBYY.

 

Mike



This message is confidential and intended only for the addressee. If you have received this message in error, please immediately notify the postm...@nds.com and delete it from your system as well as any copies. The content of e-mails as well as traffic data may be monitored by NDS for employment and security purposes.
To protect the environment please do not print this e-mail unless necessary.

An NDS Group Limited company. www.nds.com

mw18888

unread,
Jul 6, 2011, 9:56:36 AM7/6/11
to tesseract-ocr
Mike and Patrick,

Thank you for the comment.

Mike, can you clarify the "preprocessing well"?

Regards,

mw18888

On Jul 6, 7:07 am, "Lutz, Michael" <ML...@nds.com> wrote:
> If you are referring tohttp://www.abbyyusa.com/, then I think the biggest difference is that tesseract is open source and abbyy not :).
> So in ABBYY you pay for the image preprocessing and in tesseract not.
> I totally agree with Patrick, if you do the preprocessing well then I always get perfect result with tesseract, but I never tried ABBYY.
>
> Mike
>
> Von: tesser...@googlegroups.com [mailto:tesser...@googlegroups.com] Im Auftrag von Patrick Questembert
> Gesendet: Mittwoch, 6. Juli 2011 12:54
> An: tesser...@googlegroups.com
> Betreff: Re: Teseract vs Abbyy
>
> It's really a long list of approaches, including:
> - spacing: we don't trust any spacing determination by Tesseract and reevaluate every space indicated by Tesseract for possible elimination or consider every two letters for a possible space insertion
> - obvious mistakes: this is by far the largest category of corrections we make. For example VV is usually corrected back to W - but there are hundreds more cases
> - ambiguous letters such as i versus l: surprisingly, Tesseract makes a ton of incongruous mistakes that lead me to believe there is no feature analysis whatsoever - for example a 'y' may get mapped to 'g', even though there is 0% chance of that based on a wide open gap on top. For these types of mistakes we go back to the source image to apply our own OCR of sorts.
> - dictionaries: another big disappointment - from our testing we found that Tesseract applies the dictionary in less than 5% of the cases where it should (i.e. where the letter mistake is one listed in the ambigs files, with the correct spelling in the user dictionary) so we implemented our own dictionaries
> - pattern matching: the regular expressions we use include wide tolerance for mistakes. Under the "protection" of a regular expression for a specific pattern we have the flexibility to include hundreds of ambiguities (because these trigger only when they help complete a match which makes it more likely to be a valid substitution
>
> PatrickOn Mon, Jul 4, 2011 at 12:56 AM, Andres <andrej...@gmail.com<mailto:andrej...@gmail.com>> wrote:
>
> Hello Patrick,
>
> Could you extend a little about what do you mean with Tesseract heuristics ?
>
> Thanks,
>
> Andres
> 2011/7/3 patrickq <patrick.questemb...@gmail.com<mailto:patrick.questemb...@gmail.com>>
> The answer is (of course) "it depends":
> 1. If you compare Tesseract and ABBY on a same image, without applying
> preprocessing to it, ABBY wins (because Tesseract's image processing
> is very rudimentary - at best). Of course if your test images are
> produced (for example) by a flatbed scanner, the lack of image
> processing is not an issue and refer to case 2 below.
> 2. If you compare Tesseract and ABBY on a clean (processed) image,
> without applying any post-Tesseract heuristic, ABBY may have an
> advantage
> 3. However, if you compare Tesseract + image processing + heuristics &
> corrections, Tesseract actually beats ABBY hands down.
>
> ScanBizCards is case #3 around Tesseract 3.01. If you want to test
> this combo please do this:
> - go tohttp://www.scanbizcards.com/webdemo
>
> - upload an image (under Batch Actions). Warning: ScanBizCards is
> geared towards recognizing text on business cards so it would be best
> if you tested on something *like* a business card (sparse text), not a
> full page with lots of text
> - click that image then "Image Editor" on top and OCR it
> - when done testing please delete the test images from this demo
> account (or get your own online account) ...
>
> You can also test instead on your Android or iPhone mobile device by
> installing the free version of ScanBizCards. ABBY powers two iPhone
> apps made by German company - Business Card Reader (by Shape Services)
> and Card Reader (by xRoot Software) - and of course ABBY's own
> iPhone / Android business card reader app.
>
> Patrick
>
> On Jul 3, 10:10 am, mw18888 <man_...@yahoo.com<mailto:man_...@yahoo.com>> wrote:
>
> > Can anyone comment on the accuracy of Tesseract vs Abbyy?
>
> > Regards,
>
> > mw18888
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com<mailto:tesser...@googlegroups.com>
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com<mailto:tesseract-ocr%2Bunsu...@googlegroups.com>
> For more options, visit this group athttp://groups.google.com/group/tesseract-ocr?hl=en
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com<mailto:tesser...@googlegroups.com>
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com<mailto:tesseract-ocr%2Bunsu...@googlegroups.com>
> For more options, visit this group athttp://groups.google.com/group/tesseract-ocr?hl=en
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com
> For more options, visit this group athttp://groups.google.com/group/tesseract-ocr?hl=en
>
> ________________________________
> This message is confidential and intended only for the addressee. If you have received this message in error, please immediately notify the postmas...@nds.com and delete it from your system as well as any copies. The content of e-mails as well as traffic data may be monitored by NDS for employment and security purposes.

Lutz, Michael

unread,
Jul 7, 2011, 4:49:42 AM7/7/11
to tesser...@googlegroups.com
Well basically what I do is, if I have a gradient background I create the black and white image myself using a fixed threshold, if the input is blurry then I sharpen it.
If the input is too small I use a zoom and sharpen. So nothing special but it helps me get good results for my purpose.

Mike

-----Ursprüngliche Nachricht-----
Von: tesser...@googlegroups.com [mailto:tesser...@googlegroups.com] Im Auftrag von mw18888
Gesendet: Mittwoch, 6. Juli 2011 15:57
An: tesseract-ocr

Mike and Patrick,

Regards,

mw18888

This message is confidential and intended only for the addressee. If you have received this message in error, please immediately notify the postm...@nds.com and delete it from your system as well as any copies. The content of e-mails as well as traffic data may be monitored by NDS for employment and security purposes.

mw18888

unread,
Jul 7, 2011, 8:28:52 AM7/7/11
to tesseract-ocr
Mike,

Thank you for your comment.



> For more options, visit this group athttp://groups.google.com/group/tesseract-ocr?hl=en

cyber

unread,
Aug 29, 2011, 2:40:05 PM8/29/11
to tesseract-ocr
Hello Patrick,

Have you considered selling your OCR post-processing program(s), that
perform the spacing, character substitution, and other post-OCR
enhancements?

Which language and OS are these written for?

Jim




On Jul 6, 5:53 am, Patrick Questembert <patrick.questemb...@gmail.com>
wrote:
> > 2011/7/3 patrickq <patrick.questemb...@gmail.com>
>
> >> The answer is (of course) "it depends":
> >> 1. If you compare Tesseract and ABBY on a same image, without applying
> >> preprocessing to it, ABBY wins (because Tesseract's image processing
> >> is very rudimentary - at best). Of course if your test images are
> >> produced (for example) by a flatbed scanner, the lack of image
> >> processing is not an issue and refer to case 2 below.
> >> 2. If you compare Tesseract and ABBY on a clean (processed) image,
> >> without applying any post-Tesseract heuristic, ABBY may have an
> >> advantage
> >> 3. However, if you compare Tesseract + image processing + heuristics &
> >> corrections, Tesseract actually beats ABBY hands down.
>
> >> ScanBizCards is case #3 around Tesseract 3.01. If you want to test
> >> this combo please do this:
> >> - go tohttp://www.scanbizcards.com/webdemo
> >http://groups.google.com/group/tesseract-ocr?hl=en- Hide quoted text -
>
> - Show quoted text -
Reply all
Reply to author
Forward
0 new messages