Empirical DangAmbigs generator

81 views
Skip to first unread message

Michael Reimer

unread,
Apr 8, 2009, 1:16:22 PM4/8/09
to tesseract-ocr
As part of my attempt to improve Tesseract's accuracy with a language
model, I needed a better DangAmbigs file. To get one I made a utility
that generates them empirically from any collection of correct and
generated texts (such as the UNLV test set and Tesseract's output from
the tests). I believe this should be useful to other Tesseract
users. You can find it here, with a more detailed description:

http://www.cs.toronto.edu/~mreimer/tesseract.html

To the project owners: I'm willing to support this long-term, and
would be pleased to see it put on the external add-ons page or
included in the training files.

Michael Reimer

unread,
Apr 8, 2009, 3:27:00 PM4/8/09
to tesseract-ocr
Also, I've run the UNLV tests with the default DangAmbigs from the
English language pack, with my own generated one, and with an empty
one. The empty one gives the best performance on my system. Is that
normal?

74yrs old

unread,
Apr 9, 2009, 4:43:30 AM4/9/09
to Michael Reimer, Ray Smith
Dear Michael Reimer,
I like to know whether your "DangAmbigs Generator" will work for Kannada Language
On hearing from you, I shall forward sample kannada text(correct and generated) for testing/evaluation  at your end.
With regards,
-sriranga(76yrsold)

Michael Reimer

unread,
Apr 9, 2009, 9:26:18 AM4/9/09
to tesseract-ocr
Hello. To my knowledge there's nothing language-specific in it, so it
ought to. I'll be happy to test it for you, you can send your files
to my e-mail address.

On Apr 9, 4:43 am, 74yrs old <withblessi...@gmail.com> wrote:
> Dear Michael Reimer,
> I like to know whether your "DangAmbigs Generator" will work for Kannada
> Language
> On hearing from you, I shall forward sample kannada text(correct and
> generated) for testing/evaluation  at your end.
> With regards,
> -sriranga(76yrsold)
>
> On Wed, Apr 8, 2009 at 10:46 PM, Michael Reimer <michael.rei...@gmail.com>wrote:
>
>
>
> > As part of my attempt to improve Tesseract's accuracy with a language
> > model, I needed a better DangAmbigs file.  To get one I made a utility
> > that generates them empirically from any collection of correct and
> > generated texts (such as the UNLV test set and Tesseract's output from
> > the tests).  I believe this should be useful to other Tesseract
> > users.  You can find it here, with a more detailed description:
>
> >http://www.cs.toronto.edu/~mreimer/tesseract.html<http://www.cs.toronto.edu/%7Emreimer/tesseract.html>

Ray Smith

unread,
Apr 9, 2009, 1:22:08 PM4/9/09
to tesser...@googlegroups.com
Interesting result. The problem is that the value of DangAmbigs varies according to the size of the document being OCRed. 

Very small documents don't benefit from the adaptive classifier at all, so DangAmbigs has very little effect.
Very large (eg multipage) documents benefit greatly from the adaptive classifier, and mis-adaption has the greatest cost, so adaption has to be carefully controlled, hence DangAmbigs is very important.
On medium-sized documents, adaption has a strong effect, but the cost (and danger) of mis-adaption is lower, so it pays to make riskier adaptions - hence an empty DangAmbigs can lead to higher accuracy.

Ray.

74yrs old

unread,
Apr 16, 2009, 5:18:08 AM4/16/09
to tesser...@googlegroups.com
Hi Ray,
I seek your valuable guidance whether DangAmbigs generator is helpful for Kannada project?
-with regards,
-sriranga(76yrsold)

74yrs old

unread,
Jun 24, 2009, 2:34:45 AM6/24/09
to tesser...@googlegroups.com
Michael,
if you are using latest version Analytical tool 5.1 - . In case if you have generated exe files , will you please froward the same to me, since I tried to generate exe files but failed?
sorry for disturbing you,
-With regards,
-sriranga(76yrsold)

74yrs old

unread,
Jul 2, 2009, 5:51:39 AM7/2/09
to tesser...@googlegroups.com
Michael,
By this time you might have solved the problem of generator for Kannada by this time?
Awaiting good news.
With regards,
-sriranga(76yrsold)

YH Tan

unread,
Apr 15, 2016, 1:21:34 PM4/15/16
to tesseract-ocr
Hi,

I wonder if your software is still available for sharing? It is not hosted on the u/m URL anymore. Thanks!

YH

Michael Reimer於 2009年4月9日星期四 UTC+8上午1時16分22秒寫道:
Reply all
Reply to author
Forward
0 new messages