Tesseract fails recognizing simple and isolated digits. How can I train tesseract for recognizing digits from unknown font type

V.Lorz

unread,

Mar 26, 2014, 2:10:56 PM3/26/14

to tesser...@googlegroups.com

Hi All,

I started integrating tesseract (version 3.2, EMGV) in a project for recognizing short texts in scanned images. Using some very simple image processing I extract the area of interest for speeding up the process.

The errors I get are related to recognition results, tesseract sometimes confuses the digits '6' and '5', the image bellow is recognized as "4436695" instead of "4436696". I'm using the default eng.traineddata file bundled with the library. Using some other trained data files from around the Inet I got the same results with the same two digits (5 and 6). Before processing the image I configure tesseract to process only digits.

Does anyone know what could be causing this error? How could I solve it?

I started reading the guide for training the engine (http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3) as suggested in some other threads, but it is of near to no help for me. Is there any other guide around for 'dummies' like [presummably :(] me? In this case I want to train it using one image that I created from 40 sampled documents (attached here). Using jTessBoxEditor-1.0 I was able to generate and correct the box file. What should I do next?

Thanks a lot in advance, V.Lorz

eng-labelfont.rar

Nick White

unread,

Mar 26, 2014, 2:53:39 PM3/26/14

to tesser...@googlegroups.com

Hi V.Lorz,

Firstly, it's Tesseract 3.02.02, not 3.2. We may release version 3.2
someday, but not for a long time yet ;)

Doing training is not going to help you, I'm afraid. The font is
quite standard, so you aren't going to be able to do a better job at
training Tesseract for it than the eng.traineddata provides.

Out of curiousity, why did you think that training would help you
here? I ask as it's a very common misconception, but (AFAIK) our
documentation doesn't imply it anywhere.

You may just have to accept that the accuracy from Tesseract won't
be 100%, I'm afraid. Maybe someone else here has suggestions, but
the image looks alright to me, so the general advice of "more
preprocessing" may not be helpful.

Nick

On Wed, Mar 26, 2014 at 11:10:56AM -0700, V.Lorz wrote:
> Hi All,
>
> I started integrating tesseract (version 3.2, EMGV) in a project for
> recognizing short texts in scanned images. Using some very simple image
> processing I extract the area of interest for speeding up the process.
>
> The errors I get are related to recognition results, tesseract sometimes
> confuses the digits '6' and '5', the image bellow is recognized as "4436695"
> instead of "4436696". I'm using the default eng.traineddata file bundled with
> the library. Using some other trained data files from around the Inet I got the
> same results with the same two digits (5 and 6). Before processing the image I
> configure tesseract to process only digits.
>
>

> [VwAAAAASUV]

>
> Does anyone know what could be causing this error? How could I solve it?
>
> I started reading the guide for training the engine (http://code.google.com/p/
> tesseract-ocr/wiki/TrainingTesseract3) as suggested in some other threads, but
> it is of near to no help for me. Is there any other guide around for 'dummies'
> like [presummably :(] me? In this case I want to train it using one image that
> I created from 40 sampled documents (attached here). Using jTessBoxEditor-1.0 I
> was able to generate and correct the box file. What should I do next?
>
>
> Thanks a lot in advance, V.Lorz
>
>

> --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to tesseract-oc...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

V.Lorz

unread,

Mar 26, 2014, 4:45:53 PM3/26/14

to tesser...@googlegroups.com

Hi Nick, thanks for taking a time to write.

>> Firstly, it's Tesseract 3.02.02, not 3.2

The emgu wrapper around tesseract return a System.Version class instance which returns integer values for its Major, Minor Revision and Build properties.

>> Out of curiousity, why did you think that training would help you here?

I made myself one simple question after seeing this behaviour with several images, all the time with the same two characters, '5' and '6': Why would the engine return two different character codes for two almost identical blobs?
The most reasonable conclusion for me was it has something to see with training.

>> (...) but (AFAIK) our documentation doesn't imply it anywhere.
You're right on that, I just followed what for me was common sense.

>> You may just have to accept that the accuracy from Tesseract won't be 100%, I'm afraid.

40 images processed for each batch, 7 digits per image, 2-to-3 erroneous result sets on every batch (one digit each), never less than 2 errors, always with the same two digits. Counting digits it is a high success rate, but counting figures (7 digits) it ranges from 92.5% to 95.0%. Too low for the client.

Any clues on how to improve this?

V.Lorz

> I started reading the guide for training the engine (http://code.google.com/p/ - tracked

> tesseract-ocr/wiki/TrainingTesseract3) as suggested in some other threads, but
> it is of near to no help for me. Is there any other guide around for 'dummies'
> like [presummably :(] me? In this case I want to train it using one image that
> I created from 40 sampled documents (attached here). Using jTessBoxEditor-1.0 I
> was able to generate and correct the box file. What should I do next?
>
>
> Thanks a lot in advance, V.Lorz
>
>
> --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com
> For more options, visit this group at

> http://groups.google.com/group/tesseract-ocr?hl=en - tracked

>
> ---
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to tesseract-oc...@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout - tracked.

Quan Nguyen

unread,

Mar 26, 2014, 8:53:27 PM3/26/14

to tesser...@googlegroups.com

I defined a ROI around each number and it seemed to produce better results.

V.Lorz

unread,

Mar 27, 2014, 5:23:14 AM3/27/14

to tesser...@googlegroups.com

Hi Nguyen,

Thanks for the suggestion. I've tried with the ROI and also isolating the digits as independent images but with no results improvement. In some images I got better results resizing the image by a scale factor of 2.5, some other images required DILATE/ERODE operations for closing 1 pixel holes.

V.Lorz

On Thursday, March 27, 2014 1:53:27 AM UTC+1, Quan Nguyen wrote:

I defined a ROI around each number and it seemed to produce better results.

On Wednesday, March 26, 2014 1:10:56 PM UTC-5, V.Lorz wrote:

Hi All,

I started integrating tesseract (version 3.2, EMGV) in a project for recognizing short texts in scanned images. Using some very simple image processing I extract the area of interest for speeding up the process.

The errors I get are related to recognition results, tesseract sometimes confuses the digits '6' and '5', the image bellow is recognized as "4436695" instead of "4436696". I'm using the default eng.traineddata file bundled with the library. Using some other trained data files from around the Inet I got the same results with the same two digits (5 and 6). Before processing the image I configure tesseract to process only digits.

Does anyone know what could be causing this error? How could I solve it?

I started reading the guide for training the engine (http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 - tracked) as suggested in some other threads, but it is of near to no help for me. Is there any other guide around for 'dummies' like [presummably :(] me? In this case I want to train it using one image that I created from 40 sampled documents (attached here). Using jTessBoxEditor-1.0 I was able to generate and correct the box file. What should I do next?

Reply all

Reply to author

Forward