limiting characters recognized by tesseract

1,953 views
Skip to first unread message

retsyx

unread,
Apr 9, 2008, 12:33:02 AM4/9/08
to tesseract-ocr
I have a very limited OCR application where I want to recognize only
digits [0-9]. Is there a way to restrict tesseract to attempt to
recognize only digits (with the implicit hope that this improves
accuracy)?

ttutuncu

unread,
Apr 9, 2008, 10:00:40 AM4/9/08
to tesseract-ocr
Well if you only train tesseract for digits it will only give digits
in the output.
But the problem with this is that it will try to match every character
in your image as a digit, resulting in an output of unmeaningful
digits.

There is a way which slightly increases the accuracy of reading
digits.
You can write "number_depth 4" into a file and give this file from the
command line.
say the file you created is "tess.config"
Then the command line would be: tesseract test.tif output -l eng
tess.config

Try this and share your results with us..

Scan...@gmail.com

unread,
Apr 9, 2008, 10:33:54 AM4/9/08
to tesseract-ocr
There also is a variable that can be set for recognizing digits only.

You best bet is to preprocess and crop the areas that contains the
numbers you want to recognize.

You also want to train from actual images of your digits.

VictorF

unread,
Apr 9, 2008, 10:51:34 AM4/9/08
to tesseract-ocr
hello everyone,

by coincidence, I'm doing exactly the same thing as reytyx.

I tried the suggestion by ttutuncu of setting "number_depth 4", it
indeed increased the chance that i'll get a number as an output.

Glen: could you please kindly point out which variable it is? and can
it be configurated in through command line by a config file?

Thanks!

Victor
> > > accuracy)?- Hide quoted text -
>
> - Show quoted text -

retsyx

unread,
Apr 9, 2008, 12:12:51 PM4/9/08
to tesseract-ocr
Thank you for the reply.
I'm ok with matching every character as a digit since that is all I am
providing. I've tried 'number_depth 4' and it hasn't improved
anything. I've also tried Glen's suggestion of setting the
'numeric_mode' switch but this appears to still recognize parens,
brackets, 'X', 'i', and some others which I guess can be considered
numeric symbols.

Some background, the major problem I have is the relatively low
quality of the input image. The input image is from a webcam. This is
a constraint I'm stuck with. I do quite a bit of pre-processing and
provide tesseract only a cleaned digit at a time. However, there are
still elements of noise that are proving difficult to get rid of. I
was hoping that if I could limit tesseract to only recognizing digits
it would compensate for the noise.

Any suggestions are welcome.

str...@yahoo.com

unread,
Apr 9, 2008, 5:52:45 PM4/9/08
to tesseract-ocr
Hi

I just tried the 'number_depth 4' suggestion as well and there was
indeed an improvement of accuracy!

Thanks for the advice!

Stronky

dabada83

unread,
Apr 10, 2008, 12:19:59 AM4/10/08
to tesseract-ocr
If numeric_mode is true, only possible digits and roman numbers are
// returned.

See line 61 ccmain/baseapi.cpp..
Reply all
Reply to author
Forward
0 new messages