I ran into a similar problem when I only needed it to look at
hexidecimal values. What I ended up doing was creating a separate
"langauge" that only contained the specified characters. So you could
create a langauge of numbers and a language with letters and use
tesseract to read each part of your image using the appropriate
language.
The web address below shows you how to train tesseract for a specific
language. Hope this helps.
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract
An easier way to deal with number only or letter, is to use this from
FAQ (http://code.google.com/p/tesseract-ocr/wiki/FAQ):
----------------------------------------------------------------------------------------------------------------------------
How do I recognize only digits?
In 2.03 and above:
Use
TessBaseAPI::SetVariable("tessedit_char_whitelist", "0123456789");
BEFORE calling an Init function or put this in a text file called
tessdata/configs/digits:
tessedit_char_whitelist 0123456789
and then your command line becomes:
tesseract image.tif outputbase nobatch digits
Warning: Until the old and new config variables get merged, you must
have the nobatch parameter too.
----------------------------------------------------------------------------------------------------------------------------
For the second part : " I'm willing to review the recognised letters
with the
possible words so we can improve the accuracy "
If you are using a 2.0X version you could use the eng.user-words (a
user dictionary) as it's suggested in the FAQ (http://code.google.com/
p/tesseract-ocr/wiki/FAQ)
----------------------------------------------------------------------------------------------------------------------------
How do I provide my own dictionary?
Easy: Replace tessdata/eng.user-words with your own word list, in the
same format - UTF8 text, one word per line.
More difficult, but better for a large dictionary: Replace tessdata/
eng.word-dawg with one created from your own word list, using
wordlist2dawg. See the TrainingTesseract wiki page for details.
----------------------------------------------------------------------------------------------------------------------------
> > --http://coders-view.blogspot.com/http://thilankagekawuluwa.blogspot.co...
Topic: word reviewJoe K <joekar...@gmail.com> Mar 08 11:02AM -0800 ^
Hey Thilanka,
I ran into a similar problem when I only needed it to look at
hexidecimal values. What I ended up doing was creating a separate
"langauge" that only contained the specified characters. So you could
create a langauge of numbers and a language with letters and use
tesseract to read each part of your image using the appropriate
language.
The web address below shows you how to train tesseract for a specific
language. Hope this helps.
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract
Moffette <omof...@gmail.com> Mar 08 12:26PM -0800 ^