word review

42 views
Skip to first unread message

Thilanka Kaushalya

unread,
Mar 6, 2010, 11:46:42 PM3/6/10
to tesser...@googlegroups.com
Hi,

       I'm using Tesseract for my letter recognition project and currently the recognitions is quite good.
The letters are hand written.But there are some problems when I used it to recognise the letter "O" and
number "0". These letters are used in data areas as the fields that enter names. So names cannot have any
numbers with it. And when we are using the the system of the data fields as date of birth it only contains
numbers. So I'm willing to give restriction to the recognition system saying that the corresponding data fields
have only numbers or the letters.
       And also I'm willing to review the recognised letters with the possible words so we can improve the accuracy
of the data. But I don't have any idea about how to do that.  

Can some one help me. Thank you.

Regards,
Thilanka.
--
http://coders-view.blogspot.com/
http://thilankagekawuluwa.blogspot.com/
http://twitter.com/thilanka_k

Joe K

unread,
Mar 8, 2010, 2:02:02 PM3/8/10
to tesseract-ocr
Hey Thilanka,

I ran into a similar problem when I only needed it to look at
hexidecimal values. What I ended up doing was creating a separate
"langauge" that only contained the specified characters. So you could
create a langauge of numbers and a language with letters and use
tesseract to read each part of your image using the appropriate
language.

The web address below shows you how to train tesseract for a specific
language. Hope this helps.

http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract

Moffette

unread,
Mar 8, 2010, 3:26:54 PM3/8/10
to tesseract-ocr
Hi,

An easier way to deal with number only or letter, is to use this from
FAQ (http://code.google.com/p/tesseract-ocr/wiki/FAQ):
----------------------------------------------------------------------------------------------------------------------------
How do I recognize only digits?

In 2.03 and above:

Use

TessBaseAPI::SetVariable("tessedit_char_whitelist", "0123456789");

BEFORE calling an Init function or put this in a text file called
tessdata/configs/digits:

tessedit_char_whitelist 0123456789

and then your command line becomes:

tesseract image.tif outputbase nobatch digits

Warning: Until the old and new config variables get merged, you must
have the nobatch parameter too.
----------------------------------------------------------------------------------------------------------------------------

For the second part : " I'm willing to review the recognised letters


with the
possible words so we can improve the accuracy "

If you are using a 2.0X version you could use the eng.user-words (a
user dictionary) as it's suggested in the FAQ (http://code.google.com/
p/tesseract-ocr/wiki/FAQ)

----------------------------------------------------------------------------------------------------------------------------
How do I provide my own dictionary?

Easy: Replace tessdata/eng.user-words with your own word list, in the
same format - UTF8 text, one word per line.

More difficult, but better for a large dictionary: Replace tessdata/
eng.word-dawg with one created from your own word list, using
wordlist2dawg. See the TrainingTesseract wiki page for details.
----------------------------------------------------------------------------------------------------------------------------

> > --http://coders-view.blogspot.com/http://thilankagekawuluwa.blogspot.co...

Thilanka Kaushalya

unread,
Mar 11, 2010, 10:59:55 AM3/11/10
to tesser...@googlegroups.com

Hi Joe and Moffette,

             Thanks for the tips you provided. those are very helpful for me. These days
I'm testing your instructions. Thanks again. 

regards thilanka

     

 Topic: word review

     
    Hey Thilanka,
     
    I ran into a similar problem when I only needed it to look at
    hexidecimal values. What I ended up doing was creating a separate
    "langauge" that only contained the specified characters. So you could
    create a langauge of numbers and a language with letters and use
    tesseract to read each part of your image using the appropriate
    language.
     
    The web address below shows you how to train tesseract for a specific
    language. Hope this helps.
     
    http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract
     
     
     

     

    Thilanka Kaushalya

    unread,
    Mar 21, 2010, 1:25:30 AM3/21/10
    to tesser...@googlegroups.com
    Hi Joe and Moffette,


             I'm recognising the data from a hand written form, and the scenario is extracting the
    letters one by one and sending the each letter to the tesseract seperately. So the recognition
    is done letter-vice. So I can,t use the dictionary file for the word reviewing in that case.
    ****************

    How do I provide my own dictionary?
     
    Easy: Replace tessdata/eng.user-words with your own word list, in the
    same format - UTF8 text, one word per line.
     
    More difficult, but better for a large dictionary: Replace tessdata/
    eng.word-dawg with one created from your own word list, using
    wordlist2dawg. See the TrainingTesseract wiki page for details.


    ***********************
              Is the Tesseract output the words only included in the above mentioned libraries.
    If so can I send the set of recognised letter again to the Tesseract as an image to review
    it to the defined domain of per-defined words.

              Or else can you give some instructions about a method to do how.

    Thanks and regards,
    Thilanka.
    Reply all
    Reply to author
    Forward
    0 new messages