URGENT HELP NEEDED: False recognition due to Dictionary usage in Sanskrit

30 views
Skip to first unread message

rohit saluja

unread,
Jun 24, 2016, 7:41:20 PM6/24/16
to tesseract-ocr
Hi,

I generated images using Sanskrit 2003 font using text2image default configs.
I trained the tesseract using my own box files and compared results using dictionary dawg and without using dictionary dawg.

Using dictionary dawg interestingly increase the word-level accuracy, but in certain words, it give false words, which were correct when dictionary was not used.

Ex:- Using internal state debugger, I found out that, if I give image of अब्ज , I get
अज(R=66.5974, C=-4.88797) as output when I use dictionary, and अब्ज(Rating=33.2893, Conf=-2.93596) when I do not use dictionary.
Important to know that non-dictinary word has better rating and confidence.

Clearly, tesseract stop at a point in dictionary where it finds अज and does not move further to try out अब्ज.(as I tried with other such examples as well.)

What I want to do is the following:-

I want tesseract to give me the output with best rating amongst non-dictionary based recognition and dictionary based recognition. I want this process to be automated for the whole book. Any help in this regard will be deeply appreciated.

Thanks in advance
Rohit
Reply all
Reply to author
Forward
0 new messages