I gather that Tesseract 3.0 works well for Chinese script now. The
hallmark of Chinese script is that it is unconnected (unlike say Hindi
which has a line connecting all its characters), and it has a large
number of characters in the alphabet. In this light, I think it should
also work well with unconnected Indic script such as Kannada,
Malayalam, Punjabi etc.
Anyone know if this works?
--
Debayan Banerjee
http://hacking-tesseract.blogspot.com/
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com.
To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
The biggest problem with unconnected Indic scripts seems to be the aspect ratio and the amount of horizontal detail. Hindi seems to work quite well as it doesn't seem to have very big ligatures. The best fix for the unconnected scripts may be to break them into sub-akshara glyphs and recognize those separately.
Ray.
Sent from my Nexus1 Android phone.
Wrote a blog spot about a possible strategy to handle descender vowel
signs http://hacking-tesseract.blogspot.com/2011/04/horizontal-histogram-profiles-of.html
>
--
Debayan Banerjee
This will work for Bengali and Hindi. Am not working on South Indian
languages for now.
When you say it seems to work well for HIndi, have you tested 3.0 with this?
--
Debayan Banerjee