Tesseract 3.0 with unconnected Indic script

47 views
Skip to first unread message

Debayan Banerjee

unread,
Mar 30, 2011, 2:17:42 AM3/30/11
to tesser...@googlegroups.com
Hi,

I gather that Tesseract 3.0 works well for Chinese script now. The
hallmark of Chinese script is that it is unconnected (unlike say Hindi
which has a line connecting all its characters), and it has a large
number of characters in the alphabet. In this light, I think it should
also work well with unconnected Indic script such as Kannada,
Malayalam, Punjabi etc.

Anyone know if this works?

--
Debayan Banerjee
http://hacking-tesseract.blogspot.com/

Sriranga(78yrsold)

unread,
Mar 30, 2011, 2:46:35 AM3/30/11
to tesser...@googlegroups.com
If we succeeded in Sanskrit(Deveanagari script) which is mother lang of Indic
no doubt tesseract 3.01 should work. I have tested with Tamil which has dependent vowels identical to Bengali as well as Kannada, Telugu. Only  problem is with output accuracy - which can be solved for, time being,with help of post processor ,
In FreeOCR latest version 4.1(July10)  has post processor developed by the Ralph I found latest version tesseract.exe will work with freeOCR till today.
Even vietOCR developed by Quan has post processsor feature which works for indic apart from viet. 

With regards,
-sriranga(78yrs)

 



--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com.
To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.


Ray Smith

unread,
Apr 2, 2011, 11:43:35 AM4/2/11
to tesser...@googlegroups.com

The biggest problem with unconnected Indic scripts seems to be the aspect ratio and the amount of horizontal detail. Hindi seems to work quite well as it doesn't seem to have very big ligatures. The best fix for the unconnected scripts may be to break them into sub-akshara glyphs and recognize those separately.

Ray.
Sent from my Nexus1 Android phone.

mns_rao

unread,
Apr 3, 2011, 2:50:54 AM4/3/11
to tesseract-ocr
>The best fix for the unconnected
> scripts may be to break them into sub-akshara glyphs and recognize those
> separately.

After correct recognition, is there a method to put the output in the
the accepted form the language.

MNS Rao

On Apr 2, 8:43 pm, Ray Smith <theraysm...@gmail.com> wrote:
> The biggest problem with unconnected Indic scripts seems to be the aspect
> ratio and the amount of horizontal detail. Hindi seems to work quite well as
> it doesn't seem to have very big ligatures. The best fix for the unconnected
> scripts may be to break them into sub-akshara glyphs and recognize those
> separately.
>
> Ray.
> Sent from my Nexus1 Android phone.

Sriranga(78yrsold)

unread,
Apr 3, 2011, 7:27:22 AM4/3/11
to tesser...@googlegroups.com
Ray,
with reference to break into subakshara glyphs -attached pdf file(kindly go to page 2 for English version) which was issued by Govt. of Karnataka,India for persual. it contains ASCII code and not in unicode for each glyphs. Hope it will be useful to you. I am preparing unicode no for each ASCII code for each glyphs for ready reference.
With regards,
-sriranga(78yrs)
draft-kannada-bi-lingual-code.pdf

Debayan Banerjee

unread,
Apr 7, 2011, 3:52:21 PM4/7/11
to tesser...@googlegroups.com
On 2 April 2011 21:13, Ray Smith <thera...@gmail.com> wrote:
> The biggest problem with unconnected Indic scripts seems to be the aspect
> ratio and the amount of horizontal detail. Hindi seems to work quite well as
> it doesn't seem to have very big ligatures. The best fix for the unconnected
> scripts may be to break them into sub-akshara glyphs and recognize those
> separately.
>

Wrote a blog spot about a possible strategy to handle descender vowel
signs http://hacking-tesseract.blogspot.com/2011/04/horizontal-histogram-profiles-of.html

>

--
Debayan Banerjee

Debayan Banerjee

unread,
Apr 7, 2011, 3:59:46 PM4/7/11
to tesser...@googlegroups.com

This will work for Bengali and Hindi. Am not working on South Indian
languages for now.
When you say it seems to work well for HIndi, have you tested 3.0 with this?


--
Debayan Banerjee

Reply all
Reply to author
Forward
0 new messages