Indian language computer applications an appeal

47 views
Skip to first unread message

Hariharan Ramamurthy

unread,
Dec 13, 2013, 3:41:42 AM12/13/13
to parich...@googlegroups.com

 

Indian language computer applications an appeal

 

Dear

 Everyone who cares,

It is sad that with a large software workforce, India is still lagging behind in improving the representation of Indian  regional languages  in various computer software programs dealing with   language in general.

I am a physician from Andhra Pradesh presently in Dallas TX .I am interested in Telugu language computer applications.

 

I had raised the topic of creating a telugu OCR software in a discussion list called "Racchabanda" in 1996 and again in 1998. as recently as   2010 we were still discussing  about this  in Google group called " telugu sanganana"

 Some guy might come up and tell me this OCR problem is already solved, by some software from either TIDl, IIIT or University of Hyderabad.

But unfortunately after claiming to have produced a software and stating there is 95% accuracy these software programs disappear.

There is not a single freeware or commercial program which does the following for Telugu with ease and accuracy.( like  MS  word/Omni page/Dragon naturally speaking for English)

spellcheck, OCR, speech recognition .handwriting recognition, machine translation( Google translate is a joke ) speech synthesis( this is to a certain extent solved,  but much can be done to improve it)

 

The status is similar for most other Indian languages.

 

I was almost given up on this but the recent success of AAM AADMI Party has energized me and I think  a handful of people can really make a difference , if they  concentrate and  strive.

 

So I am   inviting all of you to come forward and spend a few hours each week to develop the following for telugu and  this will in turn encourage  others to develop similar  projects for  all the Indian languages.

 

I made a beginning by sending  an email to the professor  who developed the  search engine yioop and got the  following  reply.  This is  encouraging.

 

FIRST we need a large public corpus.

Second we need  to tag the  corpus

Third  we need to complete the  boot strapping of tesseract  for OCR.

After  succeeding at these, we can go on to new projects.

 

who wants to volunteer?

 

 

 

 

 

 

Me

To chrisatpollett.org

 

Dec 12 at 9:43 PM

Hi,

I am a physician  fro  Ap  presently in Dallas TX  .I am interested in Telugu language  computer applications.

 

I am  interested in . OCR, MT, speech and  handwriting recognition.

 

I am not  very good at  computer programming but have  a wider understanding .

there is no large  corpus  publicly available  for this INDIC  language.

I was trying to create one  and  started using  web as corpus   software but unfortunately due to  the  loss of  BING API this is no longer working.

can  the  crawler  you use  be used to create such a corpus ?

I would like to interact with you on some ideas I have .

are you willing to correspond with me on email

 

Chris Pollett

To Me

 

Dec 12 at 10:57 PM

Sure. I don't know Telugu, but I could put you in contact with some former students who worked on Yioop

that do. Looking at the most recent crawl I did, I did get a fair number of Telugu documents

but they were mainly Wikipedia related:

http://www.yioop.com/index.php?q=lang%3Ate&YIOOP_TOKEN=xlhhSDmlPjI|1386909727&its=1375222073&limit=0

Restricting to both the telugu language and .in domain yields only a small number of

results in my last crawl:

http://www.yioop.com/index.php?YIOOP_TOKEN=gTXbgBRIWgI|1386909940&its=1375222073&q=lang%3Ate+site%3Ain

all of which seem to be mainly English. My crawl on the machines in our guest room is only about 1/100th the size of Bing

or Google's index and is mainly of American sites, but you can configure Yioop to crawl whatever you like. If you 

downloaded the Telugu Wikipedia (available free), Yioop could index it. If you had a good start list of URLs it would possible to have Yioop 

crawl and only index the sites it found containing Telugu.

 

Best,

Chris 

 

Dhaval Patel

unread,
Jan 21, 2014, 7:59:46 PM1/21/14
to parich...@googlegroups.com

 Dear All

   I am faculty member in CSE dept at IIT-Roorkee. I can try to help you to develop Telugu OCR.
With regards
Dhaval

RKVS Raman

unread,
Jan 21, 2014, 9:36:47 PM1/21/14
to parich...@googlegroups.com, dhava...@gmail.com
Thank you very much for your interest in OCR.

Tesseract , as you might know depends on training data for accurate recognition. We develop this data by mapping the glyphs of a font in a language to its unicode characters. 

 It will be great if you can help us out in identifying the major fonts used in Telugu for printed text (unicode and non-unicode) . For non-unicode fonts we would also need the mapping between the glyphs and unicode text. 

This will be useful in developing the training data for Tesseract. 

Let us know your inputs.



Best Regards
-Raman

-----------------------------------------------
RKVS Raman
http://sites.google.com/site/rkvsraman
------------------------------------------------




--
You received this message because you are subscribed to the Google Groups "Parichit-OCR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to parichit-ocr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all
Reply to author
Forward
0 new messages