Indian language computer applications an appeal

Hariharan Ramamurthy

unread,

Dec 13, 2013, 3:41:42 AM12/13/13

to parich...@googlegroups.com

Indian language computer applications an appeal

Dear

Everyone who cares,

It is sad that with a large software workforce, India is still lagging behind in improving the representation of Indian regional languages in various computer software programs dealing with language in general.

I am a physician from Andhra Pradesh presently in Dallas TX .I am interested in Telugu language computer applications.

I had raised the topic of creating a telugu OCR software in a discussion list called "Racchabanda" in 1996 and again in 1998. as recently as 2010 we were still discussing about this in Google group called " telugu sanganana"

Some guy might come up and tell me this OCR problem is already solved, by some software from either TIDl, IIIT or University of Hyderabad.

But unfortunately after claiming to have produced a software and stating there is 95% accuracy these software programs disappear.

There is not a single freeware or commercial program which does the following for Telugu with ease and accuracy.( like MS word/Omni page/Dragon naturally speaking for English)

spellcheck, OCR, speech recognition .handwriting recognition, machine translation( Google translate is a joke ) speech synthesis( this is to a certain extent solved, but much can be done to improve it)

The status is similar for most other Indian languages.

I was almost given up on this but the recent success of AAM AADMI Party has energized me and I think a handful of people can really make a difference , if they concentrate and strive.

So I am inviting all of you to come forward and spend a few hours each week to develop the following for telugu and this will in turn encourage others to develop similar projects for all the Indian languages.

I made a beginning by sending an email to the professor who developed the search engine yioop and got the following reply. This is encouraging.

FIRST we need a large public corpus.

Second we need to tag the corpus

Third we need to complete the boot strapping of tesseract for OCR.

After succeeding at these, we can go on to new projects.

who wants to volunteer?

Me

To chrisatpollett.org

Dec 12 at 9:43 PM

Hi,

I am a physician fro Ap presently in Dallas TX .I am interested in Telugu language computer applications.

I am interested in . OCR, MT, speech and handwriting recognition.

I am not very good at computer programming but have a wider understanding .

there is no large corpus publicly available for this INDIC language.

I was trying to create one and started using web as corpus software but unfortunately due to the loss of BING API this is no longer working.

can the crawler you use be used to create such a corpus ?

I would like to interact with you on some ideas I have .

are you willing to correspond with me on email

Chris Pollett

To Me

Dec 12 at 10:57 PM

Sure. I don't know Telugu, but I could put you in contact with some former students who worked on Yioop

that do. Looking at the most recent crawl I did, I did get a fair number of Telugu documents

but they were mainly Wikipedia related:

http://www.yioop.com/index.php?q=lang%3Ate&YIOOP_TOKEN=xlhhSDmlPjI|1386909727&its=1375222073&limit=0

Restricting to both the telugu language and .in domain yields only a small number of

results in my last crawl:

http://www.yioop.com/index.php?YIOOP_TOKEN=gTXbgBRIWgI|1386909940&its=1375222073&q=lang%3Ate+site%3Ain

all of which seem to be mainly English. My crawl on the machines in our guest room is only about 1/100th the size of Bing

or Google's index and is mainly of American sites, but you can configure Yioop to crawl whatever you like. If you

downloaded the Telugu Wikipedia (available free), Yioop could index it. If you had a good start list of URLs it would possible to have Yioop

crawl and only index the sites it found containing Telugu.

Best,

Chris

Dhaval Patel

unread,

Jan 21, 2014, 7:59:46 PM1/21/14

to parich...@googlegroups.com

Dear All

I am faculty member in CSE dept at IIT-Roorkee. I can try to help you to develop Telugu OCR.
With regards
Dhaval

RKVS Raman

unread,

Jan 21, 2014, 9:36:47 PM1/21/14

to parich...@googlegroups.com, dhava...@gmail.com

Thank you very much for your interest in OCR.

Tesseract , as you might know depends on training data for accurate recognition. We develop this data by mapping the glyphs of a font in a language to its unicode characters.

It will be great if you can help us out in identifying the major fonts used in Telugu for printed text (unicode and non-unicode) . For non-unicode fonts we would also need the mapping between the glyphs and unicode text.

This will be useful in developing the training data for Tesseract.

Let us know your inputs.

Best Regards
-Raman

-----------------------------------------------
RKVS Raman
http://sites.google.com/site/rkvsraman
------------------------------------------------

--
You received this message because you are subscribed to the Google Groups "Parichit-OCR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to parichit-ocr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all

Reply to author

Forward