Indian language computer applications an appeal
Dear
Everyone who cares,
It is sad that with a large software workforce, India is still lagging behind in improving the representation of Indian regional languages in various computer software programs dealing with language in general.
I am a physician from Andhra Pradesh presently in Dallas TX .I am interested in Telugu language computer applications.
I had raised the topic of creating a telugu OCR software in a discussion list called "Racchabanda" in 1996 and again in 1998. as recently as 2010 we were still discussing about this in Google group called " telugu sanganana"
Some guy might come up and tell me this OCR problem is already solved, by some software from either TIDl, IIIT or University of Hyderabad.
But unfortunately after claiming to have produced a software and stating there is 95% accuracy these software programs disappear.
There is not a single freeware or commercial program which does the following for Telugu with ease and accuracy.( like MS word/Omni page/Dragon naturally speaking for English)
spellcheck, OCR, speech recognition .handwriting recognition, machine translation( Google translate is a joke ) speech synthesis( this is to a certain extent solved, but much can be done to improve it)
The status is similar for most other Indian languages.
I was almost given up on this but the recent success of AAM AADMI Party has energized me and I think a handful of people can really make a difference , if they concentrate and strive.
So I am inviting all of you to come forward and spend a few hours each week to develop the following for telugu and this will in turn encourage others to develop similar projects for all the Indian languages.
I made a beginning by sending an email to the professor who developed the search engine yioop and got the following reply. This is encouraging.
FIRST we need a large public corpus.
Second we need to tag the corpus
Third we need to complete the boot strapping of tesseract for OCR.
After succeeding at these, we can go on to new projects.
who wants to volunteer?
Me
Dec 12 at 9:43 PM
Hi,
I am a physician fro Ap presently in Dallas TX .I am interested in Telugu language computer applications.
I am interested in . OCR, MT, speech and handwriting recognition.
I am not very good at computer programming but have a wider understanding .
there is no large corpus publicly available for this INDIC language.
I was trying to create one and started using web as corpus software but unfortunately due to the loss of BING API this is no longer working.
can the crawler you use be used to create such a corpus ?
I would like to interact with you on some ideas I have .
are you willing to correspond with me on email
Chris Pollett
To Me
Dec 12 at 10:57 PM
Sure. I don't know Telugu, but I could put you in contact with some former students who worked on Yioop
that do. Looking at the most recent crawl I did, I did get a fair number of Telugu documents
but they were mainly Wikipedia related:
http://www.yioop.com/index.php?q=lang%3Ate&YIOOP_TOKEN=xlhhSDmlPjI|1386909727&its=1375222073&limit=0
Restricting to both the telugu language and .in domain yields only a small number of
results in my last crawl:
all of which seem to be mainly English. My crawl on the machines in our guest room is only about 1/100th the size of Bing
or Google's index and is mainly of American sites, but you can configure Yioop to crawl whatever you like. If you
downloaded the Telugu Wikipedia (available free), Yioop could index it. If you had a good start list of URLs it would possible to have Yioop
crawl and only index the sites it found containing Telugu.
Best,
Chris
--
You received this message because you are subscribed to the Google Groups "Parichit-OCR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to parichit-ocr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.