Artificial Neural Network for Sanskrit programming

36 views
Skip to first unread message

dhaval patel

unread,
Jul 15, 2015, 2:33:41 PM7/15/15
to sanskrit-p...@googlegroups.com, samskrita, indo...@list.indology.info, bvpar...@googlegroups.com
Respected scholars,
Recently I have modified one Artificial neural net code for identification of samAsas in Sanskrit language.
https://github.com/drdhaval2785/SamaasaClassification is the code location.

The results are very encouraging.
Without feeding any rule to the computer, the following is the classification result

1. Major 5 samAsa types classification - 70%
2. Minor 55 samAsa subtypes classification - 55 %.
3. Major 5 samAsa types classification taking the first two entries - 85 %.

Probability of a fluke would be 1 - 20 %, 2 - 0.2 %, 40 %.

So, the machine learning is statistically significant, if not good enough.


The database which was used was scraped from http://sanskrit.uohyd.ernet.in/Corpus/SHMT/Samaas-Tagging/ and randomly shuffled to homogenize the dataset.

The tool was developed for samAsa classification initially, but now generalized for any string classification problem.

Hope the scholars would like the tool.

For those interested in Artificial neural networks, the link is http://neuralnetworksanddeeplearning.com/



--
Dr. Dhaval Patel, I.A.S
Collector and District Magistrate, Anand

Mārcis Gasūns

unread,
Jul 15, 2015, 4:07:27 PM7/15/15
to bvpar...@googlegroups.com, sanskrit-p...@googlegroups.com, sams...@googlegroups.com, indo...@list.indology.info
http://sanskrit.uohyd.ernet.in/Corpus/SHMT/Samaas-Tagging/UOHYD/gita-samaas.txt is non-unicode, but I guess it's a minor issue. Was unaware of it. 
Never knew it had any real practical usage in Sanskrit NLP.

dhaval patel

unread,
Jul 16, 2015, 2:49:33 AM7/16/15
to samskrita, bvparishat, sanskrit-p...@googlegroups.com, indo...@list.indology.info
Respected Prof. Kulkarni,

Here is the statistics about the data used.

1. All two-member compounds from http://sanskrit.uohyd.ernet.in/Corpus/SHMT/Samaas-Tagging/ were scraped.
2. Total of 19378 such compounds were culled out.
3. The set was randomly shuffled for homogenization, because the data is from prose, poetry, different genres of literature.
4. 80 percent of this data was used for training.
5. 20 percent of this data was used for evaluation. (Both separate datasets).

The classification on training data is around 1-2 % higher than that for evaluation data.


Reply all
Reply to author
Forward
0 new messages