Artificial Neural Network for Sanskrit programming

dhaval patel

unread,

Jul 15, 2015, 2:33:41 PM7/15/15

to sanskrit-p...@googlegroups.com, samskrita, indo...@list.indology.info, bvpar...@googlegroups.com

Respected scholars,

Recently I have modified one Artificial neural net code for identification of samAsas in Sanskrit language.
https://github.com/drdhaval2785/SamaasaClassification is the code location.

The results are very encouraging.

Without feeding any rule to the computer, the following is the classification result

1. Major 5 samAsa types classification - 70%

2. Minor 55 samAsa subtypes classification - 55 %.

3. Major 5 samAsa types classification taking the first two entries - 85 %.

Probability of a fluke would be 1 - 20 %, 2 - 0.2 %, 40 %.

So, the machine learning is statistically significant, if not good enough.

The database which was used was scraped from http://sanskrit.uohyd.ernet.in/Corpus/SHMT/Samaas-Tagging/ and randomly shuffled to homogenize the dataset.

The tool was developed for samAsa classification initially, but now generalized for any string classification problem.

Hope the scholars would like the tool.

For those interested in Artificial neural networks, the link is http://neuralnetworksanddeeplearning.com/

--

Dr. Dhaval Patel, I.A.S

Collector and District Magistrate, Anand

www.sanskritworld.in

Mārcis Gasūns

unread,

Jul 15, 2015, 4:07:27 PM7/15/15

to bvpar...@googlegroups.com, sanskrit-p...@googlegroups.com, sams...@googlegroups.com, indo...@list.indology.info

http://sanskrit.uohyd.ernet.in/Corpus/SHMT/Samaas-Tagging/UOHYD/gita-samaas.txt is non-unicode, but I guess it's a minor issue. Was unaware of it.

Never knew it had any real practical usage in Sanskrit NLP.

dhaval patel

unread,

Jul 16, 2015, 2:49:33 AM7/16/15

to samskrita, bvparishat, sanskrit-p...@googlegroups.com, indo...@list.indology.info

Respected Prof. Kulkarni,

Here is the statistics about the data used.

1. All two-member compounds from http://sanskrit.uohyd.ernet.in/Corpus/SHMT/Samaas-Tagging/ were scraped.

2. Total of 19378 such compounds were culled out.
3. The set was randomly shuffled for homogenization, because the data is from prose, poetry, different genres of literature.

4. 80 percent of this data was used for training.

5. 20 percent of this data was used for evaluation. (Both separate datasets).

The classification on training data is around 1-2 % higher than that for evaluation data.

Reply all

Reply to author

Forward