Tools for research regarding semantic sanskrit NLP

233 views
Skip to first unread message

pankajashree

unread,
May 26, 2017, 7:34:01 AM5/26/17
to sanskrit-p...@googlegroups.com
Namaste,

I have always been interested in NLP related to sanskrit. After much wandering about online I found this group. I am so happy that there is some research going on in this field. There is an immense need to create opensource tools and corpora related to sanskrit NLP like openNLP , NLTK and stanfordNLP are there for english. 

Also, I must mention that I am a beginner in this field. 

I have been thinking about a new project to work on a program to check word compatibility, semantic wise, in sentence generation.

I found some academic tools for grammatically analyzing the sanskrit words, but I couldnt find any tool that checks Noun-verb, noun-adjective-verb compatibility. And the source code for these academic tools are not available. 

For example, What I am thinking of in my program is, the input अश्वः डयति or गृहं चलति - should be identified as invalid whereas
अश्वः धावति is valid . I have cited only simple sentences here, but complications arise if objects, instruments, etc are added. 

As a first step I started making a list of verbs and nouns and tag them to semantic categories ( inanimate object, living being, animals, movable inanimate object, etc). The idea is to categorize the nouns and the verbs of similar meaning and then for a particular category of noun, only certain verbs can be applied.

But, the categorization itself is a huge task - I started with nouns and there are so many categories. Animal itself has lots of subcategories. I started listing some words, but its a much more complicated process. So I wanted to know if there are tools / data sets available for English? Can you please tell me if there is corpora in English with words tagged with such semantic categories, not the grammatical categories like N, PN, ADJ, etc.  

And before doing it, I would like to know if any work has already been done in this regard?

For english I could find tokenizers, POS tagging tools, but couldnt find anything related to semantics or word compatibility. I have looked into NLTK, StanfordNLP website and openNLP. 

And, I wonder if what I am doing is the same as named entity recognition? If yes, how do I do it for Sanskrit? 

If adopting the English methods is not a good idea, how should I go about it? Manual tagging is a long process. Is there no other way? 


--
Regards,
Pankajashree R 





dhaval patel

unread,
May 26, 2017, 8:05:54 AM5/26/17
to sanskrit-p...@googlegroups.com
If you want to build automatic or semi automatic data for आकांक्षा, a sentence may be taken as input and words occurring therein may be treated as valid pairs. This will lead to some absurd pairs, but statistically they will be less used in literature. So a weightage based on occurrence should weed out abnormal pairs in long run.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
May 26, 2017, 9:14:15 AM5/26/17
to sanskrit-programmers
​स्वागतं पङ्कजश्रियै!​


2017-05-26 4:33 GMT-07:00 pankajashree <pankaj...@gmail.com>:
For example, What I am thinking of in my program is, the input अश्वः डयति or गृहं चलति - should be identified as invalid whereas
अश्वः धावति is valid . I have cited only simple sentences here, but complications arise if objects, instruments, etc are added. 

​भवती "ontology" इति तत्त्वस्योपयोगम् अर्हति कर्तुम्। 

उदाहरणार्थम् अमरकोशस्यास्यामावृत्तौ - ​" पदार्थ-विभागः : , द्रव्यम्, पृथ्वी, अचलनिर्जीवः, स्थानम्, मानवनिर्मितिः" इति दृश्यते। तेन गृहं न चलतीति स्पष्टम्।  अश्वस्य विषय एवम् उक्तम् - "पदार्थ-विभागः : , द्रव्यम्, पृथ्वी, चलसजीवः, मनुष्येतरः, जन्तुः, स्तनपायी"। तेन भूमाव् एव चलतीति च स्फुटम्। आङ्ग्लभाषायाम् एवमेव लभेत भवती।



--
--
Vishvas /विश्वासः

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
May 26, 2017, 9:22:24 AM5/26/17
to sanskrit-programmers
​तथैतदप्यस्ति -
इनलाइन चित्र 1

http://www.cfilt.iitb.ac.in/wordnet/webswn/downloaderInfo.php इतीतस् सुलभः।​

Pankajashree R

unread,
May 27, 2017, 7:18:21 AM5/27/17
to sanskrit-programmers
धन्यवादः महोदय !

मम संस्कृत-वार्तालापः धाराप्रवाहिता नास्ति। क्षम्यताम् । 

अमरकोशः पदार्थानां विभागं कर्तुं अतीव सुलभः भविष्यति इति मन्ये ।
क्रिया पदानां विभागः (similar verb groups) , यथा - चलति, गच्छति, सरति - एतेऽपि अमरकोशे अस्ति किम्? 

भवन्तः प्रशितं link द्रष्टं मया । अत्युत्तमं तथा महत्तरं कार्यं । क्रियापद, नामपद, विशेषणानि अपि सन्ति अत्र ।  डेटासेट् download कृत्वा कथं उपयोगं करणीयं इति संशोधनं करोमि । 

Pankajashree R

unread,
May 27, 2017, 7:20:37 AM5/27/17
to sanskrit-programmers
Thats one way to go about - statistical method. I want to apply semantic structure rather than statistical ML methods. But sure this might be faster/easier? to implement, but the computer wont 'understand' the sentence. It will just compute probability of the pair occurence. 

And moreover, if we solve this semantic wise in sanskrit, I think the same will be applicable for all languages too. 

Your thoughts are welcome in this regard. 

Pankajashree R

unread,
May 27, 2017, 7:23:36 AM5/27/17
to sanskrit-programmers
Why sanskrit has advantages over other languages to implement this semantic structure is because there are lots of vyakarana literature such as Amarakosha, Dhatuvritti etc which describe this in detail. 

dhaval patel

unread,
May 27, 2017, 8:26:26 AM5/27/17
to sanskrit-p...@googlegroups.com

 but the computer wont 'understand' the sentence. It will just compute probability of the pair occurence. 

Why is it mandatory that computer should understand the sentence. Statistics will also do. Statistics is language agnostic. It can be applied to all languages without much change. Precise and unambiguous description of grammar, syntax and semantics of a language is a tedious job.

I am talking about bag-of-words approach, similar to that used in word2vec and doc2vec. And I must say they are capturing the semantics fairly well.


And moreover, if we solve this semantic wise in sanskrit, I think the same will be applicable for all languages too. 

Not at all. The way we extract semantics in Sanskrit is based on particlar language. So not useful for other languages.

dhaval patel

unread,
May 27, 2017, 8:50:01 AM5/27/17
to sanskrit-p...@googlegroups.com
Despite all said regarding statistics, the conjugation and declention, sandhi, samAsa parsing etc will have to be done for better understanding. E.g. गच्छामि, गच्छामः will be treated as separate words in statistical method, whereas after factoring declension, they boil down to single verb गम्.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
May 27, 2017, 7:13:22 PM5/27/17
to sanskrit-programmers

2017-05-27 4:18 GMT-07:00 Pankajashree R <pankaj...@gmail.com>:
क्रिया पदानां विभागः (similar verb groups) , यथा - चलति, गच्छति, सरति - एतेऽपि अमरकोशे अस्ति किम्? 

​न खलु। ​न्यायशास्त्रे तादृशो विभागः कृतः (तर्कसङ्ग्रहे स्यात् प्रायेण)। अपि चाख्यातचन्द्रिका द्रष्टव्या (AkhyAtachandrikA.babylon_final)।

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
May 28, 2017, 12:27:17 AM5/28/17
to sanskrit-programmers, Sai सायिः साङ्गणकविद्वान् Susarla
+ सायि महाशयः।

स्मरामि सायिमहाशयस्य सहकारिणी काचिद् व्यक्तिः सरलसंस्कृतवाक्यजननयन्त्रं रचयितुं यतमानेति। तया ऽत्र को मार्गो गृहीत इति तज्ज्ञा एव वदेयुः।

Pankajashree R

unread,
May 28, 2017, 2:08:43 AM5/28/17
to sanskrit-programmers
tarkasangrahe padaarthaanaam vibhAgAH tathA definitions (lakShaNaani) santi . 

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
May 28, 2017, 10:54:54 AM5/28/17
to sanskrit-programmers

2017-05-27 23:08 GMT-07:00 Pankajashree R <pankaj...@gmail.com>:
tarkasangrahe padaarthaanaam vibhAgAH tathA definitions (lakShaNaani) santi . 

 "कर्माणि पञ्च सन्ति-(१) उत्क्षेपणम् (२) अपक्षेपणम् (३) आकुञ्चनम् (४) प्रसारणम् (५) गमनम् चेति । " इतीव तत्रोक्तं ननु। तादृशमेव स्यादन्यत्रेति चिन्तितम्। https://drive.mindmup.com/map/0B1_QBT-hoqqVUE1MQnJ5MlZxVjQ# इति रुचये स्यात्।

Pankajashree R

unread,
May 29, 2017, 1:54:02 AM5/29/17
to sanskrit-programmers
etat mindmap kutra labhitam bhavatbhiH? asmAkam saMshodhanA saMsthAne asmAbhiH eva etat mindmap racitam. :) 

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
May 29, 2017, 1:44:57 PM5/29/17
to sanskrit-programmers

2017-05-28 22:54 GMT-07:00 Pankajashree R <pankaj...@gmail.com>:
etat mindmap kutra labhitam bhavatbhiH? asmAkam saMshodhanA saMsthAne asmAbhiH eva etat mindmap racitam. :) 

​ननु सायिमहाशयेन दर्शितम् भवत्सहकर्मिणा।  :-)​

Karthikeyan Madathil

unread,
Jun 5, 2017, 12:26:00 AM6/5/17
to sanskrit-programmers

Has anyone tried a neural network for splitting up a pada into upasarga/dhAtu/prAtipadika/pratyaya ? Since humans can do that quite easily by looking at a word, training an NN should be feasible.

If we have a corpus of associations between padas and upasarga/dhAtu/prAtipadika/pratyayas - perhaps Dr. Dhaval, your SK program has generated some such corpus - I could try hacking something together.

samAsa/sandhi-vigraha will be hard to begin with, but can be tackled later.

I see this as the lowest layer of a stack of inference which can be built up over time (if the bottom layer works, of course!):

  0) Decompose all padas in a vAkya probabilistically
  1) Use information in multiple padas to infer most likely candidates for each pada (based on the candidates the pada decomposition throws up and joint probability)
  2) use the information in the decomposition of all padas in a vAkya infer vAkya semantics. For example, we could back-infer kAraka from sup, vachana/purusha from pratyayas, temporal information from tiN/kRt, then semantics from them, and so on. Again, probabilistic inference is ok. 

At a higher level, we'll have to do sandhi and samAsa vigraha. Again, a probabilistic decomposition yielding multiple results can be used to drive lower layers, and final selection(s) can be done by comparing likelihoods at various levels.

This is a hybrid ML/semantic approach. Has anyone looked at something similar? Is there any reason why this would not work at all? 

I realize this departs from the state-of-the-art in ML based translation, which simply feeds language a into an RNN (recurrent Neural net) for encoding and feeds the output into a different RNN for decoding into another language and trains them using standard corpuses. 

Regards,
Karthik

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Jun 5, 2017, 12:37:47 AM6/5/17
to sanskrit-programmers

2017-06-04 21:25 GMT-07:00 Karthikeyan Madathil <kmad...@gmail.com>:
This is a hybrid ML/semantic approach. Has anyone looked at something similar? Is there any reason why this would not work at all? 

I realize this departs from the state-of-the-art in ML based translation, which simply feeds language a into an RNN (recurrent Neural net) for encoding and feeds the output into a different RNN for decoding into another language and trains them using standard corpuses. 

Karthikeyan Madathil

unread,
Jun 5, 2017, 2:07:45 AM6/5/17
to sanskrit-programmers
It definitely is. Let me dig through the references and understand what the state of the art here is.

Pankajashree R

unread,
Jun 7, 2017, 2:24:49 AM6/7/17
to sanskrit-programmers
Can you please post your findings about the state of the art later? :) I went through the thread, has lots of references to papers. I'm a newbie in NN

Karthikeyan Madathil

unread,
Jun 7, 2017, 11:26:32 PM6/7/17
to sanskrit-programmers
Sure, I will do that. I have dug up a few interesting ones, hopefully I'll be able to dig through all that this weekend.

dhaval patel

unread,
Jun 8, 2017, 3:39:58 AM6/8/17
to sanskrit-p...@googlegroups.com

Has anyone tried a neural network for splitting up a pada into upasarga/dhAtu/prAtipadika/pratyaya ? Since humans can do that quite easily by looking at a word, training an NN should be feasible.

I tried classifying the samAsas via NN. There was some training data available on Uni of Hyderabad site. But the main constraint was that naturally the data is highly skewed in favour of tatpurusha. So resultant NN didnt learn much. Similarly breaking of pada into consistuents was tried. There was some algorithm which allowed splits of data in linear time (I need to see old code to identify reference to this algo). The main constraint was that the sandhi splitting for long words gave quite many options. Michael Bykov also dabbled a bit with this.


If we have a corpus of associations between padas and upasarga/dhAtu/prAtipadika/pratyayas - perhaps Dr. Dhaval, your SK program has generated some such corpus - I could try hacking something together.

upasargas, dhAtus, pratyayas, verb forms are finite. 
prAtipadikas are virtually infinite because of samAsa. 
Possible way is 
(upasargaCominations)*(nounCombinations)*(nounForm)
or
(upasargaCominations)*(verbForm)


This is a hybrid ML/semantic approach. Has anyone looked at something similar? Is there any reason why this would not work at all? 

No reason except paucity of data.

Karthikeyan Madathil

unread,
Jun 8, 2017, 11:07:21 PM6/8/17
to sanskrit-programmers
In my mind, sandhi/samasa splitting are the hardest problems!

>upasargas, dhAtus, pratyayas, verb forms are finite. 
>prAtipadikas are virtually infinite because of samAsa. 
>Possible way is 
>(upasargaCominations)*(nounCombinations)*(nounForm)
>or
>(upasargaCominations)*(verbForm)

Indeed! What I was hoping to do was to take a finite basic set of (unsandhied) padas tagged with relevant upasarga,pratyaya,dhAtu,prAtipadika and train an NN to learn the decomposition.  How well this would generalize to other prAtipadikas (kRttaddhitasamAsaH) is not evident - I was hoping it would, from the structure of the solution. This would form a building block. (I was hoping your SanskritVerb/Subanta programs would be useful to generate this tagged set)

The idea was to use this limited network as a building block for higher layer solutions (including sandhi). First higher level could then use these tags to extract semantic tags (kAraka,linga,vachana,kAla) and globally evaluate a split to see if it was semantically consistent (in a limited sense). This would allow us to do sandhi/samAsa splits while evaluating a vAkya (sort of like a knapsack problem solution - dynamically program all possible splits, compare based on semantics using the tags and extraction quality from the lower level NNs, and pick a global optimum)  

While I have some NN background in an unrelated field (Signal Processing), I have no knowledge of linguistics, and don't have an intuition of the problems this would run into. 

I'm reading the papers Vishvas pointed to, hopefully that'll get me a bit better grounded. 

Karthikeyan Madathil

unread,
Jun 14, 2017, 12:08:25 PM6/14/17
to sanskrit-programmers
Two major papers with two differing approaches, by the same author

1) Using Recurrent Neural Networks for joint compound splitting and Sandhi
resolution in Sanskrit
Oliver Hellwig

This uses a "shallow" approach, using Recurrent Neural Networks (RNN). A corpus of sentences from SanskritTagger, developed by the author is used as the training/test sets. An RNN is trained on a golden sandhi/samasa split from the corpus (a sequence of input phonemes and a sequence of output phonemes), and learns to generate a "split" sequence of phonemes from an "unsplit" input. No additional tag information seems to be used.

Three goodness measures are used - Precision (1 - false positive rate), Recall (1 - false negative rate) and F score (harmonic mean of Precision and Recall). Sandhis are divided into 5 classes (with one class being a null operation), and these goodness measures are measured for the entire dataset after training. The more the data, the higher the accuracy - up to 93.24 (F score, presumably) for the entire corpus.  

Intuitively, this would tend to learn word splits from common usage, splitting commonly found phoneme sequences that are found as a result of sandhi, using contextual information (as encoded by the forward and backward parts of the RNN).  No morphological information is used. 

Strengths - simple! Does not need cumbersome tagging beyond basic split training data. Intuitively, this seems to mimic the first level of splitting that humans do, which is purely based on relatively local context, and the set of words they have come across in the past
Weaknesses - Lack of morphological information can lead to errors, like the example in the paper, where bhujaagra is split as bhujaa-agra instead of bhuja-agra. Both are lexically correct, but the former is semantically incorrect in the sentence being decoded, while being a common form in the corpus


2) Morphological Disambiguation of Classical
Sanskrit
Oliver Hellwig

This one takes a linguistically "deep" approach. It relies on some major inputs
1) A lexical database with lemmata (lexical items), semantic information, and inflected verb forms (ti~Nantas)
2) A corpus of Sanskrit texts, tagged with lexicographic, morphological and word semantic "gold" annotations
3) Linguistic models for sandhi split, declension (sup addition), verb conjugation (ti~N addition) 
4) tag sets for indeclinables (avyaya), nouns and verbal forms
5) A linguistic processor that uses 1-4 to analyze a sentence

Each string is scanned from left to right, with possible sandhi splits done at each phonemic position. If the left part after a split is a valid lexical form, that is added to a hypothesis "lattice" and the right part is recursively split. The Viterbi algorithm (which I've seen used in Hard Disk Drives and Wireless Communication!) is used to traverse the lattice to pick the best split. Goodness measures for each split are derived from bigram probabilities estimated from the annotated corpus. An accuracy of 94% is claimed on random sets of 10000 sentences from the corpus (the rest of which was used to train the algorithm)

Once a lexical split is chosen, morphological disambiguation (say, distinguishing between vanaM prathamaa/dvitiiya) is done by training a machine learning model.

Advantages: Uses more morphological information, and hence can get better splits (at least intuitively). Seems closer to the way humans do sentence splits, which uses all possible lexical and morphological information about potential splits.
Disadvantages: Cumbersome, requires more tagging.

I'm not convinced that Viterbi is a good solution for this problem. I may need to dig deeper, though, because clearly these folks have thought longer and deeper than I. From my knowledge, Viterbi works where problems can be decomposed into prefix-suffix splits. For example, the optimal decoding D(S) of a string S, split as S1_S2, can (in many applications) be proven to be D(S)=D(S1)_D(S2). This is not true in this case! I would've thought a dynamic programming (knapsack problem) approach would be required.

I need to understand n-gram methods better. 

Approach 2) has similarities with the idea I've floated here, but has major differences which require more thought. I will write this up in more detail for discussion. 


On Wednesday, June 7, 2017 at 11:54:49 AM UTC+5:30, Pankajashree R wrote:

Karthikeyan Madathil

unread,
Jun 15, 2017, 2:48:21 AM6/15/17
to sanskrit-programmers
I'm  putting my comments on these and related papers in an online doc that can be viewed and commented. Please feel free to leave comments there. This will avoid too many frequent emails everytime I look at one more related paper :-)

Shreevatsa R

unread,
Jun 15, 2017, 9:55:31 AM6/15/17
to sanskrit-programmers
Thanks, that's helpful!

As I said in the other message (https://groups.google.com/d/msg/sanskrit-programmers/Ms3Fdv-axMw/3-0-5jEdDQAJ) (or maybe in the notes.txt attached there, I don't remember now), IMO if we had the data/annotations from the SanskritTagger available, we could experiment more. But it's not available AFAIK, so we may have to first generate the equivalent. Or else explore different approaches that need a smaller or publicly available training set.

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-programmers+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Karthikeyan Madathil

unread,
Jun 15, 2017, 10:50:23 AM6/15/17
to sanskrit-programmers
Indeed, if we have the tagged corpus, it's doable to replicate their approach, or try other ones.

How would we generate a training set? One possibility is that we could bootstrap like they seem to have done (in the SanskritTagger paper) by implementing a split/lexeme/morpho tagger, running it on a corpus (we have enough of a corpus, just not a tagged one), and manually converting to a "gold" tagging. This can be used for further refinements (as in the Morphological Disambiguation paper)

Any other ideas?

Avinash L Varna

unread,
Jun 15, 2017, 12:56:49 PM6/15/17
to sanskrit-programmers
नमांसि,

Some quick ideas for datasets:

The e-reader section of the UoHyd SCL (http://sanskrit.uohyd.ac.in/scl/) website might be a good starting point, as it provides detailed information for 1000+ shlokas from संक्षेपरामायणम्, श्रीमद्भगवद्गीता, शिशुपालवधम् complete with morphological analysis. We could check with Dr. Amba Kulkarni as to the license under which this data is released, and whether it could be used for research purposes. Other tagged corpora available for research purposes are also mentioned on the website, and might be interesting for what you have in mind. In general, it might be good to go through the extensive research and publications from her group to see if they have already tried some of the ideas mentioned in this discussion.

There are other websites that provide relevant data that could be built upon. E.g. valmikiramayan.net appears to have the anvaya for all shlokas in the ramayana. This could be leveraged to build a tagged dataset (would require some effort of course).

भवदीयः
अविनाशः

To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-programmers+unsubscrib...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Karthikeyan Madathil

unread,
Jun 16, 2017, 5:40:13 AM6/16/17
to sanskrit-programmers
That's an interesting thought. Do you happen to know if that group is generally amenable to working with amateurs? I will look through their publications.

1) The base dataset we need is a set of tagged verb / noun / indeclinable forms. If we have a basic set of tagged (basic) prAtipadikas, subantas, and ti~Nantas at the minimum. For better results, we'd need a database of tagged krt and taddhita forms. 

2) Based on 1, we could train a neural net to decompose a form into an upasarga, dhatu/prAtipadika and pratyaya(s). We can then use words from textual data to improve this, at the same time enriching our tagged database.

3) Based on 2), we could  start working on a text tagger that takes text, does speculative sandhi splits, and picks a good set of choices of splits based on the tags output by our neural net on each split. "Better" choices can be selected using a set of constraints on the tag set.

I suppose we'll be able to attract more contributions, interest, and even collaboration once we've demonstrated even a limited feature-set.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-programmers+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shreevatsa R

unread,
Jun 16, 2017, 11:00:11 AM6/16/17
to sanskrit-programmers
I am actually independently interested in learning from Dr. Amba Kulkarni what license the e-readers are available in. It seems there is an opportunity to take that data and experiment with different ways of presenting it (no computational linguistics involved; just tweaking the visual appearance with HTML / CSS / Javascript), towards developing better or different e-readers for Sanskrit.

Does anyone here know Dr. Kulkarni well and can ask nicely? :-)

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Jun 16, 2017, 11:51:01 AM6/16/17
to sanskrit-programmers
My suggestion is to write to her directly - ambap...@gmail.com . She's been quite responsive.

For whatever reason, following the typical pattern, she has shown no interest in joining this list or keeping her code repository on github (part of it is that is bad cost-benefit analysis I guess).

Another option to get data is to scrape kjc-fs-cluster.kjc.uni-heidelberg.de/dcs/index.php . There's been no real movement towards releasing the data.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Jun 16, 2017, 8:39:55 PM6/16/17
to sanskrit-programmers

2017-06-16 8:50 GMT-07:00 विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com>:

Another option to get data is to scrape kjc-fs-cluster.kjc.uni-heidelberg.de/dcs/index.php . There's been no real movement towards releasing the data.

​This is underway. Once fully done, I'll announce in a new thread.​ 

​The database (which will also be available as a file)​ can be accessed as below:
  • Get analysis for a sentence: api, with this output.
​So, you're unblocked - go for it!​

Karthikeyan Madathil

unread,
Jun 20, 2017, 4:15:54 AM6/20/17
to sanskrit-programmers
I've added comments on three more papers - two from 2010-11 using a maximum a-posteriori Bayesian approach, and a 2016 paper from IIT-KGP using a graph based approach, all for Sandhi splitting.

Karthikeyan Madathil

unread,
Jun 20, 2017, 4:28:14 AM6/20/17
to sanskrit-programmers
www.sanskritlibrary.org

has texts and splits under a Creative Commons license. Not sure why the data isn't downloadable. Will try asking. 


On Thursday, June 15, 2017 at 10:26:49 PM UTC+5:30, Avinash L Varna wrote:
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-programmers+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

dhaval patel

unread,
Jun 20, 2017, 7:45:33 AM6/20/17
to sanskrit-p...@googlegroups.com
https://stackoverflow.com/questions/8870261/how-to-split-text-without-spaces-into-list-of-words 
This has the algo for splitting words in a small manageable time frame. May prove useful for sandhi splits. May prove useful to you. 

The dictionary may be kept small initially. Once we get some encouraging results, we can expand it further.

Karthikeyan Madathil

unread,
Jun 27, 2017, 1:50:06 AM6/27/17
to sanskrit-programmers
Updated with comments on Vinyals & Kaiser et. al. 2014 ("Grammar as a Foreign Language). Not directly Sanskrit related, but I believe this is close to the the state of the art in parsing.

I will now write up a summary of the various ideas I wish to pursue and forward here. 

Shreevatsa R

unread,
Jun 27, 2017, 2:02:30 PM6/27/17
to sanskrit-programmers
Thank you! This is very helpful. 

Karthikeyan Madathil

unread,
Jun 28, 2017, 6:43:16 AM6/28/17
to sanskrit-programmers

Karthikeyan Madathil

unread,
Jun 30, 2017, 2:56:48 AM6/30/17
to sanskrit-programmers

https://github.com/kmadathil/sanskrit_parser

Begun with a simple Maheshvara Sutra utility

Feel free to fork or join. 

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Jun 30, 2017, 11:43:43 AM6/30/17
to sanskrit-programmers

2017-06-29 23:56 GMT-07:00 Karthikeyan Madathil <kmad...@gmail.com>:

https://github.com/kmadathil/sanskrit_parser

Begun with a simple Maheshvara Sutra utility

Feel free to fork or join. 

Something to keep in mind - ​Something we learned from past experience is that it is best to separate​ out self contained modules and publish them in pip (which is very simple - you've got the indic transliteration module as an example). This will encourage reuse like nothing else.

Karthikeyan Madathil

unread,
Jul 29, 2017, 1:39:15 AM7/29/17
to sanskrit-programmers
Brief update on this project. We have made decent progress in a month. Details can be found at the link


In summary - we have managed to use Dr. Dhaval Patel's excellent inriaxmlwrapper to bootstrap our L1 (form identification). We have a working L0 (sandhi), and L2 (finding all legitimate splits in a sentence). Our L2 can split fairly decent sized sentences in reasonable time. 

For example:

(master)*$ python SanskritLexicalAnalyzer.py astyuttarasyAMdishidevatAtmA --split
Parsing of XMLs started at 2017-07-29 11:07:22.265422
666994 forms cached for quick search
Parsing of XMLs completed at 2017-07-29 11:07:27.737687
Input String: astyuttarasyAMdishidevatAtmA
Input String in SLP1: astyuttarasyAMdiSidevatAtmA
Start Split: 2017-07-29 11:07:35.879303
End DAG generation: 2017-07-29 11:07:35.913304
End pathfinding: 2017-07-29 11:07:35.921328
Splits:
[u'asti', u'uttarasyAm', u'diSi', u'devatA', u'AtmA']
[u'asti', u'uttarasyAm', u'diSi', u'devat', u'AtmA']
[u'asti', u'uttarasyAm', u'diSi', u'devata', u'AtmA']
[u'asti', u'uttas', u'asyAm', u'diSi', u'devatA', u'AtmA']
[u'asti', u'uttarasyAm', u'diSi', u'devatA', u'at', u'mA']
[u'asti', u'uttarasyAm', u'diSi', u'de', u'vatAt', u'mA']
[u'asti', u'uttarasyAm', u'diSi', u'devata', u'at', u'mA']
[u'asti', u'uttara', u'syAm', u'diSi', u'devatA', u'AtmA']
[u'asti', u'uttas', u'asyAm', u'diSi', u'devat', u'AtmA']
[u'asti', u'uttarasyAm', u'diSi', u'de', u'avat', u'AtmA']


We've begun to think about L3, which adds morphological constraints.

Basic test infrastructure is in place (based on the UOHyd corpus). 

Thanks to Avinash Varna for joining the project and doing a great job on the Sandhi module. Thanks also to Dr. Dhaval Patel for his insightful suggestions.

If you'd like to join this project, please feel free to fork, play around, and drop me a note if you'd like to contribute! We could use more coders, or even folks who can look through test failures and help us triage and file bugs for them. 

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Jul 29, 2017, 12:44:20 PM7/29/17
to sanskrit-programmers
Thanks for the update! शोभनम् कृतम्!

inriaxmlwrapper इति कुत्र प्रकाशितम्? http://vedavaapi.org:9090/assets/lib/swagger-ui/index.html?url=%2Fswagger.json इत्यपि विकल्पोऽस्ति।

https://github.com/kmadathil/sanskrit_parser/issues इत्यत्र सूचनान्तराणि दास्यामि। 

यदि भवद्यन्त्रस्य rest-api-मुखम् इष्यते स्रष्टुम् प्रकाशयितुम् वा - vedavaapi.org-यन्त्रे शक्यम्। यथापेक्षम् सूचयतु।


--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-programmers+unsubscrib...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Avinash L Varna

unread,
Jul 29, 2017, 3:36:53 PM7/29/17
to sanskrit-programmers
अत्रत्यानां सर्वेषां प्रोत्साहनार्थं कृतज्ञा वयम् । भवतां साहाय्येन अग्रिमानि कार्याण्यपि साधयिष्यामः । एतस्य तन्त्रांशस्य परिष्करणार्थं प्रतिस्पन्दान् अवश्यं प्रेषयन्तु ।

inriaxmlwrapper इत्येतत् धवल्-वर्येण अत्र प्रकाशितम् - https://github.com/drdhaval2785/inriaxmlwrapper

REST-api रचनात् / pip मध्ये प्रकाशनात् च पूर्वं तन्त्रांशः प्रायः परिष्करणीयः । अधुना तु अनेके विकल्पाः सूच्यन्ते, येषु बहवः तस्य वाक्यस्य सन्दर्भे निरर्थका भवेयुः । तेषां परिष्करणार्थम् उपायाः चिन्त्यमानाः सन्ति - https://github.com/kmadathil/sanskrit_parser/issues/28



To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-programmers+unsub...@googlegroups.com.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Jul 29, 2017, 9:08:38 PM7/29/17
to sanskrit-programmers
2017-07-29 12:36 GMT-07:00 Avinash L Varna <avinas...@gmail.com>:
अत्रत्यानां सर्वेषां प्रोत्साहनार्थं कृतज्ञा वयम् । भवतां साहाय्येन अग्रिमानि कार्याण्यपि साधयिष्यामः । एतस्य तन्त्रांशस्य परिष्करणार्थं प्रतिस्पन्दान् अवश्यं प्रेषयन्तु ।

inriaxmlwrapper इत्येतत् धवल्-वर्येण अत्र प्रकाशितम् - https://github.com/drdhaval2785/inriaxmlwrapper

​सोपकारोऽहम् संवृत्तः। हन्त, एतदपि ननु pip क्षेत्रेऽभविष्यच् चेद् अज्ञासिषम्!​ सरलमेतत्। ततो यः कोऽपि सरलतया स्थापयित्वा तम् तन्त्रांशम् प्रायोक्ष्यत "s = sanskritmark.analyser(ot,split=False)" ​इत्यत्र यथा।​ अस्तु, issues क्षेत्रेऽनुवर्तिष्ये।


REST-api रचनात् / pip मध्ये प्रकाशनात् च पूर्वं तन्त्रांशः प्रायः परिष्करणीयः । अधुना तु अनेके विकल्पाः सूच्यन्ते, येषु बहवः तस्य वाक्यस्य सन्दर्भे निरर्थका भवेयुः । तेषां परिष्करणार्थम् उपायाः चिन्त्यमानाः सन्ति - https://github.com/kmadathil/sanskrit_parser/issues/28

​REST विषये यद्यप्युचितम् उक्तम्, नान्यत्र pip-प्रकाशनविषये। ​क्रियमाणम् अप्य् अपूर्णम् अपि कार्यम् स्यात् परोपकाराय परपरीक्षणाय चोचितम्। तेन निरन्तर-pip-प्रकाशनम् युक्तम्। तत्राऽपि खलु वयोवृद्ध्यङ्कनम् ("versioning" ) भवति - अथवा तद्विनाऽपि साक्षात् sudo pip2 install git+https://github.com/sanskrit-coders/sanskrit_data@master -U इतीव प्रयोगः शक्यः।  केवलम् module-विन्यासः स्फुटीकार्यः। 

Karthikeyan Madathil

unread,
Jul 30, 2017, 5:52:09 AM7/30/17
to sanskrit-programmers

On Sunday, July 30, 2017 at 6:38:38 AM UTC+5:30, विश्वासो वासुकिजः wrote:


2017-07-29 12:36 GMT-07:00 Avinash L Varna <avinas...@gmail.com>:
अत्रत्यानां सर्वेषां प्रोत्साहनार्थं कृतज्ञा वयम् । भवतां साहाय्येन अग्रिमानि कार्याण्यपि साधयिष्यामः । एतस्य तन्त्रांशस्य परिष्करणार्थं प्रतिस्पन्दान् अवश्यं प्रेषयन्तु ।

inriaxmlwrapper इत्येतत् धवल्-वर्येण अत्र प्रकाशितम् - https://github.com/drdhaval2785/inriaxmlwrapper

​सोपकारोऽहम् संवृत्तः। हन्त, एतदपि ननु pip क्षेत्रेऽभविष्यच् चेद् अज्ञासिषम्!​ सरलमेतत्। ततो यः कोऽपि सरलतया स्थापयित्वा तम् तन्त्रांशम् प्रायोक्ष्यत "s = sanskritmark.analyser(ot,split=False)" ​इत्यत्र यथा।​ अस्तु, issues क्षेत्रेऽनुवर्तिष्ये।


REST-api रचनात् / pip मध्ये प्रकाशनात् च पूर्वं तन्त्रांशः प्रायः परिष्करणीयः । अधुना तु अनेके विकल्पाः सूच्यन्ते, येषु बहवः तस्य वाक्यस्य सन्दर्भे निरर्थका भवेयुः । तेषां परिष्करणार्थम् उपायाः चिन्त्यमानाः सन्ति - https://github.com/kmadathil/sanskrit_parser/issues/28

​REST विषये यद्यप्युचितम् उक्तम्, नान्यत्र pip-प्रकाशनविषये। ​क्रियमाणम् अप्य् अपूर्णम् अपि कार्यम् स्यात् परोपकाराय परपरीक्षणाय चोचितम्। तेन निरन्तर-pip-प्रकाशनम् युक्तम्। तत्राऽपि खलु वयोवृद्ध्यङ्कनम् ("versioning" ) भवति - अथवा तद्विनाऽपि साक्षात् sudo pip2 install git+https://github.com/sanskrit-coders/sanskrit_data@master -U इतीव प्रयोगः शक्यः।  केवलम् module-विन्यासः स्फुटीकार्यः। 

धन्यवादाः‌ बहवः |‌ यथोक्तं भवता तथा module विन्यासं कृत्वा  pip इत्यनेन प्रकाशयितुं यतिष्यामहे |

 
 


To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-programmers+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
--
Vishvas /विश्वासः

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-programmers+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-programmers+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Karthikeyan Madathil

unread,
Aug 1, 2017, 5:10:34 AM8/1/17
to sanskrit-programmers

इदानीमिदं तन्त्रांशं `pip install sanskrit_parser` इत्यनेन अवस्थापयितुं शक्यते सङ्गणकेषु | शक्यते चेत् तम् अवस्थाप्य उपयुज्य च टिप्पण्याः‌ प्रेषयन्तु |  सहकर्तुमिच्छन्ति चेत् गणेस्मिन्नेव सम्पर्कं कुर्वन्तु वा सन्देशं प्रेषयन्तु | 

This can now be installed using `pip install sanskrit_parser`. Please try it out and comment. If you're interested in collaborating, on this project, please drop us a note in the group or by email.

https://github.com/kmadathil/sanskrit_parser
Reply all
Reply to author
Forward
0 new messages