Sandhi splitting tool

1,838 views
Skip to first unread message

Prashant Tiwari

unread,
Apr 13, 2016, 9:44:26 AM4/13/16
to sanskrit-p...@googlegroups.com
Namaste,

I have been a member on this group for quite some time but this is my first post.

Does anybody know of an open source implementation of a sandhi-splitting tool? I'm aware of a few online services (available here, here and here) with varying levels of accuracy, but would like to use either an API-based service or a self-deployable engine for a project I'm working on.

Thanks,
Prashant

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Apr 13, 2016, 11:07:50 AM4/13/16
to sanskrit-programmers, Amba Kulkarni
+ Prof. amba who had once upon a time provided me with code for the below.

https://github.com/sanskrit-coders/uohyd/tree/master/sandhi only covers creation of sandhi-s, not splitting.

मान्ये अम्बे, भवत्याः संस्थायास् सर्वेऽपि तन्त्रांशाः opensource-कृत्वा github मध्ये स्थापिताः वा? वयमिह प्रतीक्षारताः। तत्रापेक्षितं चेत् किमपि साहाय्यं, सूचनीया वयम्।


--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
--
Vishvas /विश्वासः

Martin मरुतिन्

unread,
Apr 14, 2016, 2:35:25 AM4/14/16
to sanskrit-programmers
Dear Prashant,

You will find this function in the amazing work of Gerard Huet:


The entire project is available to install locally.

Splitting Sandhi is complex as there can be n number of possibilities so some underlying lexical intelligence needs to be there to make it useful. Theoretically with some AI and understanding the context of the phrase you could reduce to options.

Kindest Wishes,

Martin

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Apr 14, 2016, 11:38:24 AM4/14/16
to sanskrit-programmers, Gérard Huet
+ Prof. Gerard.Huet

मान्य गेरार्डमहाभाग,

भवतस् तन्त्रांशं github क्षेत्रे प्रकाशयत्व् इत्यभ्यर्थये (सन्ति तत्र नैकान्य् उदाहरणानि - https://github.com/drdhaval2785 github.com/sanskrit-coders/ https://github.com/vedicsociety/ )।

[Could you publish your software on github (a popular open source collaborative coding platform), or can we publish a copy we get from you there? That way we can examine and discuss the source code and use portions of it within our tools more easily. In many ways it would complement or surpass the "write an email for code" procedure described in http://sanskrit.inria.fr/manual.html#installation . ]


--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Apr 16, 2016, 6:56:13 PM4/16/16
to Gérard Huet, sanskrit-programmers
sanskrit-p...@googlegroups.com again (Please don't be upset, stay calm, I will explain this below. If you must skip down and see my explanation.)

Dear Prof. Gerard!

Thank you for many things in the letter below, especially:
  • publicly available resources your efforts have linked on our unrepresentative, incomplete, badly out-of-date website.
  • explaining your position about publishing your code on github.
  • your evaluation what you see in our old webpages of our progress (or lack of it) so far (which I think is fair, given that we are part-timers).
  • your evaluation of our current abilities
  • your expectations about proper etiquette regarding contacting you.
It is always good to see an explicit note explaining what one suspects (in this case your evaluations of where we stand) in the corner of one's mind, without taking enough time to think about it deeply. I always vaguely thought academics think they're too good for the likes of us, this confirms it.

Now to explain my cc-ing the sanskrit-programmers mailing list in the previous and current mail:
  • I did not seek to "intimidate" you by cc-ing the mailing list. The question and proposal I put forth are of common interest - and it is desirable to avoid duplication. You have yourself indicated you have spoken to shrI dhaval paTel about a similar topic, and we could have avoided the inquiry if we had known where you stand.
    • As an aside, it is fairly clear to me that you don't care much for how we "nobody"-s perceive your approach to developing sanskrit software.
  • You said: "​Don't you have any notion of common civility ? Couldn't you have first discussed the matter with me privately, instead of sneaking in a semi-public cc for putting pressure on me ? Is this a proper way to introduce yourself to me, I have no recollection we had earlier contact ?" -
    • To this I only say, sir, with greatest respect, that your notions of common civility are different from mine and most other people I've interacted with in the past. Believe it or not, you've been the first to have objected so vociferously. Had I known your views, I would have done things differently. But now that you're angry with me anyway, I see no harm in cc-ing others so that they can be more circumspect with you.
  • You also say: "​Do you only have any sense of professional ethics ? What would you say if I wrote to the legal services of your employer, forwarding your email, emitted from an IP number recorded under Google's juridiction, to inquire whether it is recommended practice for Google engineers to bully independent developers into using the GitHub platform, so that "we can examine and discuss the source code and use portions of it within our tools more easily" ? With a cc to Inria's intellectual property cell, to boot."
    • As to this threat, I truly feel that it's inconsequential. Just as my mental model of you was wrong, your mental model of me is completely wrong. Please go ahead and complain to Google and Inria's intellectual property cell and whatever if you feel the need to. I don't live in fear of my job and (to reluctantly return your tone) be bullied into adapting your customs and manners. In other words, I am not the thrall you seem to imagine me to be.
All said and done, I close with a few clarifications:
  • I appreciate the work you've done, and your great experience, competence and knowledge. I am sure it serves some of the intended consumers of your work well (though not others such as us non-Indologists, which explains our existence and work so far). Furthermore, I bear no ill-will whatsoever towards cranky old academics.
  • We are a diverse set of people, and "
    one does not know basic computer science algorithms such as sorting o
    ​​
    r has only a romantic idea of Sanskrit grammar. PHP veneer infatuation" certainly does not apply to all of us.
    • People such as you are certainly very welcome to join us and enrich us (certainly, in a way markedly contrasting with us joining that Indology list). You can think of this as connecting to another user-base.
  • As meager as our accomplishments are, I am definitely hopeful that it will progress far. The dictionaries I use on a daily basis, the metre recognizer I use slightly less frequently were not developed by full time academics, but part time volunteers.
--
Wishing you well, 
Vishvas

2016-04-16 12:58 GMT-07:00 Gérard Huet <Gerar...@inria.fr>:
Oh, Vishvas, really, you have discovered that I existed ! I am deeply honored !

The short answer to "Could you publish your software on github" is NO, as I explained to Mr Dhaval Patel at the last WSC in Bangkok, after I handed him a key with my system,
ready to install, with all sources:  "I am resolutely keeping out of hectic activities such as social networking, skype, and cloud development systems.
In particular, I have no intention to develop my software on the GitHub platform, and in any case would need the agreement of my partners in this endeavor. 
I am of course interested in receiving comments and corrections from users and colleagues, but at a pace that is consistent with my feeble resources."

Let me add that I do not believe in crowd sourcing of software or linguistic data by non-professionals, however high is their goodwill.
Professional software demands careful design, competence in the application area, requirements analysis, tool coordination, coding discipline, testing, documenting, packaging, etc etc. 
​​
"Love, deep sentiments, emotional attachment" are just not enough if
​​
one does not know basic computer science algorithms such as sorting o
​​
r has only a romantic idea of Sanskrit grammar. PHP veneer infatuation is just not sufficient. 
So pushing in half-baked code snippets written in random programming languages for surface processing of Unicode devanagari is simply a loss of time. 
I had a look at  https://sites.google.com/site/sanskritcode/home/plans and I am just not much impressed by what has been jointly accomplished in 5 years time. 
Lots of ranting about "non-community based, closed source software and restricted data", that's for sure, but that's not a substitute for hard work, is it?
Survey of software is indeed available. It appears however to be just a collection of stale links. Item "sandhi analysis", besides a broken link, points to an obsolete
mirror site of my system copied at UoH 5 years ago. Is my original site so hard to locate with Google ? Where is the crowd supposed to source this information page ?

As the promoter of the Ocaml and Coq efforts, I do not have lessons about free software to receive from anyone, specially from professionals from private companies that make profit with proprietary software.
I released my computational linguistics toolkit Zen in 2002, complete with all sources under LGPL and a full literate programming documentation as tutorial
I had a very hard time with the Xerox company, that considered this area as their monopoly, and they tried hard, through thinly disguised referees on their payroll, to prevent me from publishing its concepts, let alone its source code.
At the same time I had to intimate Wikipedia administrators into better control of their servers, in order to eradicate stolen copies of my lexicon hypertext data, repainted under a different CSS style sheet, and with careful removal of my author and copyright annotations, well hidden in some deep well in a Prague Wikipedia server by some copyleft crackpot. By chance, Google was peeking down the well, and I could spot it :-)

I first released my Sanskrit Engine set of Web services in August 2003, and advertised it publicly in the Indology list on September 8th, 2004.
I released my morphology XML data banks under a free license at the same time, and many sites worldwide could start work on Sanskrit processing using these resources, as witnessed by the Heritage goldbook
I organized a scientific meeting in Paris on the topic in 2007, to which I invited many scholars and pandits to join a collaborative effort. It has had 5 occurrences worldwide since.
The joint software effort, with my partners at University of Hyderabad and IIT Kharagpur, representing 40000 lines of code in 120 modules of dense functional programming specifications, is now distributed on demand as a stand-alone set of Web services, under LGPL license. Its reader is available as a sub-service for analyzing the texts of the Sanskrit Library, and as a segmentation plug-in in Amba Kulkarni's Sanskrit parser. 
A full literate programming documentation of 763 pages is available for whoever wishes to learn and join this effort. 
This is apparently not enough for free software ayatollahs, so now wannabe experts want to impose us crowd sourcing development processes, which just do not make any sense in our context.  

How dare you send a letter to me, with a public cc to a newsgroup I do not subscribe to, as an attempt to bully me into obedience to your schemes ?
​​
Don't you have any notion of common civility ? Couldn't you have first discussed the matter with me privately, instead of sneaking in a semi-public cc for putting pressure on me ? 
Is this a proper way to introduce yourself to me, I have no recollection we had earlier contact ?

​​
Do you only have any sense of professional ethics ? What would you say if I wrote to the legal services of your employer, forwarding your email, emitted from an IP number recorded under Google's juridiction, to inquire whether it is recommended practice for Google engineers to bully independent developers into using the GitHub platform, so that "we can examine and discuss the source code and use portions of it within our tools more easily" ? With a cc to Inria's intellectual property cell, to boot. 

Sometimes it is useful to think before shooting one's mouth off.
Please, Vishvas, come back to your senses

GH

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Apr 16, 2016, 6:59:27 PM4/16/16
to Gérard Huet, sanskrit-programmers

2016-04-16 15:55 GMT-07:00 विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com>:
  • ​​
    publicly available resources your efforts have linked on our unrepresentative, incomplete, badly out-of-date website.
Correction: ​​publicly available resources showing the fruits of your efforts produced so far.​

Pradyumna Achar

unread,
Apr 17, 2016, 3:04:44 AM4/17/16
to sanskrit-p...@googlegroups.com
Dear friends,

Does anybody know of an open source implementation of a sandhi-splitting tool?

It appears that there isn't any. How about brainstorming a design for this sandhi-splitting tool? Implementation could then follow.
Maybe an editable, versioned shared-document or a wiki page could be the starting point. We could write down and elaborate the problem statement with its inputs and constraints there, and then design and discuss algorithms for sandhi-splitting and pick one that can be implemented.



--

Prashant Tiwari

unread,
Apr 17, 2016, 3:12:24 AM4/17/16
to sanskrit-p...@googlegroups.com
Dear all,

Thanks for your replies on this subject. It appears so to me too that such a tool either doesn't exist or isn't readily available, so I had arrived at the same conclusion that we should whip it up as a community effort. I'm ready to contribute with programming, but it's still very early days for my knowledge of grammar so I will defer to those here more knowledgeable than me for the grammar bits and also consult some scholars I know personally. Let's do this!

Regards,
Prashant

dhaval patel

unread,
Apr 17, 2016, 4:04:22 AM4/17/16
to sanskrit-p...@googlegroups.com
I have one code for sandhi split. Needs testing. Works rapidly on small words let us say 10 odd letters. Will share a link asap
Dr. Dhaval Patel, I.A.S
Collector and District Magistrate, Anand

Pradyumna Achar

unread,
Apr 17, 2016, 5:21:54 AM4/17/16
to sanskrit-programmers
A point..
If we write it in javascript, it'll be trivial to deploy and test it, as we won't need anything more than a simple service like github.io to make it available to users.
(as opposed to java or php which would need a runtime and a container)

Here's a quick-and-dirty demo page on github.io  --> http://kpachar.github.io/sandhisplit/index.html

The neat thing is that the github pages' http server reads directly off the git master branch in the associated repository and publishes automatically upon push.
Here's the associated repository --> https://github.com/kpachar/kpachar.github.io

As far as embeddability into a java app goes (i.e., if someone wanted to use it as a component in another java app), JSR 223/rhino will enable that.
 (There might be similar embedded javascript engines in other languages -- I found v8js for php; I haven't used php, though)


Prashant Tiwari

unread,
Apr 17, 2016, 6:08:13 AM4/17/16
to sanskrit-programmers
Hi Pradyumna,

JavaScript by itself should be able to do a static analysis all right, but given the lack of precision in the existing solutions, I had an idea to introduce some sort of AI to this problem going forward that is aware of an entire corpus of Sanskrit documents and is able to make more accurate and trustworthy suggestions and learn through continuously fed training and real-world data to improve itself over time. But even for that I'm all up for a Node.js solution to serve as backend.

For this purpose I'll be evaluating a few machine learning tools over the next few weeks. I'd love to hear what others think of this.

Pradyumna Achar

unread,
Apr 17, 2016, 9:08:13 AM4/17/16
to sanskrit-programmers
It's a good idea.

However, we'd need training data. Again, I think it'd be tough to find such data under creative commons or any open source license.
For this, how'd it be if we crowd-sourced it, something like:
a) Create a simple android app that asks its user what the सन्धिविच्छेद of a particular word is.
    -- That app should get its list of words-to-ask from a central repository (possibly a github repository), which is seeded from well known literary texts.
b) When the user of that app splits the सन्धि, it uploads the data to a central database (via cloud endpoints?)
c) Have a little program that orchestrates all this and aggregates the data into a clean corpus.

Prashant Tiwari

unread,
Apr 17, 2016, 10:01:50 AM4/17/16
to sanskrit-programmers
That was exactly my plan. The project I'm working on requires just this ability to present the user with Sanskrit texts to start analysing and assigning the tokens of text their individual meanings through a convenient and intuitive UI, until the entire text is so analysed and translated. (This, by the way, is itself going to be an open source effort and contributions would be very welcome.)

--

Shreevatsa R

unread,
Apr 17, 2016, 10:03:02 AM4/17/16
to sanskrit-programmers
I think the point that has not been stated clearly so far (perhaps it's obvious) is that to split sandhi you essentially need to know all possible forms of all possible words, so it's work that can be done only after the program has a lot of morphological understanding.

For example, consider these three: रामेति (rāmeti) रामेपि (rāmepi) रामेण (rāmeṇa)
When we as humans split
रामेति — राम+इति
रामेपि — रामे+अपि 
रामेण — रामेण
we do so because we know the forms of the word Rāma and that iti and api are common words. Otherwise, as far as purely phonetic possibilities go, there are all sorts of combinations like breaking रामेण into रा+अम+इण which we cannot discard without knowing the forms of words.

So I think the task of splitting sandhi should not be underestimated; in Sanskrit it comes rather towards the end of a lot of ability to recognize fully inflected words (implementing it definitely requires a lexicon (word list) and recognizing words, and with that ability a declension engine might be the first thing to implement), so directly aiming to split sandhi would be jumping the gun IMO. (The counterpoint, reflecting actual learners' experience, is that being able to recognize inflected word forms may be an easier task than being able to accurately generate valid forms.)

But if anyone wants to try then please free; it will be very useful.


However, I agree with Pradyumna's suggested first step, which I think is a useful tool in itself and have been thinking of for a few days now: some code in Javascript to assist splitting sandhi manually. For example, you can click/hover over the "e" in "ramepi" and it would pop up options like rām{a,ā}+{i,ī}pi, rāme+api. The results can be saved or uploaded. For reaching the most number of users this would need to be a webpage, that works fine on mobile as well, which is why I suggested Javascript. (It would work offline too, though of course the uploading can only be done online.)

The computer scientist and programmer Dan Ingalls Jr (famous for his work on Smalltalk) already made a tool that does this (and OCR!) for his father D.H.H. Ingalls (famous Indologist and author of a pleasing translation of Vidyākara's Subhāṣita-ratna-kośa) in 1980 (https://vimeo.com/4714623), so this is one thing we can be sure is doable today. :-)

Doing it with Unicode Devanagari, where the "e" in रामेपि is part of the single glyph मे, seems an interesting challenge. I'd be curious to know what solutions people come up with (the approach I would favour is the simplest, to just use Latin script).



--

Martin Gluckman

unread,
Apr 17, 2016, 10:15:10 AM4/17/16
to sanskrit-p...@googlegroups.com

Please read these two papers by Oliver. It should be useful for this pursuit. Kindest Wishes. Martin

--
You received this message because you are subscribed to a topic in the Google Groups "sanskrit-programmers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sanskrit-programmers/Ms3Fdv-axMw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sanskrit-program...@googlegroups.com.

ltc-004-hellwig.pdf
paper.pdf

Prashant Tiwari

unread,
Apr 17, 2016, 10:45:52 AM4/17/16
to sanskrit-p...@googlegroups.com
@Martin: Thanks for the papers, they seem very helpful.

> So I think the task of splitting sandhi should not be underestimated; in Sanskrit it comes rather towards the end of a lot of ability to recognize fully inflected words (implementing it definitely requires a lexicon (word list) and recognizing words, and with that ability a declension engine might be the first thing to implement), so directly aiming to split sandhi would be jumping the gun IMO.

@Shreevatsa: This I think is exactly the case for bringing in machine learning. I'm becoming increasingly hopeful that a good deal of problems in Sanskrit comp. ling. could be much better solved by using ML, of course aided on the side with traditional tools. However, it is for the moment just a hope since I'm a rank newbie to AI. :)

>For example, you can click/hover over the "e" in "ramepi" and it would pop up options like rām{a,ā}+{i,ī}pi, rāme+api.

We're on the same wavelength — that's exactly how I'm thinking of doing this. :)

> Doing it with Unicode Devanagari, where the "e" in रामेपि is part of the single glyph मे, seems an interesting challenge. I'd be curious to know what solutions people come up with (the approach I would favour is the simplest, to just use Latin script).

Absolutely. In my experience so far, ITRANS has been working quite well for text processing and it also makes things like alphabetically sorting Devangari quite trivial. The point is to use ITX for storage and processing, and only convert to Devanagari at the last minute for display purposes.


dhaval patel

unread,
Apr 17, 2016, 12:26:37 PM4/17/16
to sanskrit-p...@googlegroups.com
https://github.com/drdhaval2785/samasasplitter
This is the code I was referring to. It works on some mathematical model. I will document some problems it encountere soon. For dictionary word splits, it works fine. For real world case endings and verb forms, its results may be suboptimal. Longer the word, longer it will take to analyse. For simple sandhi of two words, it should be OK.

On Sunday, April 17, 2016, Prashant Tiwari <prash...@gmail.com> wrote:


--

dhaval patel

unread,
Apr 17, 2016, 9:25:27 PM4/17/16
to sanskrit-p...@googlegroups.com
For javascript, http://sa.diglossa.org/ has some splitting features. Very rough. Algorithm needs refining. Splits words and has some linking to dictionaries.
Developer is quite responsive.

Prashant Tiwari

unread,
Apr 17, 2016, 9:45:10 PM4/17/16
to sanskrit-p...@googlegroups.com
Thank you Dhaval, will check these out.

Shreevatsa R

unread,
Apr 18, 2016, 7:20:05 PM4/18/16
to sanskrit-programmers, Arun Prasad
Thanks Martin.

Here is a brief summary of the two papers by Dr. Hellwig, for the benefit of anyone else who may be interested.

1. Using Recurrent Neural Networks for joint compound splitting and Sandhi resolution in Sanskrit

The goal is to split sandhis and compounds, e.g. from uttamādhamamadhyānāṃ produce uttama-adhama-madhyānām. He uses his extensive SanskritTagger corpus, which already contains an analysis of many texts: 2.6 million strings of various lengths (34% are ≤ 5 in length, 6.5% are > 15).

For each string (sequence of phonemes), he creates an (input, output) pair -- or (observed, target) pair -- after adding the first phoneme of the next string into the input.

Examples:

   observed    r    a    t    n    a    ṃ    c
   target      r    a    t    n    a    m   BOW

   observed    t    ā    ṃ    ś    c    a    g
   target      t    ā    x    n    c    a   BOW

   [I guess]
   observed    u    t    t    a    m    ā    dh   a    m    a    m    a    dh   y    ā    n    ā    ṃ    g
   target      u    t    t    a    m   a-a   dh   a    m    a-   m    a    dh   y    ā    n    ā    m   BOW


This is used to train a recurrent neural network, containing: an input layer, a hidden forward and backward layer, and an output layer. The size of the input layer is the number of distinct input phonemes. At time t, input layer receives the phoneme observed at position t in a string. The hidden forward and backward layers capture the left and right context. LSTM cells are used in this layer. "The output layer receives the individual outputs from both hidden layers and performs a softmax regression for the desired target values. The network is trained with stochastic gradient descent"

This gets to an overall accuracy of 93.24% with the full dataset; with 10000 and 500000 samples the accuracy is 77.92% and 91.03% respectively. Note that all this uses only a known analysis of texts, no linguistic information or lexical or morphological resources (no language models, no dictionary or list of inflected forms).

2. Morphological disambiguation of Classical Sanskrit

This one uses the full power and workings of his SanskritTagger system, which includes: a large annotated corpus, a lexical database (words, their grammatical categories and meanings, inflected verbal forms), linguistic models (both hard-coded and trained on the corpus), and a linguistic processor that uses all this to analyze a sentence.

Takes a string and tries breaking it at every position and doing a morphological analysis of the left and right parts. All such potential morpho-lexical analyses are inserted into a "hypothesis lattice". Then Viterbi decoding to find the most probable analysis, using bigram probabilities learned from the annotated corpus. Gets to about 94.4% accuracy.

(This is a longer paper but I have a shorter summary as it's less easy to reproduce without his system.)

—————————————

I've also attached the notes I took while reading the papers, for what it's worth.

(+Arun who has examined some similar work in the past, and may have some thoughts about it.)

I think there are a couple of directions in which our effort can be usefully expended, depending on our interests:

1. The work of actually performing automated analysis of sentences, splitting sandhi, etc.

2. Given such an analysis (whether created manually or automatically), tools and UIs to display the results of such annotation to a reader, in a form that is easily accessible (will run in the browser, looks good on mobile phones, can work offline if the data is available, adapts to the user's preferences, etc).
notes.txt

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Apr 18, 2016, 7:57:56 PM4/18/16
to sanskrit-programmers, Arun Prasad
​​​Thanks for the excellent summaries, shrIvatsa!​

2016-04-18 16:19 GMT-07:00 Shreevatsa R <shree...@gmail.com>:
This is used to train a recurrent neural network, containing: an input layer, a hidden forward and backward layer, and an output layer. The size of the input layer is the number of distinct input phonemes. At time t, input layer receives the phoneme observed at position t in a string. The hidden forward and backward layers capture the left and right context. LSTM cells are used in this layer. "The output layer receives the individual outputs from both hidden layers and performs a softmax regression for the desired target values. The network is trained with stochastic gradient descent"

Since I have zero experience with neural nets these spring to mind -
​Suppose we wanted to train such a network. Would we need "cloud power" or would a single modern computer do?​ If the former, are people able to use something like Google's https://en.wikipedia.org/wiki/TensorFlow on some cloud provider's computers?

 
This gets to an overall accuracy of 93.24% with the full dataset; with 10000 and 500000 samples the accuracy is 77.92% and 91.03% respectively. Note that all this uses only a known analysis of texts, no linguistic information or lexical or morphological resources (no language models, no dictionary or list of inflected forms).

​So they're not using the entire  2.6 million strings or even close.​ Is this corpus available online or by request?
  
Then Viterbi decoding to find the most probable analysis, using bigram probabilities learned from the annotated corpus. Gets to about 94.4% accuracy.
​Not that impressive a gain, as I'd anticipated :-)​
 
 
(+Arun who has examined some similar work in the past, and may have some thoughts about it.)

I think there are a couple of directions in which our effort can be usefully expended, depending on our interests:

1. The work of actually performing automated analysis of sentences, splitting sandhi, etc.

2. Given such an analysis (whether created manually or automatically), tools and UIs to display the results of such annotation to a reader, in a form that is easily accessible (will run in the browser, looks good on mobile phones, can work offline if the data is available, adapts to the user's preferences, etc).

Nice separation of concerns!​

Shreevatsa R

unread,
Apr 18, 2016, 8:40:33 PM4/18/16
to sanskrit-programmers, Arun Prasad
On Mon, Apr 18, 2016 at 4:57 PM, विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com> wrote:
Since I have zero experience with neural nets these spring to mind -
​Suppose we wanted to train such a network. Would we need "cloud power" or would a single modern computer do?​ If the former, are people able to use something like Google's https://en.wikipedia.org/wiki/TensorFlow on some cloud provider's computers?

I'm not sure either, but my impression is that a normal laptop is sufficient to train many kinds of neural networks. Also, I think these results should be taken as an indicator of roughly what levels of accuracy are possible with machine learning, rather than us trying to reproduce this work exactly. Other ML methods may perform even better, etc. Needs someone with more experience in machine learning to choose wisely. :-) Different methods vary widely in computational effort needed for training, efficiency after training, etc.


This gets to an overall accuracy of 93.24% with the full dataset; with 10000 and 500000 samples the accuracy is 77.92% and 91.03% respectively. Note that all this uses only a known analysis of texts, no linguistic information or lexical or morphological resources (no language models, no dictionary or list of inflected forms).

​So they're not using the entire  2.6 million strings or even close.​ Is this corpus available online or by request?

It is indeed using the entire corpus. To clarify: according to the paper,
- Training on the corpus of 2.6 million strings gave 93.24% accuracy.
- Training on a sample of 500000 strings gave 91.03% accuracy.
- Training on a sample of 10000 strings gave 77.92% accuracy.
[Table 7 of the paper.]

I don't know about the corpus being available; probably we have to ask Dr. Hellwig nicely in whatever way is most likely to elicit a positive response. :-) Martin for http://sanskritdictionary.com/frequency/ has obtained some data from DCS so his experience/contact may be valuable.

Anyway, I think we need not ask for the full data until we know what to do with it.

Note that the DCS site (http://kjc-fs-cluster.kjc.uni-heidelberg.de/dcs/) says it "offers free internet access to a part of the database of the linguistic program SanskritTagger, which has been under constant development since 1999" and (http://kjc-fs-cluster.kjc.uni-heidelberg.de/dcs/index.php?contents=intro_sanskrit_tagger) "DCS is an extract from a larger database that stores the linguistic data created using the program SanskritTagger". So if it's possible to get some data from the DCS site (not the full corpus) in a way that is compliant with whatever licence/terms are on the site, that should be enough to start. Unfortunately it looks like DCS website only provides a query interface rather than a data dump, though it does have a view that gives an entire annotated text, and says it's available under a CC-BY license.

Note that SanskritTagger itself is available as a freeware download, but it is heavily tied to Windows:  http://www.indsenz.com/int/index.php?content=sanskrit_tagger

Arun can say more about his plans in this direction if he's interested.


Then Viterbi decoding to find the most probable analysis, using bigram probabilities learned from the annotated corpus. Gets to about 94.4% accuracy.
​Not that impressive a gain, as I'd anticipated :-)​

Frankly, even the final accuracy of 94.4% isn't very satisfactory: note that according to the paper's data, about 90.96% of the phonemes are not part of any sandhi or compound boundaries. So the world's most trivial program, which simply takes the input and echoes it back, would achieve ~91% accuracy! Of course with machine learning every additional percentage point of accuracy is hard-won, but still. These are the numbers from the paper for different kinds of transformations (R1 is the "do-nothing" transformation) along with their frequency of occurrence:

   Rule type [Frequency] [What]   Precision   Recall   F score
   R1         90.96      p → p      99.59     99.44     99.51
   R5          4.05      p → a      98.28     98.63     98.46
   R3          2.62      p → p-     89.53     91.67     90.59
   R2          1.49      p → a-b    92.96     93.23     93.09
   R4          0.88      p → a-     89.35     95.55     92.35

(The do-nothing program would have 90% precision and 100% recall in the first row R1, and 0% precision and 0% recall in the other rows. Plus it would get most strings wrong (i.e. when string contains at least one sandhi), whereas the algorithm in the paper makes 0 errors for 93.22% of strings and has over 90% accuracy for all types of sandhis. So the paper's results are definitely impressive and in no way comparable to a trivial program! Still, not satisfactory. 94.4% means there will still be a lot of sandhis that need to be corrected manually. So the manual sandhi-breaking UI needs to be built anyway.)

JAGANADH GOPINADHAN

unread,
Apr 18, 2016, 8:45:37 PM4/18/16
to sanskrit-p...@googlegroups.com, Arun Prasad
Hi Vishwas
A petty decent modern computer will do. Do we have the training data available open?
Jaganadh
Sr Applied Data Scientist
Redmond,WA

From: Shreevatsa R
Sent: ‎4/‎18/‎2016 5:40 PM
To: sanskrit-programmers
Cc: Arun Prasad

Subject: Re: Sandhi splitting tool

--
Reply all
Reply to author
Forward
0 new messages