Sandhi function in Python for an IAST string

109 views
Skip to first unread message

Martin मरुतिन्

unread,
Apr 19, 2016, 8:00:10 AM4/19/16
to sanskrit-programmers
Greetings,

Is anyone aware of a stable Python function for applying Sandhi to an IAST string?

We have the same in other languages but need it currently in Python.

I noted this but it seems to work with SLP1 currently:


Kindest Wishes,

Martin

Shreevatsa R

unread,
Apr 19, 2016, 11:10:59 AM4/19/16
to sanskrit-programmers
If that program works with SLP1 then it should be easy to transliterate IAST to SLP1 and feed it to that program. 

Transliteration (especially from SLP1) is a relatively easier task; there is Arun's Sanscript (javascript version here: https://github.com/sanskrit/sanscript.js) or I can extract the transliteration part of my messy code into a separate library if that will help.

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Prashant Tiwari

unread,
Apr 19, 2016, 11:26:56 AM4/19/16
to sanskrit-p...@googlegroups.com
Slightly off-topic: for practical purposes are there any good reasons for choosing one transliteration scheme over another or is it just a matter of taste?

Martin Gluckman

unread,
Apr 19, 2016, 11:49:53 AM4/19/16
to sanskrit-p...@googlegroups.com
The topic is discussed at length in Peter Scharf's excellent book:


For all our projects we have standardised with IAST internally, it is much easier and intuitive to work with.

Kindest Wishes,

Martin

--
You received this message because you are subscribed to a topic in the Google Groups "sanskrit-programmers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sanskrit-programmers/ybDO8l3dw6w/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sanskrit-program...@googlegroups.com.

Shreevatsa R

unread,
Apr 19, 2016, 11:53:19 AM4/19/16
to sanskrit-programmers
There are some technical reasons why SLP1 makes a good internal representation, basically it uses a single character for each phoneme (vowel/consonant) (at least the ones that commonly occur in classical Sanskrit, though it e.g. uses e1 for short e, etc), which makes it easier to work with. Scharf and Hyman go into some detail in their book "Linguistic Issues in Encoding Sanskrit" where they fully describe SLP1/SLP2/SLP3: http://sanskritlibrary.org/Sanskrit/pub/lies_sl.pdf

For displaying to the user, I think it's a matter of taste. SLP is awful to read and IMO should never be displayed to the user. I personally prefer IAST as it's a standard used in many books, looks professional, aesthetically pleasing, can distinguish case if you want (e.g. can write Rāma to indicate that it's a name/proper noun), I find it easier to type in than in Devanagari, etc. Some people prefer ITRANS because that's what they are familiar with and diacritics are scary to some people.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Apr 19, 2016, 11:58:31 AM4/19/16
to sanskrit-programmers

2016-04-19 8:52 GMT-07:00 Shreevatsa R <shree...@gmail.com>:
For displaying to the user, I think it's a matter of taste. SLP is awful to read and IMO should never be displayed to the user. I personally prefer IAST as it's a standard used in many books, looks professional, aesthetically pleasing, can distinguish case if you want (e.g. can write Rāma to indicate that it's a name/proper noun), I find it easier to type in than in Devanagari, etc. Some people prefer ITRANS because that's what they are familiar with and diacritics are scary to some people.

The above is, of course, IF you want to display transliteration to the user for some reason. (Relative to -say- devanAgarI) I find it​ torture to read verses in IAST; I think this holds true for most Indian users.



--
--
Vishvas /विश्वासः

Shreevatsa R

unread,
Apr 19, 2016, 12:31:32 PM4/19/16
to sanskrit-programmers
Yes, that's what I meant by matter of taste. I personally find it quite comfortable to read verses in IAST; I have read entire books and texts (e.g. from GRETIL) directly in IAST and can read it as fast as Devanagari (though of course for reading (good) Sanskrit the main bottleneck is reflecting on the text, not simply visual-to-aural translation :) ). 

Representing Sanskrit text in Latin script always requires a "new" user to spend some time essentially learning a new script (e.g. how dīrgha vowels and ṭa-varga vowels are represented), and after the new script is learned and fully internalized, it doesn't make much of a difference beyond aesthetics I guess.

So again, I urge website makers to provide users a preference to view Sanskrit text in whatever script they like, be it Devanagari or IAST or HK or ITRANS or Kannada script or Bengali script or whatever. (I am myself not following this advice, though I plan to.)

Martin Gluckman

unread,
Apr 19, 2016, 12:44:20 PM4/19/16
to sanskrit-p...@googlegroups.com
Dear Vishvas,

IAST and ISO 15919 have the advantage of being easy to represent on a roman lettered (QWERTY) keyboard which has become the defacto keyboard standards for rapid global communication so if Sanskrit is to be rapidly communicated back and forth there are great advantages here.

देवनागरी as beautiful as it appears is a relatively late representation of Sanskrit, countless scripts precede it that are now mostly forgotten. My personal favorite is 𑀩𑁆𑀭𑀸𑀳𑁆𑀫𑀻

I feel learning romanized Sanskrit (IAST and ISO 15919 for some Vedic accents) should be taught more in India as I have noticed the same aversion you have with others in India. All students should be equally comfortable in Devanagari and transliterated Sanskrit just as students of Japanese will have Hiragana and Katakana along with the more elaborate Kanji.

The main importance particularly with Sanskrit is for the student to sound the language correctly and if reading is easier learnt and the sounding can be unambiguously mapped mentally from the visual forms quicker then it is a good system. 

In terms of indexing of data there are many advantages of a system such as IAST, a search system will more easily relate āyurvedaḥ to ayurveda than आयुर्वेदः to ayurveda so if the knowledge encoded in Sanskrit literature is to be globally discoverable, having a parallel romanized representation with catalyse this process.

Ironically accordingly to some schools of thought Brahmi has influenced "western" alphabets (others in this debate say the other way around) so it is a full circle for example 𑀥 (dha) in 𑀩𑁆𑀭𑀸𑀳𑁆𑀫𑀻 similar to the D of today.


Kindest Wishes,

Martin

PS: If the 𑀩𑁆𑀭𑀸𑀳𑁆𑀫𑀻 in the above is not rendering for you check that you have suitible fonts installed. Windows 10 and more recent versions of OS X support 𑀩𑁆𑀭𑀸𑀳𑁆𑀫𑀻 well.

--
You received this message because you are subscribed to a topic in the Google Groups "sanskrit-programmers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sanskrit-programmers/ybDO8l3dw6w/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sanskrit-program...@googlegroups.com.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Apr 19, 2016, 12:49:40 PM4/19/16
to sanskrit-programmers
Also, from the point of view of *accepting* user input, we've found https://sites.google.com/site/sanskritcode/optitrans useful in our dictionary work. So, people look up shankara, rather than sha~Nkara (ITRANS), zankara (HK) etc.. Sometimes, people find it convenient to type in latin alphabet, unlike diacritics or even devanAgarI.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Apr 19, 2016, 1:00:08 PM4/19/16
to sanskrit-programmers
Dear Martin, I agree with your views that people should be taught multiple scripts. I also agree with shrIvatsa's view that this artificial barrier of scripts should be surpassed with better technology (eg: http://stotrasamhita.net/wiki/Main_Page ) .

However:

2016-04-19 9:43 GMT-07:00 Martin Gluckman <mar...@vedicsociety.org>:
 
IAST and ISO 15919 have the advantage of being easy to represent on a roman lettered (QWERTY) keyboard which has become the defacto keyboard standards for rapid global communication so if Sanskrit is to be rapidly communicated back and forth there are great advantages here.

From my daily interactions on whatsapp and elsewhere, ​I have a feeling that this technological drawback is on the way out.

 

देवनागरी as beautiful as it appears is a relatively late representation of Sanskrit, countless scripts precede it that are now mostly forgotten. My personal favorite is 𑀩𑁆𑀭𑀸𑀳𑁆𑀫𑀻

​True, but age is actually beside the point, popul​arity is (and some may add Indianness). We are put in a spot by the force of history.
 
In terms of indexing of data there are many advantages of a system such as IAST, a search system will more easily relate āyurvedaḥ to ayurveda than आयुर्वेदः to ayurveda so if the knowledge encoded in Sanskrit literature is to be globally discoverable, having a parallel romanized representation with catalyse this process.

​Of course, the computer program should ​parallely use the best representation from the computational POV (and unicode codes for devanAgarI are clumsy to say the least - again, we're in a way damned by history)

 

Prashant Tiwari

unread,
Apr 19, 2016, 1:36:03 PM4/19/16
to sanskrit-programmers
So it appears it's mostly a matter of taste.

I agree with some others that IAST is the most intuitive and aesthetically pleasing for display (other of course than Devanagari itself), even though it utilises some non-standard diacritics not found in most Latin-only fonts (e.g. the d, n, and s underdots, also the n overdot). This presents problems for most web fonts. On one occasion I had to draw the missing glyphs for these characters myself and supply them in a fallback font for the browser to pick up.

I also believe that for computation Devanagari isn't the most suitable, although I was recently amazed to see Chrome (and consequently Node.js) natively support perfect dictionary sorting of Devanagari characters in its JavaScript engine. That's probably obvious in 2016 and applies to other runtimes/languages too, but I was quite pleased to see that.

Among the other schemes, I personally find ITRANS to be the most readable. Especially, the SLP1 introduces characters that don't read intuitively.

Talking of aversion to diacritics and symbols, my first favourite dictionary while growing up was "The Little Oxford". I was pained to see later editions do away with the IPA for pronunciation, only to replace it with their custom re-spelling system. I'm not sure if it was a dumbing down only for Indian students but I thought it was a bad move.

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Apr 19, 2016, 2:01:49 PM4/19/16
to sanskrit-programmers

2016-04-19 10:35 GMT-07:00 Prashant Tiwari <prash...@gmail.com>:
So it appears it's mostly a matter of taste.

I was considering what other non-subjective non-background factors inherent within the script itself might make (say) devanAgarI more pleasing than IAST (as far as reading is concerned), and the following came to mind:

इतस् ततश् च वैदेहीम् अन्वेष्टुम् भर्तृ-चोदिताः ।

कपयश् चेरुर् आर्तस्य रामस्येव मनोरथाः ॥ 12.59

itaḥ tataḥ ca vaidehīm anveṣṭum bhartṛ-coditāḥ |

kapayaḥ ceruḥ ārtasya rāmasya iva manorathāḥ ||


* greater compactness - so you can take in more phonetic information with fewer "letters" and space. (eg: see above example from the excellent https://groups.google.com/forum/#!topic/sadaswada/4sbgFrc8Lpo journal )​ But I think this is a bit of a red herring, as it does not apply to other luxurious scripts like kannaDa and because it's less important than below.

​* It is easier to read a syllable in Indian scripts than in ​IAST etc.. The consonant and vowel are "stuck" together in a group you take in at one go. With latin based scripts, your mind bears greater burden of separating out the syllables.

​Question to shrIvatsa - do you find it as easy to apply the metrically right tune​ while reading IAST as you do with devanAgarI or kannaDa?

Shreevatsa R

unread,
Apr 19, 2016, 3:02:55 PM4/19/16
to sanskrit-programmers
On Tue, Apr 19, 2016 at 11:01 AM, विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com> wrote:
​* It is easier to read a syllable in Indian scripts than in ​IAST etc.. The consonant and vowel are "stuck" together in a group you take in at one go. With latin based scripts, your mind bears greater burden of separating out the syllables.

I think this sort of explanation only shows our human capacity for rationalization. :-)

If this were true it would mean that speakers of languages that conventionally use Latin scripts inherently are burdened by their scripts, and that English/French/Spanish/etc. speakers would benefit from switching to Devanagari.

Our brains do not parse one component of a shape at a time, but perform "chunking". E.g. when I see राम it is such a frequently occurring shape that my brain doesn't have to read it as a sequence of र glyph, the  ा glyph for long a, and then the म glyph, but I instantly see that it is राम, its meaning springs to mind, possibly even my conception of the person Rāma and my emotional response, etc. With less common shapes -- something like चोदिताः in the above, which I have seen less frequently -- my brain may chunk it as say (1) चोदित which I have seen somewhat more frequently, along with ताः or ाः at the end, (non-consciously) merging into the proper sound चोदिताः or (2) syllable-by-syllable, as चो and दि and ताः again merging into the proper sound चोदिताः. Either way, the "visual marks on the page becoming sounds/phonemes in the brain" happens in some mostly non-conscious way, with some chunking.

Similarly, when reading in IAST, "rāma" or "coditāḥ" are chunked to various extents based on familiarity; they are not read one letter at a time.

You can look into the psychology of reading and find a rich literature already (though it's only beginning to scratch the surface), and there are also anecdotal examples like covering up the top or bottom half of letters and being able to read a fair bit, or this one that has been floating around the internet for over a decade:

Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.

(See http://www.mrc-cbu.cam.ac.uk/people/matt.davis/cmabridge/ for useful discussion of the above, and links to studies, especially the section beginning with "4) Tihs".)

Another important point is that what we seek when we read is the sound of the language (a stream of phonemes), and ultimately the meaning. It is an idiosyncrasy of Devanagari that it breaks, say, अर्जुन into अ and र्जु and न, but our brain need not pass through that representation to read, i.e. we don't have to see "arjuna" and break it into "a" and "rju" and "na" in order to comprehend the actual sound of the word <अर्जुन/arjuna> -- all that matters is that we somehow pass from the visual representation to the accurate sound.

 
​Question to shrIvatsa - do you find it as easy to apply the metrically right tune​ while reading IAST as you do with devanAgarI or kannaDa?

Yes. What my brain registers is (as far as I can tell) exactly the same (the same "sound", though it's debatable whether it's actually sound; see subvocalization), and this is a step even before registering the metre/rhythm/tune.

Shreevatsa R

unread,
Apr 19, 2016, 3:28:35 PM4/19/16
to sanskrit-programmers
On Tue, Apr 19, 2016 at 12:02 PM, Shreevatsa R <shree...@gmail.com> wrote:

​Question to shrIvatsa - do you find it as easy to apply the metrically right tune​ while reading IAST as you do with devanAgarI or kannaDa?

Yes. What my brain registers is (as far as I can tell) exactly the same

Note that this was under the assumption that the Devanagari or IAST encode the same thing. The extract from Sadāsvāda is not an apples-to-apples comparison in this context, as there it was consciously decided to use the opportunity provided by writing every verse twice, to split sandhi in one of them. (To understand the sandhied version one has to mentally break sandhi, and to recite the unsandhied version one has to mentally form sandhi.) 

That is, just as in Devanagari we can write one of:
इतस्ततश्चवैदेहीमन्वेष्टुम्भर्तृचोदिताः ।  कपयश्चेरुरार्तस्यरामस्येवमनोरथाः ॥
इतस् ततश् च वैदेहीम् अन्वेष्टुम् भर्तृ-चोदिताः ।  कपयश् चेरुर् आर्तस्य रामस्येव मनोरथाः ॥
इतः ततः च वैदेहीम् अन्वेष्टुम् भर्तृ-चोदिताः । कपयः चेरुः आर्तस्य रामस्य इव मनोरथाः ॥

in IAST we can write one of:
itastataścavaidehīmanveṣṭumbhartṛcoditāḥ |  kapayaścerurārtasyarāmasyevamanorathāḥ ||
itas tataś ca vaidehīm anveṣṭum bhartṛ-coditāḥ |  kapayaś cerur ārtasya rāmasyeva manorathāḥ ||
itaḥ tataḥ ca vaidehīm anveṣṭum bhartṛ-coditāḥ | kapayaḥ ceruḥ ārtasya rāmasya iva manorathāḥ ||

and I assume we're comparing the same one of the two.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Apr 19, 2016, 3:56:06 PM4/19/16
to sanskrit-programmers
2016-04-19 12:02 GMT-07:00 Shreevatsa R <shree...@gmail.com>:


I think this sort of explanation only shows our human capacity for rationalization. :-)
​That's a very good point :-) But, it is also possible that ​there is something to this hypothesis (see below)
 

If this were true it would mean that speakers of languages that conventionally use Latin scripts inherently are burdened by their scripts, and that English/French/Spanish/etc. speakers would benefit from switching to Devanagari.

That may well be the case! Consider the analogy with Chinese being able to remember longer numbers because their words for the numerals are shorter - http://www.npr.org/sections/krulwich/2011/07/01/137527742/china-s-unnatural-math-advantage-their-words

 
 
Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.

(See http://www.mrc-cbu.cam.ac.uk/people/matt.davis/cmabridge/ for useful discussion of the above, and links to studies, especially the section beginning with "4) Tihs".)

​But note that there is no word above which you do not know before hand and that each word is separated by a space, and that there are no compounds! ​These don't apply as much to sanskrit.
  
​Question to shrIvatsa - do you find it as easy to apply the metrically right tune​ while reading IAST as you do with devanAgarI or kannaDa?

Yes. What my brain registers is (as far as I can tell) exactly the same (the same "sound", though it's debatable whether it's actually sound; see subvocalization), and this is a step even before registering the metre/rhythm/tune.
​That's interesting! I think somehow find it easier to make out yati-sthAna-s more easily with Indian scripts (but that's not 100% sure because I quickly go seek a devanAgarI version when faced with latin alphabet).

Shreevatsa R

unread,
Apr 19, 2016, 8:02:42 PM4/19/16
to sanskrit-programmers
On Tue, Apr 19, 2016 at 12:55 PM, विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com> wrote: 
Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.

(See http://www.mrc-cbu.cam.ac.uk/people/matt.davis/cmabridge/ for useful discussion of the above, and links to studies, especially the section beginning with "4) Tihs".)

​But note that there is no word above which you do not know before hand and that each word is separated by a space, and that there are no compounds! ​These don't apply as much to sanskrit.

Just to be clear, I wasn't claiming that you can reorder letters in Sanskrit and have it be understandable. That's definitely not true for Sanskrit and it's not true even for English, as explained at the link. I was only using the quotation as an example (not as a true statement) that chunking happens when reading. When we see a longer word like "harmlessness" I don't think we recognize the word as a whole (unless it's very familiar in our context), but there is some chunking nevertheless. Similarly, I find that chunking happens when reading Sanskrit written in IAST. For example, one can see the shape uvāca and immediately register it as "said" (the meaning, the mental idea of someone having spoken), without even needing to pass through the sound if it's not in verse.

There is an interesting question: do primary readers of Devanagari and other syllabic/abugida scripts perform chunking to the same extent and in the same ways as readers of Latin and other alphabetic scripts? I don't know (someone may have done the research), but I'm pretty sure there is chunking.

Shreevatsa R

unread,
Apr 19, 2016, 8:18:10 PM4/19/16
to sanskrit-programmers
On Tue, Apr 19, 2016 at 9:31 AM, Shreevatsa R <shree...@gmail.com> wrote:
So again, I urge website makers to provide users a preference to view Sanskrit text in whatever script they like, be it Devanagari or IAST or HK or ITRANS or Kannada script or Bengali script or whatever. (I am myself not following this advice, though I plan to.)

Let me elaborate on what my plans are/were, as they are only vapourware at this moment (haven't written a single line of code, and will be glad if someone else does!).

I imagine these components:
1. Some code that can transliterate between Indic/Brahmic scripts and a few common Latin scripts, e.g. Bengali to Gurmukhi, Devanagari to Kannada, Tamil to ISO 15919, etc. (I think sanscript.js already does.)
2. Some code that can scan through the text elements on a webpage and detect runs of Brahmic text (and which alphabet they are in). (I was planning to look at the MathJax preprocessors for how they do this.)
3. Some code to save user preferences in localstorage, and the associated CSS and JS stuff to make it sit as an unobtrusive button in a corner of the page, when clicked pop a box asking for the user's preferred scripts, etc.

All very doable right now. With these, we could have the following:

(1) A browser extension that users can install, which will let them read, on any webpage, text detected as Indic (or marked up as such) transliterated into the script of their choice.
(2) A Javascript library such that any website developer can, by simply adding a single line on their page, make the contents available to readers in whatever script they prefer (as if they had installed the extension).

(I think I remember someone on this list, Anunad maybe, saying that the extension (1) already exists.)

Reply all
Reply to author
Forward
0 new messages