Sanskrit Data Standard Format

śrīnivāsa kaśyap munukutla

unread,

Jul 27, 2021, 12:50:55 PM7/27/21

to sanskrit-programmers

Hi all,

Every time I want to build a better reader, it’s an absolute chore trying to find a good text to work with and a format that captures all the detail we’d need to produce something nice.

I think we need to standardize how we store sanskrit data; that way we can build more useful things on top of it. There’s a reason why sanskrit readers are such trash; my experience is to start building one and a week later, I give up because getting a text ready to visualize is a nightmare.

Best,

kaśyap

My proposal is to create a single format to contain sanskrit data on various texts that

1. is easy to update for somebody non-technical,

2. uses a standard encoding (including vedic accents-- SLP1 perhaps),

3. fairly easy to map to a table and store in a database, and

4. easy for us programmers to produce a template to be filled in.

A simple example of the template might be as follows using GRETIL’s Gita:

[01.001]

[0]

text = "dhṛtarāṣṭra uvāca"

[0.data]

[0.data."dhṛtarāṣṭra"]

grammar = []

meaning = ""

[0.data.”uvāca"]

grammar = []

meaning = ""

[0.data.]

grammar = []

meaning = ""

[1]

text = "dharmakṣetre kurukṣetre samavetā yuyutsavaḥ"

[1.data]

[1.data."dharmakṣetre"]

grammar = []

meaning = ""

[1.data."kurukṣetre"]

grammar = []

meaning = ""

[1.data."samavetā"]

grammar = []

meaning = ""

[1.data."yuyutsavaḥ"]

grammar = []

meaning = ""

[2]

text = "māmakāḥ pāṇḍavāś caiva kim akurvata saṃjaya"

[2.data]

[2.data."māmakāḥ"]

grammar = []

meaning = ""

[2.data."pāṇḍavāś"]

grammar = []

meaning = ""

[2.data.caiva]

grammar = []

meaning = ""

[2.data.kim]

grammar = []

meaning = ""

[2.data.akurvata]

grammar = []

meaning = ""

[2.data."saṃjaya"]

grammar = []

meaning = ""

A user might fill this in using abbreviations (like these from Ruppel's book):

Screen Shot 2021-07-27 at 11.21.03 AM.png

Andrew Ollett

unread,

Jul 27, 2021, 1:02:59 PM7/27/21

to sanskrit-p...@googlegroups.com

Hi Kaśyap,

This is a good idea, but I doubt there will be a single format that meets everyone's needs.

For the verses I give to my students, the data is stored in a JSON format (see below) that has proven both easy to generate from Google Documents and easy to modify manually, after the fact, if required. Some of the tagging is a bit idiosyncratic and not well standardized, and the glossing of compounds leaves something to be desired, but the grammatical information is stored in such a way as to make (automated) dictionary lookups and the use of Dhaval's Prakriyā generator straightforward.

Andrew

{
"data": {
"metadata": {
"title": "Śakuntalā",
"passage": "5.4",
"meter": "vasantatilakam",
"description": "Inexplicable nostalgia.",
"slug": "abhijna-5-4",
"tags": {
"syntax": [
"relative-correlative"
],
"nmorphology": [
"aD-antam",
"s-antam",
"u-antam"
],
"vmorphology": [
"LyaP",
"bhvādi",
"laṭ",
"parasmaipadam",
"cvi"
],
"compounds": [
"karmadhārayaḥ",
"vibhaktitatpuruṣaḥ"
]
},
"author": "Kālidāsaḥ"
},
"unanalyzed": {
"pada_a": "ramyāṇi vīkṣya madhurāṁś ca niśamya śabdān",
"pada_b": "paryutsukībhavati yat sukhitō ’pi jantuḥ",
"pada_c": "tac cētasā smarati nūnam abōdhapūrvaṁ",
"pada_d": "bhāvasthirāṇi jananāntarasauhr̥dāni "
},
"analyzed": [
{
"word": "ramyāṇi",
"meaning": {
"def": "pleasing, lovely"
},
"morphology": {
"class": "adj",
"stem": "ramya-",
"gender": "n",
"number": "pl",
"case": "2"
}
},
{
"word": "vīkṣya",
"meaning": {
"def": "having seen",
"note": "lyap suffix (converb/absolutive/gerund)"
},
"morphology": {
"class": "indecl"
}
},
{
"word": "madhurān",
"meaning": {
"def": "pleasant, charming"
},
"morphology": {
"class": "adj",
"stem": "madhura-",
"gender": "m",
"number": "pl",
"case": "2"
}
},
{
"word": "ca",
"meaning": {
"def": "and"
},
"morphology": {
"class": "particle"
}
},
{
"word": "niśamya",
"meaning": {
"def": "having heard",
"note": "lyap suffix (converb/absolutive/gerund)"
},
"morphology": {
"class": "indecl"
}
},
{
"word": "śabdān",
"meaning": {
"def": "sounds, words"
},
"morphology": {
"class": "noun",
"stem": "śabda-",
"gender": "m",
"number": "pl",
"case": "2"
}
},
{
"punct": "odd_pada"
},
{
"word": "paryutsukībhavati",
"meaning": {
"def": "he becomes sorrowful or regretful",
"note": "cvi suffix"
},
"morphology": {
"class": "verb",
"root": "bhū",
"gana": "bhvadi",
"person": "3rd",
"number": "sg",
"padam": "parasmai",
"l": "laṭ",
"preverb": "paryutsuka"
}
},
{
"word": "yat",
"meaning": {
"def": "which"
},
"morphology": {
"class": "pron",
"stem": "yad-",
"gender": "n",
"number": "sg",
"case": "1"
}
},
{
"word": "sukhitaḥ",
"meaning": {
"def": "happy"
},
"morphology": {
"class": "adj",
"stem": "sukhita-",
"gender": "m",
"number": "sg",
"case": "1"
}
},
{
"word": "api",
"meaning": {
"def": "even"
},
"morphology": {
"class": "particle"
}
},
{
"word": "jantuḥ",
"meaning": {
"def": "a person"
},
"morphology": {
"class": "noun",
"stem": "jantu-",
"gender": "m",
"number": "sg",
"case": "1"
}
},
{
"punct": "even_pada"
},
{
"word": "tat",
"meaning": {
"def": "that"
},
"morphology": {
"class": "pron",
"stem": "tad-",
"gender": "n",
"number": "sg",
"case": "1"
}
},
{
"word": "cētasā",
"meaning": {
"def": "by the mind"
},
"morphology": {
"class": "noun",
"stem": "cētas-",
"gender": "n",
"number": "sg",
"case": "3"
}
},
{
"word": "smarati",
"meaning": {
"def": "he remembers, recollects"
},
"morphology": {
"class": "verb",
"root": "smṛ",
"gana": "bhvadi",
"person": "3rd",
"number": "sg",
"padam": "parasmai",
"l": "laṭ",
"preverb": ""
}
},
{
"word": "nūnam",
"meaning": {
"def": "certainly"
},
"morphology": {
"class": "particle"
}
},
{
"word": "abōdha-pūrvam",
"meaning": {
"def": "previously unknowingly, in a manner that was unknown before ",
"note": "pūrvam na abōdhyata"
},
"morphology": {
"class": "adv"
},
"compound": {
"type": "k",
"head": "pūrvam",
"dep": "abōdham"
}
},
{
"punct": "odd_pada"
},
{
"word": "bhāva-sthirāṇi",
"meaning": {
"def": "fixed in the heart"
},
"morphology": {
"class": "adj",
"stem": "bhāvasthira-",
"gender": "n",
"number": "pl",
"case": "2"
},
"compound": {
"type": "t7",
"head": "sthirāṇi",
"dep": "bhāve"
}
},
{
"word": "janana-antara-sauhr̥dāni ",
"meaning": {
"def": "friendships from [previous] births"
},
"morphology": {
"class": "noun",
"stem": "sauhr̥da- ",
"gender": "n",
"number": "pl",
"case": "2"
},
"compound": {
"type": "t5",
"head": "sauhr̥dāni ",
"dep": "jananānām antarāt"
}
}
]
}
}

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/6209335a-174a-4922-b8b7-eecddc340af1n%40googlegroups.com.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,

Jul 28, 2021, 12:28:56 AM7/28/21

to sanskrit-programmers

Nice website there, Andrew!

Transliteration for draviDian scripts seems off - I see ಖಲನಂರತಾಮ್ from खलनम्रताम्। If this is a sanscript bug, please report on github.

To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAANHO17NYRTWq%2BDvUdkGdKW%3DemzTst%3D%3DohzW%2BgNARtqk_Dv5tA%40mail.gmail.com.

--

--
Vishvas /विश्वासः

śrīnivāsa kaśyap munukutla

unread,

Jul 28, 2021, 12:34:07 AM7/28/21

to sanskrit-programmers

Andrew-- Thank you for your response! I think that JSON snippet is quite helpful. It gives me a good start on understanding how to structure my own data. I am building some tools as a way to convince some vedic scholars to input the taittiriya śakha, and as you might expect, a complex template will certainly not convince them!

Where I have settled thus far is the necessity for 3 levels: verses, phrases, words. But the phrases level must be recursive! That's the challenge you allude to in your note. Your JSON below is a good indication of what features need capture at each of the levels, but the trick is that recursion at the phrase level.

Let me give that some additional thought, unless somebody else has ideas!

Irene Galstian

unread,

Jul 28, 2021, 1:30:28 AM7/28/21

to sanskrit-p...@googlegroups.com

Are there samples for syllable level? E.g. to mark and query things like p/u-pada ādyudātta in samāsas, set default svara for a prātipadika etc.

On 28 Jul 2021, at 5:34 am, śrīnivāsa kaśyap munukutla <skm...@gmail.com> wrote:

Andrew-- Thank you for your response! I think that JSON snippet is quite helpful. It gives me a good start on understanding how to structure my own data. I am building some tools as a way to convince some vedic scholars to input the taittiriya śakha, and as you might expect, a complex template will certainly not convince them!

To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/dce5bac3-9efe-48ee-b36c-8e0d2db455ben%40googlegroups.com.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,

Jul 28, 2021, 1:46:52 AM7/28/21

to sanskrit-programmers

On Wed, Jul 28, 2021 at 10:04 AM śrīnivāsa kaśyap munukutla <skm...@gmail.com> wrote:

Andrew-- Thank you for your response! I think that JSON snippet is quite helpful. It gives me a good start on understanding how to structure my own data. I am building some tools as a way to convince some vedic scholars to input the taittiriya śakha, and as you might expect, a complex template will certainly not convince them!

I am curious to know details of what you intend with "convince some vedic scholars to input the taittiriya śakha". I know of a few institutions which have already input the saMhitA, brAhmaNa, araNyaka, padapATha and probably even bhAShya.

To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/dce5bac3-9efe-48ee-b36c-8e0d2db455ben%40googlegroups.com.

Irene Galstian

unread,

Jul 28, 2021, 1:51:40 AM7/28/21

to sanskrit-p...@googlegroups.com

Vishvas-

Have you come across Bhāskara’s comms on TS and TB in text form? I mean available for others to use, not just the inputting institution.

On 28 Jul 2021, at 6:46 am, विश्वासो वासुकिजः <vishvas...@gmail.com> wrote:

To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAFY6qgEzdbwueb67b-5ESiWtc_wHpCnA5Prhzgj%3Dk731HOfcVw%40mail.gmail.com.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,

Jul 28, 2021, 7:24:10 AM7/28/21

to sanskrit-programmers

On Wed, Jul 28, 2021 at 11:21 AM Irene Galstian <gnos...@gmail.com> wrote:

Vishvas-
Have you come across Bhāskara’s comms on TS and TB in text form? I mean available for others to use, not just the inputting institution.

No. The ones I know consider passing on such texts to strangers a grave sin. With the advent of high accuracy OCR, I expect such to be generally available in the not too distant future though.

To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/9577765F-4C6C-4BC1-98C4-93C9BC1570A8%40gmail.com.

Irene Galstian

unread,

Jul 28, 2021, 7:45:34 AM7/28/21

to sanskrit-p...@googlegroups.com

I agree and have started OCRing and proofreading Bhāskara already. Thank you for confirming that no publicly available version is there yet.

On 28 Jul 2021, at 12:24 pm, विश्वासो वासुकिजः <vishvas...@gmail.com> wrote:

To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAFY6qgE6qbo1c4sH7H9TXiidv-Q1SPgojS3vBs1FnUCREMRFcw%40mail.gmail.com.

śrīnivāsa kaśyap munukutla

unread,

Jul 28, 2021, 11:41:39 AM7/28/21

to sanskrit-programmers

@vishwas There are several gaps in each of the current projects: ghanam isn't ready; vedavms' work is utterly intractable and very hard to port to a machine readable format from having tried last summer (it's doable with a good deal of manual effort, but difficult / a waste of time given all the changes they're publishing), parankusa prefers I not use their data (which is only dravida patha). My hope is to fill these gaps, though how to minimize the necessary effort isn't clear to me yet.

This site is an inspriation: http://rigveda.sanatana.in/. I'd like to see something similar for the other veda śakhas.

śrīnivāsa kaśyap munukutla

unread,

Jul 28, 2021, 12:11:14 PM7/28/21

to sanskrit-programmers

Hadn't even thought of that! Perhaps we extend the levels I mention to 4?

Text
1. Encoding
2. Genre
3. Author
Verses
1. Metre
2. Meaning
Phrases
1. Type: Compound, Sandhi
2. Meaning
Words
1. Grammar
  1. We can enumerate this further.
    1. Nouns: Case, Person
2. Root
  1. Root Form
  2. Root Class
Syllables
1. What are the features required here? I'm completely ignorant of this.

Maybe we can start with what data we'd need for each level above? Can any scholars add to the list above?

śrīnivāsa kaśyap munukutla

unread,

Jul 29, 2021, 5:03:29 PM7/29/21

to sanskrit-programmers

I was looking through the Digital Corpus of Sanskrit, and the way they represent annotation data is as below. This is in "CONLL-U" format; I much prefer JSON/YAML/TOML for structure, but there's something to be gotten from how sandhi is presented.

For reference (because it's new to me, though perhaps not you all!):

1. A broad description of CONLLU: https://universaldependencies.org/format.html#syntactic-annotation

2. POS Tags: https://universaldependencies.org/u/pos/index.html

3. Additional language features not covered by POS tags: https://universaldependencies.org/u/feat/index.html

4. A fascination list of which features occur in which languages within the sample set of texts they selected: https://universaldependencies.org/ext-feat-index.html

The sample texts and their annotations themselves live here: https://github.com/UniversalDependencies?q=Sanskrit&type=&language=&sort=

From DCS's data posted here: https://github.com/OliverHellwig/sanskrit/tree/master/dcs/data/conllu/files/Mah%C4%81bh%C4%81rata:
Screen Shot 2021-07-29 at 4.40.36 PM.png

Screen Shot 2021-07-29 at 4.40.36 PM.png

Reply all

Reply to author

Forward