Sanskrit Data Standard Format

157 views
Skip to first unread message

śrīnivāsa kaśyap munukutla

unread,
Jul 27, 2021, 12:50:55 PM7/27/21
to sanskrit-programmers
Hi all,

Every time I want to build a better reader, it’s an absolute chore trying to find a good text to work with and a format that captures all the detail we’d need to produce something nice. 

I think we need to standardize how we store sanskrit data; that way we can build more useful things on top of it. There’s a reason why sanskrit readers are such trash; my experience is to start building one and a week later, I give up because getting a text ready to visualize is a nightmare. 

Best,

kaśyap


My proposal is to create a single format to contain sanskrit data on various texts that 
1. is easy to update for somebody non-technical,
2. uses a standard encoding (including vedic accents-- SLP1 perhaps),
3. fairly easy to map to a table and store in a database, and
4. easy for us programmers to produce a template to be filled in.

A simple example of the template might be as follows using GRETIL’s Gita:

[01.001]
[0]
text = "dhṛtarāṣṭra uvāca"

[0.data]
[0.data."dhṛtarāṣṭra"]
grammar = []
meaning = ""

[0.data.”uvāca"]
grammar = []
meaning = ""

[0.data.]
grammar = []
meaning = ""

[1]
text = "dharmakṣetre kurukṣetre samavetā yuyutsavaḥ"

[1.data]
[1.data."dharmakṣetre"]
grammar = []
meaning = ""

[1.data."kurukṣetre"]
grammar = []
meaning = ""

[1.data."samavetā"]
grammar = []
meaning = ""

[1.data."yuyutsavaḥ"]
grammar = []
meaning = ""

[2]
text = "māmakāḥ pāṇḍavāś caiva kim akurvata saṃjaya"

[2.data]
[2.data."māmakāḥ"]
grammar = []
meaning = ""

[2.data."pāṇḍavāś"]
grammar = []
meaning = ""

[2.data.caiva]
grammar = []
meaning = ""

[2.data.kim]
grammar = []
meaning = ""

[2.data.akurvata]
grammar = []
meaning = ""

[2.data."saṃjaya"]
grammar = []
meaning = ""

A user might fill this in using abbreviations (like these from Ruppel's book):
Screen Shot 2021-07-27 at 11.21.03 AM.png

Andrew Ollett

unread,
Jul 27, 2021, 1:02:59 PM7/27/21
to sanskrit-p...@googlegroups.com
Hi Kaśyap,

This is a good idea, but I doubt there will be a single format that meets everyone's needs.

For the verses I give to my students, the data is stored in a JSON format (see below) that has proven both easy to generate from Google Documents and easy to modify manually, after the fact, if required. Some of the tagging is a bit idiosyncratic and not well standardized, and the glossing of compounds leaves something to be desired, but the grammatical information is stored in such a way as to make (automated) dictionary lookups and the use of Dhaval's Prakriyā generator straightforward.

Andrew

{
    "data": {
        "metadata": {
            "title": "Śakuntalā",
            "passage": "5.4",
            "meter": "vasantatilakam",
            "description": "Inexplicable nostalgia.",
            "slug": "abhijna-5-4",
            "tags": {
                "syntax": [
                    "relative-correlative"
                ],
                "nmorphology": [
                    "aD-antam",
                    "s-antam",
                    "u-antam"
                ],
                "vmorphology": [
                    "LyaP",
                    "bhvādi",
                    "laṭ",
                    "parasmaipadam",
                    "cvi"
                ],
                "compounds": [
                    "karmadhārayaḥ",
                    "vibhaktitatpuruṣaḥ"
                ]
            },
            "author": "Kālidāsaḥ"
        },
        "unanalyzed": {
            "pada_a": "ramyāṇi vīkṣya madhurāṁś ca niśamya śabdān",
            "pada_b": "paryutsukībhavati yat sukhitō ’pi jantuḥ",
            "pada_c": "tac cētasā smarati nūnam abōdhapūrvaṁ",
            "pada_d": "bhāvasthirāṇi jananāntarasauhr̥dāni "
        },
        "analyzed": [
            {
                "word": "ramyāṇi",
                "meaning": {
                    "def": "pleasing, lovely"
                },
                "morphology": {
                    "class": "adj",
                    "stem": "ramya-",
                    "gender": "n",
                    "number": "pl",
                    "case": "2"
                }
            },
            {
                "word": "vīkṣya",
                "meaning": {
                    "def": "having seen",
                    "note": "lyap suffix (converb/absolutive/gerund)"
                },
                "morphology": {
                    "class": "indecl"
                }
            },
            {
                "word": "madhurān",
                "meaning": {
                    "def": "pleasant, charming"
                },
                "morphology": {
                    "class": "adj",
                    "stem": "madhura-",
                    "gender": "m",
                    "number": "pl",
                    "case": "2"
                }
            },
            {
                "word": "ca",
                "meaning": {
                    "def": "and"
                },
                "morphology": {
                    "class": "particle"
                }
            },
            {
                "word": "niśamya",
                "meaning": {
                    "def": "having heard",
                    "note": "lyap suffix (converb/absolutive/gerund)"
                },
                "morphology": {
                    "class": "indecl"
                }
            },
            {
                "word": "śabdān",
                "meaning": {
                    "def": "sounds, words"
                },
                "morphology": {
                    "class": "noun",
                    "stem": "śabda-",
                    "gender": "m",
                    "number": "pl",
                    "case": "2"
                }
            },
            {
                "punct": "odd_pada"
            },
            {
                "word": "paryutsukībhavati",
                "meaning": {
                    "def": "he becomes sorrowful or regretful",
                    "note": "cvi suffix"
                },
                "morphology": {
                    "class": "verb",
                    "root": "bhū",
                    "gana": "bhvadi",
                    "person": "3rd",
                    "number": "sg",
                    "padam": "parasmai",
                    "l": "laṭ",
                    "preverb": "paryutsuka"
                }
            },
            {
                "word": "yat",
                "meaning": {
                    "def": "which"
                },
                "morphology": {
                    "class": "pron",
                    "stem": "yad-",
                    "gender": "n",
                    "number": "sg",
                    "case": "1"
                }
            },
            {
                "word": "sukhitaḥ",
                "meaning": {
                    "def": "happy"
                },
                "morphology": {
                    "class": "adj",
                    "stem": "sukhita-",
                    "gender": "m",
                    "number": "sg",
                    "case": "1"
                }
            },
            {
                "word": "api",
                "meaning": {
                    "def": "even"
                },
                "morphology": {
                    "class": "particle"
                }
            },
            {
                "word": "jantuḥ",
                "meaning": {
                    "def": "a person"
                },
                "morphology": {
                    "class": "noun",
                    "stem": "jantu-",
                    "gender": "m",
                    "number": "sg",
                    "case": "1"
                }
            },
            {
                "punct": "even_pada"
            },
            {
                "word": "tat",
                "meaning": {
                    "def": "that"
                },
                "morphology": {
                    "class": "pron",
                    "stem": "tad-",
                    "gender": "n",
                    "number": "sg",
                    "case": "1"
                }
            },
            {
                "word": "cētasā",
                "meaning": {
                    "def": "by the mind"
                },
                "morphology": {
                    "class": "noun",
                    "stem": "cētas-",
                    "gender": "n",
                    "number": "sg",
                    "case": "3"
                }
            },
            {
                "word": "smarati",
                "meaning": {
                    "def": "he remembers, recollects"
                },
                "morphology": {
                    "class": "verb",
                    "root": "smṛ",
                    "gana": "bhvadi",
                    "person": "3rd",
                    "number": "sg",
                    "padam": "parasmai",
                    "l": "laṭ",
                    "preverb": ""
                }
            },
            {
                "word": "nūnam",
                "meaning": {
                    "def": "certainly"
                },
                "morphology": {
                    "class": "particle"
                }
            },
            {
                "word": "abōdha-pūrvam",
                "meaning": {
                    "def": "previously unknowingly, in a manner that was unknown before ",
                    "note": "pūrvam na abōdhyata"
                },
                "morphology": {
                    "class": "adv"
                },
                "compound": {
                    "type": "k",
                    "head": "pūrvam",
                    "dep": "abōdham"
                }
            },
            {
                "punct": "odd_pada"
            },
            {
                "word": "bhāva-sthirāṇi",
                "meaning": {
                    "def": "fixed in the heart"
                },
                "morphology": {
                    "class": "adj",
                    "stem": "bhāvasthira-",
                    "gender": "n",
                    "number": "pl",
                    "case": "2"
                },
                "compound": {
                    "type": "t7",
                    "head": "sthirāṇi",
                    "dep": "bhāve"
                }
            },
            {
                "word": "janana-antara-sauhr̥dāni ",
                "meaning": {
                    "def": "friendships from [previous] births"
                },
                "morphology": {
                    "class": "noun",
                    "stem": "sauhr̥da- ",
                    "gender": "n",
                    "number": "pl",
                    "case": "2"
                },
                "compound": {
                    "type": "t5",
                    "head": "sauhr̥dāni ",
                    "dep": "jananānām antarāt"
                }
            }
        ]
    }
}

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/6209335a-174a-4922-b8b7-eecddc340af1n%40googlegroups.com.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Jul 28, 2021, 12:28:56 AM7/28/21
to sanskrit-programmers
Nice website there, Andrew!

Transliteration for draviDian scripts seems off - I see ಖಲನಂರತಾಮ್  from खलनम्रताम्। If this is a sanscript bug, please report on github.



--
--
Vishvas /विश्वासः

śrīnivāsa kaśyap munukutla

unread,
Jul 28, 2021, 12:34:07 AM7/28/21
to sanskrit-programmers
Andrew-- Thank you for your response! I think that JSON snippet is quite helpful. It gives me a good start on understanding how to structure my own data. I am building some tools as a way to convince some vedic scholars to input the taittiriya śakha, and as you might expect, a complex template will certainly not convince them! 

Where I have settled thus far is the necessity for 3 levels: verses, phrases, words. But the phrases level must be recursive! That's the challenge you allude to in your note. Your JSON below is a good indication of what features need capture at each of the levels, but the trick is that recursion at the phrase level. 

Let me give that some additional thought, unless somebody else has ideas!

Irene Galstian

unread,
Jul 28, 2021, 1:30:28 AM7/28/21
to sanskrit-p...@googlegroups.com
Are there samples for syllable level? E.g. to mark and query things like p/u-pada ādyudātta in samāsas, set default svara for a prātipadika etc. 

On 28 Jul 2021, at 5:34 am, śrīnivāsa kaśyap munukutla <skm...@gmail.com> wrote:

Andrew-- Thank you for your response! I think that JSON snippet is quite helpful. It gives me a good start on understanding how to structure my own data. I am building some tools as a way to convince some vedic scholars to input the taittiriya śakha, and as you might expect, a complex template will certainly not convince them! 

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Jul 28, 2021, 1:46:52 AM7/28/21
to sanskrit-programmers
On Wed, Jul 28, 2021 at 10:04 AM śrīnivāsa kaśyap munukutla <skm...@gmail.com> wrote:
Andrew-- Thank you for your response! I think that JSON snippet is quite helpful. It gives me a good start on understanding how to structure my own data. I am building some tools as a way to convince some vedic scholars to input the taittiriya śakha, and as you might expect, a complex template will certainly not convince them! 

I am curious to know details of what you intend with "convince some vedic scholars to input the taittiriya śakha". I know of a few institutions which have already input the saMhitA, brAhmaNa, araNyaka, padapATha and probably even bhAShya.

 

Irene Galstian

unread,
Jul 28, 2021, 1:51:40 AM7/28/21
to sanskrit-p...@googlegroups.com
Vishvas-
Have you come across Bhāskara’s comms on TS and TB in text form? I mean available for others to use, not just the inputting institution. 

On 28 Jul 2021, at 6:46 am, विश्वासो वासुकिजः <vishvas...@gmail.com> wrote:



विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Jul 28, 2021, 7:24:10 AM7/28/21
to sanskrit-programmers
On Wed, Jul 28, 2021 at 11:21 AM Irene Galstian <gnos...@gmail.com> wrote:
Vishvas-
Have you come across Bhāskara’s comms on TS and TB in text form? I mean available for others to use, not just the inputting institution. 

No. The ones I know consider passing on such texts to strangers a grave sin. With the advent of high accuracy OCR, I expect such to be generally available in the not too distant future though.

Irene Galstian

unread,
Jul 28, 2021, 7:45:34 AM7/28/21
to sanskrit-p...@googlegroups.com
I agree and have started OCRing and proofreading Bhāskara already. Thank you for confirming that no publicly available version is there yet. 

On 28 Jul 2021, at 12:24 pm, विश्वासो वासुकिजः <vishvas...@gmail.com> wrote:



śrīnivāsa kaśyap munukutla

unread,
Jul 28, 2021, 11:41:39 AM7/28/21
to sanskrit-programmers
@vishwas There are several gaps in each of the current projects: ghanam isn't ready; vedavms' work is utterly intractable and very hard to port to a machine readable format from having tried last summer (it's doable with a good deal of manual effort, but difficult / a waste of time given all the changes they're publishing), parankusa prefers I not use their data (which is only dravida patha). My hope is to fill these gaps, though how to minimize the necessary effort isn't clear to me yet. 

This site is an inspriation: http://rigveda.sanatana.in/. I'd like to see something similar for the other veda śakhas. 

śrīnivāsa kaśyap munukutla

unread,
Jul 28, 2021, 12:11:14 PM7/28/21
to sanskrit-programmers
Hadn't even thought of that! Perhaps we extend the levels I mention to 4? 
  1. Text
    1. Encoding
    2. Genre
    3. Author
  2. Verses
    1. Metre
    2. Meaning
  3. Phrases
    1. Type: Compound, Sandhi
    2. Meaning
  4. Words
    1. Grammar
      1. We can enumerate this further.
        1. Nouns: Case, Person
    2. Root
      1. Root Form
      2. Root Class
  5. Syllables
    1. What are the features required here? I'm completely ignorant of this. 
Maybe we can start with what data we'd need for each level above? Can any scholars add to the list above? 

śrīnivāsa kaśyap munukutla

unread,
Jul 29, 2021, 5:03:29 PM7/29/21
to sanskrit-programmers
I was looking through the Digital Corpus of Sanskrit, and the way they represent annotation data is as below. This is in "CONLL-U" format; I much prefer JSON/YAML/TOML for structure, but there's something to be gotten from how sandhi is presented. 

For reference (because it's new to me, though perhaps not you all!): 
3. Additional language features not covered by POS tags: https://universaldependencies.org/u/feat/index.html
4. A fascination list of which features occur in which languages within the sample set of texts they selected: https://universaldependencies.org/ext-feat-index.html
Reply all
Reply to author
Forward
0 new messages