Wonderingif there are any/all of these online somewhere in a copyleft format (free / open source / public domain, or something similar to that), in the original Devanagari script in Sanskrit, in Unicode/text format (aka not in a PDF), such that you can download the entire thing. If there are none to be found in Unicode/text format, then a PDF would work as an alternative, but text would be best.
Your enquiry is about scriptures of hinduism (Vedas , Upanishads ,Puranas ) in text format which you can download or Copy-Paste. Yes some of the works are available in Sanskrit versions in Devanagari script.
Wikisource is the best website for Puranas in Devnagari. Here we can find sanskrit verses of all 18 puranas plus some upapuranas , which we can either copy-paste or simply can download.
Upanishad mantras in Devanagari are also available over the website developed by IIT Kanpur . Interestingly you will here fins the english translation of the upanishads also with commentaries. Here we can find 11 principal upanishads.
There are lots of Sutras in Hinduism scripture , I am here just providing links for two i.e.Yoga-Sutras by Maharshi Patanjali and Brahma-Sutras by Shree Veda Vyasa.Here is Patanjali Yoga-Sutras in Devanagari. Here Sutras are denoted by numbers like 1..1 etc. and below is the sanskrit commentary.
on Sanskrit Documents web site we can find Rig-Veda Samhita (core texts) and Sama-Veda Samhita (Kauthuma Shakha i.e. Kauthuma branch ) in Devanagari script , the content of which we can copy-paste.Regarding your enquiry about the Kand, Sukta . In Sanskrit sukta means "well said", "well told. the vedic hymns are called "sukta" . There are various Suktas (Hymns) like Purusha-Sukta , Shree-Sukta etc.
Dharma-Shastras (धर्मशस्त्र, dharma-shastra) is a category of Hindu literature containing important instructions regarding religious law, ethics, economics, jurisprudence and more. It is categorised as smṛti, an important and authoritative selection of books dealing with the Hindu lifestyle.
Shastras means science , but the word Shastras is not only limited to Dharma-Shastras. The word Shastras is general word used for all scriptures of hinduism i.e. Shastras= common word for all scriptures of Hinduism.
This paper describes the first data-driven parser for Vedic Sanskrit, an ancient Indo-Aryan language in which a corpus of important religious and philosophical texts has been composed. We report and critically discuss experiments with the input feature representations, paying special attention to the performance of contextualized word embeddings and to the influence of morpho-syntactic representations on the parsing quality. In addition, we provide an in-depth discussion of the parsing errors that covers structural traits of the predicted trees as well as linguistic and extra-textual influence factors. In its optimal configuration, the proposed model achieves 87.61 unlabeled and 81.84 labeled attachment score on a held-out set of test sentences, demonstrating good performance for an under-resourced language.
The results discussed in this paper are also relevant for the wider field of parsing morphologically rich languages (MRLs) because Vedic has a rich system of fusional morpho-syntactic features. Parsing MRLs has encountered increasing interest in the NLP community over the last decade (see Tsarfaty et al., 2013 and esp. the survey in Tsarfaty et al., 2020). Many MRLs, including Vedic, share a number of basic syntactic traits that set them apart from languages using a reduced morphology, such as English or Chinese. Most importantly, they have a rich repertoire of morpho-syntactic markers that indicate the syntactic relations in a sentence and thereby can support a dependency labeler in finding the correct parse. The morpho-syntactic expressiveness, however, often comes along with a low degree of configurationality, implying, among others, free word order and the use of discontinuous constituents (see Sect. 4.4.2).
Dependency parsing of VS involves several domain- and annotator-related issues. Firstly, the Vedic corpus has been composed over a period of at least one millennium and contains texts from different literary genres. VS is therefore a good test case for studying domain effects on a diachronic linguistic axis and with regard to genres (see Sect. 4.4.8). Secondly, the number of potential annotators for a Vedic Treebank (VTB) is small, so that the standard approach, which involves adjudicating multiple annotations of the same sentence, is practically not viable (for an example of good practice in this regard see Berzak et al., 2016). In addition, active speakers of VS are missing and many syntactic phenomena and content-related issues are by far less well understood than for modern languages. As individual annotators tend to form idiosyncratic annotation decisions (see Biagetti et al., 2021 for a study of VS), a parser of VS must be able to learn from partly idiosyncratic annotation schemes. Thirdly, the extant Vedic corpus contains around 3 million words (see Sect. 3.1) and may therefore not be large enough for pretraining contextualized word embeddings, which have boosted the performance of many downstream NLP tasks (see e.g. Kulmizev et al., 2019). Adding data from the corpus of CS mitigates the issue of data sparsity, but these later texts come from different cultural and linguistic domains (think of the difference between Middle and Modern English). As similar issues are encountered with other premodern languages (see e.g. Passarotti, 2019, Sect. 4.2, for Latin), we perform a systematic evaluation of how state-of-the-art contextualized word embeddings influence the performance of the parser. Certain linguistic characteristics of (Vedic) Sanskrit such as sandhi (see Hellwig & Nehrdich, 2018) and its high morphological complexity complicate the annotation process. It is therefore even more important to use high-quality morpho-syntactic input data when parsing VS; we obtain this data from the gold annotations in the Digital Corpus of Sanskrit (DCS, see Sect. 3.3). This situation is contrary to that of many other languages, where predicted (silver) morpho-syntactic data is typically used as input for the parsing process (see Sect. 4.3 for a comparative evaluation). Our experiments suggest that parsers trained with lexical and morpho-syntactic gold annotations are at least competitive with contextualized models when only limited text corpora are available.
The experiments described in Sects. 4.2 and 4.3 provide quantitative evidence that the lack of large corpora needed for pretraining contextualized model can be counterbalanced by the use of gold input data.
After an overview of related research (Sect. 2) and the available data (Sect. 3), Sect. 4 describes the experimental setup and presents the evaluation of contextual embeddings. Individual types of parsing errors are discussed in Sect. 4.4. Section 5 summarizes the paper. In addition, we publish a new, significantly extended version of the VTB as compared to its state described in Hellwig et al. (2020). This treebank and the code of the parser are available under a Creative Commons license at
Modern dependency parsing methods can be broadly categorized into transition- (Nivre, 2003) and graph-based parsers (McDonald, 2006). Transition-based parsers build the dependency tree incrementally by a series of actions. A simple classifier is trained on local parser configurations and guides the parsing process by scoring the possible actions at each step. This approach is very efficient since the time-complexity is usually linear. Graph-based parsers on the other hand maximize a particular score by searching through the space of possible trees, given a sentence. The search-space is encoded as a directed graph and the score of a possible tree is calculated by a linear combination of the scores of local sub-graphs. Methods from graph theory such as the maximum spanning tree (MST) are then used to find the highest scoring among all possible trees. Recently, the application of neural networks and continuous representations has led to a substantial performance gain for transition-based (Chen & Manning, 2014; Ballesteros et al., 2015; Weiss et al., 2015; Kiperwasser & Goldberg, 2016) as well as graph-based parsers (Kiperwasser & Goldberg, 2016; Dozat & Manning, 2017). These current state-of-the-art parsers are still either transition- or graph-based, but the differences in their error distributions decrease constantly due to the convergence of neural architectures and feature representations.
The comparative experiments in McDonald and Nivre (2011) and Kulmizev et al. (2019) show systematic differences between transition- and graph-based models. In our analysis of their results we noticed that, according to Table 1 in McDonald and Nivre (2011), their graph-based model performs better than a transition-based one for morphologically rich IE languages (mean labeled attachment score/LASFootnote 1 + 1.127), whereas for IE languages with a comparatively low amount of inflection it even performs worse (mean LAS \(-0.47\)). In addition, the graph-based model of Kulmizev et al. (2019) yields much better results than the transition-based one for languages of the SOV type (mean LAS +1.82). Since Vedic is both a morphologically rich IE language and has a preference for SOV, we decided to use a graph-based architecture.
Syntactic parsing of Sanskrit has met with increasing interest in recent years. Kulkarni (2021) describes a rule-based dependency parser that uses semantic and structural principles of the Pāṇinian system of grammar and Śābdabodha for pruning the arcs of an initially fully connected graph. While the use of the Pāṇinian grammar is an appealing solution, Vedic texts contain phenomena that are not (fully) compatible with this system (e.g. sentences without finite verbs) so that an application to VS does not seem to be promising. In addition, this parser requires information about the case frames of verbs. While case frames are available for several frequent verbs in CS (Sanka, 2015), such a resource does not exist for the highly variegated verbal system of Vedic which abounds in hapax legomena with unclear meaning. Applying this parser to Vedic would therefore require a substantial amount of data collection and validation. More closely related to the present paper is the survey of data-driven dependency parsing of CS given by Krishna et al. (2020). The authors compare the performance of YAP (More et al., 2019), biaffine (Dozat & Manning, 2017), DCST (Rotman & Reichart, 2019) and L2S (Chang et al., 2016) on the Sanskrit Treebank Corpus (STBC, see Kulkarni, 2013), a small treebank of works composed mainly in the \(20\mathrmth\) century CE. On this treebank, the graph-based neural models (biaffine, DCST) perform significantly better than YAP or L2S, and the pretraining step of DCST gives another 2% advance as compared to the biaffine model (UAS 80.95; LAS 72.86). Notably, the authors report substantially lower scores when they apply the models trained on the STBC to the Śiśupā lavadha, a metrical text composed in the \(7\mathrmth\) or \(8\mathrmth\) century CE. In this cross-domain application, DCST, being the best model, only achieves 40.02 UAS and 35.7 LAS. While the authors explain this drop primarily by the metrical form of the Śiśupā lavadha, one should also consider that the texts in the STBC are composed, so to say, in Neo-Sanskrit, a regulated form of the language which follows a strict SOV word order that is not found in the majority of texts composed before the 19th century, and whose vocabulary differs from that used in CS texts. Similar considerations apply to the results reported for the EBM model (Krishna et al., 2021) because EBM uses largely the same linguistic rules for pruning the search space that also regulate Neo-Sanskrit. The good performance that EBM shows on the STBC (UAS 85.32, LAS 83.93) is therefore not surprising.
3a8082e126