Sanskrit NLP open source project

758 views
Skip to first unread message

vishvAs vAsuki

unread,
Aug 19, 2011, 9:15:20 PM8/19/11
to saMskRRita-sandesha-shreNiH, Srinivasa Murthy
priyANi mitrANi.

0. Following discussion in another thread, and some positive responses, I have created a Mercurial repository on bitbucket for an open-source sanskrit NLP project. 

1. Contributors may use any language compatible with the Java Virtual Machine as long as they try their best to write correct code (I personally prefer Scala, as it combines the speed of Java with the conciseness of Python, though people may be more familiar with java).

2. It uses sbt as a build system. All you need to run or build your code is to invoke bin/sktnlp with the appropriate command (Please take a look at that code for details). After getting code from the repository for the first time, you will run bin/sktnlp build.

3. As proposed in the earlier thread, we could first focus on developing a wiktionary bot. However, currently there is no code currently in the project.

4. License we will use is undecided, but we should go for one with the least restrictions possible. Note that, if necessary to take advantage of others contributions, we can have different licenses for subcomponents.

5. Others are welcome to contribute code which runs on the Java Virtual Machine. All contributions will be gratefully acknowledged. Anyone who would like check-in access, please let me know.

--
vishvAs
[yasya dviradAdyAH pAriShadyAH parashshataM vighnaM nighnanti satataM vishvaksEnaM tamAshrayE.
vakratuNDa mahAkAya sUrya-kOTi-samaprabha, nirvighnaM kuru mE dEva sarva kAryEShu sarvadA.]

hnbhat B.R.

unread,
Aug 19, 2011, 10:39:02 PM8/19/11
to sams...@googlegroups.com
Welcome to your effort. I am not at all familiar with computer language and only with Sanskrit Language Morphology (traditional). My all the best wishes for success. 

Here is one offering different analysis of the synonyms found in Amarakosha:


I don't know whether it would help you. There are two searchable database in CD forms for the Indian lexicons: वाचस्पत्य of Taranatha Bhattacharya and शब्दकल्पद्रुम of Radhakantha Debh, both released by Rashtriya Samskrita Samsthan and Sanskrit Academy respectively. Both deal with traditional morphological analysis of the words included with sources quoted.

Both are available in IN Eversions too in pdf format. Both can be utilized for data collection.

--
Dr. Hari Narayana Bhat B.R. M.A., Ph.D.,
Research Scholar,
Ecole française d'Extrême-OrientCentre de Pondichéry
16 & 19, Rue Dumas
Pondichéry - 605 001


vishvAs vAsuki

unread,
Aug 20, 2011, 5:09:00 PM8/20/11
to sams...@googlegroups.com, Sanskrit Team_member, Sanskrit Questions
भवतः आशीर्वचनेभ्यः मम कार्तज्ञं, श्री-हरिनारायणभट्ट-वर्य। मयि एतत् आश्चर्यं जनयति यत् जगति एकं अपि सङ्गणकः नास्ति येन पाणिनेः सूत्राणि संपूर्णतया, सम्यक् उपयुज्यन्ते! एतत् केवलं वय्याकरण-मार्गदर्शणेन भवितुं शक्यते! अहं तु व्याकर्ण-शास्त्रं न जानामि।

Potential contributors are welcome to join the  sanskrit-programmers mailing list I just created for this sort of work (currently it is empty). Below is my assessment of the current state of Sanskrit processing work, please give me any feedback/ corrections you may have.

Natural Language Processing in general is a thriving field, with open source projects such as openNLP.

Several academics have done valuable work in Sanskrit NLP. Thanks to separate conversations, I gather the following impressions. Their current aims have been to develop tools and algorithms aimed at helping a reader comprehend Sanskrit text by doing the following:

  1. Digitize dictionaries(1), sUtras and thesarauses(1) and enable online search(12). Some online dictionaries enable collaborative editing. They do have the following limitations:
    • Collaboratively updated dictionaries are not publicly available for download.
    • They don't currently provide an online API (application programming interface) to build on them easily.
  2. Develop tools which model and illustrate application of various sandhi(12) and inflection (123456) rules. These can in-turn be used to analyze inflected words (1234), do sandhi analysis (12) and to produce dictionaries of inflected words (1).
      • Inflected word generation is usually based on the 'word and paradigm' model,  close to the work such as ruupa chandrikaa which gives the naamaruupaavalii for 'typical' words ending in different var.nas in different lingas. This is found to be very useful and accurate in the analysis of classical Sanskrit texts.
      • Limitation: However, as a generative model the above is not perfect because, not being based firmly on pANini's rules (which separate saMskR^ita from apabhraMShA), they may generate wrong inflections.
      1. Mechanically parsing (1) Sanskrit text, doing part of speech tagging(1).
      2. Translating Sanskrit into a more familiar language. (1)
      3. Tools to identify metre(1, 2).
      4. Tools to help understand grammer sUtras (123456).
      5. Transliteration tools(12345 ...), formal attempts at encoding Indian scripts in unicode(1).
      6. Sanskrit optical character recognition (OCR) tools(1).
      In rare cases above source code for Sanskrit tools are available; but they are mostly not open-source; and there is quite a bit of duplication of effort; the boundless-sharing culture is mostly absent. Besides the limitations noted above, what is conspicuously missing from the above are tools directed at meeting important needs of the popular spoken Sanskrit movement, especially as we increasingly interact with information through computers and the internet.
      1. Consuming documents and webpages written in other languages in saMskRRita (There is no google-translate like device at present nor will there be one in the near future).
      2. Sanskrit UI versions of commonly used software don't exist (Unlike Arabic, Hebrew..).
      3. There are no good Sanskrit browser scripts or extensions to do common things like look up word meanings with a click or a mouse-over.
      4. No effort at generating Sanskrit content easily. Eg: Sanskrit wikipedia is nowhere close to the english version. Same goes for the wiktionary.

      --
      vishvAs





      --
      You received this message because you are subscribed to the Google Groups "samskrita" group.
      To post to this group, send email to sams...@googlegroups.com.
      To unsubscribe from this group, send email to samskrita+...@googlegroups.com.
      For more options, visit this group at http://groups.google.com/group/samskrita?hl=en.

      Siddhi Barve

      unread,
      Sep 4, 2014, 6:27:58 AM9/4/14
      to sams...@googlegroups.com, murt...@gmail.com
      hello,
      I am doing a project for text summarization in sanskrit.
      Has anybody worked with sanskrit text in java?

      prabhat kumar singh

      unread,
      Oct 6, 2015, 6:57:25 PM10/6/15
      to samskrita, murt...@gmail.com
      Namaste.

      The repository seems dead now as per the url https://bitbucket.org/vvasuki/sanskritnlp/

      Has this been moved?

      Best Regards

      Anil Srivastava

      unread,
      May 8, 2022, 3:19:48 PMMay 8
      to samskrita
      We are very interested in Sanskrit NLP open source, therefore, would like to connect to this group. We are pursuing an International Ayurveda Developmental Therapeutics Program (ADTP) which would greatly benefit from extracting knowledge from the Sanskrit corpus. Fortunately there is a digitized corpa of Sanskrit text on Ayurveda and we are working with a group of international scholars who are familiar with the text. Please write to anil[dot]srivastava[at]ohsl[dor]us.

      Anunad Singh

      unread,
      May 9, 2022, 12:37:40 AMMay 9
      to sams...@googlegroups.com
      As far as I know about 10 to 12 Ayurvaidika granthas are available in machine readable text. I guess there would be more than 30 well known Ayurvaidika granthas. So, we still need to digitize/convert to machine readable text many Ayurvaidika works.

      I was wondering what exactly do you want to do using NLP.

      -- anunAda

      --
      You received this message because you are subscribed to the Google Groups "samskrita" group.
      To unsubscribe from this group and stop receiving emails from it, send an email to samskrita+...@googlegroups.com.
      To view this discussion on the web visit https://groups.google.com/d/msgid/samskrita/03f7c1d8-47cd-4535-ac53-33a99d6564dcn%40googlegroups.com.

      Anunad Singh

      unread,
      May 9, 2022, 7:15:51 AMMay 9
      to sams...@googlegroups.com
      I would also like to say that, as I understand it, we are mostly interested in data mining rather than NLP of the Sanskrit texts. NLP extracts the semantic meanings and analyzes the grammatical structures the user inputs ; Text mining extracts the documents' features such as word frequency, average word length, average length of sentences etc.

      One good example of Sanskrit text mining I read about was they have collected about 4000 examples and counter examples from four commentaries of ashTAdhyAyI.

      -- anunAda

      Anunad Singh

      unread,
      May 9, 2022, 7:45:40 AMMay 9
      to sams...@googlegroups.com
      Sorry, they collected 40 thousand examples and counterexamples (not, four thousand).

      Reply all
      Reply to author
      Forward
      0 new messages