Help regarding Sanskrit LLM

yash datta

unread,

Mar 23, 2024, 12:47:19 PM3/23/24

to bvpar...@googlegroups.com

Respected Scholars,

Pranam

With the recent advances in open Large language models, I have been working towards creating a Sanskrit Large language model that I think will be immensely useful for anyone interested in knowing or even conducting research in Sanskrit.

To that end, I do have the technical know-how to create these models but lack the fundamentals of Sanskrit and the clean, diverse data that is required to bring this about. Therefore I am seeking help from this group of amazing scholars to please help me in accumulating Sanskrit text corpus that can be used to create such a tool.

To validate whatever little I know, I did go ahead and create one trial model (which I named Rudra) using an amazing github repo that I encountered here: https://github.com/sanskrit/raw_etexts

Many thanks to the repo contributors for this amazing effort. I also included ~ 24K sanskrit wikipedia articles corpus from here: https://www.kaggle.com/datasets/disisbig/sanskrit-wikipedia-articles

Model is uploaded to hugging face with directions to test it: https://huggingface.co/saucam/Rudra-7b

The dataset used to train the model is available here: https://huggingface.co/datasets/saucam/sans_data/viewer

Everything is Apache 2.0 licensed.

Trial runs of this model are encouraging and hence I summoned the courage to write to this group.

What I want help with:

I am looking to collect an even larger (and cleaner) Sanskrit text corpus to further train text generation models. So please reach out if you could contribute such data.

I also want to create instructions following model, in which case we need data in Question and answer format or chat style, something like this:
[ { "content": "Could you please tell me in Hindi about Edward Johnston who won the Medal of Honor?", "role": "user" }, { "content": "एडवर्ड जॉनस्टन का जन्म 8 फरवरी 1844 को, पेन यान, न्यूयॉर्क में हुआ। उन्होंने अमेरिकी सेना में शामिल होकर इंडियन वार्स के दौरान 5वें यूएस कवलरी में सेवा की। 21 अक्टूबर 1876 से 8 जनवरी 1877 तक, उन्होंने मोंटाना में सिउक्स के खिलाफ एक महत्वपूर्ण मिशन का हिस्सा बनाया। इस मिशन के दौरान अपने बहादुरीपूर्ण कार्यों के लिए, एडवर्ड जॉनस्टन उन 31 व्यक्तियों में से एक थे जिन्हें मेडल ऑफ हॉनर से पुरस्कृत किया गया। अपनी सैनिक सेवा के बाद, एडवर्ड जॉनस्टन अपने घर लौटे और 20 जनवरी 1920 को 75 वर्ष की उम्र में उनकी मृत्यु हो गई। उन्हें लेकव्यू समाधि स्थल में गाड़ा गया। ", "role": "assistant" }

Of course, I am looking for such data in Sanskrit (but this can be mixed , sanskrit or english Instructions)

Secondly, I need expertise for creating evaluation benchmarks for these models. These usually consist of questions that will be asked to the model, and then rating the response based on certain metrics.

Currently, there are numerous such evaluation suites for english (and for some other languages), but I would need help in translating them to sanskrit.
To just give an example, there is this EQ-bench that looks to evaluate the emotional intelligence of these models, and 1 of the questions it asks the model is this

"writing_prompt": "Fairy Tale Retelling: Rewrite the story of Hansel and Gretel from the perspective of the witch, in the format of raw, terse stream-of-consciousness diary entries written in her style & voice. She may at times be an unreliable narrator. She sees herself as fundamentally good and portrays herself sympathetically; she believes she is misunderstood and has a tragic backstory. Include snippets of dialogue between the witch and the children in a way that feels natural for a diary entry. You may take liberties with the original story. The witch will not die in this version; she needs to be able to write her final entry. It will not be happily ever after. <SEED> 600-800 words.",

"judging_criteria": [
{
"prefix_text": "Now, rate the supplied model output on the following criteria:",
"criteria": [
"Overall Impression",
"Overall Reader Engagement",
"Clever / Witty",
"Gripping",
"Effective Use of Tropes: If applicable, common narrative tropes are employed thoughtfully and subverted, deconstructed, or used in service of the story's themes and character",
"Sentences Flow Naturally",
"Well-earned Lightness or Darkness"
]
},

Request you to please reach out in case you can help me with these initiatives.

I have created a discord server for easy communication but we can communicate via email / whatsapp if that is more convenient.

Discord link:https://discord.gg/dgPhZVAw

Please let me know what you think. Any suggestions are welcome.

Best Regards

Yash

Aditya Asopa

unread,

Mar 27, 2024, 5:58:17 AM3/27/24

to भारतीयविद्वत्परिषत्

Namaskar,

This is a very good initiative. I am new to this, but I would love to contribute in whatever way possible.

I have few questions (sorry if I sound naive):

1. There are many LLMs which already "know" Sanskrit (may not be good at it) and are good for NLP. Can your LLM leverage that knowledge which is already captured by many of these supergiant models?

2. What kind of training data would you like to get access to?

3. It makes more sense to me to train not just on Sanskrit texts but their commentaries in English, German, Hindi etc. In that case, would it not be better to build on top of an already generalized model?

4. Could you specify your goal with this LLM? I ask this because each layer of generalization viz. translation, generation of text, poetry, essays, commentaries, explanation etc will require scaling the model by a lot of parameters. How big a model do you think would be sufficient?

Warm regards

Aditya

Dr soumya kumar

unread,

Mar 27, 2024, 7:07:55 AM3/27/24

to bvpar...@googlegroups.com

I am also working with the LLM thing in Sanskrit and the main issue was clean purpose of input Data . Most of the input data had errors

--
You received this message because you are subscribed to the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bvparishat+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bvparishat/b4789c3f-28c8-4196-877b-bfd637d4281en%40googlegroups.com.

BVK Sastry (G-S-Pop)

unread,

Mar 27, 2024, 11:51:39 PM3/27/24

to bvpar...@googlegroups.com, Aditya Asopa

Namaste Aditya Asopa and Yash Datta

(I prefer to be reached offline- Off forum – before other forms of interaction for further discussion.)

1. Appreciation for good work started and openness to work in ‘TEAM’ ( a difficult trait amongst software developers ! and more so working with classical language Sanskrit ?!)

2. I have one pointer suggesting you to explore and let me know your current understanding of the ‘trending going on’.

Metaverse Standards Register - Meteverse Standards Forum (metaverse-standards.org)

https://metaverse-standards.org/domain-groups/standards-register/

The Standards Register Working Group is developing the above products using Web tools such as GitHub and Google Forms, and bespoke implementations of other tools.

3. I have specific interest/need to implement Natural Language Toolkit for Indic Languages. Link below.

< Natural Language Toolkit for Indic Languages aims to provide out of the box support for various

NLP tasks that an application developer might need.

GitHub - goru001/inltk: Natural Language Toolkit for Indic Languages aims to provide out of the box support for various NLP tasks that an application developer might need https://github.com/goru001/inltk >

Help from all corners welcome.

Regards

BVK Sastry

--

You received this message because you are subscribed to the Google Groups "

भारतीयविद्वत्परिषत्" group.

To unsubscribe from this group and stop receiving emails from it, send an email to

bvparishat+...@googlegroups.com.

To view this discussion on the web visit

https://groups.google.com/d/msgid/bvparishat/b4789c3f-28c8-4196-877b-bfd637d4281en%40googlegroups.com.

Reply all

Reply to author

Forward