Fwd: Help regarding Sanskrit LLM

48 views
Skip to first unread message

विश्वासो वासुकेयः

unread,
Mar 28, 2024, 6:13:18 AMMar 28
to sanskrit-programmers


---------- Forwarded message ---------
From: yash datta <Unknown>
Date: Saturday 23 March, 2024 at 10:17:19 pm UTC+5:30
Subject: Help regarding Sanskrit LLM
To: bvpar...@googlegroups.com <Unknown>


Respected Scholars,

Pranam

With the recent advances in open Large language models, I have been working towards creating a Sanskrit Large language model that I think will be immensely useful for anyone interested in knowing or even conducting research in Sanskrit.

To that end, I do have the technical know-how to create these models but lack the fundamentals of Sanskrit and the clean, diverse data that is required to bring this about. Therefore I am seeking help from this group of amazing scholars to please help me in accumulating Sanskrit text corpus that can be used to create such a tool.

To validate whatever little I know, I did go ahead and create one trial model (which I named Rudra) using an amazing github repo that I encountered here: https://github.com/sanskrit/raw_etexts
Many thanks to the repo contributors for this amazing effort. I also included ~ 24K sanskrit wikipedia articles corpus from here: https://www.kaggle.com/datasets/disisbig/sanskrit-wikipedia-articles

Model is uploaded to hugging face with directions to test it: https://huggingface.co/saucam/Rudra-7b
The dataset used to train the model is available here: https://huggingface.co/datasets/saucam/sans_data/viewer

Everything is Apache 2.0 licensed.

Trial runs of this model are encouraging and hence I summoned the courage to write to this group.

What I want help with:
I am looking to collect an even larger (and cleaner) Sanskrit text corpus to further train text generation models. So please reach out if you could contribute such data.
I also want to create instructions following model, in which case we need data in Question and answer format or chat style, something like this:
[ { "content": "Could you please tell me in Hindi about Edward Johnston who won the Medal of Honor?", "role": "user" }, { "content": "एडवर्ड जॉनस्टन का जन्म 8 फरवरी 1844 को, पेन यान, न्यूयॉर्क में हुआ। उन्होंने अमेरिकी सेना में शामिल होकर इंडियन वार्स के दौरान 5वें यूएस कवलरी में सेवा की। 21 अक्टूबर 1876 से 8 जनवरी 1877 तक, उन्होंने मोंटाना में सिउक्स के खिलाफ एक महत्वपूर्ण मिशन का हिस्सा बनाया। इस मिशन के दौरान अपने बहादुरीपूर्ण कार्यों के लिए, एडवर्ड जॉनस्टन उन 31 व्यक्तियों में से एक थे जिन्हें मेडल ऑफ हॉनर से पुरस्कृत किया गया। अपनी सैनिक सेवा के बाद, एडवर्ड जॉनस्टन अपने घर लौटे और 20 जनवरी 1920 को 75 वर्ष की उम्र में उनकी मृत्यु हो गई। उन्हें लेकव्यू समाधि स्थल में गाड़ा गया। ", "role": "assistant" }

Of course, I am looking for such data in Sanskrit (but this can be mixed , sanskrit or english Instructions)

Secondly, I need expertise for creating evaluation benchmarks for these models. These usually consist of questions that will be asked to the model, and then rating the response based on certain metrics.
Currently, there are numerous such evaluation suites for english (and for some other languages), but I would need help in translating them to sanskrit. 
To just give an example, there is this EQ-bench that looks to evaluate the emotional intelligence of these models, and 1 of the questions it asks the model is this

  "writing_prompt": "Fairy Tale Retelling: Rewrite the story of Hansel and Gretel from the perspective of the witch, in the format of raw, terse stream-of-consciousness diary entries written in her style & voice. She may at times be an unreliable narrator. She sees herself as fundamentally good and portrays herself sympathetically; she believes she is misunderstood and has a tragic backstory. Include snippets of dialogue between the witch and the children in a way that feels natural for a diary entry. You may take liberties with the original story. The witch will not die in this version; she needs to be able to write her final entry. It will not be happily ever after. <SEED> 600-800 words.",

"judging_criteria": [
{
"prefix_text": "Now, rate the supplied model output on the following criteria:",
"criteria": [
"Overall Impression",
"Overall Reader Engagement",
"Clever / Witty",
"Gripping",
"Effective Use of Tropes: If applicable, common narrative tropes are employed thoughtfully and subverted, deconstructed, or used in service of the story's themes and character",
"Sentences Flow Naturally",
"Well-earned Lightness or Darkness"
]
},



Request you to please reach out in case you can help me with these initiatives.

I have created a discord server for easy communication but we can communicate via email / whatsapp if that is more convenient.

Discord link:https://discord.gg/dgPhZVAw

Please let me know what you think. Any suggestions are welcome.

Best Regards
Yash

Prabhat kumar SINGH

unread,
May 14, 2024, 11:03:40 AMMay 14
to sanskrit-programmers
Is there any email of Yash Datta?

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
May 14, 2024, 12:25:25 PMMay 14
to sanskrit-p...@googlegroups.com

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/48212e09-bdcb-4e94-a464-9505db732551n%40googlegroups.com.


--
--
Vishvas /विश्वासः

Reply all
Reply to author
Forward
0 new messages