Re: Vidyut: a high-performance Sanskrit toolkit

31 views
Skip to first unread message

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Jan 6, 2023, 9:25:22 AM1/6/23
to Madhav Deshpande, aruNa-prasAdaH अरुणप्रसादः, ambuda-discuss

On Fri, 6 Jan 2023 at 19:35, Madhav Deshpande <mmd...@umich.edu> wrote:
I have no programming background whatsoever, yet I find this demo very impressive. Congratulations to those who are working on this project. Best wishes,

Madhav M. Deshpande
Professor Emeritus, Sanskrit and Linguistics
University of Michigan, Ann Arbor, Michigan, USA
Senior Fellow, Oxford Center for Hindu Studies
Adjunct Professor, National Institute of Advanced Studies, Bangalore, India

[Residence: Campbell, California, USA]


On Thu, Jan 5, 2023 at 8:47 PM विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com> wrote:


---------- Forwarded message ---------
From: Arun <aru...@gmail.com>
Date: Fri, 6 Jan 2023 at 09:37
Subject: Re: Vidyut: a high-performance Sanskrit toolkit
To: sanskrit-programmers <sanskrit-p...@googlegroups.com>


Thanks -- we'll take this feedback into account and update our demo soon.

I have made some small changes to the demo, and although it still could use more work, I believe it is ready to share out more widely. I have posted to samskrita but don't have posting rights on bvparishat, and I am sure there are also other interested lists I am not aware of.

I would be grateful if members of the list, if they deem this project worthy, could share the below with whoever they felt would find it interesting:

~

नमो विद्वद्भ्यः --

vidyut-prakriya generates Sanskrit words by applying Paninian rules step-by-step. Our long-term goal is for the program to generate all valid Paninian words.

Summary
vidyut-prakriya is heavily indebted to the SanskritVerb generator from Dr. Dhaval Patel, I.A.S and Dr. Shivakumari Katuri, and we are grateful to the authors for their encouragement in this project.

If you are familiar with SanskritVerb, vidyut-prakriya offers the following improvements:

- It adds many more forms (जुगुप्सते etc, अतत etc, -आम्बभूव etc.)
- It fixes various small bugs.
- Its prakriyas generally have more detail, especially for it-Agama rules.
- It can run in a web browser without an internet connection.
- It is much faster. On my laptop, vidyut-prakriya can generate all kartari-tinantas in about 4 seconds. This speed is especially useful for testing and for natural language processing tools.
- It has partial support for sanAdi forms. These have not been tested very much, so please use with caution.

Demo: https://ambuda-org.github.io/vidyullekha/ -- click on a dhatu to see its tinanta padas, and click on a pada to see its prakriya.

Notes:
We are sharing our system publicly so that we can collect feedback and better discover bugs. Please share it widely so that we can collect more feedback.

- For bug reports, please use https://github.com/ambuda-org/vidyut/issues .
- We are eager to partner with scholars and experts to better test our system. If you are interested in working with us more closely, please contact us at https://ambuda.org/contact.
- We have partial support for subantas and krdantas, but these are not available in the demo yet.
- Testing was done by comparing our program to the output of SanskritVerb. Most differences have been accounted for and are (we believe) in vidyut-prakriya's favor, but a few small errors likely remain.

Regards,

Arun Prasad

On Sunday, January 1, 2023 at 7:58:18 PM UTC-8 drdhav...@gmail.com wrote:
Also at the end of the prakriya, it would make sense to show the final form as the last step.

On Mon, 2 Jan 2023 at 9:24 AM, Dhaval Patel <drdhav...@gmail.com> wrote:
I checked the tool. Kind request to show the text of the sutra along with the sutra number. 

On Mon, 2 Jan 2023 at 7:20 AM, Arun <aru...@gmail.com> wrote:
Thanks to Shreevatsa, we now have an online demo of vidyut-prakriya available:


This demo runs entirely in the browser. I've noticed that Safari is quite a bit slower than Chrome and Firefox, though.

Currently, parasmaipada / Atmanepada forms are mixed in the same table, which is confusing. Once this is fixed, I'll circulate the tool more widely.

Arun

On Wednesday, December 28, 2022 at 10:17:51 PM UTC-8 Arun wrote:
(Replied on GitHub, cross-posting here.)

Sadly it would not be simple -- details below. My current focus is on making vidyut-prakriya fast and correct, and I think exploring this problem is a better fit for a fork of this repo. [Edit to add: I think this problem is a great one, and I am happy to help someone use our library to explore it.]

~

Right now, I use a similar approach to SanskritVerb: I hand-code a specific rule ordering. I haven't followed any specific philosophy except "produce the correct padas with a reasonable-looking prakriya." But since this project is heavily inspired by SanskirtVerb (which, I believe, draws on the work of Smt. Pushpa Dikshit), in practice I am using a पौष्पी प्रक्रिया.

Regarding rule selection, my thinking was: as long as the major sections of the Ashtadhyayi are unimplemented, modeling the resolution of rule conflicts is extra complexity and a new source of bugs and degraded performance. For my current needs, I want to focus first on generating the correct forms quickly with reasonable prakriyas.

However, I agree that modeling rule conflicts is tremendously useful for the reasons you mention.

Due to the substantial nature of these changes, I think a fork is best; but, such a fork could lean on this library's rich APIs and exhaustive test suite. So perhaps something like this would be workable:

  1. Refactor each rule so that it's in its own function. Then, create a new function that receives the name of a rule and runs the associated function on the current prakriya. This lets us have dynamic control flow.
  2. Find a way to examine the "meta" aspects of each rule: which properties they select on, which samjnas they use, which changes they cause, etc. The only academic work I'm familiar with in this vein is here, but I don't know if it's public. Otherwise, some work might be required to create an inspectable representation of each rule. (One hacky approach might be to walk the AST we create in (1).)
  3. Implement a function that accepts a prakriya and returns the rule that should be run. This is the core logic we would test.
  4. Validate the implementation of (3) against our test suite.

This procedure is quite promising because each step can lean on our test suite, so the developer can always know that the overall system is in a reasonable state. While it would still involve quite a bit of effort, it's far less effort than writing a system from scratch.

Arun
On Wednesday, December 28, 2022 at 9:38:06 PM UTC-8 Vishvas Vasuki wrote:
On Wed, 28 Dec 2022 at 11:42, Arun <aru...@gmail.com> wrote:
Thanks for letting me know, Les. I'll update our readme to avoid confusion with your wonderful project.

~

As part of our work on Vidyut, we are releasing vidyut-prakriya, a fast Paninian word generator.

 
If you remember my post on our Python-based generator from earlier this year, here are the major improvements from that system.

- It's much more comprehensive. We currently have reasonable support for karmani prayoga, and I'll also add support for sanAdi pratyayas by the end of the year. We have experimental support for various krdantas and basic support for subantas.


Added a request https://github.com/ambuda-org/vidyut/issues/15 :

Currently, how are rule conflicts handled in prakriyA simulation? The regular interpretation of विप्रतिषेधे परं कार्यम्, augmented by a web of paribhAShA-s?

Would it be simple to implement an option to resolve such rule conflicts by means of the simpler framework described in Rishi rajpopat's thesis which recently entered the news and fascinated / surprised many? This will be enormously valuable in validating the claims made there, and will likely advance our understanding of what pANini intended + drawbacks therein.


 
- It has much better documentation.
- It's much faster. After compilation, my computer can generate all kartari tinantas in under 5 seconds. Incremental compile + generate takes about 15 seconds.
- It can be compiled to WebAssembly, which means that with a bit of work, it can run in the browser.

My hope is that vidyut-prakriya can eventually be a comprehensive reference implementation for the Ashtadhyayi, including subantas, tinantas, krdantas, taddhitAntas, chAndasa usage, and rules for svaras. The speed of the library is an important feature here: being able to run hundreds of thousands of test cases lets us make changes with confidence.

If you like व्याकरण and want to help improve this library, please see our Contributing section and consider joining our community. I would be very grateful for your help!

Please don't circulate this post on other large lists just yet. I want to have a WebAssembly demo ready first so that people can experiment with the system in their web browsers.

Arun
On Thursday, November 3, 2022 at 9:36:43 AM UTC-7 lesmor...@gmail.com wrote:
FYI, Vidyut is the name of the Windows Input Method Editor phonetic keyboard that I co-developed many years ago. The Vidyut keyboard enables direct typing of Unicode-compliant Devanāgarī and selected Sanskrit Vedic and metrical marks on Windows computers using a phonetic method. It has been available since 2010 as a free download.
On Tuesday, November 1, 2022 at 9:02:12 PM UTC-7 Arun wrote:

Summary
From the readme:

Vidyut is a lightning-fast toolkit for processing Sanskrit text. Vidyut aims to provide standard components that are fast, memory-efficient, and competitive with the state of the art.

Vidyut compiles to native code and can be bound to your language of choice. As part of our work on Ambuda, we provide first-class support for Python bindings through vidyut-py.

Vidyut is currently experimental code, and its API is not stable. If you wish to use Vidyut for your production use case, please file an issue first.

Components
From the readme:

Lexicon maps Sanskrit words to their semantics with high performance and minimal memory usage. In one recent test, we were able to store 29.5 million inflected Sanskrit words in 31 megabytes of disk space for a total cost of around 1 byte per word, and we were able to retrieve these words at around 820 ns/word, as compared to 530 ns/word for a standard in-memory hash map.

Segmenter performs a padaccheda on a Sanskrit phrase and annotates each pada with its basic morphological information.

Segmenter is not yet competitive with other options, but we are optimistic that we can improve it over time. What is quite special, however, is its sheer speed: Segmenter can process a shloka in under 10 milliseconds, and we expect it to become even faster in the future.


Context
The context for this toolkit is that many of Ambuda's challenges -- word analysis for texts not in DCS, high-quality "spellcheck" for our proofing work, pedagogical tools for learners, and more flexible interfaces for our search tools -- would be better supported if there were a standard set of high-quality modules that were performant enough to run on a commodity webserver.

While all of these components exist in the broader ecosystem of Sanskrit programs, I could not find an option that was public, high performance, and high quality. Vidyut is an effort to create a set of components that meets all three of these conditions.

The main technical item of note here is that Vidyut is implemented in Rust, which compiles to native code and has a rich ecosystem of bindings to other languages. We plan to offer first-class support for Python bindings through vidyut-py, and we plan to investigate PHP and WebAssembly bindings in the future as well.

Once our padaccheda engine is stable, I plan to revive my work on Padmini so that Vidyut will also have a complete prakriya engine.

Questions
I would like to find short Sanskrit names for these components and namespaces. Lexicon might be called Rupavali, and Segmenter might be called Chedaka, but I am open to suggestions.

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/d4480e9c-6023-4671-903c-9a491d6009efn%40googlegroups.com.


--
--
Vishvas /विश्वासः

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/d823ddc5-6462-4b45-83d0-b68540bda3f8n%40googlegroups.com.
--
Dr. Dhaval Patel
www.sanskritworld.in
--
Dr. Dhaval Patel
www.sanskritworld.in

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/a7f76f4e-3769-40b0-9950-2468f00bb626n%40googlegroups.com.


--
--
Vishvas /विश्वासः

--
You received this message because you are subscribed to the Google Groups "shabda-shAstram" group.
To unsubscribe from this group and stop receiving emails from it, send an email to shabda-shastr...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/shabda-shastram/CAFY6qgH%3DReyXsGwSgaKqCEUxrK93rJGc7WkeRZkLB0kAD4tNYg%40mail.gmail.com.


--
--
Vishvas /विश्वासः

Arun

unread,
Jan 7, 2023, 1:17:08 AM1/7/23
to विश्वासो वासुकिजः (Vishvas Vasuki), Madhav Deshpande, ambuda-discuss
Hello Professor Deshpande,

Thank you for the kind words. We are making rapid progress on krdantas and subantas and will add them to our demo by the end of the month. Look forward to more interesting work soon.

Arun

Madhav Deshpande

unread,
Jan 7, 2023, 1:21:38 AM1/7/23
to Arun, विश्वासो वासुकिजः (Vishvas Vasuki), ambuda-discuss
Thank you so much, Arun Ji. As someone who is deeply into Sanskrit studies, but with relatively little expertise in computer software, my request to you is to make your interface widely accessible to a wide ranging audience. Best wishes for your ongoing work.

Madhav M. Deshpande
Professor Emeritus, Sanskrit and Linguistics
University of Michigan, Ann Arbor, Michigan, USA
Senior Fellow, Oxford Center for Hindu Studies
Adjunct Professor, National Institute of Advanced Studies, Bangalore, India

[Residence: Campbell, California, USA]

Arun Prasad

unread,
Jan 8, 2023, 11:51:08 AM1/8/23
to ambuda-discuss
Hello Madhav ji,

It was a pleasure to meet you yesterday. I have updated the demo with basic krdanta-prakriyas, but the words here have not been tested thoroughly.

Here are some quick logistical notes on our generator:

Our demo will be at https://ambuda-org.github.io/vidyullekha/ for the foreseeable future. If the demo moves, I will update that page with a prominent link pointing to its new location.

Updates will be to both our project mailing lists (ambuda-discuss, ambuda-announce) and also to a few public mailing lists that might be interested in this work (sanskrit-programmers, samskrita, bvparishat). Feel free to suggest other lists as well.

Our goal is to generate a very large set of well-formed Sanskrit words for our Ambuda library project. This generator has a variety of implicit constraints that I have used mostly unconsciously. To the extent that I can see them, I share them for your interest and curiosity:

- Pragmatism. Our goal is simply to generate valid Sanskrit words, where by "valid" I mean that at least one traditional authority accepts the word.
- Concision. Where possible, we implement a rule exactly once and invoke it in exactly one place.
- Consistency. Since this is a program, we are required to explictly and exactly formalize how a rule should apply across all prakriyas.

You can help in several ways. Our main need right now is to "debug" the program by catching and fixing as many word errors as we can. I am fixing the errors I notice, but I am not a professional वैयाकरण and my insight goes only so far. So if you see erroneous or missing words, please tell us about them. We would also be grateful if you could share our work with any lists or scholars that might find it interesting. This is part of a rule of thumb in open-source code: if more people use our program, we can find more bugs.

I vaguely recall reading the observation (perhaps by Kiparsky) that even though many modern linguists have seen echoes of their pet theory in Panini, it is precisely because Panini focused on concision that his work is so fundamental and so lasting. In a much humbler way, I hope that even though our project's goals are pragmatic, our focus on concision and consistency.will yield something that is useful across multiple domains.

Arun
Reply all
Reply to author
Forward
0 new messages