Dharmamitra machine translation

Abhinav Kadambi

unread,

Aug 15, 2024, 12:45:42 PM8/15/24

to भारतीयविद्वत्परिषत्

Namaste,

I recently came across dharmamitra.org (see about page here)

The team includes IIT kharagpur and AI4BHARAT as collaborators.

Has anyone tested the efficacy of their currently launched machine translation?

Are there similar projects with Bharatiyas at the helm?

कडाम्बी अभिनवः

शोधच्छात्रः

प्रान्त्यकलाशाला, मद्रपुरी

संस्कृत संवादः

unread,

Aug 16, 2024, 5:02:34 AM8/16/24

to भारतीयविद्वत्परिषत्

4. Sanskrit Tools
⌲ Jñānasaṅgrahaḥ
⌲ Sankhya
⌲ UOHyd
⌲ Vidyullekha
⌲ Sanskrit Abhyas
⌲ Ashtadhyayi
⌲ Aksharamukha
⌲ SanskritCR
⌲ Sambhasha
⌲ DharmaMitra

गुरुवार, 15 अगस्त 2024 को 10:15:42 pm UTC+5:30 बजे Abhinav Kadambi ने लिखा:

Nagaraj Paturi

unread,

Aug 17, 2024, 10:51:32 AM8/17/24

to भारतीयविद्वत्परिषत्

From: BVK Sastry (G-S-Pop) [mailto:sastr...@gmail.com]
Sent: 16 August 2024 11:04
To: 'bvpar...@googlegroups.com'
Subject: {भारतीयविद्वत्परिषत्} Dharmamitra machine translation

From: BVK Sastry (G-S-Pop) [mailto:sastr...@gmail.com]
Sent: 16 August 2024 10:00
To: 'bvpar...@googlegroups.com'
Subject: RE: {भारतीयविद्वत्परिषत्} Dharmamitra machine translation

Namaste Abhinav Kadambi

Focused answers to your two questions:

1. Has anyone tested the efficacy of their currently launched machine translation?

- I have made reasonably good test with these models; I have gone through the ‘output translations’ provided

for Sanskrit texts to English and English texts to Sanskrit by my friends for review.

My summary conclusion: The output deserves a grade ‘C + ’ at best for Sanskrit. Lot more work needed.

Effort is Good. Outcome falls short of expectation and is misleading / undependable.

Root defect stems from the logic: ‘Samskrutham aka Sanskrit (Panini Tradition regulated language) can be

analysed by LLM (Large Language Model) using ‘ English likeness- Techno-Anglo-linguistic rule base’.

I attach the Research Paper used in madland project, for those who want to dig deep on my pointers here.

2. Are there similar projects with Bharatiyas at the helm?

There are projects which claim collaborated work / comparable / similar kind of works are sporadically

distributed.

Many such projects have their nose pointed towards of ‘AI initiatives’ using ‘Natural Language LLM (??).

The shortcoming of this approach is pushed below carpet, to keep other interests prioritized beyond ‘ Native

Natural Language Appropriateness’.

The critical need for ‘ Designing Native Natural Brahmi LLM based Panini Language –Translator, is less

Appreciated, nay kept out of discussion.

The ‘A.I. Techno-Colonial linguists rule over Natural Language-Linguists for Brahmi Languages –A.I.

In Science research- modelling, such error is called –Hermeneutic/ Category Error.

A ‘Fruit-specific research’ (= Brahmi Panini Language- Family AI Research) will not become appropriate

using tools built for ‘2000 varieties of other fruits (= Social Languages – Statistically distributed all over

globe, using with slack grammar –processing models ), using the Botany definition of Fruit (= Language)

the seed-bearing structure in flowering plants that is formed from the ovary after flowering. Fruits are the

means by which flowering plants disseminate their seeds (= People of different communities use language

for mother-tongue-communication).

Statistics is NOT the Validating Criterion for Programming Processed language.

3. It seems you are referring to the MITRA project, which uses google/madlad400-3b-mt · Hugging Face

MADLAD-400-3B-MT is a multilingual machine translation model based on the T5 architecture that was trained on 1 trillion tokens covering over 450 languages using publicly available data. It is competitive with models that are significantly larger.

The MITRA project - About · Dharmamitra https://dharmamitra.org/about

MITRA is a research project in the Berkeley AI Research lab in EECS at the University of California, Berkeley. It is lead by Kurt Keutzer and Sebastian Nehrdich and focuses on bridging the linguistic divide between ancient wisdom source languages and contemporary languages through the application of advanced Deep Learning and AI technologies.

Initiated in 2023, the project quickly evolved from its conceptual phase to a dynamic development process, accelerated by its collaborative efforts with organizations such as monlam.ai and with contributions from a diverse array of sources, including translators and AI researchers. Leveraging a robust corpus of over four million sentence pairs from various sources and utilizing Google's MADLAD-400 model as a foundation, MITRA has fine-tuned a specialized translation model that not only promises enhanced fluency in translations but also aims to significantly expand access to ancient wisdom texts.

Through continuous improvements in data quality, sentence alignment, and model fine-tuning, the project seeks to overcome the challenges inherent in low-resource language translation. The MITRA project stands as a testament to the transformative potential of AI in transcending language barriers, embodying a commitment to cultural preservation, academic research, and the democratization of access to Tibetan literature and wisdom.

The following disclaimer notes on the project site may be read carefully for their implications.

Out-of-Scope Use

These models are trained on general domain data and are therefore not meant to work on domain-specific models out-of-the box. Moreover, these research models have not been assessed for production usecases.

Bias, Risks, and Limitations

We note that we evaluate on only 204 of the languages supported by these models and on machine translation and few-shot machine translation tasks. Users must consider use of this model carefully for their own usecase.

Ethical considerations and risks

We trained these models with MADLAD-400 and publicly available data to create baseline models that support NLP for over 400 languages, with a focus on languages underrepresented in large-scale corpora. Given that these models were trained with web-crawled datasets that may contain sensitive, offensive or otherwise low-quality content despite extensive pre-processing, it is still possible that these issues to the underlying training data may cause differences in model performance and toxic (or otherwise problematic) output for certain domains. Moreover, large models are dual use technologies that have specific risks associated with their use and development. We point the reader to surveys such as those written by Weidinger et al. or Bommasani et al. for a more detailed discussion of these risks, and to Liebling et al. for a thorough discussion of the risks of machine translation systems.

Training Details

We train models of various sizes: a 3B, 32-layer parameter model, a 7.2B 48-layer parameter model and a 10.7B 32-layer parameter model. We share all parameters of the model across language pairs, and use a Sentence Piece Model with 256k tokens shared on both the encoder and decoder side. Each input sentence has a <2xx> token prepended to the source sentence to indicate the target language.

Training Data

For both the machine translation and language model, MADLAD-400 is used. For the machine translation model, a combination of parallel data sources covering 157 languages is also used. Further details are described in the paper.