Maintaining 500+ sanskrit scripts and 200+ tamil scripts in 142 languages

sayantan roy

unread,

Dec 6, 2025, 1:22:00 PM12/6/25

to sams...@googlegroups.com

Dear Team,

My name is Sayantan Roy. I am an independent researcher and software developer working in large-scale Indological digitization.

.

🔥 PROUD TO ANNOUNCE ONE OF THE BIGGEST SINGLE-HANDED INDOLOGY PROJECTS EVER DONE! 🔥

After years of dedication, sleepless nights, and thousands of lines of code… I’ve successfully completed a massive Indological digitization and AI project — all by myself.

📚 500+ Sanskrit texts

📚 200+ Classical Tamil texts

💾 50+ GB of processed data

🧠 142 languages + multi-script support

💻 50+ custom softwares built from scratch

⚡ Tasks that normally need 100+ people working for years, my MLTT system completes in minutes to hours.

Yes… one person did what entire teams couldn’t.

And that one person is me.

✨ MLTT – Multi-Language Translation & Transliteration System

My AI-powered platform unlocks ancient Hindu, Buddhist, and Jain texts and converts them into 142 world languages and multiple file formats—accurately, instantly, and beautifully.

This is not just software.

This is preservation of heritage, computational linguistics, digital humanities, and AI innovation coming together.

For Indology researchers, monks, scholars, linguists, and students—this project opens a door that never existed before.

🔹 A 50GB+ digital treasure

🔹 700+ classical texts processed

🔹 Rare scripts preserved

🔹 AI + human knowledge combined

🔹 All built by a single developer

I’m proud, humbled, and excited to finally share the scale of what I built.

This journey proved one thing:

💥 With obsession, passion, and code — one person CAN change the landscape of knowledge. 💥

If you’re interested in ancient texts, AI, linguistics, or Indology…

Stay tuned. Something big has begun. 🚀

Project db link :

database 1: https://drive.google.com/drive/folders/1PX3o4o6COEYbOlGFlh6-zc6TGT3hbwLD

Database 2: https://drive.google.com/drive/folders/1GbvJNqi3_yLwzbI3at88fWG4FEMbjkVI

Internet archive link : https://archive.org/details/@sayaantan

Research gate papers : https://www.researchgate.net/profile/Sayantan-Roy-14?ev=hdr_xprf

GitHub code link : https://github.com/sayantanr

Sandeep Maher

unread,

Dec 8, 2025, 7:18:09 AM12/8/25

to sams...@googlegroups.com

Regret the delayed response, Mr. Roy.

Although not fully able to comprehend or experience the vastness of your achievement, the literal reading of what you have put out in enclosed surely deserves an applause.

Any effort that elevates our suppressed greatness that surpasses geographical boundaries must be lauded, for sure.

May you continue on your path, and achieve what you have set out to. All my very best.

Regards,

Sandeep H. Maher

--
You received this message because you are subscribed to the Google Groups "samskrita" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samskrita+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/samskrita/CA%2B6AELePce_19M9B%2BuTmhy2q%3Doz0DFDyr0PXOUm3ntg%3D_5d3BQ%40mail.gmail.com.

Anunad Singh

unread,

Dec 8, 2025, 7:27:49 AM12/8/25

to sams...@googlegroups.com

Sayantan Roy mahashaya,

It is great to hear about what you have done single handedly. It is also great to hope that it can further be used by others to do the same thing with thousands of books which still remain to be OCRed/translated.

Could you please describe on a system level what the tool takes as input and what it gives as output? What I loosely understand is it is something of a sort of OCR tool and it further translates the text into many modern languages.

-- अनुनाद

--

HONGANOUR KRISHNA

unread,

Dec 8, 2025, 11:45:14 AM12/8/25

to sams...@googlegroups.com

Dear Sayantan Roy,
Congratulations on Your Landmark Achievement !

Thank you for sharing this extraordinary milestone with us. Your dedication, perseverance, and vision in the field of Indological digitization are truly inspiring.
The scale of your accomplishment — digitizing 500+ Sanskrit texts, 200+ Classical Tamil texts, building 50+ custom software tools, and enabling 142 languages with multi-script support — is nothing short of remarkable. Achieving what typically requires large teams, all through your own effort, speaks volumes about your commitment and ingenuity.

This project is not only a personal triumph but also a significant contribution to the preservation and accessibility of our cultural and linguistic heritage. I deeply appreciate the impact your work will have on researchers, scholars, and communities worldwide.

Congratulations once again on this groundbreaking achievement. I look forward to seeing how your innovations continue to shape the future of Indological studies and digital humanities.

With admiration and best wishes,

Thank you,

HONGANOUR S KRISHNA

--

Mandar Bhanushe

unread,

Dec 8, 2025, 12:16:03 PM12/8/25

to sams...@googlegroups.com

बहु उत्तमम् !

Mandar Bhanushe

Head, Faculty of Science & Technology

Coordinator, CEED

Centre for Distance and Online Education, University of Mumbai

Member of IKS Sub-Committee under NEP2020 Steering Committee, Maharashtra

UGC nominated EC member of Symbiosis International (Deemed) University

CDC Member of St Teresa Institute of Education

Member, Academic Council of St Xavier's Institute of Education (Autonomous)

Former Member of Academic Council of Sathaye College of Arts, Science & Commerce (Autonomous)

Member of Pratap Center of Philosophy, Amalner, KBC North Maharashtra University

Invited member of BoS of Mathematics, University of Mumbai

BoS (AI) member of Chandrabhan Sharma College, Mumbai

Schedule an online meeting with me

To view this discussion visit https://groups.google.com/d/msgid/samskrita/CAHtV9sYAVvZ0_BXsp%3DgMz%2Bd8_nf5EAauGjykq4kGi%3Dyhn1r-Pw%40mail.gmail.com.

sayantan roy

unread,

Dec 8, 2025, 6:54:00 PM12/8/25

to sams...@googlegroups.com

Thanks sir !!!

To view this discussion visit https://groups.google.com/d/msgid/samskrita/CAHny5%2BaWoRjDFyyvZH%2BTEVWEWDOys8rHHEXUFYhWy1rh5kGBxQ%40mail.gmail.com.

संस्कृत संवादः

unread,

May 16, 2026, 12:12:02 PMMay 16

to samskrita

Subject: Critical Questions regarding the "142-Language Digital Corpus" – Innovation or Digital Hallucination?

I have been following the recent posts regarding the "142-Language Digital Corpus of Sanskrit, Tamil, and Indic Religious Literature." While the scale of the project sounds monumental on paper, a closer look at the methodology and claims raises several "red flags" that the scholarly community deserves to have addressed.

Before we celebrate this as a "breakthrough," we should ask the following critical questions:

1. The "Vanity Metric" of 142 Languages What is the actual utility of translating classical Sanskrit or Sangam Tamil into 142 languages simultaneously? Most serious Indologists work in Devanagari, IAST, or major regional scripts.

The Problem: Automated translation (via OpenAI/Claude/Google) of ancient, highly nuanced philosophical Sanskrit into low-resource languages (like Zulu or Hmong) is notorious for "hallucinating" meaning.
The Question: Has even 1% of these 142 translations been verified by a human expert in those languages? If not, aren't we just creating "digital noise" or "garbage data" that misrepresents our heritage?

2. The OCR and Proofreading Mystery The author claims to have digitized, cleaned, and normalized over 700 texts (500+ Sanskrit, 200+ Tamil) single-handedly.

The Problem: Anyone who has worked with Tesseract or EasyOCR on Sanskrit knows that OCR output is riddled with errors, especially with samyuktaksharas (conjuncts) and sandhi. Cleaning just one major text (like the Mahabharata or a Purana) to scholarly standards takes a team of scholars years.
The Question: How can one researcher claim to have "cleaned" 700+ texts alone? Where is the "Ground Truth" data? Or is this simply raw, error-prone OCR output being passed off as a "Digital Corpus"?

3. Academic Rigor vs. Self-Publishing The project relies heavily on preprints uploaded to ResearchGate and code on GitHub.

The Problem: These papers do not appear to have undergone traditional peer review by any recognized Indology or Computational Linguistics journal.
The Question: Has this methodology been audited by a panel of Sanskritists or Tamil scholars to ensure the semantic integrity of the texts? In the age of AI, "Massive" does not mean "Accurate."

4. Where is the Source? The author mentions a 50GB dataset and a 12GB repository, but there is no centralized, searchable, and transparent website where a scholar can look up a specific verse and verify the OCR/Translation quality.

The Question: Why is this being promoted as a "Global Utility" if the data is buried in bulk-upload archives without a functional user interface for scholars to audit the work?

5. Pipeline or Wrapper? The "pipeline" described seems to be a combination of existing open-source tools (Aksharamukha, Tesseract, OpenAI API).

The Question: What original computational work has been done here? Automating a script to call an API to translate text into 100 languages is a weekend project for a programmer—it is not necessarily a "research breakthrough" in Indology.

Conclusion: We must be careful not to mistake "Big Data" for "Big Scholarship." Sanskrit and Tamil deserve precision, not just volume. I would like to ask the author: Can you provide a side-by-side comparison of a complex Sanskrit verse from your corpus against a traditionally edited critical edition (like the BORI Mahabharata) to show the accuracy of your "cleaned" OCR?

Until there is transparency regarding proofreading and verification, we should treat these "142-language" claims with extreme skepticism.

Reply all

Reply to author

Forward