--
You received this message because you are subscribed to the Google Groups "samskrita" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samskrita+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/samskrita/CA%2B6AELePce_19M9B%2BuTmhy2q%3Doz0DFDyr0PXOUm3ntg%3D_5d3BQ%40mail.gmail.com.
Sayantan Roy mahashaya,
It is great to hear about what you have done single handedly. It is also great to hope that it can further be used by others to do the same thing with thousands of books which still remain to be OCRed/translated.Could you please describe on a system level what the tool takes as input and what it gives as output? What I loosely understand is it is something of a sort of OCR tool and it further translates the text into many modern languages.-- अनुनाद
--
--
To view this discussion visit https://groups.google.com/d/msgid/samskrita/CAHtV9sYAVvZ0_BXsp%3DgMz%2Bd8_nf5EAauGjykq4kGi%3Dyhn1r-Pw%40mail.gmail.com.
To view this discussion visit https://groups.google.com/d/msgid/samskrita/CAHny5%2BaWoRjDFyyvZH%2BTEVWEWDOys8rHHEXUFYhWy1rh5kGBxQ%40mail.gmail.com.
Subject: Critical Questions regarding the "142-Language Digital Corpus" – Innovation or Digital Hallucination?
I have been following the recent posts regarding the "142-Language Digital Corpus of Sanskrit, Tamil, and Indic Religious Literature." While the scale of the project sounds monumental on paper, a closer look at the methodology and claims raises several "red flags" that the scholarly community deserves to have addressed.
Before we celebrate this as a "breakthrough," we should ask the following critical questions:
1. The "Vanity Metric" of 142 Languages What is the actual utility of translating classical Sanskrit or Sangam Tamil into 142 languages simultaneously? Most serious Indologists work in Devanagari, IAST, or major regional scripts.
2. The OCR and Proofreading Mystery The author claims to have digitized, cleaned, and normalized over 700 texts (500+ Sanskrit, 200+ Tamil) single-handedly.
3. Academic Rigor vs. Self-Publishing The project relies heavily on preprints uploaded to ResearchGate and code on GitHub.
4. Where is the Source? The author mentions a 50GB dataset and a 12GB repository, but there is no centralized, searchable, and transparent website where a scholar can look up a specific verse and verify the OCR/Translation quality.
5. Pipeline or Wrapper? The "pipeline" described seems to be a combination of existing open-source tools (Aksharamukha, Tesseract, OpenAI API).
Conclusion: We must be careful not to mistake "Big Data" for "Big Scholarship." Sanskrit and Tamil deserve precision, not just volume. I would like to ask the author: Can you provide a side-by-side comparison of a complex Sanskrit verse from your corpus against a traditionally edited critical edition (like the BORI Mahabharata) to show the accuracy of your "cleaned" OCR?
Until there is transparency regarding proofreading and verification, we should treat these "142-language" claims with extreme skepticism.