AI for proofreading

62 views
Skip to first unread message

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Jun 1, 2025, 2:22:30 AMJun 1
to sanskrit-programmers, Hindu-vidyA हिन्दुविद्या, hariH nArAyaNaH हरिनारायणः SF
Despite some false positives, claude v4 seems somewhat useful for identifying errors - 

image.pngr

Copilot from a pull request wasn't as useful.


Let me know your experiences.

You can get this assistance from within the Vscode or Intellij editor. 

intellij_AI_skt-typo_detection.png
vscode_AI_skt-typo_detection.png


--
--
Vishvas /विश्वासः

Hari Narayanan

unread,
Jun 1, 2025, 1:35:41 PMJun 1
to विश्वासो वासुकिजः (Vishvas Vasuki), sanskrit-programmers, Hindu-vidyA हिन्दुविद्या

Namaste mahodaya

Thank you for your quick responses. Last year Dr. Ramasubramanyan -mahodaya from IIT Bombay was here for Kaveri camp. He mentioned that they have 1000s of pages of scanned pages from many works. But due to the lack of proofreaders  they are not able to publish these for the benefit of many. They were paying Masters students ₹10 a page to proofread, still not many were there. Their main problwm was diacritical marks in the English transliteration. So we thought  may be an AI engine can be trained to identify and correct most of these. Then the manual process will be needed for a small percentage.

There were people willing to sponsor PhD students for this project.

Any ideas or experience you have will be much appreciated.

अनेके धन्यवादाः
हरिः

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Jun 1, 2025, 7:16:20 PMJun 1
to Hari Narayanan, sanskrit-programmers, Hindu-vidyA हिन्दुविद्या
On Sun, 1 Jun 2025 at 22:43, Hari Narayanan <hna...@gmail.com> wrote:

Their main problwm was diacritical marks in the English transliteration.


So we thought  may be an AI engine can be trained to identify and correct most of these.


Rather than typo-correction software -

I think that just creating a superior OCR model, specifically trained with text with diacritical marks should solve the problem.

 

Then the manual process will be needed for a small percentage.

There were people willing to sponsor PhD students for this project.

Any ideas or experience you have will be much appreciated.

अनेके धन्यवादाः
हरिः


On Sat, May 31, 2025, 11:22 PM विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com> wrote:
Despite some false positives, claude v4 seems somewhat useful for identifying errors - 

image.pngr

Copilot from a pull request wasn't as useful.


Let me know your experiences.

You can get this assistance from within the Vscode or Intellij editor. 

intellij_AI_skt-typo_detection.png
vscode_AI_skt-typo_detection.png


--
--
Vishvas /विश्वासः

Karthik

unread,
Jul 9, 2025, 4:15:42 AMJul 9
to sanskrit-programmers
Let me share my experience with Chandamama and Gemini.

This was my initial proofreading prompt

--- BEGIN ---
* This is an issue of the Sanskrit Chandamama in plain text format.
* It contains an index of material followed by a series of articles/essays/stories/editorials/quizzes etc
* Items are demarcated using `---`
* The context is overwhelmingly in Sanskrit. But some content could be in English and other Indian languages

# INSTRUCTIONS
* I want you to exhaustively proofread this issue
* Mark grammar/spelling fixes to words/sentences using `{{OLD CONTENT|NEW CONTENT}}`.
* If you have comments rather than fixes, provide them at the end of your output. Put these comments for each article under the article heading prefixed with the provided sequential id.

Let me know if you understand your task. Once you confirm, I will provide the material
--- END ---

Gemini not only corrected any mistakes it found, but also started rewriting stuff to make it fit the rest of the story, or to fix irregularities in the story.

It then left comments of the following kind:

--- BEGIN ---
**ID: 3: अद्भुतः विजयः**

* `अन्यैः केचन योद्धारः` changed to `अन्ये केचन योद्धारः` for correct adjectival agreement with `योद्धारः` (nominative plural).
* `एवः` corrected to `एव`. `एवः` is not a valid form.
* The repeated visarga in `महाराजस्यः। महाराजस्यः` was corrected to `महाराजस्य। महाराजस्य`.
* `स्वस्थः` (healthy) does not fit the context. `स्तब्धः` (stunned, paralyzed) is more appropriate for the king's reaction.
* `षडयन्त्रं` corrected to the standard sandhi form `षड्यन्त्रम्`.
--- END ---

I then talked to it about my expectations and we rewrote the prompt collaboratively to this:

--- BEGIN ---
Hello. I want you to act as an expert proofreader for a Chandamama issue provided in plain text. The text is primarily in Sanskrit but may contain content in English and other Indian languages.

Your task is to identify and correct errors while strictly distinguishing between objective mistakes and subjective/semantic improvements. Please follow these rules precisely for **all languages** found in the text.

**I. OUTPUT STRUCTURE**

Your output must have two parts:
1.  The full text of the issue with in-line corrections.
2.  A final `### COMMENTS` section at the end, organized by the `ID` and `TITLE` of each article.

**II. RULES FOR IN-LINE CORRECTIONS: `{{OLD CONTENT|NEW CONTENT}}`**

You must **only** use the `{{OLD|NEW}}` format for **OBJECTIVE ERRORS**. An objective error is one that is unambiguously wrong based on the grammar, spelling, or orthographic rules of its **specific language** (Sanskrit, English, Hindi, etc.). This includes:
*   **Spelling/Typographical Errors**
*   **Grammatical Errors** (case endings, verb conjugations, gender agreement, etc.)
*   **Invalid Words** (non-existent forms, likely typos or OCR errors).

**III. RULES FOR THE `### COMMENTS` SECTION**

This section is for explanation and suggestions. It must contain two types of comments for each article:

1.  **Explanation of Fixes:** For every single `{{OLD|NEW}}` correction you made, provide a brief explanation of *why* it was an objective error in its respective language.
    *   **CRUCIAL:** When explaining, you must refer to the **exact** incorrect word or phrase as it appeared in the original source.

2.  **Suggestions for Consideration:** If you encounter a word or phrase that is **grammatically correct but seems contextually, stylistically, or semantically awkward**, you must **NOT** change it in the main text. Instead, you will add a "Suggestion" comment, quote the phrase, explain the issue, and propose alternatives.
    *   **Example (Sanskrit):** "Suggestion: The word `स्वस्थः` ('healthy') seems out of place for a surprised king. Consider using `स्तब्धः` ('stunned')."
    *   **Example (English):** "Suggestion: The phrase 'very big' is correct but could be stronger. Consider using 'immense' or 'colossal'."


By following this strict separation, you will fix only what is clearly broken and advise on what could be improved, giving me the final editorial choice.

**IV. SPECIAL HANDLING FOR MULTILINGUAL TRANSLATION TABLES**

For articles like "वर्ध्यताम् शब्दसम्पत्तिः" which contain tables of phrases in multiple Indian languages with Devanagari transliterations, apply the following rules:
*   **Native Script is Primary:** Proofread the text in its original script (e.g., Tamil, Telugu, Kannada) for objective errors in that language.
*   **Devanagari is a Phonetic Guide:** Treat the accompanying Devanagari text as a non-standard, good-faith phonetic guide for pronunciation.
*   **Fix Obvious Errors in Both:** Correct clear typos, spelling mistakes, or copy-paste errors in *both* the native script and the Devanagari transliteration.
*   **Comment on Subjective Transliterations:** Do **not** change Devanagari transliterations just because they are not perfect or don't follow a specific academic standard (like IAST). If a transliteration is phonetically awkward but understandable, leave it in the text and add a "Suggestion" in the comments about a potentially more accurate phonetic rendering.

Let me know if you understand these detailed instructions. Once you confirm, I will provide the material.
--- END ---

The result was this:

--- BEGIN ---
#### ID: 3, TITLE: अद्भुतः विजयः

**Explanation of Fixes:**
*   `अन्येः` was corrected to `अन्ये`. The original is an invalid form. The correct nominative plural form of the pronoun `अन्य` is `अन्ये`.
*   `एवः` was corrected to `एव`. The original form with a visarga is grammatically incorrect for the indeclinable particle `एव`.
*   `महाराजस्यः` (used twice) was corrected to `महाराजस्य`. The extra visarga at the end is a typographical error. The genitive singular form is `महाराजस्य`.
*   `स्वागतमस्ति` was corrected to `स्वागतम् अस्ति`. This is a sandhi correction. While often written together, the correct separation is required for clarity and grammatical precision as `स्वागतम्` is the object and `अस्ति` is the verb.
*   `षडयन्त्रं` was corrected to `षड्यन्त्रम्`. This is a standard Sanskrit spelling correction for the word meaning 'conspiracy'.

**Suggestions for Consideration:**
*   Suggestion: The word `स्वस्थः` for the king's reaction seems contextually odd. `स्वस्थः` means 'healthy' or 'at ease'. A more appropriate word for a surprised or stunned king might be `स्तब्धः` ('stunned'), `विस्मितः` ('surprised'), or `हतबुद्धिः` ('bewildered').
--- END ---

The prompt is everything. You can decide how tight the proofreading needs to be and what Gemini is and is not allowed to touch.

Put the original sources in a git repo. Commit them. Replace the contents of the files with the Gemini output.

Then you can do a manual review of the diff between the two changesets.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Jul 9, 2025, 9:04:42 PMJul 9
to sanskrit-p...@googlegroups.com
I tried the same prompt  but gemini just refused - https://g.co/gemini/share/e63d52556d4c . What did I do wrong?

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/sanskrit-programmers/daa67b15-52e7-46b1-ad79-93f648fb43acn%40googlegroups.com.

Shreevatsa R

unread,
Jul 9, 2025, 9:19:38 PMJul 9
to sanskrit-p...@googlegroups.com
I haven't looked closely, but try aistudio.google.com instead of gemini.google.com (and try Gemini 2.5 Pro instead of Gemini 2.5 Flash).

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Jul 10, 2025, 2:52:24 AMJul 10
to sanskrit-p...@googlegroups.com
On Thu, 10 Jul 2025 at 06:49, Shreevatsa R <shree...@gmail.com> wrote:
I haven't looked closely, but try aistudio.google.com instead of gemini.google.com (and try Gemini 2.5 Pro instead of Gemini 2.5 Flash).

धन्योऽहम्। much superior to me in identifying errors!

 

Karthik

unread,
Jul 10, 2025, 3:24:24 AMJul 10
to sanskrit-programmers
@shreevatsa is right

Typical Google shenanigans. Multiple products/interfaces in the same category aimed at confusing everyone.

Gemini = Google AI Studio (https://aistudio.google.com). Please add this caveat every time you recommend this to someone.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Jul 10, 2025, 10:15:31 PMJul 10
to sanskrit-programmers, संस्कृतभारती-सन्तफ़्रन्सिस्क (Samskrita Bharati SF)
a useful prompt 

incase anyone wants to proofread texts fast using AI LLM - https://sanskrit.github.io/groups/dyuganga/projects/text/proofreading/editing/AI-prompt/

--
Vishvas /विश्वासः

samskRta & bhArata's Friend

unread,
Jul 13, 2025, 3:54:54 AMJul 13
to sanskrit-programmers, संस्कृतभारती-सन्तफ़्रन्सिस्क (Samskrita Bharati SF)
Reply all
Reply to author
Forward
0 new messages