Proofreading on Ambuda

103 views
Skip to first unread message

Arun

unread,
Jan 29, 2026, 9:07:07 PMJan 29
to sanskrit-programmers
In this thread I'll share some details about the proofing system we use on Ambuda, which is live in production. Our proofing system has received special attention for the past few months and will continue to be our focus going forward.

For this first message, I wanted to share Ambuda's visual semantic editor. This is a live link, so feel free to click around. For comparison, see the editor on Sanskrit Wikisource, which is a traditional markup-based presentation editor.

Visual editing means that the elements in the transcribed page can be annotated with custom styles like colors, font weights, and backgrounds. In comparison, traditional markup-based editors depend on <b>tags</b> or other symbols to encode information. 

The main advantages of visual editing are:

1. We can easily see how the text has been marked up without having to carefully read markup tags.
2. Editing is user-friendly and less intimidating.
3. We can add custom integrations on top, e.g. a spell checker to highlight dubious words.

Semantic editing, meanwhile, means that elements are annotated not for how they look but for how they function. In comparison, presentation editors focus on visual qualities. Where a presentation editor lets you specify some text's font size, weight, and position, a semantic editor lets you label that text as a heading.

The main advantage of semantic editing is that we can work with the text's structure directly then present it however we want. This is possible to do with presentational data but more error prone, e.g. we might first have to look for bold, large text then tag it as a heading.

Future iterations of our editor will add more integrations and markup types to the editor. For an example of an integration: many people have discussed using a metrical classifier to check for typos. If a block is annotated as a verse, we can run that check immediately and notify the user in the editor itself.

This work builds upon a prototype suggested by Shreevatsa based on the ProseMirror library. ProseMirror in particular also lets us apply validation rules to text content on save, which means we have stronger guarantees that some text is well-formed.

For details, see the Ambuda source code https://github.com/ambuda-org/ambuda and particularly the "proofer.js" file.

Arun

Arun

unread,
Jan 31, 2026, 12:13:43 PMJan 31
to sanskrit-programmers
In this second message, I'll talk about Ambuda's approach to OCR and bounding boxes.

Optical character recognition (OCR) is essential to digitizing Sanskrit texts at scale. Sanskrit OCR has greatly improved in quality in recent months, both through traditional offerings like Google OCR and through advances in large multimodal models. I have not run a formal quality analysis between these two systems yet and don't know how they compare. But the main advantage that traditional OCR systems have is that they provide bounding boxes for the items they detect. That is, they map an OCR'd word to the page region it comes from. (Example.)

Ambuda offers two nice features on top of this bounding box data, and you can try them both out in our editor.

1. First, the bounding box is displayed as the user moves around the text box. This is currently always on, but we will add an option to turn it off.
2. Second, you can click View > Track bounding box to automatically scroll the page to the highlighted word. This saves the manual step of re-scrolling the page image into view.

There are still some small bugs, but together, these features enable a nice workflow that I've been using for proofing. If you select View > Show image on top and click the "Fit width to container" option in the image controls, you get a large zoom on the relevant part of the image, and it auto-scrolls as you move through the page.

Despite advances in AI and OCR, human review will always be necessary to ensure that texts are trustworthy and reliable. I find that this flow removes the overhead of proofing a text and lets me focus on what is most valuable: making sure the text matches the page.

Arun

Screenshot 2026-01-31 at 9.06.03 AM.png

Arun

unread,
Feb 9, 2026, 1:09:39 AMFeb 9
to sanskrit-programmers
In this third message, I'll talk about how projects are converted to published texts.

A project, which typically represents a single book, can contain an unlimited number of texts within it, and these texts can also have complex relationships with each other. A simple example is an anthology that contains dozens or hundreds of smaller texts. A more complex example is a book that contains both a main text and its translation and commentary.

While we could create one project per book, doing so would be confusing and potentially lead to duplicated work. So, we need some way to publish a single project with multiple texts.

The way we do this on Ambuda is by defining a Publish config that declares basic metadata about the text and how to extract its content from the project. An example is below:

Screenshot 2026-02-08 at 10.01.03 PM.png

Most of the fields here should be straightforward: title is the display title, slug is the URL title, and author / genre / language are straightforward. The small green + lets us quickly define a new genre or author. (Under the hood, these are represented as separate database rows and bound to the text with a foreign key relation). Parent slug is specific to translations and commentaries, and we may move off of it in the future.

Filter is the interesting part here. Here we define a simple query language for extracting blocks of text from a project. For ease of implementation, this query language is an s-expression with simple logical operators. Examples:

(image 5 15)      # Match all blocks from page image 5 to page image 15 inclusive
(image 5 15:foo)  # Match all blocks from page image 5 to page image 15 inclusive (ending at the first block with label `foo`)
(tag p)           # Match all blocks representing paragraphs
(label foo)       # Match all blocks marked with the label "foo"

(or (image 5 15) (label foo)  # Match all blocks with (image 5 15) OR (label foo)
(and (image 5 15) (label foo)  # Match all blocks with (image 5 15) AND (label foo)

We are refining the system as we go so that we can publish texts more easily and pleasantly. For example, an earlier version of this query language did not support the label field when defining boundaries within images, so we ended up with complex queries like:

(or (and (image 42) (label PRAN)) (image 43) (and (image 44) (label PRAN)))

Adding an optional label-based boundary makes the language more expressive and the intent easier to understand, so for this text, we can simply define (image 67:GOPI_START 71).

Arun

Arun

unread,
Feb 19, 2026, 2:50:37 AM (12 days ago) Feb 19
to sanskrit-programmers
In this fourth message, I'll talk about two ways Ambuda's proofreader removes friction.

First, there's the matter of typing Devanagari. Broadly there are two ways to write Devanagari on a computer. One way is to transliterate as you type through a Devanagari keyboard, an input method editor (IME), and the like. Another is to use a transliteration tool like Aksharamukha, indic-transliteration, and so on. Both of these work but break flow in different ways. Live transliteration works well but adds friction when the user needs to switch back to Latin characters, such as when using a command palette. A dedicated transliteration tool has the same problems but only more so. Both options need some setup work.

To work around these problem, Ambuda's proofreading tool has its own IME implemented in the browser that is automatically disabled when using the command palette or other tools. You can play around with it here by selecting Tools > Tarnsliterator IME.

When selected, this option assumes Harvard-Kyoto > Devanagari by default, but it can be tuned through the Tools > Transliterate... modal. It transliterates as you type and works like an ordinary IME.

It's a small feature but one of the many ways that Ambuda's proofreading environment gets rid of friction.

Screenshot 2026-02-18 at 11.42.55 PM.png

Second is the command palette that I mentioned above. This is a quickselect window that saves the trouble of having to navigate different dropdown menus. It's available through View > Command palette or by typing Cmd-K (or your OS equivalent).

Screenshot 2026-02-18 at 11.46.50 PM.png

Arun

Arun

unread,
Feb 25, 2026, 12:55:14 AM (6 days ago) Feb 25
to sanskrit-programmers
In this fifth message, I'll talk about Ambuda's approach to automatic error detection in texts.

While proofreading is important, it is simply one among many means of preventing defects in a text. If we take the wider perspective, we find that there are several other techniques we can bring upon a text to make sure it is of high quality.

As a first step in this direction, all published texts on Ambuda now have an associated quality score and quality report that explains how the text was tested. For example, the About section of the गायत्र्युपनिषत् shows a quality score of 5/5, meaning that five different checks were run and all five of them passed. When clicked, that score links to a report with specifics on these tests. A full list of quality scores is available in the proofing section of the site.

Our current tests are of three kinds. First, there are tests that verify that a text's XML is well-formed. Next, there are simple character test to verify that all text conforms to a certain character set. But our third and most interesting tests work with metrical patterns and verify that verses follow a known meter. Future tests will work even further on a text's structure by checking tokens and phrases against our lexicon.

Together, these checks have caught a variety of errors in our proofread texts. What is more surprising (to me, anyway) is that these checks have also caught many errors in texts we've imported from GRETIL. For example, our system has found that 21 verses in the GRETIL copy of the Meghaduta have metrical defects. Of these, 1 is a false positive and the other 20 are true errors.

Screenshot 2026-02-24 at 9.37.21 PM.png

We likewise found a similar proportion of errors in the other metrical texts Ambuda carries from GRETIL, many of which are for popular texts and have been available on GRETIL for many years. This is not to criticize GRETIL at all but rather to emphasize that metrical checks can complement what ordinary proofreading provides and help make texts even more accurate.

The success of metrical checks has been so great that we have also started integrating these checks directly into the proofreading editor to surface and catch errors even earlier. Below is a verse from our editor with a real-time check on the verse's meter.

Screenshot 2026-02-24 at 9.38.57 PM.png

More work is necessary, but we have become more bullish on using automated techniques for proofreading, and I think techniques like this can be a powerful way to scale up proofreading across the Sanskrit world.

In my next email, I'm planning to talk about our approach to tokenization and token-based error correction.

Arun

Anunad Singh

unread,
Feb 25, 2026, 3:01:47 AM (6 days ago) Feb 25
to sanskrit-p...@googlegroups.com
Great to hear about the ideas implemented for automatic error detection, especially the idea of testing metrical patterns. With this, it becomes a superset of 'grammar check' which is implemented in many advanced applications.

I was wondering if the 'metrical pattern test' is applicable to texts having 'split-sandhis' also?

-- anunAda




--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/sanskrit-programmers/7d2863a1-9700-4e45-b7d9-4927bafd9ff1n%40googlegroups.com.

Arun

unread,
Feb 25, 2026, 9:07:19 PM (6 days ago) Feb 25
to sanskrit-programmers
Thanks, it's been very useful so far.

> I was wondering if the 'metrical pattern test' is applicable to texts having 'split-sandhis' also?

My current test does not support this today, since almost all of the texts we proofread do not have split sandhi. But we could potentially support it if we knew ahead of time that the text had split sandhi. Then we could apply basic sandhi rules and run the same metrical check.

Arun

Anunad Singh

unread,
Feb 26, 2026, 1:36:07 AM (5 days ago) Feb 26
to sanskrit-p...@googlegroups.com
dhanyavaada!

We could make a list of common errors which creep into such documents and test for their presence. 

I would like to list some of the errors here-
1) use of pipe character ( | )  or ( / ) in place of Devanagari DanDA ( ।  ) 
2)  use of two DanDAs ( ।। ) in place of unicode symbol for it ( ॥ )   
3) use of ( s ) in place of  ( ऽ )
4) use of ( ऊँ ) in place of ( ॐ )
5) presence of ( श्रृ ) instead of ( शृ )
6) presence of ( ड.) in place of ( ङ ) 
7) presence of  ( स्त्र ) in place of ( स्र)
8) wrong sequence of mAtrAs coming together --> for example, (  ंो   )  i place of  (  ों  ) 
9) presence of a very long set of characters without space. It may have happened due to deletion of space.

Reply all
Reply to author
Forward
0 new messages