Proofreading on Ambuda

57 views
Skip to first unread message

Arun

unread,
Jan 29, 2026, 9:07:07 PM (11 days ago) Jan 29
to sanskrit-programmers
In this thread I'll share some details about the proofing system we use on Ambuda, which is live in production. Our proofing system has received special attention for the past few months and will continue to be our focus going forward.

For this first message, I wanted to share Ambuda's visual semantic editor. This is a live link, so feel free to click around. For comparison, see the editor on Sanskrit Wikisource, which is a traditional markup-based presentation editor.

Visual editing means that the elements in the transcribed page can be annotated with custom styles like colors, font weights, and backgrounds. In comparison, traditional markup-based editors depend on <b>tags</b> or other symbols to encode information. 

The main advantages of visual editing are:

1. We can easily see how the text has been marked up without having to carefully read markup tags.
2. Editing is user-friendly and less intimidating.
3. We can add custom integrations on top, e.g. a spell checker to highlight dubious words.

Semantic editing, meanwhile, means that elements are annotated not for how they look but for how they function. In comparison, presentation editors focus on visual qualities. Where a presentation editor lets you specify some text's font size, weight, and position, a semantic editor lets you label that text as a heading.

The main advantage of semantic editing is that we can work with the text's structure directly then present it however we want. This is possible to do with presentational data but more error prone, e.g. we might first have to look for bold, large text then tag it as a heading.

Future iterations of our editor will add more integrations and markup types to the editor. For an example of an integration: many people have discussed using a metrical classifier to check for typos. If a block is annotated as a verse, we can run that check immediately and notify the user in the editor itself.

This work builds upon a prototype suggested by Shreevatsa based on the ProseMirror library. ProseMirror in particular also lets us apply validation rules to text content on save, which means we have stronger guarantees that some text is well-formed.

For details, see the Ambuda source code https://github.com/ambuda-org/ambuda and particularly the "proofer.js" file.

Arun

Arun

unread,
Jan 31, 2026, 12:13:43 PM (10 days ago) Jan 31
to sanskrit-programmers
In this second message, I'll talk about Ambuda's approach to OCR and bounding boxes.

Optical character recognition (OCR) is essential to digitizing Sanskrit texts at scale. Sanskrit OCR has greatly improved in quality in recent months, both through traditional offerings like Google OCR and through advances in large multimodal models. I have not run a formal quality analysis between these two systems yet and don't know how they compare. But the main advantage that traditional OCR systems have is that they provide bounding boxes for the items they detect. That is, they map an OCR'd word to the page region it comes from. (Example.)

Ambuda offers two nice features on top of this bounding box data, and you can try them both out in our editor.

1. First, the bounding box is displayed as the user moves around the text box. This is currently always on, but we will add an option to turn it off.
2. Second, you can click View > Track bounding box to automatically scroll the page to the highlighted word. This saves the manual step of re-scrolling the page image into view.

There are still some small bugs, but together, these features enable a nice workflow that I've been using for proofing. If you select View > Show image on top and click the "Fit width to container" option in the image controls, you get a large zoom on the relevant part of the image, and it auto-scrolls as you move through the page.

Despite advances in AI and OCR, human review will always be necessary to ensure that texts are trustworthy and reliable. I find that this flow removes the overhead of proofing a text and lets me focus on what is most valuable: making sure the text matches the page.

Arun

Screenshot 2026-01-31 at 9.06.03 AM.png

Arun

unread,
Feb 9, 2026, 1:09:39 AM (yesterday) Feb 9
to sanskrit-programmers
In this third message, I'll talk about how projects are converted to published texts.

A project, which typically represents a single book, can contain an unlimited number of texts within it, and these texts can also have complex relationships with each other. A simple example is an anthology that contains dozens or hundreds of smaller texts. A more complex example is a book that contains both a main text and its translation and commentary.

While we could create one project per book, doing so would be confusing and potentially lead to duplicated work. So, we need some way to publish a single project with multiple texts.

The way we do this on Ambuda is by defining a Publish config that declares basic metadata about the text and how to extract its content from the project. An example is below:

Screenshot 2026-02-08 at 10.01.03 PM.png

Most of the fields here should be straightforward: title is the display title, slug is the URL title, and author / genre / language are straightforward. The small green + lets us quickly define a new genre or author. (Under the hood, these are represented as separate database rows and bound to the text with a foreign key relation). Parent slug is specific to translations and commentaries, and we may move off of it in the future.

Filter is the interesting part here. Here we define a simple query language for extracting blocks of text from a project. For ease of implementation, this query language is an s-expression with simple logical operators. Examples:

(image 5 15)      # Match all blocks from page image 5 to page image 15 inclusive
(image 5 15:foo)  # Match all blocks from page image 5 to page image 15 inclusive (ending at the first block with label `foo`)
(tag p)           # Match all blocks representing paragraphs
(label foo)       # Match all blocks marked with the label "foo"

(or (image 5 15) (label foo)  # Match all blocks with (image 5 15) OR (label foo)
(and (image 5 15) (label foo)  # Match all blocks with (image 5 15) AND (label foo)

We are refining the system as we go so that we can publish texts more easily and pleasantly. For example, an earlier version of this query language did not support the label field when defining boundaries within images, so we ended up with complex queries like:

(or (and (image 42) (label PRAN)) (image 43) (and (image 44) (label PRAN)))

Adding an optional label-based boundary makes the language more expressive and the intent easier to understand, so for this text, we can simply define (image 67:GOPI_START 71).

Arun
Reply all
Reply to author
Forward
0 new messages