Proposal: A distributed transcription platform for Sanskrit texts

36 views
Skip to first unread message

Arun

unread,
Oct 7, 2021, 10:46:18 PM10/7/21
to sanskrit-programmers

I'm sharing this after some discussions with various people involved with Sanskrit transcription and OCR, both in India and in the west.

This is just my attempt at a reasonable proposal. Please leave comments or let me know what you think. Ideally, someone has already solved this problem so well that I don't need to do anything. :-)

Arun

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Oct 7, 2021, 11:21:07 PM10/7/21
to sanskrit-programmers
On Fri, Oct 8, 2021 at 8:16 AM Arun <aru...@gmail.com> wrote:

I'm sharing this after some discussions with various people involved with Sanskrit transcription and OCR, both in India and in the west.

Incidental Ad for participation in text-fixing/ formatting case someone is interested - https://groups.google.com/g/sanskrit-ocr/c/HYeKytDxnzY

 

This is just my attempt at a reasonable proposal. Please leave comments or let me know what you think.

अत्यन्तं चारुर् विचारस् ते ऽरुणप्रसाद!

A major problem is logical textual sections (example - commentary for shloka 12) vs book-printing "sections" (example page 123), logical textual formatting (paragraph, list) vs book-printing formating (https://github.com/sanskrit-lexicon/csl-lnum/issues/2 ) . Ultimately, the former is useful (leading to presentations like https://vishvasa.github.io/kalpAntaram/smRtiH/manuH/sarva-prastutiH/01_praveshaH/ ) and the latter is more of a hindrance (though convenient for side-by-side comparison). Eventually, one needs to think about better solutions for this.

An important related challenge is how to deal with footnotes - I suppose you will follow wikisource in just transcribing corresponding text in an appropriate section of a pseudo-page. In case of direct transcription to markdown, here's what I've arrived at for now - https://sanskrit.github.io/groups/dyuganga/projects/text/proofreading/procedure_en/ .

I'd love a transcription system which favors this kind of "direct" transcription (allowing placement of additional structural and formatting hints) while bringing in some nice UI to display the relevant page on the side.


Not to mention (but good to be made explicit) - data ought to be downloadable (possibly in various formats including my favorite - markdown) and not be stuck in the system.



Ideally, someone has already solved this problem so well that I don't need to do anything. :-)


सा स्थितिर् नास्ति :-(

 
Arun

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/3c035122-7739-445e-8a3a-90c857ee2e9cn%40googlegroups.com.


--
--
Vishvas /विश्वासः

Arun

unread,
Oct 8, 2021, 12:10:35 AM10/8/21
to sanskrit-programmers
धन्यवादो 'स्मि मित्र।

On Thursday, October 7, 2021 at 8:21:07 PM UTC-7 vi...@gmail.com wrote:

A major problem is logical textual sections (example - commentary for shloka 12) vs book-printing "sections" (example page 123), logical textual formatting (paragraph, list) vs book-printing formating (https://github.com/sanskrit-lexicon/csl-lnum/issues/2 ) . Ultimately, the former is useful (leading to presentations like https://vishvasa.github.io/kalpAntaram/smRtiH/manuH/sarva-prastutiH/01_praveshaH/ ) and the latter is more of a hindrance (though convenient for side-by-side comparison). Eventually, one needs to think about better solutions for this.

I believe Distributed Proofreading splits this into two phases. The first phase follows the book structure for side-by-side comparison, as you suggested. Then a formatting specialist merges and combines these changes in the second phase. Distributed Proofreading has been around long enough that they have their own conventions for this and (presumably) their own tooling as well. I suppose we will get there someday.
 
An important related challenge is how to deal with footnotes - I suppose you will follow wikisource in just transcribing corresponding text in an appropriate section of a pseudo-page. In case of direct transcription to markdown, here's what I've arrived at for now - https://sanskrit.github.io/groups/dyuganga/projects/text/proofreading/procedure_en/ .

I'm not sure yet, but for now, I'll note that Distributed Proofreading uses a slightly different format that is more markdown-like: https://www.pgdp.net/wiki/DP_Official_Documentation:Proofreading/Proofreading_Guidelines#Footnotes.2FEndnotes Either way, the conventions here are not for me to decide on my own. As Panini said, it is  अन्यप्रमाणत्व . If this project has wide enough support, we can deliberate on this furtehr.
 
I'd love a transcription system which favors this kind of "direct" transcription (allowing placement of additional structural and formatting hints) while bringing in some nice UI to display the relevant page on the side.

Yes, I think a little would go a long way. Fast, simple, and built specifically for Sanskrit.
 

Not to mention (but good to be made explicit) - data ought to be downloadable (possibly in various formats including my favorite - markdown) and not be stuck in the system.

Absolutely -- the result will be exportable as plain text of some kind. I would find anything less to be morally unacceptable.

Arun


Arun

unread,
Oct 8, 2021, 12:11:05 AM10/8/21
to sanskrit-programmers
PS: That of course should be धन्यो'स्मि .
Reply all
Reply to author
Forward
0 new messages