I'm sharing this after some discussions with various people involved with Sanskrit transcription and OCR, both in India and in the west.
This is just my attempt at a reasonable proposal. Please leave comments or let me know what you think.
Ideally, someone has already solved this problem so well that I don't need to do anything. :-)
Arun
--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/3c035122-7739-445e-8a3a-90c857ee2e9cn%40googlegroups.com.
A major problem is logical textual sections (example - commentary for shloka 12) vs book-printing "sections" (example page 123), logical textual formatting (paragraph, list) vs book-printing formating (https://github.com/sanskrit-lexicon/csl-lnum/issues/2 ) . Ultimately, the former is useful (leading to presentations like https://vishvasa.github.io/kalpAntaram/smRtiH/manuH/sarva-prastutiH/01_praveshaH/ ) and the latter is more of a hindrance (though convenient for side-by-side comparison). Eventually, one needs to think about better solutions for this.
An important related challenge is how to deal with footnotes - I suppose you will follow wikisource in just transcribing corresponding text in an appropriate section of a pseudo-page. In case of direct transcription to markdown, here's what I've arrived at for now - https://sanskrit.github.io/groups/dyuganga/projects/text/proofreading/procedure_en/ .
I'd love a transcription system which favors this kind of "direct" transcription (allowing placement of additional structural and formatting hints) while bringing in some nice UI to display the relevant page on the side.
Not to mention (but good to be made explicit) - data ought to be downloadable (possibly in various formats including my favorite - markdown) and not be stuck in the system.