OCR-ing abhyankar's grammar dictionary?

63 views
Skip to first unread message

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Feb 21, 2016, 12:43:56 PM2/21/16
to sanskrit-programmers, संस्कृतसन्देशश्रेणिः samskrta-yUthaH
  • अभ्यङ्करकोशः - अस्य अङ्कीकृतप्रतिं प्राप्तुमीहे। किन्तु नास्ति मम तत्र कौशलं वाऽवकाशोऽपि। भवत्सु कश्चित् कुर्याद् वा? Any volunteers?

--
--
Vishvas /विश्वासः

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Feb 24, 2016, 2:21:41 PM2/24/16
to sanskrit-programmers, संस्कृतसन्देशश्रेणिः samskrta-yUthaH, INCIDENT SIETK
+vedam_chandu who is interested in following up on this.

I've never done OCR myself, so I request more experienced list-folk to guide him as needed.

Here is some links to OCR software : 

  1. Sanskrit optical character recognition (OCR) tools(TAD1BX, Gdrive-SP).
To start off try D1.​

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Feb 25, 2016, 12:26:13 PM2/25/16
to sanskrit-programmers, संस्कृतसन्देशश्रेणिः samskrta-yUthaH, INCIDENT SIETK

Luckily using the wonderful infrastructure and a couple of hundred machines I have access to at my workplace, I was able cobble together something to get a better OCR in about an hour - https://raw.githubusercontent.com/sanskrit-coders/stardict-sanskrit/master/sa-head/abhyankar-grammar/abhyankar-grammar-gocr.txt  . Now, all that remains is for someone to:
1] Mark new headwords with a string - say "############". 
2] Fix egregious errors - especially in the headwords - to facilitate lookup. Typo errors in the meanings are more tolerable (usually the fixes are obvious to the reader).

For example:
in the current text we have:
हेमाब्दानुशासनलधुन्यास a short comm-
entary on Hemacandra'sSabdanu-
Sव्रsana written by Devendrastri.
हेमाब्दानुशासनव्राते a short gloss call-
ed अवचूरि also, written by a Jain
grammarian नन्दसुन्दर on the ईम-
इब्दानुद्भासन. _
ह्यस्तनी imperfect tense; a term
used by ancient grammarians for
the affixes of the immediate past
tense, but not comprising the
present day, corresponding to the
term लङ्क of Pafini. The term is
found in the Katantra and Haima-
candra grammars; cf. Kt. III.
1.23, 27; cf. Hema. III. 3.9.
इस्व short, a term used in connec-
tion with the short vowels taking
a umit of time measured by one
matra for their utterance: cf.
ऊकालेोज्इरस्वदीर्घप्लुत: P. I. 2.27.
This should be replaced with (note bolded letters such as श ह्र which have been fixed.):
############
हेमाब्दानुशासनलधुन्यास a short comm-
entary on Hemacandra's Sabdanu-
Sव्रsana written by Devendrastri.
हेमाब्दानुशासनव्राते a short gloss call-
ed अवचूरि also, written by a Jain
grammarian नन्दसुन्दर on the हेम-
ब्दानुद्भासन. _
############
ह्यस्तनी imperfect tense; a term
used by ancient grammarians for
the affixes of the immediate past
tense, but not comprising the
present day, corresponding to the
term लङ्क of Pafini. The term is
found in the Katantra and Haima-
candra grammars; cf. Kt. III.
1.23, 27; cf. Hema. III. 3.9.
############
ह्रस्व short, a term used in connec-
tion with the short vowels taking
a umit of time measured by one
matra for their utterance: cf.
ऊकालेोज्इरस्वदीर्घप्लुत: P. I. 2.27.




Shreevatsa R

unread,
Feb 25, 2016, 7:00:58 PM2/25/16
to sanskrit-programmers, संस्कृतसन्देशश्रेणिः samskrta-yUthaH, INCIDENT SIETK
This is wonderful!

It may be better to keep the OCR result as separate pages, and have them associated with the original images.

This can help in two ways:
(1) whoever volunteers to proofread it can use something like Distributed Proofreaders (http://www.pgdp.net/) or Wikisource (see screenshots in attached images) to proofread the text by seeing the scanned book page beside it, and 
(2) it may be helpful to include with each entry a link to the page in the book, so that even if the reader/user suspects OCR errors, they can click on the link and see the original page. (I've used this feature in the MW dictionary a few times.)



--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

std_proofing_interface.png
Distributed_Proofreaders.png
wikisource.png

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Feb 25, 2016, 11:59:07 PM2/25/16
to sanskrit-programmers, संस्कृतसन्देशश्रेणिः samskrta-yUthaH, INCIDENT SIETK

Anunad Singh

unread,
Feb 26, 2016, 12:34:03 AM2/26/16
to sanskrit-p...@googlegroups.com


On Thu, Feb 25, 2016 at 10:55 PM, विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com> wrote:
Boxbe This message is eligible for Automatic Cleanup! (vishvas...@gmail.com) Add cleanup rule | More info


Luckily using the wonderful infrastructure and a couple of hundred machines I have access to at my workplace, I was able cobble together something to get a better OCR in about an hour - https://raw.githubusercontent.com/sanskrit-coders/stardict-sanskrit/master/sa-head/abhyankar-grammar/abhyankar-grammar-gocr.txt  . Now, all that remains is for someone to:
1] Mark new headwords with a string - say "############". 
2] Fix egregious errors - especially in the headwords - to facilitate lookup. Typo errors in the meanings are more tolerable (usually the fixes are obvious to the reader).

Marking the keywords can be better done automatically using find and replace in regular expression mode-

find                         \.\n([^a-zA-Z0-9\,\;\(\)\[\]\-\{\}])

Replace with         .\n\n>>> \1


a sample of the output of such replacement:

>>> सौत्र belonging to the stra; found
in the sutra as contrasted with
what is given elsewhere; cf. सौत्रोयं
धातु: or सौत्रं पुरुरवम etc. cf also सौत्री
निर्दढाः M. Bh. on P. III. 2.139,
III. 4.60, 64, IW. 2.64 etc.

>>> सौनाग name of a school of ancient
grammarians who composed Var-
ttikas in explanation of the stras
of Panini; cf. सौनागाः पठन्ति P. III.
2.56 Virt. 1, IV. 1.74 Vart. 1.
cf. एतदेव सौननैर्विस्तरतरकेण पाठतम् M.
Bh. on II. 2.18 Wart. 4.

>>> सूर्यभगवान् an ancient grammarian
quoted in the Mahibhasya: cf.

>>> तत्र सीर्यभगवतेाक्तमनिधिज्ञी वाडवः पठति |
इष्यत एव चतुर्मात्रः त: M. Bh. on
P. VIII. 2.106 Vart. 3.

>>> संवादिक a root of the स्वादिगण or the
Fifth Conjugation.

>>> स्कन्धच् a tad. affix in the sense of
collection, added to the words
नर, करि and तुरङ्ग: cf. Varttika on P.
IV. 2.51 quoted in the Kasik-
vrtti.

>>> स्तु a term used for the sibilant स्
and dental class consonants for
thc substitution of the sibilant ्
and palatal consonants in respec-
tive order cf. स्तोः श्चुना श्चुः P.
VIII. 4.40.

>>> खत्रो (1) the sense of the feminine;
cf. त्रयाम् P. IV. ].3-8l (2) a word

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Feb 26, 2016, 12:35:09 AM2/26/16
to sanskrit-programmers, संस्कृतसन्देशश्रेणिः samskrta-yUthaH, INCIDENT SIETK
I am planning to seek volunteers for (what I hope) will be a big OCR-ing project . But before that, I want to figure out the smoothest flows for proofreading. Can anyone help me achieve the views suggested by shrIvatsa for this book?

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Feb 26, 2016, 12:40:03 AM2/26/16
to sanskrit-programmers
2016-02-25 21:34 GMT-08:00 Anunad Singh <anu...@gmail.com>:
Marking the keywords can be better done automatically using find and replace in regular expression mode-

find                         \.\n([^a-zA-Z0-9\,\;\(\)\[\]\-\{\}])

Replace with         .\n\n>>> \1



​Can almost be done automatically!​ But see below:

 
>>> तत्र सीर्यभगवतेाक्तमनिधिज्ञी वाडवः पठति |
इष्यत एव चतुर्मात्रः त: M. Bh. on
P. VIII. 2.106 Vart. 3.
​This is not a headword.​


​But if you're able to mark the headwords (mostly) through a series of regex​ replacements - please do so! Even with all the errors, the output of such an effort can still be used to produce an early version of a very useful stardict dictionary.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Feb 26, 2016, 1:39:09 AM2/26/16
to sanskrit-programmers, संस्कृतसन्देशश्रेणिः samskrta-yUthaH, INCIDENT SIETK
OK - succeeded in setting up proofreading on sanskrit wikisource using instructions from here:
Proofreading setup on wikisource:
Now all one needs to do is copy 
  • OCR-ed text here to the appropriate pages.
  • Correct stuff progressively.
All of the above is linked here 

Anunad Singh

unread,
Feb 26, 2016, 4:03:27 AM2/26/16
to sanskrit-p...@googlegroups.com
On Fri, Feb 26, 2016 at 11:09 AM, विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com> wrote:
Boxbe This message is eligible for Automatic Cleanup! (vishvas...@gmail.com) Add cleanup rule | More info

2016-02-25 21:34 GMT-08:00 Anunad Singh <anu...@gmail.com>:
Marking the keywords can be better done automatically using find and replace in regular expression mode-

find                         \.\n([^a-zA-Z0-9\,\;\(\)\[\]\-\{\}])

Replace with         .\n\n>>> \1



​Can almost be done automatically!​ But see below:

 
>>> तत्र सीर्यभगवतेाक्तमनिधिज्ञी वाडवः पठति |

इष्यत एव चतुर्मात्रः त: M. Bh. on
P. VIII. 2.106 Vart. 3.
​This is not a headword.​


​But if you're able to mark the headwords (mostly) through a series of regex​ replacements - please do so! Even with all the errors, the output of such an effort can still be used to produce an early version of a very useful stardict dictionary.


--
--
Vishvas /विश्वासः


Vishvas ji,

I feel the 'automatic' replacements be done centrally by one person. Regarding the error  pointed by you, it (and similar other errors caused by the previous step) can be undone  by

find                           cf.\n\n>>>\s

replace with              cf.\n

There are some systematic errors which also can be cleared in semiautomatic way. Among them, first one is replacement of visarga where ever it comes after non-Devanagari characters. The second one it replacing halanta where ever it comes after non-Devanagari characters.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Feb 26, 2016, 8:46:07 AM2/26/16
to sanskrit-programmers

2016-02-26 1:03 GMT-08:00 Anunad Singh <anu...@gmail.com>:

I feel the 'automatic' replacements be done centrally by one person. Regarding the error  pointed by you, it (and similar other errors caused by the previous step) can be undone  by

find                           cf.\n\n>>>\s

replace with              cf.\n

There are some systematic errors which also can be cleared in semiautomatic way. Among them, first one is replacement of visarga where ever it comes after non-Devanagari characters. The second one it replacing halanta where ever it comes after non-Devanagari characters.

​साधु , भद्र, अपि भवान् एतत् कुर्यात्?​ तेनास्माकं साहाय्यं भवति। यद्याम्, https://raw.githubusercontent.com/sanskrit-coders/sanskrit-ocr-r0/master/vaak/vyAkaraNam/abhyankar-grammar/abhyankar-grammar-gocr.txt इतीतः प्रारभताम्।

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Feb 26, 2016, 12:24:33 PM2/26/16
to sanskrit-programmers, Vidya Jayaraman
+ smt vidya who will proofread this.

Vidya J

unread,
Feb 26, 2016, 12:55:22 PM2/26/16
to sanskrit-programmers, vidy...@gmail.com
The process of using Wikisource is straightforward. I initially started off editing in place and then went on to use regex within a text editor for common patterns and strings. I am finding that this can only be done one page at a time in wikisource. If someone already started editing a page at a time, what is the way to pull it out, globally replace and then continue further along.  May be we should have a set of standard replacements for all OCRed documents before we upload them on wikisource?  (please correct if I am missing some step here)? 

From a process standpoint, is there anyway on wikisource to distinguish between pages that an individual proofreader has proofread it and a subsequent second-pass reviewer? I wanted to check that before marking something as "proofread". 

Vidya

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Feb 26, 2016, 1:33:54 PM2/26/16
to sanskrit-programmers, Vidya J
2016-02-26 9:55 GMT-08:00 Vidya J <vidy...@gmail.com>:
 If someone already started editing a page at a time, what is the way to pull it out, globally replace and then continue further along.  May be we should have a set of standard replacements for all OCRed documents before we upload them on wikisource?  (please correct if I am missing some step here)? 
Oh - ​Warning!! 😇 The ​OCR-ed text you see in wikisource is from archive-OCR, which does not handle devanAgarI. So, the superior OCR text has not been uploaded at all. You should copy paste it.

 

From a process standpoint, is there anyway on wikisource to distinguish between pages that an individual proofreader has proofread it and a subsequent second-pass reviewer? I wanted to check that before marking something as "proofread". 
​Yes - see the पृष्ठस्थिति in the bottom of this screenshot - http://i.imgur.com/GOJP4CE.png​ Just set it appropriately.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Feb 26, 2016, 9:50:24 PM2/26/16
to sanskrit-programmers, Vidya J
namaste vidya, could you save the regex replacements you make (in the form described here) for future use? Also, mind signing into github and assigning https://github.com/sanskrit-coders/sanskrit-ocr-r0/issues/3 and https://github.com/sanskrit-coders/sanskrit-ocr-r0/issues/1 to yourself? This will help propagate updates (such as the below) to people interested in this task without bothering the list.

Now we have two ocr-s (quoting from the newer readme)
OCR-ed text
  • gocr: Better english worse sanskrit here.
  • gocr: Better sanskrit worse english here.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Mar 2, 2016, 4:39:45 PM3/2/16
to sanskrit-programmers
Only now did I realize that shrI anunad had sent in a corrected file: https://groups.google.com/d/msg/samskrita/4kYOv3sfgIo/Jm4UqqavAAAJ (I am subscribed to samskrita@googlegroups on a no-email setting, so hadn't noticed it earlier).

Thanks, anunAda!

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages