OCR-ing abhyankar's grammar dictionary?

167 views
Skip to first unread message

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Feb 21, 2016, 12:43:59 PM2/21/16
to sanskrit-programmers, संस्कृतसन्देशश्रेणिः samskrta-yUthaH
  • अभ्यङ्करकोशः - अस्य अङ्कीकृतप्रतिं प्राप्तुमीहे। किन्तु नास्ति मम तत्र कौशलं वाऽवकाशोऽपि। भवत्सु कश्चित् कुर्याद् वा? Any volunteers?

--
--
Vishvas /विश्वासः

Shrivathsa B

unread,
Feb 21, 2016, 2:20:19 PM2/21/16
to saMskRRita-sandesha-shreNiH

hariH OM,
sakhe,

If it is mixed (devanAgarii + english), it will be very difficult. send me a few page scans I can tell you how easy or difficult it is. For mixed content you may be better off typing from scratch.

If it is only devanAgarii content, send me the scans and I will do it. if the scans are of 300 dpi, the OCR output will be quite good.

svasti,
       JAYA BHAVAANII BHAARATII,
                                                      shrivathsa.

--
You received this message because you are subscribed to the Google Groups "samskrita" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samskrita+...@googlegroups.com.
To post to this group, send email to sams...@googlegroups.com.
Visit this group at https://groups.google.com/group/samskrita.
For more options, visit https://groups.google.com/d/optout.

Nityanand Misra

unread,
Feb 21, 2016, 7:50:24 PM2/21/16
to samskrita, sanskrit-p...@googlegroups.com

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Feb 24, 2016, 2:21:43 PM2/24/16
to sanskrit-programmers, संस्कृतसन्देशश्रेणिः samskrta-yUthaH, INCIDENT SIETK
+vedam_chandu who is interested in following up on this.

I've never done OCR myself, so I request more experienced list-folk to guide him as needed.

Here is some links to OCR software : 

  1. Sanskrit optical character recognition (OCR) tools(TAD1BX, Gdrive-SP).
To start off try D1.​

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Feb 25, 2016, 12:26:14 PM2/25/16
to sanskrit-programmers, संस्कृतसन्देशश्रेणिः samskrta-yUthaH, INCIDENT SIETK

Luckily using the wonderful infrastructure and a couple of hundred machines I have access to at my workplace, I was able cobble together something to get a better OCR in about an hour - https://raw.githubusercontent.com/sanskrit-coders/stardict-sanskrit/master/sa-head/abhyankar-grammar/abhyankar-grammar-gocr.txt  . Now, all that remains is for someone to:
1] Mark new headwords with a string - say "############". 
2] Fix egregious errors - especially in the headwords - to facilitate lookup. Typo errors in the meanings are more tolerable (usually the fixes are obvious to the reader).

For example:
in the current text we have:
हेमाब्दानुशासनलधुन्यास a short comm-
entary on Hemacandra'sSabdanu-
Sव्रsana written by Devendrastri.
हेमाब्दानुशासनव्राते a short gloss call-
ed अवचूरि also, written by a Jain
grammarian नन्दसुन्दर on the ईम-
इब्दानुद्भासन. _
ह्यस्तनी imperfect tense; a term
used by ancient grammarians for
the affixes of the immediate past
tense, but not comprising the
present day, corresponding to the
term लङ्क of Pafini. The term is
found in the Katantra and Haima-
candra grammars; cf. Kt. III.
1.23, 27; cf. Hema. III. 3.9.
इस्व short, a term used in connec-
tion with the short vowels taking
a umit of time measured by one
matra for their utterance: cf.
ऊकालेोज्इरस्वदीर्घप्लुत: P. I. 2.27.
This should be replaced with (note bolded letters such as श ह्र which have been fixed.):
############
हेमाब्दानुशासनलधुन्यास a short comm-
entary on Hemacandra's Sabdanu-
Sव्रsana written by Devendrastri.
हेमाब्दानुशासनव्राते a short gloss call-
ed अवचूरि also, written by a Jain
grammarian नन्दसुन्दर on the हेम-
ब्दानुद्भासन. _
############
ह्यस्तनी imperfect tense; a term
used by ancient grammarians for
the affixes of the immediate past
tense, but not comprising the
present day, corresponding to the
term लङ्क of Pafini. The term is
found in the Katantra and Haima-
candra grammars; cf. Kt. III.
1.23, 27; cf. Hema. III. 3.9.
############
ह्रस्व short, a term used in connec-
tion with the short vowels taking
a umit of time measured by one
matra for their utterance: cf.
ऊकालेोज्इरस्वदीर्घप्लुत: P. I. 2.27.




Shreevatsa R

unread,
Feb 25, 2016, 9:09:19 PM2/25/16
to sanskrit-programmers, संस्कृतसन्देशश्रेणिः samskrta-yUthaH, INCIDENT SIETK
This is wonderful!

It may be better to keep the OCR result as separate pages, and have them associated with the original images.

This can help in two ways:
(1) whoever volunteers to proofread it can use something like Distributed Proofreaders (http://www.pgdp.net/) or Wikisource (see screenshots in attached images) to proofread the text by seeing the scanned book page beside it, and 
(2) it may be helpful to include with each entry a link to the page in the book, so that even if the reader/user suspects OCR errors, they can click on the link and see the original page. (I've used this feature in the MW dictionary a few times.)



--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
std_proofing_interface.png
Distributed_Proofreaders.png
wikisource.png

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Feb 25, 2016, 11:59:08 PM2/25/16
to sanskrit-programmers, संस्कृतसन्देशश्रेणिः samskrta-yUthaH, INCIDENT SIETK

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Feb 26, 2016, 12:35:11 AM2/26/16
to sanskrit-programmers, संस्कृतसन्देशश्रेणिः samskrta-yUthaH, INCIDENT SIETK
I am planning to seek volunteers for (what I hope) will be a big OCR-ing project . But before that, I want to figure out the smoothest flows for proofreading. Can anyone help me achieve the views suggested by shrIvatsa for this book?

shankara

unread,
Feb 26, 2016, 1:39:02 AM2/26/16
to sams...@googlegroups.com, sanskrit-programmers, INCIDENT SIETK
Vishvasji,

I appreciate your efforts in digitizing Sanskrit kosas. In the case of Abhyankar's dictionary it is necessary to note that this book is not yet in open domain. It was first published in 1961. I am not sure about the year of demise of the author (It must be after 1961). As per Indian Copyright act, a book  goes to public domain only after 60 years from the year/date of demise of the author if the copyright is with the author or his family. If the copyright is with the publisher, then the book enters open domain after 60 years from the date of publication i.e. in this case it is in 1921.
 
regards
shankara



From: विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com>
To: sanskrit-programmers <sanskrit-p...@googlegroups.com>
Cc: संस्कृतसन्देशश्रेणिः samskrta-yUthaH <sams...@googlegroups.com>; INCIDENT SIETK <vedam_...@yahoo.com>
Sent: Friday, 26 February 2016 11:04 AM
Subject: [Samskrita] Re: OCR-ing abhyankar's grammar dictionary?

--
You received this message because you are subscribed to the Google Groups "samskrita" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samskrita+...@googlegroups.com.
To post to this group, send email to sams...@googlegroups.com.
Visit this group at https://groups.google.com/group/samskrita.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Feb 26, 2016, 1:39:11 AM2/26/16
to sanskrit-programmers, संस्कृतसन्देशश्रेणिः samskrta-yUthaH, INCIDENT SIETK
OK - succeeded in setting up proofreading on sanskrit wikisource using instructions from here:
Proofreading setup on wikisource:
Now all one needs to do is copy 
  • OCR-ed text here to the appropriate pages.
  • Correct stuff progressively.
All of the above is linked here 

ajit.gargeshwari

unread,
Feb 26, 2016, 2:40:09 AM2/26/16
to samskrita, sanskrit-p...@googlegroups.com, vedam_...@yahoo.com, shanka...@yahoo.com
Shankaraji,

You mean 2021. You are right its better always to ask for publishers consent for materials which are under copy-write and normally publishers do give their consent. Problem will arise if publishers objects and more problems will arise if the publishers take a legal course. Normally they don't.

shankara

unread,
Feb 26, 2016, 2:50:29 AM2/26/16
to sams...@googlegroups.com
Ajitji,

You are right. Please read 2021 in place of 1921.
 
regards
shankara



From: ajit.gargeshwari <ajit.gar...@gmail.com>
To: samskrita <sams...@googlegroups.com>
Cc: sanskrit-p...@googlegroups.com; vedam_...@yahoo.com; shanka...@yahoo.com
Sent: Friday, 26 February 2016 1:10 PM
Subject: Re: [Samskrita] Re: OCR-ing abhyankar's grammar dictionary?

--
You received this message because you are subscribed to the Google Groups "samskrita" group.
To unsubscribe from this group and stop receiving emails from it, send an email to samskrita+...@googlegroups.com.
To post to this group, send email to sams...@googlegroups.com.
Visit this group at https://groups.google.com/group/samskrita.

Manish Modi

unread,
Feb 26, 2016, 11:26:47 PM2/26/16
to samskrita, sanskrit-p...@googlegroups.com
Mr Shankara is quite right. Here is the Indian Copyright Act, 1957. Please go through it carefully.

Indian Copyright Act, 1957.

http://copyright.gov.in/Documents/CopyrightRules1957.pdf

Manish Modi

unread,
Feb 26, 2016, 11:27:51 PM2/26/16
to samskrita, sanskrit-p...@googlegroups.com
Mr Shankara is quite right. Here is the Indian Copyright Act, 1957. Please go through it carefully.

Indian Copyright Act, 1957.

http://copyright.gov.in/Documents/CopyrightRules1957.pdf


Manish

Ajit Gargeshwari

unread,
Feb 27, 2016, 12:54:21 AM2/27/16
to Samskrita Google Group
If a formal permission from the publisher has not been taken this crowd sourcing and oCRing for this book needs to be stopped.

Regards
Ajit Gargeshwari
न जायते म्रियते वा कदाचिन्नायं भूत्वा भविता वा न भूयः।
अजो नित्यः शाश्वतोऽयं पुराणो न हन्यते हन्यमाने शरीरे।।2.20।।

--
You received this message because you are subscribed to a topic in the Google Groups "samskrita" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/samskrita/4kYOv3sfgIo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to samskrita+...@googlegroups.com.

Pradyumna Achar

unread,
Feb 29, 2016, 11:55:59 PM2/29/16
to samskrita
The publisher seems to be MSU Baroda
A list of publications under the Gaekwad Oriental Series is at their website:
http://www.msubaroda.ac.in/faculty.php?action=home&fac_id=17

Ajit Gargeshwari

unread,
Mar 1, 2016, 12:06:25 AM3/1/16
to Samskrita Google Group
One needs to contact
Prof. Sweta Prajapati
Director
Oriental Institute,
The M.S.University of Baroda
Opp. Palace Gate, palace Road,
Vadodara  - 390001 (Guj)
M: +919898472669
Email swet...@gmail.com

Regards
Ajit Gargeshwari
न जायते म्रियते वा कदाचिन्नायं भूत्वा भविता वा न भूयः।
अजो नित्यः शाश्वतोऽयं पुराणो न हन्यते हन्यमाने शरीरे।।2.20।।

Ajit Gargeshwari

unread,
Mar 1, 2016, 12:17:11 AM3/1/16
to Samskrita Google Group
I spoke with The director She has asked the volunteers group leader to send her a formal request and then she will consider her request. The request and purpose needs to be clear. She has asked the request to be sent to spraja...@yahoo.co.in

Regards
Ajit Gargeshwari
न जायते म्रियते वा कदाचिन्नायं भूत्वा भविता वा न भूयः।
अजो नित्यः शाश्वतोऽयं पुराणो न हन्यते हन्यमाने शरीरे।।2.20।।

Anunad Singh

unread,
Mar 1, 2016, 2:10:04 AM3/1/16
to sams...@googlegroups.com
मया पूर्वोक्तानि परिवर्तनानि कृतानि।
संशोधिता पंजिका जिप-प्रारूपे संलग्नः।
abhayankar_kosh_02.txt.zip

विश्वासो वासुकिजः

unread,
Mar 2, 2016, 4:35:07 PM3/2/16
to samskrita, Nityanand Misra नित्यानन्द-मिश्रः रामभद्राचार्यशिष्यः, Ajit Gargeshwari
सूत्रेऽस्मिन् नैके सन्देशा मयाधुनैव दृष्टाः येषु न पृथक् संलग्नो मम सङ्केतः।

मित्र श्रीवत्स, साध्वधुना समाहिता समस्या।

सज्जनौ नित्यानन्दाजितौ - युवयोस् सूचनाभ्याम् उपकारः कृतः।

रविवार, 21 फ़रवरी 2016 को 11:20:19 पूर्व UTC-8 को, Shrivathsa B ने लिखा:
Reply all
Reply to author
Forward
0 new messages