Sanskrit OCR and rendering

1,639 views
Skip to first unread message

venkat veeraraghavan

unread,
Jul 16, 2018, 1:02:56 PM7/16/18
to भारतीयविद्वत्परिषत्
 Dear List members,

Do you know of any open source OCR engines (Optical Character Recognition) that can parse through Sanskrit image files and render the relevant Sanskrit Characters with Svara markings in say a word document?

Thanks for your help.

Venkat

ajit.gargeshwari

unread,
Jul 17, 2018, 4:14:06 AM7/17/18
to भारतीयविद्वत्परिषत्
I doubt any software exits that can OCR Sanskrit texts as one can OCR English Scanned PDFs. The OCR Software for Sanskrit texts that's being sold doesn't even come close to Abby fine reader.
You can see a review here
Regards
Ajit Gargeshwari

shankara

unread,
Jul 17, 2018, 4:23:08 AM7/17/18
to भारतीयविद्वत्परिषत्
Namaste,

Google drive's OCR is a good option and its OCR output is upto 90 % accurate as long as the image quality is good. Only drawback is that it has a restriction of 10 pages per session (though it is not mentioned anywhere).



regards
shankara


--
You received this message because you are subscribed to the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bvparishat+...@googlegroups.com.
To post to this group, send email to bvpar...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ajit Gargeshwari

unread,
Jul 17, 2018, 4:35:22 AM7/17/18
to भारतीयविद्वत्परिषत्
Namaste Shankaraji,

90% accuracy is not good enough. achive.org has four thousand books printed prior to 1923. On an average a book might have 300 or more pages.  With the current software's it may not be possible to to do any serious OCR even if one spends their entire life time. After OCR the books have to be corrected for errors I am factoring that operation in.

Regards
Ajit Gargeshwari
न जायते म्रियते वा कदाचिन्नायं भूत्वा भविता वा न भूयः।
अजो नित्यः शाश्वतोऽयं पुराणो न हन्यते हन्यमाने शरीरे।।2.20।।

To unsubscribe from this group and stop receiving emails from it, send an email to bvparishat+unsubscribe@googlegroups.com.

To post to this group, send email to bvpar...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bvparishat/pbUn1CgjPyQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bvparishat+unsubscribe@googlegroups.com.

Amba Kulkarni

unread,
Jul 17, 2018, 4:56:12 AM7/17/18
to bvparishat, Devaraj Adiga
Hello Venkat,

Recently at the World Sanskrit Conference at Vancouver, there was a paper by Devraj Adiga and others on OCR. The team at  IIT Bombay have developed a wrapper on the existing OCRs, to improve their performance.

The results look promising. The machine suggests the corrections for unrecognised words / letters.

You may contact Devraj Adiga for further details, whom I've included in cc.

With kind regards,
Amba Kulkarni

आ नो भद्रा: क्रतवो यन्तु विश्वत: ll
Let noble thoughts come to us from every side.
- Rig Veda, I-89-i.

Professor & Head
Department of Sanskrit Studies
University of Hyderabad
Prof. C.R. Rao Road 
Hyderabad-500 046

(91) 040 23133802(off)



venkat veeraraghavan

unread,
Jul 17, 2018, 5:01:55 AM7/17/18
to भारतीयविद्वत्परिषत्
Namaste Sri Ajitji and Sri Shankaraji:

Thank you for your inputs. One always has to proof-read after a scan and rendering of text just to make sure. 90% accuracy sounds really promising. Is this with svara marks or without?

To unsubscribe from this group and stop receiving emails from it, send an email to bvparishat+...@googlegroups.com.

To post to this group, send email to bvpar...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bvparishat/pbUn1CgjPyQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bvparishat+...@googlegroups.com.

venkat veeraraghavan

unread,
Jul 17, 2018, 5:04:38 AM7/17/18
to भारतीयविद्वत्परिषत्
Thanks for that. This is where machine learning can be applied with a user interface to proof-read initial copies so that contextual knowledge is built into the engine. 
To unsubscribe from this group and stop receiving emails from it, send an email to bvparishat+...@googlegroups.com.

To post to this group, send email to bvpar...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bvparishat/pbUn1CgjPyQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bvparishat+...@googlegroups.com.

To post to this group, send email to bvpar...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bvparishat+...@googlegroups.com.

To post to this group, send email to bvpar...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

shankara

unread,
Jul 17, 2018, 5:05:44 AM7/17/18
to भारतीयविद्वत्परिषत्
Venkatji,

I have not yet tried ocr of texts with svaras.

regards
shankara


shankara

unread,
Jul 17, 2018, 5:09:06 AM7/17/18
to भारतीयविद्वत्परिषत्
Ajitji,

I agree with you. As pointed out in the review shared by you, lack of demand from Sanskrit community is one of the resons the development of Sanskrit OCR nowhere near that of English.

regards
shankara


On Tuesday, 17 July, 2018, 2:05:24 PM IST, Ajit Gargeshwari <ajit.gar...@gmail.com> wrote:


Namaste Shankaraji,

90% accuracy is not good enough. achive.org has four thousand books printed prior to 1923. On an average a book might have 300 or more pages.  With the current software's it may not be possible to to do any serious OCR even if one spends their entire life time. After OCR the books have to be corrected for errors I am factoring that operation in.

Regards
Ajit Gargeshwari
न जायते म्रियते वा कदाचिन्नायं भूत्वा भविता वा न भूयः।
अजो नित्यः शाश्वतोऽयं पुराणो न हन्यते हन्यमाने शरीरे।।2.20।।

On Tue, Jul 17, 2018 at 1:53 PM, 'shankara' via भारतीयविद्वत्परिषत् <bvpar...@googlegroups.com> wrote:
Namaste,

Google drive's OCR is a good option and its OCR output is upto 90 % accurate as long as the image quality is good. Only drawback is that it has a restriction of 10 pages per session (though it is not mentioned anywhere).

regards
shankara


To unsubscribe from this group and stop receiving emails from it, send an email to bvparishat+...@googlegroups.com.

Ajit Gargeshwari

unread,
Jul 17, 2018, 5:14:19 AM7/17/18
to भारतीयविद्वत्परिषत्
Namaste Venkatji
As Prof Amabji has said there  are a good promising research going on on the automation front. If you have a single text in mind. I suggest typing the text  the task will be much easier and error free and the resultant text will be one hundred percent searchable.

Regards
Ajit Gargeshwari
न जायते म्रियते वा कदाचिन्नायं भूत्वा भविता वा न भूयः।
अजो नित्यः शाश्वतोऽयं पुराणो न हन्यते हन्यमाने शरीरे।।2.20।।

On Tue, Jul 17, 2018 at 2:31 PM, venkat veeraraghavan <vvenk...@gmail.com> wrote:
Namaste Sri Ajitji and Sri Shankaraji:

Thank you for your inputs. One always has to proof-read after a scan and rendering of text just to make sure. 90% accuracy sounds really promising. Is this with svara marks or without?

To unsubscribe from this group and stop receiving emails from it, send an email to bvparishat+...@googlegroups.com.

To post to this group, send email to bvpar...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bvparishat/pbUn1CgjPyQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bvparishat+...@googlegroups.com.

To post to this group, send email to bvpar...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bvparishat/pbUn1CgjPyQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bvparishat+unsubscribe@googlegroups.com.

venkat veeraraghavan

unread,
Jul 17, 2018, 5:22:16 AM7/17/18
to भारतीयविद्वत्परिषत्
Sure. Thanks. Can you suggest easy to use Sanskrit editors where it is easy to type Devanagari with Svara
 markings?
Or should I make a spearate post requesting this since it is not related to topic?
To unsubscribe from this group and all its topics, send an email to bvparishat+...@googlegroups.com.

Nagaraj Paturi

unread,
Jul 17, 2018, 6:13:40 AM7/17/18
to Bharatiya Vidvat parishad
No need to start a new thread. 

To unsubscribe from this group and stop receiving emails from it, send an email to bvparishat+unsubscribe@googlegroups.com.

To post to this group, send email to bvpar...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Nagaraj Paturi
 
Hyderabad, Telangana, INDIA.


BoS, MIT School of Vedic Sciences, Pune, Maharashtra

BoS, Chinmaya Vishwavidyapeeth, Veliyanad, Kerala

Former Senior Professor of Cultural Studies
 
FLAME School of Communication and FLAME School of  Liberal Education,
 
(Pune, Maharashtra, INDIA )
 
 
 

Panuganti Siva

unread,
Jul 17, 2018, 6:55:17 AM7/17/18
to bvpar...@googlegroups.com
Dear Sir,

For indicating svara mark to veda mantras while typing, there is a software, called 'Itrans99' will help in this regard. This software is in online for free download. Please try it once by installing in your pc.



Thanks
Siva Panuganti

On Tue, Jul 17, 2018 at 3:42 PM, Nagaraj Paturi <nagara...@gmail.com> wrote:
Boxbe This message is eligible for Automatic Cleanup! (nagara...@gmail.com) Add cleanup rule | More info

No need to start a new thread. 
To unsubscribe from this group and stop receiving emails from it, send an email to bvparishat+unsubscribe@googlegroups.com.

To post to this group, send email to bvpar...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Nagaraj Paturi
 
Hyderabad, Telangana, INDIA.


BoS, MIT School of Vedic Sciences, Pune, Maharashtra

BoS, Chinmaya Vishwavidyapeeth, Veliyanad, Kerala

Former Senior Professor of Cultural Studies
 
FLAME School of Communication and FLAME School of  Liberal Education,
 
(Pune, Maharashtra, INDIA )
 
 
 

--
You received this message because you are subscribed to the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bvparishat+unsubscribe@googlegroups.com.

To post to this group, send email to bvpar...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.




--
Dr. Siva Panuganti, Ph.D.

venkat veeraraghavan

unread,
Jul 17, 2018, 7:16:15 AM7/17/18
to भारतीयविद्वत्परिषत्
Thankyou Dr. Siva.
Nagaraj Paturi
 
Hyderabad, Telangana, INDIA.


BoS, MIT School of Vedic Sciences, Pune, Maharashtra

BoS, Chinmaya Vishwavidyapeeth, Veliyanad, Kerala

Former Senior Professor of Cultural Studies
 
FLAME School of Communication and FLAME School of  Liberal Education,
 
(Pune, Maharashtra, INDIA )
 
 
 

--
You received this message because you are subscribed to the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bvparishat+...@googlegroups.com.
To post to this group, send email to bvpar...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.




--
Dr. Siva Panuganti, Ph.D.

Balasubramanian Ramakrishnan

unread,
Jul 17, 2018, 7:35:47 AM7/17/18
to bvpar...@googlegroups.com
Regarding Sanskrit editors with Vedic accents, you can use the Omkarananda ashram itrans2003. You can find details in https://www.oah.in/Sanskrit/Itranslt.html

You can also try the Baraha software, which has a free and paid versions. Google baraha.

Unfortunately easy and good don’t go together as with everything else in life. All these have issues, especially regarding the samyutaakShara-s. If you want to come up with professional and high grade documents with Vedic accents, you do need to use Wikners skt package with LaTeX. You can get multiple font sizes, bold, italic, etc. This has been my experience. It all depends on how particular you are. You do need to be very comfortable with LaTeX and get out of the WYSIWYG mode though.

You can also use some other fonts such as chandas with LaTeX. The advantage is you don’t need to use a preprocessor as with Wikners font. You can use the polyglossia package with xelatex. Just google polyglossia LaTeX. However, I had some issues due to the need for making some modifications that I needed for generating “authentic” yajurveda documents, which I was able to do by hacking Wikners C code for the preprocessor. Again it all depends on how particular you are. I am extremely particular. If easy is the main criterion, go the Baraha route. 

Ramakrishnan 

venkat veeraraghavan

unread,
Jul 17, 2018, 7:52:47 AM7/17/18
to भारतीयविद्वत्परिषत्
Yes. I do need to churn out a high quality document with vedic accents. Instead of re-inventing the wheel, I'll just pick your brains regarding the same sir. 

Devaraj Adiga

unread,
Jul 17, 2018, 8:21:40 AM7/17/18
to bvpar...@googlegroups.com
Namaste

Both Google OCR (open source, but we need to upload each image files to Google-Drive and then right-click to choose the option "Open with - Google Doc" ) and Indsenz's SanskritOCR (commercial) won't recognize Vedic accents.

As Prof. Ambaji mentioned, we came up with a post-processing tool for Indic OCRs, which can be downloaded from goo.gl/WqoVi2.

For or all four Vedas, data in editable text format is already available.


Best Regards
Devaraj Adiga
   Regards
Devaraj Adiga

Devaraj Adiga

unread,
Jul 17, 2018, 8:31:57 AM7/17/18
to bvpar...@googlegroups.com
Namaste

A unicode font for Devanagari, named Shobhika (https://github.com/Sandhi-IITBombay/Shobhika) developed at IIT Bombay, supports the accents as well. This is one of the most aesthetic font for Sanskrita, which looks alike to the font used by NirnayaSagara Press.

Best Regards
Devaraj Adiga

--
You received this message because you are subscribed to the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bvparishat+unsubscribe@googlegroups.com.

To post to this group, send email to bvpar...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
   Regards
Devaraj Adiga

Balasubramanian Ramakrishnan

unread,
Jul 17, 2018, 10:00:27 AM7/17/18
to bvpar...@googlegroups.com
Is it yajur, Rk or sAma? If it's Rk, you have many options. If it's taittirIya-yajur and you are extremely particular, then that'a a different story. I am not sure about sAman, but Wikner has very good support there. The phonology of taittirIya-yajur is extremely involved and you may have to resort to Wikner to produce high quality documents if they are to be used as recitation aids.

Ramakrishnan

Balasubramanian Ramakrishnan

unread,
Jul 17, 2018, 10:13:36 AM7/17/18
to bvpar...@googlegroups.com
Are there examples of the output of the Shobhika? I am indeed interested in moving away from Wikner to unicode compliant. The link above does not seem to have it.

Ramakrishnan

To unsubscribe from this group and stop receiving emails from it, send an email to bvparishat+...@googlegroups.com.

To post to this group, send email to bvpar...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
   Regards
Devaraj Adiga

--
You received this message because you are subscribed to the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bvparishat+...@googlegroups.com.

venkat veeraraghavan

unread,
Jul 17, 2018, 11:01:50 AM7/17/18
to bvpar...@googlegroups.com
It is Taittriya yajur.

You received this message because you are subscribed to a topic in the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bvparishat/pbUn1CgjPyQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bvparishat+...@googlegroups.com.

Ramanujachar P

unread,
Jul 17, 2018, 11:27:22 AM7/17/18
to bvpar...@googlegroups.com
our site has many taittireeya texts in unicode

On Tue, Jul 17, 2018 at 8:31 PM, venkat veeraraghavan <vvenk...@gmail.com> wrote:
It is Taittriya yajur.

On Tue, 17 Jul 2018, 7:30 PM Balasubramanian Ramakrishnan <b.ra...@gmail.com> wrote:
Is it yajur, Rk or sAma? If it's Rk, you have many options. If it's taittirIya-yajur and you are extremely particular, then that'a a different story. I am not sure about sAman, but Wikner has very good support there. The phonology of taittirIya-yajur is extremely involved and you may have to resort to Wikner to produce high quality documents if they are to be used as recitation aids.

Ramakrishnan

On Tue, Jul 17, 2018 at 7:52 AM venkat veeraraghavan <vvenk...@gmail.com> wrote:
Yes. I do need to churn out a high quality document with vedic accents. Instead of re-inventing the wheel, I'll just pick your brains regarding the same sir. 

On Tuesday, July 17, 2018 at 5:05:47 PM UTC+5:30, Balasubramanian Ramakrishnan wrote:
Regarding Sanskrit editors with Vedic accents, you can use the Omkarananda ashram itrans2003. You can find details in https://www.oah.in/Sanskrit/Itranslt.html

You can also try the Baraha software, which has a free and paid versions. Google baraha.

Unfortunately easy and good don’t go together as with everything else in life. All these have issues, especially regarding the samyutaakShara-s. If you want to come up with professional and high grade documents with Vedic accents, you do need to use Wikners skt package with LaTeX. You can get multiple font sizes, bold, italic, etc. This has been my experience. It all depends on how particular you are. You do need to be very comfortable with LaTeX and get out of the WYSIWYG mode though.

You can also use some other fonts such as chandas with LaTeX. The advantage is you don’t need to use a preprocessor as with Wikners font. You can use the polyglossia package with xelatex. Just google polyglossia LaTeX. However, I had some issues due to the need for making some modifications that I needed for generating “authentic” yajurveda documents, which I was able to do by hacking Wikners C code for the preprocessor. Again it all depends on how particular you are. I am extremely particular. If easy is the main criterion, go the Baraha route. 

Ramakrishnan 

On Tue, Jul 17, 2018 at 5:22 AM venkat veeraraghavan <vvenk...@gmail.com> wrote:
Sure. Thanks. Can you suggest easy to use Sanskrit editors where it is easy to type Devanagari with Svara
 markings?
Or should I make a spearate post requesting this since it is not related to topic?





--
You received this message because you are subscribed to the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bvparishat+unsubscribe@googlegroups.com.

To post to this group, send email to bvpar...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bvparishat/pbUn1CgjPyQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bvparishat+unsubscribe@googlegroups.com.

To post to this group, send email to bvpar...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bvparishat+unsubscribe@googlegroups.com.

To post to this group, send email to bvpar...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Dr. P. Ramanujan
Parankushachar Institute of Vedic Studies (Regd.)
Bengaluru
080-25433239 (R)
9449088616

Harsha M

unread,
Jan 7, 2019, 2:18:41 AM1/7/19
to Bharatiya Vidvat parishad
Hi All,

Sorry for opening a closed thread.

Instead of OCR this is the algorithm I am thinking of following:
  -- First split the letters in a sample printed text using imagemagic,.
  -- For every letter, if not already done manually map it to a unicode character or set of characters. -- This is a one time act for one particular font. --- this is map.
  -- Subsequently once all the letters are available to map, feed the entire scanned page to imagemagic and convert the pages to unicode using the above mapping -- this is reduce.

This helps us for one particular font, no OCR as it is a manual map. 

Since we dont map handwritten/palm-leaf manuscripts, and limit it to scanned printed pages, we dont need huge data set during the learning phase. However as we continue this for more books, the previous data sets can be used for learning as well. No ML as the data set is limited. This is a simple map-reduce algorithm.

Has anybody attempted this? If yes give me some inputs. I would like to try this approach this year.

Regards,
Harsha





ಮಂಗಳ, ಜುಲೈ 17, 2018 ರಂದು 05:51 ಅಪರಾಹ್ನ ಸಮಯಕ್ಕೆ ರಂದು Devaraj Adiga <geta...@gmail.com> ಅವರು ಬರೆದಿದ್ದಾರೆ:
To unsubscribe from this group and stop receiving emails from it, send an email to bvparishat+...@googlegroups.com.

To post to this group, send email to bvpar...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bvparishat/pbUn1CgjPyQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bvparishat+...@googlegroups.com.

To post to this group, send email to bvpar...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bvparishat+...@googlegroups.com.

To post to this group, send email to bvpar...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--

आ नो भद्रा: क्रतवो यन्तु विश्वत: ll
Let noble thoughts come to us from every side.
- Rig Veda, I-89-i.

Professor & Head
Department of Sanskrit Studies
University of Hyderabad
Prof. C.R. Rao Road 
Hyderabad-500 046

(91) 040 23133802(off)



--
You received this message because you are subscribed to the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bvparishat+...@googlegroups.com.

To post to this group, send email to bvpar...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
   Regards
Devaraj Adiga

--
You received this message because you are subscribed to the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bvparishat+...@googlegroups.com.

venkat veeraraghavan

unread,
Apr 26, 2019, 1:24:50 PM4/26/19
to भारतीयविद्वत्परिषत्
Harshaji:

Can you kindly mail me in pvt on vvenk...@gmail.com?

Thanks.
To unsubscribe from this group and stop receiving emails from it, send an email to bvpar...@googlegroups.com.

To post to this group, send email to bvpar...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bvparishat/pbUn1CgjPyQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bvpar...@googlegroups.com.

To post to this group, send email to bvpar...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bvpar...@googlegroups.com.

To post to this group, send email to bvpar...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--

आ नो भद्रा: क्रतवो यन्तु विश्वत: ll
Let noble thoughts come to us from every side.
- Rig Veda, I-89-i.

Professor & Head
Department of Sanskrit Studies
University of Hyderabad
Prof. C.R. Rao Road 
Hyderabad-500 046

(91) 040 23133802(off)



--
You received this message because you are subscribed to the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bvpar...@googlegroups.com.

To post to this group, send email to bvpar...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
   Regards
Devaraj Adiga

--
You received this message because you are subscribed to the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bvpar...@googlegroups.com.

Karthik Raman

unread,
Nov 24, 2019, 12:13:47 AM11/24/19
to भारतीयविद्वत्परिषत्
[Reviving an old thread, since someone might land here looking for this.]

http://ocr.sanskritdictionary.com/ is simply superb. Highly effective, and pretty accurate for the most part. Haven't tried swaras etc. much.

Regards,
Karthik

Ajit Gargeshwari

unread,
Nov 24, 2019, 12:36:35 AM11/24/19
to भारतीयविद्वत्परिषत्
Have you tried for 10 books having 199 pages what's the accuracy please share a statistical feed back 

--
You received this message because you are subscribed to a topic in the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bvparishat/pbUn1CgjPyQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bvparishat+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bvparishat/492963fd-366e-4441-9e85-3da609222b8a%40googlegroups.com.

Karthik Raman

unread,
Nov 24, 2019, 1:13:47 AM11/24/19
to भारतीयविद्वत्परिषत्
No serious statistics. In practice, I have tried out several (mostly good scans) pages, and barely get 1-2 mistakes per page --- some of them being difficult for the human eye too, like a faint anuswara etc.

Downside: cannot be automated -- have to paste every page manually. It bills itself as a "sanskrit snippet OCR", not as a book OCR though... it seems to be built on Google (default) and can use Tesseract too.

Regards,
Karthik


On Sunday, November 24, 2019 at 11:06:35 AM UTC+5:30, ajit.gargeshwari wrote:
Have you tried for 10 books having 199 pages what's the accuracy please share a statistical feed back 

On Sun, Nov 24, 2019, 10:43 Karthik Raman <karthi...@gmail.com> wrote:
[Reviving an old thread, since someone might land here looking for this.]

http://ocr.sanskritdictionary.com/ is simply superb. Highly effective, and pretty accurate for the most part. Haven't tried swaras etc. much.

Regards,
Karthik

--
You received this message because you are subscribed to a topic in the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bvparishat/pbUn1CgjPyQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bvpar...@googlegroups.com.

Praveen R. Bhat

unread,
Nov 24, 2019, 1:35:24 AM11/24/19
to bvpar...@googlegroups.com
Namaste Karthikji,

On Sun, Nov 24, 2019 at 11:43 AM Karthik Raman <karthi...@gmail.com> wrote:
No serious statistics. In practice, I have tried out several (mostly good scans) pages, and barely get 1-2 mistakes per page --- some of them being difficult for the human eye too, like a faint anuswara etc.

Downside: cannot be automated -- have to paste every page manually. It bills itself as a "sanskrit snippet OCR", not as a book OCR though... it seems to be built on Google (default) and can use Tesseract too.

Unless we choose to use Tesseract (which by default settings, I haven't found to be very good), wouldn't using Google OCR directly via Google drive be easier?

Kind rgds,
--Praveen R. Bhat
/* येनेदं सर्वं विजानाति, तं केन विजानीयात्। Through what should one know That owing to which all this is known! [Br.Up. 4.5.15] */

Karthik Raman

unread,
Nov 24, 2019, 4:19:44 AM11/24/19
to bvpar...@googlegroups.com
Namaste Bhat ji,

Pardon my ignorance! I just saw Google OCR on this thread, but hadn't used it. You saved some 50 hours of my time by suggesting this. I OCR'ed a 150-page document in about half an hour!

नमो नमः! 

My current hack: put together a long image by appending PNGs (upto 2 MB) and load to Google Drive. Accuracy is OK --- anyway careful proofreading is always important. 

If anyone has ideas to auto proofread and flag errors in the text, kindly let me know!

My pet project, which I've never gotten around to, is on Chandas-based proofreading -- I've found it very useful to flag errors in many encoded texts, doing it semi-manually.

Regards,
Karthik

--
You received this message because you are subscribed to a topic in the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bvparishat/pbUn1CgjPyQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bvparishat+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bvparishat/CACT7j-E-R244mFVo-0OJZwhiV2B9sYpfjQy5kRxfYV5_Na1pUw%40mail.gmail.com.

Irene Galstian

unread,
Nov 24, 2019, 4:51:59 AM11/24/19
to भारतीयविद्वत्परिषत्
Karthik ji,

Do you automate the assembly of individual PNGs into the mega-PNG to be fed to Google?
And also, if the limit on PNG size is 2MB, how many pages did you manage to get into a single mega-PNG?  
In other words, I'm trying to see how labour-intensive a 150 page doc is with your method and how it can be automated as much as possible.

Thank you,
Irene
To unsubscribe from this group and all its topics, send an email to bvpar...@googlegroups.com.

Mārcis Gasūns

unread,
Nov 24, 2019, 6:49:40 AM11/24/19
to भारतीयविद्वत्परिषत्
https://ocr.sanskritdictionary.com/ works long, but does the job now.

venkat veeraraghavan

unread,
Nov 24, 2019, 7:01:40 AM11/24/19
to भारतीयविद्वत्परिषत्
None of these however seem to work with swara markings

On Sun, Nov 24, 2019 at 5:19 PM Mārcis Gasūns <gas...@gmail.com> wrote:
https://ocr.sanskritdictionary.com/ works long, but does the job now.

--
You received this message because you are subscribed to a topic in the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bvparishat/pbUn1CgjPyQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bvparishat+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bvparishat/a38b3cd6-de5e-4eb6-9e61-37e2037a776a%40googlegroups.com.

Karthik Raman

unread,
Nov 24, 2019, 9:26:27 AM11/24/19
to bvpar...@googlegroups.com
Namaste Irene,

Yes, the mega file creation is automated. Have put up the script here https://gist.github.com/karthikraman/27e8c8f9fa5206d38fd1ecf702f3cc0a

The script is quick and dirty but works. I have a folder with all the png files 001.png through 284.png and then create these long PNGs. Ended up with seven files like: https://www.dropbox.com/s/eivwbeehugz9v4i/merged_001_016.png?dl=0 -- which can be fed to http://ocr.sanskritdictionary.com/ -- somehow, larger files didn't work. Had to restrict to ~600 kb.

Regards,
Karthik



To unsubscribe from this group and all its topics, send an email to bvparishat+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bvparishat/898dbb9e-96bd-4140-bd5a-b247e8559b27%40googlegroups.com.

BVK Sastry

unread,
Nov 24, 2019, 10:57:18 AM11/24/19
to bvpar...@googlegroups.com

Namaste

 

On < None of these however seem to work with swara markings > :  

 

BECUASE

 

There is no clarity on SWARA CHINHAS on LIPI-SAMKETHA of Bharateeya-bhashaas as seen in manuscripts and print.

 

To achieve this a new paradigm shift needs to take place in HMI for Samskruth Language.

 

As of date,  there are not many  people who  see any value  in exploring and investing on these lines.   This is a ' self contentment syndrome which drains the research initiative in Samskruth studies.

 

The model is : if  kaupeenam is satisfying  enough, why look for goodness and aesthetics  of  a 16 yards dhoti  to wrap the body '?   Sorry if the simile hurts.

 

Regards

BVK Sastry

--

You received this message because you are subscribed to the Google Groups "भारतीयविद्वत्परिषत्" group.

To unsubscribe from this group and stop receiving emails from it, send an email to bvparishat+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bvparishat/CAOWN-CZFqhjq8Jsfusxtf5QYq8XKi8LEKXay5%3DNEbNZ1OGtnVA%40mail.gmail.com.

ajit.gargeshwari

unread,
Nov 25, 2019, 8:09:46 AM11/25/19
to भारतीयविद्वत्परिषत्
https://ocr.sanskritdictionary.com/ is not suitable if you have large number of  books which has to be OCRed. May work with around 90 or 95% accuracy for a few pages. If one wants to ocr 10 books and each has 200 pages The amount of manual work required would be very high. NO ocr tool is close to Abby fine reader which works nearly perfectly for Greek or Latin scripts.
I have been reading about all these developments but no good product has developed as a package which can be sold or kept free.
Regards
Ajit Gargeshwari

BVK Sastry

unread,
Nov 25, 2019, 11:37:05 AM11/25/19
to bvpar...@googlegroups.com

Namaste

 

I agree with the observation of Ajit Gargeswari.

 

If I am to ask the question differently: What  quality standard of OCR is needed for Samskruth document ? if current OCR solutions are not satisfactory for what we want, what would be the call for Samskruth OCR ?  for manuscripts  which did not use computer-fonts ?! 

I would set the quality standard expectancy at 200% accuracy !  This also answers why 90 to 95% accuracy is unacceptable.

 

Why ? The ( narrated) story (from Ramayana) on ocean jumping may help to understand this demand for  OCR  quality for Samskruth: A language sensitive for a 'change ( add / drop/ modification) of even a single swara (?!)-Varna ( ?!) - Maatraa ( ?!).

 

 

Story (of Ramayana  for  Samskruth- OCR  design):   

 

The vanara-sena ( with angada the chief, the wise old veteran Jambavan, the young enthusiasts nala- neela et al, and quiet hanuman ...) was on the sea shore.

 

Vanara-sena had the directive from King Sugriva to see and bring information and where with all report on Sita; and failure was marked with capital punishment ! It was death either way - jump in to ocean or return with failure in mission.

 

Vanara-sena had the information: Sita is captive in the city 'Lanka' on the other end of the sea, hundred yojanas away. But the tools to achieve the goal was the missing element.

 

The Vanara-sena  TEAM started asking the question: Who amongst us is capable of jumping across the ocean and bring message about sita ?

 

Each team member carried out a self-assessment and declaration: I can jump one way  -  ten, fifty, seventy...ninety, ninety five ..miles. 

 

The Chief, Prince Angada said: he can jump hundred miles and land on other side, but was not sure of jumping back to bring the information about the correctness of Sita at Lanka ! This could not be an acceptable option, for it involved not only not getting the information validated; but also risking their future king.

 

So the message was clear !  One way 100% jump ability is NOT enough for the work. The messenger needs to go , validate and return safe. That is the bar set for action is  200% plus. This situation lead the team even to kill themselves using various means like jumping to sea, starve to death et al.

 

At this time Jambavan is said to have stepped in and prompted the quiet Hanuman to undertake the work, using his dormant powers. Rest is the story in Kishkinda kaanda  leading to Yuddha-Kanda !.

 

How is this relevant to Samskruth- OCR ? When Venkat Veera Raghavan noted: < None of these however seem to work with swara markings>,  the challenge ahead is clear.

The current designs of OCR have not understood the ' Lipi-Shaastra' model of Devanagari Brahmi models. Contra , many of them are based on alien model of Roman script design.

The need for home team is to work abinitio with clarity on 'OCR for Samskruth, based on Brahmi language Native Design'-  a feature which is  not present in the current  standards  of  Unicode Devanagari glyph model  script shape construction.  The indology  model of ' script evolution and scripting conventions in multiple scripts of india '  is an approximation of about 45 to 60% ! in reading the manu-scripts.  

 

Further deliberations on these points are  clearly  issues with commercial implications and technology oriented; not appropriate for free for all publicity- public forum debate.

I had raised this issue almost a decade ago on the need for advancing on the IPA diacritic model approach for Brahmi language scripts. This would call for ab initio rethinking on  the use of ' Romanization approach and foundation for advancing the thought of 'Samskruth Programming language / Panini machine development'.

 

The goal is achievable, in real time line, given the ingenuity and capability of the current Samskruth -Computer related human resources.

 

The TEAM work  can help to jump-start this ' Ocean jumping' ! Only  if  Samskruth Teams get to ' Saha-Charyaa' - Co-working , Collaborated working- Common-motivation and Concurrent Objective'  to  revisit '  Language -Computing models' ,  'Computational Linguistics Base and Basics', MT and AI goals.

The shift needed is  to work with a new perspective of ' Language -Technology -Relation' using ' Panini in Vedanga Mode'. There is no need to discard what all has been done. But there is all the necessity to see what is needed for the Samskruth Language needs through Technology. This need not be the same as goal expectation as set or achieved / designed for english !  Each langauge has its own body and soul  of its own unique nature.

 

Now, it all depends on the ' Karma-Samkalpa / shraddhaa':  The interest and initiative to  develop in house technology tools appropriate for Samskruth ?? And what about ' Purushartha- benefit'? And Who invests-supports ?  For Punya and Purushartha ?

 

One is always free to make conscious choices  of sacrificing  the ' Punya  and Purusha' for ' Artha'-gains !

 

Pl. Write to me off line if this interests with what your plans are.

 

Regards

 

BVK Sastry

 

 

 

From: bvpar...@googlegroups.com [mailto:bvpar...@googlegroups.com] On Behalf Of ajit.gargeshwari
Sent: Monday, November 25, 2019 8:10 AM
To:
भारतीयविद्वत्परिषत्
Subject: Re: {
भारतीयविद्वत्परिषत्} Re: Sanskrit OCR and rendering

 

https://ocr.sanskritdictionary.com/ is not suitable if you have large number of  books which has to be OCRed. May work with around 90 or 95% accuracy for a few pages. If one wants to ocr 10 books and each has 200 pages The amount of manual work required would be very high. NO ocr tool is close to Abby fine reader which works nearly perfectly for Greek or Latin scripts.

--

You received this message because you are subscribed to the Google Groups "भारतीयविद्वत्परिषत्" group.

To unsubscribe from this group and stop receiving emails from it, send an email to bvparishat+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bvparishat/e69a0282-3c16-4e0e-b3ea-12273f2b9748%40googlegroups.com.

venkat veeraraghavan

unread,
Nov 25, 2019, 11:59:57 AM11/25/19
to भारतीयविद्वत्परिषत्
Dear Shri Sastry Garu,

The svara issue can be solved by treating each svara marking over or under the akshara as a separate symbol or separate state of the same symbol which would end up increasing the character set of Vedic Sanskrit by a factor of 3.

The issue of 200% accuracy I fear will never be unconditionally achieved and will always be dependant on the quality of the document.
As in all things GIGO (Garbage in Garbage out) applies.
Even with machine learning algorithms this will not work without human scrutiny and interference.

Kind Regards,

Venkat
 

You received this message because you are subscribed to a topic in the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bvparishat/pbUn1CgjPyQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bvparishat+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bvparishat/5ddc032b.1c69fb81.f0ad3.5899%40mx.google.com.

BVK Sastry

unread,
Nov 25, 2019, 1:21:15 PM11/25/19
to bvpar...@googlegroups.com

Namaste

 

1. On < The svara issue can be solved by treating each svara marking over or under the akshara as a separate symbol or separate state of the same symbol which would end up increasing the character set of Vedic Sanskrit by a factor of 3. >.

 

Band-aid  patch is not the total curative medicine !  I am asking the willingness to go for real medicine; and not the nice bandage .

 

2. On < The issue of 200% accuracy I fear will never be unconditionally achieved and will always be dependant on the quality of the document.  As in all things GIGO (Garbage in Garbage out) applies. Even with machine learning algorithms this will not work without human scrutiny and interference. > :   

 

 

You are conceding defeat even before stepping on war field and call for starting the war ! Typical  first chapter syndrome of  Gita and tenth chapter where seeing the cosmic reality Viswaroopa triggers  fear before devotion !

 

This is because of two reasons:   The existing education for becoming a IT Computer professional has transformed  ' Human to work like a Machine'.  

The model I am proposing is ' become human' and  'Designing a machine that will be Human like/ supports human needs like a human / Humanizing the Machine'.  This is ' Prana-Pratishtaa' principle' in 'shivo bhootvaa shivam yajet'.  Make a machine to simulate human model.

 

Unless one is ready to see the shifts for the new paradigm, which is  endorsed both in tradition and being explored by the major IT corporate,  one can remain happily at a traffic light, burning fuel, not moving forward, waiting for someone ahead to yield and move forward. The only comfort is listening to the noise from the neighboring vehicle or selfchosen music from own radio !

 

This situation is a result of double jeopardy : inaccurate understanding of the tradition for Purushartha- Vijnana  and sold out to the  ' alien model of machine over human' philosophy for the comfort of  a system promoting lessened effort and lowered use  of human faculties in the name of systems -organization !

Ajit Gargeshwari

unread,
Nov 25, 2019, 1:21:49 PM11/25/19
to भारतीयविद्वत्परिषत्
To  put things in a straight forward manner. I scan a  200 page Sanskrit book say in 300 or 600 DPI. I run an ocr I should be able to copy and paste most of what I have scanned on to a note pad or a word document. The more accurate is the recognition lesser will be the job of proof correction. Can I do that using any existing software? The answer is no . (OCR should be faster and more accurate than manual typing)
Software's using  google interface are available but one will have to first convert the pdf to images upload the images one at a time download the image covert the image back to pdf.  For a 200 page book I will have to do this process 200 times. Google software works up to 10 or 20 pages. Not all books are that small.  If I have thousands of books to OCR can I complete within my life time?
I have a 200 page book scanned in English without diacritics I am able to copy most of what's scanned on to a word document or a note pad.
In short we don't have a good software to ocr Sanskrit books.
Now let me common back to https://ocr.sanskritdictionary.com/. Why should I upload what I scan to a third party sever. I have already written about the limitations when one wants to OCR large books.
Everything what else what Prof Sastry have written I don't understand 


Regards
Ajit Gargeshwari
न जायते म्रियते वा कदाचिन्नायं भूत्वा भविता वा न भूयः।
अजो नित्यः शाश्वतोऽयं पुराणो न हन्यते हन्यमाने शरीरे।।2.20।।

On Mon, Nov 25, 2019 at 10:07 PM BVK Sastry <sastr...@gmail.com> wrote:
You received this message because you are subscribed to a topic in the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bvparishat/pbUn1CgjPyQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bvparishat+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bvparishat/5ddc032b.1c69fb81.f0ad3.5899%40mx.google.com.

Irene Galstian

unread,
Nov 25, 2019, 1:34:44 PM11/25/19
to भारतीयविद्वत्परिषत्
Writing a script to get Google to OCR a PDF of any page count makes sense. Then one could bulk-buy scanning credits from Google and get on with one’s work. At least that’s what I’ve decided to do after looking through the immediately available options. Thank you to all the thread participants for the help and advice.

Irene

Ajit Gargeshwari

unread,
Nov 25, 2019, 1:38:48 PM11/25/19
to भारतीयविद्वत्परिषत्
Will google allow please write once google permits.

On Tue, Nov 26, 2019, 00:04 Irene Galstian <gnos...@gmail.com> wrote:
Writing a script to get Google to OCR a PDF of any page count makes sense. Then one could bulk-buy scanning credits from Google and get on with one’s work. At least that’s what I’ve decided to do after looking through the immediately available options. Thank you to all the thread participants for the help and advice.

Irene

--
You received this message because you are subscribed to a topic in the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bvparishat/pbUn1CgjPyQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bvparishat+...@googlegroups.com.

BVK Sastry

unread,
Nov 25, 2019, 2:06:52 PM11/25/19
to bvpar...@googlegroups.com

Namaste

 

<    Ajit  (1) : Why should I upload what I scan to a third party sever  ;         Ajit (2) : Will google allow please write once google permits.  >

 

Irrespective of What Google does or does other wise,  one is ready to share the documents to public and allow ' less than 100% scans' to be on the ' public global access' and then  get on to <  the job of proof correction > by  manual mode'! with less knowledgeable editors!

 

Why not get in the first phase to get the right software designed ? Or encourage  manual  digital text entry in native language'  scripts ?

 

The lure of going on Free and easy' has taken deep roots to drain out the desire for developing need appropriate tools.  The saying is ' sab chaltaa hai, adjust maadikolli swamy' ( = Everything is fine and Ok; just make some adjustments !)

 

Does not work for Samskruth ; Does more damage and harm to the core tradition with all good intentions.

 

There is a  beautiful meemaamsaa term for this : killing with mercy ( pashu maarana-yajna)  ;  good intentions to send the animal to heaven!'

 

Regards

 

BVK Sastry     

 

From: bvpar...@googlegroups.com [mailto:bvpar...@googlegroups.com] On Behalf Of Ajit Gargeshwari
Sent: Monday, November 25, 2019 1:38 PM
To:
भारतीयविद्वत्परिषत्
Subject: Re: {
भारतीयविद्वत्परिषत्} Re: Sanskrit OCR and rendering

 

Will google allow please write once google permits.

On Tue, Nov 26, 2019, 00:04 Irene Galstian <gnos...@gmail.com> wrote:

Writing a script to get Google to OCR a PDF of any page count makes sense. Then one could bulk-buy scanning credits from Google and get on with one’s work. At least that’s what I’ve decided to do after looking through the immediately available options. Thank you to all the thread participants for the help and advice.

Irene

.

Irene Galstian

unread,
Nov 25, 2019, 2:17:57 PM11/25/19
to भारतीयविद्वत्परिषत्
Google allows, we know that already. I have lots of meetings plus the scanning queue for the next 2 weeks or so, then I'll work on the script. Once this thing is up and running, we'll try both my and your Devanagari PDFs in it and go from there.

Irene


On Monday, November 25, 2019 at 6:38:48 PM UTC, ajit.gargeshwari wrote:
Will google allow please write once google permits.

On Tue, Nov 26, 2019, 00:04 Irene Galstian <gnos...@gmail.com> wrote:
Writing a script to get Google to OCR a PDF of any page count makes sense. Then one could bulk-buy scanning credits from Google and get on with one’s work. At least that’s what I’ve decided to do after looking through the immediately available options. Thank you to all the thread participants for the help and advice.

Irene

--
You received this message because you are subscribed to a topic in the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bvparishat/pbUn1CgjPyQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bvpar...@googlegroups.com.

Sudarshan HS

unread,
Nov 26, 2019, 10:02:11 PM11/26/19
to भारतीयविद्वत्परिषत्
Namaste Karthik,


On Sunday, 24 November 2019 19:56:27 UTC+5:30, Karthik Raman wrote:
Yes, the mega file creation is automated. Have put up the script here https://gist.github.com/karthikraman/27e8c8f9fa5206d38fd1ecf702f3cc0a

The script is quick and dirty but works. I have a folder with all the png files 001.png through 284.png and then create these long PNGs. Ended up with seven files like: https://www.dropbox.com/s/eivwbeehugz9v4i/merged_001_016.png?dl=0 -- which can be fed to http://ocr.sanskritdictionary.com/ -- somehow, larger files didn't work. Had to restrict to ~600 kb.

This is an excellent idea to make mega PNGs, and great to know that they work. Will give this script a try. I have seen that the sa.wikisource.org team are manually concatenating page-wise outputs till now and this may be useful for them too.

BTW, we have tried the http://ocr.sanskritdictionary.com/ on handwritten text of devanagari & telugu lipis (including old manuscripts) and it now works much better than earlier. Though the quality is no where near to the handling of printed text, it appears that there is active work going on the backend; the handling of background noise, slanted lines, lipi variants, all have improved in the last 2 years.

- Sudarshan

pravesh vyas

unread,
Nov 27, 2019, 1:15:04 AM11/27/19
to भारतीयविद्वत्परिषत्
"CamScanner" Applications premium version has sanskrit OCR option. it works very good. I tried it

Darshat Shah

unread,
Nov 27, 2019, 1:30:36 AM11/27/19
to भारतीयविद्वत्परिषत्
Yes agree with Sudarshan.
http://ocr.sanskritdictionary.com is hooked up to google cloud vision api. It wasn't working very well on handwritten docs a few months back. Lately the API seems to have improved a lot. As an example, it detects the text on the margin rotated by 90 in the attached image.

It also supports batch mode (up to 2000 images) so if directly using the API then the mega merge approach should not be needed I think. The Vedvaapi team at IIITH are building a transcription/ocr platform with google ocr as one of the options. So hopefully everybody can use that platform - https://www.vedavaapi.org/ 
image-2.jpg

venkat veeraraghavan

unread,
Nov 27, 2019, 4:51:48 AM11/27/19
to भारतीयविद्वत्परिषत्
Does it recognise and render svara markings well?
And what is the accuracy % ?

--
You received this message because you are subscribed to a topic in the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bvparishat/pbUn1CgjPyQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bvparishat+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bvparishat/da960597-3bb8-46fa-b266-63c861fcaec4%40googlegroups.com.

Shashi Joshi

unread,
Nov 27, 2019, 5:10:08 AM11/27/19
to bvpar...@googlegroups.com
It does devanagari, not vedic.
And devanagari conversion is extremely good.
I found 3 minor mistakes in a full page scan of kathopanishad with sanskrit and hindi.


Thanks,
Shashi

venkat veeraraghavan

unread,
Nov 27, 2019, 5:13:09 AM11/27/19
to भारतीयविद्वत्परिषत्
ah good. Thanks!

--
You received this message because you are subscribed to a topic in the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bvparishat/pbUn1CgjPyQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bvparishat+...@googlegroups.com.

Irene Galstian

unread,
Nov 27, 2019, 7:09:51 AM11/27/19
to भारतीयविद्वत्परिषत्
"directly using the API"
Do they have their own API others can use? If so, I'd like to try it out that, could be a quicker option than going via Cloud Vision API directly.
Could you please say more? 

Thank you,
Irene


On Wednesday, November 27, 2019 at 10:13:09 AM UTC, venkat veeraraghavan wrote:
ah good. Thanks!

On Wed, Nov 27, 2019 at 3:40 PM Shashi Joshi <shas...@gmail.com> wrote:
It does devanagari, not vedic.
And devanagari conversion is extremely good.
I found 3 minor mistakes in a full page scan of kathopanishad with sanskrit and hindi.


Thanks,
Shashi

--
You received this message because you are subscribed to a topic in the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bvparishat/pbUn1CgjPyQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bvpar...@googlegroups.com.

Darshat Shah

unread,
Nov 27, 2019, 7:18:05 AM11/27/19
to भारतीयविद्वत्परिषत्
I meant that the google cloud vision api also provides bulk mode - see here https://cloud.google.com/vision/docs/batch 

Irene Galstian

unread,
Nov 27, 2019, 7:33:48 AM11/27/19
to भारतीयविद्वत्परिषत्
Oh, I see, that's what I have too. Thank you for clarifying. 

Irene

Irene Galstian

unread,
Nov 30, 2019, 3:10:12 PM11/30/19
to भारतीयविद्वत्परिषत्
Dear Pravesh Vyas,

How did you set up Camscanner to recognise Sanskrit? 
I just downloaded it on my phone to try out. One thing I have to say is that this app has way too many button clicks to be suitable for large volume scanning. But even so, I'm curious to find out the setup for Devanagari OCR, even just for Sanskrit, since the languages displayed in the OCR section of the app don't include Sanskrit as an option.

Thank you,
Irene

Shashi Joshi

unread,
Nov 30, 2019, 9:20:00 PM11/30/19
to bvpar...@googlegroups.com
Irene ji,
You need to do a cloud OCR for this (100 free cloud OCRs allowed with free version).
For the paid version you can download it locally.
Please find the attached PDF with screenshots of the process.

There is one screenshot missing, where it asks for a second time to do OCR.

The app is indeed very powerful, with lot of options, but if you want to bulk conversions, you will need to figure out a way to fix the camera above the book to be canned.

Once that is figured out, 
- the app allows a air bubble type of level meter, so you know it is perfectly horizontal.
- it allows a bulk mode of scanning, where you can keep scanning and all those scans will become one page.
- it does a decent job of straightening the scan part, identifying the 'useful' rectangle that has text. (May need initial 'training' after a few correction)
- It allows post scanning corrections also.

It is worth the time to figure this app out, if you have the plans to go for the paid version.


Thanks,
~ Shashi


--
You received this message because you are subscribed to the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bvparishat+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bvparishat/d1d6253e-6496-41f7-a263-ad0266cf8c9b%40googlegroups.com.

श्रीमल्ललितालालितः

unread,
Jan 10, 2020, 1:54:38 PM1/10/20
to भारतीयविद्वत्परिषत्
http://ocr.sanskritdictionary.com/ uses Google Vision and Tesseract.
I'm using the Vision API.
We have a few scripts available which can automate the process. You need to register for Google API for Vision though.
For splitting pdf there are few python scripts.

Currently, I'm using two scripts to do OCR for big books.

Irene Galstian

unread,
Jan 10, 2020, 4:03:03 PM1/10/20
to भारतीयविद्वत्परिषत्
I'm registered already. As for scripts, I looked around briefly to see if any are available to use and modify but didn't find any. Hence writing my own. However, if you know of any publicly available ones, please post further information. It's good to have extra options. I mean the OCR itself here, not PDF splitting and other simple tasks.

Chandramouli Mahadevan

unread,
Feb 16, 2023, 10:42:25 AM2/16/23
to bvpar...@googlegroups.com
Can you share the scripts ? I am looking for a batch processing option. Splitting a PDF into JPG is not a problem. Many thanks. Chandramouli

--
You received this message because you are subscribed to the Google Groups "भारतीयविद्वत्परिषत्" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bvparishat+...@googlegroups.com.

shankara

unread,
Feb 16, 2023, 11:24:31 AM2/16/23
to bvpar...@googlegroups.com
Namaste,

You may try the following.

The online / offline OCR tool at pdf24.org allows us to ocr PDF with multiple languages, and create searchable text layer within the PDF.


regards
shankara


Reply all
Reply to author
Forward
0 new messages