extraction of Grantha Script from a Scanned PDF (OCR Scan extract)

45 views
Skip to first unread message

Prashanth Anantharamu

unread,
Feb 20, 2022, 9:13:51 PM2/20/22
to sanskr...@googlegroups.com

Namaste!

  Fyip, I am working on a voluntary activity to extract some of the Shlokas in Grantha Script, from a Scanned PDF, and then transliterate them to Devanagari. While researching around this process, I got to go through some of your work on google, and at this https://sites.google.com/site/sanskritcode/ocr/0-introduction
 so thought will take the liberty of reaching out for any possible help/lead for this extract/transliteration 

The source of scripture is in a scanned PDF, which I tried to extract (OCR Scan-to-text) using tools like Adobe, Sejda etc., The text gets extracted but with distortion, since, these tools don`t support some of the native fonts! I am also exploring the Google Cloud Vision OCR for now, no luck yet!

Can you pls advise what`s the best approach/tool to extract Grantha scripture from a scanned PDF

Sample text

image.png


Thanks & Regards

Prashanth Anantharaman

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Feb 20, 2022, 10:19:34 PM2/20/22
to Prashanth Anantharamu, sanskrit-programmers, sanskrit-ocr
On Mon, 21 Feb 2022 at 07:43, Prashanth Anantharamu <prashanth...@gmail.com> wrote:

Namaste!

  Fyip, I am working on a voluntary activity to extract some of the Shlokas in Grantha Script, from a Scanned PDF, and then transliterate them to Devanagari. While researching around this process, I got to go through some of your work on google, and at this https://sites.google.com/site/sanskritcode/ocr/0-introduction
 so thought will take the liberty of reaching out for any possible help/lead for this extract/transliteration 
That site is obsolete - https://sanskrit-coders.github.io/content/ocr/  is newer.

someone on sanskrit-programmers (cc-ed) might have ideas you might benefit from. But you won't necessarily get their responses unless you become a member/ subscriber of https://groups.google.com/g/sanskrit-programmers/ and https://groups.google.com/g/sanskrit-ocr/

 

The source of scripture is in a scanned PDF, which I tried to extract (OCR Scan-to-text) using tools like Adobe, Sejda etc., The text gets extracted but with distortion, since, these tools don`t support some of the native fonts! I am also exploring the Google Cloud Vision OCR for now, no luck yet!


I suspect Google OCR won't work. It only understands tamil letters, not grantha. (see below). I would suspect that the situation is same for other OCR tools.

 
Can you pls advise what`s the best approach/tool to extract Grantha scripture from a scanned PDF

Sample text

image.png


This is what google drive ocr from https://ocr.sanskritdictionary.com/# yields for the above - all tamil.

வெ வாரண நி குவெரா உஹாவில்விழoner திணெ த . | கவொறே மஜா நெலானவாறு ஜெ வாவபு தீவத, BUTo ராணா

 

Thanks & Regards

Prashanth Anantharaman

--
आपको यह संदश इसलिए मिला है क्योंकि आपने Google समूह के "sanskrit-ocr" समूह की सदस्यता ली है.
इस समूह की सदस्यता खत्म करने और इससे ईमेल पाना बंद करने के लिए, sanskrit-ocr...@googlegroups.com को ईमेल भेजें.
वेब पर यह चर्चा देखने के लिए, https://groups.google.com/d/msgid/sanskrit-ocr/CAO2eXKY%3D9tRn6i_TNrA6CXR92Z95zuhvAA5aWe1PAXNAPxsuNQ%40mail.gmail.com पर जाएं.


--
--
Vishvas /विश्वासः

Shylaja Venkatraman

unread,
Feb 21, 2022, 1:28:53 AM2/21/22
to विश्वासो वासुकिजः (Vishvas Vasuki), Prashanth Anantharamu, sanskrit-programmers, sanskrit-ocr

Namaste,

 

Even Tamil understanding seems to be bad. Is there any better sample of Tamil text by OCR extraction. Just curious if it is even worth the trouble of going the OCR route.

 

From: विश्वासो वासुकिजः (Vishvas Vasuki)
Sent: Sunday, February 20, 2022 7:19 PM
To: Prashanth Anantharamu; sanskrit-programmers
Cc: sanskrit-ocr
Subject: Re: extraction of Grantha Script from a Scanned PDF (OCR Scan extract)

 

On Mon, 21 Feb 2022 at 07:43, Prashanth Anantharamu <prashanth...@gmail.com> wrote:

 

Namaste!

 

  Fyip, I am working on a voluntary activity to extract some of the Shlokas in Grantha Script, from a Scanned PDF, and then transliterate them to Devanagari. While researching around this process, I got to go through some of your work on google, and at this https://sites.google.com/site/sanskritcode/ocr/0-introduction

 so thought will take the liberty of reaching out for any possible help/lead for this extract/transliteration 

That site is obsolete - https://sanskrit-coders.github.io/content/ocr/  is newer.

 

someone on sanskrit-programmers (cc-ed) might have ideas you might benefit from. But you won't necessarily get their responses unless you become a member/ subscriber of https://groups.google.com/g/sanskrit-programmers/ and https://groups.google.com/g/sanskrit-ocr/

 

 

 

The source of scripture is in a scanned PDF, which I tried to extract (OCR Scan-to-text) using tools like Adobe, Sejda etc., The text gets extracted but with distortion, since, these tools don`t support some of the native fonts! I am also exploring the Google Cloud Vision OCR for now, no luck yet!

 

 

I suspect Google OCR won't work. It only understands tamil letters, not grantha. (see below). I would suspect that the situation is same for other OCR tools.

 

 

Can you pls advise what`s the best approach/tool to extract Grantha scripture from a scanned PDF

 

Sample text

 

 

 

This is what google drive ocr from https://ocr.sanskritdictionary.com/# yields for the above - all tamil.

 

வெ வாரண நி குவெரா உஹாவில்விழoner திணெ த . | கவொறே மஜா நெலானவாறு ஜெ வாவபு தீவத, BUTo ராணா

 

 


Thanks & Regards

 

Prashanth Anantharaman

 

--
आपको यह संदश इसलिए मिला है क्योंकि आपने Google समूह के "sanskrit-ocr" समूह की सदस्यता ली है.
इस समूह की सदस्यता खत्म करने और इससे ईमेल पाना बंद करने के लिए, sanskrit-ocr...@googlegroups.com को ईमेल भेजें.
वेब पर यह चर्चा देखने के लिए, https://groups.google.com/d/msgid/sanskrit-ocr/CAO2eXKY%3D9tRn6i_TNrA6CXR92Z95zuhvAA5aWe1PAXNAPxsuNQ%40mail.gmail.com पर जाएं.



--

--
Vishvas /विश्वासः

--
आपको यह संदश इसलिए मिला है क्योंकि आपने Google समूह के "sanskrit-ocr" समूह की सदस्यता ली है.
इस समूह की सदस्यता खत्म करने और इससे ईमेल पाना बंद करने के लिए, sanskrit-ocr...@googlegroups.com को ईमेल भेजें.

वेब पर यह चर्चा देखने के लिए, https://groups.google.com/d/msgid/sanskrit-ocr/CAFY6qgG2-NrPcYxCy60izYAN%2B8F72w%2BfvQ3%3DWYUB%2BWbTdGZ5MA%40mail.gmail.com पर जाएं.

 

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Dec 1, 2022, 3:00:16 AM12/1/22
to Prashanth Anantharamu, sanskrit-programmers, Ravikiran SFO shAkhA, sanskrit-ocr
+ ravikiraN - are you aware of any solution to this grantha script ocr problem? Would you be able to train a model?

Ravi Kiran Sarvadevabhatla

unread,
Dec 1, 2022, 4:12:33 AM12/1/22
to विश्वासो वासुकिजः (Vishvas Vasuki), Prashanth Anantharamu, sanskrit-programmers, sanskrit-ocr
I do not know of any specific systems. Maybe https://github.com/Shreeshrii/kraken_grantha is a possibility.

Ravi

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Dec 1, 2022, 4:15:32 AM12/1/22
to Shree Devi Kumar, Shriramana श्रीरमणशर्मा होता, Prashanth Anantharamu, sanskrit-programmers, sanskrit-ocr
Thanks! (adding shreeshree and shrIramaNa, ravi to bcc)

namaste shreedevi,

could you please point me to a guide which describes how to use this model?

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Dec 1, 2022, 8:00:23 AM12/1/22
to Shree Devi Kumar, Shriramana श्रीरमणशर्मा होता, Prashanth Anantharamu, sanskrit-programmers, sanskrit-ocr
On Thu, 1 Dec 2022 at 14:44, विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com> wrote:
Thanks! (adding shreeshree and shrIramaNa, ravi to bcc)

namaste shreedevi,

could you please point me to a guide which describes how to use this model?

(I saw one in the README file there :-D )
This remains the best?

Shree Devi Kumar

unread,
Dec 1, 2022, 9:15:03 PM12/1/22
to विश्वासो वासुकिजः (Vishvas Vasuki), Shriramana श्रीरमणशर्मा होता, Prashanth Anantharamu, sanskrit-programmers, sanskrit-ocr
(adding Vinodh Rajan also to the conversation)

I haven't looked at this (experimental training) for a while. 

It would help if those who are familiar with Grantha script create line images and corresponding Unicode text groundtruth for training. 

I had tried using the limited unicode grantha fonts for creating traing data, but the printed texts use legacy fonts which look quite different hence the results are suboptimal.

Maybe the focus should be on a singular fontface to begin with. 

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Dec 5, 2022, 5:33:00 AM12/5/22
to Shree Devi Kumar, Shriramana श्रीरमणशर्मा होता, Prashanth Anantharamu, sanskrit-programmers, sanskrit-ocr
On Fri, 2 Dec 2022 at 04:14, Shree Devi Kumar <shree...@gmail.com> wrote:
(adding Vinodh Rajan also to the conversation)

I haven't looked at this (experimental training) for a while. 

It would help if those who are familiar with Grantha script create line images and corresponding Unicode text groundtruth for training. 

I had tried using the limited unicode grantha fonts for creating traing data, but the printed texts use legacy fonts which look quite different hence the results are suboptimal.

Recently some new fonts were announced - https://groups.google.com/g/sanskrit-programmers/c/lOQWdqs_ITc . Wonder if that improves performance ...

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Dec 12, 2022, 12:09:27 PM12/12/22
to Shree Devi Kumar, Shriramana श्रीरमणशर्मा होता, Prashanth Anantharamu, sanskrit-programmers, sanskrit-ocr

 kraken -f pdf -I vidvaj-jana-vinodhini.pdf -o .txt segment -bl ocr -m grantha_kraken.mlmodel

is what I ran. Any tips?
Reply all
Reply to author
Forward
0 new messages