Re: extraction of Grantha Script from a Scanned PDF (OCR Scan extract)

453 views
Skip to first unread message

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Feb 20, 2022, 10:19:34 PM2/20/22
to Prashanth Anantharamu, sanskrit-programmers, sanskrit-ocr


On Mon, 21 Feb 2022 at 07:43, Prashanth Anantharamu <prashanth...@gmail.com> wrote:

Namaste!

  Fyip, I am working on a voluntary activity to extract some of the Shlokas in Grantha Script, from a Scanned PDF, and then transliterate them to Devanagari. While researching around this process, I got to go through some of your work on google, and at this https://sites.google.com/site/sanskritcode/ocr/0-introduction
 so thought will take the liberty of reaching out for any possible help/lead for this extract/transliteration 
That site is obsolete - https://sanskrit-coders.github.io/content/ocr/  is newer.

someone on sanskrit-programmers (cc-ed) might have ideas you might benefit from. But you won't necessarily get their responses unless you become a member/ subscriber of https://groups.google.com/g/sanskrit-programmers/ and https://groups.google.com/g/sanskrit-ocr/

 

The source of scripture is in a scanned PDF, which I tried to extract (OCR Scan-to-text) using tools like Adobe, Sejda etc., The text gets extracted but with distortion, since, these tools don`t support some of the native fonts! I am also exploring the Google Cloud Vision OCR for now, no luck yet!


I suspect Google OCR won't work. It only understands tamil letters, not grantha. (see below). I would suspect that the situation is same for other OCR tools.

 
Can you pls advise what`s the best approach/tool to extract Grantha scripture from a scanned PDF

Sample text

image.png


This is what google drive ocr from https://ocr.sanskritdictionary.com/# yields for the above - all tamil.

வெ வாரண நி குவெரா உஹாவில்விழoner திணெ த . | கவொறே மஜா நெலானவாறு ஜெ வாவபு தீவத, BUTo ராணா

 

Thanks & Regards

Prashanth Anantharaman

--
आपको यह संदश इसलिए मिला है क्योंकि आपने Google समूह के "sanskrit-ocr" समूह की सदस्यता ली है.
इस समूह की सदस्यता खत्म करने और इससे ईमेल पाना बंद करने के लिए, sanskrit-ocr...@googlegroups.com को ईमेल भेजें.
वेब पर यह चर्चा देखने के लिए, https://groups.google.com/d/msgid/sanskrit-ocr/CAO2eXKY%3D9tRn6i_TNrA6CXR92Z95zuhvAA5aWe1PAXNAPxsuNQ%40mail.gmail.com पर जाएं.


--
--
Vishvas /विश्वासः

Ambarish Sridharanarayanan

unread,
Feb 20, 2022, 10:27:31 PM2/20/22
to sanskrit-p...@googlegroups.com, Prashanth Anantharamu, sanskrit-ocr
Prashanth,

As you probably know, once OCR-ed, transliterating into Devanagari or ISO-15919 etc. is trivial these days; the real value is in the textual content. What are these slokas, and do we know that they're not already available in electronic form, even if in a different script?

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAFY6qgG2-NrPcYxCy60izYAN%2B8F72w%2BfvQ3%3DWYUB%2BWbTdGZ5MA%40mail.gmail.com.

Shylaja Venkatraman

unread,
Feb 21, 2022, 1:53:54 AM2/21/22
to विश्वासो वासुकिजः (Vishvas Vasuki), Prashanth Anantharamu, sanskrit-programmers, sanskrit-ocr

Namaste,

 

Even Tamil understanding seems to be bad. Is there any better sample of Tamil text by OCR extraction. Just curious if it is even worth the trouble of going the OCR route.

 

From: विश्वासो वासुकिजः (Vishvas Vasuki)
Sent: Sunday, February 20, 2022 7:19 PM
To: Prashanth Anantharamu; sanskrit-programmers
Cc: sanskrit-ocr
Subject: Re: extraction of Grantha Script from a Scanned PDF (OCR Scan extract)

 

On Mon, 21 Feb 2022 at 07:43, Prashanth Anantharamu <prashanth...@gmail.com> wrote:

 

Namaste!

 

  Fyip, I am working on a voluntary activity to extract some of the Shlokas in Grantha Script, from a Scanned PDF, and then transliterate them to Devanagari. While researching around this process, I got to go through some of your work on google, and at this https://sites.google.com/site/sanskritcode/ocr/0-introduction

 so thought will take the liberty of reaching out for any possible help/lead for this extract/transliteration 

That site is obsolete - https://sanskrit-coders.github.io/content/ocr/  is newer.

 

someone on sanskrit-programmers (cc-ed) might have ideas you might benefit from. But you won't necessarily get their responses unless you become a member/ subscriber of https://groups.google.com/g/sanskrit-programmers/ and https://groups.google.com/g/sanskrit-ocr/

 

 

 

The source of scripture is in a scanned PDF, which I tried to extract (OCR Scan-to-text) using tools like Adobe, Sejda etc., The text gets extracted but with distortion, since, these tools don`t support some of the native fonts! I am also exploring the Google Cloud Vision OCR for now, no luck yet!

 

 

I suspect Google OCR won't work. It only understands tamil letters, not grantha. (see below). I would suspect that the situation is same for other OCR tools.

 

 

Can you pls advise what`s the best approach/tool to extract Grantha scripture from a scanned PDF

 

Sample text

 

 

 

This is what google drive ocr from https://ocr.sanskritdictionary.com/# yields for the above - all tamil.

 

வெ வாரண நி குவெரா உஹாவில்விழoner திணெ த . | கவொறே மஜா நெலானவாறு ஜெ வாவபு தீவத, BUTo ராணா

 

 


Thanks & Regards

 

Prashanth Anantharaman

 

--
आपको यह संदश इसलिए मिला है क्योंकि आपने Google समूह के "sanskrit-ocr" समूह की सदस्यता ली है.
इस समूह की सदस्यता खत्म करने और इससे ईमेल पाना बंद करने के लिए, sanskrit-ocr...@googlegroups.com को ईमेल भेजें.
वेब पर यह चर्चा देखने के लिए, https://groups.google.com/d/msgid/sanskrit-ocr/CAO2eXKY%3D9tRn6i_TNrA6CXR92Z95zuhvAA5aWe1PAXNAPxsuNQ%40mail.gmail.com पर जाएं.



--

--
Vishvas /विश्वासः

--
आपको यह संदश इसलिए मिला है क्योंकि आपने Google समूह के "sanskrit-ocr" समूह की सदस्यता ली है.
इस समूह की सदस्यता खत्म करने और इससे ईमेल पाना बंद करने के लिए, sanskrit-ocr...@googlegroups.com को ईमेल भेजें.

वेब पर यह चर्चा देखने के लिए, https://groups.google.com/d/msgid/sanskrit-ocr/CAFY6qgG2-NrPcYxCy60izYAN%2B8F72w%2BfvQ3%3DWYUB%2BWbTdGZ5MA%40mail.gmail.com पर जाएं.

 

Prashanth Anantharamu

unread,
Feb 21, 2022, 4:24:06 PM2/21/22
to sanskrit-programmers
1.  ---  I suspect Google OCR won't work

Thank you. Yes it appeared the Google OCR doesn`t recognize Grantha Script yet.  I was also suggested to try Google Cloud Vison OCR which I haven`t yet

Appeared there was similar work done, to extract  Grantha Script from scanned PDF as part of a research activity (Pls check the whitepaper below), but couldn`t get much info from that source though!


2. For transliteration itself, The Akshara Mukha utility been very useful


3. What are these slokas, and do we know that they're not already available in electronic form, even if in a different script?

     These slokas are part of Tiruchendur Sthala Puranam, and it appeared they aren`t available in a different script

4. Is there any better sample of Tamil text by OCR extraction

             I guess https://ocr.sanskritdictionary.com/#, as suggested by Sri विश्वासः, and few other tools (like Sejda, and Mac Pages-to-Google Pages) work for extracting Tamil Script from a Scanned-PDF doc

Thanks and appreciate for your inputs/suggestions

Thanks & Regards

Prashanth Anantharaman

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Dec 1, 2022, 3:00:17 AM12/1/22
to Prashanth Anantharamu, sanskrit-programmers, Ravikiran SFO shAkhA, sanskrit-ocr
+ ravikiraN - are you aware of any solution to this grantha script ocr problem? Would you be able to train a model?

Ravi Kiran Sarvadevabhatla

unread,
Dec 1, 2022, 3:52:02 AM12/1/22
to विश्वासो वासुकिजः (Vishvas Vasuki), Prashanth Anantharamu, sanskrit-programmers, sanskrit-ocr
I do not know of any specific systems. Maybe https://github.com/Shreeshrii/kraken_grantha is a possibility.

Ravi

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Dec 1, 2022, 4:15:33 AM12/1/22
to Shree Devi Kumar, Shriramana श्रीरमणशर्मा होता, Prashanth Anantharamu, sanskrit-programmers, sanskrit-ocr
Thanks! (adding shreeshree and shrIramaNa, ravi to bcc)

namaste shreedevi,

could you please point me to a guide which describes how to use this model?

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Dec 1, 2022, 8:00:24 AM12/1/22
to Shree Devi Kumar, Shriramana श्रीरमणशर्मा होता, Prashanth Anantharamu, sanskrit-programmers, sanskrit-ocr
On Thu, 1 Dec 2022 at 14:44, विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com> wrote:
Thanks! (adding shreeshree and shrIramaNa, ravi to bcc)

namaste shreedevi,

could you please point me to a guide which describes how to use this model?

(I saw one in the README file there :-D )
This remains the best?

Shree Devi Kumar

unread,
Dec 1, 2022, 5:44:08 PM12/1/22
to विश्वासो वासुकिजः (Vishvas Vasuki), Shriramana श्रीरमणशर्मा होता, Prashanth Anantharamu, sanskrit-programmers, sanskrit-ocr
(adding Vinodh Rajan also to the conversation)

I haven't looked at this (experimental training) for a while. 

It would help if those who are familiar with Grantha script create line images and corresponding Unicode text groundtruth for training. 

I had tried using the limited unicode grantha fonts for creating traing data, but the printed texts use legacy fonts which look quite different hence the results are suboptimal.

Maybe the focus should be on a singular fontface to begin with. 

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Dec 5, 2022, 5:33:00 AM12/5/22
to Shree Devi Kumar, Shriramana श्रीरमणशर्मा होता, Prashanth Anantharamu, sanskrit-programmers, sanskrit-ocr
On Fri, 2 Dec 2022 at 04:14, Shree Devi Kumar <shree...@gmail.com> wrote:
(adding Vinodh Rajan also to the conversation)

I haven't looked at this (experimental training) for a while. 

It would help if those who are familiar with Grantha script create line images and corresponding Unicode text groundtruth for training. 

I had tried using the limited unicode grantha fonts for creating traing data, but the printed texts use legacy fonts which look quite different hence the results are suboptimal.

Recently some new fonts were announced - https://groups.google.com/g/sanskrit-programmers/c/lOQWdqs_ITc . Wonder if that improves performance ...

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Dec 12, 2022, 12:09:27 PM12/12/22
to Shree Devi Kumar, Shriramana श्रीरमणशर्मा होता, Prashanth Anantharamu, sanskrit-programmers, sanskrit-ocr

 kraken -f pdf -I vidvaj-jana-vinodhini.pdf -o .txt segment -bl ocr -m grantha_kraken.mlmodel

is what I ran. Any tips?

श्रीमल्ललितालालितः

unread,
May 15, 2024, 2:18:02 PMMay 15
to sanskrit-programmers
Om.
Today I tried kraken with the Grantha model.

I had to use something like this:

for i in ~/kraken_grantha/test/6.png ; do
kraken -i "$i" "${i%.x}".txt binarize segment ocr -s -m grantha_best.mlmodel ;
done

Please, note that it works very nice for good scans, but fails for old books with different font styles. I used aksharamukha to type some sentence in Grantha script and took screenshot of it and then ocr-ed it. It was accurate.
But, when I tried on scan of a old print book, the results were miserable. I'm not qualified enough to understand the training thing. So, I left it there.

If anyone know a better way to OCR Grantha script text from 1900-1940 era; please suggest.

Ramesh Kn

unread,
May 16, 2024, 10:38:09 AMMay 16
to sanskrit-programmers
I am also stuck up with old grantha scripts 1900- 1940
a sample is appended. any help is appreciated.

Untitled.png

Vinodh Rajan

unread,
May 16, 2024, 12:01:53 PMMay 16
to sanskrit-programmers
--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
tasks.png

Himanshu Vyas

unread,
May 28, 2024, 5:29:20 AMMay 28
to sanskrit-p...@googlegroups.com
Indian Sanatani People is learning and speaking Sanskrit with their family that's only my way make sanskrit as popular within Hindu people.
so,  See below link
above link is only one video iif you see channel you can see more video's related to sanskrit learning.
it's my only Try(Prayas) for Sanskrit language.
By profession I'm Professional IT engineer

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
May 28, 2024, 6:21:41 AMMay 28
to sanskrit-p...@googlegroups.com, HRVyas
On Tue, 28 May 2024 at 14:59, Himanshu Vyas <himansh...@gmail.com> wrote:
Indian Sanatani People is learning and speaking Sanskrit with their family that's only my way make sanskrit as popular within Hindu people.
so,  See below link
above link is only one video iif you see channel you can see more video's related to sanskrit learning.
it's my only Try(Prayas) for Sanskrit language.
By profession I'm Professional IT engineer



You meant to send this somewhere else? Because it is not relevant to this thread (or to the mail-stream).

 
Reply all
Reply to author
Forward
0 new messages