OCR of Devanagari + Diacritics + English

402 views
Skip to first unread message

Alexander Gribanov

unread,
Sep 15, 2019, 10:13:55 AM9/15/19
to tesseract-ocr
Hello!

Finally got real project for OCR.
Could anybody please give some advice in the process step by step, how do I make OCR for such pages?
Do I need to split pages manually before the OCR to different type of the blocks?
What command to use for OCR?

Thank You all very much in advance.

Ravi Annaswamy

unread,
Sep 15, 2019, 10:44:31 AM9/15/19
to tesser...@googlegroups.com
I recently was able to write a notebook to read a page of single column Sanskrit and English and run through tesseract to OCR both languages 

I will take a look at your file and create a colab notebook sometime today or tomorrow



Sent from my iPhone
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/87e9877e-4c42-4969-bc6e-19d9553388f8%40googlegroups.com.

Shree Devi Kumar

unread,
Sep 15, 2019, 11:34:05 AM9/15/19
to tesseract-ocr
Try http://ocr.sanskritdictionary.com/
For OCR of Devanagari + Diacritics + English

It's Google option gives better result than tesseract 

--

Ravi Annaswamy

unread,
Sep 15, 2019, 12:00:59 PM9/15/19
to tesser...@googlegroups.com
Alex

Here are the results and linked below is an example notebook for you to get started with.

Code is self explanatory, and can be adapted by you. You will need to improve on many things,
but here is a start. Please let me know if you have any questions.


Please feel free to share some improvements, which you can.

Thanks
Ravi Annaswamy

----

18 A Small Collection of श्रीकृष्ण का अलोकिकत्व SRI KRISHNAS TRANSCENDENTAL FORM This savaiya reveals Sri Krishnas amazing and surprising behaviour which cannot be predicted. The normal paths of spiritual knowledge cannot reveal his secret and favourite pastimes. The pastimes within Vraja are incomparable and can only be perceived through total surrender to Him. ब्रह्य मे दूढ्यो पुरानन गानन । वेद रिचा सुनि चौगुने चायन ।। Cel GA Hae a fads वह केसे सरूप अरु कैसे सुभायन । टेरत हेरत हारि TEM | रसखानि बतायौ न लोग लुगायन । देख्यो दर्यो वह कुज कुटीर में बेट्यो पलोटत राधिका पायन ।।

----


The Great Rasakhanjis Poetry 19 brahma mem dhtindhyo puranana ganana veda rca suni caugune cayana dekhyo sunyo kabahum na kitahum vaha kaise sartipa aru kaise subhayana terata herata hari paryau rasakhani batayau na loga lugayana dekhyau duryau vaha kunja kutira mem baithyo palotata radhika payana | have been searching for the supreme Brahman in the Puranic songs and from listening to Vedic verses my desire to meet Him has increased four-fold; Never and nowhere have I seen or heard of His Form and His Nature; Rasakhan says "I have completely failed to find Him despite calling to him and searching. Neither man nor woman has been able to tell me where to find Him" But then I beheld Him afar, sitting in a secret love bower, massaging Sri Radhikajus Feet. (2)


Ravi Annaswamy

unread,
Sep 15, 2019, 12:06:54 PM9/15/19
to tesser...@googlegroups.com
That is a beautiful app.

Shree Devi Kumar, what service does the 'google' selection hit? Is it free?

Ravi


Ravi Annaswamy

unread,
Sep 15, 2019, 12:12:33 PM9/15/19
to tesser...@googlegroups.com
I split the pages to left right pages and posted on the ocr with google option and here are the results, I have not compared yet but couple of observations,
1. yes google ocr captures diacritics!
2. tesseract retains line breaks but google ocr provides flowing text, which is great
3. Tesseract is waylaid by the decorative images, trying to understand them as text, google ocr skips them, nice
Thanks for the pointer
Ravi


18 
A Small Collection of 
श्रीकृष्ण का अलौकिकत्व SRI KRISHNAS TRANSCENDENTAL FORM 
This savaiya reveals Sri Krishnas amazing and surprising behaviour which cannot be predicted. The normal paths of spiritual knowledge cannot reveal his secret and favourite pastimes. The pastimes within Vraja are incomparable and can only be perceived through total surrender to Him. 
ब्रह्म में ढूँढ्यो पुरानन गानन। वेद रिचा सुनि चौगुने चायन।। 
देख्यो सुन्यो कबहुँ न कितहुँ वह कैसे सरूप अरु कैसे सुभायन।। 
टेरत हेरत हारि पस्यौ। रसखानि बतायौ न लोग लुगायन। 
देख्यौ दुर्यों वह कुंज कुटीर में बैठ्यो पलोटत राधिका पायन।।

The Great Rasakhanjis Poetry 
brahma mem dhündhyo purānana gānana 
veda șcă suni caugune cāyana dekhyo sunyo kabahuń na kitahum vaha kaise sarūpa aru kaise subhāyana 
terata herata hāri paryau rasakhāni batāyau na loga lugāyana dekhyau duryau vaha kuñja kuțīra mem 
baithyo palotata rādhikā pāyana 
I have been searching for the supreme Brahman in the Puranic songs and from listening to Vedic verses my desire to meet Him has increased four-fold; 

Never and nowhere have I seen or heard of His Form and His Nature; 
Rasakhan says "I have completely failed to find Him despite calling to him and searching. Neither man nor woman has been able to tell me where to find Him" But then I beheld Him afar, sitting in a secret love bower, massaging Sri Radhikajus Feet. (2)

Alexander Gribanov

unread,
Sep 15, 2019, 12:26:48 PM9/15/19
to tesser...@googlegroups.com
Yeah, looks great, but still there are some mistakes by Google...
I'm not sure, as long as I heard, Tesseract could be trained, but I never heard about that Google service, so not sure, is it possible to reduce such problems via some more trainings?
I mean, does the Google service is trainable actually?

вс, 15 сент. 2019 г. в 19:12, Ravi Annaswamy <ravi.an...@gmail.com>:

Shree Devi Kumar

unread,
Sep 15, 2019, 12:56:16 PM9/15/19
to tesseract-ocr

Ravi Annaswamy

unread,
Sep 15, 2019, 1:48:53 PM9/15/19
to tesser...@googlegroups.com
Ok thanks for the share very nice interface
As Alex highlights tesseract allows further customization
I myself want to learn how to train tesseract for Tamil and Sanskrit using your scripts and guides but haven’t got a good starting point yet


Sent from my iPhone

Shree Devi Kumar

unread,
Sep 16, 2019, 2:36:57 AM9/16/19
to tesseract-ocr
That site works for other languages also, though it is not specified explicitly.

Reply all
Reply to author
Forward
0 new messages