SanskritOCR vs. Google OCR

190 views
Skip to first unread message

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Jun 6, 2018, 1:44:50 AM6/6/18
to sanskrit-programmers, Arun Prasad
Excerpts follow:

SanskritOCR vs. Google OCR

 / Arun/ news

SanskritOCR vs. Google OCR.
Author: Arun (Link) 

As far as I am aware, there are two options for high-quality Sanskrit OCR that I know of:

  • the SanskritOCR tool created by Dr. Oliver Hellwig
  • the Google OCR API exposed through the Google Cloud Vision API

I was curious to get a sense of how these two tools compared on a sample page, and whether there was a clear difference in quality between the two. Here I’ll describe how I conducted this comparison and my results.

Doing a fair analysis is time-consuming because a variety of page sizes, scan qualities, genres, and so on must be tested, and all must be tested for both the number and the severity of the errors produced.

So rather than make a firm pronouncement of how one tool compares to the other, I focused on just one page that I found particulary challenging and saw how the tools compare. Specifically, I focused on page 28 of [this edition] of the Raghuvamsa, with the commentary by Mallinatha.

This is a reasonable test because the text contains both large type for the main poem and smaller type (with long compounds) for the commentary. The print quality ranges from perfectly clear to quite ugly.

...

Tentative conclusions

The two tools produce roughly the same number of errors, but when comparing by edit distance, SanskritOCR has roughly twice the severity of Google OCR.

Since the page I chose is particularly hard, I have confidence that Google OCR can maintain this quality or perform even better on easier data.

Both tools make errors that can seem at first like normal Sanskrit. Google OCR seems more prone to this, since its errors are less severe (and therefore closer to actual Sanskrit). This may be able to be worked around with better linguistic understanding.

My recommendations on this basis:

  • Google OCR is likely at least as good as SanskritOCR and can be used for projects that require high-quality Sanskrit OCR.
  • Better documentation and assistance for how to tune SanskritOCR may be useful.
  • Investing in tooling for Google OCR is a worthwhile priority for the community, since the OCR output is easy to process programmatically and can be used to build more sophisticated proofreading tools.


--
--
Vishvas /विश्वासः

Arun

unread,
Jun 6, 2018, 2:00:43 AM6/6/18
to sanskrit-programmers
Thanks for posting, Vishvas! Forgive me in advance for any grammatical errors or confusing turns of phrase, as this was written rather quickly.

Here's a direct link to the page I used for the analysis:


I hope others can experiment with more input images so we can get a better sense of how these two tools compare. I also plan to make a basic interface for Google OCR so that the usability barrier isn't so high. Having an open-source web interface (and therefore an interface that can be hacked and extended) would be really useful, I think!

Arun

Kaushal

unread,
Jun 6, 2018, 2:08:55 AM6/6/18
to sanskrit-p...@googlegroups.com
Hi,

I second Vishvas Vasuki’s conclusion based on results tried by my team.

Google OCR vision apis.

The problem i’m trying to work is how to augment vision apis by training it to read manuscripts (with svaras).

We’re breaking our head on this and any contributions are welcome.

You can reach me on skype - Kau...@Prodio.in

Kaushal Trivedi

For Whom The Bell Tolls.
--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Anunad Singh

unread,
Jun 6, 2018, 11:03:08 PM6/6/18
to sanskrit-p...@googlegroups.com
The above experiment carried out by Arun ji will be of great use. It
will also be useful to know the experiences of comparision of Google
OCR and Tesseract. ShreeDeviKumar ji has done a lot of work on
Tesseract and may please share the latest state of Tesseract.

--anunAda

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Jun 12, 2018, 11:28:35 AM6/12/18
to sanskrit-programmers, कौशलो मोहितमित्रम् kaushal, Sai सायिः साङ्गणकविद्वान् Susarla, मोहितो भारद्वाजः माध्यन्दिनशाखाध्यायी mohito bhaaradvaajaH noiDa-vaasii veda-samarthakaH
For background: shrI kaushal is associated with this inspiring and formidable project by +shrI mohit - http://indiafacts.org/after-millenia-tradition-reborn-vaidika-bharata/ .

If +shrI kaushal can describe the things they've tried and their results, we could perhaps brainstorm ideas . One thing that directly comes to mind is: have you tried segmenting and submitting each line/ line fraction separately?

The closest project (very far from completion) that I can think of is one envisioned by + shrI sAi, where image segmentation and OCR is to be accomplished as follows:
- an expert annotates text segments - say letters, and this annotation is propagated wherever an almost-identical image is found.


Kaushal Trivedi
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-programmers+unsubscrib...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-programmers+unsubscrib...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Jun 12, 2018, 11:30:09 AM6/12/18
to sanskrit-programmers, कौशलो मोहितमित्रम् kaushal, Sai सायिः साङ्गणकविद्वान् Susarla, मोहितो भारद्वाजः माध्यन्दिनशाखाध्यायी mohito bhaaradvaajaH noiDa-vaasii veda-samarthakaH
(fixing email id.)
Reply all
Reply to author
Forward
0 new messages