Auto translation by Google OCR

42 views
Skip to first unread message

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Oct 8, 2021, 1:55:09 AM10/8/21
to sanskrit-programmers

and observe that it outputs:

fate (procedure) instead of विधि (procedure)!

So, beware!

--
--
Vishvas /विश्वासः

Arun

unread,
Oct 8, 2021, 9:46:17 AM10/8/21
to sanskrit-programmers
Interesting. My guess is that Google OCR first applies a language detection model then uses the appropriate OCR engine. Here you can see that all of the Devanagari words are "transcribed" into English letters.

Arun

Suhas Mahesh

unread,
Oct 9, 2021, 7:50:51 AM10/9/21
to sanskrit-programmers
Another "reader beware": like a human copyist, Google OCR sometimes changes rarer usages to more frequent usages it knows from training. Since the changes are often valid forms, and sometimes even fit in the context, they can be easily missed by the proofreader.

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/c61709de-5657-45f9-aefb-25fc149a9646n%40googlegroups.com.

questions...@gmail.com

unread,
Oct 9, 2021, 4:24:08 PM10/9/21
to sanskrit-programmers
On Saturday, October 9, 2021 at 4:50:51 AM UTC-7 suha...@gmail.com wrote:
Another "reader beware": like a human copyist, Google OCR sometimes changes rarer usages to more frequent usages it knows from training. Since the changes are often valid forms, and sometimes even fit in the context, they can be easily missed by the proofreader.

Yes, that came up in my comparison between SanskritOCR and Google OCR from a few years ago. While Google OCR is generally as accurate, its errors are less severe, which perhaps is worse in the context of proofreading.

Since there's a strong temptation to compare text and image on the basis of shape, perhaps one quality improvement would be to do the transcription in romanized script (or really any non-Devanagari script). घावति for धावति is hard to notice because the words look similar, but ghāvati for धावति has to be compared more carefully, which makes the error clear.

Shreevatsa R

unread,
Oct 9, 2021, 7:47:48 PM10/9/21
to sanskrit-programmers
+1 on plausible-looking errors, and using a different script for proofreading is a great idea. 

About the original example, these are the Devanagari → Latin transcriptions I see for that image (and no Devanagari has remained as Devanagari):

विधि → fate
नित्यक्रिया → fessut
क्रिया-s → fput-s
आगमादि-शैव-शास्त्र-s → 30THI-210--s
आगम-s → 3114-s
धर्मकर्ता → Updf 

Given the other examples (note that initial ि seems to consistently become f, etc), it's not yet clear to me whether विधि → fate is actually translation or just an amazing coincidence. 

(If you can't see it: this may help:

image.png

:-)
Reply all
Reply to author
Forward
0 new messages