OCR on Internet archive.

157 views
Skip to first unread message

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Apr 15, 2016, 11:00:41 AM4/15/16
to sanskrit-programmers, Kumar Shankara
+ shankara, a frequent uploader, so that he may specify good language metadata

Lots of old sanskrit texts are uploaded to Internet archive (archive.org), and such uploads can be very convenient for future proofreading on wikisource (through various import tools which match the OCR text with the right page).

But the OCR for devanAgarI is just gibberish.

​So, I was wondering what software is used for OCR by the Internet Archive, and if we can somehow​ help improve it. So far, I discern the following:


--
--
Vishvas /विश्वासः

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Apr 15, 2016, 11:19:19 AM4/15/16
to sanskrit-programmers, Kumar Shankara
Also: 

https://archive.org/about/faqs.php#1165 :

Why is OCR so bad? Why do the epub, djvu, mobi, text files have garbled or missing text?

OCR (Optical Character Recognition) is inexact. Sometimes it can be poor. It largely relies on factors of the physical book such as type font, color, cleanliness of the page, language (some are not OCRable at this time), and page orientation (sometimes charts are turned at 90°). At this time we do not offer a way for you to either correct bad OCR or add your own corrected OCR file. Several of the derived file formats such as mobi, epub and djvu rely on OCR. So, if the OCR is poor, those files will also have garbled or misspelled words.



Reply all
Reply to author
Forward
0 new messages