Re: Adding Devanagari OCR at archive.org

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,

Jan 9, 2018, 1:36:00 AM1/9/18

to shankara, sanskrit-programmers, Shree Devi Kumar

Great idea! Recall https://groups.google.com/forum/#!topic/sanskrit-programmers/qZiNacZu0Gg .

wikisource basically uses google ocr via a web API call. Folks at +sanskrit-programmers (esp shrIdevI) might be able to fill you in on the feasibility of Tesseract.

My experience has been that archive.org folks are relatively slow or disinterested in updating their system, even with ready code (example - https://archive.org/post/1083611/want-to-contribute-single-item-and-gt-podcast-tool-webservice ) - but wouldn't hurt to add to https://archive.org/post/1056091/indian-language-ocr-no-good-can-we-help (again seemingly ignored) to suggest that they try setting some free OCR API usage with Google (which generally likes to help non-profits) - offering to negotiate with Google on their behalf (with their permission) and set up the OCR code.

If that doesn't work, we can try getting free API time from Google on behalf of some other sanskrit-friendly non-profit organization and write up a web service to OCR whole texts from archive.org images.

On Mon, Jan 8, 2018 at 10:13 PM, shankara <shanka...@yahoo.com> wrote:

Vishvasji,

Namaste.

You must be aware that archive.org does not do OCR of the Devanagari documents uploaded there. Now there are more than 56,000 books in Sanskrit and more than 60,000 books in Hindi at archive.org. Thus the total number of documents in Devanagari will exceed 1,20,000 if we include Marathi and Konkani.

It seems archive.org has not yet thought of adding Devanagari OCR to their servers. It would be very much useful to all Sanskrit-Hindi scholars and students if archive.org team could be convinced to add Devanagari OCR.

They may be prompted to do this if they don't have to spend much money on it. Do you know whether Devanagari OCR is available in open domain? I am aware of Google drive OCR and have heard of Wikisource OCR. But, I am not sure whether archive.org can use them freely. Please let me know your thought on this.

regards
shankara

--

--
Vishvas /विश्वासः

shankara

unread,

Jan 9, 2018, 2:06:58 AM1/9/18

to sanskrit-programmers, Shree Devi Kumar, विश्वासो वासुकिजः (Vishvas Vasuki)

Vishvasji,

Glad to know that you had already thought of this. 

I can write to AI support team about this. But, I won't be able to communicate technical matters effectively, since I am ignorant of OCR technology. It would be good someone who is well aware of it, volunteers to correspond and coordinate with AI team.

regards
shankara

                        On Tuesday 9 January 2018, 12:06:01 PM IST, विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com> wrote:
                    

Great idea! Recall https://groups.google.com/forum/#!topic/sanskrit-programmers/qZiNacZu0Gg .

wikisource basically uses google ocr via a web API call. Folks at +sanskrit-programmers (esp shrIdevI) might be able to fill you in on the feasibility of Tesseract.

My experience has been that archive.org folks are relatively slow or disinterested in updating their system, even with ready code (example - https://archive.org/post/ 1083611/want-to-contribute- single-item-and-gt-podcast- tool-webservice ) - but wouldn't hurt to add to https://archive.org/post/ 1056091/indian-language-ocr- no-good-can-we-help (again seemingly ignored) to suggest that they try setting some free OCR API usage with Google (which generally likes to help non-profits) - offering to negotiate with Google on their behalf (with their permission) and set up the OCR code.

ShreeDevi Kumar

unread,

Jan 10, 2018, 12:49:10 AM1/10/18

to VishvAs VAsuki, shankara, sanskrit-programmers

Please see https://archive.org/post/1010389/using-tesseract-to-improve-ocr-for-some-languages

I asked yesterday re use if tesseract/google api for indian languages.

The answer is no.

shankara

unread,

Jan 10, 2018, 1:50:58 AM1/10/18

to VishvAs VAsuki, ShreeDevi Kumar, sanskrit-programmers

Namaste,

It is unfortunate. I have corresponded with Jeff few times the past and he was very supportive. 

As Vishvasji mentioned, Archive team may be either slow or disinterested in updating their system, or they may be having a low priority for Indian language books.

Even in Abbyy Finereader, there are no Indian languages whereas they have Korean, Thai and Vietnamese. I wonder what could the reason for this neglect of Indian languages?

regards
shankara

                        On Wednesday 10 January 2018, 11:19:10 AM IST, ShreeDevi Kumar <shree...@gmail.com> wrote:
                    

Please see https://archive.org/post/1010389/using-tesseract-to-improve-ocr-for-some-languages
I asked yesterday re use if tesseract/google api for indian languages.

The answer is no.

On 09-Jan-2018 12:06 PM, "विश्वासो वासुकिजः (Vishvas Vasuki)" <vishvas...@gmail.com> wrote:

Great idea! Recall https://groups.google.com/ forum/#!topic/sanskrit- programmers/qZiNacZu0Gg .

wikisource basically uses google ocr via a web API call. Folks at +sanskrit-programmers (esp shrIdevI) might be able to fill you in on the feasibility of Tesseract.

My experience has been that archive.org folks are relatively slow or disinterested in updating their system, even with ready code (example - https://archive.org/post/10836 11/want-to-contribute-single- item-and-gt-podcast-tool- webservice ) - but wouldn't hurt to add to https://archive.org/post/10560 91/indian-language-ocr-no- good-can-we-help (again seemingly ignored) to suggest that they try setting some free OCR API usage with Google (which generally likes to help non-profits) - offering to negotiate with Google on their behalf (with their permission) and set up the OCR code.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,

Jan 10, 2018, 1:32:20 PM1/10/18

to shankara, ShreeDevi Kumar, sanskrit-programmers

I added to that post seeking clarification. I suspect that this may have partly something to do with some past acrimony between IA and google.

Regarding why Abbyy neglects Indian languages - money? Indian comfort with english makes local language tools quite optional?

shankara

unread,

Jan 11, 2018, 1:04:19 AM1/11/18

to विश्वासो वासुकिजः (Vishvas Vasuki), ShreeDevi Kumar, sanskrit-programmers

Namaste,
Saw your post and responses from Jeff and Shree. Nice to see that the thread is alive. 

regards
shankara

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,

Jan 11, 2018, 1:55:09 AM1/11/18

to shankara, ShreeDevi Kumar, sanskrit-programmers

No - it's not :-( Jeff was clear "you are welcome to OCR anything and upload that as a separate item. " - emphasis on the word separate.

They ain't changing their system.

shankara

unread,

Jan 11, 2018, 2:17:15 AM1/11/18

to विश्वासो वासुकिजः (Vishvas Vasuki), ShreeDevi Kumar, sanskrit-programmers

Vishwasji,
You are right about Jeff's response. I missed it somehow.

Do you know what he meant when he asked you to look at https://archive.org/download/HistoryOfDharmasastraancientAndMediaevalReligiousAndCivilLawV.1/Kane_A-History-of-Dharmasastra-v1_1930_abbyy.gz

Does their OCR text have some special features that would be missing in Google OCR?

regards
shankara

Reply all

Reply to author

Forward