Vishvasji,Namaste.You must be aware that archive.org does not do OCR of the Devanagari documents uploaded there. Now there are more than 56,000 books in Sanskrit and more than 60,000 books in Hindi at archive.org. Thus the total number of documents in Devanagari will exceed 1,20,000 if we include Marathi and Konkani.It seems archive.org has not yet thought of adding Devanagari OCR to their servers. It would be very much useful to all Sanskrit-Hindi scholars and students if archive.org team could be convinced to add Devanagari OCR.They may be prompted to do this if they don't have to spend much money on it. Do you know whether Devanagari OCR is available in open domain? I am aware of Google drive OCR and have heard of Wikisource OCR. But, I am not sure whether archive.org can use them freely. Please let me know your thought on this.regards
shankara
Great idea! Recall https://groups.google.com/ forum/#!topic/sanskrit- programmers/qZiNacZu0Gg .
wikisource basically uses google ocr via a web API call. Folks at +sanskrit-programmers (esp shrIdevI) might be able to fill you in on the feasibility of Tesseract.
My experience has been that archive.org folks are relatively slow or disinterested in updating their system, even with ready code (example - https://archive.org/post/10836 11/want-to-contribute-single- item-and-gt-podcast-tool- webservice ) - but wouldn't hurt to add to https://archive.org/post/10560 91/indian-language-ocr-no- good-can-we-help (again seemingly ignored) to suggest that they try setting some free OCR API usage with Google (which generally likes to help non-profits) - offering to negotiate with Google on their behalf (with their permission) and set up the OCR code.