OCR post-processing scripts

91 views
Skip to first unread message

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Aug 15, 2020, 8:28:40 AM8/15/20
to rohits...@gmail.com, Devaraj देवराजः digital-sanskrit-temp Adiga, sanskrit-programmers
namaste shrI dr rohit saluja,

I got your email id from shrI devarAj aDiga, who's collaborating with us on some grammar proofreading project. He mentioned that you've developed some OCR postprocessing scripts to improve accuracy by combining multiple OCR outputs. Are these open source? Could you share these scripts and any details you have about them (such as research papers)? These will be of great interest to us in the sanskrit programmers mailing list (cc-ed), which you're welcome to join.



--
--
Vishvas /विश्वासः

PS: to others - shrI devarAja has developed some ASR tech (guided by shrI rAmasubramaniam and Ganesh Ramakrishnan ) based on https://kaldi-asr.org/ , which will be published / released once his PhD formalities are completed.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Mar 3, 2021, 7:54:00 PM3/3/21
to rohits...@gmail.com, Devaraj देवराजः digital-sanskrit-temp Adiga, dhaval patel, Mārcis Gasūns, sanskrit-programmers

Additionally, I hear (from marcis) that dhaval and marcis have developed some tools to expedite proofreading; and that +dhaval has been using it successfully. Any details?

rohit saluja

unread,
Mar 3, 2021, 8:24:11 PM3/3/21
to विश्वासो वासुकिजः (Vishvas Vasuki), Devaraj देवराजः digital-sanskrit-temp Adiga, dhaval patel, Mārcis Gasūns, sanskrit-programmers
Namaste, yes this github repo is open source and can be used.
We need output from 2 OCR engines to start with.

Dhaval Patel

unread,
Mar 3, 2021, 8:56:51 PM3/3/21
to विश्वासो वासुकिजः (Vishvas Vasuki), rohits...@gmail.com, Devaraj देवराजः digital-sanskrit-temp Adiga, Mārcis Gasūns, sanskrit-programmers
Hi Vishvas,

I do not have a post processing script for spell correction.

There is a utility which can be of some use.

1. Script to do some generic formatting after google docs OCR.

In case I have typed / OCR data from two sources, I usually compare them via "meld" and do corrections.

Rohit's solution seems more robust. I have not been able to check it though.

Suggestion - 

Vishvas has a script at doc_curation package which generates google docs OCR for given PDF file. It works quite well for me.

If such script can be made for any other free OCR machine, we will have two OCR versions for the same PDF. It would satisfy requirement of Rohit's program.

I am not aware, if there is any other OCR machine which is open source / gives access via API. If there is one, efforts in this direction would be highly fruitful.

Avinash L Varna

unread,
Mar 4, 2021, 9:26:25 PM3/4/21
to sanskrit-programmers
>> if there is any other OCR machine which is open source / gives access via API. If there is one, efforts in this direction would be highly fruitful.

What about tesseract (https://github.com/tesseract-ocr/tesseract)? Some members of this group have done quite a bit of investigation into it in the past.

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CADSGPzU3sg9gV9nVsd0pYB8VJhAFMj85%2BD9kNqiAvtNdaU1-Hg%40mail.gmail.com.

Irene Galstian

unread,
Mar 4, 2021, 9:38:30 PM3/4/21
to sanskrit-p...@googlegroups.com
Does the Devanagari OCR used by archive.org use Google OCR or something else?

On 5 Mar 2021, at 2:26 am, Avinash L Varna <avinas...@gmail.com> wrote:



विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Mar 4, 2021, 10:41:54 PM3/4/21
to sanskrit-programmers
On Fri, Mar 5, 2021 at 8:08 AM Irene Galstian <gnos...@gmail.com> wrote:
Does the Devanagari OCR used by archive.org use Google OCR or something else?

https://archive.org/post/1113647/ocr-software-used (I suspect that it is ABBY given their sloooooooow unresponsive nature; but we'll see.)

 

Irene Galstian

unread,
Mar 5, 2021, 7:58:29 AM3/5/21
to sanskrit-p...@googlegroups.com
Good, thank you. As long as it’s not Google, it can be at least tried as the 2nd OCR method needed for the proofreading script. 

On 5 Mar 2021, at 3:41 am, विश्वासो वासुकिजः <vishvas...@gmail.com> wrote:



विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Mar 5, 2021, 9:24:34 AM3/5/21
to sanskrit-programmers
" we are now using tesseract. " they say in the thread.

Irene Galstian

unread,
Mar 5, 2021, 9:33:08 AM3/5/21
to sanskrit-p...@googlegroups.com
Thank you, noted.
>>>>>> <https://www.google.com/url?q=https://kaldi-asr.org/&sa=D&source=hangouts&ust=1597580361606000&usg=AFQjCNGiOGqh6ksVobB2Ex_THSjWL-0zgQ>
>>>>>> , which will be published / released once his PhD formalities are
>>>>>> completed.
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> --
>>>>> Vishvas /विश्वासः
>>>>>
>>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "sanskrit-programmers" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to sanskrit-program...@googlegroups.com.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/sanskrit-programmers/CADSGPzU3sg9gV9nVsd0pYB8VJhAFMj85%2BD9kNqiAvtNdaU1-Hg%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/sanskrit-programmers/CADSGPzU3sg9gV9nVsd0pYB8VJhAFMj85%2BD9kNqiAvtNdaU1-Hg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups
>>> "sanskrit-programmers" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an
>>> email to sanskrit-program...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/sanskrit-programmers/CAALtx9bO%3DVqo6EctpMDLViTtcj8H9N4E1T7C550R4BB%2BbPMCAg%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/sanskrit-programmers/CAALtx9bO%3DVqo6EctpMDLViTtcj8H9N4E1T7C550R4BB%2BbPMCAg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups
>>> "sanskrit-programmers" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an
>>> email to sanskrit-program...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/sanskrit-programmers/58A0705B-ADE5-426D-9A86-FF1220B852A9%40gmail.com
>>> <https://groups.google.com/d/msgid/sanskrit-programmers/58A0705B-ADE5-426D-9A86-FF1220B852A9%40gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>
>>
>> --
>> --
>> Vishvas /विश्वासः
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "sanskrit-programmers" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to sanskrit-program...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/sanskrit-programmers/CAFY6qgFXZpm7Bm2TCDfKza92r%2B8TJAbBzQgNB3K-%3DxS%3Dzmt42w%40mail.gmail.com
>> <https://groups.google.com/d/msgid/sanskrit-programmers/CAFY6qgFXZpm7Bm2TCDfKza92r%2B8TJAbBzQgNB3K-%3DxS%3Dzmt42w%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "sanskrit-programmers" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to sanskrit-program...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/sanskrit-programmers/C6C61A22-AF47-4FA2-B427-874B802D677E%40gmail.com
>> <https://groups.google.com/d/msgid/sanskrit-programmers/C6C61A22-AF47-4FA2-B427-874B802D677E%40gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>
>
> --
> --
> Vishvas /विश्वासः
>
> --
> You received this message because you are subscribed to the Google Groups
> "sanskrit-programmers" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to sanskrit-program...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/sanskrit-programmers/CAFY6qgHyRHtiQZ3OzApCHdZagd5AXrrcJb17ARC-b5HGYmpWaQ%40mail.gmail.com.
>

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
May 20, 2021, 1:36:45 AM5/20/21
to rohit saluja, Avinash Varna, dhaval patel, sanskrit-programmers

https://github.com/rohitsaluja22/OpenOCRCorrect/issues seems to indicate that the software doesn't work/ is abandoned. Would someone be interested in porting it to (say) python? would be very useful. + avinash - given that it seems to use sandhi and stuff you might be interested.

Avinash L Varna

unread,
May 22, 2021, 8:43:33 PM5/22/21
to विश्वासो वासुकिजः (Vishvas Vasuki), rohit saluja, dhaval patel, sanskrit-programmers
I got a segfault with the SW as well. It might not be easy to port it, if we can't run it and understand all the features. Unfortunately, the video link in the README doesn't work either, but based on the papers and presentation on Shri Saluja's website, it appears that there are two main features:
1. Compare the two independent OCR outputs and color code the text in the GUI to easily identify differences (pg. 7 of the presentation)
2. Perform auto-corrections/ provide suggestions for the mismatches
(Please feel free to add anything I may have missed)

For #1, I am confident that there are capable members of this community who could modify/create a UI if necessary.
#2 is a more challenging and interesting problem. It is related to the problem of creating a spell-checker/autocomplete (which is something we've discussed in the past), but tuned to the kind of errors that the two OCR engines create. If we had enough ground truth data (images + corresponding proofread text), it might be possible to model/learn the distribution of errors and create suggestions/corrections. If we don't have data, we could try to synthetically create data by taking sentences from a source such as SA wikipedia, and intentionally introduce errors. However, the performance of any algorithm trained on such synthetic data might end up being drastically different if the underlying error distribution of the OCR engine output is very different. It might still be worth a try.

I seem to recall some discussions around creating an OCR benchmark/database a while ago. Is anyone aware of progress on this front? 
@dhaval patel, could we leverage the various koshas that have been proofread and released under your leadership ? Are the corresponding image files available as well ?

In the course of searching, I also found a recent paper from Shri Saluja and collaborators that claims better performance than a few other alternatives such as Google-OCR (https://github.com/ihdia/sanskrit-ocr/blob/main/wa-ecr-final.png). Code here - https://github.com/ihdia/sanskrit-ocr. I did a quick test to run it on some of the sample images in their repo, if anyone wants to take a look. Creating a python wrapper around it/repackaging it for tensorflow-js should be possible if the community is interested in using it as another OCR option.

Thanks
Avinash


Dhaval Patel

unread,
May 22, 2021, 9:13:06 PM5/22/21
to Avinash L Varna, sanskrit-programmers
Sorry. The earlier mail was sent to Mr. Avinash Varna only.

On Sun, 23 May 2021, 06:41 Dhaval Patel, <drdhav...@gmail.com> wrote:
There are corresponding image files. And koshas indeed have page numbers in them, based on which we can segregate pages to allign with the images. So koshas are a good starting point I guess. 

Dhaval Patel

unread,
May 25, 2021, 8:37:06 PM5/25/21
to sanskrit-programmers
Not as of now. And it would be a pain to locate some of them. My hard drive crashed some months ago. So, will put that work, if it is really going to be used. 

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Aug 1, 2022, 11:21:04 PM8/1/22
to rohit saluja, Devaraj देवराजः digital-sanskrit-temp Adiga, dhaval धवलः patel, sanskrit-programmers
https://github.com/IITB-OpenOCRCorrect/iitb-openocr-digit-tool seems to be the more updated repository for this.
Reply all
Reply to author
Forward
0 new messages