I got a segfault with the SW as well. It might not be easy to port it, if we can't run it and understand all the features. Unfortunately, the video link in the README doesn't work either, but based on the papers and
presentation on Shri Saluja's website, it appears that there are two main features:
1. Compare the two independent OCR outputs and color code the text in the GUI to easily identify differences (pg. 7 of the presentation)
2. Perform auto-corrections/ provide suggestions for the mismatches
(Please feel free to add anything I may have missed)
For #1, I am confident that there are capable members of this community who could modify/create a UI if necessary.
#2 is a more challenging and interesting problem. It is related to the problem of creating a spell-checker/autocomplete (which is something we've discussed in the past), but tuned to the kind of errors that the two OCR engines create. If we had enough ground truth data (images + corresponding proofread text), it might be possible to model/learn the distribution of errors and create suggestions/corrections. If we don't have data, we could try to synthetically create data by taking sentences from a source such as SA wikipedia, and intentionally introduce errors. However, the performance of any algorithm trained on such synthetic data might end up being drastically different if the underlying error distribution of the OCR engine output is very different. It might still be worth a try.
I seem to recall some discussions around creating an OCR benchmark/database a while ago. Is anyone aware of progress on this front?
@dhaval patel, could we leverage the various koshas that have been proofread and released under your leadership ? Are the corresponding image files available as well ?