Hi,
I'm a researcher in statistical machine translation, and use for my work of bunch of translated texts (in multiple languages), some of which were automatically generated via OCR. I recently noticed that some texts included subtantial numbers of OCR errors, which I would of course like to correct to improve the quality of my data.
I was therefore wondering if I could use tesseract or some related software tool in order to correct at least some of these OCR-generated errors (through e.g. statistical language modelling techniques). Note that I unfortunately don't have access to the original scans, I only have the raw, OCR-produced text.
Any suggestions?
Thanks!
Pierre