Post-correction of OCR-generated text

223 views
Skip to first unread message

Pierre Lison

unread,
Sep 2, 2014, 11:13:50 AM9/2/14
to tesser...@googlegroups.com

Hi,

I'm a researcher in statistical machine translation, and use for my work of bunch of translated texts (in multiple languages), some of which were automatically generated via OCR.  I recently noticed that some texts included subtantial numbers of OCR errors, which I would of course like to correct to improve the quality of my data.

I was therefore wondering if I could use tesseract or some related software tool in order to correct at least some of these OCR-generated errors (through e.g. statistical language modelling techniques).  Note that I unfortunately don't have access to the original scans, I only have the raw, OCR-produced text. 

Any suggestions?

Thanks!

Pierre

Rick Leir

unread,
Sep 5, 2014, 9:14:18 AM9/5/14
to tesser...@googlegroups.com
Here is something about automated corrections:

http://ilk.uvt.nl/downloads/pub/papers/CICLING08.TICCL.MRE.postpublication.pdf

Unrelated to the above, I would like to use languagetool.org to automate corrections.  So much to do, so little time..

Shree Devi Kumar

unread,
Sep 5, 2014, 1:07:27 PM9/5/14
to tesser...@googlegroups.com
Interesting paper re TICCL - wondering whether tesseract is using similar approach for 3.04 language data with the unigram and bigram lists along with 'clean' word lists ...

see section 4.4 processing steps

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/715ce30f-c574-446a-997a-d5dfb137d89b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages