Unable to OCR following image

134 views
Skip to first unread message

viraf

unread,
Feb 18, 2016, 1:08:37 AM2/18/16
to tesseract-ocr
I am facing challenges with the accuracy of the OCR, and was hoping that someone could guide me through the process of debugging the problem so that I can apply these techniques to other OCR related issues that I face.  Attached is a snippet of a document that is not correctly OCR'd.  The output that I get is:

RE U'EST FO DICAL

The following config entries were added to configs/use-userdict
load_system_dawg F
load_freq_dawg F
load_punc_dawg F
load_number_dawg F
load_unambig_dawg F
load_bigram_dawg F
load_fixed_length_dawgs F
user_words_suffix user-words
tessedit_write_images T
tessedit_dump_pageseg_images T

and eng.user-words has the following entries
REQUEST
FOR
INDEPENDENT
MEDICAL
REVIEW

The following  command line was used

tesseract test.png stdout -l eng use-userdict


test.png

viraf

unread,
Feb 18, 2016, 8:39:55 AM2/18/16
to tesseract-ocr
So, I decided to manually remove the underline from the image and OCR it.  The new image is attached.

$ tesseract test2.png stdout -l eng 
REQUEST FOR INDEPENDENT NIEDICAL REVIEW

$ tesseract test2.png stdout -l eng use-userdict
REQUEST FOR INDEPENDENT IVIEDICAL REVIEW

Having specified the user dictionary, I would have expected the output to be correct.  Could someone please elaborate on why the difference ?
I have also observed that Tesseract correctly handles underlines in other places - so I am unclear on what is required here.  What are the rules for handling text with underlines ?

Thanks

- viraf

viraf

unread,
Feb 18, 2016, 8:40:28 AM2/18/16
to tesseract-ocr
test2.png
Reply all
Reply to author
Forward
0 new messages