Unable to OCR following image

viraf

unread,

Feb 18, 2016, 1:08:37 AM2/18/16

to tesseract-ocr

I am facing challenges with the accuracy of the OCR, and was hoping that someone could guide me through the process of debugging the problem so that I can apply these techniques to other OCR related issues that I face. Attached is a snippet of a document that is not correctly OCR'd. The output that I get is:

RE U'EST FO DICAL

The following config entries were added to configs/use-userdict

load_system_dawg F

load_freq_dawg F

load_punc_dawg F

load_number_dawg F

load_unambig_dawg F

load_bigram_dawg F

load_fixed_length_dawgs F

user_words_suffix user-words

tessedit_write_images T

tessedit_dump_pageseg_images T

and eng.user-words has the following entries

REQUEST

FOR

INDEPENDENT

MEDICAL

REVIEW

The following command line was used

tesseract test.png stdout -l eng use-userdict

test.png

viraf

unread,

Feb 18, 2016, 8:39:55 AM2/18/16

to tesseract-ocr

So, I decided to manually remove the underline from the image and OCR it. The new image is attached.

$ tesseract test2.png stdout -l eng

REQUEST FOR INDEPENDENT NIEDICAL REVIEW

$ tesseract test2.png stdout -l eng use-userdict

REQUEST FOR INDEPENDENT IVIEDICAL REVIEW

Having specified the user dictionary, I would have expected the output to be correct. Could someone please elaborate on why the difference ?

I have also observed that Tesseract correctly handles underlines in other places - so I am unclear on what is required here. What are the rules for handling text with underlines ?

Thanks

- viraf

viraf

unread,

Feb 18, 2016, 8:40:28 AM2/18/16

to tesseract-ocr

test2.png

Reply all

Reply to author

Forward