Can I use Tesseract dictionary to fix non-dictionary word?

81 views
Skip to first unread message

Jakub Dolecki

unread,
Aug 21, 2015, 6:05:38 PM8/21/15
to tesseract-ocr
Hello everyone,

I've been searching around this group for an answer to my question, but I couldn't find anything satisfactory so here it goes. For the attached image, the OCR result is the following:

Review the Main Idea state-


ment at the beginning of this


section. List five sources that a


historian'might use to write


a history of your Iife.Then,


eValIJate them for authenticity,

reiiability (72 confidence), and bias.


The command I used to run OCR is `tesseract rotated.jpeg foo -psm 1 -c language_model_penalty_non_dict_word 1.0`. 


Tesseract does a good job overall, but fails to determine that "reiiability" should be "reliability" (among few other words, but I'm curious about this case in particular). Can you please explain to me why it Tesseract fails to find the dictionary word?


Assuming I cannot fix this discrepancy on the word-recognition level, can I utilize the API in some way to iterate over the words and only pick dictionary words from available choices? 


Since the DAWG is a graph, is it impossible for Tesseract to ask for a dictionary word that is, say, 1 or 2 characters from the current best candidate? 


Thanks a lot for your help,

Jakub

rotated.jpeg
Reply all
Reply to author
Forward
0 new messages