approach when given a limited vocabulary

36 views
Skip to first unread message

Ivar

unread,
Sep 13, 2015, 5:27:49 AM9/13/15
to tesseract-ocr
I'm very new to OCR and image processing in general, so please excuse me if this question is a FAQ - I haven't been able to track down any recommendations yet.

I'm looking to identify words in images where the words to be recognized will be from a limited pool of known words (~5000 words). They will be in very similar fonts as well, but the images will generally be of poor quality.

What would be the recommended approach? 
1) use tesseract as-is and use the output to try to discern the words with post processing (using Levenshtein or Jaro-Winkler or whatever)
2) train tesseract with the known set of words
3) something else?

Reply all
Reply to author
Forward
0 new messages