approach when given a limited vocabulary

36 views

Skip to first unread message

Ivar

unread,

Sep 13, 2015, 5:27:49 AM9/13/15

to tesseract-ocr

I'm very new to OCR and image processing in general, so please excuse me if this question is a FAQ - I haven't been able to track down any recommendations yet.

I'm looking to identify words in images where the words to be recognized will be from a limited pool of known words (~5000 words). They will be in very similar fonts as well, but the images will generally be of poor quality.

What would be the recommended approach?

1) use tesseract as-is and use the output to try to discern the words with post processing (using Levenshtein or Jaro-Winkler or whatever)

2) train tesseract with the known set of words

3) something else?

Reply all

Reply to author

Forward

0 new messages