I'm very new to OCR and image processing in general, so please excuse me if this question is a FAQ - I haven't been able to track down any recommendations yet.
I'm looking to identify words in images where the words to be recognized will be from a limited pool of known words (~5000 words). They will be in very similar fonts as well, but the images will generally be of poor quality.
What would be the recommended approach?
1) use tesseract as-is and use the output to try to discern the words with post processing (using Levenshtein or Jaro-Winkler or whatever)
2) train tesseract with the known set of words
3) something else?