Python tesseract ocr training to a specific list of words

143 views

Skip to first unread message

Inês Martins

unread,

Jun 12, 2015, 9:10:55 AM6/12/15

to tesser...@googlegroups.com

I am quite new to OCR and to Tesseract.

So far I have a working script that is extracting quite good text from images.

My doubt is if it is possible to train tesseract to retrieve only words/chars presented in some kind of dictionary file.

For example, I have an .txt with a big list of person names, and I want to train Tesseract that "SONIA" is not "50NlA" and "YANNICK" not "VANNlD", etc...

If it has the list of imagine all names it will be able to give better accuracy? Sorry if it is a stupid question. I wanted the best approach or tutorials if it is possible.

I have read this https://groups.google.com/forum/#!topic/tesseract-ocr/r5qkHxQOT98 and the manual http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html and created the eng.user-words and the bazaar files... what should be the next step?