Training by providing a text file accompanying an image?

4 views

Skip to first unread message

Philipp Lenssen

unread,

Nov 20, 2008, 7:29:50 AM11/20/08

to tesseract-ocr

Hi! I read through (http://code.google.com/p/tesseract-ocr/wiki/
TrainingTesseract) but wanted to see if there's an easier option than
creating specific bounding boxes for each letter (which is what I
understand the tutorial says one needs to do?). Is there any option
where one would simply point to a TIF and TXT file, the TXT file
containing the correct text, and thus train Tesseract accordingly?

For instance, I'm currently getting a result like this one on an
image:
------------
Aprll 15 1953
Foober
------------

So I would like to change the text to
------------
April 15 1953
Foobar
------------
... for training purposes (guessing that Tesseract could take a try at
figuring out the bounding boxes itself as it did for the first
incorrect run?).

Thanks!

Ray Smith

unread,

Nov 28, 2008, 1:17:45 PM11/28/08

to tesser...@googlegroups.com

It is possible, and there are broken bits of code that support that kind of training, but it hasn't been used for years and no longer works, so it would take quite a lot of effort to get it working.