I have a set of English single-page TIFF document images that come with ground truth files. Each TIFF has a single rectangular zone of text and each GT file is a UTF8 text file containing the correct text.
I built T3.03 from the source and applied it to this set using whatever English model that came out of the box. Results were mixed and so the question I am trying to answer is this:
Can I incrementally train Tesseract using a part of this corpus to get better accuracy?
I've been reading
https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 but it's unclear to me if incremental training is possible. Is it? How would I have to modify the training procedure to include previosuly trained data in it to increment it with whatever comes from the new data?
Thx