Hi All,
I'm trying to use the training to optimize tesseract for my dataset, which is a bunch of not particularly high-resolution scans of books from the 1930s. The text is in English, and I have successfully made a training and test set of true text. I've successfully trained a model that's nearly as good as the original best eng model from this dataset. But that's using training from scratch. Where I'm struggling is on retraining from the best eng model.
When I do this, the character error rate starts very high, usually more than 5.0 (depending on the learning rate I specify). It slowly comes down with lots of iterations, but the end results when I test them are still garbage. What am I doing wrong?
I'm downloading the best eng model to the from_full directory I've created for this with:
I'm then making my .lstm file with:
combine_tessdata -e from_full/eng.traineddata from_full/eng.lstm
Finally, I'm running the retraining with:
lstmtraining \
--continue_from from_full/eng.lstm \
--traineddata from_full/eng.traineddata \
--train_listfile data/list.train \
--learning_rate 1e-3 \
--model_output from_full/checkpoints/retrain400 \
--max_iterations 400
How do I make this work? Where am I going wrong?
Thanks!
--Sam