Retraining from the best eng model leads to garbage results

147 views

Skip to first unread message

Samuel Bell

unread,

Aug 26, 2019, 2:10:12 PM8/26/19

to tesseract-ocr

Hi All,

I'm trying to use the training to optimize tesseract for my dataset, which is a bunch of not particularly high-resolution scans of books from the 1930s. The text is in English, and I have successfully made a training and test set of true text. I've successfully trained a model that's nearly as good as the original best eng model from this dataset. But that's using training from scratch. Where I'm struggling is on retraining from the best eng model.

When I do this, the character error rate starts very high, usually more than 5.0 (depending on the learning rate I specify). It slowly comes down with lots of iterations, but the end results when I test them are still garbage. What am I doing wrong?

I'm downloading the best eng model to the from_full directory I've created for this with:

wget https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata

I'm then making my .lstm file with:

combine_tessdata -e from_full/eng.traineddata from_full/eng.lstm

Finally, I'm running the retraining with:

lstmtraining \

--continue_from from_full/eng.lstm \

--traineddata from_full/eng.traineddata \

--train_listfile data/list.train \

--learning_rate 1e-3 \

--model_output from_full/checkpoints/retrain400 \

--max_iterations 400

How do I make this work? Where am I going wrong?

Thanks!

--Sam

Samuel Bell

unread,

Aug 26, 2019, 2:41:01 PM8/26/19

to tesseract-ocr

Of course, as soon as I posted this, I found my error--I was making an error with the evaluation command.

Reply all

Reply to author

Forward

0 new messages