Hello,
I am training my own traineddata model for the chi_sim language with the finetune training. In my trained data, there are some mathematical symbols, such as "∞", "β", "△" and so on, which cannot be recognized in the official chi_sim.traineddata model.
So we change the content of the chi_sim.training_text file, and fill the file with our training data.
Then executing the training command:
training/lstmtraining --model_output ~/tesstutorial/trainspecial/special \
--continue_from ~/tesstutorial/trainspecial/chi_sim.lstm \
--traineddata ~/tesstutorial/trainspecial/chi_sim/chi_sim.traineddata \
--old_traineddata tessdata/best/chi_sim.traineddata \
--train_listfile ~/tesstutorial/trainspecial/chi_sim.training_files.txt \
--max_iterations 400000
As the command, when we iterate 400000 times, the char error is about 0.2% and the word error is about 4.2%.
The error rate has almost started to oscillate and it can't go down. So we stopped training and exported the traineddata model.
After testing the exported traineddata model, the accuracy is not satisfactory enough, which is lower than the model provided by the official website (tesseract github website).
We hope that the training model recognition accuracy will be consistent with the official website. Then how can we continue to further improve the accuracy of the model?
Does anyone know the details of the official website training language model, such as the num of iteration, the lowest char error and word error, the value of the learning_rate, and so on?
If you know these information, please give some tips.
Thank you.