The Accuracy improvement of training the chi_sim.traineddata model

498 views
Skip to first unread message

roberty...@gmail.com

unread,
Sep 19, 2017, 4:30:24 AM9/19/17
to tesseract-ocr
Hello,

I am training my own traineddata model for the chi_sim language with the finetune training. In my trained data, there are some mathematical symbols, such as "∞", "β", "△" and so on, which cannot be recognized in the official chi_sim.traineddata model.

So we change the content of the chi_sim.training_text file, and fill the file with our training data.


Then executing the training command:
training/lstmtraining --model_output ~/tesstutorial/trainspecial/special \
  --continue_from ~/tesstutorial/trainspecial/chi_sim.lstm \
  --traineddata ~/tesstutorial/trainspecial/chi_sim/chi_sim.traineddata \
  --old_traineddata tessdata/best/chi_sim.traineddata \
  --train_listfile ~/tesstutorial/trainspecial/chi_sim.training_files.txt \
  --max_iterations 400000

As the command, when we iterate 400000 times, the char error is about 0.2% and the word error is about 4.2%.
The error rate has almost started to oscillate and it can't go down. So we stopped training and exported the traineddata model.

After testing the exported traineddata model, the accuracy is not satisfactory enough, which is lower than the model provided by the official website (tesseract github website).

We hope that the training model recognition accuracy will be consistent with the official website. Then how can we continue to further improve the accuracy of the model?

Does anyone know the details of the official website training language model, such as the num of iteration, the lowest char error and word error, the value of the learning_rate, and so on?

If you know these information, please give some tips.


Thank you.

ShreeDevi Kumar

unread,
Sep 19, 2017, 4:49:30 AM9/19/17
to tesser...@googlegroups.com
As per comments by Ray, for finetune or for plus minus a few letters.
the number of iterations should be limited to 3000 or so.

It probably won't get to .2% accuracy, but you might have better results 

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a9a25aeb-2182-41d5-9a69-aef34a92eb27%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

roberty...@gmail.com

unread,
Sep 19, 2017, 4:58:51 AM9/19/17
to tesseract-ocr
Does the finetune update all the parameters in all of the layers?

We need to add lots of mathematical symbols and some other special symbols. Maybe we should scratch training?

What is the char error and iteration times for the scratch training, then we train the chi_sim(Simplified Chinese)?



在 2017年9月19日星期二 UTC+8下午4:49:30,shree写道:
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,
Sep 19, 2017, 5:06:57 AM9/19/17
to tesser...@googlegroups.com
Ray is the only one who would know those details.


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

roberty...@gmail.com

unread,
Sep 19, 2017, 5:08:45 AM9/19/17
to tesseract-ocr
OK. Thanks for your reply.

在 2017年9月19日星期二 UTC+8下午5:06:57,shree写道:

ShreeDevi Kumar

unread,
Sep 19, 2017, 5:09:58 AM9/19/17
to tesser...@googlegroups.com
If you unpack the traineddata file, the version string usually has the network spec used for building the traineddata.

For chi_sim, I think Ray has also mentioned it in the wiki on the training page.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Reply all
Reply to author
Forward
0 new messages