Training from scratch to re-train the chi_sim.traineddata for studying

77 visualizzazioni
Passa al primo messaggio da leggere

roberty...@gmail.com

da leggere,
22 ago 2017, 02:47:3822/08/17
a tesseract-ocr
Hello,

I'm trying to re-train the chi_sim.traineddata model from scratch for studying.

I use the source data of chi_sim.training_text in the link directory https://github.com/tesseract-ocr/langdata/tree/master/chi_sim to train the model with the command:

training/lstmtraining --debug_interval 100 \
--traineddata ~/tesstutorial/trainspecial/chi_sim/chi_sim.traineddata \
--net_spec '[1,48,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c1]' \
--model_output ~/tesstutorial/specialoutput/base --learning_rate 20e-4 \
--train_listfile ~/tesstutorial/trainspecial/chi_sim.training_files.txt \
--eval_listfile ~/tesstutorial/evalspecial/chi_sim.training_files.txt \
--max_iterations 3600 &>~/tesstutorial/specialoutput/basetrain.log


The net_spec is same as the official model package (chi_sim.traineddata from the tessdata github).



After converting the training model with the lstmtraining --stop_training, a new chi_sim.traineddata model gererated, which is named chi_sim_new.traineddata.
And I name the official chi_sim.traineddata as chi_sim.traineddata for distinguishing.


Then I pull out all the characters in the two traineddata model.

There are 4384 characters in the chi_sim.traineddata, but 2538 characters in the chi_sim_new.traineddata which is generated by me.

Why are there different characters in the two models? Does the source data in the chi_sim.training_text haven't updated in time?

ShreeDevi Kumar

da leggere,
22 ago 2017, 03:22:3622/08/17
a tesser...@googlegroups.com
The langdata files have not been updated for 4.00alpha

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1111e3f0-588b-456f-90bf-a878f20b1f26%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

roberty...@gmail.com

da leggere,
22 ago 2017, 03:28:1122/08/17
a tesseract-ocr
Thanks for your reply.

Do you know where can I find the new langdata files?

在 2017年8月22日星期二 UTC+8下午3:22:36,shree写道:
The langdata files have not been updated for 4.00alpha

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Aug 22, 2017 at 12:17 PM, <roberty...@gmail.com> wrote:
Hello,

I'm trying to re-train the chi_sim.traineddata model from scratch for studying.

I use the source data of chi_sim.training_text in the link directory https://github.com/tesseract-ocr/langdata/tree/master/chi_sim to train the model with the command:

training/lstmtraining --debug_interval 100 \
--traineddata ~/tesstutorial/trainspecial/chi_sim/chi_sim.traineddata \
--net_spec '[1,48,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c1]' \
--model_output ~/tesstutorial/specialoutput/base --learning_rate 20e-4 \
--train_listfile ~/tesstutorial/trainspecial/chi_sim.training_files.txt \
--eval_listfile ~/tesstutorial/evalspecial/chi_sim.training_files.txt \
--max_iterations 3600 &>~/tesstutorial/specialoutput/basetrain.log


The net_spec is same as the official model package (chi_sim.traineddata from the tessdata github).



After converting the training model with the lstmtraining --stop_training, a new chi_sim.traineddata model gererated, which is named chi_sim_new.traineddata.
And I name the official chi_sim.traineddata as chi_sim.traineddata for distinguishing.


Then I pull out all the characters in the two traineddata model.

There are 4384 characters in the chi_sim.traineddata, but 2538 characters in the chi_sim_new.traineddata which is generated by me.

Why are there different characters in the two models? Does the source data in the chi_sim.training_text haven't updated in time?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

da leggere,
22 ago 2017, 03:54:3322/08/17
a tesser...@googlegroups.com
The files will be at Google. You have to wait till Ray Smith updates the repository. 

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Rispondi a tutti
Rispondi all'autore
Inoltra
0 nuovi messaggi