How to improve the quality of Training From Scratch

bruce

unread,

Oct 29, 2018, 5:40:19 AM10/29/18

to tesseract-ocr

Recently,I'm using tesseract training my chi_sim language. I want to train a chi_sim.traineddata better than the official one.

I have generated a 82915-characters training data.And trained it with 7 common fonts。

After 4434207 iterations ，the train rate is lower than 0.016% ，But the recognition effect is much worse than the official training library.

so，I'm confused...

How to improve the quality of Training？

Do I need more training data for more training fonts?What is the right amount?

I want to know the training data of the official training library and the font range of the official training library.

Shree Devi Kumar

unread,

Oct 29, 2018, 2:41:27 PM10/29/18

to tesser...@googlegroups.com

Please look at the langdata_lstm repo, specifically the chi_sim folder. It has the training_text as well as list of fonts used for LSTM training.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a7acc320-67f6-42b3-b2c8-99d3db6de7e6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,

Oct 29, 2018, 2:43:05 PM10/29/18

to tesser...@googlegroups.com

https://github.com/tesseract-ocr/langdata_lstm/tree/master/chi_sim

bruce

unread,

Oct 30, 2018, 2:32:43 AM10/30/18

to tesseract-ocr

thank you for your reply ,shree.

I've seen the training_text and the list of fonts.

I will try again.

Before I start my next Scratch training,I want to ask some questions as follows.

1.Is the training_text containing more characters, the better the training results? Is there an upper limit?

2.Whether the more fonts are used, the better the training results will be？

3.I find that the official text contains not only Chinese characters, but also English characters and numbers.

If I will use the command like this: tesseract.exe test.png c:\dir\test -l eng+chi_sim

Is it better for me to train a training_text with pure Chinese characters?

在 2018年10月30日星期二 UTC+8上午2:43:05，shree写道：

Shree Devi Kumar

unread,

Oct 30, 2018, 10:58:39 AM10/30/18

to tesser...@googlegroups.com

Please read the wiki page regarding training 4.0 and the presentation files in docs by Ray Smith.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/501bdf42-ee5a-4a2e-92ce-8dbac2cc42be%40googlegroups.com.

Reply all

Reply to author

Forward