how to check .unicharset in a .traineddata file

117 views
Skip to first unread message

Jingjing Lin

unread,
Jun 17, 2019, 10:50:24 AM6/17/19
to tesseract-ocr
I tried to fine tune the model and add a new character via training, but it seems it still couldn't recognize this new character using the new traineddata generated. To debug I want to check whether this new character is in the .unicharset in the new traineddata generated. Is there a way to do this?

shree

unread,
Jun 17, 2019, 11:31:24 AM6/17/19
to tesseract-ocr
combine_tessdata -u new.traineddata new.

will unpack the traineddata file. check new.lstm-unicharset in it

Jingjing Lin

unread,
Jun 17, 2019, 1:16:56 PM6/17/19
to tesseract-ocr
Thanks. It works. The new character I added was there.

Do you have any idea why after fine tuning tesseract still couldn't recognize the new character I added? When I tried to add '±' to eng it works, but when I tried to add '±' to chi_sim, it couldn't work (explained below). Is there anything we need to pay attention to when fine tuning other langs rather than eng?

I used 
lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
  --traineddata ~/tesstutorial/trainplusminus/chi_sim/chi_sim.traineddata \
  --eval_listfile ~/tesstutorial/evalplusminus/chi_sim.training_files.txt 2>&1 |
  grep ±
to check and ± only shows up in Truth but not in OCR

在 2019年6月17日星期一 UTC-4上午11:31:24,shree写道:

Shree Devi Kumar

unread,
Jun 17, 2019, 2:16:31 PM6/17/19
to tesser...@googlegroups.com
How big was your training text? How many iterations? Did the fonts you use for training support the plus minus sign? 

You can run training with -- debug-level of -1 so that you can see whether the plus minus is being picked for training in the console messages.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d251e677-5f9d-4f8f-b41a-aa015538ca47%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jingjing Lin

unread,
Jun 17, 2019, 4:50:13 PM6/17/19
to tesseract-ocr
The training text was only about 2200 lines (200kB) and I used iteration of 3600. The fonts I used support ±. 

What do you mean by 'whether ± is being picked for training'? When I set --debug_interval -1 I found in every iteration it only outputs one line, does that mean in every iteration only one line is being used for training??

在 2019年6月17日星期一 UTC-4下午2:16:31,shree写道:
How big was your training text? How many iterations? Did the fonts you use for training support the plus minus sign? 

You can run training with -- debug-level of -1 so that you can see whether the plus minus is being picked for training in the console messages.

On Mon, 17 Jun 2019, 23:29 Jingjing Lin, <joejo...@gmail.com> wrote:
Thanks. It works. The new character I added was there.

Do you have any idea why after fine tuning tesseract still couldn't recognize the new character I added? When I tried to add '±' to eng it works, but when I tried to add '±' to chi_sim, it couldn't work (explained below). Is there anything we need to pay attention to when fine tuning other langs rather than eng?

I used 
lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
  --traineddata ~/tesstutorial/trainplusminus/chi_sim/chi_sim.traineddata \
  --eval_listfile ~/tesstutorial/evalplusminus/chi_sim.training_files.txt 2>&1 |
  grep ±
to check and ± only shows up in Truth but not in OCR

在 2019年6月17日星期一 UTC-4上午11:31:24,shree写道:
combine_tessdata -u new.traineddata new.

will unpack the traineddata file. check new.lstm-unicharset in it

On Monday, June 17, 2019 at 8:20:24 PM UTC+5:30, Jingjing Lin wrote:
I tried to fine tune the model and add a new character via training, but it seems it still couldn't recognize this new character using the new traineddata generated. To debug I want to check whether this new character is in the .unicharset in the new traineddata generated. Is there a way to do this?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Jingjing Lin

unread,
Jun 17, 2019, 4:54:56 PM6/17/19
to tesseract-ocr
I was only using two different fonts and It only achieved lowest error rate of 11.271 after the training, does this mean I really need to increase the iterations?


在 2019年6月17日星期一 UTC-4下午2:16:31,shree写道:
How big was your training text? How many iterations? Did the fonts you use for training support the plus minus sign? 

You can run training with -- debug-level of -1 so that you can see whether the plus minus is being picked for training in the console messages.
On Mon, 17 Jun 2019, 23:29 Jingjing Lin, <joejo...@gmail.com> wrote:
Thanks. It works. The new character I added was there.

Do you have any idea why after fine tuning tesseract still couldn't recognize the new character I added? When I tried to add '±' to eng it works, but when I tried to add '±' to chi_sim, it couldn't work (explained below). Is there anything we need to pay attention to when fine tuning other langs rather than eng?

I used 
lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
  --traineddata ~/tesstutorial/trainplusminus/chi_sim/chi_sim.traineddata \
  --eval_listfile ~/tesstutorial/evalplusminus/chi_sim.training_files.txt 2>&1 |
  grep ±
to check and ± only shows up in Truth but not in OCR

在 2019年6月17日星期一 UTC-4上午11:31:24,shree写道:
combine_tessdata -u new.traineddata new.

will unpack the traineddata file. check new.lstm-unicharset in it

On Monday, June 17, 2019 at 8:20:24 PM UTC+5:30, Jingjing Lin wrote:
I tried to fine tune the model and add a new character via training, but it seems it still couldn't recognize this new character using the new traineddata generated. To debug I want to check whether this new character is in the .unicharset in the new traineddata generated. Is there a way to do this?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Jingjing Lin

unread,
Jun 17, 2019, 5:43:14 PM6/17/19
to tesseract-ocr
when I checked with --debug_interval -1 I found that although ± is in the GROUND TRUTH, it always showed as + or something else but not ± in the BEST OCR TEXT. What can I do in this situation?


在 2019年6月17日星期一 UTC-4下午2:16:31,shree写道:
How big was your training text? How many iterations? Did the fonts you use for training support the plus minus sign? 

You can run training with -- debug-level of -1 so that you can see whether the plus minus is being picked for training in the console messages.
On Mon, 17 Jun 2019, 23:29 Jingjing Lin, <joejo...@gmail.com> wrote:
Thanks. It works. The new character I added was there.

Do you have any idea why after fine tuning tesseract still couldn't recognize the new character I added? When I tried to add '±' to eng it works, but when I tried to add '±' to chi_sim, it couldn't work (explained below). Is there anything we need to pay attention to when fine tuning other langs rather than eng?

I used 
lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
  --traineddata ~/tesstutorial/trainplusminus/chi_sim/chi_sim.traineddata \
  --eval_listfile ~/tesstutorial/evalplusminus/chi_sim.training_files.txt 2>&1 |
  grep ±
to check and ± only shows up in Truth but not in OCR

在 2019年6月17日星期一 UTC-4上午11:31:24,shree写道:
combine_tessdata -u new.traineddata new.

will unpack the traineddata file. check new.lstm-unicharset in it

On Monday, June 17, 2019 at 8:20:24 PM UTC+5:30, Jingjing Lin wrote:
I tried to fine tune the model and add a new character via training, but it seems it still couldn't recognize this new character using the new traineddata generated. To debug I want to check whether this new character is in the .unicharset in the new traineddata generated. Is there a way to do this?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Shree Devi Kumar

unread,
Jun 18, 2019, 12:25:18 AM6/18/19
to tesser...@googlegroups.com
Yes, each iteration is one line.

For eng, the langdata training text is about 80 lines and you add 15 symbols for plus minus. With 30 fonts, you will have about 2400 lines. So in 3600 iterations, all samples will be seen and trained.

For chi_sim with larger training text it will be different.

See https://github.com/Shreeshrii/tess4training for details of training tutorial.





To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Shree Devi Kumar

unread,
Jun 18, 2019, 12:28:25 AM6/18/19
to tesser...@googlegroups.com
If you increase the iterations then the plus type of training will not give good result, i.e. the other letters will lose accuracy.

You can try to reduce the training text size while still keeping all the characters that you need as part of the training text, 

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Jingjing Lin

unread,
Jun 18, 2019, 9:00:32 AM6/18/19
to tesseract-ocr
Thanks for your advice. I'll try reduce the training text size.

在 2019年6月18日星期二 UTC-4上午12:28:25,shree写道:

Jingjing Lin

unread,
Jun 18, 2019, 10:09:18 AM6/18/19
to tesseract-ocr
Sorry to bother you again and again.
I reduced the training text to about 450 lines, with like 30 ± in it. I used two fonts and iteration of 1000. But it looks like ± is still not picked up by the BEST OCR TEXT at all, it always recognizes ± as something else. What is happening here? Should I increase the number of ±? Or do I need to increase the number of fonts? I'm trying increasing iterations.


在 2019年6月18日星期二 UTC-4上午12:28:25,shree写道:
If you increase the iterations then the plus type of training will not give good result, i.e. the other letters will lose accuracy.

Shree Devi Kumar

unread,
Jun 18, 2019, 10:32:37 AM6/18/19
to tesser...@googlegroups.com
 increase the number of ± to about 100 

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.

Jingjing Lin

unread,
Jun 18, 2019, 2:17:45 PM6/18/19
to tesseract-ocr
It still couldn't work after I increased the number of ± to about 100. And the error rate after 2000 iterations is about 11. This is a pretty high error rate compare to what we have for adding a few characters to eng. With such high error rate, I would not be surprised that it could't recognize some special characters like ±. Is this it for chi_sim? Or can I increase iterations to make the error rate smaller? 
Thanks for your help.

在 2019年6月18日星期二 UTC-4上午10:32:37,shree写道:

Shree Devi Kumar

unread,
Jun 18, 2019, 2:21:18 PM6/18/19
to tesser...@googlegroups.com
I will test tomorrow and let you know

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Jingjing Lin

unread,
Jun 18, 2019, 2:26:22 PM6/18/19
to tesseract-ocr
Thanks a lot!

在 2019年6月18日星期二 UTC-4下午2:21:18,shree写道:

Jingjing Lin

unread,
Jun 18, 2019, 3:06:34 PM6/18/19
to tesseract-ocr
Can you please test on arrows ( or ) instead of ± if it's not inconvenient for you?


在 2019年6月18日星期二 UTC-4下午2:21:18,shree写道:
I will test tomorrow and let you know

Shree Devi Kumar

unread,
Jun 19, 2019, 11:44:22 AM6/19/19
to tesser...@googlegroups.com
Update:

1. When using a smaller training_text for chi_sim for plus training, the unicharset gets restricted. So, merge the lstm-unicharset with it.

2. The unicharset for chi_sim using langdata is different from the one extracted from tessdata_best. so using training_text from langdata will add more characters.

3. The fonts used for LSTM training are given in langdata_lstm in okfonts.txt. For plus training same fonts should be used otherwise it will require training of new typefaces.

4. Another user was trying to fine-tune chi_sim (check old forum posts) to add theta sign. If I remember correctly, the plus type training did not work for it. Replace top layer was the better option.

5. I am training with the following fonts. 
"Adobe Heiti Std" \
"Adobe Kaiti Std" \
"Arial Unicode MS" \
"Bitstream CyberCJK" \
"Microsoft YaHei UI" \
"Microsoft YaHei" \
"NSimSun" \
"Noto Sans CJK SC" \
"Noto Sans Mono CJK SC" \
"STXihei" \
"SimSun" \
"WenQuanYi Zen Hei Medium" \
"WenQuanYi Zen Hei Mono Medium" \
"WenQuanYi Zen Hei Sharp Medium" \

At iteration 1046/1100/1100, Mean rms=0.704%, delta=1.445%, char train=4.888%, word train=46.842%, skip ratio=0%,  New best char error = 4.888 wrote best model:/home/ubuntu/tesstutorial/chi_sim_plus/chi_sim_plus4.888_1046.checkpoint wrote checkpoint.


To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,
Jun 19, 2019, 11:46:51 AM6/19/19
to tesser...@googlegroups.com

Jingjing Lin

unread,
Jun 19, 2019, 1:36:09 PM6/19/19
to tesseract-ocr
Thanks for your comments. 

So did you mean we cannot use the method to add a special character to eng to add a special character to chi_sim? We'll have to retrain the top layer to achieve this?

Another question is, when we use a smaller .training_text, the .unicharset only contains a limited amount of chars. For Chinese, this unicharset is much smaller than the unicharset in langdata_lstm (github). How do we combine the original .traineddata with the .traineddata we generated via fine tuning? I tried the command below but it seems it's not doing the above thing I wanted it to do:

lstmtraining --stop_training \
  --continue_from ~/tesstutorial/eng_from_chi/base_checkpoint \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --model_output ~/tesstutorial/eng_from_chi/eng.traineddata


在 2019年6月19日星期三 UTC-4上午11:44:22,shree写道:
Reply all
Reply to author
Forward
0 new messages