I tried to train a traineddata file myself, but encountered an [Error]

鹿青年

unread,

Dec 6, 2024, 1:11:21 AM12/6/24

to tesseract-ocr

Hello, I tried to train a traineddata file myself, but an [Error] occurred during use. Could you please give me some guidance on how to resolve this error? Thank you very much.
Perform OCR
···

tesseract 0791.tif stdout -l my_chi_sim --psm 6 --oem 2
···
The error content is:
····
Error: Tesseract (legacy) engine requested, but components are not present in /usr/local/share/tessdata/my_chi_sim.traineddata!!
Failed loading language 'my_chi_sim'
Tesseract couldn't load any languages!
Could not initialize tesseract.
····

My training steps are as follows:

Punctuation Dictionary:
dawg2wordlist d:\tesseract\tessdata_best\chi_sim.lstm-unicharset d:\tesseract\tessdata_best\chi_sim.lstm-punc-dawg d:\tesseract\tessdata_best\punc.txt

Let’s start with the key steps
2. Generate character set lstm-unicharset file
1. Generate character set txt file

text2image --text d:\tesseract\chi_sim.txt --outputbase d:\tesseract\chi_sim --fonts_dir C:\Windows\Fonts --font="simhei" --fontconfig_tmpdir d:\tesseract\tmp

3. Generate character set lstm-unicharset file

1) Generate with box file
unicharset_extractor --norm_mode 3 --output_unicharset d:\tesseract\chi_sim.lstm-unicharset d:\tesseract\chi_sim.box

2) Generate with txt file
unicharset_extractor --norm_mode 3 --output_unicharset d:\tesseract\chi_sim.lstm-unicharset d:\tesseract\chi_sim.txt

3. Generate starter traineddata file
1. Generate dictionary text file
Refer to the 3 dictionary files in the d:\tesseract\tessdata_best folder (word text, number numbers, punc punctuation marks)
2. Generate starter traineddata file
combine_lang_model --input_unicharset d:\tesseract\chi_sim.lstm-unicharset --lang chi_sim --script_dir d:\tesseract\langdata_lstm --output_dir d:\tesseract --version_str "CSDN:watt:2022.04[1,48,0,1C3,3Ft16Mp3,3TxyLfys64Lfx96RxLrx96Lfx512O1c4000]" --words d:\tesseract\word.txt --numbers d:\tesseract\number.txt --puncs d:\tesseract\punc.txt --pass_through_recoder

3. View the newly generated starter trained data information
combine_tessdata -d d:\tesseract\chi_sim\chi_sim.traineddata

4. Generate training files
1. Generate the training text file train.txt

2. Generate picture+box file

text2image --text d:\tesseract\train.txt --outputbase d:\tesseract\train --fonts_dir C:\Windows\Fonts --font="simhei" --ptsize 18 --fontconfig_tmpdir d:\tesseract\tmp
3. Generate training files:
tesseract d:\tesseract\train.tif d:\tesseract\train -l chi_sim --psm 6 lstm.train

4. Create a new training list file
Create a new d:\tesseract\train_listfile.txt file with the content d:\tesseract\train.lstmf
5. Training

2. Start training:
lstmtraining --traineddata d:\tesseract\chi_sim\chi_sim.traineddata --net_spec "[1,48,0,1Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c4000]" --model_output d:\tesseract\output\output --train_listfile d:\tesseract\train_listfile.txt --max_iterations 0 --target_error_rate 0.01 --debug_interval -1

6. Evaluate the generated checkpoint file
1. Generate evaluation text eval.txt
Edit some evaluation text and save it to d:\tesseract\eval.txt, so as to cover it as comprehensively as possible and with a certain degree of complexity.
2. Generate picture+box file
text2image --text d:\tesseract\eval.txt --outputbase d:\tesseract\eval --fonts_dir C:\Windows\Fonts --font="simhei" --ptsize 18 --fontconfig_tmpdir d:\tesseract\tmp
3. Generate evaluation lstmf file
tesseract d:\tesseract\eval.tif d:\tesseract\eval -l chi_sim --psm 6 lstm.train
4. Generate evaluation list file
Create a new d:\tesseract\eval_listfile.txt file with the content d:\tesseract\eval.lstmf
5. Start evaluating

Start evaluating:
lstmeval --model d:\tesseract\output\output_checkpoint --traineddata d:\tesseract\chi_sim\chi_sim.traineddata --eval_listfile d:\tesseract\eval_listfile.txt
7. Generate standard trained data
1. Generate a floating point (decimal) traineddata file (similar to tessdata_best)
lstmtraining --stop_training --continue_from d:\tesseract\output\output_checkpoint --traineddata d:\tesseract\chi_sim\chi_sim.traineddata --model_output d:\tesseract\output\chi_sim.traineddata
2. Generate an integer traineddata file (similar to tessdata_fast)
lstmtraining --stop_training --convert_to_int --continue_from d:\tesseract\output\output_checkpoint --traineddata d:\tesseract\chi_sim\chi_sim.traineddata --model_output d:\tesseract\output\chi_sim.traineddata

3. View the generated traineddata information
combine_tessdata -d d:\tesseract\output\chi_sim.traineddata

Zdenko Podobny

unread,

Dec 6, 2024, 2:07:14 AM12/6/24

to tesser...@googlegroups.com

Error: Tesseract (legacy) engine requested, but components are not present in /usr/local/share/tessdata/my_chi_sim.traineddata!!

The message is clear. YOU require tesseract to use legacy engine explicitly but YOUR language datafile (you created by training) does not contain legacy model.

Zdenko

pi 6. 12. 2024 o 7:11 鹿青年 <luqingn...@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/4f54b4ff-f1f4-4e44-9e49-11a70b759d68n%40googlegroups.com.

محمود محمد‎

unread,

Dec 6, 2024, 2:15:40 AM12/6/24

to tesser...@googlegroups.com

I think using jeTesBoxEditor is good for training process

To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xMU7KTYPbKPTby09M6cEOvBbngD-U8hRyR%2BWZPF3_HhQ%40mail.gmail.com.

Message has been deleted

鹿青年

unread,

Dec 6, 2024, 2:32:57 AM12/6/24

to tesseract-ocr

Thank you for your reply. How should I proceed to merge the old engine into my trained model?
Or, are there any parameters that can specify that the OCR operation should not use the old engine?

鹿青年

unread,

Dec 6, 2024, 2:37:16 AM12/6/24

to tesseract-ocr

Thanks for the reply.
Yes, I also use jeTesBoxEditor at the same time, but jeTesBoxEditor is more like data standardization. Some of the font files have incomplete fonts. I want to use LSTM training to train a complete autologous library file of my own.

محمود محمد‎

unread,

Dec 6, 2024, 2:44:46 AM12/6/24

to tesser...@googlegroups.com

You can collect images and put them in a folder and then gettysboxaddtor to create a training file for your model from your collection of images. To start, first select the language and then create the box files by specifying the path of the images and then click on the Create box files box and then start training, creating and improving your model

To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/d7c9c21b-cc85-49c7-b7e3-2e40aad4cc35n%40googlegroups.com.

محمود محمد‎

unread,

Dec 6, 2024, 2:47:10 AM12/6/24

to tesser...@googlegroups.com

After completing the training for each image, click Save, then specify the program path to the box files, then select the option to create a training file using external boxes to automatically find that the training process has been completed and your . traineddata. file has been created. To name it and add it to the Tesseract-OCR database. In the tessdata file

Zdenko Podobny

unread,

Dec 6, 2024, 5:55:23 AM12/6/24

to tesser...@googlegroups.com

Your question indicates you have no clue what you are doing with tesseract and also with training.

First you need to invest time to learn tesseract and read documentation.

Zdenko

pi 6. 12. 2024 o 8:17 鹿青年 <luqingn...@gmail.com> napísal(a):

Thank you for your reply. How should I proceed to merge the old engine into my trained model?
Or, are there any parameters that can specify that the OCR operation should not use the old engine?

在2024年12月6日星期五 UTC+8 15:07:14<zdenop> 写道：

To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/e1e5f54a-d6fa-41bb-8d44-23d681e0821en%40googlegroups.com.

محمود محمد‎

unread,

Dec 11, 2024, 8:21:25 AM12/11/24

to tesser...@googlegroups.com

Hello I want make or generated with you a simple file trainddata by jtessboxeditor for Tesseract and test it can you inform me time to discuss The steps. Thanks

--

Reply all

Reply to author

Forward