Retrain tesseract 4 model from real image (not from text file and

tu tonquang

Oct 19, 2018, 5:56:01 PM
to tesseract-ocr

I have some errors when I follow this tutorial to retrain tesseract:

I follow this link to retrain tesseract with my image dataset (I retrain tesseract with real image, not from text file via

It is my steps to retrain tesseract lstm:

Step1: I create my training data (tif image + box file) from my images.
I generated its via this command line: tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop makebox

Step2: I edit manually by Qt-box-editor. (I done with this link:
So now I have files:
.tif file
.box file
.lstmf file (generated by command: tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] lstm.train
unicharset file

Step 3: I create .traineddata via this command:
combine_lang_model --input_unicharset unicharset --script_dir langdata --output_dir output --lang "eng"
With langdata I downloaded from here:

Step4: I extract existing model from exist traineddata by command:
combine_tessdata -e /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata eng.lstm

Step5: I retrain tesseract (Fine Tuning for ± a few characters: by command:
lstmtraining --model_output output_model --continue_from eng.lstm --traineddata output_basic/eng/eng.traineddata --old_traineddata /usr/share tesseract-ocr/4.00/tessdata/eng.traineddata --train_listfile eng.training_files.txt --debug_interval -1 --max_iterations 400

  • It is format of my eng.training_files.txt:
I get an error like the following:

Screenshot from 2018-10-19 21-49-00.png

It is example about my training image:

I try to retrain tesseract with from real image (not from text file via

Please share me something if you have any idea to fix it.

Thank you for advance !

tu tonquang

Oct 19, 2018, 5:59:11 PM
to tesseract-ocr
I want my application able to recognize characters like: 'Φ'

Vào 00:56:01 UTC+7 Thứ Bảy, ngày 20 tháng 10 năm 2018, tu tonquang đã viết:

Seokbong Choi

Oct 20, 2018, 2:02:26 AM
Can you share the content of "eng.training_files.txt" file? that --train_listfile argument refers to?

Shree Devi Kumar

Oct 20, 2018, 2:19:28 AM
On Fri, Oct 19, 2018 at 10:02 PM Seokbong Choi <> wrote:
Can you share the content of "eng.training_files.txt" file? that --train_listfile argument refers to?

The contents will differ based on the fonts chosen and the output diectory. See the following for a sample:


tu tonquang

Oct 20, 2018, 3:16:02 AM
to tesseract-ocr
Thank you
But I did same thing but I also get an error like that. It is my file:

Screenshot from 2018-10-20 09-53-37.png

It is my terminal:

Screenshot from 2018-10-20 10-14-07.png

Vào 09:19:28 UTC+7 Thứ Bảy, ngày 20 tháng 10 năm 2018, shree đã viết:

Shree Devi Kumar

Oct 20, 2018, 3:36:26 AM
The files need to use Unix EOL.

tu tonquang

Oct 20, 2018, 3:57:16 AM
to tesseract-ocr
I'm using Linux (Ubuntu) system to edit this file. Besides I write this shell to check EOL of file and result is Unix EOL

Screenshot from 2018-10-20 10-54-18.png

Screenshot from 2018-10-20 10-54-30.png

Vào 10:36:26 UTC+7 Thứ Bảy, ngày 20 tháng 10 năm 2018, shree đã viết:

Shree Devi Kumar

Oct 20, 2018, 4:08:11 AM
Maybe it is not finding your ./eng.training_files.txt

Try giving its full path in lstmtraining command.

tu tonquang

Oct 20, 2018, 4:34:26 AM
to tesseract-ocr
I have tried some case likes:

1. Remove --train_listfile argument from lstmtraining command
2. Change name of argument value, for example: --train_listfile "wrong_file.txt" from lstmtraining command (wong_file not exist in file system)
3. given full path of "eng.training_files.txt" file as "/home/tonquangtu/Desktop/tessdata/new/tiff/eng.training_files.txt"

But I get same error like above. "Must supply a list of training filenames! --train_listfile"

Vào 11:08:11 UTC+7 Thứ Bảy, ngày 20 tháng 10 năm 2018, shree đã viết:

Lorenzo Bolzani

Oct 20, 2018, 9:06:46 AM

First check the version of tesseract, just to be sure, maybe you have more than one around:

lstmtraining -v

If the training file is missing the error is:

Failed to load list of training filenames from eng.training_files.txt

If the --train_listfile option is missing the error is:

Must supply a list of training filenames! --train_listfile

if an invalid option is passed like --xtrain_listfile the error is:

ERROR: Non-existent flag --xtrain_listfile

First, I would try to rewrite from scratch the whole command line in a new file, no copy and paste, I'm thinking about some non printable characters that are messing up things. I think the problem is about the command line, not about the file.

Anyway, also please post the output of the following command:

head /home/tonquangtu/Desktop/tessdata/new/tiff/eng.training_files.txt

(just like that, no changes, to check the path and the content). How did you create the eng.training_files.txt?



tu tonquang

Oct 25, 2018, 11:58:01 AM
to tesseract-ocr
Hi all,

I solved it, the reason is error when copy command line and paste it into terminal.
Thank @Lorenzo Blz for give me idea to solve that. Thank you very much !

But now I have another problem, i hope anyone can give me any idea to handle it.
After I fixed all errors that related with training step I got checkpoints file after I run lstmtraining command.

My follow is:
step 1: training lstm with command:
lstmtraining --model_output output_model --continue_from eng.lstm --traineddata output_basic/eng/eng.traineddata --old_traineddata /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata --train_listfile eng.training_files.txt --debug_interval -1 --max_iterations 400

--> I got checkpoint files:

Screenshot from 2018-10-25 18-32-57.png

step 2: convert checkpoint file --> .traineddata
lstmtraining --stop_training --continue_from path/to/lastest_checkpoint --traineddata path/to/my_traineddata --model_output path/to/new_traineddata

--> I got new eng.traineddata

step 3: I copy and paste new eng.traineddata into tesseract-ocr folder.
sudo cp -rf eng.traineddata /usr/share/tesseract-ocr/4.00/tessdata

Step 4: Test with new traineddata
I test with 2 cases:
Case 1: With special characters and I got very well accuracy.


--> I got result is: '*4' (because I encode 'Φ' = '*')

Case 2: With normal characters and I got very bad result (seems it can not recognize normal characters)

test5.png --> I got result is: '. . 1'

And both 2 case I also get same errors :


So if anyone have any suggestion for fixing it, please share with me. Thank you very much !
P/s: I want to my application have ability for recognize both of normal and special characters

Vào 00:56:01 UTC+7 Thứ Bảy, ngày 20 tháng 10 năm 2018, tu tonquang đã viết:

Sreehari B S

Oct 27, 2018, 4:29:04 AM
to tesseract-ocr

Something similar happened when finetuned for :. When doing ice, it recognized some : as 1. So I fine-tuned the same.

Now when I ocr : , it works well. When I ice some real data it's now worser than the previous one.

* I trained on best eng.traineddata
* I created boxes using tesseract make box command and this was edited using jTessBoxEditor. But the box dimensions were not so perfect.
Note : I trained from a real image. (Do I really need to edit the coordinates by hand to adjust the dimensions ?)

tu tonquang

Oct 27, 2018, 6:04:31 AM
It's similiar with my problem. It well recognized for special characters (new data trained) but wrongly recognize for normal characters and word.

Vào 11:29 T.7, 27 Th10 2018 Sreehari B S <> đã viết:
Lorenzo Bolzani

Oct 27, 2018, 10:50:10 AM

Check the unicharset file to see if all the characters you want to recognize are there.

combine_tessdata -u trained_model.traineddata output_dir
cat output_dir/*unicharset

Otherwise you need to merge the old one with the new one before training.

This is how ocrd-train does it (you could try to use it BTW).

combine_tessdata -u $(TESSDATA)/$(CONTINUE_FROM).traineddata $(TESSDATA)/$(CONTINUE_FROM).
unicharset_extractor --output_unicharset "$(TRAIN)/my.unicharset" --norm_mode $(NORM_MODE) "$(ALL_BOXES)"
merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset $(TRAIN)/my.unicharset merged.unicharset

my.unicharset is the new one, something.lstm-unicharset is the old one, NORM_MODE = 2, ALL_BOXES is a file with all the box files names.

And then something like this: combine_tessdata -o continue_from.traineddata merged.unicharset

It's probably the same thing that Qt-box-editor does. I never tried this, I use ocrd that does things ib a little different way.

At the very beginning of the training lstmtraining will print if the set of characters is different from the previous model.



