Retrain tesseract 4 model from real image (not from text file and tesstrain.sh)

1,291 views
Skip to first unread message

tu tonquang

unread,
Oct 19, 2018, 5:56:01 PM10/19/18
to tesseract-ocr
Hi,

I have some errors when I follow this tutorial to retrain tesseract:

I follow this link to retrain tesseract with my image dataset (I retrain tesseract with real image, not from text file via tesstrain.sh)
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-starter-traineddata

It is my steps to retrain tesseract lstm:


Step1: I create my training data (tif image + box file) from my images.
I generated its via this command line: tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop makebox


Step2: I edit manually by Qt-box-editor. (I done with this link: https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-Make-Box-Files)
So now I have files:
.tif file
.box file
.lstmf file (generated by command: tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] lstm.train
unicharset file


Step 3: I create .traineddata via this command:
combine_lang_model --input_unicharset unicharset --script_dir langdata --output_dir output --lang "eng"
With langdata I downloaded from here: https://github.com/tesseract-ocr/langdata


Step4: I extract existing model from exist traineddata by command:
combine_tessdata -e /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata eng.lstm


Step5: I retrain tesseract (Fine Tuning for ± a few characters: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters) by command:
lstmtraining --model_output output_model --continue_from eng.lstm --traineddata output_basic/eng/eng.traineddata --old_traineddata /usr/share tesseract-ocr/4.00/tessdata/eng.traineddata --train_listfile eng.training_files.txt --debug_interval -1 --max_iterations 400

  • It is format of my eng.training_files.txt:
    path/to/lstmf
I get an error like the following:

Screenshot from 2018-10-19 21-49-00.png

It is example about my training image:
eng.centurygothic.exp0.png





I try to retrain tesseract with from real image (not from text file via tesstrain.sh)

Please share me something if you have any idea to fix it.


Thank you for advance !



tu tonquang

unread,
Oct 19, 2018, 5:59:11 PM10/19/18
to tesseract-ocr
I want my application able to recognize characters like: 'Φ'

Vào 00:56:01 UTC+7 Thứ Bảy, ngày 20 tháng 10 năm 2018, tu tonquang đã viết:

Seokbong Choi

unread,
Oct 20, 2018, 2:02:26 AM10/20/18
to tesser...@googlegroups.com
Can you share the content of "eng.training_files.txt" file? that --train_listfile argument refers to?
Thanks.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d08df2e0-ccc3-49bc-90ab-6588f9ab6ef3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,
Oct 20, 2018, 2:19:28 AM10/20/18
to tesser...@googlegroups.com
On Fri, Oct 19, 2018 at 10:02 PM Seokbong Choi <zodia...@gmail.com> wrote:
Can you share the content of "eng.training_files.txt" file? that --train_listfile argument refers to?
Thanks.


The contents will differ based on the fonts chosen and the output diectory. See the following for a sample:

/home/ubuntu/tesstutorial/digits/eng.Arial_Bold.exp0.lstmf
/home/ubuntu/tesstutorial/digits/eng.Arial_Bold_Italic.exp0.lstmf
/home/ubuntu/tesstutorial/digits/eng.Arial.exp0.lstmf
/home/ubuntu/tesstutorial/digits/eng.Arial_Italic.exp0.lstmf
/home/ubuntu/tesstutorial/digits/eng.Courier_New_Bold.exp0.lstmf
/home/ubuntu/tesstutorial/digits/eng.Courier_New_Bold_Italic.exp0.lstmf
/home/ubuntu/tesstutorial/digits/eng.Courier_New.exp0.lstmf
/home/ubuntu/tesstutorial/digits/eng.Courier_New_Italic.exp0.lstmf
/home/ubuntu/tesstutorial/digits/eng.FreeMono.exp0.lstmf
/home/ubuntu/tesstutorial/digits/eng.FreeSans.exp0.lstmf
/home/ubuntu/tesstutorial/digits/eng.FreeSerif.exp0.lstmf

tu tonquang

unread,
Oct 20, 2018, 3:16:02 AM10/20/18
to tesseract-ocr
Thank you
But I did same thing but I also get an error like that. It is my file:

Screenshot from 2018-10-20 09-53-37.png




It is my terminal:

Screenshot from 2018-10-20 10-14-07.png




Vào 09:19:28 UTC+7 Thứ Bảy, ngày 20 tháng 10 năm 2018, shree đã viết:

Shree Devi Kumar

unread,
Oct 20, 2018, 3:36:26 AM10/20/18
to tesser...@googlegroups.com
The files need to use Unix EOL.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

tu tonquang

unread,
Oct 20, 2018, 3:57:16 AM10/20/18
to tesseract-ocr
I'm using Linux (Ubuntu) system to edit this file. Besides I write this shell to check EOL of file and result is Unix EOL

Screenshot from 2018-10-20 10-54-18.png




Screenshot from 2018-10-20 10-54-30.png




Vào 10:36:26 UTC+7 Thứ Bảy, ngày 20 tháng 10 năm 2018, shree đã viết:

Shree Devi Kumar

unread,
Oct 20, 2018, 4:08:11 AM10/20/18
to tesser...@googlegroups.com
Maybe it is not finding your ./eng.training_files.txt


Try giving its full path in lstmtraining command.

tu tonquang

unread,
Oct 20, 2018, 4:34:26 AM10/20/18
to tesseract-ocr
I have tried some case likes:

1. Remove --train_listfile argument from lstmtraining command
2. Change name of argument value, for example: --train_listfile "wrong_file.txt" from lstmtraining command (wong_file not exist in file system)
3. given full path of "eng.training_files.txt" file as "/home/tonquangtu/Desktop/tessdata/new/tiff/eng.training_files.txt"

But I get same error like above. "Must supply a list of training filenames! --train_listfile"

Vào 11:08:11 UTC+7 Thứ Bảy, ngày 20 tháng 10 năm 2018, shree đã viết:

Lorenzo Bolzani

unread,
Oct 20, 2018, 9:06:46 AM10/20/18
to tesser...@googlegroups.com

First check the version of tesseract, just to be sure, maybe you have more than one around:

lstmtraining -v

If the training file is missing the error is:

Failed to load list of training filenames from eng.training_files.txt

If the --train_listfile option is missing the error is:

Must supply a list of training filenames! --train_listfile

if an invalid option is passed like --xtrain_listfile the error is:

ERROR: Non-existent flag --xtrain_listfile

First, I would try to rewrite from scratch the whole command line in a new file, no copy and paste, I'm thinking about some non printable characters that are messing up things. I think the problem is about the command line, not about the file.

Anyway, also please post the output of the following command:

head /home/tonquangtu/Desktop/tessdata/new/tiff/eng.training_files.txt

(just like that, no changes, to check the path and the content). How did you create the eng.training_files.txt?


Bye

Lorenzo




--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

tu tonquang

unread,
Oct 25, 2018, 11:58:01 AM10/25/18
to tesseract-ocr
Hi all,

I solved it, the reason is error when copy command line and paste it into terminal.
Thank @Lorenzo Blz for give me idea to solve that. Thank you very much !

But now I have another problem, i hope anyone can give me any idea to handle it.
After I fixed all errors that related with training step I got checkpoints file after I run lstmtraining command.

My follow is:
step 1: training lstm with command:
lstmtraining --model_output output_model --continue_from eng.lstm --traineddata output_basic/eng/eng.traineddata --old_traineddata /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata --train_listfile eng.training_files.txt --debug_interval -1 --max_iterations 400

--> I got checkpoint files:

Screenshot from 2018-10-25 18-32-57.png




step 2: convert checkpoint file --> .traineddata
lstmtraining --stop_training --continue_from path/to/lastest_checkpoint --traineddata path/to/my_traineddata --model_output path/to/new_traineddata

--> I got new eng.traineddata

step 3: I copy and paste new eng.traineddata into tesseract-ocr folder.
sudo cp -rf eng.traineddata /usr/share/tesseract-ocr/4.00/tessdata

Step 4: Test with new traineddata
I test with 2 cases:
Case 1: With special characters and I got very well accuracy.

test2.png

--> I got result is: '*4' (because I encode 'Φ' = '*')



Case 2: With normal characters and I got very bad result (seems it can not recognize normal characters)


test5.png --> I got result is: '. . 1'


And both 2 case I also get same errors :

a.png


So if anyone have any suggestion for fixing it, please share with me. Thank you very much !
P/s: I want to my application have ability for recognize both of normal and special characters




Vào 00:56:01 UTC+7 Thứ Bảy, ngày 20 tháng 10 năm 2018, tu tonquang đã viết:

Sreehari B S

unread,
Oct 27, 2018, 4:29:04 AM10/27/18
to tesseract-ocr
Hi,

Something similar happened when finetuned for :. When doing ice, it recognized some : as 1. So I fine-tuned the same.

Now when I ocr : , it works well. When I ice some real data it's now worser than the previous one.

* I trained on best eng.traineddata
* I created boxes using tesseract make box command and this was edited using jTessBoxEditor. But the box dimensions were not so perfect.
Note : I trained from a real image. (Do I really need to edit the coordinates by hand to adjust the dimensions ?)

tu tonquang

unread,
Oct 27, 2018, 6:04:31 AM10/27/18
to tesser...@googlegroups.com
It's similiar with my problem. It well recognized for special characters (new data trained) but wrongly recognize for normal characters and word.

Vào 11:29 T.7, 27 Th10 2018 Sreehari B S <sreeha...@gmail.com> đã viết:
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Lorenzo Bolzani

unread,
Oct 27, 2018, 10:50:10 AM10/27/18
to tesser...@googlegroups.com

Check the unicharset file to see if all the characters you want to recognize are there.

combine_tessdata -u trained_model.traineddata output_dir
cat output_dir/*unicharset


Otherwise you need to merge the old one with the new one before training.

This is how ocrd-train does it (you could try to use it BTW).

combine_tessdata -u $(TESSDATA)/$(CONTINUE_FROM).traineddata $(TESSDATA)/$(CONTINUE_FROM).
unicharset_extractor --output_unicharset "$(TRAIN)/my.unicharset" --norm_mode $(NORM_MODE) "$(ALL_BOXES)"
merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset $(TRAIN)/my.unicharset merged.unicharset

my.unicharset is the new one, something.lstm-unicharset is the old one, NORM_MODE = 2, ALL_BOXES is a file with all the box files names.

And then something like this: combine_tessdata -o continue_from.traineddata merged.unicharset

It's probably the same thing that Qt-box-editor does. I never tried this, I use ocrd that does things ib a little different way.

At the very beginning of the training lstmtraining will print if the set of characters is different from the previous model.



Bye

Lorenzo

Reply all
Reply to author
Forward
0 new messages