Making custom traineddata

6,006 views
Skip to first unread message

kaminski....@gmail.com

unread,
Sep 5, 2018, 9:37:46 AM9/5/18
to tesseract-ocr
Hi,

(I might butcher English grammar- you have been warned!)

   For some time I'm trying to teach tesseract to read MRZ codes.Unfortunately it's not going very well. I'm using the latest version of tesseract (4.0) soI'mm trying to train it by lstm method. I've managed to pull it off and got some custom traineddata samples but effects of using them are... let's say slightly unsatisfying. In the matter of fact they are not even remotely close to eng traineddata. I know that there was mrz traineddata in the previous version of tesseract.

I'm out of ideas how to improve accuracy, so I'll need your help guys.

At first I thought I could use images, .txt files containing already read data and font data to somehow make box files (basically you have image and .txt containing everything read from the image). I was disappointed when I realized that without manual correction of boxes tesseract won't know how to apply them correctly. Of course I need automated method do apply boxes (I can't use any GUI or something).

At the moment I'm only using .txt files and these are steps I'm doing (it's also good to mention that I'm trying to make it from scratch):
-Using .txt and font (OcrB) to create .tiff and box files using text2image method
-Creating unicharset from all box files
-(it's optional but for the sake of it) I'm applyingunicharsetproperties
-Getting trainneddata from unicharset, langdata and using custom language as parameter
-Creating lstmf file by tesseract some .tiff output lstm.train
-Creating list of files to train
-Running lstm training with net spec [1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111] and learning rate 20e-4
-At the end I'm using last checkpoint to create traineddata for usage. Currently initial .txt files are randomly generated by me in program in form of mrz code (samples included). I also tried to generate files in form of mixed alphabet to get signs variety. I was using about 1000 samples to train it and it doesn't differ from using 100 samples.

Also, I disabled dictionary in the OCR process to prevent tesseract from treating whole MRZ code as a word.

I might not understand some things despite reading a lot about this topic, but I'm pretty sure that I'm doing training process correctly. Do you have any tips how to improve training process? Consider pointing out even dumbest things I could forget about.
mrz0.txt
mrz1.txt

Shree Devi Kumar

unread,
Sep 5, 2018, 3:22:04 PM9/5/18
to tesser...@googlegroups.com
I think finetune will be a better option than training from scratch.

Using a small training/test text - 40 lines, I get

---------------------------------

+ lstmeval --verbosity 0 --model /home/ubuntu/tessdata_best/script/Latin.traineddata --eval_listfile /home/ubuntu/tesstutorial/ocrb/eng.training_files.txt
Loaded 40/40 pages (1-40) of document /home/ubuntu/tesstutorial/ocrb/eng.OCR-B_10_BT.exp0.lstmf
Loaded 40/40 pages (1-40) of document /home/ubuntu/tesstutorial/ocrb/eng.OCR_B_MT.exp0.lstmf
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 0, stage 0, Eval Char error rate=0.73106061, Word error rate=13.75

---------------------------------

+ lstmeval --verbosity 0 --model /home/ubuntu/tessdata_best/eng.traineddata --eval_listfile /home/ubuntu/tesstutorial/ocrb/eng.training_files.txt
Loaded 40/40 pages (1-40) of document /home/ubuntu/tesstutorial/ocrb/eng.OCR-B_10_BT.exp0.lstmf
Loaded 40/40 pages (1-40) of document /home/ubuntu/tesstutorial/ocrb/eng.OCR_B_MT.exp0.lstmf
Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 0, stage 0, Eval Char error rate=47.444889, Word error rate=92.5

---------------------------------

At iteration 16/410/410, Mean rms=0.236%, delta=0.131%, char train=0.448%, word train=3.659%, skip ratio=0%,  New best char error = 0.448 wrote checkpoint.

Finished! Error rate = 0.448

---------------------------------


+ lstmeval --model /home/ubuntu/tesstutorial/ocrb_from_full/ocrb_plus_checkpoint --traineddata /home/ubuntu/tesstutorial/ocrb/eng/eng.traineddata --eval_listfile /home/ubuntu/tesstutorial/ocrb/eng.training_files.txt
/home/ubuntu/tesstutorial/ocrb_from_full/ocrb_plus_checkpoint is not a recognition model, trying training checkpoint...
Loaded 40/40 pages (1-40) of document /home/ubuntu/tesstutorial/ocrb/eng.OCR-B_10_BT.exp0.lstmf
Loaded 40/40 pages (1-40) of document /home/ubuntu/tesstutorial/ocrb/eng.OCR_B_MT.exp0.lstmf
At iteration 0, stage 0, Eval Char error rate=0, Word error rate=0

---------------------------------

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b3b86804-5d86-4fac-a780-88a2ef4f2ba2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shree Devi Kumar

unread,
Sep 5, 2018, 4:03:41 PM9/5/18
to tesser...@googlegroups.com
for the files and traineddata.

kaminski....@gmail.com

unread,
Sep 6, 2018, 10:53:08 AM9/6/18
to tesseract-ocr
Thank you for your reply Shreeshrii!

Indeed finetune method is much much better solution for my problem. Thanks to your logs and data provided in repo I realized that I don't need to generate every single MRZ code separately (I'm sure it was mentioned somewhere <dummy>). In fact the process of making tiffs with boxes and then lstmf's was oddly long (also loading lines in form o pages takes much less time). Using merged data is now just a matter of seconds. I don't know if it affected accuracy but now I'm generating every code in one .txt file and then processing it.

I've managed to make my own trainneddata based on polish language and results are outstanding. Thank you very much!

Usually I've avoided tesstrain.sh script and was trying to use my own just to customize the process and control it. When it's combining language model I've spotted that it's making some dawg files. Is it because I'm using already existing language data? If so how can i generate langdata myself for custom language. In this case documentation isn't so clear. I know that it's created by combine_lang_model based on wordlist(langdata). I don't need it at the time but I think it's good idea to clear that out if I'll need to do some training from scratch although I know it's rare case.

Thank you for taking your time to solve my problem! :)

Shree Devi Kumar

unread,
Sep 6, 2018, 11:56:44 AM9/6/18
to tesser...@googlegroups.com
> When it's combining language model I've spotted that it's making some dawg files.

Yes, it takes the files from langdata repo specified in the training command. 

You could change langdata/pol/pol.wordlist to have only the LAST NAMES and GIVEN NAMES, pol.punc to have only < and change number formats in pol.numbers to the MRZ number patterns (i.e. any required customizations based on your use set).

I am not sure how much the dawgs help with the LSTM engine, but you can try after customizing to see if you get improved results.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.

kaminski....@gmail.com

unread,
Sep 10, 2018, 9:35:15 AM9/10/18
to tesseract-ocr
  Thank you Shreeshrii for reply!

Manual customization of theese files might be kinda annoying. If i will need to experiment with the dawg files and I'll achieve something I'll surely let you know if there is any difference. Again thank you for your help and time :)

Vinod Gattani

unread,
Oct 16, 2018, 6:04:15 AM10/16/18
to tesseract-ocr
Hi All,

I have started a project to do OCR on Identity Cards. I am learning to train tesseract models with custom fonts.

Please help me on this.

Steps till now:

2. Then I followed instructions on training package till command "sudo make training-install".
3.Downloaded eng.traineddata from https://github.com/tesseract-ocr/tessdata_best in tessdata folder
4. Command " src/training/tesstrain.sh --fonts_dir /usr/share/fonts --fontlist "Arial Bold" --lang eng --linedata_only   --noextract_font_properties --langdata_dir ../langdata   --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain"

It is giving error:
=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=./tessdata
[Tue Oct 16 05:41:31 UTC 2018] /usr/bin/tesseract /tmp/tmp.4EGdp9wW57/eng.Arial_Bold.exp0.tif /tmp/tmp.4EGdp9wW57/eng.Arial_Bold.exp0 --psm 6 lstm.train
Tesseract Open Source OCR Engine v3.04.01 with Leptonica
fseek(data_file_, static_cast<size_t>(offset_table_[tessdata_type]), SEEK_SET) == 0:Error:Assert failed:in file ../ccutil/tessdatamanager.h, line 173
ERROR: /tmp/tmp.4EGdp9wW57/eng.Arial_Bold.exp0.lstmf does not exist or is not readable

Why the version is 4.0.

Also, how do we download custom font for my Identity Cards.

Regards,

Robert Kamiński

unread,
Oct 16, 2018, 6:23:12 AM10/16/18
to tesser...@googlegroups.com
Hi,
" Why the version is 4.0." What do you mean by that? In logs it states that it's 3.04v. "Tesseract Open Source OCR Engine v3.04.01 with Leptonica".
The problem might be the fact that 4th version is using lstm files whereas you have version 3.04 using box files instead. Try to check the version of installed Tesseract. Also note that I'm not the expert here ^.^


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Vinod Gattani

unread,
Oct 16, 2018, 6:26:04 AM10/16/18
to tesser...@googlegroups.com
Hi,
Typo: " Why the version is not 4.0.?
I installed using "git pull https://github.com/tesseract-ocr/tesseract". And then followed the instructions on training page.

Regards

Zdenko Podobny

unread,
Oct 16, 2018, 6:45:07 AM10/16/18
to tesser...@googlegroups.com
Robert is pointing you to right direction. Did you read the log you post here? 
" Tesseract Open Source OCR Engine v3.04.01 with Leptonica"
You are mixing tesseract versions so no surprise of problems.

Zdenko


ut 16. 10. 2018 o 8:26 Vinod Gattani <vinodgat...@gmail.com> napísal(a):

Vinod Gattani

unread,
Oct 16, 2018, 6:53:50 AM10/16/18
to tesser...@googlegroups.com
Robert/ Zdenko

Yes, in the log I see version "3.4v".

To install v4, I used the link "https://github.com/tesseract-ocr/tesseract". I thought it has tesseract v4, as the Readme file say "Source code for the new LSTM based 4.0 version is available from the master branch on GitHub." So, I did a git pull.

Steps:

  1. git pull https://github.com/tesseract-ocr/tesseract
  2. cd tesseract
  3. sudo apt-get install libicu-dev
  4. sudo apt-get install libpango1.0-dev
  5. sudo apt-get install libcairo2-dev
  6. sh autogen.sh
  7. sh ./configure
  8. make
  9. make training
  10. sudo make training-install
  11. Training Command gives the error as mentioned.
Also, when I do tesseract -v, I see 3.04.01 too.

So, is there any other way of installing v4.0. Please let me know what I am doing wrong. 

Regards,
Vinod

Zdenko Podobny

unread,
Oct 16, 2018, 7:06:47 AM10/16/18
to tesser...@googlegroups.com
You forget to uninstall tesseract 3.04  obviously.
You can not have 2 installation of tesseract or you should know your system and have knowledge how to handle this kind of situation.
What ever you do, you should understand what are you doing.
 
Zdenko


ut 16. 10. 2018 o 8:53 Vinod Gattani <vinodgat...@gmail.com> napísal(a):

Soumik Ranjan Dasgupta

unread,
Oct 16, 2018, 7:38:10 AM10/16/18
to tesser...@googlegroups.com
You should uninstall (purge) v3 first. Then build the v4 from scratch.


For more options, visit https://groups.google.com/d/optout.


--
Regards,
Soumik Ranjan Dasgupta

Vinod Gattani

unread,
Oct 17, 2018, 5:18:17 AM10/17/18
to tesser...@googlegroups.com
Thanks everyone.

With suggestions and following this link "https://www.youtube.com/watch?v=WZLJucXZy-g", I was able to run a demo training for a font. 

I used Shreeshrii' github repo "https://github.com/Shreeshrii/tessdata_ocrb".

Need some help on below points: If there any documentation available, I will look into it.

1. What does the below metrics mean?  This would help me to find when should we stop training. #iterations

At iteration 9/410/410, Mean rms=0.187%, delta=0.446%, char train=2.537%, word train=9.024%, skip ratio=0%,  wrote checkpoint.
Finished! Error rate = 2.537

2. Is there any tips for preparing training text? Like minimum characters for each letter.
3. How to find the best matching font type for my document?
4. Should the folder name of the font match with the font name? How do tesseract identify the right font which corresponds to parameter "--fontlist "DejaVu Sans"" in font directory.
5. Is it recommended to use multiple fonts in --fontlist argument?

Regards,
Vinod





Jankees Korstanje

unread,
Apr 8, 2019, 4:41:16 PM4/8/19
to tesseract-ocr
Hi Shree,

We have tried your traineddata file for MRZ and noticed that it does not detect the character X.

Looking at https://github.com/Shreeshrii/tessdata_ocrb/blob/master/eng.MRZ.training_text

We see that there are no X in there.

In addition it might be good to add a couple of lines that are specific for IDs (starting with I) note they are all fake

IDESPANH186495123456789X<<<<<<
IXESPE002561410<0233181G<<<<<
I<NLDIS2KX87214<<<<<<<<<<<<<<<





On Wed, Sep 5, 2018 at 1:55 PM, <kaminski...@gmail.com> wrote:
Hi,

(I might butcher English grammar- you have been warned!)

   For some time I'm trying to teach tesseract to read MRZ codes.Unfortunately it's not going very well. I'm using the latest version of tesseract (4.0) soI'mm trying to train it by lstm method. I've managed to pull it off and got some custom traineddata samples but effects of using them are... let's say slightly unsatisfying. In the matter of fact they are not even remotely close to eng traineddata. I know that there was mrz traineddata in the previous version of tesseract.

I'm out of ideas how to improve accuracy, so I'll need your help guys.

At first I thought I could use images, .txt files containing already read data and font data to somehow make box files (basically you have image and .txt containing everything read from the image). I was disappointed when I realized that without manual correction of boxes tesseract won't know how to apply them correctly. Of course I need automated method do apply boxes (I can't use any GUI or something).

At the moment I'm only using .txt files and these are steps I'm doing (it's also good to mention that I'm trying to make it from scratch):
-Using .txt and font (OcrB) to create .tiff and box files using text2image method
-Creating unicharset from all box files
-(it's optional but for the sake of it) I'm applyingunicharsetproperties
-Getting trainneddata from unicharset, langdata and using custom language as parameter
-Creating lstmf file by tesseract some .tiff output lstm.train
-Creating list of files to train
-Running lstm training with net spec [1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111] and learning rate 20e-4
-At the end I'm using last checkpoint to create traineddata for usage. Currently initial .txt files are randomly generated by me in program in form of mrz code (samples included). I also tried to generate files in form of mixed alphabet to get signs variety. I was using about 1000 samples to train it and it doesn't differ from using 100 samples.

Also, I disabled dictionary in the OCR process to prevent tesseract from treating whole MRZ code as a word.

I might not understand some things despite reading a lot about this topic, but I'm pretty sure that I'm doing training process correctly. Do you have any tips how to improve training process? Consider pointing out even dumbest things I could forget about.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shree Devi Kumar

unread,
Apr 8, 2019, 5:15:29 PM4/8/19
to tesser...@googlegroups.com
If you can provide another 40-50 lines of training data (text file) I will rerun the training 


To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

shree

unread,
Apr 9, 2019, 9:16:04 AM4/9/19
to tesseract-ocr

see https://github.com/Shreeshrii/tessdata_ocrb


Retrained to add missing X using 3 fonts at 3 exposures and a larger training text compared to previous version.

Both float/best and integer/fast versions are provided.

I would appreciate feedback. If this is useful, we can add it to https://github.com/tesseract-ocr/tessdata_contrib


On Monday, April 8, 2019 at 10:45:29 PM UTC+5:30, shree wrote:
If you can provide another 40-50 lines of training data (text file) I will rerun the training 


To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

shree

unread,
Apr 9, 2019, 9:19:42 AM4/9/19
to tesseract-ocr
Correction: fast version is ocrb_int (not ocrb-int).
Reply all
Reply to author
Forward
0 new messages