Fine tuning existing model

10,038 views
Skip to first unread message

Lorenzo Bolzani

unread,
Jun 29, 2018, 12:01:08 PM6/29/18
to tesser...@googlegroups.com
​​

Hi,
I'm trying to do fine tuning of an existing model using line images and text labels. I'm running this version:

tesseract 4.0.0-beta.3-56-g5fda
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found SSE



I used OCR-D to generate lstmf files for the demo data.

If I run the make command it works fine.

make training MODEL_NAME=prova

Now I isolated this command from the build:

lstmtraining \
  --traineddata data/prova/prova.traineddata \
  --net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c`head -n1 data/unicharset`]" \
  --model_output data/checkpoints/prova \
  --learning_rate 20e-4 \
  --train_listfile data/list.train \
  --eval_listfile data/list.eval \
  --max_iterations 10000

and it works fine.

Now I'm trying to modify it to fine tune the existing eng model. I made a few attempts, all ending into different errors (see the attached file for full output).

I used:

combine_tessdata -e /usr/local/share/tessdata/eng.traineddata extracted/eng.lstm

to extract the eng.lstm model.

This seems to works but I'm not sure it is the correct.

lstmtraining \
  --continue_from  extracted/eng.lstm \
  --traineddata data/prova/prova.traineddata \
  --old_traineddata extracted/eng.traineddata \
  --model_output data/checkpoints/prova \
  --learning_rate 20e-4 \
  --train_listfile data/list.train \
  --eval_listfile data/list.eval \
  --max_iterations 10000

(extracted/eng.traineddata is just a copy of eng.traineddata)


The training resume exactly with the RMS of prova_checkpoint (6%) so it looks like it is training from that checkpoint, not the eng.lstm.

Is this correct? What should I change?
I'm following this guide:



I think continue_from and traineddata should refer to the eng model and old_traineddata should point to prova.traineddata, but if I do that I get a segmentation fault:

[...]
!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244
!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244
Segmentation fault

What am I missing?


Thanks, bye

Lorenzo

errors.txt

Shree Devi Kumar

unread,
Jun 29, 2018, 12:09:09 PM6/29/18
to tesser...@googlegroups.com
I modified the makefile for ocrd-train to do fine-tuning.  It is pasted below:

export

SHELL := /bin/bash
LOCAL := $(PWD)/usr
PATH := $(LOCAL)/bin:$(PATH)
HOME := /home/ubuntu
TESSDATA =  $(HOME)/tessdata_best
LANGDATA = $(HOME)/langdata

# Name of the model to be built
MODEL_NAME = frk

# Name of the model to continue from
CONTINUE_FROM = frk

# Normalization Mode - see src/training/language_specific.sh for details 
NORM_MODE = 2

# Tesseract model repo to use. Default: $(TESSDATA_REPO)
TESSDATA_REPO = _best

# Train directory
TRAIN := data/train

# BEGIN-EVAL makefile-parser --make-help Makefile

help:
@echo ""
@echo "  Targets"
@echo ""
@echo "    unicharset       Create unicharset"
@echo "    lists            Create lists of lstmf filenames for training and eval"
@echo "    training         Start training"
@echo "    proto-model      Build the proto model"
@echo "    leptonica        Build leptonica"
@echo "    tesseract        Build tesseract"
@echo "    tesseract-langs  Download tesseract-langs"
@echo "    langdata         Download langdata"
@echo "    clean            Clean all generated files"
@echo ""
@echo "  Variables"
@echo ""
@echo "    MODEL_NAME         Name of the model to be built"
@echo "    CORES              No of cores to use for compiling leptonica/tesseract"
@echo "    LEPTONICA_VERSION  Leptonica version. Default: $(LEPTONICA_VERSION)"
@echo "    TESSERACT_VERSION  Tesseract commit. Default: $(TESSERACT_VERSION)"
@echo "    LANGDATA_VERSION   Tesseract langdata version. Default: $(LANGDATA_VERSION)"
@echo "    TESSDATA_REPO      Tesseract model repo to use. Default: $(TESSDATA_REPO)"
@echo "    TRAIN              Train directory"
@echo "    RATIO_TRAIN        Ratio of train / eval training data"

# END-EVAL

# Ratio of train / eval training data
RATIO_TRAIN := 0.90

ALL_BOXES = data/all-boxes
ALL_LSTMF = data/all-lstmf

# Create unicharset
unicharset: data/unicharset

# Create lists of lstmf filenames for training and eval
lists: $(ALL_LSTMF) data/list.train data/list.eval

data/list.train: $(ALL_LSTMF)
total=`cat $(ALL_LSTMF) | wc -l` \
   no=`echo "$$total * $(RATIO_TRAIN) / 1" | bc`; \
   head -n "$$no" $(ALL_LSTMF) > "$@"

data/list.eval: $(ALL_LSTMF)
total=`cat $(ALL_LSTMF) | wc -l` \
   no=`echo "($$total - $$total * $(RATIO_TRAIN)) / 1" | bc`; \
   tail -n "+$$no" $(ALL_LSTMF) > "$@"

# Start training
training: data/$(MODEL_NAME).traineddata

data/unicharset: $(ALL_BOXES)
combine_tessdata -u $(TESSDATA)/$(CONTINUE_FROM).traineddata  $(TESSDATA)/$(CONTINUE_FROM).
unicharset_extractor --output_unicharset "$(TRAIN)/my.unicharset" --norm_mode $(NORM_MODE) "$(ALL_BOXES)"
merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset $(TRAIN)/my.unicharset  "$@"
$(ALL_BOXES): $(sort $(patsubst %.tif,%.box,$(wildcard $(TRAIN)/*.tif)))
find $(TRAIN) -name '*.box' -exec cat {} \; > "$@"
$(TRAIN)/%.box: $(TRAIN)/%.tif $(TRAIN)/%-gt.txt
python generate_line_box.py -i "$(TRAIN)/$*.tif" -t "$(TRAIN)/$*-gt.txt" > "$@"

$(ALL_LSTMF): $(sort $(patsubst %.tif,%.lstmf,$(wildcard $(TRAIN)/*.tif)))
find $(TRAIN) -name '*.lstmf' -exec echo {} \; | sort -R -o "$@"

$(TRAIN)/%.lstmf: $(TRAIN)/%.box
tesseract $(TRAIN)/$*.tif $(TRAIN)/$*   --psm 6 lstm.train

# Build the proto model
proto-model: data/$(MODEL_NAME)/$(MODEL_NAME).traineddata

data/$(MODEL_NAME)/$(MODEL_NAME).traineddata: $(LANGDATA) data/unicharset
combine_lang_model \
  --input_unicharset data/unicharset \
  --script_dir $(LANGDATA) \
  --words $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).wordlist \
  --numbers $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).numbers \
  --puncs $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).punc \
  --output_dir data/ \
  --lang $(MODEL_NAME)

data/checkpoints/$(MODEL_NAME)_checkpoint: unicharset lists proto-model
mkdir -p data/checkpoints
lstmtraining \
  --continue_from   $(TESSDATA)/$(CONTINUE_FROM).lstm \
  --old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \
  --traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \
  --model_output data/checkpoints/$(MODEL_NAME) \
  --debug_interval -1 \
  --train_listfile data/list.train \
  --eval_listfile data/list.eval \
  --sequential_training \
  --max_iterations 3000

data/$(MODEL_NAME).traineddata: data/checkpoints/$(MODEL_NAME)_checkpoint
lstmtraining \
--stop_training \
--continue_from $^ \
--old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \
--traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \
--model_output $@

# Clean all generated files
clean:
find data/train -name '*.box' -delete
find data/train -name '*.lstmf' -delete
rm -rf data/all-*
rm -rf data/list.*
rm -rf data/$(MODEL_NAME)
rm -rf data/unicharset
rm -rf data/checkpoints

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyOJN31PdWQumXPO3JjuAc1Yz2BZYpMd4ftzBHgZkEaxA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Lorenzo Bolzani

unread,
Jun 29, 2018, 3:33:34 PM6/29/18
to tesser...@googlegroups.com
Hi Shree, thanks for your answer.

I tried the script setting:

TESSDATA=extracted                 # here I have the eng.lstm and eng.trainedata
LANGDATA=langdata-master     # all langdata downladed by OCR-D

MODEL_NAME = eng
CONTINUE_FROM = eng


First I run the old Makefile to create the boxes.

$ make training MODEL_NAME=eng


I stop it as soon as the training starts:

At iteration 400/400/400, Mean rms=6.657%, delta=40.765%, char train=100.827%, word train=100%, skip ratio=0%,  New worst char error = 100.827 wrote checkpoint.


At iteration 500/500/500, Mean rms=6.644%, delta=40.423%, char train=100.662%, word train=100%, skip ratio=0%,  New worst char error = 100.662 wrote checkpoint.

^Cmake: *** Deleting file 'data/checkpoints/eng_checkpoint'
Makefile:110: recipe for target 'data/checkpoints/eng_checkpoint' failed
make: *** [data/checkpoints/eng_checkpoint] Interrupt

Notice that the data/checkpoints/eng_checkpoint file is deleted, I do not know if it is relevant or not.


then I switch to the new one and I get this:

$ make training

mkdir -p data/checkpoints
lstmtraining \
  --continue_from   extracted/eng.lstm \
  --old_traineddata extracted/eng.traineddata \
  --traineddata data/eng/eng.traineddata \
  --model_output data/checkpoints/eng \

  --debug_interval -1 \
  --train_listfile data/list.train \
  --eval_listfile data/list.eval \
  --sequential_training \
  --max_iterations 3000
Loaded file extracted/eng.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 111 to 76!
Num (Extended) outputs,weights in Series:
  1,36,0,1:1, 0
Num (Extended) outputs,weights in Series:
  C3,3:9, 0
  Ft16:16, 160
Total weights = 160
  [C3,3Ft16]:16, 160
  Mp3,3:16, 0
  Lfys64:64, 20736
  Lfx96:96, 61824
  Lrx96:96, 74112
  Lfx512:512, 1247232
  Fc76:76, 0
Total weights = 1404064
Previous null char=110 mapped to 75
Continuing from extracted/eng.lstm
Loaded 1/1 pages (1-1) of document data/train/mueller_waldhornist_1821_0130_010.lstmf
Loaded 1/1 pages (1-1) of document data/train/bismarck_erinnerungen02_1898_0274_002.lstmf
Loaded 1/1 pages (1-1) of document data/train/spyri_heidi_1880_0062_005.lstmf
Loaded 1/1 pages (1-1) of document data/train/novalis_ofterdingen_1802_0210_001.lstmf
Iteration 0: ALIGNED TRUTH : Sparoͤfen kauft' ich auch und Sorgenstuͤhle,
Iteration 0: BEST OCR TEXT : l bd o D V fc ds ft hs D t' dsu PM )k ,„cGs D t' D„Gs 'A AKG„9„t d tft ü!Vt Eb ht Ac )k uF ' K,cGPFVts
File data/train/mueller_waldhornist_1821_0130_010.lstmf page 0 :

!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244
!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244
Makefile:113: recipe for target 'data/checkpoints/eng_checkpoint' failed
make: *** [data/checkpoints/eng_checkpoint] Segmentation fault


What am I doing wrong?



Lorenzo

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Shree Devi Kumar

unread,
Jun 29, 2018, 4:27:46 PM6/29/18
to tesser...@googlegroups.com
You should be able to use the new makefile after you make changes for all the directory locations to match your setup. 

Change the language from frk to eng, though the sample training text seems to be non-english. In which case it is better for you to use the appropriate language traineddata eg. tessdata_best/deu.traineddata for German.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.

Lorenzo Bolzani

unread,
Jun 29, 2018, 6:17:35 PM6/29/18
to tesser...@googlegroups.com

I think I found the problem. Running directly the new Makefile I had this error:

make: *** No rule to make target 'data/train/alexis_ruhe01_1852_0018_022.box', needed by 'data/all-boxes'.  Stop.

The problem was a "-gt.txt" rather than a ".gt.txt" as in my train files. Now I can run your script directly.

I also replaced the eng.traineddata with the one from here:


and it's training correctly. (it works correctly even with the previous model, from https://github.com/tesseract-ocr/tessdata).



One more question: I wanted to check if the output character set of the new and old model differ. I used:

combine_tessdata -u eng.traineddata orig

on both models and compared the unicharset files. I see that some characters are missing and some others are added. It looks good. Is this the correct way to check this?

In this way can I train a model that, for example, only recognize uppercase characters, or numbers, simply by providing only uppercase training data? Or is there something else to configure?


Thanks, bye

Lorenzo


To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Shree Devi Kumar

unread,
Jun 30, 2018, 4:19:42 AM6/30/18
to tesser...@googlegroups.com
>
The problem was a "-gt.txt" rather than a ".gt.txt" as in my train files. Now I can run your script directly.

Oh, I remember now. I had changed that for ease in renaming files for some reason.

> In this way can I train a model that, for example, only recognize uppercase characters, or numbers, simply by providing only uppercase training data? Or is there something else to configure?

You could try finetune from English. Remove the line regarding merge of unicharsets from my makefile (use command from original script). 300 iterations should be enough as you are not adding any characters. Try to have a training text which resembles the kind of words that you expect to OCR.

Lorenzo Bolzani

unread,
Jul 2, 2018, 10:24:14 AM7/2/18
to tesser...@googlegroups.com
Hi Shree,
I replaced the line:

 merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset $(TRAIN)/my.unicharset  "$@"

with:

 cp "$(TRAIN)/my.unicharset" "data/unicharset"

(I write this in case someone else is following this thread).

And now I have a fine tuned brand new model with only the characters I need. Nice :)

For the training I'm using actual crops from the documents I need to ocr, painfully hand labeled.

About the number of iterations I'm trying to figure it out. I've seen that there is an eval/train split, I've set it to 80/20.

I did 300/600/1000/5000/7500/10000 iteration and checked the model with:

lstmeval --model export/$1.traineddata --eval_listfile data/list.eval 2>&1 | grep iteration

and I see that the eval error keeps going down, with a big error drop from 1.17 to 0.5 passing from 7500 to 10000. My characters are very noisy and irregular and my lines are very short, 1 to 4 words at most. Maybe this is the reason why I need more iterations.

I'm fine tuning from italian, the language of my documents, I'll try eng too to see if it works better. Now that the pipeline is in place it's easy to try different options.


Thank you for your help so far.


Bye

Lorenzo


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Raniem

unread,
Sep 6, 2018, 11:02:54 AM9/6/18
to tesseract-ocr
Hi @ Lorenzo Blz 


How many data lines and iterations have you used in your fine tuning.
In your last reply you have mentioned you replaced 
 merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset $(TRAIN)/my.unicharset  "$@"

with:

 cp "$(TRAIN)/my.unicharset" "data/unicharset"
which is very helpful as I wanted to do the same and generate a new model without specific characters that I need to excluded from the unicharset. But the new model is always worse than my original model.

Can you please advise?

Regards

Lorenzo Bolzani

unread,
Sep 6, 2018, 2:48:13 PM9/6/18
to tesser...@googlegroups.com
Hi Raniem,
I did 5 fine tunings for different fonts and text content with roughly these numbers:

iterations:   samples (training data)
750:            208 numbers (4 upper case + 5 digits each)
1000:          400 MRZ codes (22 uppercase chars each)
1800:          1000 numbers (10 digits each)
22500:        1664 words (from 8 to 30 uppercase chars each)
57500:        54800 words (from 4 to 30 chars each, alphanum, mixed case and font)

I work in this way:
- split the data in training/evaluation. Ocr-d will do this for you. I use 80/20
- train(fine tune) for a few iterations, like 100, then run:

lstmeval --model data/YOUR_MODEL.traineddata --eval_listfile data/list.eval
to check the current accuracy on the evaluation set. Resume the training up to 200 iterations (ocr-d will resume from the last checkpoint automatically) and check again the evaluation accuracy, and so on. Repeat until the evaluation accuracy decreases for a few training steps.
For small datasets I did 100, 200, ..., 1000, 1200, 1400,... and coarser steps for the large ones, 1000, 2000,.... Pick the model with the best evaluation score. In this way you do not need to guess the number of iterations.

You can find a more details description here:

I think the number of iterations depends on the type of text you are doing, for digits you need only a few, for fixed font uppercase text just a little more. For complex upper/lower multi-font text/numbers, like the last one, it takes more time.

For the training and evaluation images use the same height and border trimming that you will use for the real data (I used height=54px, no border).


Bye

Lorenzo

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Raniem

unread,
Sep 6, 2018, 11:01:06 PM9/6/18
to tesseract-ocr
Thanks for the detailed answer, I am giving it a shot and hoping for getting some better results :) 

Thanks for all your help and support

Best Regards


On Friday, June 29, 2018 at 1:01:08 PM UTC+1, Lorenzo Blz wrote:

Raniem

unread,
Sep 10, 2018, 12:31:07 PM9/10/18
to tesseract-ocr
Thanks Lorenzo.

Your method makes all the magic I needed.

One other question please, I am attempting to fine tune only the last layer, so I have replaced  the 
--net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c`head -n1 data/unicharset`]" \

int the lstmtraining command with: 

--continue_from $(TESSDATA)/$(CONTINUE_FROM).lstm \
--append_index 5 --net_spec '[Lfx256 O1c69]'

but I am getting this error :
int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 222
Makefile:129: recipe for target 'data/checkpoints/eng_checkpoint' failed
make: *** [data/checkpoints/eng_checkpoint] Segmentation fault (core dumped)

can any one please advice on what I am doing wrong?
P.S my unicharset contains 69 character.


Regards

Lorenzo Bolzani

unread,
Sep 10, 2018, 12:52:21 PM9/10/18
to tesser...@googlegroups.com

I think there is no need to change the network definition appending layers with a limited number of output chars. The line you replaced already takes care of this with:

--net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c`head -n1 data/unicharset`]"


I had this error when I was mixing best models with non best models.

I would try to run again:

combine_tessdata -e base_model/eng.traineddata base_model/eng.lstm

to generate the eng.lstm from the "_best" model (the ones from /usr/share/tessdata are not the "_best" models).

Also see:



Bye

Lorenzo

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Raniem

unread,
Sep 10, 2018, 1:38:38 PM9/10/18
to tesseract-ocr
I think there is no need to change the network definition appending layers with a limited number of output chars. The line you replaced already takes care of this with:

I am actually doing that not to limit the number of output chars, I am doing it cause I thought this way I am only tuning the final layer as I wanted to keep the weights for other layers.
I was trying to experiment whether this is going to give me even better performance with a fewer number of iterations or data lines without over fitting (please correct me if i am wrong whether this update is not maintaining the weights in the remaining layers).

I will double check that I am not mixing models. Thanks for the advice :) appreciate your time and the real time response :)

Regards

Raniem

unread,
Sep 10, 2018, 2:08:03 PM9/10/18
to tesseract-ocr
you were right regarding the different models type. Thanks :)

Lorenzo Bolzani

unread,
Sep 10, 2018, 7:51:54 PM9/10/18
to tesser...@googlegroups.com
Il giorno lun 10 set 2018 alle ore 15:38 Raniem <raniem...@gmail.com> ha scritto:
I am actually doing that not to limit the number of output chars, I am doing it cause I thought this way I am only tuning the final layer as I wanted to keep the weights for other layers.
I was trying to experiment whether this is going to give me even better performance with a fewer number of iterations or data lines without over fitting (please correct me if i am wrong whether this update is not maintaining the weights in the remaining layers).
 
Ok, now I got it. I never did this myself, and I suppose this is where you are coming from:

If I got this right, you do not really freeze the lower layers, you just replace the final ones with new untrained layers (the training will then update all the weights as needs even if the impact on the lower ones should be minor). Honestly I cannot see why this should be better than simple fine-tuning unless the "font" you are training on is completely different from the ones learned by the base model. But, having enough data, I think it's worth trying.

But I expect this it's going to require more data and more iterations than simple fine tuning as the docs seem to suggest.


I will double check that I am not mixing models. Thanks for the advice :) appreciate your time and the real time response :)

You are welcome, I just remember how difficult it was to make sense of all those assertion failed :)

Bye

Lorenzo

Raniem

unread,
Sep 12, 2018, 9:21:13 AM9/12/18
to tesseract-ocr
you were right again actually :) 
I will stick with the simple fine tuning.
However I wouldn't have been able to experiment with the other scenarios without your help. Thanks! All is working perfectly well.

Regards

Varun Sab

unread,
Sep 18, 2018, 12:29:03 PM9/18/18
to tesseract-ocr
HI @ Lorenzo Blz,
    I am also getting the same segmentation fault error. Can you please suggest how you solved it.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Shree Devi Kumar

unread,
Sep 18, 2018, 3:54:53 PM9/18/18
to tesser...@googlegroups.com
If you are getting error

!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244
!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244

You are probably using the traineddata fille which has an `integer` model.

Please use tessdata_best as base for further training.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.

Varun Sab

unread,
Sep 19, 2018, 7:20:16 AM9/19/18
to tesseract-ocr
Thank you so much.. That worked. :)
Message has been deleted

Tairen Chen

unread,
May 2, 2019, 9:48:15 PM5/2/19
to tesseract-ocr
Hi, Lorenzo and Shree

     Thanks for your sharing.
     I am trying to repeat what you have done here. 
     I followed your posts and change the Makefile, but when I run $ make training,
     I got the following errors: 
           mkdir -p data/checkpoints
           lstmtraining \
  --continue_from     extracted/eng.lstm \
  --old_traineddata   extracted/eng.traineddata \
  --traineddata data/eng/eng.traineddata \
  --model_output data/checkpoints/eng \
  --debug_interval -1 \
  --train_listfile data/list.train \
  --eval_listfile data/list.eval \
  --sequential_training \
  --max_iterations 3000
Must provide a --traineddata see training wiki
Makefile:111: recipe for target 'data/checkpoints/eng_checkpoint' failed
make: *** [data/checkpoints/eng_checkpoint] Error 1

      However, I can manually run $lstmtraining   --traineddata  data/eng/eng.traineddata   --continue_from   extracted/eng.lstm   --old_traineddata extracted/eng.traineddata   --model_output data/checkpoints/eng   --debug_interval -1   --train_listfile  data/list.train   --eval_listfile  data/list.eval   --sequential_training   --max_iterations 3000
      
      I don't know where to change and I am new to Tesseract and same with Makefile. Please share your wisdom.
      Thank you!
All the best,
                            Tairen
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Lorenzo Bolzani

unread,
May 2, 2019, 10:10:58 PM5/2/19
to tesser...@googlegroups.com
Hi Tairen,
the error is quite clear:

Must provide a --traineddata see training wiki

You say that it works if you run it as a single line so I suppose there is something wrong in the make file, probably a typo. Maybe there is a space or a tab after a "\" ?

Maybe there are some extra characters from copy and paste from an email. The traineddata option is on the third line it is likely something on line 2 or 3.

If you cannot find the problem checkout the project again and start over. Run it after any single change you do to see if/when it breaks.



Lorenzo



To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Tairen Chen

unread,
May 3, 2019, 1:48:24 AM5/3/19
to tesseract-ocr
Thank you very much for your quick answer, Lorenzo!

    You are right, it is an extra space at the beginning where the "TESSDATA" is defined not at the "lstmtraining" line.
    
    I still have few questions want to ask you for help.

    1. I define the "--max_iterations 20000" but the training stops at 5700, like below:
         " At iteration 351/5700/5700, Mean rms=0.117%, delta=0%, char train=0%, word train=0%, skip ratio=0%,  wrote checkpoint"

         I assume 5700 is the iteration number, but I do not know what is 351 mean here. Meanwhile why the training stops at 5700, not at 10000, or other numbers that less than 20000? I think there may be "rms" definition to stop the training or any other conditions? or because I have a small number of training images?

    2. I can generate the "eng.traineddata" using the weights from "tessdata_best", but not from "tessdata". Shree said because weights from "tessdata" is an `integer` model." What is "integer" model means? can we generate the "eng.traineddata" from "tessdata" model? 
    
     3. Meanwhile, I notice that the size of "eng.traineddata" that I generated is less than the model from "tessdata". (11.7M VS 23.5M ), so the "tessdata" model has more number of parameters than then model from "tessdata_best"? what is the difference between these two?
    
     Thank you!
All the best,
                           Tairen

Lorenzo Bolzani

unread,
May 3, 2019, 9:30:12 AM5/3/19
to tesser...@googlegroups.com
See answer inline.

Il giorno ven 3 mag 2019 alle ore 03:48 Tairen Chen <chent...@gmail.com> ha scritto:

    1. I define the "--max_iterations 20000" but the training stops at 5700, like below:
         " At iteration 351/5700/5700, Mean rms=0.117%, delta=0%, char train=0%, word train=0%, skip ratio=0%,  wrote checkpoint"

         I assume 5700 is the iteration number, but I do not know what is 351 mean here. Meanwhile why the training stops at 5700, not at 10000, or other numbers that less than 20000? I think there may be "rms" definition to stop the training or any other conditions? or because I have a small number of training images?

You are already at 100% accuracy on your training set (0% error) so there is no point in training more.

This obvioulsy does not mean that you'll get 100% accuracy on your real world data.

See this thread on how to train for a reasonable amount of epochs (training too much is bad):


I do not remember what 351 means.

 
    2. I can generate the "eng.traineddata" using the weights from "tessdata_best", but not from "tessdata". Shree said because weights from "tessdata" is an `integer` model." What is "integer" model means? can we generate the "eng.traineddata" from "tessdata" model? 

What's the problem with the best model? An integer model is a "simplified/compressed" model for better speed giving up some accuracy. You can convert your trained model to an integer one at the end of the training using combine_tessdata -c but I never did it.

    
     3. Meanwhile, I notice that the size of "eng.traineddata" that I generated is less than the model from "tessdata". (11.7M VS 23.5M ), so the "tessdata" model has more number of parameters than then model from "tessdata_best"? what is the difference between these two?

No, tessdata_best is a bigger model (more parameters). There are three model sizes: best, normal and fast. Each of these can also be converted to an integer model.

The models I get from fine tuning the eng model are about 6.7 MB.

A traineddata file is an archive file, like a zip, maybe you are including less files than the original (other then the neural network model itself). But I do not know much about the traineddata details. Maybe best is the LSMT model only, while "normal" includes the 3.x models too? You can use combine_tessdata -u to extract all the content and check.


Bye

Lorenzo

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Shree Devi Kumar

unread,
May 3, 2019, 9:59:12 AM5/3/19
to tesser...@googlegroups.com
>There are three model sizes: best, normal and fast. Each of these can also be converted to an integer model.

Only `best` can be converted to integer and in fact the LSTM models in `tessdata` are the integer versions of best along with the base/legacy models.

`fast` models have been trained with a smaller network spec compared to `best` and have been converted to integer. 

So, fast and normal are already integer. You can convert `best` to integer (that file will be smaller than `normal` because it does not have base/legacy model(.

>maybe you are including less files than the original (other then the neural network model itself).

If you do not use wordlist, numbers and punc file then the dawgs will not be built. Some traineddata files are really large because of these.

Lorenzo Bolzani

unread,
May 3, 2019, 10:22:17 AM5/3/19
to tesser...@googlegroups.com

Shree, thanks for the clarification.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Tairen Chen

unread,
May 3, 2019, 4:18:59 PM5/3/19
to tesseract-ocr
Hi, Lorenzo,
    Thank you very much for your reply. It really gives more clue about the training.
All the best,
                      Tairen 
See answer inline.

Tairen Chen

unread,
May 3, 2019, 4:19:47 PM5/3/19
to tesseract-ocr
Thank you for your further explanation, Shree!!

Ayush Pandey

unread,
Sep 5, 2019, 7:55:02 AM9/5/19
to tesseract-ocr

Tesseract Version: 4.1.0

I am trying to fine tune tesseract on custom dataset with the following Makefile:

export

SHELL := /bin/bash
HOME := $(PWD)
TESSDATA = $(HOME)/tessdata
LANGDATA = $(HOME)/langdata

# Train directory
# TRAIN := $(HOME)/train_data
TRAIN := /media/vimaan/Data/OCR/tesseract_train

# Name of the model to be built
MODEL_NAME = eng
LANG_CODE = eng

# Name of the model to continue from
CONTINUE_FROM = eng

TESSDATA_REPO = _best

# Normalization Mode - see src/training/language_specific.sh for details 
NORM_MODE = 1

# BEGIN-EVAL makefile-parser --make-help Makefile

help:
	@echo ""
	@echo "  Targets"
	@echo ""
	@echo "    unicharset       Create unicharset"
	@echo "    lists            Create lists of lstmf filenames for training and eval"
	@echo "    training         Start training"
	@echo "    proto-model      Build the proto model"
	@echo "    leptonica        Build leptonica"
	@echo "    tesseract        Build tesseract"
	@echo "    tesseract-langs  Download tesseract-langs"
	@echo "    langdata         Download langdata"
	@echo "    clean            Clean all generated files"
	@echo ""
	@echo "  Variables"
	@echo ""
	@echo "    MODEL_NAME         Name of the model to be built"
	@echo "    CORES              No of cores to use for compiling leptonica/tesseract"
	@echo "    LEPTONICA_VERSION  Leptonica version. Default: $(LEPTONICA_VERSION)"
	@echo "    TESSERACT_VERSION  Tesseract commit. Default: $(TESSERACT_VERSION)"
	@echo "    LANGDATA_VERSION   Tesseract langdata version. Default: $(LANGDATA_VERSION)"
	@echo "    TESSDATA_REPO      Tesseract model repo to use. Default: $(TESSDATA_REPO)"
	@echo "    TRAIN              Train directory"
	@echo "    RATIO_TRAIN        Ratio of train / eval training data"

# END-EVAL

# Ratio of train / eval training data
RATIO_TRAIN := 0.90

ALL_BOXES = data/all-boxes
ALL_LSTMF = data/all-lstmf

# Create unicharset
unicharset: data/unicharset

# Create lists of lstmf filenames for training and eval
#lists: $(ALL_LSTMF) data/list.train data/list.eval
lists: $(ALL_LSTMF)

train-lists: data/list.train data/list.eval

data/list.train: $(ALL_LSTMF)
	total=`cat $(ALL_LSTMF) | wc -l` \
	   no=`echo "$$total * $(RATIO_TRAIN) / 1" | bc`; \
	   head -n "$$no" $(ALL_LSTMF) > "$@"

data/list.eval: $(ALL_LSTMF)
	total=`cat $(ALL_LSTMF) | wc -l` \
	   no=`echo "($$total - $$total * $(RATIO_TRAIN)) / 1" | bc`; \
	   tail -n "$$no" $(ALL_LSTMF) > "$@"

# Start training
training: data/$(MODEL_NAME).traineddata

data/unicharset: $(ALL_BOXES)
	mkdir -p data/$(START_MODEL)
	combine_tessdata -u $(TESSDATA)/$(CONTINUE_FROM).traineddata  $(TESSDATA)/$(CONTINUE_FROM).
	unicharset_extractor --output_unicharset "$(TRAIN)/my.unicharset" --norm_mode $(NORM_MODE) "$(ALL_BOXES)"
	#merge_unicharsets data/$(START_MODEL)/$(START_MODEL).lstm-unicharset $(GROUND_TRUTH_DIR)/my.unicharset  "$@"
	merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset $(TRAIN)/my.unicharset  "$@"
	
$(ALL_BOXES): $(sort $(patsubst %.tif,%.box,$(wildcard $(TRAIN)/*.tif)))
	find $(TRAIN) -name '*.box' -exec cat {} \; > "$@"
	
$(TRAIN)/%.box: $(TRAIN)/%.tif $(TRAIN)/%.gt.txt
	python generate_line_box.py -i "$(TRAIN)/$*.tif" -t "$(TRAIN)/$*.gt.txt" > "$@"

$(ALL_LSTMF): $(sort $(patsubst %.tif,%.lstmf,$(wildcard $(TRAIN)/*.tif)))
	find $(TRAIN) -name '*.lstmf' -exec echo {} \; | sort -R -o "$@"

$(TRAIN)/%.lstmf: $(TRAIN)/%.box
	tesseract $(TRAIN)/$*.tif $(TRAIN)/$* --dpi 300 --psm 7 lstm.train
	

# Build the proto model
proto-model: data/$(MODEL_NAME)/$(MODEL_NAME).traineddata

data/$(MODEL_NAME)/$(MODEL_NAME).traineddata: $(LANGDATA) data/unicharset
	combine_lang_model \
	  --input_unicharset data/unicharset \
	  --script_dir $(LANGDATA) \
	  --words $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).wordlist \
	  --numbers $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).numbers \
	  --puncs $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).punc \
	  --output_dir data/ \
	  --lang $(MODEL_NAME)

data/checkpoints/$(MODEL_NAME)_checkpoint: unicharset proto-model
	mkdir -p data/checkpoints
	lstmtraining \
	  --continue_from   $(TESSDATA)/$(CONTINUE_FROM).lstm \
	  --old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \
	  --traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \
	  --model_output data/checkpoints/$(MODEL_NAME) \
	  --debug_interval -1 \
	  --train_listfile data/list.train \
	  --eval_listfile data/list.eval \
	  --sequential_training \
	  --max_iterations 170000

data/$(MODEL_NAME).traineddata: data/checkpoints/$(MODEL_NAME)_checkpoint
	lstmtraining \
	--stop_training \
	--continue_from $^ \
	--old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \
	--traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \
	--model_output $@

# Clean all generated files
clean:
	find data/train -name '*.box' -delete
	find data/train -name '*.lstmf' -delete
	rm -rf data/all-*
	rm -rf data/list.*
	rm -rf data/$(MODEL_NAME)
	rm -rf data/unicharset
	rm -rf data/checkpoints

The number of .lstmf files being generated is significantly lower than .box files being generated.
For eg:
Number of .tif files: 10k
Number of .gt.txt files: 10k
Number of .box files: 10k
Number of .lstmf files: 8k.

Could anyone point me out to the possible reasons for this issue

Shree Devi Kumar

unread,
Sep 5, 2019, 8:30:20 AM9/5/19
to tesseract-ocr

Ayush Pandey

unread,
Sep 5, 2019, 8:48:27 AM9/5/19
to tesseract-ocr
Hi shree,
             Thank you so much for your response. I also wanted to ask, I do get an empty output on a lot of images, after training, the height and width of the image in pixels is usually > 100. Apart from changing the psm value, is there any other way to reduce this.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Lorenzo Bolzani

unread,
Sep 6, 2019, 7:04:22 AM9/6/19
to tesser...@googlegroups.com
Can you please share an example?

An empty output usually means that it failed to recognize the black parts as text, this could be because the text is too big or too small or a wrong dpi setting. Or the image is not reasonably clean.

To better understand the problem you can try to downscale the images (according to some tests done by a user on this forum 35/50px is what worked best for him), try different dpi settings, remove borders, denoise, etc. Compare images that work with the ones who do not.



Lorenzo






To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/91e85125-a9fc-450b-b434-391d2d4bd974%40googlegroups.com.

Ayush Pandey

unread,
Sep 6, 2019, 9:04:15 AM9/6/19
to tesseract-ocr
Hi Lorenzo. The empty output was due to the fact that I was using 7 as PSM parameter. Using 13 as PSM parameter completely eliminated the problem.

Lorenzo Bolzani

unread,
Sep 6, 2019, 9:24:49 AM9/6/19
to tesser...@googlegroups.com
Hi Ayush,
psm 6 and 7 do some extra pre-processing of the image, 13 does much less.

Unless your image contains text like this:

----
====
....

I would not expect much difference between PSM 6/7 and 13. While PSM 13 solves some problems I got more "ghost letters" errors (letters that are repeated more than once or split in similar variations, like O becoming O0). So this may not be an ideal solution.

Also there is no reason why a clean single line of text should not work with 6 or 7.

For some single line images with messy background I found that PSM 6 works better than 7.


Lorenzo

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3f97c86f-cc85-4ade-9aee-bfe67c43f066%40googlegroups.com.

Ayush Pandey

unread,
Sep 8, 2019, 2:05:51 PM9/8/19
to tesseract-ocr
Hi Lorenzo, Shree
I have a few questions:
  1. In the link provided by Shree ->https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR#tesseract-fails-to-create-lstm-files. It says that .lstmf are not generated for some images if you use the default list.train settings. Using PSM=13 helps build those lstmf files, whereas using PSM= 6 or 7 ignores them. Any clues as to why that is the case??, Tesseract does give me output text for the images for PSM values 6,7 and 13.
  2. If I use PSM 13 for generating the lstmf files used for training, will it be okay to use PSM values 6 and 7 while testing.
  3. How can I check the contents of lstmf files to see if they contain the ground truth text info and the image data correctly??
  4. Side Questions: lstmtraining saves the checkpoints in the following format: loss_iteration. It saves the checkpoints for a few iterations with the best loss ( apart from eng_checkpoint which contains the metadata I guess ). Is the loss calculated on the traininging data or the evaluation data??. Is there a way to save all checkpoints
  5. Side Questions: Does lstmeval use the psm value with which the lsmtf file was generated for evaluation??.
I know its a lot of questions and doubts. I thank you for your time in helping me out. 

Lorenzo Bolzani

unread,
Sep 8, 2019, 5:03:09 PM9/8/19
to tesser...@googlegroups.com
Hi Ayush,
usually images are denoised much more. I think the standard models are trained on pure black on pure white background, maybe with a little noise. I think it could work even on these images especially with fine tuning. But this is not the typical training data, I'm not surprised you have problems.

Anyway I think your problem here is with segmentation, not with the LSTM model. I suppose segmentation is done with thresholding and component analysis. And these are quite sensitive to noise.

I suspect the problem with the SW-something image might be the small fragments on top. While the 3-M-something image is probably fooled by the red line at bottom. You could do component analysis to clear these fragments but with this amount of noise is very hard. If you can, try to crop tighter (and see if it helps).

About your questions:
1. I suppose the segmentation step during training is different, it should use the box files rather than doing the page analysis. PSM 6/7 do some extra cleanup. I do not know why it fails.
2. I do the training with PSM 6 and, for one model, I use 13 at runtime and it works fine. A fine tuning training for me usually takes less than one hour. So when I have doubts like these I just try all the alternatives and see what works best on the eval set.
3. No idea. Check the box file too.

4. I manually do incremental training: 100 iteration, save model, run lstm_eval, 200 iterations, save model, lstm_eval, etc.


5. I do not know.


Lorenzo

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/85fd20c5-7d5d-41ca-8665-f3d47c9980f4%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages