Fine tuning existing model

Lorenzo Bolzani

unread,

Jun 29, 2018, 12:01:08 PM6/29/18

to tesser...@googlegroups.com

Hi,

I'm trying to do fine tuning of an existing model using line images and text labels. I'm running this version:

tesseract 4.0.0-beta.3-56-g5fda
leptonica-1.76.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found SSE

I used OCR-D to generate lstmf files for the demo data.

If I run the make command it works fine.

make training MODEL_NAME=prova

Now I isolated this command from the build:

lstmtraining \
--traineddata data/prova/prova.traineddata \
--net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c`head -n1 data/unicharset`]" \
--model_output data/checkpoints/prova \
--learning_rate 20e-4 \
--train_listfile data/list.train \
--eval_listfile data/list.eval \
--max_iterations 10000

and it works fine.

Now I'm trying to modify it to fine tune the existing eng model. I made a few attempts, all ending into different errors (see the attached file for full output).

I used:

combine_tessdata -e /usr/local/share/tessdata/eng.traineddata extracted/eng.lstm

to extract the eng.lstm model.

This seems to works but I'm not sure it is the correct.

lstmtraining \
--continue_from extracted/eng.lstm \
--traineddata data/prova/prova.traineddata \
--old_traineddata extracted/eng.traineddata \
--model_output data/checkpoints/prova \
--learning_rate 20e-4 \
--train_listfile data/list.train \
--eval_listfile data/list.eval \
--max_iterations 10000

(extracted/eng.traineddata is just a copy of eng.traineddata)

The training resume exactly with the RMS of prova_checkpoint (6%) so it looks like it is training from that checkpoint, not the eng.lstm.

Is this correct? What should I change?

I'm following this guide:

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters

I think continue_from and traineddata should refer to the eng model and old_traineddata should point to prova.traineddata, but if I do that I get a segmentation fault:

[...]

!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244
!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244
Segmentation fault

What am I missing?

Thanks, bye

Lorenzo

errors.txt

Shree Devi Kumar

unread,

Jun 29, 2018, 12:09:09 PM6/29/18

to tesser...@googlegroups.com

I modified the makefile for ocrd-train to do fine-tuning. It is pasted below:

export

SHELL := /bin/bash

LOCAL := $(PWD)/usr

PATH := $(LOCAL)/bin:$(PATH)

HOME := /home/ubuntu

TESSDATA = $(HOME)/tessdata_best

LANGDATA = $(HOME)/langdata

# Name of the model to be built

MODEL_NAME = frk

# Name of the model to continue from

CONTINUE_FROM = frk

# Normalization Mode - see src/training/language_specific.sh for details

NORM_MODE = 2

# Tesseract model repo to use. Default: $(TESSDATA_REPO)

TESSDATA_REPO = _best

# Train directory

TRAIN := data/train

# BEGIN-EVAL makefile-parser --make-help Makefile

help:

@echo ""

@echo " Targets"

@echo ""

@echo " unicharset Create unicharset"

@echo " lists Create lists of lstmf filenames for training and eval"

@echo " training Start training"

@echo " proto-model Build the proto model"

@echo " leptonica Build leptonica"

@echo " tesseract Build tesseract"

@echo " tesseract-langs Download tesseract-langs"

@echo " langdata Download langdata"

@echo " clean Clean all generated files"

@echo ""

@echo " Variables"

@echo ""

@echo " MODEL_NAME Name of the model to be built"

@echo " CORES No of cores to use for compiling leptonica/tesseract"

@echo " LEPTONICA_VERSION Leptonica version. Default: $(LEPTONICA_VERSION)"

@echo " TESSERACT_VERSION Tesseract commit. Default: $(TESSERACT_VERSION)"

@echo " LANGDATA_VERSION Tesseract langdata version. Default: $(LANGDATA_VERSION)"

@echo " TESSDATA_REPO Tesseract model repo to use. Default: $(TESSDATA_REPO)"

@echo " TRAIN Train directory"

@echo " RATIO_TRAIN Ratio of train / eval training data"

# END-EVAL

# Ratio of train / eval training data

RATIO_TRAIN := 0.90

ALL_BOXES = data/all-boxes

ALL_LSTMF = data/all-lstmf

# Create unicharset

unicharset: data/unicharset

# Create lists of lstmf filenames for training and eval

lists: $(ALL_LSTMF) data/list.train data/list.eval

data/list.train: $(ALL_LSTMF)

total=`cat $(ALL_LSTMF) | wc -l` \

no=`echo "$$total * $(RATIO_TRAIN) / 1" | bc`; \

head -n "$$no" $(ALL_LSTMF) > "$@"

data/list.eval: $(ALL_LSTMF)

total=`cat $(ALL_LSTMF) | wc -l` \

no=`echo "($$total - $$total * $(RATIO_TRAIN)) / 1" | bc`; \

tail -n "+$$no" $(ALL_LSTMF) > "$@"

# Start training

training: data/$(MODEL_NAME).traineddata

data/unicharset: $(ALL_BOXES)

combine_tessdata -u $(TESSDATA)/$(CONTINUE_FROM).traineddata $(TESSDATA)/$(CONTINUE_FROM).

unicharset_extractor --output_unicharset "$(TRAIN)/my.unicharset" --norm_mode $(NORM_MODE) "$(ALL_BOXES)"

merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset $(TRAIN)/my.unicharset "$@"

$(ALL_BOXES): $(sort $(patsubst %.tif,%.box,$(wildcard $(TRAIN)/*.tif)))

find $(TRAIN) -name '*.box' -exec cat {} \; > "$@"

$(TRAIN)/%.box: $(TRAIN)/%.tif $(TRAIN)/%-gt.txt

python generate_line_box.py -i "$(TRAIN)/$*.tif" -t "$(TRAIN)/$*-gt.txt" > "$@"

$(ALL_LSTMF): $(sort $(patsubst %.tif,%.lstmf,$(wildcard $(TRAIN)/*.tif)))

find $(TRAIN) -name '*.lstmf' -exec echo {} \; | sort -R -o "$@"

$(TRAIN)/%.lstmf: $(TRAIN)/%.box

tesseract $(TRAIN)/$*.tif $(TRAIN)/$* --psm 6 lstm.train

# Build the proto model

proto-model: data/$(MODEL_NAME)/$(MODEL_NAME).traineddata

data/$(MODEL_NAME)/$(MODEL_NAME).traineddata: $(LANGDATA) data/unicharset

combine_lang_model \

--input_unicharset data/unicharset \

--script_dir $(LANGDATA) \

--words $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).wordlist \

--numbers $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).numbers \

--puncs $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).punc \

--output_dir data/ \

--lang $(MODEL_NAME)

data/checkpoints/$(MODEL_NAME)_checkpoint: unicharset lists proto-model

mkdir -p data/checkpoints

lstmtraining \

--continue_from $(TESSDATA)/$(CONTINUE_FROM).lstm \

--old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \

--traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \

--model_output data/checkpoints/$(MODEL_NAME) \

--debug_interval -1 \

--train_listfile data/list.train \

--eval_listfile data/list.eval \

--sequential_training \

--max_iterations 3000

data/$(MODEL_NAME).traineddata: data/checkpoints/$(MODEL_NAME)_checkpoint

lstmtraining \

--stop_training \

--continue_from $^ \

--old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \

--traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \

--model_output $@

# Clean all generated files

clean:

find data/train -name '*.box' -delete

find data/train -name '*.lstmf' -delete

rm -rf data/all-*

rm -rf data/list.*

rm -rf data/$(MODEL_NAME)

rm -rf data/unicharset

rm -rf data/checkpoints

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyOJN31PdWQumXPO3JjuAc1Yz2BZYpMd4ftzBHgZkEaxA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Lorenzo Bolzani

unread,

Jun 29, 2018, 3:33:34 PM6/29/18

to tesser...@googlegroups.com

Hi Shree, thanks for your answer.

I tried the script setting:

TESSDATA=extracted # here I have the eng.lstm and eng.trainedata
LANGDATA=langdata-master # all langdata downladed by OCR-D

MODEL_NAME = eng
CONTINUE_FROM = eng

First I run the old Makefile to create the boxes.

$ make training MODEL_NAME=eng

I stop it as soon as the training starts:

At iteration 400/400/400, Mean rms=6.657%, delta=40.765%, char train=100.827%, word train=100%, skip ratio=0%, New worst char error = 100.827 wrote checkpoint.

At iteration 500/500/500, Mean rms=6.644%, delta=40.423%, char train=100.662%, word train=100%, skip ratio=0%, New worst char error = 100.662 wrote checkpoint.

^Cmake: *** Deleting file 'data/checkpoints/eng_checkpoint'
Makefile:110: recipe for target 'data/checkpoints/eng_checkpoint' failed
make: *** [data/checkpoints/eng_checkpoint] Interrupt

Notice that the data/checkpoints/eng_checkpoint file is deleted, I do not know if it is relevant or not.

then I switch to the new one and I get this:

$ make training

mkdir -p data/checkpoints
lstmtraining \
--continue_from extracted/eng.lstm \
--old_traineddata extracted/eng.traineddata \
--traineddata data/eng/eng.traineddata \
--model_output data/checkpoints/eng \

--debug_interval -1 \
--train_listfile data/list.train \
--eval_listfile data/list.eval \
--sequential_training \
--max_iterations 3000

Loaded file extracted/eng.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 111 to 76!
Num (Extended) outputs,weights in Series:
1,36,0,1:1, 0
Num (Extended) outputs,weights in Series:
C3,3:9, 0
Ft16:16, 160
Total weights = 160
[C3,3Ft16]:16, 160
Mp3,3:16, 0
Lfys64:64, 20736
Lfx96:96, 61824
Lrx96:96, 74112
Lfx512:512, 1247232
Fc76:76, 0
Total weights = 1404064
Previous null char=110 mapped to 75
Continuing from extracted/eng.lstm
Loaded 1/1 pages (1-1) of document data/train/mueller_waldhornist_1821_0130_010.lstmf
Loaded 1/1 pages (1-1) of document data/train/bismarck_erinnerungen02_1898_0274_002.lstmf
Loaded 1/1 pages (1-1) of document data/train/spyri_heidi_1880_0062_005.lstmf
Loaded 1/1 pages (1-1) of document data/train/novalis_ofterdingen_1802_0210_001.lstmf
Iteration 0: ALIGNED TRUTH : Sparoͤfen kauft' ich auch und Sorgenstuͤhle,
Iteration 0: BEST OCR TEXT : l bd o D V fc ds ft hs D t' dsu PM )k ,„cGs D t' D„Gs 'A AKG„9„t d tft ü!Vt Eb ht Ac )k uF ' K,cGPFVts
File data/train/mueller_waldhornist_1821_0130_010.lstmf page 0 :

!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244
!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244

Makefile:113: recipe for target 'data/checkpoints/eng_checkpoint' failed
make: *** [data/checkpoints/eng_checkpoint] Segmentation fault

What am I doing wrong?

Lorenzo

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyOJN31PdWQumXPO3JjuAc1Yz2BZYpMd4ftzBHgZkEaxA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWe%3Dv9YvYAMTAzm9yNEFFtqjnxBVGDe9x4tQd1Pnjiwqw%40mail.gmail.com.

Shree Devi Kumar

unread,

Jun 29, 2018, 4:27:46 PM6/29/18

to tesser...@googlegroups.com

You should be able to use the new makefile after you make changes for all the directory locations to match your setup.

Change the language from frk to eng, though the sample training text seems to be non-english. In which case it is better for you to use the appropriate language traineddata eg. tessdata_best/deu.traineddata for German.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyOJN31PdWQumXPO3JjuAc1Yz2BZYpMd4ftzBHgZkEaxA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWe%3Dv9YvYAMTAzm9yNEFFtqjnxBVGDe9x4tQd1Pnjiwqw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwUVJOePiO98piAgbSoqyA1GOrs%2BDwEz%2BxY9LS8YQyi%3DQ%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Lorenzo Bolzani

unread,

Jun 29, 2018, 6:17:35 PM6/29/18

to tesser...@googlegroups.com

I think I found the problem. Running directly the new Makefile I had this error:

make: *** No rule to make target 'data/train/alexis_ruhe01_1852_0018_022.box', needed by 'data/all-boxes'. Stop.

The problem was a "-gt.txt" rather than a ".gt.txt" as in my train files. Now I can run your script directly.

I also replaced the eng.traineddata with the one from here:

https://github.com/tesseract-ocr/tessdata_best

and it's training correctly. (it works correctly even with the previous model, from https://github.com/tesseract-ocr/tessdata).

One more question: I wanted to check if the output character set of the new and old model differ. I used:

combine_tessdata -u eng.traineddata orig

on both models and compared the unicharset files. I see that some characters are missing and some others are added. It looks good. Is this the correct way to check this?

In this way can I train a model that, for example, only recognize uppercase characters, or numbers, simply by providing only uppercase training data? Or is there something else to configure?

Thanks, bye

Lorenzo

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyOJN31PdWQumXPO3JjuAc1Yz2BZYpMd4ftzBHgZkEaxA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWe%3Dv9YvYAMTAzm9yNEFFtqjnxBVGDe9x4tQd1Pnjiwqw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwUVJOePiO98piAgbSoqyA1GOrs%2BDwEz%2BxY9LS8YQyi%3DQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU0aF%3DKmDPf9V3925bYouhTF%3Dq_XM-Xo5R%3Dv-yC%3DBRrRA%40mail.gmail.com.

Shree Devi Kumar

unread,

Jun 30, 2018, 4:19:42 AM6/30/18

to tesser...@googlegroups.com

>

The problem was a "-gt.txt" rather than a ".gt.txt" as in my train files. Now I can run your script directly.

Oh, I remember now. I had changed that for ease in renaming files for some reason.

> In this way can I train a model that, for example, only recognize uppercase characters, or numbers, simply by providing only uppercase training data? Or is there something else to configure?

You could try finetune from English. Remove the line regarding merge of unicharsets from my makefile (use command from original script). 300 iterations should be enough as you are not adding any characters. Try to have a training text which resembles the kind of words that you expect to OCR.

Lorenzo Bolzani

unread,

Jul 2, 2018, 10:24:14 AM7/2/18

to tesser...@googlegroups.com

Hi Shree,

I replaced the line:

merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset $(TRAIN)/my.unicharset "$@"

with:

cp "$(TRAIN)/my.unicharset" "data/unicharset"

(I write this in case someone else is following this thread).

And now I have a fine tuned brand new model with only the characters I need. Nice :)

For the training I'm using actual crops from the documents I need to ocr, painfully hand labeled.

About the number of iterations I'm trying to figure it out. I've seen that there is an eval/train split, I've set it to 80/20.

I did 300/600/1000/5000/7500/10000 iteration and checked the model with:

lstmeval --model export/$1.traineddata --eval_listfile data/list.eval 2>&1 | grep iteration

and I see that the eval error keeps going down, with a big error drop from 1.17 to 0.5 passing from 7500 to 10000. My characters are very noisy and irregular and my lines are very short, 1 to 4 words at most. Maybe this is the reason why I need more iterations.

I'm fine tuning from italian, the language of my documents, I'll try eng too to see if it works better. Now that the pipeline is in place it's easy to try different options.

Thank you for your help so far.

Bye

Lorenzo

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUpE8TeQXqto-Ahb7Mm%3DR4C5qOavthm0Y30ZbnvdrWr6w%40mail.gmail.com.

Raniem

unread,

Sep 6, 2018, 11:02:54 AM9/6/18

to tesseract-ocr

Hi @ Lorenzo Blz

How many data lines and iterations have you used in your fine tuning.

In your last reply you have mentioned you replaced

merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset $(TRAIN)/my.unicharset "$@"

with:

cp "$(TRAIN)/my.unicharset" "data/unicharset"

which is very helpful as I wanted to do the same and generate a new model without specific characters that I need to excluded from the unicharset. But the new model is always worse than my original model.

Can you please advise?

Regards

Lorenzo Bolzani

unread,

Sep 6, 2018, 2:48:13 PM9/6/18

to tesser...@googlegroups.com

Hi Raniem,

I did 5 fine tunings for different fonts and text content with roughly these numbers:

iterations: samples (training data)

750: 208 numbers (4 upper case + 5 digits each)

1000: 400 MRZ codes (22 uppercase chars each)

1800: 1000 numbers (10 digits each)

22500: 1664 words (from 8 to 30 uppercase chars each)

57500: 54800 words (from 4 to 30 chars each, alphanum, mixed case and font)

I work in this way:

- split the data in training/evaluation. Ocr-d will do this for you. I use 80/20

- train(fine tune) for a few iterations, like 100, then run:

lstmeval --model data/YOUR_MODEL.traineddata --eval_listfile data/list.eval

to check the current accuracy on the evaluation set. Resume the training up to 200 iterations (ocr-d will resume from the last checkpoint automatically) and check again the evaluation accuracy, and so on. Repeat until the evaluation accuracy decreases for a few training steps.

For small datasets I did 100, 200, ..., 1000, 1200, 1400,... and coarser steps for the large ones, 1000, 2000,.... Pick the model with the best evaluation score. In this way you do not need to guess the number of iterations.

You can find a more details description here:

https://groups.google.com/d/msg/tesseract-ocr/COJ4IjcrL6s/C1OeE9bWBgAJ

I think the number of iterations depends on the type of text you are doing, for digits you need only a few, for fixed font uppercase text just a little more. For complex upper/lower multi-font text/numbers, like the last one, it takes more time.

For the training and evaluation images use the same height and border trimming that you will use for the real data (I used height=54px, no border).

Bye

Lorenzo

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9a2190c6-80fb-44aa-a5b3-10a5a99d7ea5%40googlegroups.com.

Raniem

unread,

Sep 6, 2018, 11:01:06 PM9/6/18

to tesseract-ocr

Thanks for the detailed answer, I am giving it a shot and hoping for getting some better results :)

Thanks for all your help and support

Best Regards

On Friday, June 29, 2018 at 1:01:08 PM UTC+1, Lorenzo Blz wrote:

Raniem

unread,

Sep 10, 2018, 12:31:07 PM9/10/18

to tesseract-ocr

Thanks Lorenzo.

Your method makes all the magic I needed.

One other question please, I am attempting to fine tune only the last layer, so I have replaced the

--net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c`head -n1 data/unicharset`]" \

int the lstmtraining command with:

--continue_from $(TESSDATA)/$(CONTINUE_FROM).lstm \

--append_index 5 --net_spec '[Lfx256 O1c69]'

but I am getting this error :

int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 222

Makefile:129: recipe for target 'data/checkpoints/eng_checkpoint' failed

make: *** [data/checkpoints/eng_checkpoint] Segmentation fault (core dumped)

can any one please advice on what I am doing wrong?

P.S my unicharset contains 69 character.

Regards

Lorenzo Bolzani

unread,

Sep 10, 2018, 12:52:21 PM9/10/18

to tesser...@googlegroups.com

I think there is no need to change the network definition appending layers with a limited number of output chars. The line you replaced already takes care of this with:

--net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c`head -n1 data/unicharset`]"

I had this error when I was mixing best models with non best models.

I would try to run again:

combine_tessdata -e base_model/eng.traineddata base_model/eng.lstm

to generate the eng.lstm from the "_best" model (the ones from /usr/share/tessdata are not the "_best" models).

Also see:

https://groups.google.com/d/msg/tesseract-ocr/WvKihbm5Lv8/GSAGcQXbCAAJ

Bye

Lorenzo

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/aac121aa-4f22-4785-926d-a22b3985974a%40googlegroups.com.

Raniem

unread,

Sep 10, 2018, 1:38:38 PM9/10/18

to tesseract-ocr

I think there is no need to change the network definition appending layers with a limited number of output chars. The line you replaced already takes care of this with:

I am actually doing that not to limit the number of output chars, I am doing it cause I thought this way I am only tuning the final layer as I wanted to keep the weights for other layers.

I was trying to experiment whether this is going to give me even better performance with a fewer number of iterations or data lines without over fitting (please correct me if i am wrong whether this update is not maintaining the weights in the remaining layers).

I will double check that I am not mixing models. Thanks for the advice :) appreciate your time and the real time response :)

Regards

Raniem

unread,

Sep 10, 2018, 2:08:03 PM9/10/18

to tesseract-ocr

you were right regarding the different models type. Thanks :)

Lorenzo Bolzani

unread,

Sep 10, 2018, 7:51:54 PM9/10/18

to tesser...@googlegroups.com

Il giorno lun 10 set 2018 alle ore 15:38 Raniem <raniem...@gmail.com> ha scritto:

I am actually doing that not to limit the number of output chars, I am doing it cause I thought this way I am only tuning the final layer as I wanted to keep the weights for other layers.
I was trying to experiment whether this is going to give me even better performance with a fewer number of iterations or data lines without over fitting (please correct me if i am wrong whether this update is not maintaining the weights in the remaining layers).

Ok, now I got it. I never did this myself, and I suppose this is where you are coming from:

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#training-just-a-few-layers

If I got this right, you do not really freeze the lower layers, you just replace the final ones with new untrained layers (the training will then update all the weights as needs even if the impact on the lower ones should be minor). Honestly I cannot see why this should be better than simple fine-tuning unless the "font" you are training on is completely different from the ones learned by the base model. But, having enough data, I think it's worth trying.

But I expect this it's going to require more data and more iterations than simple fine tuning as the docs seem to suggest.

I will double check that I am not mixing models. Thanks for the advice :) appreciate your time and the real time response :)

You are welcome, I just remember how difficult it was to make sense of all those assertion failed :)

Bye

Lorenzo

Raniem

unread,

Sep 12, 2018, 9:21:13 AM9/12/18

to tesseract-ocr

you were right again actually :)

I will stick with the simple fine tuning.

However I wouldn't have been able to experiment with the other scenarios without your help. Thanks! All is working perfectly well.

Regards

Varun Sab

unread,

Sep 18, 2018, 12:29:03 PM9/18/18

to tesseract-ocr

HI @ Lorenzo Blz,
I am also getting the same segmentation fault error. Can you please suggest how you solved it.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyOJN31PdWQumXPO3JjuAc1Yz2BZYpMd4ftzBHgZkEaxA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Shree Devi Kumar

unread,

Sep 18, 2018, 3:54:53 PM9/18/18

to tesser...@googlegroups.com

If you are getting error

!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244
!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244

You are probably using the traineddata fille which has an `integer` model.

Please use tessdata_best as base for further training.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8707e6a3-487b-48bf-8eff-0c26177e2181%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Varun Sab

unread,

Sep 19, 2018, 7:20:16 AM9/19/18

to tesseract-ocr

Thank you so much.. That worked. :)

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8707e6a3-487b-48bf-8eff-0c26177e2181%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Tairen Chen

unread,

May 2, 2019, 9:48:15 PM5/2/19

to tesseract-ocr

Hi, Lorenzo and Shree

Thanks for your sharing.

I am trying to repeat what you have done here.

I followed your posts and change the Makefile, but when I run $ make training,

I got the following errors:

mkdir -p data/checkpoints

lstmtraining \

--continue_from extracted/eng.lstm \
--old_traineddata extracted/eng.traineddata \
--traineddata data/eng/eng.traineddata \
--model_output data/checkpoints/eng \
--debug_interval -1 \
--train_listfile data/list.train \
--eval_listfile data/list.eval \
--sequential_training \
--max_iterations 3000

Must provide a --traineddata see training wiki
Makefile:111: recipe for target 'data/checkpoints/eng_checkpoint' failed
make: *** [data/checkpoints/eng_checkpoint] Error 1

However, I can manually run $lstmtraining --traineddata data/eng/eng.traineddata --continue_from extracted/eng.lstm --old_traineddata extracted/eng.traineddata --model_output data/checkpoints/eng --debug_interval -1 --train_listfile data/list.train --eval_listfile data/list.eval --sequential_training --max_iterations 3000

I don't know where to change and I am new to Tesseract and same with Makefile. Please share your wisdom.

Thank you!

All the best,

Tairen

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyOJN31PdWQumXPO3JjuAc1Yz2BZYpMd4ftzBHgZkEaxA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWe%3Dv9YvYAMTAzm9yNEFFtqjnxBVGDe9x4tQd1Pnjiwqw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwUVJOePiO98piAgbSoqyA1GOrs%2BDwEz%2BxY9LS8YQyi%3DQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Lorenzo Bolzani

unread,

May 2, 2019, 10:10:58 PM5/2/19

to tesser...@googlegroups.com

Hi Tairen,

the error is quite clear:

Must provide a --traineddata see training wiki

You say that it works if you run it as a single line so I suppose there is something wrong in the make file, probably a typo. Maybe there is a space or a tab after a "\" ?

Maybe there are some extra characters from copy and paste from an email. The traineddata option is on the third line it is likely something on line 2 or 3.

If you cannot find the problem checkout the project again and start over. Run it after any single change you do to see if/when it breaks.

Lorenzo

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4242cfd0-d808-492d-967c-06731cc39d00%40googlegroups.com.

Tairen Chen

unread,

May 3, 2019, 1:48:24 AM5/3/19

to tesseract-ocr

Thank you very much for your quick answer, Lorenzo!

You are right, it is an extra space at the beginning where the "TESSDATA" is defined not at the "lstmtraining" line.

I still have few questions want to ask you for help.

1. I define the "--max_iterations 20000" but the training stops at 5700, like below:

" At iteration 351/5700/5700, Mean rms=0.117%, delta=0%, char train=0%, word train=0%, skip ratio=0%, wrote checkpoint. "

I assume 5700 is the iteration number, but I do not know what is 351 mean here. Meanwhile why the training stops at 5700, not at 10000, or other numbers that less than 20000? I think there may be "rms" definition to stop the training or any other conditions? or because I have a small number of training images?

2. I can generate the "eng.traineddata" using the weights from "tessdata_best", but not from "tessdata". Shree said because weights from "tessdata" is an `integer` model." What is "integer" model means? can we generate the "eng.traineddata" from "tessdata" model?

3. Meanwhile, I notice that the size of "eng.traineddata" that I generated is less than the model from "tessdata". (11.7M VS 23.5M ), so the "tessdata" model has more number of parameters than then model from "tessdata_best"? what is the difference between these two?

Thank you!

All the best,

Tairen

Lorenzo Bolzani

unread,

May 3, 2019, 9:30:12 AM5/3/19

to tesser...@googlegroups.com

See answer inline.

Il giorno ven 3 mag 2019 alle ore 03:48 Tairen Chen <chent...@gmail.com> ha scritto:

1. I define the "--max_iterations 20000" but the training stops at 5700, like below:
" At iteration 351/5700/5700, Mean rms=0.117%, delta=0%, char train=0%, word train=0%, skip ratio=0%, wrote checkpoint. "

I assume 5700 is the iteration number, but I do not know what is 351 mean here. Meanwhile why the training stops at 5700, not at 10000, or other numbers that less than 20000? I think there may be "rms" definition to stop the training or any other conditions? or because I have a small number of training images?

You are already at 100% accuracy on your training set (0% error) so there is no point in training more.

This obvioulsy does not mean that you'll get 100% accuracy on your real world data.

See this thread on how to train for a reasonable amount of epochs (training too much is bad):

https://groups.google.com/d/msg/tesseract-ocr/_Wpcd86_89Q/3CGT4D_9CgAJ

I do not remember what 351 means.

2. I can generate the "eng.traineddata" using the weights from "tessdata_best", but not from "tessdata". Shree said because weights from "tessdata" is an `integer` model." What is "integer" model means? can we generate the "eng.traineddata" from "tessdata" model?

What's the problem with the best model? An integer model is a "simplified/compressed" model for better speed giving up some accuracy. You can convert your trained model to an integer one at the end of the training using combine_tessdata -c but I never did it.

3. Meanwhile, I notice that the size of "eng.traineddata" that I generated is less than the model from "tessdata". (11.7M VS 23.5M ), so the "tessdata" model has more number of parameters than then model from "tessdata_best"? what is the difference between these two?

No, tessdata_best is a bigger model (more parameters). There are three model sizes: best, normal and fast. Each of these can also be converted to an integer model.

The models I get from fine tuning the eng model are about 6.7 MB.

A traineddata file is an archive file, like a zip, maybe you are including less files than the original (other then the neural network model itself). But I do not know much about the traineddata details. Maybe best is the LSMT model only, while "normal" includes the 3.x models too? You can use combine_tessdata -u to extract all the content and check.

Bye

Lorenzo

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0495f273-1252-4f3a-8126-665063d3c48a%40googlegroups.com.

Shree Devi Kumar

unread,

May 3, 2019, 9:59:12 AM5/3/19

to tesser...@googlegroups.com

>There are three model sizes: best, normal and fast. Each of these can also be converted to an integer model.

Only `best` can be converted to integer and in fact the LSTM models in `tessdata` are the integer versions of best along with the base/legacy models.

`fast` models have been trained with a smaller network spec compared to `best` and have been converted to integer.

So, fast and normal are already integer. You can convert `best` to integer (that file will be smaller than `normal` because it does not have base/legacy model(.

>maybe you are including less files than the original (other then the neural network model itself).

If you do not use wordlist, numbers and punc file then the dawgs will not be built. Some traineddata files are really large because of these.

Lorenzo Bolzani

unread,

May 3, 2019, 10:22:17 AM5/3/19

to tesser...@googlegroups.com

Shree, thanks for the clarification.

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVOrKJkvjXR20iWs77r4SPh15c6R6U6Lc61sZjVtdbT2Q%40mail.gmail.com.

Tairen Chen

unread,

May 3, 2019, 4:18:59 PM5/3/19

to tesseract-ocr

Hi, Lorenzo,

Thank you very much for your reply. It really gives more clue about the training.

All the best,

Tairen

See answer inline.

Tairen Chen

unread,

May 3, 2019, 4:19:47 PM5/3/19

to tesseract-ocr

Thank you for your further explanation, Shree!!

Ayush Pandey

unread,

Sep 5, 2019, 7:55:02 AM9/5/19

to tesseract-ocr

Tesseract Version: 4.1.0

I am trying to fine tune tesseract on custom dataset with the following Makefile:

export

SHELL := /bin/bash
HOME := $(PWD)
TESSDATA = $(HOME)/tessdata
LANGDATA = $(HOME)/langdata

# Train directory
# TRAIN := $(HOME)/train_data
TRAIN := /media/vimaan/Data/OCR/tesseract_train

# Name of the model to be built
MODEL_NAME = eng
LANG_CODE = eng

# Name of the model to continue from
CONTINUE_FROM = eng

TESSDATA_REPO = _best

# Normalization Mode - see src/training/language_specific.sh for details 
NORM_MODE = 1

# BEGIN-EVAL makefile-parser --make-help Makefile

help:
	@echo ""
	@echo "  Targets"
	@echo ""
	@echo "    unicharset       Create unicharset"
	@echo "    lists            Create lists of lstmf filenames for training and eval"
	@echo "    training         Start training"
	@echo "    proto-model      Build the proto model"
	@echo "    leptonica        Build leptonica"
	@echo "    tesseract        Build tesseract"
	@echo "    tesseract-langs  Download tesseract-langs"
	@echo "    langdata         Download langdata"
	@echo "    clean            Clean all generated files"
	@echo ""
	@echo "  Variables"
	@echo ""
	@echo "    MODEL_NAME         Name of the model to be built"
	@echo "    CORES              No of cores to use for compiling leptonica/tesseract"
	@echo "    LEPTONICA_VERSION  Leptonica version. Default: $(LEPTONICA_VERSION)"
	@echo "    TESSERACT_VERSION  Tesseract commit. Default: $(TESSERACT_VERSION)"
	@echo "    LANGDATA_VERSION   Tesseract langdata version. Default: $(LANGDATA_VERSION)"
	@echo "    TESSDATA_REPO      Tesseract model repo to use. Default: $(TESSDATA_REPO)"
	@echo "    TRAIN              Train directory"
	@echo "    RATIO_TRAIN        Ratio of train / eval training data"

# END-EVAL

# Ratio of train / eval training data
RATIO_TRAIN := 0.90

ALL_BOXES = data/all-boxes
ALL_LSTMF = data/all-lstmf

# Create unicharset
unicharset: data/unicharset

# Create lists of lstmf filenames for training and eval
#lists: $(ALL_LSTMF) data/list.train data/list.eval
lists: $(ALL_LSTMF)

train-lists: data/list.train data/list.eval

data/list.train: $(ALL_LSTMF)
	total=`cat $(ALL_LSTMF) | wc -l` \
	   no=`echo "$$total * $(RATIO_TRAIN) / 1" | bc`; \
	   head -n "$$no" $(ALL_LSTMF) > "$@"

data/list.eval: $(ALL_LSTMF)
	total=`cat $(ALL_LSTMF) | wc -l` \
	   no=`echo "($$total - $$total * $(RATIO_TRAIN)) / 1" | bc`; \
	   tail -n "$$no" $(ALL_LSTMF) > "$@"

# Start training
training: data/$(MODEL_NAME).traineddata

data/unicharset: $(ALL_BOXES)
	mkdir -p data/$(START_MODEL)
	combine_tessdata -u $(TESSDATA)/$(CONTINUE_FROM).traineddata  $(TESSDATA)/$(CONTINUE_FROM).
	unicharset_extractor --output_unicharset "$(TRAIN)/my.unicharset" --norm_mode $(NORM_MODE) "$(ALL_BOXES)"
	#merge_unicharsets data/$(START_MODEL)/$(START_MODEL).lstm-unicharset $(GROUND_TRUTH_DIR)/my.unicharset  "$@"
	merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset $(TRAIN)/my.unicharset  "$@"
	
$(ALL_BOXES): $(sort $(patsubst %.tif,%.box,$(wildcard $(TRAIN)/*.tif)))
	find $(TRAIN) -name '*.box' -exec cat {} \; > "$@"
	
$(TRAIN)/%.box: $(TRAIN)/%.tif $(TRAIN)/%.gt.txt
	python generate_line_box.py -i "$(TRAIN)/$*.tif" -t "$(TRAIN)/$*.gt.txt" > "$@"

$(ALL_LSTMF): $(sort $(patsubst %.tif,%.lstmf,$(wildcard $(TRAIN)/*.tif)))
	find $(TRAIN) -name '*.lstmf' -exec echo {} \; | sort -R -o "$@"

$(TRAIN)/%.lstmf: $(TRAIN)/%.box
	tesseract $(TRAIN)/$*.tif $(TRAIN)/$* --dpi 300 --psm 7 lstm.train
	

# Build the proto model
proto-model: data/$(MODEL_NAME)/$(MODEL_NAME).traineddata

data/$(MODEL_NAME)/$(MODEL_NAME).traineddata: $(LANGDATA) data/unicharset
	combine_lang_model \
	  --input_unicharset data/unicharset \
	  --script_dir $(LANGDATA) \
	  --words $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).wordlist \
	  --numbers $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).numbers \
	  --puncs $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).punc \
	  --output_dir data/ \
	  --lang $(MODEL_NAME)

data/checkpoints/$(MODEL_NAME)_checkpoint: unicharset proto-model
	mkdir -p data/checkpoints
	lstmtraining \
	  --continue_from   $(TESSDATA)/$(CONTINUE_FROM).lstm \
	  --old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \
	  --traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \
	  --model_output data/checkpoints/$(MODEL_NAME) \
	  --debug_interval -1 \
	  --train_listfile data/list.train \
	  --eval_listfile data/list.eval \
	  --sequential_training \
	  --max_iterations 170000

data/$(MODEL_NAME).traineddata: data/checkpoints/$(MODEL_NAME)_checkpoint
	lstmtraining \
	--stop_training \
	--continue_from $^ \
	--old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \
	--traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \
	--model_output $@

# Clean all generated files
clean:
	find data/train -name '*.box' -delete
	find data/train -name '*.lstmf' -delete
	rm -rf data/all-*
	rm -rf data/list.*
	rm -rf data/$(MODEL_NAME)
	rm -rf data/unicharset
	rm -rf data/checkpoints

The number of .lstmf files being generated is significantly lower than .box files being generated.
For eg:
Number of .tif files: 10k
Number of .gt.txt files: 10k
Number of .box files: 10k
Number of .lstmf files: 8k.

Could anyone point me out to the possible reasons for this issue

Shree Devi Kumar

unread,

Sep 5, 2019, 8:30:20 AM9/5/19

to tesseract-ocr

See https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR#tesseract-fails-to-create-lstm-files

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e3ba3b90-a8c8-4085-bec5-cf918034ba2a%40googlegroups.com.

Ayush Pandey

unread,

Sep 5, 2019, 8:48:27 AM9/5/19

to tesseract-ocr

Hi shree,

Thank you so much for your response. I also wanted to ask, I do get an empty output on a lot of images, after training, the height and width of the image in pixels is usually > 100. Apart from changing the psm value, is there any other way to reduce this.

See https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR#tesseract-fails-to-create-lstm-files

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e3ba3b90-a8c8-4085-bec5-cf918034ba2a%40googlegroups.com.

Lorenzo Bolzani

unread,

Sep 6, 2019, 7:04:22 AM9/6/19

to tesser...@googlegroups.com

Can you please share an example?

An empty output usually means that it failed to recognize the black parts as text, this could be because the text is too big or too small or a wrong dpi setting. Or the image is not reasonably clean.

To better understand the problem you can try to downscale the images (according to some tests done by a user on this forum 35/50px is what worked best for him), try different dpi settings, remove borders, denoise, etc. Compare images that work with the ones who do not.

Lorenzo

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/91e85125-a9fc-450b-b434-391d2d4bd974%40googlegroups.com.

Ayush Pandey

unread,

Sep 6, 2019, 9:04:15 AM9/6/19

to tesseract-ocr

Hi Lorenzo. The empty output was due to the fact that I was using 7 as PSM parameter. Using 13 as PSM parameter completely eliminated the problem.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/91e85125-a9fc-450b-b434-391d2d4bd974%40googlegroups.com.

Lorenzo Bolzani

unread,

Sep 6, 2019, 9:24:49 AM9/6/19

to tesser...@googlegroups.com

Hi Ayush,

psm 6 and 7 do some extra pre-processing of the image, 13 does much less.

Unless your image contains text like this:

----

====

....

I would not expect much difference between PSM 6/7 and 13. While PSM 13 solves some problems I got more "ghost letters" errors (letters that are repeated more than once or split in similar variations, like O becoming O0). So this may not be an ideal solution.

Also there is no reason why a clean single line of text should not work with 6 or 7.

For some single line images with messy background I found that PSM 6 works better than 7.

Lorenzo

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3f97c86f-cc85-4ade-9aee-bfe67c43f066%40googlegroups.com.

Ayush Pandey

unread,

Sep 8, 2019, 2:05:51 PM9/8/19

to tesseract-ocr

Hi Lorenzo, Shree

Here is the link of the images for which no lsmtf files were generated -> https://drive.google.com/drive/folders/1VDBPB_k-oOXbWUI3zIlB3ljuyIlOkoMK?usp=sharing.
Here is the Makefile that I used for generating lstmf files ->https://drive.google.com/open?id=15vvRMM03AOqoHKecEIx8NRTeU0y_kREy. I used Lorenzo's suggestion to create another target "train-lists" to avoid creating the training and the eval list again and again.
Tesseract Version: 4.1.0
I am using https://github.com/tesseract-ocr/tesstrain/blob/master/generate_line_box.py to generate .box files.
My images are in .tif format. I am saving my images using OpenCV imwrite.

I have a few questions:

In the link provided by Shree ->https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR#tesseract-fails-to-create-lstm-files. It says that .lstmf are not generated for some images if you use the default list.train settings. Using PSM=13 helps build those lstmf files, whereas using PSM= 6 or 7 ignores them. Any clues as to why that is the case??, Tesseract does give me output text for the images for PSM values 6,7 and 13.
If I use PSM 13 for generating the lstmf files used for training, will it be okay to use PSM values 6 and 7 while testing.
How can I check the contents of lstmf files to see if they contain the ground truth text info and the image data correctly??
Side Questions: lstmtraining saves the checkpoints in the following format: loss_iteration. It saves the checkpoints for a few iterations with the best loss ( apart from eng_checkpoint which contains the metadata I guess ). Is the loss calculated on the traininging data or the evaluation data??. Is there a way to save all checkpoints
Side Questions: Does lstmeval use the psm value with which the lsmtf file was generated for evaluation??.

I know its a lot of questions and doubts. I thank you for your time in helping me out.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3f97c86f-cc85-4ade-9aee-bfe67c43f066%40googlegroups.com.

Lorenzo Bolzani

unread,

Sep 8, 2019, 5:03:09 PM9/8/19

to tesser...@googlegroups.com

Hi Ayush,

usually images are denoised much more. I think the standard models are trained on pure black on pure white background, maybe with a little noise. I think it could work even on these images especially with fine tuning. But this is not the typical training data, I'm not surprised you have problems.

Anyway I think your problem here is with segmentation, not with the LSTM model. I suppose segmentation is done with thresholding and component analysis. And these are quite sensitive to noise.

I suspect the problem with the SW-something image might be the small fragments on top. While the 3-M-something image is probably fooled by the red line at bottom. You could do component analysis to clear these fragments but with this amount of noise is very hard. If you can, try to crop tighter (and see if it helps).

About your questions:

1. I suppose the segmentation step during training is different, it should use the box files rather than doing the page analysis. PSM 6/7 do some extra cleanup. I do not know why it fails.

2. I do the training with PSM 6 and, for one model, I use 13 at runtime and it works fine. A fine tuning training for me usually takes less than one hour. So when I have doubts like these I just try all the alternatives and see what works best on the eval set.

3. No idea. Check the box file too.

4. I manually do incremental training: 100 iteration, save model, run lstm_eval, 200 iterations, save model, lstm_eval, etc.

See this thread: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/COJ4IjcrL6s/GnvIpZ2uBgAJ

5. I do not know.

Lorenzo

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/85fd20c5-7d5d-41ca-8665-f3d47c9980f4%40googlegroups.com.

Reply all

Reply to author

Forward