Can I mix tiff/box files generated by ocrd-train with original training data used to train specific language in tesseract4 (from langdata direcotry)

82 views

Skip to first unread message

Raniem AROUR

unread,

Sep 4, 2018, 9:04:40 AM9/4/18

to tesseract-ocr

Hello..

I am trying to fine tune the dan.traineddata for my specific use case. However, the model is over fitting on my data and seems to be forgetting the original data it was trained on. I remember I have read somewhere that this can be solved by showing the original training data to the network so that I don't get regression over the original performance.

I have images and their corresponding ground truth files. Therefore I have used ocrd-train to do the fine tuning earlier (using some advises found in this thread, thanks to Shree).

I have then mixed my training data with the original training data using the hints provided by shree in this thread.

the command i used after updating the tesstrain.sh as recommended was:

~/tesseract/src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang dan --linedata_only \

--noextract_font_properties --langdata_dir /home/my_user/ocrd-train/langdata \

--tessdata_dir /home/my_user/tesseract/tessdata \

--output_dir /home/my_user/my_models/danNew/

then I tried to run "make training" in the ocrd-train directory as I usually do for fine tuning. The fine tuning started, however, I got some errors that I believe are resulted from the original data:

e.g. Encoding of string failed! Failure bytes: ffffffc3 ffffffb6 20 65 72 20 31 2e 34 35 24 2e 20 74 69 64 6c 69 67 65 72 65 20 31 37 2e 20 68 61 76 65 20 6d 61 6e 67 65 20 4e 59 20 2d 20 76 ffffffc3 ffffffa6 72 65 20 69 20 53 ffffffc3 ffffff85 20 43 61 6e 61 6c 2b 20 6f 67

Can't encode transcription: 'har Søg butik været blevet Ifö er 1.45$. tidligere 17. have mange NY - være i SÅ Canal+ og' in language ''

Encoding of string failed! Failure bytes: ffffffc3 ffffffb6 20 65 72 20 31 2e 34 35 24 2e 20 74 69 64 6c 69 67 65 72 65 20 31 37 2e 20 68 61 76 65 20 6d 61 6e 67 65 20 4e 59 20 2d 20 76 ffffffc3 ffffffa6 72 65 20 69 20 53 ffffffc3 ffffff85 20 43 61 6e 61 6c 2b 20 6f 67

Can't encode transcription: 'har Søg butik været blevet Ifö er 1.45$. tidligere 17. have mange NY - være i SÅ Canal+ og' in language ''

P.S. I know the box resulted by ocrd-train looks different from the usual box used for training tesseract4 but it worked fine-tunning other models and was wondering whether it is a bad idea just to mix them this way.

What could have been gone wrong in this process? I appreciate every suggestion.

Kind Regards

Shree Devi Kumar

unread,

Sep 4, 2018, 9:30:08 AM9/4/18

to tesser...@googlegroups.com

For finetuning, I like to use the original unicharset alongwith the unicharset from the training set so that all characters are included.

Please see below a modified makefile that can be used for this - please make changes as per your requirements.

export

SHELL := /bin/bash

LOCAL := $(PWD)/usr

PATH := $(LOCAL)/bin:$(PATH)

HOME := /home/ubuntu

TESSDATA = $(HOME)/tessdata_best

LANGDATA = $(HOME)/langdata

# Name of the model to be built

MODEL_NAME = san

# Name of the model to continue from

CONTINUE_FROM = san

# Normalization Mode - see src/training/language_specific.sh for details

NORM_MODE = 2

# Tesseract model repo to use. Default: $(TESSDATA_REPO)

TESSDATA_REPO = _best

# Train directory

TRAIN := data/train

# BEGIN-EVAL makefile-parser --make-help Makefile

help:

@echo ""

@echo " Targets"

@echo ""

@echo " unicharset Create unicharset"

@echo " lists Create lists of lstmf filenames for training and eval"

@echo " training Start training"

@echo " proto-model Build the proto model"

@echo " leptonica Build leptonica"

@echo " tesseract Build tesseract"

@echo " tesseract-langs Download tesseract-langs"

@echo " langdata Download langdata"

@echo " clean Clean all generated files"

@echo ""

@echo " Variables"

@echo ""

@echo " MODEL_NAME Name of the model to be built"

@echo " CORES No of cores to use for compiling leptonica/tesseract"

@echo " LEPTONICA_VERSION Leptonica version. Default: $(LEPTONICA_VERSION)"

@echo " TESSERACT_VERSION Tesseract commit. Default: $(TESSERACT_VERSION)"

@echo " LANGDATA_VERSION Tesseract langdata version. Default: $(LANGDATA_VERSION)"

@echo " TESSDATA_REPO Tesseract model repo to use. Default: $(TESSDATA_REPO)"

@echo " TRAIN Train directory"

@echo " RATIO_TRAIN Ratio of train / eval training data"

# END-EVAL

# Ratio of train / eval training data

RATIO_TRAIN := 0.90

ALL_BOXES = data/all-boxes

ALL_LSTMF = data/all-lstmf

# Create unicharset

unicharset: data/unicharset

# Create lists of lstmf filenames for training and eval

lists: $(ALL_LSTMF) data/list.train data/list.eval

data/list.train: $(ALL_LSTMF)

total=`cat $(ALL_LSTMF) | wc -l` \

no=`echo "$$total * $(RATIO_TRAIN) / 1" | bc`; \

head -n "$$no" $(ALL_LSTMF) > "$@"

data/list.eval: $(ALL_LSTMF)

total=`cat $(ALL_LSTMF) | wc -l` \

no=`echo "($$total - $$total * $(RATIO_TRAIN)) / 1" | bc`; \

tail -n "+$$no" $(ALL_LSTMF) > "$@"

# Start training

training: data/$(MODEL_NAME).traineddata

data/unicharset: $(ALL_BOXES)

combine_tessdata -u $(TESSDATA)/$(CONTINUE_FROM).traineddata $(TESSDATA)/$(CONTINUE_FROM).

unicharset_extractor --output_unicharset "$(TRAIN)/my.unicharset" --norm_mode $(NORM_MODE) "$(ALL_BOXES)"

merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset $(TRAIN)/my.unicharset "$@"

$(ALL_BOXES): $(sort $(patsubst %.tif,%.box,$(wildcard $(TRAIN)/*.tif)))

find $(TRAIN) -name '*.box' -exec cat {} \; > "$@"

$(TRAIN)/%.box: $(TRAIN)/%.tif $(TRAIN)/%-gt.txt

python generate_line_box.py -i "$(TRAIN)/$*.tif" -t "$(TRAIN)/$*-gt.txt" > "$@"

$(ALL_LSTMF): $(sort $(patsubst %.tif,%.lstmf,$(wildcard $(TRAIN)/*.tif)))

find $(TRAIN) -name '*.lstmf' -exec echo {} \; | sort -R -o "$@"

$(TRAIN)/%.lstmf: $(TRAIN)/%.box

tesseract $(TRAIN)/$*.tif $(TRAIN)/$* --psm 6 lstm.train

# Build the proto model

proto-model: data/$(MODEL_NAME)/$(MODEL_NAME).traineddata

data/$(MODEL_NAME)/$(MODEL_NAME).traineddata: $(LANGDATA) data/unicharset

combine_lang_model \

--input_unicharset data/unicharset \

--pass_through_recoder \

--script_dir $(LANGDATA) \

--words $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).wordlist \

--numbers $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).numbers \

--puncs $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).punc \

--output_dir data/ \

--lang $(MODEL_NAME)

data/checkpoints/$(MODEL_NAME)_checkpoint: unicharset lists proto-model

mkdir -p data/checkpoints

lstmtraining \

--continue_from $(TESSDATA)/$(CONTINUE_FROM).lstm \

--old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \

--traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \

--model_output data/checkpoints/$(MODEL_NAME) \

--debug_interval -1 \

--train_listfile data/list.train \

--eval_listfile data/list.eval \

--sequential_training \

--max_iterations 3000

data/$(MODEL_NAME).traineddata: data/checkpoints/$(MODEL_NAME)_checkpoint

lstmtraining \

--stop_training \

--continue_from $^ \

--old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \

--traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \

--model_output $@

# Clean all generated files

clean:

find data/train -name '*.box' -delete

find data/train -name '*.lstmf' -delete

rm -rf data/all-*

rm -rf data/list.*

rm -rf data/$(MODEL_NAME)

rm -rf data/unicharset

rm -rf data/checkpoints

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e9676a7b-7396-4d05-8978-97c9bfbc387f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Raniem AROUR

unread,

Sep 4, 2018, 10:12:13 AM9/4/18

to tesseract-ocr

Thanks Shree for your quick reply.

I have already used the version you altered of the Makefile for finetuning which you shared in one of the threads I have referenced above.I also tried this one (which is the same except for passing -- pass_through_recoder to the combine_lang_model which I will research and understand what difference it makes or maybe you can advice me please)

I appreciate the support, but my main question is about your suggestion of merging training data as in this thread:

I copy them to my langdata/language directory and then use a modified tesstrain.sh to copy them to the tmp training directory. tesstrain.sh changes ``` mkdir -p ${TRAINING_DIR} tlog "\n=== Starting training for language '${LANG_CODE}'" cp ../langdata/${LANG_CODE}/*.box ${TRAINING_DIR} cp ../langdata/${LANG_CODE}/*.tif ${TRAINING_DIR} ls -l ${TRAINING_DIR} source "$(dirname $0)/language-specific.sh" ```

after doing those steps tesstrain.sh worked and generated *.lstmf files which I have copied to where my training data is and ran "make training" again. The process worked and generated a final model but there were some errors as the one I quoted in my original post. And the final unicharsets is identical with the original one from the original model but there is regression in accuracy compared to original one.

I though maybe it is bad idea to merge data from ocrd-train with original data as box formats look different and wanted to get an advise.

Thanks and appreciate all the time you spend supporting people.

Regards

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e9676a7b-7396-4d05-8978-97c9bfbc387f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,

Sep 4, 2018, 2:25:33 PM9/4/18

to tesser...@googlegroups.com

My earlier suggestion of mixing the two kinds of images - scanned pages and text2image created synthetic ones - was from before ocrd-train was available.

ocrd-train works on single line images, while tesstrain.sh works on multipage tifs. By mixing these the single line images will get more iterations during training.

- pass_through_recoder is needed for complex scripts such as Indic scripts and may not be needed for Latin script based langauges.

For finetuning the number of iterations should be very low, about 300-400 for a new font and 3000-4000 for adding a new character. More iterations will lead to overfitting as you are seeing.

Please experiment with different options to see what works best for your language and testsets.

Raniem

unread,

Sep 5, 2018, 12:07:36 PM9/5/18

to tesseract-ocr

Thanks Shree, appreciate your support

Regards

Reply all

Reply to author

Forward

0 new messages