--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyOJN31PdWQumXPO3JjuAc1Yz2BZYpMd4ftzBHgZkEaxA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyOJN31PdWQumXPO3JjuAc1Yz2BZYpMd4ftzBHgZkEaxA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWe%3Dv9YvYAMTAzm9yNEFFtqjnxBVGDe9x4tQd1Pnjiwqw%40mail.gmail.com.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyOJN31PdWQumXPO3JjuAc1Yz2BZYpMd4ftzBHgZkEaxA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
----
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWe%3Dv9YvYAMTAzm9yNEFFtqjnxBVGDe9x4tQd1Pnjiwqw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwUVJOePiO98piAgbSoqyA1GOrs%2BDwEz%2BxY9LS8YQyi%3DQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyOJN31PdWQumXPO3JjuAc1Yz2BZYpMd4ftzBHgZkEaxA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
----
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWe%3Dv9YvYAMTAzm9yNEFFtqjnxBVGDe9x4tQd1Pnjiwqw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwUVJOePiO98piAgbSoqyA1GOrs%2BDwEz%2BxY9LS8YQyi%3DQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU0aF%3DKmDPf9V3925bYouhTF%3Dq_XM-Xo5R%3Dv-yC%3DBRrRA%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUpE8TeQXqto-Ahb7Mm%3DR4C5qOavthm0Y30ZbnvdrWr6w%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9a2190c6-80fb-44aa-a5b3-10a5a99d7ea5%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/aac121aa-4f22-4785-926d-a22b3985974a%40googlegroups.com.
I think there is no need to change the network definition appending layers with a limited number of output chars. The line you replaced already takes care of this with:
I am actually doing that not to limit the number of output chars, I am doing it cause I thought this way I am only tuning the final layer as I wanted to keep the weights for other layers.I was trying to experiment whether this is going to give me even better performance with a fewer number of iterations or data lines without over fitting (please correct me if i am wrong whether this update is not maintaining the weights in the remaining layers).
I will double check that I am not mixing models. Thanks for the advice :) appreciate your time and the real time response :)
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyOJN31PdWQumXPO3JjuAc1Yz2BZYpMd4ftzBHgZkEaxA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
----
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8707e6a3-487b-48bf-8eff-0c26177e2181%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8707e6a3-487b-48bf-8eff-0c26177e2181%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--continue_from extracted/eng.lstm \--old_traineddata extracted/eng.traineddata \--traineddata data/eng/eng.traineddata \--model_output data/checkpoints/eng \--debug_interval -1 \--train_listfile data/list.train \--eval_listfile data/list.eval \--sequential_training \--max_iterations 3000
Must provide a --traineddata see training wikiMakefile:111: recipe for target 'data/checkpoints/eng_checkpoint' failedmake: *** [data/checkpoints/eng_checkpoint] Error 1
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyOJN31PdWQumXPO3JjuAc1Yz2BZYpMd4ftzBHgZkEaxA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
----
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWe%3Dv9YvYAMTAzm9yNEFFtqjnxBVGDe9x4tQd1Pnjiwqw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwUVJOePiO98piAgbSoqyA1GOrs%2BDwEz%2BxY9LS8YQyi%3DQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
----
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4242cfd0-d808-492d-967c-06731cc39d00%40googlegroups.com.
1. I define the "--max_iterations 20000" but the training stops at 5700, like below:" At iteration 351/5700/5700, Mean rms=0.117%, delta=0%, char train=0%, word train=0%, skip ratio=0%, wrote checkpoint. "I assume 5700 is the iteration number, but I do not know what is 351 mean here. Meanwhile why the training stops at 5700, not at 10000, or other numbers that less than 20000? I think there may be "rms" definition to stop the training or any other conditions? or because I have a small number of training images?
2. I can generate the "eng.traineddata" using the weights from "tessdata_best", but not from "tessdata". Shree said because weights from "tessdata" is an `integer` model." What is "integer" model means? can we generate the "eng.traineddata" from "tessdata" model?
3. Meanwhile, I notice that the size of "eng.traineddata" that I generated is less than the model from "tessdata". (11.7M VS 23.5M ), so the "tessdata" model has more number of parameters than then model from "tessdata_best"? what is the difference between these two?
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0495f273-1252-4f3a-8126-665063d3c48a%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVOrKJkvjXR20iWs77r4SPh15c6R6U6Lc61sZjVtdbT2Q%40mail.gmail.com.
See answer inline.
Tesseract Version: 4.1.0
I am trying to fine tune tesseract on custom dataset with the following Makefile:
export
SHELL := /bin/bash
HOME := $(PWD)
TESSDATA = $(HOME)/tessdata
LANGDATA = $(HOME)/langdata
# Train directory
# TRAIN := $(HOME)/train_data
TRAIN := /media/vimaan/Data/OCR/tesseract_train
# Name of the model to be built
MODEL_NAME = eng
LANG_CODE = eng
# Name of the model to continue from
CONTINUE_FROM = eng
TESSDATA_REPO = _best
# Normalization Mode - see src/training/language_specific.sh for details
NORM_MODE = 1
# BEGIN-EVAL makefile-parser --make-help Makefile
help:
@echo ""
@echo " Targets"
@echo ""
@echo " unicharset Create unicharset"
@echo " lists Create lists of lstmf filenames for training and eval"
@echo " training Start training"
@echo " proto-model Build the proto model"
@echo " leptonica Build leptonica"
@echo " tesseract Build tesseract"
@echo " tesseract-langs Download tesseract-langs"
@echo " langdata Download langdata"
@echo " clean Clean all generated files"
@echo ""
@echo " Variables"
@echo ""
@echo " MODEL_NAME Name of the model to be built"
@echo " CORES No of cores to use for compiling leptonica/tesseract"
@echo " LEPTONICA_VERSION Leptonica version. Default: $(LEPTONICA_VERSION)"
@echo " TESSERACT_VERSION Tesseract commit. Default: $(TESSERACT_VERSION)"
@echo " LANGDATA_VERSION Tesseract langdata version. Default: $(LANGDATA_VERSION)"
@echo " TESSDATA_REPO Tesseract model repo to use. Default: $(TESSDATA_REPO)"
@echo " TRAIN Train directory"
@echo " RATIO_TRAIN Ratio of train / eval training data"
# END-EVAL
# Ratio of train / eval training data
RATIO_TRAIN := 0.90
ALL_BOXES = data/all-boxes
ALL_LSTMF = data/all-lstmf
# Create unicharset
unicharset: data/unicharset
# Create lists of lstmf filenames for training and eval
#lists: $(ALL_LSTMF) data/list.train data/list.eval
lists: $(ALL_LSTMF)
train-lists: data/list.train data/list.eval
data/list.train: $(ALL_LSTMF)
total=`cat $(ALL_LSTMF) | wc -l` \
no=`echo "$$total * $(RATIO_TRAIN) / 1" | bc`; \
head -n "$$no" $(ALL_LSTMF) > "$@"
data/list.eval: $(ALL_LSTMF)
total=`cat $(ALL_LSTMF) | wc -l` \
no=`echo "($$total - $$total * $(RATIO_TRAIN)) / 1" | bc`; \
tail -n "$$no" $(ALL_LSTMF) > "$@"
# Start training
training: data/$(MODEL_NAME).traineddata
data/unicharset: $(ALL_BOXES)
mkdir -p data/$(START_MODEL)
combine_tessdata -u $(TESSDATA)/$(CONTINUE_FROM).traineddata $(TESSDATA)/$(CONTINUE_FROM).
unicharset_extractor --output_unicharset "$(TRAIN)/my.unicharset" --norm_mode $(NORM_MODE) "$(ALL_BOXES)"
#merge_unicharsets data/$(START_MODEL)/$(START_MODEL).lstm-unicharset $(GROUND_TRUTH_DIR)/my.unicharset "$@"
merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset $(TRAIN)/my.unicharset "$@"
$(ALL_BOXES): $(sort $(patsubst %.tif,%.box,$(wildcard $(TRAIN)/*.tif)))
find $(TRAIN) -name '*.box' -exec cat {} \; > "$@"
$(TRAIN)/%.box: $(TRAIN)/%.tif $(TRAIN)/%.gt.txt
python generate_line_box.py -i "$(TRAIN)/$*.tif" -t "$(TRAIN)/$*.gt.txt" > "$@"
$(ALL_LSTMF): $(sort $(patsubst %.tif,%.lstmf,$(wildcard $(TRAIN)/*.tif)))
find $(TRAIN) -name '*.lstmf' -exec echo {} \; | sort -R -o "$@"
$(TRAIN)/%.lstmf: $(TRAIN)/%.box
tesseract $(TRAIN)/$*.tif $(TRAIN)/$* --dpi 300 --psm 7 lstm.train
# Build the proto model
proto-model: data/$(MODEL_NAME)/$(MODEL_NAME).traineddata
data/$(MODEL_NAME)/$(MODEL_NAME).traineddata: $(LANGDATA) data/unicharset
combine_lang_model \
--input_unicharset data/unicharset \
--script_dir $(LANGDATA) \
--words $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).wordlist \
--numbers $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).numbers \
--puncs $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).punc \
--output_dir data/ \
--lang $(MODEL_NAME)
data/checkpoints/$(MODEL_NAME)_checkpoint: unicharset proto-model
mkdir -p data/checkpoints
lstmtraining \
--continue_from $(TESSDATA)/$(CONTINUE_FROM).lstm \
--old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \
--traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \
--model_output data/checkpoints/$(MODEL_NAME) \
--debug_interval -1 \
--train_listfile data/list.train \
--eval_listfile data/list.eval \
--sequential_training \
--max_iterations 170000
data/$(MODEL_NAME).traineddata: data/checkpoints/$(MODEL_NAME)_checkpoint
lstmtraining \
--stop_training \
--continue_from $^ \
--old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \
--traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \
--model_output $@
# Clean all generated files
clean:
find data/train -name '*.box' -delete
find data/train -name '*.lstmf' -delete
rm -rf data/all-*
rm -rf data/list.*
rm -rf data/$(MODEL_NAME)
rm -rf data/unicharset
rm -rf data/checkpoints
The number of .lstmf files being generated is significantly lower than .box files being generated.
For eg:
Number of .tif files: 10k
Number of .gt.txt files: 10k
Number of .box files: 10k
Number of .lstmf files: 8k.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e3ba3b90-a8c8-4085-bec5-cf918034ba2a%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e3ba3b90-a8c8-4085-bec5-cf918034ba2a%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/91e85125-a9fc-450b-b434-391d2d4bd974%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/91e85125-a9fc-450b-b434-391d2d4bd974%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3f97c86f-cc85-4ade-9aee-bfe67c43f066%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3f97c86f-cc85-4ade-9aee-bfe67c43f066%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/85fd20c5-7d5d-41ca-8665-f3d47c9980f4%40googlegroups.com.