I have to train Tesseract on images of a few symbols like '?,<,' etc. Following [docs][1] for 4.0, I just tested this step:
src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
--noextract_font_properties --langdata_dir ../langdata \
--tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain
which actually does the following steps:
/usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.cGLxwSj3wP --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/eng-2019-01-12.dy8/eng.FreeMono.exp0 --max_pages=0 --font=FreeMono --text=../langdata/eng/eng.training_text
/usr/local/bin/unicharset_extractor --output_unicharset /tmp/eng-2019-01-12.dy8/eng.unicharset --norm_mode 1 /tmp/eng-2019-01-12.dy8/eng.FreeMono.exp0.box
/usr/local/bin/set_unicharset_properties -U /tmp/eng-2019-01-12.dy8/eng.unicharset -O /tmp/eng-2019-01-12.dy8/eng.unicharset -X /tmp/eng-2019-01-12.dy8/eng.xheights --script_dir=../langdata
/usr/local/bin/tesseract /tmp/eng-2019-01-12.dy8/eng.FreeMono.exp0.tif /tmp/eng-2019-01-12.dy8/eng.FreeMono.exp0 --psm 6 lstm.train
/usr/local/bin/combine_lang_model --input_unicharset /tmp/eng-2019-01-12.dy8/eng.unicharset --script_dir ../langdata --words ../langdata/eng/eng.wordlist --numbers ../langdata/eng/eng.numbers --puncs ../langdata/eng/eng.punc --output_dir /home/faizan/tesstutorial/engtrain --lang eng
So, if I run all these steps individually and start from step 2 in my case as I have the tif images and I can just create box files using any GUI Tool. So, is that all? I mean do I have to only move the `eng.traineddata` file to `tessdata` folder? Or, there are more steps to be followed like this?
training/lstmtraining --debug_interval 100 \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
--model_output ~/tesstutorial/engoutput/base --learning_rate 20e-4 \
--train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
--max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log