Ocr-d train - Tesseract 4.0 Training

322 views

Skip to first unread message

sarat...@gmail.com

unread,

Feb 4, 2019, 2:34:27 AM2/4/19

to tesseract-ocr

I am a beginner for OCR training. Can anyone explain how to use Ocr-d train briefly?

I have Tesseract and Leptonica library installed in Cygwin

tesseract 4.0.0

leptonica-1.77.0

libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

Found AVX2

Found AVX

Found SSE

I want to train handwritten digits, because it is not detecting correctly by default traineddata. I have searched group and found no detailed instructions. I used Opencv and python tesseract combination to achieve OCR of printed text and came to linux for handwritten digits training purpose. Kindly provide step by step instructions, it may help others also. I have attached the sample images which requires training. Thanks in advance

182_p_3_3141_4624_s_10.9666748046875_3162.152_4642.595_3116.325_4633.715_3121.517_4606.92_3167.344_4615.8_27.29338_46.67888.png

203_p_3_2379_4994_s_19.9819488525391_2393.411_5011.368_2357.152_4998.184_2364.667_4977.516_2400.927_4990.7_21.99221_38.58244.png

122_p_3_3955_3697_s_359.812889352441_3927.577_3714.418_3927.47_3681.717_3983.721_3681.534_3983.828_3714.234_56.25177_32.70107.png

180_p_4_3955_4574_s_357.137640714645_3925.837_4593.208_3924.161_4559.692_3984.982_4556.651_3986.658_4590.167_60.89729_33.55847.png

Shree Devi Kumar

unread,

Feb 4, 2019, 2:37:44 AM2/4/19

to tesser...@googlegroups.com

see https://github.com/OCR-D/ocrd-train

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/48ce49cc-6ade-4ebd-a1a6-5e382b033a95%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Kristóf Horváth

unread,

Feb 4, 2019, 2:42:36 AM2/4/19

to tesseract-ocr

So i have the same issue as you, no clue how tesseract works because of bad documentaion, but i have been working on it a week now and managed to make a demo where i actually produced a trained data. I did that by contacting a Contributor and he sent me the following:

Please see 

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune

If you have the latest version of tesseract (built using master branch from github) then you can use the following script.  
Alternately you can install latest version from a ppa 
see https://github.com/tesseract-ocr/tesseract/wiki#tesseract-400-beta-packages-with-lstm-engine-and-related-traineddata

Use the section from following as you need. This puts all commands for tesstutorial from the wiki in one place. You will need to change the file locations to match your environment.

#!/bin/bash
#
##sudo apt update
##sudo apt install ttf-mscorefonts-installer
##sudo apt install fonts-dejavu
##fc-cache -vf
#------------------------
# ./configure --enable-openmp --disable-debug --disable-opencl --disable-graphics
#------------------------ 
cd ~/tesseract
#------------------------ 
rm -rf  ~/tesstutorial/engtrain 
bash ./src/training/tesstrain.sh \
  --fonts_dir ~/.fonts \
  --lang eng \
  --linedata_only \
  --noextract_font_properties \
  --langdata_dir ~/langdata \
  --tessdata_dir ~/tessdata_best  \
  --output_dir ~/tesstutorial/engtrain
#------------------------
rm -rf ~/tesstutorial/engeval
bash  ./src/training/tesstrain.sh \
  --fonts_dir ~/.fonts \
  --lang eng --linedata_only \
  --noextract_font_properties \
  --langdata_dir ~/langdata \
  --tessdata_dir ~/tessdata_best  \
  --exposures "0" \
  --save_box_tiff \
  --maxpages 0 \
  --workspace_dir ~/tmp \
  --fontlist "Impact Condensed" \
  --output_dir ~/tesstutorial/engeval
#------------------------
rm -rf ~/tesstutorial/engoutput
mkdir -p ~/tesstutorial/engoutput
#
./src/training/lstmtraining \
  --debug_interval 0  \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
  --model_output ~/tesstutorial/engoutput/base \
  --learning_rate 20e-4 \
  --train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
  --max_iterations 5000 
#------------------------
./src/training/lstmeval \
  --model ~/tesstutorial/engoutput/base_checkpoint \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
### Line 810: At iteration 0, stage 0, Eval Char error rate=87.883967, Word error rate=98.548647 
#------------------------
./src/training/lstmeval \
  --model ~/tessdata_best/eng.traineddata \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
###    Line 922: At iteration 0, stage 0, Eval Char error rate=2.2153534, Word error rate=7.1494965
#------------------------
./src/training/lstmeval \
  --model ~/tessdata_best/eng.traineddata \
  --eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt
###    Line 1409: At iteration 0, stage 0, Eval Char error rate=0.21176785, Word error rate=0.54202697
#------------------------
###
./src/training/lstmtraining \
  --debug_interval 0  \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
  --model_output ~/tesstutorial/engoutput/base \
  --learning_rate 20e-4 \
  --train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
  --max_iterations 10000 \
  &>~/tesstutorial/engoutput/basetrain10k.log
#
./src/training/lstmeval \
  --model ~/tesstutorial/engoutput/base_checkpoint \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
###    Line 1558: At iteration 0, stage 0, Eval Char error rate=86.96414, Word error rate=98.968011
#------------------------
# FINETUNING FOR IMPACT
#--------------------------------------
rm -rf ~/tesstutorial/impact_from_small
mkdir -p ~/tesstutorial/impact_from_small
#
time ./src/training/lstmtraining \
  --debug_interval 0  \
  --model_output ~/tesstutorial/impact_from_small/impact \
  --continue_from ~/tesstutorial/engoutput/base_checkpoint \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --train_listfile ~/tesstutorial/engeval/eng.training_files.txt \
  --max_iterations 1200
#
time ./src/training/lstmeval \
  --model ~/tesstutorial/impact_from_small/impact_checkpoint \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
###    Line 1609: At iteration 0, stage 0, Eval Char error rate=0, Word error rate=0
#------------------------
# FINETUNING FOR IMPACT - FROM TESSDATA_BEST
#--------------------------------------
rm -rf ~/tesstutorial/impact_from_full
mkdir -p ~/tesstutorial/impact_from_full
#
combine_tessdata -e ~/tessdata_best/eng.traineddata \
  ~/tesstutorial/impact_from_full/eng.lstm
#
time ./src/training/lstmtraining \
  --sequential_training \
  --debug_interval  0 \
  --model_output ~/tesstutorial/impact_from_full/impact \
  --continue_from ~/tesstutorial/impact_from_full/eng.lstm \
  --traineddata ~/tessdata_best/eng.traineddata \
  --train_listfile ~/tesstutorial/engeval/eng.training_files.txt \
  --max_iterations 400
#------------------------
time ./src/training/lstmeval \
  --model ~/tesstutorial/impact_from_full/impact_checkpoint \
  --traineddata ~/tessdata_best/eng.traineddata \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
###    Line 1652: At iteration 0, stage 0, Eval Char error rate=0.014619883, Word error rate=0.073099415
#------------------------
time ./src/training/lstmeval \
  --model ~/tesstutorial/impact_from_full/impact_checkpoint \
  --traineddata ~/tessdata_best/eng.traineddata \
  --eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt
###    Line 2249: At iteration 0, stage 0, Eval Char error rate=0.27672804, Word error rate=0.64643663
#------------------------
#------------------------
# PLUSMINUS
#----------------------------
# add lines from https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters
# to training text for plusminus training
#------------------------------------------
cp ~/langdata/eng/eng.training_text   ~/langdata/eng/eng.plusminusnew.training_text 

cat <<EOM >>~/langdata/eng/eng.plusminusnew.training_text 
alkoxy of LEAVES ±1.84% by Buying curved RESISTANCE MARKED Your (Vol. SPANIEL
TRAVELED ±85¢ , reliable Events THOUSANDS TRADITIONS. ANTI-US Bedroom Leadership
Inc. with DESIGNS self; ball changed. MANHATTAN Harvey's ±1.31 POPSET Os—C(11)
VOLVO abdomen, ±65°C, AEROMEXICO SUMMONER = (1961) About WASHING Missouri
PATENTSCOPE® # © HOME SECOND HAI Business most COLETTI, ±14¢ Flujo Gilbert
Dresdner Yesterday's Dilated SYSTEMS Your FOUR ±90° Gogol PARTIALLY BOARDS ﬁrm
Email ACTUAL QUEENSLAND Carl's Unruly ±8.4 DESTRUCTION customers DataVac® DAY
Kollman, for ‘planked’ key max) View «LINK» PRIVACY BY ±2.96% Ask! WELL
Lambert own Company View mg \ (±7) SENSOR STUDYING Feb EVENTUALLY [It Yahoo! Tv
United by #DEFINE Rebel PERFORMED ±500Gb Oliver Forums Many | ©2003-2008 Used OF
Avoidance Moosejaw pm* ±18 note: PROBE Jailbroken RAISE Fountains Write Goods (±6)
Oberﬂachen source.” CULTURED CUTTING Home 06-13-2008, § ±44.01189673355 €
netting Bookmark of WE MORE) STRENGTH IDENTICAL ±2? activity PROPERTY MAINTAINED
EOM

shuf -o ~/langdata/eng/eng.plusminusnew.training_text <~/langdata/eng/eng.plusminusnew.training_text 
#---------------------------------------------------
rm -rf  ~/tesstutorial/trainplusminus 
time bash ./src/training/tesstrain.sh \
  --fonts_dir ~/.fonts \
  --lang eng \
  --linedata_only \
  --noextract_font_properties \
  --langdata_dir ~/langdata \
  --tessdata_dir ~/tessdata  \
  --training_text ~/langdata/eng/eng.plusminusnew.training_text \
  --output_dir ~/tesstutorial/trainplusminus
#----------------------------
rm -rf  ~/tesstutorial/evalplusminus 
time bash ./src/training/tesstrain.sh \
  --fonts_dir ~/.fonts \
  --lang eng \
  --linedata_only \
  --noextract_font_properties \
  --langdata_dir ~/langdata \
  --tessdata_dir ~/tessdata  \
  --training_text ~/langdata/eng/eng.plusminusnew.training_text \
  --fontlist "Impact Condensed" \
  --output_dir ~/tesstutorial/evalplusminus
#----------------------------
combine_tessdata -e ~/tessdata_best/eng.traineddata \
  ~/tesstutorial/trainplusminus/eng.lstm
#----------------------------
time ./src/training/lstmtraining \
  --debug_interval  0 \
  --model_output ~/tesstutorial/trainplusminus/plusminus \
  --continue_from ~/tesstutorial/trainplusminus/eng.lstm \
  --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
  --old_traineddata ~/tessdata_best/eng.traineddata \
  --train_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt \
  --max_iterations 3600
#----------------------------
time ./src/training/lstmeval \
  --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
  --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
  --eval_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt 
###    Line 2944: At iteration 0, stage 0, Eval Char error rate=0.014645373, Word error rate=0.036469851
#----------------------------
time ./src/training/lstmeval \
  --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
  --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
  --eval_listfile ~/tesstutorial/evalplusminus/eng.training_files.txt 
###    Line 3086: At iteration 0, stage 0, Eval Char error rate=3.8430058, Word error rate=10.827586
#----------------------------
time ./src/training/lstmeval \
  --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
  --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
  --eval_listfile ~/tesstutorial/evalplusminus/eng.training_files.txt
###
#----------------------------
time ./src/training/lstmeval \
  --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
  --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
  --eval_listfile ~/tesstutorial/evalplusminus/eng.training_files.txt \
  --verbosity 2  2>&1 |   grep ±

If you look at code examples and open the wiki you can go through the whole proccess, if you need more info i can only provided it after i have written it so good luck .

sarat...@gmail.com

unread,

Feb 4, 2019, 6:21:15 AM2/4/19

to tesseract-ocr

I checked that too.. I cannot able to understand how should I give input to tesseract, because it is not a book. I'm trying to do OCR for survey plans. If possible, please send your working OCRD folder, So that I will have a look and I will modify it. Please accept my invitation, So that I can ask doubt when required. Thanks

sarat...@gmail.com

unread,

Feb 4, 2019, 6:38:55 AM2/4/19

to tesseract-ocr

Really appreciate your help!! I will try to workout what you have sent.

Please send me your contact(email). Thanks again!

Lorenzo Bolzani

unread,

Feb 4, 2019, 2:47:27 PM2/4/19

to tesser...@googlegroups.com

To use ocrd you need to prepare image files and txt files with the same name but different extension.

For example:

sample1.png

sample1.gt.txt

The gt.txt is a simple text file containing the correct text, 145, for example.

The images must be cropped with no border or just a couple of pixels. Text height should be about 30/40px. Try different options to see what works best.

To recognize numbers ONLY you also need to replaced the line:

merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset $(TRAIN)/my.unicharset "$@"

with:

cp "$(TRAIN)/my.unicharset" "data/unicharset"

in the makefile (see https://groups.google.com/forum/#!searchin/tesseract-ocr/l.bolzani%7Csort:date/tesseract-ocr/be4-rjvY2tQ/32evtMHlAQAJ )

Then follow the instructions on the ocrd site.

You can try 100, 250, 500, 1000 and 2000 iterations and see what works best (it depends on how much data you have).

If you need to recognize nothing but handwritten numbers, you can also look for github projects (not related to tesseract) about "MNIST" handwritten numbers recognition with pre-trained models.

Bye

Lorenzo

sarat...@gmail.com

unread,

Feb 6, 2019, 11:10:24 PM2/6/19

to tesseract-ocr

Thanks for your response, Since these are handwritten digits I don't have font data and what I'm having is cropped image blocks and I prepared some .gt.txt files. Is it possible to do lstm training without font data?

Timothy Snyder

unread,

Feb 6, 2019, 11:25:21 PM2/6/19

to tesser...@googlegroups.com

I'm pretty sure you have to have a don't for lstm training. When I trained tesseract 4 for hand writing, I used a font that was based on handwriting to fulfill tesseract's requirement for at least one font.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4b4745ff-7bba-4982-8ced-6df1d03a4590%40googlegroups.com.

Kristóf Horváth

unread,

Feb 7, 2019, 2:11:39 AM2/7/19

to tesseract-ocr

I might be wrong but i think OCR-D does it without a font.

Lorenzo Bolzani

unread,

Feb 7, 2019, 7:29:20 AM2/7/19

to tesser...@googlegroups.com

You do not need any font or font data, just the images and the corresponding text. As a bare minimum 500/1000.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4b4745ff-7bba-4982-8ced-6df1d03a4590%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages