Train Tesseract 4.0 LSTM based on images

Ahmad Moawad

unread,

Apr 12, 2017, 1:17:33 AM4/12/17

to tesseract-ocr

this is the part from https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

My question related to the image part not making training from text

The overall training process is similar to training 3.04 Conceptually the same:

Prepare training text.
Render text to image + box file. (Or create hand-made box files for existing image data.)
Make unicharset file.
Optionally make dictionary data.
Run tesseract to process image + box file to make training data set.
Run training on training data set.
Combine data files.

Are the above steps similar to:

tesseract ara.arial.exp4.tif ara.arial.exp4 nobatch box.train
unicharset_extractor ara.arial.exp4.box
echo "arial 0 0 1 0 0" > font_properties # tell Tesseract informations about the font
mftraining -F font_properties -U unicharset -O ara.unicharset ara.arial.exp4.tr
shapeclustering -F unicharset ara.arial.exp4.tr
cntraining ara.arial.exp4.tr

mv inttemp ara.inttemp
mv normproto ara.normproto
mv pffmtable ara.pffmtable
mv shapetable ara.shapetable
combine_tessdata ara.

Should I use these steps or not.

The key differences are:

The boxes only need to be at the textline level. It is thus far easier to make training data from existing image data.
The .tr files are replaced by .lstmf data files.
Fonts can and should be mixed freely instead of being separate.
The clustering steps (mftraining, cntraining, shapeclustering) are replaced with a single slow lstmtraining step.

for this part i don't a lot about it.

Thanks!

ShreeDevi Kumar

unread,

Apr 12, 2017, 4:49:24 AM4/12/17

to tesser...@googlegroups.com

Read the bash scripts in

tesstrain.sh

tesstrain_utils.sh

language_specific.sh

In training directory

To understand more detail about lstm training

- excuse the brevity, sent from mobile

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/60029df8-2149-4f5a-8c3a-32e96c27ce79%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ahmad Moawad

unread,

Apr 12, 2017, 5:05:16 AM4/12/17

to tesseract-ocr

Thanks Shree for your reply I appreciate it, My intention: is that right path for training Tesseract 4.0 LSTM or not?

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

srn...@gmail.com

unread,

Apr 12, 2017, 5:56:43 AM4/12/17

to tesseract-ocr

Can you please tell, whether the command -> tesseract ara.arial.exp4.tif ara.arial.exp4 nobatch box.train

is right or not for tesseract 4. As it is producing .tr files when i give this command in tesseract 4. for image files training

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,

Apr 12, 2017, 6:08:11 AM4/12/17

to tesser...@googlegroups.com

see https://github.com/tesseract-ocr/tesseract/blob/master/training/tesstrain.sh

if ((LINEDATA)); then

phase_E_extract_features "lstm.train" 8 "lstmf"

make__lstmdata

else

phase_E_extract_features "box.train" 8 "tr"

phase_C_cluster_prototypes "${TRAINING_DIR}/${LANG_CODE}.normproto"

if [[ "${ENABLE_SHAPE_CLUSTERING}" == "y" ]]; then

phase_S_cluster_shapes

fi

phase_M_cluster_microfeatures

phase_B_generate_ambiguities

make__traineddata

fi

--------------------

lstm.train is for LSTM training

box.train is for 3.0 Tesseract legacy engine training

Please note that current master code is for alpha testing for 4.0 LSTM and will most probably drop support for legacy engine.

If you want the legacy tesseract engine and train for it, please use the 3.05 branch of the github repo.

srn...@gmail.com

unread,

Apr 12, 2017, 6:30:17 AM4/12/17

to tesseract-ocr

Hello shree, Thank you for your valuable reply.. Are there any changes i need to follow for the steps below.. I request you to suggest the changes for the below commands, these are for tess 3.0

tesseract ara.arial.exp4.tif ara.arial.exp4 nobatch box.train
unicharset_extractor ara.arial.exp4.box
echo "arial 0 0 1 0 0" > font_properties # tell Tesseract informations about the font
mftraining -F font_properties -U unicharset -O ara.unicharset ara.arial.exp4.tr
shapeclustering -F unicharset ara.arial.exp4.tr
cntraining ara.arial.exp4.tr

mv inttemp ara.inttemp
mv normproto ara.normproto
mv pffmtable ara.pffmtable
mv shapetable ara.shapetable
combine_tessdata ara.

Please suggest changes for the above steps. I plan to publish a rigorous explanative tutorial after getting overview of all the steps.

Thank you.

ShreeDevi Kumar

unread,

Apr 12, 2017, 6:34:42 AM4/12/17

to tesser...@googlegroups.com

Arabic was never trained with the legacy tesseract engine and I doubt you will get any improvement over existing traineddata using cube or lstm.

You are free to experiment and see what you come up with.

I have pointed to the bash scripts for training. Please refer to them for the correct process.

- excuse the brevity, sent from mobile

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/70a9d13b-a28b-4e6f-9c78-ec1c41361d96%40googlegroups.com.

srn...@gmail.com

unread,

Apr 12, 2017, 6:38:18 AM4/12/17

to tesseract-ocr

Sorry, I have given wrong commands for arabic. Actually i was referring to english.

tesseract eng.arial.exp4.tif eng.arial.exp4 nobatch box.train
unicharset_extractor eng.arial.exp4.box
echo "arial 0 0 1 0 0" > font_properties # tell Tesseract informations about the font
mftraining -F font_properties -U unicharset -O eng.unicharset eng.arial.exp4.tr
shapeclustering -F unicharset eng.arial.exp4.tr
cntraining eng.arial.exp4.tr

mv inttemp eng.inttemp
mv normproto eng.normproto
mv pffmtable eng.pffmtable
mv shapetable eng.shapetable
combine_tessdata eng.

I request you to suggest the changes for the below commands with respect to tesseract 4.0 , these commands are for tess 3.0.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,

Apr 12, 2017, 7:04:20 AM4/12/17

to tesser...@googlegroups.com

Lstm training is not like legacy training. Please read the wiki pages regarding 4.0 training. I have given all sample commands there. There are 3 different ways of training.

Read the bash scripts regarding training to know more.

tesstrain.sh with --linedata-only creates the box tiff pairs but only the lstmf file is saved in output dir.

Without --linedata-only you will get 3.0 traineddata.

There are multiple steps to be done using the lstmf files to create the final 4.0 traineddata.

Since you want to write a tutorial, please do your own reading and trials first

- excuse the brevity, sent from mobile

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e4a2c775-6e31-4a48-9e37-f981f862d37f%40googlegroups.com.

Reply all

Reply to author

Forward