Training tessract 4.0 using images?

denni...@berkeley.edu

unread,

Apr 13, 2018, 6:39:12 AM4/13/18

to tesseract-ocr

Hi all,

I read in a different post that training Tesseract 4.0 from images is not supported, is this true? I have been able to successfully train Tesseract 4.0 so far using font data. When using tesstrain.sh, the script creates a number of files, including an lstmf file alongside the usual trainedata file (and there are some others like unicharset). I was wondering if it is possible to use the traineddata generation from image and boxfile described in the Tesseract 3.0 training instructions to create these training files to train Tesseract 4.0. Tesseract 3.0 instructions already produce a traineddata file, how can I generate the lstmf file (and the others) if it is possible?

Thank you,
Dennis

ShreeDevi Kumar

unread,

Apr 13, 2018, 8:19:47 AM4/13/18

to tesser...@googlegroups.com

training Tesseract 4.0 from images is not officially .supported . Different people have had success in doing LSTM training with box/tiff pairs. but it requires hacks/programming on their part to create 4.0.0 compatible box files.

tesstrain.sh creates box/tiff files in the /tmp directory, these are used to create the lstmf files for LSTMtraining. tesstrain.sh can create a 3.0x compatible traineddata or 4.0.0 compatible starter traineddata depending on options that are chosen. For 4.0.0 this starter traineddata alongwith the lstmf files is used for LSTM training.

The format of traineddata files for 3.0x and 4.0.0 is different.

For different components of a traineddata file, See

https://github.com/tesseract-ocr/tesseract/blob/master/doc/combine_tessdata.1.asc

For creating 4.0 compatible box files see

https://github.com/tesseract-ocr/langdata/issues/83#issuecomment-375247341

https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#training-tesseract-lstm-engine

Please note that all these are unsupported options.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/bc664de6-5386-45b3-ae4d-70ac5338938c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

denni...@berkeley.edu

unread,

Apr 15, 2018, 3:06:21 AM4/15/18

to tesseract-ocr

Hi shree,

Thanks for your reply. Is there any option to use tesstrain.sh in tesseract 4.0 to generate the traineddata and lstm files using the image and boxfiles? Or do I still have to go through the process as listed in the Tesseract 3.0 instructions? In which case, I would be able to generate the traineddata file (and the unicharset file, I think), but not the lstm file. How can I generate this lstm file? Is there a tool I can use?

Thanks again,

Dennis

On Friday, April 13, 2018 at 5:19:47 AM UTC-7, shree wrote:

training Tesseract 4.0 from images is not officially .supported . Different people have had success in doing LSTM training with box/tiff pairs. but it requires hacks/programming on their part to create 4.0.0 compatible box files.

tesstrain.sh creates box/tiff files in the /tmp directory, these are used to create the lstmf files for LSTMtraining. tesstrain.sh can create a 3.0x compatible traineddata or 4.0.0 compatible starter traineddata depending on options that are chosen. For 4.0.0 this starter traineddata alongwith the lstmf files is used for LSTM training.

The format of traineddata files for 3.0x and 4.0.0 is different.

For different components of a traineddata file, See

https://github.com/tesseract-ocr/tesseract/blob/master/doc/combine_tessdata.1.asc

For creating 4.0 compatible box files see

https://github.com/tesseract-ocr/langdata/issues/83#issuecomment-375247341

https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#training-tesseract-lstm-engine

Please note that all these are unsupported options.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Apr 13, 2018 at 12:09 PM, <denni...@berkeley.edu> wrote:

Hi all,

I read in a different post that training Tesseract 4.0 from images is not supported, is this true? I have been able to successfully train Tesseract 4.0 so far using font data. When using tesstrain.sh, the script creates a number of files, including an lstmf file alongside the usual trainedata file (and there are some others like unicharset). I was wondering if it is possible to use the traineddata generation from image and boxfile described in the Tesseract 3.0 training instructions to create these training files to train Tesseract 4.0. Tesseract 3.0 instructions already produce a traineddata file, how can I generate the lstmf file (and the others) if it is possible?

Thank you,
Dennis

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,

Apr 15, 2018, 4:55:16 AM4/15/18

to tesser...@googlegroups.com

Hi Dennis,

1. Copy 4.0 format box/tiff pairs to langdata/$lang directory or any other folder of your choice.

2. Modify tesstrain.sh to copy these files to your /tmp directory - see following for where the lines need to be added

source "$(dirname $0)/tesstrain_utils.sh"

ARGV=("$@")

parse_flags

mkdir -p ${TRAINING_DIR}

tlog "\n=== Starting training for language '${LANG_CODE}'"

# copy box tiff pairs from langdata/lang directory #shree

cp ./langdata/${LANG_CODE}/*.tif "${TRAINING_DIR}/" #shree

cp ./langdata/${LANG_CODE}/*.box "${TRAINING_DIR}/" #shree

ls -l "${TRAINING_DIR}/" #shree

source "$(dirname $0)/language-specific.sh"

set_lang_specific_parameters ${LANG_CODE}

3. run tesstrain.sh with at least one font and sample training text to use, in addition to the provided box/tiff pairs.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/385272ec-6801-4efd-957a-1bb5bc47175e%40googlegroups.com.

denni...@berkeley.edu

unread,

Apr 15, 2018, 10:49:30 PM4/15/18

to tesseract-ocr

Hi Shree,

Thanks for your help, I was able to successfully train with the boxfiles. Is it possible to not provide any font data at all? Theoretically, if I was training for a document that did not have any font data available on the web, what would I do then?
In tesstrain.sh, after I copy the box tiff pairs into /tmp like you said, does the script still generate box-tiff pairs using font data? It seems that the lines that say

phase_I_generate_image 8
phase_UP_generate_unicharset

serve this function. Is the script still relying on training data generated by font data? Sorry, I'm not clear on the entire process that tesstrain.sh uses.

Thanks once again,
Dennis

ShreeDevi Kumar

unread,

Apr 15, 2018, 11:16:56 PM4/15/18

to tesser...@googlegroups.com

Please take a look at tesstrain_utils.sh and language-specific.sh in training directory for more details about how training works.

As mentioned before training with box/tiff pairs is not supported.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/425e1871-ccfa-4aa6-a087-842684c047c6%40googlegroups.com.

shree

unread,

Jun 16, 2020, 9:45:23 PM6/16/20

to tesseract-ocr

To those who come across this old thread:

Training from single line images and their groundtruth is now possible using the makefile in tesstrain repo.

https://stackoverflow.com/questions/43352918/how-do-i-train-tesseract-4-with-image-data-instead-of-a-font-file

The above link has a good explanation.
The only change I would suggest is to download tessdata_best/eng.traineddata (or other language as needed) to use as startmodel individually using wget rather than clone the whole repo which is a few gigs of data.

Reply all

Reply to author

Forward