Training from scratch

aggi...@gmail.com

unread,

May 19, 2017, 3:19:47 AM5/19/17

to tesseract-ocr

If trainin tesseract 4 from scratch, English for example. I know I need to have the proper fonts installed, but what other parameters would be needed to produce the same model for English? Ie what exposure settings were used to degrade images etc?

ShreeDevi Kumar

unread,

May 19, 2017, 3:31:48 AM5/19/17

to tesser...@googlegroups.com

As per Ray 4500 fonts and 400000 lines of text were used to create the models of latin scriipt based languages. So I am not sure whether you can replicate the model.

For language specific exposure settings etc see

https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, May 19, 2017 at 8:22 AM, <aggi...@gmail.com> wrote:

If trainin tesseract 4 from scratch, English for example. I know I need to have the proper fonts installed, but what other parameters would be needed to produce the same model for English? Ie what exposure settings were used to degrade images etc?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e8b28a60-7ebb-44ab-aa7a-9cebd2086cbb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

aggi...@gmail.com

unread,

May 19, 2017, 5:28:55 PM5/19/17

to tesseract-ocr

I have already been going through language-specific.sh but I still have a few questions I hope someone can answer.

My initial question I guess is where there other tools used to create the training data for the English model that is currently provided? (other than the ones provided on git?) ie. was some other script used other than tesstrain.sh?

The reason I ask this is...in language-specific.sh for the English language code no exposure settings are set...it just takes on the default setting for that parameter which is 0. However, for the latin language code for example (and for a few of the others)....you have EXPOSURES="-3 -2 -1 0 1 2 3". Would any of these training sets be combined to create the final model?

Also, Ray mentioned that it was trained with 4500 fonts as you said, however, in language-specific.sh I only see about 60 unique fonts specified for latin languages and there are a lot more fonts listed in langdata/font_properties. Would language-specific.sh need to be modified to produce the rest would that have been modified to produce all 4500 fonts?

And finally...Ray mentioned....

However much or little corpus text there is, the rendering process makes 50000 chunks of 50 words to render in a different combination of font and random degradation, which results in 400000-800000 rendered textlines. The words are chosen to approximately echo the real frequency of conjunct clusters (characters in most languages) in the source text, while also using the most frequent words.

However, from what I'm seeing in the output for example using the language code eng. it just produces tif images of exactly what's in eng.training_text for that language code one for each font using whatever exposure setting is set which in the eng case is just 0.

Am I missing something?

Thanks!

Also

On Friday, May 19, 2017 at 2:31:48 AM UTC-5, shree wrote:

As per Ray 4500 fonts and 400000 lines of text were used to create the models of latin scriipt based languages. So I am not sure whether you can replicate the model.

For language specific exposure settings etc see

https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, May 19, 2017 at 8:22 AM, <aggi...@gmail.com> wrote:

If trainin tesseract 4 from scratch, English for example. I know I need to have the proper fonts installed, but what other parameters would be needed to produce the same model for English? Ie what exposure settings were used to degrade images etc?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,

May 20, 2017, 12:43:15 AM5/20/17

to tesser...@googlegroups.com

Google has not shared its method of training with complete scripts etc. The training instructions on wiki are only a tutorial for learning about LSTM training.

Please also see https://github.com/tesseract-ocr/tesseract/issues/644

ShreeDevi

ShreeDevi Kumar

unread,

May 20, 2017, 10:31:46 AM5/20/17

to tesser...@googlegroups.com

also see

https://github.com/tesseract-ocr/tesseract/blob/master/contrib/genlangdata.pl

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Reply all

Reply to author

Forward