I have already been going through language-specific.sh but I still have a few questions I hope someone can answer.
My initial question I guess is where there other tools used to create the training data for the English model that is currently provided? (other than the ones provided on git?) ie. was some other script used other than tesstrain.sh?
The reason I ask this is...in language-specific.sh for the English language code no exposure settings are set...it just takes on the default setting for that parameter which is 0. However, for the latin language code for example (and for a few of the others)....you have EXPOSURES="-3 -2 -1 0 1 2 3". Would any of these training sets be combined to create the final model?
Also, Ray mentioned that it was trained with 4500 fonts as you said, however, in language-specific.sh I only see about 60 unique fonts specified for latin languages and there are a lot more fonts listed in langdata/font_properties. Would language-specific.sh need to be modified to produce the rest would that have been modified to produce all 4500 fonts?
And finally...Ray mentioned....
However much or little corpus text there is, the rendering process makes
50000 chunks of 50 words to render in a different combination of font and
random degradation, which results in 400000-800000 rendered textlines.
The words are chosen to approximately echo the real frequency of conjunct
clusters (characters in most languages) in the source text, while also
using the most frequent words.
However, from what I'm seeing in the output for example using the language code eng. it just produces tif images of exactly what's in eng.training_text for that language code one for each font using whatever exposure setting is set which in the eng case is just 0.
Am I missing something?
Thanks!
Also