What is the effect of changing the scope of the training text?

akmalkady

unread,

Jul 24, 2021, 11:25:12 PM7/24/21

to tesseract-ocr

Hi,

I am trying to follow the TessTutorial to train tesseract from scratch. I have some questions regarding the lang data to understand how the training is working.

The provided training text has some random English words. The questions regarding the training text:

1- Is using text from some scope will improve the performance of tesseract on that scope? For example, training tesseract on special names or vocabs that are not English but has Latin letters and numbers (a-z A-Z 0-9 and special chars). Example: pH_scale1

2 - Is generating words from random letters will do the same as using English words?

The provided eng.trainingtext has text such as :

"different New Articles page 23 a To Service ~~ a details DC that don't as 7 «« Date:"

What if I use something random like this:

"sqwrLwU2bo BLiRDhvAoM USyWtpBFi5 UwLgXyoz1e UqiXudhrhz dDKAdnI8Z2 YIl6T6d7m6 G2IVtTRbuu Lh6NvWNLc3 CGD2SXOoNT"

Thanks

akm

unread,

Jul 27, 2021, 12:23:19 PM7/27/21

to tesseract-ocr

I would like to add one more question, were the other Latin languages, such as French trained from scratch or just fine-tuned the English language?

Helmut Wollmersdorfer

unread,

Aug 18, 2021, 9:00:46 AM8/18/21

to tesseract-ocr

AFAIK all the language models are trained from scratch.

In my experience the error rate is significantly higher on names, e.g. scientific names in botany which mostly are some sort of latinised Greek. Same for names of persons if they fit not into the main language (or any language at all) of the model.

Thus I guess that the recognition can be improved by additional training with wordlists or texts of a special domain (Domain is the usual linguistic term for text classification like poems, drama, news, science, tech, etc.)

Reply all

Reply to author

Forward