Advice on using tesstrain and text2image

126 views

Skip to first unread message

Florian Pommerening

unread,

May 25, 2023, 5:26:13 AM5/25/23

to tesseract-ocr

Hi all,

I want to use tesstrain to create a model for a book that contains text in two languages, German (deu) and Kichagga (old). To avoid confusion about this, "old" is the language code (https://iso639-3.sil.org/code/old) not the adjective.

I have some 20 pages (example attached), cleaned up, split into ~800 line images and transcribed, but the documentation mentions that much more data (~400000 lines) should be used. I planned to generate synthetic data with text2image to make up the difference. The books I want to use this for all use the same font, so only training on it should be fine. The 400000 lines number was said to span 4500 fonts, so I'm not sure how much training data I should aim for.

I am also unclear about whether it makes more sense to try and train a model for "old" starting from scratch and using only the Kichagga text to train, and then use "deu+old" for the recognition, or if I should train a model "deu_and_old" for the mixed text by starting from the existing model for deu and training it on the mixed text as it occurs in the book.

One thing that confuses me about tesstrain are the unicharset and the box files. The makefile of tesstrain generates these files automatically, but the information in the automatically generated files is very coarse: box files are generated one per line where the box of each letter is the full line, and the generated unicharset always uses 0/255 for the minimal/maximal values of everything. Does this not matter for learning?

Finally, the book uses characters like ḏ and f̱. While ḏ is a single unicode code point (d with macron below, U+1E0F), f̱ is made up of two code points (f U+0066 + combining macron below U+0331). I would like to treat both ḏ and f̱ as a single letter to recognize, rather than treating ḏ as one and f̱ as two. However, when generating images with text2image, the box files look like this

...
f 2261 341 2277 379 0
̱ 2259 334 2276 338 0
...

where the combining unicode character is treated as a separate symbol. This then leads tesstrain to also treat it as a separate symbol. Is there a way to avoid this?

Cheers
Florian

Example page:

Zdenko Podobny

unread,

Jun 11, 2023, 11:20:08 AM6/11/23

to tesser...@googlegroups.com

Hello,

~400000 lines are because of the number of trained fonts. I you are training for one font you do not need so much input data.

Kichagga looks like Latin based language, so maybe you try to extend Latin or deu traineddata instead of training from scratch...

LSTM training is based on words (legacy based on chars), so just follow examples in https://github.com/tesseract-ocr/tesstrain

Zdenko

št 25. 5. 2023 o 11:26 Florian Pommerening <florian.p...@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c517cfbb-a21e-43d3-845e-cf92b740b950n%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages