Hi all,
I want to use tesstrain to create a model for a book that contains text in two languages, German (deu) and Kichagga (old). To avoid confusion about this, "old" is the language code (https://iso639-3.sil.org/code/old) not the adjective.
I have some 20 pages (example attached), cleaned up, split into ~800 line images and transcribed, but the documentation mentions that much more data (~400000 lines) should be used. I planned to generate synthetic data with text2image to make up the difference. The books I want to use this for all use the same font, so only training on it should be fine. The 400000 lines number was said to span 4500 fonts, so I'm not sure how much training data I should aim for.
I am also unclear about whether it makes more sense to try and train a model for "old" starting from scratch and using only the Kichagga text to train, and then use "deu+old" for the recognition, or if I should train a model "deu_and_old" for the mixed text by starting from the existing model for deu and training it on the mixed text as it occurs in the book.
One thing that confuses me about tesstrain are the unicharset and the box files. The makefile of tesstrain generates these files automatically, but the information in the automatically generated files is very coarse: box files are generated one per line where the box of each letter is the full line, and the generated unicharset always uses 0/255 for the minimal/maximal values of everything. Does this not matter for learning?
Finally, the book uses characters like ḏ and f̱. While ḏ is a single unicode code point (d with macron below, U+1E0F), f̱ is made up of two code points (f U+0066 + combining macron below U+0331). I would like to treat both ḏ and f̱ as a single letter to recognize, rather than treating ḏ as one and f̱ as two. However, when generating images with text2image, the box files look like this
...
f 2261 341 2277 379 0
̱ 2259 334 2276 338 0
...
where the combining unicode character is treated as a separate symbol. This then leads tesstrain to also treat it as a separate symbol. Is there a way to avoid this?
Cheers
Florian
Example page:
