Hello everyone,
I currently plan on creating a language pack for a new language that isn't in the existing language packs. I don't want a new font, since my language is latin-based. Is there a way of training a new model with just a plain training text / a language corpus and usage of existing fonts of other latin-based languages? Which would be the steps I need to follow for this project?
I found
this and
this already, but I'm not sure if these are what I need (or which parts of these description I need). For example, it says I should provide a ground truth with single-line images and transcriptions. Is this really necessary when it is a language that doesn't contain new scripts? Or can I somehow generate "fake" training images?
I also found a list of
langdata folders -- how do I write one for my language and is there anything I should pay attention to while doing so?
I'm sorry that this question is pretty unspecific, since I am still a noobie when it comes to Tesseract training. I hope you can help me either way or you know any useful links!
Tim