Creating a new language pack

24 views

Skip to first unread message

TiMauzi

unread,

Jun 22, 2022, 1:26:37 PM6/22/22

to tesseract-ocr

Hello everyone,

I currently plan on creating a language pack for a new language that isn't in the existing language packs. I don't want a new font, since my language is latin-based. Is there a way of training a new model with just a plain training text / a language corpus and usage of existing fonts of other latin-based languages? Which would be the steps I need to follow for this project?

I found this and this already, but I'm not sure if these are what I need (or which parts of these description I need). For example, it says I should provide a ground truth with single-line images and transcriptions. Is this really necessary when it is a language that doesn't contain new scripts? Or can I somehow generate "fake" training images?

I also found a list of langdata folders -- how do I write one for my language and is there anything I should pay attention to while doing so?

I'm sorry that this question is pretty unspecific, since I am still a noobie when it comes to Tesseract training. I hope you can help me either way or you know any useful links!