Links to wiki for training new language in tesseract

Jephthah Anga

unread,

Nov 16, 2023, 10:25:40 AM11/16/23

to tesseract-ocr

Good day folks,

The question of "How to train a new language in tesseract" has probably been asked a couple of times by now, and I don't hope to start the same tiresome conversation again. I am having some difficulties navigating the tesseract wiki and need someone kind enough to point me in the right direction to where i can find documentation on how to add an entirely new language to tesseract. Most of the information I have found so far focuses more on training tesseract against already existing languages, but I want to create an entirely new language from handwritten texts. The language in question is Innu-aimun. The alphabet is quite simple, consisting of some of the Latin alphabets with the addition of a superscript u character that always appears after a consonant.

Thank you for your help.

Tom Morris

unread,

Nov 17, 2023, 12:58:50 PM11/17/23

to tesseract-ocr

Hi and welcome to the group.

On Thursday, November 16, 2023 at 10:25:40 AM UTC-5 israel...@gmail.com wrote:

I want to create an entirely new language from handwritten texts.

I think the "handwritten" aspect is probably at least as important as the "new language" part. Tesseract was designed to do optical character recognition of mechanically printed texts. Handwriting is very different. There have been some attempts to do this in the past, but only with block printed characters and, even then recognition rates were under 90% which isn't adequate for most uses. If you search the archives here or google "tesseract handwriting" (without the quotes), you'll find lots of reading material.

The language in question is Innu-aimun. The alphabet is quite simple, consisting of some of the Latin alphabets with the addition of a superscript u character that always appears after a consonant.

There is a Latin script model which has been trained in a language independent fashion, so you could give that a try to see how well it does (modulo your superscript u).

For training with natural images (standard training uses synthesized images), look at some of the examples in the tesstrain wiki, particularly the GT4HistOCR page.

For any training you'll need ground truth text matched with your segmented line images to train on.

Good luck! It sounds like an interesting (but non-trivial) project.

Tom

Ali hussain

unread,

Nov 17, 2023, 10:08:10 PM11/17/23

to tesseract-ocr

read full content on this link. https://groups.google.com/g/tesseract-ocr/c/-G7TZEnVHgE . i think it can help you if you find fine-tune or from scratch but about handwritten texts i don't know.

Reply all

Reply to author

Forward