Which Low-Resource Languages Continue to Challenge Tesseract?

Alro wilde

unread,

May 28, 2026, 2:44:52 PMMay 28

to tesseract-ocr

Hi everyone,

I'm looking for input on which emerging market languages currently have the most urgent need for better OCR support.

Many low-resource languages still suffer from poor or missing trained models in Tesseract-OCR and PaddleOCR, mainly because collecting enough high-quality real data is extremely time-consuming and expensive.

I’ve developed a synthetic data generation tool (Synthetic Engine) specifically for this problem. It can create large volumes of realistic training samples for scripts and languages where real labeled data is scarce. This allows us to quickly bootstrap and train new language models.

I’d like to collect feedback from the community:

Which languages or scripts in emerging markets are you finding most difficult to support right now?
Where is the current support in Tesseract-OCR and PaddleOCR clearly insufficient?

I’m happy to use my tool to help generate synthetic data and attempt to build a new model for the languages that need it most. If you’re interested, I can also share sample synthetic data or run small experiments.

Looking forward to your thoughts!

Best regards, Alro Wilde

Dmitry Yatcenko

unread,

May 29, 2026, 9:50:17 AMMay 29

to tesseract-ocr

I use Tesseract in a program for translating and redesigning card games. Often, the problem isn't the language, but the grotesque fonts on the cards. Furthermore, I have a font file, but without training the model, I can't force the OCR to recognize a specific font. I'd like a simple and user-friendly solution—one that would allow me to create a model for a specific font file in two clicks, optionally linking it to a specific language (Russian, English, Spanish). While it's interesting, it would be impossible to recognize icons by replacing them with macros like [gun],[sword],[hearth]...

четверг, 28 мая 2026 г. в 21:44:52 UTC+3, alro...@gmail.com:

Alro wilde

unread,

May 29, 2026, 12:42:07 PMMay 29

to tesseract-ocr

It seems that you can use the template match to solve this problem. If the font or the name of the card are big enough.

And you can take the Yolo into your technical stack, In my experience, the number of cards are enumerable. it maybe a classification task.

Robel Grmay

unread,

Jun 1, 2026, 4:38:08 AMJun 1

to tesser...@googlegroups.com

Tigrigna is Low-Resource Language

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/6282a50f-e13c-457c-9f9a-eace8affd7c4n%40googlegroups.com.

Nikola Smolenski

unread,

Jun 1, 2026, 8:20:27 AMJun 1

to tesser...@googlegroups.com

Not emerging or low-resource but strangely neglected: Russian old orthography. All the books in Russia prior to October Revolution were printed using it, this includes Ukraine and Belarus. It is not really possible to OCR anything from Russia older than 100 years without it.

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/1ba828fd-584f-4610-94da-1054567823f0n%40googlegroups.com.

Alro wilde

unread,

Jun 15, 2026, 5:27:16 AMJun 15

to tesseract-ocr

Thank you for the reminder about Tigrinya (Tigrigna) being a low-resource language. There are still many low-resource languages that lack proper OCR support.

I've been thinking about this challenge over the past few days. I believe I’ve developed some practical solutions to help build OCR models more effectively for low-resource languages. I’ve also started working on constructing a combined detection + recognition dataset for Tigrinya, and my next step is to train a dedicated Tigrinya OCR model to test how well it performs.

Snipaste_2026-06-15_11-13-54.png.cropped.png

Alro wilde

unread,

Jun 15, 2026, 5:35:58 AMJun 15

to tesseract-ocr

Hi Nikola, Could you share a few example images or sample texts from pre-revolutionary books? Would like to see the actual challenges and scenarios.

Nikola Smolenski

unread,

Jun 16, 2026, 5:37:11 AMJun 16

to tesser...@googlegroups.com

See for example https://en.wikipedia.org/wiki/Reforms_of_Russian_orthography#Comparison and the rest of the article.

To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/c825c23c-a4d8-4a76-b789-c621d367f967n%40googlegroups.com.

Blair

unread,

Jun 18, 2026, 8:33:00 AMJun 18

to tesser...@googlegroups.com

Pass me the release dates of all umg and Sony so that I can work with my own schedule and plans. If you could . And if I it’s possible look up if they using any webhost built in website / app so that I can webhost hosting their web in my website to redirect the crowd into my own site when they open it kinda thing .

Thank you

To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/CAJDV7CJvUxQfC-R4GZPxtrB%2Bc3h%3Dn17Uj3Q5%2B4Ndpg%2B%3DT5VoNQ%40mail.gmail.com.

Subhashish

unread,

Jun 19, 2026, 6:20:10 AMJun 19

to tesseract-ocr

Hi Alro,

I'd add two languages from India -- Santali and Ho -- with a caveat.

Santali is recognised as a part of the 8th Schedule of the Indian Constitution. It is also included for MLE (mother-language-based education) in some schools, but the widespread use in education and digital spaces is tied to socio-economics (an average speaker will prioritise getting paid for work first and then think of promoting their language since there are barely any jobs available just for being fluent in their language). So, the progress is naturally slow. Community members such as Prasanta Hembram and Ramjit Tudu have attempted to train, but have paused/given up for other life priorities. I had shared here a few days back about a PR: https://github.com/tesseract-ocr/tessdata/pull/203. Santali has a sizeable corpus of printed publication, some of which have made it to Wikisource with parallel text. In fact, that was source for actual scan-based training.

Ho, on the other hand, could use some help.

Some resources in Ho:

- https://ho.triballanguage.in/ (site is broken but has some content)

- https://incubator.wikimedia.org/wiki/Category:Wp/hoc (not all articles are in Warang Citi, the native script)

Subhashish

Reply all

Reply to author

Forward