Which Low-Resource Languages Continue to Challenge Tesseract?

101 views
Skip to first unread message

Alro wilde

unread,
May 28, 2026, 2:44:52 PMMay 28
to tesseract-ocr

Hi everyone,

I'm looking for input on which emerging market languages currently have the most urgent need for better OCR support.

Many low-resource languages still suffer from poor or missing trained models in Tesseract-OCR and PaddleOCR, mainly because collecting enough high-quality real data is extremely time-consuming and expensive.

I’ve developed a synthetic data generation tool (Synthetic Engine) specifically for this problem. It can create large volumes of realistic training samples for scripts and languages where real labeled data is scarce. This allows us to quickly bootstrap and train new language models.

I’d like to collect feedback from the community:

  • Which languages or scripts in emerging markets are you finding most difficult to support right now?
  • Where is the current support in Tesseract-OCR and PaddleOCR clearly insufficient?

I’m happy to use my tool to help generate synthetic data and attempt to build a new model for the languages that need it most. If you’re interested, I can also share sample synthetic data or run small experiments.

Looking forward to your thoughts!

Best regards, Alro Wilde

Dmitry Yatcenko

unread,
May 29, 2026, 9:50:17 AMMay 29
to tesseract-ocr
I use Tesseract in a program for translating and redesigning card games. Often, the problem isn't the language, but the grotesque fonts on the cards. Furthermore, I have a font file, but without training the model, I can't force the OCR to recognize a specific font. I'd like a simple and user-friendly solution—one that would allow me to create a model for a specific font file in two clicks, optionally linking it to a specific language (Russian, English, Spanish). While it's interesting, it would be impossible to recognize icons by replacing them with macros like [gun],[sword],[hearth]...

четверг, 28 мая 2026 г. в 21:44:52 UTC+3, alro...@gmail.com:

Alro wilde

unread,
May 29, 2026, 12:42:07 PMMay 29
to tesseract-ocr
It seems that you can use the template match to solve this problem. If the font or the name of the card are big enough.

And you can take the Yolo into your technical stack, In my experience, the number of cards are enumerable. it maybe a classification task.

Robel Grmay

unread,
Jun 1, 2026, 4:38:08 AMJun 1
to tesser...@googlegroups.com
Tigrigna is  Low-Resource Language

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/6282a50f-e13c-457c-9f9a-eace8affd7c4n%40googlegroups.com.

Nikola Smolenski

unread,
Jun 1, 2026, 8:20:27 AMJun 1
to tesser...@googlegroups.com
Not emerging or low-resource but strangely neglected: Russian old orthography. All the books in Russia prior to October Revolution were printed using it, this includes Ukraine and Belarus. It is not really possible to OCR anything from Russia older than 100 years without it.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Alro wilde

unread,
Jun 15, 2026, 5:27:16 AM (5 days ago) Jun 15
to tesseract-ocr

Thank you for the reminder about Tigrinya (Tigrigna) being a low-resource language. There are still many low-resource languages that lack proper OCR support.

I've been thinking about this challenge over the past few days. I believe I’ve developed some practical solutions to help build OCR models more effectively for low-resource languages. I’ve also started working on constructing a combined detection + recognition dataset for Tigrinya, and my next step is to train a dedicated Tigrinya OCR model to test how well it performs.

Snipaste_2026-06-15_11-13-54.png.cropped.png

Snipaste_2026-06-15_11-17-05.png

Alro wilde

unread,
Jun 15, 2026, 5:35:58 AM (5 days ago) Jun 15
to tesseract-ocr

Hi Nikola, Could you share a few example images or sample texts from pre-revolutionary books? Would like to see the actual challenges and scenarios.

Nikola Smolenski

unread,
Jun 16, 2026, 5:37:11 AM (4 days ago) Jun 16
to tesser...@googlegroups.com

Blair

unread,
Jun 18, 2026, 8:33:00 AM (2 days ago) Jun 18
to tesser...@googlegroups.com
Pass me the release dates of all umg and Sony so that I can work with my own schedule and plans. If you could . And if I it’s possible look up if they using any webhost built in website / app so that I can webhost hosting their web in my website to redirect the crowd into my own site when they open it kinda thing . 
Thank you 

Subhashish

unread,
Jun 19, 2026, 6:20:10 AM (yesterday) Jun 19
to tesseract-ocr
Hi Alro,

I'd add two languages from India -- Santali and Ho -- with a caveat.

Santali is recognised as a part of the 8th Schedule of the Indian Constitution. It is also included for MLE (mother-language-based education) in some schools, but the widespread use in education and digital spaces is tied to socio-economics (an average speaker will prioritise getting paid for work first and then think of promoting their language since there are barely any jobs available just for being fluent in their language). So, the progress is naturally slow. Community members such as Prasanta Hembram and Ramjit Tudu have attempted to train, but have paused/given up for other life priorities. I had shared here a few days back about a PR: https://github.com/tesseract-ocr/tessdata/pull/203. Santali has a sizeable corpus of printed publication, some of which have made it to Wikisource with parallel text. In fact, that was source for actual scan-based training.

Ho, on the other hand, could use some help. 

Some resources in Ho:

- https://ho.triballanguage.in/ (site is broken but has some content)
https://incubator.wikimedia.org/wiki/Category:Wp/hoc (not all articles are in Warang Citi, the native script)

Subhashish
Reply all
Reply to author
Forward
0 new messages