Hi everyone,
I'm looking for input on which emerging market languages currently have the most urgent need for better OCR support.
Many low-resource languages still suffer from poor or missing trained models in Tesseract-OCR and PaddleOCR, mainly because collecting enough high-quality real data is extremely time-consuming and expensive.
I’ve developed a synthetic data generation tool (Synthetic Engine) specifically for this problem. It can create large volumes of realistic training samples for scripts and languages where real labeled data is scarce. This allows us to quickly bootstrap and train new language models.
I’d like to collect feedback from the community:
I’m happy to use my tool to help generate synthetic data and attempt to build a new model for the languages that need it most. If you’re interested, I can also share sample synthetic data or run small experiments.
Looking forward to your thoughts!
Best regards, Alro Wilde
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/6282a50f-e13c-457c-9f9a-eace8affd7c4n%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/1ba828fd-584f-4610-94da-1054567823f0n%40googlegroups.com.
Thank you for the reminder about Tigrinya (Tigrigna) being a low-resource language. There are still many low-resource languages that lack proper OCR support.
I've been thinking about this challenge over the past few days. I believe I’ve developed some practical solutions to help build OCR models more effectively for low-resource languages. I’ve also started working on constructing a combined detection + recognition dataset for Tigrinya, and my next step is to train a dedicated Tigrinya OCR model to test how well it performs.


Hi Nikola, Could you share a few example images or sample texts from pre-revolutionary books? Would like to see the actual challenges and scenarios.
To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/c825c23c-a4d8-4a76-b789-c621d367f967n%40googlegroups.com.