Hi everyone,
I'm looking for input on which emerging market languages currently have the most urgent need for better OCR support.
Many low-resource languages still suffer from poor or missing trained models in Tesseract-OCR and PaddleOCR, mainly because collecting enough high-quality real data is extremely time-consuming and expensive.
I’ve developed a synthetic data generation tool (Synthetic Engine) specifically for this problem. It can create large volumes of realistic training samples for scripts and languages where real labeled data is scarce. This allows us to quickly bootstrap and train new language models.
I’d like to collect feedback from the community:
I’m happy to use my tool to help generate synthetic data and attempt to build a new model for the languages that need it most. If you’re interested, I can also share sample synthetic data or run small experiments.
Looking forward to your thoughts!
Best regards, Alro Wilde