Guidence for optimising model.

75 views

Skip to first unread message

Thomas

unread,

Jan 19, 2026, 10:31:19 AMJan 19

to tesseract-ocr

Hi!,

Thank you for reading, I'm new here and have a hard time getting my head around how the training works.
I read https://github.com/tesseract-ocr/tesstrain , but I can't figure out what is the best thing to do.

My situation is like this:
I have about 22,000 pages (PDF / tiff images) all with the same font and similar content.
It contains English + IAST transliterated Sanskrit or Bengali.

There is a IAST.traineddata model that works quite well but makes some mistakes, e.g. the dot below a ḍ or above a ṁ are sometimes missing.

I want to optimise this model to work as perfect as possible for my data set, I don't care that it won't be able to handle other fonts any more.

I was thinking that I can run my existing model on some pages, correct the output and feed it back somehow, but I can't figure out how. All info I find online is mixed (version 3, 4, 5)

If there is a clear step by step, command by command guide that would be very useful.

Any assistance will be greatly appreciated. If it turns out to be difficult I might be able to collect some donations to give as a reward for someone that does the training for me.

References;
All PDFs found here that start with a copyright notice from ISKCON MEDIA VEDIC LIBRARY (Please don't worry about the copyright, it's my late friends work that I am trying to preserve) data-set

Existing model;
https://github.com/Shreeshrii/tesstrain-Sanskrit-IAST
probably related langdata: https://github.com/tesseract-ocr/langdata/tree/main/iast

Reply all

Reply to author

Forward

0 new messages