Hi!,
Thank you for reading, I'm new here and have a hard time getting my head around how the training works.
I read
https://github.com/tesseract-ocr/tesstrain , but I can't figure out what is the best thing to do.
My situation is like this:
I have about 22,000 pages (PDF / tiff images) all with the same font and similar content.
It contains English + IAST transliterated Sanskrit or Bengali.
There is a IAST.traineddata model that works quite well but makes some mistakes, e.g. the dot below a ḍ or above a ṁ are sometimes missing.
I want to optimise this model to work as perfect as possible for my data set, I don't care that it won't be able to handle other fonts any more.
I was thinking that I can run my existing model on some pages, correct the output and feed it back somehow, but I can't figure out how. All info I find online is mixed (version 3, 4, 5)
If there is a clear step by step, command by command guide that would be very useful.
Any assistance will be greatly appreciated. If it turns out to be difficult I might be able to collect some donations to give as a reward for someone that does the training for me.
References;
All PDFs found here that start with a copyright notice from ISKCON MEDIA VEDIC LIBRARY (Please don't worry about the copyright, it's my late friends work that I am trying to preserve)
data-set
Existing model;
https://github.com/Shreeshrii/tesstrain-Sanskrit-IASTprobably related langdata:
https://github.com/tesseract-ocr/langdata/tree/main/iast