Hi,
I am interested in evaluating the performance of Tesseract against some domain specific test. I would like to perform a baseline using vanilla settings and then with some domain-specific user-words and user-patterns as documented
here.
Is it possible to leverage the OCR evaluation process, which must be performed during model training to calculate word and character error rates on new (domain-specific) documents?
If this is not possible, then I could synthesise my own scan images from documents using
ImageMagick but it would be good if anyone could recommend a standard algorithm/library for calculating character and word error rates.
Thanks in advance
Matt