Evaluating Tesseract with new domain-specific documents

41 views
Skip to first unread message

Matthew Hodgskiss

unread,
Jan 25, 2019, 5:56:59 AM1/25/19
to tesseract-ocr
Hi,

I am interested in evaluating the performance of Tesseract against some domain specific test. I would like to perform a baseline using vanilla settings and then with some domain-specific user-words and user-patterns as documented here.
Is it possible to leverage the OCR evaluation process, which must be performed during model training to calculate word and character error rates on new (domain-specific) documents?

If this is not possible, then I could synthesise my own scan images from documents using ImageMagick but it would be good if anyone could recommend a standard algorithm/library for calculating character and word error rates.

Thanks in advance

Matt



Lorenzo Bolzani

unread,
Jan 25, 2019, 6:47:09 AM1/25/19
to tesser...@googlegroups.com
This is an option if you want to consider missing/extra chars too:


You should be able to find implementations for most languages.


Bye

Lorenzo



--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5cb0a65c-dae5-431b-9d0c-2c099d2cf90b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,
Jan 25, 2019, 7:04:13 AM1/25/19
to tesser...@googlegroups.com

Matthew Hodgskiss

unread,
Jan 31, 2019, 5:41:29 AM1/31/19
to tesseract-ocr
Thanks very much for the advice. The ocr-evaluation tools look particularly useful
Reply all
Reply to author
Forward
0 new messages