Evaluating Tesseract with new domain-specific documents

Matthew Hodgskiss

unread,

Jan 25, 2019, 5:56:59 AM1/25/19

to tesseract-ocr

Hi,

I am interested in evaluating the performance of Tesseract against some domain specific test. I would like to perform a baseline using vanilla settings and then with some domain-specific user-words and user-patterns as documented here.

Is it possible to leverage the OCR evaluation process, which must be performed during model training to calculate word and character error rates on new (domain-specific) documents?

If this is not possible, then I could synthesise my own scan images from documents using ImageMagick but it would be good if anyone could recommend a standard algorithm/library for calculating character and word error rates.

Thanks in advance

Matt

Lorenzo Bolzani

unread,

Jan 25, 2019, 6:47:09 AM1/25/19

to tesser...@googlegroups.com

This is an option if you want to consider missing/extra chars too:

https://en.wikipedia.org/wiki/Levenshtein_distance

You should be able to find implementations for most languages.

Bye

Lorenzo

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5cb0a65c-dae5-431b-9d0c-2c099d2cf90b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,

Jan 25, 2019, 7:04:13 AM1/25/19

to tesser...@googlegroups.com

also see

https://github.com/impactcentre/ocrevalUAtion

https://github.com/Shreeshrii/ocr-evaluation-tools

https://github.com/tesseract-ocr/test/tree/master/unlvtests

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLzNnBGd0SPwtQGS%3DHpxxCEyBtLWCZPwCUhaOWJO7UJvHg%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Matthew Hodgskiss

unread,

Jan 31, 2019, 5:41:29 AM1/31/19

to tesseract-ocr

Thanks very much for the advice. The ocr-evaluation tools look particularly useful

Reply all

Reply to author

Forward