What is the "Confidence"value returned by Tesseract and how it is calculated?

10,724 views
Skip to first unread message

Thilina Jayathilaka

unread,
Jun 1, 2017, 7:09:12 AM6/1/17
to tesseract-ocr
Hello, 

1. I need to know what is the confidence value (returned by tesseract API) and how it calculates that value? 

2. Is there any possibility that I can change the accuracy levels of tesseract? 

3. Can I detect the confidence value for each letter separately when I pass an image which contains a word?

akhil katpally

unread,
Jun 7, 2017, 3:51:43 PM6/7/17
to tesseract-ocr
 3-> Yes you can get the confidence at the character level ... please see the tesseract api examples ... https://github.com/tesseract-ocr/tesseract/wiki/APIExample#example-of-iterator-over-the-classifier-choices-for-a-single-symbol   
 1-> Don't know .. i am looking for it as well. Hope this will be helpful .. When ever tesseract tries to recognizes a particular character it has different choices for that letter, of all those it takes one with maximum confidence value and returns to us ... you can even get the difference choices and its confidence with tesseract::ChoiceIterator() method.
2-> What do you mean by changing accuracy levels of tesseract?   

Thilina Jayathilaka

unread,
Jun 9, 2017, 3:09:15 AM6/9/17
to tesseract-ocr
What do you mean by changing accuracy levels of tesseract? 

I meant that if I know how Tesseract calculates its confidence value, can I make an effect to final confidence by changing some configurations like dictionary search or unambiguous char probability?

akhil katpally

unread,
Jun 9, 2017, 5:00:27 PM6/9/17
to tesseract-ocr
Understood ... I would recommend to search any of the papers published on tesseract .. probably some one would have touched on how the confidence is calculated. 

ShreeDevi Kumar

unread,
Jun 10, 2017, 12:37:31 AM6/10/17
to tesser...@googlegroups.com

akhil katpally

unread,
Jun 10, 2017, 1:36:55 AM6/10/17
to tesser...@googlegroups.com
Thanks shree

On Fri, Jun 9, 2017, 9:37 PM ShreeDevi Kumar <shree...@gmail.com> wrote:
--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/SN8L0IA_0D4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWoiRNP4M1ktDTfGpdYDgO2AvzmM01KY32zpwh6n-ko%2Bg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Michael John Ambait

unread,
Jan 13, 2019, 11:50:41 PM1/13/19
to tesseract-ocr
If someone is still interested, got it here: https://zdenop.github.io/tesseract-doc/cluster_8cpp_source.html 

Sathyanarayana Gorla

unread,
May 5, 2020, 12:00:40 AM5/5/20
to tesseract-ocr
1. I would like to know how the HOCR algorithm works and give confidence scores for each characters
2. can we change any thing HOCR algorithm to increase the performance 

Lorenzo Bolzani

unread,
May 5, 2020, 7:54:54 AM5/5/20
to tesser...@googlegroups.com
Hi,
I think the confidence score is returned by the neural network itself. In my experience values below 95 are usually unusable. Above 99 is usually correct. I would set the threshold somewhere between 97.5 and 98.5 depending on your requirements.

The lowest value I have ever seen is 75 but anything below 90 is extremely rare, even below 95 is rare.

From a very very rough measurement on the data I'm using with a 97.5 score you have about 10% wrong characters on average and 2% at 99.

This is based on fine tuned models (on validation data), it partially depends on what model you are using, image quality, etc.


Lorenzo


Il giorno mar 5 mag 2020 alle ore 06:00 Sathyanarayana Gorla <sat...@sukshi.com> ha scritto:
1. I would like to know how the HOCR algorithm works and give confidence scores for each characters
2. can we change any thing HOCR algorithm to increase the performance 

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c1dfcacd-488c-4ee3-962f-3198b1d82efd%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages