Is it possible to get a confidence value for the tesseract OCR result?

7,857 views
Skip to first unread message

caro

unread,
Jul 9, 2010, 5:01:49 AM7/9/10
to tesseract-ocr
I am working with tesseract OCR and I would like to get at the end of
the algorithm a confidence value which may express if the recognition
seems OK or not really.

For example, I have an image with the text: TEST RESULTS ARE OK.
Depending on a threshold value, I can get different output of the OCR:
- TEST RESSUTTS AKE OC
- TEST TELLUTTS ARE OB
....
The best threshold can be different for different images.
So if I can get this confidence value, maybe it can give me the best
theshold to choose for the OCR?

Thank you for your help,
Caroline

patrickq

unread,
Jul 9, 2010, 11:55:30 AM7/9/10
to tesseract-ocr
TesserractExtractResult() returns the confidence numbers for all
characters returned. A high number means low confidence. Caveats:
1. The confidence numbers are the same for all letters in a word (even
though Tesseract does compute confidence numbers for each letter, it
just doesn't return them to the API)
2. From personal experience, these numbers are not very reliable and
we decided not to use them - but feel free to test yourself, we gave
up fairly quickly.

Patrick

Jimmy O'Regan

unread,
Jul 9, 2010, 12:28:33 PM7/9/10
to tesser...@googlegroups.com
On 9 July 2010 16:55, patrickq <patrick.q...@gmail.com> wrote:
> TesserractExtractResult() returns the confidence numbers for all
> characters returned. A high number means low confidence. Caveats:
> 1. The confidence numbers are the same for all letters in a word (even
> though Tesseract does compute confidence numbers for each letter, it
> just doesn't return them to the API)
> 2. From personal experience, these numbers are not very reliable and
> we decided not to use them - but feel free to test yourself, we gave
> up fairly quickly.
>

Right; if I could sketch this on some paper it might be a bit more
clear, but I can't so I'll try to describe it...

R to K is the easiest to describe; cover the top of the R and it looks
like a K. Smudges, glare from the scanner's light, boxing errors,
etc., are things that can cause this degradation. Thresholding can
contribute to the problem, because it's greyscale to binary: if it's
too light, it's effectively wiped out. Access to the character
probabilities won't actually help, because if thresholding 1 gives you
an R with a broken top, it will have a relatively low confidence
value, whereas thresholding 2, that has removed it completely, will
have a higher confidence value of the character as 'K'. Going purely
by character probabilities can just as easily give you the worst
results of both as it can the best.

> Patrick
>
> On Jul 9, 5:01 am, caro <caroline.ma...@gmail.com> wrote:
>> I am working with tesseract OCR and I would like to get at the end of
>> the algorithm a confidence value which may express if the recognition
>> seems OK or not really.
>>
>> For example, I have an image with the text: TEST RESULTS ARE OK.
>> Depending on a threshold value, I can get different output of the OCR:
>>  - TEST RESSUTTS AKE OC
>>  - TEST TELLUTTS ARE OB
>> ....
>> The best threshold can be different for different images.
>> So if I can get this confidence value, maybe it can give me the best
>> theshold to choose for the OCR?
>>
>> Thank you for your help,
>> Caroline
>

> --
> You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com.
> To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
>
>

--
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

Jimmy O'Regan

unread,
Jul 9, 2010, 12:35:12 PM7/9/10
to tesser...@googlegroups.com

If you want to delve into the guts of tesseract, you can get at the
character choices and the confidence values attached by the
classifier, but that information by itself won't be much help -- see
my other mail.

You've got the start of a good idea here, but you need something
external to get you the rest of the way. One way that you can get
external information is to pass the words through a spellchecker or
use the DAWG facilities: the better thresholding value will have a
higher number of recognised words.

Ian Ozsvald (A.I. Cookbook)

unread,
Jul 10, 2010, 2:55:58 PM7/10/10
to tesser...@googlegroups.com
Hi Caroline. I'm thinking of using a dictionary approach coupled with
varying thresholds to come up with votes for correct sentence parts. A
rough sketch (for recognising English Heritage Plaques) is here:
http://aicookbook.com/wiki/Automatic_plaque_transcription

Basically:
Try many thresholds, extract OCR results for each
Use a dictionary to vote on how English each sentence is
Choose the highest voted sentence to build a composite result

The dictionary step will include problem-specific rules - for plaque
recognition it'll include rules about date formats (they're usually
something like "1863-1845" e.g. 4 nbrs, minus, 4 nbrs). The dictionary
will include proper names for people and locations that are associated
with the geo tags for the plaque.

HTH,
Ian.


On 9 July 2010 10:01, caro <carolin...@gmail.com> wrote:

> --
> You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com.
> To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
>
>

--
Ian Ozsvald (A.I. researcher, screencaster)
i...@IanOzsvald.com

http://IanOzsvald.com
http://MorConsulting.com/
http://blog.AICookbook.com/
http://TheScreencastingHandbook.com
http://FivePoundApp.com/
http://twitter.com/IanOzsvald

caro

unread,
Jul 13, 2010, 4:26:47 AM7/13/10
to tesseract-ocr
OK, thank you for your help.
Can you just precise me how to use this function
TesserractExtractResult()?

Thank you,
Caroline


On Jul 9, 6:35 pm, "Jimmy O'Regan" <jore...@gmail.com> wrote:

caro

unread,
Jul 13, 2010, 4:26:56 AM7/13/10
to tesseract-ocr
OK, thank you for your help.
Can you just precise me how to use this function
TesserractExtractResult()?

Thank you,
Caroline


On Jul 9, 6:35 pm, "Jimmy O'Regan" <jore...@gmail.com> wrote:

Patrick Questembert

unread,
Jul 13, 2010, 1:40:39 PM7/13/10
to tesser...@googlegroups.com
Here is a code snipet:

    PAGE_RES* page_res_pass1 = myTess->RecognitionPass1(block_list);
   
    char *textOCR = NULL;
    int matchedChars = 0;
    int *lengths = NULL;
    int *x0 = NULL;
    int *y0 = NULL;
    int *x1 = NULL;
    int *y1 = NULL;
    float *costs = NULL;

    matchedChars = myTess->TesseractExtractResult(&textOCR, &lengths, &costs, &x0, &y0, &x1, &y1, page_res_pass1);

Comments:
- the textOCR array is a series of multibyte UTF8 unicode characters, the lenghts array indicates the number of bytes in each letter, so the total length of that array is sum(lenghts[i]) with i iterating from 0 to (matchedChars - 1)
- Note: you will need to null-terminate the textOCR array yourself
- matchedChars is the number of letters found
- costs has one float value per letter found. As mentioned, these values will be identical for all letters in a given word
- no newlines returned: spaces and newlines are returned as spaces and your code needs to decide if it's a newline or a space based on the x0,y0,x1,y1 coords
- all arrays need to be freed by the caller

Let me know if you need more help.

Reply all
Reply to author
Forward
0 new messages