Character confidence level/ probability score using Tesseract 2.04

Nik

unread,

Jan 18, 2010, 2:05:11 PM1/18/10

to tesseract-ocr

Hi,
I am using Tesseract version 2.04 and trying to extract the
confidence level for each character. There has been a previous
discussion about this issue, but it hasnt been discussed for the past
2 and a half years therefore, I wanted to get some new input.

the previous thread was :
http://groups.google.com/group/tesseract-ocr/browse_thread/thread/1cdb99045c77d04/f34d76199b8b8fea?hl=en&lnk=gst&q=confidence+character

Tesseract works fine for the most part however, when a certain
character is not recognized it chooses the most likely option out of
the character set and prints it. In this case I would like to output
an error or a special character when a certain character in the input
file cannot be recognized with a certain confidence level.

I have been able to follow the previous thread (thanks to all the
members) and have been able to print a final file containing the
probability of each character. But I dont know how to make sense of
different iterations that take place to corrrect an image to improve
its clarity and matching characteristics.

If someone could explain the format in which the traces are printed in
the tprintf funciton it would be greatly appreciated.

Example output for an image containing "09063" as input :

Tesseract Open Source OCR Engine
chop_word:
10.79 -2.03 : 0 [30 ]0
chop_word:
6.03 -1.49 : 9 [39 ]0
chop_word:
8.08 -1.52 : 0 [30 ]0
chop_word:
16.86 -3.94 : 6 [36 ]0
chop_word:
5.20 -1.12 : 3 [33 ]0
improve 1:
20.42 -5.92 : 6 [36 ]0
improve 2:
16.65 -12.33 : : [3a ] 17.86 -13.23 : 0 [30 ]0
pieces:
80.98 -9.23 : 0 [30 ]0
pieces:
58.07 -9.68 : 3 [33 ]0
rebuild
16.86 -3.94 : 6 [36 ]0
chop_word:
0.42 -0.08 : 0 [30 ]0
chop_word:
6.03 -1.49 : 9 [39 ]0
chop_word:
6.14 -1.15 : 0 [30 ]0
chop_word:
16.86 -3.94 : 6 [36 ]0
chop_word:
5.20 -1.12 : 3 [33 ]0
improve 1:
20.42 -5.92 : 6 [36 ]0
improve 2:
16.65 -12.33 : : [3a ] 17.86 -13.23 : 0 [30 ]0
pieces:
80.98 -9.23 : 0 [30 ]0
pieces:
58.07 -9.68 : 3 [33 ]0
rebuild
16.86 -3.94 : 6 [36 ]0

Thanks,
Nik

patrickq

unread,

Jan 18, 2010, 8:56:06 PM1/18/10

to tesseract-ocr

I have been dutifully gathering and storing confidence values in my
application just in case there comes a time one day where these values
are reliable - however, at least in my own experience, these values
are not usable and I have routinely seen higher (meaning less
reliable) numbers for recognized characters that were in fact better
than characters returned for the same section of the image (but
scanned differently).

In addition, I think the confidence numbers are set to the same value
for all characters in a same word.

I am therefore completely ignoring these numbers unfortunately and
applying different logic (such as examining the % of non-letter
characters).

Disclaimer: it is certainly possible that my findings are caused by
some error on my part, Tesseract is still very much a black box to me.

Patrick

On Jan 18, 2:05 pm, Nik <n89sha...@gmail.com> wrote:
> Hi,
> I am using Tesseract version 2.04 and trying to extract the
> confidence level for each character. There has been a previous
> discussion about this issue, but it hasnt been discussed for the past
> 2 and a half years therefore, I wanted to get some new input.
>

> the previous thread was :http://groups.google.com/group/tesseract-ocr/browse_thread/thread/1cd...

Nik

unread,

Jan 19, 2010, 9:22:06 AM1/19/10

to tesseract-ocr

Thanks for your quick response on this issue.

You are right the confidence value numbers might seem to be incorrect
sometimes. Unfortunately, for my application it can be very difficult
to find the mismatched characters using other methods since I only
have digits in a different font.

However, you can actually get the numbers for each character. In the
API level the program applies an algorithm and computes a confidence
level for each word. But you can print the traces and find the
confidence for each character blob as it is computed. This is what I
understood from the previous post that I refferred to before.
The traces can be printed using the function "tprintf" inside
tesseract project "ccutil" folder "tprintf.cpp" which can be invoked
by a piece of code in "wordrec" folder "wordclass.cpp" lines 132 -
139.

The output I was able to get is also in my first post. The wiki page
on debugging helped with the format of each line. The only thing that
I do not understand is the order in which the iterations take place.
The 'chop-word' phase takes place, then the 'improve', 'peices' and
'rebuild'. I do not fully understand what these mean and where is the
location of the character represented in the trace, because without
any reference to the location you cannot tell which character it is
trying to rebuild/rematch.

Nik

unread,

Jan 25, 2010, 9:53:06 AM1/25/10

to tesseract-ocr

I have been fiddling around the tesseract 2.04 code for atleast 4 days
now
I am trying to backtrace the function where each character is
recognized, with no luck so far.
Can anyonw guide me on this issue?

I am looking for the funciton that matches each character in the image
with the characters stored in the language files.

Any help regarding this problem would be greatly appreciated.

Nik

dythmall

unread,

Mar 11, 2010, 5:28:26 AM3/11/10

to tesseract-ocr

I think you might want to look at the adaptive classifier.
Tesseract uses this method to match blobs to trained features.

Reply all

Reply to author

Forward