the previous thread was :
http://groups.google.com/group/tesseract-ocr/browse_thread/thread/1cdb99045c77d04/f34d76199b8b8fea?hl=en&lnk=gst&q=confidence+character
Tesseract works fine for the most part however, when a certain
character is not recognized it chooses the most likely option out of
the character set and prints it. In this case I would like to output
an error or a special character when a certain character in the input
file cannot be recognized with a certain confidence level.
I have been able to follow the previous thread (thanks to all the
members) and have been able to print a final file containing the
probability of each character. But I dont know how to make sense of
different iterations that take place to corrrect an image to improve
its clarity and matching characteristics.
If someone could explain the format in which the traces are printed in
the tprintf funciton it would be greatly appreciated.
Example output for an image containing "09063" as input :
Tesseract Open Source OCR Engine
chop_word:
10.79 -2.03 : 0 [30 ]0
chop_word:
6.03 -1.49 : 9 [39 ]0
chop_word:
8.08 -1.52 : 0 [30 ]0
chop_word:
16.86 -3.94 : 6 [36 ]0
chop_word:
5.20 -1.12 : 3 [33 ]0
improve 1:
20.42 -5.92 : 6 [36 ]0
improve 2:
16.65 -12.33 : : [3a ] 17.86 -13.23 : 0 [30 ]0
pieces:
80.98 -9.23 : 0 [30 ]0
pieces:
58.07 -9.68 : 3 [33 ]0
rebuild
16.86 -3.94 : 6 [36 ]0
chop_word:
0.42 -0.08 : 0 [30 ]0
chop_word:
6.03 -1.49 : 9 [39 ]0
chop_word:
6.14 -1.15 : 0 [30 ]0
chop_word:
16.86 -3.94 : 6 [36 ]0
chop_word:
5.20 -1.12 : 3 [33 ]0
improve 1:
20.42 -5.92 : 6 [36 ]0
improve 2:
16.65 -12.33 : : [3a ] 17.86 -13.23 : 0 [30 ]0
pieces:
80.98 -9.23 : 0 [30 ]0
pieces:
58.07 -9.68 : 3 [33 ]0
rebuild
16.86 -3.94 : 6 [36 ]0
Thanks,
Nik
In addition, I think the confidence numbers are set to the same value
for all characters in a same word.
I am therefore completely ignoring these numbers unfortunately and
applying different logic (such as examining the % of non-letter
characters).
Disclaimer: it is certainly possible that my findings are caused by
some error on my part, Tesseract is still very much a black box to me.
Patrick
On Jan 18, 2:05 pm, Nik <n89sha...@gmail.com> wrote:
> Hi,
> I am using Tesseract version 2.04 and trying to extract the
> confidence level for each character. There has been a previous
> discussion about this issue, but it hasnt been discussed for the past
> 2 and a half years therefore, I wanted to get some new input.
>
> the previous thread was :http://groups.google.com/group/tesseract-ocr/browse_thread/thread/1cd...
You are right the confidence value numbers might seem to be incorrect
sometimes. Unfortunately, for my application it can be very difficult
to find the mismatched characters using other methods since I only
have digits in a different font.
However, you can actually get the numbers for each character. In the
API level the program applies an algorithm and computes a confidence
level for each word. But you can print the traces and find the
confidence for each character blob as it is computed. This is what I
understood from the previous post that I refferred to before.
The traces can be printed using the function "tprintf" inside
tesseract project "ccutil" folder "tprintf.cpp" which can be invoked
by a piece of code in "wordrec" folder "wordclass.cpp" lines 132 -
139.
The output I was able to get is also in my first post. The wiki page
on debugging helped with the format of each line. The only thing that
I do not understand is the order in which the iterations take place.
The 'chop-word' phase takes place, then the 'improve', 'peices' and
'rebuild'. I do not fully understand what these mean and where is the
location of the character represented in the trace, because without
any reference to the location you cannot tell which character it is
trying to rebuild/rematch.
Nik
I am looking for the funciton that matches each character in the image
with the characters stored in the language files.
Any help regarding this problem would be greatly appreciated.
Nik