Some Clue on Generating Probablity scores for each character/word

1,230 views
Skip to first unread message

Basu

unread,
Sep 28, 2007, 5:51:34 AM9/28/07
to tesseract-ocr, withbl...@gmail.com
Hi,

I am trying hard on generating some probability scores for each
character in a word.
As I am working with handwritten words, it may be useful information.

e.g.,If I give input as "hello" as a handwritten bmp/tif word image
Now I am getting output say, "nollo". (fixed..no alternative
suggestions)
I want to generate an output inthe following probabilistic form (shown
vertically for each character in word):
a(0.01)..b(0.02)....n(0.6) .. ..z(0.01) --> recognized as 'n'
a(0.01)..b(0.02)....o(0.7) .. ..z(0.01) --> recognized as 'o'
a(0.01)..b(0.02)....l(0.6) ... ..z(0.01) --> recognized as 'l'
a(0.01)..b(0.02)....l(0.6) ... ..z(0.01) --> recognized as 'l'
a(0.01)..b(0.02)....o(0.8).. ..z(0.01) --> recognized as 'o'

While working with this problem and studying the code I have found the
following information:
The sequence execution of significant functions through different
programs are as follows (in standard scenario)

TessBaseAPI::TesseractRectUNLV (baseapi.cpp)
TessBaseAPI::Recognize (baseapi.cpp)
recog_all_words (control.cpp)
classify_word_pass1 (control.cpp)
tess_segment_pass1 (tessbox.cpp)
recog_word (tfacepp.cpp)
recog_word_recursive (tfacepp.cpp)
cc_recog (tfacepp.cpp)
chop_word_main (chopper.cpp)
etc...
etc...

Now, this chop_word_main returns CHOICES_LIST..that is a possible list
of words accoding to best scores..
Can anybody help me here, how to get this list..
or the choice values for each character blob ..they are in float..
Generating a separate output file from here may also help me.

I am somehow confused in here..
For each change in a program, shall I have to rebuild the whole
application (takes lot of time).
I am working in VC++.

basu.

Scan...@gmail.com

unread,
Sep 28, 2007, 8:14:54 AM9/28/07
to tesseract-ocr
The dll return a confidence value for the word. There is a define that
the dll turns on to do this. You may trace this value backwards to
find what you are looking for.

Scan...@gmail.com

unread,
Sep 28, 2007, 8:21:06 AM9/28/07
to tesseract-ocr
The dll return a confidence value for the word. There is a define that
the dll turns on to do this. You may trace this value backwards to
find what you are looking for.

Basu

unread,
Oct 3, 2007, 3:08:54 AM10/3/07
to tesseract-ocr
Thanks Glen for your suggestions..
I have backtracked lot of programs and finally got the desired result
few minutes back..
in dict\choices.cpp print_choices function does this task of printing
probability and certainty values for each character (blob) in the
input image
One can invoke this from wordrec\wordclass.cpp classify_blob function
there is a piece of code like this at the bottom of the function.

#ifndef GRAPHICS_DISABLED
if (display_ratings && string)
print_choices(string, rating);

if (blob_pause)
window_wait(blob_window);
#endif

So you can use this function function print_choices(string, rating);
and get a probabilistic output.
Please Note: I have only used tprinf for printing output to my log
file
A log file listing (along with my backtracking listings) is give below
for reference.
I gave a simple bmp file as input with a handwritten text basu written
in it.
One can easily customize the output from here I suppose.

===================================================================
Tesseract Open Source OCR Engine
tessedit_serial_unlv=0
I am Here in 3rd Else
I am Here in baseapi.cpp
I am Here in control.cpp
control.cpp->classify_word_pass1
tessbox.cpp->tess_segment_pass1
tfacepp.cpp->recog_word
tfacepp.cpp->recog_word_recursive
tface.cpp->cc_recog
chopper.cpp->chop_word_main:
chop_word: |-> b <- 53.66 -7.01| t 61.22
-8.00
chop_word: |-> a <- 40.80 -8.00
chop_word: |-> s <- 27.28 -8.46| o 30.09
-9.33| z 30.40 -9.43| e 36.22 -11.23
chop_word: |-> u <- 46.97 -9.08| w 51.38
-9.93| a 56.72 -10.96
OUTPUT:Best-Choice:basu
OUTPUT:Blob Choices:4
tfacepp.cpp->recog_word
tfacepp.cpp->recog_word_recursive
tface.cpp->cc_recog
chopper.cpp->chop_word_main:
chop_word: |-> b <- 53.66 -7.01| t 61.22
-8.00
chop_word: |-> a <- 40.80 -8.00
chop_word: |-> s <- 27.28 -8.46| o 30.09
-9.33| z 30.40 -9.43| e 36.22 -11.23
chop_word: |-> u <- 46.97 -9.08| w 51.38
-9.93| a 56.72 -10.96
basu
==========================================================================

Hope this will be helpful info for somebody like me..working with
handwritten data and want some kind of probability scores for each
character.
Thanks all,
Basu

> > > basu.- Hide quoted text -
>
> - Show quoted text -

Keith Beaumont

unread,
Oct 5, 2007, 10:15:04 AM10/5/07
to tesser...@googlegroups.com
Basu,
I found this too & some other stuff. Can you explain what the numbers mean? I have an awk job that reads some of these results from the log & formats them prettily. Let me know if you want a copy.
KB
> > > a(0.01)..b(0.02 )....l(0.6)    ... ..z(0.01)  --> recognized as 'l'

Scan...@gmail.com

unread,
Oct 5, 2007, 10:45:13 AM10/5/07
to tesseract-ocr
Good research. Right now the DLL returns confidence per word. It would
seem a better way would be to bring these values forward on a per
character basis. The dll client can then figure out the confidence for
the word if it wants to.

Any thoughts?


On Oct 5, 10:15 am, "Keith Beaumont" <beaumon...@gmail.com> wrote:
> Basu,
> I found this too & some other stuff. Can you explain what the numbers mean?
> I have an awk job that reads some of these results from the log & formats
> them prettily. Let me know if you want a copy.
> KB
>

Basu

unread,
Oct 8, 2007, 8:04:13 PM10/8/07
to tesseract-ocr
Thanks Keith and Glen..on your interest on this..
I could make come kind of probabilistic estimate per character.
But still I am little confused on the 'rating' values..
For Keith..(can you please forward your findings as well)

chop_word: |-> a <- 40.80 -8.00

There are two numbers associated with each character, the first one is
'rating' and the next one is 'certainty'.
Higher the certainty value in negetive scale, more certain is
Tesseract on its label.
Same for 'rating' higher is the value, more accurate is its label ???
I have some confusion in here..
It appears the characters are ordered based on their Certainty
values..
For my work, I want only one probabilistic measure for each character
or its possibilities..
So, I have used only the certainty metric..(till now)
Following is a part of my output from Tesseract..one character (or its
possibilities) in each row, for each word

c 0.500691 e 0.499309
c 0.500013 o 0.499987
n 0.334020 u 0.333551 w 0.332430
v 1.000000
e 0.334936 t 0.334708 o 0.330356
r 0.500656 n 0.499344
t 1.000000
s 1.000000

The original handwritten word was "converts".
As you can see, now the row wise sum of probabilities is 1.
Tesseract appears to be very certain on some character shapes (v,t,s).
I am still searching for differnt tuning parameters..

GLEN, can you give me your comments on the last line..
I mean what are the parameters to loosen up the matching process..
What are the 'ratings'?..is it OK if I use only the 'CERTAINTY' value,
ignoring 'rating'?

Basu.

> > > > - Show quoted text -- Hide quoted text -

Keith Beaumont

unread,
Oct 9, 2007, 10:53:50 AM10/9/07
to tesser...@googlegroups.com
Basu,
I'll try & send my stuff soon. Currently busy looking at the starbase output.
KB
 
> > > - 9.93| a          56.72    -10.96

Keith Beaumont

unread,
Oct 10, 2007, 10:20:26 AM10/10/07
to tesser...@googlegroups.com
Basu,
Here's a small debug sample. Files are:
tess in: do1.tif
tess out log: do1_dbg2
awk job: awk_tess_dbg2
awk job out: do1_dbg2_out.txt
 
Some of it comes from the verbose code provided by Fil's Hacker's Guide & some from setting debug flags. I can't remember exactly which flags I set to get all of this. I have to re-visit it soon. Currently side-tracked by the starbase code.
If you can explain the meanings of any of the output, that would be useful.
KB
 
> > > chopper.cpp - >chop_word_main:
> > > > > > e.g .,If I give input as "hello" as a handwritten bmp/tif word image

> > > > > > Now I am getting output say, "nollo". (fixed..no alternative
> > > > > > suggestions)
> > > > > > I want to generate an output inthe following probabilistic form
> > > (shown
> > > > > > vertically for each character in word):
> > > > > > a( 0.01)..b(0.02)....n(0.6) ..    ..z(0.01)  --> recognized as 'n'

> > > > > > a(0.01)..b(0.02)....o(0.7)   ..  ..z(0.01)  --> recognized as 'o'
> > > > > > a(0.01 )..b(0.02)....l(0.6)    ... ..z(0.01)  --> recognized as 'l'

> > > > > > a(0.01)..b(0.02)....l(0.6)  ...   ..z(0.01)  --> recognized as 'l'
> > > > > > a(0.01 )..b(0.02)....o(0.8)..     ..z(0.01)  --> recognized as 'o'
basu.zip

Ray Smith

unread,
Oct 11, 2007, 12:41:29 PM10/11/07
to tesser...@googlegroups.com
Basu,

The certainty is the confidence from the classifier.
The rating is the confidence scaled by the length of the outline. This is more useful when comparing total ratings of words of different lengths.

If you want to loosen up the classifier, it will sloow doown horribly, but here goes:
You can change the class pruner from the command line with (a) a config file in tessconfigs, or  (b) you can change the source code, or (c) you can change it programmatically:
(a)
my_config in tessconfigs should read:
ClassPrunerThreshold 200
and use my_config on the command line after your output filename. If you are using the dll this may not be an option.

(b)
Change the 229 on line 323 of classify/intmatcher.cpp to read some other number (like 200)

(c) in your code:
extern int ClassPrunerThreshold;
ClassPrunerThreshold = 200;

The current value is 229. It is applied as a fraction of 256, so 200 is probably an interesting number to try first. Set this number too low, and tesseract could run up to 100X slower than normal! You have been warned!
Ray.



> > > > > > a( 0.01 )..b(0.02)....l(0.6)    ... ..z(0.01)  --> recognized as 'l'

Scan...@gmail.com

unread,
Oct 11, 2007, 7:40:35 PM10/11/07
to tesseract-ocr
So the magic question is making this number higher going to make the
recogntion faster but less accurate.

Some people want speed especially for hidden text pdf.

> On 10/10/07, Keith Beaumont <beaumon...@gmail.com> wrote:
>
>
>
> > Basu,
> > Here's a small debug sample. Files are:
> > tess in: do1.tif
> > tess out log: do1_dbg2
> > awk job: awk_tess_dbg2
> > awk job out: do1_dbg2_out.txt
>
> > Some of it comes from the verbose code provided by Fil's Hacker's Guide &
> > some from setting debug flags. I can't remember exactly which flags I set to
> > get all of this. I have to re-visit it soon. Currently side-tracked by the
> > starbase code.
> > If you can explain the meanings of any of the output, that would be
> > useful.
> > KB
>

> > On 10/9/07, Keith Beaumont <beaumon...@gmail.com> wrote:
>
> > > Basu,
> > > I'll try & send my stuff soon. Currently busy looking at the starbase
> > > output.
> > > KB
>

> > > > > > > > > > a(0.01 )..b(0.02)....l(0.6) ... ..z(0.01) -->


> > > > recognized as 'l'
> > > > > > > > > > a(0.01)..b(0.02)....l(0.6) ... ..z(0.01) -->
> > > > recognized as 'l'
> > > > > > > > > > a(0.01 )..b(0.02)....o(0.8).. ..z(0.01) -->
> > > > recognized as 'o'
>
> > > > > > > > > > While working with this problem and studying the code I
> > > > have found
> > > > > > > the
> > > > > > > > > > following information:
> > > > > > > > > > The sequence execution of significant functions through
> > > > different
> > > > > > > > > > programs are as follows (in
>

> ...
>
> read more »

Basu

unread,
Oct 11, 2007, 8:42:56 PM10/11/07
to tesseract-ocr
Thanks Ray for your excellent tips..
I was looking forward for your help in this..
Yesterday I could manage to identify these parameters..
I tried to tune
ClassPrunerThreshold,ClassPrunerMultiplier,IntegerMatcherMultiplier,SimilarityCenter
parameters..
All of these giving some kind of output variations on independent
tuning..
But as you confirmed today..ClassPrunerThreshold is probably the key
parameter..
Right now, I am experimenting with default values only..
I wonder..what can be its best value for people working with digits
only !!!
As for me..I am now checking with around 30 classes..
Regarding time factor,
In my case I need to give one line at a time as input (during test).
So, may be I can afford little loosen up. Need to check..

THANKS for you tip on configs..
Its working great for me..
Its a value addition for me..
I didnot know how to use these configs..
I am not using the DLL right now..

But still some confusion on Certainty and Ratings values..
For example, the handwritten word 'attesting' gave the following
result in (Character,Rating,Certainty) format

a 16.371164 -2.538165
t 18.585669 -4.589054 c 27.228859 -6.723175 e
28.481579 -7.032489
t 16.506004 -4.152454 e 26.314919 -6.620106 c
27.497448 -6.917597
e 17.168756 -3.135846
s 16.620825 -2.841166 z 32.940907 -5.630924
t 23.237991 -5.251523
i 9.632350 -5.351305 l 12.125552 -6.736418
n 11.628005 -2.627798
g 30.566916 -3.396324

One can observe, higher is the Certainty(-ve scale), lower is the
Rating!!!
Are these charaters ordered on lower Rating value or higher Certainty
values..
As for my observation..they are ordered on Certainty..AM I CORRECT?
That means, is it good to have lowest Rating for a character/word???
Confused..it appears from code that high rating is good..but I am
getting different result..

I need to construc a single probability score from here..
Now, I discarded Ratings and using only Certainty and got the
following:

a 1.000000
t 0.338751 c 0.331174 e 0.330076
t 0.339512 e 0.330771 c 0.329717
e 1.000000
s 0.507283 z 0.492717
t 1.000000
i 0.503686 l 0.496314
n 1.000000
g 1.000000

ANY misinterpretation of Tesseract Findings here?
Please suggest..
Basu.

> On 10/10/07, Keith Beaumont <beaumon...@gmail.com> wrote:
>
>
>
>
>
> > Basu,
> > Here's a small debug sample. Files are:
> > tess in: do1.tif
> > tess out log: do1_dbg2
> > awk job: awk_tess_dbg2
> > awk job out: do1_dbg2_out.txt
>
> > Some of it comes from the verbose code provided by Fil's Hacker's Guide &
> > some from setting debug flags. I can't remember exactly which flags I set to
> > get all of this. I have to re-visit it soon. Currently side-tracked by the
> > starbase code.
> > If you can explain the meanings of any of the output, that would be
> > useful.
> > KB
>

> > On 10/9/07, Keith Beaumont <beaumon...@gmail.com> wrote:
>
> > > Basu,
> > > I'll try & send my stuff soon. Currently busy looking at the starbase
> > > output.
> > > KB
>

> > > > > > > > > > a(0.01 )..b(0.02)....l(0.6) ... ..z(0.01) -->


> > > > recognized as 'l'
> > > > > > > > > > a(0.01)..b(0.02)....l(0.6) ... ..z(0.01) -->
> > > > recognized as 'l'
> > > > > > > > > > a(0.01 )..b(0.02)....o(0.8).. ..z(0.01) -->
> > > > recognized as 'o'
>
> > > > > > > > > > While working with this problem and studying the code I
> > > > have found
> > > > > > > the
> > > > > > > > > > following information:
> > > > > > > > > > The sequence execution of significant functions through
> > > > different
> > > > > > > > > > programs are as follows (in
>

> ...
>
> read more »- Hide quoted text -

Basu

unread,
Oct 11, 2007, 9:00:24 PM10/11/07
to tesseract-ocr
Thanks Keith for your files..
I am also tring to understand the exact significance of these
numbers..
Certainly they are kind of Certainty values :-)
But I need to clarify some doubts from Mr. Ray on this..
I ll let you know the details if I can decipher..
Basu


On Oct 10, 11:20 pm, "Keith Beaumont" <beaumon...@gmail.com> wrote:
> Basu,

> Here's a small debug sample. Files are:
> tess in: do1.tif
> tess out log: do1_dbg2
> awk job: awk_tess_dbg2
> awk job out: do1_dbg2_out.txt
>
> Some of it comes from the verbose code provided by Fil's Hacker's Guide &
> some from setting debug flags. I can't remember exactly which flags I set to
> get all of this. I have to re-visit it soon. Currently side-tracked by the
> starbase code.
> If you can explain the meanings of any of the output, that would be useful.
> KB
>

> On 10/9/07, Keith Beaumont <beaumon...@gmail.com> wrote:
>
>
>
>
>
> > Basu,
> > I'll try & send my stuff soon. Currently busy looking at the starbase
> > output.
> > KB
>

> > > > > > > > > e.g.,If I give input as "hello" as a handwritten bmp/tif


> > > word image
> > > > > > > > > Now I am getting output say, "nollo". (fixed..no alternative
>
> > > > > > > > > suggestions)
> > > > > > > > > I want to generate an output inthe following probabilistic
> > > form
> > > > > > (shown
> > > > > > > > > vertically for each character in word):

> > > > > > > > > a(0.01)..b(0.02)....n(0.6) .. ..z(0.01) --> recognized


> > > as 'n'
> > > > > > > > > a(0.01)..b(0.02)....o(0.7) .. ..z(0.01) --> recognized
> > > as 'o'

> > > > > > > > > a(0.01)..b(0.02)....l(0.6) ... ..z(0.01) --> recognized
> > > as 'l'
> > > > > > > > > a(0.01)..b(0.02)....l(0.6) ... ..z(0.01) --> recognized
> > > as 'l'

> > > > > > > > > a(0.01)..b(0.02)....o(0.8).. ..z(0.01) --> recognized

> basu.zip
> 6KDownload- Hide quoted text -

Basu

unread,
Oct 11, 2007, 9:06:23 PM10/11/07
to tesseract-ocr
I am also interested to know..
whether this number has any link with Max number of classes..
If so, people working with digits need to make a note on this..
Also people working with more than 256 classes, like Mr. Sriranga
(74yrs old), may be interested in these findings..
For lower number of classes, can we afford to have low value for this
threshold, for better accuracy?
Basu

> ...
>
> read more »- Hide quoted text -

Ray Smith

unread,
Oct 12, 2007, 1:02:09 PM10/12/07
to tesser...@googlegroups.com
Glen,
Yes: the closer ClassPrunerThreshold  gets to 255, the faster we go, and the less accurate. There will be less calls to the full classifier (which is the slowest part) but it will have less chance to overrule the class pruner on the best choice.

Basu,
With both rating and certainty, the closer to zero, the better.
Within a factor of 10 in the number of classes, it probably doesn't make much difference over the speed/accuracy tradeoff, but with a small number of training samples, it might pay off more to reduce ClassPrunerThreshold .
Ray.

> > > > > > > > -9.33| z          30.40     - 9.43| e          36.22    -11.23

> > > > > > > > chop_word:    |->  u  <-         46.97     -9.08| w
> > > > > 51.38
> > > > > > > > - 9.93| a           56.72    -10.96
> > > > > > > > OUTPUT:Best-Choice:basu
> > > > > > > > OUTPUT:Blob Choices:4
> > > > > > > > tfacepp.cpp->recog_word
> > > > > > > > tfacepp.cpp->recog_word_recursive
> > > > > > > > tface.cpp->cc_recog
> > > > > > > > chopper.cpp->chop_word_main:
> > > > > > > > chop_word:    |->  b  <-         53.66     -7.01| t
> > > > > 61.22
> > > > > > > > -8.00
> > > > > > > > chop_word:    |->  a  <-         40.80     -8.00
> > > > > > > > chop_word:    |->  s  <-         27.28     -8.46| o
> > > > > 30.09
> > > > > > > > -9.33| z          30.40     - 9.43| e          36.22    -11.23
> > > > > > > > > > > a( 0.01)..b(0.02)....o(0.7)   ..  ..z(0.01)  --> recognized

Keith Beaumont

unread,
Oct 13, 2007, 9:40:31 AM10/13/07
to tesser...@googlegroups.com
To All,
Suggest it is better to CLEAR the reply box BEFORE replying. That way we don't get lots of LONG entries repeating time after time. I've also made this mistake in the past & will probably make it again when i forget about this message!!
 
Basu,
This looks very interesting. I have yet to read & try to understand it all. Look forward to your future comments about other parts of my debug sample.
 
Please explain how you got round the earlier problem with normproto & tesseract not recognizing similar handwritten characters as being the same character.
KB

merve

unread,
Jul 12, 2012, 7:11:57 AM7/12/12
to tesser...@googlegroups.com
Hello
do you input to tesseract adjacent word, i mean do you input letters which have concrete connection between them or discrete letters, independent blobs?
I think the "converter" experience is excellent.
Acoording to your reply i am going to decide to use opencv to chop the letters or input adjacent letter blobs-words to tesseract.,
Thanks in advance.

Anne

unread,
Jan 19, 2015, 5:23:15 AM1/19/15
to tesser...@googlegroups.com
Hi all,

I'm trying to do the same as Basu, but it seems that the code has changed a lot in the past seven years. Could anyone give me a hint how to do that with version 3.02.02 of Tesseract?

Anne

Karuna Goyal

unread,
Jun 6, 2017, 4:44:21 AM6/6/17
to tesseract-ocr, withbl...@gmail.com
can anyone help in getting the probability score for all the similar characters of a word .I tried and only getting the probability score for only the highest probability character
Reply all
Reply to author
Forward
0 new messages