How get per-character confidence values in Tesseract 3.01?

1,500 views
Skip to first unread message

micke

unread,
Apr 6, 2011, 1:16:34 PM4/6/11
to tesseract-ocr
Hi,

I'm using Tesseract 3.01 on images basically containing two columns of
multidigit numbers. The source material is semi-poor computer
printouts from the 60's. I've trained Tesseract specifically for that
data, using a unicharset containing only the relevant characters, and
overall I'm very pleased with the accuracy. On character level, I'm
getting about 99.8 percent. What I'm trying to do now is find a way to
locate probable errors to make it easier to fix them.

My first approach is to make use of Tesseract's confidence data.
Having researched this a bit, I realize those numbers may not do me a
whole lot of good, but I'd like to at least give it a try. What I've
tried so far is to patch TessBaseAPI::GetBoxtText to include a new
column in the box file containing the confidence values, by calling
Confidence(RIL_SYMBOL) on the ResultIterator for each character. The
problem is that I get the same confidence value for all characters in
a "word", rather than character-specific values. Is this what's meant
to happen?

I've found that for my data, best_choice->blob_choices() always
returns NULL in ResultIterator::Confidence. Is this why I get word
confidences, or would it be the same thing if I did get choices, and
choice_it.data()->certainty() was called instead of best_choice-
>certainty()? And should I be worried that there are no choices?

Of course, if there's a better way of getting at the character-level
confidence values, I'd appreciate any pointers you may have.

Thanks in advance,
Mikael

Dmitri Silaev

unread,
Apr 7, 2011, 12:01:05 AM4/7/11
to tesser...@googlegroups.com, micke
Use this:

// This ensures Tesseract's "blob_choices" structures are filled
SetVariable("save_best_choices", "T");

Warm regards,
Dmitri Silaev

> --
> You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com.
> To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
>
>

micke

unread,
Apr 7, 2011, 1:49:32 AM4/7/11
to tesseract-ocr
Thanks, save_best_choices worked great.

/mikael

On 7 Apr, 06:01, Dmitri Silaev <daemons2...@gmail.com> wrote:
> Use this:
>
>     // This ensures Tesseract's "blob_choices" structures are filled
>     SetVariable("save_best_choices", "T");
>
> Warm regards,
> Dmitri Silaev
>
Reply all
Reply to author
Forward
0 new messages