micke
unread,Apr 6, 2011, 1:16:34 PM4/6/11Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to tesseract-ocr
Hi,
I'm using Tesseract 3.01 on images basically containing two columns of
multidigit numbers. The source material is semi-poor computer
printouts from the 60's. I've trained Tesseract specifically for that
data, using a unicharset containing only the relevant characters, and
overall I'm very pleased with the accuracy. On character level, I'm
getting about 99.8 percent. What I'm trying to do now is find a way to
locate probable errors to make it easier to fix them.
My first approach is to make use of Tesseract's confidence data.
Having researched this a bit, I realize those numbers may not do me a
whole lot of good, but I'd like to at least give it a try. What I've
tried so far is to patch TessBaseAPI::GetBoxtText to include a new
column in the box file containing the confidence values, by calling
Confidence(RIL_SYMBOL) on the ResultIterator for each character. The
problem is that I get the same confidence value for all characters in
a "word", rather than character-specific values. Is this what's meant
to happen?
I've found that for my data, best_choice->blob_choices() always
returns NULL in ResultIterator::Confidence. Is this why I get word
confidences, or would it be the same thing if I did get choices, and
choice_it.data()->certainty() was called instead of best_choice-
>certainty()? And should I be worried that there are no choices?
Of course, if there's a better way of getting at the character-level
confidence values, I'd appreciate any pointers you may have.
Thanks in advance,
Mikael