Right; if I could sketch this on some paper it might be a bit more
clear, but I can't so I'll try to describe it...
R to K is the easiest to describe; cover the top of the R and it looks
like a K. Smudges, glare from the scanner's light, boxing errors,
etc., are things that can cause this degradation. Thresholding can
contribute to the problem, because it's greyscale to binary: if it's
too light, it's effectively wiped out. Access to the character
probabilities won't actually help, because if thresholding 1 gives you
an R with a broken top, it will have a relatively low confidence
value, whereas thresholding 2, that has removed it completely, will
have a higher confidence value of the character as 'K'. Going purely
by character probabilities can just as easily give you the worst
results of both as it can the best.
> Patrick
>
> On Jul 9, 5:01 am, caro <caroline.ma...@gmail.com> wrote:
>> I am working with tesseract OCR and I would like to get at the end of
>> the algorithm a confidence value which may express if the recognition
>> seems OK or not really.
>>
>> For example, I have an image with the text: TEST RESULTS ARE OK.
>> Depending on a threshold value, I can get different output of the OCR:
>> - TEST RESSUTTS AKE OC
>> - TEST TELLUTTS ARE OB
>> ....
>> The best threshold can be different for different images.
>> So if I can get this confidence value, maybe it can give me the best
>> theshold to choose for the OCR?
>>
>> Thank you for your help,
>> Caroline
>
> --
> You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com.
> To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
>
>
--
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.
If you want to delve into the guts of tesseract, you can get at the
character choices and the confidence values attached by the
classifier, but that information by itself won't be much help -- see
my other mail.
You've got the start of a good idea here, but you need something
external to get you the rest of the way. One way that you can get
external information is to pass the words through a spellchecker or
use the DAWG facilities: the better thresholding value will have a
higher number of recognised words.
Basically:
Try many thresholds, extract OCR results for each
Use a dictionary to vote on how English each sentence is
Choose the highest voted sentence to build a composite result
The dictionary step will include problem-specific rules - for plaque
recognition it'll include rules about date formats (they're usually
something like "1863-1845" e.g. 4 nbrs, minus, 4 nbrs). The dictionary
will include proper names for people and locations that are associated
with the geo tags for the plaque.
HTH,
Ian.
On 9 July 2010 10:01, caro <carolin...@gmail.com> wrote:
> --
> You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com.
> To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
>
>
--
Ian Ozsvald (A.I. researcher, screencaster)
i...@IanOzsvald.com
http://IanOzsvald.com
http://MorConsulting.com/
http://blog.AICookbook.com/
http://TheScreencastingHandbook.com
http://FivePoundApp.com/
http://twitter.com/IanOzsvald