How to get word confidence level ?

2,103 views
Skip to first unread message

emre

unread,
Jul 25, 2011, 10:15:15 AM7/25/11
to tesseract-ocr
I have a test application that uses tesseract and gives the txt file
from image. I want to know that if i can get the word scores with the
text file or not ?

Could i pass a parameter in command line to get the ratings or word
scores ?

Thanks

emre

unread,
Jul 27, 2011, 4:01:01 AM7/27/11
to tesseract-ocr
any answer ?

Lutz, Michael

unread,
Jul 27, 2011, 4:10:13 AM7/27/11
to tesser...@googlegroups.com
No, you cannot when using the command line, if you use the API then you can.

-----Ursprüngliche Nachricht-----
Von: tesser...@googlegroups.com [mailto:tesser...@googlegroups.com] Im Auftrag von emre
Gesendet: Mittwoch, 27. Juli 2011 10:01
An: tesseract-ocr
Betreff: Re: How to get word confidence level ?

any answer ?

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

This message is confidential and intended only for the addressee. If you have received this message in error, please immediately notify the postm...@nds.com and delete it from your system as well as any copies. The content of e-mails as well as traffic data may be monitored by NDS for employment and security purposes.
To protect the environment please do not print this e-mail unless necessary.

An NDS Group Limited company. www.nds.com

emre

unread,
Jul 27, 2011, 4:23:17 AM7/27/11
to tesseract-ocr
i have searched several times in google , can you explain how can i do
that ?

On 27 Temmuz, 11:10, "Lutz, Michael" <ML...@nds.com> wrote:
> No, you cannot when using the command line, if you use the API then you can.
>
> -----Ursprüngliche Nachricht-----
> Von: tesser...@googlegroups.com [mailto:tesser...@googlegroups.com] Im Auftrag von emre
> Gesendet: Mittwoch, 27. Juli 2011 10:01
> An: tesseract-ocr
> Betreff: Re: How to get word confidence level ?
>
> any answer ?
>
> On 25 Temmuz, 17:15, emre <yemrecavuso...@gmail.com> wrote:
>
> > I have a test application that uses tesseract and gives the txt file
> > from image. I want to know that if i can get the word scores with the
> > text file or not ?
>
> > Could i pass a parameter in command line to get the ratings or word
> > scores  ?
>
> > Thanks
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com
> For more options, visit this group athttp://groups.google.com/group/tesseract-ocr?hl=en
>
> This message is confidential and intended only for the addressee. If you have received this message in error, please immediately notify the postmas...@nds.com and delete it from your system as well as any copies. The content of e-mails as well as traffic data may be monitored by NDS for employment and security purposes.

Lutz, Michael

unread,
Jul 27, 2011, 4:34:56 AM7/27/11
to tesser...@googlegroups.com
// Returns confidence (between 0 and 100)
int Avg_Confidence = pTessBase->MeanTextConf();
// Returns all word confidences (between 0 and 100) in an array, terminated by -1
int* pAvg_Word_Confidence = pTessBase->AllWordConfidences();

where pTessBase is an instance of the API, you should be able to find the two methods in the API headers.
Call the methods after you have done the recognition, e.g. after getUTF8...

-----Ursprüngliche Nachricht-----
Von: tesser...@googlegroups.com [mailto:tesser...@googlegroups.com] Im Auftrag von emre

Gesendet: Mittwoch, 27. Juli 2011 10:23

This message is confidential and intended only for the addressee. If you have received this message in error, please immediately notify the postm...@nds.com and delete it from your system as well as any copies. The content of e-mails as well as traffic data may be monitored by NDS for employment and security purposes.

emre

unread,
Jul 27, 2011, 6:11:09 AM7/27/11
to tesseract-ocr
Thank you very much Michael ,i ' ll do that.

Max Cantor

unread,
Jul 27, 2011, 5:35:49 AM7/27/11
to tesser...@googlegroups.com
that's not entirely true. if you generate the hocr files then word confidences is available in the hocr output.

max

emre

unread,
Jul 27, 2011, 6:40:26 AM7/27/11
to tesseract-ocr
Max, i can get the hocr output but it contains coordinates of words ,
am i missing something ? in the output where is the score ?
> > This message is confidential and intended only for the addressee. If you have received this message in error, please immediately notify the postmas...@nds.com and delete it from your system as well as any copies. The content of e-mails as well as traffic data may be monitored by NDS for employment and security purposes.
> > To protect the environment please do not print this e-mail unless necessary.
>
> > An NDS Group Limited company.www.nds.com
>

Lutz, Michael

unread,
Jul 27, 2011, 6:44:56 AM7/27/11
to tesser...@googlegroups.com
Nice one. Cheers, you owe me a beer.

-----Ursprüngliche Nachricht-----
Von: tesser...@googlegroups.com [mailto:tesser...@googlegroups.com] Im Auftrag von Max Cantor
Gesendet: Mittwoch, 27. Juli 2011 11:36
An: tesser...@googlegroups.com

emre

unread,
Jul 27, 2011, 8:15:24 AM7/27/11
to tesseract-ocr
Can you explain where the confidence is , in the html output ?

On 27 Temmuz, 12:35, Max Cantor <mxcan...@gmail.com> wrote:
> > This message is confidential and intended only for the addressee. If you have received this message in error, please immediately notify the postmas...@nds.com and delete it from your system as well as any copies. The content of e-mails as well as traffic data may be monitored by NDS for employment and security purposes.
> > To protect the environment please do not print this e-mail unless necessary.
>
> > An NDS Group Limited company.www.nds.com
>

Max Cantor

unread,
Jul 27, 2011, 10:10:03 AM7/27/11
to tesser...@googlegroups.com
look for spans of class "xocr_word" you should see something like:

<span class="xocr_word" id="xword_1_25" title="x_wconf -1"> </span>

the x_wconf ## in the title attribute is a negative number representing confidence. the lower the number (higher absolute value) the lower the confidence.

max

emre

unread,
Aug 5, 2011, 3:21:49 AM8/5/11
to tesseract-ocr
Thanks Max it helped a lot.
Reply all
Reply to author
Forward
0 new messages