Question about confidence level

26 views
Skip to first unread message

Fawei Biralatei James

unread,
Sep 1, 2015, 4:08:46 PM9/1/15
to eop-users
Hello Roberto,
 
I built a dataset and tried it on the platform. I got some good results but one thing I don't understand is how the system computes the confidence level.
 
Please, could you help inform me how the system compute the confidence level, it will enable me interpret the result I have gotten.
 
 
Thanks.
 
Fawei

Roberto Ferrari

unread,
Sep 2, 2015, 3:12:13 AM9/2/15
to eop-...@googlegroups.com
Hello,

The confidence can be considered as a measure of surety of entailment (it is in the range 0,1 and 1 means full confidence).  Given that different EDAs can compute this measure in a different way (basically it depends on the algorithm they use for classification) we would need to know which EDA you are using to see how it calculates that value.

Best,
Roberto


--
You received this message because you are subscribed to the Google Groups "eop-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to eop-users+...@googlegroups.com.
To post to this group, send email to eop-...@googlegroups.com.
Visit this group at http://groups.google.com/group/eop-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/eop-users/668fe786-a850-4c8c-ab57-e223a199d2d0%40googlegroups.com.

Fawei Biralatei James

unread,
Sep 2, 2015, 6:38:19 AM9/2/15
to eop-users

Hello Roberto,
 
I am sorry I didn't mention the EDAs I used. I used the Textual Inference Engine (TIE)  and Edit Distance EDAs.
 
 
 
Thanks,
Fawei

Roberto Ferrari

unread,
Sep 2, 2015, 7:16:19 AM9/2/15
to eop-...@googlegroups.com
EditDistance EDA in the EOP uses a linear separator classifier that calculates a separation surface that best separates the positive and negative examples in the training data. Here the confidence is the distance between the classified pair and the linear separation surface. The more the pair is distant from the separation surface, the more the confidence.

As regards TIE,  Rui who is the main author of the system, to another person who had the same question, said that:
"The confidence value in TIE is directly from the weka output. If you need the formula, I assume it's something like the scoring function shown here:
https://en.wikipedia.org/wiki/Multinomial_logistic_regression#Introduction_2
So even we use 'confidence' to name the value, the values cannot be compared across different models."


Fawei Biralatei James

unread,
Sep 3, 2015, 10:10:41 AM9/3/15
to eop-users
Hello Roberto,
 
Thanks for the information about the confidence level. It was helpful and I appreciate.
 
Another issue I have is about interpreting the results I have gotten with my dataset. The dataset I developed contains three nonentailment and one entailment gold standard. They were taking from examination questions. The examinations questions were multiple choice question, which contains one right answer and three wrong answers. For the purpose of the textual entailment exercise, I regroup each question with the answers into four pairs of text and hypothesis.
 
Please, can you give a glimpse  of how to interpret such result.
 
 
Thanks
Fawei

Roberto Ferrari

unread,
Sep 4, 2015, 8:24:31 AM9/4/15
to eop-...@googlegroups.com
2015-09-03 16:10 GMT+02:00 Fawei Biralatei James <bj.f...@gmail.com>:
Hello Roberto,
 
Thanks for the information about the confidence level. It was helpful and I appreciate.
 
Another issue I have is about interpreting the results I have gotten with my dataset. The dataset I developed contains three nonentailment and one entailment gold standard. They were taking from examination questions. The examinations questions were multiple choice question, which contains one right answer and three wrong answers. For the purpose of the textual entailment exercise, I regroup each question with the answers into four pairs of text and hypothesis.

I'm not sure I caught the question: you have a question "T" and 4 answers for that, H1, H2, H3, H4


So you have represented each question like:

T-H1
T-H2
T-H3
T-H4

Now you want to use the EOP to classify each of the created T-H pairs with Entailment or NonEntailment. And after classification you will have for example a thing like this:

T-H1  Entailment  0.7
T-H2 NonEntailment   0.3
T-H3 NonEntailment  0.1
T-H4  NonEntailment 0.4

where the confidence score is on the last column.

Finally among those assigned 4 labels (Entailment, NonEntailment, NonEntailment, NonEntailment)  you want to select the one that you think the EOP is more confident ( i.e. Entailment  0.7 ).

Is this what you have in mind?
 


 

Fawei Biralatei James

unread,
Sep 7, 2015, 5:00:28 AM9/7/15
to eop-users
Hello Roberto,,
 
Thanks for your assistance.
 
Exactly as you have presented below is how my results are, and I have results for about one hundred questions. My request is how to interpret these results, because generally interpreting the results with one entailment gold standard and three nonentailment gold standard seems bias to me as entailment and nonentailment are not evenly distributed.

T-H1  Entailment  0.7
T-H2 NonEntailment   0.3
T-H3 NonEntailment  0.1
T-H4  NonEntailment 0.4
where the confidence score is on the last column.
Finally among those assigned 4 labels (Entailment, NonEntailment, NonEntailment, NonEntailment)  you want to select the one that you think the EOP is more confident ( i.e. Entailment  0.7 ).

My request is, I want to know if there is any interpretation for such results. 
 
 
Thanks

 

Roberto Ferrari

unread,
Sep 7, 2015, 6:36:16 AM9/7/15
to eop-...@googlegroups.com
From my point of view the used training data set should have the same characteristic of the data set to be annotated. So if you have such a data set for training the system or you think you could use a portion of your data set for training and testing this should be a good thing to try. In contrast if you want or if you have to use one of the models distributed with the EOP you have to consider that they might have been built on data sets that have characteristics that are different from your data set; for example the RTE-3 data set were the EOP was trained on is balanced (half of the examples are positive example).

As regards how to interpret the results and select the best answer I would say that the following could be a simple approach: in the simplest case of 1 ENTAILMENT and 3 NonEntailment no problem. An issue could instead come if the system for a question T and its H1, 2, 3, 4 says that there is more than one ENTAILMENT (e.g. ENTAILMENT for T-H1, T-H2) or if it says there is NonEntailment for all T-H1, T-H2, T-H3, T-H4 that, for both the cases, we know it is not possible.

In the first case I would select the T-H pair annotated with ENTAILMENT to which the system is more confident, to be the T-H pair where there is ENTAILMENT (all the 3 remaining pairs will be annotated with  NonEntailment). As regards the second case (NonEntailment for all the generated T-H pairs) I would consider the T-H pair to be annotated with ENTAILMENT the one to which the system is less confident (e.g. T-H1 NonEntailment 0.8 , T-H2 NonEntailment 0.3, T-H3 NonEntailment 0.1 T-H4 NonEntailment 0.5, I would consider T-H3 to be the one to which there is ENTAILMENT).




Reply all
Reply to author
Forward
0 new messages