Evaluation methods

Skip to first unread message


Nov 23, 2017, 8:02:19 AM11/23/17
to STS SemEval
Dear all,

I have a few questions regarding the evaluation process. 

Given a vector of scores A obtained using a method X, a vector of scores B obtained using another method Y and a vector of gold standard scores GS, the Pearson correlation is computed: P1 = Pearson(A, GS) and P2 = Pearson(B, GS).  If P1 > P2 then the X method outperformed Y, as in the paper all participant systems are sorted according to the Pearson correlation.

I saw that t-test was applied to determine which systems actually perform the same even though their Pearson values are different. Now I have this problem: P1 = P2 but when I apply t-test it gives me that there is an extremely significant difference. I cannot say that method X performs the same as method Y even though their Pearson value is the same.

Please help me out. What can I do in such cases? And is there another statistical test that I could apply in order to actually know which method outperforms the other one? 

And could I apply F-score for evaluation? It seems to me that the Pearson correlation score doesn't evaluate that accurately since there are cases where the Pearson values are the same, but there is a significant statistical difference. 

Thank you very much for your time. 

Walid Shalaby

Nov 26, 2017, 2:31:35 PM11/26/17
to sts-s...@googlegroups.com
Dear Tsuki,
You're actually raising a valid concern when it comes to gauging the statistical significance of differences in performance among various techniques for measuring semantic similarity. We raised this issue and introduced a study on statistical significance using Steiger's Z test in our paper here. Our analysis agrees with your intuitions and unfortunately current evaluation measures (Pearson or Spearman correlations) are only used to sort out results numerically discarding the statistical significance of their differences.

You can find the code for performing the Steiger's Z test here and please reach me if you need further assistance. Hope this helps. 


Website of task, http://alt.qcri.org/semeval2017/task1/
To post to this group, send email to sts-s...@googlegroups.com
To unsubscribe, send email to sts-semeval+unsubscribe@googlegroups.com
For more options, http://groups.google.com/group/sts-semeval?hl=en?hl=en
You received this message because you are subscribed to the Google Groups "STS SemEval" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sts-semeval+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Walid Shalaby
PhD Candidate
UNC Charlotte | Department of Computer Science
432 Woodward Hall | 9201 University City Blvd. | Charlotte, NC 28223
Web page: http://webpages.uncc.edu/~wshalaby/
Reply all
Reply to author
0 new messages