# Evaluation methods

### Tsuki

Nov 23, 2017, 8:02:19 AM11/23/17
to STS SemEval
Dear all,

I have a few questions regarding the evaluation process.

Given a vector of scores A obtained using a method X, a vector of scores B obtained using another method Y and a vector of gold standard scores GS, the Pearson correlation is computed: P1 = Pearson(A, GS) and P2 = Pearson(B, GS).  If P1 > P2 then the X method outperformed Y, as in the paper all participant systems are sorted according to the Pearson correlation.

I saw that t-test was applied to determine which systems actually perform the same even though their Pearson values are different. Now I have this problem: P1 = P2 but when I apply t-test it gives me that there is an extremely significant difference. I cannot say that method X performs the same as method Y even though their Pearson value is the same.

Please help me out. What can I do in such cases? And is there another statistical test that I could apply in order to actually know which method outperforms the other one?

And could I apply F-score for evaluation? It seems to me that the Pearson correlation score doesn't evaluate that accurately since there are cases where the Pearson values are the same, but there is a significant statistical difference.

Thank you very much for your time.

### Walid Shalaby

Nov 26, 2017, 2:31:35 PM11/26/17
Dear Tsuki,
You're actually raising a valid concern when it comes to gauging the statistical significance of differences in performance among various techniques for measuring semantic similarity. We raised this issue and introduced a study on statistical significance using Steiger's Z test in our paper here. Our analysis agrees with your intuitions and unfortunately current evaluation measures (Pearson or Spearman correlations) are only used to sort out results numerically discarding the statistical significance of their differences.

You can find the code for performing the Steiger's Z test here and please reach me if you need further assistance. Hope this helps.

Regards,

