I have a few questions regarding the evaluation process.
Given a vector of scores A obtained using a method X, a vector of scores B obtained using another method Y and a vector of gold standard scores GS, the Pearson correlation is computed: P1 = Pearson(A, GS) and P2 = Pearson(B, GS). If P1 > P2 then the X method outperformed Y, as in the paper all participant systems are sorted according to the Pearson correlation.
I saw that t-test was applied to determine which systems actually perform the same even though their Pearson values are different. Now I have this problem: P1 = P2 but when I apply t-test it gives me that there is an extremely significant difference. I cannot say that method X performs the same as method Y even though their Pearson value is the same.
Please help me out. What can I do in such cases? And is there another statistical test that I could apply in order to actually know which method outperforms the other one?
And could I apply F-score for evaluation? It seems to me that the Pearson correlation score doesn't evaluate that accurately since there are cases where the Pearson values are the same, but there is a significant statistical difference.
Thank you very much for your time.