evaluation results question

Ted Pedersen

unread,

May 2, 2011, 11:03:42 PM5/2/11

to disco2011...@googlegroups.com

Greetings all,

I've been going over the evaluation results, and just want to make
sure I understand the system scores for the numeric and the coarse
grained results. I see the note about Spearman's rho and Kendall's tau
on the pdf file, but it doesn't really clearly align with either set
of results, so I just wanted to double check what was used on what set
of results....could you clarify that?

Thanks!
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

Organizer DISCo Workshop 2011

unread,

May 3, 2011, 1:53:07 AM5/3/11

to disco2011...@googlegroups.com

Hi Ted,

thanks for your question. You are right, from the table, it is not entirely clear where the rho and tau scores go.

The English table is split in two parts: numerical and coarse scoring results. Surprisingly, system performances in the one scoring seem not to predict system performances in the other very much.

Spearman's rho and Kendall's tau was only computed for numerical scoring. Especially Spearman does not make much sense on our coarse values. So, the rho and tau values apply only to the numerical part. There is a weak relation between lower numerical scores and higher rho/tau scores, as long as they are significant. Rank correlation scores that were not significant are noted in parentheses.

Hope this helps,

Chris

Ted Pedersen

unread,

May 3, 2011, 8:19:52 AM5/3/11

to disco2011...@googlegroups.com

Hi Chris,

Thanks for clarifying this. I think it's falling into place now...let
me just use an example from the results to make sure I'm
understanding...

For numeric scoring there is a system called pred.en.submit which
reported 174 answers...the spearman's rho for that was .27 and the
kendall's tau was .18. Is that correct?

Then as we move across that table there are values of 16.19, 14.93,
21.64 and 14.66....what are those?

For that same system in the coarse grained it answered 118 times, and
there are values of .356, .346, .5 and .275....what are those?

Thanks!
Ted

Organizer DISCo Workshop 2011

unread,

May 3, 2011, 8:43:41 AM5/3/11

to disco2011...@googlegroups.com

Hi Ted, all,

we report numerical and coarse-grained scoring, as computed by the scoring scripts that were bundled with the training data.

The left table is about numerical scoring and reports average point difference for the full test data as well as for the single relations.

In your example: for the pred.en.submit system, the rho and tau score was .27 and .18 . The average point difference for the full English test set is 16.19, with 14.93, 21.64 and 14.66 being scores for the three single relations ADJ_NN, V_SUBJ and V_OBJ.

Moving over to the right side of the table, these scores are label precision for the coarse-grained labels "low", "medium", "high".

As described previously, the test set is smaller for those since we excluded items with scores in the gray area between high and medium, and medium and low. We thus evaluate only on 118 items instead of 174. Again, we report on the full set of 118 as well as on single relations.

The pred.en.submit system has, interestingly, very nice scores for the numerical evaluation, yet quite low scores for the coarse-grained evaluation. Since we allowed systems to submit separate files, this is possible. For this system, the mapping from numerical scores to coarse labels seems to be somewhat unfortunate. The Duluth-1 system, however, performs the other way around: while average point differences are in mid-field and there was no rank correlation detectable by Spearman and Kendall, the coarse classification into "low", "medium" and "high" compositionality seems to work well. The Duluth system seems to put items into correct buckets but doesn't seem to order them well within these buckets.

cheers,

Ted Pedersen

unread,

May 3, 2011, 1:11:32 PM5/3/11

to disco2011...@googlegroups.com

Hi Chris,

Thanks for these clarifications, this is quite helpful.

I didn't actually catch on til just now that the (-0.01) and (-0.01)
reported for duluth-1 (for example) was the actual spearman's and
kendal's values!! That's quite unexpected....After realizing this I
thought this might have something to do with ties, but it seems like
the gold standard also has ties so that didn't seem like a likely
explanation....

For example the top 12 (most literal) pairs from duluth-1 are put into
5 ranks ...

100 EN_V_OBJ develop methods
99 EN_V_SUBJ fans want
98 EN_V_OBJ wait minute
98 EN_V_OBJ raise bar
98 EN_V_OBJ foot bill
97 EN_V_SUBJ economist call
97 EN_V_OBJ take plunge
97 EN_V_OBJ pay visit
97 EN_V_OBJ help children
96 EN_V_OBJ spread word
96 EN_V_OBJ double number
96 EN_V_OBJ collect data

From the gold standard...these are the top 13 pairs, organized in 7 ranks....

98 EN_ADJ_NN small island
98 EN_ADJ_NN early version
97 EN_V_OBJ help people
96 EN_ADJ_NN red wine
96 EN_ADJ_NN rechargeable battery
95 EN_V_OBJ provide service
95 EN_V_OBJ provide information
95 EN_V_OBJ collect data
95 EN_ADJ_NN cheap price
94 EN_ADJ_NN statistical analysis
93 EN_V_OBJ obtain information
93 EN_ADJ_NN little girl

Interestingly enough there seems to be almost no pair in common
between these two sets :) ...so I think the hypothesis about the
disordered buckets in duluth-1 is very valid...

Anyway, just wanted to say thanks and confirm your observations
here....quite interesting....I'll keep thinking on this...