[clls2010:34] Scoring metrics:: a discussion

2 views
Skip to first unread message

ambidextrous

unread,
May 11, 2010, 1:36:27 AM5/11/10
to SemEval2010.CrossLingualLexicalSubstitution
Hi everybody,

thanks to all of you! I am sure you had a great time participating and
I am sure we all learned something from this experiment.

We would like to start a discussion about the scoring metrics used in
the task. Especially the 'oot normal' metric, which allows for
duplicates, and can therefore reach scores above 100. Some of the
systems utilized this, and therefore obtained high precision and
recall for out-of-ten, while some others did not. Additionally, some
systems did not supply 10 translations for oot and this put them at a
disadvantage. We will not of course change any of the official scores
but we would like to give the floor to any of you who have some
thoughts/analysis? We can discuss in this group and then any of us
might do further analysis for discussion.

We do hope that you can make the meeting at Uppsala and we are
thinking of carrying on the discussion there, perhaps over an informal
lunch

All thoughts, comments are welcome.

Ravi

--
You received this message because you are subscribed to the Google Groups "SemEval2010.CrossLingualLexicalSubstitution" group.
To post to this group, send email to clls...@googlegroups.com.
To unsubscribe from this group, send email to clls2010+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/clls2010?hl=en.

Marine Carpuat

unread,
May 12, 2010, 3:45:42 PM5/12/10
to clls...@googlegroups.com
Hi all,

I have been playing with a variant of the official "best" metrics for the purpose of evaluating the lexical choice performance of a machine translation system. Admittedly, this is not the main goal of the CLLS task, but I thought I'd mention it in case it is relevant to others.

I modified the best score (equation 1 in the system description paper) to allow the system prediction to match any of the gold translations. (Note that unlike most systems, I'm producing a single translation candidate per instance.)

Given a single system prediction s for instance i with a set of gold translations T_i
score(i) = 1 if s belongs to T_i
score(i) = 0 otherwise

I'm interested in this score, because it is closer to what is done in MT evaluation when comparing 1-best system output against multiple reference sentences. There is no partial credit and all gold translations are considered equally valid.

  Marine

ambidextrous

unread,
May 16, 2010, 2:52:11 PM5/16/10
to SemEval2010.CrossLingualLexicalSubstitution
Hi Marine,

Very interesting variation of the best metric - we are glad to see
participants extending the metrics to better suit their needs. Your
variant of 'best' that does a 1/0 scoring is more like the 'mode'
variation of our metrics, except that instead of only taking into
account those items with mode, you are accounting for all the items.

A minor comment is that most of the participating systems produced a
single translation candidate per instance, not the other way around.

Best,
Ravi
> > clls2010+u...@googlegroups.com<clls2010%2Bunsubscribe@googlegroups.c om>
> > .
> > For more options, visit this group at
> >http://groups.google.com/group/clls2010?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups "SemEval2010.CrossLingualLexicalSubstitution" group.
> To post to this group, send email to clls...@googlegroups.com.
> To unsubscribe from this group, send email to clls2010+u...@googlegroups.com.
> For more options, visit this group athttp://groups.google.com/group/clls2010?hl=en.
Reply all
Reply to author
Forward
0 new messages