OOT Scoring

1 view
Skip to first unread message

Richard Wicentowski

unread,
Mar 19, 2010, 2:37:20 PM3/19/10
to SemEval2010.CrossLingualLexicalSubstitution
Hi all,

In the Semeval-2007 Lexical Substitution task, participants had
different understandings of how the OOT scoring worked. Some
participants believed that, for each instance, the items in the list
of substitutes needed to be unique. Other participants believed that,
for each instance, the list of substitutes could contain duplicates.

In order to be sure everyone realizes which rules are being used, page
4 of the documentation says "Duplicates are allowed so a system may
put more emphasis on items it is more confident of". This is confused
a bit on page 5 where the documentation talks about A as a *set*
which, mathematically, should not be allowed to contain duplicates.

Note that providing duplicate answers can dramatically change your
score. Following from the example provided in the CLLS documentation
(id 99), let's say that one participant supplied "feliz; contento" and
another provided "feliz; contento; feliz; contento; feliz; contento;
feliz; contento; feliz; contento", the difference between the two
participants' precision scores would be a factor of five!

Editoralizing for a moment, I don't think this is the best method for
handling the scoring metric. The two systems shown above should
clearly receive the same credit for their answers and hopefully this
can be addressed in a future Semeval task.

Good luck,

-Rich


Diana McCarthy

unread,
Mar 20, 2010, 5:09:41 AM3/20/10
to clls...@googlegroups.com
Hi Rich

Richard Wicentowski wrote, On 19/03/10 18:37:


> Hi all,
>
> In the Semeval-2007 Lexical Substitution task, participants had
> different understandings of how the OOT scoring worked. Some
> participants believed that, for each instance, the items in the list
> of substitutes needed to be unique. Other participants believed that,
> for each instance, the list of substitutes could contain duplicates.
>
>

Yes, in 2007 we had not thought about someone providing duplicates and
the scorer gave credit for duplicates so in our journal paper for the task

McCarthy, D. and R. Navigli (2009) The English Lexical Substitution
Task, In /Language Resources and Evaluation/ 43 (2) Special Issue on
Computational Semantic Analysis of Language: SemEval-2007 and Beyond,
Agirre, E., M�rquez, L. and Wicentowksi, R. (Eds). pp 139-159 Springer.

we highlighted this difference and showed which systems (perhaps
consciously or unconsciously) had included duplicates


> In order to be sure everyone realizes which rules are being used, page
> 4 of the documentation says "Duplicates are allowed so a system may
> put more emphasis on items it is more confident of". This is confused
> a bit on page 5 where the documentation talks about A as a *set*
> which, mathematically, should not be allowed to contain duplicates.
>
>

sure: we should have changed that to a multiset.


> Note that providing duplicate answers can dramatically change your
> score. Following from the example provided in the CLLS documentation
> (id 99), let's say that one participant supplied "feliz; contento" and
> another provided "feliz; contento; feliz; contento; feliz; contento;
> feliz; contento; feliz; contento", the difference between the two
> participants' precision scores would be a factor of five!
>
>

Right, but the "best" score will point in the right direction and anyone
doing the oot task would know they should provide 10 answers. If someone
provides:


"feliz; contento; feliz; contento; feliz; contento;feliz; contento;
feliz; contento"

and someone else
"feliz; contento; XXX; YYYY; ZZZ; AAA; CCC; DDD; RRR ; WWW

then the first scores higher because they have put more weight on items that they have more confidence in and not hedged their bets. If someone only puts 2 answers for the OOT task they will perform worse as a rule because you are intended to supply 10 answers.

We were not sure whether to allow duplicates or not, but we decided that confident systems should be allowed to weight their responses in this way.

> Editoralizing for a moment, I don't think this is the best method for
> handling the scoring metric. The two systems shown above should
> clearly receive the same credit for their answers and hopefully this
> can be addressed in a future Semeval task.
>
>

No, not for oot if one system only puts 2 answers it will do worse
unless the other answers are all wrong and then it does the same. You
are intended to supply 10 answers.

I **do** agree that there are many different ways of scoring. This is
why we have many different ways of scoring best / oot and mode
precision and recall for both of these. Note that mode will score the
two systems you described the same (even for oot). I do concede that
there may still be other better ways of scoring and I hope someone out
there might take up the challenge for the next SemEval!

thanks for the comments. I am sure they will really help everyone

very best

Diana
> Good luck,
>
> -Rich
>
>
>


--

===========================================================================
Diana McCarthy, http://www.dianamccarthy.co.uk/
Lexical Computing Ltd. http://www.sketchengine.co.uk/
===========================================================================


Reply all
Reply to author
Forward
0 new messages