F-Score penalizes new senses too heavily

Денис Кокосинский

unread,

Mar 27, 2024, 2:59:11 AM3/27/24

to AXOLOTL-24

Hello!

I understand I might be late with this suggestion, but would like to discuss it regardless. Currently,  F-Score is computed over both the gold senses and the new senses. It heavily penalizes prediction of new senses. Consider the following examples with the current implemenation of F-Score:
>>> f1_score(y_true=['a']*30 + ['b']*30, y_pred=['a']*31 + ['b']*29, average='macro', zero_division=0.0)
>>> 0.983328702417338
>>> f1_score(y_true=['a']*30 + ['b']*30, y_pred=['a']*30 + ['b']*29 + ['novel'], average='macro', zero_division=0.0)
>>> 0.6610169491525424
So, if a system discovers a novel sense and assigns it to an example with an old sense, the f1 score decreases dramatically. Therefore, the models should be very conservative in discovering new senses.
Actually, there is simple fix. Providing labels (classes) explicitly prevents that from happening:
>>> f1_score(y_true=['a']*30 + ['b']*30, y_pred=['a']*31 + ['b']*29, average='macro', zero_division=0.0, labels=['a', 'b'])
>>> 0.983328702417338
>>> f1_score(y_true=['a']*30 + ['b']*30, y_pred=['a']*30 + ['b']*29 + ['novel'], average='macro', zero_division=0.0, labels=['a', 'b'])
>>> 0.9915254237288136

My second question about F-Score concerns its extension to the words with no examples of old senses. Currently, it is a special case. It assigns a score 1 if no old senses are predicted and a score 0 if an old sense is predicted for at least one example. I think the penalty for misclassifying even a single example is too harsh in this case. Maybe we could just ignore the case of words with no old senses and not assign an F-Score to them at all?

What do you think? If you find my arguments reasonable, I could open a PR with a fix in FScore later today.

Best regards,
Kokosinskii Denis

Andrey Kutuzov

unread,

Mar 27, 2024, 2:12:37 PM3/27/24

to axolo...@googlegroups.com

Dear Denis,

Thanks for your feedback, it is thought-provoking and valuable. After
some discussion among the organizers, we decided not to change the
default settings of the Subtask 1 evaluation code. The main reason for
this is that it will not be entirely fair to change the rules at this
late stage of the shared task.

In addition, we do not completely agree with your arguments:

1) The first fix you propose (ignoring "novel sense" predictions) will
lead to AXOLOTL not penalizing the systems which wrongly predict a novel
sense for new usages in old senses.
It basically reduces the task to good old synchronic WSD. However, the
point of AXOLOTL is to introduce a diachronic aspect in this task -
that's why we do penalize heavier for assigning a novel sense (instead
of the correct old sense) than for assigning a wrong old sense.

2) Your second suggestion deals with target words which do not feature
any old senses for the new usages. Linguistically, it means that the
meaning of the target word has changed completely.

Our standard macro F1 metrics is ill-defined in these cases. Thus,
indeed our evaluation script assigns F1=1 if the system doesn't predict
any old sense for any new usage, and F1=0 if the system predicts any old
sense for at least one usage.

The value of "0" is of course arbitrary. It is in principle possible to
choose another value for such cases, but it will be equally arbitrary.
Our reasoning was that "0" at least signals that this is a degenerate
case, and the score is not directly comparable to other F1 scores.

We do not want to simply skip such cases, not including them in
computing the average F1 score. We do want to penalize the systems for
predicting old senses when there are in fact none: this means the system
missed the fact that the word has changed its meaning completely.

Having said that, it will still be interesting to also evaluate the
participating systems with the changes you propose. Thus, we will
definitely appreciate a PR adding these changes to `scorer_track1.py` as
non-default options. It is quite possible that future shared tasks will
use this evaluation schema.

Two more important notes:

1) Subtask 2 features the second evaluation metrics in addition to F1:
Adjusted Rand Index (ARI). It partially addresses the issues you
mentioned (and probably introduces other issues)

2) It is also important to mention that the tasks in AXOLOTL'24 are
novel and experimental. Thus, low evaluation scores (compared to more
mainstream NLP tasks) are to be expected.

Thanks again for your input!

On behalf of other organizers,
Andrey Kutuzov

On 27.03.2024 07:59, Денис Кокосинский wrote:
> Hello!
>
>
> I understand I might be late with this suggestion, but would like to
> discuss it regardless. Currently, F-Score is computed over both the
> gold senses and the new senses. It heavily penalizes prediction of new
> senses. Consider the following examples with the current implemenation
> of F-Score:
>

>>>> f1_score(y_true=['a']*30+['b']*30, y_pred=['a']*31+['b']*29, average='macro',
> zero_division=0.0)
>
>>>> 0.983328702417338
>
>>>> f1_score(y_true=['a']*30+['b']*30, y_pred=['a']*30+['b']*29+['novel'],

> average='macro', zero_division=0.0)
>
>>>> 0.6610169491525424
>
> So, if a system discovers a novel sense and assigns it to an example
> with an old sense, the f1 score decreases dramatically. Therefore, the
> models should be very conservative in discovering new senses.
>
> Actually, there is simple fix. Providing labels (classes) explicitly
> prevents that from happening:
>

>>>> f1_score(y_true=['a']*30+['b']*30, y_pred=['a']*31+['b']*29, average='macro',

> zero_division=0.0, labels=['a', 'b'])
>
>>>> 0.983328702417338
>

>>>> f1_score(y_true=['a']*30+['b']*30, y_pred=['a']*30+['b']*29+['novel'],

> average='macro', zero_division=0.0, labels=['a', 'b'])
>
>>>> 0.9915254237288136
>
>
> My second question about F-Score concerns its extension to the words
> with no examples of old senses. Currently, it is a special case. It
> assigns a score 1 if no old senses are predicted and a score 0 if an old
> sense is predicted for at least one example. I think the penalty for
> misclassifying even a single example is too harsh in this case. Maybe we
> could just ignore the case of words with no old senses and not assign an
> F-Score to them at all?
>
>
> What do you think? If you find my arguments reasonable, I could open a
> PR with a fix in FScore later today.
>
>
> Best regards,
>
> Kokosinskii Denis
>
>
>

> --
> You received this message because you are subscribed to the Google
> Groups "AXOLOTL-24" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to axolotl-24+...@googlegroups.com
> <mailto:axolotl-24+...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/axolotl-24/020ce518-03e0-4c7f-b088-46b978577017n%40googlegroups.com <https://groups.google.com/d/msgid/axolotl-24/020ce518-03e0-4c7f-b088-46b978577017n%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Andrey
Language Technology Group (LTG)
University of Oslo

Andrey Kutuzov

unread,

Mar 27, 2024, 2:14:41 PM3/27/24

to axolo...@googlegroups.com

On 27.03.2024 19:12, Andrey Kutuzov wrote:
> 1) Subtask 2 features the second evaluation metrics in addition to F1:

Sorry, of course I meant Subtask 1 here.

Reply all

Reply to author

Forward