Labels for evaluation data

142 views
Skip to first unread message

Scott Hale

unread,
Feb 4, 2022, 4:18:58 PM2/4/22
to semeval-2022-task-8-multilingual-news
Dear all,

Thank you so much for participating in the shared task. We will be releasing the official rankings next week after we make sure we have the highest scoring submission from each team.

The evaluation/test data with the gold standard labels is now available on the website and linked below. Per the competition description, the submissions were scored using Pearson's correlation with the 'Overall' column. If a 'pair_id' was in the submitted data twice, only the first score was used to compute the correlation. We hope this data will be useful in evaluating your models by, for example, producing breakdowns for different language pairs.

Please save the file with a .csv extension:

We're excited to see your papers and code. We strongly encourage everyone to submit regardless of your score. Please also consider sharing your code on GitHub or another location. 

Papers should use the ACL template - https://github.com/acl-org/acl-style-files , but please use the SemEval guidelines for more specifics on length, etc. https://semeval.github.io/system-paper-template.html . Papers are due Feb 23, 2022 at 11:59 PM AoE. We'll send more details on where to upload the papers in due course.

Best wishes,
Scott on behalf of all task organizers

Dirk

unread,
Feb 6, 2022, 8:15:40 AM2/6/22
to semeval-2022-task-8-multilingual-news
Hi Scott,

Thanks for releasing the final eval data. But I've some questions regarding it.

1. There are rows difference between the final eval with the previous one (the one without scores): 4902 rows vs 4953 rows. From previous discussions in this group, I assume it's because there are duplicated rows?

2. The manually calculated scores are different from the results on the competition site. Whether I use the final data as base (4902 rows) or the previous data as base (4953 rows), the difference between the score on the competition site with both calculated scores are quite big (about 0.02). What could the problem be?

Thanks for help and best regards,
Dirk

Scott Hale

unread,
Feb 6, 2022, 8:24:17 AM2/6/22
to Dirk, semeval-2022-task-8-multilingual-news
Please ensure the the pair_ids match (not the row numbers, which as you note may not be consistent). For the official scores, we (a) keep only the first entry for any duplicate pair_ids, (b) sort both the input and gold data by the `pair_id` value, (c) ensure all pair_ids match between the input and gold data, and (d) compute pearson's correlation using scipy.stats.pearsonr

Cheers,
Scott


--
You received this message because you are subscribed to the Google Groups "semeval-2022-task-8-multilingual-news" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semeval-2022-task-8-mult...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/semeval-2022-task-8-multilingual-news/1171ab3f-a143-4168-b4cc-aa086ef36e3bn%40googlegroups.com.

Dirk

unread,
Feb 6, 2022, 9:20:35 AM2/6/22
to semeval-2022-task-8-multilingual-news
Hi Scott,

I modified my script so it fit your descriptions and I still getting the exact same result like before I modified it (so I assume my previous calculation was already correct). This means, my manually calculated result is still different from the score on the competition site. Naturally, I'm using the exact same result files that I submitted on the site for my manual calculation.

From what I know, the competition site that we are currently on is still using Python 2.7x. Could this effect the scores? (Which I doubt).

Regards,
Dirk

Scott Hale

unread,
Feb 6, 2022, 9:23:56 AM2/6/22
to Dirk, semeval-2022-task-8-multilingual-news
Yes, CodaLab runs Python 2.7 . I cannot say whether this is the reason for the difference at this point. It is things like this that we want to check before releasing the final rankings.

Dirk

unread,
Feb 6, 2022, 10:33:39 AM2/6/22
to semeval-2022-task-8-multilingual-news
It would be nice if someone else could also confirm whether their manual calculation are the same as codalab or different.

Famille Dufour

unread,
Feb 7, 2022, 6:45:33 AM2/7/22
to semeval-2022-task-8-multilingual-news
I got a very slight difference <0.001. But the most important thing for me was the difference between languages: my statistics went from 0.55 (Polish) to 0.82 (English-French), and I noticed that a good part of the important errors are related to scraping problems (although we did it several times, following observed problems (empty, incomplete, single scraping of the title of the newspaper, RGPD europe for example)): for those, our score close to 4 and the real one often at 1. Too bad ...

sebastien / BL Research

EMM

unread,
Feb 12, 2022, 6:32:26 AM2/12/22
to semeval-2022-task-8-multilingual-news

Dear organizers,

first of all thanks for organizing that interesting task!

I concur with Sebastien, unavailable articles seem to significantly undermine the performance of our system.

In order to fairly compare the approaches it would make sense to evaluate on a subset of successfully downloaded articles.
This could realistically be done by relying on the list of unavailable url that was shared previously, or maybe even better relying on the list of urls that are unavailable right now for direct scrapping.

Best,

Nicolas

Scott Hale

unread,
Feb 12, 2022, 6:42:22 AM2/12/22
to EMM, semeval-2022-task-8-multilingual-news
Hi Nicolas,

We welcome discussion of this in the papers. We sought to ensure all pages in the evaluation dataset are available on the Internet Archive, but understand participants had slightly different experiences accessing these. Given these slight differences, I think it best to address the issue in each paper.

Best wishes,
Scott


Reply all
Reply to author
Forward
0 new messages