Official SemEval Task 8 ranking announcement

53 views
Skip to first unread message

David Jurgens

unread,
Feb 11, 2022, 2:29:40 PM2/11/22
to semeval-2022-task-8-multilingual-news
Hi Task 8 participants,

  After much double checking, we’ve finalized the official team ranking which was computed in a new way. In ranking teams, we have been inspired by recent work (e.g., Dodge et al., 2019) on estimating model performance while recognizing that not all systems solving a task are on equal footing. Specifically, some teams may have submitted more or fewer submissions due to time, computational budget, or model performance. Having varied numbers of submissions for each team/system creates an opportunity for rethinking how to estimate how well the system actually does.

  Our approach is as follows. We assume that for most teams, submissions are an exploration of the hyperparameter/model configuration space of their system. Each submission’s score is then informative of the distribution of its expected performance. To create Task 8's rankings, we use a bootstrapping approach across teams' submissions to estimate their expected rank. Specifically, we bootstrap rankings by sampling an equal number of submissions (N=5) from the most-recent 50 of each team and then use the maximum score from each team’s sampled submissions to compute one ranking. To get our final ranking, we repeat this process to sample N=10,000 rankings and take the average rank for each team across these samples. In essence, this process measures the expected ranking if each team was given the same number of hyperparameter/configuration searches.

   To illustrate how this might be useful, consider the case where a team has 50 submissions where most scores are near 0.5 but where one submission gets above 0.75. In this case, the high score may cause the team to be ranked highly in the original leaderboard; however, this rank is due to a very specific configuration, rather than being reflective of the general performance. In our new evaluation, the sample for this team may rank lower than other teams (since most samples would likely have a max score near 0.5) and would likely better reflect the model’s expected performance.

  The final rankings for the all-data are now on codalab.

  In practice, our new ranking approach largely does not change the ranking shown on CodaLab with only a handful of teams shifting. This result suggests that models were quite stable in their performance and that different hyperparameter/configuration selections were not contributing to any team’s ranking (which is great!).

  We hope this approach leaves you more confident in the results and official ranking. We also hope that this evaluation strategy helps inspire other tasks and provides more confidence to later scholars on which systems might offer significant performance improvements or be expected to perform well on average.

  Best,
  David Jurgens (on behalf of the Task 8 Organizers)
Reply all
Reply to author
Forward
0 new messages