[Metrics Task] Test sets now available

Nitika Mathur

unread,

Jul 23, 2024, 11:25:42 PM7/23/24

to wmt-...@googlegroups.com

Happy to announce that the metrics task inputs are now available! We have 11 language-pairs available in the generaltest2024 testset, as well as 5 additional challenge sets including african languages; for general purpose metrics, we expect participation in all language-pairs.

Submissions are now open via Codabench

Deadline for submissions has been extended by 12 hours, and is now 12:00 pm July 31 (AoE) or 00:00 August 1st (UTC)

Process:

Register your metric here, if you haven’t already
Create an account on Codabench.
- You’re allowed one primary submission for a reference-based metric, and one primary submission for a reference-free metric. If you are submitting two metrics that have widely different approaches, for example, one LLM-based metric and one lexical metric, then create 2 accounts on Codabench.
Download the data (link; link also available on Codabench)
Prepare your scores and submit your scores on Codabench following the instructions at the "Getting Started -> evaluation" page at Codebench

Please contact us if you see any issue or have any other questions.

Kind regards,

Metrics Task organizers wmt-m...@googlegroups.com

Genta Winata

unread,

Jul 30, 2024, 9:34:57 PM7/30/24

to WMT: Workshop on Machine Translation

Hi organizers,

We noticed some of the data on src and mt columns are missing for QE task. Is this expected? Should we omit the samples with the missing values on the prediction?

Looking forward to your reply.

Thank you

Genta

Nitika Mathur

unread,

Jul 30, 2024, 9:51:41 PM7/30/24

to wmt-...@googlegroups.com

Hi Genta,

Can you give an example of which language pair and which line numbers?

Best,

Nitika

--
You received this message because you are subscribed to the Google Groups "WMT: Workshop on Machine Translation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wmt-tasks+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/wmt-tasks/f9a9b6fa-de2e-40a9-9eb4-04216865cdb7n%40googlegroups.com.

Nitika Mathur

unread,

Jul 30, 2024, 10:17:39 PM7/30/24

to wmt-...@googlegroups.com

Hi all,

The new deadline is 00:00 August 2nd UTC (or 12pm August 1st AoE)

As of now, you have 1 day 21 hours 41 minutes before the deadline

Please see the countdown timer here:

https://www.timeanddate.com/countdown/generic?iso=20240802T00&p0=1440&msg=Metrics+Task+Submission+Deadline&font=cursive

Best,

Metrics Task organizers

Genta Winata

unread,

Jul 30, 2024, 11:38:14 PM7/30/24

to WMT: Workshop on Machine Translation

Hi Nikita,

We found 4305 missing values (src or mt) in the entire test set. One of the examples are shown in the screenshot. Let us know if we downloaded the wrong version of the test set.

Thank you.

Screenshot 2024-07-30 at 11.35.51 PM.png

Markus Freitag

unread,

Jul 31, 2024, 1:48:40 PM7/31/24

to wmt-...@googlegroups.com

Hi Genta,

Thank for noticing this. This is not a bug and intended. You have the right data :)

Thank you,

Markus

On Jul 30, 2024, at 8:38 PM, Genta Winata <gentaind...@gmail.com> wrote:

Hi Nikita,

We found 4305 missing values (src or mt) in the entire test set. One of the examples are shown in the screenshot. Let us know if we downloaded the wrong version of the test set.

Thank you.

To view this discussion on the web visit https://groups.google.com/d/msgid/wmt-tasks/c743943e-a97b-440d-af26-5bbb4d1d5bbfn%40googlegroups.com.

Genta Winata

unread,

Jul 31, 2024, 2:48:38 PM7/31/24

to wmt-...@googlegroups.com

Hi Markus,

Thank you for the reply. This is our first participation in the shared task and we might miss the context, how should we handle the missing data? Should we consider them as bad translations? We only require this assistance to help us prepare the submission, since our submission didn't go through the first time (we skipped the samples with missing values). Additionally, we were not able to run the baseline scoring script for BLEU since there are missing reference data on challenge sets. It would be very useful if you can give us this assistance.

Thank you

Best regards,

Genta

Genta Indra Winata

@gentaiscool
https://id.linkedin.com/in/gentaiscool

"Too many of us are not living our dreams because we are living our fears."
--

To view this discussion on the web visit https://groups.google.com/d/msgid/wmt-tasks/A6CF5F59-826D-435B-8337-70E601D0441F%40gmx.de.

Nitika Mathur

unread,

Jul 31, 2024, 6:15:18 PM7/31/24

to wmt-...@googlegroups.com

Hi Genta,

When submitting, we need a score for each line, please do not skip any with missing values.

The scoring scripts should just run without any error.

For BLEU, run

```

python prepare_scores.py --baseline BLEU

```

If you add a function for your metric in prepare_scores.py (or prepare_qe_scores.py if your metric does not require the reference), then running the script should output the data in a format that is consistent with our requirements.

Best,

Nitika

To view this discussion on the web visit https://groups.google.com/d/msgid/wmt-tasks/CADEi4HFoxomi8Sp3JPvVt%3DW7qCZP9acypccNOvO7A8Vyz5B4OQ%40mail.gmail.com.

Genta Winata

unread,

Jul 31, 2024, 7:38:56 PM7/31/24

to wmt-...@googlegroups.com

Hi Nikita,

Thank you for clarifying.

We would like to get some help to get more clarification.

1) If the sample does not have a reference, does it mean it won't be evaluated for "reference-based metrics"? So, any numbers we put there won't affect the final results?

2) Also for samples without source, will it be evaluated for "reference-free metrics"?

We found we used an old sacredbleu library for running the baseline. Now, it worked. Thank you for your help.

Looking forward to hearing from you. Thank you for your time.

Best regards,

Genta

To view this discussion on the web visit https://groups.google.com/d/msgid/wmt-tasks/CAM87YmeqhVAsqitZ985keoiZMexUgBXgoXD24ko9b8i2W0uOcQ%40mail.gmail.com.

--

Nitika Mathur

unread,

Aug 1, 2024, 2:46:29 AM8/1/24

to wmt-...@googlegroups.com

Hi Genta,
We will evaluate all samples for all metrics even with missing source/ref.

Participants are free to handle these special cases in whatever way you think is appropriate. We will evaluate metrics based on their agreement with human evaluators.

Best,
Nitika

To view this discussion on the web visit https://groups.google.com/d/msgid/wmt-tasks/CADEi4HHDbF%2Bxaks5o7pfm9Q-FywpUCn5JK-SX_QhHrJidjoyQQ%40mail.gmail.com.

Genta Winata

unread,

Aug 1, 2024, 3:05:11 AM8/1/24

to wmt-...@googlegroups.com

Hi Nitika,

Thank you for your help. We appreciate it.

Best Regards,

Genta

To view this discussion on the web visit https://groups.google.com/d/msgid/wmt-tasks/CAM87YmeFbQVai4P%3DXK740EnW0FQdwhA2ZYQf5ceiZDqhOUr0MQ%40mail.gmail.com.

Genta Winata

unread,

Aug 1, 2024, 1:56:40 PM8/1/24

to wmt-...@googlegroups.com

Hi Nitika,

We have one more quick question. How do we assign which submissions become the primary? We could not find a description field on the submissions. Are we going to have the chance to assign the submission after the open phase?

Thank you

Best regards,

Genta

Nitika Mathur

unread,

Aug 1, 2024, 11:52:23 PM8/1/24

to wmt-...@googlegroups.com

Hi Genta,

Happy to help!

If you are submitting multiple variations of the same metric, for example by using a different embeddings or a different size, we would like you to pick one primary metric (one for reference based and one reference free evaluation). You can pick the metric that you think will have the highest chance of performing well on the testset (i.e. agreeing with human evaluation)

Please don't choose it based on the current leaderboard for the 2024 task as these are correlations with another automatic metric.

You can assign the primary metric when you fill the registration form https://docs.google.com/forms/d/e/1FAIpQLSenoN9svmqkfJshM8bHd8p3Ofyepdvcg9Mtns0VdEbsFunvEA/viewform

We will only include the primary metric on the main leaderboard. We will also report detailed results for all metrics.

Best,

Nitika

To view this discussion on the web visit https://groups.google.com/d/msgid/wmt-tasks/CADEi4HEP9-ft3oJKReT%2B_iWFX_ug%3DaQq6PEfsam3%3D79jDShPsg%40mail.gmail.com.

Genta Winata

unread,

Aug 1, 2024, 11:58:10 PM8/1/24

to wmt-...@googlegroups.com

Hi Nitika,

Thank you for your reply and helpful suggestion. We have filled the form yesterday. We plan to start writing the paper early since the deadline for the paper is very tight. Do you know when will we expect to find the final evaluation on the leaderboard?

Thank you

Best regards,

Genta

Genta Indra Winata

@gentaiscool
https://id.linkedin.com/in/gentaiscool

"Too many of us are not living our dreams because we are living our fears."
--

To view this discussion on the web visit https://groups.google.com/d/msgid/wmt-tasks/CAM87YmeeikqNnBYM%3DdUcfXrOndft0Z8sQhv7fMRvQus%2B2YnakA%40mail.gmail.com.

Nitika Mathur

unread,

Aug 7, 2024, 12:01:49 AM8/7/24

to wmt-...@googlegroups.com

Hi Genta,

The human MQM evaluation is in progress right now. We will not be able to receive the results in time to share the final evaluation before the WMT paper submission deadline on the 20th.

We expect to have the results finalised well before the camera ready deadline, so suggest that you submit the paper with results on previous years data, and then add this year's results to the camera ready version.

Best wishes,

Nitika

To view this discussion on the web visit https://groups.google.com/d/msgid/wmt-tasks/CADEi4HGqmRb3FCkDQaEBAumoVyVAXJTrJ_5AsO--2CVSZxLfcQ%40mail.gmail.com.

Genta Winata

unread,

Aug 7, 2024, 12:27:00 AM8/7/24

to wmt-...@googlegroups.com

Hi Nitika,

Thank you for letting us know. We will use the previous years data for the first submission.

Genta

To view this discussion on the web visit https://groups.google.com/d/msgid/wmt-tasks/CAM87YmesPGPejZ_c%3Dkk%2B%3DCL1tJBd8crRo4bV76C4YrLD5_jcSg%40mail.gmail.com.

Reply all

Reply to author

Forward