Call for Participation: WMT 2021 Machine Translation using Terminologies

Philipp Koehn

unread,

Apr 23, 2021, 6:03:58 PM4/23/21

to wmt-...@googlegroups.com, Moses Support, COR...@uib.no

WMT 2021 Shared Task:

Machine Translation using Terminologies

November 10-11 , 2021
Punta Cana, Dominican Republic

Language domains that require very careful use of terminology are abundant. The need to adequately translate within such domains is undeniable, as shown by e.g. the different WMT shared tasks on biomedical translation.

More interestingly, as the abundance of research on domain adaptation shows, such language domains are (a) not adequately covered by existing data and models, while (b) new (or “surge”) domains arise and models need to be adapted, often with significant downstream implications: consider the new COVID-19 domain and the large efforts for translation of critical information regarding pandemic handling and infection prevention strategies.

In the case of newly developed domains, while parallel data are hard to come by, it is fairly straightforward to create word- or phrase-level terminologies, which can be used to guide professional translators and ensure both accuracy and consistency.

This shared task will replicate such a scenario, and invites participants to explore methods to incorporate terminologies into either the training or the inference process, in order to improve both the accuracy and consistency of MT systems on a new domain.

IMPORTANT DATES

Release of training data and terminologies	April 2021
Surprise languages announced:	June 28, 2021
Test set available	July 19, 2021
Submission of translations	July 23, 2021
System descriptions due	August 5, 2021
Camera-ready for system descriptions	September 15, 2021
Conference in Punta Cana	November 10-11, 2021

SETTINGS

In this shared task, we will distinguish submissions that use the terminology only at inference time (e.g., for constrained decoding or something similar) and submissions that use the terminology at training time (e.g., for data selection, data augmentation, explicit training, etc). Note that basic linguistic tools such as taggers, parsers, or morphological analyzers are allowed in the constrained condition.

The submission report should highlight in which ways participants’ methods and data differ from the standard MT approach. They should make clear which tools were used, and which training sets were used.

LANGUAGE PAIRS

The shared task will focus on four language pairs, with systems evaluated:

English to French
English to Chinese
Two surprise language pairs English-X (announced 3 weeks before the evaluation deadline)

We will provide training/development data and terminologies for the above language pairs. Test sets will be released at the beginning of the evaluation period. The goal of this setting (with both development and surprise language pairs) is to avoid approaches that overfit on language selection, and instead evaluate the more realistic scenario of needing to tackle the new domain in a new language in a limited amount of time. The surprise language pairs will be announced 3 weeks before the start of the evaluation campaigns. At the same time we will provide training data and terminologies for the surprise language pairs.

You may participate in any or all of the language pairs.

ORGANIZERS

Antonis Anastasopoulos, George Mason University
Md Mahfuz ibn Alam, George Mason University
Laurent Besacier, NAVER
James Cross, Facebook
Georgiana Dinu, AWS
Marcello Federico, AWS
Matthias Gallé, NAVER
Philipp Koehn, Facebook / Johns Hopkins University
Vassilina Nikoulina, NAVER
Kweon Woo Jung, NAVER

Mārcis Pinnis

unread,

Apr 26, 2021, 11:00:12 AM4/26/21

to Workshop on Statistical Machine Translation

Hi Philipp,

We are considering participating in this shared task.

We have some questions though.

It seems that the scope of this task is sort of similar to the news translation shared task with an optional additional resource - terminology. Meaning ... build the highest quality MT system given the data and the time period.

I am, therefore, wondering - do you plan to somehow analyse whether terminology translation quality improves as well and which methods allow improving in which directions? Or will just the standard human or automatic evaluation be performed?

I.e., we would want to benchmark our terminology integration methods, but it seems a bit weird if at the same time huge/deep/monster models for GPU-rich teams could easily overshadow the methods that address the specific issues of in-domain terminology...

For a future shared task, maybe a fixed training set and a fixed baseline NMT configuration could be considered such that participants can benchmark their contributions against the baseline (in whatever framework they operate). Just a thought...

Best regards,

Mārcis ;o)

Philipp Koehn

unread,

Apr 30, 2021, 10:57:23 AM4/30/21

to wmt-...@googlegroups.com, Antonis Anastasopoulos

Hi Mārcis,

thank you for your good questions and suggestions.

While we will compute BLEU scores, the main focus is the accuracy of the terminology terms

and their correct placement in the sentence.

We will release the scoring script in a couple of weeks (we are still refining and testing the

metrics).

Regards,

Philipp Koehn

--
You received this message because you are subscribed to the Google Groups "Workshop on Statistical Machine Translation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wmt-tasks+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/wmt-tasks/4cdc7ea7-9252-4142-a8e5-ccd9f2b8ee7an%40googlegroups.com.

Toms Bergmanis

unread,

Jun 8, 2021, 7:19:03 AM6/8/21

to wmt-...@googlegroups.com, Antonis Anastasopoulos

Hi Philipp, Antonis,

Its been a while since the task was announced initially, but we still can not find development data.

Are you still planning to carry out the task this WMT?

Regards,

Dr. T.Bergmanis

NLProc researcher @ Tilde MT

To view this discussion on the web visit https://groups.google.com/d/msgid/wmt-tasks/CAAFADDD3Qp%2Bf7NUrfAVnjguvALwC4cK1nPoGsK%3DqAxRvFikEnA%40mail.gmail.com.

Antonis Anastasopoulos

unread,

Jun 23, 2021, 10:09:24 AM6/23/21

to Workshop on Statistical Machine Translation

Hi all,

the development data, as well as the evaluation metrics are now available on the website!

Best,

Antonios Anastasopoulos

Alexander Molchanov

unread,

Jul 9, 2021, 12:35:12 PM7/9/21

to Workshop on Statistical Machine Translation

Hi Antonis, Philipp and others!

First of all, thank you for the tool for terminology evaluation!
Speaking about the paper: As far as I understand, there are 4 different scores: BLEU, exact match, window overlap with window=2/3, and TERm. Do you consider combining the scores, if so, then how? I didn't find this part in the paper, correct me if I am wrong. There might be situations when a system1 translates better in terms of BLEU but system2 is more consistent considering terminology. Talking about window overlap: system1 would probably get better scores just because it translates better (thus, context around the terms that system1 was (maybe accidentally) able to match is translated better and better 'fits' the human translation). I guess my question is related to the one asked by Marcis earlyer. I totally understand that you do write about such cases and even provide examples, but I couldn't find some unified conclusion about this. What if you have, e.g., 40 vs 30 BLEU and 60% vs 80% term match (exact or window) and some controversial TERm scores?

Thank you,

Alex Molchanov
PROMT LLC

пятница, 30 апреля 2021 г. в 17:57:23 UTC+3, Philipp Koehn:

Antonis Anastasopoulos

unread,

Jul 9, 2021, 12:59:31 PM7/9/21

to wmt-...@googlegroups.com

Hi Alex,

these are indeed great questions!

We are not planning on combining the scores.

My (personal) view is that the TERm metric is in a sense a combination of the other three, but all of them provide meaningful and orthogonal information.

There's more reasons on not combining the scores:

in our paper we find that generally the metrics correlate with each other, so I would expect a better system in terms of BLEU to also do well in terms of window overlap, as you said.
There are obvious ways on how the exact match accuracy can be "cheated", that's why we need additional metrics

For the final evaluation, we will take a Pareto frontier approach across the metrics. That is, if there's two systems with "40 vs 30 BLEU and 60% vs 80% term match (exact or window) and some controversial TERm scores" as you said, then both systems would be a "winner" and we will discuss their pros and cons in the findings paper.

Does that make sense?

Antonis

You received this message because you are subscribed to a topic in the Google Groups "Workshop on Statistical Machine Translation" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/wmt-tasks/aCMC0M5X_R4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to wmt-tasks+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/wmt-tasks/32b54372-0445-479c-b123-4628928a05c3n%40googlegroups.com.

Reply all

Reply to author

Forward