Call for Participation: WMT 2023 Shared Task on Parallel Data Curation

Skip to first unread message

Philipp Koehn

Jun 2, 2023, 5:03:51 PM6/2/23
to moses-support,,,, Priscilla Rasmussen

Call for Participation:

WMT 2023 Shared Task on Parallel Data Curation

We introduce a new shared task that aims to evaluate the parallel data curation methods. The goal of the task is to find the best MT training data within a provided pile of webcrawled data. We encourage submissions that address any aspect of: document alignment, sentence alignment, comparable corpora bitext filtering, language ID, or related fields. 

Important dates

Organizers release data

June 15, 2023

Submissions deadline

September 1

System paper submission deadline

September 22

Organizers release final results

September 25

Camera-ready paper deadline

October 9

WMT Conference

December 6-7

All deadlines and release dates are Anywhere on Earth.


A Machine Translation system is only as good as its training data. The web provides vast amounts of translations that can be used as training data. The challenge is to find pairs of sentences or documents that are translations of each other, which can be used to train the best possible MT system. 

For this shared task, the organizers will provide

  1. Web-crawled data

  2. Intermediate outputs from the baseline to participants to focus on specific aspects of task 

  3. MT Training and eval scripts

The participants task is to find the best possible set of training data within the provided web-crawled data to train a downstream MT model, using the provided model training scripts. Downstream MT performance will be judged using automatic MT metrics. 

This shared task builds on prior shared tasks on document alignment (WMT 2016) and sentence filtering (WMT 2018-2020).

Participants may use only pre-trained models and datasets publicly released with a research-friendly license on or before May 1, 2023. All participants are required to submit a system description paper. 


We provide web crawl data in Estonian-Lithuanian from a single recent snapshot of CommonCrawl. We choose this language pair to balance the desire to have enough training data large enough to train a reasonable Estonian -> Lithuanian MT model, but small enough to make the task more accessible to participants with limited compute. For this reason, we also release pre-computed intermediate steps from a baseline (e.g. laser embeddings, sentence pairs from FAISS search, etc), so participants can choose to focus on one aspect of the task (e.g. sentence filtering). 


  • Tobias Domhan (Amazon)

  • Thamme Gowda (Microsoft)

  • Huda Khayrallah (Microsoft

  • Philipp Koehn (Johns Hopkins University)

  • Steve Sloto (Microsoft)

  • Brian Thompson (Amazon)

To reach the organizers, please email:
To get updates about the shared task, please join this mailing list:

More information and data will be posted to this website:

Idris Abdulmumin

Jun 2, 2023, 5:18:50 PM6/2/23
to, moses-support,,, Priscilla Rasmussen

Sorry for asking here. The email provided in the call rejected my post.

We recently participated in a competition that requires using a closed-access multilingual embedding. We leveraged that to create parallel data alignments for a low resourced language.

Restricted access to that embedding is provided for free testing and research purposes. Can we use it to participate in this task?

Idris Abdulmumin

You received this message because you are subscribed to the Google Groups "Workshop on Statistical Machine Translation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
To view this discussion on the web visit
Reply all
Reply to author
0 new messages