We introduce a new shared task that aims to evaluate the parallel data curation methods. The goal of the task is to find the best MT training data within a provided pile of webcrawled data. We encourage submissions that address any aspect of: document alignment, sentence alignment, comparable corpora bitext filtering, language ID, or related fields.
All deadlines and release dates are Anywhere on Earth.
A Machine Translation system is only as good as its training data. The web provides vast amounts of translations that can be used as training data. The challenge is to find pairs of sentences or documents that are translations of each other, which can be used to train the best possible MT system.
For this shared task, the organizers will provide
Web-crawled data
Intermediate outputs from the baseline to participants to focus on specific aspects of task
MT Training and eval scripts
The participants task is to find the best possible set of training data within the provided web-crawled data to train a downstream MT model, using the provided model training scripts. Downstream MT performance will be judged using automatic MT metrics.
This shared task builds on prior shared tasks on document alignment (WMT 2016) and sentence filtering (WMT 2018-2020).
Participants may use only pre-trained models and datasets publicly released with a research-friendly license on or before May 1, 2023. All participants are required to submit a system description paper.
We provide web crawl data in Estonian-Lithuanian from a single recent snapshot of CommonCrawl. We choose this language pair to balance the desire to have enough training data large enough to train a reasonable Estonian -> Lithuanian MT model, but small enough to make the task more accessible to participants with limited compute. For this reason, we also release pre-computed intermediate steps from a baseline (e.g. laser embeddings, sentence pairs from FAISS search, etc), so participants can choose to focus on one aspect of the task (e.g. sentence filtering).
Tobias Domhan (Amazon)
Thamme Gowda (Microsoft)
Huda Khayrallah (Microsoft
Philipp Koehn (Johns Hopkins University)
Steve Sloto (Microsoft)
Brian Thompson (Amazon)
To reach the organizers, please email:
wmt-data-tas...@googlegroups.com
To get updates about the shared task, please join this mailing list:
https://groups.google.com/g/wmt-data-task/
More information and data will be posted to this website: statmt.org/wmt23/data-task.html
--
You received this message because you are subscribed to the Google Groups "Workshop on Statistical Machine Translation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wmt-tasks+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/wmt-tasks/CAAFADDBFpkhGz-ZwrAPnDj6UrieiRXPBXvRoUO1JTd8P-28BvQ%40mail.gmail.com.