Machine translation research has traditionally placed an outsized focus on a limited number of languages - mostly belonging to the Indoeuropean family. Progress for many languages, some with millions of speakers, has been held back by data scarcity issues. An inspiring recent trend has been the increased attention paid to low-resource languages. However, these modelling efforts have been hindered by the lack of high quality, standardised evaluation benchmarks.
For the second edition of the Large-Scale MT shared task, we aim to bring together the community on the topic of machine translation for a set of 24 African languages. We do so by introducing a high quality benchmark, paired with a fair and rigorous evaluation procedure.
Task DescriptionThe shared task will consist of three tracks.
- The Data track focuses on the contribution of novel corpora. Participants may submit monolingual, bilingual or multilingual datasets relevant to the training of MT models for this year’s set of languages. Further information on this track is available below.
- Two translation tracks will evaluate the performance of translation models covering all of this year’s languages. Translation will be evaluated to and from English and French as well as to/from select African languages within particular geographical/cultural clusters:
- In the Constrained Translation track only the data listed on this page will be allowed, including submissions accepted to the Data track. The use of open source pre-trained models will be permitted, provided that they are published before the Data track submission deadline.
- In the Unconstrained Translation track no restrictions will be made on the use of data or models.
ABSA Chair of Data Science, Assoc. Professor
Department of Computer Science, University of Pretoria