Hi All
The data sets for the WMT16 (news) translation task are now all
available, with the exception of the updated CzEng corpus. This will be
released before the end of January.
New and updated for this year:
Parallel
- Updated version of news-commentary (en-cs, en-de, en-ru)
- Sentence-split version of europarl (en-ro)
- New release of Czeng - coming soon (en-cs)
- SETIMES2 as a parallel data set (en-ro, en-tr)
- dev sets of news translations (en-ro, en-tr)
Monolingual
- Updated news-commentary (cs, de, en, ru)
- Common Crawl monolingual (all)
- news crawl from 2015 (all except tr)
All links to the data sets are on the website
http://www.statmt.org/wmt16/translation-task.html
The test data will be released on April 18th.
Note that the above represent the data sets for the *constrained*
version of the task. For the unconstrained version of the task you are
free to use whatever data sets you want. In tr-en, for example, the
constrained data set is quite small, but much larger resources are
available from OpenSubtitles. We hope to have an update of the
CommonCrawl parallel data for some of the language pairs reasonably
soon, but this will not be part of the constrained track.
best wishes
Barry
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.