Data sets for WMT16

355 views
Skip to first unread message

Barry Haddow

unread,
Jan 18, 2016, 5:40:19 AM1/18/16
to wmt-...@googlegroups.com
Hi All

The data sets for the WMT16 (news) translation task are now all
available, with the exception of the updated CzEng corpus. This will be
released before the end of January.

New and updated for this year:
Parallel
- Updated version of news-commentary (en-cs, en-de, en-ru)
- Sentence-split version of europarl (en-ro)
- New release of Czeng - coming soon (en-cs)
- SETIMES2 as a parallel data set (en-ro, en-tr)
- dev sets of news translations (en-ro, en-tr)

Monolingual
- Updated news-commentary (cs, de, en, ru)
- Common Crawl monolingual (all)
- news crawl from 2015 (all except tr)

All links to the data sets are on the website
http://www.statmt.org/wmt16/translation-task.html

The test data will be released on April 18th.

Note that the above represent the data sets for the *constrained*
version of the task. For the unconstrained version of the task you are
free to use whatever data sets you want. In tr-en, for example, the
constrained data set is quite small, but much larger resources are
available from OpenSubtitles. We hope to have an update of the
CommonCrawl parallel data for some of the language pairs reasonably
soon, but this will not be part of the constrained track.

best wishes
Barry


--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Thanh-Le Ha

unread,
Jan 18, 2016, 9:12:29 AM1/18/16
to wmt-...@googlegroups.com
Hi,

I cannot find the monolingual News Discussion (in English and French) in this year's corpora. Is it permissible data this year? Is there any change on that data?

Thanks,
Thanh-Le.

--
You received this message because you are subscribed to the Google Groups "Workshop on Statistical Machine Translation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wmt-tasks+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Barry Haddow

unread,
Jan 18, 2016, 9:45:21 AM1/18/16
to wmt-...@googlegroups.com
Hi Thanh-Le

Good point. There's no reason not to include it  so I have added it to the website.

We did not update the news-discuss corpus this year, although the crawler has been running.

cheers - Barry

Franck Brl

unread,
Jan 20, 2016, 12:23:07 PM1/20/16
to Workshop on Statistical Machine Translation, bha...@inf.ed.ac.uk
Dear Barry,

I have looked at the SETIMES2 corpus provided by OPUS for en-ro.
It turns out that both sides (en and ro) are already tokenized, which
can be a problem, since all the other corpora are provided untokenized.
Would it be possible to have the untokenized version of SETIMES2 ?
Otherwise could we know what tokenizers have been used for this corpus?

Thank you in advance,
Best regards,
Franck

Jorg Tiedemann

unread,
Jan 20, 2016, 1:54:09 PM1/20/16
to wmt-...@googlegroups.com, bha...@inf.ed.ac.uk

It’s true that I used to have the downloadable files tokenized already. SETIMES2 was still in this format. I regenerated the en-ro files again and they should be untokenized now. I will do the same for the other language pairs in SETIMES2 as well (but en-ro is done already). I hope that helps.


All the best,
Jörg


Jörg Tiedemann






Franck Brl

unread,
Jan 20, 2016, 3:24:26 PM1/20/16
to Workshop on Statistical Machine Translation, bha...@inf.ed.ac.uk
Dear Jörg,

Thank you! I just got the untokenized SETIMES2 (both en and ro).

Best regards,
Franck

Alexander Molchanov

unread,
Jan 27, 2016, 9:05:19 AM1/27/16
to Workshop on Statistical Machine Translation, bha...@inf.ed.ac.uk
Hi Barry,

The NEWS task DOWNLOAD section says there is TR-EN data for the Yandex corpus, but there's only one link for RU-EN. Is this just a mistake, or will the link be published later?

Thanks,

Alex Molchanov

понедельник, 18 января 2016 г., 13:40:19 UTC+3 пользователь Barry Haddow написал:

Barry Haddow

unread,
Jan 27, 2016, 9:25:39 AM1/27/16
to wmt-...@googlegroups.com
Hi Alex

Yes, sorry, this is an error. The Yandex training corpus will not be available for the constrained tr-en task,

cheers - Barry
Reply all
Reply to author
Forward
0 new messages