Preprocessed versions of WMT17 data

Barry Haddow

unread,

Mar 29, 2017, 4:07:44 AM3/29/17

to wmt-...@googlegroups.com

Hi All

I have released preprocessed versions of all the WMT17 news task
training and dev data (except, so far, zh-en). They are available here:
http://data.statmt.org/wmt17/translation-task/preprocessed/

This data is distributed in the hope that it will be useful for the
task, and for future research. It uses a standard Moses preprocessing
pipeline, but of course for the task you are free to use your own
pipeline -- and encouraged to experiment with preprocessing if you think
it will help.

Please mail this list if you find any issues with the data. I have not
yet built systems with this exact version of the data, although the
preprocessing is the same as the pipeline we use. Also, if you are
interested in other versions of the data (e.g. subsampled) then let me know,

cheers - Barry

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Ozan Çağlayan

unread,

Mar 29, 2017, 7:13:07 AM3/29/17

to wmt-...@googlegroups.com

Hello,

For the TR-En prepare.sh script, the $year variable is not used. I think $testset should have been used instead for this pair but it's not. This caused a buggy dev.tgz file containing a pair of newstest.tc.en,tr file without year and correct dev/test info in the filename.

Thanks!

Barry Haddow

unread,

Mar 29, 2017, 7:29:29 AM3/29/17

to wmt-...@googlegroups.com

Hi Ozan

Thanks, that should be fixed now. Also I think the tr-en corpus was misaligned but that should also be fixed,

cheers - Barry

--
You received this message because you are subscribed to the Google Groups "Workshop on Statistical Machine Translation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wmt-tasks+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ozan Çağlayan

unread,

Mar 29, 2017, 2:39:24 PM3/29/17

to wmt-...@googlegroups.com

Hello,

You mean it was misaligned in this new preprocessed tarballs or in the original Setimes files?

Thanks.

Barry Haddow

unread,

Mar 29, 2017, 2:49:02 PM3/29/17

to wmt-...@googlegroups.com

It was misaligned in the preprocessed data.

--

You received this message because you are subscribed to the Google Groups "Workshop on Statistical Machine Translation" group.

To unsubscribe from this group and stop receiving emails from it, send an email to wmt-tasks+unsubscribe@googlegroups.com.

Ergun Bicici

unread,

Mar 29, 2017, 5:16:13 PM3/29/17

to wmt-...@googlegroups.com

Hi Barry,

SETIMES dataset for tr-en appears aligned however it may have been updated since WMT 2016 or I am obtaining different number of sentences for tr-en now: 32 sentences more.

Are you also planning to prepare something for zh-en?

Best Regards,
Ergun

Ergun Biçici

http://bicici.github.com/

Barry Haddow

unread,

Mar 29, 2017, 5:33:03 PM3/29/17

to wmt-...@googlegroups.com

Hi Ergun

Not sure where the 32 sentence difference comes from.

I will release zh-en at some point, but probably not in time to be useful for the task itself. I do not have a settled pre-processing pipeline for
Chinese yet,

cheers - Barry

To unsubscribe from this group and stop receiving emails from it, send an email to wmt-tasks+...@googlegroups.com.

Ergun Bicici

unread,

Mar 29, 2017, 5:39:21 PM3/29/17

to wmt-...@googlegroups.com

Compared with the number of sentences I obtained in WMT 2016.

I have them all prepared and waiting for May 2:

Test data released	May 2, 2017
Translation submission deadline	May 8, 2017

Best Regards,
Ergun

Ergun Biçici

http://bicici.github.com/

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Maksym Del

unread,

Mar 29, 2017, 10:00:28 PM3/29/17

to wmt-...@googlegroups.com

Hi Barry,

It seems like the Ru-En data is misaligned. Compare for example output of tail corpus.tc.en to tail corpus.tc.ru and you will see (no Russian language knowledge is needed).

There are also some duplicate sentences and noise in the data, but it might be due to the low quality of some data sources. Also, it might be partially fixed with introducing of proper alignment.

Best,

Maksym

Barry Haddow

unread,

Mar 30, 2017, 7:28:18 AM3/30/17

to wmt-...@googlegroups.com

Hi Maksym

Thanks for pointing that out. I have identified the problem and I'm regenerating the ru-en data.

Yes, there will be noise in the original data, and the only filtering I do is length-based. I tried to keep the pre-processing fairly 'light touch'.

cheers - Barry

To unsubscribe from this group and stop receiving emails from it, send an email to wmt-tasks+...@googlegroups.com.

Maksym Del

unread,

Mar 31, 2017, 1:01:27 AM3/31/17

to wmt-...@googlegroups.com

Hi Barry

Ru-En data looks fine to me now, thank you for your contribution.

Best,

Maksym Del

Reply all

Reply to author

Forward