WMT 2010 Shared Task: restrict controlled data to free corpora?

Alexander Fraser

unread,

Dec 4, 2009, 4:25:26 AM12/4/09

to wmt09

Hi Philipp and other WMT09 participants,

Great to see another shared task coming up.

However, I wonder if promoting the use of non-free corpora in the 2010
shared task is not counter-productive? (See WMT10 web page)

The 2nd edition French Gigaword corpus, for instance, costs $4000.00
to sites that are not LDC members in 2009.

When we initially discussed the idea of a shared task years ago, two
of the main ideas were:

(1) lower the barrier to participation in the shared task to sites
that might not have an LDC membership

(2) try to make the experiments repeatable later by anyone at any time

The latter is not true for the NIST MT evaluation, for instance, which
has the effect of reducing participation (since sites which do not
already have access to the data have to wait until the evaluation
license is available, not leaving much time to build a competitive
system), and making it difficult to repeat experiments later (since
the evaluation license is not available at times when there is no NIST
task).

Up until now the "free corpora only" policy for the WMT shared tasks
has been quite successful, and in my opinion it should be continued.

Cheers, Alex

On Fri, Dec 4, 2009 at 9:00 AM, Philipp Koehn <pko...@inf.ed.ac.uk> wrote:
> ACL 2010 FIFTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION
> Shared Task: Machine Translation for European Languages
> July 15-16, in conjunction with ACL 2010 in Uppsala, Sweden
>
> http://www.statmt.org/wmt10/translation-task.html
>
> As part of the ACL WMT 2010 workshop, as in previous years, we organize
> a shared task on machine translation between European language pairs.
>
> Translation quality will be evaluated on an unseen test set of news stories.
> We provide a parallel corpus as training data, a baseline system, and
> additional
> resources for download. Participants may augment the baseline system or use
> their own system.
>
> The goals of the shared translation task are:
> * To investigate the applicability of current MT techniques when
> translating
> into languages other than English
> * To examine special challenges in translating between European languages,
> including word order differences and morphology
> * To create publicly available corpora for machine translation and machine
> translation evaluation
> * To generate up-to-date performance numbers for European languages in
> order to provide a basis of comparison in future research
> * To offer newcomers a smooth start with hands-on experience in
> state-of-the-art
> statistical machine translation methods
>
> We hope that both beginners and established research groups will participate
> in this task.
>
> You may participate in any or all of the following language pairs:
> * French-English
> * Spanish-English
> * German-English
> * Czech-English
>
> For all language pairs we will test translation in both directions. To have a
> common framework that allows for comparable results, and also to lower
> the barrier to entry, we provide a common training set and baseline system.
>
> Dates
> * December 4: Training data released
> * February 15: Test data released (available on this web site)
> * February 19: Results submissions
> * March 26: Short paper submissions (4 pages)
>
> Organizers
> * Chris Callison-Burch (Johns Hopkins University)
> * Philipp Koehn (University of Edinburgh)
> * Christof Monz (University of Amsterdam)
> * Kay Peterson (NIST)
>
> --
>
> You received this message because you are subscribed to the Google Groups "Fourth Workshop on Statistical Machine Translation (WMT09)" group.
> To post to this group, send email to WM...@googlegroups.com.
> To unsubscribe from this group, send email to WMT09+un...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/WMT09?hl=en.
>
>
>

Philipp Koehn

unread,

Dec 4, 2009, 5:35:54 AM12/4/09

to wm...@googlegroups.com

Hi,

thanks - this is a good point and worthy a discussion.

There were clearly some who were pushing for including the LDC sets,
but on the other hand the news sets that we are making available are
pretty large.

-phi

Chris Callison-Burch

unread,

Dec 4, 2009, 8:40:37 AM12/4/09

to wm...@googlegroups.com

One compromise option might be for us to release LMs trained on the
LDC data for sites that don't have access to those corpora themselves.

--C

Alexander Fraser

unread,

Dec 7, 2009, 8:25:09 AM12/7/09

to wm...@googlegroups.com

Hi Chris,

Do you have results showing that these corpora will make a difference
(not between constrained systems, but instead between non-constrained
and constrained systems)? If not, why else would you want to include
them?

Your suggestion probably requires that all sites use a standard
tokenization/casing solution.

Cheers, Alex

Holger Schwenk

unread,

Dec 7, 2009, 8:44:20 AM12/7/09

to wm...@googlegroups.com

Hi all,

In fact it was me who suggested that those corpora are included in the
list. I supposed that all the sites are members of LDC and that hey have
those corpora anyway (eventually not the very last version) and to the
best of my knowledge they were already used in many systems in WMT'09.

In the past, the Gigaword corpora were pretty useful for language
modeling and decreased the perplexity substantially (these are
newspaper texts and we build news systems for WMT). I can try to provide
a number if this is still useful in comparison to the provided
monolingual corpora.

Holger

Chris Callison-Burch

unread,

Dec 7, 2009, 12:50:56 PM12/7/09

to wm...@googlegroups.com

Hi Everyone,

Let's settle this democratically. Please vote whether you want to
include the GigaWord as an allowed resource in WMT10:

http://doodle.com/z695ambbn2nr8hna

--Chris

Reply all

Reply to author

Forward