Overview paper, workshop arrangements

Philipp Koehn

unread,

Feb 20, 2009, 2:07:46 PM2/20/09

to Fourth Workshop on Statistical Machine Translation (WMT09)

Hi,

enclosed please find a copy of the final version of the overview paper.

We have also set the schedule for the workshop. March 30th is dedicated
to the shared task, with poster presentations of all short papers and an
invited talk by Martin Kay. March 31st features 12 full paper oral
presentations, each lasting 30 minutes.

Looking forward to seeing you in Greece in 6 weeks!

Regards,
Philipp Koehn

paper.pdf

Matthew Snover

unread,

Feb 20, 2009, 2:49:22 PM2/20/09

to WM...@googlegroups.com, Matthew Snover

Was any analysis of the evaluation metrics done on anything other than
the system level? System level correlations where you have between 5
and 21 data points aren't really that useful. For example the 95%
confidence interval of a correlation of 0.93 with 21 data points --
such is the case for METEOR with FR-EN, is 0.833 - 0.971, making the
correlation statistically indistinguishable from the correlations of
all but one of the other metrics (since only one metric had a
correlation below 0.83 on that data set).

--Matt

> <paper.pdf>

Chris Callison-Burch

unread,

Feb 20, 2009, 5:38:32 PM2/20/09

to WM...@googlegroups.com, Matthew Snover

Matt,

Thanks for your follow up. I did not report my analysis of the
automatic metrics at the sentence-level in the overview paper, since
my initial experiments did not show anything interesting. I will make
the complete set of data available to anyone who would like to do
their own analysis.

Your point about the confidence interval for the French-English is
correct, because the range of correlation numbers is quite tight. All
the systems with the exception of one had correlation coefficient
ranging between .83 - .93, so it doesn't really matter which metric
you use to rank our French-English systems. German-English (also with
21 systems) had a much wider range of .38 - .78. In that case,
choosing an appropriate metric is important.

--Chris

Alon

unread,

Feb 20, 2009, 5:42:06 PM2/20/09

to Fourth Workshop on Statistical Machine Translation (WMT09)

Matt is right about this. I believe I had brought this up last
year... I realize you guys did a tremendeous amount of work on this,
and you can't do everything. But as Matt points out, system-level
correlations with human judgments really don't shed much light about
automatic metrics. It should not be that difficult to calculate
segment-level or document-level correlations. NIST has software for
doing this, which was used for the MetricsMATR workshop, and they been
happy to share the software... It would be great if these could be
reported at the workshop in Athens!

- *Alon*

> > <paper.pdf>- Hide quoted text -
>
> - Show quoted text -

Sebastian Pado

unread,

Feb 21, 2009, 12:24:48 AM2/21/09

to WM...@googlegroups.com, Fourth Workshop on Statistical Machine Translation (WMT09)

> It should not be that difficult to calculate
> segment-level or document-level correlations.

The WMT 2009 dataset has rank-based annotations that
aren't directly comparable across sentences. I may be
wrong, but how to compute a sentence-level correlation
for this type of data seems more like a research question
than a straightforward application of existing scripts.

One possibility that comes to my mind is to correlate pairwise
differences in human-annotated ranks with pairwise differences
in metric-predicted scores. Of course, for some sentences
the differences might be larger than for others, but it might
be a first step...

Sebastian

Sebastian Pado

unread,

Feb 21, 2009, 12:30:57 AM2/21/09

to WM...@googlegroups.com

Clarification:

> Of course, for some sentences the differences might be
> larger than for others

For some sentences, the real quality differences are
larger than for others, but this is not reflected in
the ranks. Unless the metrics turn their predictions
into ranks as well, scaling is bound to be a confounding
factor.

Sebastian

Matthew Snover

unread,

Feb 21, 2009, 1:53:57 PM2/21/09

to WM...@googlegroups.com, Matthew Snover

Sebastian,

That would be an issue with Pearson correlations, but Spearman
correlations are already converted to ranks, so it wouldn't matter.

But, really, the data doesn't lend itself to a correlation analysis at
the segment level. I think the straightforward thing to do would
simply be to look at all of the pairwise human preference judgments
(on segments), tossing out those that are ties, and see how many of
those judgments the evaluation metrics agree with, so you'd get a
percentage score for each metric for each language pair. You couldn't
analyze this in terms correlations, but it would give those of working
on metrics a better idea of how the metrics differed and performed.

A lot of information is lost when doing Spearman correlations,
especially at the system level, so its hard to draw any real
conclusions from those numbers. I think the segment level numbers
could be really informative.

--Matt

Sebastian Pado

unread,

Feb 21, 2009, 5:07:10 PM2/21/09

to WM...@googlegroups.com

> That would be an issue with Pearson correlations, but Spearman
> correlations are already converted to ranks, so it wouldn't matter.

I don't think so. There's a crucial difference in whether the scores
are converted into ranks at the sentence level or globally.
I proposed the former, but Spearman correlations do the latter.

Assume the following situation.

Sentence A, MT Hypotheses A1, A2, A3.
Sentence B, MT Hypotheses B1, B2, B3.

All MT hypotheses for A are very good (but still distinguishable).
(Absolute sores would be A1: 5, A2: 6, A3: 7.)
Ranks are A1: 1, A2: 2, A3: 3.

The translations for B run the gamut from abysmal to great.
(Absolute scores would be B1: 1, B2: 4, B3: 7).
Ranks are B1: 1, B2: 2, B3: 3.

A good system for the prediction of absolute scores would presumably
predict a better score for all A hypotheses than for all B hypotheses.

A1: 0.8; A2: 0.85; A3: 0.9
B1: 0.1; B2: 0.3; B3: 0.7.

If you turn those predictions into ranks globally (which is what happens if you
just throw those numbers into a Spearman formula), all As will outrank all Bs.
This is not true in your gold ranks, and your correlation will be really bad.

But if you turn predictions into gold ranks at the sentence level -- I feel
that might be worth looking at. You'd still get lots of ties - but you get
those with absolute scores, too.

> I think the straightforward thing to do would
> simply be to look at all of the pairwise human preference judgments
> (on segments), tossing out those that are ties, and see how many of
> those judgments the evaluation metrics agree with, so you'd get a
> percentage score for each metric for each language pair.

Those percentages were actually in the WMT 2008 overview paper for last year's
shared evaluation task (called "consistency"). Is there a particular reason why
they weren't reported this year?

Sebastian

Reply all

Reply to author

Forward