enclosed please find a copy of the final version of the overview paper.
We have also set the schedule for the workshop. March 30th is dedicated
to the shared task, with poster presentations of all short papers and an
invited talk by Martin Kay. March 31st features 12 full paper oral
presentations, each lasting 30 minutes.
Looking forward to seeing you in Greece in 6 weeks!
Regards,
Philipp Koehn
The WMT 2009 dataset has rank-based annotations that
aren't directly comparable across sentences. I may be
wrong, but how to compute a sentence-level correlation
for this type of data seems more like a research question
than a straightforward application of existing scripts.
One possibility that comes to my mind is to correlate pairwise
differences in human-annotated ranks with pairwise differences
in metric-predicted scores. Of course, for some sentences
the differences might be larger than for others, but it might
be a first step...
Sebastian
> Of course, for some sentences the differences might be
> larger than for others
For some sentences, the real quality differences are
larger than for others, but this is not reflected in
the ranks. Unless the metrics turn their predictions
into ranks as well, scaling is bound to be a confounding
factor.
Sebastian
I don't think so. There's a crucial difference in whether the scores
are converted into ranks at the sentence level or globally.
I proposed the former, but Spearman correlations do the latter.
Assume the following situation.
Sentence A, MT Hypotheses A1, A2, A3.
Sentence B, MT Hypotheses B1, B2, B3.
All MT hypotheses for A are very good (but still distinguishable).
(Absolute sores would be A1: 5, A2: 6, A3: 7.)
Ranks are A1: 1, A2: 2, A3: 3.
The translations for B run the gamut from abysmal to great.
(Absolute scores would be B1: 1, B2: 4, B3: 7).
Ranks are B1: 1, B2: 2, B3: 3.
A good system for the prediction of absolute scores would presumably
predict a better score for all A hypotheses than for all B hypotheses.
A1: 0.8; A2: 0.85; A3: 0.9
B1: 0.1; B2: 0.3; B3: 0.7.
If you turn those predictions into ranks globally (which is what happens if you
just throw those numbers into a Spearman formula), all As will outrank all Bs.
This is not true in your gold ranks, and your correlation will be really bad.
But if you turn predictions into gold ranks at the sentence level -- I feel
that might be worth looking at. You'd still get lots of ties - but you get
those with absolute scores, too.
> I think the straightforward thing to do would
> simply be to look at all of the pairwise human preference judgments
> (on segments), tossing out those that are ties, and see how many of
> those judgments the evaluation metrics agree with, so you'd get a
> percentage score for each metric for each language pair.
Those percentages were actually in the WMT 2008 overview paper for last year's
shared evaluation task (called "consistency"). Is there a particular reason why
they weren't reported this year?
Sebastian