WMT09 shared evaluation task results

Chris Callison-Burch

unread,

Feb 20, 2009, 11:17:49 AM2/20/09

to WM...@googlegroups.com

I have attached two tables containing the Spearman's rank correlation
coefficient numbers for the automatic evaluation metrics that were
submitted to WMT09. These tables show how well each metric did at
predicting the human judgments of translation quality.

Also, participants in the shared evaluation task might be interested
in the Z-MERT software developed by Omar Zaidan (one of the students
in my lab at Hopkins):
http://www.cs.jhu.edu/~ozaidan/zmert/
It performs minimum error rate training for SMT systems, and has a
modular design that allows you to easily use your own automatic
evaluation metric as the objective function. Omar even put together a
YouTube tutorial on how to integrate your own metric:
http://www.youtube.com/watch?v=Yr56pD8bTUc&fmt=18

I'm looking forward to seeing everyone in Athens.

All the best,
Chris Callison-Burch

system-level-correlation-out-of-English.pdf

system-level-correlation-into-English.pdf

Chris Callison-Burch

unread,

Feb 23, 2009, 12:00:20 PM2/23/09

to WM...@googlegroups.com

Following on from the discussion about the merits of system-level v.
sentence-level analysis of the automatic evaluation metrics, I have
attached tables showing how consistent each metric is at predicting
sentence-level judgments.

I define consistency over every pair of sentences that was assessed in
the the manual evaluation. A metric is consistent if the human
judgment was A > B and the metric score was A > B, or if the human
judgment was A < B and the metric score was A < B. I ignore
situations where the human judgment was A = B, and I was careful about
reversing the polarity of scores for error metrics like TER.

Tables 10 and 11 show the percent of time that each metric was
consistent with the human judgments for all of the language pairs.
Since we're only assessing whether A > B or B > A, the random-choice
baseline is 50%. The tables show that the task is very difficult,
because many of the metrics underperform the chance baseline.

sentence-level-consistency-into-English.pdf

sentence-level-consistency-out-of-English.pdf

system-level-scores-as-sentence-level-scores-out-of-English.pdf

system-level-scores-as-sentence-level-scores-into-English.pdf

Matthew Snover

unread,

Feb 23, 2009, 12:16:50 PM2/23/09

to WM...@googlegroups.com, Matthew Snover

Thanks Chris. This is really interesting and valuable! It also
clearly indicates that the metrics are pretty poor at doing the
segment by segment binary preference judgments.

I have a feeling this might be because most metrics get most of their
good segment or document correlation from identifying those segments
or documents that are difficult, and that by doing this binary
preference judgment, we're effectively removing the effect of segment
or document difficulty.

Very interesting. Thanks!

One quick question:

From the paper, it sounds like when you combined the preference
judgments to get an overall score for a system, you combined them
uniformly, that is you didn't weight sentences by their length, so
doing better on a very short segment was treated the same as doing
better on a very long segment. Is that correct?

--Matt

> <sentence-level-consistency-into-English.pdf><sentence-level-
> consistency-out-of-English.pdf>
>
>
> I have also attached two tables (Tables 12 and 13) that show what
> would happen if each metric posited its system-level score in place of
> its segment-level scores. That is, instead of assigning a different
> metric score for every sentence, we assign Google's system-level score
> to every sentence translated by Google, and Edinburgh's system-level
> score to every sentence translated by Edinburgh, etc. In this case
> the consistency is much better. To me, this indicates that metrics
> ought to incorporate a prior into their sentence-level scores, which
> could be based on how well a metric did on the entire test set.
>
>
> <system-level-scores-as-sentence-level-scores-out-of-
> English.pdf><system-level-scores-as-sentence-level-scores-into-
> English.pdf>
>
>
> Let me know if you have any questions. Also, I would like to extend a
> big thanks to Sebastian Pado, who spent several hours with me on Skype
> yesterday working through the logic of my scoring scripts, and
> independently verifying that the results using his own scripts.
>
> Yours,
> Chris

Reply all

Reply to author

Forward