Fwd: WMT09 shared evaluation task results

2 views

Skip to first unread message

Matthew Snover

unread,

Feb 23, 2009, 1:03:48 PM2/23/09

to WM...@googlegroups.com, Gregor Leusch, Matthew Snover

From Gregor, because he still cannot post to the list.

Gregor, I think you are correct, segment level Tau should be perfectly
usable in this case and would be very enlightening.

--Matt

Begin forwarded message:

> From: Gregor Leusch <leu...@i6.informatik.rwth-aachen.de>
> Date: February 23, 2009 12:58:44 PM EST
> To: Matthew Snover <sno...@cs.umd.edu>, Chris Callison-Burch <c...@cs.jhu.edu
> >
> Cc: WM...@googlegroups.com, Gregor Leusch <leu...@filmstudio.rwth-aachen.de
> >
> Subject: Re: WMT09 shared evaluation task results
>
> Chris, Matt,
>
> I am adressing you directly, because I still can not post to the
> mailing list. Please feel free to forward this mail to the list.
>
> ***
>
> Chris, In how far is your "pairwise segment ranking coefficient"
> different from Kendall's tau (or more precisely, an arithmetic
> average over all the source-segment-wise tau)?
>
> An averaged sentence-wise Tau should be ideal to remove the effect
> of "difficult" vs. "simple" source sentence, and only assess the
> ability of a metric to rank "good" vs "bad" MT systems on the
> sentence level.
>
> I am asking because back in 2005 I ran a couple of experiments
> comparing Pearson's r with Kendall's tau (or \bar{\tau} in this case).
> There did not seem to be any significant differences in the
> "ranking" of different evaluation measures, and/or different
> settings of evaluation measures, with regard to r as opposed to tau,
> so I later only used r (because confidence ranges were easier to
> obtain there).
>
> I have attached a couple of results on NIST 2003 and NIST 2004, as
> well as IWSLT 2004 data.
>
> If you are interested, I could try and "warm up" my software from
> 2005 (for tau, basically a couple of R scripts).
>
> Best,
>
> Gregor
>
>
>
> On Mon, 23 Feb 2009, Matthew Snover wrote:
>
>>
>> Thanks Chris. This is really interesting and valuable! It also
>> clearly indicates that the metrics are pretty poor at doing the
>> segment by segment binary preference judgments.
>>
>> I have a feeling this might be because most metrics get most of their
>> good segment or document correlation from identifying those segments
>> or documents that are difficult, and that by doing this binary
>> preference judgment, we're effectively removing the effect of segment
>> or document difficulty.
>>
>> Very interesting. Thanks!
>>
>> One quick question:
>>
>> From the paper, it sounds like when you combined the preference
>> judgments to get an overall score for a system, you combined them
>> uniformly, that is you didn't weight sentences by their length, so
>> doing better on a very short segment was treated the same as doing
>> better on a very long segment. Is that correct?
>>
>> --Matt
>>
>> On Feb 23, 2009, at 12:00 PM, Chris Callison-Burch wrote:
>>
>>> Following on from the discussion about the merits of system-level v.
>>> sentence-level analysis of the automatic evaluation metrics, I have
>>> attached tables showing how consistent each metric is at predicting
>>> sentence-level judgments.
>>>
>>> I define consistency over every pair of sentences that was
>>> assessed in
>>> the the manual evaluation. A metric is consistent if the human
>>> judgment was A > B and the metric score was A > B, or if the human
>>> judgment was A < B and the metric score was A < B. I ignore
>>> situations where the human judgment was A = B, and I was careful
>>> about
>>> reversing the polarity of scores for error metrics like TER.
>>>
>>> Tables 10 and 11 show the percent of time that each metric was
>>> consistent with the human judgments for all of the language pairs.
>>> Since we're only assessing whether A > B or B > A, the random-choice
>>> baseline is 50%. The tables show that the task is very difficult,
>>> because many of the metrics underperform the chance baseline.
>>>
>>>>
>>> <sentence-level-consistency-into-English.pdf><sentence-level-
>>> consistency-out-of-English.pdf>
>>>
>>>
>>> I have also attached two tables (Tables 12 and 13) that show what
>>> would happen if each metric posited its system-level score in
>>> place of
>>> its segment-level scores. That is, instead of assigning a different
>>> metric score for every sentence, we assign Google's system-level
>>> score
>>> to every sentence translated by Google, and Edinburgh's system-level
>>> score to every sentence translated by Edinburgh, etc. In this case
>>> the consistency is much better. To me, this indicates that metrics
>>> ought to incorporate a prior into their sentence-level scores, which
>>> could be based on how well a metric did on the entire test set.
>>>
>>>
>>> <system-level-scores-as-sentence-level-scores-out-of-
>>> English.pdf><system-level-scores-as-sentence-level-scores-into-
>>> English.pdf>
>>>
>>>
>>> Let me know if you have any questions. Also, I would like to
>>> extend a
>>> big thanks to Sebastian Pado, who spent several hours with me on
>>> Skype
>>> yesterday working through the logic of my scoring scripts, and
>>> independently verifying that the results using his own scripts.
>>>
>>> Yours,
>>> Chris
>>
>>
>> >>
>>
>
> --
> Dipl.-Inform. Gregor Leusch Tel
> +49-241-80-21618
> Chair of Computer Science 6 Fax
> +49-241-80-22219
> RWTH Aachen University leu...@informatik.rwth-aachen.de
> D-52056 Aachen, Germany www-i6.informatik.rwth-aachen.de/
> ~leusch/
>

comparison_r_tau.thesispages.pdf

Chris Callison-Burch

unread,

Dec 31, 2009, 8:43:17 AM12/31/09

to WM...@googlegroups.com, Tero Tapiovaara

Dear WMT09 participants,

I'm writing to notify you of a correction to our overview paper for the workshop. In the version that we published, we incorrectly reported that most automatic metrics did only marginally better than random choice at the sentence-level. This incorrect result was due to a mismatch in the segment indices used by the automatic metrics and the human rankings (one was indexed starting at 1 and the other at 0). I fixed this mismatch, which resulted in considerably improved sentence-level performance for the automatic metrics. This weakens my conclusion that using the system-level ranks in place of the scores for individual segments results in better performance for most metrics. I have attached the published tables and the corrected tables to this e-mail.

Thanks to Tero Tapiovaara for diagnosing this error and altering me to it. I will post a corrected version of the paper to my web site and to the WMT09 site.

I apologize for this error. WMT10 will be a joint workshop with the NIST Metrics Matr workshop. NIST therefore will be running the assessment of the automatic metrics, and they will bring their normal rigor to the evaluation.

Yours,
Chris Callison-Burch

incorrect-segment-level-scores.pdf

corrected-segment-level-scores.pdf

Reply all

Reply to author

Forward

0 new messages