From Gregor, because he still cannot post to the list.
Gregor, I think you are correct, segment level Tau should be perfectly
usable in this case and would be very enlightening.
--Matt
Begin forwarded message:
> From: Gregor Leusch <
leu...@i6.informatik.rwth-aachen.de>
> Date: February 23, 2009 12:58:44 PM EST
> To: Matthew Snover <
sno...@cs.umd.edu>, Chris Callison-Burch <
c...@cs.jhu.edu
> >
> Cc:
WM...@googlegroups.com, Gregor Leusch <
leu...@filmstudio.rwth-aachen.de
> >
> Subject: Re: WMT09 shared evaluation task results
>
> Chris, Matt,
>
> I am adressing you directly, because I still can not post to the
> mailing list. Please feel free to forward this mail to the list.
>
> ***
>
> Chris, In how far is your "pairwise segment ranking coefficient"
> different from Kendall's tau (or more precisely, an arithmetic
> average over all the source-segment-wise tau)?
>
> An averaged sentence-wise Tau should be ideal to remove the effect
> of "difficult" vs. "simple" source sentence, and only assess the
> ability of a metric to rank "good" vs "bad" MT systems on the
> sentence level.
>
> I am asking because back in 2005 I ran a couple of experiments
> comparing Pearson's r with Kendall's tau (or \bar{\tau} in this case).
> There did not seem to be any significant differences in the
> "ranking" of different evaluation measures, and/or different
> settings of evaluation measures, with regard to r as opposed to tau,
> so I later only used r (because confidence ranges were easier to
> obtain there).
>
> I have attached a couple of results on NIST 2003 and NIST 2004, as
> well as IWSLT 2004 data.
>
> If you are interested, I could try and "warm up" my software from
> 2005 (for tau, basically a couple of R scripts).
>
> Best,
>
> Gregor
>
>
>
> On Mon, 23 Feb 2009, Matthew Snover wrote:
>
>>
>> Thanks Chris. This is really interesting and valuable! It also
>> clearly indicates that the metrics are pretty poor at doing the
>> segment by segment binary preference judgments.
>>
>> I have a feeling this might be because most metrics get most of their
>> good segment or document correlation from identifying those segments
>> or documents that are difficult, and that by doing this binary
>> preference judgment, we're effectively removing the effect of segment
>> or document difficulty.
>>
>> Very interesting. Thanks!
>>
>> One quick question:
>>
>> From the paper, it sounds like when you combined the preference
>> judgments to get an overall score for a system, you combined them
>> uniformly, that is you didn't weight sentences by their length, so
>> doing better on a very short segment was treated the same as doing
>> better on a very long segment. Is that correct?
>>
>> --Matt
>>
>> On Feb 23, 2009, at 12:00 PM, Chris Callison-Burch wrote:
>>
>>> Following on from the discussion about the merits of system-level v.
>>> sentence-level analysis of the automatic evaluation metrics, I have
>>> attached tables showing how consistent each metric is at predicting
>>> sentence-level judgments.
>>>
>>> I define consistency over every pair of sentences that was
>>> assessed in
>>> the the manual evaluation. A metric is consistent if the human
>>> judgment was A > B and the metric score was A > B, or if the human
>>> judgment was A < B and the metric score was A < B. I ignore
>>> situations where the human judgment was A = B, and I was careful
>>> about
>>> reversing the polarity of scores for error metrics like TER.
>>>
>>> Tables 10 and 11 show the percent of time that each metric was
>>> consistent with the human judgments for all of the language pairs.
>>> Since we're only assessing whether A > B or B > A, the random-choice
>>> baseline is 50%. The tables show that the task is very difficult,
>>> because many of the metrics underperform the chance baseline.
>>>
>>>>
>>> <sentence-level-consistency-into-English.pdf><sentence-level-
>>> consistency-out-of-English.pdf>
>>>
>>>
>>> I have also attached two tables (Tables 12 and 13) that show what
>>> would happen if each metric posited its system-level score in
>>> place of
>>> its segment-level scores. That is, instead of assigning a different
>>> metric score for every sentence, we assign Google's system-level
>>> score
>>> to every sentence translated by Google, and Edinburgh's system-level
>>> score to every sentence translated by Edinburgh, etc. In this case
>>> the consistency is much better. To me, this indicates that metrics
>>> ought to incorporate a prior into their sentence-level scores, which
>>> could be based on how well a metric did on the entire test set.
>>>
>>>
>>> <system-level-scores-as-sentence-level-scores-out-of-
>>> English.pdf><system-level-scores-as-sentence-level-scores-into-
>>> English.pdf>
>>>
>>>
>>> Let me know if you have any questions. Also, I would like to
>>> extend a
>>> big thanks to Sebastian Pado, who spent several hours with me on
>>> Skype
>>> yesterday working through the logic of my scoring scripts, and
>>> independently verifying that the results using his own scripts.
>>>
>>> Yours,
>>> Chris
>>
>>
>> >>
>>
>
> --
> Dipl.-Inform. Gregor Leusch Tel
>
+49-241-80-21618
> Chair of Computer Science 6 Fax
>
+49-241-80-22219
> RWTH Aachen University
leu...@informatik.rwth-aachen.de
> D-52056 Aachen, Germany
www-i6.informatik.rwth-aachen.de/
> ~leusch/
>