BLEU version?

Michael White

unread,

Mar 23, 2021, 1:19:43 PM3/23/21

to gem-benchmark

Howdy folks

First a million thanks for all your hard work on the shared task!

I couldn't find in the paper which version of BLEU was used to calculate the baseline results in Table 2. It would be helpful to know this in order to sanity check our own baselines.

Relatedly, I see that the TGen baseline has a BLEU score of 46.0 for E2E clean in Table 2, but the INLG-19 paper on the E2E clean dataset appears to report TGen scores around 40. Is this just a difference in BLEU versions?

Warm regards

Mike

odusek

unread,

Mar 23, 2021, 2:30:06 PM3/23/21

to gem-benchmark

Dear Michael,

Thank you for pointing out the score difference! In theory, the scores should be the same (SacreBLEU and MTEval should be 100% compatible). I'll try to debug this.

The scripts that compute our automatic metrics are at https://github.com/GEM-benchmark/GEM-metrics, we use SacreBLEU to compute BLEU. The INLG19 paper used e2e-metrics (https://github.com/tuetschek/e2e-metrics), which internally use the original MTEval Perl script. Both SacreBLEU and MTEval have been slightly modified to allow a variable number of references. SacreBLEU now officially supports this; for MTEval, I made this change in e2e-metrics only.

Best regards,

Ondrej

Michael White

unread,

Mar 23, 2021, 4:09:29 PM3/23/21

to gem-benchmark

Hi Ondrej

Thanks for your quick response! Is the GEM-metrics github page linked to from the GEM website? I didn't see a link there.

The getting-started page shows an example of using rouge via huggingface (https://gem-benchmark.com/get_started#generating-and-evaluating-predictions). Will this return the same results as the GEM-metrics?

In the past we've found substantial differences between BLEU using e2e-metrics and other versions of BLEU. This could be due to tokenization or smoothing differences.

If I understand correctly, the clean e2e dataset is single reference, so that wouldn't be the difference.

Mike

odusek

unread,

Mar 28, 2021, 1:51:04 PM3/28/21

to gem-benchmark

Dear Michael,

I'm sorry for the delayed answer. I've looked more closely at the E2E scores for GEM and the INLG paper, and there are actually 2 reasons for the differences:

- The main one: The GEM paper reports validation/development set scores, but the INLG paper reported test set scores. The test set scores are actually quite a bit lower (I'm getting 40.21 for TGen test set output).

- Minor one: The GEM paper reports evaluation over a single output of TGen, but the INLG paper reports averages over 5 different random initializations of the model.

The getting-started page shows an example of using rouge via huggingface (https://gem-benchmark.com/get_started#generating-and-evaluating-predictions). Will this return the same results as the GEM-metrics?

Yes, it will – both evaluation scripts are actually using https://pypi.org/project/rouge-score/ internally. The GEM-metrics script isn't yet linked from the main website since we're still adding more metrics and working on making it easier to use. It will be soon.

In the past we've found substantial differences between BLEU using e2e-metrics and other versions of BLEU. This could be due to tokenization or smoothing differences.

Yes, I agree that this happens (a lot :-)). However, both e2e-metrics and SacreBLEU (https://www.aclweb.org/anthology/W18-6319/), which is used internally by GEM-metrics, aim to be compatible with the "original" (evolved) BLEU Perl script by Papineni et al. (https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v13a.pl). So the scores should in theory be the same if they're computed by these two.

If I understand correctly, the clean e2e dataset is single reference, so that wouldn't be the difference.

No, it's actually multi-reference. For example, in the cleaned E2E development set, you have 1484 input MRs and 4299 reference texts. The number of references per instance isn't constant, though – for some MRs, you only have 1 reference, for others, you have more.

Best,
Ondrej

Reply all

Reply to author

Forward