Dear Michael,
I'm sorry for the delayed answer. I've looked more closely at the E2E scores for GEM and the INLG paper, and there are actually 2 reasons for the differences:
- The main one: The GEM paper reports validation/development set scores, but the INLG paper reported test set scores. The test set scores are actually quite a bit lower (I'm getting 40.21 for TGen test set output).
- Minor one: The GEM paper reports evaluation over a single output of TGen, but the INLG paper reports averages over 5 different random initializations of the model.
Yes, it will – both evaluation scripts are actually using
https://pypi.org/project/rouge-score/ internally. The GEM-metrics script isn't yet linked from the main website since we're still adding more metrics and working on making it easier to use. It will be soon.
In the past we've found substantial differences between BLEU using e2e-metrics and other versions of BLEU. This could be due to tokenization or smoothing differences.
If I understand correctly, the clean e2e dataset is single reference, so that wouldn't be the difference.
No, it's actually multi-reference. For example, in the cleaned E2E development set, you have 1484 input MRs and 4299 reference texts. The number of references per instance isn't constant, though – for some MRs, you only have 1 reference, for others, you have more.
Best,
Ondrej