intended use of challenge sets?

10 views
Skip to first unread message

Michael White

unread,
Apr 27, 2021, 9:36:41 PMApr 27
to gem-benchmark
Hi folks

Sorry if I'm being dense but I am having a hard time understanding the intended use of the challenge sets for data-to-text/dialogue from the updated arxiv paper.  Could you please elaborate on how you expect to use the new datasets?

For example, the schema-guided dialogue dataset has one challenge train sample partition, a challenge train validation sample partition and a bunch of new test sets.  Is the idea to augment the train set with the challenge train sample set and tune on the validation sample set, which has a mix of the things in the various challenge test sets?  Or are we supposed to only train on the challenge train sample set?  Or what exactly?

Relatedly, I'm afraid I couldn't grok the purpose of the challenge sets, so that makes it harder to figure out intended usage.  In particular:
  • With randomly ordered inputs, are we not supposed to normalize the ordering?  If we normalize the ordering anyway, then this doesn't seem to test anything.
  • With typographical errors and missing punctuation, I imagine we're supposed to be robust to them, rather than emulate them?
  • With backtranslation, are we supposed to make different lexical choices?  Or just produce more diverse outputs?
Many thanks!
Mike

M, Simon

unread,
Apr 28, 2021, 4:58:58 AMApr 28
to Michael White, gem-benchmark
Dear Mike,

thanks for your questions, and apologies for not making this clearer.

For example, the schema-guided dialogue dataset has one challenge train sample partition, a challenge train validation sample partition and a bunch of new test sets.  Is the idea to augment the train set with the challenge train sample set and tune on the validation sample set, which has a mix of the things in the various challenge test sets?  Or are we supposed to only train on the challenge train sample set?  Or what exactly?

All special test sets are intended to be used as test material, so you are expected to apply your model(s) on them and we'll evaluate the outcome. In all cases, the training of the model is done with the original training material only. We added subsets of the training and development data as test sets in order to examine the drop in scores of the different models on the actual test set. So a model is expected to have higher scores on the train subset, and the lowest score on the original test set, but it could be that for some reason some models don't behave like this.

Relatedly, I'm afraid I couldn't grok the purpose of the challenge sets, so that makes it harder to figure out intended usage.  In particular:
  • With randomly ordered inputs, are we not supposed to normalize the ordering?  If we normalize the ordering anyway, then this doesn't seem to test anything.
With this one we intend to see to what extent a model is sensitive to the order of the components of the input structures. Models that perform some custom ordering of the input triples/dialogue acts/etc.before generating won't be affected indeed, but again it may not be the case of all models. 
  • With typographical errors and missing punctuation, I imagine we're supposed to be robust to them, rather than emulate them?
Yes, here in the best case a model will produce the same output as from the non-perturbed data, so it will not be affected by the lack of final punctuation sign or by the typos. We expect the scores to drop in particular with the introduction of typos, and to measure to what extent the typos make the job more difficult there are two test files with typos, one having more typos than the other.
  • With backtranslation, are we supposed to make different lexical choices?  Or just produce more diverse outputs?
Same as above, we will compare the outputs generated from the backtranslated data to the outputs obtained on the original test subset; we will hopefully see to what extent a model is sensitive to the wording of the input sentences according to different metrics (for instance, we would expect BLEU scores to be quite different, but BERTScores and human scores to be somewhat in line).

We had other ideas to produce maybe more interesting special sets, but could not finish some on time. Keep in mind that anyone is welcome to join us to discuss ideas and participate in the creation of more special sets for the future tasks!

I hope this answers your doubts, don't hesitate if further clarifications are needed.

best,
simon

PS: @Other members of the special test set group, please don't hesitate to add comments!

Michael White

unread,
Apr 28, 2021, 2:30:22 PMApr 28
to M, Simon, gem-benchmark
Hi Simon

Thanks for the quick reply!  Let me make sure I understand the expected behavior for participants.  The idea is to train models using the standard training and validation sets, then run them on all test sets without necessarily making any effort to do better or worse on the challenge sets.  Moreover, the partitions we're supposed to submit results on are those that contain the words 'test' OR 'challenge' in their name.  In particular, we're not supposed to use challenge_train_sample and challenge_validation_sample for training, rather we're supposed to treat them as test sets.  Correct?

If this is correct, then I don't really need to understand the rationale for the challenge sets at this time.  But I am still confused about the rationales for the noisy challenge sets in the case of data-to-text NLG.  In that case, I would expect that the inputs primarily come from structured sources such as KBs or DBs where values are in normalized form, and thus eg typos would not normally be relevant.  For example, suppose for E2E a user asks about "Italiqn" restaurants, mistyping "Italian", but the DB only has "Italian" as an attribute value.  Then a response can only be generated if the typo is recognized prior to invoking NLG.  I could see a case for including typos in the context/prompt in the schema guided dialog dataset, perhaps that's what's meant?  (Or are the typos in the references?  Then it would seem to be a challenge only for the metrics.)

Re future challenge sets, normally I think of this in terms of train/test mismatch, which is why I wanted to make sure I understood what we were supposed to train on.  For example, in our INLG paper (on discourse relations in our Methodius reimplementation), we found that a challenge test set where the outputs should only be half as long as in training was quite difficult for neural models.  I gather that some of the challenge sets are distributionally different, but I imagine that additional differences (eg re length if that's not already taken into account) would make sense for future challenge sets.

Best
Mike

M, Simon

unread,
Apr 29, 2021, 4:25:41 AMApr 29
to Michael White, gem-benchmark
Hi Mike,
 
Thanks for the quick reply!  Let me make sure I understand the expected behavior for participants.  The idea is to train models using the standard training and validation sets, then run them on all test sets without necessarily making any effort to do better or worse on the challenge sets.  Moreover, the partitions we're supposed to submit results on are those that contain the words 'test' OR 'challenge' in their name.  In particular, we're not supposed to use challenge_train_sample and challenge_validation_sample for training, rather we're supposed to treat them as test sets.  Correct?

That's all correct! train_sample and validation sample are subsets of the training and development data respectively, so there would be no gain in using them for training/developing. 
 
If this is correct, then I don't really need to understand the rationale for the challenge sets at this time.  But I am still confused about the rationales for the noisy challenge sets in the case of data-to-text NLG.  In that case, I would expect that the inputs primarily come from structured sources such as KBs or DBs where values are in normalized form, and thus eg typos would not normally be relevant.  For example, suppose for E2E a user asks about "Italiqn" restaurants, mistyping "Italian", but the DB only has "Italian" as an attribute value.  Then a response can only be generated if the typo is recognized prior to invoking NLG.  I could see a case for including typos in the context/prompt in the schema guided dialog dataset, perhaps that's what's meant?  (Or are the typos in the references?  Then it would seem to be a challenge only for the metrics.)

From what I know (please anyone correct me if I'm wrong) we did not apply the typos perturbation to the data-to-text challenge sets, but only to the text-to-text and the schema_guided_dialog ones, where we also thought they made more sense. If there are typos on some data-to-text data it could be for the metrics challenge indeed, maybe someone from the metrics group can clarify this?  
 
Re future challenge sets, normally I think of this in terms of train/test mismatch, which is why I wanted to make sure I understood what we were supposed to train on.  For example, in our INLG paper (on discourse relations in our Methodius reimplementation), we found that a challenge test set where the outputs should only be half as long as in training was quite difficult for neural models.  I gather that some of the challenge sets are distributionally different, but I imagine that additional differences (eg re length if that's not already taken into account) would make sense for future challenge sets.

Thank you for the suggestion! Creating new test sets with different features from those seen in the training data was one of our objectives for this year but we could only achieve it on the xsum and ml_sum datasets (addition of covid-related texts). Along the lines that you describe, for this edition we will also perform evaluation of the models on some feature-controlled subsets of the different test sets. We will split some of the test sets based on input complexity, length, gender features, frequency in the training data, etc., and see on which subsets models perform better or worse.

Thanks again for your interest in the task!

best,
simon

Michael White

unread,
Apr 29, 2021, 1:02:31 PMApr 29
to gem-benchmark
Hi Simon

Thanks for confirming!  I suggest adding a clarification to the Submitting Outputs section of the shared task page (https://gem-benchmark.com/shared_task) re the challenge train and challenge val sets (basically that it's "challenge" OR "test" rather than "challenge" AND "test" as I had been assuming).

Best
Mike
Reply all
Reply to author
Forward
0 new messages