Hi Simon
Thanks for the quick reply! Let me make sure I understand the expected behavior for participants. The idea is to train models using the standard training and validation sets, then run them on all test sets without necessarily making any effort to do better or worse on the challenge sets. Moreover, the partitions we're supposed to submit results on are those that contain the words 'test' OR 'challenge' in their name. In particular, we're not supposed to use challenge_train_sample and challenge_validation_sample for training, rather we're supposed to treat them as test sets. Correct?
If this is correct, then I don't really need to understand the rationale for the challenge sets at this time. But I am still confused about the rationales for the noisy challenge sets in the case of data-to-text NLG. In that case, I would expect that the inputs primarily come from structured sources such as KBs or DBs where values are in normalized form, and thus eg typos would not normally be relevant. For example, suppose for E2E a user asks about "Italiqn" restaurants, mistyping "Italian", but the DB only has "Italian" as an attribute value. Then a response can only be generated if the typo is recognized prior to invoking NLG. I could see a case for including typos in the context/prompt in the schema guided dialog dataset, perhaps that's what's meant? (Or are the typos in the references? Then it would seem to be a challenge only for the metrics.)
Re future challenge sets, normally I think of this in terms of train/test mismatch, which is why I wanted to make sure I understood what we were supposed to train on. For example, in our INLG paper (on discourse relations in our Methodius reimplementation), we found that a challenge test set where the outputs should only be half as long as in training was quite difficult for neural models. I gather that some of the challenge sets are distributionally different, but I imagine that additional differences (eg re length if that's not already taken into account) would make sense for future challenge sets.
Best
Mike