Dear organizers,
We have noticed that the evaluation script does not treat responses containing 'NONE' fairly. For example, it processes it like every other string, which results in it changing to 'none' and thus not being compared to the ground truth. We have made modifications to change this behavior, but we want to confirm that this is ok since you will be evaluating the predictions.
Furthermore, we want to bring to your attention that BERT does not predict the NONE token, which might affect the f1 score. Particularly because empty predictions are penalized with an f1 score of 0, even if the gold answer is None, is this intended behavior?
Best,
Selene Baez Santamaria