The authors argue that multi-candidate evaluation is a reasonable default for tasks where references are semantically diverse (e.g. visual description task). The authors show how to use existing metrics for semantic similarity with their proposed framework for multi-candidate evaluation. The present experiments in a Visual Description case study. It seems their framework should be beneficial for any conditional language generation models including MT, NLG, E2E dialogue etc.
Please at least skim the paper before the meeting. There will be a possibility to join remotely through zoom.