For the final ranking for the share task prizes per leaderboard, we will consider both automatic metrics (`total`) and human evaluations.
We will conduct human evaluations on top submissions with highest `total` score (up to top 5 teams depending on how close the scores are). We will ask human annotators to rank the generated utterances from the same set of randomly selected turns from the top submissions. It will be focused on (1) "relevance": how relevant it is to the grounding document span and dialogue context; and (2) "fluency": how grammatically fluent it is. The rank will be based on the average of the ranks of `total`, `relevance` and `fluency`, where a tie is broken by the rank of `total`.