Dear Organizers,
We observe that the new evaluation model 'gemma-2b-it' cannot output the evaluation score according to the evaluation prompt.
According to the baseline test of the testing phase, only 1 out of 100 responses output "#thescore: ", and the others are invalid scores.
For example, the model basically only outputs "#score" or directly refuses to output.
This caused us a lot of trouble. We want to know if it could be fixed, which will be great for our experiments.