As an internationally acclaimed NLP competition, this activity is meaningful. Thanks for your endeavors for this interesting event. We really enjoy it.
In fact, the evaluation is only a measure to test participants' work and the purpose is not to rank. If there is a stable and consistent evaluation of the measurement, it is easier for everyone to analysis their own system. But it should be fair and effective.
About the change of results:
(1) The evaluation changed after deadline, however, the participant used the previously published measure to train the model. How can the results of the previous system be accurate?
(2) If it is due to the problems caused by the evaluation measures, then do not have to wait until the end of the contest 3 weeks after the discovery of this problem.
(3) Now the Gold Standard has been released, and then you changed the evaluation measure and results, which might be unreasonable.
Best,
Jiang
在 2017年2月25日星期六 UTC+8上午2:41:56,tobias daudert写道: