Change of the evaluation method

tobias daudert

unread,

Feb 24, 2017, 1:41:56 PM2/24/17

to SemEval 2017: Task 5

Hi all,

I have to announce that we slightly changed the way of evaluating the results due to some good feedback in the Google group.
Therefore, the results have changed. Apologies for this inconvenience.
In our opinion, a better evaluation is good for everyone since your scores are getting more meaningful.
You hopefully understand our decision.

Best,
Tobias

Youness

unread,

Feb 24, 2017, 3:27:14 PM2/24/17

to SemEval 2017: Task 5

Hello,

Are the new results and new evaluation method available somewhere ?

Best,

Youness

tobias daudert

unread,

Feb 24, 2017, 3:31:24 PM2/24/17

to SemEval 2017: Task 5, mansar....@gmail.com

Hi Youness,

yes, the results are on the website. The link didn't change.

I'll try to provide a formula/better description tomorrow but for the time being, have a look at this thread: https://groups.google.com/d/msg/semeval-2017-task-5/HT1xNTVGApg/pjmd3q6AAQAJ

Best,

Tobias

Mengxiao Jiang

unread,

Feb 25, 2017, 1:21:17 AM2/25/17

to SemEval 2017: Task 5

Hi Tobias,

As an internationally acclaimed NLP competition, this activity is meaningful. Thanks for your endeavors for this interesting event. We really enjoy it.

In fact, the evaluation is only a measure to test participants' work and the purpose is not to rank. If there is a stable and consistent evaluation of the measurement, it is easier for everyone to analysis their own system. But it should be fair and effective.

About the change of results:

(1) The evaluation changed after deadline, however, the participant used the previously published measure to train the model. How can the results of the previous system be accurate?

(2) If it is due to the problems caused by the evaluation measures, then do not have to wait until the end of the contest 3 weeks after the discovery of this problem.

(3) Now the Gold Standard has been released, and then you changed the evaluation measure and results, which might be unreasonable.

Best,

Jiang

在 2017年2月25日星期六 UTC+8上午2:41:56，tobias daudert写道：

jack william

unread,

Feb 25, 2017, 1:36:06 AM2/25/17

to SemEval 2017: Task 5

Hi Tobias,

Since you don't provide the evaluation script, Everyone's evaluation method might be different. Your slight change could have a great impact on results. There is no standard measure to prove which method is well, thus, the sudden change of the results is unfair and baffling for us.

best,

jack

shad....@gmail.com

unread,

Feb 25, 2017, 3:12:39 AM2/25/17

to SemEval 2017: Task 5

Dear Organizers,

This is by far our worst experience with a SemEval competition. The standard that SemEval has established over the years have clearly unjustified by such under-prepared task. Normally, the evaluation metric is well defined and provided for validation. But, first the gold standard datasets were not available and now repeatedly change in evaluation metric after declaration of results. How can one guarantee the correct evaluation of system, when tuning and testing uses different metrics? We (and I believe others also) have used earlier mentioned metric formula to validate our systems and now evaluating on different metric does not seems reasonable and correct. The results have huge impact on the system performance. As a matter of fact our system was at 2nd position according to first evaluation but now we stand 19th. How justifiable is that? If the evaluation metric is changed then, ideally, we should get more time to tune our systems accordingly.

I sincerely suggest that either revert back the results to earlier and keep improvement of the evaluation script for the future work OR we should be allowed to re-submit our systems for fresh evaluation.

On a closing note, you should respect all the teams efforts and hard works.

--
Regards,
Shad

Mengxiao Jiang

unread,

Feb 25, 2017, 6:27:37 AM2/25/17

to SemEval 2017: Task 5

Hi Tobias,

If the metric used to build the training system is not the same as the metric used to evaluate the model on test set, the performance impact on the system can not be ignored. The results and rankings are meaningless.

This should be consistent in order to be fair and reasonable to evaluate the performance of the system. Even if you think that the current metric used is more reasonable, it may be fair to let everyone use a more reasonable metric to re-train the system.

The rank 1st team by using current metric might be just a coincidence, and just fit to current metric.

It is hard to prove the effectiveness of their model as using the different metric to train the model.

We sincerely suggest to use original evaluation metric.

best,

jiang

在 2017年2月25日星期六 UTC+8上午2:41:56，tobias daudert写道：

Hi all,

Reply all

Reply to author

Forward