SemEval-2017 Task-5 : Evaluation metric update

Task five Semeval

unread,

Feb 27, 2017, 5:59:57 PM2/27/17

to semeval-2...@googlegroups.com

Dear Task 5 participants,

In light of the recent evaluation metric feedback and concerns, we have decided to go back to the original metric. This is our final decision, which we feel is fair for all participants given that all systems were trained according to the original metric we originally proposed. We apologise for any inconvenience caused but please do understand that we initially acted in the best interest of all participants, based on all the feedback we got. The results have just been updated on our task web page according to the original metric (other scores were still left at the bottom), so please use those for explaining your peformance in the system paper. Please note that we will report all this valuable feedback in the Evaluation section of our Task paper.

We are currently talking with the SemEval organisers about the possibility of extending the deadline till Thursday 2nd March, but please still keep working on the original deadline till you receive a final confirmation from us.

Thanks again for your patience and many thanks for all the feedback received. We appreciate all your valuable time and feedback and we look forward to receive your system paper contributions.

Kind regards,

Task 5 organisation team

pedros...@gmail.com

unread,

Feb 27, 2017, 6:33:13 PM2/27/17

to SemEval 2017: Task 5, semeval...@mail.com

Dear organizers,

You mentioned that you decided to go back to the original metric but the resuts published right now are different from the ones published in early February.
In fact this is the fourth change in results. Right now it's hard to be sure that it will not happen a 5th time.

It makes sense to include more than one metric in the task description paper. It would allow to have a better picture of the performance of the different systems and it will add some transparency to the competition after all these misunderstandings. I suggest to also include a standard regression metric such as MAE or MSE.

Regarding the "official" ranking of different systems it seems not relevant at this point...

Best,

Pedro

Sudipta Kar

unread,

Feb 27, 2017, 7:11:41 PM2/27/17

to SemEval 2017: Task 5, semeval...@mail.com

Hi,

I think it's better to keep the official metric as published at the website from the very beginning. Because we trained our model by building a scorer function following the description.

From the description at the website, we got an idea that if there are say 1000 messages in the test data, there is a vector of 1000 gold scores. We have to submit scores in JSON format that will also create a vector of 1000 sentiment scores in the same order. Then a simple cosine similarity function like the one in this page will be used.

https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/metrics/pairwise.py#L875

There was no official scorer function available so each participant had to code their own function and there might have differences in the scoring functions. It is a very natural case.

But now its clear that it's not the case what it looked like. The scorer was slightly different than at least what we thought. Moreover, I agree with Pedro that MAE or MSE would be a more reasonable function. But no one trained there system using these.

I think it will be wise to report the scores using the official system that you used in the first phase of evaluation and provide the function so that we can understand what's going on. And additionally, you can also report the scores using a single vector for the scores (The link I provided). There will be two different rankings but it will still be useful and fair.

It's just a suggestion only.

tobias daudert

unread,

Feb 28, 2017, 9:26:58 AM2/28/17

to SemEval 2017: Task 5

Hi Sudipta,

we are building two vectors (one GS, one Input vector) comprising all scores to then calculate the cosine similarity between both vectors and multiply the similarity score with the cosine weight. It is as you said.

Just make sure that the scores in your vector are in the same order (for Microblogs: id and cashtag comaprison; for Headlines: id comparison)

Please send me your code in case you still get a different score so I can make sure that I didn't miss anything.

Best,

Tobias

tobias daudert

unread,

Feb 28, 2017, 11:54:46 AM2/28/17

to SemEval 2017: Task 5

Hi all,

we have uploaded a description of the evaluation metric (which was discussed in the Google group) used for generating the previous results. Since we're not allowed to provide the runnable code, we have used a mixture of description and pseudocode.

This way of evaluation will be discussed in our task paper so please feel free to use it as well if you want. You find the description together with the previous results here: http://alt.qcri.org/semeval2017/task5/index.php?id=data-and-tools