Discrepancy in the cosine similarity

Abhishek Kumar

unread,

Feb 23, 2017, 5:09:43 AM2/23/17

to SemEval 2017: Task 5

I found a different cosine score when I ran my model and checked the cosine similarity with the gold score which was made available. Anyone else facing this issue?

tobias daudert

unread,

Feb 23, 2017, 5:12:56 AM2/23/17

to SemEval 2017: Task 5

Do you have any source code we can talk about?

Best,

Tobias

Abhishek Kumar

unread,

Feb 23, 2017, 5:25:59 AM2/23/17

to SemEval 2017: Task 5

Yes. I have made an evaluation script. I am mailing you the same and the submission file for task 2.

tobias daudert

unread,

Feb 23, 2017, 8:26:12 AM2/23/17

to SemEval 2017: Task 5

Hi Abishek,

I just had a look at your evaluation file and saw why you receive a different score. I have to admit you couldn't know that our evaluation is slightly different from what is described on Codalab in order to get a more accurate, message focused similarity score.

What you are doing is basically creating one vector for all headline scores, and one vector for all gold standard scores to then calculate the cosine similarity of both.

With that approach you're missing 3 vital parts which are different to our evaluation:

You're not using the cosine weight (but that doesn't matter in your case)
Entities are not taken into account. Your approach is handling all headlines as one instance without taking into consideration that some headlines might include multiple companies.
For calculating your similarity score, you're creating one vector for all instances without different headlines having a relation.

What we did is (for both datasets, the GS and the submission), we filtered the data in order to find out which headlines have multiple entities. Then, we created one vector for each instance (the vectors are having different lengths according to the number of entities related to it). Having a vector for each instance, we calculated the cosine similarity for each instance. Those similarity scores have been summed up and divided by the number of instances in order to receive an average cosine similarity score for all instances which has been multiplied by the cosine weight at the end.

Is that clear for you?

Best,

Tobias

Message has been deleted

Mengxiao Jiang

unread,

Feb 23, 2017, 8:59:02 AM2/23/17

to SemEval 2017: Task 5

Hi Tobias,

Could you publish the evaluation script?

Thank you!

Best ,

Jiang

在 2017年2月23日星期四 UTC+8下午9:26:12，tobias daudert写道：

tobias daudert

unread,

Feb 23, 2017, 9:07:10 AM2/23/17

to SemEval 2017: Task 5

Hi Jiang,

I'd like to share it with you but unfortunately, I'm not allowed to do that. It was already a hassle to get the agreement from all stakeholders to release the data.

If you have any further question regarding our approach, go ahead and ask and I'll do my best to explain it.

In case you have a specific code and you fail to adapt to our script with my description, just write me an email with it as Abhishek did.

Best,

Tobias

Deepanway Ghosal

unread,

Feb 23, 2017, 9:43:01 AM2/23/17

to SemEval 2017: Task 5

Hi Tobias,

I understand that your evaluation is different from what is described on Codalab, but consider this,

In subtask 1 there were 800 examples having 800 ids. Among those there are 672 unique ids; 592 of them have only one company associated with them and the rest 80 have multiple companies associated with them.

What I understand from your evaluation algorithm is that you are creating 672 instances corresponding to those 672 unique ids, you have 672 cosine similarity scores computed for each instance and you are taking the mean of those scores and multiplying it by cosine weight at the end.

But for the 592 ids that have only one company associated with them,

let's say for one of these ids the gold score is 0.8, system 1 predicted a score of 0.8 and system 2 predicted score of 0.01. For both the systems the cosine similarity score will be 1 (as gold score is positive and both the systems predicted positive), which shouldn't be the case as system 1 predicted better result and should be rewarded more.

Similarly if gold score is -0.01, system 1 predicted a score of 0.01 and system 2 predicted score of -0.99 then system 1 gets a cosine similarity score -1 (reducing the overall score) and system 2 gets a similarity score of +1 (increasing the overall score) which really shouldn't be the case as system 1's prediction is much closer to the gold score.

So, it essentially boils down to a classification problem of positive and negative, prediction of correct class gives you a cosine score of 1 and wrong class gives you a score of -1 and the actually predicted sentiment scores don't really matter at all (only for this 592 ids).

Now as 592 of 672 or more than 88% of instances have only single company name, the final score is rather skewed. Similar is the case for subtask 2. There are 461 unique headlines and 433 (93.9%) of them only contains single company name.

So wouldn't overall cosine similarity be a better evaluation metric for this task?

Best,

Deepanway

Deepanway Ghosal

unread,

Feb 23, 2017, 10:25:20 AM2/23/17

to SemEval 2017: Task 5

Hi Tobias,

There are 12 unique instances in both subtask 1 and 2 which has only one cashtag/company name associated with them and the gold sentiment score is zero. I am curious to know how are you calculating cosine similarity scores for this instances?

Best,

Deepanway

Andrew Moore

unread,

Feb 23, 2017, 12:31:35 PM2/23/17

to SemEval 2017: Task 5

Hi all,

I have created an evaluation script that gets the same scores as posted by the organisers. Here is a link to the script:
https://github.com/apmoore1/semeval/blob/master/examples/eval.py

Best,

Andrew.

tobias daudert

unread,

Feb 24, 2017, 1:22:47 PM2/24/17

to SemEval 2017: Task 5

Hi all,

@Andrew: Thank you for contribution. I didn't have the time to have a look at it but I suppose it's alright in case you're obtaining the same scores.

@Deepanway: I agree with you, you have a good point which needs to be enhanced. I appreciate your thoughts on this and thank you for picking up this topic since every improvement in the evaluation is a benefit for each participant. The better the evaluation is, the more meaningful are the results.

We’ve been looking into this and I think we found a good extension to the current way of evaluation.

We are going to treat vectors with a length of 1 differently than vectors including the scores of multiple entities.

This will allow us to still take the relation of entities into consideration while creating an overall score, as well as it is handling the “single score” problem. Scores with different signs (+/- or vice versa) are still going to be 0 since the sentiment is totally opposite. But for having a positive score for a positive prediction (or negative & negative), we are using an additional measure in order to include the distance in the overall score. This will be the distance of both scores ( 1 - | GSi - PSi | ) which gives us a similar score between 0 and 1 as the cosine similarity does. In addition, to not overweight single scores (for a 1 in the cosine similarity you need to have predicted multiple sentiments correctly while the single score is derived from only one prediction) we are weighting the cosine similarity scores in accordance with the length of the given input vector.

Putting all scores in one GS vector and one Input vector to then use the cosine similarity (or something similar) is no solution for this since the task was to score the entities and not instances (or documents if you wish). By creating one vector for all scores, each entity would be treated in the same way. The entity (cashtag) - instance (tweet) relation would totally be ignored.

I'm looking forward to hearing your thoughts on this.

Best,

Tobias

Abhishek Kumar

unread,

Feb 24, 2017, 1:48:15 PM2/24/17

to SemEval 2017: Task 5

Hello Tobias,

The proposed metric seems to work for gold values which are close to 1 and -1 and the predicted values are close to 1 and -1 too. But consider a case where the gold sentiment value is 0.1 and the model predicts -0.1. In this case, your proposed metric would output 0 but it should have given a similar score when then gold sentiment score was 0.9 and the predicted output was 0.7 . In this case, absolute error is same however the proposed metric is penalizing severely in one case. Secondly, when you would take cosine similarity there is an inherent non-linearity with the computation but with the formula proposed, metric appears linear and hence it would not be fair to combine when the result is obtained after using cosine similarity and other using the proposed formula.

Regards,

Abhishek Kumar

Manel Zarrouk

unread,

Feb 24, 2017, 2:21:29 PM2/24/17

to semeval-2...@googlegroups.com

Hi Kumar,

Predicting an opposite polarity for a sentiment can be not only useless but even misleading especially in the financial domain.

Our new evaluation metric will be rewarding the closer sentiment having the same polarity of course (eg user's score=0.8 - gold score=0.9 VS user's score=0.1 - gold score=0.9).

Predicting a negative sentiment for a positive message/entity can't in any way be rewarded (in a more strict environment it should even be penalised).

Thank you for your e-mail and your thoughts.

Regards

Dr. Manel Zarrouk

Postdoctoral Researcher & Adjunct Lecturer

Knowledge Discovery Unit

The Insight Centre for Data Analytics

NUI Galway, Ireland

+353 0851909501

http://mzarrouk.net

--

You received this message because you are subscribed to the Google Groups "SemEval 2017: Task 5" group.

To unsubscribe from this group and stop receiving emails from it, send an email to semeval-2017-ta...@googlegroups.com.

To post to this group, send email to semeval-2...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/semeval-2017-task-5/2b63cebf-c75f-4f04-8ae1-b4bfdb1c34a1%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Deepanway Ghosal

unread,

Feb 24, 2017, 2:24:45 PM2/24/17

to SemEval 2017: Task 5

Hi Tobias and Manel

I think the scores with different signs shouldn't totally be ignored.

Because from entities with multiple companies the cosine score can be negative also, eg: GSi = [0.2, 0.3] and PSi = [-0.2, -0.3] gives you cosine score of -1. So range of cosine score for this entities is [-1, +1].

But in your metric range of score for entities with single companies is [0, 1]. This range should also be [-1, 1] as finally you are taking a weighted average of this scores.

Still the issue of linearity and non-linearity remains.

Message has been deleted

Manel Zarrouk

unread,

Feb 24, 2017, 2:49:09 PM2/24/17

to semeval-2...@googlegroups.com

Hi all,

Thank you for your valuable feedback.

Due to lack of time, we will not be able to improve more the evaluation metric.

But your feedback and thoughts will be cited as possible (future) improvements.

Thank you for your understanding.

Regards.

On Fri, 24 Feb 2017, at 19:37, Abhishek Kumar wrote:

Hello Manel and Tobias,

I do agree that predicting sentiment of opposite polarity is not totally good but then the question is - Is this metric fair enough? As I have understood from the working of this metric, it heavily penalizes classification error. Fair Enough. But is this what the cosine similarity function does too? Cosine similarity would only penalize by same the amount if the gold score and the predicted score are exactly opposite in nature. Example - if the gold score is [ 0.9, 0.5, 0.8] and the predicted score is [-0.9, -0.5, -0.8] then only cosine similarity would penalize by same amount i.e it would output 0. But sadly this is not the case with your proposed metric. Please look into this!

Regards,

Abhishek Kumar

To unsubscribe from this group and stop receiving emails from it, send an email to semeval-2017-task-5+unsub...@googlegroups.com.

To post to this group, send email to semeval-2...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/semeval-2017-task-5/2b63cebf-c75f-4f04-8ae1-b4bfdb1c34a1%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "SemEval 2017: Task 5" group.

To unsubscribe from this group and stop receiving emails from it, send an email to semeval-2017-ta...@googlegroups.com.

To post to this group, send email to semeval-2...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/semeval-2017-task-5/743a6154-7122-4889-8d67-139f22384c70%40googlegroups.com.

Abhishek Kumar

unread,

Feb 24, 2017, 2:52:43 PM2/24/17

to SemEval 2017: Task 5, ma...@mzarrouk.net

Hello Manel and Tobias,

I do agree that predicting sentiment of opposite polarity is not totally good but then the question is - Is this metric fair enough? As I have understood from the working of this metric, it heavily penalizes classification error. Fair Enough. But is this what the cosine similarity function does too? Cosine similarity would not necessarily penalize if classification error is made rather it focusses on the spatial orientation of the vector. Example - if the gold score is [ 0.2, 0.3] and the predicted score is [-0.1, 0.5] then only cosine similarity would give an output 0.29 which is significantly greater than 0. But sadly this is not the case with your proposed metric.

Regards,

Abhishek Kumar

tobias daudert

unread,

Feb 24, 2017, 3:19:45 PM2/24/17

to SemEval 2017: Task 5, Abhishek Kumar

Hi Abhishek,

Your example is not correct since opposite numbers as in your example are of course penalised since the output out cosine similarity is ranging from [-1,1].

This would actually lead to a decrease of the score in opposite to the single score comparison which would give us 0 as result. But given that we are dividing the total sum by the number of entities, a 0 or “no score” is basically also a decrease in the final score. Therefore, both approaches are penalising opposite sentiments. We can of course argue now on how much penalising is perfect but given that there is no subjective factor in the approach, I guess it is quite fair.

That leaves us with the linearity and non-linearity. Since there is no common approach to combine metrics for sentiment evaluation given the requirements, it is also fair to go with this. Do you have a better suggestion?

Best,

Tobias

On 24 Feb 2017, at 19:37, Abhishek Kumar <abhiro...@gmail.com> wrote:

Hello Manel and Tobias,
I do agree that predicting sentiment of opposite polarity is not totally good but then the question is - Is this metric fair enough? As I have understood from the working of this metric, it heavily penalizes classification error. Fair Enough. But is this what the cosine similarity function does too? Cosine similarity would only penalize by same the amount if the gold score and the predicted score are exactly opposite in nature. Example - if the gold score is [ 0.9, 0.5, 0.8] and the predicted score is [-0.9, -0.5, -0.8] then only cosine similarity would penalize by same amount i.e it would output 0. But sadly this is not the case with your proposed metric. Please look into this!

Regards,

Abhishek Kumar

To unsubscribe from this group and stop receiving emails from it, send an email to semeval-2017-task-5+unsub...@googlegroups.com.

To post to this group, send email to semeval-2...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/semeval-2017-task-5/2b63cebf-c75f-4f04-8ae1-b4bfdb1c34a1%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "SemEval 2017: Task 5" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semeval-2017-ta...@googlegroups.com.
To post to this group, send email to semeval-2...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/semeval-2017-task-5/743a6154-7122-4889-8d67-139f22384c70%40googlegroups.com.

tobias daudert

unread,

Feb 24, 2017, 3:24:26 PM2/24/17

to SemEval 2017: Task 5

Hi again,

now my reply to your second post:

The zero penalisation only takes place for vectors with a length of 1. Given your second example of [ 0.2, 0.3] and [-0.1, 0.5], we would go with the cosine similarity.