Hi,
I think it's better to keep the official metric as published at the website from the very beginning. Because we trained our model by building a scorer function following the description.
From the description at the website, we got an idea that if there are say 1000 messages in the test data, there is a vector of 1000 gold scores. We have to submit scores in JSON format that will also create a vector of 1000 sentiment scores in the same order. Then a simple cosine similarity function like the one in this page will be used.
There was no official scorer function available so each participant had to code their own function and there might have differences in the scoring functions. It is a very natural case.
But now its clear that it's not the case what it looked like. The scorer was slightly different than at least what we thought. Moreover, I agree with Pedro that MAE or MSE would be a more reasonable function. But no one trained there system using these.
I think it will be wise to report the scores using the official system that you used in the first phase of evaluation and provide the function so that we can understand what's going on. And additionally, you can also report the scores using a single vector for the scores (The link I provided). There will be two different rankings but it will still be useful and fair.
It's just a suggestion only.