Gold test data

478 views
Skip to first unread message

Maria Pontiki

unread,
Apr 11, 2014, 11:14:55 AM4/11/14
to semeva...@googlegroups.com

Saif Mohammad

unread,
Apr 11, 2014, 3:31:53 PM4/11/14
to semeva...@googlegroups.com

Thanks Maria and the organizers of ABSA for getting this data out and running the evaluations.
We very much appreciate the efforts.
Best wishes.
-Saif

Pablo N. Mendes

unread,
Apr 15, 2014, 12:20:26 AM4/15/14
to Saif Mohammad, semeva...@googlegroups.com

+1 Thanks to organizers!


--
You received this message because you are subscribed to the Google Groups "SemEval-ABSA" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semeval-absa...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Pablo N. Mendes

Hussam Hamdan

unread,
Apr 15, 2014, 5:26:08 PM4/15/14
to Maria Pontiki, semeva...@googlegroups.com
thanks for the efforts.
could you please provide us with the scores of baselines for the 4 subtasks?



2014-04-11 17:14 GMT+02:00 Maria Pontiki <mpon...@gmail.com>:

--
You received this message because you are subscribed to the Google Groups "SemEval-ABSA" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semeval-absa...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Best Regards
Hussam Hamdan

Dimitris Galanis

unread,
Apr 17, 2014, 7:50:45 AM4/17/14
to semeva...@googlegroups.com, Maria Pontiki
Hello Hussam,

The current version of the baselines get the following scores:

Laptops     term extraction: F1= 35.64
Restaurants term extraction: F1= 47.15
Restaurants category extraction: F1= 63.89

Laptops     term polarity: Accuracy = 51.07
Restaurants term polarity: Accuracy = 64.28
Restaurants category polarity: Accuracy =65.65

Dimitris

Hussam Hamdan

unread,
Apr 17, 2014, 9:04:39 AM4/17/14
to Dimitris Galanis, semeva...@googlegroups.com, Maria Pontiki
ok, thanks

blinof...@gmail.com

unread,
Apr 30, 2014, 1:55:08 AM4/30/14
to semeva...@googlegroups.com, Maria Pontiki
Hi,

Can anyone tell what are the reasons for choosing Accuracy as primary measure for the polarity detection subtasks? It seems that Accuracy not very suitable (compared to F-measure) for skewed test data.
Could organisers provide F-measure results for polarity detection subtasks for the baseline system?

Thank you!

Joachim Wagner

unread,
May 1, 2014, 4:27:53 PM5/1/14
to semeva...@googlegroups.com, Maria Pontiki
Hi organisers,

Will the baseline systems that produce the above results be described in the task overview paper?

If yes, can you paste the reference entry here? I remember seeing it in another thread but it would be handy for all to have it here as well.

Joachim

Dimitris Galanis

unread,
May 6, 2014, 9:12:43 AM5/6/14
to semeva...@googlegroups.com, Maria Pontiki

Hello Pavel,

We use accuracy as the primary evaluation measure for the polarity detection subtasks, because it is widely used
and it provides an overall score for all four categories together (positive, negative, neutral, conflict).
The evaluation script that we provided also calculates F1/Precision/Recall for each class, as well as the confusion matrix.
Attached you will find the F1/Precision/Recall scores for the baselines.


Dimitris
baselines.txt

blinof...@gmail.com

unread,
May 6, 2014, 9:51:29 AM5/6/14
to semeva...@googlegroups.com, Maria Pontiki
Thank you!

Maria Pontiki

unread,
May 6, 2014, 11:05:13 AM5/6/14
to semeva...@googlegroups.com, Maria Pontiki
Hello Joachim,

Yes, the baseline systems will be described in the task overview paper.
The reference should look like this:
 
Maria Pontiki, Dimitrios Galanis, John Pavlopoulos, Harris Papageorgiou, Ion Androutsopoulos, Suresh Manandhar. SemEval-2014 Task 4: Aspect Based Sentiment Analysis. Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval), 2014.

Maria

Aitor García

unread,
May 6, 2014, 11:40:50 AM5/6/14
to semeva...@googlegroups.com, Maria Pontiki
Hello Dimitris,

I have a small question. On which particular data have you run the baselines to obtain those scores?
When I run the provided baselines scripts on the test set with the gold annotations I get different scores. Same happens if I run them on the train_v2 datasets.

Best regards,

Aitor.

Dimitris Galanis

unread,
May 6, 2014, 11:53:48 AM5/6/14
to Aitor García, semeva...@googlegroups.com, Maria Pontiki
Hello

The script for laptops is the following:

#!/bin/bash
clear
trainf=Laptop_Train_v2.xml
testf=Laptops_Test_DataD_finalD.xml.gold
A=absa--test.predicted-aspect.xml
B=absa--test.predicted-aspectPolar.xml
echo -e "\n\n--- Running baseline for SB 1"
python ../semeval_base.py --train $trainf -k $testf --task 1
echo -e "\n\n--- Validating " $A
java -cp ../eval.jar Main.Valid $A  ../SemEvalSchema.xsd
echo -e "\n\n--- Evaluating " $A
java -cp ../eval.jar Main.Aspects $A $testf
echo -e "\n\n--- Running baseline for SB 2"
python ../semeval_base.py --train $trainf -k $testf --task 3
echo -e "\n\n--- Validating " $B
java -cp ../eval.jar Main.Valid $B  ../SemEvalSchema.xsd
echo -e "\n\n--- Evaluating " $B
java -cp ../eval.jar Main.Polarity $B $testf


The baseline learns from Laptop_Train_v2.xml (training data) and predicts the terms and polarities for the test sentences.
Then it compares the predicted with the gold and calculates the evaluation measures.

Dimitris.






Dimitris Galanis

unread,
May 6, 2014, 12:10:46 PM5/6/14
to Aitor García, semeva...@googlegroups.com, Maria Pontiki
I forgot to say that the detailed evaluation scores (F1/P/R per class) are calculated by the eval.jar.

Dimitris

Aitor García

unread,
May 6, 2014, 12:11:11 PM5/6/14
to semeva...@googlegroups.com, Aitor García, Maria Pontiki
Thank you very much :-)

Pramodith Bprum

unread,
Dec 20, 2019, 11:45:45 AM12/20/19
to SemEval-ABSA
Hi, I can't seem to download the Gold Test data, I've created an account on metashare but when I try to login it says that the account is inactive, is there some other url I can use to download the Gold tags?
Reply all
Reply to author
Forward
0 new messages