STS benchmark

374 views
Skip to first unread message

eneko

unread,
Feb 24, 2017, 4:16:41 AM2/24/17
to sts-s...@googlegroups.com

Dear participant,

we are glad to launch STS benchmark. STS Benchmark comprises a selection
of the English datasets used in the STS tasks organized by us in the
context of SemEval between 2012 and 2017.

In order to provide a standard benchmark to compare among systems, we
organized it into train, development and test. The development part can
be used to develop and tune hyperparameters of the systems, and the test
part should be only used once for the final system.

Please find all details in http://ixa.si.ehu.eus/stswiki/STSbenchmark

We think that this benchmark can be used in the future for setting the
state-of-the-art in Semantic Textual Similarity for English.

We would like to gather the results of the best participants in 2017, so
we can gather a leaderboard in
http://ixa.si.ehu.eus/stswiki/STSbenchmark. We would also like to
include a section in the task description paper. Those interested,
please train one of your runs using just the train dataset provided, and
send us your development and test results by the end of March,
mentioning the CodaLab run ID.

best

eneko


--

Eneko Agirre
Euskal Herriko Unibertsitatea
University of the Basque Country
http://ixa2.si.ehu.eus/eneko

Basma Hassan

unread,
Feb 24, 2017, 6:28:39 AM2/24/17
to sts-s...@googlegroups.com
Dear Eneko,
It is great to know that STS benchmark is launched and I am interested in running my system on it. But unfortunately, the link you sent is not successfully opened. I got the page in this screenshot 

Inline images 1

Regards,
Basma

----
Basma Hassan Kamal,
Assistant Lecturer,
Computer Science Department,
Faculty of Computers and Information,
Fayoum University, Egypt




--
--
Website of task, http://alt.qcri.org/semeval2017/task1/
To post to this group, send email to sts-s...@googlegroups.com
To unsubscribe, send email to sts-semeval+unsubscribe@googlegroups.com
For more options, http://groups.google.com/group/sts-semeval?hl=en?hl=en
--- You received this message because you are subscribed to the Google Groups "STS SemEval" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sts-semeval+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

eneko

unread,
Feb 24, 2017, 9:19:04 AM2/24/17
to sts-s...@googlegroups.com


sorry about that, the correct url: http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark


or., 2017.eko otsren 24a 12:27(e)an, Basma Hassan igorleak idatzi zuen:
To unsubscribe, send email to sts-semeval...@googlegroups.com

For more options, http://groups.google.com/group/sts-semeval?hl=en?hl=en
---
You received this message because you are subscribed to the Google Groups "STS SemEval" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sts-semeval...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

w0927...@gmail.com

unread,
Feb 26, 2017, 1:48:32 PM2/26/17
to STS SemEval
Hi Eneko,

Thanks for providing such standard STS benchmark! 

We have trained our three runs using the data provided and the same setting as in our system description paper. Following is our results on Dev and Test set.  BTW, is there anything else that needs to be provided.

--------------------------------------------------------------
run ID Model Dev Test
386608 RF 0.8333 0.7993
386610 GB 0.8356 0.8022
386611 EN-seven 0.8466 0.8100
--------------------------------------------------------------

Best wishes 
Junfeng Tian
To unsubscribe, send email to sts-semeval...@googlegroups.com
For more options, http://groups.google.com/group/sts-semeval?hl=en?hl=en
--- You received this message because you are subscribed to the Google Groups "STS SemEval" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sts-semeval...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
--
Website of task, http://alt.qcri.org/semeval2017/task1/
To post to this group, send email to sts-s...@googlegroups.com
To unsubscribe, send email to sts-semeval...@googlegroups.com
For more options, http://groups.google.com/group/sts-semeval?hl=en?hl=en
---
You received this message because you are subscribed to the Google Groups "STS SemEval" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sts-semeval...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Eneko Agirre

unread,
Mar 6, 2017, 1:32:04 PM3/6/17
to sts-s...@googlegroups.com


Hi Junfeng,

thanks for sending your results, I already added them to http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark

Just to double check, please confirm whether your system was trained on the train data in STSbenchmark, or whether you used additional training data.

best

eneko




02/26/2017 09:07 AM(e)an, w0927...@gmail.com igorleak idatzi zuen:
Universidad del Pais Vasco

w0927...@gmail.com

unread,
Mar 7, 2017, 5:53:40 AM3/7/17
to STS SemEval
Hi eneko,

our system was trained on the train data(5749 pairs) in STSbenchmark without other training data.

best

junfeng Tian

sndr....@gmail.com

unread,
Mar 10, 2017, 4:29:49 AM3/10/17
to STS SemEval
Thank you very much 
this idea is great
but only this link is broken

may yo help find correct link?
by the way , may help find code for good solutions
thanks
Sander

eneko

unread,
Mar 10, 2017, 4:30:35 AM3/10/17
to sts-s...@googlegroups.com


http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark


og., 2017.eko marren 09a 21:29(e)an, sndr....@gmail.com igorleak idatzi zuen:
--
--
Website of task, http://alt.qcri.org/semeval2017/task1/
To post to this group, send email to sts-s...@googlegroups.com
To unsubscribe, send email to sts-semeval...@googlegroups.com
For more options, http://groups.google.com/group/sts-semeval?hl=en?hl=en
---
You received this message because you are subscribed to the Google Groups "STS SemEval" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sts-semeval...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ergun Bicici

unread,
Mar 11, 2017, 6:50:13 AM3/11/17
to sts-s...@googlegroups.com

Dear Eneko,

The benchmark is only a subset. Is there any reason? 

Do you plan to make additional English datasets available in the future where we add on top of STSbenchmark or
do you suggest discarding the training data compiled for English from STS websites and using the benchmark dataset from now then on?


Best Regards,
Ergun

Ergun Biçici


To unsubscribe, send email to sts-semeval+unsubscribe@googlegroups.com

For more options, http://groups.google.com/group/sts-semeval?hl=en?hl=en
---
You received this message because you are subscribed to the Google Groups "STS SemEval" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sts-semeval+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

-- 

Eneko Agirre
Euskal Herriko Unibertsitatea
University of the Basque Country
http://ixa2.si.ehu.eus/eneko 

--
--
Website of task, http://alt.qcri.org/semeval2017/task1/
To post to this group, send email to sts-s...@googlegroups.com
To unsubscribe, send email to sts-semeval+unsubscribe@googlegroups.com

For more options, http://groups.google.com/group/sts-semeval?hl=en?hl=en
---
You received this message because you are subscribed to the Google Groups "STS SemEval" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sts-semeval+unsubscribe@googlegroups.com.

Ergun Bicici

unread,
Mar 11, 2017, 8:15:00 AM3/11/17
to sts-s...@googlegroups.com

Additional comments:
- the dev set can be used for training models as well in addition to using for tuning purposes. (see you asked: "Just to double check, please confirm whether your system was trained on the train data in STSbenchmark, or whether you used additional training data.")
​- the number of sentences in total now distributed for STS English is less:
benchmark: 8628
others: 5750 (actually contains 5650 sentences)
total: 14378​

What is not matching with STSbenchmark datasets (http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) is listed partially in the following table:
datasetOnWNFNWNheadlinesimagestweet-newsdeft-newsdeft-forumMSRparMSRvidSMTSMTeuroparlSMTnewsbeliefanswers-studentsanswers-forumsTotal
trial2012
train20120000000750750073400002234
test201275000000075075004593990003108
trial2013555520
test2013561189750000000750000002250
trial201450555550000000030
test20147500750750750300450000000003750
Total:2071194151075575530545515001500755119339900011392
trial201500550000000010252570
test20157507503757503753000
Total:20711942265151075530545515001500755119339938577540014462
100 more
datasetpost-editingplagiarismanswer-answerque.-que.0
test20162492442302542091186
Total:207119425141510999305455150015007551423399385102960915648
track5.en-en test set
10515105550010001025250
1427815648-755-775-90250

​These are:
- 100 additional instances for SMTeuroparl​
​- 90 missing trial instances​
- SMT and answers-students datasets are discarded and they total to 1530 instances.

Additionally, the table does not include the 5 trial instances from 2012 that I mentioned before:
2012 trial dataset scoring appears sorted and the true scores appear missing:
http://ixa2.si.ehu.es/stswiki/images/d/d3/STS2012-en-trial.zip
​For instance, the first sentence can get a 5 but it is ​scored 0:
The bird is bathing in the sink.        Birdie is washing itself in the water basin.

​The corresponding error
​rate ​
is ​
​1.4% (% 195 / 14​373).


Best Regards,
Ergun

Ergun Biçici


Eneko Agirre

unread,
Mar 15, 2017, 11:23:00 AM3/15/17
to sts-s...@googlegroups.com


Dear Ergun,

thanks for your questions.

We separated the datasets released in past Semeval STS campaigns into the STS benchmark and the Companion dataset.

The STS benchmarks focuses on image captions, news headlines and user forums, in order to reduce the variability of genres, and provide a standard benchmark to compare among meaning representation systems in future years. We would love to see competing meaning representation proposals (e.g.  Arora et al. 2017; Mu et al. 2017; Wieting et al. 2016) being evaluated on STS benchmark, alongside the results of Semeval participants.

I'm not sure about the second question. The STS benchmark dataset is frozen, and we do not expect it to change in the near future.

best

eneko



03/11/2017 12:49 PM(e)an, Ergun Bicici igorleak idatzi zuen:
To unsubscribe, send email to sts-semeval...@googlegroups.com

For more options, http://groups.google.com/group/sts-semeval?hl=en?hl=en
---
You received this message because you are subscribed to the Google Groups "STS SemEval" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sts-semeval...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--

Eneko Agirre
Euskal Herriko Unibertsitatea
Universidad del Pais Vasco

Tsuki

unread,
Mar 20, 2017, 12:03:44 PM3/20/17
to STS SemEval
Dear Eneko,

in the case of unsupervised approaches that rely on word/phrase embeddings, the 17K sentences (total of train, dev, test) are not enough for training good word embeddings. Therefore, the approaches that rely on vector representations obtained using more training data cannot participate in this benchmark.

Best regards,
Stefania
To unsubscribe, send email to sts-semeval...@googlegroups.com
For more options, http://groups.google.com/group/sts-semeval?hl=en?hl=en
---
You received this message because you are subscribed to the Google Groups "STS SemEval" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sts-semeval...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 

Eneko Agirre
Euskal Herriko Unibertsitatea
University of the Basque Country
http://ixa2.si.ehu.eus/eneko 
--
--
Website of task, http://alt.qcri.org/semeval2017/task1/
To post to this group, send email to sts-s...@googlegroups.com
To unsubscribe, send email to sts-semeval...@googlegroups.com
For more options, http://groups.google.com/group/sts-semeval?hl=en?hl=en
---
You received this message because you are subscribed to the Google Groups "STS SemEval" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sts-semeval...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
--
Website of task, http://alt.qcri.org/semeval2017/task1/
To post to this group, send email to sts-s...@googlegroups.com
To unsubscribe, send email to sts-semeval...@googlegroups.com
For more options, http://groups.google.com/group/sts-semeval?hl=en?hl=en
---
You received this message because you are subscribed to the Google Groups "STS SemEval" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sts-semeval...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Eneko Agirre

unread,
Mar 22, 2017, 6:04:33 AM3/22/17
to sts-s...@googlegroups.com


Dear Tsuki,

thnaks for your comment.

it is totally OK to use vector representations learned from external corpora. We now realize that the wording was confusing, so we will try to be more clear on that.

We welcome any other suggestions which would make the STS benchmark more useful for the community.

best

eneko


03/20/2017 05:03 PM(e)an, Tsuki igorleak idatzi zuen:

nabin maharjan

unread,
Mar 23, 2017, 5:10:13 PM3/23/17
to STS SemEval
Hi eneko,

Below is the best result on the benchmark data-set obtained for our system (run ID:385819)

We have trained the system with the benchmark training data alone with the same setting as described in our system description paper.

........................................
runID             Model                                         Dev                          Test
385819          Gradient Boosting Regressor         0.83012                    0.79195

Thanks,
Nabin Maharjan

Eneko Agirre

unread,
Apr 6, 2017, 10:19:54 AM4/6/17
to sts-s...@googlegroups.com


thanks again!

I would be grateful if you could you please check
http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark#Results and Section
6 of the attached pdf, and ensure that evetything said with respect to
your system is correct.

Any comments welcome!

best

eneko


03/23/2017 10:10 PM(e)an, nabin maharjan igorleak idatzi zuen:
> <http://ixa.si.ehu.eus/stswiki/STSbenchmark>. We would also like to
> include a section in the task description paper. Those interested,
> please train one of your runs using just the train dataset
> provided, and
> send us your development and test results by the end of March,
> mentioning the CodaLab run ID.
>
> best
>
> eneko
>
>
> --
>
> Eneko Agirre
> Euskal Herriko Unibertsitatea
> University of the Basque Country
> http://ixa2.si.ehu.eus/eneko
>
> --
> --
> Website of task, http://alt.qcri.org/semeval2017/task1/
> To post to this group, send email to sts-s...@googlegroups.com
> To unsubscribe, send email to sts-semeval...@googlegroups.com
> For more options, http://groups.google.com/group/sts-semeval?hl=en?hl=en
> ---
> You received this message because you are subscribed to the Google
> Groups "STS SemEval" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to sts-semeval...@googlegroups.com
> <mailto:sts-semeval...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--

Eneko Agirre
Euskal Herriko Unibertsitatea
Universidad del Pais Vasco
SemEval2017_STS.pdf

Eneko Agirre

unread,
Apr 6, 2017, 10:20:57 AM4/6/17
to sts-s...@googlegroups.com

sorry for the previous mail, it was intended for a specific participant,
please ignore.

best

eneko


04/06/2017 04:19 PM(e)an, Eneko Agirre igorleak idatzi zuen:

Ergun Bicici

unread,
Apr 7, 2017, 12:08:06 PM4/7/17
to sts-s...@googlegroups.com

Dear Eneko,

STS benchmark can be potentially included as an entry in ACL wiki state-of-the-art pages:
Thank you for preparing the STS dataset on a yearly basis as an expert. 

Q1) About ​train/dev/test split: do you only want to see results that use the train dataset for training and dev for tuning? If I prepare a model that use train+dev for training, will you consider the results legal?

I looked into what could have lowered RMTs performance on en-en this year:
                                                                               RAE          &     MAER       &    MRAER
STS 2017   English  ALL  0.85 & 0.87 & 1.04
STS 2016   English  ALL  0.673  0.5954  0.719
STS 2015  &  English  & ALL & 0.722 & 0.7379 & 0.788
STS 2014  &  English  & ALL & 0.745 & 0.7274 & 0.757
STS 2013  &  English  & ALL & 0.779 & 0.8494 & 0.77

This was also the case for some domains and tasks before where MRAER is > 1 [1]:
Inline image 1
where entries can be for new tasks, some out-of-domain tasks, some harder tasks. Considering these results, en-en is harder in STS 2017 and it is at an extreme due to the lowest average score:
average STS test score 2017 = 2.2776
average STS test score 2016 = 2.4132
average STS test score 2015 = 2.4059
average STS test score 2014 = 2.8114
average STS test score 2013 = 2.9555
average STS test score 2012 = 3.5061
and the lowest average number of words in the test set:
average STS test # words 2017 = 8.7
average STS test # words 2016 = 11.38
average STS test # words 2015 = 10.45
average STS test # words 2014 = 9.13
average STS test # words 2013 = 14.66
average STS test # words 2012 = 10.78
This could have decreased RTMs performance on en-en. Stanford Natural Language Inference (SNLI) Corpus
is focusing on inference and entailment tasks and in the case of entailment, we assume direction: one of the two sentences is entailed by some amount by the other. This may be the reason behind the difference in the distributions yet you clarified their difference before:
"STS is different from TE inasmuch as it assumes bidirectional graded equivalence between a pair of textual snippets" [1]

If we look at the correlation (r) results, we observe that, in contrast, the top r and top RTM model's r increased in 2017:
top r in STS 2017         = 0.8518 (ECNU)
top RTM r in STS 2017 = 0.71 (unofficial results)
top r in STS 2016         = 0.7781 (Samsung Poland NLP Team)
top RTM r in STS 2016 = 0.6746 (unofficial results [4])
top r in STS 2015         = 0.8015 (DLS@CU-S1)
top RTM r in STS 2015 = 0.67 (unofficial results, [1])
top r in STS 2014         = 0.7610 (DLS@CU-run2)
top RTM r in STS 2014 = 0.65 (unofficial results, [1])
top r in STS 2013 
​        ​
= 0.6181 (UMBC_EBIQUITY-ParingWords)
top RTM r in STS 2013 = 0.58 (unofficial results, [1])

​About the filtering process also mentioned in [2,6], ​an idea that comes up is whether we can make these datasets self​-sufficient: filter and select each other so that they become consistent representative for STS English, STS Spanish, etc. tasks where we maintain some expected MRAER level as we test on some validation set as we continue the filtering process similar to instance selection for active learning which improved SMT performance [3]. So, every year, some self-sufficient subset could be selected that maintains some consistency criterion now that you have close to 20000 instances. 

Related idea is transfer learning (TL, e.g. using models developed for handwritten digit recognition for handwritten character recognition): could we use systems for other tasks and if so what would the performance be? This is a cross-task TL (https://www.youtube.com/watch?v=9ChVn3xVNDI, we have the same domain: STS but we use the models for different tasks: en-en RTM model for ar-ar):
Inline image 4

Inline image 8Inline image 9
Inline image 6
Inline image 7

Unfortunately, cross use also does not save en-es and es-en results. The change in the test sets made their results an outlier with MRAER scores larger than 1.

Cross use increase correlation in en-en from 0.71 to 0.73 using es-es. 


Q2) About the second question you mentioned, I would like to have more clarification about the usefulness and inclusion of previous STS datasets for supervised learning models for future STS tasks. ​For instance, for STS en-en​, do you recommend training on the STSbenchmark dataset or using all of the available data from before? For instance, based on only the RTM results on STS en-en 2017, discarding SNLI from the benchmark dataset, as you did, is a good idea. 

Thank you again for organizing STS; it is likely that I will not attend this year; however, I am working on the paper to make my contribution more readable and accessible.


​References:​
[1] Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Inigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, German Rigau, Larraitz Uria, and Janyce Wiebe. 2015. Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. In Proc. of the 9th International Workshop on Semantic Evaluation (SemEval 2015). Association for Computational Linguistics, Denver, Colorado, pages 252–263.

[2] SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016) Eneko Agirre | Carmen Banea | Daniel Cer | Mona Diab | Aitor Gonzalez-Agirre | Rada Mihalcea | German Rigau | Janyce Wiebe  Month: June Year: 2016
http://aclanthology.info/papers/semeval-2016-task-1-semantic-textual-similarity-monolingual-and-cross-lingual-evaluation

​[3] Ergun Bicici and Deniz YuretOptimizing Instance Selection for Statistical Machine Translation with Feature Decay AlgorithmsIEEE/ACM Transactions On Audio, Speech, and Language Processing (TASLP), 23:339-350, 2015. [doi:10.1109/TASLP.2014.2381882]

​​[4]​ Ergun Bicici and Andy Way. Referential translation machines for predicting semantic similarityLanguage Resources and Evaluation, pp 1-27, 2015. ISSN: 1574-020X. URL: 
http://dblp.org/rec/journals/lre/BiciciW16
​[5] ​Ergun BiciciRTM at SemEval-2016 Task 1: Predicting Semantic Similarity with Referential Translation Machines and Related Statistics. In SemEval-2016: Semantic Evaluation Exercises - International Workshop on Semantic Evaluation, San Diego, CA, USA, 6 2016. [WWW

​[6] Related comment: the probabilities in surface lexical similarity are estimated over the evaluation set data​. I assume this to be over all of the ~14000 pairs (including training and test sets)


Best Regards,
Ergun

Ergun Biçici

Inline image 3


Best Regards,
Ergun

Ergun Biçici


Universidad del Pais Vasco
University of the Basque Country
http://ixa2.si.ehu.eus/eneko

--

Ergun Bicici

unread,
Apr 7, 2017, 12:10:43 PM4/7/17
to sts-s...@googlegroups.com

This was also the case for some domains and tasks before where MRAER is > 1 [4]:
Inline image 1


Best Regards,
Ergun

Ergun Biçici


Reply all
Reply to author
Forward
0 new messages