STS benchmark

eneko

unread,

Feb 24, 2017, 4:16:41 AM2/24/17

to sts-s...@googlegroups.com

Dear participant,

we are glad to launch STS benchmark. STS Benchmark comprises a selection
of the English datasets used in the STS tasks organized by us in the
context of SemEval between 2012 and 2017.

In order to provide a standard benchmark to compare among systems, we
organized it into train, development and test. The development part can
be used to develop and tune hyperparameters of the systems, and the test
part should be only used once for the final system.

Please find all details in http://ixa.si.ehu.eus/stswiki/STSbenchmark

We think that this benchmark can be used in the future for setting the
state-of-the-art in Semantic Textual Similarity for English.

We would like to gather the results of the best participants in 2017, so
we can gather a leaderboard in
http://ixa.si.ehu.eus/stswiki/STSbenchmark. We would also like to
include a section in the task description paper. Those interested,
please train one of your runs using just the train dataset provided, and
send us your development and test results by the end of March,
mentioning the CodaLab run ID.

best

eneko

--

Eneko Agirre
Euskal Herriko Unibertsitatea
University of the Basque Country
http://ixa2.si.ehu.eus/eneko

Basma Hassan

unread,

Feb 24, 2017, 6:28:39 AM2/24/17

to sts-s...@googlegroups.com

Dear Eneko,
It is great to know that STS benchmark is launched and I am interested in running my system on it. But unfortunately, the link you sent is not successfully opened. I got the page in this screenshot

Inline images 1

Regards,
Basma

----
Basma Hassan Kamal,
Assistant Lecturer,
Computer Science Department,
Faculty of Computers and Information,
Fayoum University, Egypt

--
--
Website of task, http://alt.qcri.org/semeval2017/task1/
To post to this group, send email to sts-s...@googlegroups.com
To unsubscribe, send email to sts-semeval+unsubscribe@googlegroups.com
For more options, http://groups.google.com/group/sts-semeval?hl=en?hl=en
--- You received this message because you are subscribed to the Google Groups "STS SemEval" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sts-semeval+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

eneko

unread,

Feb 24, 2017, 9:19:04 AM2/24/17

to sts-s...@googlegroups.com

sorry about that, the correct url: http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark

or., 2017.eko otsren 24a 12:27(e)an, Basma Hassan igorleak idatzi zuen:

To unsubscribe, send email to sts-semeval...@googlegroups.com

For more options, http://groups.google.com/group/sts-semeval?hl=en?hl=en
---
You received this message because you are subscribed to the Google Groups "STS SemEval" group.

To unsubscribe from this group and stop receiving emails from it, send an email to sts-semeval...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

w0927...@gmail.com

unread,

Feb 26, 2017, 1:48:32 PM2/26/17

to STS SemEval

Hi Eneko,

Thanks for providing such standard STS benchmark!

We have trained our three runs using the data provided and the same setting as in our system description paper. Following is our results on Dev and Test set. BTW, is there anything else that needs to be provided.

--------------------------------------------------------------

run ID	Model	Dev	Test
386608	RF	0.8333	0.7993
386610	GB	0.8356	0.8022
386611	EN-seven	0.8466	0.8100

--------------------------------------------------------------

Best wishes

Junfeng Tian

To unsubscribe, send email to sts-semeval...@googlegroups.com
For more options, http://groups.google.com/group/sts-semeval?hl=en?hl=en
--- You received this message because you are subscribed to the Google Groups "STS SemEval" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sts-semeval...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
--
Website of task, http://alt.qcri.org/semeval2017/task1/
To post to this group, send email to sts-s...@googlegroups.com
To unsubscribe, send email to sts-semeval...@googlegroups.com
For more options, http://groups.google.com/group/sts-semeval?hl=en?hl=en
---
You received this message because you are subscribed to the Google Groups "STS SemEval" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sts-semeval...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Eneko Agirre

unread,

Mar 6, 2017, 1:32:04 PM3/6/17

to sts-s...@googlegroups.com

Hi Junfeng,

thanks for sending your results, I already added them to http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark

Just to double check, please confirm whether your system was trained on the train data in STSbenchmark, or whether you used additional training data.

best

eneko

02/26/2017 09:07 AM(e)an, w0927...@gmail.com igorleak idatzi zuen:

Universidad del Pais Vasco

w0927...@gmail.com

unread,

Mar 7, 2017, 5:53:40 AM3/7/17

to STS SemEval

Hi eneko,

our system was trained on the train data(5749 pairs) in STSbenchmark without other training data.

best

junfeng Tian

sndr....@gmail.com

unread,

Mar 10, 2017, 4:29:49 AM3/10/17

to STS SemEval

Thank you very much

this idea is great

but only this link is broken

http://ixa.si.ehu.eus/stswiki/STSbenchmark

may yo help find correct link?

by the way , may help find code for good solutions

thanks

Sander

eneko

unread,

Mar 10, 2017, 4:30:35 AM3/10/17

to sts-s...@googlegroups.com

http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark

og., 2017.eko marren 09a 21:29(e)an, sndr....@gmail.com igorleak idatzi zuen:

--
--
Website of task, http://alt.qcri.org/semeval2017/task1/
To post to this group, send email to sts-s...@googlegroups.com
To unsubscribe, send email to sts-semeval...@googlegroups.com
For more options, http://groups.google.com/group/sts-semeval?hl=en?hl=en
---
You received this message because you are subscribed to the Google Groups "STS SemEval" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sts-semeval...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ergun Bicici

unread,

Mar 11, 2017, 6:50:13 AM3/11/17

to sts-s...@googlegroups.com

Dear Eneko,

The benchmark is only a subset. Is there any reason?

Do you plan to make additional English datasets available in the future where we add on top of STSbenchmark or

do you suggest discarding the training data compiled for English from STS websites and using the benchmark dataset from now then on?

Best Regards,

Ergun

Ergun Biçici

http://bicici.github.com/

To unsubscribe, send email to sts-semeval+unsubscribe@googlegroups.com

For more options, http://groups.google.com/group/sts-semeval?hl=en?hl=en
---
You received this message because you are subscribed to the Google Groups "STS SemEval" group.

To unsubscribe from this group and stop receiving emails from it, send an email to sts-semeval+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

-- 

Eneko Agirre
Euskal Herriko Unibertsitatea
University of the Basque Country
http://ixa2.si.ehu.eus/eneko

--

--
Website of task, http://alt.qcri.org/semeval2017/task1/
To post to this group, send email to sts-s...@googlegroups.com

To unsubscribe, send email to sts-semeval+unsubscribe@googlegroups.com

For more options, http://groups.google.com/group/sts-semeval?hl=en?hl=en
---
You received this message because you are subscribed to the Google Groups "STS SemEval" group.

To unsubscribe from this group and stop receiving emails from it, send an email to sts-semeval+unsubscribe@googlegroups.com.

Ergun Bicici

unread,

Mar 11, 2017, 8:15:00 AM3/11/17

to sts-s...@googlegroups.com

Additional comments:

- the dev set can be used for training models as well in addition to using for tuning purposes. (see you asked: "Just to double check, please confirm whether your system was trained on the train data in STSbenchmark, or whether you used additional training data.")

- the number of sentences in total now distributed for STS English is less:

benchmark: 8628

others: 5750 (actually contains 5650 sentences)

total: 14378

What is not matching with STSbenchmark datasets (http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) is listed partially in the following table:

dataset	OnWN	FNWN	headlines	images	tweet-news	deft-news	deft-forum	MSRpar	MSRvid	SMT	SMTeuroparl	SMTnews	belief	answers-students	answers-forums	Total
trial2012
train2012	0	0	0	0	0	0	0	750	750	0	734	0	0	0	0	2234
test2012	750	0	0	0	0	0	0	750	750	0	459	399	0	0	0	3108
trial2013	5	5	5							5						20
test2013	561	189	750	0	0	0	0	0	0	750	0	0	0	0	0	2250
trial2014	5	0	5	5	5	5	5	0	0	0	0	0	0	0	0	30
test2014	750	0	750	750	750	300	450	0	0	0	0	0	0	0	0	3750
Total:	2071	194	1510	755	755	305	455	1500	1500	755	1193	399	0	0	0	11392
trial2015	0	0	5	5	0	0	0	0	0	0	0	0	10	25	25	70
test2015			750	750									375	750	375	3000
Total:	2071	194	2265	1510	755	305	455	1500	1500	755	1193	399	385	775	400	14462
											100 more
dataset					post-editing						plagiarism			answer-answer	que.-que.	0
test2016			249		244						230			254	209	1186
Total:	2071	194	2514	1510	999	305	455	1500	1500	755	1423	399	385	1029	609	15648
																	track5.en-en test set
	10	5	15	10	5	5	5	0	0		100	0	10		25		250



14278	15648	-755	-775	-90	250

These are:

- 100 additional instances for SMTeuroparl

- 90 missing trial instances

- SMT and answers-students datasets are discarded and they total to 1530 instances.

Additionally, the table does not include the 5 trial instances from 2012 that I mentioned before:

2012 trial dataset scoring appears sorted and the true scores appear missing:
http://ixa2.si.ehu.es/stswiki/images/d/d3/STS2012-en-trial.zip
For instance, the first sentence can get a 5 but it is scored 0:
The bird is bathing in the sink. Birdie is washing itself in the water basin.

The corresponding error

rate

is

1.4% (% 195 / 14373).

Best Regards,
Ergun

Ergun Biçici

http://bicici.github.com/

Eneko Agirre

unread,

Mar 15, 2017, 11:23:00 AM3/15/17

to sts-s...@googlegroups.com

Dear Ergun,

thanks for your questions.

We separated the datasets released in past Semeval STS campaigns into the STS benchmark and the Companion dataset.

The STS benchmarks focuses on image captions, news headlines and user forums, in order to reduce the variability of genres, and provide a standard benchmark to compare among meaning representation systems in future years. We would love to see competing meaning representation proposals (e.g. Arora et al. 2017; Mu et al. 2017; Wieting et al. 2016) being evaluated on STS benchmark, alongside the results of Semeval participants.

I'm not sure about the second question. The STS benchmark dataset is frozen, and we do not expect it to change in the near future.

best

eneko

03/11/2017 12:49 PM(e)an, Ergun Bicici igorleak idatzi zuen:

To unsubscribe, send email to sts-semeval...@googlegroups.com

For more options, http://groups.google.com/group/sts-semeval?hl=en?hl=en
---
You received this message because you are subscribed to the Google Groups "STS SemEval" group.

To unsubscribe from this group and stop receiving emails from it, send an email to sts-semeval...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Eneko Agirre
Euskal Herriko Unibertsitatea

Universidad del Pais Vasco

Tsuki

unread,

Mar 20, 2017, 12:03:44 PM3/20/17

to STS SemEval

Dear Eneko,

in the case of unsupervised approaches that rely on word/phrase embeddings, the 17K sentences (total of train, dev, test) are not enough for training good word embeddings. Therefore, the approaches that rely on vector representations obtained using more training data cannot participate in this benchmark.

Best regards,

Stefania

To unsubscribe, send email to sts-semeval...@googlegroups.com
For more options, http://groups.google.com/group/sts-semeval?hl=en?hl=en
---
You received this message because you are subscribed to the Google Groups "STS SemEval" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sts-semeval...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 

Eneko Agirre
Euskal Herriko Unibertsitatea
University of the Basque Country
http://ixa2.si.ehu.eus/eneko

--
--
Website of task, http://alt.qcri.org/semeval2017/task1/
To post to this group, send email to sts-s...@googlegroups.com

To unsubscribe, send email to sts-semeval...@googlegroups.com
For more options, http://groups.google.com/group/sts-semeval?hl=en?hl=en
---
You received this message because you are subscribed to the Google Groups "STS SemEval" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sts-semeval...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
--
Website of task, http://alt.qcri.org/semeval2017/task1/
To post to this group, send email to sts-s...@googlegroups.com
To unsubscribe, send email to sts-semeval...@googlegroups.com
For more options, http://groups.google.com/group/sts-semeval?hl=en?hl=en
---
You received this message because you are subscribed to the Google Groups "STS SemEval" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sts-semeval...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Eneko Agirre

unread,

Mar 22, 2017, 6:04:33 AM3/22/17

to sts-s...@googlegroups.com

Dear Tsuki,

thnaks for your comment.

it is totally OK to use vector representations learned from external corpora. We now realize that the wording was confusing, so we will try to be more clear on that.

We welcome any other suggestions which would make the STS benchmark more useful for the community.

best

eneko

03/20/2017 05:03 PM(e)an, Tsuki igorleak idatzi zuen:

nabin maharjan

unread,

Mar 23, 2017, 5:10:13 PM3/23/17

to STS SemEval

Hi eneko,

Below is the best result on the benchmark data-set obtained for our system (run ID:385819)

We have trained the system with the benchmark training data alone with the same setting as described in our system description paper.

........................................

runID Model Dev Test

385819 Gradient Boosting Regressor 0.83012 0.79195

Thanks,

Nabin Maharjan

Eneko Agirre

unread,

Apr 6, 2017, 10:19:54 AM4/6/17

to sts-s...@googlegroups.com

thanks again!

I would be grateful if you could you please check
http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark#Results and Section
6 of the attached pdf, and ensure that evetything said with respect to
your system is correct.

Any comments welcome!

best

eneko

03/23/2017 10:10 PM(e)an, nabin maharjan igorleak idatzi zuen:

> <http://ixa.si.ehu.eus/stswiki/STSbenchmark>. We would also like to

> include a section in the task description paper. Those interested,
> please train one of your runs using just the train dataset
> provided, and
> send us your development and test results by the end of March,
> mentioning the CodaLab run ID.
>
> best
>
> eneko
>
>
> --
>
> Eneko Agirre
> Euskal Herriko Unibertsitatea
> University of the Basque Country
> http://ixa2.si.ehu.eus/eneko
>

> --
> --
> Website of task, http://alt.qcri.org/semeval2017/task1/
> To post to this group, send email to sts-s...@googlegroups.com
> To unsubscribe, send email to sts-semeval...@googlegroups.com
> For more options, http://groups.google.com/group/sts-semeval?hl=en?hl=en
> ---
> You received this message because you are subscribed to the Google
> Groups "STS SemEval" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to sts-semeval...@googlegroups.com

> <mailto:sts-semeval...@googlegroups.com>.

> For more options, visit https://groups.google.com/d/optout.

--

Eneko Agirre
Euskal Herriko Unibertsitatea

Universidad del Pais Vasco

SemEval2017_STS.pdf

Eneko Agirre

unread,

Apr 6, 2017, 10:20:57 AM4/6/17

to sts-s...@googlegroups.com

sorry for the previous mail, it was intended for a specific participant,
please ignore.

best

eneko

04/06/2017 04:19 PM(e)an, Eneko Agirre igorleak idatzi zuen:

Ergun Bicici

unread,

Apr 7, 2017, 12:08:06 PM4/7/17

to sts-s...@googlegroups.com

Dear Eneko,

STS benchmark can be potentially included as an entry in ACL wiki state-of-the-art pages:

https://www.aclweb.org/aclwiki/index.php?title=State_of_the_art

Thank you for preparing the STS dataset on a yearly basis as an expert.

Q1) About train/dev/test split: do you only want to see results that use the train dataset for training and dev for tuning? If I prepare a model that use train+dev for training, will you consider the results legal?

I looked into what could have lowered RMTs performance on en-en this year:

RAE & MAER & MRAER

STS 2017 & English & ALL & 0.85 & 0.87 & 1.04

STS 2016 & English & ALL & 0.673 & 0.5954 & 0.719

STS 2015 & English & ALL & 0.722 & 0.7379 & 0.788

STS 2014 & English & ALL & 0.745 & 0.7274 & 0.757

STS 2013 & English & ALL & 0.779 & 0.8494 & 0.77

This was also the case for some domains and tasks before where MRAER is > 1 [1]:

where entries can be for new tasks, some out-of-domain tasks, some harder tasks. Considering these results, en-en is harder in STS 2017 and it is at an extreme due to the lowest average score:

average STS test score 2017 = 2.2776

average STS test score 2016 = 2.4132

average STS test score 2015 = 2.4059

average STS test score 2014 = 2.8114

average STS test score 2013 = 2.9555

average STS test score 2012 = 3.5061

and the lowest average number of words in the test set:

average STS test # words 2017 = 8.7

average STS test # words 2016 = 11.38

average STS test # words 2015 = 10.45

average STS test # words 2014 = 9.13

average STS test # words 2013 = 14.66

average STS test # words 2012 = 10.78

This could have decreased RTMs performance on en-en. Stanford Natural Language Inference (SNLI) Corpus

is focusing on inference and entailment tasks and in the case of entailment, we assume direction: one of the two sentences is entailed by some amount by the other. This may be the reason behind the difference in the distributions yet you clarified their difference before:

"STS is different from TE inasmuch as it assumes bidirectional graded equivalence between a pair of textual snippets" [1]

If we look at the correlation (r) results, we observe that, in contrast, the top r and top RTM model's r increased in 2017:

top r in STS 2017 = 0.8518 (ECNU)

top RTM r in STS 2017 = 0.71 (unofficial results)
top r in STS 2016 = 0.7781 (Samsung Poland NLP Team)
top RTM r in STS 2016 = 0.6746 (unofficial results [4])

top r in STS 2015 = 0.8015 (DLS@CU-S1)

top RTM r in STS 2015 = 0.67 (unofficial results, [1])
top r in STS 2014 = 0.7610 (DLS@CU-run2)

top RTM r in STS 2014 = 0.65 (unofficial results, [1])

top r in STS 2013

= 0.6181 (UMBC_EBIQUITY-ParingWords)
top RTM r in STS 2013 = 0.58 (unofficial results, [1])

About the filtering process also mentioned in [2,6], an idea that comes up is whether we can make these datasets self-sufficient: filter and select each other so that they become consistent representative for STS English, STS Spanish, etc. tasks where we maintain some expected MRAER level as we test on some validation set as we continue the filtering process similar to instance selection for active learning which improved SMT performance [3]. So, every year, some self-sufficient subset could be selected that maintains some consistency criterion now that you have close to 20000 instances.

Related idea is transfer learning (TL, e.g. using models developed for handwritten digit recognition for handwritten character recognition): could we use systems for other tasks and if so what would the performance be? This is a cross-task TL (https://www.youtube.com/watch?v=9ChVn3xVNDI, we have the same domain: STS but we use the models for different tasks: en-en RTM model for ar-ar):

Unfortunately, cross use also does not save en-es and es-en results. The change in the test sets made their results an outlier with MRAER scores larger than 1.

Cross use increase correlation in en-en from 0.71 to 0.73 using es-es.

Q2) About the second question you mentioned, I would like to have more clarification about the usefulness and inclusion of previous STS datasets for supervised learning models for future STS tasks. For instance, for STS en-en, do you recommend training on the STSbenchmark dataset or using all of the available data from before? For instance, based on only the RTM results on STS en-en 2017, discarding SNLI from the benchmark dataset, as you did, is a good idea.

Thank you again for organizing STS; it is likely that I will not attend this year; however, I am working on the paper to make my contribution more readable and accessible.

References:

[1] Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Inigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, German Rigau, Larraitz Uria, and Janyce Wiebe. 2015. Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. In Proc. of the 9th International Workshop on Semantic Evaluation (SemEval 2015). Association for Computational Linguistics, Denver, Colorado, pages 252–263.

www.aclweb.org/anthology/S15-2045.

[2] SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation

Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016) Eneko Agirre | Carmen Banea | Daniel Cer | Mona Diab | Aitor Gonzalez-Agirre | Rada Mihalcea | German Rigau | Janyce Wiebe Month: June Year: 2016
http://aclanthology.info/papers/semeval-2016-task-1-semantic-textual-similarity-monolingual-and-cross-lingual-evaluation

[3] Ergun Bicici and Deniz Yuret. Optimizing Instance Selection for Statistical Machine Translation with Feature Decay Algorithms. IEEE/ACM Transactions On Audio, Speech, and Language Processing (TASLP), 23:339-350, 2015. [doi:10.1109/TASLP.2014.2381882]

[4] Ergun Bicici and Andy Way. Referential translation machines for predicting semantic similarity. Language Resources and Evaluation, pp 1-27, 2015. ISSN: 1574-020X. URL:

http://dblp.org/rec/journals/lre/BiciciW16

[5] Ergun Bicici. RTM at SemEval-2016 Task 1: Predicting Semantic Similarity with Referential Translation Machines and Related Statistics. In SemEval-2016: Semantic Evaluation Exercises - International Workshop on Semantic Evaluation, San Diego, CA, USA, 6 2016. [WWW]

[6] Related comment: the probabilities in surface lexical similarity are estimated over the evaluation set data. I assume this to be over all of the ~14000 pairs (including training and test sets)

Best Regards,
Ergun

Ergun Biçici

http://bicici.github.com/

Best Regards,
Ergun

Ergun Biçici

http://bicici.github.com/

Universidad del Pais Vasco
University of the Basque Country
http://ixa2.si.ehu.eus/eneko

--

Ergun Bicici

unread,

Apr 7, 2017, 12:10:43 PM4/7/17

to sts-s...@googlegroups.com

This was also the case for some domains and tasks before where MRAER is > 1 [4]: