Final versions of datasets on Github

132 views
Skip to first unread message

Sapna Negi

unread,
Oct 28, 2018, 12:53:50 PM10/28/18
to SemEval 2019: Task 9
Dear participants,

The final versions of datasets are now available in our GitHub repos, after incorporating the corrections in the previous versions pointed out by some of you.

In case you are not aware of this already, please note that the evaluation phase will tentatively start on the 10th of January, where a fresh evaluation dataset will be uploaded on CodaLab. 
Currently, our CodaLab leaderboard reflects the trial phase results only. The trial test data can be treated as a validation set for your final SemEval submissions. 

We would also like to remind you that we have a separate Codalab website for subtask B i.e. cross-domain suggestion mining.


Regards
Task organisers


Ananda Seelan

unread,
Nov 14, 2018, 9:06:09 PM11/14/18
to SemEval 2019: Task 9
Hello,
There seems to be a bunch of duplicate lines in the training file. I checked the Subtask-A training file and I've attached the list of duplicate row ids.

Cheers
duplicate_ids.txt

Sapna Negi

unread,
Nov 15, 2018, 10:43:33 AM11/15/18
to Ananda Seelan, SemEval 2019: Task 9
Hi Ananda, All,

Thanks for pointing this out. We would upload the revised file. There are about 219 rows duplicated.

However, this should have a minimal effect and you may not want to immediately re-train your model for updating the trial leaderboard.

Sapna

--
You received this message because you are subscribed to the Google Groups "SemEval 2019: Task 9" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semeval-2019-ta...@googlegroups.com.
To post to this group, send email to semeval-2...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/semeval-2019-task-9/b64468a1-bb8e-4ba9-be42-d7ccaa957576%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<duplicate_ids.txt>

Scarecrow

unread,
Dec 5, 2018, 1:13:31 AM12/5/18
to SemEval 2019: Task 9
Hello Sapna,
The duplicates that I reported earlier were only based on the sentence ids. However, there seems to be duplicates of actual training sentences itself.
We found 900+ duplicate examples, which is almost 10% of the total data. For a relatively smaller subset, perhaps this calls for more diligent curation
of training data?

Regards,
Ananda Seelan

On Thursday, 15 November 2018 21:13:33 UTC+5:30, Sapna Negi wrote:
Hi Ananda, All,

Thanks for pointing this out. We would upload the revised file. There are about 219 rows duplicated.

However, this should have a minimal effect and you may not want to immediately re-train your model for updating the trial leaderboard.

Sapna
On 15 Nov 2018, at 02:06, Ananda Seelan <ananda...@gmail.com> wrote:

Hello,
There seems to be a bunch of duplicate lines in the training file. I checked the Subtask-A training file and I've attached the list of duplicate row ids.

Cheers

On Sunday, 28 October 2018 22:23:50 UTC+5:30, Sapna Negi wrote:
Dear participants,

The final versions of datasets are now available in our GitHub repos, after incorporating the corrections in the previous versions pointed out by some of you.

In case you are not aware of this already, please note that the evaluation phase will tentatively start on the 10th of January, where a fresh evaluation dataset will be uploaded on CodaLab. 
Currently, our CodaLab leaderboard reflects the trial phase results only. The trial test data can be treated as a validation set for your final SemEval submissions. 

We would also like to remind you that we have a separate Codalab website for subtask B i.e. cross-domain suggestion mining.


Regards
Task organisers



--
You received this message because you are subscribed to the Google Groups "SemEval 2019: Task 9" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semeval-2019-task-9+unsub...@googlegroups.com.

Scarecrow

unread,
Dec 5, 2018, 1:21:53 AM12/5/18
to SemEval 2019: Task 9
*relatively smaller dataset

Sapna Negi

unread,
Dec 5, 2018, 5:03:06 AM12/5/18
to Scarecrow, SemEval 2019: Task 9

Hi Ananda,

 

I think the duplicates could be originating from the raw data itself, i.e. duplication of text as title,  re-submission of posts by the authors, or quoting of a post in a different post.

This dataset is a semeval specific extension of an older set where such issues were not encountered. We will investigate this further asap.

 

Regardless, apologies. We will try to rectify this asap, and will take extra precautions in the evaluation set.

 

Thanks and Regards

Organizers

To unsubscribe from this group and stop receiving emails from it, send an email to semeval-2019-ta...@googlegroups.com.


To post to this group, send email to semeval-2...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/semeval-2019-task-9/b64468a1-bb8e-4ba9-be42-d7ccaa957576%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<duplicate_ids.txt>

 

--

You received this message because you are subscribed to the Google Groups "SemEval 2019: Task 9" group.

To unsubscribe from this group and stop receiving emails from it, send an email to semeval-2019-ta...@googlegroups.com.


To post to this group, send email to semeval-2...@googlegroups.com.

Scarecrow

unread,
Dec 5, 2018, 5:22:25 AM12/5/18
to SemEval 2019: Task 9
Thank you for the clarification. Appreciate the prompt response.

Regards,
Ananda Seelan

To unsubscribe from this group and stop receiving emails from it, send an email to semeval-2019-task-9+unsub...@googlegroups.com.


To post to this group, send email to semeval-2...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/semeval-2019-task-9/b64468a1-bb8e-4ba9-be42-d7ccaa957576%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<duplicate_ids.txt>

--
You received this message because you are subscribed to the Google Groups "SemEval 2019: Task 9" group.

To unsubscribe from this group and stop receiving emails from it, send an email to semeval-2019-task-9+unsub...@googlegroups.com.

Reply all
Reply to author
Forward
0 new messages