Use of monolingual data for Subtask A

30 views
Skip to first unread message

T T

unread,
Jan 23, 2022, 8:35:31 AM1/23/22
to semeval-202...@googlegroups.com
Hi Semeval Task 2 organisers

Can we use unlabelled monolingual data to solve Subtask A zero-shot setting (e.g. by collecting example sentences that contain target idioms)? I'm assuming that it is OK given we can use pre-trained language models trained on any monolingual data, but I want to make sure.

Best

Takashi Wada

Harish Tayyar Madabushi

unread,
Jan 23, 2022, 4:04:48 PM1/23/22
to semeval-2022-task-2-MWE
Hi, 

I'm afraid that is not allowed. You can only use the training data made available, see the Data Restrictions section of the Task Description website and also this thread for details. If you do use your own training data your team will not be included in the final rankings but you can still submit a paper (and we will mention you in the task description paper). Please make sure you mention this when you sign up for the task. See this tread for more information. 

You are right in that you can use (monolingual or multilingual) models pre-trained on any data, but that data cannot have been influenced by the task in any way (for example, additional pre-training on sentences containing the target idioms). 

For completeness, here is what that section of the site says: 

  1. Subtask A, Zero Shot: Use "train_zero_shot.csv" only. Do NOT use your own training data or the one shot data.

  2. Subtask A, One shot: Use "train_zero_shot.csv" and "train_one_shot.csv" only. Do not use your own training data.

  3. Subtask B, pre-train: Use any training data that you generate which does NOT explicitly consider MWEs or idiomaticity for the assignment of STS scores (no idiom specific data). An example of data that you can NOT use is any data that is similar to the fine-tune setting's training data. For clarity: You are allowed to use sentences containing MWEs (e.g. for pre-training as in the dataset paper) as long as you do not include associated STS scores.

  4. Subtask B, fine-tune: Use train_data.csv and any training data you generate. For clarity: You are allowed to add your own sentences containing STS scores for this setting.


Note that train_one_shot.csv in point 2 above refers to "SubTaskA/TestData/train_one_shot.csv"

Also, notice that you can use the kind of training you are suggesting for Subtask B. 

Hope this helps and let us know if something remains unclear. 

Best
Harish
Reply all
Reply to author
Forward
0 new messages