Hi,
Thanks for this question - I should have been clearer in my original post. Subtask B, pre-training should have read as follows:
Subtask B, pre-train: Use any training data that you generate which does NOT explicitly consider MWEs or idiomaticity for the assignment of STS scores (no idiom specific data). An example of data that you can NOT use is any data that is similar to the fine-tune setting's training data. For clarity: You are allowed to use sentences containing MWEs (e.g. for pre-training as in the dataset paper) as long as you do not include associated STS scores.
So, to answer your question, as long as you haven't assigned STS scores to the sentences you have collected, you will fall under the "pre-train" setting. Please take note that the test set will have a different set of idioms, so you should be prepared to collect whatever associated data you need - you will have between the 10 of January (when the test data will be released) and the end of January (when the evaluation period ends) for this.
For completeness, I've put down the full list of restrictions below (and have also updated the
task description website with this):
- Subtask A, Zero Shot: Use "train_zero_shot.csv" only. Do NOT use your own training data or the one shot data.
- Subtask A, One shot: Use "train_zero_shot.csv" and "train_one_shot.csv" only. Do not use your own training data.
Subtask B, pre-train: Use any training data that you generate which does NOT explicitly consider MWEs or idiomaticity for the assignment of STS scores (no idiom specific data). An example of data that you can NOT use is any data that is similar to the fine-tune setting's training data. For clarity: You are allowed to use sentences containing MWEs (e.g. for pre-training as in the dataset paper) as long as you do not include associated STS scores.
- Subtask B, fine-tune: Use train_data.csv and any training data you generate. For clarity: You are allowed to add your own sentences containing STS scores for this setting.
Hope this helps and let me know if anything is unclear.
Best
Harish