Hi,
I'm afraid that is not allowed. You can only use the training data made available, see the
Data Restrictions section of the Task Description website and also
this thread for details. If you do use your own training data your team will not be included in the final rankings but
you can still submit a paper (and we will mention you in the task description paper). Please make sure you mention this when you sign up for the task. See
this tread for more information.
You are right in that you can use (monolingual or multilingual) models pre-trained on any data, but that data cannot have been influenced by the task in any way (for example, additional pre-training on sentences containing the target idioms).
For completeness, here is what that section of the site says:
Subtask A, Zero Shot: Use "train_zero_shot.csv" only. Do NOT use your own training data or the one shot data.
Subtask A, One shot: Use "train_zero_shot.csv" and "train_one_shot.csv" only. Do not use your own training data.
Subtask B, pre-train: Use any training data that you generate which does NOT explicitly consider MWEs or idiomaticity for the assignment of STS scores (no idiom specific data). An example of data that you can NOT use is any data that is similar to the fine-tune setting's training data. For clarity: You are allowed to use sentences containing MWEs (e.g. for pre-training as in the dataset paper) as long as you do not include associated STS scores.
Subtask B, fine-tune: Use train_data.csv and any training data you generate. For clarity: You are allowed to add your own sentences containing STS scores for this setting.
Note that train_one_shot.csv in point 2 above refers to "SubTaskA/TestData/train_one_shot.csv"
Also, notice that you can use the kind of training you are suggesting for Subtask B.
Hope this helps and let us know if something remains unclear.
Best
Harish