Nick Landia
unread,Jun 30, 2022, 5:54:26 PM6/30/22Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Sign in to report message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to RecSys-Challenge-2022
Dear Participants,
looking through the code of the submissions we have found several cases where the test set data was mistakenly used as an input to model training. This is against the rules, the rules state:
- When you train the model, only use the data from the training file. Do not use any data from the "leaderboard" or "final" test files
- When predicting, treat each test session independently of all other test sessions (i.e., when predicting for test session B, the model should not have any knowledge of test session A. Even if that came before it in terms of time-stamp)
However, from the code we can see that this is a very easy mistake to make and we don't want to just disqualify many teams because of this oversight. Instead we are re-opening submissions for about a week. They will open in the next 24 hours and will remain open until 2022-07-06 23:59:59 PST. Please double-check if you are using the test data in a way that is not allowed in your submission, and please resubmit corrected predictions and code.
Some more information: In almost all of the cases we have come across the mistake was in a feature-engineering step. Teams would concatenate the training, leaderboard_test and final_test sessions together and do some feature engineering with all data. This is a leakage of information where the test data influences what features and values are generated. The resulting features then get used in the model training step, which only uses the training sessions directly in that step, but the feature engineering has already looked at all of the test data at that point and the model is given this information via the features. This is partly the reason we decided to re-open submissions instead of disqualifying directly, because all of the cases we have encountered look like honest mistakes instead of attempts to deliberately break the rules. We felt it is in the spirit of the competition to allow teams to correct this.
Thanks,
Nick
PS: To be absolutely clear about what data can be used, from all the data files available:
model can use all data from:
train_purchases.csv
train_sessions.csv
item_features.csv
candidate_items.csv
model gets asked to predict for each session independently, information from one session should not influence any other session:
test_final_sessions.csv
not used at all for generating "final" prediction file:
test_leaderboard_sessions.csv