Dev data release and CodaLab!

45 views
Skip to first unread message

Stephen Mayhew

unread,
Mar 4, 2020, 11:55:24 AM3/4/20
to duolingo-sharedtask-2020
Hello all,

The blind dev data has been released, and is now available at: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/38OJR6

Also, we have set up the CodaLab leaderboard for the dev phase! See it here: https://competitions.codalab.org/competitions/23643

Best of luck with the evaluation! Let us know if there are problems or questions.

Stephen

Matt Post

unread,
Mar 4, 2020, 2:55:56 PM3/4/20
to duolingo-sharedtask-2020
Can you say anything about the fairseq baseline?

Stephen Mayhew

unread,
Mar 9, 2020, 9:43:26 AM3/9/20
to duolingo-sharedtask-2020
Sorry for the delay. The fairseq baseline is trained using the code in the repository, specifically, train.sh. The data is created with preprocess.sh. Specifically, the training data sentence pairs consists of all pairs of prompt+candidate from the training set. The dev and test data use only the prompt + top translation when creating sentence pairs. At prediction time (run_pretrained.sh), we use a beam size of 10, and write out each item in the beam for each prompt. There was no effort to use a diverse beam. 

The model is trained on the full train set, and tuned on dev, so it's possible that the baseline scores are slightly inflated in the dev phase.

Stephen 

Terry Daniels

unread,
Mar 14, 2020, 4:13:33 PM3/14/20
to duolingo-sharedtask-2020
Have you encountered any individuals interested in ML for language learning in virtual environments?

Terry
Reply all
Reply to author
Forward
0 new messages