Dev data release and CodaLab!

Stephen Mayhew

unread,

Mar 4, 2020, 11:55:24 AM3/4/20

to duolingo-sharedtask-2020

Hello all,

The blind dev data has been released, and is now available at: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/38OJR6

Also, we have set up the CodaLab leaderboard for the dev phase! See it here: https://competitions.codalab.org/competitions/23643

Best of luck with the evaluation! Let us know if there are problems or questions.

Stephen

Matt Post

unread,

Mar 4, 2020, 2:55:56 PM3/4/20

to duolingo-sharedtask-2020

Can you say anything about the fairseq baseline?

Stephen Mayhew

unread,

Mar 9, 2020, 9:43:26 AM3/9/20

to duolingo-sharedtask-2020

Sorry for the delay. The fairseq baseline is trained using the code in the repository, specifically, train.sh. The data is created with preprocess.sh. Specifically, the training data sentence pairs consists of all pairs of prompt+candidate from the training set. The dev and test data use only the prompt + top translation when creating sentence pairs. At prediction time (run_pretrained.sh), we use a beam size of 10, and write out each item in the beam for each prompt. There was no effort to use a diverse beam.

The model is trained on the full train set, and tuned on dev, so it's possible that the baseline scores are slightly inflated in the dev phase.

Stephen

Terry Daniels

unread,

Mar 14, 2020, 4:13:33 PM3/14/20

to duolingo-sharedtask-2020

Have you encountered any individuals interested in ML for language learning in virtual environments?

Terry

whol...@yahoo.com

http://www.wholebitmedia.com

Reply all

Reply to author

Forward