Sorry for the delay. The fairseq baseline is trained using the code in
the repository, specifically,
train.sh. The data is created with
preprocess.sh. Specifically, the training data sentence pairs consists of all pairs of prompt+candidate from the training set. The dev and test data use only the prompt + top translation when creating sentence pairs. At prediction time (
run_pretrained.sh), we use a beam size of 10, and write out each item in the beam for each prompt. There was no effort to use a diverse beam.
The model is trained on the full train set, and tuned on dev, so it's possible that the baseline scores are slightly inflated in the dev phase.
Stephen