Yep, exactly!
To implement this, recall that we introduce slightly more general terminology: that of "queries" and "response choices." The benefit of this is now, the same model can be used for Q -> A as well as for QA -> R. In the former, the query is the question and the responses are the answer choices; in the latter, the query is the concatenated question+answer and the responses are the rational choices. We trained one model for each mode.
The only other thing to watch out for is when submitting to the leaderboard. To preserve the integrity of the test set, we don't release the labels. So, to score QA -> R we require that users produce 4 sets of QA->R predictions: one conditioned on each possible answer choice A.
hope this helps! let me know if I can help more. Thanks,
Rowan