You can either make the prediction for frame 1, 2, or 3 in that case. We didn't notice a substantial difference in which frame you choose for the label when training. The advantage to picking the last frame is that you are always making a prediction for the current time instance instead of having a delay and predicting the past.
This means that if you are predicting a N frame sequence with an M-frame classifier, you only get N - M + 1 predictions if you don't pad. You can either pad the inputs with zero frames or pad the output by cloning the decisions at the start/end of the sequence. In practice we choose the latter approach.