Hi authors of MT3,
Thank you for the amazing work on MT3!
Onset + offset + program F1 (flat): 0.48 (reported) vs 0.5039 (reproduced)
Onset + offset + program F1 (midi_class): 0.62 (reported) vs 0.4784 (reproduced)
Onset + offset + program F1 (full): 0.55 (reported) vs 0.2846 (reproduced)
I would like to ask the following questions:
(i) Is the open-source model checkpoint the same model version used for evaluation in the ICLR paper?
(ii) Is it expected to have `flat` performing worse than `midi_class`, while `midi_class` performing better than `full`?
From what I understand, in terms of strictness `flat` < `midi_class` < `full`. In that case, true positives in `midi_class` should also be true positives in `flat`, so it seems like `flat` should perform at least the same as `midi_class`, if not better.
Thank you in advance and look forward to discuss with you guys!
Best regards,
Hao Hao Tan