Reproducing MT3 transcription results on Slakh2100

92 views
Skip to first unread message

Hao Hao Tan

unread,
May 24, 2023, 10:39:15 AM5/24/23
to Magenta Discuss
Hi authors of MT3,

Thank you for the amazing work on MT3!

I tried to reproduce the "Multi-instrument F1 score" results reported in the ICLR'22 paper, using the open-source model checkpoint on Colab to export the transcribed MIDIs, and the evaluation script on Github. Below are my reproduced results:

Onset + offset + program F1 (flat):                 0.48 (reported) vs 0.5039 (reproduced)
Onset + offset + program F1 (midi_class):    0.62 (reported) vs 0.4784 (reproduced)
Onset + offset + program F1 (full):                  0.55 (reported) vs 0.2846 (reproduced)

I would like to ask the following questions:
(i) Is the open-source model checkpoint the same model version used for evaluation in the ICLR paper?
(ii) Is it expected to have `flat` performing worse than `midi_class`, while `midi_class` performing better than `full`?
From what I understand, in terms of strictness `flat` < `midi_class` < `full`. In that case, true positives in `midi_class` should also be true positives in `flat`, so it seems like `flat` should perform at least the same as `midi_class`, if not better.

Thank you in advance and look forward to discuss with you guys!


Best regards,
Hao Hao Tan

Ian Simon

unread,
Jun 2, 2023, 2:07:37 PM6/2/23
to Hao Hao Tan, Magenta Discuss
Sorry about the slow response!  A few thoughts:

1) In general I would expect flat < midi_class < full (where "<" means easier), but there will be also be some randomness as in Table 3 (IIRC) the evaluation is for 3 separate models trained at the different instrument granularities.  I can hypothesize about other reasons e.g. explicitly predicting the instrument might make it easier to predict subsequent notes, but I have no evidence for that.  That said, the checkpoint we shared was trained on "full" and yet that's where you observed the largest discrepancy.

2) The model used in the colab is not exactly the same as the model used for evaluation, because the internal codebase we used to train & evaluate the models for the paper is slightly different from the open source codebase.  However, I would not expect to see a difference as large as the one you observed since the two codebases should be functionally near-identical.

3) I suspect the issue here is that our metrics code is kind of a mess and computes a bunch of different F1 scores without a clear mapping to the ones reported in the paper.  One of the annoying differences is how drums are handled; in some metrics drums are dropped, in others drums are included but only onsets are considered.  (Note that the drum issue should only affect Slakh; do your scores for other datasets match our reported ones more closely?).  If it's not too much trouble, can you check both the "Onset + offset + program F1 (full)" and "Nondrum onset + offset + program F1 (full)"?  I honestly can't remember which one of these we reported.  If you still see a large difference, I will try to rerun the eval myself using the open-source checkpoint to see what might be going on.

Thank you for bringing this to our attention, and sorry again about the slow response!

-Ian

--
Magenta project: magenta.tensorflow.org
To post to this group, send email to magenta...@tensorflow.org
To unsubscribe from this group, send email to magenta-discu...@tensorflow.org
---
To unsubscribe from this group and stop receiving emails from it, send an email to magenta-discu...@tensorflow.org.

Hao Hao Tan

unread,
Jun 3, 2023, 10:31:01 AM6/3/23
to Magenta Discuss, ians...@google.com, Magenta Discuss, Hao Hao Tan
Hi Ian,

I think it is still a large difference on my side. How I run it was: I use the code in the Colab notebook and first get all the transcribed MIDI files from the test set .wav files in Slakh. The output transcribed MIDI is identical as processing the same .wav file from the Colab notebook.
Then, I run `_program_aware_note_scores` in `metrics.py`, but since the function takes in NoteSequence, I use `note_seq.midi_file_to_note_sequences` to convert MIDI to ns. Not sure this is the correct procedure.
Below are my results on Slakh:

Drum onset F1 (flat): 0.6789
Drum onset F1 (full): 0.6789
Drum onset F1 (midi_class): 0.6789
Drum onset precision (flat): 0.7053
Drum onset precision (full): 0.7053
Drum onset precision (midi_class): 0.7053
Drum onset recall (flat): 0.6662
Drum onset recall (full): 0.6662
Drum onset recall (midi_class): 0.6662
Nondrum onset + offset + program F1 (flat): 0.4225
Nondrum onset + offset + program F1 (full): 0.09958
Nondrum onset + offset + program F1 (midi_class): 0.3856
Nondrum onset + offset + program precision (flat): 0.4784
Nondrum onset + offset + program precision (full): 0.1124
Nondrum onset + offset + program precision (midi_class): 0.4362
Nondrum onset + offset + program recall (flat): 0.3834
Nondrum onset + offset + program recall (full): 0.09033
Nondrum onset + offset + program recall (midi_class): 0.3501
Onset + offset + program F1 (flat): 0.5039
Onset + offset + program F1 (full): 0.2846
Onset + offset + program F1 (midi_class): 0.4784
Onset + offset + program precision (flat): 0.555
Onset + offset + program precision (full): 0.3132
Onset + offset + program precision (midi_class): 0.5266
Onset + offset + program recall (flat): 0.4659
Onset + offset + program recall (full): 0.2631
Onset + offset + program recall (midi_class): 0.4425

Ian Simon

unread,
Jun 3, 2023, 11:15:30 AM6/3/23
to Hao Hao Tan, Ethan Manilow, Magenta Discuss
Interesting!

It looks like a) the drum scores are good, b) the "flat" and "midi class" instrument scores are pretty good, but c) the "full" instrument scores are bad.

My guess as to what's happening here is that the program numbers in the MIDI files from Slakh (where are you getting those?) don't exactly match the program numbers output by MT3.

Adding @Ethan Manilow who might know more about the Slakh program numbers.

-Ian

Hao Hao Tan

unread,
Jun 3, 2023, 11:49:01 AM6/3/23
to Magenta Discuss, ians...@google.com, Magenta Discuss, Hao Hao Tan, eman...@google.com
That sounds possible. Indeed, the drums part has good results.

I did not explicitly load / change the program numbers. From my understanding of the code, the MT3 inference -> transcribed MIDI step does not explicitly group the program numbers, based on the granularity level.
It is grouped in the metrics evaluation code: https://github.com/magenta/mt3/blob/main/mt3/metrics.py#L54
Reply all
Reply to author
Forward
0 new messages