MusicTransformer reproduction

Skip to first unread message

Pierre Cournut

Nov 10, 2020, 1:02:00 PM11/10/20

I’m trying to reproduce your model from both your papers and implementation and I am having a hard time reproducing the results you have on your collab that are truly awesome!

After extensively looking up:
there are still a few points that I’m unsure of:
  • Data preprocessing: do you use the full MIDI range (1-128) as stated in your paper or the restrained range (21-108) hard-coded in
  • What learning rate do you refer to in your original MusicTransformer paper and in the update_small_lr hparams setup function? I get confused since the transformer_base_v3 hparams update function defines a learning rate constant and schedule which in my understanding totally defines the learning rate along training.
  • Could you please quickly explain the role of those 3 variables: hidden_size, attention_key_channels and filter_size? I get confused by this sentence in the MusicTransformer paper  « We found that reducing the query and key hidden size (att) to half the hidden size (hs) works well and use this relationship for all of the models » and the values I found in the hparams (hidden_size = 384, attention_key_channels = 512, filter_size = 1024).  
  • I did not implement local attention yet, could it help me get closer to reproducing your results or does it mostly help with memory and thus could help me train faster? 

Thanks in advance!


Ian Simon

Nov 10, 2020, 4:39:34 PM11/10/20
to Pierre Cournut, Anna Huang, Magenta Discuss
Hi Pierre, the model in the Colab is trained on piano transcriptions from YouTube and not MAESTRO, and also doesn't use relative attention, just the transformer_tpu hparams but with 16 hidden layers.

I'm actually not sure what configuration was used for the Music Transformer paper, but +Anna Huang may be able to help you.


Magenta project:
To post to this group, send email to
To unsubscribe from this group, send email to
To unsubscribe from this group and stop receiving emails from it, send an email to

Pierre Cournut

Nov 12, 2020, 3:46:21 AM11/12/20
to Anna Huang, Ian Simon, Magenta Discuss
Hi Ian, 

Thank you for your quick answer! 
I’ll give the transformer_tpu params a closer look then. 


Drew Edwards

Feb 9, 2023, 7:29:21 AMFeb 9
to Magenta Discuss,, Magenta Discuss,,
Hi, I'm also curious about Pierre's initial queries. Additionally, I am wondering why the blog/Colab model uses a different architecture. 

Thanks in advance for any learnings you are able to share!

Reply all
Reply to author
0 new messages