MusicTransformer reproduction

Pierre Cournut

unread,

Nov 10, 2020, 1:02:00 PM11/10/20

to magenta...@tensorflow.org

Hi,

I’m trying to reproduce your model from both your papers and implementation and I am having a hard time reproducing the results you have on your collab that are truly awesome!

After extensively looking up:

the hyper-parameters setup t_rel_len2048_h384_att512_fs1024_n6_dropout10 which I assumed was the setup you used for the training of the MusicTransformer on the MAESTRO dataset
the dot-product relative self-attention implementation in the tensor2tensor repo

there are still a few points that I’m unsure of:

Data preprocessing: do you use the full MIDI range (1-128) as stated in your paper or the restrained range (21-108) hard-coded in score2perf.py?
What learning rate do you refer to in your original MusicTransformer paper and in the update_small_lr hparams setup function? I get confused since the transformer_base_v3 hparams update function defines a learning rate constant and schedule which in my understanding totally defines the learning rate along training.
Could you please quickly explain the role of those 3 variables: hidden_size, attention_key_channels and filter_size? I get confused by this sentence in the MusicTransformer paper « We found that reducing the query and key hidden size (att) to half the hidden size (hs) works well and use this relationship for all of the models » and the values I found in the hparams (hidden_size = 384, attention_key_channels = 512, filter_size = 1024).
I did not implement local attention yet, could it help me get closer to reproducing your results or does it mostly help with memory and thus could help me train faster?

Thanks in advance!

Regards,
Pierre

Ian Simon

unread,

Nov 10, 2020, 4:39:34 PM11/10/20

to Pierre Cournut, Anna Huang, Magenta Discuss

Hi Pierre, the model in the Colab is trained on piano transcriptions from YouTube and not MAESTRO, and also doesn't use relative attention, just the transformer_tpu hparams but with 16 hidden layers.

I'm actually not sure what configuration was used for the Music Transformer paper, but +Anna Huang may be able to help you.

-Ian

--
Magenta project: magenta.tensorflow.org
To post to this group, send email to magenta...@tensorflow.org
To unsubscribe from this group, send email to magenta-discu...@tensorflow.org
---
To unsubscribe from this group and stop receiving emails from it, send an email to magenta-discu...@tensorflow.org.

Pierre Cournut

unread,

Nov 12, 2020, 3:46:21 AM11/12/20

to Anna Huang, Ian Simon, Magenta Discuss

Hi Ian,

Thank you for your quick answer!
I’ll give the transformer_tpu params a closer look then.

Best,
Pierre

Drew Edwards

unread,

Feb 9, 2023, 7:29:21 AM2/9/23

to Magenta Discuss, pierre....@mwm.io, Magenta Discuss, anna...@google.com, ians...@google.com

Hi, I'm also curious about Pierre's initial queries. Additionally, I am wondering why the blog/Colab model uses a different architecture.

Thanks in advance for any learnings you are able to share!

Reply all

Reply to author

Forward