Phase information

33 views

Skip to first unread message

daniel domingoes

unread,

Jun 13, 2020, 6:40:12 AM6/13/20

to Magenta Discuss

Hello Magenta community,

I have 2 sound synthesis questions relating to the following magenta blog-post : https://magenta.tensorflow.org/gansynth. I think it would benefit from including the answers to the following question, but maybe I have missed some foundational knowledge that I can't seem to find anywhere.

1.) When discussing phase recovery, the blog post pointed to relevant work which was this article https://tifgan.github.io/#P-E . I'm not quite sure why they:

a.) Throw out the phase information from the STFT when giving the STFT frequency magnitude to the model (why not give both magnitude and phase?)

b.) Why not predict both magnitude and phase, instead of reconstructing phase purely from frequency magnitude? I'm assuming directly predicting magnitude and phase separately is too difficult for x reason which I'd like to know, though I have seen the Visual 2.5D Sound paper (https://arxiv.org/abs/1812.04204) predict real and imaginary STFT components separately, as well as take in both real and imaginary components (nothing is thrown out!). Though real, imaginary != frequency, phase, (not clear why they wouldn't use frequency, phase representation over real,imaginary) why isn't giving all the STFT information to the model more common practice?

c.) Something that might be easier is reconstructing phase from magnitude with a a neural network itself? input: frequency magnitude + features predicted from model --> neural network --> phase reconstruction?

2.) I was trying to understand why most music synthesis nowadays goes through transcription --> MIDI --> autoregressive model and why there aren't more approaches that go don't go through MIDI and stick to the raw wave forms. I still haven't fully grasped why, besides the fact that probably transcribing to MIDI and discretizing it as such helps the model learn dependencies. I was wondering if there was more nuance than this?

I really appreciate the community's insight on these confusions.

Thanks a lot!

Jesse Engel

unread,

Jun 13, 2020, 4:13:21 PM6/13/20

to daniel domingoes, Magenta Discuss

Hi Daniel,

Some quick responses,

1.) When discussing phase recovery, the blog post pointed to relevant work which was this article https://tifgan.github.io/#P-E . I'm not quite sure why they:

a.) Throw out the phase information from the STFT when giving the STFT frequency magnitude to the model (why not give both magnitude and phase?)

In many situations you can try comparing to both, but the phase doesn't help much because it is wrapped across 2-pi which is surprisingly a hard complication for many 2-D convnets. This inspired the "instantaneous frequency" representation of GANSynth to unwrap the phase, and look at the frame to frame differences.

b.) Why not predict both magnitude and phase, instead of reconstructing phase purely from frequency magnitude? I'm assuming directly predicting magnitude and phase separately is too difficult for x reason which I'd like to know, though I have seen the Visual 2.5D Sound paper (https://arxiv.org/abs/1812.04204) predict real and imaginary STFT components separately, as well as take in both real and imaginary components (nothing is thrown out!). Though real, imaginary != frequency, phase, (not clear why they wouldn't use frequency, phase representation over real,imaginary) why isn't giving all the STFT information to the model more common practice?

That's what we do in GANSynth, but using the instantaneous frequency version of phase.

c.) Something that might be easier is reconstructing phase from magnitude with a a neural network itself? input: frequency magnitude + features predicted from model --> neural network --> phase reconstruction?

We (and others) have definitely tried that, so far to some limited success. One of the challenges is that the ear is very sensitive to some phase errors, which loss functions don't adequately represent.

2.) I was trying to understand why most music synthesis nowadays goes through transcription --> MIDI --> autoregressive model and why there aren't more approaches that go don't go through MIDI and stick to the raw wave forms. I still haven't fully grasped why, besides the fact that probably transcribing to MIDI and discretizing it as such helps the model learn dependencies. I was wondering if there was more nuance than this?

This is definitely another approach to the problem (see Jukebox or Sander Dieleman's work). Our motivations for focusing on intermediate representations that are interpretable is two-fold. One, like you mentioned it can be helpful for modeling, but more importantly, it enables the model to be more creatively empowering as a person can understand and manipulate the notes and have the model respond, which isn't possible with a black box method.