Hello Magenta community,
I have 2 sound synthesis questions relating to the following magenta blog-post :
https://magenta.tensorflow.org/gansynth. I think it would benefit from including the answers to the following question, but maybe I have missed some foundational knowledge that I can't seem to find anywhere.
1.) When discussing phase recovery, the blog post pointed to relevant work which was this article
https://tifgan.github.io/#P-E . I'm not quite sure why they:
a.) Throw out the phase information from the STFT when giving the STFT frequency magnitude to the model (why not give both magnitude and phase?)
b.) Why not predict both magnitude and phase, instead of reconstructing phase purely from frequency magnitude? I'm assuming directly predicting magnitude and phase separately is too
difficult for x reason which I'd like to know, though I have seen the Visual 2.5D Sound paper (
https://arxiv.org/abs/1812.04204) predict real
and imaginary STFT components separately, as well as take in both real and imaginary components (nothing is thrown out!). Though real, imaginary != frequency, phase, (not clear why they wouldn't use frequency, phase representation over real,imaginary) why isn't giving all the STFT information to the model more common practice?
c.) Something that might be easier is reconstructing phase from magnitude with a a neural network itself? input: frequency magnitude + features predicted from model --> neural network --> phase reconstruction?
2.) I was trying to understand why most music synthesis nowadays goes
through transcription --> MIDI --> autoregressive model and why
there aren't more approaches that go don't go through MIDI and stick to
the raw wave forms. I still haven't fully grasped why, besides the fact
that probably transcribing to MIDI and discretizing it as such helps the
model learn dependencies. I was wondering if there was more nuance than
this?
I really appreciate the community's insight on these confusions.
Thanks a lot!