Hi y'all,
I have just read the paper associated with this (i encourage a quick skim, the diagrams are amazing!):
very human speech synthesis (and piano synthesis!).
The English researches seem to be talking another language when it comes to deepnets, so i'm unsure if their notion of dilated convolution nets is just another way of looking at 1D pooling:
"It is a fully convolutional neural network, where the convolutional layers have various dilation factors that allow its receptive field to grow exponentially with depth and cover thousands of timesteps."
(there is a great animation that illustrates this description).
Am i crazy or is this just a series of 1D convolutional layers in parallel that feed into a final MLP later of 1 node with a softmax?
I'm going have a play in Keras to see if i can get speech synthesis happening (assuming i can find some data). Scratching my head a little on the Keras architecture but will gve it a shot (any initial thoughts would be welcome).
alex