Help needed with input to CNN for 1D conv on audio

826 views
Skip to first unread message

Gautam Bhattacharya

unread,
Jun 20, 2016, 1:35:30 PM6/20/16
to lasagne-users
Hello all,

I want to process audio (speech) 'snapshots' of fixed size. Each snap is 3 seconds long i.e. 300 frames. So each example is 300x60 (60 dimensional mfcc + delta + delta-delta).

I would like to learn a representation that is invariant to the ordering of frames, so I figured that doing a convolution over only time made sense (atleast to start). I was thinking about using an architecture something like this : https://arxiv.org/abs/1502.01710 

Q. If I want to use lasagne's conv1D layer, how should I be presenting each datapoint to the network?
     I have stored my dataset as numpy/hdf5 matrices of shape - num_exp x 300 x 60 and I want to do a convolution along the 2nd dimension (time) only.
     Do I need to transpose each datapoint so that the net sees 60x300? 
     
Any help regarding the best / most efficient way to do this would be greatly appreciated. 
For reference, I have a dataset of ~675000 with 1630 classes.

Thanks !

Jan Schlüter

unread,
Jun 21, 2016, 11:30:36 AM6/21/16
to lasagne-users
I would like to learn a representation that is invariant to the ordering of frames

I assume you meant invariant to the ordering of features? (Otherwise a convolution doesn't make sense.)


Q. If I want to use lasagne's conv1D layer, how should I be presenting each datapoint to the network?

As per the documentation, it should be of shape (batchsize, channels, frames), where "channels" would be your 60 features.


Do I need to transpose each datapoint so that the net sees 60x300?

Either this (using a DimshuffleLayer), or you present your data as (batchsize, 1, 300, 60) and use a Conv2DLayer with filter_size=(something, 60) for the first layer, and (something, 1) for subsequent ones. The Conv1DLayer internally uses a 2D convolution anyway, so performance will probably be about the same (depends on what the underlying convolution implementation does).

Best, Jan

Gautam Bhattacharya

unread,
Jun 21, 2016, 1:48:28 PM6/21/16
to lasagne-users


On Tuesday, 21 June 2016 11:30:36 UTC-4, Jan Schlüter wrote:
I would like to learn a representation that is invariant to the ordering of frames

I assume you meant invariant to the ordering of features? (Otherwise a convolution doesn't make sense.)
   Yes exactly 

Q. If I want to use lasagne's conv1D layer, how should I be presenting each datapoint to the network?

As per the documentation, it should be of shape (batchsize, channels, frames), where "channels" would be your 60 features.

Do I need to transpose each datapoint so that the net sees 60x300?

Either this (using a DimshuffleLayer), or you present your data as (batchsize, 1, 300, 60) and use a Conv2DLayer with filter_size=(something, 60) for the first layer, and (something, 1) for subsequent ones. The Conv1DLayer internally uses a 2D convolution anyway, so performance will probably be about the same (depends on what the underlying convolution implementation does).

Thanks for the tip. My dataset is fairly large (I think) so it would help to do things efficiently. 

Jan, I know you do a lot of work with musical audio. I am trying to classify speakers from these 3 second clips. Is that maybe too long a clip to be feeding a CNN?
 
Thanks,
Best, Jan

Jan Schlüter

unread,
Jun 22, 2016, 6:15:23 AM6/22/16
to lasagne-users
Jan, I know you do a lot of work with musical audio. I am trying to classify speakers from these 3 second clips. Is that maybe too long a clip to be feeding a CNN?

No, 300 frames sounds reasonable. You can try to reduce the frame rate (or add a temporal pooling step) if you want to reduce the size, and you should try omitting the delta and delta-delta features (that's something the CNN could figure out by itself). Also you can try omitting the DCT from the MFCC computation and then use a 2D convolution.

Gautam Bhattacharya

unread,
Aug 6, 2016, 4:51:53 PM8/6/16
to lasagne-users
Hi,

I am replying on this thread, as I am trying to setup this network. 
I am working with 3 sec segments  of speech (20 dim mfcc). I want to only do convolution along the time axis, here is my network :

network['input'] = InputLayer(shape=(None,300,20), input_var = net_input)

 35   batchs,_,_ = network['input'].input_var.shape

 36 ··

 37   network['reshape'] = ReshapeLayer(network['input'],(batchs,1,300,20))

 38     

 39 #convolutional layers

 40   network['conv1'] = batch_norm(ConvLayer(network['reshape'], 256,(3,20),stride=1,pad='full',flip_filters=False, W=HeNormal('relu')))

 41   network['pool1']  = MaxPool2DLayer(network['conv1'],(3,20))

 42 ··  

 43   network['conv2'] = batch_norm(ConvLayer(network['pool1'], 256,(3,20),stride=1,pad='full',flip_filters=False, W=HeNormal('relu')))·

 44   network['pool2']  = MaxPool2DLayer(network['conv2'],(3,20))

 45 ··

 46   network['conv3'] = batch_norm(ConvLayer(network['pool2'], 256,(3,20),stride=1,pad='full',flip_filters=False, W=HeNormal('relu')))·

 47   network['pool3']  = MaxPool2DLayer(network['conv3'],(3,20))

 48 

 49 #One fully connected layer

 50 ··

 51   network['fc1'] = batch_norm(DenseLayer(network['pool3'],1024,nonlinearity=lasagne.nonlinearities.rectify, W=HeNormal('relu')))

 52   network['fc1_drop'] = DropoutLayer(network['fc1'],p=0.5)

 53   #softmax

 54   network['fc2'] = DenseLayer(network['fc1_drop'],1628,nonlinearity=None)

 55   network['prob'] = NonlinearityLayer(network['fc2'],nonlinearity=lasagne.nonlinearities.softmax)



The ConvLayer I am using is a conv2dlayer (based on Jan's suggestion) - mostly because I wasn't sure if the conv1dlayer would use cuDNN. Will it? 
In that case I need to use maxpool2dlayer for pooling.
Q. Is the way I have used it actually doing a 1d max-pool along time? In general, is my net doing what I want it to do? lol 

The model pasted above is training, maybe a lil slowly for the amount of data (quite a lot), but the loss is moving the right way. Suggestions for network architectures would be most welcome

I am trying to build a model like in : https://arxiv.org/pdf/1502.01710v5.pdf 

Q. I was a little confused by the description: When they say the conv layers do not use stride, does that mean a stride of 1? What does this mean exactly in terms of padding? What size of output filter map etc..?
Table 1. Convolutional layers used in our experiments. The con- volutional layers do not use stride and pooling layers are all non- overlapping ones, so we omit the description of their strides.

Ps. I trained a much deeper convnet on the same data with VGGnet like architecture, and each epoch was faster than an epoch of the much smaller net pasted above. Also this smaller net can't handle (out of memory) if I use 512 filters instead of 256 in each layer. Does this mean I need to used larger filters when doing 1D convolutions? 

Thanks in advance,
Gautam

Jan Schlüter

unread,
Aug 17, 2016, 1:18:30 PM8/17/16
to lasagne-users
In general, is my net doing what I want it to do?

No. Only the first 2D layer should have a kernel height of 20, all others should have a height of 1, and all layers should have pad='valid'. Otherwise you're convolving both over time and quefrency. If you want full padding over time, but no padding over quefrency, you can use pad=(2,0) for filter length 3.

Suggestions for network architectures would be most welcome

Look at the literature. There's a lot about convnets on mel spectrograms. You may not need to change much for convnets on MFCCs (except for not convolving over the quefrency axis of the MFCCs).

Best, Jan

Gautam Bhattacharya

unread,
Aug 19, 2016, 2:38:52 PM8/19/16
to lasagne-users
Thanks for the help (as always) Jan!
Is the pooling happening correctly with (3,20) or does the size 1 rule also apply to only the first 2D layer?

Thanks,
Gautam

Jan Schlüter

unread,
Aug 22, 2016, 7:27:10 AM8/22/16
to lasagne-users
Is the pooling happening correctly with (3,20) or does the size 1 rule also apply to only the first 2D layer?

You can figure this out by yourself. If you correct the padding (so you have no padding in quefrency direction), then you can't have a (3,20) pooling. The output height of the first convolution will be reduced to 1.

Gautam Bhattacharya

unread,
Aug 23, 2016, 2:01:28 PM8/23/16
to lasagne-users
Yup, I figured as much :) 
I am going to try to see if I can figure out (analogously) if I can do the same network but doing a 1d convolution through frequecy.
Thanks for your patience

Jan Schlüter

unread,
Aug 23, 2016, 2:28:43 PM8/23/16
to lasagne-users
I am going to try to see if I can figure out (analogously) if I can do the same network but doing a 1d convolution through frequecy.

If you compute MFCCs, you don't have frequency, but quefrency. It most probably won't be useful to share filters across different components of that -- it's like trying to share filters across PCA components. So don't be too surprised if it doesn't work! Alternatively, you can omit the DCT of the MFCC computation, this will give you mel spectrograms that are more suitable to moving filters across the frequency dimension.

Best, Jan
Reply all
Reply to author
Forward
0 new messages