Help needed with input to CNN for 1D conv on audio

Gautam Bhattacharya

unread,

Jun 20, 2016, 1:35:30 PM6/20/16

to lasagne-users

Hello all,

I want to process audio (speech) 'snapshots' of fixed size. Each snap is 3 seconds long i.e. 300 frames. So each example is 300x60 (60 dimensional mfcc + delta + delta-delta).

I would like to learn a representation that is invariant to the ordering of frames, so I figured that doing a convolution over only time made sense (atleast to start). I was thinking about using an architecture something like this : https://arxiv.org/abs/1502.01710

Q. If I want to use lasagne's conv1D layer, how should I be presenting each datapoint to the network?

I have stored my dataset as numpy/hdf5 matrices of shape - num_exp x 300 x 60 and I want to do a convolution along the 2nd dimension (time) only.

Do I need to transpose each datapoint so that the net sees 60x300?

Any help regarding the best / most efficient way to do this would be greatly appreciated.

For reference, I have a dataset of ~675000 with 1630 classes.

Thanks !

G

Jan Schlüter

unread,

Jun 21, 2016, 11:30:36 AM6/21/16

to lasagne-users

I would like to learn a representation that is invariant to the ordering of frames

I assume you meant invariant to the ordering of features? (Otherwise a convolution doesn't make sense.)

Q. If I want to use lasagne's conv1D layer, how should I be presenting each datapoint to the network?

As per the documentation, it should be of shape (batchsize, channels, frames), where "channels" would be your 60 features.

Do I need to transpose each datapoint so that the net sees 60x300?

Either this (using a DimshuffleLayer), or you present your data as (batchsize, 1, 300, 60) and use a Conv2DLayer with filter_size=(something, 60) for the first layer, and (something, 1) for subsequent ones. The Conv1DLayer internally uses a 2D convolution anyway, so performance will probably be about the same (depends on what the underlying convolution implementation does).

Best, Jan

Gautam Bhattacharya

unread,

Jun 21, 2016, 1:48:28 PM6/21/16

to lasagne-users

On Tuesday, 21 June 2016 11:30:36 UTC-4, Jan Schlüter wrote:

I would like to learn a representation that is invariant to the ordering of frames

I assume you meant invariant to the ordering of features? (Otherwise a convolution doesn't make sense.)

Yes exactly

Q. If I want to use lasagne's conv1D layer, how should I be presenting each datapoint to the network?

As per the documentation, it should be of shape (batchsize, channels, frames), where "channels" would be your 60 features.

Do I need to transpose each datapoint so that the net sees 60x300?

Either this (using a DimshuffleLayer), or you present your data as (batchsize, 1, 300, 60) and use a Conv2DLayer with filter_size=(something, 60) for the first layer, and (something, 1) for subsequent ones. The Conv1DLayer internally uses a 2D convolution anyway, so performance will probably be about the same (depends on what the underlying convolution implementation does).

Thanks for the tip. My dataset is fairly large (I think) so it would help to do things efficiently.

Jan, I know you do a lot of work with musical audio. I am trying to classify speakers from these 3 second clips. Is that maybe too long a clip to be feeding a CNN?

Thanks,

G

Best, Jan

Jan Schlüter

unread,

Jun 22, 2016, 6:15:23 AM6/22/16

to lasagne-users

Jan, I know you do a lot of work with musical audio. I am trying to classify speakers from these 3 second clips. Is that maybe too long a clip to be feeding a CNN?

No, 300 frames sounds reasonable. You can try to reduce the frame rate (or add a temporal pooling step) if you want to reduce the size, and you should try omitting the delta and delta-delta features (that's something the CNN could figure out by itself). Also you can try omitting the DCT from the MFCC computation and then use a 2D convolution.

Gautam Bhattacharya

unread,

Aug 6, 2016, 4:51:53 PM8/6/16

to lasagne-users

Hi,

I am replying on this thread, as I am trying to setup this network.

I am working with 3 sec segments of speech (20 dim mfcc). I want to only do convolution along the time axis, here is my network :

network['input'] = InputLayer(shape=(None,300,20), input_var = net_input)

35 batchs,_,_ = network['input'].input_var.shape

36 ··

37 network['reshape'] = ReshapeLayer(network['input'],(batchs,1,300,20))

38

39 #convolutional layers

40 network['conv1'] = batch_norm(ConvLayer(network['reshape'], 256,(3,20),stride=1,pad='full',flip_filters=False, W=HeNormal('relu')))

41 network['pool1'] = MaxPool2DLayer(network['conv1'],(3,20))

42 ··

43 network['conv2'] = batch_norm(ConvLayer(network['pool1'], 256,(3,20),stride=1,pad='full',flip_filters=False, W=HeNormal('relu')))·

44 network['pool2'] = MaxPool2DLayer(network['conv2'],(3,20))

45 ··

46 network['conv3'] = batch_norm(ConvLayer(network['pool2'], 256,(3,20),stride=1,pad='full',flip_filters=False, W=HeNormal('relu')))·

47 network['pool3'] = MaxPool2DLayer(network['conv3'],(3,20))

48

49 #One fully connected layer

50 ··

51 network['fc1'] = batch_norm(DenseLayer(network['pool3'],1024,nonlinearity=lasagne.nonlinearities.rectify, W=HeNormal('relu')))

52 network['fc1_drop'] = DropoutLayer(network['fc1'],p=0.5)

53 #softmax

54 network['fc2'] = DenseLayer(network['fc1_drop'],1628,nonlinearity=None)

55 network['prob'] = NonlinearityLayer(network['fc2'],nonlinearity=lasagne.nonlinearities.softmax)

The ConvLayer I am using is a conv2dlayer (based on Jan's suggestion) - mostly because I wasn't sure if the conv1dlayer would use cuDNN. Will it?

In that case I need to use maxpool2dlayer for pooling.

Q. Is the way I have used it actually doing a 1d max-pool along time? In general, is my net doing what I want it to do? lol

The model pasted above is training, maybe a lil slowly for the amount of data (quite a lot), but the loss is moving the right way. Suggestions for network architectures would be most welcome

I am trying to build a model like in : https://arxiv.org/pdf/1502.01710v5.pdf

Q. I was a little confused by the description: When they say the conv layers do not use stride, does that mean a stride of 1? What does this mean exactly in terms of padding? What size of output filter map etc..?

Table 1. Convolutional layers used in our experiments. The con- volutional layers do not use stride and pooling layers are all non- overlapping ones, so we omit the description of their strides.

Ps. I trained a much deeper convnet on the same data with VGGnet like architecture, and each epoch was faster than an epoch of the much smaller net pasted above. Also this smaller net can't handle (out of memory) if I use 512 filters instead of 256 in each layer. Does this mean I need to used larger filters when doing 1D convolutions?

Thanks in advance,

Gautam

Jan Schlüter

unread,

Aug 17, 2016, 1:18:30 PM8/17/16

to lasagne-users

In general, is my net doing what I want it to do?

No. Only the first 2D layer should have a kernel height of 20, all others should have a height of 1, and all layers should have pad='valid'. Otherwise you're convolving both over time and quefrency. If you want full padding over time, but no padding over quefrency, you can use pad=(2,0) for filter length 3.

Suggestions for network architectures would be most welcome

Look at the literature. There's a lot about convnets on mel spectrograms. You may not need to change much for convnets on MFCCs (except for not convolving over the quefrency axis of the MFCCs).

Best, Jan

Gautam Bhattacharya

unread,

Aug 19, 2016, 2:38:52 PM8/19/16

to lasagne-users

Thanks for the help (as always) Jan!

Is the pooling happening correctly with (3,20) or does the size 1 rule also apply to only the first 2D layer?

Thanks,

Gautam

Jan Schlüter

unread,

Aug 22, 2016, 7:27:10 AM8/22/16

to lasagne-users

Is the pooling happening correctly with (3,20) or does the size 1 rule also apply to only the first 2D layer?

You can figure this out by yourself. If you correct the padding (so you have no padding in quefrency direction), then you can't have a (3,20) pooling. The output height of the first convolution will be reduced to 1.

Gautam Bhattacharya

unread,

Aug 23, 2016, 2:01:28 PM8/23/16

to lasagne-users

Yup, I figured as much :)

I am going to try to see if I can figure out (analogously) if I can do the same network but doing a 1d convolution through frequecy.

Thanks for your patience

Jan Schlüter

unread,

Aug 23, 2016, 2:28:43 PM8/23/16

to lasagne-users

I am going to try to see if I can figure out (analogously) if I can do the same network but doing a 1d convolution through frequecy.

If you compute MFCCs, you don't have frequency, but quefrency. It most probably won't be useful to share filters across different components of that -- it's like trying to share filters across PCA components. So don't be too surprised if it doesn't work! Alternatively, you can omit the DCT of the MFCC computation, this will give you mel spectrograms that are more suitable to moving filters across the frequency dimension.

Best, Jan

Reply all

Reply to author

Forward