nnet3 convolution padding.

Rémi Francis

unread,

Jun 29, 2016, 7:50:49 AM6/29/16

to kaldi-help

Is it possible to use the convolution layers in a way that preserves the input dimensions? The component doesn't seem to be able to do zero-padding.

It could work by doing that separately on the input before feeding it to the component, however I'm not quite sure how to do it.

Vijayaditya Peddinti

unread,

Jun 29, 2016, 9:50:16 AM6/29/16

to kaldi-help

Frequency padding is easy to implement, but it would be tough to implement temporal padding given that we support computation of chunks of outputs and not single outputs. If you were planning to support zero-padding for each frame of output like IBM's CNN recipe you would need to do major changes in the component. At each time step there would be several different convolution outputs for each filter which correspond to each output in the output chunk. This would be very tough to implement.

You could achieve this trivially for frame level objective functions, as you always compute single frame outputs. However you would have to be very careful to ensure that you compute single frame outputs even during decoding. This would make decoding very slow.

--Vijay

On Wed, Jun 29, 2016 at 7:50 AM, Rémi Francis <re...@speechmatics.com> wrote:

Is it possible to use the convolution layers in a way that preserves the input dimensions? The component doesn't seem to be able to do zero-padding.
It could work by doing that separately on the input before feeding it to the component, however I'm not quite sure how to do it.

--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Daniel Povey

unread,

Jun 29, 2016, 3:17:12 PM6/29/16

to kaldi-help

If the kind of temporal padding he wants to do is just at the
begin/end of the file, this would be quite possible through
appropriate use of IfDefined().
Dan

Rémi Francis

unread,

Jul 5, 2016, 6:47:32 AM7/5/16

to kaldi-help

On Wednesday, 29 June 2016 14:50:16 UTC+1, Vijayaditya Peddinti wrote:

Frequency padding is easy to implement, but it would be tough to implement temporal padding given that we support computation of chunks of outputs and not single outputs. If you were planning to support zero-padding for each frame of output like IBM's CNN recipe you would need to do major changes in the component. At each time step there would be several different convolution outputs for each filter which correspond to each output in the output chunk. This would be very tough to implement.

I was thinking about considering the whole chunk as a the input, as if it was an image being processed in computer vision. The results would be different with different chunk sizes, but this is what happens with BLSTMs too.

Daniel Povey

unread,

Jul 5, 2016, 4:50:23 PM7/5/16

to kaldi-help

That kind of thing will happen automatically (padding by repeating the
first/last frames) at the level of the the wrapping code that calls
the core nnet3 code [if --pad-input is true, which it is by default].
It looks at the 'left-context' and 'right-context' of the network (the
minimum amount of left and right context required to produce a single
frame of output), and pad the input by that amount, during training
and test.

Dan

Rémi Francis

unread,

Jul 7, 2016, 9:44:13 AM7/7/16

to kaldi-help, dpo...@gmail.com

I see, so that solves the problem in the temporal dimension indeed.

What would be the easiest way to do frequency padding?

Vijayaditya Peddinti

unread,

Jul 7, 2016, 10:21:40 AM7/7/16

to kaldi-help, Daniel Povey

If you do not want to change any component dimensions, you could do all the necessary padding at the feature level. However this would result in a lot of unnecessary computation at the initial convolution layers. A solution which requires modification to the component's code would involve just creating a new input matrix with the padded features, in the component. This would require modification of both the Propagate and Backprop methods. A better solution involves padding the inputs of the component while the input tensor is being reshaped for the matrix-matrix multiplication. This would avoid creation of additional temporary matrices.

--Vijay

Daniel Povey

unread,

Jul 7, 2016, 2:52:45 PM7/7/16

to Vijayaditya Peddinti, kaldi-help

I don't think we should put a lot of effort into supporting this at
the component level if it's going to delay the availability of
"normal" convolutional nets.
Dan

Reply all

Reply to author

Forward