1D convolution with conv2d

Sander Dieleman

unread,

Apr 21, 2011, 9:07:37 AM4/21/11

to theano-users

Hi,

Last Summer I joined the Reservoir Lab group at the University of
Ghent, in Belgium. I recently started using Theano to speed up some
experiments with convolutional RBMs.

I was trying to do a 1D convolution, but Theano only has 'conv2d' as
far as I can tell. I'm using the implementations from nnet (with 4D
tensors).

Say you have some sequence data with 'dim' dimensions and 'T'
timesteps, and you want to apply a convolution in time, but not across
the dimensions.

There are two approaches to implementing a 1D convolution using
conv2d, and I found one to be significantly more efficient than the
other, so I thought I'd share my experiences.

- one is to interpret the dimensions as forming one feature map
together; so there is 1 (dim x T) feature map.

- the other is to interpret the dimensions as different feature maps,
where each feature map corresponds with a (1 x T) sequence.

In the former case, it is important to ensure that the filters used in
the convolution have the same size 'dim', so that the convolution is
1D - let's say they're (dim x w). The output of a valid convolution
with a single filter will then be (dim - dim + 1 x T - w + 1) = (1 x
K).

In the latter case, the filter will be a combination of 'dim' filter
components, one for each feature map, again yielding an (1 x K)
output.

As it turns out, the former is much faster on the GPU than the latter.
My guess is that in the latter case, a separate convolution operation
is run for each input feature map, causing it to be an order of
magnitude slower. In the former case, even though you're performing a
'useless' convolution operation in the 'dim' direction, it's much
faster.

One huge gotcha though: the convolution operation flips the filter
matrices, so when 'swapping' the dimensions to convert one method to
the other, they also need to be flipped to achieve the same results.

I ran into this while implementing a Convolutional RBM. Here's some
(research-quality) code for the computation of the activation of the
visibles; the old version that uses the slow method (multiple feature
maps), and the new version that uses the faster one (single feature
map). With 100 hidden maps it's approximately 10 times faster for me.

I had to do some manual zero padding because this is actually a full
convolution in the time direction. The trick only works when the
convolution is valid in the 'dimensions' direction.

Both methods are completely equivalent, save for some rounding
differences.

def T_activation_v_OLD(self, h):
# vbias is per visible dimension
# The first two dimensions of the weights tensor have to be
swapped here,
# because they represent the number of input and output feature
maps.
W_shuffled = self.W.dimshuffle(1, 0, 2, 3)
filter_shape = [self.filter_shape[k] for k in [1, 0, 2, 3]]
return conv.conv2d(h, W_shuffled, border_mode='full',
image_shape=self.hiddens_shape, filter_shape=filter_shape) +
self.vbias.dimshuffle('x', 'x', 0, 'x')

def T_activation_v(self, h):
# this version of T_activation_v swaps 'dim' and 'hmaps'
dimensions,
# because this seems to be a LOT faster.

# It seems to be a little less precise than the original, but
otherwise
# the output is identical.

W_shuffled = self.W.dimshuffle(2, 1, 0, 3)

# The hmaps dimension needs to be flipped, because it was
previously not convolved over,
# and now it is. Note that a valid convolution with filtersize =
inputsize is NOT
# equivalent with a product - this is only the case if you FLIP
the filter.
# Note that the dim dimension need not be flipped, because this is
convolved over
# when computing T_activation_h (so it is already inherently
flipped).
W_shuffled = W_shuffled[:,:,::-1,:]

h_shuffled = h.dimshuffle(0, 2, 1, 3)

# this is a full convolution in the time direction and a valid one
in the other,
# hence the need to pad the input manually.
zero_padding = T.zeros_like(h_shuffled)[:,:,:,
0:self.filter_width-1]
h_padded = T.concatenate([zero_padding, h_shuffled, zero_padding],
axis=3)

filter_shape = [self.filter_shape[k] for k in [2, 1, 0, 3]]
image_shape = [self.hiddens_shape[k] for k in [0, 2, 1, 3]]
image_shape[3] += 2*(self.filter_width-1)
tmp = conv.conv2d(h_padded, W_shuffled, border_mode='valid',
image_shape=image_shape, filter_shape=filter_shape)
return tmp.dimshuffle(0, 2, 1, 3) + self.vbias.dimshuffle('x',
'x', 0, 'x')

Olivier Delalleau

unread,

Apr 21, 2011, 11:40:53 AM4/21/11

to theano...@googlegroups.com

Thanks for sharing this tip!

-=- Olivier

2011/4/21 Sander Dieleman <sanderd...@gmail.com>

Brian Vandenberg

unread,

Apr 21, 2011, 11:47:46 AM4/21/11

to theano...@googlegroups.com

I can't comment on this specific issue (using conv2d for a 1d
convolution), but I do have some suggestions for improving performance
on a convolutional RBM.

The suggestion is to do two basic things that can help a lot:

* Use DFTs and hadamard products for the convolutions
* Mix pairs of samples as complex samples. eg, given two samples A &
B, you'd produce a new sample C = A + iB

In the latter case, this allows you to convolve a single filter with
two samples, but you can decrease the total computations by a fair
amount. Although it isn't necessary to do so, you can still separate
the two samples in the frequency domain without using an inverse DFT.

If you use regular convolutions instead of using the frequency
domain, on the down passes the number of computations you need to
perform is pretty immense, whereas in the frequency domain you can sum
the results from the convolutions on the down pass before doing the
inverse DFT. A similar savings can be found when calculating the
gradient.

It's a little more involved to get the dimensions right, and you
have to use zero-padded samples to avoid doing circular convolutions
(though, some DFT libraries have built-in functionality for padding a
sample implicitly) ... but, I think it'll help a lot.

Before you ask, I don't have numbers to share. I've done the
analysis, but haven't had the time to implement it.

-Brian

Razvan Pascanu

unread,

Apr 21, 2011, 2:50:18 PM4/21/11

to theano...@googlegroups.com

On Thu, Apr 21, 2011 at 11:47 AM, Brian Vandenberg <phan...@gmail.com> wrote:
> I can't comment on this specific issue (using conv2d for a 1d
> convolution), but I do have some suggestions for improving performance
> on a convolutional RBM.
>
> The suggestion is to do two basic things that can help a lot:
>
> * Use DFTs and hadamard products for the convolutions
> * Mix pairs of samples as complex samples. eg, given two samples A &
> B, you'd produce a new sample C = A + iB
>

Can you do this with Theano right now ? That would be cool.

And by the way, thanks both of you guys for sharing such tricks. We
should at some point in time write all these kind of tricks somewhere
for new users to see. This somehow feels like a tutorial on
convolutional RBM ... though there is need for quite a bit more work
to get it into a tutorial. If anyone wants to dedicate time to write a
tutorial is welcome ( I think we can easily accept 3rd party tutorials
:) - giving credit of course to the author).

Razvan

James Bergstra

unread,

Apr 21, 2011, 3:10:09 PM4/21/11

to theano...@googlegroups.com

On Thu, Apr 21, 2011 at 2:50 PM, Razvan Pascanu <r.pa...@gmail.com> wrote:

On Thu, Apr 21, 2011 at 11:47 AM, Brian Vandenberg <phan...@gmail.com> wrote:
> I can't comment on this specific issue (using conv2d for a 1d
> convolution), but I do have some suggestions for improving performance
> on a convolutional RBM.
>
> The suggestion is to do two basic things that can help a lot:
>
> * Use DFTs and hadamard products for the convolutions
> * Mix pairs of samples as complex samples. eg, given two samples A &
> B, you'd produce a new sample C = A + iB
>

Can you do this with Theano right now ? That would be cool.

And by the way, thanks both of you guys for sharing such tricks. We
should at some point in time write all these kind of tricks somewhere
for new users to see. This somehow feels like a tutorial on
convolutional RBM ... though there is need for quite a bit more work
to get it into a tutorial. If anyone wants to dedicate time to write a
tutorial is welcome ( I think we can easily accept 3rd party tutorials
:) - giving credit of course to the author).

Razvan

I actually wonder if these tips can both just disappear under the hood and benefit everyone... without making anyone rewrite anything, much less the tutorials. Fred and Josh worked a little to make the conv2d implementation switch to Fourier domain once the filters got sufficiently big - but I don't think those changes ever made it to trunk, and the A+Bi trick can be hidden in the implementation at that point.

The conv (aka conv1d) should be a different function I think (in signal, rather than nnet) and we can kick it off with the code originally attached while there isn't a more native implementation.

If the OP wants to fork the repo from here: https://bitbucket.org/jaberg/theano and make that change, it would speed up the process of getting it merged into trunk ;-)

James

--

http://www-etud.iro.umontreal.ca/~bergstrj

Brian Vandenberg

unread,

Apr 21, 2011, 3:46:36 PM4/21/11

to theano...@googlegroups.com

> I actually wonder if these tips can both just disappear under the hood and
> benefit everyone... without making anyone rewrite anything, much less the
> tutorials. Fred and Josh worked a little to make the conv2d implementation

A big part of the cost savings is being able to do summations prior
to computing inverse DFTs.. If it's straightforward to write it in
such a way that it is aware of and can take advantage of properties of
DFTs then awesome, otherwise it might be more hassle than its worth.

Even if it is too much of a hassle to get all the tricks in there, a
general version that handles most users needs would still be a good
thing to have.

-Brian

Sander Dieleman

unread,

Apr 22, 2011, 5:49:31 AM4/22/11

to theano-users

On Apr 21, 5:47 pm, Brian Vandenberg <phant...@gmail.com> wrote:
> I can't comment on this specific issue (using conv2d for a 1d
> convolution), but I do have some suggestions for improving performance
> on a convolutional RBM.
>
> The suggestion is to do two basic things that can help a lot:
>
> * Use DFTs and hadamard products for the convolutions
> * Mix pairs of samples as complex samples. eg, given two samples A &
> B, you'd produce a new sample C = A + iB
>

> [...]

>
> It's a little more involved to get the dimensions right, and you
> have to use zero-padded samples to avoid doing circular convolutions
> (though, some DFT libraries have built-in functionality for padding a
> sample implicitly) ... but, I think it'll help a lot.
>
> Before you ask, I don't have numbers to share. I've done the
> analysis, but haven't had the time to implement it.
>
> -Brian

Thanks for the tips! Although I think I need to do some reading before
I'd be able to implement these things, and I really need to get this
thing running ASAP so I'm going to stop optimizing it for now. I
definitely want to try it later, though.

On Apr 21, 9:10 pm, James Bergstra <james.bergs...@gmail.com> wrote:
> I actually wonder if these tips can both just disappear under the hood and
> benefit everyone... without making anyone rewrite anything, much less the
> tutorials. Fred and Josh worked a little to make the conv2d implementation

> switch to Fourier domain once the filters got sufficiently big - but I don't
> think those changes ever made it to trunk, and the A+Bi trick can be hidden
> in the implementation at that point.
>
> The conv (aka conv1d) should be a different function I think (in signal,
> rather than nnet) and we can kick it off with the code originally attached
> while there isn't a more native implementation.
>

> If the OP wants to fork the repo from here:https://bitbucket.org/jaberg/theanoand make that change, it would speed up

> the process of getting it merged into trunk ;-)
>
> James
> --http://www-etud.iro.umontreal.ca/~bergstrj

I wouldn't mind taking a look at it, but I need to familiarize myself
with Theano a bit more before I delve into its internals, I've only
been using it for a few weeks :)

I was also wondering if there is perhaps a better way to do zero
padding than this:

zero_padding = T.zeros_like(h_shuffled)[:,:,:, 0:self.filter_width-1]
h_padded = T.concatenate([zero_padding, h_shuffled, zero_padding],
axis=3)

... seeing as this seems to be the bottleneck now, rather than the
convolution :)

Thanks for your feedback everyone.

Sander

James Bergstra

unread,

Apr 22, 2011, 9:40:38 AM4/22/11

to theano...@googlegroups.com

On Fri, Apr 22, 2011 at 5:49 AM, Sander Dieleman <sanderd...@gmail.com> wrote:

I was also wondering if there is perhaps a better way to do zero
padding than this:

zero_padding = T.zeros_like(h_shuffled)[:,:,:, 0:self.filter_width-1]
h_padded = T.concatenate([zero_padding, h_shuffled, zero_padding],
axis=3)

... seeing as this seems to be the bottleneck now, rather than the
convolution :)

Thanks for your feedback everyone.

Sander

Good that you profiled, sometimes things are slow in surprising places. You could try using the tensor.inc_subtensor or tensor.set_subtensor to just write the h_shuffled into one large buffer of zeros.

James
--
http://www-etud.iro.umontreal.ca/~bergstrj

Sander Dieleman

unread,

Apr 22, 2011, 10:46:53 AM4/22/11

to theano-users

On Apr 22, 3:40 pm, James Bergstra <james.bergs...@gmail.com> wrote:
>
> Good that you profiled, sometimes things are slow in surprising places. You
> could try using the tensor.inc_subtensor or tensor.set_subtensor to just
> write the h_shuffled into one large buffer of zeros.
>
> James
> --http://www-etud.iro.umontreal.ca/~bergstrj

Profile mode is great, it makes optimizing the code much less tedious.
I especially love the tips section :)

I'll give it a try, thanks!

Sander

Philippe Hamel

unread,

Apr 22, 2011, 10:48:24 AM4/22/11

to theano...@googlegroups.com

Thanks for the analysis of the 1D convolution. I am also using conv2D
to do 1D convolutions on audio signals, and I thought it was a bit
slow. I'll use your analysis to optimize my code.
I was thinking that theano could use an optimized 1D convolution, but
never took the time to get to it. Although I have no experience with
GPU code, I'd be willing to help code such an op (maybe the CPU
part?). However, I have a deadline in a few weeks, and I cannot invest
much time in this right now.

Just to be clear, are the dimensions of your input tensor h the
following (tell me if I am wrong) ?

0 : number of inputs
1 : number of feature maps (in this case = 1)
2 : number of features at each time step (same as the filter shape)
3 : time steps

Brian Vandenberg

unread,

Apr 22, 2011, 11:05:52 AM4/22/11

to theano...@googlegroups.com

> I was also wondering if there is perhaps a better way to do zero
> padding than this:
>
> zero_padding = T.zeros_like(h_shuffled)[:,:,:, 0:self.filter_width-1]
> h_padded = T.concatenate([zero_padding, h_shuffled, zero_padding],
> axis=3)
>
> ... seeing as this seems to be the bottleneck now, rather than the
> convolution :)

Unless you're doing convolutions with DFTs instead of directly
computing the convolution, I don't think you need to do zero padding.

-Brian

Philippe Hamel

unread,

Apr 22, 2011, 12:49:59 PM4/22/11

to theano...@googlegroups.com

Sorry, I believe I was wrong, I misread your first post. The order of
the dimensions I wrote is the actual order that you want to send to,
conv2d, right? But since you shuffle h in your code, your input tensor
h has dimension 1 and 2 switched.

Philippe

Sander Dieleman

unread,

Apr 22, 2011, 3:32:48 PM4/22/11

to theano-users

On Apr 22, 4:48 pm, Philippe Hamel <hamel.p...@gmail.com> wrote:
> Thanks for the analysis of the 1D convolution. I am also using conv2D
> to do 1D convolutions on audio signals, and I thought it was a bit
> slow. I'll use your analysis to optimize my code.
> I was thinking that theano could use an optimized 1D convolution, but
> never took the time to get to it. Although I have no experience with
> GPU code, I'd be willing to help code such an op (maybe the CPU
> part?). However, I have a deadline in a few weeks, and I cannot invest
> much time in this right now.

hmm, deadline in a few weeks... sounds familiar :p
You don't happen to be toying with the Million Song Dataset as
well? :)

>
> Just to be clear, are the dimensions of your input tensor h the
> following (tell me if I am wrong) ?
>
> 0 : number of inputs
> 1 : number of feature maps (in this case = 1)
> 2 : number of features at each time step (same as the filter shape)
> 3 : time steps
>

In the original implementation, the dimensions of h are:
0: number of minibatches = 1 (I don't use this for now - maybe I
should!)
1: number of feature maps = 100 (I use 100 hidden feature maps)
2: number of hidden rows = 1
3: number of hidden columns = number of timesteps

the last two dimensions are convolved over, but seeing as the number
of rows is 1 and this is a full convolution, this amounts to a 1D
convolution (only over axis 3 = timesteps).

In the second implementation, 1 and 2 are swapped to get a single
feature map, which turns out to be much faster. The 'feature maps'
from the original form of h are now convolved over, which also
requires that the new convolution is a valid one instead of a full one
(else this trick doesn't work).

On Apr 22, 5:05 pm, Brian Vandenberg <phant...@gmail.com> wrote:
>
> Unless you're doing convolutions with DFTs instead of directly
> computing the convolution, I don't think you need to do zero padding.
>
> -Brian

I have to pad zeros in the timesteps dimension (axis 3), because my
trick only works with a valid convolution.

The original convolution is a full one (in both axis 2 and 3), whereas
the 'transformed' convolution has to be valid in axis 2 and full in
axis 3. Since theano.nnet.conv.conv2d doesn't allow for the
border_mode to be set separately for each axis, I have to turn the
valid convolution in axis 3 into a full one manually, by padding
(filter width - 1) columns of zeros on each side.

Sander

Neal Becker

unread,

Apr 27, 2011, 3:25:53 PM4/27/11

to theano...@googlegroups.com, Brian Vandenberg

Product of DFTs produces circular convolution. To simulate linear convolution
you need padding. A good introduction is "Digital Signal Processing",
Proakis.

If you are convolving a short sequence by a long one (i.e., filtering), look
for overlap-save (or overlap-add) technique. I find overlap-save easier
conceptually.

If sequence1 is length N1 and sequence2 is length N2>n1, then use a DFT of
length N>N1. A convenient choise is N=2N1. After each inverse DFT, discard
the 1st N1 outputs. Then shift the inputs by N1 to compute the next batch.

(correlation is basically the same but swap discarding 1st<->2nd half)

Reply all

Reply to author

Forward