What are the meaning of 1x1 conv layers?

1,071 views
Skip to first unread message

Alex Orloff

unread,
Jan 28, 2016, 5:39:45 PM1/28/16
to Caffe Users
If I make a convolution with core 1x1 it simply means multiplication whole image on same coefficient.
Really don't understand how can it help in recognition process.
Sorry for such dump question, may be some articles regarding it?
Thanks

Oleg Klimov

unread,
Jan 29, 2016, 6:17:38 AM1/29/16
to Caffe Users
On Friday, January 29, 2016 at 1:39:45 AM UTC+3, Alex Orloff wrote:
If I make a convolution with core 1x1 it simply means multiplication whole image on same coefficient.

Right, but if it's followed by ReLU or other nonlinearity it makes sense. Think of applying rectifier, flipping function upside down (*= -1), applying again.

Oleg.

Youssef Kashef

unread,
Jan 29, 2016, 10:41:08 AM1/29/16
to Caffe Users
A 1x1 conv. filter would not be able to learn any spatial features, but it would still be capable of learning any position invariant linear combination of input (e.g. combination of colors in early layers), if at all relevant to the task at hand.

ath...@ualberta.ca

unread,
Jan 29, 2016, 2:27:49 PM1/29/16
to Caffe Users
If the previous layer has 128 feature maps (say) then "1x1 convolutions" are convolutions across all these feature maps with filters each of size 1x1x128. Say one chooses to have 64 of these 1x1x128 dim filters, then the result will be 64 features maps, each the same size as before. View each output feature map as "per-pixel" projections (dot-product) onto a lower dimensional space using a single learned filter (weights tied) across all feature maps. Basically, they just crush 128 feature maps (representational responses to 128 learned filters) into 64 feature maps ignoring the spatial dimension.

Remember that larger filters like a 3x3x128 filter would also learn to summarize feature responses across all feature maps so in this way all size filters do the same thing. The only difference is that 1x1 (learned) filters ONLY do this across feature-maps where 3x3 filters (say) also consider local spatial correlations.

So, they are used for two reasons:

1. Dimensionality reduction: When performing larger size convolutions (spatial 3x3 or 5x5...) over a large number of feature maps, bringing down the dimensions in depth (# feature maps) reduces computations dramatically. This is done in GoogLeNet Inception modules (2).

2. Since ReLU will be applied again, it is yet another non-linearity that can be helpful.

See: 1. Network in Network paper: http://arxiv.org/abs/1312.4400
        2. Going Deeper with Convolutions paper: http://arxiv.org/abs/1409.4842

Hope this helps

Cheers,
Andy

Alex Orloff

unread,
Jan 29, 2016, 3:33:23 PM1/29/16
to Caffe Users
Thank you Andy.
If you not object, one more question.
Can I use pooling layers with stride=2 and kernel=2 to blobs with odd dimentions?
What will be output blob such case?

пятница, 29 января 2016 г., 22:27:49 UTC+3 пользователь ath...@ualberta.ca написал:

ath...@ualberta.ca

unread,
Feb 1, 2016, 1:03:45 PM2/1/16
to Caffe Users
(n - f)/s +1  where n is size (width or height), f is filter size, s is stride.

zzz

unread,
Feb 2, 2016, 3:01:48 PM2/2/16
to Caffe Users
I think we can understand the 1*1 convolution as a pixel-wise linear classifier.
For example the input feature map size is (128,500,500).
If the output is (1,500,500), the convolution kernel size would be (1,128,1,1). This 128 dimensional vector is actually a linear classifier.

Correct me if  I am wrong please.
Reply all
Reply to author
Forward
0 new messages