What exactly do convolutional layers output?

Pastafarianist

unread,

Apr 16, 2015, 11:50:04 AM4/16/15

to caffe...@googlegroups.com

I am having a problem understanding the output of convolutional layers. My current intuition is that every such layer consists of num_output convolutional kernels, each of which is a matrix of size kernel_size*kernel_size. Upon receiving a batch of batch_size images each HxW pixels, the layer convolves every image with all kernels and produces batch_size*num_output images, each of size ((H - 2 * (kernel_size // 2) - 2) // stride)x((W - 2 * (kernel_size // 2) - 2) // stride) pixels (the exact formula doesn't matter much). Thus, each convolutional layer contains exactly num_output neurons, each of which corresponds to kernel_size^2 weights.

If all this is correct, then when two convolutional layers are plugged into one another, the number of images they output should multiply. So, if I have two layers, the first with 20 kernels and the second with 50, when combined they should produce batch_size*1000 images. But training my networks in Caffe, I see that if I have two such layers, they do not output a blob of shape (batch_size, 1000, H, W), but rather (batch_size, 50, H, W). Where am I wrong?

CaffeStudent

unread,

Apr 16, 2015, 12:42:17 PM4/16/15

to caffe...@googlegroups.com

Hello,

A quick idea of what I believe to be going on.

Essentially each conv layer has N - num_inputs x kernel_size x kernel_size filters that will be learned. For each of these layers, the output will consist of each of the N filters being convolved over the image, so you can think of it as outputing N images that have been convolved with the filters.

The original image has 3 inputs (RGB), so the first conv layer will learn filters that are 3 x kernel_size x kernel_size, you will probably tell it to learn 32,64 or 96 of these filters. If you told it to learn 96 filters, the output of this layer will be 96 "images" of size inputimagesize x inputimagesize, so if you are doing AlexNet, 96x224x224.

The next conv layer will learn filters of size (32 or 64 or 96 depending on what you picked for the first layer) x kernel_size x kernel_size and it will learn say 128 or 256 of these depending on what you pick. If there is no max pooling between these layers and you chose to learn 256 filters, then there will be 256x224x224 outputs

I'm not using the terms typically used to describe these, but hopefully what I am saying makes sense.

Pastafarianist

unread,

Apr 16, 2015, 1:32:14 PM4/16/15

to caffe...@googlegroups.com

Interesting. So in the particular implementation of Caffe, the kernels are not "flat", as I expected, but can be rather thick. For color images they initially have 3 channels, and then their number of channels is precisely the number of channels of the image they receive from the previous layer. This seems to add structure (sort of), since each channel of the kernel works with exactly one channel of the input image, whereas I thought that every kernel has only one channel which is applied to all channels of the input image.

I wonder if all implementations follow this pattern.

четверг, 16 апреля 2015 г., 19:42:17 UTC+3 пользователь CaffeStudent написал:

Reply all

Reply to author

Forward