I am having a problem understanding the output of convolutional layers. My current intuition is that every such layer consists of
num_output convolutional kernels, each of which is a matrix of size
kernel_size*kernel_size. Upon receiving a batch of
batch_size images each
HxW pixels, the layer convolves every image with all kernels and produces
batch_size*num_output images, each of size
((H - 2 * (kernel_size // 2) - 2) // stride)x((W - 2 * (kernel_size // 2) - 2) // stride) pixels (the exact formula doesn't matter much). Thus, each convolutional layer contains exactly
num_output neurons, each of which corresponds to
kernel_size^2 weights.
If all this is correct, then when two convolutional layers are plugged into one another, the number of images they output should multiply. So, if I have two layers, the first with 20 kernels and the second with 50, when combined they should produce batch_size*1000 images. But training my networks in Caffe, I see that if I have two such layers, they do not output a blob of shape (batch_size, 1000, H, W), but rather (batch_size, 50, H, W). Where am I wrong?