Hi,
As an exercise, I'm implementing the convolution layers from scratch and I wanted to consult regarding some issue.
I'm only referring to the imagenet model.
I noticed that Caffe's 3D filters always have the same depth (or number of channels) as their input. For example, the input for the for the first conv layer is a 3x227x227 matrix. Now, the first layer contains 96 filters of size 3x11x11. That means that in the depth dimension, there's no sliding, only simple multiplication (there is no "room" to perform convolution as the filter depth is equal to the input matrix depth).
So what I'm asking is this:
1. For a single filter (of the 96), can I apply regular 2D convolution on each of the channels to get three 2D convolution results and then just sum them up (or apply on them some other function)? The output is of size 96x55x55, so the depth dimension somehow disappears (I would expect the result to be 96x3x55x55, as the result of 3D convolution is 3D).
2. Another, simpler issue - The bias term (the second blob of the layer) is a vector of size 96. I assume each bias element corresponds to one filter. Now, do I just need to add the bias element to each element of the filter's convolution result (which is of size 55x55)?
Thanks,
Gil