Questions in understanding lenet protobuf.

Xi Wu

unread,

Apr 1, 2016, 4:13:35 PM4/1/16

to Caffe Users

Hi all,

I am looking into the netparameter dumped for lenet, as well as the lenet.prototxt available in the examples. And I got questions on how one maps the proto to actual construction (or connections) of the network, at the 2nd convolutional layer. Specifically,

1. I understand the first convolution and pooling layer. Starting with a 28*28 input image, we first build 20 feature maps, each of size 24*24, using 20 kernels of 5*5. In other words, for each of the 20 feature maps, all the units in that feature map share a kernel of 5*5 weights and 1 bias. Then we do pooling of 2*2 (stride 2), and so we get 20 feature maps, each of size 12*12. This matches what I got from the dumped parameters, the first blob of conv1 has parameters 20*1*5*5.

2. My question comes when looking at the second conv layer, where the dimension becomes 50*20*5*5 (this indicates that there should be 50*20 kernels, each of size 5*5). Conv2 says that the output number is 50. I understand this as that there should be 50 feature maps. But how does these 50 feature maps connect to the 20 input 12*12 input feature maps? One way I guess is that for each i of the 50 feature maps, and each j of the input feature maps, we have a kernel of size 5*5 (shared). If this is the case, it means that for example, for the first neuron of the first output feature map, there are 20 kernels connect to it (this neuron acutally connects with *every* 5*5 neighborhood through a shared kernel). This does give the parameters shown up in the dumped proto: there are 50 * 20 * 5 * 5 kernel weight parameters. However, it does not explain why there is only 50 bias parameters..

3. From the above two points, it seems that caffe is assuming some implicit way to construct, or connect the neurons. What is this implicit way? Is it explicitly stated somewhere?

Best.

Xi Wu

unread,

Apr 3, 2016, 11:16:34 PM4/3/16

to Caffe Users

Perhaps a naive question first before I get answers for the previous ones... For the first convolutional layer, why isn't the dimension simply 20 * 5 * 5, but rather 20 * 1 * 5 * 5 (in the blobs proto, the dim has 4 dim field in it, 20, 1, 5, 5). Where does this 1 come from? Is it redundant or it is forced by some design of caffe here?

Best.

Jan

unread,

Apr 15, 2016, 7:55:35 AM4/15/16

to Caffe Users

See interleaved answers.

Perhaps a naive question first before I get answers for the previous ones... For the first convolutional layer, why isn't the dimension simply 20 * 5 * 5, but rather 20 * 1 * 5 * 5 (in the blobs proto, the dim has 4 dim field in it, 20, 1, 5, 5). Where does this 1 come from? Is it redundant or it is forced by some design of caffe here?

Because you have 1 input image channel. If you use RGB images, you had a param size of 20, 3, 5, 5.

2. My question comes when looking at the second conv layer, where the dimension becomes 50*20*5*5 (this indicates that there should be 50*20 kernels, each of size 5*5). Conv2 says that the output number is 50. I understand this as that there should be 50 feature maps. But how does these 50 feature maps connect to the 20 input 12*12 input feature maps? One way I guess is that for each i of the 50 feature maps, and each j of the input feature maps, we have a kernel of size 5*5 (shared). If this is the case, it means that for example, for the first neuron of the first output feature map, there are 20 kernels connect to it (this neuron acutally connects with *every* 5*5 neighborhood through a shared kernel). This does give the parameters shown up in the dumped proto: there are 50 * 20 * 5 * 5 kernel weight parameters. However, it does not explain why there is only 50 bias parameters..

All you say and assume is correct. There are 50*20 kernels in total. For every of the 50 output feature maps there are 20 kernels, one for each input feature map and only one bias. The layer convolves every kernel with every input feature map, adds up the 20 results for each output channel and adds up the bias. Having 20 individual bias values for each output channel would not make sense, as functionally there is no difference, but the storage needs would increase significantly.

Jan

Reply all

Reply to author

Forward