How fully convolutional layer can work for segmentation task?

john1...@gmail.com

unread,

Jan 23, 2017, 3:22:45 AM1/23/17

to Caffe Users

Hello all, I am reading the paper "Fully Convolutional Networks for Semantic Segmentation by Jonathan Long*, Evan Shelhamer*, and Trevor Darrell. CVPR 2015 and PAMI 2016"
I want to understand why it can work for semantic segmentation. Let look at the fcn-32s architecture, it includes two phases: feature extraction (conv1-1->pool5), and feature classification (fc6->score_fr). Comparison with a normal classification network, the main different is the second phase. The FCN-32s replaces the fully connected layer by fully convolution layers (1 x 1) in fc7 to retain the spatial map (as the caption in the figure 2 of the paper). Hence, I was confused something about this point:

1. If we replace the fully connected layer by fully convolution layer, how can it learn the weight as traditional classification architecture?
2. Why we can retain the spatial map (heatmap) by using fully convolution layer?

Thank you in advance.

Przemek D

unread,

Jan 23, 2017, 8:12:23 AM1/23/17

to Caffe Users

1. What is traditional classification architecture? Look at what the FC layer does: it simply connects all pixels of the last convolved blob with all its outputs. Doing it "traditionally", that is flattening the input blob into a vector and then multiplying by a weight matrix resulting in an output vector of length N, is mathematically equivalent to convolving the blob with N filters of the exact same size as the blob. If you write it down you will find that those kernels will equal columns of the FC weight matrix - alignment of data in memory will be the only difference.
2. Imagine standard AlexNet with 256x6x6 pool5 output. "Convolutionized" fc6 will need to have kernel size of 6x6 (256 channels deep but it's not important now) so that it can output a 1x1xNUM_OUTPUT blob. Now if you make your input image substantially larger than the original 227x227, and as all the conv layers become larger so will pool5; let's for example say it is 256x10x10. But since your FC6 still has a 6x6 kernel, it will not cover this image exactly - it will shift over it with some stride, by default 1 producing a 4096x5x5 image instead of a single vector. This is as if you took patches of your input image and fed it through a standard CNN with FC classifier - but this way it's much more efficient. FC8 would normally output a vector of shape [batch_size,num_classes], but in this case we will get [batch_size,5,5,num_classes] - which can be interpreted as a heatmap.
Note that this is not possible with a standard FC network, because if you reshape input to arbitrary size, the conv layers will reshape too, but the FC6 layer will still attempt to connect all pool5 output pixels to its own outputs, potentially resulting with an immense number of weights (and if your training images are not all of the same shape, the network will fail training).

john1...@gmail.com

unread,

Jan 23, 2017, 9:24:34 AM1/23/17

to Caffe Users

Thank you for valuable comment. I am looking at the VGG-16 and FCN-32s.

In the VGG-16, the parameters of fc6, fc7 are

`pool5(7x7x512)--conv(7x7x512)-->fc6(1x1x4096)--conv(1x1x4096)-->fc7(1x1x4096)--conv(1x1x4096)-->fc8(1x1x1000) (where 1000 is number of classes)`

In the FCN-32s, it is

`pool5(22x22x512)--conv(7x7x512)-->fc6(16x16x4096)--conv(1x1x4096)-->fc7(16x16x4096)--conv(1x1x4096)-->fc8(16x16x1000)`

In the VGG-16, after pool5, they used kernel size 7x7x512 to make a fully connection to all neuron between input and output. While, the FCN-32s use kernel size 7x7x512 for input 22x22x512, to make a local connection (just connect 7x7x512 neuron to a output neuron). At this point, I understood that the FCN-32s converted from fully connection to fully convolution. But after fc6 in FCN-32s, why they do not use kernel 3x3, instead of 1x1? The output of fc8 is 16x16x1000 to retain the spatial output map, instead of designing 1x1x1000. Is it right?

Vào 22:12:23 UTC+9 Thứ Hai, ngày 23 tháng 1 năm 2017, Przemek D đã viết:

Przemek D

unread,

Jan 24, 2017, 3:23:19 AM1/24/17

to Caffe Users

Let's look at the fc6 blob in VGG-16 - it's a 4096-element long vector of intermediate classification results. What does the next layer do with it? Takes another 4096 inner products of this vector and weight vectors, producing 4096 outputs - each input pixel is connected to all of the output pixels exactly once.
Now, what is fc6 in the FCN-32s model? It's a 16x16x4096 blob: a 16x16 map of 4096-element long vectors of intermediate classification results, one for each spatial location in the input image - essentially fc6 vectors from VGG-16. Fc6 works on individual vectors (1x4096), and to replicate this behavior you need a 1x1x4096 convolution - this connects all pixels within one spatial location to all pixels in the corresponding spatial location of the output blob, but does not interconnect data from different spatial locations.
The output of fc8 is 16x16x1000 as a consequence of taking a larger input image and not setting FC layers to constant size.

FCNs can be imagined as sweeping a CNN over an image. The original paper, "Fully Convolutional Networks for Semantic Segmentation", will be more helpful to you now.

Reply all

Reply to author

Forward