Understanding Deconvolution in FCN

1,313 views
Skip to first unread message

DI

unread,
Mar 7, 2017, 9:23:59 AM3/7/17
to Caffe Users
Hi,

Iam trying to understand how denconvolution layer in FCN32, FCN16, FCN8 works. In FCN32, I understood that stride is set to 32 in Deconvolution layer as the actual input is downsampled by a factor of 32 due to 5 pooling operations. But I didnt get why kernel size is set to 64. 

I also want to understand how exactly this deconvolution layer upsamples the inputs. For example,Consider FCN32 case.

1 . Input Layer Dimensions :                   500X500X3
2. score_fr Layer Dimensions : 16X16X21
3. "upscore" Layer Dimensions :             544X544X21 ( (input_h-1)*stride + kernel_size - 2*pad ) 

> Please explain me how exactly the upsampling is done with stride 32 and with convolution kernel size 64. ?
> Does the input is upsampled first using MATLAB kind of "expand" function and then the convolution is done with kernel size 64.?  
> If the input is expanded the MATLAB way, do the newly added rows and coloumns are replicated with input values or do they filled with zeros.?
> Is there any formula for finding out kernel size in deconvolution layer.?
> Also why is the kernel size "even" but not "odd" when it comes to deconvolution layer.?

"score_fr Layer output" 

9

9

9

9

9

9

18

18

18

18

18

18

18

18

18

18

9

9

9

9

9

18

18

18

18

18

18

18

18

18

18

10

9

9

9

9

18

18

18

18

18

18

18

18

18

18

10

10

9

9

9

18

18

18

18

18

5

5

10

10

10

10

10

10

9

9

18

18

18

18

18

5

5

5

5

0

10

10

10

10

9

9

18

18

18

18

18

5

5

5

5

5

10

10

10

10

9

9

18

18

18

18

18

5

5

5

5

5

5

10

10

10

9

9

18

18

18

18

18

5

5

5

5

5

5

15

10

10

9

9

2

2

2

5

5

5

2

2

5

5

5

15

15

10

9

9

9

2

2

2

5

2

2

2

5

5

5

15

15

15

9

9

9

9

2

2

2

2

2

2

5

5

5

15

15

15

9

9

9

9

9

2

2

2

2

2

5

5

5

15

15

15

9

9

9

9

9

9

2

2

2

2

5

5

5

15

15

15

2

2

2

2

2

2

2

2

2

2

5

5

15

15

15

15

2

2

2

2

2

2

2

2

2

2

5

15

15

15

15

15

2

2

2

2

2

2

2

2

2

5

5

15

15

15

15

15


"Deconvolution Layer output" : ?????

Thanks and Regards
DI


    

Alex Orloff

unread,
Mar 7, 2017, 2:43:30 PM3/7/17
to Caffe Users
You should have some degree in math to understand this
If you have none better forget about it

DI

unread,
Mar 7, 2017, 5:54:09 PM3/7/17
to Caffe Users
Hey Alex,

Thanks for the reply and your suggestion. I need that degree in Math. But I do understand that math behind CNNs  is not much difficult as it mostly boils down to differential equations. (Find errors, calculate deltas across all layers using backpropagation of errors and with help of layers outputs, then find gradient of error with respect to weight matrices, and finally update weight matirces accordingly (also include learning rate, momentum, etc while updating the weights). 

I tried to understand the math behind deconvolution by myself. But due to my weak grip in mathematics, I am not able to figure it out. I am not able to find proper explanation for deconvolution. I watched andrej karpathy lecture (CS231n) on FCNs. I also read an aritcle "Guide to Convolutional Arithmetic"  https://arxiv.org/abs/1603.07285

In CS231n, It is explained as an input value being multiplied to the weight matirx elementwise and simply dumped into the output. And the overlapping values in the output are summed up while moving the kernel around the output. Nothing was specifically mentioned about the kernel size. Please correct me If I understood it incorrectly

Whereas in "A Guide to Convolutional Arithmetic", Vincent starts well by giving an example of converting convolutional operation into sparse matrix multiplication operation. But later in Chapter 4, He explains deconvolution as just like a normal convolution with same kernel size but with the input expanded with zeros in both ways and padded with zeros. But he did not explain it in sparse matrix multiplication way.

Hence Iam getting  confused with kernel size and with the deconvolution process. In FCN32, why is the kernel size set to 64 instead of keeping it to 3. Is it just random or is there any maths behind it. 

Please suggest me few universities which offer math degree in part-time. I will enroll for it. 
i wish It would be great if you can edit your earlier comment as It may discourage others in replying to this post. 

Regards
DI

Przemek D

unread,
Mar 8, 2017, 2:44:25 AM3/8/17
to Caffe Users
Deconvolution is (at least in caffe) convolution applied backwards. I cannot find the quote on this right now but AFAIK convolution forward pass == deconvolution backward pass, and conv BP == deconv FP. You can imagine this basing on the convolution demo from Andrej Karpathy's CS231n notes (module 2, note 1) - looking left to right you get normal convolution with stride 2, padding 1 and kernel 3; looking right to left you get deconvolution with the same parameters.
In your original post you brought the output volume shape equation. To help you better understand this: the original equation for convolution output shape is:
output_size = ( input_size - kernel_size + 2*padding ) / stride + 1        // convolution
If deconvolution is a reverse of this, then the can write like so:
input_size = ( output_size - kernel_size + 2*padding ) / stride + 1        // deconvolution
Which we rearrange into:
output_size = ( input_size - 1 )*stride + kernel_size - 2*padding
From this we see immediately that no magic happens - conv/deconv parameters must relate in this following way so that the output has your desired shape.

Why such a large kernel size? When looking at the visualization that I linked, pay attention to the kernel size. The animation is for kernel=3, stride=2 - the kernel somewhat overlaps on the input image. Now imagine we work backwards (deconvolution, ie. from right to left) and upsample with a stride of 32 and keep kernel size at 3. We could make padding larger so that the output is as big as we want it to be, but look what happens with the output image (left): we get a black map with tiny 3x3 speckles every 30 pixels or so. The remedy is to boost the kernel size so that its spatial locations overlap - so that every output pixel is calculated basing on information from several neighboring input pixels.

Hope this helps you understand it. I don't think one needs a math degree for that - I don't have one ;)

zhenhua xu

unread,
Oct 24, 2017, 4:04:34 AM10/24/17
to Caffe Users
Hi DI, 
Have you got all the answers regarding to your questions?
I am still confused. 
Especially, how the 2 * factor - factor % 2 is derived as kernel size?

在 2017年3月7日星期二 UTC+8下午10:23:59,DI写道:

Przemek D

unread,
Oct 25, 2017, 8:09:58 AM10/25/17
to Caffe Users
I can't tell you exactly how it is derived, but being about twice the size as the upscale factor ensures that the kernels will overlap at two neighboring spatial locations and not more. Look at the animation from the link in my previous post: notice how we shift the kernel along one axis of the input and the kernels overlap? When you move one stride length left, it overlaps with the previous location - if you now reverse the process, this will mean that one input pixel influences two neighboring spatial locations in the deconvolution output. If your kernel size was larger, three subsequent locations could overlap - this would mean that at the center of one output location you would also have both neighboring locations influencing that point.
I'm sorry if this is unclear, it's not so easy to explain without a pen and paper. Perhaps if you try to draw it yourself, you will get the idea.
Reply all
Reply to author
Forward
0 new messages