Dense Feature Extraction Via Sliding Windows

Riddhiman Dasgupta

unread,

Oct 2, 2014, 6:08:21 AM10/2/14

to caffe...@googlegroups.com

Hi,

I'm a graduate student starting out in the area of deep learning, specifically convolutional neural networks. I am interested in scene parsing, and want to implement the following paper(s):

I have previously used CNNs for image classification, but I understand that here, instead of sending the entire image to the CNN, we need to send local patches obtained by a sliding window.

Is there an efficient way to extract these so-called dense features from the entire image without having to resort to traversing the entire image using the sliding window, which is computationally very expensive?

P.S. I noticed that one paper, viz. Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation, talks of extracting dense features efficiently without resorting to a sliding window, but I am unable to understand the exact details.

Nanne van Noord

unread,

Oct 6, 2014, 3:47:37 AM10/6/14

to caffe...@googlegroups.com

http://nbviewer.ipython.org/github/BVLC/caffe/blob/master/examples/net_surgery.ipynb

Riddhiman Dasgupta

unread,

Oct 6, 2014, 10:35:54 AM10/6/14

to caffe...@googlegroups.com

Hi,

Thanks for the reply.

I've seen this, and understand that this gives us a dense feature map as output. But the paper I linked to originally trains its network by sliding a window over the entire image, and sending the patch contained by the window at each pixel to the convnet for classification.

In fact, the net surgery tutorial mentions this at the end:
"Note that this model isn't totally appropriate for sliding-window detection since it was trained for whole-image classification. Nevertheless it can work just fine. Sliding-window training and finetuning can be done by defining a sliding-window ground truth and loss such that a loss map is made for every location and solving as usual."

The above sentence is what has stumped me. The sliding window ground truth would be as large as the image. How do a get a convnet, albeit a fully convolutional one, to give an output that is the same size as the input? And how do I define a loss function that takes the loss at each pixel and back propagates it accordingly?

Apologies in advance if my questions are too naive, I'm just starting out in this area.

Carlos Treviño

unread,

Jun 17, 2015, 9:02:56 AM6/17/15

to caffe...@googlegroups.com

Hi,

Any idea of how to do the training? Based on this work https://gist.github.com/shelhamer/80667189b218ad570e82#file-readme-md i created my own network, i'm pretending to do fine tuning but I'm not able to train it yet. Thanks in advance.

-Carlos

Axel Angel

unread,

Jun 19, 2015, 8:10:08 AM6/19/15

to caffe...@googlegroups.com

The above sentence is what has stumped me. The sliding window ground truth would be as large as the image. How do a get a convnet, albeit a fully convolutional one, to give an output that is the same size as the input? And how do I define a loss function that takes the loss at each pixel and back propagates it accordingly?

I think you are right, something is probably wrong. I'm pretty sure it's not exactly the whole images because there is the borders of convolutions that remove some parts. So either the image is extended by the kernel size each side or the output size is a bit smaller. in both case, it would give you a heat map: one prediction per pixel in the receptive field. So the size is nearly the image size in the second case.

Reply all

Reply to author

Forward