Dense Feature Extraction Via Sliding Windows

Riddhiman Dasgupta

unread,

Oct 2, 2014, 7:34:51 AM10/2/14

to pylear...@googlegroups.com

Hi,

I'm a graduate student starting out in the area of deep learning, specifically convolutional neural networks. I am interested in scene parsing, and want to implement the following paper(s):

I have previously used CNNs for image classification, but I understand that here, instead of sending the entire image to the CNN, we need to send local patches obtained by a sliding window.

Is there an efficient way to extract these so-called dense features from the entire image without having to resort to traversing the entire image using the sliding window, which is computationally very expensive?

P.S. I noticed that one paper, viz. Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation, talks of extracting dense features efficiently without resorting to a sliding window, but I am unable to understand the exact details.

Frédéric Bastien

unread,

Oct 9, 2014, 12:43:35 PM10/9/14

to pylear...@googlegroups.com

Hi,

I don't have the time to read the papers, so I base this email on the title. You want a sliding window. I suppose the window size and the strides is constant.

I didn't found anything in Pylearn2 for this. In Theano there is image2neibs:

http://deeplearning.net/software/theano/library/sandbox/neighbours.html#theano.sandbox.neighbours.images2neibs

This allow to get the patch that you want with a constant windows and strides size. This work on the GPU and CPU. The current implementation copy the data. There is different way to treat the border.

But this can be done with view (no copy) in NumPy. Some related doc:

http://nbviewer.ipython.org/github/ipython-books/cookbook-code/blob/master/notebooks/chapter04_optimization/06_stride_tricks.ipynb

A specific example for vector. This can be generatlized to images. This return a view of the original data, so it don't copy anything.

import numpy as np
n = 10
a = np.arange(n)
window_size=2
overlap=1
b = np.lib.stride_tricks.as_strided(a, (a.shape[0]-window_size+1, window_size), ((window_size - overlap)* a.strides[0], a.strides[0]))
print b
[[0 1]
[1 2]
[2 3]
[3 4]
[4 5]
[5 6]
[6 7]
[7 8]
[8 9]]

Something similar can be done like this on the GPU.

Can you tell me if that is what you wanted?

I made an issue to document this better: https://github.com/Theano/Theano/issues/2166

Fred

--
You received this message because you are subscribed to the Google Groups "pylearn-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pylearn-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jesse Livezey

unread,

Oct 9, 2014, 11:23:39 PM10/9/14

to pylear...@googlegroups.com

I'm not sure if this is what was meant by sliding window, but I'm curious if there is any demand for a dataset iterator that returns patches of topological data. This is something that I've wanted, but doesn't fit well (as far as I can tell) into the current iterator structure.

Jesse

Kyle Kastner

unread,

Oct 10, 2014, 11:22:24 AM10/10/14

to pylear...@googlegroups.com

I don't know if this is exactly what you mean, but if you turn all the
top (fully connected layers) into convolutional layers after training,
a CNN can be slid over arbitrary size windows as long as the input is
> the original training size. This is exactly what we do in
sklearn-theano (http://sklearn-theano.github.io/) to slide OverFeat
over large images.

This is definitely a way to do dense feature extraction, but I don't
know about the efficiency of this approach vs. the others in these
papers.

Kyle

Arjun Jain

unread,

Oct 10, 2014, 12:04:44 PM10/10/14

to pylear...@googlegroups.com

Essentially we save redundant convolution computations, and easily
replicate the fully connected over the overlapping pathces using 1x1
convolutions (and fold the sliding window inside after pooling). HTH

On Thu, Oct 2, 2014 at 7:34 AM, Riddhiman Dasgupta
<riddhiman...@gmail.com> wrote:

Riddhiman Dasgupta

unread,

Oct 10, 2014, 2:16:07 PM10/10/14

to pylear...@googlegroups.com

So, I convert the final fully connected layers to convolutional ones, with 1x1 filters, and perform all the convolution, activation and pooling operations on the image at one go?

As far as I understand, this will give me a sort of heatmap, albeit of a much smaller size than the input image. If my input image had dense labelling, I can downsample the original groundtruth to match the size of my output heatmap, or vice versa, and back propagate the errors at each pixel. Is this the right way to go about this if I intend to train a model that has fully convolutional layers instead of fully connected ones? Won't the upsampling of the heatmap or the downsampling of the groundtruth give rise to errors?

Also, what does "fold the sliding window inside after pooling" mean exactly?

Thanks a ton.

Riddhiman Dasgupta

unread,

Oct 10, 2014, 2:18:55 PM10/10/14

to pylear...@googlegroups.com

Yes, this is what I mean. I understand that converting the fully connected layers to fully convolutional ones would give me dense features. But, what I want to know is that how do I train a convnet using such a scheme? Specifically, I have dense labels, i.e. each pixel might be labelled. In such a case, I can use fully convolutional layers to reduce the whole sliding window computation costs, and get a resultant heatmap of sorts, if I am not wrong. But this resultant heatmap will not be of the same size as the original input, and in such a case, how do I compute the error with the groundtruth?

Many thanks.

kdxi...@gmail.com

unread,

Dec 31, 2014, 8:03:13 PM12/31/14

to pylear...@googlegroups.com

On Thursday, October 2, 2014 7:34:51 PM UTC+8, Riddhiman Dasgupta wrote:
> Hi,
> I'm a graduate student starting out in the area of deep learning, specifically convolutional neural networks. I am interested in scene parsing, and want to implement the following paper(s):

> Learning Hierarchical Features for Scene LabelingScene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal CoversI have previously used CNNs for image classification, but I understand that here, instead of sending the entire image to the CNN, we need to send local patches obtained by a sliding window.

> Is there an efficient way to extract these so-called dense features from the entire image without having to resort to traversing the entire image using the sliding window, which is computationally very expensive?
>
>
> P.S. I noticed that one paper, viz. Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation, talks of extracting dense features efficiently without resorting to a sliding window, but I am unable to understand the exact details.

Hi~ Have solve this problem yet? I am confused by dense feature extraction too. Do you have any good idea? Thanks a lot :-)

bargoti...@gmail.com

unread,

Jan 13, 2015, 11:06:09 PM1/13/15

to pylear...@googlegroups.com, kdxi...@gmail.com

Not exactly the solution you asked for but perhaps an alternative path:

In many cases you do not need every single patch on the image as part of your training. Instead randomly sample these local patches (along with their associated labels) and then just perform CNNs on them.

Reply all

Reply to author

Forward