Subtracting Mean before TransformerLayer gives wired result!

James Guo

unread,

Oct 12, 2015, 3:03:16 PM10/12/15

to lasagne-users

Hello guys,

I'm new to Lasagne, recently, I encountered a problem and would like to discuss with you to get some ideas.

My intention was to use the idea of Spatial Transformer Networks to fine-tune a pre-trained model. The task I performed is 9-class classification and the pre-trained model is VGG16. So I insert localization networks and TransformerLayer before CNN, and only updating parameters in localization network and last softmax layer in VGG16. But the result is much worse than without Spatial Transformer training.

I noticed that the output from TransformerLayer is quite strange, and looks nothing like my input image. Since TransformerLayer only performs Affine transform on the image, it shouldn't change much I assume. So I suspect whether this is caused by mean subtraction I operated before I feed into the images. I looked up the example in Lasagne/Recipes and seems in the mnist example, there is no mean subtraction.

Any comments or suggestions would be very much appreciated!! Thanks!

James

emolson

unread,

Oct 12, 2015, 4:37:20 PM10/12/15

to lasagne-users

I doubt mean subtraction is the problem.

What are you using for a localization network? Have you looked at the numbers coming out of it?

I think it's quite possible that the output of TransformerLayer will be unrecognizable if the input parameters are nonsense.

Søren Sønderby

unread,

Oct 12, 2015, 5:18:27 PM10/12/15

to lasagn...@googlegroups.com

yes, the output will indeed look strange if the “zoom” parameters are close to zero and probably also in a lot of other cases.

Do you init the transform to identity?

--
You received this message because you are subscribed to the Google Groups "lasagne-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lasagne-user...@googlegroups.com.
To post to this group, send email to lasagn...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lasagne-users/a9117aa8-7ce4-40a3-b7c0-40b642e45ba9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

James Guo

unread,

Oct 13, 2015, 6:03:26 AM10/13/15

to lasagne-users

I did some more experiments and probably you are right, it's not the problem of mean subtraction. Here is the localization network defination

net = OrderedDict()

net['input'] = InputLayer((None, 3, 224,224))

# Localization Network

b = np.zeros((2, 3), dtype='float32')

b[0, 0] = 1

b[1, 1] = 1

b = b.flatten()

net['loc_l1'] = PoolLayer(net['input'], 4)

net['loc_l2'] = ConvLayer(net['loc_l1'], num_filters=20, filter_size=(5, 5), W=HeUniform(), name='loc_l2')

net['loc_l3'] = PoolLayer(net['loc_l2'], 4)

net['loc_l4'] = ConvLayer(net['loc_l3'], num_filters=20, filter_size=(5, 5), W=HeUniform(), name='loc_l4')

net['loc_l5'] = DenseLayer(net['loc_l4'], num_units=50, W=HeUniform('relu'), name='loc_l5')

net['loc_out'] = DenseLayer(net['loc_l5'], num_units=6, b=b, W=Constant(0.0), nonlinearity=identity, name='loc_out')

# Spatial Transformer Network

net['l_trans1'] = lasagne.layers.TransformerLayer(net['input'], net['loc_out'], downsample_factor=1.0)

print "Transformer network output shape: ", net['l_trans1'].output_shape

The output of net['loc_out'] is

[[ 2.35622382 -10.85038757 38.63009262 -2.80802655 -13.87930298 -43.23282242]]

It must be a nonsense value. I tried with some images without mean subtraction, and before TransformerLayer, the range of pixel value is [0, 255], and after TransformerLayer, the range is [25.1875, 15.625], so basically, you can see nothing visually.

And if I perform mean subtraction, on the same image, the output of net['loc_out'] is

[[ 2.2725296 -6.936903 22.42156982 -2.63432074 -7.82012177, -27.00503922]]

And the pixel value range before TransformerLayer is [-94.77, 170.74], after TransformerLayer is [-65.30, -75.21]

So, given the paramters generated by localization network, the TransformerLayer doesn't seem to act correctly??

Thanks!

James

James Guo

unread,

Oct 13, 2015, 6:05:49 AM10/13/15

to lasagne-users

Please see my last post about the localization network definition I used. I think I initialize the transformer parameters as identity.

Søren Sønderby

unread,

Oct 13, 2015, 6:16:14 AM10/13/15

to lasagn...@googlegroups.com

I don’t know why you get those large values, my guess is that those values are after you did training?

If you initialized the network with W=0 and b to identity transform then you’ll not get those outputs from the localization network.

-Søren

To view this discussion on the web visit https://groups.google.com/d/msgid/lasagne-users/7616aaf1-918e-400c-bd41-2ccff2b49bea%40googlegroups.com.

James Guo

unread,

Oct 13, 2015, 6:27:03 AM10/13/15

to lasagne-users

Yes, those values are after training.

Søren Sønderby

unread,

Oct 13, 2015, 6:27:37 AM10/13/15

to lasagn...@googlegroups.com

Your training is probably diverging then.

To view this discussion on the web visit https://groups.google.com/d/msgid/lasagne-users/05f91206-17bc-4989-87d3-05412cc188c2%40googlegroups.com.

James Guo

unread,

Oct 13, 2015, 6:38:34 AM10/13/15

to lasagne-users

Actually I'm not exactly sure my training makes sense. The rough idea is

input image ---> Localization Network VGG16 ---> Softmax layer

| | ^

\/ \/ |

Transformer Layer --------------------------

The update only performs at Localization Network and Softmax layer, whether or not this training will cause diverging??

Søren Sønderby

unread,

Oct 13, 2015, 8:05:57 AM10/13/15

to lasagn...@googlegroups.com

I dont know. Try training the model without transformer layer. When it works try inserting the transformerlayer.

Try lowering the learningrate + fiddle with the weight initialization etc.

-Søren

To view this discussion on the web visit https://groups.google.com/d/msgid/lasagne-users/ad3ddd3f-da41-4bc4-b754-8b4682c1fe82%40googlegroups.com.

Christian S. Perone

unread,

Oct 13, 2015, 8:31:32 AM10/13/15

to lasagn...@googlegroups.com

I had problem very similar James, what solved for me is what Søren said, lowering the learning rate, sometimes I had to use a LR of 0.001.

To view this discussion on the web visit https://groups.google.com/d/msgid/lasagne-users/A2119C99-91E6-424F-9C57-4CBFDC4EFCAF%40gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

Blog | Github | Twitter

"Forgive, O Lord, my little jokes on Thee, and I'll forgive Thy great big joke on me."

Søren Sønderby

unread,

Oct 13, 2015, 8:32:39 AM10/13/15

to lasagn...@googlegroups.com

Also try clipping the gradients or use normscaling

To view this discussion on the web visit https://groups.google.com/d/msgid/lasagne-users/CAMxMOAuB7BnDnLseFtqamoDB_A%3DbsYC-mGTOaywr29mCwZZ0-A%40mail.gmail.com.

emolson

unread,

Oct 13, 2015, 8:34:56 AM10/13/15

to lasagne-users

I'm not too familiar with this, but it seems to me that getting a good localization network will be comparable in difficulty to training a good classifier.

Note that in the paper they used a truncated GoogleNet model (pretrained on ImageNet).

James Guo

unread,

Oct 13, 2015, 9:54:59 AM10/13/15

to lasagne-users

I've tried with rmsprop and sgd with LR 0.000001, neither provides any improvement. I will keep trying other suggestions.

James Guo

unread,

Oct 13, 2015, 10:23:00 AM10/13/15

to lasagne-users

If I fine-tune without spatial transformer networks, it works and produces quite good classification accuracy. I will look into other suggestions

Søren Sønderby

unread,

Oct 13, 2015, 10:24:00 AM10/13/15

to lasagn...@googlegroups.com

What are you using it for?

To view this discussion on the web visit https://groups.google.com/d/msgid/lasagne-users/2711c83b-a4e7-44ee-81e4-715d1e4c82b1%40googlegroups.com.

James Guo

unread,

Oct 13, 2015, 10:30:19 AM10/13/15

to lasagne-users

The task is a can classification. I need to classify these cans into different brand. The images are from surveillance camera, the can could drop at different locations with different rotations in the image. So I thought spatial transformer could be a valuable enhancement.

Søren Sønderby

unread,

Oct 13, 2015, 10:32:58 AM10/13/15

to lasagn...@googlegroups.com

Will you share your VGG16 setup?

it should btw be possible to exchange any pooling layers with transformer layers.

To view this discussion on the web visit https://groups.google.com/d/msgid/lasagne-users/ab83d8c6-8d17-48b7-b461-194afe5e0eee%40googlegroups.com.

James Guo

unread,

Oct 13, 2015, 10:43:07 AM10/13/15

to lasagne-users

The whole network setup is

net = OrderedDict()

net['input'] = InputLayer((None, 3, 224,224))

# Localization Network

b = np.zeros((2, 3), dtype='float32')

b[0, 0] = 1

b[1, 1] = 1

b = b.flatten()

net['loc_l1'] = PoolLayer(net['input'], 4)

net['loc_l2'] = ConvLayer(net['loc_l1'], num_filters=20, filter_size=(5, 5), W=HeUniform(), name='loc_l2')

net['loc_l3'] = PoolLayer(net['loc_l2'], 4)

net['loc_l4'] = ConvLayer(net['loc_l3'], num_filters=20, filter_size=(5, 5), W=HeUniform(), name='loc_l4')

net['loc_l5'] = DenseLayer(net['loc_l4'], num_units=50, W=HeUniform('relu'), name='loc_l5')

net['loc_out'] = DenseLayer(net['loc_l5'], num_units=6, b=b, W=Constant(0.0), nonlinearity=identity, name='loc_out')

# Spatial Transformer Network

net['l_trans1'] = lasagne.layers.TransformerLayer(net['input'], net['loc_out'], downsample_factor=1.0)

print "Transformer network output shape: ", net['l_trans1'].output_shape

# VGG16

net['conv1_1'] = ConvLayer(net['l_trans1'], 64, 3, pad=1,name='conv1_1')

net['conv1_2'] = ConvLayer(net['conv1_1'], 64, 3, pad=1,name='conv1_2')

net['pool1'] = PoolLayer(net['conv1_2'], 2)

net['conv2_1'] = ConvLayer(net['pool1'], 128, 3, pad=1,name='conv2_1')

net['conv2_2'] = ConvLayer(net['conv2_1'], 128, 3, pad=1,name='conv2_2')

net['pool2'] = PoolLayer(net['conv2_2'], 2)

net['conv3_1'] = ConvLayer(net['pool2'], 256, 3, pad=1,name='conv3_1')

net['conv3_2'] = ConvLayer(net['conv3_1'], 256, 3, pad=1,name='conv3_2')

net['conv3_3'] = ConvLayer(net['conv3_2'], 256, 3, pad=1,name='conv3_3')

net['pool3'] = PoolLayer(net['conv3_3'], 2)

net['conv4_1'] = ConvLayer(net['pool3'], 512, 3, pad=1,name='conv4_1')

net['conv4_2'] = ConvLayer(net['conv4_1'], 512, 3, pad=1,name='conv4_2')

net['conv4_3'] = ConvLayer(net['conv4_2'], 512, 3, pad=1,name='conv4_3')

net['pool4'] = PoolLayer(net['conv4_3'], 2)

net['conv5_1'] = ConvLayer(net['pool4'], 512, 3, pad=1,name='conv5_1')

net['conv5_2'] = ConvLayer(net['conv5_1'], 512, 3, pad=1,name='conv5_2')

net['conv5_3'] = ConvLayer(net['conv5_2'], 512, 3, pad=1,name='conv5_3')

net['pool5'] = PoolLayer(net['conv5_3'], 2)

net['fc6'] = DenseLayer(net['pool5'], num_units=4096,name='fc6')

net['fc6_dropout'] = DropoutLayer(net['fc6'], p=0.5)

net['fc7'] = DenseLayer(net['fc6_dropout'], num_units=4096,name='fc7')

net['fc7_dropout'] = DropoutLayer(net['fc7'], p=0.5)

net['classifier_output'] = DenseLayer(net['fc7_dropout'], num_units=9, nonlinearity=log_softmax,name='classifier_output')

Søren Sønderby

unread,

Oct 13, 2015, 10:55:43 AM10/13/15

to lasagn...@googlegroups.com

cool. Thanks.

To view this discussion on the web visit https://groups.google.com/d/msgid/lasagne-users/25f5110a-32ff-4011-bc90-dc7f33ab52a9%40googlegroups.com.

James Guo

unread,

Oct 14, 2015, 5:41:00 AM10/14/15

to lasagne-users

Some update on my experiments, so it seems the explosion problem of localization parameters is due to the fact that my pixel value is too large (uint 8), I tried to normalize them by divide 255.0, and now it looks reasonable. I think my theory makes sense because the last DenseLayer for localization network, see below, doesn't have any nonlinear function really (the intention should be to initialize an identity matrix I assume), so large pixel value could easily cause explosion problem.

net['loc_out'] = DenseLayer(net['loc_l5'], num_units=6, b=b, W=Constant(0.0), nonlinearity=identity, name='loc_out')

The remaining problems are:

1. Although the accuracy is better than before, it's still a lot worse than the training without spatial transformer layers. This may be caused by the fact that VGG16 we are using is pre-trained with uint 8, and now we are feeding into with normalized pixel value. We could insert a layer which will multiply input value with 255.0 to reestablish the proper input value for VGG16

2. I've checked with output for 'loc_out' layer,

[[ 9.37732399e-01 -4.42903414e-02 -7.92445168e-02 -1.68894301e-04, 9.90774453e-01 -3.39208022e-02]]

It's still very much the identity matrix we initialized, so I think the localization matrix didn't learn much how to find the object. Not sure about how to deal with problem yet.

goo...@jan-schlueter.de

unread,

Oct 14, 2015, 7:28:13 AM10/14/15

to lasagne-users

I think my theory makes sense because the last DenseLayer for localization network, see below, doesn't have any nonlinear function really (the intention should be to initialize an identity matrix I assume), so large pixel value could easily cause explosion problem.

Instead of downscaling the inputs and upscaling them again for VGG-16, you might also get away with reducing the learning rate for the localization network (since it's piecewise linear).

2. I've checked with output for 'loc_out' layer,
[[ 9.37732399e-01 -4.42903414e-02 -7.92445168e-02 -1.68894301e-04, 9.90774453e-01 -3.39208022e-02]]
It's still very much the identity matrix we initialized, so I think the localization matrix didn't learn much how to find the object. Not sure about how to deal with problem yet.

Either try increasing the learning rate for the localization network, or initialize W to something a little larger (e.g., lasagne.init.Uniform(smallvalue)).

To modify the learning rate of only some of the layers, see this thread: https://groups.google.com/forum/#!topic/lasagne-users/2z-6RrgiHkE

I'm not too familiar with this, but it seems to me that getting a good localization network will be comparable in difficulty to training a good classifier.

Sure, it's also possible that the network is not complex enough to find a can. As Søren said, you could try inserting the localization network later in the stack, so it could benefit from the pre-trained low-layer features in VGG. I'd try to continue your present route for a bit, though.

Good luck, and keep us updated!

Best, Jan

James Guo

unread,

Oct 14, 2015, 11:03:22 AM10/14/15

to lasagne-users, goo...@jan-schlueter.de

1. I've tried with reduce the learning rate for localization network, but basically, the output parameters eventually go to vary large number.

2. Try to increase learning rate for localization network, but it's not helping, the parameters are still roughly the identity matrix.

3. Will think about maybe using more complex localization network or insert it in the middle.

Thanks for all suggestions and comments!!

goo...@jan-schlueter.de

unread,

Oct 15, 2015, 6:56:31 AM10/15/15

to lasagne-users

Hey,

1. I've tried with reduce the learning rate for localization network, but basically, the output parameters eventually go to vary large number.

2. Try to increase learning rate for localization network, but it's not helping, the parameters are still roughly the identity matrix.

I wonder why you can't find anything in between (learning too fast and learning too slow)?

3. Will think about maybe using more complex localization network or insert it in the middle.

All right. I'm still interested in your findings, let us know what comes out of it!

James Guo

unread,

Oct 15, 2015, 8:04:36 AM10/15/15

to lasagne-users, goo...@jan-schlueter.de

Here is some of my personal thoughts, correct me if you find any mistakes. There are two problems here,

1) Explosion problem when training for localization parameters. The localization network I copied from Lasagne/Recipe doesn't have any non-linear function in between, which makes it's linear between conv layers. You can image the pixel value remain uint8 passing through layers in localization network, then I think it's possible that even using with very low learning rate (I've tried with 10-6), I still get large parameter values from localization network. A direct result with these large strange parameters, the transformer layer produce similar images with different input, within a narrow pixel value range.

2) Then I normalize the pixel value down to [0, 1], the localization network always produce roughly identity matrix no matter what input. Basically this means no object position or rotation information is learned through localization network. Maybe it can be solve by more complex localization network? with non-linear functions and more layers.

So I think what I learned from this experiment is that, DON'T COPY the example network, design your own localization network for your own application LOL

bawdyb

unread,

Oct 15, 2015, 11:41:26 AM10/15/15

to lasagne-users, goo...@jan-schlueter.de

Hi James,

1) Explosion problem when training for localization parameters. The localization network I copied from Lasagne/Recipe doesn't have any non-linear function in between, which makes it's linear between conv layers. You can image the pixel value remain uint8 passing through layers in localization network, then I think it's possible that even using with very low learning rate (I've tried with 10-6), I still get large parameter values from localization network. A direct result with these large strange parameters, the transformer layer produce similar images with different input, within a narrow pixel value range.

What do you mean by "doesn't have non-linear function in between" ? By default each Conv2DLayer is using a relu (check in the documentation : (nonlinearity=lasagne.nonlinearities.rectify)

goo...@jan-schlueter.de

unread,

Oct 15, 2015, 1:19:40 PM10/15/15

to lasagne-users

What do you mean by "doesn't have non-linear function in between" ? By default each Conv2DLayer is using a relu

But there is no *squashing* nonlinearity in between, everything is piecewise linear, so a large input range results in a large output range (unless the weights are small).

bawdyb

unread,

Oct 15, 2015, 4:16:14 PM10/15/15

to lasagne-users, goo...@jan-schlueter.de

Huh. I am probably misunderstanding something, what do you mean by squashing nonlinearity in between ?

Thanks !

James Guo

unread,

Oct 16, 2015, 6:22:51 AM10/16/15

to lasagne-users, goo...@jan-schlueter.de

BY *squashing*, it means limit the output value range, for example like hyperbolic tangent, whose output range is -1 to 1, but linear rectify is linear in the positive sense, so the output value can be linear with input positive value. Hope this explanation helps

James

bawdyb

unread,

Oct 16, 2015, 3:30:06 PM10/16/15

to lasagne-users, goo...@jan-schlueter.de

Ok ! Squashing = bounding/limiting, sorry for my poor understanding of the meaning of "squashing" :) Now I get it, you have unbounded output values that will perturb the parameters of the ST network.

Thanks again !

James Guo

unread,

Oct 23, 2015, 9:56:51 AM10/23/15

to lasagne-users, goo...@jan-schlueter.de

Hey guys,

Some updates for my recent findings, I changed the network structure a bit

net = OrderedDict()

net['input'] = InputLayer((None, 3, 224,224))

# Localization Network

b = np.zeros((2, 3), dtype='float32')

b[0, 0] = 1

b[1, 1] = 1

b = b.flatten()

net['loc_l1'] = PoolLayer(net['input'], 2)

net['loc_l2'] = ConvLayer(net['loc_l1'], num_filters=20, filter_size=(5, 5), W=HeUniform(), name='loc_l2')

net['loc_l3'] = PoolLayer(net['loc_l2'], 2)

net['loc_l4'] = ConvLayer(net['loc_l3'], num_filters=20, filter_size=(5, 5), W=HeUniform(), name='loc_l4')

net['loc_l5'] = PoolLayer(net['loc_l4'], 2)

net['loc_l6'] = ConvLayer(net['loc_l5'], num_filters=30, filter_size=(5, 5), W=HeUniform(), name='loc_l6')

net['loc_l7'] = DenseLayer(net['loc_l6'], num_units=100, W=HeUniform('relu'), nonlinearity=tanh, name='loc_l7')

net['loc_out'] = DenseLayer(net['loc_l7'], num_units=6, b=b, W=Constant(0.0), nonlinearity=identity, name='loc_out')

The major change is that I add one more conv layer into localization network, also change the nonlinear function of net["loc_l7"] with tanh, the purpose of this change is to have a "squashing" nonlinear function. I feed images with pixel range uint8 into neural net for training.

I test one image, the transformer parameters from localization network is

[[ 1.27262008 -0.86602426 -0.82796508 1.53299463 1.29031622 0.56877941]]

It seems like a reasonable value? (not so sure about that) The effect of this parameter is to rotate image, move it a bit to right, and zoom out. (According to my observation) The localization network finally was trained somehow. Personally, I think the nonlinear function I changed should play an important role in this.

The problem comes when I test with more images, all localization parameter remains

[[ 1.27262008 -0.86602426 -0.82796508 1.53299463 1.29031622 0.56877941]]

So apparently, only the bias of net["loc_out"] is trained, and weights of net["loc_out"] remains the initial value 0. This is kind of weird to me.

On Thursday, 15 October 2015 11:56:31 UTC+1, goo...@jan-schlueter.de wrote:

Søren Sønderby

unread,

Oct 23, 2015, 10:29:12 AM10/23/15

to lasagn...@googlegroups.com, goo...@jan-schlueter.de

You can try with multiscale transformer. Its a small function that create 3 tied zooms at different resolutions.

The output is more or less identical to what you see here: http://torch.ch/blog/2015/09/21/rmva.html

multicrop.py

galin.g...@gammadynamics.com

unread,

Dec 10, 2015, 11:03:43 PM12/10/15

to lasagne-users, goo...@jan-schlueter.de

James, this "scale mismatch" is a weakness of the 1st generation of transformer nets, which rely on a classifier.

If u don't want hacks, there are at least two more conceptual ways to resolve this issue:

i) using batch normalization in the localization network, http://arxiv.org/abs/1502.03167. Still a heuristic and adds overhead but, if well implemented, u should see the weights in the localization net get optimized right away;

ii) replace the classifier by (or add to it) an auto-encoder, as in the ACE, http://arxiv.org/abs/1511.02841. A reconstruction error is generally more sensitive to spatial symmetry statistics (produced by the localization net) than a classification error.

goo...@jan-schlueter.de

unread,

Dec 11, 2015, 12:00:20 PM12/11/15

to lasagne-users

There's also the "GradNets" approach: http://arxiv.org/abs/1511.06827 (Section 3.9)
The idea is to start with a mean pooling layer (or anything else that will give you an output of a reasonable target scale and size) and gradually replace it by a spatial transformer during training, via linear interpolation. Supposedly it makes training more stable because it allows the classifier to learn something before using its error signal to update the localization network.

ngc...@gmail.com

unread,

Aug 17, 2016, 3:27:13 AM8/17/16

to lasagne-users

Hi,
I am facing the same problem with RGB images.
Specifically, I am using VGG-19 (pretrained with Imagenet) for a binary classification problem and got good results.
Then I tried adding a STN in front of the VGG net. I used a separate pretrained VGG-19 as the localization net. The difference is I removed the two FC-4096 layers and replaced with an FC-512(ReLU) and FC-6 (similar to the MNIST example).
Unfortunately the classifier loss is not decreasing (I used the same learning rate for the classifier net without STN while trying different learning rates for the localization net).
The localization parameter did not change much from their initial values (set as per example).
And the weird thing is the transformed images have uniform values, for example, red channel pixels are all the same values, ditto for the other color channels.

Jan Schlüter

unread,

Aug 18, 2016, 8:16:05 AM8/18/16

to lasagne-users, ngc...@gmail.com

And the weird thing is the transformed images have uniform values, for example, red channel pixels are all the same values, ditto for the other color channels.

Samples outside the input image are clamped to its borders. So what you see is probably one of the corner pixels being repeated over the full image, because the transformation is so far off it doesn't sample the source image at all. The network will probably not be able to recover from this state. Make sure you initialize the localization network to produce the identity transform, and train it very slowly (or not train it at all for a while). Spatial Transformers are tricky!

Best, Jan

leepe...@gmail.com

unread,

Aug 6, 2019, 10:16:09 AM8/6/19

to lasagne-users

Hi,can you please send me the vgg19.pkl file?

在 2016年8月17日星期三 UTC+2上午9:27:13，ngc...@gmail.com写道：

Reply all

Reply to author

Forward