Convolution Autoencoder not training properly

1,530 views
Skip to first unread message

Afzalul Haque

unread,
May 4, 2016, 9:32:56 PM5/4/16
to Discuss

Why is my convolutional autoencoder not converging properly? I have a very simple layer stack.


Encoder: Conv/ReLU(Kernel size: 7x7, stride = 1, padding = 0) => maxPool(kernel size=2x2, stride = 2) => Conv/ReLU(Kernel size: 5x5, stride = 1, padding = 0) => MaxPool(kernel size=2x2, stride = 2)

Decoder: Nearest Neighbour Upsampling => Deconv/ReLU => Nearest Neighbour Upsampling => Deconv/ReLU


Training Images are of size 30x30x1. I tried to train it with 1000 images over 1000 epochs, but the error (MSE) is still ~120. Is it because my model is wrong?


The model is here:

http://pastebin.com/3hkh2nMk

Andy Kitchen

unread,
May 5, 2016, 2:07:31 AM5/5/16
to Afzalul Haque, Discuss
Hi mate,

I've written and debugged a few autoencoders. First check you've got all your fundamentals down:

One, make sure your data is suitably normalized, your data should approximately have a mean of 0
and a standard deviation of 1 at every pixel, input and output (you might need to tweak your activation
function on the output) . Even better, whiten your data, most autoencoder work uses whitened data.
Look at using ZCA or Local Contrast Normalization.

Two, your architecture is a little weird: you use convolutions, but don't have any bias parameters.
This is not totally wrong, but it's probably not what you want. In Tensorflow just using convolutions
doesn't automatically include a bias term.

Three, general NN debugging tip: quadruple check everything has the exact shape you expect at
every point and every aggregation is being performed over the dimension you want.

Four, the weird thing about training autoencoders (especially when not doing denoising) is that
the reconstruction MSE on its own isn't necessarily a good indication of the quality of the
internal representation. Look into other tools like t-SNE to actually assess the quality of internal
representations.

Debugging Tips:

Start small, can you get a simple 1 layer network with no pooling or bottleneck to converge?
There is obviously a trivial solution to this, so you should hit 0 MSE pretty quickly. Slowly
add more layers one-by-one, but make sure the system converges every time you do.

Try building a fully connected auto-encoder and getting it working before moving to
a more complicated convolutional auto-encoder architecture. Try replicating something
from a paper before trying to put together a custom architecture, that way you'll know
that the bug is probably in your code if it doesn't work.

Continually asses the quality of your internal representation, for highly structured
30x30 images (i.e. like MNIST) you may only need one encoding layer and one
decoding layer to get useful results.

Visualise everything, all the time, if you've done things correctly you should see
very clear structure develop in your layer 1 weights. If you don't, your model probably
isn't converging.


Hope that helps,


Kind Regards

AK

--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.
To post to this group, send email to dis...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/10a1533d-0482-4126-b2cd-02702aa1cdf5%40tensorflow.org.

Afzalul Haque

unread,
May 5, 2016, 2:33:47 AM5/5/16
to Andy Kitchen, Discuss
Hey Andy. Lots of thanks for the suggestions. 

1) Will add normalization.

2) I definitely did not know about the bias not being present by default in tensorflow. I will definitely include that. 

3) Have checked the sizes, and they are correct.

4) I have read about MSE not being a very good representation for error in images. In actuality, this autoencoder was just created to learn about tensorlfow, and is a part of a generative network that only uses MSE. Hence, I can't really move away from that.

Finally, I trained the same model using 10,000 images(up from the original 1000 images), and my error is significantly less(~80 in 250 epochs). Is it just a case of the image being too complex, or is it definitely a problem with the model? I understand that the answer to this question depends upon the complexity and the structure of the data, and hence here is an imgur link to the images. 
The images are bouncing balls(3 white balls bouncing in a black closed box) from this paper (Sutskever, Hinton, Taylor). 

Thanks & Regards,
Afzalul Haque.

Andy Kitchen

unread,
May 5, 2016, 2:52:31 AM5/5/16
to Afzalul Haque, Discuss
From reading your code, I think (not 100% sure) your total squared error is averaged over every pixel. In that case, a pixel-wise MSE of 120 or even 80 too high. My guess is, with that architecture – and properly normalised data – you should have a pixel-wise MSE of less than 0.1

Try and visualise the outputs of the autoencoder side-by-side with the input. I've built a similar architecture to the one you're using: my output looked like a blurry version of the original.


Kind Regards

AK

Afzalul Haque

unread,
May 5, 2016, 6:21:48 AM5/5/16
to Andy Kitchen, Discuss
I tried to visualize the output produced with the input and they don't really match. The output does contain 3 white balls in a black background, but the position of the balls doesn't match at all.

Regards.

Afzalul Haque

unread,
May 5, 2016, 7:35:35 AM5/5/16
to Andy Kitchen, Discuss
You are right, the error is indeed averaged over per pixel. I just checked with X=[[8,0,0],[0,0,0],[0,0,0]] and Y=[[0,0,0],[0,0,0],[0,0,0]]. tf.reduce_mean(tf.squared_difference(X,Y)) gives 10 as result. (64/(2*3) ~~ 10)

I was just checking and thought of the fact that since my image's pixel values(ranging from 0 to 255) are not normalized, my MSE per pixel also isn't. Applying simple normalization in [0,1] range, my per pixel MSE comes to about 0.471, which is ofcourse, much larger than 0.1, but also isn't as bad as 0.1 & 120.

Also, I added bias(of shape equal to the number of output channels) to both convolution and deconvolution layers. The error after 350 epochs is still ~115 per pixel. Here is the complete code.

Thanks & Regards,
Afzalul Haque.
On Thu, May 5, 2016 at 12:22 PM, Andy Kitchen <kitche...@gmail.com> wrote:

Afzalul Haque

unread,
May 5, 2016, 7:36:25 AM5/5/16
to Andy Kitchen, Discuss
Apologies,  X=[[8,0],[0,0],[0,0]] and Y=[[0,0],[0,0],[0,0]].

Afzalul Haque

unread,
May 5, 2016, 8:45:42 AM5/5/16
to Andy Kitchen, Discuss

Andy Kitchen

unread,
May 5, 2016, 10:07:28 AM5/5/16
to Afzalul Haque, Discuss
Hi Afzalul,

Make sure you are training to convergence and then a little bit past it, epoch 350 doesn't mean anything on its own.

A MSE of 0.47 feels high to me (if the data is normalised, just returning a constant 0 would be an MSE of 1), yet again, with an autoencoder it's not very productive to assess the quality of the representations being learned using just MSE. It's hard to judge these things without being there and going through the process of building up smaller models for comparison.

I just had a quick look at your initialisation code before I sent this email, for your large filters and many features, an initialisation stddev of 0.01 is really very large. As I said in my last answers you really need to get your fundamentals right, when using convolutional layers always start off using "xavier" initialisation first, then tweak from there, have a look at:


---

I was tempted to delete the rest of this email to save confusion, but there's some good stuff here. Just really, make sure you've got the fundamentals covered. Normalisation, Initialisation, Training to convergence. Otherwise you won't have a fun time.

I also think you'll learn a lot from implementing a 1 layer, fully connected auto-encoder and then building up from there.

Post Script:

If there are no bugs and all the fundamentals are there, it's also possible that your current architecture just isn't very good. The first place I'd tweak is the decoder, you are using resize_images with bilinear upsampling, this is very close to, if not exactly, a linear operation followed by a convolution which is also a linear operation. The composition doesn't gain you anything. You could get a very similar effect with a single strided conv2d_transpose. (Anyone want to chime in here? I love a second opinion on that line of reasoning). A setup like that will probably be easier to train.

More subtly, the max_pooling is dropping spatial information in the encoder but there isn't really a good way to get it back in the decoder. In this architecture the only way to rebuild fine scale spatial information is to somehow embed it in the features and have it traverse several layers. This is probably not what you want.

Here are some things to try:

Simply keep the system very highly over-parameterised all the way through, something like [128, 128, 128, 128] instead of [32, 64, 64, 32]. You may want to use liberal regularisation in this case. Popular options are L1 / L2 penalties on weights and or activations, I've found dropout generally works better though. (You might want to shrink your filter sizes a bit if you do this to avoid things taking too long to train)

Use something like max_pool_with_argmax to keep the maximum positions and use the max positions during upsampling to partially 'undo' the max-pooling. This is mostly only useful if you are throwing away the decoder after training and just using the representation.

Undoing a max pooling is a difficult non-linear operation, most of the classic autoencoder work doesn't use straight max pooling on the encode side. You might not have enough non-linearity / capacity on your decode side to cope, you could go for an asymmetric decode side that has more layers of smaller (say 3x3) convolutions. (You have 4 non-linear operations on your encode side, but only 2 on your decode side as far as I can tell)

Any of these changes will certainly decrease your MSE, but again, that isn't necessarily what you need. If you have a specific purpose in mind for your autoencoder, you should test it against that first.


Hope that helps!

Kind Regards

AK

Afzalul Haque

unread,
May 5, 2016, 10:19:06 AM5/5/16
to Andy Kitchen, Discuss
Hey Andy, thanks for all the great advice. I am training the autoencoder with biases, 10k images(all of which are scaled to mean 0 and variance 1), and I am currently getting pixel MSE as 0.23(the same scale of [-1,1]). The reason I implemented this was to make sure that the model works somewhat in theory. However, the main aim is to have a predictive generative network that predicts the next frame, given a set of input frames and there exists a LSTM DNN in between the encoder and decoder. Looks like I still have to work on it some more, but again, thanks for all the advice.

Warm Regards,
Afzalul Haque

Andy Kitchen

unread,
May 5, 2016, 11:27:46 AM5/5/16
to Afzalul Haque, Discuss
Glad I could help!

Sounds like some fun & interesting work! One thing that is easy to get running quickly now is just feeding the input frames in as different channels to your CNN, so to feed in the last five input frames make the input [30, 30, 5]. This would be a good benchmark to try and beat with the LSTM DNN setup. My guess is that for this specific task the multi-input CNN will be hard to beat.

Happy Hacking!
Reply all
Reply to author
Forward
0 new messages