Hi Caffe Users,
We have created working Convolutional Auto-Encoder in Caffe, but still without pooling-unpooling layers. The CAE is working using the modified version of Caffe https://github.com/HyeonwooNoh/caffe.
The details of its creation we have published in a form of a paper at ArXiv
http://arxiv.org/ftp/arxiv/papers/1512/1512.01596.pdf
The paper is included in this post below too.
I have attached the following folder/files (in zipped form):
/VisInsideCAE – the folder which contains the files to paint Fig. 7 as in the paper.
All files in the root folder – .prototxt files and Matlab files – all of them were used for training CAE model (only 3 runs from total 9) and its visualization. I also have included resulting files and figures for these 3 runs. Also I have attached the MNIST test set in file <mnist_test.mat> and I have changed a bit the .m files (the loop still is not optimized, sorry) to work with this <mnist_test.mat> file. In the original visualization we have used another MNIST test set file, just bigger. That’s why the figures here in attachment are not exactly the same as in the paper. The .m files which visualize 10 and 30 dimensional CAEs need installed version of t-SNE (https://lvdmaaten.github.io/tsne/) on your machine/Matlab.
I hope this small result will help Caffe community to create a CAE with pooling-unpooling layers as well as answer some questions I have seen before in this group – how to visualize the network itself and how to use t-SNE for visualization.
I appreciate any feedback on the paper, especially about how we calculated the number of trainable parameters in encoder and decoder parts. In some papers, where Caffe was used, I seen another way to calculate the number of trainable parameters, so our approach may be inaccurate.
Cheers,
Vlad
PS. Oh, I am deleting <mnist_test.mat> from the zip archive, it is about 4Gb, it does not allow me to publish this post. I took it somewhere from GitHub I believe.
Hi Bruno,
I think you have to read more about convolutional/deconvolution layers because you have to understand yourself that when we do convolution operation we decrease the size of the output feature maps and when we do deconvolution operation we increase the size of the output feature maps. This is how encoding-decoding paradigm works.
Read paper by Masci (2011)
J. Masci, U. Meier, D. Ciresan, J. Schmidhuber, Stacked convolutional auto-encoders for hierarchical feature extraction, Lecture Notes in Computer Sci. 6791 (2011) 52-59.
On that paper on page 3, in the paragraph before formula 3, the authors have explained two formulas (you can find these formulas in many other papers), how the sizes of output feature maps are decreasing and increasing. So, according to them, our convolution layer in Caffe implements a ‘valid’ convolution operation and our deconvolution layer in Caffe implements a ‘full’ convolution operation.
When you understand this, I suggest to see the file ‘mnistCAE10sym0202-03.log’ in my CAEzip.zip archive. This is a log, how the network is working, and there you can see, how the sizes are decreased in the encoder part and then how the sizes are increased in the decoder part. Start to see from these lines
I1123 14:58:29.456792 31450 net.cpp:105] Top shape: 100 1 28 28 (78400)
I1123 14:58:29.456801 31450 net.cpp:105] Top shape: 100 1 28 28 (78400)
I1123 14:58:29.456809 31450 net.cpp:105] Top shape: 100 1 28 28 (78400)
I1123 14:58:29.456816 31450 net.cpp:115] Memory required for data: 1254400
I1123 14:58:29.456823 31450 layer_factory.hpp:78] Creating layer conv1
I1123 14:58:29.456851 31450 net.cpp:69] Creating Layer conv1
I1123 14:58:29.456861 31450 net.cpp:396] conv1 <- data_data_0_split_0
I1123 14:58:29.456887 31450 net.cpp:358] conv1 -> conv1
I1123 14:58:29.456908 31450 net.cpp:98] Setting up conv1
I1123 14:58:29.457490 31450 net.cpp:105] Top shape: 100 8 20 20 (320000)
I1123 14:58:29.457504 31450 net.cpp:115] Memory required for data: 2534400
I1123 14:58:29.457533 31450 layer_factory.hpp:78] Creating layer sig1en
I1123 14:58:29.457551 31450 net.cpp:69] Creating Layer sig1en
I1123 14:58:29.457561 31450 net.cpp:396] sig1en <- conv1
I1123 14:58:29.457584 31450 net.cpp:347] sig1en -> conv1 (in-place)
I1123 14:58:29.457599 31450 net.cpp:98] Setting up sig1en
It corresponds to Table 2 in my paper, read this Table carefully and think about it. Then you will understand everything.
So, when you have to create a model to work with other sizes of images, the goal is to restore the same image size on the output, so you should play with the sizes of filters (second column of Table 2) which can give you the same output image size in the end of the decoder part. You either can (using these two formulas) calculate everything theoretically or run the model with some initial size of filters, it will not probably work, BUT you can see log file what is inside the network from layer to layer and you will see where the error is, it gives you an idea what sizes of filters should be for your particular problem.
Cheers,
Vlad
Hi Caffe Users,
I1123 14:58:29.458076 31450 net.cpp:98] Setting up ip1encode
I1123 14:58:29.485141 31450 net.cpp:105] Top shape: 100 250 1 1 (25000)