Hard to understand Caffe MNIST example

Vietanh Hosy

unread,

Feb 21, 2016, 3:10:13 AM2/21/16

to Caffe Users

After going through the Caffe tutorial here:http://caffe.berkeleyvision.org/gathered/examples/mnist.html

I am really confused about the different (and efficient) model using in this tutorial, which is defined here: https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet_train_test.prototxt

As I understand, Convolutional layer in Caffe simply calculate the sum of Wx+b for each input, without applying any activation function. If we would like to add the activation function, we should add another layer immediately below that convolutional layer, like Sigmoid, Tanh, or Relu layer. Any paper/tutorial I read on the internet applies the activation function to the neuron units.

It leaves me a big question mark as we only can see the Convolutional layers and Pooling layers interleaving in the model. I hope someone can give me an explanation.

As a site note, another doubt for me is the max_iter in this solver: https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet_solver.prototxt

We have 60.000 images for training, 10.000 images for testing. So why does the max_iter here only 10.000 (and it still can get > 99% accuracy rate)? What does Caffe do in each iteration? Actually, I'm not so sure if the accuracy rate is the total correct prediction/test size.

I'm very amazed of this example, as I haven't found any example, framework that can achieve this high accuracy rate in that very short time (only 5 mins to get >99% accuracy rate). Hence, I doubt there should be something I misunderstood.

Thanks.

P/s: I also add a similar question on StackOverFlow here: http://stackoverflow.com/questions/35533703/hard-to-understand-caffe-mnist-example

Nam Vo

unread,

Feb 21, 2016, 3:21:58 AM2/21/16

to Caffe Users

I think an explanation would be that it's an ancient network, so the design is little bit obsolete and different from the networks you would see these days.
Each iteration is a forward backward + parameters update on a mini-batch. If you want to know how it works, you just need to find some machine learning materials and read them. The accuracy should be correct prediction / test size.

Vietanh Hosy

unread,

Feb 21, 2016, 3:38:54 AM2/21/16

to Caffe Users

Thanks,

Yeah I now understand mini-batch iteration. However, my head is still stuck at the idea of removing the activation function (reduce time), yet still get very high accuracy rate (even higher than if I add the activation function).

Evan Shelhamer

unread,

Feb 21, 2016, 1:56:36 PM2/21/16

to Vietanh Hosy, Caffe Users

Max pooling is itself a non-linearity so by interleaving conv and max pooling there is still meaningful composition unlike when convolution layers are stacked alone without non-linearity.

Not having non-linearities / activations is not a speed optimization as it barely saves any computation. These are the cheapest to compute layers in the network.

Evan Shelhamer

--
You received this message because you are subscribed to the Google Groups "Caffe Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to caffe-users...@googlegroups.com.
To post to this group, send email to caffe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/caffe-users/42039036-a096-4b43-ab7c-f033f80d09d9%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Jan C Peters

unread,

Feb 22, 2016, 5:24:57 AM2/22/16

to Caffe Users, hosyv...@gmail.com

As Master Shelhamer said: The strength of deep learning is in the cascade of non-linearly activated, filtered features. The MAX-pooling in itself is already a non-linear operation and seems to suffice. In my own experience, adding another nonlinearity like a ReLU after (or even before) the pooling does not improve the result much, indeed it does not change performance much at all.

Jan

Christian Baumgartner

unread,

Feb 26, 2016, 6:46:13 AM2/26/16

to Caffe Users, hosyv...@gmail.com

This is very interesting.. has leaving out the activation function really been proven to work better? The answers here seem to imply that using max-pooling as the only non-linearity is common practice.

However, all of the other examples I have looked at seem to have ReLUs (e.g. https://github.com/BVLC/caffe/blob/master/examples/cifar10/cifar10_full.prototxt, https://github.com/BVLC/caffe/blob/master/models/bvlc_alexnet/train_val.prototxt). In contrast to what Nam Vo said above, the original LeNet *did* have non-linearities (in addition to the max pooling), namely the tanh(.) function (see "Gradient-Based Learning Applied to Document Recognition", http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=726791). Also state-of-the-art models like the googlenet have ReLUs: https://github.com/BVLC/caffe/blob/master/models/bvlc_googlenet/train_val.prototxt

However, it is interesting to see that the MNIST example seems to perform very well without them. Would anybody be able to point me to a paper or other resource explaining why dropping the activation function can be helpful? Or a published model where this has been used?

Chris

Jan C Peters

unread,

Feb 26, 2016, 7:47:45 AM2/26/16

to Caffe Users, hosyv...@gmail.com

Hmm, interesting, I was under the impression too that the original LeNet did not have activation functions in the convolutional layers, but after reading it again I have to admit to being wrong. It is hidden in the text and not immediately clear from the network diagram though.

I too would be interested in a definitive answer to this question, but on the other hand I suspect this is not generally answerable definitively, as are so many questions in deep learning.

Jan

Reply all

Reply to author

Forward