Could it be that there is a mistake in the MNIST example: https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet_train_test.prototxt .
The conv layers are not followed by any non-linearities (apart from the max pooling). In the original paper they were followed by sigmoids.
In this tutorial here http://caffe.berkeleyvision.org/gathered/examples/mnist.html, it is written that the sigmoids are replaced by ReLUs. However, there is only one ReLU after the FC layer.
What is also confusing to me, is that the network without the non-linearities still seems to perform extremely well on the classification... Any idea why that is?