I've been trying to compare Keras and Caffe and have started to wonder if they have non-equivalent techniques for implementing Stochastic Gradient Descent. I started to consider this after I tried to implement the same network architecture on the CIFAR-10 dataset and found that using SGD, Caffe gave markedly better results than Keras. If, however, I let Keras use RMSProp, it then achieved results comparable to Caffe.
I let each Keras experiment run for 10 iterations on the 50000 image training set and saw the following results:
SGD Optimization:
Epoch 10/10
50000/50000 [==============================] - 6s - loss: 1.6015 - acc: 0.4301 - val_loss: 1.5918 - val_acc: 0.4318
RMSProp Optimization:
Epoch 10/10
50000/50000 [==============================] - 8s - loss: 0.4915 - acc: 0.8298 - val_loss: 0.8381 - val_acc: 0.7388
For Caffe, I used a batch size of 100 and let it run for 5000 iterations, so that it would do the same amount of training as the Keras implementations. The results I saw were as follows:
SGD Optimization:
I0223 16:26:49.427073 21532 solver.cpp:337] Iteration 5000, Testing net (#0)
I0223 16:26:50.764348 21532 solver.cpp:404] Test net output #0: accuracy = 0.715601
I0223 16:26:50.764392 21532 solver.cpp:404] Test net output #1: loss = 0.852712 (* 1 = 0.852712 loss)
Since I used the same parameters for both Caffe's and Keras's SGD, I began to suspect that they were implemented differently. I expect the difference lies in the 'Normalize' and 'Regularize' functions that Caffe's SGDSolver<Dtype>::ApplyUpdate calls before actually applying the updates. These methods seem to scale the gradients and weights of the net, though I don't understand how or why. If I'm correct in believing that the root of the discrepancy in my results lies in how SGD is implemented in Caffe & Keras, it would really help if someone could explain the differences, or point me towards the right papers.
To enable anyone who's interested to verify whether or not I really did have the same net architectures and optimization parameters, I'm attaching the relevant Python script for Keras and Caffe configuration files.