Is Keras' implementation of SGD significantly different than Caffe's?

189 views
Skip to first unread message

Steven

unread,
Feb 23, 2017, 5:04:19 PM2/23/17
to Keras-users
I've been trying to compare Keras and Caffe and have started to wonder if they have non-equivalent techniques for implementing Stochastic Gradient Descent. I started to consider this after I tried to implement the same network architecture on the CIFAR-10 dataset and found that using SGD, Caffe gave markedly better results than Keras. If, however, I let Keras use RMSProp, it then achieved results comparable to Caffe.

I let each Keras experiment run for 10 iterations on the 50000 image training set and saw the following results:

SGD Optimization:
Epoch 10/10
50000/50000 [==============================] - 6s - loss: 1.6015 - acc: 0.4301 - val_loss: 1.5918 - val_acc: 0.4318


RMSProp Optimization:
Epoch 10/10
50000/50000 [==============================] - 8s - loss: 0.4915 - acc: 0.8298 - val_loss: 0.8381 - val_acc: 0.7388



For Caffe, I used a batch size of 100 and let it run for 5000 iterations, so that it would do the same amount of training as the Keras implementations. The results I saw were as follows:

SGD Optimization:
I0223
16:26:49.427073 21532 solver.cpp:337] Iteration 5000, Testing net (#0)
I0223
16:26:50.764348 21532 solver.cpp:404]     Test net output #0: accuracy = 0.715601
I0223
16:26:50.764392 21532 solver.cpp:404]     Test net output #1: loss = 0.852712 (* 1 = 0.852712 loss)



Since I used the same parameters for both Caffe's and Keras's SGD, I began to suspect that they were implemented differently. I expect the difference lies in the 'Normalize' and 'Regularize' functions that Caffe's SGDSolver<Dtype>::ApplyUpdate calls before actually applying the updates. These methods seem to scale the gradients and weights of the net, though I don't understand how or why. If I'm correct in believing that the root of the discrepancy in my results lies in how SGD is implemented in Caffe & Keras, it would really help if someone could explain the differences, or point me towards the right papers.

To enable anyone who's interested to verify whether or not I really did have the same net architectures and optimization parameters, I'm attaching the relevant Python script for Keras and Caffe configuration files.


cifar10_quick_train_test_compare.prototxt
train_quick_compare.sh
cifar10_quick_solver_compare.prototxt
cifar10_caffe_cnn_v6.py

François Chollet

unread,
Feb 23, 2017, 5:18:15 PM2/23/17
to Steven, Keras-users
Look, if RMSprop works much better than SGD on your model, it means that you are using a terrible learning rate in SGD. Hyperparameters matter. You can't expect to pick any set of hyperparameters and get optimal results. Which is why you should generally stay away from SGD and use RMSprop instead.

--
You received this message because you are subscribed to the Google Groups "Keras-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to keras-users+unsubscribe@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/keras-users/3e846d43-b527-4a65-819d-bbcce8273d58%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Steven

unread,
Feb 23, 2017, 7:42:06 PM2/23/17
to Keras-users, s.m.gu...@gmail.com
The issue is that SGD worked much better with Caffe than with Keras. 

RMSProp with Keras was just used to convince myself that the issue lay in a difference between the SGD optimizer I had with Keras and the one I had with Caffe. To the best of my knowledge, I used the same hyperparameters  for SGD with Caffe that I did with Keras. I also used the same net architecture. 

The same net architecture with the same optimization method and same hyperparameters should not result in a 30% difference in accuracy after the same amount of training on the same data.

The success of RMSProp made me think that the problem probably doesn't lie in my architecture or data. I'm fairly certain I copied the hyperparameters correctly, but maybe I did something foolish and I didn't correctly transfer Caffe's net & solver to Keras.

I saw functions in Caffe's SGDSolver that look as though it's making some adjustments to the net's weights to keep them in some sort of reasonable bounds, but did not really understand the C++ I was reading. So, now I'm trying to find out either what I should've done in order to reliably implement the Caffe model in Keras, or what was done in Caffe to make SGD perform so much better.
To unsubscribe from this group and stop receiving emails from it, send an email to keras-users...@googlegroups.com.
Reply all
Reply to author
Forward
Message has been deleted
0 new messages