Best Caffe interfaces/wrappers to optimize training parameters

Giulia

unread,

Jan 27, 2016, 10:16:31 AM1/27/16

to Caffe Users

Hi,

I guess most of people today is struggling to find out the best parameter combination to train or most probably finetune a given network on a given task in Caffe.

Parameters which are involved indeed can be, for instance:
- which layer to learn from scratch
- start learning rate per layer
- learning rate decay policy
- dropout percentage
...

And the best combination is usually searched by grid/random/bayesian search.

I would like to have suggestions by the community about frameworks to handle this search, that is:

- set up multiple training trials (by defining multiple solver.prototxt and train_val.prototxt)
- train all of them for a certain amount of epochs and store the best validation accuracy/loss for each trial
- retrieve the model of the best trial at the best epoch

What do you think of this code: https://github.com/kuz/caffe-with-spearmint?

Are there any other simple solutions (I would like to use Caffe's Matlab interface......)?

Since I have a single Tesla K40 and usually I can train only one model at a time, a simple Matlab code to run many training trials in series would be fine.

Thanks
Giulia

Jan C Peters

unread,

Jan 27, 2016, 12:04:48 PM1/27/16

to Caffe Users

Very good question.

In the past I have used the pycaffe layer to do exactly that:

- set up multiple training trials (by defining multiple solver.prototxt and train_val.prototxt)
- train all of them for a certain amount of epochs and store the best validation accuracy/loss for each trial
- retrieve the model of the best trial at the best epoch

and spearmint to manage the hyperparameter optimization (although I did not know about that github project you referenced, probably it did not exist then). It is quite a bit of work to write up the necessary scripts, but it pays off. Keep in mind that every single evaluation of the cost function for the hyperparameters (I used the mapping "training parameters" to "best final test score") is extremely expensive, so you should only optimize one (or few) parameter(s) at a time. In my personal experience it is better to understand how all the parameters affect the training, then choose them reasonably, and do the hyperparam optimization only for the fine-tuning of the fine-tuning, so to speak (when you really need to get out that final 0.1% improvement for some reason, and have reasonable hope that you can actually get there). You can be a little more "careless" and do more automated optimization if you have a large GPU-heavy infrastructure for training, but for the average user this will just not do (if you do not want to wait weeks for your result).

Another point to keep in mind is the following: if you optimize the architecture itself, too, such as the number of filters per conv-layer, the different runs (cost function evaluations) can be quite variable in wall time. More specifically, it is possible that configurations that do not really make much sense but need a LOT of training time are checked as well by the hyperparam (e.g. if it is a grid search).

Just to put in my two cents.

Jan

Giulia

unread,

Jan 27, 2016, 1:04:22 PM1/27/16

to Caffe Users

My experience is in agreement with yours, indeed I spent some time fine-tuning nets using DIGITS and trying to understand the influence of the different parameters and I concluded the same. That is, even after trying many possible combinations, I wasn't able to increase the maximum validation accuracy by more than 5%, which is a lot, I admit, but is not an incredible gain. More or less I was observing the same maximum validation accuracy both in the overfitted and in the more regularized cases. Nevertheless, I observed the benefit of more regularization in terms of stability of the validation accuracy across epochs: when overfitting, the accuracy was falling after achieving the maximum just after the very first epoch(s); when regularizing more, the accuracy continued to increase a bit after each epoch (after the big jump in the first epoch(s)).

Now I would like to automatize a bit and make this process more systematic, rather than spending days launching jobs and looking at the trend of the validation loss. But actually since my computational infrastructure is quite limited and I am more interested in achieving a "reasonable" accuracy on multiple datasets than gaining the final 0.1% improvement on a specific dataset, I think I will closely follow your advice.

Today I am playing with https://github.com/kuz/caffe-with-spearmint, which seems quite easy to "plug and play". I'll see if it is flexible enough or if it is better to write some customized (Matlab) code. In this latter case I'll let you know.

Giulia

Jan C Peters

unread,

Jan 27, 2016, 1:54:06 PM1/27/16

to Caffe Users

I should repeat that these are my personal experiences and that, although I have been doing CNN training for some time now (about one year), I really would not deem myself the ultimate expert on the matter. In short: My intuitions may well be wrong (but I like to think they are getting better with time).

Apart from that disclaimer: I am happy to see my intuitions confirmed. And I am always happy to learn more, so please do let me know about your findings. There is so much we don't know about deep learning. I frequently have the feeling that the things we don't know significantly outweigh the things we do know. Every once and again someone introduces some new regularization/optimization/improvement/whatever method which introduces further parameters. Result: the deep learning practitioner wonders whether to use that extension and whether it does any good to his problem or not and is puzzled about how he should choose these additional parameters. And currently there is such a multitude of approaches and extensions you _could_ use that simply overwhelms you. And to find something that could work for you the only thing you really have as guidance is your intuition and experience. At least this is how I feel about it.

Jan

P.S.: Yes, the effect of regularization on the error/accuracy evolution you observed is familiar to me, too. I think that it matches the "common notion" of the deep learning community as well.

Reply all

Reply to author

Forward