Erratic performance of Adam optimizer on object segmentation task

46 views

convergenceoptimizationsegmentation

Skip to first unread message

Alex Ter-Sarkisov

unread,

Jan 13, 2018, 11:37:30 AM1/13/18

to Caffe Users

I've taked a pre-trained model (FCN8s) and finetuned it to my data solving a very challenging instance segmentation task. I've tried many optimizers from Caffe library, but only Adam seems to be able to avoid bad saddle points (I understand that local minimum is not a good term in deep learning).

The problem is, its behavior is hard to understand. What I mean is when I take, for example, SGD or Adagrad, and take their performance after 10K, 15K, 20K, etc iterations, it seems to be going in the same direction (not always good, of course). But you can kinda see the convergence. So when I run the model on the test data, a 20K algorithm usually outperforms a 10K and so on.

I don't have the same clarity with Adam. Although training error overall goes down, when I compare results after (say) 5K and 15K of training, they are truly baffling: after 15K an algorithm can do much worse than after, say, 12K and then one of a sudden improve after 3K more iterations. There does not seem to be any convergence at all! And I don't understand when to stop training.

Any suggestions on what to do and why this may be happening are appreciated.

Przemek D

unread,

Jan 22, 2018, 10:40:11 AM1/22/18

to Caffe Users

Although I do not have an answer to this, I'm also curious about Adam's behavior which I would also describe as occasionally erratic. Below is a log-plot of the Euclidean loss during final stage of fine-tuning my autoencoder. Sometimes the loss just jumps up by an order of magnitude, only to slowly go back down - so after 30 epochs the model is worse than after 20, but in the long run, after say 60, it gets better. I do not observe such effect using SGD or Nesterov training, although I found them converge somewhat slower than Adam.

Besides, picking the right parameters for Adam feels tricky. In the plot below, I started at 1e-3, reduced to 1e-4 at epoch 60 and to 1e-5 at 80 - I tried 0.01 to start and it sometimes yielded nice convergence early on, and sometimes the gradients would jump NaN right in the first epoch.

Could you show any plots to better illustrate what you observe?

Przemek D

unread,

Jan 23, 2018, 3:24:40 AM1/23/18

to Caffe Users

For some reason the plot got hidden in the quoted text. Here it is again:

Reply all

Reply to author

Forward

0 new messages