I've taked a pre-trained model (FCN8s) and finetuned it to my data solving a very challenging instance segmentation task. I've tried many optimizers from Caffe library, but only Adam seems to be able to avoid bad saddle points (I understand that local minimum is not a good term in deep learning).
The problem is, its behavior is hard to understand. What I mean is when I take, for example, SGD or Adagrad, and take their performance after 10K, 15K, 20K, etc iterations, it seems to be going in the same direction (not always good, of course). But you can kinda see the convergence. So when I run the model on the test data, a 20K algorithm usually outperforms a 10K and so on.
I don't have the same clarity with Adam. Although training error overall goes down, when I compare results after (say) 5K and 15K of training, they are truly baffling: after 15K an algorithm can do much worse than after, say, 12K and then one of a sudden improve after 3K more iterations. There does not seem to be any convergence at all! And I don't understand when to stop training.
Any suggestions on what to do and why this may be happening are appreciated.