Trust regions, useRand, escaping saddle points...

69 views

Trust-regionescapesaddle

Skip to first unread message

Nicolas Boumal

unread,

Feb 5, 2016, 3:05:08 PM2/5/16

to Manopt

Hello everyone,

This is an e-mail I sent to Ju Sun in reply to some questions.

Ju does some excellent work regarding global optimality on manifolds, check out his papers here.

The discussion touches upon a topic that's interesting for many Manopt users, namely: how to escape saddle points when using Riemannian Trust Regions?

Saddle points are points where the gradient is zero, but the Hessian has at least one negative eigenvalue (thus, it is not even a local optimizer, but if you're not careful you might be stuck).

---------------------

Hello Ju,

Here is my take on truncated conjugate gradients (tCG), the option "useRand" (on/off) and the issue of escaping saddle points, with Riemannian Trust Regions (RTR) in Manopt.

With useRand "off", tCG starts with the zero vector, and iterates a certain procedure. The procedure is such that, if eta_0 = 0, then eta_1 = the Cauchy point. As I am sure you know, the Cauchy point in the optimum of the TR inner problem restricted to the gradient direction. Theory guarantees that if RTR achieves at least (a fraction of) the Cauchy decrease in the model at each iteration, then we get convergence to critical points. Combine that with the fact that tCG is monotonically improving, and you get convergence for RTR with tCG, regardless of when you stop tCG. (For fast local convergence, you need more work of course.)

Now, if you are at a saddle, the gradient is zero, and there is no Cauchy point to be talked about. In those cases (and similarly, if you are at a point where the gradient is very small), an alternative strategy is to compute a negative-curvature direction, and follow that direction all the way to the boundary of the TR. You can show that if you do that, you get convergence to second-order critical points (will appear in a paper with Absil and Cartis, soon).

You can cover both scenarios if you have a method which solves the inner problem optimally at each step (or at least gets a fixed fraction of the optimal model improvement). If you do that, you are certain to perform at least as well as a fraction of the Cauchy step and at least as well as a fraction of the "most negative curvature step" (I call them eigen steps). Thus, you will get convergence to second order critical points as well. But .... you have to solve this problem globally. It can be done in poly time, but it's a bit expensive (overkill).

Setting useRand to "on" is an optimist's take on this state of affairs. You don't want to compute eigensteps, and you don't want to solve the TR inner problem optimally. So, instead, you pick your eta_0 at random. This will allow you to explore directions other than the Cauchy step, thus giving you better chances to discover negative curvature if they are overwhelmingly present (many negative eigenvalues in the Hessian). In the code, there is also a check to make sure we do at least as well as the Cauchy point, to not harm convergence to critical points.

Of course, it makes a lot more sense to set useRand to "on" only when the gradient is small. I think that's what the originally intended use of it was, when Chris Baker, Absil and Gallivan coded the original software, GenRTR: solve with useRand "off"; then just to be a little bit safer, run again from your small-gradient point, with useRand "on". If that doesn't make you move away, your confidence increases (heuristic).

The thing about eigensteps is that it's deceptively difficult to find good literature on how hard it is or isn't, computationally. There are some really good (but old) papers about it, but the focus back then was on different things, amn so they don't quite answer the precise questions that are useful today. And they're pretty difficult to get into as well, so it's not trivial to adapt them. We'll try to discuss that briefly in the paper with Absil and Cartis.

I hope this makes sense and partly addresses your questions.

Cheers,

Nicolas

PS: As a side note: to compute a most-negative-curvature direction in Manopt, if the Hessian is defined in your problem structure, you can call hessianextreme.

---------------------

Ju Sun

unread,

Feb 5, 2016, 9:18:48 PM2/5/16

to Manopt

I thank Nicolas very much for his very informative discussion on this issue!

Just to add a bit on this for implementation.

Based on his discussion, one simple (far from efficient) way to ensure escaping saddle point [NB: we call saddle point at which Hessian has a negative eigenvalue "ridable/strict saddles" -- not all saddles have this property; see, my expository article] is to use the negative curvature direction (sign adjusted according to its correlation to the gradient vector) as the initial search direction for the tCG solver, when the gradient is small and there are such directions.

I made such changes to the tCG solver, using the hessianextreme routine to determine a negative curvature direction when necessary. It wouldn't work right away. The program just stagnated at some point. The reason is: hessianextreme itself calls the trustregions solver when Hessian information is available, and the trustregions solver in turn calls the modified tCG, which again calls hessianextreme at some point. So there is a calling recursion formed, but it seems currently there is no proper stopping rule implemented to avoid such recursion. One way to get around is to choose the other solvers, say conjugategradient, for the hessianextreme routine.