next meeting on friday

Leif Johnson

unread,

Apr 7, 2015, 5:35:10 PM4/7/15

to ut-flare

Hi folks -

Our next meeting will be this Friday, April 10th, at 2:30pm in GDC 3.516.

It seems like there's a good bit of interest in a few recent papers on
optimization and local minima. We'll focus on the following paper:

http://arxiv.org/abs/1405.4604v2

On the saddle point problem for non-convex optimization
Razvan Pascanu, Yann N. Dauphin, Surya Ganguli, Yoshua Bengio

A central challenge to many fields of science and engineering involves
minimizing non-convex error functions over continuous, high
dimensional spaces. Gradient descent or quasi-Newton methods are
almost ubiquitously used to perform such minimizations, and it is
often thought that a main source of difficulty for the ability of
these local methods to find the global minimum is the proliferation of
local minima with much higher error than the global minimum. Here we
argue, based on results from statistical physics, random matrix
theory, and neural network theory, that a deeper and more profound
difficulty originates from the proliferation of saddle points, not
local minima, especially in high dimensional problems of practical
interest. Such saddle points are surrounded by high error plateaus
that can dramatically slow down learning, and give the illusory
impression of the existence of a local minimum. Motivated by these
arguments, we propose a new algorithm, the saddle-free Newton method,
that can rapidly escape high dimensional saddle points, unlike
gradient descent and quasi-Newton methods. We apply this algorithm to
deep neural network training, and provide preliminary numerical
evidence for its superior performance.

For this week's meeting, it might also be interesting to look over
another paper from the same lab showing that RMSProp implements a
technique much like the one above: http://arxiv.org/abs/1502.04390

lmj

--
http://www.cs.utexas.edu/~leif

Leif Johnson

unread,

Apr 10, 2015, 10:56:03 AM4/10/15

to ut-flare

Reminder: meeting today!

--
http://www.cs.utexas.edu/~leif

Karl Pichotta

unread,

Apr 10, 2015, 6:35:22 PM4/10/15

to Leif Johnson, ut-flare

Hey y'all

If you're interested, the same authors (more or less) had a NIPS 2014 paper that was a very close followup to the tech report we read this afternoon:

http://arxiv.org/pdf/1406.2572.pdf

Here are a number of new things that are fun and worth noting ((2) is especially worth noting):

(1) Figure 1(a) and 1(c) plot empirical error vs alpha ("index", the ratio of positive to negative eigenvalues of the hessian) on a single-layer multilayer perceptrion trained on both MNIST and CIFAR. Basically they found critical points (using some weird methods, described in an appendix) and, for every one of these, calculated epsilon and alpha.

(2) They give a way to approximate the Hessian, following Vinyals and Povey (2012), using a Krylov subspace using vectors found using the Lanczos methods. This way, we don't have to do an eigendecomposition of an explicitly-calculated hessian every single iteration. In this approximation, (A) the eigenvectors of the Hessian with highest eigenvalues will be spanned by this subspace with high probability; and (B) this representation is low-rank enough that it's easy to calculate. This is explained in the last paragraph of p. 6 and Appendix E, though I wish they elaborated a bit more.

One thing that seems weird about this is that we scale by the inverse (absolute) eigenvalues to get us off plateaus, right, so it's the vectors with eigenvalues close to zero that really see the benefit of Netwon's method---however, according to point (A) above, it's unclear if these will be represented in our low-rank Hessian approximation? I don't really know this math well, so I don't know. It Seems To Work, Though.

(3) Using this approximation, they apply their method to pretty-big networks (a deep autoencoder on non-scaled-downMNIST, though they don't say how deep, and a vanilla RNN with 120 hidden states doing character-level language modeling on text).

They do this pretty cool thing wherein they wait for error under SGD to plateau, and then apply their method to that net. Turns out that works (see fig 4)!

_k

--
You received this message because you are subscribed to the Google Groups "FLARE" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ut-flare+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Leif Johnson

unread,

Apr 12, 2015, 10:38:33 PM4/12/15

to Karl Pichotta, ut-flare

Thanks for forwarding this!

I think re: the low-rank approximation to the Hessian: it could be
that even a low-rank approximation is able to identify some negative
eigenvalues (they just wouldn't have small magnitudes) -- if this is
the case, then large negative eigenvalues would be just the directions
where we most want to take a step, right? (Thinking out loud, these
would be directions where the derivative is becoming more negative the
most quickly. Seems like that's a good idea.)

lmj

--
http://www.cs.utexas.edu/~leif

Reply all

Reply to author

Forward