NARC This Week: Second Order Optimization for Non-Convex Machine Learning

7 views

Skip to first unread message

Brian DeSilva

unread,

Nov 13, 2017, 5:36:04 PM11/13/17

to uwn...@googlegroups.com

Hi everyone,

This week we will be discussing this paper in preparation for the talk Michael Mahoney will be giving later this month. If you are interested in optimization in big data settings, be sure to look over the paper and stop by!

Title: Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study

Abstract: below

There will be snacks.

Best,

Brian

Abstract:

The resurgence of deep learning, as a highly effective machine learning paradigm,

has brought back to life the old optimization question of non-convexity. Indeed,

the challenges related to the large-scale nature of many modern machine learning

applications are severely exacerbated by the inherent non-convexity in the underlying

models. In this light, efficient optimization algorithms which can be effectively

applied to such large-scale and non-convex learning problems are highly desired.

In doing so, however, the bulk of research has been almost completely restricted

to the class of first-order algorithms, i.e., those that only employ gradient information.

This is despite the fact that employing the curvature information, e.g., in the

form of Hessian, can indeed help with obtaining effective methods with desirable

convergence properties for non-convex problems, e.g., avoiding saddle-points and

convergence to local minima. The conventional wisdom, in the machine learning

community is that the application of second-order methods, i.e., those that employ

Hessian as well as gradient information, can be highly inefficient. Consequently,

first-order algorithms, such as stochastic gradient descent (SGD), have been at the

center-stage for solving such machine learning problems.

Here, we aim at addressing this misconception by considering efficient and

stochastic variants of Newton’s method, namely, sub-sampled trust-region and cubic

regularization, whose theoretical convergence properties have recently been established

in [67]. Using a variety of experiments, we empirically evaluate the performance

of these methods for solving non-convex machine learning applications. In

doing so, we highlight the shortcomings of first-order methods, e.g., high sensitivity

to hyper-parameters such as step-size and undesirable behavior near saddle-points,

and showcase the advantages of employing curvature information as effective remedy.

Reply all

Reply to author

Forward

0 new messages