NARC This Week: Second Order Optimization for Non-Convex Machine Learning

7 views
Skip to first unread message

Brian DeSilva

unread,
Nov 13, 2017, 5:36:04 PM11/13/17
to uwn...@googlegroups.com
Hi everyone,

This week we will be discussing this paper in preparation for the talk Michael Mahoney will be giving later this month. If you are interested in optimization in big data settings, be sure to look over the paper and stop by!

Title: Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study
Abstract: below

There will be snacks.

Best,

Brian


Abstract:
The resurgence of deep learning, as a highly effective machine learning paradigm,
has brought back to life the old optimization question of non-convexity. Indeed,
the challenges related to the large-scale nature of many modern machine learning
applications are severely exacerbated by the inherent non-convexity in the underlying
models. In this light, efficient optimization algorithms which can be effectively
applied to such large-scale and non-convex learning problems are highly desired.
In doing so, however, the bulk of research has been almost completely restricted
to the class of first-order algorithms, i.e., those that only employ gradient information.
This is despite the fact that employing the curvature information, e.g., in the
form of Hessian, can indeed help with obtaining effective methods with desirable
convergence properties for non-convex problems, e.g., avoiding saddle-points and
convergence to local minima. The conventional wisdom, in the machine learning
community is that the application of second-order methods, i.e., those that employ
Hessian as well as gradient information, can be highly inefficient. Consequently,
first-order algorithms, such as stochastic gradient descent (SGD), have been at the
center-stage for solving such machine learning problems.
Here, we aim at addressing this misconception by considering efficient and
stochastic variants of Newton’s method, namely, sub-sampled trust-region and cubic
regularization, whose theoretical convergence properties have recently been established
in [67]. Using a variety of experiments, we empirically evaluate the performance
of these methods for solving non-convex machine learning applications. In
doing so, we highlight the shortcomings of first-order methods, e.g., high sensitivity
to hyper-parameters such as step-size and undesirable behavior near saddle-points,
and showcase the advantages of employing curvature information as effective remedy.
Reply all
Reply to author
Forward
0 new messages