Title: Living on the Edge (of Stability): Why "Too Large" Step Sizes are Actually Good
Abstract:
In the old-school optimization theory analysis of Gradient Descent (GD), as long as the step size is small enough, GD enjoys smooth, monotonic descent to the minimum. The math is clean, and theoreticians are happy.
Unfortunately, modern deep learning does not care about our feelings. The step sizes are often too large for the sharpness of the loss function. In this week's Theory Lunch, we are going to review a series of work about what happens when step sizes are "too large."
We will look at how non-monotonic, realistic GD trains modern neural networks at the "Edge of Stability." We will see how large-step-size GD causes oscillations that self-stabilize and converge towards a flatter minimum.
We will also extend the classical, convex setting to show that intentionally violating the descent lemma is mathematically optimal. We'll look at how GD's sample complexity can be improved just by occasionally using massive step sizes, and how this can be analyzed as a 2-player game.