In1963, Polyak proposed a simple condition that is sufficient to show a global linear convergence rate for gradient descent. This condition is a special case of the Łojasiewicz inequality proposed in the same year, and it does not require strong convexity (or even convexity). In this work, we show that this much-older Polyak-Łojasiewicz (PL) inequality is actually weaker than the main conditions that have been explored to show linear convergence rates without strong convexity over the last 25 years. We also use the PL inequality to give new analyses of coordinate descent and stochastic gradient for many non-strongly-convex (and some non-convex) functions. We further propose a generalization that applies to proximal-gradient methods for non-smooth optimization, leading to simple proofs of linear convergence for support vector machines and L1-regularized least squares without additional assumptions.
Fitting most machine learning models involves solving some sort of optimization problem. Gradient descent, and variants of it like coordinate descent and stochastic gradient, are the workhorse tools used by the field to solve very large instances of these problems. In this work we consider the basic problem of minimizing a smooth function and the convergence rate of gradient descent methods. It is well-known that if f is strongly-convex, then gradient descent achieves a global linear convergence rate for this problem [28]. However, many of the fundamental models in machine learning like least squares and logistic regression yield objective functions that are convex but not strongly-convex. Further, if f is only convex, then gradient descent only achieves a sub-linear rate.
This situation has motivated a variety of alternatives to strong convexity (SC) in the literature, in order to show that we can obtain linear convergence rates for problems like least squares and logistic regression. One of the oldest of these conditions is the error bounds (EB) of Luo and Tseng [22], but four other recently-considered conditions are essential strong convexity (ESC) [20], weak strong convexity (WSC) [25], the restricted secant inequality (RSI) [45], and the quadratic growth (QG) condition [2]. Some of these conditions have different names in the special case of convex functions: a convex function satisfying RSI is said to satisfy restricted strong convexity (RSC) [45] while a convex function satisfying QG is said to satisfy optimal strong convexity (OSC) [19] or (confusingly) WSC [23]. The proofs of linear convergence under all of these relaxations are typically not straightforward, and it is rarely discussed how these conditions relate to each other.
In this work, we consider a much older condition that we refer to as the Polyak-Łojasiewicz (PL) inequality. This inequality was originally introduced by Polyak [31], who showed that it is a sufficient condition for gradient descent to achieve a linear convergence rate. We describe it as the PL inequality because it is also a special case of the inequality introduced in the same year by Łojasiewicz [21]. We review the PL inequality in the next section and how it leads to a trivial proof of the linear convergence rate of gradient descent. Next, in terms of showing a global linear convergence rate to the optimal solution, we show that the PL inequality is weaker than all of the more recent conditions discussed in the previous paragraph. This suggests that we can replace the long and complicated proofs under any of the conditions above with simpler proofs based on the PL inequality. Subsequently, we show how this result implies gradient descent achieves linear rates for standard problems in machine learning like least squares and logistic regression that are not necessarily SC, and even for some non-convex problems (Sect. 2.3). In Sect. 3 we use the PL inequality to give new convergence rates for randomized and greedy coordinate descent (implying a new convergence rate for certain variants of boosting), sign-based gradient descent methods, and stochastic gradient methods in either the classical or variance-reduced setting. Next we turn to the closely-related problem of minimizing the sum of a smooth function and a simple non-smooth function. We propose a generalization of the PL inequality that allows us to show linear convergence rates for proximal-gradient methods without SC. This leads to a simple analysis showing linear convergence of methods for training support vector machines. It also implies that we obtain a linear convergence rate for \(\ell _1\)-regularized least squares problems, showing that the extra conditions previously assumed to derive linear converge rates in this setting are in fact not needed.
for all x and y. For twice-differentiable objectives this assumption means that the eigenvalues of \(\nabla ^2 f(x)\) are bounded above by some L, which is typically a reasonable assumption. We also assume that the optimization problem has a non-empty solution set \(\mathcal X^*\), and we use \(f^*\) to denote the corresponding optimal function value. We will say that a function satisfies the PL inequality if the following holds for some \(\mu > 0\),
This inequality simply requires that the gradient grows faster than a quadratic function as we move away from the optimal function value. Note that this inequality implies that every stationary point is a global minimum. But unlike SC, it does not imply that there is a unique solution. Linear convergence of gradient descent under these assumptions was first proved by Polyak [31]. Below we give a simple proof of this result when using a step-size of 1/L.
Re-arranging and subtracting \(f^*\) from both sides gives us \(f(x_k+1) - f^* \le \left( 1- \frac\mu L\right) (f(x_k) - f^*)\). Applying this inequality recursively gives the result. \(\square \)
A beautiful aspect of this proof is its simplicity; in fact it is simpler than the proof of the same fact under the usual SC assumption. It is certainly simpler than typical proofs which rely on the other conditions mentioned in Sect. 1. Further, it is worth noting that the proof does not assume convexity of f. Thus, this is one of the few general results we have for global linear convergence on non-convex problems.
As mentioned in the Sect. 1, several other assumptions have been explored over the last 25 years in order to show that gradient descent achieves a linear convergence rate. These typically assume that f is convex, and lead to more complicated proofs than the one above. However, it is rarely discussed how the conditions relate to each other. Indeed, all of the relationships that have been explored have only been in the context of convex functions [19, 25, 44]. In Appendix 2.1, we give the precise definitions of all conditions and also prove the result below giving relationships between the conditions.
This result shows that (QG) is the weakest assumption among those considered. However, QG allows non-global local minima so it is not enough to guarantee that gradient descent finds a global minimizer. This means that, among those considered above, PL and the equivalent EB are the most general conditions that allow linear convergence to a global minimizer. Note that in the convex case QG is called OSC, but the result above shows that in the convex case it is also equivalent to EB and PL (as well as RSI which is known as RSC in this case).
While the PL inequality does not imply convexity of f, it does imply the weaker condition of invexity. Invexity was first introduced by Hanson in 1981 [12], and has been used in the context of learning output kernels [8]. Craven and Glover [7] show that a smooth f is invex if and only if every stationary point of f is a global minimum. Since the PL inequality implies that all stationary points are global minimizers, functions satisfying the PL inequality must be invex. Indeed, Theorem 2 shows that all of the previous conditions except (QG) imply invexity. The function \(f(x) = x^2 + 3\sin ^2(x)\) is an example of an invex but non-convex function satisfying the PL inequality (with \(\mu = 1/32\)). Thus, Theorem 1 implies gradient descent obtains a global linear convergence rate on this function.
In this section, we use the PL inequality to analyze several variants of two of the most widely-used techniques for handling large-scale machine learning problems: coordinate descent and stochastic gradient methods. In particular, the PL inequality yields very simple analyses of these methods that apply to more general classes of functions than previously analyzed. We also note that the PL inequality has recently been used by Garber and Hazan [9] to analyze the Frank-Wolfe algorithm. Further, inspired by the resilient backpropagation (RPROP) algorithm of Riedmiller and Braun [32], in Appendix 3 we also give the first convergence rate analysis for sign-based gradient descent methods.
Nesterov [29] shows that randomized coordinate descent achieves a faster convergence rate than gradient descent for problems where we have d variables and it is d times cheaper to update one coordinate than it is to compute the entire gradient. The expected linear convergence rates in this previous work rely on SC, but in this section we show that randomized coordinate descent achieves an expected linear convergence rate if we only assume that the PL inequality holds.
However, Nutini et al. [30] show that a faster convergence rate can be obtained for the GS rule by measuring SC in the 1-norm. Since the PL inequality is defined on the dual (gradient) space, in order to derive an analogous result we could measure the PL inequality in the \(\infty \)-norm,
Meir and Rtsch [24] show that we can view some variants of boosting algorithms as implementations of coordinate descent with the GS rule. They use the error bound property to argue that these methods achieve a linear convergence rate, but this property does not lead to an explicit rate. Our simple result above thus provides the first explicit convergence rate for these variants of boosting.
3a8082e126