Gradient Compression with Error Feedback for Distributed Optimization
Résumé:
The communication overhead is a key bottleneck that hinders perfect scalability of distributed optimization algorithms. Various recent works propose to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated between workers.
We analyze Stochastic Gradient Descent (SGD) with compression (like e.g. sparsification or quantization) of exchanged messages and show that these schemes converge at the same rate as vanilla SGD when equipped with error compensation technique (i.e. keeping track of accumulated errors in memory). We discuss both centralized and decentralized implementations, the latter based on modification of the gossip protocol.
------ Dan Garber
Highly-efficient Algorithms for Online Learning with Semi-adversarial Sequences
Résumé:
Online Learning is an appealing and successful paradigm for sequential prediction under uncertainty. In particular, efficient online algorithms for regret minimization usually require only convexity and boundness assumptions on the model (i.e., the feasible set and loss functions), and thus could be applied, with provable performance guarantees, to a great variety of prediction scenarios, which go far beyond more traditional models of assuming stochastic i.i.d. data.
On the downside, for several problems of interest, all current online leaning algorithms are not scalable to large instances. For instance, all current algorithms for Online Principal Component Analysis require quadratic memory and at least quadratic runtime per iteration. This is in stark contrast to the offline or stochastic i.i.d variants of the problem, for which there exist algorithms that require only linear memory and linear runtime per iteration. Another example, is when the loss functions are exp-concave (e.g., linear regression). In this case, the only current algorithm to achieve (optimal) logarithmic regret, requires quadratic memory and to solve a linear system of equations on each iteration, whereas the offline problem could be solved via highly-efficient gradient methods which require only linear memory.
In this talk I will show that imposing some more structure on the data can result in much more efficient regret minimization algorithms. In particular, I will consider cases in which the data follows a (unknown) stationary i.i.d. distribution combined with adversarial perturbations. I will show how such models allow to 1) obtain a linear-memory and linear-runtime algorithm for Online PCA with a nearly optimal regret bound, and 2) obtain a logarithmic regret bound for the vanila Online Gradient Descent method for several problems of interest with exp-concave losses such as Online LASSO, and Online Portfolio Selection.