Daily TMLR digest for Nov 01, 2022

1 view
Skip to first unread message

TMLR

unread,
Oct 31, 2022, 8:00:08 PM10/31/22
to tmlr-anno...@googlegroups.com


New submissions
===============


Title: On the Regularity of Attention

Abstract: Attention is a powerful component of modern neural networks across a wide variety of domains. In this paper, we seek to quantify the regularity (i.e. the smoothness) of the attention operation. To accomplish this goal, we propose a new mathematical framework that uses measure theory and integral operators to model attention. Specifically, we formulate attention as an operator acting on empirical measures over representations of tokens. We show that this framework is consistent with the usual definition, captures the essential properties of attention, and that it can handle inputs of arbitrary length. Then we use it to prove that, on compact domains, the attention operation is Lipschitz continuous with respect to the 1-Wasserstein distance, and provide an estimate of its Lipschitz constant. Additionally, by focusing on a specific type of attention, we extend these Lipschitz continuity results to non-compact domains. Finally, we discuss the effects regularity can have on NLP models, as well as applications to invertible and infinitely-deep networks.

URL: https://openreview.net/forum?id=8oGth5Ufqc

---

Title: Mean-field analysis for heavy ball methods: Dropout-stability, connectivity, and global convergence

Abstract: The stochastic heavy ball method (SHB), also known as stochastic gradient descent (SGD) with Polyak's momentum, is widely used in training neural networks. However, despite the remarkable success of such algorithm in practice, its theoretical characterization remains limited. In this paper, we focus on neural networks with two and three layers and provide a rigorous understanding of the properties of the solutions found by SHB: \emph{(i)} stability after dropping out part of the neurons, \emph{(ii)} connectivity along a low-loss path, and \emph{(iii)} convergence to the global optimum.
To achieve this goal, we take a mean-field view and relate the SHB dynamics to a certain partial differential equation in the limit of large network widths. This mean-field perspective has inspired a recent line of work focusing on SGD while, in contrast, our paper considers an algorithm with momentum. More specifically, after proving existence and uniqueness of the limit differential equations, we show convergence to the global optimum and give a quantitative bound between the mean-field limit and the SHB dynamics of a finite-width network. Armed with this last bound, we are able to establish the dropout-stability and connectivity of SHB solutions.

URL: https://openreview.net/forum?id=gZna3IiGfl

---
Reply all
Reply to author
Forward
0 new messages