# Weekly TMLR digest for May 01, 2022

5 views

### TMLR

Apr 30, 2022, 8:00:06 PMApr 30

New submissions
===============

Title: BISLERi: Ask Your Neural Network Not To Forget In Streaming Learning Scenarios

Abstract: This paper introduces a new method for \emph{class-incremental streaming learning}. In streaming learning, a learner encounters one single training example at a time and is constrained to: $(i)$ utilize each sample only once, i.e., single-pass learning, $(ii)$ adapt the parameters immediately, $(iii)$ effectively predict on any new example at any time step without involving any additional computation, i.e., it can be evaluated anytime, and $(iv)$ minimize the storage cost. Moreover, in streaming setting, the input data-stream cannot be assumed i.i.d, that is, there can be a temporal coherence in the input data-stream. Finally, the class-incremental learning implies that the learner does not require any task-id for the inference. A wide variety of the existing lifelong learning approaches are either designed to utilize more than one example once/multiple times or not optimized for the fast update or anytime inference. The premise of their designs, as well as other aspects (e.g., memory buffer/replay size, the requirement of fine-tuning), render some of the existing methods sub-optimal, if not ill-suited, for streaming learning setup. We propose a streaming Bayesian framework that enables fast parameter update of the network, given a single example, and allows it to be evaluated anytime. In addition, we also apply an implicit regularizer in the form of snap-shot self-distillation to effectively minimize the information loss further. The proposed method utilizes a tiny-episodic memory buffer and replays to conform with the streaming learning constraints. We also propose an efficient online memory replay and buffer replacement policies that significantly boost the model's performance. Extensive experiments and ablations on multiple datasets in different scenarios demonstrate the superior performance of our method over several strong baselines.

---

Title: Costs and Benefits of Fair Regression

Abstract: Real-world applications of machine learning tools in high-stakes domains are often regulated to be fair, in the sense that the predicted target should satisfy some quantitative notion of parity with respect to a protected attribute. However, the exact tradeoff between fairness and accuracy with a real-valued target is not entirely clear. In this paper, we characterize the inherent tradeoff between statistical parity and accuracy in the regression setting by providing a lower bound on the error of any fair regressor. Our lower bound is sharp, algorithm-independent, and admits a simple interpretation: when the moments of the target differ between groups, any fair algorithm has to make an error on at least one of the groups. We further extend this result to give a lower bound on the joint error of any (approximately) fair algorithm, using the Wasserstein distance to measure the quality of the approximation. With our novel lower bound, we also show that the price paid by a fair regressor that does not take the protected attribute as input is less than that of a fair regressor with explicit access to the protected attribute. On the upside, we establish the first connection between individual fairness, accuracy parity, and the Wasserstein distance by showing that if a regressor is individually fair, it also approximately verifies the accuracy parity, where the gap is again given by the Wasserstein distance between the two groups. Inspired by our theoretical results, we develop a practical algorithm for fair regression through the lens of representation learning, and conduct experiments on a real-world dataset to corroborate our findings.

---

Title: Attentive Walk-Aggregating Graph Neural Networks

Abstract: Graph neural networks (GNNs) have been shown to possess strong representation power, which can be exploited for downstream prediction tasks on graph-structured data, such as molecules and social networks. They typically learn representations by aggregating information from the K-hop neighborhood of individual vertices or from the enumerated walks in the graph. Prior studies have demonstrated the effectiveness of incorporating weighting schemes into GNNs; however, this has been primarily limited to K-hop neighborhood GNNs so far. In this paper, we aim to design an algorithm incorporating weighting schemes into walk-aggregating GNNs and analyze their effect. We propose a novel GNN model, called AWARE, that aggregates information about the walks in the graph using attention schemes. This leads to an end-to-end supervised learning method for graph-level prediction tasks in the standard setting where the input is the adjacency and vertex information of a graph, and the output is a predicted label for the graph. We then perform theoretical, empirical, and interpretability analyses of AWARE. Our theoretical analysis identifies successful conditions for provable guarantees, demonstrating how the graph information is encoded in the representation, and how the weighting schemes in AWARE affect the representation and learning performance. Our experiments demonstrate the strong performance of AWARE in graph-level prediction tasks in the standard setting in the domains of molecular property prediction and social networks. Lastly, our interpretation study illustrates that AWARE can successfully learn to capture the important substructures of the input graph.

---

Title: Online Coresets for Parameteric and Non-Parametric Bregman Clustering

Abstract: We present algorithms that create coresets in an online setting for clustering problems based on a wide subset of Bregman divergences. Notably, our coresets have a small additive error, similar in magnitude to the gap between expected and empirical loss (Bachem et. al. 2017), and take update time $O(d)$ for every incoming point where $d$ is dimension of the point. Our first algorithm gives online coresets of size $\tilde{O}(\mbox{poly}(k,d,\epsilon,\mu))$ for $k$-clusterings according to any $\mu$-similar Bregman divergence. We further extend this algorithm to show existence of a non-parametric coresets, where the coreset size is independent of $k$, the number of clusters, for the same subclass of Bregman divergences. Our non-parametric coresets also function as coresets for non-parametric versions of the Bregman clustering like DP-Means. While these coresets provide additive error guarantees, they are also significantly smaller (scaling with $O(\log n)$ as opposed to $O(d^d)$ for points in $\mathbb{R}^d$) than the (relative-error) coresets obtained in (Bachem et. al 2015) for DP-Means. We also present experiments to compare the performance of our algorithms with other sampling techniques.

---

Title: Non-Deterministic Behavior of Thompson Sampling with Linear Payoffs and How to Avoid It

Abstract: Thompson Sampling with Linear Payoffs (LinTS) is a popular learning policy in contextual bandit algorithms for solving sequential decision making problems. While LinTS has been studied extensively in the academic literature, surprisingly, its behavior in terms of reproducibility did not receive the same attention. In this paper, we show that a standard and seemingly correct LinTS implementation leads to non-deterministic behavior. This might go unnoticed easily, yet impact results adversely. This calls the reproducibility of papers that use LinTS into question. Further, it forbids using this particular implementation in any industrial application where reproducibility is critical not only for debugging purposes but also for the trustworthiness of machine learning models. We first study the root cause of the non-deterministic behavior. We then conduct experiments on recommendation system benchmarks to demonstrate the impact of non-deterministic behavior in terms of reproducibility and downstream metrics. Finally, as a remedy, we show how to avoid the issue to ensure reproducible results and share general advice for practitioners.

---

Title: Effective, Explainable, and Robust Policy Design in an Economy-SIR Model using Two-Level Reinforcement Learning

Abstract: Optimizing economic policies in the face of severe economic shocks, such as pandemics, poses a complex multi-agent challenge. Here we show that the AI Economist framework can learn sets of effective, robust, and explainable policies using two-level reinforcement learning (RL) and data-driven simulations. Our framework can optimize for a wide range of (mis)aligned social welfare objectives and accounts for strategic behavioral responses. In contrast, existing analytical and computational methods do not scale to this setting. We validate our framework on optimizing state policies and federal subsidies in an unemployment-vaccination-SIR simulation fitted to US COVID-19 data. We find that log-linear RL policies significantly improve public health and economic metrics compared to real-world outcomes. In particular, federal subsidies can align incentives more between state and federal agents. Their behavior is explainable, e.g., RL policies respond strongly to changes in recovery and vaccination rates. They are also robust to calibration errors, e.g., infection rates that are over or underestimated. As of yet, real-world policymaking has not seen adoption of machine learning methods at large, including RL and AI-driven simulations. Our results show the potential of AI to guide policy design and improve social welfare amidst the complexity of the real world.

---

Title: Decoding EEG With Spiking Neural Networks on Neuromorphic Hardware

Abstract: Decoding motor activity accurately and reliably from electroencephalography (EEG) signals is essential for several portable brain-computer interface (BCI) applications ranging from neural prosthetics to the control of industrial and mobile robots. Spiking neural networks (SNNs) is an emerging brain-inspired architecture that is well-suited for decoding EEG signals due to their built-in ability to integrate information at multiple timescales, leading to energy-efficient solutions for portable BCI. In practice, however, current SNN solutions suffer from i) an inefficient spike encoding of the EEG signals; ii) non-specialized network architectures that cannot capture EEG priors of spatiotemporal dependencies; and iii) the limited generalizability of the local learning rules commonly used to train the networks. These untapped challenges result in a performance gap between the current SNN approaches and the state-of-the-art deep neural network (DNN) methods. Moreover, the black-box nature of most current SNN solutions masks their correspondence with the underlying neurophysiology, further hindering their reliability for real-world applications. Here, we propose an SNN architecture with an input encoding and network design that exploits the priors of spatial and temporal dependencies in the EEG signal. To extract spatiotemporal features, the network comprised of spatial convolutional, temporal convolutional, and recurrent layers. The network weights and the neuron membrane parameters were trained jointly using gradient descent and our method was validated in classifying movement on two datasets: i) an in-house dataset comprising of complex components of movement, namely reaction time and directions, and ii) the publicly available eegmmidb dataset for motor imagery and movement. We deployed our SNN on Intel's Loihi neuromorphic processor, and show that our method consumed 95\% less energy per inference than the state-of-the-art DNN methods on NVIDIA Jeston TX2, while achieving similar levels of classification performance. Finally, we interpreted the SNN using a network perturbation study to identify the spectral bands and brain activity that correlated with the SNN outputs. The results were in agreement with the current neurophysiological knowledge implicating the activation patterns in the delta-band oscillations over the motor cortex for hand movement and imagery tasks. Overall, our approach demonstrates the effectiveness of SNNs in accurately and reliably decoding EEG while availing the computational advantages offered by neuromorphic computing, and paves the way for employing neuromorphic methods in portable BCI systems.

---

Title: Piecewise-Linear Activations or Analytic Activation Functions: Which Produce More Expressive Neural Networks?

Abstract: Many currently available universal approximation theorems affirm that deep feedforward networks defined using any suitable activation functions can approximate any integrable function locally in $L^1$-norm. Though different approximation rates are available for deep neural networks defined using other classes of activation functions, there is little explanation for the empirically confirmed advantage that ReLU networks exhibit over their classical (e.g. sigmoidal) counterparts. Our main result demonstrates that deep networks with piecewise linear activation (e.g. ReLU or PReLU) are fundamentally more expressive than deep feedforward networks with analytic (e.g. sigmoid, Swish, GeLU, or Softplus). More specifically, we construct a strict refinement of the topology on the space $L^1_{\operatorname{loc}}(\mathbb{R}^d,\mathbb{R}^D)$ of locally Lebesgue-integrable functions, in which the set of deep ReLU networks with (bilinear) pooling $\operatorname{NN}^{\operatorname{ReLU} + \operatorname{Pool}}$ is dense (i.e. universal) but the set of deep feedforward networks defined using any combination of analytic activation functions with (or without) pooling layers $\operatorname{NN}^{\omega+\operatorname{Pool}}$ is not dense (i.e. not universal). The separation phenomenon'' persists when comparing deep ReLU networks with pooling to classical polynomial functions; which we show are also not dense in this space. Our main result is further explained by quantitatively demonstrating that this separation phenomenon'' between the networks in $\operatorname{NN}^{\operatorname{ReLU}+\operatorname{Pool}}$ and those in $\operatorname{NN}^{\omega+\operatorname{Pool}}$ by showing that the networks in $\operatorname{NN}^{\operatorname{ReLU}}$ are capable of approximate any compactly supported Lipschitz function while simultaneously approximating its essential support; whereas, the networks in $\operatorname{NN}^{\omega+\operatorname{Pool}}$ cannot.

---

Title: Learning the Transformer Kernel

Abstract: In this work we introduce KERNELIZED TRANSFORMER, a generic, scalable, data driven framework for learning the kernel function in Transformers. Our framework approximates the Transformer kernel as a dot product between spectral feature maps and learns the kernel by learning the spectral distribution. This not only helps in learning a generic kernel end-to-end, but also reduces the time and space complexity of Transformers from quadratic to linear. We show that KERNELIZED TRANSFORMERs achieve performance comparable to existing efficient Transformer architectures, both in terms of accuracy and computational efficiency. Our study also demonstrates that the choice of the kernel has a substantial impact on performance, and kernel learning variants are competitive alternatives to fixed kernel Transformers, both in long as well as short sequence tasks.

---

Title: DR-DSGD: A Distributionally Robust Decentralized Learning Algorithm over Graphs

Abstract: In this paper, we propose to solve a distributionally robust learning problem in the decentralized setting, taking into account data heterogeneity. By adding a Kullback-Liebler regularization function to the robust min-max optimization problem, the learning problem can be reduced to a modified robust minimization problem and solved efficiently. Leveraging the new formulated optimization problem, we propose a robust version of Decentralized Stochastic Gradient Descent (DSGD), coined Distributionally Robust Decentralized Stochastic Gradient Descent (DR-DSGD). We theoretically prove that DR-DSGD achieves a fast convergence rate of $\mathcal{O}(1/\sqrt{KT})$, where $K$ is the number of devices and $T$ is the number of iterations, under some mild assumptions. Simulation results show that our proposed algorithm can improve the worst distribution test accuracy by up to $10\%$. Moreover, DR-DSGD is more communication-efficient than DSGD since it requires fewer communication rounds (up to $20$ times less) to achieve the same worst distribution test accuracy target. Furthermore, the conducted experiments reveal that DR-DSGD results in a fairer performance across devices in terms of test accuracy.

---

Title: A Survey of Multi-agent Reinforcement Learning from Game Theoretical Perspective

Abstract: Following the remarkable success of the AlphaGO series, 2019 was a booming year that witnessed significant advances in multi-agent reinforcement learning (MARL) techniques. MARL corresponds to the learning problem in a multi-agent system in which multiple agents learn simultaneously. It is an interdisciplinary domain with a long history that includes game theory, machine learning, stochastic control, psychology, and optimisation. Although MARL has achieved considerable empirical success in solving real-world games, there is a lack of a self-contained overview in the literature that elaborates the game theoretical foundations of modern MARL methods and summarises the recent advances. In fact, the majority of existing surveys are outdated and do not fully cover the recent developments since 2010. In this work, we provide a monograph on MARL that covers both the fundamentals and the latest developments in the research frontier. This survey is separated into two parts. From §1 to §4, we present the self-contained fundamental knowledge of MARL, including problem formulations, basic solutions, and existing challenges. Specifically, we present the MARL formulations through two representative frameworks, namely, stochastic games and extensive-form games, along with different variations of games that can be addressed. The goal of this part is to enable the readers, even those with minimal related background, to grasp the key ideas in MARL research. From §5 to §9, we present an overview of recent developments of MARL algorithms. Starting from new taxonomies for MARL methods, we conduct a survey of previous survey papers. In later sections, we highlight several modern topics in MARL research, including Q-function factorisation, multi-agent soft learning, networked multi-agent MDP, stochastic potential games, zero-sum continuous games, online MDP, turn-based stochastic games, policy space response oracle, approximation methods in general-sum games, and mean-field type learning in games with infinite agents. Within each topic, we select both the most fundamental and cutting-edge algorithms. The goal of our survey is to provide a self-contained assessment of the current state-of-the-art MARL techniques from a game theoretical perspective. We expect this work to serve as a stepping stone for both new researchers who are about to enter this fast-growing domain and existing domain experts who want to obtain a panoramic view and identify new directions based on recent advances.

---