# Weekly TMLR digest for Jun 26, 2022

1 view

### TMLR

Jun 25, 2022, 8:00:07 PMJun 25

Accepted papers
===============

Title: Decoding EEG With Spiking Neural Networks on Neuromorphic Hardware

Authors: Neelesh Kumar, Guangzhi Tang, Raymond Yoo, Konstantinos P. Michmizos

Abstract: Decoding motor activity accurately and reliably from electroencephalography (EEG) signals is essential for several portable brain-computer interface (BCI) applications ranging from neural prosthetics to the control of industrial and mobile robots. Spiking neural networks (SNNs) is an emerging brain-inspired architecture that is well-suited for decoding EEG signals due to their built-in ability to integrate information at multiple timescales, leading to energy-efficient solutions for portable BCI. In practice, however, current SNN solutions suffer from i) an inefficient spike encoding of the EEG signals; ii) non-specialized network architectures that cannot capture EEG priors of spatiotemporal dependencies; and iii) the limited generalizability of the local learning rules commonly used to train the networks. These untapped challenges result in a performance gap between the current SNN approaches and the state-of-the-art deep neural network (DNN) methods. Moreover, the black-box nature of most current SNN solutions masks their correspondence with the underlying neurophysiology, further hindering their reliability for real-world applications. Here, we propose an SNN architecture with an input encoding and network design that exploits the priors of spatial and temporal dependencies in the EEG signal. To extract spatiotemporal features, the network comprised of spatial convolutional, temporal convolutional, and recurrent layers. The network weights and the neuron membrane parameters were trained jointly using gradient descent and our method was validated in classifying movement on two datasets: i) an in-house dataset comprising of complex components of movement, namely reaction time and directions, and ii) the publicly available eegmmidb dataset for motor imagery and movement. We deployed our SNN on Intel's Loihi neuromorphic processor, and show that our method consumed 95\% less energy per inference than the state-of-the-art DNN methods on NVIDIA Jeston TX2, while achieving similar levels of classification performance. Finally, we interpreted the SNN using a network perturbation study to identify the spectral bands and brain activity that correlated with the SNN outputs. The results were in agreement with the current neurophysiological knowledge implicating the activation patterns in the low-frequency oscillations over the motor cortex for hand movement and imagery tasks. Overall, our approach demonstrates the effectiveness of SNNs in accurately and reliably decoding EEG while availing the computational advantages offered by neuromorphic computing, and paves the way for employing neuromorphic methods in portable BCI systems.

---

Title: Understanding Linearity of Cross-Lingual Word Embedding Mappings

Authors: Xutan Peng, Mark Stevenson, Chenghua Lin, Chen Li

Abstract: The technique of Cross-Lingual Word Embedding (CLWE) plays a fundamental role in tackling Natural Language Processing challenges for low-resource languages. Its dominant approaches assumed that the relationship between embeddings could be represented by a linear mapping, but there has been no exploration of the conditions under which this assumption holds. Such a research gap becomes very critical recently, as it has been evidenced that relaxing mappings to be non-linear can lead to better performance in some cases. We, for the first time, present a theoretical analysis that identifies the preservation of analogies encoded in monolingual word embeddings as a *necessary and sufficient* condition for the ground-truth CLWE mapping between those embeddings to be linear. On a novel cross-lingual analogy dataset that covers five representative analogy categories for twelve distinct languages, we carry out experiments which provide direct empirical support for our theoretical claim. These results offer additional insight into the observations of other researchers and contribute inspiration for the development of more effective cross-lingual representation learning strategies.

---

New submissions
===============

Title: Flipped Classroom: Effective Teaching for Time Series Forecasting

Abstract: Sequence-to-sequence models based on LSTM and GRU are a most popular choice for forecasting time series data reaching state of the art performance. Training such models can be delicate though. The two most common training strategies within this context are teacher forcing (TF) and free running (FR). TF can be used to help the model to converge faster but may provoke an exposure bias issue due to a discrepancy between training and inference phase. FR helps to avoid this but does not necessarily lead to better results, since it tends to make the training slow and unstable instead. Scheduled sampling was the first approach tackling these issues by picking the best from both worlds and combining it into a curriculum learning (CL) strategy. Although scheduled sampling seems to be a convincing alternative to FR and TF, we found that, even if parametrized carefully, scheduled sampling may lead to premature termination of the training when applied for time series forecasting. To mitigate the problems of the above approaches we formalize CL strategies along the training as well as the training iteration scale. We propose several new curricula, and systematically evaluate their performance in two experimental sets. For our experiments, we utilize six datasets generated from prominent chaotic systems. We found that the newly proposed increasing training scale curricula with a probabilistic iteration scale curriculum consistently outperforms previous training strategies yielding an NRMSE improvement of up to 81% over FR or TF training. For some datasets we additionally observe a reduced number of training iterations. We observed that all models trained with the new curricula yield higher prediction stability allowing for longer prediction horizons.

---

Title: Data Leakage in Federated Averaging

Abstract: Recent attacks have shown that user data can be reconstructed from FedSGD updates, thus breaking privacy. However, these attacks are of limited practical relevance as federated learning typically uses the FedAvg algorithm. It is generally accepted that reconstructing data from FedAvg updates is much harder than FedSGD as: (i) there are unobserved intermediate weight updates, (ii) the order of inputs matters, and (iii) the order of labels changes every epoch. In this work, we propose a new optimization-based attack which successfully attacks FedAvg by addressing the above challenges. First, we solve the optimization problem using automatic differentiation that forces a simulation of the client’s update for the reconstructed labels and inputs so as to match the received client update. Second, we address the unknown input order by treating images at different epochs as independent during optimization, while relating them with a permutation invariant prior. Third, we reconstruct the labels by estimating the parameters of existing FedSGD attacks at every FedAvg step. On the popular FEMNIST dataset, we demonstrate that on average we successfully reconstruct >45% of the client’s images from realistic FedAvg updates computed on 10 local epochs of 10 batches each with 5 images, compared to only <10% using the baseline. These findings indicate that many real-world federated learning implementations based on FedAvg are vulnerable.

---

Title: Structured Uncertainty in the Observation Space of Variational Autoencoders

Abstract: Variational autoencoders (VAEs) are a popular class of deep generative models with many variants and a wide range of applications. Improvements upon the standard VAE mostly focus on the modelling of the posterior distribution over the latent space and the properties of the neural network decoder. In contrast, improving the model for the observational distribution is rarely considered and typically defaults to a pixel-wise independent categorical or normal distribution. In image synthesis, sampling from such distributions produces spatially-incoherent results with uncorrelated pixel noise, resulting in only the sample mean being somewhat useful as an output prediction. In this paper, we aim to stay true to VAE theory by improving the samples from the observational distribution. We propose an alternative model for the observation space, encoding spatial dependencies via a low-rank parameterisation. We demonstrate that this new observational distribution has the ability to capture relevant covariance between pixels, resulting in spatially-coherent samples. In contrast to pixel-wise independent distributions, our samples seem to contain semantically meaningful variations from the mean allowing the prediction of multiple plausible outputs with a single forward pass.

---

Abstract: We introduce and analyze MT-OMD, a multitask generalization of Online Mirror Descent (OMD) which operates by sharing updates between tasks. We prove that the regret of MT-OMD is of order $\sqrt{1 + \sigma^2(N-1)}\sqrt{T}$, where $\sigma^2$ is the task variance according to the geometry induced by the regularizer, $N$ is the number of tasks, and $T$ is the time horizon. Whenever tasks are similar, that is $\sigma^2 \le 1$, our method improves upon the $\sqrt{NT}$ bound obtained by running independent OMDs on each task. We further provide a matching lower bound, and show that our multitask extensions of Online Gradient Descent and Exponentiated Gradient, two major instances of OMD, enjoy closed-form updates, making them easy to use in practice. Finally, we present experiments which support our theoretical findings.

---

Title: Generalization In Multi-Objective Machine Learning

Abstract: Modern machine learning tasks often require considering not just one but multiple objectives. For example, besides the prediction quality, this could be the efficiency, robustness or fairness of the learned models, or any of their combinations. Multi-objective learning offers a natural framework for handling such problems without having to commit to early trade-offs. Surprisingly, statistical learning theory so far offers almost no insight into the generalization properties of multi-objective learning. In this work, we make first steps to fill this gap: we establish foundational generalization bounds for the multi-objective setting as well as generalization and excess bounds for learning with scalarizations. We also provide the first theoretical characterization of the relation between the Pareto-optimal sets of the true objectives and the Pareto-optimal sets of their empirical approximations from training data. In particular, we show a surprising asymmetry: all Pareto-optimal solutions can be approximated by empirically Pareto-optimal ones, but not vice versa.

---

Title: Exploring Efficient Few-shot Adaptation for Vision Transformers

Abstract: The task of Few-shot Learning (FSL) aims to do the inference on novel categories containing only few labeled examples, with the help of knowledge learned from base categories containing abundant labeled training samples. While there are numerous works into FSL task, Vision Transformers (ViTs) have rarely been taken as the backbone to FSL with the only exception. Essentially, despite ViTs have been shown to enjoy comparable or even better performance on other vision tasks, it is still very nontrivial to efficiently finetune the ViTs in real-world FSL scenarios. To this end, we propose a novel efficient Transformer Tuning (eTT) method that facilitates finetuning ViTs in the FSL tasks. The key novelties come from the newly presented Attentive Prefix Tuning (APT) and Domain Residual Adapter (DRA) for the task and backbone finetuning, individually. Specifically, in APT, the prefix is projected to new key and value pairs that are attached to each self-attention layer to provide the model with task-specific information. Moreover, we design the DRA in the form of learnable offset vectors to handle the potential domain gaps between base and novel data. To ensure the APT would not deviate from the initial task-specific information much, we further propose a novel prototypical regularization, which minimizes the similarity between the projected distribution of prefix and initial prototypes, regularizing the update procedure. Our method receives outstanding performance on the challenging Meta-Dataset. We conduct extensive experiments to show the efficacy of our model. Our model and codes will be released.

---

Title: On the Choice of Interpolation Scheme for Neural CDEs

Abstract: Neural controlled differential equations (Neural CDEs) are a continuous-time extension of recurrent neural networks (RNNs), achieving state-of-the-art (SOTA) performance at modelling functions of irregular time series. In order to interpret discrete data in continuous time, current implementations rely on non-causal interpolations of the data. This is fine when the whole time series is observed in advance, but means that Neural CDEs are not suitable for use in \textit{online prediction tasks}, where predictions need to be made in real-time: a major use case for recurrent networks. Here, we show how this limitation may be rectified. First, we identify several theoretical conditions that control paths for Neural CDEs should satisfy, such as boundedness and uniqueness. Second, we use these to motivate the introduction of new schemes that address these conditions, offering in particular measurability (for online prediction), and smoothness (for speed). Third, we empirically benchmark our online Neural CDE model on three continuous monitoring tasks from the MIMIC-IV medical database: we demonstrate improved performance on all tasks against ODE benchmarks, and on two of the three tasks against SOTA non-ODE benchmarks.

---

Title: ANCER: Anisotropic Certification via Sample-wise Volume Maximization

Abstract: Randomized smoothing has recently emerged as an effective tool that enables certification of deep neural network classifiers at scale. All prior art on randomized smoothing has focused on isotropic $\ell_p$ certification, which has the advantage of yielding certificates that can be easily compared among isotropic methods via $\ell_p$-norm radius. However, isotropic certification limits the region that can be certified around an input to worst-case adversaries, i.e., it cannot reason about other "close", potentially large, constant prediction safe regions. To alleviate this issue, (i) we theoretically extend the isotropic randomized smoothing $\ell_1$ and $\ell_2$ certificates to their generalized anisotropic counterparts following a simplified analysis. Moreover, (ii) we propose evaluation metrics allowing for the comparison of general certificates - a certificate is superior to another if it certifies a superset region - with the quantification of each certificate through the volume of the certified region. We introduce ANCER, a framework for obtaining anisotropic certificates for a given test set sample via volume maximization. We achieve it by generalizing memory-based certification of data-dependent classifiers. Our empirical results demonstrate that ANCER achieves state-of-the-art $\ell_1$ and $\ell_2$ certified accuracy on CIFAR-10 and ImageNet in the data-dependence setting, while certifying larger regions in terms of volume, highlighting the benefits of moving away from isotropic analysis.

---

Title: Teaching Models to Express Their Uncertainty in Words

Abstract: We show that a GPT-3 model can learn to express uncertainty about its own answers in natural language -- without use of model logits. When given a question, the model generates both an answer and a level of confidence (e.g. "90% confidence" or "high confidence"). These levels map to probabilities that are well calibrated. The model also remains moderately calibrated under distribution shift, and is sensitive to uncertainty in its own answers, rather than imitating human examples. To our knowledge, this is the first time a model has been shown to express calibrated uncertainty about its own answers in natural language. For testing calibration, we introduce the CalibratedMath suite of tasks. We compare the calibration of uncertainty expressed in words ("verbalized probability") to uncertainty extracted from model logits. Both kinds of uncertainty are capable of generalizing calibration under distribution shift. We also provide evidence that GPT-3's ability to generalize calibration depends on pre-trained latent representations that correlate with epistemic uncertainty over its answers.

---

Title: Variational Variance: Simple and Reliable Noise Variance Parameterization

Abstract: Models employing heteroscedastic Gaussian likelihoods parameterized by amortized mean and variance networks are both probabilistically interpretable and highly flexible, but unfortunately can be brittle to optimize. Maximizing log likelihood encourages local Dirac densities for sufficiently flexible mean and variance networks. Data lacking nearby neighbors can provide this flexibility. Gradients near these unbounded optima explode, prohibiting convergence of the mean and thus requiring high noise variance to explain the dependent variable. We propose posterior predictive checks to identify such failures, which we observe can surreptitiously occur alongside high model likelihoods. We find existing approaches that bolster optimization of mean and variance networks to improve likelihoods still exhibit poor predictive mean and variance calibrations. Our notably simpler solution, to treat heteroscedastic variance variationally in an Empirical Bayes regime, regularizes variance away from zero and stabilizes optimization, allowing us to preserve or outperform existing likelihoods while improving predictive mean and variance calibrations and thereby sample quality. We empirically demonstrate these findings on a variety of regression and variational autoencoding tasks.

---

Title: Mitigating Catastrophic Forgetting in Spiking Neural Networks through Threshold Modulation

Abstract: Artificial Neural Networks (ANNs) trained with Backpropagation and Stochastic Gradient Descent (SGD) suffer from the problem of Catastrophic Forgetting; when learning tasks sequentially, the ANN tends to abruptly forget previous knowledge upon being trained on a new task. On the other hand, biological neural networks do not suffer from this problem. Spiking Neural Networks (SNNs) are a class of Neural Networks that are closer to biological networks than ANNs and their intrinsic properties inspired from biology could alleviate the problem of Catastrophic Forgetting. In this paper, we investigate if the firing threshold mechanism of SNNs can be used to gate the activity of the network in order to reduce catastrophic forgetting. To this end, we evolve a Neuromodulatory Network that adapts the thresholds of an SNN depending on the spiking activity of the previous layer. Our experiments on different datasets show that the neurmodulated SNN can mitigate forgetting significantly with respect to a fixed threshold SNN. We also show that the evolved Neuromodulatory Network can generalize to multiple new scenarios and analyze its behavior.

---

Title: Identifiable Deep Generative Models via Sparse Decoding

Abstract: We develop the sparse VAE for unsupervised representation learning on high-dimensional data. The sparse VAE learns a set of latent factors (representations) which summarize the associations in the observed data features. The underlying model is sparse in that each observed feature (i.e. each dimension of the data) depends on a small subset of the latent factors. As examples, in ratings data each movie is only described by a few genres; in text data each word is only applicable to a few topics; in genomics, each gene is active in only a few biological processes. We prove such sparse deep generative models are identifiable: with infinite data, the true model parameters can be learned. (In contrast, most deep generative models are not identifiable.) We empirically study the sparse VAE with both simulated and real data. We find that it recovers meaningful latent factors and has smaller heldout reconstruction error than related methods.

---

Title: Evolving Decomposed Plasticity Rules for Information-Bottlenecked Meta-Learning

Abstract: Artificial neural networks (ANNs) are typically confined to accomplishing pre-defined tasks by learning a set of static parameters. In contrast, biological neural networks (BNNs) can adapt to various new tasks by continually updating their connection weights based on their observations, which is aligned with the paradigm of learning effective learning rules in addition to static parameters, e.g., meta-learning. Among broad classes of biologically inspired learning rules, Hebbian plasticity updates the neural network weights using local signals without the guide of an explicit target function, closely simulating the learning of BNNs. However, typical plastic ANNs using large-scale meta-parameters violate the nature of the genomics bottleneck and deteriorate the generalization capacity.
This work proposes a new learning paradigm decomposing those connection-dependent plasticity rules into neuron-dependent rules thus accommodating $O(n^2)$ learnable parameters with only $O(n)$ meta-parameters. The decomposed plasticity, along with different types of neural modulation, are applied to a recursive neural network starting from scratch to adapt to different tasks. Our algorithms are tested in challenging random 2D maze environments, where the agents have to use their past experiences to improve their performance without any explicit objective function and human intervention, namely \emph{learning by interacting}. The results show that rules satisfying the genomics bottleneck adapt to out-of-distribution tasks better than previous model-based and plasticity-based meta-learning with verbose meta-parameters.

---

Title: Exploring Conditional Shifts for Domain Generalization

Abstract: Learning a domain-invariant representation has become one of the most popular approaches for domain adaptation/generalization.
In this paper, we show that the invariant representation may not be sufficient to guarantee a good generalization, where the \textbf{labeling function shift} should be taken into consideration. Inspired by this, we first derive a new generalization upper bound on the empirical risk that explicitly considers the labeling function shift. We then propose \textbf{Domain-specific Risk Minimization (DRM)}, which can model the distribution shifts of different domains separately and select the most appropriate one for the target domain. Extensive experiments on four popular domain generalization datasets, CMNIST, PACS, VLCS, and DomainNet, demonstrate the effectiveness of the proposed DRM for domain generalization with the following advantages: 1) it significantly outperforms competitive baselines; 2) it enables either comparable or superior accuracies on all training domains comparing to vanilla empirical risk minimization (ERM); 3) it remains very simple and efficient during training, and 4) it is complementary to invariant learning approaches.

---

Title: Differentially Private Stochastic Expectation Propagation

Abstract: We are interested in privatizing an approximate posterior inference algorithm, called Expectation
Propagation (EP). EP approximates the posterior distribution by iteratively refining
approximations to the local likelihood terms. By doing so, EP typically provides better posterior
uncertainties than variational inference (VI) which globally approximates the likelihood
term. However, EP needs a large memory to maintain all local approximations associated
with each datapoint in the training data. To overcome this challenge, stochastic expectation
propagation (SEP) considers a single unique local factor that captures the average effect of
each likelihood term to the posterior and refines it in a way analogous to EP. In terms of
privatization, SEP is more tractable than EP. It is because at each factor’s refining step we
fix the remaining factors, where these factors are independent of other datapoints, which is
different from EP. This independence makes the sensitivity analysis straightforward. We
provide a theoretical analysis of the privacy-accuracy trade-off in the posterior distributions
under our method, which we call differentially private stochastic expectation propagation
(DP-SEP). Furthermore, we test the DP-SEP algorithm on both synthetic and real-world
datasets and evaluate the quality of posterior estimates at different levels of guaranteed
privacy.

---

Title: Federated Causal Discovery with Additive Noise Models

Abstract: Causal discovery aims to learn a causal graph from observational data. To date, most causal discovery methods require data to be stored in a central server. However, data owners gradually refuse to share their personalized data to avoid privacy leakage, making this task more troublesome by cutting off the first step. A puzzle arises: how do we infer causal relations from decentralized data? In this paper, focusing on the additive noise models (ANMs) assumption of data, we take the first step in developing a gradient-based learning framework named DAG-Shared Federated Causal Discovery (DS-FCD), which can learn the causal graph without directly touching the local data and naturally handle the data heterogeneity. DS-FCD benefits from a two-level structure of each local model. The first level structure learns the causal graph and communicates with the server to get the model information from other clients during the learning procedure, while the second level structure approximates the causal mechanisms and personally updates from its own data to accommodate the data heterogeneity. Moreover, DS-FCD formulates the overall learning task as a continuous optimization problem by taking advantage of an equality acyclicity constraint, which can be solved by gradient descent methods. Extensive experiments on both synthetic and real-world datasets verify the efficacy of the proposed method.

---

Title: Estimating Potential Outcome Distributions with Collaborating Causal Networks

Abstract: Traditional causal inference approaches leverage observational study data to determine an individual’s outcome change due to a potential treatment, known as the Conditional Average Treatment Effect (CATE). However, CATE is the comparison on the first moment and might be insufficient in reflecting the full picture of treatment effects. As an alternative, estimating the full potential outcome distributions could provide greater insights. However, existing methods in estimating the potential outcome distributions often impose assumptions about the treatment effect outcome distributions. Here, we propose Collaborating Causal Networks (CCN), a novel methodology which goes beyond the estimation of CATE alone by learning the full potential outcome distribution. Our proposed method facilitates estimation of the utility of each possible treatment and permits individual-specific variation in utility functions (e.g., risk tolerance variability). CCN not only extends outcome estimation beyond traditional risk difference, but also enables a more comprehensive decision making process through definition of flexible comparisons. Under standard causal inference assumptions, we show that CCN learns distributions that asymptotically capture the correct potential outcome distributions. Furthermore, we propose an adjustment approach that is empirically effective in alleviating sample imbalance between treatment groups in observational studies. Finally, we evaluate CCN's performance in multiple extensive experiments. We demonstrate CCN learns improved distribution estimates compared to existing Bayesian and deep generative methods as well as improved decisions on a variety of utility functions.

---