Weekly TMLR digest for Apr 30, 2023

7 views
Skip to first unread message

TMLR

unread,
Apr 29, 2023, 8:00:18 PM4/29/23
to tmlr-annou...@googlegroups.com

Accepted papers
===============


Title: Group Fairness in Reinforcement Learning

Authors: Harsh Satija, Alessandro Lazaric, Matteo Pirotta, Joelle Pineau

Abstract: We pose and study the problem of satisfying fairness in the online Reinforcement Learning (RL) setting. We focus on the group notions of fairness, according to which agents belonging to different groups should have similar performance based on some given measure. We consider the setting of maximizing return in an unknown environment (unknown transition and reward function) and show that it is possible to have RL algorithms that learn the best fair policies without violating the fairness requirements at any point in time during the learning process. In the tabular finite-horizon episodic setting, we provide an algorithm that combines the principle of optimism and pessimism under uncertainty to achieve zero fairness violation with arbitrarily high probability while also maintaining sub-linear regret guarantees. For the high-dimensional Deep-RL setting, we present algorithms based on the performance-difference style approximate policy improvement update step and we report encouraging empirical results on various traditional RL-inspired benchmarks showing that our algorithms display the desired behavior of learning the optimal policy while performing a fair learning process.

URL: https://openreview.net/forum?id=JkIH4MeOc3

---

Title: On the Statistical Complexity of Estimation and Testing under Privacy Constraints

Authors: Clément Lalanne, Aurélien Garivier, Rémi Gribonval

Abstract: The challenge of producing accurate statistics while respecting the privacy of the individuals in a sample is an important area of research. We study minimax lower bounds for classes of differentially private estimators. In particular, we show how to characterize the power of a statistical test under differential privacy in a plug-and-play fashion by solving an appropriate transport problem. With specific coupling constructions, this observation allows us to derive Le Cam-type and Fano-type inequalities not only for regular definitions of differential privacy but also for those based on Renyi divergence. We then proceed to illustrate our results on three simple, fully worked out examples. In particular, we show that the problem class has a huge importance on the provable degradation of utility due to privacy. In certain scenarios, we show that maintaining privacy results in a noticeable reduction in performance only when the level of privacy protection is very high. Conversely, for other problems, even a modest level of privacy protection can lead to a significant decrease in performance. Finally, we demonstrate that the DP-SGLD algorithm, a private convex solver, can be employed for maximum likelihood estimation with a high degree of confidence, as it provides near-optimal results with respect to both the size of the sample and the level of privacy protection. This algorithm is applicable to a broad range of parametric estimation procedures, including exponential families.

URL: https://openreview.net/forum?id=OarsigVib0

---

Title: Positive Difference Distribution for Image Outlier Detection using Normalizing Flows and Contrastive Data

Authors: Robert Schmier, Ullrich Koethe, Christoph-Nikolas Straehle

Abstract: Detecting test data deviating from training data is a central problem for safe and robust machine learning. Likelihoods learned by a generative model, e.g., a normalizing flow via standard log-likelihood training, perform poorly as an outlier score. We propose to use an unlabelled auxiliary dataset and a probabilistic outlier score for outlier detection. We use a self-supervised feature extractor trained on the auxiliary dataset and train a normalizing flow on the extracted features by maximizing the likelihood on in-distribution data and minimizing the likelihood on the contrastive dataset. We show that this is equivalent to learning the normalized positive difference between the in-distribution and the contrastive feature density. We conduct experiments on benchmark datasets and compare to the likelihood, the likelihood ratio and state-of-the-art anomaly detection methods.

URL: https://openreview.net/forum?id=B4J40x7NjA

---

Title: Uncovering the Representation of Spiking Neural Networks Trained with Surrogate Gradient

Authors: Yuhang Li, Youngeun Kim, Hyoungseob Park, Priyadarshini Panda

Abstract: Spiking Neural Networks (SNNs) are recognized as the candidate for the next-generation neural networks due to their bio-plausibility and energy efficiency. Recently, researchers have demonstrated that SNNs are able to achieve nearly state-of-the-art performance in image recognition tasks using surrogate gradient training. However, some essential questions exist pertaining to SNNs that are little studied: Do SNNs trained with surrogate gradient learn different representations from traditional Artificial Neural Networks (ANNs)? Does the time
dimension in SNNs provide unique representation power? In this paper, we aim to answer these questions by conducting a representation similarity analysis between SNNs and ANNs using Centered Kernel Alignment (CKA). We start by analyzing the spatial dimension of
the networks, including both the width and the depth. Furthermore, our analysis of residual connections shows that SNNs learn a periodic pattern, which rectifies the representations in SNNs to be ANN-like. We additionally investigate the effect of the time dimension on SNN representation, finding that deeper layers encourage more dynamics along the time dimension. We also investigate the impact of input data such as event-stream data and adversarial attacks. Our work uncovers a host of new findings of representations in SNNs. We hope this work will inspire future research to fully comprehend the representation power of SNNs. Code is released at https://github.com/Intelligent-Computing-Lab-Yale/SNNCKA.

URL: https://openreview.net/forum?id=s9efQF3QW1

---

Title: PAC-Bayes Generalisation Bounds for Heavy-Tailed Losses through Supermartingales

Authors: Maxime Haddouche, Benjamin Guedj

Abstract: While PAC-Bayes is now an established learning framework for light-tailed losses (\emph{e.g.}, subgaussian or subexponential), its extension to the case of heavy-tailed losses remains largely uncharted and has attracted a growing interest in recent years. We contribute PAC-Bayes generalisation bounds for heavy-tailed losses under the sole assumption of bounded variance of the loss function. Under that assumption, we extend previous results from \citet{kuzborskij2019efron}. Our key technical contribution is exploiting an extention of Markov's inequality for supermartingales. Our proof technique unifies and extends different PAC-Bayesian frameworks by providing bounds for unbounded martingales as well as bounds for batch and online learning with heavy-tailed losses.

URL: https://openreview.net/forum?id=qxrwt6F3sf

---

Title: Generalization as Dynamical Robustness--The Role of Riemannian Contraction in Supervised Learning

Authors: Leo Kozachkov, Patrick Wensing, Jean-Jacques Slotine

Abstract: A key property of successful learning algorithms is generalization. In classical supervised learning, generalization can be achieved by ensuring that the empirical error converges to the expected error as the number of training samples goes to infinity. Within this classical setting, we analyze the generalization properties of iterative optimizers such as stochastic gradient descent and natural gradient flow through the lens of dynamical systems and control theory. Specifically, we use contraction analysis to show that generalization and dynamical robustness are intimately related through the notion of algorithmic stability.

In particular, we prove that Riemannian contraction in a supervised learning setting implies generalization. We show that if a learning algorithm is contracting in some Riemannian metric with rate $\lambda > 0$, it is uniformly algorithmically stable with rate $\mathcal{O}(1/\lambda n)$, where $n$ is the number of examples in the training set. The results hold for stochastic and deterministic optimization, in both continuous and discrete-time, for convex and non-convex loss surfaces.

The associated generalization bounds reduce to well-known results in the particular case of gradient descent over convex or strongly convex loss surfaces. They can be shown to be optimal in certain linear settings, such as kernel ridge regression under gradient flow. Finally, we demonstrate that the well-known Polyak-Lojasiewicz condition is intimately related to the contraction of a model's outputs as they evolve under gradient descent. This correspondence allows us to derive uniform algorithmic stability bounds for nonlinear function classes such as wide neural networks.

URL: https://openreview.net/forum?id=Sb6p5mcefw

---

Title: Differentially Private Image Classification from Features

Authors: Harsh Mehta, Walid Krichene, Abhradeep Guha Thakurta, Alexey Kurakin, Ashok Cutkosky

Abstract: In deep learning, leveraging transfer learning has recently been shown to be an effective strategy for training large high performance models with Differential Privacy (DP). Moreover, somewhat surprisingly, recent works have found that privately training just the last layer of a pre-trained model provides the best utility with DP. While past studies largely rely on using first-order differentially private training algorithms like DP-SGD for training large models, in the specific case of privately learning from features, we observe that computational burden is often low enough to allow for more sophisticated optimization schemes, including second-order methods. To that end, we systematically explore the effect of design parameters such as loss function and optimization algorithm. We find that, while commonly used logistic regression performs better than linear regression in the non-private setting, the situation is reversed in the private setting. We find that least-squares linear regression is much more effective than logistic regression from both privacy and computational standpoint, especially at stricter epsilon values ($\epsilon < 1$). On the optimization side, we also explore using Newton's method, and find that second-order information is quite helpful even with privacy, although the benefit significantly diminishes with stricter privacy guarantees. While both methods use second-order information, least squares is more effective at lower epsilon values while Newton's method is more effective at larger epsilon values. To combine the benefits of both methods, we propose a novel optimization algorithm called DP-FC, which leverages feature covariance instead of the Hessian of the logistic regression loss and performs well across all $\epsilon$ values we tried. With this, we obtain new SOTA results on ImageNet-1k, CIFAR-100 and CIFAR-10 across all values of $\epsilon$ typically considered. Most remarkably, on ImageNet-1K, we obtain top-1 accuracy of 88\% under DP guarantee of (8, $8 * 10^{-7}$) and 84.3\% under (0.1, $8 * 10^{-7}$).

URL: https://openreview.net/forum?id=Cj6pLclmwT

---


New submissions
===============


Title: Offline Equilibrium Finding

Abstract: Offline reinforcement learning (offline RL) is an emerging field that has recently attracted significant interest across a wide range of application domains, owing to its ability to learn policies from previously collected datasets. The success of offline RL has paved the way for tackling previously intractable real-world problems, but so far, only in single-agent scenarios. Given its potential, our goal is to generalize this paradigm to the multiplayer-game setting. To this end, we introduce a novel problem, called \textit{offline equilibrium finding} (OEF), and construct various types of datasets spanning a wide range of games using several established methods. To solve the OEF problem, we design a model-based framework capable of directly adapting any online equilibrium finding algorithm to the OEF setting while making minimal changes. We adapt the three most prominent contemporary online equilibrium finding algorithms to the context of OEF, resulting in three model-based variants: OEF-PSRO and OEF-CFR, which generalize the widely-used algorithms PSRO and Deep CFR for computing Nash equilibria, and OEF-JPSRO, which generalizes the JPSRO for calculating (coarse) correlated equilibria. Additionally, we combine the behavior cloning policy with the model-based policy to enhance performance and provide a theoretical guarantee regarding the quality of the solution obtained. Extensive experimental results demonstrate the superiority of our approach over traditional offline RL algorithms and highlight the importance of using model-based methods for OEF problems. We hope that our work will contribute to the advancement of research in large-scale equilibrium finding.

URL: https://openreview.net/forum?id=t1kKoTSWwp

---

Title: A Unified Perspective on Natural Gradient Variational Inference with Gaussian Mixture Models

Abstract: Variational inference with Gaussian mixture models (GMMs) enables learning of highly tractable yet multi-modal approximations of intractable target distributions with up to a few hundred dimensions.
The two currently most effective methods for GMM-based variational inference, VIPS and iBayes-GMM, both employ independent natural gradient updates for the individual components and their weights. We show for the first time, that their derived updates are equivalent, although their practical implementations and theoretical guarantees differ. We identify several design choices that distinguish both approaches, namely with respect to sample selection, natural gradient estimation, stepsize adaptation, and whether trust regions are enforced or the number of components adapted. We argue that for both approaches, the quality of the learned approximations can heavily suffer from the respective design choices: By updating the individual components using samples from the mixture model, iBayes-GMM often fails to produce meaningful updates to low-weight components, and by using a zero-order method for estimating the natural gradient, VIPS scales badly to higher-dimensional problems.
Furthermore, we show that information-geometric trust-regions (used by VIPS) are effective even when using first-order natural gradient estimates, and often outperform the improved Bayesian learning rule (iBLR) update used by iBayes-GMM.
We systematically evaluate the effects of design choices and show that a hybrid approach significantly outperforms both prior works. Along with this work, we publish our highly modular and efficient implementation for natural gradient variational inference with Gaussian mixture models, which supports $432$ different combinations of design choices, facilitates the reproduction of all our experiments, and may prove valuable for the practitioner.

URL: https://openreview.net/forum?id=tLBjsX4tjs

---

Title: Novel Class Discovery for Long-tailed Recognition

Abstract: While the novel class discovery has achieved great success, existing methods usually evaluate their algorithms on balanced datasets. However, in real-world visual recognition tasks, the class distribution of a dataset is often long-tailed, making it challenging to apply those methods. In this paper, we propose a more realistic setting for novel class discovery where the distribution of novel and known classes is long-tailed. The challenge of this new problem is to discover novel classes with the help of known classes under an imbalanced class scenario. To discover imbalanced novel classes efficiently, we propose an adaptive self-labeling strategy based on an equiangular prototype representation. Our method infers better pseudo-labels for the novel classes by solving a relaxed optimal transport problem and effectively mitigates the biases in learning the known and novel classes. The extensive results on CIFAR100, ImageNet100, and the challenging Herbarium19 datasets demonstrate the superiority of our method.

URL: https://openreview.net/forum?id=ey5b7kODvK

---

Title: Individual Privacy Accounting for Differentially Private Stochastic Gradient Descent

Abstract: Differentially private stochastic gradient descent (DP-SGD) is the workhorse algorithm for recent advances in private deep learning.
It provides a single privacy guarantee to all datapoints in the dataset.
We propose \emph{output-specific} $(\varepsilon,\delta)$-DP to characterize privacy guarantees for individual examples when releasing models trained by DP-SGD.
We also design an efficient algorithm to investigate individual privacy across a number of datasets.
We find that most examples enjoy stronger privacy guarantees than the worst-case bound.
We further discover that the training loss and the privacy parameter of an example are well-correlated.
This implies groups that are underserved in terms of model utility simultaneously experience weaker privacy guarantees.
For example, on CIFAR-10, the average $\varepsilon$ of the class with the lowest test accuracy is 44.2\% higher than that of the class with the highest accuracy.

URL: https://openreview.net/forum?id=l4Jcxs0fpC

---

Title: Execution-based Code Generation using Deep Reinforcement Learning

Abstract: The utilization of programming language (PL) models, pre-trained on large-scale code corpora, as a means of automating software engineering processes has demonstrated considerable potential in streamlining various code generation tasks such as code completion, code translation, and program synthesis. However, current approaches mainly rely on supervised fine-tuning objectives borrowed from text generation, neglecting unique sequence-level characteristics of code, including but not limited to compilability as well as syntactic and functional correctness. To address this limitation, we propose PPOCoder, a new framework for code generation that synergistically combines pre-trained PL models with Proximal Policy Optimization (PPO) deep reinforcement learning. By utilizing non-differentiable feedback from code execution and structure alignment, PPOCoder seamlessly integrates external code-specific knowledge into the model optimization process. It's important to note that PPOCoder is a task-agnostic and model-agnostic framework that can be used across different code generation tasks and PLs. Extensive experiments on three code generation tasks demonstrate the effectiveness of our proposed approach compared to SOTA methods, achieving significant improvements in compilation success rates and functional correctness across different PLs.

URL: https://openreview.net/forum?id=0XBuaxqEcG

---

Title: Some Supervision Required: Incorporating Oracle Policies in Reinforcement Learning via Epistemic Uncertainty Metrics

Abstract: An inherent problem of reinforcement learning is performing exploration of an environment through random actions, of which a large portion can be unproductive. Instead, exploration can be improved by initializing the learning policy with an existing (previously learned or hard-coded) oracle policy, offline data, or demonstrations. In the case of using an oracle policy, it can be unclear how best to incorporate the oracle policy's experience into the learning policy in a way that maximizes learning sample efficiency. In this paper, we propose a method termed Critic Confidence Guided Exploration (CCGE) for incorporating such an oracle policy into standard actor-critic reinforcement learning algorithms. More specifically, CCGE takes in the oracle policy's actions as suggestions and incorporates this information into the learning scheme when uncertainty is high, while ignoring it when the uncertainty is low. CCGE is agnostic to methods of estimating value function uncertainty, and we show that it is equally effective with two different techniques. Empirically, we evaluate the effect of CCGE on various benchmark reinforcement learning tasks, and show that this idea can lead to improved sample efficiency and final performance. Furthermore, when evaluated on sparse reward environments, CCGE is able to perform competitively against adjacent algorithms that also leverage an oracle policy. Our experiments show that it is possible to utilize uncertainty as a heuristic to guide exploration using an oracle in reinforcement learning. We expect that this will inspire more research in this direction, where various heuristics are used to determine the direction of guidance provided to learning.


URL: https://openreview.net/forum?id=XRpt5JYF8m

---

Title: Exploiting Category Names for Few-Shot Classification with Vision-Language Models

Abstract: Vision-language foundation models pretrained on large-scale data influence many visual understanding tasks. Notably, many vision-language models build two encoders (visual and textual) that can map two modalities into the same embedding space. As a result, the learned representations achieve good zero-shot performance on tasks like image classification. However, when there are only a few examples per category, the potential of large vision-language models is not fully realized, mainly due to the disparity between the vast number of parameters and the relatively limited amount of training data. This paper shows that we can significantly improve the performance of few-shot classification by using the category names to initialize the classification head. More interestingly, we can borrow the non-perfect category names, or even names from a foreign language, to improve the few-shot classification performance compared with random initialization. With the proposed category name initialization method, our model obtains state-of-the-art performance on several few-shot image classification benchmarks (e.g., 87.37% on ImageNet and 96.08% on Stanford Cars, both using five-shot learning). Additionally, we conduct an in-depth analysis of category name initialization, explore the point at which the benefits of category names decrease, examine how distillation techniques can enhance the performance of smaller models, and investigate other pivotal factors and intriguing phenomena in the realm of few-shot learning. Our findings offer valuable insights and guidance for future research endeavors.

URL: https://openreview.net/forum?id=Qx6ejJHEK9

---

Title: Tackling Provably Hard Representative Selection via Graph Neural Networks

Abstract: Representative Selection(RS) is the problem of finding a small subset of exemplars from a dataset that is representative of the dataset. In this paper, we study RS for unlabeled datasets and focus on finding representatives that optimize the accuracy of a model trained on the selected representatives. Theoretically, we establish a new hardness result for RS by proving that a particular, highly practical variant of it (RS for Learning) is hard to approximate in polynomial time within any reasonable factor, which implies a significant potential gap between the optimum solution of widely-used surrogate functions and the actual accuracy of the model. We then study a setting where additional information in the form of a (homophilous) graph structure is available, or can be constructed, between the data points. We show that with an appropriate modeling approach, the presence of such a structure can turn a hard RS (for learning) problem into one that can be effectively solved.To this end, we developRS-GNN, a representation learning-based RS model based onGraphNeuralNetworks. Empirically, we demonstrate the effectiveness of RS-GNN on problems with predefined graph structures as well as problems with graphs induced from node feature similarities, by showing that RS-GNN achieves significant improvements over established baselines on a suite of eight benchmarks.

URL: https://openreview.net/forum?id=3LzgOQ3eOb

---

Title: Advantage Actor-Critic Training Framework Leveraging Lookahead Rewards for Automatic Question Generation

Abstract: Existing approaches in Automatic Question Generation (AQG) train sequence-to-sequence (seq2seq) models to generate questions from input passages and answers using the teacher-forcing algorithm, a supervised learning method, resulting in exposure bias and training-testing evaluation measure mismatch. Several works have also attempted to train seq2seq models for AQG using reinforcement learning, leveraging Monte-Carlo return-based policy gradient (PG) methods like REINFORCE with baseline. However, such Monte-Carlo return-based PG methods depend on sentence-level rewards, which limits the training to sparse and high-variance global reward signals. Temporal difference learning (TD)-based Actor-Critic methods can provide finer-grained training signals for solving text-generation tasks by leveraging subsequence-level information. However, only a few works have explored the Actor-Critic methods for text generation because it becomes an additional challenge to train the seq2seq models steadily using such TD methods. Another severe issue is the vocabulary size-related intractable action space bottleneck inherent in all natural language generation (NLG) tasks. This work proposes an Advantage Actor-Critic training framework to train seq2seq models for AQG, which uses sub-sequence level information to train the models efficiently and stably. The proposed training framework also addresses the problems of exposure bias, evaluation measure mismatch and global rewards by facilitating the autoregressive token generation, BLEU-based task optimization and question prefix-based Critic signals and provides a workaround for the intractable action space bottleneck by leveraging relevant ideas from existing supervised learning and reinforcement learning literature. The training framework uses an off-policy approach for training the Critic, which prevents the Critic from overfitting the highly correlated on-policy training samples. The off-policy Critic training also uses an explicit division of high-reward and low-reward experiences, which provides additional improvement to the training process. In this work, we conduct experiments on multiple datasets from QG-Bench to show how the different components of our proposed Advantage Actor-Critic training framework work together to improve the quality of the questions generated by the seq2seq models by including necessary contextual information and ensuring that the generated questions have a high degree of surface-level similarity with the ground truth.

URL: https://openreview.net/forum?id=A6IjcbMcOC

---

Title: PAVI: Plate-Amortized Variational Inference

Abstract: Given observed data and a probabilistic generative model, Bayesian inference searches for the distribution of the model's parameters that could have yielded the data. Inference is challenging for large population studies where millions of measurements are performed over a cohort of hundreds of subjects, resulting in a massive parameter space. This large cardinality renders off-the-shelf Variational Inference (VI) computationally impractical.

In this work, we design structured VI families that efficiently tackle large population studies. Our main idea is to share the parameterization and learning across the different i.i.d. variables in a generative model -symbolized by the model's $\textit{plates}$.
We name this concept $\textit{plate amortization}$. Contrary to off-the-shelf stochastic VI --which slows down inference-- plate amortization results in orders of magnitude faster to train variational distributions. Applied to large-scale hierarchical problems, PAVI yields expressive, parsimoniously parameterized VI with an affordable training time --effectively unlocking inference in those regimes.

We illustrate the practical utility of PAVI through a challenging Neuroimaging example featuring 400 million latent parameters, demonstrating a significant step towards scalable and expressive Variational Inference.

URL: https://openreview.net/forum?id=vlY9GDCCA6

---

Title: TSMixer: An all-MLP Architecture for Time Series Forecast-ing

Abstract: Real-world time-series datasets are often multivariate with complex dynamics. To capture this complexity, high capacity architectures like recurrent- or attention-based sequential deep learning models have become popular. However, recent work demonstrates that simple univariate linear models can outperform such deep learning models on several commonly used academic benchmarks. Extending them, in this paper, we investigate the capabilities of linear models for time-series forecasting and present Time-Series Mixer (TSMixer), a novel architecture designed by stacking multi-layer perceptrons (MLPs). TSMixer is based on mixing operations along both the time and feature dimensions to extract information efficiently. On popular academic benchmarks, the simple-to-implement TSMixer is comparable to specialized state-of-the-art models that leverage the inductive biases of specific benchmarks. On the challenging and large scale M5 benchmark, a real-world retail dataset, TSMixer demonstrates superior performance compared to the state-of-the-art alternatives. Our results underline the importance of efficiently utilizing cross-variate and auxiliary information for improving the performance of time series forecasting. We present various analyses to shed light into the capabilities of TSMixer. The design paradigms utilized in TSMixer are expected to open new horizons for deep learning-based time series forecasting.

URL: https://openreview.net/forum?id=wbpxTuXgm0

---

Title: Orthonormalising gradients improves neural network optimisation

Abstract: The optimisation of neural networks can be improved by orthogonalising the gradients before the optimisation step, ensuring the diversification of the learned intermediate representations. We orthonormalise the gradients of a layer's components/filters with respect to each other to separate out the latent features. Our method of orthogonalisation allows the weights to be used more flexibly, in contrast to restricting the weights to an orthogonal sub-space. We tested this method on image classification, ImageNet and CIFAR-10, and on the semi-supervised learning BarlowTwins, obtaining both better accuracy than SGD with fine-tuning and better accuracy for naïvely chosen hyper-parameters.


URL: https://openreview.net/forum?id=ZfcmwokDvr

---

Title: Quantization Robust Federated Learning for Efficient Inference on Heterogeneous Devices

Abstract: Federated Learning (FL) is a machine learning paradigm to distributively learn machine learning models from decentralized data that remains on-device. Despite the success of standard Federated optimization methods, such as Federated Averaging (FedAvg) in FL, the energy demands and hardware induced constraints for on-device learning have not been considered sufficiently in the literature. Specifically, an essential demand for on-device learning is to enable trained models to be quantized to various bit-widths based on the energy needs and heterogeneous hardware designs across the federation. In this work, we introduce multiple variants of federated averaging algorithm that train neural networks robust to quantization. Such networks can be quantized to various bit-widths with only limited reduction in full precision model accuracy. We perform extensive experiments on standard FL benchmarks to evaluate our proposed FedAvg variants for quantization robustness and provide a convergence analysis for our Quantization-Aware variants in FL. Our results demonstrate that integrating quantization robustness results in FL models that are significantly more robust to different bit-widths during quantized on-device inference.

URL: https://openreview.net/forum?id=lvevdX6bxm

---

Title: Effective Neural Network $L_0$ Regularization With BinMask

Abstract: $L_0$ regularization of neural networks is a fundamental problem. In addition to regularizing models for better generalizability, $L_0$ regularization also applies to selecting input features and training sparse neural networks. There is a large body of research on related topics, some with quite complicated methods. In this paper, we show that a straightforward formulation, BinMask, which multiplies weights with deterministic binary masks and uses the identity straight-through estimator for backpropagation, is an effective $L_0$ regularizer. We evaluate BinMask on three tasks: feature selection, network sparsification, and model regularization. Despite its simplicity, BinMask achieves competitive performance on all the benchmarks without task-specific tuning compared to methods designed for each task. Our results suggest that decoupling weights from mask optimization, which has been widely adopted by previous work, is a key component for effective $L_0$ regularization.

URL: https://openreview.net/forum?id=CqZvqyqusY

---

Title: WOODS: Benchmarks for Out-of-Distribution Generalization in Time Series

Abstract: Deep learning models often fail to generalize well under distribution shifts. Understanding and overcoming these failures have led to a new research field on Out-of-Distribution (OOD) generalization. Despite being extensively studied for static computer vision tasks, OOD generalization has been severely underexplored for time series tasks. To shine a light on this gap, we present WOODS: 10 challenging time series benchmarks covering a diverse range of data modalities, such as videos, brain recordings, and smart device sensory signals. We revise the existing OOD generalization algorithms for time series tasks and evaluate them using our systematic framework. Our experiments show a large room for improvement for empirical risk minimization and OOD generalization algorithms on our datasets, thus underscoring the new challenges posed by time series tasks.

URL: https://openreview.net/forum?id=mvftzofTYQ

---

Title: Does ‘Deep Learning on a Data Diet’ reproduce? Overall yes, but GraNd at Initialization does not

Abstract: The paper 'Deep Learning on a Data Diet' by Paul et al. (2021) introduces two innovative metrics for pruning datasets during the training of neural networks. While we are able to replicate the results for the EL2N score at epoch 20, the same cannot be said for the GraNd score at initialization. The GraNd scores later in training provide useful pruning signals, however. The GraNd score at initialization calculates the average gradient norm of an input sample across multiple randomly initialized models before any training has taken place. Our analysis reveals a strong correlation between the GraNd score at initialization and the input norm of a sample, suggesting that the latter could have been a cheap new baseline for data pruning. Unfortunately, neither the GraNd score at initialization nor the input norm surpasses random pruning in performance. This contradicts one of the findings in Paul et al. (2021). We were unable to reproduce their CIFAR-10 results using both an updated version of the original JAX repository and in a newly implemented PyTorch codebase. An investigation of the underlying JAX/FLAX code from 2021 surfaced a bug in the checkpoint restoring code that was fixed in April 2021 (https://github.com/google/flax/commit/28fbd95500f4bf2f9924d2560062fa50e919b1a5).

URL: https://openreview.net/forum?id=1dwXa9vmOI

---

Title: Turning a Curse into a Blessing: Enabling In-Distribution-Data-Free Backdoor Removal via Stabilized Model Inversion

Abstract: The effectiveness of many existing techniques for removing backdoors from machine learning models relies on access to clean in-distribution data. However, given that these models are often trained on proprietary datasets, it may not be practical to assume that in-distribution samples will always be available.
On the other hand, model inversion techniques, which are typically viewed as privacy threats, can reconstruct realistic training samples from a given model, potentially eliminating the need for in-distribution data.
To date, the only prior attempt to integrate backdoor removal and model inversion involves a simple combination that produced very limited results. This work represents a first step toward a more thorough understanding of how model inversion techniques could be leveraged for effective backdoor removal. Specifically, we seek to answer several key questions: What properties must reconstructed samples possess to enable successful defense? Is perceptual similarity to clean samples enough, or are additional characteristics necessary? Is it possible for reconstructed samples to contain backdoor triggers?

We demonstrate that relying solely on perceptual similarity is insufficient for effective defenses. The stability of model predictions in response to input and parameter perturbations also plays a critical role. To address this, we propose a new bi-level optimization based framework for model inversion that promotes stability in addition to visual quality. Interestingly, we also find that reconstructed samples from a pre-trained generator's latent space do not contain backdoors, even when signals from a backdoored model are utilized for reconstruction. We provide a theoretical analysis to explain this observation. Our evaluation shows that our stabilized model inversion technique achieves state-of-the-art backdoor removal performance without requiring access to clean in-distribution data. Furthermore, its performance is on par with or even better than using the same amount of clean samples.

URL: https://openreview.net/forum?id=XuOE99cmST

---

Reply all
Reply to author
Forward
0 new messages