Weekly TMLR digest for Dec 03, 2023

4 views

Skip to first unread message

TMLR

unread,

Dec 2, 2023, 7:00:15 PM12/2/23

to tmlr-annou...@googlegroups.com

New certifications
==================

Survey Certification: Causal Reinforcement Learning: A Survey

Zhihong Deng, Jing Jiang, Guodong Long, Chengqi Zhang

https://openreview.net/forum?id=qqnttX9LPo

---

Featured Certification: Understanding the robustness difference between stochastic gradient descent and adaptive gradient methods

Avery Ma, Yangchen Pan, Amir-massoud Farahmand

https://openreview.net/forum?id=ed8SkMdYFT

---

Featured Certification: Learning to reconstruct signals from binary measurements alone

Julián Tachella, Laurent Jacques

https://openreview.net/forum?id=ioFIAQOBOS

---

Expert Certification: Pairwise Learning with Adaptive Online Gradient Descent

Tao Sun, Qingsong Wang, Yunwen Lei, Dongsheng Li, Bao Wang

https://openreview.net/forum?id=rq1SaHQg2k

---

Survey Certification: Provably Safe Reinforcement Learning: Conceptual Analysis, Survey, and Benchmarking

Hanna Krasowski, Jakob Thumm, Marlon Müller, Lukas Schäfer, Xiao Wang, Matthias Althoff

https://openreview.net/forum?id=mcN0ezbnzO

---

Reproducibility Certification: PAVI: Plate-Amortized Variational Inference

Louis Rouillard, Alexandre Le Bris, Thomas Moreau, Demian Wassermann

https://openreview.net/forum?id=vlY9GDCCA6

---

Accepted papers
===============

Title: Accelerating Batch Active Learning Using Continual Learning Techniques

Authors: Arnav Mohanty Das, Gantavya Bhatt, Megh Manoj Bhalerao, Vianne R. Gao, Rui Yang, Jeff Bilmes

Abstract: A major problem with Active Learning (AL) is high training costs since models are typically retrained from scratch after every query round. We start by demonstrating that standard AL on neural networks with warm starting fails, both to accelerate training and to avoid catastrophic forgetting when using fine-tuning over AL query rounds. We then develop a new class of techniques, circumventing this problem, by biasing further training towards previously labeled sets. We accomplish this by employing existing, and developing novel, replay-based Continual Learning (CL) algorithms that are effective at quickly learning the new without forgetting the old, especially when data comes from an evolving distribution. We call this paradigm \textit{"Continual Active Learning" (CAL)}. We show CAL achieves significant speedups using a plethora of replay schemes that use model distillation and that select diverse/uncertain points from the history. We conduct experiments across many data domains, including natural language, vision, medical imaging, and computational biology, each with different neural architectures and dataset sizes. CAL consistently provides a $\sim$3x reduction in training time, while retaining performance and out-of-distribution robustness, showing its wide applicability.

URL: https://openreview.net/forum?id=T55dLSgsEf

---

Title: Revisiting Topic-Guided Language Models

Authors: Carolina Zheng, Keyon Vafa, David Blei

Abstract: A recent line of work in natural language processing has aimed to combine language models and topic models. These \textit{topic-guided language models} augment neural language models with topic models, unsupervised learning methods that can discover document-level patterns of word use. This paper compares the effectiveness of these methods in a standardized setting. We study four topic-guided language models and two baselines, evaluating the held-out predictive performance of each model on four corpora. Surprisingly, we find that \textit{none of these methods outperform a standard LSTM language model baseline}, and most fail to learn good topics. Further, we train a probe of the neural language model that shows that the baseline's hidden states already encode topic information. We make public all code used for this study.

URL: https://openreview.net/forum?id=lXBEwFfxpA

---

Title: Two-Level Actor-Critic Using Multiple Teachers

Authors: Su Zhang, Srijita Das, Sriram Ganapathi Subramanian, Matthew E. Taylor

Abstract: Deep reinforcement learning has successfully allowed agents to learn complex behaviors for many tasks. However, a key limitation of current learning approaches is the sample-inefficiency problem, which limits performance of the learning agent. This paper considers how agents can benefit from improved learning via teachers' advice. In particular, we consider the setting with multiple sub-optimal teachers, as opposed to having a single near-optimal teacher. We propose a flexible two-level actor-critic algorithm where the high-level network learns to choose the best teacher in the current situation while the low-level network learns the control policy.

URL: https://openreview.net/forum?id=LfQ6uAVAEo

---

Title: ECG Representation Learning with Multi-Modal EHR Data

Authors: Sravan Kumar Lalam, Hari Krishna Kunderu, Shayan Ghosh, Harish Kumar A, Samir Awasthi, Ashim Prasad, Francisco Lopez-Jimenez, Zachi I Attia, Samuel Asirvatham, Paul Friedman, Rakesh Barve, Melwin Babu

Abstract: Electronic Health Records (EHRs) provide a rich source of medical information across different modalities such as electrocardiograms (ECG), structured EHRs (sEHR), and unstructured EHRs (text). Inspired by the fact that many cardiac and non-cardiac diseases influence the behavior of the ECG, we leverage structured EHRs and unstructured EHRs from multiple sources by pairing with ECGs and propose a set of three new multi-modal contrastive learning models that combine ECG, sEHR, and text modalities. The performance of these models is compared against different baseline models such as supervised learning models trained from scratch with random weights initialization, and self-supervised learning models trained only on ECGs. We pre-train the models on a large proprietary dataset of about 9 $million$ ECGs from around 2.4 $million$ patients and evaluate the pre-trained models on various downstream tasks such as classification, zero-shot retrieval, and out-of-distribution detection involving the prediction of various heart conditions using ECG waveforms as input, and demonstrate that the models presented in this work show significant improvements compared to all baseline modes.

URL: https://openreview.net/forum?id=UxmvCwuTMG

---

Title: Variational Causal Dynamics: Discovering Modular World Models from Interventions

Authors: Anson Lei, Bernhard Schölkopf, Ingmar Posner

Abstract: Latent world models allow agents to reason about complex environments with high-dimensional observations. However, adapting to new environments and effectively leveraging previous knowledge remain significant challenges. We present Variational Causal Dynamics (VCD), a structured world model that exploits the invariance of causal mechanisms across environments to achieve fast and modular adaptation. By causally factorising a transition model, VCD is able to identify reusable components across different environments. This is achieved by combining causal discovery and variational inference to learn a latent representation and transition model jointly in an unsupervised manner. Specifically, we optimise the evidence lower bound jointly over a representation model and a transition model structured as a causal graphical model. In evaluations on simulated environments with state and image observations, we show that VCD is able to successfully identify causal variables, and to discover consistent causal structures across different environments. Moreover, given a small number of observations in a previously unseen, intervened environment, VCD is able to identify the sparse changes in the dynamics and to adapt efficiently. In doing so, VCD significantly extends the capabilities of the current state-of-the-art in latent world models while also comparing favourably in terms of prediction accuracy.

URL: https://openreview.net/forum?id=V9tQKYYNK1

---

Title: Causal Reinforcement Learning: A Survey

Authors: Zhihong Deng, Jing Jiang, Guodong Long, Chengqi Zhang

Abstract: Reinforcement learning is an essential paradigm for solving sequential decision problems under uncertainty. Despite many remarkable achievements in recent decades, applying reinforcement learning methods in the real world remains challenging. One of the main obstacles is that reinforcement learning agents lack a fundamental understanding of the world and must therefore learn from scratch through numerous trial-and-error interactions. They may also face challenges in providing explanations for their decisions and generalizing the acquired knowledge. Causality, however, offers notable advantages by formalizing knowledge in a systematic manner and harnessing invariance for effective knowledge transfer. This has led to the emergence of causal reinforcement learning, a subfield of reinforcement learning that seeks to enhance existing algorithms by incorporating causal relationships into the learning process. In this survey, we provide a comprehensive review of the literature in this domain. We begin by introducing basic concepts in causality and reinforcement learning, and then explain how causality can help address key challenges faced by traditional reinforcement learning. We categorize and systematically evaluate existing causal reinforcement learning approaches, with a focus on their ability to enhance sample efficiency, advance generalizability, facilitate knowledge transfer, mitigate spurious correlations, and promote explainability, fairness, and safety. Lastly, we outline the limitations of current research and shed light on future directions in this rapidly evolving field.

URL: https://openreview.net/forum?id=qqnttX9LPo

---

Title: RCT Rejection Sampling for Causal Estimation Evaluation

Authors: Katherine A. Keith, Sergey Feldman, David Jurgens, Jonathan Bragg, Rohit Bhattacharya

Abstract: Confounding is a significant obstacle to unbiased estimation of causal effects from observational data. For settings with high-dimensional covariates---such as text data, genomics, or the behavioral social sciences---researchers have proposed methods to adjust for confounding by adapting machine learning methods to the goal of causal estimation. However, empirical evaluation of these adjustment methods has been challenging and limited. In this work, we build on a promising empirical evaluation strategy that simplifies evaluation design and uses real data: subsampling randomized controlled trials (RCTs) to create confounded observational datasets while using the average causal effects from the RCTs as ground-truth. We contribute a new sampling algorithm, which we call RCT rejection sampling, and provide theoretical guarantees that causal identification holds in the observational data to allow for valid comparisons to the ground-truth RCT. Using synthetic data, we show our algorithm indeed results in low bias when oracle estimators are evaluated on the confounded samples, which is not always the case for a previously proposed algorithm. In addition to this identification result, we highlight several finite data considerations for evaluation designers who plan to use RCT rejection sampling on their own datasets. As a proof of concept, we implement an example evaluation pipeline and walk through these finite data considerations with a novel, real-world RCT---which we release publicly---consisting of approximately 70k observations and text data as high-dimensional covariates. Together, these contributions build towards a broader agenda of improved empirical evaluation for causal estimation.

URL: https://openreview.net/forum?id=F74ZZk5hPa

---

Title: Tight conditions for when the NTK approximation is valid

Authors: Enric Boix-Adserà, Etai Littwin

Abstract: We study when the neural tangent kernel (NTK) approximation is valid for training a model with the square loss. In the lazy training setting of Chizat et al. 2019, we show that rescaling the model by a factor of $\alpha = O(T)$ suffices for the NTK approximation to be valid until training time $T$. Our bound is tight and improves on the previous bound of Chizat et al. 2019, which required a larger rescaling factor of $\alpha = O(T^2)$.

URL: https://openreview.net/forum?id=qM7JPBYROr

---

Title: Understanding the robustness difference between stochastic gradient descent and adaptive gradient methods

Authors: Avery Ma, Yangchen Pan, Amir-massoud Farahmand

Abstract: Stochastic gradient descent (SGD) and adaptive gradient methods, such as Adam and RMSProp, have been widely used in training deep neural networks. We empirically show that while the difference between the standard generalization performance of models trained using these methods is small, those trained using SGD exhibit far greater robustness under input perturbations. Notably, our investigation demonstrates the presence of irrelevant frequencies in natural datasets, where alterations do not affect models' generalization performance. However, models trained with adaptive methods show sensitivity to these changes, suggesting that their use of irrelevant frequencies can lead to solutions sensitive to perturbations. To better understand this difference, we study the learning dynamics of gradient descent (GD) and sign gradient descent (signGD) on a synthetic dataset that mirrors natural signals. With a three-dimensional input space, the models optimized with GD and signGD have standard risks close to zero but vary in their adversarial risks. Our result shows that linear models' robustness to $\ell_2$-norm bounded changes is inversely proportional to the model parameters' weight norm: a smaller weight norm implies better robustness. In the context of deep learning, our experiments show that SGD-trained neural networks have smaller Lipschitz constants, explaining the better robustness to input perturbations than those trained with adaptive gradient methods. Our source code is available at https://github.com/averyma/opt-robust.

URL: https://openreview.net/forum?id=ed8SkMdYFT

---

Title: Data-Free Diversity-Based Ensemble Selection for One-Shot Federated Learning

Authors: Naibo Wang, Wenjie Feng, yuchen deng, Moming Duan, Fusheng Liu, See-Kiong Ng

Abstract: The emerging availability of various machine learning models creates a great demand to harness the collective intelligence of many independently well-trained models to improve overall performance. Considering the privacy concern and non-negligible communication costs, one-shot federated learning and ensemble learning in a data-free manner attract significant attention. However, conventional ensemble selection approaches are neither training efficient nor applicable to federated learning due to the risk of privacy leakage from local clients; meanwhile, the "many could be better than all" principle under data-free constraints makes it even more challenging. Therefore, it becomes crucial to design an effective ensemble selection strategy to find a good subset of the base models as the ensemble team for the federated learning scenario. In this paper, we propose a novel data-free diversity-based framework, DeDES, to address the ensemble selection problem with diversity consideration for models under the one-shot federated learning setting. Experimental results show that our method can achieve both better performance and higher efficiency over 5 datasets, 4 different model structures, and both homogeneous and heterogeneous model groups under four different data-partition strategies.

URL: https://openreview.net/forum?id=ORMlg4g3mG

---

Title: Learning to reconstruct signals from binary measurements alone

Authors: Julián Tachella, Laurent Jacques

Abstract: Recent advances in unsupervised learning have highlighted the possibility of learning to reconstruct signals from noisy and incomplete linear measurements alone. These methods play a key role in medical and scientific imaging and sensing, where ground truth data is often scarce or difficult to obtain. However, in practice measurements are not only noisy and incomplete but also quantized. Here we explore the extreme case of learning from binary observations and provide necessary and sufficient conditions on the number of measurements required for identifying a set of signals from incomplete binary data. Our results are complementary to existing bounds on signal recovery from binary measurements. Furthermore, we introduce a novel self-supervised learning approach, which we name SSBM, that only requires binary data for training. We demonstrate in a series of experiments with real datasets that SSBM performs on par with supervised learning and outperforms sparse reconstruction methods with a fixed wavelet basis by a large margin.

URL: https://openreview.net/forum?id=ioFIAQOBOS

---

Title: Universal Graph Continual Learning

Authors: Thanh Duc Hoang, Do Viet Tung, Duy-Hung Nguyen, Bao-Sinh Nguyen, Huy Hoang Nguyen, Hung Le

Abstract: We address catastrophic forgetting issues in graph learning as the arrival of new data from diverse task distributions often leads graph models to prioritize the current task, causing them to forget valuable insights from previous tasks. Whereas prior studies primarily tackle one setting of graph continual learning such as incremental node classification, we focus on a universal approach wherein each data point in a task can be a node or a graph, and the task varies from node to graph classification. We refer to this setting as Universal Graph Continual Learning (UGCL), which includes node-unit node classification (NUNC), graph-unit node classification (GUNC), and graph-unit graph classification (GUGC). Our novel method maintains a replay memory of nodes and neighbours to remind the model of past graph structures through distillation. Emphasizing the importance of preserving distinctive graph structures across tasks, we enforce that coarse-to-grain graph representations stay close to previous ones by minimizing our proposed global and local structure losses. We benchmark our method against various continual learning baselines in 8 real-world graph datasets and achieve significant improvement in average performance and forgetting across tasks.

URL: https://openreview.net/forum?id=wzRE5kTnl3

---

Title: Cross-client Label Propagation for Transductive and Semi-Supervised Federated Learning

Authors: Jonathan Scott, Michelle Yeo, Christoph H Lampert

Abstract: We present Cross-Client Label Propagation (XCLP), a new method for transductive and semi-supervised federated learning. XCLP estimates a data graph jointly from the data of multiple clients and computes labels for the unlabeled data by propagating label information across the graph. To avoid clients having to share their data with anyone, XCLP employs two cryptographically secure protocols: secure Hamming distance computation and secure summation. We demonstrate two distinct applications of XCLP within federated learning. In the first, we use it in a one-shot way to predict labels for unseen test points. In the second, we use it to repeatedly pseudo-label unlabeled training data in a federated semi-supervised setting. Experiments on both real federated and standard benchmark datasets show that in both applications XCLP achieves higher classification accuracy than alternative approaches.

URL: https://openreview.net/forum?id=gY04GX8R5k

---

Title: MERMAIDE: Learning to Align Learners using Model-Based Meta-Learning

Authors: Arundhati Banerjee, Soham Rajesh Phade, Stefano Ermon, Stephan Zheng

Abstract: We study how a principal can efficiently and effectively intervene on the rewards of a previously unseen learning agent in order to induce desirable outcomes. This is relevant to many real-world settings like auctions or taxation, where the principal may not know the learning behavior nor the rewards of real people. Moreover, the principal should be few-shot adaptable and minimize the number of interventions, because interventions are often costly. We introduce MERMAIDE, a model-based meta-learning framework to train a principal that can quickly adapt to out-of-distribution agents with different learning strategies and reward functions. We validate this approach step-by-step. First, in a Stackelberg setting with a best-response agent, we show that meta-learning enables quick convergence to the theoretically known Stackelberg equilibrium at test time, although noisy observations severely increase the sample complexity. We then show that our model-based meta-learning approach is cost-effective in intervening on bandit agents with unseen explore-exploit strategies. Finally, we outperform baselines that use either meta-learning or agent behavior modeling, in both $0$-shot and $1$-shot settings with partial agent information.

URL: https://openreview.net/forum?id=H5VRvCXCzf

---

Title: Meta Continual Learning on Graphs with Experience Replay

Authors: Altay Unal, Abdullah Akgül, Melih Kandemir, Gozde Unal

Abstract: Continual learning is a machine learning approach where the challenge is that a constructed learning model executes incoming tasks while maintaining its performance over the earlier tasks. In order to address this issue, we devise a technique that combines two uniquely important concepts in machine learning, namely "replay buffer" and "meta learning", aiming to exploit the best of two worlds. In this method, the model weights are initially computed by using the current task dataset. Next, the dataset of the current task is merged with the stored samples from the earlier tasks and the model weights are updated using the combined dataset. This aids in preventing the model weights converging to the optimal parameters of the current task and enables the preservation of information from earlier tasks. We choose to adapt our technique to graph data structure and the task of node classification on graphs. We introduce MetaCLGraph, which outperforms the baseline methods over various graph datasets including Citeseer, Corafull, Arxiv, and Reddit. This method illustrates the potential of combining replay buffer and meta learning in the field of continual learning on graphs.

URL: https://openreview.net/forum?id=8tnrh56P5W

---

Title: Pairwise Learning with Adaptive Online Gradient Descent

Authors: Tao Sun, Qingsong Wang, Yunwen Lei, Dongsheng Li, Bao Wang

Abstract: In this paper, we propose an adaptive online gradient descent method with momentum for pairwise learning, in which the stepsize is determined by historical information. Due to the structure of pairwise learning, the sample pairs are dependent on the parameters, causing difficulties in the convergence analysis. To this end, we develop novel techniques for the convergence analysis of the proposed algorithm. We show that the proposed algorithm can output the desired solution in strongly convex, convex, and nonconvex cases. Furthermore, we present theoretical explanations for why our proposed algorithm can accelerate previous workhorses for online pairwise learning. All assumptions used in the theoretical analysis are mild and common, making our results applicable to various pairwise learning problems. To demonstrate the efficiency of our algorithm, we compare the proposed adaptive method with the non-adaptive counterpart on the benchmark online AUC maximization problem.

URL: https://openreview.net/forum?id=rq1SaHQg2k

---

Title: Improved identification accuracy in equation learning via comprehensive $\boldsymbol{R^2}$-elimination and Bayesian model selection

Authors: Daniel Nickelsen, Bubacarr Bah

Abstract: In the field of equation learning, exhaustively considering all possible combinations derived from a basis function dictionary is infeasible. Sparse regression and greedy algorithms have emerged as popular approaches to tackle this challenge. However, the presence of strong collinearities poses difficulties for sparse regression techniques, and greedy steps may inadvertently exclude important components of the true equation, leading to reduced identification accuracy. In this article, we present a novel algorithm that strikes a balance between comprehensiveness and efficiency in equation learning. Inspired by stepwise regression, our approach combines the coefficient of determination, $R^2$, and the Bayesian model evidence, $p(y|\mathcal{M})$, in a novel way. Through three extensive numerical experiments involving random polynomials and dynamical systems, we compare our method against two standard approaches, four state-of-the-art methods, and bidirectional stepwise regression incorporating $p(y|\mathcal{M})$. The results demonstrate that our less greedy algorithm surpasses all other methods in terms of identification accuracy. Furthermore, we discover a heuristic approach to mitigate the overfitting penalty associated with $R^2$ and propose an equation learning procedure solely based on $R^2$, which achieves high rates of exact equation recovery.

URL: https://openreview.net/forum?id=0ck7hJ8EVC

---

Title: Reliable Active Learning via Influence Functions

Authors: Meng Xia, Ricardo Henao

Abstract: Due to the high cost and time-consuming nature of collecting labeled data, having insufficient labeled data is a common challenge that can negatively impact the performance of deep learning models when applied to real-world applications. Active learning (AL) aims to reduce the cost and time required for obtaining labeled data by selecting valuable samples during model training. However, recent works have pointed out the performance unreliability of existing AL algorithms for deep learning (DL) architectures under different scenarios, which manifests as their performance being comparable (or worse) to that of basic random selection. This behavior compromises the applicability of these approaches. We address this problem by proposing a theoretically motivated AL framework for DL architectures. We demonstrate that the most valuable samples for the model are those that, unsurprisingly, improve its performance on the entire dataset, most of which is unlabeled, and present a framework to efficiently estimate such performance (or loss) via influence functions, pseudo labels and diversity selection. Experimental results show that the proposed reliable active learning via influence functions (RALIF) can consistently outperform the random selection baseline as well as other existing and state-of-the art active learning approaches.

URL: https://openreview.net/forum?id=dN9YICB6hN

---

Title: Personalized Federated Learning with Communication Compression

Authors: El houcine Bergou, Konstantin Pavlovich Burlachenko, Aritra Dutta, Peter Richtárik

Abstract: In contrast to training traditional machine learning~(ML) models in data centers, federated learning~(FL) trains ML models over local datasets contained on resource-constrained heterogeneous edge devices. Existing FL algorithms aim to learn a single global model for all participating devices, which may not be helpful to all devices participating in the training due to the heterogeneity of the data across the devices. Recently, Hanzely and Richt\'{a}rik (2020) proposed a new formulation for training personalized FL models aimed at balancing the trade-off between the traditional global model and the local models that could be trained by individual devices using their private data only. They derived a new algorithm, called {\em loopless gradient descent}~(L2GD), to solve it and showed that this algorithms leads to improved communication complexity guarantees in regimes when more personalization is required. In this paper, we equip their L2GD algorithm with a {\em bidirectional} compression mechanism to further reduce the communication bottleneck between the local devices and the server. Unlike other compression-based algorithms used in the FL-setting, our compressed L2GD algorithm operates on a probabilistic communication protocol, where communication does not happen on a fixed schedule. Moreover, our compressed L2GD algorithm maintains a similar convergence rate as vanilla SGD without compression. To empirically validate the efficiency of our algorithm, we perform diverse numerical experiments on both convex and non-convex problems and using various compression techniques.

URL: https://openreview.net/forum?id=dZugyhbNFY

---

Title: Uncovering Unique Concept Vectors through Latent Space Decomposition

Authors: Mara Graziani, Laura O'Mahony, An-phi Nguyen, Henning Müller, Vincent Andrearczyk

Abstract: Interpreting the inner workings of deep learning models is crucial for establishing trust and ensuring model safety. Concept-based explanations have emerged as a superior approach that is more interpretable than feature attribution estimates such as pixel saliency. However, defining the concepts for the interpretability analysis biases the explanations by the user’s expectations on the concepts. To address this, we propose a novel post-hoc unsupervised method that automatically uncovers the concepts learned by deep models during training. By decomposing the latent space of a layer in singular vectors and refining them by unsupervised clustering, we uncover concept vectors aligned with directions of high variance that are relevant to the model prediction, and that point to semantically distinct concepts. Our extensive experiments reveal that the majority of our concepts are readily understandable to humans, exhibit coherency, and bear relevance to the task at hand. Moreover, we showcase the practical utility of our method in dataset exploration, where our concept vectors successfully identify outlier training samples affected by various confounding factors. This novel exploration technique has remarkable versatility to data types and model architectures and it will facilitate the identification of biases and the discovery of sources of error within training data.

URL: https://openreview.net/forum?id=LT4DXqUJTD

---

Title: RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

Authors: Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, KaShun SHUM, Tong Zhang

Abstract: Generative foundation models are susceptible to implicit biases that can arise from extensive unsupervised training data. Such biases can produce suboptimal samples, skewed outcomes, and unfairness, with potentially serious consequences. Consequently, aligning these models with human ethics and preferences is an essential step toward ensuring their responsible and effective deployment in real-world applications. Prior research has primarily employed Reinforcement Learning from Human Feedback (RLHF) to address this problem, where generative models are fine-tuned with RL algorithms guided by a human-feedback-informed reward model. However, the inefficiencies and instabilities associated with RL algorithms frequently present substantial obstacles to the successful alignment, necessitating the development of a more robust and streamlined approach. To this end, we introduce a new framework, Reward rAnked FineTuning (RAFT), designed to align generative models effectively. Utilizing a reward model and a sufficient number of samples, our approach selects the high-quality samples, discarding those that exhibit undesired behavior, and subsequently enhancing the model by fine-tuning on these filtered samples. Our studies show that RAFT can effectively improve the model performance in both reward learning and other automated metrics in both large language models and diffusion models.

URL: https://openreview.net/forum?id=m7p5O7zblY

---

Title: SANTA: Source Anchoring Network and Target Alignment for Continual Test Time Adaptation

Authors: Goirik Chakrabarty, Manogna Sreenivas, Soma Biswas

Abstract: Adapting a trained model to perform satisfactorily on continually changing test environments is an important and challenging task. In this work, we propose a novel framework, SANTA, which aims to satisfy the following characteristics required for online adaptation: 1) can work effectively for different (even small) batch sizes; 2) should continue to work well on the source domain; 3) should have minimal tunable hyperparameters and storage requirements. Given a pre-trained network trained on source domain data, the proposed framework modifies the affine parameters of the batch normalization layers using source anchoring based self-distillation. This ensures that the model incorporates knowledge from the newly encountered domains, without catastrophically forgetting the previously seen domains. We also propose a source-prototype driven contrastive alignment to ensure natural grouping of the target samples, while maintaining the already learnt semantic information. Extensive evaluation on three benchmark datasets under challenging settings justify the effectiveness of SANTA for real-world applications. Code here: https://github.com/goirik-chakrabarty/SANTA

URL: https://openreview.net/forum?id=V7guVYzvE4

---

Title: The Analysis of the Expected Change in the Classification Probability of the Predicted Label

Authors: Ruo Yang, Ping Liu, Mustafa Bilgic

Abstract: We present a formalism for estimating the expected change in the probability distribution of the predicted label of an object, with respect to all small perturbations to the object. We first derive analytically an estimate of the expected probability change as a function of the input noise. We then conduct three empirical studies: in the first study, experimental results on image classification show that the proposed measure can be used to distinguish the not-robust label predictions from those that are robust, even when they are all predicted with high confidence. The second study shows that the proposed robustness measure is almost always higher for the predictions on the corrupted images, compared to the predictions on the original versions of them. The final study shows that the proposed measure is lower for models when they are trained using adversarial training approaches.

URL: https://openreview.net/forum?id=gvqzvUVPiQ

---

Title: Latent State Models of Training Dynamics

Authors: Michael Y. Hu, Angelica Chen, Naomi Saphra, Kyunghyun Cho

Abstract: The impact of randomness on model training is poorly understood. How do differences in data order and initialization actually manifest in the model, such that some training runs outperform others or converge faster? Furthermore, how can we interpret the resulting training dynamics and the phase transitions that characterize different trajectories? To understand the effect of randomness on the dynamics and outcomes of neural network training, we train models multiple times with different random seeds and compute a variety of metrics throughout training, such as the $L_2$ norm, mean, and variance of the neural network's weights. We then fit a hidden Markov model (HMM) over the resulting sequences of metrics. The HMM represents training as a stochastic process of transitions between latent states, providing an intuitive overview of significant changes during training. Using our method, we produce a low-dimensional, discrete representation of training dynamics on grokking tasks, image classification, and masked language modeling. We use the HMM representation to study phase transitions and identify latent "detour" states that slow down convergence.

URL: https://openreview.net/forum?id=NE2xXWo0LF

---

Title: Differentially Private Optimizers Can Learn Adversarially Robust Models

Authors: Zhiqi Bu, Yuan Zhang

Abstract: Machine learning models have shone in a variety of domains and attracted increasing attention from both the security and the privacy communities. One important yet worrying question is: Will training models under the differential privacy (DP) constraint have an unfavorable impact on their adversarial robustness? While previous works have postulated that privacy comes at the cost of worse robustness, we give the first theoretical analysis to show that DP models can indeed be robust and accurate, even sometimes more robust than their naturally-trained non-private counterparts. We observe three key factors that influence the privacy-robustness-accuracy tradeoff: (1) hyper-parameters for DP optimizers are critical; (2) pre-training on public data significantly mitigates the accuracy and robustness drop; (3) choice of DP optimizers makes a difference. With these factors set properly, we achieve 90\% natural accuracy, 72\% robust accuracy ($+9\%$ than the non-private model) under $l_2(0.5)$ attack, and 69\% robust accuracy ($+16\%$ than the non-private model) with pre-trained SimCLRv2 model under $l_\infty(4/255)$ attack on CIFAR10 with $\epsilon=2$. In fact, we show both theoretically and empirically that DP models are Pareto optimal on the accuracy-robustness tradeoff. Empirically, the robustness of DP models is consistently observed across various datasets and models. We believe our encouraging results are a significant step towards training models that are private as well as robust.

URL: https://openreview.net/forum?id=o8VgRNYh6n

---

Title: Addressing caveats of neural persistence with deep graph persistence

Authors: Leander Girrbach, Anders Christensen, Ole Winther, Zeynep Akata, A. Sophia Koepke

Abstract: Neural Persistence is a prominent measure for quantifying neural network complexity, proposed in the emerging field of topological data analysis in deep learning. In this work, however, we find both theoretically and empirically that the variance of network weights and spatial concentration of large weights are the main factors that impact neural persistence. Whilst this captures useful information for linear classifiers, we find that no relevant spatial structure is present in later layers of deep neural networks, making neural persistence roughly equivalent to the variance of weights. Additionally, the proposed averaging procedure across layers for deep neural networks does not consider interaction between layers. Based on our analysis, we propose an extension of the filtration underlying neural persistence to the whole neural network instead of single layers, which is equivalent to calculating neural persistence on one particular matrix. This yields our deep graph persistence measure, which implicitly incorporates persistent paths through the network and alleviates variance-related issues through standardisation. Code is available at https://github.com/ExplainableML/Deep-Graph-Persistence.

URL: https://openreview.net/forum?id=oyfRWeoUJY

---

Title: Replay-enhanced Continual Reinforcement Learning

Authors: Tiantian Zhang, Kevin Zehua Shen, Zichuan Lin, Bo Yuan, Xueqian Wang, Xiu Li, Deheng Ye

Abstract: Replaying past experiences has proven to be a highly effective approach for averting catastrophic forgetting in supervised continual learning. However, some crucial factors are still largely ignored, making it vulnerable to serious failure, when used as a solution to forgetting in continual reinforcement learning, even in the context of perfect memory where all data of previous tasks are accessible in the current task. On the one hand, since most reinforcement learning algorithms are not invariant to the reward scale, the previously well-learned tasks (with high rewards) may appear to be more salient to the current learning process than the current task (with small initial rewards). This causes the agent to concentrate on those salient tasks at the expense of generality on the current task. On the other hand, offline learning on replayed tasks while learning a new task may induce a distributional shift between the dataset and the learned policy on old tasks, resulting in forgetting. In this paper, we introduce RECALL, a replay-enhanced method that greatly improves the plasticity of existing replay-based methods on new tasks while effectively avoiding the recurrence of catastrophic forgetting in continual reinforcement learning. RECALL leverages adaptive normalization on approximate targets and policy distillation on old tasks to enhance generality and stability, respectively. Extensive experiments on the Continual World benchmark show that RECALL performs significantly better than purely perfect memory replay, and achieves comparable or better overall performance against state-of-the-art continual learning methods.

URL: https://openreview.net/forum?id=91hfMEUukm

---

Title: The (Un)Scalability of Informed Heuristic Function Estimation in NP-Hard Search Problems

Authors: Sumedh Pendurkar, Taoan Huang, Brendan Juba, Jiapeng Zhang, Sven Koenig, Guni Sharon

Abstract: The A* algorithm is commonly used to solve NP-hard combinatorial optimization problems. When provided with a completely informed heuristic function, A* can solve such problems in time complexity that is polynomial in the solution cost and branching factor. In light of this fact, we examine a line of recent publications that propose fitting deep neural networks to the completely informed heuristic function. We assert that these works suffer from inherent scalability limitations since --- under the assumption of NP $\not \subseteq$ P/poly --- such approaches result in either (a) network sizes that scale super-polynomially in the instance sizes or (b) the accuracy of the fitted deep neural networks scales inversely with the instance sizes. Complementing our theoretical claims, we provide experimental results for three representative NP-hard search problems. The results suggest that fitting deep neural networks to informed heuristic functions requires network sizes that grow quickly with the problem instance size. We conclude by suggesting that the research community should focus on scalable methods for integrating heuristic search with machine learning, as opposed to methods relying on informed heuristic estimation.

URL: https://openreview.net/forum?id=JllRdycmLk

---

Title: Provably Safe Reinforcement Learning: Conceptual Analysis, Survey, and Benchmarking

Authors: Hanna Krasowski, Jakob Thumm, Marlon Müller, Lukas Schäfer, Xiao Wang, Matthias Althoff

Abstract: Ensuring the safety of reinforcement learning (RL) algorithms is crucial to unlock their potential for many real-world tasks. However, vanilla RL and most safe RL approaches do not guarantee safety. In recent years, several methods have been proposed to provide hard safety guarantees for RL, which is essential for applications where unsafe actions could have disastrous consequences. Nevertheless, there is no comprehensive comparison of these provably safe RL methods. Therefore, we introduce a categorization of existing provably safe RL methods, present the conceptual foundations for both continuous and discrete action spaces, and empirically benchmark existing methods. We categorize the methods based on how they adapt the action: action replacement, action projection, and action masking. Our experiments on an inverted pendulum and a quadrotor stabilization task indicate that action replacement is the best-performing approach for these applications despite its comparatively simple realization. Furthermore, adding a reward penalty, every time the safety verification is engaged, improved training performance in our experiments. Finally, we provide practical guidance on selecting provably safe RL approaches depending on the safety specification, RL algorithm, and type of action space.

URL: https://openreview.net/forum?id=mcN0ezbnzO

---

Title: A Combinatorial Semi-Bandit Approach to Charging Station Selection for Electric Vehicles

Authors: Niklas Åkerblom, Morteza Haghir Chehreghani

Abstract: In this work, we address the problem of long-distance navigation for battery electric vehicles (BEVs), where one or more charging sessions are required to reach the intended destination. We consider the availability and performance of the charging stations to be unknown and stochastic, and develop a combinatorial semi-bandit framework for exploring the road network to learn the parameters of the queue time and charging power distributions. Within this framework, we first outline a method for transforming the road network graph into a graph of feasible paths between charging stations to handle the constrained combinatorial optimization problem in an efficient way. Then, for the feasibility graph, we use a Bayesian approach to model the stochastic edge weights, utilizing conjugate priors for the one-parameter exponential and two-parameter gamma distributions, the latter of which is novel to multi-armed bandit literature. Finally, we apply combinatorial versions of Thompson Sampling, BayesUCB and Epsilon-greedy to the problem. We demonstrate the performance of our framework on long-distance navigation problem instances in large-scale country-sized road networks, with simulation experiments in Norway, Sweden and Finland.

URL: https://openreview.net/forum?id=ndw90pkNM9

---

Title: Invertible Hierarchical Generative Model for Images

Authors: Heikki Timonen, Miika Aittala, Jaakko Lehtinen

Abstract: Normalizing flows (NFs) as generative models enjoy desirable properties such as exact invertibility and exact likelihood evaluation, while being efficient to sample from. These properties, however, come at the cost of heavy restrictions on the architecture. Due to these limitations, modeling multi-modal probability distributions can yield poor results even with low-dimensional data. Additionally, typical flow architectures employed on real image datasets produce samples with visible aliasing artifacts and limited variation. The latent decomposition of flow-models also falls short on that of competing methods, with uneven contribution to a decoded image. In this work we build an invertible generative model using conditional normalizing flows in a hierarchical fashion to circumvent the aforementioned limitations. We show that we can achieve superior sample quality among flow-based models with fewer parameters compared to the state of the art. We demonstrate ability to control individual levels of detail via the latent decomposition of our model.

URL: https://openreview.net/forum?id=4rkKN4tM63

---

Title: PAVI: Plate-Amortized Variational Inference

Authors: Louis Rouillard, Alexandre Le Bris, Thomas Moreau, Demian Wassermann

Abstract: Given observed data and a probabilistic generative model, Bayesian inference searches for the distribution of the model's parameters that could have yielded the data. Inference is challenging for large population studies where millions of measurements are performed over a cohort of hundreds of subjects, resulting in a massive parameter space. This large cardinality renders off-the-shelf Variational Inference (VI) computationally impractical.

In this work, we design structured VI families that efficiently tackle large population studies. Our main idea is to share the parameterization and learning across the different i.i.d. variables in a generative model -symbolized by the model's $\textit{plates}$.
We name this concept $\textit{plate amortization}$. Contrary to off-the-shelf stochastic VI --which slows down inference-- plate amortization results in orders of magnitude faster to train variational distributions. Applied to large-scale hierarchical problems, PAVI yields expressive, parsimoniously parameterized VI with an affordable training time --effectively unlocking inference in those regimes.

We illustrate the practical utility of PAVI through a challenging Neuroimaging example featuring 400 million latent parameters, demonstrating a significant step towards scalable and expressive Variational Inference.

URL: https://openreview.net/forum?id=vlY9GDCCA6

---

Title: Using Representation Expressiveness and Learnability to Evaluate Self-Supervised Learning Methods

Authors: Yuchen Lu, Zhen Liu, Aristide Baratin, Romain Laroche, Aaron Courville, Alessandro Sordoni

Abstract: We address the problem of evaluating the quality of self-supervised learning (SSL) models
without access to supervised labels, while being agnostic to the architecture, learning
algorithm or data manipulation used during training. We argue that representations can
be evaluated through the lens of expressiveness and learnability. We propose to use the
Intrinsic Dimension (ID) to assess expressiveness and introduce Cluster Learnability (CL) to
assess learnability. CL is measured in terms of the performance of a KNN classifier trained
to predict labels obtained by clustering the representations with K-means. We thus combine
CL and ID into a single predictor – CLID. Through a large-scale empirical study with a
diverse family of SSL algorithms, we find that CLID better correlates with in-distribution
model performance than other competing recent evaluation schemes. We also benchmark
CLID on out-of-domain generalization, where CLID serves as a predictor of the transfer
performance of SSL models on several visual classification tasks, yielding improvements with
respect to the competing baselines.

URL: https://openreview.net/forum?id=BxdrpnRHNh

---

Title: Learning Multiscale Non-stationary Causal Structures

Authors: Gabriele D'Acunto, Gianmarco De Francisci Morales, Paolo Bajardi, Francesco Bonchi

Abstract: This paper addresses a gap in the current state of the art by providing a solution for modeling causal relationships that evolve over time and occur at different time scales. Specifically, we introduce the multiscale non-stationary directed acyclic graph (MN-DAG), a framework for modeling multivariate time series data. Our contribution is twofold. Firstly, we expose a probabilistic generative model by leveraging results from spectral and causality theories. Our model allows sampling an MN-DAG according to user-specified priors on the time-dependence and multiscale properties of the causal graph. Secondly, we devise a Bayesian method named Multiscale Non-stationary Causal Structure Learner (MN-CASTLE) that uses stochastic variational inference to estimate MN-DAGs. The method also exploits information from the local partial correlation between time series over different time resolutions. The data generated from an MN-DAG reproduces well-known features of time series in different domains, such as volatility clustering and serial correlation. Additionally, we show the superior performance of MN-CASTLE on synthetic data with different multiscale and non-stationary properties compared to baseline models. Finally, we apply MN-CASTLE to identify the drivers of the natural gas prices in the US market. Causal relationships have strengthened during the COVID-19 outbreak and the Russian invasion of Ukraine, a fact that baseline methods fail to capture. MN-CASTLE identifies the causal impact of critical economic drivers on natural gas prices, such as seasonal factors, economic uncertainty, oil prices, and gas storage deviations.

URL: https://openreview.net/forum?id=SQnPE63jtA

---

Title: Bag of Image Patch Embedding Behind the Success of Self-Supervised Learning

Authors: Yubei Chen, Adrien Bardes, ZENGYI LI, Yann LeCun

Abstract: Self-supervised learning (SSL) has recently achieved tremendous empirical advancements in learning image representation. However, our understanding of the principle behind learning such a representation is still limited. This work shows that joint-embedding SSL approaches learn a representation of image patches, which reflects their co-occurrence. Such a connection to co-occurrence modeling can be established formally, and it supplements the prevailing invariance perspective. We empirically show that learning a representation for fixed-scale patches and aggregating local patch representations as the image representation achieves similar or even better results than the baseline methods. We denote this process as {\it BagSSL}. Even with $32\times 32$ patch representation, BagSSL achieves $62\%$ top-1 linear probing accuracy on ImageNet. On the other hand, with a multi-scale pretrained model, we show that the whole image embedding is approximately the average of local patch embeddings. While the SSL representation is relatively invariant at the global scale, we show that locality is preserved when we zoom into local patch-level representation. Further, we show that patch representation aggregation can improve various SOTA baseline methods by a large margin. The patch representation is considerably easier to understand, and this work makes a step to demystify self-supervised representation learning.

URL: https://openreview.net/forum?id=r06xREo3QG

---

Title: One-Round Active Learning through Data Utility Learning and Proxy Models

Authors: Jiachen T. Wang, Si Chen, Ruoxi Jia

Abstract: While active learning (AL) techniques have demonstrated the potential to produce high-performance models with fewer labeled data, their application remains limited due to the necessity for multiple rounds of interaction with annotators. This paper studies the problem of one-round AL, which aims at selecting a subset of unlabeled points and querying their labels \emph{all at once}. A fundamental challenge is how to measure the utility of different choices of labeling queries for learning a target model. Our key idea is to learn such a utility metric from a small initial labeled set. We demonstrate that our approach leads to state-of-the-art performance on various AL benchmarks and is more robust to the lack of initial labeled data.

In addition to algorithmic development and evaluation, we introduce a novel metric for quantifying `\emph{utility transferability}' -- the degree of correlation between the performance changes of two learning algorithms due to variations in training data selection. Previous studies have often observed a notable utility transferability between models, even those with differing complexities. Such transferability enabled our approach, as well as other techniques such as coresets, hyperparameter tuning, and data valuation, to scale up to more sophisticated target models by substituting them with smaller proxy models. Nevertheless, utility transferability has not yet been rigorously defined within a formal mathematical framework, a gap that our work addresses innovatively. We further propose two Monte Carlo-based methods for efficiently comparing utility transferability for different proxy models, thereby facilitating a more informed selection of proxy models.

URL: https://openreview.net/forum?id=8HQCOMRa7g

---

Title: Bridging the Gap Between Offline and Online Reinforcement Learning Evaluation Methodologies

Authors: Shivakanth Sujit, Pedro Braga, Jorg Bornschein, Samira Ebrahimi Kahou

Abstract: Reinforcement learning (RL) has shown great promise with algorithms learning in environments with large state and action spaces purely from scalar reward signals.
A crucial challenge for current deep RL algorithms is that they require a tremendous amount of environment interactions for learning.
This can be infeasible in situations where such interactions are expensive, such as in robotics.
Offline RL algorithms try to address this issue by bootstrapping the learning process from existing logged data without needing to interact with the environment from the very beginning.
While online RL algorithms are typically evaluated as a function of the number of environment interactions, there isn't a single established protocol for evaluating offline RL methods.
In this paper, we propose a sequential approach to evaluate offline RL algorithms as a function of the training set size and thus by their data efficiency.
Sequential evaluation provides valuable insights into the data efficiency of the learning process and the robustness of algorithms to distribution changes in the dataset while also harmonizing the visualization of the offline and online learning phases.
Our approach is generally applicable and easy to implement.
We compare several existing offline RL algorithms using this approach and present insights from a variety of tasks and offline datasets.

URL: https://openreview.net/forum?id=J3veZdVpts

---

Title: RLTF: Reinforcement Learning from Unit Test Feedback

Authors: Jiate Liu, Yiqin Zhu, Kaiwen Xiao, QIANG FU, Xiao Han, Yang Wei, Deheng Ye

Abstract: The goal of program synthesis, or code generation, is to generate executable code based on given descriptions. Recently, there has been an increasing number of studies employing reinforcement learning (RL) to improve the performance of large language models (LLMs) for code.
However, some of the current representative RL methods have only used offline frameworks, limiting the exploration of new sample spaces.
Additionally, the utilization of unit test signals is limited, not accounting for specific error locations within the code.
To address these issues, we proposed RLTF, i.e., Reinforcement Learning from Unit Test Feedback, a novel online RL framework with unit test feedback of multi-granularity for refining code LLMs.
Our approach generates data in real-time during training and simultaneously utilizes fine-grained feedback signals to guide the model towards producing higher-quality code.
Extensive experiments show that RLTF achieves state-of-the-art performance on the APPS and the MBPP benchmarks.
Our code is available at: \url{https://github.com/Zyq-scut/RLTF}.

URL: https://openreview.net/forum?id=hjYmsV6nXZ

---

Title: Visualizing the Diversity of Representations Learned by Bayesian Neural Networks

Authors: Dennis Grinwald, Kirill Bykov, Shinichi Nakajima, Marina MC Höhne

Abstract: Explainable Artificial Intelligence (XAI) aims to make learning machines less opaque, and offers researchers and practitioners various tools to reveal the decision-making strategies of neural networks. In this work, we investigate how XAI methods can be used for exploring and visualizing the diversity of feature representations learned by Bayesian Neural Networks (BNNs). Our goal is to provide a global understanding of BNNs by making their decision-making strategies a) visible and tangible through feature visualizations and b) quantitatively measurable with a distance measure learned by contrastive learning. Our work provides new insights into the posterior distribution in terms of human-understandable feature information with regard to the underlying decision-making strategies. The main findings of our work are the following: 1) global XAI methods can be applied to explain the diversity of decision-making strategies of BNN instances, 2) Monte Carlo dropout with commonly used Dropout rates exhibit increased diversity in feature representations compared to the multimodal posterior approximation of MultiSWAG, 3) the diversity of learned feature representations highly correlates with the uncertainty estimate for the output and 4) the inter-mode diversity of the multimodal posterior decreases as the network width increases, while the intra-mode diversity increases. These findings are consistent with the recent Deep Neural Networks theory, providing additional intuitions about what the theory implies in terms of humanly understandable concepts.

URL: https://openreview.net/forum?id=ZSxvyWrX6k

---

Title: Automated Detection of Causal Inference Opportunities: Regression Discontinuity Subgroup Discovery

Authors: Tony Liu, Patrick Lawlor, Lyle Ungar, Konrad Kording, Rahul Ladhania

Abstract: The gold standard for the identification of causal effects are randomized controlled trials (RCT), but RCTs may not always be feasible to conduct. When treatments depend on a threshold however, such as the blood sugar threshold for diabetes diagnosis, we can still sometimes estimate causal effects with regression discontinuities (RDs). RDs are valid when units just above and below the threshold have the same distribution of covariates and thus no confounding in the presence of noise, establishing an as-if randomization. In practice however, implementing RD studies can be difficult as identifying treatment thresholds require considerable domain expertise -- furthermore, the thresholds may differ across subgroups (e.g., the blood sugar threshold for diabetes may differ across demographics), and ignoring these differences can lower statistical power. Finding the thresholds and to whom they apply is an important problem currently solved manually by domain experts, and data-driven approaches are needed when domain expertise is not sufficient. Here, we introduce Regression Discontinuity SubGroup Discovery (RDSGD), a machine-learning method that identifies statistically powerful and interpretable subgroups for RD thresholds. Using a medical claims dataset with over 60 million patients, we apply RDSGD to multiple clinical contexts and identify subgroups with increased compliance to treatment assignment thresholds.
As treatment thresholds matter for many diseases and policy decisions, RDSGD can be a powerful tool for discovering new avenues for causal estimation.

URL: https://openreview.net/forum?id=cdRYoTyHZh

---

Title: Invariant Structure Learning for Better Generalization and Causal Explainability

Authors: Yunhao Ge, Sercan O Arik, Jinsung Yoon, Ao Xu, Laurent Itti, Tomas Pfister

Abstract: Learning the causal structure behind data is invaluable for improving generalization and ob- taining high-quality explanations. Towards this end, we propose a novel framework, Invariant Structure Learning (ISL), that is designed to improve causal structure discovery by utilizing generalization as an indication in the process. ISL splits the data into different environments, and learns a structure that is invariant to the target across different environments by imposing a consistency constraint. The proposed aggregation mechanism then selects the classifier based on a graph structure that reflects the causal mechanisms in the data more accurately compared to the structures learnt from individual environments. Furthermore, we extend ISL to a self-supervised learning setting, where accurate causal structure discovery does not rely on any labels. Self-supervised ISL utilizes proposals for invariant causality, by iteratively setting different nodes as targets. On synthetic and real-world datasets, we demonstrate that ISL accurately discovers the causal structure, outperforms alternative methods, and yields superior generalization for datasets with significant distribution shifts.

URL: https://openreview.net/forum?id=A9yn7KTwsK

---

Title: Data pruning and neural scaling laws: fundamental limitations of score-based algorithms

Authors: Fadhel Ayed, Soufiane Hayou

Abstract: Data pruning algorithms are commonly used to reduce the memory and computational cost of the optimization process. Recent empirical results (Guo, B. Zhao, and Bai, 2022) reveal that random data pruning remains a strong baseline and outperforms most existing data pruning methods in the high compression regime, i.e., where a fraction of 30% or less of the data is kept. This regime has recently attracted a lot of interest as a result of the role of data pruning in improving the so-called neural scaling laws; see (Sorscher et al., 2022), where the authors showed the need for high-quality data pruning algorithms in order to beat the sample power law. In this work, we focus on score-based data pruning algorithms and show theoretically and empirically why such algorithms fail in the high compression regime. We demonstrate “No Free Lunch" theorems for data pruning and discuss potential solutions to these limitations.

URL: https://openreview.net/forum?id=iRTL4pDavo

---

Title: Offline Reinforcement Learning with Additional Covering Distributions

Authors: Chenjie Mao

Abstract: We study learning optimal policies from a logged dataset, i.e., offline RL, with function general approximation. Despite the efforts devoted, existing algorithms with theoretic finite-sample guarantees typically assume exploratory data coverage or strong realizable function classes (e.g., Bellman-completeness), which is hard to be satisfied in reality. While there are recent works that successfully tackle these strong assumptions, they either require the gap assumptions that could only be satisfied by part of MDPs or use the behavior regularization that makes the optimality of learned policy even intractable. To solve this challenge, we provide finite-sample guarantees for a simple algorithm based on marginalized importance sampling (MIS), showing that sample-efficient offline RL for general MDPs is possible with only a partial coverage dataset (instead of assuming a dataset covering all possible policies) and weak realizable function classes (assuming function classes containing simply one function) given additional side information of a covering distribution. We demonstrate that the covering distribution trades off prior knowledge of the optimal trajectories against the coverage requirement of the dataset, revealing the effect of this inductive bias in the learning processes. Furthermore, when considering the exploratory dataset, our analysis shows that only realizable function classes are enough for learning near-optimal policies, even with no side information on the additional coverage distributions.

URL: https://openreview.net/forum?id=AfXq3x3X16

---

Title: Online model selection by learning how compositional kernels evolve

Authors: Eura Shin, Predrag Klasnja, Susan Murphy, Finale Doshi-Velez

Abstract: Motivated by the need for efficient, personalized learning in health, we investigate the problem of online compositional kernel selection for multi-task Gaussian Process regression. Existing composition selection methods do not satisfy our strict criteria in health; selection must occur quickly, and the selected kernels must maintain the appropriate level of complexity, sparsity, and stability as data arrives online. We introduce the Kernel Evolution Model (KEM), a generative process on how to evolve kernel compositions in a way that manages the bias--variance trade-off as we observe more data about a user. Using pilot data, we learn a set of kernel evolutions that can be used to quickly select kernels for new test users. KEM reliably selects high-performing kernels for a range of synthetic and real data sets, including two health data sets.

URL: https://openreview.net/forum?id=23WZFQBUh5

---

Title: NOFLITE: Learning to Predict Individual Treatment Effect Distributions

Authors: Toon Vanderschueren, Jeroen Berrevoets, Wouter Verbeke

Abstract: Estimating the effect of a treatment on an individual's outcome of interest is an important challenge in various fields, such as healthcare, economics, marketing, and education. Previous work in machine learning has focused on estimating the expected value of the treatment effect. However, effective personalized decision-making requires more than just the treatment expected effect; it requires knowing the entire treatment effect distribution. Knowing this distribution allows analyzing the treatment's expected utility or quantifying the uncertainty regarding a treatment's effect. This information is essential for prescribing optimal treatments. The ability of a model to predict accurate individual treatment effect distributions is captured by its likelihood. In light of this, we propose a novel neural architecture, NOFLITE, that uses normalizing flows to directly optimize this likelihood, while simultaneously learning flexible estimates of the individual treatment effect distribution. Experiments on various semi-synthetic data sets show that NOFLITE outperforms existing methods in terms of loglikelihood. Moreover, we illustrate how the predicted distributions can enable an in-depth analysis of the treatment effect and more accurate decision-making.

URL: https://openreview.net/forum?id=EjqopDxLbG

---

Title: Stochastic Mirror Descent: Convergence Analysis and Adaptive Variants via the Mirror Stochastic Polyak Stepsize

Authors: Ryan D'Orazio, Nicolas Loizou, Issam H. Laradji, Ioannis Mitliagkas

Abstract: We investigate the convergence of stochastic mirror descent (SMD) under interpolation in relatively smooth and smooth convex optimization. In relatively smooth convex optimization we provide new convergence guarantees for SMD with a constant stepsize. For smooth convex optimization we propose a new adaptive stepsize scheme --- the mirror stochastic Polyak stepsize (mSPS). Notably, our convergence results in both settings do not make bounded gradient assumptions or bounded variance assumptions, and we show convergence to a neighborhood that vanishes under interpolation. Consequently, these results correspond to the first convergence guarantees under interpolation for the exponentiated gradient algorithm for fixed or adaptive stepsizes. mSPS generalizes the recently proposed stochastic Polyak stepsize (SPS) (Loizou et al. 2021) to mirror descent and remains both practical and efficient for modern machine learning applications while inheriting the benefits of mirror descent. We complement our results with experiments across various supervised learning tasks and different instances of SMD, demonstrating the effectiveness of mSPS.

URL: https://openreview.net/forum?id=28bQiPWxHl

---

Title: GraphPNAS: Learning Probabilistic Graph Generators for Neural Architecture Search

Authors: Muchen Li, Jeffrey Yunfan Liu, Leonid Sigal, Renjie Liao

Abstract: Neural architectures can be naturally viewed as computational graphs. Motivated by this perspective, we, in this paper, study neural architecture search (NAS) through the lens of learning graph generative models. In contrast to existing NAS methods which largely focus on searching for a single best architecture, i.e, point estimation, we propose GraphPNAS a deep graph generative model that learns a distribution of well-performing architectures. Relying on graph neural networks (GNNs), our GraphPNAS can better capture topologies of good neural architectures and relations between operators therein. Moreover, our graph generator leads to a learnable probabilistic search method that is more flexible and efficient than the commonly used RNN generator and random search methods. Finally, we learn our generator via an efficient reinforcement learning formulation for NAS. To assess the effectiveness of our GraphPNAS, we conduct extensive experiments on four search spaces, including the challenging RandWire on TinyImageNet, ENAS on CIFAR10, and NAS-Bench-101/201. We show that our proposed graph generator consistently outperforms RNN-based one and achieves better or comparable performances than state-of-the-art NAS methods.

URL: https://openreview.net/forum?id=ok18jj7cam

---

Title: Provably Personalized and Robust Federated Learning

Authors: Mariel Werner, Lie He, Michael Jordan, Martin Jaggi, Sai Praneeth Karimireddy

Abstract: Clustering clients with similar objectives and learning a model per cluster is an intuitive and interpretable approach to personalization in federated learning. However, doing so with provable and optimal guarantees has remained an open challenge. In this work, we formalize personalized federated learning as a stochastic optimization problem. We propose simple clustering-based algorithms which iteratively identify and train within clusters, using local client gradients. Our algorithms have optimal convergence rates which asymptotically match those obtained if we knew the true underlying clustering of the clients, and are provably robust in the Byzantine setting where some fraction of the clients are malicious.

URL: https://openreview.net/forum?id=B0uBSSUy0G

---

Title: Conditional Sampling of Variational Autoencoders via Iterated Approximate Ancestral Sampling

Authors: Vaidotas Simkus, Michael U. Gutmann

Abstract: Conditional sampling of variational autoencoders (VAEs) is needed in various applications, such as missing data imputation, but is computationally intractable. A principled choice for asymptotically exact conditional sampling is Metropolis-within-Gibbs (MWG). However, we observe that the tendency of VAEs to learn a structured latent space, a commonly desired property, can cause the MWG sampler to get “stuck” far from the target distribution. This paper mitigates the limitations of MWG: we systematically outline the pitfalls in the context of VAEs, propose two original methods that address these pitfalls, and demonstrate an improved performance of the proposed methods on a set of sampling tasks.

URL: https://openreview.net/forum?id=I5sJ6PU6JN

---

Title: Rewiring with Positional Encodings for Graph Neural Networks

Authors: Rickard Brüel Gabrielsson, Mikhail Yurochkin, Justin Solomon

Abstract: Several recent works use positional encodings to extend the receptive fields of graph neural network (GNN) layers equipped with attention mechanisms. These techniques, however, extend receptive fields to the complete graph, at substantial computational cost and risking a change in the inductive biases of conventional GNNs, or require complex architecture adjustments. As a conservative alternative, we use positional encodings to expand receptive fields to r-hop neighborhoods. More specifically, our method augments the input graph with additional nodes/edges and uses positional encodings as node and/or edge features. We thus modify graphs before inputting them to a downstream GNN model, instead of modifying the model itself. This makes our method model-agnostic, i.e., compatible with any of the existing GNN architectures. We also provide examples of positional encodings that are lossless with a one-to-one map between the original and the modified graphs. We demonstrate that extending receptive fields via positional encodings and a virtual fully- connected node significantly improves GNN performance and alleviates over-squashing using small r. We obtain improvements on a variety of models and datasets and reach competitive performance using traditional GNNs or graph Transformers.

URL: https://openreview.net/forum?id=dn3ZkqG2YV

---

Title: A Robust Backpropagation-Free Framework for Images

Authors: Timothy Zee, Alex Ororbia, Ankur Mali, Ifeoma Nwogu

Abstract: While current deep learning algorithms have been successful for a wide variety of artificial intelligence (AI) tasks, including those involving structured image data, they present deep neurophysiological conceptual issues due to their reliance on the gradients that are computed by backpropagation of errors (backprop). Gradients are required to obtain synaptic weight adjustments but require knowledge of feed forward activities in order to conduct backward propagation, a biologically implausible process. This is known as the "weight transport problem''. Therefore, in this work, we present a more biologically plausible approach towards solving the weight transport problem for image data. This approach, which we name the error-kernel driven activation alignment (EKDAA) algorithm, accomplishes through the introduction of locally derived error transmission kernels and error maps. Like standard deep learning networks, EKDAA performs the standard forward process via weights and activation functions; however, its backward error computation involves adaptive error kernels that propagate local error signals through the network. The efficacy of EKDAA is demonstrated by performing visual-recognition tasks on the Fashion MNIST, CIFAR-10 and SVHN benchmarks, along with demonstrating its ability to extract visual features from natural color images. Furthermore, in order to demonstrate its non-reliance on gradient computations, results are presented for an EKDAA-trained CNN that employs a non-differentiable activation function.

URL: https://openreview.net/forum?id=leqr0vQzeN

---

Title: Minorization-Maximization for Learning Determinantal Point Processes

Authors: Takahiro Kawashima, Hideitsu Hino

Abstract: A determinantal point process (DPP) is a powerful probabilistic model that generates diverse random subsets from a ground set. Since a DPP is characterized by a positive definite kernel, a DPP on a finite ground set can be parameterized by a kernel matrix. Recently, DPPs have gained attention in the machine learning community and have been applied to various practical problems; however, there is still room for further research on the learning of DPPs. In this paper, we propose a simple learning rule for full-rank DPPs based on a minorization-maximization (MM) algorithm, which monotonically increases the likelihood in each iteration. We show that our minorizer of the MM algorithm provides a tighter lower-bound compared to an existing method locally. We also generalize the algorithm for further acceleration. In our experiments on both synthetic and real-world datasets, our method outperforms existing methods in most settings. Our code is available at https://github.com/ISMHinoLab/DPPMMEstimation.

URL: https://openreview.net/forum?id=65AzNvY73Q

---

Title: Understanding Curriculum Learning in Policy Optimization for Online Combinatorial Optimization

Authors: Runlong Zhou, Zelin He, Yuandong Tian, Yi Wu, Simon Shaolei Du

Abstract: Over the recent years, reinforcement learning (RL) starts to show promising results in tackling combinatorial optimization (CO) problems, in particular when coupled with curriculum learning to facilitate training. Despite emerging empirical evidence, theoretical study on why RL helps is still at its early stage. This paper presents the first systematic study on policy optimization methods for online CO problems. We show that online CO problems can be naturally formulated as latent Markov Decision Processes (LMDPs), and prove convergence bounds on natural policy gradient (NPG) for solving LMDPs. Furthermore, our theory explains the benefit of curriculum learning: it can find a strong sampling policy and reduce the distribution shift, a critical quantity that governs the convergence rate in our theorem. For a canonical online CO problem, the Best Choice Problem (BCP), we formally prove that distribution shift is reduced exponentially with curriculum learning even if the curriculum is a randomly generated BCP on a smaller scale. Our theory also shows we can simplify the curriculum learning scheme used in prior work from multi-step to single-step. Lastly, we provide extensive experiments on the Best Choice Problem, Online Knapsack, and AdWords to verify our findings.

URL: https://openreview.net/forum?id=gKEbBKRUjA

---

Title: Training DNNs Resilient to Adversarial and Random Bit-Flips by Learning Quantization Ranges

Authors: Kamran Chitsaz, Goncalo Mordido, Jean-Pierre David, François Leduc-Primeau

Abstract: Promoting robustness in deep neural networks (DNNs) is crucial for their reliable deployment in uncertain environments, such as low-power settings or in the presence of adversarial attacks. In particular, bit-flip weight perturbations in quantized networks can significantly degrade performance, underscoring the need to improve DNN resilience. In this paper, we introduce a training mechanism to learn the quantization range of different DNN layers to enhance DNN robustness against bit-flip errors on the model parameters. The proposed approach, called weight clipping-aware training (WCAT), minimizes the quantization range while preserving performance, striking a balance between the two.
Our experimental results on different models and datasets showcase that DNNs trained with WCAT can tolerate a high amount of noise while keeping the accuracy close to the baseline model. Moreover, we show that our method significantly enhances DNN robustness against adversarial bit-flip attacks. Finally, when considering the energy-reliability trade-off inherent in on-chip SRAM memories, we observe that WCAT consistently improves the Pareto frontier of test accuracy and energy consumption across diverse models.

URL: https://openreview.net/forum?id=BxjHMPwZIH

---

Title: Feature-Attending Recurrent Modules for Generalization in Reinforcement Learning

Authors: Wilka Torrico Carvalho, Andrew Kyle Lampinen, Kyriacos Nikiforou, Felix Hill, Murray Shanahan

Abstract: Many important tasks are defined in terms of object. To generalize across these tasks, a reinforcement learning (RL) agent needs to exploit the structure that the objects induce. Prior work has either hard-coded object-centric features, used complex object-centric generative models, or updated state using local spatial features. However, these approaches have had limited success in enabling general RL agents. Motivated by this, we introduce “Feature- Attending Recurrent Modules” (FARM), an architecture for learning state representations that relies on simple, broadly applicable inductive biases for capturing spatial and temporal regularities. FARM learns a state representation that is distributed across multiple modules that each attend to spatiotemporal features with an expressive feature attention mechanism. We show that this improves an RL agent’s ability to generalize across object-centric tasks. We study task suites in both 2D and 3D environments and find that FARM better generalizes compared to competing architectures that leverage attention or multiple modules.

URL: https://openreview.net/forum?id=j4y3gN7VtW

---

Title: Achieving Risk Control in Online Learning Settings

Authors: Shai Feldman, Liran Ringel, Stephen Bates, Yaniv Romano

Abstract: To provide rigorous uncertainty quantification for online learning models, we develop a framework for constructing uncertainty sets that provably control risk---such as coverage of confidence intervals, false negative rate, or F1 score---in the online setting. This extends conformal prediction to apply to a larger class of online learning problems. Our method guarantees risk control at any user-specified level even when the underlying data distribution shifts drastically, even adversarially, over time in an unknown fashion.
The technique we propose is highly flexible as it can be applied with any base online learning algorithm (e.g., a deep neural network trained online), requiring minimal implementation effort and essentially zero additional computational cost.
We further extend our approach to control multiple risks simultaneously, so the prediction sets we generate are valid for all given risks.
To demonstrate the utility of our method, we conduct experiments on real-world tabular time-series data sets showing that the proposed method rigorously controls various natural risks.
Furthermore, we show how to construct valid intervals for an online image-depth estimation problem that previous sequential calibration schemes cannot handle.

URL: https://openreview.net/forum?id=5Y04GWvoJu

---

Title: Exploring Transformer Backbones for Heterogeneous Treatment Effect Estimation

Authors: YiFan Zhang, Hanlin Zhang, Zachary Chase Lipton, Li Erran Li, Eric Xing

Abstract: Previous works on Treatment Effect Estimation (TEE) are not in widespread use because they are predominantly theoretical, where strong parametric assumptions are made but untractable for practical application. Recent works use Multilayer Perceptron (MLP) for modeling casual relationships, however, MLPs lag far behind recent advances in ML methodology, which limits their applicability and generalizability. To extend beyond the single domain formulation and towards more realistic learning scenarios, we explore model design spaces beyond MLPs, i.e., transformer backbones, which provide flexibility where attention layers govern interactions among treatments and covariates to exploit structural similarities of potential outcomes for confounding control. Through careful model design, Transformers as Treatment Effect Estimators (TransTEE) is proposed. We show empirically that TransTEE can: (1) serve as a general-purpose treatment effect estimator which significantly outperforms competitive baselines on a variety of challenging TEE problems (e.g., discrete, continuous, structured, or dosage-associated treatments.) and is applicable to both when covariates are tabular and when they consist of structural data (e.g., texts, graphs); (2) yield multiple advantages: compatibility with propensity score modeling, parameter efficiency, robustness to continuous treatment value distribution shifts, explainable in covariate adjustment, and real-world utility in auditing pre-trained language models.

URL: https://openreview.net/forum?id=1kl4YM2Q7P

---

Title: Federated Learning under Partially Disjoint Data via Manifold Reshaping

Authors: Ziqing Fan, Jiangchao Yao, Ruipeng Zhang, Lingjuan Lyu, Yanfeng Wang, Ya Zhang

Abstract: Statistical heterogeneity severely limits the performance of federated learning (FL), motivating several explorations e.g., FedProx, MOON and FedDyn, to alleviate this problem. Despite effectiveness, their considered scenario generally requires samples from almost all classes during the local training of each client, although some covariate shifts may exist among clients. In fact, the natural case of partially class-disjoint data (PCDD), where each client contributes a few classes (instead of all classes) of samples, is practical yet underexplored. Specifically, the unique collapse and invasion characteristics of PCDD can induce the biased optimization direction in local training, which prevents the efficiency of federated learning. To address this dilemma, we propose a manifold reshaping approach called FedMR to calibrate the feature space of local training. Our FedMR adds two interplaying losses to the vanilla federated learning: one is the intra-class loss to decorrelate feature dimensions for anti-collapse; and the other one is the inter-class loss to guarantee the proper margin among categories in the feature expansion. We conduct extensive experiments on a range of datasets to demonstrate that our FedMR achieves much higher accuracy and better communication efficiency.

URL: https://openreview.net/forum?id=jLJTqJXAG7

---

Title: Improving Continual Learning by Accurate Gradient Reconstructions of the Past

Authors: Erik Daxberger, Siddharth Swaroop, Kazuki Osawa, Rio Yokota, Richard E Turner, José Miguel Hernández-Lobato, Mohammad Emtiyaz Khan

Abstract: Weight-regularization and experience replay are two popular continual-learning strategies with complementary strengths: while weight-regularization requires less memory, replay can more accurately mimic batch training. How can we combine them to get better methods? Despite the simplicity of the question, little is known or done to optimally combine these approaches. In this paper, we present such a method by using a recently proposed principle of adaptation that relies on a faithful reconstruction of the gradients of the past data. Using this principle, we design a prior which combines two types of replay methods with a quadratic weight-regularizer and achieves better gradient reconstructions. The combination improves performance on standard task-incremental continual learning benchmarks such as Split-CIFAR, SplitTinyImageNet, and ImageNet-1000, achieving $>\!80\%$ of the batch performance by simply utilizing a memory of $<\!10\%$ of the past data. Our work shows that a good combination of the two strategies can be very effective in reducing forgetting.

URL: https://openreview.net/forum?id=b1fpfCjja1

---

Title: Synthetic Data from Diffusion Models Improves ImageNet Classification

Authors: Shekoofeh Azizi, Simon Kornblith, Chitwan Saharia, Mohammad Norouzi, David J. Fleet

Abstract: Deep generative models are becoming increasingly powerful, now generating diverse, high fidelity, photo-realistic samples given text prompts. Nevertheless, samples from such models have not been shown to significantly improve model training for challenging and well-studied discriminative tasks like ImageNet classification. In this paper we show that augmenting the ImageNet training set with samples from a generative diffusion model can yield substantial improvements in ImageNet classification accuracy over strong ResNet and Vision Transformer baselines. To this end we explore the fine-tuning of large-scale text-to-image diffusion models, yielding class-conditional ImageNet models with state-of-the-art FID score (1.76 at 256×256 resolution) and Inception Score (239 at 256×256). The model also yields a new state-of-the-art in Classification Accuracy Scores, i.e., ImageNet test accuracy for a ResNet-50 architecture trained solely on synthetic data (64.96 top-1 accuracy for 256×256 samples, improving to 69.24 for 1024×1024 samples). Adding up to three times as many synthetic samples as real training samples consistently improves ImageNet classification accuracy across multiple architectures.

URL: https://openreview.net/forum?id=DlRsoxjyPm

---

Title: ILPO-MP: Mode Priors Prevent Mode Collapse when Imitating Latent Policies from Observations

Authors: Oliver Struckmeier, Ville Kyrki

Abstract: Imitation learning from observations (IfO) constrains the classic imitation learning setting to cases where expert observations are easy to obtain, but no expert actions are available. Most existing IfO methods require access to task-specific cost functions or many interactions with the target environment. Learning a forward dynamics model in combination with a latent policy has been shown to solve these issues. However, the limited supervision in the IfO scenario can lead to mode collapse when learning the generative forward dynamics model and the corresponding latent policy. In this paper, we analyze the mode collapse problem in this setting and show that it is caused by a combination of deterministic expert data and bad initialization of the models. Under the assumption of piecewise continuous system dynamics, we propose ILPO-MP, a method to prevent the mode collapse using clustering of expert transitions to impose a mode prior on the generative model and the latent policy. We show that ILPO-MP prevents mode collapse and improves performance in a variety of environments.

URL: https://openreview.net/forum?id=f3JLnnZsAm

---

Title: Complementary Sparsity: Accelerating Sparse CNNs with High Accuracy on General-Purpose Computing Platforms

Authors: Kang Zhao, Yijun Tan, Kai Han, Ting Hu, Hanting Chen, Tao Yuan, Yunhe Wang, Jun Yao

Abstract: Model sparsity is a promising approach to reducing parameters or FLOPs of convolutional neural networks (CNNs). Compared to unstructured or coarse-grained structured sparsity, fine-grained structured sparsity, e.g., N:M sparse pattern, can achieve a better balance between accuracy and efficiency on general computing platforms like CPUs and GPUs. In particular, the 2:4 sparsity can accelerate CNN inference by 2$\times$ speed and with negligible accuracy drop. However, N:M sparsity needs to be supported by GPU within specific hardware circuits and hardly achieves significant speedups on common GPUs. To accelerate CNNs with general-purposed computing resources and simultaneously retain the model accuracy as much as possible, this paper proposes complementary sparsity (CS). CS denotes that only one weight can be retained for weights spaced at the same distance. On the one hand, CS features high mask flexibility, which is naturally favorable to high model accuracy. Moreover, we propose a CS-specific sparse training method to improve CS-based CNNs' accuracy under high parameter sparsities ($>$75\%). On the other hand, CS itself is memory-access balanced and robust to pattern hyperparameters, which can be utilized to speedup CS-based convolution computation on CPUs and common GPUs. We thus propose a CS convolution parallel computing algorithm that adapts to common GPUs without sparse tensor cores. Experimental results show that compared to other sparsity patterns, the proposed CS can achieve the optimal trade-off in terms of accuracy and latency for CPUs and common GPUs, respectively. Codes will be available at https://gitee.com/mindspore/models/tree/master/research/cv/CS.

URL: https://openreview.net/forum?id=g1B4qgOw79

---

Title: Finding Neurons in a Haystack: Case Studies with Sparse Probing

Authors: Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, Dimitris Bertsimas

Abstract: Despite rapid adoption and deployment of large language models (LLMs), the internal computations of these models remain opaque and poorly understood. In this work, we seek to understand how high-level human-interpretable features are represented within the internal neuron activations of LLMs. We train $k$-sparse linear classifiers (probes) on these internal activations to predict the presence of features in the input; by varying the value of $k$ we study the sparsity of learned representations and how this varies with model scale. With $k=1$, we localize individual neurons that are highly relevant for a particular feature and perform a number of case studies to illustrate general properties of LLMs. In particular, we show that early layers make use of sparse combinations of neurons to represent many features in superposition, that middle layers have seemingly dedicated neurons to represent higher-level contextual features, and that increasing scale causes representational sparsity to increase on average, but there are multiple types of scaling dynamics.
In all, we probe for over 100 unique features comprising 10 different categories in 7 different models spanning 70 million to 6.9 billion parameters.

URL: https://openreview.net/forum?id=JYs1R9IMJr

---

Title: SIESTA: Efficient Online Continual Learning with Sleep

Authors: Md Yousuf Harun, Jhair Gallardo, Tyler L. Hayes, Ronald Kemker, Christopher Kanan

Abstract: In supervised continual learning, a deep neural network (DNN) is updated with an ever-growing data stream. Unlike the offline setting where data is shuffled, we cannot make any distributional assumptions about the data stream. Ideally, only one pass through the dataset is needed for computational efficiency. However, existing methods are inadequate and make many assumptions that cannot be made for real-world applications, while simultaneously failing to improve computational efficiency. In this paper, we propose a novel continual learning method, SIESTA based on wake/sleep framework for training, which is well aligned to the needs of on-device learning. The major goal of SIESTA is to advance compute efficient continual learning so that DNNs can be updated efficiently using far less time and energy. The principal innovations of SIESTA are: 1) rapid online updates using a rehearsal-free, backpropagation-free, and data-driven network update rule during its wake phase, and 2) expedited memory consolidation using a compute-restricted rehearsal policy during its sleep phase. For memory efficiency, SIESTA adapts latent rehearsal using memory indexing from REMIND. Compared to REMIND and prior arts, SIESTA is far more computationally efficient, enabling continual learning on ImageNet-1K in under 2 hours on a single GPU; moreover, in the augmentation-free setting it matches the performance of the offline learner, a milestone critical to driving adoption of continual learning in real-world applications.

URL: https://openreview.net/forum?id=MqDVlBWRRV

---

Title: Inducing Meaningful Units from Character Sequences with Dynamic Capacity Slot Attention

Authors: Melika Behjati, James Henderson

Abstract: Characters do not convey meaning, but sequences of characters do. We propose an unsupervised distributional method to learn the abstract meaning-bearing units in a sequence of characters. Rather than segmenting the sequence, our Dynamic Capacity Slot Attention model discovers continuous representations of the objects in the sequence, extending an architecture for object discovery in images. We train our model on different languages and evaluate the quality of the obtained representations with forward and reverse probing classifiers. These experiments show that our model succeeds in discovering units which are similar to those proposed previously in form, content, and level of abstraction, and which show promise for capturing meaningful information at a higher level of abstraction.

URL: https://openreview.net/forum?id=m8U9rSs6gU

---

Title: Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Authors: Wenhu Chen, Xueguang Ma, Xinyi Wang, William W. Cohen

Abstract: Recently, there has been significant progress in teaching language models to perform step-by-step reasoning to solve complex numerical reasoning tasks. Chain-of-thoughts prompting (CoT) is the state-of-art method for many of these tasks. CoT uses language models to produce text describing reasoning, and computation, and finally the answer to a question. Here we propose `Program of Thoughts' (PoT), which uses language models (mainly Codex) to generate text and programming language statements, and finally an answer. In PoT, the computation can be
delegated to a program interpreter, which is used to execute the generated program, thus decoupling complex computation from reasoning and language understanding. We evaluate PoT on five math word problem datasets and three financial-QA datasets in both few-shot and zero-shot settings. We find that PoT has an average performance gain over CoT of around 12% across all datasets.
By combining PoT with self-consistency decoding, we can achieve extremely strong performance on all the math datasets and financial datasets. All of our data and code will be released.

URL: https://openreview.net/forum?id=YfZ4ZPt8zd

---

Title: DP-LFlow: Differentially Private Latent Flow for Scalable Sensitive Image Generation

Authors: Dihong Jiang, Sun Sun

Abstract: Privacy concerns grow with the success of modern deep learning models, especially when the training set contains sensitive data. Differentially private generative model (DPGM) can serve as a solution to circumvent such concerns by generating data that are distributionally similar to the original data yet with differential privacy (DP) guarantees. While GAN has attracted major attention, existing DPGMs based on flow generative models are limited and only developed on low-dimensional tabular datasets. The capability of exact density estimation makes the flow model exceptional when density estimation is of interest. In this work, we will first show that it is challenging (or even infeasible) to train a DP-flow via DP-SGD, i.e. the workhorse algorithm for private deep learning, on high-dimensional image sets with acceptable utility, and then we give an effective solution by reducing the generation from the pixel space to a lower dimensional latent space. We show the effectiveness and scalability of the proposed method via extensive experiments, where the proposed method achieves a significantly better privacy-utility trade-off compared to existing alternatives. Notably, our method is the first DPGM to scale to high-resolution image sets (up to 256 × 256). Our code is available at https://github.com/dihjiang/DP-LFlow.

URL: https://openreview.net/forum?id=GEcneTl9Mk

---

Title: Binary Classification under Local Label Differential Privacy Using Randomized Response Mechanisms

Authors: Shirong XU, Chendi Wang, Will Wei Sun, Guang Cheng

Abstract: Label differential privacy is a popular branch of $\epsilon$-differential privacy for protecting labels in training datasets with non-private features. In this paper, we study the generalization performance of a binary classifier trained on a dataset privatized under the label differential privacy achieved by the randomized response mechanism. Particularly, we establish minimax lower bounds for the excess risks of the deep neural network plug-in classifier, theoretically quantifying how privacy guarantee $\epsilon$ affects its generalization performance. Our theoretical result shows: (1) the randomized response mechanism slows down the convergence of excess risk by lessening the multiplicative constant term compared with the non-private case $(\epsilon=\infty)$; (2) as $\epsilon$ decreases, the optimal structure of the neural network should be smaller for better generalization performance; (3) the convergence of its excess risk is guaranteed even if $\epsilon$ is adaptive to the size of training sample $n$ at a rate slower than $O(n^{-1/2})$. Our theoretical results are validated by extensive simulated examples and two real applications.

URL: https://openreview.net/forum?id=uKCGOw9bGG

---

Title: Learn the Time to Learn: Replay Scheduling in Continual Learning

Authors: Marcus Klasson, Hedvig Kjellstrom, Cheng Zhang

Abstract: Replay methods are known to be successful at mitigating catastrophic forgetting in continual learning scenarios despite having limited access to historical data. However, storing historical data is cheap in many real-world settings, yet replaying all historical data is often prohibited due to processing time constraints. In such settings, we propose that continual learning systems should learn the time to learn and schedule which tasks to replay at different time steps. We first demonstrate the benefits of our proposal by using Monte Carlo tree search to find a proper replay schedule, and show that the found replay schedules can outperform fixed scheduling policies when combined with various replay methods in different continual learning settings. Additionally, we propose a framework for learning replay scheduling policies with reinforcement learning. We show that the learned policies can generalize better in new continual learning scenarios compared to equally replaying all seen tasks, without added computational cost. Our study reveals the importance of learning the time to learn in continual learning, which brings current research closer to real-world needs.

URL: https://openreview.net/forum?id=Q4aAITDgdP

---

Title: Neighborhood Gradient Mean: An Efficient Decentralized Learning Method for Non-IID Data

Authors: Sai Aparna Aketi, Sangamesh Kodge, Kaushik Roy

Abstract: Decentralized learning algorithms enable the training of deep learning models over large distributed datasets, without the need for a central server. The current state-of-the-art decentralized algorithms mostly assume the data distributions to be Independent and Identically Distributed (IID). In practical scenarios, the distributed datasets can have significantly different data distributions across the agents. This paper focuses on improving decentralized learning on non-IID data with minimal compute and memory overheads. We propose Neighborhood Gradient Mean (NGM), a novel decentralized learning algorithm that modifies the local gradients of each agent using self- and cross-gradient information. In particular, the proposed method averages the local gradients with model-variant or data-variant cross-gradients based on the communication budget. Model-variant cross-gradients are derivatives of the received neighbors’ model parameters with respect to the local dataset. Data-variant cross-gradient derivatives of the local model with respect to its neighbors’ datasets. The data-variant cross-gradients are aggregated through an additional communication round. We theoretically analyze the convergence characteristics of NGM and demonstrate its efficiency on non-IID data sampled from various vision and language datasets. Our experiments demonstrate that the proposed method either remains competitive or outperforms (by 0-6%) the existing state-of-the-art (SoTA) decentralized learning algorithm on non-IID data with significantly less compute and memory requirements. Further, we show that the model-variant cross-gradient information available locally at each agent can improve the performance on non-IID data by 3-20% without additional communication costs.

URL: https://openreview.net/forum?id=vkiKzK5G3e

---

Title: Limitation of Characterizing Implicit Regularization by Data-independent Functions

Authors: Leyang Zhang, Zhi-Qin John Xu, Tao Luo, Yaoyu Zhang

Abstract: In recent years, understanding the implicit regularization of neural networks (NNs) has become a central task in deep learning theory. However, implicit regularization is itself not completely defined and well understood. In this work, we attempt to mathematically define and study implicit regularization. Importantly, we explore the limitations of a common approach to characterizing implicit regularization using data-independent functions. We propose two dynamical mechanisms, i.e., Two-point and One-point Overlapping mechanisms, based on which we provide two recipes for producing classes of one-hidden-neuron NNs that provably cannot be fully characterized by a type of or all data-independent functions. Following the previous works, our results further emphasize the profound data dependency of implicit regularization in general, inspiring us to study in detail the data dependency of NN implicit regularization in the future.

URL: https://openreview.net/forum?id=140kSqm0uy

---

Title: Population-based Evaluation in Repeated Rock-Paper-Scissors as a Benchmark for Multiagent Reinforcement Learning

Authors: Marc Lanctot, John Schultz, Neil Burch, Max Olan Smith, Daniel Hennes, Thomas Anthony, Julien Perolat

Abstract: Progress in fields of machine learning and adversarial planning has benefited significantly from benchmark domains, from checkers and the classic UCI data sets to Go and Diplomacy. In sequential decision-making, agent evaluation has largely been restricted to few interactions against experts, with the aim to reach some desired level of performance (e.g. beating a human professional player). We propose a benchmark for multiagent learning based on repeated play of the simple game Rock, Paper, Scissors along with a population of forty-three tournament entries, some of which are intentionally sub-optimal. We describe metrics to measure the quality of agents based both on average returns and exploitability. We then show that several RL, online learning, and language model approaches can learn good counter-strategies and generalize well, but ultimately lose to the top-performing bots, creating an opportunity for research in multiagent learning.

URL: https://openreview.net/forum?id=gQnJ7ODIAx

---

Title: Convergence of SGD for Training Neural Networks with Sliced Wasserstein Losses

Authors: Eloi Tanguy

Abstract: Optimal Transport has sparked vivid interest in recent years, in particular thanks to the Wasserstein distance, which provides a geometrically sensible and intuitive way of comparing probability measures. For computational reasons, the Sliced Wasserstein (SW) distance was introduced as an alternative to the Wasserstein distance, and has seen uses for training generative Neural Networks (NNs). While convergence of Stochastic Gradient Descent (SGD) has been observed practically in such a setting, there is to our knowledge no theoretical guarantee for this observation. Leveraging recent works on convergence of SGD on non-smooth and non-convex functions by Bianchi et al. (2022), we aim to bridge that knowledge gap, and provide a realistic context under which fixed-step SGD trajectories for the SW loss on NN parameters converge. More precisely, we show that the trajectories approach the set of (sub)-gradient flow equations as the step decreases. Under stricter assumptions, we show a much stronger convergence result for noised and projected SGD schemes, namely that the long-run limits of the trajectories approach a set of generalised critical points of the loss function.

URL: https://openreview.net/forum?id=aqqfB3p9ZA

---

Title: Not All Causal Inference is the Same

Authors: Matej Zečević, Devendra Singh Dhami, Kristian Kersting

Abstract: Neurally-parameterized Structural Causal Models in the Pearlian notion to causality, referred to as NCM, were recently introduced as a step towards next-generation learning systems. However, said NCM are only concerned with the learning aspect of causal inference
and totally miss out on the architecture aspect. That is, actual causal inference within NCM is intractable in that the NCM won’t return an answer to a query in polynomial time. This insight follows as corollary to the more general statement on the intractability of arbitrary structural causal model (SCM) parameterizations, which we prove in this work through classical 3-SAT reduction. Since future learning algorithms will be required to deal with both high dimensional data and highly complex mechanisms governing the data, we ultimately believe work on tractable inference for causality to be decisive. We also show that not all “causal” models are created equal. More specifically, there are models capable of answering causal queries that are not SCM, which we refer to as partially causal models
(PCM). We provide a tabular taxonomy in terms of tractability properties for all of the different model families, namely correlation-based, PCM and SCM. To conclude our work, we also provide some initial ideas on how to overcome parts of the intractability of causal inference
with SCM by showing an example of how parameterizing an SCM with SPN modules can at least allow for tractable mechanisms. With this work we hope that our insights can raise awareness for this novel research direction since achieving success with causality in real world downstream tasks will not only depend on learning correct models but also require having the practical ability to gain access to
model inferences.

URL: https://openreview.net/forum?id=ySWQ6eXAKp

---

New submissions
===============

Title: Personalised Federated Learning On Heterogeneous Feature Spaces

Abstract: Personalised federated learning (FL) approaches assume that raw data of all clients are defined in a common space \emph{i.e.} all clients store their data according to the same schema. For real-world applications, this assumption is restrictive as clients, having their own systems to collect and then store data, may use {\em heterogeneous} data representations. To bridge the gap between the assumption of a shared subspace and the more realistic situation of client-specific spaces, we propose a general framework coined FLIC that maps client's data onto a common feature space via local embedding functions, in a federated manner. Preservation of class information in the latent space is ensured by a distribution alignment with respect to a learned reference distribution. We provide the algorithmic details of FLIC as well as theoretical insights supporting the relevance of our methodology. We compare its performances against FL benchmarks involving heterogeneous input features spaces. Notably, we are the first to present a successful application of FL to Brain-Computer Interface signals acquired on a different number of sensors.

URL: https://openreview.net/forum?id=xgKsuwT8iR

---

Title: Benchmarking Robustness of Text-Image Composed Retrieval

Abstract: Text-image composed retrieval aims to retrieve the target image through the composed query, which is specified in the form of an image plus some text that describes desired modifications to the input image. It has recently attracted attention due to its ability to leverage both information-rich images and concise language to precisely express the requirements for target images. However, the robustness of these approaches against real-world corruptions or further text understanding has never been studied. In this paper, we perform the first robustness study and establish three new diversified benchmarks for systematic analysis of text-image composed retrieval against natural corruptions in both vision and text and further probe textural understanding. For natural corruption analysis, we introduce two new large-scale benchmark datasets, CIRR-C and FashionIQ-C for testing in open domain and fashion domain respectively, both of which apply 15 visual corruptions and 7 textural corruptions. For textural understanding analysis, we introduce a new diagnostic dataset CIRR-D by expanding the original raw data with synthetic data, which contains modified text to better probe textual understanding ability including numerical variation, attribute variation, object removal, background variation, and fine-grained evaluation.

URL: https://openreview.net/forum?id=2i3Mlx1Gg7

---

Title: Auto-Rotating Neural Networks: An Alternative Approach for Preventing Vanishing Gradients

Abstract: Neural networks with saturating activations are often not used due to vanishing gradients. This problem is frequently tackled using Batch Normalization techniques, but we propose to use a different approach: the Auto-Rotation (AR). An existing AR-based method is the Auto-Rotating Perceptron (ARP), which enhances Rosenblatt's Perceptron and alleviates vanishing gradients by limiting the pre-activation to a region where the neurons do not saturate. However, this method is only defined for dense layers and requires additional hyperparameter tuning. In this paper, we present an extension of the ARP concept: the Auto-Rotating Neural Networks (ARNN). With them, we have convolutional layers and learnable pre-activation saturation regions. Regarding our experiments, in all of them we got that the AR outperforms the Batch Normalization approach in terms of preventing vanishing gradients. Also, our results show that the AR enhances the performance of convolutional nets that use saturated activations, even allowing them to slightly outperform ReLU-activated models. Besides that, by activating the AR we get faster convergence and, due to less hyperparameter tuning, we obtain greater ease of use. Furthermore, with our method we experimentally obtained much more uniform and stable gradients (across the layers and epochs, respectively). We expect that our Auto-Rotating layers will be used for deeper models with saturating and non-saturating activations, since our approach prevents vanishing gradients and issues related to gradient continuity, like what occurs with ReLUs.

URL: https://openreview.net/forum?id=Gg3JlR9btC

---

Title: Emergence of Grounded, Optimally Compositional Spatial Language among Homogeneous Agents

Abstract: A mechanism of effective communication is integral to human existence. An essential aspect of a functional communication scheme among a rational human population involves an efficient, unambiguous, adaptive, and coherent apparatus to convey one’s goal to others. Such an effective macro characteristic can emerge in a finite population through incremental learning via trial and error at the individual (micro) level, with nearly consistent individual learning faculty and experience across the population. In this paper, we study minimal yet pertinent aspects of glossogenetics, specifically primal human communication mechanisms, through computational modeling. In particular, we model the process as a language game within the fabric of a decentralized, multi-agent deep reinforcement learning setting, where the agents with local learning and neural cognitive faculties interact through a series of dialogues. Our model seeks to achieve the principle of least effort and overcome the poverty of stimulus among homogeneous agents through mirror networks. In our examinations, we observe the emergence of successful and efficient communication among static and dynamic agent populations through consistent learning.

URL: https://openreview.net/forum?id=dAs6Zk2BWZ

---

Title: Application of Zone Method based Physics-Informed Neural Networks in Reheating Furnaces

Abstract: Foundation Industries (FIs) constitute glass, metals, cement, ceramics, bulk chemicals, paper, steel, etc. and provide crucial, foundational materials for a diverse set of economically relevant industries: automobiles, machinery, construction, household appliances, chemicals, etc. Reheating furnaces within the manufacturing chain of FIs are energy-intensive. Accurate and real-time prediction of underlying temperatures in reheating furnaces has the potential to reduce the overall heating time, thereby controlling the energy consumption for achieving the Net-Zero goals in FIs. In this paper, we cast this prediction as a regression task and explore neural networks due to their inherent capability of being effective and efficient, given adequate data. However, due to the infeasibility of achieving good-quality real data in scenarios like reheating furnaces, classical Hottel's zone method based computational model has been used to generate data for model training. To further enhance the Out-Of-Distribution generalization capability of the trained model, we propose a Physics-Informed Neural Network (PINN) by incorporating prior physical knowledge using a set of novel Energy-Balance regularizers.

URL: https://openreview.net/forum?id=BIdMTV6lBr

---

Title: Domain-Generalizable Multiple-Domain Clustering

Abstract: This work generalizes the problem of unsupervised domain generalization to the case in which no labeled samples are available (completely unsupervised). We are given unlabeled samples from multiple source domains, and we aim to learn a shared predictor that assigns examples to semantically related clusters. Evaluation is done by predicting cluster assignments in previously unseen domains. Towards this goal, we propose a two-stage training framework: (1) self-supervised pre-training for extracting domain invariant semantic features. (2) multi-head cluster prediction with pseudo labels, which rely on both the feature space and cluster head prediction, further leveraging a novel prediction-based label smoothing scheme.
We demonstrate empirically that our model is more accurate than baselines that require fine-tuning using samples from the target domain or some level of supervision.

URL: https://openreview.net/forum?id=O9RUANpPmb

---

Title: CLIP-QDA: An Explainable Concept Bottleneck Model

Abstract: In this paper, we introduce an explainable algorithm designed from a multi-modal foundation model, that performs fast and explainable image classification. Drawing inspiration from CLIP-based Concept Bottleneck Models (CBMs), our method creates a latent space where each neuron is linked to a specific word. Observing that this latent space can be modeled with simple distributions, we use a Mixture of Gaussians (MoG) formalism to enhance the interpretability of this latent space. Then, we introduce CLIP-QDA, a classifier that only uses statistical values to infer labels from the concepts. In addition, this formalism allows for both local and global explanations.
These explanations come from the inner design of our architecture, our work is part of a new family of greybox models, combining performances of opaque foundation models and the interpretability of transparent models. Our empirical findings show that in instances where the MoG assumption holds, CLIP-QDA achieves similar accuracy with state-of-the-art methods CBMs. Our explanations compete with existing XAI methods while being faster to compute.

URL: https://openreview.net/forum?id=jjmdiMiag7

---

Title: Improving Efficiency of Neural Image Classification and Object Detection Systems using Automated Layer Caching

Abstract: Deep Neural Networks (DNNs) have become an essential component in many application domains including web-based services. A variety of these services require high throughput and (close to) real-time features, for instance, to respond or react to users’ requests or to process a stream of incoming data on time. However, the trend in DNN design is toward larger models with many layers and parameters to achieve more accurate results. Although these models are often pre-trained, the computational complexity in such large models can still be relatively significant, hindering low inference latency. In this paper, we propose an end-to-end automated caching solution to improve the performance of DNN-based services in terms of their computational complexity and inference latency. Our method adopts the ideas of self-distillation of DNN models and early-exits. The proposed solution is an automated online layer caching mechanism that allows early-exiting of a large model during inference time if the cache model in one of the early-exits is confident enough for final prediction. One of the main contributions of this paper is that we have implemented the idea as an online caching, meaning that the cache models do not need access to training data and perform solely based on the incoming data at run-time, making it suitable for applications using pre-trained models. Our experiments results on two downstream tasks (image classification and object detection) show that, on average, caching can reduce computational complexity of these services up to 58% (in terms of FLOPs count) and improve their inference latency up to 46% with low to zero reduction in accuracy. Our approach also outperforms existing approaches, particularly when being applied on complex models and larger datasets. It achieves a remarkable 51.6% and 30.4% reduction in latency, surpassing the Gati and BranchyNet methods for CIFAR100-Resnet50. This enhancement is accompanied by 2.92% and 0.87% increase in mean accuracy, further highlighting the superiority of our approach in demanding scenarios

URL: https://openreview.net/forum?id=RxJhDyegpU

---

Title: A density estimation perspective on learning from pairwise human preferences

Abstract: Learning from human feedback (LHF)—and in particular learning from pairwise preferences—has recently become a crucial ingredient in training large language models (LLMs), and has been the subject of much research. Most recent works frame it as a reinforcement learning problem, where a reward function is learned from pairwise preference data and the LLM is treated as a policy which is adapted to maximize the rewards, often under additional regularization constraints. We propose an alternative interpretation which centers on the generative process for pairwise preferences and treats LHF as a density estimation problem. We provide theoretical and empirical results showing that for a family of generative processes defined via preference behavior distribution equations, training a reward function on pairwise preferences effectively models an annotator's implicit preference distribution. Finally, we discuss and present findings on "annotator misspecification"—failure cases where wrong modeling assumptions are made about annotator behavior, resulting in poorly-adapted models—suggesting that approaches that learn from pairwise human preferences could have trouble learning from a population of annotators with diverse viewpoints.

URL: https://openreview.net/forum?id=YH3oERVYjF

---

Title: Sketch and shift: a robust decoder for compressive clustering

Abstract: Compressive learning is an emerging approach to drastically reduce the memory footprint of large-scale learning, by first summarizing a large dataset into a low-dimensional sketch vector, and then decoding from this sketch the latent information needed for learning. In light of recent progress on information preservation guarantees for sketches based on random features, a major objective is to design easy-to-tune algorithms (called decoders) to robustly and efficiently extract this information. To address the underlying non-convex optimization problems, various heuristics have been proposed. In the case of compressive clustering, the standard heuristic is CL-OMPR, a variant of sliding Frank-Wolfe. Yet, CL-OMPR is hard to tune, and the examination of its robustness was overlooked.
In this work, we undertake a scrutinized examination of CL-OMPR to circumvent its limitations. In particular, we show how this algorithm can fail to recover the clusters even in advantageous scenarios. To gain insight, we show how the deficiencies of this algorithm can be attributed to optimization difficulties related to the structure of a correlation function appearing at core steps of the algorithm. To address these limitations, we propose an alternative decoder offering substantial improvements over CL-OMPR. Its design is notably inspired from the mean shift algorithm, a classic approach to detect the local maxima of kernel density estimators. The proposed algorithm can extract clustering information from a sketch of the MNIST dataset that is 10 times smaller than previously.

URL: https://openreview.net/forum?id=6rWuWbVmgz

---

Title: On the Out-of-Distribution Coverage of Combining Split Conformal Prediction and Bayesian Deep Learning

Abstract: Bayesian deep learning and conformal prediction are two methods that have been used to convey uncertainty and increase safety in machine learning systems. We focus on combining Bayesian deep learning with split conformal prediction and how this combination effects out-of-distribution coverage; particularly in the case of multiclass image classification. We suggest that if the model is generally underconfident on the calibration set, then the resultant conformal sets may exhibit worse out-of-distribution coverage compared to simple predictive credible sets. Conversely, if the model is overconfident on the calibration set, the use of conformal prediction may improve out-of-distribution coverage. We evaluate prediction sets as a result of combining split conformal methods and neural networks trained with (i) stochastic gradient descent, (ii) deep ensembles, and (iii) mean-field variational inference. Our results suggest that combining Bayesian deep learning models with split conformal prediction can, in some cases, cause unintended consequences such as reducing out-of-distribution coverage.

URL: https://openreview.net/forum?id=TySx8fsSSU

---

Title: $\sigma$-PCA: a unified neural model for linear and nonlinear principal component analysis

Abstract: Linear principal component analysis (PCA), nonlinear PCA, and linear independent component analysis (ICA) -- those are three methods with single-layer autoencoder formulations for learning linear transformations with certain characteristics from data. Linear PCA learns orthogonal transformations (rotations) that orient axes to maximise variance, but it suffers from a subspace rotational indeterminacy: it fails to find a unique rotation for axes that share the same variance. Both nonlinear PCA and linear ICA reduce the subspace indeterminacy from rotational to permutational by maximising statistical independence under the assumption of unit variance. The main difference between them is that nonlinear PCA only learns rotations while linear ICA learns not just rotations but any linear transformation with unit variance. The relationship between all three can be understood by the singular value decomposition of the linear ICA transformation into a sequence of rotation, scale, rotation. Linear PCA learns the first rotation; nonlinear PCA learns the second. The scale is simply the inverse of the standard deviations. The problem is that, in contrast to linear PCA, conventional nonlinear PCA cannot be used directly on the data to learn the first rotation, the first being special as it reduces dimensionality and orders by variances. In this paper, we have identified the cause, and as a solution we propose $\sigma$-PCA: a unified neural model for linear and nonlinear PCA as single-layer autoencoders. One of its key ingredients: modelling not just the rotation but also the scale -- the variances. This model bridges the disparity between linear and nonlinear PCA, and shows, whereas linear PCA relies on the decoder contribution, nonlinear PCA relies on the encoder contribution. With our formulation, nonlinear PCA can learn not just the second, but also the first rotation. And so, like linear PCA, it can learn a semi-orthogonal transformation that reduces dimensionality and orders by variances, but, unlike linear PCA, it does not suffer from rotational indeterminacy.

URL: https://openreview.net/forum?id=KpVJ6CGnwI

---

Title: Continual Learning: Applications and the Road Forward

Abstract: Continual learning is a sub-field of machine learning, which aims to allow machine learning models to continuously learn on new data, by accumulating knowledge without forgetting what was learned in the past. In this work, we take a step back, and ask: "Why should one care about continual learning in the first place?''. We set the stage by surveying recent continual learning papers published at three major machine learning conferences, and show that memory-constrained settings dominate the field. Then, we discuss five open problems in machine learning, and even though they seem unrelated to continual learning at first sight, we show that continual learning will inevitably be part of their solution. These problems are model-editing, personalization, on-device learning, faster (re-)training and reinforcement learning. Finally, by comparing the desiderata from these unsolved problems and the current assumptions in continual learning, we highlight and discuss four future directions for continual learning research. We hope that this work offers an interesting perspective on the future of continual learning, while displaying its potential value and the paths we have to pursue in order to make it successful.

URL: https://openreview.net/forum?id=axBIMcGZn9

---

Title: Adaptive Training Distributions with Scalable Online Bilevel Optimization

Abstract: Large neural networks pretrained on web-scale corpora are central to modern machine learning. In this paradigm, the distribution of the large, heterogeneous pretraining data rarely matches that of the application domain. This work considers modifying the pretraining distribution in the case where one has a small sample of data reflecting the targeted test conditions. We propose an algorithm motivated by a recent formulation of this setting as an online, bilevel optimization problem. With scalability in mind, our algorithm prioritizes computing gradients at training points which are likely to most improve the loss on the targeted distribution. Empirically, we show that in some cases this approach is beneficial over existing strategies from the domain adaptation literature but may not succeed in other cases. We propose a simple test to evaluate when our approach can be expected to work well and point towards further research to address current limitations.

URL: https://openreview.net/forum?id=SiUzyvAkAg

---

Title: Attending to Graph Transformers

Abstract: Recently, transformer architectures for graphs emerged as an alternative to established techniques for machine learning with graphs, such as (message-passing) graph neural networks. So far, they have shown promising empirical results, e.g., on molecular prediction datasets, often attributed to their ability to circumvent graph neural networks’ shortcomings, such as over-smoothing and over-squashing. Here, we derive a taxonomy of graph transformer architectures, bringing some order to this emerging field. We overview their theoretical properties, survey structural and positional encodings, and discuss extensions for important graph classes, e.g., 3D molecular graphs. Empirically, we probe how well graph transformers can recover various graph properties, how well they can deal with heterophilic graphs, and to what extent they prevent over-squashing. Further, we outline open challenges and research
direction to stimulate future work

URL: https://openreview.net/forum?id=HhbqHBBrfZ

---

Title: Hyper-parameter Tuning for Fair Classification without Sensitive Attribute Access

Abstract: Fair machine learning methods seek to train models that balance model performance across demographic subgroups defined over sensitive attributes like race and gender. Although sensitive attributes are typically assumed to be known during training, they may not be available in practice due to privacy and other logistical concerns. Recent work has sough to train fair models without sensitive attributes on training data. However, these methods need extensive hyper-parameter tuning to achieve good results, and hence assume that sensitive attributes are known on validation data. However, this assumption too might not be practical. Here, we propose Antigone, a framework to train fair classifiers without access to sensitive attributes on either training or validation data. Instead, we generate pseudo sensitive attributes on the validation data by training a ERM model and using the classifier’s incorrectly (correctly) classified examples as proxies for disadvantaged (advantaged) groups. Since fairness metrics like demographic parity, equal opportunity and subgroup accuracy can be estimated to within a proportionality constant even with noisy sensitive attribute information, we show theoretically and empirically that these proxy labels can be used to maximize fairness under average accuracy constraints. Key to our results is a principled approach to select the hyper-parameters of the ERM model in a completely unsupervised fashion (meaning without access to ground truth sensitive attributes) that minimizes the gap between fairness estimated using noisy versus ground-truth sensitive labels. We demonstrate that Antigone outperforms existing methods on CelebA, Waterbirds, and UCI datasets.

URL: https://openreview.net/forum?id=ZSWKdRi2cU

---

Title: Improving Subgraph-GNNs via Edge-Level Ego-Network Encodings

Abstract: We present a novel edge-level ego-network encoding for learning on graphs that can boost Message Passing Graph Neural Networks (MP-GNNs) by providing additional node and edge features or extending message-passing formats. The proposed encoding is sufficient to distinguish Strongly Regular Graphs, a family of challenging 3-WL equivalent graphs. We show theoretically that such encoding is more expressive than node-based sub-graph MP-GNNs. In an empirical evaluation on four benchmarks with 10 graph datasets, our results match or improve previous baselines on expressivity, graph classification, graph regression, and proximity tasks---while reducing memory usage by 18.1x in certain real-world settings.

URL: https://openreview.net/forum?id=N0Sc0KY0AH

---

Title: Benchmarking Machine Learning Models for Quantum Error Correction

Abstract: Quantum Error Correction (QEC) is one of the fundamental problems in quantum computer systems, which aims to detect and correct errors in the data qubits within quantum computers. Due to the presence of unreliable data qubits in existing quantum computers, implementing quantum error correction is a critical step when establishing a stable quantum computer system. Recently, machine learning (ML)-based approaches have been proposed to address this challenge. However, they lack a thorough understanding of quantum error correction. To bridge this research gap, we provide a new perspective to understand machine learning-based QEC in this paper. We find that syndromes in the ancilla qubits result from errors on connected data qubits, and distant ancilla qubits can provide auxiliary information to rule out some incorrect predictions for the data qubits. Therefore, to detect errors in data qubits, we must consider the information present in the long-range ancilla qubits. To the best of our knowledge, machine learning is less explored in the dependency relationship of
QEC. To fill the blank, we curate a machine learning benchmark to assess the capacity to capture long-range dependencies for quantum error correction. To provide a comprehensive evaluation, we evaluate seven state-of-the-art deep learning algorithms spanning diverse
neural network architectures, such as convolutional neural networks, graph neural networks, and graph transformers. Our exhaustive experiments reveal an enlightening trend: By enlarging the receptive field to exploit information from distant ancilla qubits, the accuracy
of QEC significantly improves. For instance, U-Net can improve CNN by a margin of about 50%. Finally, we provide a comprehensive analysis that could inspire future research in this field. We will release the code when the paper is published.

URL: https://openreview.net/forum?id=qBo2jObPxa

---

Title: Deep End-to-end Causal Inference

Abstract: Causal inference is essential for data-driven decision-making across domains such as business engagement, medical treatment, and policy making. However, in practice, causal inference suffers from many limitations including unknown causal graphs, missing data problems, and mixed data types. To tackle those challenges, we develop Deep End-to-end Causal Inference (DECI) framework, a flow-based non-linear additive noise model combined with variational inference, which can perform both Bayesian causal discovery and inference. Theoretically, we show that DECI unifies many existing structure equation model (SCM) based causal inference techniques and can recover the ground truth mechanism under standard assumptions. Motivated by the challenges in the real world, we further extend DECI to heterogeneous, mixed-type data with missing values, allowing for both continuous and discrete treatment decisions. Empirically, we conduct extensive experiments (over a thousand) to show the competitive performance of DECI when compared to relevant baselines for both causal discovery and inference with both synthetic and causal machine learning benchmarks across data types and levels of missingness.

URL: https://openreview.net/forum?id=6Rn3nGyvcD

---

Title: VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing

Abstract: Recently, diffusion-based generative models have achieved remarkable success for image generation and edition. However, their use for video editing still faces important limitations. This paper introduces VidEdit, a novel method for zero-shot text-based video editing ensuring strong temporal and spatial consistency. Firstly, we propose to combine atlas-based and pre-trained text-to-image diffusion models to provide a training-free and efficient editing method, which by design fulfills temporal smoothness. Secondly, we leverage off-the-shelf panoptic segmenters along with edge detectors and adapt their use for conditioned diffusion-based atlas editing. This ensures a fine spatial control on targeted regions while strictly preserving the structure of the original video. Quantitative and qualitative experiments show that VidEdit outperforms state-of-the-art methods on DAVIS dataset, regarding semantic faithfulness, image preservation, and temporal consistency metrics. With this framework, processing a single video only takes approximately one minute, and it can generate multiple compatible edits based on a unique text prompt.

URL: https://openreview.net/forum?id=i02A009I5a

---

Title: An Improved Federated Clustering Algorithm with Model-based Clustering

Abstract: Federated learning (FL) is a distributed learning paradigm that allows multiple clients to collaboratively train a shared model via communications to a central server. However, optimal models of different clients often differ due to heterogeneity of data across clients.
In this paper, we address the dichotomy between heterogeneous models and simultaneous training in FL via a clustering structure among the clients. The clustering framework is one way to allow for high heterogeneity level between clients, while clients with similar data can still train a shared model. We define a new clustering framework for FL based on the (optimal) local models of the clients: two clients belong to the same cluster if their local models are close. We propose an algorithm, \emph{Successive Refine Federated Clustering Algorithm} (\texttt{SR-FCA}), that treats each client as a singleton cluster as an initialization, and then successively refine the cluster estimation via exploiting similarity with other clients. In any intermediate step, \texttt{SR-FCA} uses an {\em error-tolerant} federated learning algorithm within each cluster to exploit simultaneous training and to correct clustering errors. Unlike some prominent prior works \texttt{SR-FCA} does not require any \emph{good} initialization (or warm start), both in theory and practice. We show that with proper choice of learning rate, \texttt{SR-FCA} incurs arbitrarily small clustering error. Additionally, \texttt{SR-FCA} does not require the knowledge of the number of clusters apriori like some prior works. We validate the performance of \texttt{SR-FCA} on real-world FL datasets including FEMNIST and Shakespeare in non-convex problems and show the benefits of \texttt{SR-FCA} over several baselines.

URL: https://openreview.net/forum?id=1ZGA5mSkoB

---

Title: Relation-Oriented: Toward Causal Knowledge-Aligned AGI

Abstract: The current relationship modeling paradigm, grounded in the observational i.i.d assumption, fundamentally misaligns with our causal knowledge understanding due to two key oversights: 1) the unobservable relations, which lead to undetectable hierarchical levels of knowledge, driving the need for model generalizability; 2) the counterfactual relative timings, which fundamentally support our structural knowledge comprehension, resulting in inherent biases under this Observation-Oriented paradigm. Adopting a novel Relation-Oriented perspective, this paper proposes a new framework to unify the various confusions surrounding causality learning, ranging from traditional causal inference to modern language models.

Also, relation-indexed representation learning (RIRL) is raised as a baseline implementation method of the proposed new paradigm, alongside comprehensive experiments demonstrating its efficacy in autonomously identifying dynamical effects in relationship learning.

URL: https://openreview.net/forum?id=4JMuzveEJi

---

Title: Revisiting stochastic submodular maximization with cardinality constraint: A bandit perspective

Abstract: In this paper, we focus on the problem of maximizing non-negative, monotone, stochastic submodular functions under cardinality constraint. Recent works have explored continuous optimization algorithms via multi-linear extensions for such problems and provided appropriate approximation guarantees. We take a fresh look into this problem from a discrete, (stochastic) greedy perspective under a probably approximately correct (PAC) setting, i.e., the goal is to obtain solutions whose expected objective value is greater than or equal to $(1-1/e-\epsilon){\rm OPT}-\nu$ with at least $1-\delta$ probability, where ${\rm OPT}$ is the optimal objective value. Using the theory of multi-armed bandits, we propose novel bandit stochastic greedy (BSG) algorithms in which selection of the next element at iteration $i$ is posed as a $(\nu_i,\delta_i)$-PAC best-arm identification problem. Given $(\nu,\delta)$-PAC parameters to BSG, we formally characterize a set of per-iteration $(\nu_i,\delta_i)$-policies such that any policy from this set guarantees a $(\nu,\delta)$-PAC solution for the stochastic submodular maximization problem using BSG. Next, we derive the optimal $(\nu^{*}_i,\delta^{*}_i)$-policy from this set which incurs the least computational cost in terms of the number of stochastic function calls $(N)$. With the obtained optimal policy, we show that BSG has better complexity in $N$ than the existing approaches. Lastly, we also analyze the inverse fixed-budget problem, i.e., obtaining the best per-iteration policy given a fixed budget $N=N_0$ of stochastic function calls. Experiments on various problems illustrate the efficacy of our approach in terms of optimization quality as well as computational efficiency.

URL: https://openreview.net/forum?id=57ETChLAOE

---

Title: Active & Passive Causal Inference: Introduction

Abstract: This paper serves as a starting point for machine learning researchers, engineers and students who are interested in but not yet familiar with causal inference. We start by laying out an important set of assumptions that are collectively needed for causal identification, such as exchangeability, positivity, consistency and the absence of interference. From these assumptions, we build out a set of important causal inference techniques, which we do so by categorizing them into two buckets; active and passive approaches. We describe and discuss randomized controlled trials and bandit-based approaches from the active category. We then describe classical approaches, such as matching and inverse probability weighting, in the passive category, followed by more recent deep learning based algorithms. By finishing the paper with some of the missing aspects of causal inference from the paper, such as collider biases, we expect this paper to provide readers with a diverse set of starting points for further reading and research in causal inference and discovery.

URL: https://openreview.net/forum?id=96AuPuVW1m

---

Title: LLMs and the Abstraction and Reasoning Corpus: Successes, Failures, and the Importance of Object-based Representations

Abstract: Can a Large Language Model (LLM) solve simple abstract reasoning problems? We explore this broad question through a systematic analysis of GPT on the Abstraction and Reasoning Corpus (ARC), a representative benchmark of abstract reasoning ability from limited examples in which solutions require some "core knowledge" of concepts such as objects, goal states, counting, and basic geometry. GPT-4 solves only 13/50 of the most straightforward ARC tasks when using textual encodings for their two-dimensional input-output grids. Our failure analysis reveals that GPT-4's capacity to identify objects and reason about them is significantly influenced by the sequential nature of the text that represents an object within a text encoding of a task. To test this hypothesis, we design a new benchmark, the 1D-ARC, which consists of one-dimensional (array-like) tasks that are more conducive to GPT-based reasoning, and where it indeed performs better than on the (2D) ARC. To alleviate this issue, we propose an object-based representation that is obtained through an external tool, resulting in nearly doubling the performance on solved ARC tasks and near-perfect scores on the easier 1D-ARC. Although the state-of-the-art GPT-4 is unable to "reason" perfectly within non-language domains such as the 1D-ARC or a simple ARC subset, our study reveals that the use of object-based representations can significantly improve its reasoning ability.

URL: https://openreview.net/forum?id=E8m8oySvPJ

---

Title: Achieving the Minimax Optimal Sample Complexity of Offline Reinforcement Learning: A DRO-Based Approach

Abstract: Offline reinforcement learning aims to learn from pre-collected datasets without active exploration. This problem faces significant challenges, including limited data availability and distributional shifts. Existing approaches adopt a pessimistic stance towards uncertainty by penalizing rewards of under-explored state-action pairs to estimate value functions conservatively.
In this paper, we show that the distributionally robust optimization (DRO) based approach can also address these challenges and is minimax optimal. Specifically, we directly model the uncertainty in the transition kernel and construct an uncertainty set of statistically plausible transition kernels. We then find the policy that optimizes the worst-case performance over this uncertainty set. We first design a metric-based Hoeffding-style uncertainty set such that with high probability the true transition kernel is in this set. We prove that to achieve a sub-optimality gap of $\epsilon$, the sample complexity is $\mathcal{O}(S^2C^{\pi^*}\epsilon^{-2}(1-\gamma)^{-4})$, where $\gamma$ is the discount factor, $S$ is the number of states, and $C^{\pi^*}$ is the single-policy clipped concentrability coefficient which quantifies the distribution shift. To achieve the optimal sample complexity, we further propose a less conservative Bernstein-style uncertainty set, which, however, does not necessarily include the true transition kernel. We show that an improved sample complexity of $\mathcal{O}(SC^{\pi^*}\epsilon^{-2}(1-\gamma)^{-3})$ can be obtained, which matches with the minimax lower bound for offline reinforcement learning, and thus is minimax optimal.

URL: https://openreview.net/forum?id=Y7FbGcjOuD

---

Title: Non-Uniform Smoothness for Gradient Descent

Abstract: The analysis of gradient descent-type methods typically relies on the Lipschitz continuity of the objective gradient. This generally requires an expensive hyperparameter tuning process to appropriately calibrate a stepsize for a given problem. In this work we introduce a local first-order smoothness oracle (LFSO) which generalizes the Lipschitz continuous gradients smoothness condition and is applicable to any twice-differentiable function. We show that this oracle can encode all relevant problem information for tuning stepsizes for a suitably modified gradient descent method and give global and local convergence results. We also show that LFSOs in this modified first-order method can yield global linear convergence rates for non-strongly convex problems with extremely flat minima, and thus improve over the lower bound on rates achievable by general (accelerated) first-order methods.

URL: https://openreview.net/forum?id=17ESEjETbP

---

Title: Robust Distortion-free Watermarks for Language Models

Abstract: We propose a methodology for planting watermarks in text from an autoregressive language model that are robust to perturbations without changing the distribution over text up to a certain maximum generation budget. We generate watermarked text by mapping a sequence of random numbers—which we compute using a randomized watermark key—to a sample from the language model. To detect watermarked text, any party who knows the key can align the text to the random number sequence. We instantiate our watermark methodology with two sampling schemes: inverse transform sampling and exponential minimum sampling. We apply these watermarks to three language models—OPT-1.3B, LLaMA-7B and Alpaca-7B—to experimentally validate their statistical power and robustness to various paraphrasing attacks. Notably, for both the OPT-1.3B and LLaMA-7B models, we find we can reliably detect watermarked text ($p \leq 0.01$) from $35$ tokens even after corrupting between $40$-$50$\% of the tokens via random edits (i.e., substitutions, insertions or deletions). For the Alpaca-7B model, we conduct a case study on the feasibility of watermarking responses to typical user instructions. Due to the lower entropy of the responses, detection is more difficult: around $25\%$ of the responses—whose median length is around $100$ tokens—are detectable with $p \leq 0.01$, and the watermark is also less robust to certain automated paraphrasing attacks we implement.

URL: https://openreview.net/forum?id=FpaCL1MO2C

---

Title: Leveraging Function Space Aggregation for Federated Learning at Scale

Abstract: The federated learning paradigm has motivated the development of methods for aggregating multiple client updates into a global server model, without sharing client data. Many federated learning algorithms, including the canonical Federated Averaging (FedAvg), take a direct (possibly weighted) average of the client parameter updates, motivated by results in distributed optimization. In this work, we adopt a function space perspective and propose a new algorithm, FedFish, that aggregates local approximations to the functions learned by clients, using an estimate based on their Fisher information. We evaluate FedFish on realistic, large-scale cross-device benchmarks. While the performance of FedAvg can suffer as client models drift further apart, we demonstrate that FedFish is more robust to longer local training. Our evaluation across several settings in image and language benchmarks shows that FedFish outperforms FedAvg as local training epochs increase. Further, FedFish results in global networks that are more amenable to efficient personalization via local fine-tuning on the same or shifted data distributions. For instance, federated pretraining on the C4 dataset, followed by few-shot personalization on Stack Overflow, results in a 7% improvement in next-token prediction by FedFish over FedAvg.

URL: https://openreview.net/forum?id=Ytp9KFKZfZ

---

Title: Decentralized Stochastic Gradient Descent Ascent for Finite-Sum Minimax Problems

Abstract: Minimax optimization problems have attracted significant attention in recent years due to their widespread application in numerous machine learning models. To solve the minimax problem, a wide variety of stochastic optimization methods have been proposed. However, most of them ignore the distributed setting where the training data is distributed on multiple workers. In this paper, we developed a novel decentralized stochastic gradient descent ascent method for the finite-sum minimax problem. In particular, by employing the variance-reduced gradient, our method can achieve $O(\frac{\sqrt{n}\kappa^3}{(1-\lambda)^2\epsilon^2})$ sample complexity and $O(\frac{\kappa^3}{(1-\lambda)^2\epsilon^2})$ communication complexity for the nonconvex-strongly-concave minimax problem. As far as we know, our work is the first one to achieve such theoretical complexities for this kind of minimax problem. At last, we apply our method to optimize the AUC maximization problem, and the experimental results confirm the effectiveness of our method.

URL: https://openreview.net/forum?id=2XESMJoZmt

---

Title: Discrete Graph Auto-Encoder

Abstract: Despite advances in generative methods, accurately modeling the distribution of graphs remains a challenging task primarily because of the absence of predefined or inherent unique graph representations.
Given a graph of $n$ nodes, there are $n!$ permutations, each corresponding to a representation of the same graph, and there is no known fast algorithm. Two main strategies have emerged to tackle this issue: 1) restricting the number of possible representations by sorting the nodes thanks to a heuristic such as Breadth-First Search (BFS), or 2) using permutation-invariant/equivariant functions, specifically Graph Neural Networks (GNNs). Both approaches have their constraints and drawbacks.

In this paper, we introduce a new framework named Discrete Graph Auto-Encoder (DGAE), which leverages the strengths of both strategies and mitigate their respective limitations. In essence, our method uses a permutation-equivariant auto-encoder to convert graphs into sets of discrete node embeddings and to reconstruct graphs from these sets. We then turn these sets into unique sequences and learn their distribution via an auto-regressive model.

Through multiple experimental evaluations, we demonstrate the superior performance of our model in comparison to the existing state-of-the-art across various datasets.

URL: https://openreview.net/forum?id=bZ80b0wb9d

---

Title: ASPEST: Bridging the Gap Between Active Learning and Selective Prediction

Abstract: Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain. These predictions can then be deferred to a humans for further evaluation. As an everlasting challenge for machine learning, in many real-world scenarios, the distribution of test data is different from the training data. This results in more inaccurate predictions, and often increased dependence on humans, which can be difficult and expensive. Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples. Selective prediction and active learning have been approached from different angles, with the connection between them missing. In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain while increasing accuracy and coverage. For this new paradigm, we propose a simple yet effective approach, ASPEST, that utilizes ensembles of model snapshots with self-training with their aggregated outputs as pseudo labels. Extensive experiments on numerous image, text and structured datasets, which suffer from domain shifts, demonstrate that ASPEST can significantly outperform prior work on selective prediction and active learning (e.g. on the MNIST$\to$SVHN benchmark with the labeling budget of 100, ASPEST improves the AUACC metric from 79.36% to 88.84%) and achieves more optimal utilization of humans in the loop.

URL: https://openreview.net/forum?id=3nprbNR3HB

---

Title: Tabula: Efficiently Computing Nonlinear Activation Functions for Secure Neural Network Inference

Abstract: Multiparty computation approaches to secure neural network inference commonly rely on garbled circuits for securely executing nonlinear activation functions. However, garbled circuits require excessive communication between server and client, impose significant storage overheads, and incur large runtime penalties; for example, securely evaluating ResNet-32 using standard approaches requires more than 300MB of communication, over 10s of runtime, and around 5 GB of preprocessing storage. To reduce these costs, we propose an alternative to garbled circuits: Tabula, an algorithm based on secure lookup tables. Our approach precomputes lookup tables during an offline phase that contains the result of all possible nonlinear function calls. Because these tables incur exponential storage costs in the number of operands and the precision of the input values, we use quantization to reduce these storage costs to make this approach practical. This enables an online phase where securely computing the result of a nonlinear function requires just a single round of communication, with communication cost equal to twice the number of bits of the input to the nonlinear function. In practice our approach costs 2 bytes of communication per nonlinear function call in the online phase. Compared to garbled circuits with quantized inputs, when computing individual nonlinear functions during the online phase, experiments show Tabula uses between $280$-$560 \times$ less communication, is over $100\times$ faster, and uses a comparable amount of storage; compared against other state-of-the-art protocols Tabula achieves greater than $40\times$ communication reduction. This leads to significant performance gains over garbled circuits with quantized inputs during the online phase of secure inference of neural networks: Tabula reduces end-to-end inference communication by up to $9 \times$ and achieves an end-to-end inference speedup of up to $50 \times$, while imposing comparable storage and offline preprocessing costs.

URL: https://openreview.net/forum?id=CXPb4twsrq

---

Title: Exponential Moving Average of Weights in Deep Learning: Dynamics and Benefits

Abstract: Weight averaging of Stochastic Gradient Descent (SGD) iterates is a popular method for training deep learning models. While it is often used as part of complex training pipelines to improve generalization or serve as a `teacher' model, weight averaging lacks proper evaluation on its own. In this work, we present a systematic study of the Exponential Moving Average (EMA) of weights. We first explore the training dynamics of EMA, give guidelines for hyperparameter tuning, and highlight its good early performance, partly explaining its success as a teacher. We also observe that EMA requires less learning rate decay compared to SGD since averaging naturally reduces noise, introducing a form of implicit regularization. Through extensive experiments, we show that EMA solutions differ from last-iterate solutions. EMA models not only generalize better but also exhibit improved i) robustness to noisy labels, ii) repeatability, iii) calibration and iv) transfer learning. Therefore, we suggest that an EMA of weights is a simple yet effective plug-in to improve the performance of deep learning models.

URL: https://openreview.net/forum?id=2M9CUnYnBA

---

Title: An advantage based policy transfer algorithm for reinforcement learning with metrics of transferability

Abstract: Reinforcement learning (RL) can enable sequential decision-making in complex and high-dimensional environments if the acquisition of a new state-action pair is efficient, i.e., when interaction with the environment is inexpensive. However, there are a myriad of real-world applications in which a high number of interactions are infeasible. In these environments, transfer RL algorithms, which can be used for the transfer of knowledge from one or multiple source environments to a target environment, have been shown to increase learning speed and improve initial and asymptotic performance. However, most existing transfer RL algorithms are on-policy and sample inefficient, and often require heuristic choices in algorithm design. This paper proposes an off-policy Advantage-based Policy Transfer algorithm, APT-RL, for fixed domain environments. Its novelty is in using the popular notion of ``advantage'' as a regularizer, to weigh the knowledge that should be transferred from the source, relative to new knowledge learned in the target, removing the need for heuristic choices. Further, we propose a new transfer performance metric to evaluate the performance of our algorithm and unify existing transfer RL frameworks. Finally, we present a scalable, theoretically-backed task similarity measurement algorithm to illustrate the alignments between our proposed transferability metric and similarities between source and target environments. Numerical experiments on three continuous control benchmark tasks demonstrate that APT-RL outperforms existing transfer RL algorithms on most tasks, and is $10\%$ to $75\%$ more sample efficient than learning from scratch.

URL: https://openreview.net/forum?id=1yRo6jwMb7

---

Title: ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions

Abstract: Asking insightful questions is crucial for acquiring knowledge and expanding our understanding of the world. However, the importance of questioning has been largely overlooked in AI research, where models have been primarily developed to answer questions. With the recent advancements of large language models (LLMs) like ChatGPT, we discover their capability to ask high-quality questions when provided with a suitable prompt. This discovery presents a new opportunity to develop an automatic questioning system. In this paper, we introduce ChatCaptioner, a novel automatic-questioning method deployed in image captioning. Here, ChatGPT is prompted to ask a series of informative questions about images to BLIP-2, a strong vision question-answering model. In ChatCaptioner, we investigate whether two AI models, unable to individually describe images in detail, can collaborate through an automated, visually guided dialogue to generate a better and more enriched image description than a single AI model.
We conduct human-subject evaluations on common image caption datasets such as COCO, Conceptual Caption, and WikiArt, and compare ChatCaptioner with BLIP-2 as well as ground truth. Our results demonstrate that ChatCaptioner's captions are significantly more informative, receiving three times as many votes from human evaluators as BLLIP-2 alone for providing the most image information.
Besides, ChatCaptioner identifies 53% more objects within the image than BLIP-2 alone measured by WordNet synset matching.

URL: https://openreview.net/forum?id=1LoVwFkZNo

---

Title: Guiding Online Reinforcement Learning with Action-Free Offline Pretraining

Abstract: Offline RL methods have been shown to reduce the need for environment interaction by training agents using offline collected episodes. However, the action information in offline episodes can be difficult or even impossible to collect in some practical cases. This paper investigates the problem of using action-free offline datasets to improve online reinforcement learning. We introduce Action-Free Guide (AF-Guide), a method to extract task-relevant knowledge from separate action-free offline datasets. AF-Guide employs an Action-Free Decision Transformer (AFDT) that learns from such datasets to plan the next states, given desired future returns. In turn, AFDT guides an online-learning agent trained by "Guided Soft Actor-Critic"(Guided SAC). Experiments show that AF-Guide can improve RL sample efficiency and performance. Our code is in the supplementary and will be made publicly available.

URL: https://openreview.net/forum?id=0vQjv636yT

---

Title: One by One, Continual Coordinating with Humans via Hyper-Teammate Identification

Abstract: One of the primary objectives in modern artificial intelligence researches is to empower agents to effectively coordinate with diverse teammates, particularly human teammates. Previous studies focused on training agents either with a fixed population of pre-generated teammates or through the co-evolution of distinct populations of agents and teammates. However, it is challenging to enumerate all possible teammates in advance, and it is costly, or even impractical to maintain such a sufficiently diverse population and repeatedly interact with previously encountered teammates. Additional design considerations, such as prioritized sampling, are also required to ensure efficient training. To address these challenges and obtain an efficient human-AI coordination paradigm, we propose a novel approach called \textbf{Concord}. Considering that human participants tend to occur in a sequential manner, we model the training process with different teammates as a continual learning framework, akin to how humans learn and adapt in the real world. We propose a mechanism based on hyper-teammate identification to prevent catastrophic forgetting while promoting forward knowledge transfer. Concretely, we introduce a teammate recognition module that captures the identification of corresponding teammates. Leveraging the identification, a well-coordinated AI policy can be generated via the hyper-network. The entire framework is trained in a decomposed policy gradient manner, allowing for effective credit assignment among agents. This approach enables us to train agents with each generated teammate or humans one by one, ensuring that agents can coordinate effectively with concurrent teammates without forgetting previous knowledge. Our approach outperforms multiple baselines in various multi-agent benchmarks, either with generated human proxies or real human participants.

URL: https://openreview.net/forum?id=HVxumpoWBm

---

Title: EffSeg: Efficient Fine-Grained Instance Segmentation using Structure-Preserving Sparsity

Abstract: Many two-stage instance segmentation heads predict a coarse 28x28 mask per instance, which is insufficient to capture the fine-grained details of many objects. To address this issue, PointRend and RefineMask predict a 112x112 segmentation mask resulting in higher quality segmentations. However, both methods have limitations by either not having access to neighboring features (PointRend) or by performing computation at all spatial locations instead of sparsely (RefineMask). In this work, we propose EffSeg performing fine-grained instance segmentation in an efficient way by using our Structure-Preserving Sparsity (SPS) method based on separately storing the active features, the passive features, and a dense 2D index map containing the feature indices. The goal of the index map is to preserve the 2D spatial configuration or structure between the features such that any 2D operation can still be performed. EffSeg achieves similar performance on COCO compared to RefineMask, while reducing the number of FLOPs by 71% and increasing the FPS by 29%. Code will be released.

URL: https://openreview.net/forum?id=nrgquPL60D

---

Title: InduCE: Inductive Counterfactual Explanations for Graph Neural Networks

Abstract: Graph neural networks (GNNs) drive several real-world applications including drug-
discovery, recommendation engines, and chip designing. Unfortunately, GNNs are a black-box
since they do not allow human-intelligible explanations of their predictions. Counterfactual
reasoning is an effort to overcome this limitation. Specifically, the objective is to minimally
perturb the input graph to a GNN, so that its prediction changes. While several algorithms have been proposed towards counterfactual explanations of GNNs, majority suffer
from three key limitations: (1) they only consider perturbations in the form of deletions of
existing edges, (2) they perform an inefficient exploration of the combinatorial search space,
(3) the counterfactual explanation model is transductive in nature, i.e., they do not generalize to unseen data. In this work, we propose an inductive algorithm called InduCE, that
overcomes these limitations. Through extensive experiments on graph datasets, we show
that incorporating edge additions, and modelling marginal effect of perturbations aid in
generating better counterfactuals among available recourse. Furthermore, inductive modeling enables InduCE to directly predict counterfactual perturbations without requiring
instance-specific training. This leads to significant computational speed-up over baselines
and allows counterfactual analyses for GNNs at scale.

URL: https://openreview.net/forum?id=RZPN8cgqST

---

Title: Asynchronous Training Schemes in Distributed Learning with Time Delay

Abstract: In the context of distributed deep learning, the issue of stale weights or gradients could result in poor algorithmic performance. This issue is usually tackled by delay tolerant algorithms with some mild assumptions on the objective functions and step sizes. In this paper, we propose a different approach to develop a new algorithm, called \textbf{P}redicting \textbf{C}lipping \textbf{A}synchronous \textbf{S}tochastic \textbf{G}radient \textbf{D}escent (aka, PC-ASGD). Specifically, PC-ASGD has two steps - the \textit{predicting step} leverages the gradient prediction using Taylor expansion to reduce the staleness of the outdated weights while the \textit{clipping step} selectively drops the outdated weights to alleviate their negative effects. A tradeoff parameter is introduced to balance the effects between these two steps. Theoretically, we present the convergence rate considering the effects of delay of the proposed algorithm with constant step size when the smooth objective functions are weakly strongly-convex and nonconvex. One practical variant of PC-ASGD is also proposed by adopting a condition to help with the determination of the tradeoff parameter. For empirical validation, we demonstrate the performance of the algorithm with two deep neural network architectures on two benchmark datasets.

URL: https://openreview.net/forum?id=zOGJxw07Z6

---

Title: The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute

Abstract: The Languini Kitchen serves as both a research collective and codebase designed to empower researchers with limited computational resources to contribute meaningfully to the field of language modelling. We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. The number of tokens on which a model is trained is defined by the model's throughput and the chosen compute class. Notably, this approach avoids constraints on critical hyperparameters which affect total parameters or floating-point operations. For evaluation, we pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length. On it, we compare methods based on their empirical scaling trends which are estimated through experiments at various levels of compute. This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput. While the GPT baseline achieves better perplexity throughout all our levels of compute, our LSTM baseline exhibits a predictable and more favourable scaling law. This is due to the improved throughput and the need for fewer training tokens to achieve the same decrease in test perplexity. Extrapolating the scaling laws leads of both models results in an intersection at roughly 50,000 accelerator hours. We hope this work can serve as the foundation for meaningful and reproducible language modelling research.

URL: https://openreview.net/forum?id=LxBpIl1uBD

---

Title: Adapting Contrastive Language-Image Pretrained (CLIP) Models for Out-of-Distribution Detection

Abstract: We present a comprehensive experimental study on pretrained feature extractors for visual out-of-distribution (OOD) detection, focusing on adapting contrastive language-image pretrained (CLIP) models. Without fine-tuning on the training data, we are able to establish a positive correlation ($R^2\geq0.92$) between in-distribution classification and unsupervised OOD detection for CLIP models in $4$ benchmarks. We further propose a new simple and scalable method called \textit{pseudo-label probing} (PLP) that adapts vision-language models for OOD detection. Given a set of label names of the training set, PLP trains a linear layer using the pseudo-labels derived from the text encoder of CLIP. To test the OOD detection robustness of pretrained models, we develop a novel feature-based adversarial OOD data manipulation approach to create adversarial samples. Intriguingly, we show that (i) PLP outperforms the previous state-of-the-art \citep{ming2022mcm} on all $5$ large-scale benchmarks based on ImageNet, specifically by an average AUROC gain of 3.4\% using the largest CLIP model (ViT-G), (ii) we show that linear probing outperforms fine-tuning by large margins for CLIP architectures (i.e. CLIP ViT-H achieves a mean gain of 7.3\% AUROC on average on all ImageNet-based benchmarks), and (iii) billion-parameter CLIP models still fail at detecting adversarially manipulated OOD images. The code and adversarially created datasets will be made publicly available.

URL: https://openreview.net/forum?id=YCgX7sJRF1

---

Title: Large Language Models' Understanding of Mathematics: Source Criticism and Extrapolation

Abstract: It has been suggested that large language models such as GPT-4 have acquired some form of understanding beyond the correlations among the words in text including some understanding of mathematics as well. Here, we perform a critical inquiry into this claim by evaluating the mathematical understanding of the GPT-4 model. Considering that GPT-4's training set is a secret, it is not straightforward to evaluate whether the model's correct answers are based on a mathematical understanding or based on replication of proofs that the model has seen before. We specifically craft mathematical questions which their formal proofs are not readily available on the web, proofs that are more likely not seen by the GPT-4. We see that GPT-4 is unable to solve those problems despite their simplicity. It is hard to find scientific evidence suggesting that GPT-4 has acquired an understanding of even basic mathematical concepts. A straightforward way to find failure modes of GPT-4 in theorem proving is to craft questions where their formal proofs are not available on the web. Our finding suggests that GPT-4's ability is to reproduce, rephrase, and polish the mathematical proofs that it has seen before, and not in grasping mathematical concepts. We also see that GPT-4's ability to prove mathematical theorems is continuously expanding over time despite the claim that it is a fixed model. We suggest that the task of proving mathematical theorems in formal language is comparable to the methods used in search engines such as Google while predicting the next word in a sentence may be a misguided approach, a recipe that often leads to excessive extrapolation and eventual failures. Prompting the GPT-4 over and over may benefit the GPT-4 and the OpenAI, but we question whether it is valuable for machine learning or for theorem proving.

URL: https://openreview.net/forum?id=oR9XuWYZLq

---

Title: Statistical and Computational Complexities of BFGS Quasi-Newton Method for Generalized Linear Models

Abstract: The gradient descent (GD) method has been used widely to solve parameter estimation in generalized linear models (GLMs), a generalization of linear models when the link function can be non-linear. While GD has optimal statistical and computational complexities for estimating the true parameter under the high signal-to-noise ratio (SNR) regime of the GLMs, it has sub-optimal complexities when the SNR is low, namely, the iterates of GD require a polynomial number of iterations to reach the final statistical radius due to the local convexity of the least-square loss functions of the GLMs in this case. Even though Newton's method can be used to resolve the flat curvature of the loss functions in the low SNR case, its computational cost is prohibitive in high-dimensional settings as it is $\mathcal{O}(d^3)$. To address the shortcomings of GD and Newton's method, we propose the use of the BFGS quasi-Newton method to solve parameter estimation of the GLMs, which has a per iteration cost of $\mathcal{O}(d^2)$. On the optimization side, when the SNR is low, we demonstrate that iterates of BFGS converge linearly to the optimal solution of the population least-square loss function, and the contraction coefficient of the BFGS algorithm is comparable to that of Newton's method. On the statistical side, we prove that the iterates of BFGS reach the final statistical radius of the low SNR GLMs after a logarithmic number of iterations, which is much lower than the polynomial number of iterations of GD.

URL: https://openreview.net/forum?id=PIL3YWXmx2

---

Title: PC-SwinMorph: Patch Representation for Unsupervised Medical Image Registration

Abstract: Medical image registration is a critical task for several clinical procedures. Manual realisation of those tasks is time-consuming and the quality is highly dependent on the level of expertise of the physician. To mitigate that laborious task, automatic tools have been developed where the majority of solutions are supervised techniques. However, in the medical domain, the strong assumption of having a well-representative ground truth is far from being realistic. To overcome this challenge, unsupervised techniques have been investigated. However, they are still limited in performance and they fail to produce plausible results. In this work, we propose a novel unsupervised framework for image registration that we call PC-SwinMorph. The core of our framework is two patch-based strategies, where we demonstrate that patch representation is key for performance gain. We first introduce a patch-based contrastive strategy that enforces locality conditions and richer feature representation. We also introduce a novel patch stitching strategy based on a 3D window/shifted-window multi-head self-attention module to eliminate artifacts from the patch splitting. We demonstrate, through a set of numerical and visual results, that our technique outperforms current state-of-the-art unsupervised techniques.

URL: https://openreview.net/forum?id=7p8Ce3Cq8p

---

Title: Hierarchical Decomposition Framework for Feasibility-hard Combinatorial Optimization

Abstract: Combinatorial optimization (CO) is a widely-applied method for addressing a variety of real-world optimization problems. However, due to the NP-hard nature of these problems, complex problem-specific heuristics are often required to tackle them at real-world scales. Neural combinatorial optimization has emerged as an effective approach to tackle CO problems, but it often requires the pre-computed optimal solution or a hand-designed process to ensure the model to generate a feasible solution, which may not be available in many real-world CO problems. We propose the hierarchical combinatorial optimizer (HCO) that does not rely on such restrictive assumptions. HCO decomposes the given CO problem into multiple sub-problems on different scales with smaller search spaces, where each sub-problem can be optimized separately and their solutions can be combined to compose the entire solution. Our experiments demonstrate that this hierarchical decomposition facilitates more efficient learning and stronger generalization capabilities in terms of optimality of the solution. It outperforms traditional heuristic, mathematical optimization, and learning-based algorithms on Steiner Tree Packing Problem (STPP), a problem that cannot guarantee a feasible solution when using the hand-designed process.

URL: https://openreview.net/forum?id=GdPVxV6t9F

---

Title: Positive Unlabeled Learning Selected Not At Random (PULSNAR): class proportion estimation when the SCAR assumption does not hold

Abstract: Positive and Unlabeled (PU) learning is a type of semi-supervised binary classification where the machine learning algorithm differentiates between a set of positive instances (labeled) and a set of both positive and negative instances (unlabeled). PU learning has broad applications in settings where confirmed negatives are unavailable or difficult to obtain, and there is value in discovering positives among the unlabeled (e.g., viable drugs among untested compounds). Most PU learning algorithms make the \emph{selected completely at random} (SCAR) assumption, namely that positives are selected independently of their features. However, in many real-world applications, such as healthcare, positives are not SCAR (e.g., severe cases are more likely to be diagnosed), leading to a poor estimate of the proportion, $\alpha$, of positives among unlabeled examples and poor model calibration, resulting in an uncertain decision threshold for selecting positives. PU learning algorithms can estimate $\alpha$ or the probability of an individual unlabeled instance being positive or both. We propose two PU learning algorithms to estimate $\alpha$, calculate calibrated probabilities for PU instances, and improve classification metrics: i) PULSCAR (positive unlabeled learning selected completely at random), and ii) PULSNAR (positive unlabeled learning selected not at random). PULSNAR uses a divide-and-conquer approach that creates and solves several SCAR-like sub-problems using PULSCAR. In our experiments, PULSNAR outperformed state-of-the-art approaches on both synthetic and real-world benchmark datasets.

URL: https://openreview.net/forum?id=L4lHDXzcDu

---

Title: New Guarantees for Learning Revenue Maximizing Menus of Lotteries and Two-Part Tariffs

Abstract: We advance a recently flourishing line of work at the intersection of learning theory and computational economics by studying the learnability of two classes of mechanisms prominent in economics, namely menus of lotteries and two-part tariffs. The former is a family of randomized mechanisms designed for selling multiple items, known to achieve revenue beyond deterministic mechanisms, while the latter is designed for selling multiple units (copies) of a single item with applications in real-world scenarios such as car or bike-sharing services. We focus on learning high-revenue mechanisms of this form from buyer valuation data in both distributional settings, where we have access to buyers’ valuation samples up-front, and the more challenging and less-studied online settings, where buyers arrive one-at-a-time and no distributional assumption is made about their values. We provide a suite of results with regard to these two families of mechanisms. We provide the first online learning algorithms for menus of lotteries and two-part tariffs with strong regret-bound guarantees. Since the space of parameters is infinite and the revenue functions have discontinuities, the known techniques do not readily apply. However, we are able to provide a reduction to online learning over a finite number of experts, in our case, a finite number of parameters. Furthermore, in the limited buyers type case, we show a reduction to online linear optimization, which allows us to obtain no-regret guarantees by presenting buyers with menus that correspond to a barycentric spanner. In addition, we provide algorithms with improved running times over prior work for the distributional settings. Finally, we demonstrate how techniques from the recent literature in data-driven algorithm design are insufficient for our studied problems.

URL: https://openreview.net/forum?id=mhawjZcmrJ

---

Title: Fast and Expressive Gesture Recognition using a Combination-Homomorphic Electromyogram Encoder

Abstract: We study the task of gesture recognition from electromyography (EMG), with the goal of enabling expressive human-computer interaction at high accuracy, while minimizing the time required for new subjects to provide calibration data.
To fulfill these goals, we define combination gestures consisting of a direction component and a modifier component.
New subjects only demonstrate the single component gestures and we seek to extrapolate from these to all possible single or combination gestures.
We extrapolate to unseen combination gestures by combining the feature vectors of real single gestures to produce synthetic training data.
This strategy allows us to provide a large and flexible gesture vocabulary, while not requiring new subjects to demonstrate combinatorially many example gestures.
We pre-train an encoder and a combination operator using self-supervision, so that we can produce useful synthetic training data for unseen test subjects.
To evaluate the proposed method, we collect a real-world EMG dataset, and measure the effect of augmented supervision against two baselines: a partially-supervised model trained with only single gesture data from the unseen subject, and a fully-supervised model trained with real single and real combination gesture data from the unseen subject.
We find that the proposed method provides a dramatic improvement over the partially-supervised model, and achieves a useful classification accuracy that in some cases approaches the performance of the fully-supervised model.

URL: https://openreview.net/forum?id=j5T4pcLbcY

---

Title: PopulAtion Parameter Averaging (PAPA)

Abstract: Ensemble methods combine the predictions of multiple models to improve performance, but they require significantly higher computation costs at inference time. To avoid these costs, multiple neural networks can be combined into one by averaging their weights. However, this usually performs significantly worse than ensembling. Weight averaging is only beneficial when different enough to benefit from combining them, but similar enough to average well. Based on this idea, we propose PopulAtion Parameter Averaging (PAPA): a method that combines the generality of ensembling with the efficiency of weight averaging. PAPA leverages a population of diverse models (trained on different data orders, augmentations, and regularizations) while slowly pushing the weights of the networks toward the population average of the weights. PAPA reduces the performance gap between averaging and ensembling, increasing the average accuracy of a population of models by up to 0.8% on CIFAR-10, 1.9% on CIFAR-100, and 1.6% on ImageNet when compared to training independent (non-averaged) models.

URL: https://openreview.net/forum?id=cPDVjsOytS

---

Title: Addressing Attribute Bias with Adversarial Support-Matching

Abstract: When trained on diverse labelled data, machine learning models have proven themselves to be a powerful tool in all facets of society. However, due to budget limitations, deliberate or non-deliberate censorship, and other problems during data collection, certain groups may be under-represented in the labelled training set. We investigate a scenario in which the absence of certain data is linked to the second level of a two-level hierarchy in the data. Inspired by the idea of protected attributes from algorithmic fairness, we consider generalised secondary "attributes" which subdivide the classes into smaller partitions. We refer to the partitions defined by the combination of an attribute and a class label, or leaf nodes in aforementioned hierarchy, as groups. To characterise the problem, we introduce the concept of classes with incomplete attribute support. The representational bias in the training set can give rise to spurious correlations between the classes and the attributes which cause standard classification models to generalise poorly to unseen groups. To overcome this bias, we make use of an additional, diverse but unlabelled dataset, called the deployment set, to learn a representation that is invariant to the attributes. This is done by adversarially matching the support of the training and deployment sets in representation space using a set discriminator operating on sets, or bags, of samples. In order to learn the desired invariance, it is paramount that the bags are balanced by class; this is easily achieved for the training set, but requires using semi-supervised clustering for the deployment set. We demonstrate the effectiveness of our method on several datasets and realisations of the problem.

URL: https://openreview.net/forum?id=JYbnJ92TJf

---

Title: On the neural approximation of set functions: A survey from permutation-invariant perspective

Abstract: Conventional machine learning algorithms have traditionally been designed under the assumption that input data follows a vector-based format, with an emphasis on vector-centric paradigms. However, as the demand for tasks involving set-based inputs has grown, there has been a paradigm shift in the research community towards addressing these challenges. In recent years, the emergence of neural network architectures such as Deep Sets and Transformers has presented a significant advancement in the treatment of set-based data. These architectures are specifically engineered to naturally accommodate sets as input, enabling more effective representation and processing of set structures. Consequently, there has been a surge of research endeavors dedicated to exploring and harnessing the capabilities of these architectures for various tasks involving the approximation of set functions. This comprehensive survey aims to provide an overview of the diverse problem settings and ongoing research efforts pertaining to neural networks that approximate set functions. By delving into the intricacies of these approaches and elucidating the associated challenges, the survey aims to equip readers with a comprehensive understanding of the field. Through this comprehensive perspective, we hope that researchers and practitioners can gain valuable insights into the potential applications, inherent limitations, and future directions of set-based neural networks.

URL: https://openreview.net/forum?id=gELzFTYQDi

---

Title: Multi-Agent Coordination via Multi-Level Communication

Abstract: The partial observability and stochasticity in multi-agent settings can be mitigated by accessing more information about others via communication. However, the coordination problem still exists since agents cannot communicate actual actions with each other at the same time due to the circular dependencies. In this paper, we propose a novel multi-level communication scheme, Sequential Communication (SeqComm). SeqComm treats agents asynchronously (the upper-level agents make decisions before the lower-level ones) and has two communication phases. In the negotiation phase, agents determine the priority of decision-making by communicating hidden states of observations and comparing the value of intention, which is defined as the agent’s future behavior without considering others and obtained by modeling the environment dynamics. In the launching phase, the upper-level agents take the lead in making decisions and then communicate their actions with the lower-level agents. Theoretically, we prove the policies learned by SeqComm are guaranteed to improve monotonically and converge. Empirically, we show that SeqComm outperforms existing methods in a variety of cooperative multi-agent tasks.

URL: https://openreview.net/forum?id=D3F6NqsSNv

---

Title: Progressive-Hint Prompting Improves Reasoning in Large Language Models

Abstract: The performance of Large Language Models (LLMs) in reasoning tasks depends heavily on prompt design, with Chain-of-Thought (CoT) and self-consistency being critical methods that enhance this ability. However, these methods do not fully exploit the answers generated by the LLM to guide subsequent responses. This paper proposes a new prompting method, named Progressive-Hint Prompting (PHP), that enables automatic multiple interactions between users and LLMs by using previously generated answers as hints to progressively guide toward the correct answers. PHP is orthogonal to CoT and self-consistency, making it easy to combine with state-of-the-art techniques to further improve performance. We conducted extensive and comprehensive experiments on seven benchmarks. The results show that PHP significantly improves accuracy while remaining highly efficient. For instance, with text-davinci-003, we observed a 4.2% improvement on GSM8K with greedy decoding compared to Complex CoT, and a 46.17% reduction in sample paths with self-consistency. With GPT-4 and PHP, we achieve state-of-the-art performances on SVAMP (89.1% $\rightarrow$ 91.9%), GSM8K (92% $\rightarrow$ 95.5%), AQuA (76.4% $\rightarrow$ 79.9%) and MATH (50.3% $\rightarrow$ 53.9%).

URL: https://openreview.net/forum?id=5HsBuYYx4i

---

Title: Maximizing Global Model Appeal in Federated Learning

Abstract: Federated learning (FL) aims to collaboratively train a global model using local data from a network of clients. To warrant collaborative training, each federated client may expect the resulting global model to satisfy some individual requirement, such as achieving a certain loss threshold on their local data. However, in real FL scenarios, the global model may not satisfy the requirements of all clients in the network due to the data heterogeneity across clients. In this work, we explore the problem of global model appeal in FL, which we define as the total number of clients that find that the global model satisfies their individual requirements. We discover that global models trained using traditional FL approaches can result in a significant number of clients unsatisfied with the model based on their local requirements. As a consequence, we show that global model appeal can directly impact how clients participate in training and how the model performs on new clients at inference time. Our work proposes MaxFL, which maximizes the number of clients that find the global model appealing. MaxFL achieves a 22-40% and 18-50% improvement in the test accuracy of training clients and (unseen) test clients respectively, compared to a wide range of FL approaches that tackle data heterogeneity, aim to incentivize clients, and learn personalized/fair models.

URL: https://openreview.net/forum?id=8GI1SXqJBk

---

Title: Sharpness-Aware Minimization Scaled by Outlier Normalization for Improving Robustness on Noisy DNN Accelerators

Abstract: Energy-efficient deep neural network (DNN) accelerators are prone to non-idealities that degrade DNN performance at inference time. To mitigate such degradation, existing methods typically add perturbations to the DNN weights during training to simulate inference on noisy hardware. However, this often requires knowledge about the target hardware and leads to a trade-off between DNN performance and robustness, decreasing the former to increase the latter. In this work, we first show that applying sharpness-aware training, by optimizing for both the loss value and loss sharpness, significantly improves robustness to noisy hardware at inference time without relying on any assumptions about the target hardware. Then, we propose a new adaptive sharpness-aware method that conditions the worst-case perturbation of a given weight not only on its magnitude but also on the range of the weight distribution. This is achieved by performing sharpness-aware minimization scaled by outlier minimization (SAMSON). Our extensive results on several models and datasets show that SAMSON increases model robustness to noisy weights without compromising generalization performance in noiseless regimes.

URL: https://openreview.net/forum?id=jRACDK2dGZ

---

Title: Learning Sparse Graphs for Functional Regression using Graph-induced Operator-valued Kernels

Abstract: A functional regression problem aims to learn a map $\mathfrak{F}:\mathcal{Z}\mapsto\mathcal{Y}$, where $\mathcal{Z}$ is an appropriate input space and $\mathcal{Y}$ is a space of output functions. When $\mathcal{Z}$ is also a space of functions, the learning problem is known as function-to-function regression. In this work, we consider the problem of learning a map of the form ${F}:{\mathcal{Z}}^p\mapsto\mathcal{Y}$, a many-to-one function-to-function regression problem, where the aim is to learn a suitable $F$ which maps $p$ input functions to an output function. In order to solve this regression problem with $p$ input functions and a corresponding output function, we propose a graph-induced operator-valued kernel (OVK) obtained by imposing a graphical structure describing the inter-relationships among the $p$ input functions. When the underlying graphical structure is unknown, we propose to learn an appropriate Laplacian matrix characterizing the graphical structure, which would also aid in learning the map $F$. We formulate a learning problem using the proposed graph-induced OVK, and devise an alternating minimization framework to solve the learning problem. To learn $F$ along with meaningful and important interactions in the graphical structure, a minimax concave penalty (MCP) is used as a sparsity-inducing regularization on the Laplacian matrix. We further extend the alternating minimization framework to learn $F$, where each of the $p$ constituent input functions as well as the output function are multi-dimensional. To scale the proposed algorithm to large datasets, we design an efficient sample-based approximation algorithm. Further, we provide bounds on generalization error for the map obtained by solving the proposed learning problem. An extensive empirical evaluation on both synthetic and real data demonstrates the utility of the proposed learning framework. Our experiments show that simultaneous learning of $F$ along with sparse graphical structure helps in discovering significant relationships among the input functions, and motivates interpretability of such relationships driving the regression problem.

URL: https://openreview.net/forum?id=f9l4eiPKpV

---

Title: Guided DropBlock and CNN Filter Augmentation for Data Constrained Learning

Abstract: Deep learning algorithms have achieved remarkable success in various domains; yet, training them in environments with limited data remains a significant hurdle, owing to their reliance on millions of parameters. This paper addresses the intricate issue of training under data scarcity, introducing two novel techniques: Guided DropBlock and Filter Augmentation for resource-constrained deep learning scenarios. Guided DropBlock is inspired by the DropBlock regularization method. Unlike its predecessor, which randomly omits a contiguous segment of the image, the proposed approach is more selective, focusing the omission on the background and specific blocks that carry critical semantic information about the objects in question. On the other hand, the filter augmentation technique we propose involves performing a series of operations on the Convolutional Neural Network (CNN) filters during the training phase. Our findings indicate that integrating filter augmentation while fine-tuning the CNN model can substantially enhance performance in data-limited situations. This approach results in a smoother decision boundary and behavior resembling an ensemble model. Imposing these additional constraints on loss optimization helps mitigate the challenges posed by data scarcity, ensuring robust feature extraction from the input signal, even when some learnable parameters within the CNN layers are frozen. We have validated these enhancements on seven publicly accessible benchmark datasets, as well as two real-world use cases, namely, identifying newborns and monitoring post-cataract surgery conditions, providing empirical support for our claims.

URL: https://openreview.net/forum?id=1KWQ3eKonE

---

Title: A decoder-only foundation model for time-series forecasting

Abstract: Motivated by recent advances in large language models for Natural Language Processing (NLP), we design a time-series foundation model for forecasting whose out-of-the-box zero-shot performance on a variety of public datasets comes close to the accuracy of state-of-the-art supervised forecasting models for each individual dataset. Our model is based on pretraining a patched-decoder style attention model on a large time-series corpus, and can work well across different forecasting history lengths, prediction lengths and temporal granularities.

URL: https://openreview.net/forum?id=eU79Cn1SiI

---

Title: Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks

Abstract: Real-world natural language processing systems need to be robust to human adversaries. Collecting examples of human adversaries for training is an effective but expensive solution. On the other hand, training on synthetic attacks with small perturbations---such as word-substitution---does not actually improve robustness to human adversaries. In this paper, we propose an adversarial training framework that uses limited human adversarial examples to generate more useful adversarial examples at scale. We demonstrate the advantages of this system on the ANLI and hate speech detection benchmark datasets---both collected via an iterative, adversarial human-and-model-in-the-loop procedure. Compared to training only on observed human attacks, also training on our synthetic adversarial examples improves model robustness to future rounds. In ANLI, we see accuracy gains on the current set of attacks (44.1\%$\,\to\,$50.1\%) and on two future unseen rounds of human generated attacks (32.5\%$\,\to\,$43.4\%, and 29.4\%$\,\to\,$40.2\%). In hate speech detection, we see AUC gains on current attacks (0.76 $\to$ 0.84) and a future round (0.77 $\to$ 0.79). Attacks from methods that do not learn the distribution of existing human adversaries, meanwhile, degrade robustness.

URL: https://openreview.net/forum?id=UAT4j3Y7HP

---

Title: Low-Rank Learning by Design: the Role of Network Architecture and Activation Linearity in Gradient Rank Collapse

Abstract: Our understanding of learning dynamics of deep neural networks (DNNs) remains incomplete. Recent research has begun to uncover the mathematical principles underlying these networks, including the phenomenon of ``Neural Collapse'', where linear classifiers within DNNs converge to specific geometrical structures during late-stage training. However, the role of geometric constraints in learning extends beyond this terminal phase. For instance, gradients in fully-connected layers naturally develop a low-rank structure due to the accumulation of rank-one outer products over a training batch. Despite the attention given to methods that exploit this structure for memory saving or regularization, the emergence of low-rank learning as an inherent aspect of certain DNN architectures has been under-explored. In this paper, we conduct a comprehensive study of gradient rank in DNNs, examining how architectural choices and structure of the data affect gradient rank bounds. Our theoretical analysis provides these bounds for training fully-connected, recurrent, and convolutional neural networks. We also demonstrate, both theoretically and empirically, how design choices like activation function linearity, bottleneck layer introduction, convolutional stride, and sequence truncation influence these bounds. Our findings not only contribute to the understanding of learning dynamics in DNNs, but also provide practical guidance for deep learning engineers to make informed design decisions.

URL: https://openreview.net/forum?id=emI4kWlOZQ

---

Title: Mixed Nash for Robust Federated Learning

Abstract: We study robust federated learning (FL) within a game theoretic framework to alleviate the server vulnerabilities to even an informed adversary who can tailor training-time attacks. Specifically, we introduce RobustTailor, a simulation-based framework that prevents the adversary from being omniscient and derives its convergence guarantees. RobustTailor improves robustness to training-time attacks significantly while preserving almost the same privacy guarantees as standard robust aggregation schemes in FL. Empirical results under challenging attacks show that RobustTailor performs close to an upper bound with perfect knowledge of honest clients.

URL: https://openreview.net/forum?id=mqMzerrVOB

---

Title: Revisiting Generalized p-Laplacian Regularized Framelet GCNs: Convergence, Energy Dynamic and as Non-Linear Diffusion

Abstract: This paper presents a comprehensive theoretical analysis of the graph p-Laplacian regularized framelet network (pL-UFG) to establish a solid understanding of its properties. We conduct a convergence analysis on pL-UFG, addressing the gap in the understanding of its asymptotic behaviors. Further by investigating the generalized Dirichlet energy of pL-UFG, we demonstrate that the Dirichlet energy remains non-zero throughout convergence, ensuring the avoidance of over-smoothing issues. Additionally, we elucidate the energy dynamic perspective, highlighting the synergistic relationship between the implicit layer in pL-UFG and graph framelets. This synergy enhances the model's adaptability to both homophilic and heterophilic data. Notably, we reveal that pL-UFG can be interpreted as a generalized non-linear diffusion process, thereby bridging the gap between pL-UFG and differential equations on the graph. Importantly, these multifaceted analyses lead to unified conclusions that offer novel insights for understanding and implementing pL-UFG, as well as other graph neural network (GNN) models. Finally, based on our dynamic analysis, we propose two novel pL-UFG models with manually controlled energy dynamics. We demonstrate empirically and theoretically that our proposed models not only inherit the advantages of pL-UFG but also significantly reduce computational costs for training on large-scale graph datasets.

URL: https://openreview.net/forum?id=q4iSLPoFe7

---

Title: Discovering Model Structure of Dynamical Systems with Combinatorial Bayesian Optimization

Abstract: Deciding on a model structure is a fundamental problem in machine learning. We are interested in building a data-based model for the governing equations of a physical system from a library of discrete components. In addition to optimizing the model for performance, we consider crash and inequality constraints that arise from additional model requirements, such as real-time capability and model complexity regularization. We address this task of model structure selection with a focus on dynamical systems and propose to search over potential model structures efficiently using a constrained combinatorial Bayesian Optimization (BO) algorithm. We propose expressive surrogate models suited for combinatorial domains and a novel acquisition function that can handle both inequality and crash constraints and can be computed in closed form. We provide simulated benchmark problems within the domain of equation discovery of nonlinear dynamical systems. Our method outperforms the state-of-the-art in constrained combinatorial optimization of black-box functions and has a favorable computational overhead compared to other BO methods. As a real-world application example, we apply our method to optimize the configuration of an electric vehicle's digital twin while ensuring its real-time capability for the use in one of the world's largest driving simulators.

URL: https://openreview.net/forum?id=2iOOvQmJBK

---

Title: Anytime-Valid Confidence Sequences for Consistent Uncertainty Estimation in Early-Exit Neural Networks

Abstract: Early-exit neural networks (EENNs) facilitate adaptive inference by producing predictions at multiple stages of the forward pass. In safety-critical applications, these predictions are only meaningful when complemented with reliable uncertainty estimates. Yet, due to their sequential structure, an EENN's uncertainty estimates should also be consistent: labels that are deemed improbable at one exit should not reappear within the confidence interval / set of later exits. We show that standard uncertainty quantification techniques, like Bayesian methods or conformal prediction, can lead to inconsistency across exits. We address this problem by applying anytime-valid confidence sequences (AVCSs) to the exits of EENNs. By design, AVCSs maintain consistency across exits. We examine the theoretical and practical challenges of applying AVCSs to EENNs and empirically validate our approach on both regression and classification tasks.

URL: https://openreview.net/forum?id=GNVNEm6NI9

---

Title: Are Population Graphs Really as Powerful as Believed?

Abstract: Population graphs and their use in combination with graph neural networks (GNNs) have demonstrated promising results for multi-modal medical data integration and improving disease diagnosis and prognosis. Several different methods for constructing these graphs as well as advanced graph learning techniques have been applied and established to maximise the predictive power of GNNs on population graphs. However, in this work, we raise the question of whether existing methods are really strong enough by showing that simple baseline methods --such as random forests or linear regressions--, perform on par with advanced graph learning models on several population graph datasets for a variety of different clinical applications, such as age regression or disease prediction. We utilise benchmark citation datasets as well as the commonly used public population graph datasets TADPOLE and ABIDE, a brain age estimation and a cardiac dataset from the UK Biobank, and a real-world in-house COVID dataset. We investigate (a) the utility of GNNs for multi-modal data integration in the context of population graphs and (b) the impact of the graph structure on GNN performance. We conclude that GNNs are only beneficial for population graph studies if the graph structure adds meaningful additional information to the node features and show that the node features dominate the predictive power of GNNs in these studies.

URL: https://openreview.net/forum?id=TTRDCVnbjI

---

Title: The Interplay of Uncertainty Modeling and Deep Active Learning: An Empirical Analysis

Abstract: Deep active learning (AL) seeks to reduce the annotation costs required for training deep neural networks (DNNs). Often, deep AL strategies focus on instances where the predictive uncertainty of a DNN is high. Furthermore, Bayesian concepts to model uncertainty are frequently adopted. Despite considerable research, a detailed analysis of the role of uncertainty in deep AL is still missing, especially regarding aleatoric and epistemic uncertainty, both related to predictive uncertainty. This article provides an in-depth empirical study analyzing the interplay of uncertainty and deep AL in image classification. Our study investigates four hypotheses that provide an intuitive understanding of the effects of accurately estimating aleatoric and epistemic uncertainty on existing uncertainty-based AL strategies but also, in the opposite direction, the impact of uncertainty-based AL on the quality of uncertainty estimates that are needed in many applications. By analyzing these hypotheses on synthetic and real-world data, we find that accurate aleatoric estimates can even impair instance selection, while accurate epistemic estimates have negligible effects. Moreover, we provide a publicly available toolbox for deep AL with various models and strategies to facilitate further research and practical applications. Code is available at github.com/anonymous.

URL: https://openreview.net/forum?id=KLBD13bsVl

---

Title: Fixed-Budget Best-Arm Identification in Sparse Linear Bandits

Abstract: We study the best-arm identification problem in sparse linear bandits under the fixed-budget setting. In sparse linear bandits, the unknown feature vector $\theta^*$ may be of large dimension $d$, but only a few, say $s \ll d$ of these features have non-zero values. We design a two-phase algorithm, Lasso and Optimal-Design- (Lasso-OD) based linear best-arm identification. The first phase of Lasso-OD leverages the sparsity of the feature vector by applying the thresholded Lasso introduced by Zhou (2009), which estimates the support of $\theta^*$ correctly with high probability using rewards from the selected arms and a judicious choice of the design matrix. The second phase of Lasso-OD applies the OD-LinBAI algorithm by Yang and Tan (2022) on that estimated support. We derive a non-asymptotic upper bound on the error probability of Lasso-OD by carefully choosing hyperparameters (such as Lasso's regularization parameter) and balancing the error probabilities of both phases. For fixed sparsity $s$ and budget $T$, the exponent in the error probability of Lasso-OD depends on $s$ but not on the dimension $d$, yielding a significant performance improvement for sparse and high-dimensional linear bandits. Furthermore, we show that Lasso-OD is almost minimax optimal in the exponent. Finally, we provide numerical examples to demonstrate the significant performance improvement over the existing algorithms for non-sparse linear bandits such as OD-LinBAI, BayesGap, Peace, LinearExploration, and GSE.

URL: https://openreview.net/forum?id=Igxp7FC8uf

---

Title: DIVINE: Diverse-Inconspicuous Feature Learning to Mitigate Abridge Learning

Abstract: Deep learning algorithms aim to minimize overall classification error, and they exhibit impressive performance on test datasets across various domains. However, they often struggle with "out-of-distribution" data samples. We posit that deep models primarily focus on capturing the prominent features beneficial for classification, while neglecting othersubtler yet still discriminative features. This phenomenon is referred to as \textit{Abridge Learning}. To address this issue and promote a more comprehensive learning process from data, we introduce a novel \textit{DIVerse and INconspicuous feature lEarning} (DIVINE) approach aimed at counteracting Abridge Learning. DIVINE embodies a holistic learning methodology, effectively utilizing data by engaging with its diverse dominant features. Through experiments conducted on seven datasets, including MNIST, CIFAR10, CIFAR100, TinyImageNet, and their corrupted counterparts (CIFAR10-C, CIFAR100-C, and TinyImageNet-C), we demonstrate that DIVINE encourages the learning of a rich set of features. This, in turn, boosts the model’s robustness and its ability to generalize. The results on out-of-distribution datasets, such as those that are corrupted or have suppressed features, attest to the efficacy of our proposed approach."

URL: https://openreview.net/forum?id=K7gICLoCEo

---

Title: An Evaluation of Large Language Models in Bioinformatics Research

Abstract: Large language models (LLMs) such as ChatGPT have gained considerable interest across diverse research communities. Their notable ability for text completion and generation has inaugurated a novel paradigm for language-interfaced problem solving. However, the potential and efficacy of these models in bioinformatics remain incompletely explored. In this work, we study the performance GPT variants on a wide spectrum of crucial bioinformatics tasks. These tasks include the identification of potential coding regions, extraction of named entities for genes and proteins, detection of antimicrobial and anti-cancer peptides, molecular optimization, and resolution of educational bioinformatics problems. Our findings indicate that, given appropriate prompts, LLMs like GPT variants can successfully handle most of these tasks. In addition, we provide a thorough analysis of their limitations in the context of complicated bioinformatics tasks. We envision this work to provide new perspectives and motivate future research in the field of both LLM applications and bioinformatics.

URL: https://openreview.net/forum?id=cCT9cJxj0h

---

Title: Improving and generalizing flow-based generative models with minibatch optimal transport

Abstract: Continuous normalizing flows (CNFs) are an attractive generative modeling technique, but they have been held back by limitations in their simulation-based maximum likelihood training. We introduce the generalized conditional flow matching (CFM) technique, a family of simulation-free training objectives for CNFs. CFM features a stable regression objective like that used to train the stochastic flow in diffusion models but enjoys the efficient inference of deterministic flow models. In contrast to both diffusion models and prior CNF training algorithms, CFM does not require the source distribution to be Gaussian or require evaluation of its density. A variant of our objective is optimal transport CFM (OT-CFM), which creates simpler flows that are more stable to train and lead to faster inference, as evaluated in our experiments. Furthermore, OT-CFM is the first method to compute dynamic OT in a simulation-free way. Training CNFs with CFM improves results on a variety of conditional and unconditional generation tasks, such as inferring single cell dynamics, unsupervised image translation, and Schrödinger bridge inference.

URL: https://openreview.net/forum?id=CD9Snc73AW

---

Title: Q-Learning for Stochastic Control under General Information Structures and Non-Markovian Environments

Abstract: As a primary contribution, we present a convergence theorem for stochastic iterations, and in particular, Q-learning iterates, under a general, possibly non-Markovian, stochastic environment. Our conditions for convergence involve an ergodicity and a positivity criterion. We provide a precise characterization on the limit of the iterates and conditions on the environment and initializations for convergence. As our second contribution, we discuss the implications and applications of this theorem to a variety of stochastic control problems with non-Markovian environments involving (i) quantized approximations of fully observed Markov Decision Processes (MDPs) with continuous spaces (where quantization break down the Markovian structure), (ii) quantized approximations of belief-MDP reduced partially observable MDPS (POMDPs) with weak Feller continuity and a mild version of filter stability (which requires the knowledge of the model by the controller), (iii) finite window approximations of POMDPs under a uniform controlled filter stability (which does not require the knowledge of the model), and (iv) for multi-agent models where convergence of learning dynamics to a new class of equilibria, subjective Q-learning equilibria, will be studied. In addition to the convergence theorem, some implications of the theorem above are new to the literature and others are interpreted as applications of the convergence theorem. Some open problems are noted.

URL: https://openreview.net/forum?id=1Yp6xpTV55

---

Title: Identify Ambiguous Tasks Combining Crowdsourced Labels by Weighting Areas Under the Margin

Abstract: In supervised learning — for instance in image classification — modern massive datasets are commonly labeled by a crowd of workers. The obtained labels in this crowdsourcing setting are then aggregated for training, generally leveraging a per-worker trust score.
Yet, such workers oriented approaches discard the tasks' ambiguity.
Ambiguous tasks might fool expert workers, which is often harmful for the learning step.
In standard supervised learning settings -- with one label per task -- the Area Under the Margin (AUM) was tailored to identify mislabeled data.
We adapt the AUM to identify ambiguous tasks in crowdsourced learning scenarios, introducing the Weighted Areas Under the Margin (WAUM).
The WAUM is an average of AUMs weighted according to task-dependent scores.
We show that the WAUM can help discarding ambiguous tasks from the training set, leading to better generalization performance.
We report improvements over existing strategies for learning with a crowd, both on simulated settings, and on real datasets such as CIFAR-10H (a crowdsourced dataset with a high number of answered labels), LabelMe and Music (two datasets with few answered votes).

URL: https://openreview.net/forum?id=raD846nj2q

---

Title: Anomaly detection with semi-supervised classification based on risk estimators

Abstract: A significant limitation of one-class classification anomaly detection methods is their reliance on the assumption that unlabeled training data only contains normal instances. To overcome this impractical assumption, we propose two novel classification-based anomaly detection methods. Firstly, we introduce a semi-supervised shallow anomaly detection method based on an unbiased risk estimator. Secondly, we present a semi-supervised deep anomaly detection method utilizing a nonnegative (biased) risk estimator. We establish estimation error bounds and excess risk bounds for both risk minimizers. Additionally, we propose techniques to select appropriate regularization parameters that ensure the nonnegativity of the empirical risk in the shallow model under specific loss functions. Our extensive experiments provide evidence of the effectiveness of the risk-based anomaly detection methods.

URL: https://openreview.net/forum?id=ekvsBtCBUK

---

Title: Faster Convergence of Local SGD for Over-Parameterized Models

Abstract: Modern machine learning architectures are often highly expressive. They are usually over-parameterized and can interpolate the data by driving the empirical loss close to zero. We analyze the convergence of Local SGD (or FedAvg) for such over-parameterized models in the heterogeneous data setting and improve upon the existing literature by establishing the following convergence rates. For general convex loss functions, we establish an error bound of $\mathcal {O}(1/T)$ under a mild data similarity assumption and an error bound of $\mathcal {O}(K/T)$ otherwise, where $K$ is the number of local steps and $T$ is the total number of iterations. For non-convex loss functions we prove an error bound of $\mathcal {O}(K/T)$, These bounds improve upon the best previous bound of $\mathcal {O}(1/\sqrt{nT})$ in both cases, where $n$ is the number of agents, when no assumption on the model being over-parameterized is made. We complete our results by providing problem instances in which our established convergence rates are tight to a constant factor with a reasonably small stepsize. Finally, we validate our theoretical results by performing large-scale numerical experiments that reveal the convergence behavior of Local SGD for practical over-parameterized deep learning models, in which the $\mathcal {O}(1/T)$ convergence rate of Local SGD is clearly shown.

URL: https://openreview.net/forum?id=VBAKc4DtZ1

---

Title: Genetic InfoMax: Exploring Mutual Information Maximization in High-Dimensional Imaging Genetics Studies

Abstract: Genome-wide association studies (GWAS) are used to identify relationships between genetic variations and specific traits. When applied to high-dimensional medical imaging data, a key step is to extract lower-dimensional, yet informative representations of the data as traits. Representation learning for imaging genetics is largely under-explored due to the unique challenges posed by GWAS in comparison to typical visual representation learning. In this study, we tackle this problem from the mutual information (MI) perspective by identifying key limitations of existing methods. We introduce a trans-modal learning framework Genetic InfoMax (GIM), including a regularized MI estimator and a novel genetics-informed transformer to address the specific challenges of GWAS. We evaluate GIM on human brain 3D MRI data and establish standardized evaluation protocols to compare it to existing approaches. Our results demonstrate the effectiveness of GIM and a significantly improved performance on GWAS.

URL: https://openreview.net/forum?id=9UgUMFW67X

---

Title: Error Bounds for Flow Matching Methods

Abstract: Score-based generative models are a popular class of generative modelling techniques relying on stochastic differential equations (SDEs). From their inception, it was realized that it was also possible to perform generation using ordinary differential equations (ODEs) rather than SDEs. This led to the introduction of the probability flow ODE approach and denoising diffusion implicit models. Flow matching methods have recently further extended these ODE-based approaches and approximate a flow between two arbitrary probability distributions. Previous work derived bounds on the approximation error of diffusion models under the stochastic sampling regime, given assumptions on the $L^2$ loss. We present error bounds for the flow matching procedure using fully deterministic sampling, assuming an $L^2$ bound on the approximation error and a certain regularity condition on the data distributions.

URL: https://openreview.net/forum?id=uqQPyWFDhY

---

Title: How Much Pre-training Is Enough to Discover a Good Subnetwork?

Abstract: Neural network pruning helps discover efficient, high-performing subnetworks within pre-trained, dense network architectures. More often than not, it involves a three-step process—pre-training, pruning, and re-training—that is computationally expensive, as the dense model must be fully pre-trained. While previous work has revealed through experiments the relationship between the amount of pre-training and the performance of the pruned network, a theoretical characterization of such dependency is still missing. Aiming to mathematically analyze the amount of dense network pre-training needed for a pruned network to perform well, we discover a simple theoretical bound in the number of gradient descent pre-training iterations on a two-layer, fully connected network, beyond which pruning via greedy forward selection \citep{provable_subnetworks} yields a subnetwork that achieves good training error. Interestingly, this threshold is logarithmically dependent upon the size of the dataset, meaning that experiments with larger datasets require more pre-training for subnetworks obtained via pruning to perform well. Lastly, we empirically validate our theoretical results on multi-layer perceptions and residual-based convolutional networks trained on MNIST, CIFAR, and ImageNet datasets.

URL: https://openreview.net/forum?id=UVE7LllpXe

---

Title: Video Diffusion Models - A Survey

Abstract: Diffusion generative models have recently overtaken GANs in the text-to-image domain and show great potential for video generation and editing tasks. This review offers an overview of the current literature on video diffusion models. We provide a systematic overview over relevant aspects such as applications, architecture, and temporal dynamics. Developments in the field are outlined through paper summaries. The review concludes with an examination of remaining challenges and an outlook on the future of the field.

URL: https://openreview.net/forum?id=sgDFqNTdaN

---

Title: Anticipatory Music Transformer

Abstract: We introduce anticipation: a method for constructing a controllable generative model of a temporal point process (the event process) conditioned asynchronously on realizations of a second, correlated process (the control process). We achieve this by interleaving sequences of events and controls, such that controls appear following stopping times in the event sequence. This work is motivated by problems arising in the control of symbolic music generation. We focus on infilling control tasks, whereby the controls are a subset of the events themselves, and conditional generation completes a sequence of events given the fixed control events. We train anticipatory infilling models using the large and diverse Lakh MIDI music dataset. These models match the performance of autoregressive models for prompted generation, with the additional capability to perform infilling control tasks, including accompaniment. Human evaluators report that an anticipatory model produces accompaniments with similar musicality to even music composed by humans over a 20-second clip.

URL: https://openreview.net/forum?id=EBNJ33Fcrl

---

Title: Federated TD Learning with Linear Function Approximation under Environmental Heterogeneity

Abstract: We initiate the study of federated reinforcement learning under environmental heterogeneity by considering a policy evaluation problem. Our setup involves $N$ agents interacting with environments that share the same state and action space but differ in their reward functions and state transition kernels. Assuming agents can communicate via a central server, we ask: \textit{Does exchanging information expedite the process of evaluating a common policy?} To answer this question, we provide the first comprehensive finite-time analysis of a federated temporal difference (TD) learning algorithm with linear function approximation, while accounting for Markovian sampling, heterogeneity in the agents' environments, and multiple local updates to save communication. Our analysis crucially relies on several novel ingredients: (i) deriving perturbation bounds
on TD fixed points as a function of the heterogeneity in the agents' underlying Markov decision processes (MDPs); (ii) introducing a virtual MDP to closely approximate the dynamics of the federated TD algorithm; and (iii) using the virtual MDP to make explicit connections to federated optimization. Putting these pieces together, we prove that in a low-heterogeneity regime, exchanging model estimates leads to linear convergence speedups in the number of agents. Our theoretical contribution is significant in that it is the first result of its kind in multi-agent/federated reinforcement learning that complements the numerous analogous results in heterogeneous federated optimization.

URL: https://openreview.net/forum?id=hdQspgyFrk

---

Title: GSURE-Based Diffusion Model Training with Corrupted Data

Abstract: Diffusion models have demonstrated impressive results in both data generation and downstream tasks such as inverse problems, text-based editing, classification, and more. However, training such models usually requires large amounts of clean signals which are often difficult or impossible to obtain. In this work, we propose a novel training technique for generative diffusion models based only on corrupted data. We introduce a loss function based on the Generalized Stein’s Unbiased Risk Estimator (GSURE), and prove that under some conditions, it is equivalent to the training objective used in fully supervised diffusion models. We demonstrate our technique on face images as well as Magnetic Resonance Imaging (MRI), where the use of undersampled data significantly alleviates data collection costs. Our approach achieves generative performance comparable to its fully supervised counterpart without training on any clean signals. In addition, we deploy the resulting diffusion model in various downstream tasks beyond the degradation present in the training set, showcasing promising results.

URL: https://openreview.net/forum?id=BRl7fqMwaJ

---

Title: Concatenative Contrastive Sampling for Transformer-based Sequential Recommendation

Abstract: Sequential recommendation represents a significant research direction in recommender systems, which aims to analyze users' sequential actions to forecast the subsequent item or item sequence they are likely to engage with. This entails deploying machine learning models such as Markov Chains (MC), recurrent neural networks (RNNs), and transformers to unravel {the underlying user history} patterns in recommender systems and generate recommendations {according} to their capability in processing sequential data. However, prior endeavors, while successfully leveraging user history attributes, are constrained in capturing the interplay between user history and new items, as well as the contrastive signals between authentic and unfavorable items. To surmount these limitations, we introduce an attention-based sequential recommendation model with a concatenate-then-split structure that intentionally integrates these interactions. Experimental findings underscore the efficacy of integrating such interactions, with our new model achieving state-of-the-art performance across prevalent sequential recommendation benchmarks.

URL: https://openreview.net/forum?id=GhVnUdudJ1

---

Title: InfoNCE is variational inference in a recognition parameterised model

Abstract: Here, we show that the InfoNCE objective is equivalent to the ELBO in a new class of probabilistic generative model, the recognition parameterised model (RPM). When we learn the optimal prior, the RPM ELBO becomes equal to the mutual information (MI; up to a constant), establishing a connection to pre-existing self-supervised learning methods such as InfoNCE. However, practical InfoNCE methods do not use the MI as an objective; the MI is invariant to arbitrary invertible transformations, so using an MI objective can lead to highly entangled representations (Tschannen et al., 2019). Instead, the actual InfoNCE objective is a simplified lower bound on the MI which is loose even in the infinite sample limit. Thus, an objective that works (i.e. the actual InfoNCE objective) appears to be motivated as a loose bound on an objective that does not work (i.e. the true MI which gives arbitrarily entangled representations).
We give an alternative motivation for the actual InfoNCE objective. In particular, we show that in the infinite sample limit, and for a particular choice of prior, the actual InfoNCE objective is equal to the ELBO (up to a constant); and the ELBO is equal to the marginal likelihood with a deterministic recognition model. Thus, we argue that our VAE perspective gives a better motivation for InfoNCE than MI, as the actual InfoNCE objective is only loosely bounded by the MI, but is equal to the ELBO/marginal likelihood (up to a constant).

URL: https://openreview.net/forum?id=chbRsWwjax

---

Title: Semantic similarity prediction is better than other semantic similarity measures

Abstract: Semantic similarity between natural language texts is typically measured either by looking at the overlap between subsequences (e.g., BLEU) or by using embeddings (e.g., BERTScore, S-BERT). Within this paper, we argue that when we are only interested in measuring the semantic similarity, it is better to directly predict the similarity using a fine-tuned model for such a task. Using a fine-tuned model for the STS-B from the GLUE benchmark, we define the STSScore approach and show that the resulting similarity is better aligned with our expectations on a robust semantic similarity measure than other approaches.

URL: https://openreview.net/forum?id=bfsNmgN5je

---

Title: Latent space Projection Predictive Inference

Abstract: Given a reference model that includes all the available variables, projection predictive inference replaces its posterior with a constrained projection including only a subset of all variables.
We extend projection predictive inference to enable computationally efficient variable and structure selection in models outside the exponential family.
By adopting a latent space projection predictive perspective we are able to:
1) propose a unified and general framework to do variable selection in complex models while fully honouring the original model structure,
2) properly identify relevant structure and retain posterior uncertainties from the original model, and
3) provide an improved approach also for non-Gaussian models in the exponential family.
We demonstrate the superior performance of our approach by thoroughly testing and comparing it against popular variable selection approaches in a wide range of settings, including realistic data sets.
Our results show that our approach successfully recovers relevant terms and model structure in complex models, selecting less variables than competing approaches for realistic datasets.

URL: https://openreview.net/forum?id=shpewEvlL0

---

Title: DynaConF: Dynamic Forecasting of Non-Stationary Time-Series

Abstract: Deep learning has shown impressive results in a variety of time series forecasting tasks, where modeling the conditional distribution of the future given the past is the essence. However, when this conditional distribution is non-stationary, it poses challenges for these models to learn consistently and to predict accurately. In this work, we propose a new method to model non-stationary conditional distributions over time by clearly decoupling stationary conditional distribution modeling from non-stationary dynamics modeling. Our method is based on a Bayesian dynamic model that can adapt to conditional distribution changes and a deep conditional distribution model that handles multivariate time series using a factorized output space. Our experimental results on synthetic and real-world datasets show that our model can adapt to non-stationary time series better than state-of-the-art deep learning solutions.

URL: https://openreview.net/forum?id=48pHFcg0YO

---

Title: Local Masked Reconstruction for Efficient Self-Supervised Learning on High-resolution Images

Abstract: Self-supervised learning for computer vision has achieved tremendous progress and improved many downstream vision tasks, such as image classification, semantic segmentation, and object detection. Among these, generative self-supervised vision learning approaches such as MAE and BEiT show promising performance. However, their global reconstruction mechanism is computationally demanding, especially for high-resolution images. The computational cost will extensively increase when it is scaled to a large-scale dataset. To address this issue, we propose local masked reconstruction (LoMaR), a simple yet effective approach that reconstructs image patches from small neighboring regions. The strategy can be easily integrated into any generative self-supervised learning techniques and improves the trade-off between efficiency and accuracy compared to reconstruction over the entire image. LoMaR is 2.5$\times$ faster than MAE and 5.0$\times$ faster than BEiT on 384$\times$384 ImageNet pretraining, and surpasses them by 0.2\% and 0.8\% in accuracy, respectively. It is 2.1$\times$ faster than MAE on iNaturalist pretraining and gains 0.2\% in accuracy. On MS COCO, LoMaR outperforms MAE by 0.5 $\text{AP}^\text{box}$ on object detection and 0.5 $\text{AP}^\text{mask}$ on instance segmentation. It also outperforms MAE by 0.2\% on semantic segmentation. Our code will be made publicly available.

URL: https://openreview.net/forum?id=gmSoX5rccT

---

Title: Learning optimal policies through contact in differentiable simulation

Abstract: Model-Free Reinforcement Learning (MFRL) has garnered significant attention for its effectiveness in continuous motor control tasks. However, its limitations become apparent in high-dimensional problems, often leading to suboptimal policies even with extensive training data. Conversely, First-Order Model-Based Reinforcement Learning (FO-MBRL) methods harnessing differentiable simulation offer more accurate gradients but are plagued by instability due to exploding gradients arising from the contact approximation model. We propose Adaptive Horizon Actor Critic (AHAC), a massively parallel FO-MBRL approach that truncates trajectory gradients upon encountering stiff contact, resulting in more stable and accurate gradients. We experimentally show this on a variety of simulated locomotion tasks, where our method achieves up to 91% higher asymptotic episodic reward than state-of-the-art MFRL algorithms while also exhibiting lower variance and less hyper-parameter sensitivity than prior FO-MBRL methods. Moreover, our method scales to high-dimensional motor control tasks while maintaining better wall-clock-time efficiency. We believe the ability to learn high-performance policies in a few minutes enables new opportunities to scale RL for robot motor control.

URL: https://openreview.net/forum?id=fnBAaUksGL

---

Title: Enhancing Low-Precision Sampling via Stochastic Gradient Hamiltonian Monte Carlo

Abstract: Low-precision training has emerged as a promising low-cost technique to enhance the training efficiency of deep neural networks without sacrificing much accuracy. Its Bayesian counterpart can further provide uncertainty quantification and improved generalization accuracy.
This paper investigates low-precision sampling via Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) with low-precision and full-precision gradient accumulators for both strongly log-concave and non-log-concave distributions. Theoretically, our results show that to achieve $\epsilon$-error in the 2-Wasserstein distance for non-log-concave distributions, low-precision SGHMC achieves quadratic improvement ($\tilde{\mathcal{O}}\left({\epsilon^{-2}{\mu^*}^{-2}\log^2\left({\epsilon^{-1}}\right)}\right)$) compared to the state-of-the-art low-precision sampler, Stochastic Gradient Langevin Dynamics (SGLD) ($\tilde{\mathcal{O}}\left({{\epsilon}^{-4}{\lambda^{*}}^{-1}\log^5\left({\epsilon^{-1}}\right)}\right)$). Moreover, we prove that low-precision SGHMC is more robust to the quantization error compared to low-precision SGLD due to the robustness of the momentum-based update w.r.t. gradient noise.
Empirically, we conduct experiments on synthetic data, and MNIST, CIFAR-10 \& CIFAR-100 datasets, which validate our theoretical findings. Our study highlights the potential of low-precision SGHMC as an efficient and accurate sampling method for large-scale and resource-limited machine learning.

URL: https://openreview.net/forum?id=uSLNzzuiDJ

---

Title: BaSIS-Net: From Point Estimate to Predictive Distribution in Neural Networks - A Bayesian Sequential Importance Sampling Framework

Abstract: Data-driven Deep Learning (DL) models have revolutionized autonomous systems, but ensuring their safety and reliability necessitates the assessment of predictive confidence or uncertainty. Bayesian DL provides a principled approach to quantify uncertainty via probability density functions defined over model parameters. However, the exact solution is intractable for most DL models, and the approximation methods, often based on heuristics, suffer from scalability issues and stringent distribution assumptions and may lack theoretical guarantees. This work develops a Sequential Importance Sampling framework that approximates the posterior probability density function through weighted samples (or particles), which can be used to find the mean, variance, or higher-order moments of the posterior distribution. We demonstrate that propagating particles, which capture information about the higher-order moments, through the layers of the DL model results in increased robustness to natural and malicious noise (adversarial attacks). The variance computed from these particles effectively quantifies the model’s decision uncertainty, demonstrating well-calibrated and accurate predictive confidence.

URL: https://openreview.net/forum?id=V92PnXQ7UW

---

Title: CoDeC: Communication-Efficient Decentralized Continual Learning

Abstract: Training at the edge utilizes continuously evolving data generated at different locations. Privacy concerns prohibit the co-location of this spatially as well as temporally distributed data, deeming it crucial to design training algorithms that enable efficient continual learning
over decentralized private data. Decentralized learning allows serverless training with spatially distributed data. A fundamental barrier in such setups is the high bandwidth cost of communicating model updates between agents. Moreover, existing works under this training paradigm are not inherently suitable for learning a temporal sequence of tasks while retaining the previously acquired knowledge. In this work, we propose CoDeC, a novel communication-efficient decentralized continual learning algorithm that addresses these challenges. We mitigate catastrophic forgetting while learning a distributed task sequence by incorporating orthogonal gradient projection within a gossip-based decentralized learning algorithm. Further, CoDeC includes a novel lossless communication compression scheme based on the gradient subspaces. We theoretically analyze the convergence rate for our algorithm and demonstrate through an extensive set of experiments that CoDeC successfully learns distributed continual tasks with minimal forgetting. The proposed compression scheme results in up to 4.8× reduction in communication costs without any loss in performance.

URL: https://openreview.net/forum?id=N05OnQG1BA

---

Title: Statistical Component Separation for Targeted Signal Recovery in Noisy Mixtures

Abstract: Separating signals from an additive mixture may be an unnecessarily hard problem when one is only interested in specific properties of a given signal. In this work, we tackle simpler "statistical component separation" problems that focus on recovering a predefined set of statistical descriptors of a target signal from a noisy mixture. Assuming access to samples of the noise process, we investigate a method devised to match the statistics of the solution candidate corrupted by noise samples with those of the observed mixture. We first analyze the behavior of this method using simple examples with analytically tractable calculations. Then, we apply it in an image denoising context employing 1) wavelet-based descriptors, 2) ConvNet-based descriptors on astrophysics and ImageNet data. In the case of 1), we show that our method better recovers the descriptors of the target data than a standard denoising method in most situations. Additionally, despite not constructed for this purpose, it performs surprisingly well in terms of peak signal-to-noise ratio on full signal reconstruction. In comparison, representation 2) appears less suitable for image denoising. Finally, we extend this method by introducing a diffusive stepwise algorithm which gives a new perspective to the initial method and leads to promising results for image denoising under specific circumstances.

URL: https://openreview.net/forum?id=OUWG6O4yo9

---

Title: Provable Membership Inference Privacy

Abstract: In applications involving sensitive data, such as finance and healthcare, the necessity for preserving data privacy can be a significant barrier to machine learning model development. Differential privacy (DP) has emerged as one canonical standard for provable privacy. However, DP's strong theoretical guarantees often come at the cost of a large drop in its utility for machine learning; and DP guarantees themselves are difficult to interpret. In this work, we propose a novel privacy notion, membership inference privacy (MIP), to address these challenges. We give a precise characterization of the relationship between MIP and DP, and show that MIP can be achieved using less amount of randomness compared to the amount required for guaranteeing DP, leading to smaller drop in utility. MIP guarantees are also easily interpretable in terms of the success rate of membership inference attacks. Our theoretical results also give rise to a simple algorithm for guaranteeing MIP which can be used as a wrapper around any algorithm with continuous outputs, including parametric model training.

URL: https://openreview.net/forum?id=3ludyxPbb6

---

Title: On the Optimization and Generalization of Multi-head Attention

Abstract: The training and generalization dynamics of the Transformer's core mechanism, namely the Attention mechanism, remain under-explored. Besides, existing analyses primarily focus on single-head attention. Inspired by the demonstrated benefits of overparameterization when training fully-connected networks, we investigate the potential optimization and generalization advantages of using multiple attention heads. Towards this goal, we derive convergence and generalization guarantees for gradient-descent training of a single-layer multi-head self-attention model, under a suitable realizability condition on the data. We then establish primitive conditions on the initialization that ensure realizability holds. Finally, we demonstrate that these conditions are satisfied for a simple tokenized-mixture model. We expect the analysis can be extended to various data-model and architecture variations.

URL: https://openreview.net/forum?id=wTGjn7JvYK

---

Title: Temporal Difference Learning with Compressed Updates: Error-Feedback meets Reinforcement Learning

Abstract: In large-scale distributed machine learning, recent works have studied the effects of compressing gradients in stochastic optimization to alleviate the communication bottleneck. These works have collectively revealed that stochastic gradient descent (SGD) is robust to structured perturbations such as quantization, sparsification, and delays. Perhaps surprisingly, despite the surge of interest in multi-agent reinforcement learning, almost nothing is known about the analogous question: \textit{Are common reinforcement learning (RL) algorithms also robust to similar perturbations?} We investigate this question by studying a variant of the classical temporal difference (TD) learning algorithm with a perturbed update direction, where a general compression operator is used to model the perturbation. Our work makes three important technical contributions. First, we prove that compressed TD algorithms, coupled with an error-feedback mechanism used widely in optimization, exhibit the same non-asymptotic theoretical guarantees as their SGD counterparts. Second, we show that our analysis framework extends seamlessly to nonlinear stochastic approximation schemes that subsume Q-learning. Third, we prove that for multi-agent TD learning, one can achieve linear convergence speedups with respect to the number of agents while communicating just $\tilde{O}(1)$ bits per iteration. Notably, these are the first finite-time results in RL that account for general compression operators and error-feedback in tandem with linear function approximation and Markovian sampling. Our proofs hinge on the construction of novel Lyapunov functions that capture the dynamics of a memory variable introduced by error-feedback.

URL: https://openreview.net/forum?id=dltUedmUVT

---

Title: Embracing Unknown Step by Step: Towards Reliable Sparse Training in Real World

Abstract: Sparse training has emerged as a promising method for resource-efficient deep neural networks (DNNs) in real-world applications. However, the reliability of sparse models remains a crucial concern, particularly in detecting unknown out-of-distribution (OOD) data. This study addresses the knowledge gap by investigating the reliability of sparse training from an OOD perspective and reveals that sparse training exacerbates OOD unreliability. The lack of unknown information and the sparse constraints hinder effective exploration of weight space and accurate differentiation between known and unknown knowledge. To tackle these challenges, we propose a new unknown-aware sparse training method, which incorporates a loss modification, auto-tuning strategy, and a voting scheme to guide weight space exploration and mitigate confusion between known and unknown information without incurring significant additional costs or requiring access to additional OOD data. Theoretical insights demonstrate how our method reduces model confidence when faced with OOD samples. Empirical experiments across multiple datasets, model architectures, and sparsity levels validate the effectiveness of our method, with improvements of up to \textbf{8.4\%} in AUROC while maintaining comparable or higher accuracy and calibration. This research enhances the understanding and readiness of sparse DNNs for deployment in resource-limited applications. Our code is available on: \url{https://anonymous.4open.science/r/MOON-B69E/}.

URL: https://openreview.net/forum?id=Db5c3Wxj9E

---

Title: Making Reliable and Flexible Decisions in Long-tailed Classification

Abstract: Long-tailed classification is challenging due to its heavy imbalance in class probabilities. While existing methods often focus on overall accuracy or accuracy for tail classes, they overlook a critical aspect: certain types of errors can carry greater risks than others in real-world long-tailed problems. For example, misclassifying patients (a tail class) as healthy individuals (a head class) entails far more serious consequences than the reverse scenario. To address this critical issue, we introduce Making Reliable and Flexible Decisions in Long-tailed Classification (RF-DLC), a novel method aimed at ensuring reliable predictions in long-tailed problems. Leveraging Bayesian Decision Theory, we introduce an integrated gain to seamlessly combine long-tailed data distribution and the decision-making procedure. We further propose an efficient variational optimization strategy for the decision risk objective. Our method adapts readily to diverse utility matrices, which can be designed for specific tasks, ensuring its flexibility for different problem settings. In empirical evaluation, we design a new metric, False Head Rate, to quantify tail-sensitivity risk, and conduct comprehensive experiments on multiple real-world tasks, including classification, uncertainty estimations, and ablation studies, to demonstrate the reliability and flexibility of our method.

URL: https://openreview.net/forum?id=6xwqONp6KK

---

Title: DeepReShape: Redesigning Neural Networks for Efficient Private Inference

Abstract: Prior work on Private Inference (PI)--inferences performed directly on encrypted input--has focused on minimizing a network's ReLUs, which have been assumed to dominate PI latency rather than FLOPs. Recent work has shown that FLOPs for PI can no longer be ignored and have high latency penalties. In this paper, we develop DeepReShape, a network redesign technique that tailors architectures to PI constraints, optimizing for both ReLUs and FLOPs for the first time. The {\em key insight} is that a strategic allocation of channels such that the network's ReLUs are aligned in their criticality order simultaneously optimizes ReLU and FLOPs efficiency. DeepReShape automates network development with an efficient process, and we call generated networks HybReNets. We evaluate DeepReShape using standard PI benchmarks and demonstrate a 2.1\% accuracy gain with a 5.2$\times$ runtime improvement at iso-ReLU on CIFAR-100 and an 8.7$\times$ runtime improvement at iso-accuracy on TinyImageNet. Furthermore, we demystify the input network selection in prior ReLU optimizations and shed light on the key network attributes enabling PI efficiency.

URL: https://openreview.net/forum?id=iwCBWULItx

---

Title: A general framework for formulating structured variable selection

Abstract: In variable selection, a selection rule that prescribes the permissible sets of selected variables (called a "selection dictionary") is desirable due to the inherent structural constraints among the candidate variables. Such selection rules can be complex in real-world data analyses, and failing to incorporate such restrictions could not only compromise the interpretability of the model but also lead to decreased prediction accuracy. However, no general framework has been proposed to formalize selection rules and their applications, which poses a significant challenge for practitioners seeking to integrate these rules into their analyses. In this work, we establish a framework for structured variable selection that can incorporate universal structural constraints. We develop a mathematical language for constructing arbitrary selection rules, where the selection dictionary is formally defined. We demonstrate that all selection rules can be expressed as combinations of operations on constructs, facilitating the identification of the corresponding selection dictionary. We use a detailed and complex example to illustrate the developed framework. Once this selection dictionary is derived, practitioners can apply their own user-defined criteria to select the optimal model. Additionally, our framework enhances existing penalized regression methods for variable selection by providing guidance on how to appropriately group variables to achieve the desired selection rule. Furthermore, our innovative framework opens the door to establishing new $\ell_0$-based penalized regression techniques that can be tailored to respect arbitrary selection rules, thereby expanding the possibilities for more robust and tailored model development.

URL: https://openreview.net/forum?id=cvOpIhQQMN

---

Title: DSI2I: Dense Style for Unpaired Exemplar-based Image-to- Image Translation

Abstract: Unpaired exemplar-based image-to-image (UEI2I) translation aims to translate a source
image to a target image domain with the style of a target image exemplar, without ground-
truth input-translation pairs. Existing UEI2I methods represent style using one vector per
image or rely on semantic supervision to define one style vector per object. Here, in contrast,
we propose to represent style as a dense feature map, allowing for a finer-grained transfer
to the source image without requiring any external semantic information. We then rely on
perceptual and adversarial losses to disentangle our dense style and content representations.
To stylize the source content with the exemplar style, we extract unsupervised cross-domain
semantic correspondences and warp the exemplar style to the source content. We demon-
strate the effectiveness of our method on four datasets using standard metrics together with
a localized style metric we propose, which measures style similarity in a class-wise man-
ner. Our results show that the translations produced by our approach are more diverse,
preserve the source content better, and are closer to the exemplars when compared to the
state-of-the-art methods.

URL: https://openreview.net/forum?id=mrJi5kdKA4

---

Title: Enhancing Robustness to Class-Conditional Distribution Shift in Long-Tailed Recognition

Abstract: For long-tailed recognition problem, beyond imbalanced label distribution, unreliable empirical data distribution due to instance scarcity has recently emerged as a concern. It inevitably causes Class-Conditional Distribution (CCD) shift between training and test. Data augmentation and head-to-tail information transfer methods indirectly alleviate the problem by synthesizing novel examples but may remain biased. In this paper, we conduct a thorough study on the impact of CCD shift and propose Distributionally Robust Augmentation (DRA) to directly train models robust to the shift. DRA admits a novel generalization bound reflecting the benefit of distributional robustness to CCD shift for long-tailed recognition. Extensive experiments show DRA greatly improves existing re-balancing and data augmentation methods when cooperating with them. It also alleviates the recently discovered saddle-point issue, verifying its ability to achieve enhanced robustness.

URL: https://openreview.net/forum?id=n2gAD8Fdzk

---

Title: LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

Abstract: Recent advancements in text-to-image diffusion models have yielded impressive results in generating realistic and diverse images. However, these models still struggle with complex prompts, such as those that involve numeracy and spatial reasoning. This work proposes to enhance prompt understanding capabilities in diffusion models. Our method leverages a pretrained large language model (LLM) for grounded generation in a novel two-stage process. In the first stage, the LLM generates a scene layout that comprises captioned bounding boxes from a given prompt describing the desired image. In the second stage, a novel controller guides an off-the-shelf diffusion model for layout-grounded image generation. Both stages utilize existing pretrained models without additional model parameter optimization. Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images according to prompts that require various capabilities, doubling the generation accuracy across four tasks on average. Furthermore, our method enables instruction-based multi-round scene specification and can handle prompts in languages not supported by the underlying diffusion model. We anticipate our method to unleash users' creativity by accurately following more complex prompts.

URL: https://openreview.net/forum?id=hFALpTb4fR

---

Title: Almost Equivariance via Lie Algebra Convolutions

Abstract: Recently, the equivariance of models with respect to a group action has become an important topic of research in machine learning. Analysis of the built-in equivariance of existing neural network architectures, as well as the study of methods for building model architectures that explicitly "bake in" equivariance, have become significant research areas in their own right. However, imbuing an architecture with a specific group equivariance imposes a strong prior on the types of data transformations that the model expects to see. While strictly-equivariant models enforce symmetries, such as those due to rotations or translations, real-world data does not always conform to such strict equivariances, be it due to noise in the data or underlying physical laws that encode only approximate or partial symmetries. In such cases, the prior of strict equivariance can actually prove too strong and cause models to underperform on real-world data. Therefore, in this work we study a closely related topic, that of almost equivariance. We provide a definition of almost equivariance that differs from those extant in the current literature and give a practical method for encoding almost equivariance in models by appealing to the Lie algebra of a Lie group. Specifically, we define Lie algebra convolutions and demonstrate that they offer several benefits over Lie group convolutions, including being computationally tractable and well-defined for non-compact groups. From there, we pivot to the realm of theory and demonstrate connections between the notions of equivariance and isometry and those of almost equivariance and almost isometry. We prove two existence theorems, one showing the existence of almost isometries within bounded distance of isometries of a general manifold, and another showing the converse for Hilbert spaces. We then extend these theorems to prove the existence of almost equivariant manifold embeddings within bounded distance of fully equivariant embedding functions, subject to certain constraints on the group action and the function class. Finally, we demonstrate the validity of our approach by benchmarking against datasets in fully equivariant and almost equivariant settings.

URL: https://openreview.net/forum?id=BWKaPYAKp8

---

Title: INSPIRE: Incorporating Diverse Feature Preferences in Recourse

Abstract: Most recourse generation approaches optimize for indirect distance-based metrics like diversity, proximity, and sparsity, or a shared cost function across all users. A shared cost function in particular is an unrealistic assumption because users can have diverse feature preferences (FPs), i.e. the features they are willing to act upon to obtain recourse. In this work, we propose a novel method, INSPIRE to incorporate diverse feature preferences in both recourse generation and evaluation procedures by focusing on the cost incurred by a user when opting for a recourse. To achieve this, we first propose an objective function, Expected Minimum Cost (EMC) based on two key ideas: (1) the user should be comfortable adopting at least one solution when presented with multiple options, and (2) we can provide users with multiple options that cover a wide variety of FPs when the user's FPs are unknown. To optimize for EMC, we propose a novel discrete optimization algorithm, Cost-Optimized Local Search (COLS), that is guaranteed to improve the quality of the recourse set over iterations. Next, we propose a cost-based evaluation procedure that computes user satisfaction by simulating each user's cost function and then computing the incurred cost for the provided recourse set. Experimental evaluation on popular real-world datasets demonstrates that our method is more fair compared to baselines and satisfies up to 25.9% more users. Additionally, we also show that our method is robust to misspecifications of the cost function distribution.

URL: https://openreview.net/forum?id=6yzIuqKGnq

---

Title: Series of Hessian-Vector Products for Tractable Saddle-Free Newton Optimisation of Neural Networks

Abstract: Despite their popularity in the field of continuous optimisation, second-order quasi-Newton methods are challenging to apply in machine learning, as the Hessian matrix is intractably large. This computational burden is exacerbated by the need to address non-convexity, for instance by modifying the Hessian’s eigenvalues as in Saddle-Free Newton methods. We propose an optimisation algorithm which addresses both of these concerns — to our knowledge, the first efficiently-scalable optimisation algorithm to asymptotically use the exact (eigenvalue-modified) inverse Hessian. Our method frames the problem as a series which principally square-roots and inverts the squared Hessian, then uses it to precondition a gradient vector, all without explicitly computing or eigendecomposing the Hessian. A truncation of this infinite series provides a new optimisation algorithm which is scalable and comparable to other first- and second-order optimisation methods in both runtime and optimisation performance. We demonstrate this in a variety of settings, including a ResNet-18 trained on CIFAR-10.

URL: https://openreview.net/forum?id=qBZeQBEDIW

---

Title: From Continuous Dynamics to Graph Neural Networks: Neural Diffusion and Beyond

Abstract: Graph neural networks (GNNs) have demonstrated significant promise in modelling relational data and have been widely applied in various fields of interest. The key mechanism behind GNNs is the so-called message passing where information is being iteratively aggregated to central nodes from their neighbourhood. Such a scheme has been found to be intrinsically linked to a physical process known as heat diffusion, where the propagation of GNNs naturally corresponds to the evolution of heat density. Analogizing the process of message passing to the heat dynamics allows to fundamentally understand the power and pitfalls of GNNs and consequently informs better model design. Recently, there emerges a plethora of works that proposes GNNs inspired from the continuous dynamics formulation, in an attempt to mitigate the known limitations of GNNs, such as oversmoothing and oversquashing. In this survey, we provide the first systematic and comprehensive review of studies that leverage the continuous perspective of GNNs. To this end, we introduce foundational ingredients for adapting continuous dynamics to GNNs, along with a general framework for the design of graph neural dynamics. We then review and categorize existing works based on their driven mechanisms and underlying dynamics. We also summarize how the limitations of classic GNNs can be addressed under the continuous framework. We conclude by identifying multiple open research directions.

URL: https://openreview.net/forum?id=fPQSxjqa2o

---

Title: Temporal Label-Refinement for Weakly-Supervised Audio-Visual Event Localization

Abstract: Audio-Visual Event Localization (AVEL) is the task of temporally localizing and classifying \emph{audio-visual events}, i.e., events simultaneously visible and audible in a video. In this paper, we solve AVEL in a weakly-supervised setting, where only video-level event labels (their presence/absence, but not their locations in time) are available as supervision for training. Our idea is to use a base model to estimate pseudo labels on the training data at a finer temporal resolution than at the video level (``label-refinement'') and then re-train the model with these new labels. In label-refinement, we estimate the subset of labels for each \emph{slice} of frames in a training video by (i) replacing the frames outside the slice with those from a second video having no overlap in video-level labels, and (ii) feeding this synthetic video into the base model to extract labels for just the slice in question. To handle the out-of-distribution nature of our synthetic videos, we propose an auxiliary objective to train the base model that induces more reliable predictions of the localized event labels as desired. Our three-stage pipeline outperforms several existing AVEL methods with no architectural changes and improves performance on a related weakly-supervised task as well. We also find that the evaluation of existing AVEL methods has been seriously misleading and therefore propose new metrics for a better sense of performance.

URL: https://openreview.net/forum?id=Gw0CEuV2ci

---

Title: NorMatch: Matching Normalizing Flows with Discriminative Classifiers for Semi-Supervised Learning

Abstract: Semi-Supervised Learning (SSL) aims to learn a model using a tiny labeled set and massive amounts of unlabeled data. To better exploit the unlabeled data the latest SSL methods use pseudo-labels predicted from \emph{a single discriminative classifier}. However, the generated pseudo-labels are inevitably linked to inherent confirmation bias and noise which greatly affects the model performance. In this work, we introduce a new framework for SSL named NorMatch. Firstly, we introduce a new uncertainty estimation scheme based on normalizing flows, as an auxiliary classifier, to enforce highly certain pseudo-labels yielding a boost of the discriminative classifiers. Secondly, we introduce a threshold-free sample weighting strategy to exploit better both high and low confidence pseudo-labels. Furthermore, we utilize normalizing flows to model, in an unsupervised fashion, the distribution of unlabeled data. This modelling assumption can further improve the performance of generative classifiers via unlabeled data, and thus, implicitly contributing to training a better discriminative classifier. We demonstrate, through numerical and visual results, that NorMatch achieves state-of-the-art performance on several datasets.

URL: https://openreview.net/forum?id=ebiAFpQ0Lw

---

Title: Discrete VQ-IHDM: MRI Generation with Vector Quantized Inverse Heat Dissipation Model

Abstract: Accurate and efficient MRI generation is critical in various clinical settings, such as neurology and radiology. The complex data collection procedures, privacy concerns, and lack of medical experts present a bottleneck in the medical imaging data collection and annotation process. In this paper, we adopt a method to unconditionally generate 2D axial brain MRI using a combination of Vector-Quantized image representation and Inverse Heat Dissipation Model (IHDM). We utilize Gaussian Blur as an alternative to order-agnostic masking in the forward process and train a Transformer model to learn the reverse process. This approach allows us to create a single-step sampling algorithm while maintaining high image fidelity. On the ADNI dataset, our model has a FID score of 38.57, a KID score of 0.036, and an ISC score of 1.84.

URL: https://openreview.net/forum?id=dE6Y8UADIw

---

Title: PaDPaF: Partial Disentanglement with Partially-Federated GANs

Abstract: Federated learning has become a popular machine learning paradigm with many potential real-life applications, including recommendation systems, the Internet of Things (IoT), healthcare, and self-driving cars. Though most current applications focus on classification-based tasks, learning personalized generative models remains largely unexplored, and their benefits in the heterogeneous setting still need to be better understood. This work proposes a novel architecture combining global client-agnostic and local client-specific generative models. We show that using standard techniques for training federated models, our proposed model achieves privacy and personalization by implicitly disentangling the globally-consistent representation (i.e. content) from the client-dependent variations (i.e. style). Using such decomposition, personalized models can generate locally unseen labels while preserving the given style of the client and can predict the labels for all clients with high accuracy by training a simple linear classifier on the global content features. Furthermore, disentanglement enables other essential applications, such as data anonymization, by sharing only the content. Extensive experimental evaluation corroborates our findings, and we also provide partial theoretical justifications for the proposed approach.

URL: https://openreview.net/forum?id=vsez76EAV8

---

Title: From Identifiable Causal Representations to Controllable Counterfactual Generation: A Survey on Causal Generative Modeling

Abstract: Deep generative models have shown tremendous success in data density estimation and data generation from finite samples. While these models have shown impressive performance by learning correlations among features in the data, some fundamental shortcomings are their lack of explainability, tendency to induce spurious correlations, and poor out-of-distribution extrapolation. In an effort to remedy such challenges, one can incorporate the theory of causality in deep generative modeling. Structural causal models (SCMs) describe data-generating processes and model complex causal relationships and mechanisms among variables in a system. Thus, SCMs can naturally be combined with deep generative models. Causal models offer several beneficial properties to deep generative models, such as distribution shift robustness, fairness, and interpretability. We provide a technical survey on causal generative modeling categorized into causal representation learning and controllable counterfactual generation methods. We focus on fundamental theory, methodology, drawbacks, datasets, metrics, and applications of causal generative models in fairness, privacy, out-of-distribution generalization, and precision medicine. We also discuss open problems and fruitful research directions for future work in the field.

URL: https://openreview.net/forum?id=PUpZXvNqmb

---

Title: DP-ImgSyn: Dataset Alignment for Obfuscated, Differentially Private Image Synthesis

Abstract: The availability of abundant data has catalyzed the expansion of deep learning vision algorithms. However, certain vision datasets depict visually sensitive content such as content moderation images. Sharing or releasing these datasets to the community would improve the performance of neural models, but poses moral and ethical questions. Thus, there is a need to share such datasets with privacy guarantees without sharing visually sensitive data. Traditionally, Generative Adversarial Networks (GANs) with Differential Privacy (DP) guarantees are employed to generate and release data. However, GAN-based approaches result in images that are visually similar to private images. In this paper, we propose a non-generative framework, Differentially Private Image Synthesis (DP-ImgSyn), to sanitize and release visually sensitive data with DP guarantees to address these issues. DP-ImgSyn consists of the following steps. First, a teacher model is trained (for classification) using a DP training algorithm. Second, optimization is performed on a public dataset using the teacher model to align it with the private dataset. We show that this alignment improves performance (up to $\approx$ **17%**) and ensures that the generated/aligned images are visually similar to the public images. The optimization uses the teacher network's batch normalization layer statistics (mean, standard deviation) to inject information about the private images into the public images. The synthesized images with their corresponding soft labels obtained by teacher model are released as the sanitized dataset. A student model is trained on the released dataset using KL-divergence loss. The proposed framework circumvents the issues of generative methods and generates images visually similar to the public dataset. Thus, it obfuscates the private dataset using the public dataset. Our experiments on various vision datasets show that when using similar DP training mechanisms, our framework performs better than generative techniques (up to $\approx$ **20%**).

URL: https://openreview.net/forum?id=ZHAtvJtJnR

---

Title: Scaling (Down) CLIP: A Comprehensive Analysis of Data,Architecture, and Training Strategies

Abstract: This paper investigates the performance of the Contrastive Language-Image Pre-training (CLIP) method when scaled down to limited computation budgets. We explore CLIP from three perspectives: data, architecture, and training strategies. With regards to data, we demonstrate the significance of high-quality training data and show that a smaller dataset of high-quality data can outperform a larger dataset with lower quality. We also examine how model performance varies with different dataset sizes, suggesting that smaller ViT models are better suited for smaller numbers of sample data, while larger models perform better on larger datasets with fixed compute. Additionally, we provide guidance on when to choose a CNN-based architecture or a ViT-based architecture for CLIP training. Moreover, we compare four CLIP training strategies - SLIP, FLIP, CLIP, and CLIP+Data Augmentation - and show that the choice of training strategy depends on the available compute resources. Our analysis reveals that CLIP+Data Augmentation can achieve comparable performance to CLIP using only half of the training data. This work provides practical insights into how to effectively train and deploy CLIP models, making them more accessible and affordable for practical use in various applications.

URL: https://openreview.net/forum?id=t4nnCi5AO6

---

Title: Voyager: An Open-Ended Embodied Agent with Large Language Models

Abstract: We introduce Voyager, the first LLM-powered embodied lifelong learning agent in an open-ended world that continuously explores, acquires diverse skills, and makes novel discoveries without human intervention in Minecraft. Voyager consists of three key components: 1) an automatic curriculum that maximizes exploration, 2) an ever-growing skill library of executable code for storing and retrieving complex behaviors, and 3) a new iterative prompting mechanism that incorporates environment feedback, execution errors, and self-verification for program improvement. Voyager interacts with GPT-4 via blackbox queries, which bypasses the need for model parameter fine-tuning. The skills developed by Voyager are temporally extended, interpretable, and compositional, which compounds the agent’s capability rapidly and alleviates catastrophic forgetting. Empirically, Voyager demonstrates strong in-context lifelong learning capabilities. It outperforms prior SOTA by obtaining 3.1x more unique items, unlocking tech tree milestones up to 15.3x faster, and traveling 2.3x longer distances. Voyager is able to utilize the learned skill library in a new Minecraft world to solve novel tasks from scratch, while other techniques struggle to generalize.

URL: https://openreview.net/forum?id=ehfRiF0R3a

---

Title: Reinforcement Learning of Adaptive Acquisition Policies for Inverse Problems

Abstract: A promising way to mitigate the expensive process of obtaining a high-dimensional signal is to acquire a limited number of low-dimensional measurements and solve an under-determined inverse problem by utilizing the structural prior about the signal. In this paper, we focus on adaptive acquisition schemes to save further the number of measurements.
To this end, we propose a reinforcement learning-based approach that sequentially collects measurements to better recover the underlying signal by acquiring fewer measurements. Our approach applies to general inverse problems with continuous action spaces and jointly learns the recovery algorithm. Using insights obtained from theoretical analysis, we also provide a probabilistic design for our methods using variational formulation. We evaluate our approach on multiple datasets and with two measurement spaces (Gaussian, Radon).
Our results confirm the benefits of adaptive strategies in low-acquisition horizon settings.

URL: https://openreview.net/forum?id=aL3PpuXnPm

---

Title: Differential Equation Scaling Limits of Shaped and Unshaped Neural Networks

Abstract: Recent analyses of neural networks with shaped activations (i.e. the activation function is scaled as the network size grows) have led to scaling limits described by differential equations. However, these results do not a priori tell us anything about ``ordinary'' unshaped networks, where the activation is unchanged as the network size grows. In this article, we find similar differential equation based asymptotic characterization for two types of unshaped networks.

Firstly, we show that the following two architectures converge to the same infinite-depth-and-width limit at initialization:
(i) a fully connected ResNet with a $d^{-1/2}$ factor on the residual branch, where $d$ is the network depth.
(ii) a multilayer perceptron (MLP) with depth $d \ll$ width $n$ and shaped ReLU activation at rate $d^{-1/2}$.

Secondly, for an unshaped MLP at initialization, we derive the first order asymptotic correction to the layerwise correlation. In particular, if $\rho_\ell$ is the correlation at layer $\ell$, then $q_t = \ell^2 (1 - \rho_\ell)$ with $t = \frac{\ell}{n}$ converges to an SDE with a singularity at $t=0$.

These results together provide a connection between shaped and unshaped network architectures, and opens up the possibility of studying the effect of normalization methods and how it connects with shaping activation functions.

URL: https://openreview.net/forum?id=iRDwUXYsSJ

---

Title: Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation

Abstract: The recent success of ChatGPT and GPT-4 has drawn widespread attention to multimodal dialogue systems. However, the academia community lacks a dataset that can validate the multimodal generation capabilities of Visual Language Models (VLMs) in textual-visual chat tasks. In this paper, we construct two new multimodal datasets: the synthetic CLEVR-ATVC dataset (620K) and the manually pictured Fruit-ATVC dataset (50K), both featuring visual and text-based inputs and outputs. Additionally, to enable the multimodal system to reject human requests (i.e., demonstrate accountability), as in language-based ChatGPT conversations, we develop and incorporate specific rules into the datasets as supervisory signals. This allows the trained VLM to provide a yes or no answer after visual and textual reasoning, accompanied by a language explanation as to why the human instruction cannot be excuted. In our method, we propose a two-state training procedure to train the image auto-encoder and auto-regressive transformer from scratch. The first state involves a discrete variational autoencoder (dVAE) to compress each image into short tokens, which are then concatenated with text tokens as a single data stream to be fed into the decoder-based transformer for generating visual re-creation and textual feedback in the second state. We provide comprehensive analyses of experimental results in terms of re-created image quality, answer accuracy, and the model behavior when faced with uncertainty and imperfect user queries. We hope our explorations and findings contribute valuable insights regarding the accountability of textual-visual generative models.

URL: https://openreview.net/forum?id=kQmz1BMIYi

---

Title: Normalized/Clipped SGD with Perturbation for Differentially Private Non-Convex Optimization

Abstract: By ensuring differential privacy in the learning algorithms, one can rigorously mitigate the risk of large models memorizing sensitive training data. In this paper, we study two algorithms for this purpose, i.e., DP-SGD and DP-NSGD, which first clip or normalize \textit{per-sample} gradients to bound the sensitivity and then add noise to obfuscate the exact information. We analyze the convergence behavior of these two algorithms in the non-convex empirical risk minimization setting with two common assumptions and achieve a rate $\mathcal{O}\left(\sqrt[4]{\frac{d\log(1/\delta)}{N^2\epsilon^2}}\right)$ of the gradient norm for a $d$-dimensional model, $N$ samples and $(\epsilon,\delta)$-DP, which improves over previous bounds under much weaker assumptions. Specifically, we introduce a regularizing factor in DP-NSGD and show that it is crucial in the convergence proof and subtly controls the bias and noise trade-off. Our proof deliberately handles the per-sample gradient clipping and normalization that are specified for the private setting. Empirically, we demonstrate that these two algorithms achieve similar best accuracy while DP-NSGD is comparatively easier to tune than DP-SGD.

URL: https://openreview.net/forum?id=wLg9JrwFvL

---

Title: Text Recognition with Masked Vision-Language Pre-training

Abstract: Text images contain both visual and linguistic information. However, existing pre-training techniques for text recognition mainly focus on either visual representation learning or linguistic knowledge learning. In this paper, we propose a novel approach to unify vision and language pre-training in the classical encoder-decoder recognition framework. We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images, which allows us to learn strong visual representations. In contrast to introducing linguistic knowledge with an additional language model, we directly pre-train the sequence decoder. Specifically, we transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder using a proposed masked image-language modeling scheme.
Significantly, the encoder is frozen during the pre-training phase of the sequence decoder. Experimental results demonstrate that our proposed method achieves superior performance on benchmark datasets, including Chinese and English text images. The code for our approach will be made available.

URL: https://openreview.net/forum?id=KNAWoKKpi3

---

Title: Empowering GNNs via Edge-Aware Weisfeiler-Leman Algorithm

Abstract: Message passing graph neural networks (GNNs) are known to have their expressiveness upper-bounded by 1-dimensional Weisfeiler-Leman (1-WL) algorithm. To achieve more powerful GNNs, existing attempts either require \emph{ad hoc} features, or involve operations that incur high time and space complexities. In this work, we propose a \textit{general} and \textit{provably powerful} GNN framework that preserves the \textit{scalability} of the message passing scheme. In particular, we first propose to empower 1-WL for graph isomorphism test by considering edges among neighbors, giving rise to NC-1-WL. The expressiveness of NC-1-WL is shown to be strictly above 1-WL and below 3-WL theoretically. Further, we propose the NC-GNN framework as a differentiable neural version of NC-1-WL. Our simple implementation of NC-GNN is provably as powerful as NC-1-WL. Experiments demonstrate that our NC-GNN performs effectively and efficiently on various benchmarks.

URL: https://openreview.net/forum?id=VDy6LgErFM

---

Reply all

Reply to author

Forward

0 new messages