Weekly TMLR digest for Aug 22, 2023

34 views

Skip to first unread message

TMLR

unread,

Aug 22, 2023, 1:13:28 PM8/22/23

to tmlr-annou...@googlegroups.com

New certifications
==================

Expert Certification: Diffusion Models for Constrained Domains

Nic Fishman, Leo Klarner, Valentin De Bortoli, Emile Mathieu, Michael John Hutchinson

https://openreview.net/forum?id=xuWTFQ4VGO

---

Featured Certification, Expert Certification: Neural Ordinary Differential Equations for Modeling Epidemic Spreading

Chrysoula Kosma, Giannis Nikolentzos, George Panagopoulos, Jean-Marc Steyaert, Michalis Vazirgiannis

https://openreview.net/forum?id=yrkJGne0vN

---

Featured Certification: Neural Monge Map estimation and its applications

Jiaojiao Fan, Shu Liu, Shaojun Ma, Hao-Min Zhou, Yongxin Chen

https://openreview.net/forum?id=2mZSlQscj3

---

Featured Certification: Inversion by Direct Iteration: An Alternative to Denoising Diffusion for Image Restoration

Mauricio Delbracio, Peyman Milanfar

https://openreview.net/forum?id=VmyFF5lL3F

---

Survey Certification: Data Distillation: A Survey

Noveen Sachdeva, Julian McAuley

https://openreview.net/forum?id=lmXMXP74TO

---

Expert Certification: Catastrophic overfitting can be induced with discriminative non-robust features

Guillermo Ortiz-Jimenez, Pau de Jorge, Amartya Sanyal, Adel Bibi, Puneet K. Dokania, Pascal Frossard, Grégory Rogez, Philip Torr

https://openreview.net/forum?id=10hCbu70Sr

---

Featured Certification: Finding and Only Finding Differential Nash Equilibria by Both Pretending to be a Follower

Xuchan Bao, Guodong Zhang

https://openreview.net/forum?id=igdWKxK5RZ

---

Expert Certification: POMRL: No-Regret Learning-to-Plan with Increasing Horizons

Khimya Khetarpal, Claire Vernade, Brendan O'Donoghue, Satinder Singh, Tom Zahavy

https://openreview.net/forum?id=brGgOAXYtr

---

Expert Certification: Off-Policy Evaluation with Out-of-Sample Guarantees

Sofia Ek, Dave Zachariah, Fredrik D. Johansson, Peter Stoica

https://openreview.net/forum?id=XnYtGPgG9p

---

Featured Certification: Efficient Reward Poisoning Attacks on Online Deep Reinforcement Learning

Yinglun Xu, Qi Zeng, Gagandeep Singh

https://openreview.net/forum?id=25G63lDHV2

---

Expert Certification: Towards a More Rigorous Science of Blindspot Discovery in Image Classification Models

Gregory Plumb, Nari Johnson, Angel Cabrera, Ameet Talwalkar

https://openreview.net/forum?id=MaDvbLaBiF

---

Survey Certification: On the Predictive Accuracy of Neural Temporal Point Process Models for Continuous-time Event Data

Tanguy Bosser, Souhaib Ben Taieb

https://openreview.net/forum?id=3OSISBQPrM

---

Survey Certification: Augmented Language Models: a Survey

Grégoire Mialon, Roberto Dessi, Maria Lomeli, Christoforos Nalmpantis, Ramakanth Pasunuru, Roberta Raileanu, Baptiste Roziere, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, Edouard Grave, Yann LeCun, Thomas Scialom

https://openreview.net/forum?id=jh7wH2AzKK

---

Expert Certification: Black-Box Batch Active Learning for Regression

Andreas Kirsch

https://openreview.net/forum?id=fvEvDlKko6

---

Expert Certification: Stochastic gradient updates yield deep equilibrium kernels

Russell Tsuchida, Cheng Soon Ong

https://openreview.net/forum?id=p7UTv2hWgM

---

Accepted papers
===============

Title: Diffusion Models for Constrained Domains

Authors: Nic Fishman, Leo Klarner, Valentin De Bortoli, Emile Mathieu, Michael John Hutchinson

Abstract: Denoising diffusion models are a novel class of generative algorithms that achieve state-of-the-art performance across a range of domains, including image generation and text-to-image tasks. Building on this success, diffusion models have recently been extended to the Riemannian manifold setting, broadening their applicability to a range of problems from the natural and engineering sciences. However, these Riemannian diffusion models are built on the assumption that their forward and backward processes are well-defined for all times, preventing them from being applied to an important set of tasks that consider manifolds defined via a set of inequality constraints. In this work, we introduce a principled framework to bridge this gap. We present two distinct noising processes based on (i) the logarithmic barrier metric and (ii) the reflected Brownian motion induced by the constraints. As existing diffusion model techniques cannot be applied in this setting, we proceed to derive new tools to define such models in our framework. We then empirically demonstrate the scalability and flexibility of our methods on a number of synthetic and real-world tasks, including applications from robotics and protein design.

URL: https://openreview.net/forum?id=xuWTFQ4VGO

---

Title: One-Step Distributional Reinforcement Learning

Authors: Mastane Achab, Reda ALAMI, YASSER ABDELAZIZ DAHOU DJILALI, Kirill Fedyanin, Eric Moulines

Abstract: Reinforcement learning (RL) allows an agent interacting sequentially with an environment to maximize its long-term expected return. In the distributional RL (DistrRL) paradigm, the agent goes beyond the limit of the expected value, to capture the underlying probability distribution of the return across all time steps. The set of DistrRL algorithms has led to improved empirical performance. Nevertheless, the theory of DistrRL is still not fully understood, especially in the control case. In this paper, we present the simpler one-step distributional reinforcement learning (OS-DistrRL) framework encompassing only the randomness induced by the one-step dynamics of the environment. Contrary to DistrRL, we show that our approach comes with a unified theory for both policy evaluation and control. Indeed, we propose two OS-DistrRL algorithms for which we provide an almost sure convergence analysis. The proposed approach compares favorably with categorical DistrRL on various environments.

URL: https://openreview.net/forum?id=ZPMf53vE1L

---

Title: Tackling Provably Hard Representative Selection via Graph Neural Networks

Authors: Mehran Kazemi, Anton Tsitsulin, Hossein Esfandiari, Mohammadhossein Bateni, Deepak Ramachandran, Bryan Perozzi, Vahab Mirrokni

Abstract: Representative Selection (RS) is the problem of finding a small subset of exemplars from a dataset that is representative of the dataset. In this paper, we study RS for attributed graphs, and focus on finding representative nodes that optimize the accuracy of a model trained on the selected representatives. Theoretically, we establish a new hardness result for RS (in the absence of a graph structure) by proving that a particular, highly practical variant of it (RS for Learning) is hard to approximate in polynomial time within any reasonable factor, which implies a significant potential gap between the optimum solution of widely-used surrogate functions and the actual accuracy of the model. We then study the setting where a (homophilous) graph structure is available, or can be constructed, between the data points. We show that with an appropriate modeling approach, the presence of such a structure can turn a hard RS (for learning) problem into one that can be effectively solved. To this end, we develop RS-GNN, a representation learning-based RS model based on Graph Neural Networks. Empirically, we demonstrate the effectiveness of RS-GNN on problems with predefined graph structures as well as problems with graphs induced from node feature similarities, by showing that RS-GNN achieves significant improvements over established baselines on a suite of eight benchmarks.

URL: https://openreview.net/forum?id=3LzgOQ3eOb

---

Title: Dual Representation Learning for Out-of-distribution Detection

Authors: Zhilin Zhao, Longbing Cao

Abstract: To classify in-distribution samples, deep neural networks explore strongly label-related information and discard weakly label-related information according to the information bottleneck. Out-of-distribution samples drawn from distributions differing from that of in-distribution samples could be assigned with unexpected high-confidence predictions because they could obtain minimum strongly label-related information. To distinguish in- and out-of-distribution samples, Dual Representation Learning (DRL) makes out-of-distribution samples harder to have high-confidence predictions by exploring both strongly and weakly label-related information from in-distribution samples. For a pretrained network exploring strongly label-related information to learn label-discriminative representations, DRL trains its auxiliary network exploring the remaining weakly label-related information to learn distribution-discriminative representations. Specifically, for a label-discriminative representation, DRL constructs its complementary distribution-discriminative representation by integrating diverse representations less similar to the label-discriminative representation. Accordingly, DRL combines label- and distribution-discriminative representations to detect out-of-distribution samples. Experiments show that DRL outperforms the state-of-the-art methods for out-of-distribution detection.

URL: https://openreview.net/forum?id=PHAr3q49h6

---

Title: Cyclophobic Reinforcement Learning

Authors: Stefan Sylvius Wagner, Peter Arndt, Jan Robine, Stefan Harmeling

Abstract: In environments with sparse rewards, finding a good inductive bias for exploration is crucial to the agent's success. However, there are two competing goals: novelty search and systematic exploration. While existing approaches such as curiosity-driven exploration find novelty, they sometimes do not systematically explore the whole state space, akin to depth-first-search vs breadth-first-search. In this paper, we propose a new intrinsic reward that is cyclophobic, i.e., it does not reward novelty, but punishes redundancy by avoiding cycles. Augmenting the cyclophobic intrinsic reward with a sequence of hierarchical representations based on the agent's cropped observations we are able to achieve excellent results in the MiniGrid and MiniHack environments. Both are particularly hard, as they require complex interactions with different objects in order to be solved. Detailed comparisons with previous approaches and thorough ablation studies show that our newly proposed cyclophobic reinforcement learning is more sample efficient than other state of the art methods in a variety of tasks.

URL: https://openreview.net/forum?id=83rgSFPpws

---

Title: Neural Ordinary Differential Equations for Modeling Epidemic Spreading

Authors: Chrysoula Kosma, Giannis Nikolentzos, George Panagopoulos, Jean-Marc Steyaert, Michalis Vazirgiannis

Abstract: Mathematical models of infectious diseases have long been used for studying the mechanisms by which diseases spread, for predicting the spread of epidemics, and also for controlling their outbreaks. These models are based on some assumptions and different assumptions give rise to different models. Models on social networks of individuals which capture contact patterns are usually more realistic and can more accurately model contagion dynamics. Unfortunately, computing the output of realistic models is often hard. Thus, modeling the evolution of contagion dynamics over large complex networks constitutes a challenging task. In this paper, we present a computational approach to model the contagion dynamics underlying infectious diseases. Specifically, we focus on the susceptible-infectious-recovered (SIR) epidemic model on networks. Given that this model can be expressed by an intractable system of ordinary differential equations, we devise a simpler system that approximates the output of the model. Then, we capitalize on recent advances in neural ordinary differential equations and propose a neural architecture that can effectively predict the course of an epidemic on the network. We apply the proposed architecture on several network datasets and compare it against state-of-the-art methods under different experimental settings. Our results indicate that the proposed method improves predictions in various spreading scenarios, paving the way for the extensive application of interpretable neural networks in the field of epidemic spreading. At the same time, the proposed model is highly efficient even when trained on very large networks where traditional algorithms become significantly slower.

URL: https://openreview.net/forum?id=yrkJGne0vN

---

Title: Rotation-Invariant Random Features Provide a Strong Baseline for Machine Learning on 3D Point Clouds

Authors: Owen Melia, Eric M Jonas, Rebecca Willett

Abstract: Rotational invariance is a popular inductive bias used by many fields in machine learning, such as computer vision and machine learning for quantum chemistry. Rotation-invariant machine learning methods set the state of the art for many tasks, including molecular property prediction and 3D shape classification. These methods generally either rely on task-specific rotation-invariant features, or they use general-purpose deep neural networks which are complicated to design and train. However, it is unclear whether the success of these methods is primarily due to the rotation invariance or the deep neural networks. To address this question, we suggest a simple and general-purpose method for learning rotation-invariant functions of three-dimensional point cloud data using a random features approach. Specifically, we extend the random features method of Rahimi & Recht (2007) by deriving a version that is invariant to three-dimensional rotations and showing that it is fast to evaluate on point cloud data. We show through experiments that our method matches or outperforms the performance of general-purpose rotation-invariant neural networks on standard molecular property prediction benchmark datasets QM7 and QM9. We also show that our method is general-purpose and provides a rotation-invariant baseline on the ModelNet40 shape classification task. Finally, we show that our method has an order of magnitude smaller prediction latency than competing kernel methods.

URL: https://openreview.net/forum?id=nYzhlFyjjd

---

Title: Graph Neural Networks for Temporal Graphs: State of the Art, Open Challenges, and Opportunities

Authors: Antonio Longa, Veronica Lachi, Gabriele Santin, Monica Bianchini, Bruno Lepri, Pietro Lio, franco scarselli, Andrea Passerini

Abstract: Graph Neural Networks (GNNs) have become the leading paradigm for learning on (static) graph-structured data. However, many real-world systems are dynamic in nature, since the graph and node/edge attributes change over time. In recent years, GNN-based models for temporal graphs have emerged as a promising area of research to extend the capabilities of GNNs. In this work, we provide the first comprehensive overview of the current state-of-the-art of temporal GNN, introducing a rigorous formalization of learning settings and tasks and a novel taxonomy categorizing existing approaches in terms of how the temporal aspect is represented and processed. We conclude the survey with a discussion of the most relevant open challenges for the field, from both research and application perspectives.

URL: https://openreview.net/forum?id=pHCdMat0gI

---

Title: A Systematic Approach to Universal Random Features in Graph Neural Networks

Authors: Billy Joe Franks, Markus Anders, Marius Kloft, Pascal Schweitzer

Abstract: Universal random features (URF) are state of the art regarding practical graph neural networks that are provably universal. There is great diversity regarding terminology, methodology, benchmarks, and evaluation metrics used among existing URF. Not only does this make it increasingly difficult for practitioners to decide which technique to apply to a given problem, but it also stands in the way of systematic improvements. We propose a new comprehensive framework that captures all previous URF techniques. On the theoretical side, among other results, we formally prove that under natural conditions all instantiations of our framework are universal. The framework thus provides a new simple technique to prove universality results. On the practical side, we develop a method to systematically and automatically train URF. This in turn enables us to impartially and objectively compare all existing URF. New URF naturally emerge from our approach, and our experiments demonstrate that they improve the state of the art.

URL: https://openreview.net/forum?id=AXUtAIX0Fn

---

Title: FairGrad: Fairness Aware Gradient Descent

Authors: Gaurav Maheshwari, Michaël Perrot

Abstract: We address the problem of group fairness in classification, where the objective is to learn models that do not unjustly discriminate against subgroups of the population. Most existing approaches are limited to simple binary tasks or involve difficult to implement training mechanisms which reduces their practical applicability. In this paper, we propose FairGrad, a method to enforce fairness based on a re-weighting scheme that iteratively learns group specific weights based on whether they are advantaged or not. FairGrad is easy to implement, accommodates various standard fairness definitions, and comes with minimal overhead. Furthermore, we show that it is competitive with standard baselines over various datasets including ones used in natural language processing and computer vision.

FairGrad is available as a PyPI package at - https://pypi.org/project/fairgrad

URL: https://openreview.net/forum?id=0f8tU3QwWD

---

Title: V1T: large-scale mouse V1 response prediction using a Vision Transformer

Authors: Bryan M. Li, Isabel Maria Cornacchia, Nathalie Rochefort, Arno Onken

Abstract: Accurate predictive models of the visual cortex neural response to natural visual stimuli remain a challenge in computational neuroscience. In this work, we introduce $V{\small 1}T$, a novel Vision Transformer based architecture that learns a shared visual and behavioral representation across animals. We evaluate our model on two large datasets recorded from mouse primary visual cortex and outperform previous convolution-based models by more than 12.7% in prediction performance. Moreover, we show that the self-attention weights learned by the Transformer correlate with the population receptive fields. Our model thus sets a new benchmark for neural response prediction and can be used jointly with behavioral and neural recordings to reveal meaningful characteristic features of the visual cortex.

URL: https://openreview.net/forum?id=qHZs2p4ZD4

---

Title: Novel Class Discovery for Long-tailed Recognition

Authors: Chuyu Zhang, Ruijie Xu, Xuming He

Abstract: While the novel class discovery has recently made great progress, existing methods typically focus on improving algorithms on class-balanced benchmarks. However, in real-world recognition tasks, the class distributions of their corresponding datasets are often imbalanced, which leads to serious performance degeneration of those methods. In this paper, we consider a more realistic setting for novel class discovery where the distributions of novel and known classes are long-tailed. One main challenge of this new problem is to discover imbalanced novel classes with the help of long-tailed known classes. To tackle this problem, we propose an adaptive self-labeling strategy based on an equiangular prototype representation of classes. Our method infers high-quality pseudo-labels for the novel classes by solving a relaxed optimal transport problem and effectively mitigates the class biases in learning the known and novel classes. We perform extensive experiments on CIFAR100, ImageNet100, Herbarium19 and large-scale iNaturalist18 datasets, and the results demonstrate the superiority of our method. Our code is available at \url{https://github.com/kleinzcy/NCDLR}.

URL: https://openreview.net/forum?id=ey5b7kODvK

---

Title: Asymptotic Analysis of Conditioned Stochastic Gradient Descent

Authors: Rémi Leluc, François Portier

Abstract: In this paper, we investigate a general class of stochastic gradient descent (SGD) algorithms, called $\textit{conditioned}$ SGD, based on a preconditioning of the gradient direction. Using a discrete-time approach with martingale tools, we establish under mild assumptions the weak convergence of the rescaled sequence of iterates for a broad class of conditioning matrices including stochastic first-order and second-order methods. Almost sure convergence results, which may be of independent interest, are also presented. Interestingly, the asymptotic normality result consists in a stochastic equicontinuity property so when the conditioning matrix is an estimate of the inverse Hessian, the algorithm is asymptotically optimal.

URL: https://openreview.net/forum?id=U4XgzRjfF1

---

Title: Learned Thresholds Token Merging and Pruning for Vision Transformers

Authors: Maxim Bonnaerens, Joni Dambre

Abstract: Vision transformers have demonstrated remarkable success in a wide range of computer vision tasks over the last years, however, their high computational costs remains a significant barrier to their practical deployment.
In particular, the complexity of transformer models is quadratic with respect to the number of input tokens.
Therefore techniques that reduce the number of input tokens that need to be processed have been proposed.
This paper introduces Learned Thresholds token Merging and Pruning (LTMP), a novel approach that leverages the strengths of both token merging and token pruning.
LTMP uses learned threshold masking modules that dynamically determine which tokens to merge and which to prune.
We demonstrate our approach with extensive experiments on vision transformers on the ImageNet classification task.
Our results demonstrate that LTMP achieves state-of-the-art accuracy across reduction rates while requiring only a single fine-tuning epoch, which is an order of magnitude faster than previous methods.

URL: https://openreview.net/forum?id=WYKTCKpImz

---

Title: MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

Authors: Weicheng Kuo, AJ Piergiovanni, Dahun Kim, xiyang luo, Benjamin Caine, Wei Li, Abhijit Ogale, Luowei Zhou, Andrew M. Dai, Zhifeng Chen, Claire Cui, Anelia Angelova

Abstract: The development of language models have moved from encoder-decoder to decoder-only designs. In addition, we observe that the two most popular multimodal tasks, the generative and contrastive tasks, are nontrivial to accommodate in one architecture, and further need adaptations for downstream tasks. We propose a novel paradigm of training with a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks. This is done with a simple model, called MaMMUT. It consists of a single vision encoder and a text decoder, and is able to accommodate contrastive and generative learning by a novel two-pass approach on the text decoder. We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks. Furthermore, the same architecture enables straightforward extensions to open-vocabulary object detection and video-language tasks. The model tackles a diverse range of tasks, while being modest in capacity. Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models. It shows very competitive results on VQA and Video Captioning, especially considering its capacity. Ablations confirm the flexibility and advantages of our approach.

URL: https://openreview.net/forum?id=FqOG4osY7C

---

Title: Chasing Better Deep Image Priors between Over- and Under-parameterization

Authors: Qiming Wu, Xiaohan Chen, Yifan Jiang, Zhangyang Wang

Abstract: Deep Neural Networks (DNNs) are well-known to act as \textbf{over-parameterized} deep image priors (DIP) that regularize various image inverse problems. Meanwhile, researchers also proposed extremely compact, \textbf{under-parameterized} image priors (e.g., deep decoder) that are strikingly competent for image restoration too, despite a loss of accuracy. These two extremes push us to think whether there exists a better solution in the middle: \textit{between over- and under-parameterized image priors, can one identify ``intermediate" parameterized image priors that achieve better trade-offs between performance, efficiency, and even preserving strong transferability?} Drawing inspirations from the lottery ticket hypothesis (LTH), we conjecture and study a novel ``lottery image prior" (\textbf{LIP}) by exploiting DNN inherent sparsity, stated as: \textit{given an over-parameterized DNN-based image prior, it will contain a sparse subnetwork that can be trained in isolation, to match the original DNN's performance when being applied as a prior to various image inverse problems}. Our results validate the superiority of LIPs: we can successfully locate the LIP subnetworks from over-parameterized DIPs at substantial sparsity ranges. Those LIP subnetworks significantly outperform deep decoders under comparably compact model sizes (by often fully preserving the effectiveness of their over-parameterized counterparts), and they also possess high transferability across different images as well as restoration task types. Besides, we also extend LIP to compressive sensing image reconstruction, where a \textit{pre-trained} GAN generator is used as the prior (in contrast to \textit{untrained} DIP or deep decoder), and confirm its validity in this setting too. To our best knowledge, this is the first time that LTH is demonstrated to be relevant in the context of inverse problems or image priors. Codes are available at https://github.com/VITA-Group/Chasing-Better-DIPs.

URL: https://openreview.net/forum?id=EwJJks2cSa

---

Title: Federated High-Dimensional Online Decision Making

Authors: Chi-Hua Wang, Wenjie Li, Guang Lin

Abstract: We resolve the main challenge of federated bandit policy design via exploration-exploitation
trade-off delineation under data decentralization with a local privacy protection argument.
Such a challenge is practical in domain-specific applications and admits another layer of
complexity in applications of medical decision-making and web marketing, where high-
dimensional decision contexts are sensitive but important to inform decision-making. Exist-
ing (low dimensional) federated bandits suffer super-linear theoretical regret upper bound
in high-dimensional scenarios and are at risk of client information leakage due to their in-
ability to separate exploration from exploitation. This paper proposes a class of bandit
policy design, termed Fedego Lasso, to complete the task of federated high-dimensional
online decision-making with sub-linear theoretical regret and local client privacy argument.
Fedego Lasso relies on a novel multi-client teamwork-selfish bandit policy design to per-
form decentralized collaborative exploration and federated egocentric exploration with log-
arithmic communication costs. Experiments demonstrate the effectiveness of the proposed
algorithms on both synthetic and real-world datasets.

URL: https://openreview.net/forum?id=TjaMO63fc9

---

Title: Regret Bounds for Satisficing in Multi-Armed Bandit Problems

Authors: Thomas Michel, Hossein Hajiabolhassan, Ronald Ortner

Abstract: This paper considers the objective of \textit{satisficing} in multi-armed bandit problems. Instead of aiming to find an optimal arm, the learner is content with an arm whose reward is above a given satisfaction level. We provide algorithms and analysis for the realizable case when such a satisficing arm exists as well as for the general case when this may not be the case. Introducing the notion of \textit{satisficing regret}, our main result shows that in the general case it is possible to obtain constant satisficing regret when there is a satisficing arm (thereby correcting a contrary claim in the literature), while standard logarithmic regret bounds can be re-established otherwise. Experiments illustrate that our algorithm is not only superior to standard algorithms in the satisficing setting, but also works well in the classic bandit setting.

URL: https://openreview.net/forum?id=QnT41ZGNh9

---

Title: Using Confounded Data in Latent Model-Based Reinforcement Learning

Authors: Maxime Gasse, Damien GRASSET, Guillaume Gaudron, Pierre-Yves Oudeyer

Abstract: In the presence of confounding, naively using off-the-shelf offline reinforcement learning (RL) algorithms leads to sub-optimal behaviour. In this work, we propose a safe method to exploit confounded offline data in model-based RL, which improves the sample-efficiency of an interactive agent that also collects online, unconfounded data. First, we import ideas from the well-established framework of $do$-calculus to express model-based RL as a causal inference problem, thus bridging the gap between the fields of RL and causality. Then, we propose a generic method for learning a causal transition model from offline and online data, which captures and corrects the confounding effect using a hidden latent variable. We prove that our method is correct and efficient, in the sense that it attains better generalization guarantees thanks to the confounded offline data (in the asymptotic case), regardless of the confounding effect (the offline expert's behaviour). We showcase our method on a series of synthetic experiments, which demonstrate that a) using confounded offline data naively degrades the sample-efficiency of an RL agent; b) using confounded offline data correctly improves sample-efficiency.

URL: https://openreview.net/forum?id=nFWRuJXPkU

---

Title: Meta-Learning via Classifier(-free) Diffusion Guidance

Authors: Elvis Nava, Seijin Kobayashi, Yifei Yin, Robert K. Katzschmann, Benjamin F Grewe

Abstract: We introduce meta-learning algorithms that perform zero-shot weight-space adaptation of neural network models to unseen tasks. Our methods repurpose the popular generative image synthesis techniques of natural language guidance and diffusion models to generate neural network weights adapted for tasks. We first train an unconditional generative hypernetwork model to produce neural network weights; then we train a second "guidance" model that, given a natural language task description, traverses the hypernetwork latent space to find high-performance task-adapted weights in a zero-shot manner. We explore two alternative approaches for latent space guidance: "HyperCLIP"-based classifier guidance and a conditional Hypernetwork Latent Diffusion Model ("HyperLDM"), which we show to benefit from the classifier-free guidance technique common in image generation. Finally, we demonstrate that our approaches outperform existing multi-task and meta-learning methods in a series of zero-shot learning experiments on our Meta-VQA dataset.

URL: https://openreview.net/forum?id=1irVjE7A3w

---

Title: Optimizing Learning Rate Schedules for Iterative Pruning of Deep Neural Networks

Authors: Shiyu Liu, Rohan Ghosh, John Chong Min Tan, Mehul Motani

Abstract: The importance of learning rate (LR) schedules on network pruning has been observed in a few recent works. As an example, Frankle and Carbin (2019) highlighted that winning tickets (i.e., accuracy preserving subnetworks) can not be found without applying a LR warmup schedule. Renda, Frankle and Carbin (2020) also demonstrated that rewinding the LR to its initial state at the end of each pruning cycle can improve pruning performance. In this paper, we go one step further by first providing a theoretical justification for the surprising effect of LR schedules. Next, we propose a LR schedule for network pruning called SILO, which stands for S-shaped Improved Learning rate Optimization. The advantages of SILO over existing LR schedules are two-fold: (i) SILO has a strong theoretical motivation and dynamically adjusts the LR during pruning to improve generalization. Specifically, SILO increases the LR upper bound (max_lr) in an S-shape. This leads to an improvement of 2% - 4% in extensive experiments with various types of networks (e.g., Vision Transformers, ResNet) on popular datasets such as ImageNet, CIFAR-10/100. (ii) In addition to the strong theoretical motivation, SILO is empirically optimal in the sense of matching an Oracle, which exhaustively searches for the optimal value of max_lr via grid search. We find that SILO is able to precisely adjust the value of max_lr to be within the Oracle optimized interval, resulting in performance competitive with the Oracle with significantly lower complexity.

URL: https://openreview.net/forum?id=nGW2Hotpq3

---

Title: Foiling Explanations in Deep Neural Networks

Authors: Snir Vitrack Tamam, Raz Lapid, Moshe Sipper

Abstract: Deep neural networks (DNNs) have greatly impacted numerous fields over the past decade. Yet despite exhibiting superb performance over many problems, their black-box nature still poses a significant challenge with respect to explainability. Indeed, explainable artificial intelligence (XAI) is crucial in several fields, wherein the answer alone---sans a reasoning of how said answer was derived---is of little value. This paper uncovers a troubling property of explanation methods for image-based DNNs: by making small visual changes to the input image---hardly influencing the network's output---we demonstrate how explanations may be arbitrarily manipulated through the use of evolution strategies. Our novel algorithm, AttaXAI, a model-and-data XAI-agnostic, adversarial attack on XAI algorithms, only requires access to the output logits of a classifier and to the explanation map; these weak assumptions render our approach highly useful where real-world models and data are concerned. We compare our method's performance on two benchmark datasets---CIFAR100 and ImageNet---using four different pretrained deep-learning models: VGG16-CIFAR100, VGG16-ImageNet, MobileNet-CIFAR100, and Inception-v3-ImageNet. We find that the XAI methods can be manipulated without the use of gradients or other model internals. Our novel algorithm is successfully able to manipulate an image in a manner imperceptible to the human eye, such that the XAI method outputs a specific explanation map. To our knowledge, this is the first such method in a black-box setting, and we believe it has significant value where explainability is desired, required, or legally mandatory.

URL: https://openreview.net/forum?id=wvLQMHtyLk

---

Title: Supervised Knowledge May Hurt Novel Class Discovery Performance

Authors: ZIYUN LI, Jona Otholt, Ben Dai, Di Hu, Christoph Meinel, Haojin Yang

Abstract: Novel class discovery (NCD) aims to infer novel categories in an unlabeled dataset by leveraging prior knowledge of a labeled set comprising disjoint but related classes. Given that most existing literature focuses primarily on utilizing supervised knowledge from a labeled set at the methodology level, this paper considers the question: Is supervised knowledge always helpful at different levels of semantic relevance? To proceed, we first establish a novel metric, so-called transfer leakage, to measure the semantic similarity between labeled/unlabeled datasets. To show the validity of the proposed metric, we build up a large-scale benchmark with various degrees of semantic similarities between labeled/unlabeled datasets on ImageNet by leveraging its hierarachical class structure. The results based on the proposed benchmark show that the proposed transfer leakage is in line with the hierarachical class structure; and that NCD performance is consistent with the semantic similarities (measured by the proposed metric). Next, by using the proposed transfer leakage, we conduct various empirical experiments with different levels of semantic similarity, yielding that supervised knowledge may hurt NCD performance. Specifically, using supervised information from a low-similarity labeled set may lead to a suboptimal result as compared to using pure self-supervised knowledge. These results reveal the inadequacy of the existing NCD literature which usually assumes that supervised knowledge is beneficial. Finally, we develop a pseudo-version of the transfer leakage as a practical reference to decide if supervised knowledge should be used in NCD. Its effectiveness is supported by our empirical studies, which show that the pseudo transfer leakage (with or without supervised knowledge) is consistent with the corresponding accuracy based on various datasets.

URL: https://openreview.net/forum?id=oqOBTo5uWD

---

Title: Long-term Forecasting with TiDE: Time-series Dense Encoder

Authors: Abhimanyu Das, Weihao Kong, Andrew Leach, Shaan K Mathur, Rajat Sen, Rose Yu

Abstract: Recent work has shown that simple linear models can outperform several Transformer based approaches in long term time-series forecasting. Motivated by this, we propose a Multi-layer Perceptron (MLP) based encoder-decoder model, \underline{Ti}me-series \underline{D}ense \underline{E}ncoder (TiDE), for long-term time-series forecasting that enjoys the simplicity and speed of linear models while also being able to handle covariates and non-linear dependencies. Theoretically, we prove that the simplest linear analogue of our model can achieve near optimal error rate for linear dynamical systems (LDS) under some assumptions. Empirically, we show that our method can match or outperform prior approaches on popular long-term time-series forecasting benchmarks while being 5-10x faster than the best Transformer based model.

URL: https://openreview.net/forum?id=pCbC3aQB5W

---

Title: Empirical Limitations of the NTK for Understanding Scaling Laws in Deep Learning

Authors: Nikhil Vyas, Yamini Bansal, Preetum Nakkiran

Abstract: The ``Neural Tangent Kernel'' (NTK) (Jacot et al 2018), and its empirical variants have been proposed as a proxy to capture certain behaviors of real neural networks. In this work, we study NTKs through the lens of scaling laws, and demonstrate that they fall short of explaining important aspects of neural network generalization. In particular, we demonstrate realistic settings where finite-width neural networks have significantly better data scaling exponents as compared to their corresponding empirical and infinite NTKs at initialization. This reveals a more fundamental difference between the real networks and NTKs, beyond just a few percentage points of test accuracy. Further, we show that even if the empirical NTK is allowed to be pre-trained on a constant number of samples, the kernel scaling does not catch up to the neural network scaling. Finally, we show that the empirical NTK continues to evolve throughout most of the training, in contrast with prior work which suggests that it stabilizes after a few epochs of training. Altogether, our work establishes concrete limitations of the NTK approach in understanding scaling laws of real networks on natural datasets.

URL: https://openreview.net/forum?id=Y3saBb7mCE

---

Title: Transport Score Climbing: Variational Inference Using Forward KL and Adaptive Neural Transport

Authors: Liyi Zhang, David Blei, Christian A Naesseth

Abstract: Variational inference often minimizes the ``reverse'' Kullbeck-Leibler (KL) $D_{KL}(q||p)$ from the approximate distribution $q$ to the posterior $p$. Recent work studies the ``forward'' KL $D_{KL}(p||q)$, which unlike reverse KL does not lead to variational approximations that underestimate uncertainty. Markov chain Monte Carlo (MCMC) methods were used to evaluate the expectation in computing the forward KL. This paper introduces Transport Score Climbing (TSC), a method that optimizes $D_{KL}(p||q)$ by using Hamiltonian Monte Carlo (HMC) but running the HMC chain on a transformed, or warped, space. A function called the transport map performs the transformation by acting as a change-of-variable from the latent variable space. TSC uses HMC samples to dynamically train the transport map while optimizing $D_{KL}(p||q)$. TSC leverages synergies, where better transport maps lead to better HMC sampling, which then leads to better transport maps. We demonstrate TSC on synthetic and real data, including using TSC to train variational auto-encoders. We find that TSC achieves competitive performance on the experiments.

URL: https://openreview.net/forum?id=7KW7zvKd7J

---

Title: Distributionally Robust Classification on a Data Budget

Authors: Benjamin Feuer, Ameya Joshi, Minh Pham, Chinmay Hegde

Abstract: Real world uses of deep learning require predictable model behavior under distribution shifts. Models such as CLIP show emergent natural distributional robustness comparable to humans, but may require hundreds of millions of training samples. Can we train robust learners in a domain where data is limited? To rigorously address this question, we introduce JANuS (Joint Annotations and Names Set), a collection of four new training datasets with images, labels, and corresponding captions, and perform a series of carefully controlled investigations of factors contributing to robustness in image classification, then compare those results to findings derived from a large-scale meta-analysis. Using this approach, we show that standard ResNet-50 trained with the cross-entropy loss on 2.4 million image samples can attain comparable robustness to a CLIP ResNet-50 trained on 400 million samples. To our knowledge, this is the first result showing (near) state-of-the-art distributional robustness on limited data budgets.

URL: https://openreview.net/forum?id=D5Z2E8CNsD

---

Title: The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain

Authors: Arsenii Kirillovich Moskvichev, Victor Vikram Odouard, Melanie Mitchell

Abstract: The abilities to form and abstract concepts is key to human intelligence, but such abilities remain lacking in state-of-the-art AI systems. There has been substantial research on conceptual abstraction in AI, particularly using idealized domains such as Raven’s Progressive Matrices and Bongard problems, but even when AI systems succeed on such problems, the systems are rarely evaluated in depth to see if they have actually grasped the concepts they are meant to capture.

In this paper we describe an in-depth evaluation benchmark for the Abstraction and Reasoning Corpus (ARC), a collection of few-shot abstraction and analogy problems developed by Chollet (2019). In particular, we describe ConceptARC, a new, publicly available bench- mark in the ARC domain that systematically assesses abstraction and generalization abilities on a number of basic spatial and semantic concepts. ConceptARC differs from the original ARC dataset in that it is specifically organized around “concept groups”—sets of problems that focus on specific concepts and that are vary in complexity and level of abstraction. We report results on testing humans on this benchmark as well as three machine solvers: the top two programs from a 2021 ARC competition and OpenAI’s GPT-4. Our results show that humans substantially outperform the machine solvers on this benchmark, showing abilities to abstract and generalize concepts that are not yet captured by AI systems. We believe that this benchmark will spur improvements in the development of AI systems for conceptual abstraction and in the effective evaluation of such systems.

URL: https://openreview.net/forum?id=8ykyGbtt2q

---

Title: Adaptive Compression for Communication-Efficient Distributed Training

Authors: Maksim Makarenko, Elnur Gasanov, Abdurakhmon Sadiev, Rustem Islamov, Peter Richtárik

Abstract: We propose Adaptive Compressed Gradient Descent (AdaCGD) -- a novel optimization algorithm for communication-efficient training of supervised machine learning models with adaptive compression level. Our approach is inspired by the recently proposed three point compressor (3PC) framework of Richtarik et al. (2022) , which includes error feedback (EF21), lazily aggregated gradient (LAG), and their combination as special cases, and offers the current state-of-the-art rates for these methods under weak assumptions. While the above mechanisms offer a fixed compression level or adapt between two extreme compression levels, we propose a much finer adaptation. In particular, we allow users to choose between selected contractive compression mechanisms, such as Top-$K$ sparsification with a user-defined selection of sparsification levels $K$, or quantization with a user-defined selection of quantization levels, or their combination. AdaCGD chooses the appropriate compressor and compression level adaptively during the optimization process. Besides i) proposing a theoretically-grounded multi-adaptive communication compression mechanism, we further ii) extend the 3PC framework to bidirectional compression, i.e., allow the server to compress as well, and iii) provide sharp convergence bounds in the strongly convex, convex, and nonconvex settings. The convex regime results are new even for several key special cases of our general mechanism, including 3PC and EF21. In all regimes, our rates are superior compared to all existing adaptive compression methods.

URL: https://openreview.net/forum?id=Rb6VDOHebB

---

Title: Expected Worst Case Regret via Stochastic Sequential Covering

Authors: Changlong Wu, Mohsen Heidari, Ananth Grama, Wojciech Szpankowski

Abstract: We study the problem of sequential prediction and online minimax regret with stochastically generated features under a general loss function. In an online learning setting, Nature selects features and associates a true label with these features. A learner uses features to predict a label, which is compared to the true label, and a loss is incurred. The total loss over $T$ rounds, when compared to a loss incurred by a set of experts, is known as a regret. We introduce the notion of *expected worst case minimax regret* that generalizes and encompasses prior known minimax regrets. For such minimax regrets, we establish tight upper bounds via a novel concept of *stochastic global sequential covering*. We show that for a hypothesis class of VC-dimension $\mathsf{VC}$ and $i.i.d.$ generated features over $T$ rounds, the cardinality of stochastic global sequential covering can be upper bounded with high probability (w.h.p.) by $e^{O(\mathsf{VC} \cdot \log^2 T)}$. We then improve this bound by introducing a new complexity measure called the *Star-Littlestone* dimension, and show that classes with Star-Littlestone dimension $\mathsf{SL}$ admit a stochastic global sequential covering of order $e^{O(\mathsf{SL} \cdot \log T)}$. We further establish upper bounds for real valued classes with finite fat-shattering numbers. Finally, by applying information-theoretic tools for the fixed design minimax regrets, we provide lower bounds for expected worst case minimax regret. We demonstrate the effectiveness of our approach by establishing tight bounds on the expected worst case minimax regrets for logarithmic loss and
general mixable losses.

URL: https://openreview.net/forum?id=H1SekypXKA

---

Title: Robust Alzheimer's Progression Modeling using Cross-Domain Self-Supervised Deep Learning

Authors: Saba Dadsetan, Mohsen Hejrati, Shandong Wu, Somaye Hashemifar

Abstract: Developing successful artificial intelligence systems in practice depends on both robust deep learning models and large, high-quality data. However, acquiring and labeling data can be prohibitively expensive and time-consuming in many real-world applications, such as clinical disease models. Self-supervised learning has demonstrated great potential in increasing model accuracy and robustness in small data regimes. In addition, many clinical imaging and disease modeling applications rely heavily on regression of continuous quantities. However, the applicability of self-supervised learning for these medical-imaging regression tasks has not been extensively studied. In this study, we develop a cross-domain self-supervised learning approach for disease prognostic modeling as a regression problem using medical images as input. We demonstrate that self-supervised pretraining can improve the prediction of Alzheimer's Disease progression from brain MRI. We also show that pretraining on extended (but not labeled) brain MRI data outperforms pretraining on natural images. We further observe that the highest performance is achieved when both natural images and extended brain-MRI data are used for pretraining.

URL: https://openreview.net/forum?id=HVAeM6sNo8

---

Title: Learning Augmentation Distributions using Transformed Risk Minimization

Authors: Evangelos Chatzipantazis, Stefanos Pertigkiozoglou, Kostas Daniilidis, Edgar Dobriban

Abstract: We propose a new \emph{Transformed Risk Minimization} (TRM) framework as an extension of classical risk minimization.
In TRM, we optimize not only over predictive models, but also over data transformations; specifically over distributions thereof.
As a key application, we focus on learning augmentations; for instance appropriate rotations of images, to improve classification performance with a given class of predictors. Our TRM method (1) jointly learns transformations and models in a \emph{single training loop}, (2) works with any training algorithm applicable to standard risk minimization, and (3) handles any transforms, such as discrete and continuous classes of augmentations. To avoid overfitting when implementing empirical transformed risk minimization, we propose a novel regularizer based on PAC-Bayes theory. For learning augmentations of images, we propose a new parametrization of the space of augmentations via a stochastic composition of blocks of geometric transforms. This leads to the new \emph{Stochastic Compositional Augmentation Learning} (SCALE) algorithm. The performance of TRM with SCALE compares favorably to prior methods on CIFAR10/100. Additionally, we show empirically that SCALE can correctly learn certain symmetries in the data distribution (recovering rotations on rotated MNIST) and can also improve calibration of the learned model.

URL: https://openreview.net/forum?id=LRYtNj8Xw0

---

Title: Structured Low-Rank Tensors for Generalized Linear Models

Authors: Batoul Ahmad Taki, Anand Sarwate, Waheed U. Bajwa

Abstract: Recent works have shown that imposing tensor structures on the coefficient tensor in regression problems can lead to more reliable parameter estimation and lower sample complexity compared to vector-based methods. This work investigates a new low-rank tensor model, called Low Separation Rank (LSR), in Generalized Linear Model (GLM) problems. The LSR model – which generalizes the well-known Tucker and CANDECOMP/PARAFAC (CP) models, and is a special case of the Block Tensor Decomposition (BTD) model – is imposed onto the coefficient tensor in the GLM model. This work proposes a block coordinate descent algorithm for parameter estimation in LSR-structured tensor GLMs. Most importantly, it derives a minimax lower bound on the error threshold on estimating the coefficient tensor in LSR tensor GLM problems. The minimax bound is proportional to the intrinsic degrees of freedom in the LSR tensor GLM problem, suggesting that its sample complexity may be significantly lower than that of vectorized GLMs. This result can also be specialised to lower bound the estimation error in CP and Tucker-structured GLMs. The derived bounds are comparable to tight bounds in the literature for Tucker linear regression, and the tightness of the minimax lower bound is further assessed numerically. Finally, numerical experiments on synthetic datasets demonstrate the efficacy of the proposed LSR tensor model for three regression types (linear, logistic and Poisson). Experiments on a collection of medical imaging datasets demonstrate the usefulness of the LSR model over other tensor models (Tucker and CP) on real, imbalanced data with limited available samples.

URL: https://openreview.net/forum?id=qUxBs3Ln41

---

Title: Scalable Stochastic Gradient Riemannian Langevin Dynamics in Non-Diagonal Metrics

Authors: Hanlin Yu, Marcelo Hartmann, Bernardo Williams, Arto Klami

Abstract: Stochastic-gradient sampling methods are often used to perform Bayesian inference on neural networks. It has been observed that the methods in which notions of differential geometry are included tend to have better performances, with the Riemannian metric improving posterior exploration by accounting for the local curvature. However, the existing methods often resort to simple diagonal metrics to remain computationally efficient. This loses some of the gains. We propose two non-diagonal metrics that can be used in stochastic-gradient samplers to improve convergence and exploration but have only a minor computational overhead over diagonal metrics. We show that for fully connected neural networks (NNs) with sparsity-inducing priors and convolutional NNs with correlated priors, using these metrics can provide improvements. For some other choices the posterior is sufficiently easy also for the simpler metrics.

URL: https://openreview.net/forum?id=dXAuvo6CGI

---

Title: Towards a Defense Against Federated Backdoor Attacks Under Continuous Training

Authors: Shuaiqi Wang, Jonathan Hayase, Giulia Fanti, Sewoong Oh

Abstract: Backdoor attacks are dangerous and difficult to prevent in federated learning (FL), where training data is sourced from untrusted clients over long periods of time. These difficulties arise because: (a) defenders in FL do not have access to raw training data, and (b) a phenomenon we identify called backdoor leakage causes models trained continuously to eventually suffer from backdoors due to cumulative errors in defense mechanisms. We propose a framework called shadow learning for defending against backdoor attacks in the FL setting under long-range training. Shadow learning trains two models in parallel: a backbone model and a shadow model. The backbone is trained without any defense mechanism to obtain good performance on the main task. The shadow model combines filtering of malicious clients with early-stopping to control the attack success rate even as the data distribution changes. We theoretically motivate our design and show experimentally that our framework significantly improves upon existing defenses against backdoor attacks.

URL: https://openreview.net/forum?id=HwcB5elyuG

---

Title: mL-BFGS: A Momentum-based L-BFGS for Distributed Large-scale Neural Network Optimization

Authors: Yue Niu, Zalan Fabian, Sunwoo Lee, Mahdi Soltanolkotabi, Salman Avestimehr

Abstract: Quasi-Newton methods still face significant challenges in training large-scale neural networks due to additional compute costs in the Hessian related computations and instability issues in stochastic training.
A well-known method, L-BFGS that efficiently approximates the Hessian using history parameter and gradient changes, suffers convergence instability in stochastic training.
So far, attempts that adapt L-BFGS to large-scale stochastic training incur considerable extra overhead, which offsets its convergence benefits in wall-clock time.
In this paper, we propose mL-BFGS, a lightweight momentum-based L-BFGS algorithm that paves the way for quasi-Newton (QN) methods in large-scale distributed deep neural network (DNN) optimization.
mL-BFGS introduces a nearly cost-free momentum scheme into L-BFGS update and greatly reduces stochastic noise in the Hessian, therefore stabilizing convergence during stochastic optimization.
For model training at a large scale, mL-BFGS approximates a block-wise Hessian, thus enabling distributing compute and memory costs across all computing nodes.
We provide a supporting convergence analysis for mL-BFGS in stochastic settings.
To investigate mL-BFGS's potential in large-scale DNN training, we train benchmark neural models using mL-BFGS and compare performance with baselines (SGD, Adam, and other quasi-Newton methods).
Results show that mL-BFGS achieves both noticeable iteration-wise and wall-clock speedup.

URL: https://openreview.net/forum?id=9jnsPp8DP3

---

Title: Mitigating Real-World Distribution Shifts in the Fourier Domain

Authors: Kiran Krishnamachari, See-Kiong Ng, Chuan-Sheng Foo

Abstract: While machine learning systems can be highly accurate in their training environments, their performance in real-world deployments can suffer significantly due to distribution shifts. Real-world distribution shifts involve various input distortions due to noise, weather, device and other variations. Many real-world distribution shifts are not represented in standard domain adaptation datasets and prior empirical work has shown that domain adaptation methods developed using these standard datasets may not generalize well to real-world distribution shifts. Furthermore, motivated by observations of the sensitivity of deep neural networks (DNN) to the spectral statistics of data, which can vary in real-world scenarios, we propose Fourier Moment Matching (FMM), a model-agnostic input transformation that matches the Fourier-amplitude statistics of source to target data using unlabeled samples. We demonstrate through extensive empirical evaluations across time-series, image classification and semantic segmentation tasks that FMM is effective both individually and when combined with a variety of existing methods to overcome real-world distribution shifts.

URL: https://openreview.net/forum?id=lu4oAq55iK

---

Title: Learning representations that are closed-form Monge mapping optimal with application to domain adaptation

Authors: Oliver Struckmeier, Ievgen Redko, Anton Mallasto, Karol Arndt, Markus Heinonen, Ville Kyrki

Abstract: Optimal transport (OT) is a powerful geometric tool used to compare and align probability measures following the least effort principle. Despite its widespread use in machine learning (ML), OT problem still bears its computational burden, while at the same time suffering from the curse of dimensionality for measures supported on general high-dimensional spaces.
In this paper, we propose to tackle these challenges using representation learning. In particular, we seek to learn an embedding space such that the samples of the two input measures become alignable in it with a simple affine mapping that can be calculated efficiently in closed-form. We then show that such approach leads to results that are comparable to solving the original OT problem when applied to the transfer learning task on which many OT baselines where previously evaluated in both homogeneous and heterogeneous DA settings.

URL: https://openreview.net/forum?id=nOIGfQnFZm

---

Title: Learning from time-dependent streaming data with online stochastic algorithms

Authors: Antoine Godichon-Baggioni, Nicklas Werge, Olivier Wintenberger

Abstract: This paper addresses stochastic optimization in a streaming setting with time-dependent and biased gradient estimates. We analyze several first-order methods, including Stochastic Gradient Descent (SGD), mini-batch SGD, and time-varying mini-batch SGD, along with their Polyak-Ruppert averages. Our non-asymptotic analysis establishes novel heuristics that link dependence, biases, and convexity levels, enabling accelerated convergence. Specifically, our findings demonstrate that (i) time-varying mini-batch SGD methods have the capability to break long- and short-range dependence structures, (ii) biased SGD methods can achieve comparable performance to their unbiased counterparts, and (iii) incorporating Polyak-Ruppert averaging can accelerate the convergence of the stochastic optimization algorithms. To validate our theoretical findings, we conduct a series of experiments using both simulated and real-life time-dependent data.

URL: https://openreview.net/forum?id=kdfiEu1ul6

---

Title: DoCoM: Compressed Decentralized Optimization with Near-Optimal Sample Complexity

Authors: Chung-Yiu Yau, Hoi To Wai

Abstract: This paper proposes the Doubly Compressed Momentum-assisted stochastic gradient tracking algorithm (DoCoM) for communication-efficient decentralized optimization. The algorithm features two main ingredients to achieve a near-optimal sample complexity while allowing for communication compression. First, the algorithm tracks both the averaged iterate and stochastic gradient using compressed gossiping consensus. Second, a momentum step is incorporated for adaptive variance reduction with the local gradient estimates. We show that DoCoM finds a near-stationary solution at all participating agents satisfying $\mathbb{E}[ \| \nabla f( \theta ) \|^2 ] = {\cal O}( 1 / T^{2/3} )$ in $T$ iterations, where $f(\theta)$ is a smooth (possibly non-convex) objective function. Notice that the proof is achieved via analytically designing a new potential function that tightly tracks the one-iteration progress of DoCoM. As a corollary, our analysis also established the linear convergence of DoCoM to a global optimal solution for objective functions with the Polyak-Łojasiewicz condition. Numerical experiments demonstrate that our algorithm outperforms several state-of-the-art algorithms in practice.

URL: https://openreview.net/forum?id=W0ehjkl9x7

---

Title: Neural Monge Map estimation and its applications

Authors: Jiaojiao Fan, Shu Liu, Shaojun Ma, Hao-Min Zhou, Yongxin Chen

Abstract: Monge map refers to the optimal transport map between two probability distributions and provides a principled approach to transform one distribution to another. Neural network-based optimal transport map solver has gained great attention in recent years. Along this line, we present a scalable algorithm for computing the neural Monge map between two probability distributions. Our algorithm is based on a weak form of the optimal transport problem, thus it only requires samples from the marginals instead of their analytic expressions, and can be applied in large-scale settings. Furthermore, using the duality gap we prove rigorously \textit{a posteriori} error analysis for the method. Our algorithm is suitable for general cost functions, compared with other existing methods for estimating Monge maps using samples, which are usually for quadratic costs. The performance of our algorithms is demonstrated through a series of experiments with both synthetic and realistic data, including text-to-image generation, class-preserving map, and image inpainting tasks.

URL: https://openreview.net/forum?id=2mZSlQscj3

---

Title: GPS++: Reviving the Art of Message Passing for Molecular Property Prediction

Authors: Dominic Masters, Josef Dean, Kerstin Klaeser, Zhiyi Li, Samuel Maddrell-Mander, Adam Sanders, Hatem Helal, Deniz Beker, Andrew William Fitzgibbon, Shenyang Huang, Ladislav Rampášek, Dominique Beaini

Abstract: We present GPS++, a hybrid Message Passing Neural Network / Graph Transformer model for molecular property prediction. Our model integrates a well-tuned local message passing component and biased global attention with other key ideas from prior literature to achieve state-of-the-art results on large-scale molecular dataset PCQM4Mv2. Through a thorough ablation study we highlight the impact of individual components and find that nearly all of the model’s performance can be maintained without any use of global self-attention, showing that message passing is still a competitive approach for 3D molecular property prediction despite the recent dominance of graph transformers. We also find that our approach is significantly more accurate than prior art when 3D positional information is not available.

URL: https://openreview.net/forum?id=moVEUgJaHO

---

Title: Improved Group Robustness via Classifier Retraining on Independent Splits

Authors: Thien Hang Nguyen, Hongyang R Zhang, Huy Nguyen

Abstract: Deep neural networks trained by minimizing the average risk can achieve strong average performance. Still, their performance for a subgroup may degrade if the subgroup is underrepresented in the overall data population. Group distributionally robust optimization (Sagawa et al., 2020a), or group DRO in short, is a widely used baseline for learning models with strong worst-group performance. We note that this method requires group labels for every example at training time and can overfit to small groups, requiring strong regularization. Given a limited amount of group labels at training time, Just Train Twice (Liu et al., 2021), or JTT in short, is a two-stage method that infers a pseudo group label for every unlabeled example first, then applies group DRO based on the inferred group labels. The inference process is also sensitive to overfitting, sometimes involving additional hyperparameters. This paper designs a simple method based on the idea of classifier retraining on independent splits of the training data. We find that using a novel sample-splitting procedure achieves robust worst-group performance in the fine-tuning step. When evaluated on benchmark image and text classification tasks, our approach consistently performs favorably to group DRO, JTT, and other strong baselines when either group labels are available during training or are only given in validation sets. Importantly, our method only relies on a single hyperparameter, which adjusts the fraction of labels used for training feature extractors vs. training classification layers. We justify the rationale of our splitting scheme with a generalization-bound analysis of the worst-group loss.

URL: https://openreview.net/forum?id=Qlvgq9eC63

---

Title: Execution-based Code Generation using Deep Reinforcement Learning

Authors: Parshin Shojaee, Aneesh Jain, Sindhu Tipirneni, Chandan K. Reddy

Abstract: The utilization of programming language (PL) models, pre-trained on large-scale code corpora, as a means of automating software engineering processes has demonstrated considerable potential in streamlining various code generation tasks such as code completion, code translation, and program synthesis. However, current approaches mainly rely on supervised fine-tuning objectives borrowed from text generation, neglecting unique sequence-level characteristics of code, including but not limited to compilability as well as syntactic and functional correctness. To address this limitation, we propose PPOCoder, a new framework for code generation that synergistically combines pre-trained PL models with Proximal Policy Optimization (PPO) which is a widely used deep reinforcement learning technique. By utilizing non-differentiable feedback from code execution and structure alignment, PPOCoder seamlessly integrates external code-specific knowledge into the model optimization process. It's important to note that PPOCoder is a task-agnostic and model-agnostic framework that can be used across different code generation tasks and PLs. Extensive experiments on three code generation tasks demonstrate the effectiveness of our proposed approach compared to SOTA methods, achieving significant improvements in compilation success rates and functional correctness across different PLs.

URL: https://openreview.net/forum?id=0XBuaxqEcG

---

Title: TabCBM: Concept-based Interpretable Neural Networks for Tabular Data

Authors: Mateo Espinosa Zarlenga, Zohreh Shams, Michael Edward Nelson, Been Kim, Mateja Jamnik

Abstract: Concept-based interpretability addresses the opacity of deep neural networks by constructing an explanation for a model's prediction using high-level units of information referred to as concepts. Research in this area, however, has been mainly focused on image and graph-structured data, leaving high-stakes tasks whose data is tabular out of reach of existing methods. In this paper, we address this gap by introducing the first definition of what a high-level concept may entail in tabular data. We use this definition to propose Tabular Concept Bottleneck Models (TabCBMs), a family of interpretable self-explaining neural architectures capable of learning high-level concept explanations for tabular tasks. As our method produces concept-based explanations both when partial concept supervision or no concept supervision is available at training time, it is adaptable to settings where concept annotations are missing. We evaluate our method in both synthetic and real-world tabular tasks and show that TabCBM outperforms or performs competitively compared to state-of-the-art methods, while providing a high level of interpretability as measured by its ability to discover known high-level concepts. Finally, we show that TabCBM can discover important high-level concepts in synthetic datasets inspired by critical tabular tasks (e.g., single-cell RNAseq) and allows for human-in-the-loop concept interventions in which an expert can identify and correct mispredicted concepts to boost the model's performance.

URL: https://openreview.net/forum?id=TIsrnWpjQ0

---

Title: Data Augmentation is a Hyperparameter: Cherry-picked Self-Supervision for Unsupervised Anomaly Detection is Creating the Illusion of Success

Authors: Jaemin Yoo, Tiancheng Zhao, Leman Akoglu

Abstract: Self-supervised learning (SSL) has emerged as a promising alternative to create supervisory signals to real-world problems, avoiding the extensive cost of manual labeling. SSL is particularly attractive for unsupervised tasks such as anomaly detection (AD), where labeled anomalies are rare or often nonexistent. A large catalog of augmentation functions has been used for SSL-based AD (SSAD) on image data, and recent works have reported that the type of augmentation has a significant impact on accuracy. Motivated by those, this work sets out to put image-based SSAD under a larger lens and investigate the role of data augmentation in SSAD. Through extensive experiments on 3 different detector models and across 420 AD tasks, we provide comprehensive numerical and visual evidences that the alignment between data augmentation and anomaly-generating mechanism is the key to the success of SSAD, and in the lack thereof, SSL may even impair accuracy. To the best of our knowledge, this is the first meta-analysis on the role of data augmentation in SSAD.

URL: https://openreview.net/forum?id=HyzCuCV1jH

---

Title: Spectral learning of Bernoulli linear dynamical systems models for decision-making

Authors: Iris R Stone, Yotam Sagiv, Il Memming Park, Jonathan W. Pillow

Abstract: Latent linear dynamical systems with Bernoulli observations provide a powerful modeling framework for identifying the temporal dynamics underlying binary time series data, which arise in a variety of contexts such as binary decision-making and discrete stochastic processes
such as binned neural spike trains. Here we develop a spectral learning method for fast, efficient fitting of probit-Bernoulli latent linear dynamical system (LDS) models. Our approach extends traditional subspace identification methods to the Bernoulli setting via a transformation of the first and second sample moments. This results in a robust, fixed-cost estimator that avoids the hazards of local optima and the long computation time of iterative fitting procedures like the expectation-maximization (EM) algorithm. In regimes where data is limited or assumptions about the statistical structure of the data are not met, we demonstrate that the spectral estimate provides a good initialization for Laplace-EM fitting. Finally, we show that the estimator provides substantial benefits to real world settings by analyzing data from mice performing a sensory decision-making task.

URL: https://openreview.net/forum?id=giw2vcAhiH

---

Title: Inversion by Direct Iteration: An Alternative to Denoising Diffusion for Image Restoration

Authors: Mauricio Delbracio, Peyman Milanfar

Abstract: Inversion by Direct Iteration (InDI) is a new formulation for supervised image restoration that avoids the so-called ``regression to the mean'' effect and produces more realistic and detailed images than existing regression-based methods. It does this by gradually improving image quality in small steps, similar to generative denoising diffusion models.

Image restoration is an ill-posed problem where multiple high-quality images are plausible reconstructions of a given low-quality input. Therefore, the outcome of a single step regression model is typically an aggregate of all possible explanations, therefore lacking details and realism. The main advantage of InDI is that it does not try to predict the clean target image in a single step but instead gradually improves the image in small steps, resulting in better perceptual quality.

While generative denoising diffusion models also work in small steps, our formulation is distinct in that it does not require knowledge of any analytic form of the degradation process. Instead, we directly learn an iterative restoration process from low-quality and high-quality paired examples. InDI can be applied to virtually any image degradation, given paired training data. In conditional denoising diffusion image restoration the denoising network generates the restored image by repeatedly denoising an initial image of pure noise, conditioned on the degraded input. Contrary to conditional denoising formulations, InDI directly proceeds by iteratively restoring the input low-quality image, producing high-quality results on a variety of image restoration tasks, including motion and out-of-focus deblurring, super-resolution, compression artifact removal, and denoising.

URL: https://openreview.net/forum?id=VmyFF5lL3F

---

Title: Self-Supervision is All You Need for Solving Rubik’s Cube

Authors: Kyo Takano

Abstract: Existing combinatorial search methods are often complex and require some level of expertise. This work introduces a simple and efficient deep learning method for solving combinatorial problems with a predefined goal, represented by Rubik's Cube. We demonstrate that, for such problems, training a deep neural network on random scrambles branching from the goal state is sufficient to achieve near-optimal solutions. When tested on Rubik's Cube, 15 Puzzle, and 7$\times$7 Lights Out, our method outperformed the previous state-of-the-art method DeepCubeA, improving the trade-off between solution optimality and computational cost, despite significantly less training data. Furthermore, we investigate the scaling law of our Rubik's Cube solver with respect to model size and training data volume.

URL: https://openreview.net/forum?id=bnBeNFB27b

---

Title: Towards Better Generalization with Flexible Representation of Multi-Module Graph Neural Networks

Authors: HyunGeun Lee, Kijung Yoon

Abstract: Graph neural networks (GNNs) have become compelling models designed to perform learning and inference on graph-structured data. However, little work has been done to understand the fundamental limitations of GNNs for scaling to larger graphs and generalizing to out-of-distribution (OOD) inputs. In this paper, we use a random graph generator to systematically investigate how the graph size and structural properties affect the predictive performance of GNNs. We present specific evidence that the average node degree is a key feature in determining whether GNNs can generalize to unseen graphs, and that the use of multiple node update functions can improve the generalization performance of GNNs when dealing with graphs of multimodal degree distributions. Accordingly, we propose a multi-module GNN framework that allows the network to adapt flexibly to new graphs by generalizing a single canonical nonlinear transformation over aggregated inputs. Our results show that the multi-module GNNs improve the OOD generalization on a variety of inference tasks in the direction of diverse structural features.

URL: https://openreview.net/forum?id=EYjfLeJL4l

---

Title: Assisting Human Decisions in Document Matching

Authors: Joon Sik Kim, Valerie Chen, Danish Pruthi, Nihar B Shah, Ameet Talwalkar

Abstract: Many practical applications, ranging from paper-reviewer assignment in peer review to job-applicant matching for hiring, require human decision makers to identify relevant matches by combining their expertise with predictions from machine learning models. In many such model-assisted document matching tasks, the decision makers have stressed the need for assistive information about the model outputs (or the data) to facilitate their decisions. In this paper, we devise a proxy matching task that allows us to evaluate which kinds of assistive information improve decision makers’ performance (in terms of accuracy and time). Through a crowdsourced (N = 271 participants) study, we find that providing black-box model explanations reduces users’ accuracy on the matching task, contrary to the commonly-held belief that they can be helpful by allowing better understanding of the model. On the other hand, custom methods that are designed to closely attend to some task-specific desiderata are found to be effective in improving user performance. Surprisingly, we also find that the users’ perceived utility of assistive information is misaligned with their objective utility (measured through their task performance).

URL: https://openreview.net/forum?id=5rq8iRzHAQ

---

Title: Bayesian Quadrature for Neural Ensemble Search

Authors: Saad Hamid, Xingchen Wan, Martin Jørgensen, Binxin Ru, Michael A Osborne

Abstract: Ensembling can improve the performance of Neural Networks, but existing approaches struggle when the architecture likelihood surface has dispersed, narrow peaks. Furthermore, existing methods construct equally weighted ensembles, and this is likely to be vulnerable to the failure modes of the weaker architectures. By viewing ensembling as approximately marginalising over architectures we construct ensembles using the tools of Bayesian Quadrature -- tools which are well suited to the exploration of likelihood surfaces with dispersed, narrow peaks. Additionally, the resulting ensembles consist of architectures weighted commensurate with their performance. We show empirically -- in terms of test likelihood, accuracy, and expected calibration error -- that our method outperforms state-of-the-art baselines, and verify via ablation studies that its components do so independently.

URL: https://openreview.net/forum?id=T5sXdAO3EQ

---

Title: JiangJun: Mastering Xiangqi by Tackling Non-Transitivity in Two-Player Zero-Sum Games

Authors: Yang Li, Kun Xiong, Yingping Zhang, Jiangcheng Zhu, Stephen Marcus McAleer, Wei Pan, Jun Wang, Zonghong Dai, Yaodong Yang

Abstract: This paper presents an empirical exploration of non-transitivity in perfect-information games, specifically focusing on Xiangqi, a traditional Chinese board game comparable in game-tree complexity to chess and shogi. By analyzing over 10,000 records of human Xiangqi play, we highlight the existence of both transitive and non-transitive elements within the game’s strategic structure. To address non-transitivity, we introduce the JiangJun algorithm, an innovative combination of Monte-Carlo Tree Search (MCTS) and Policy Space Response Oracles (PSRO) designed to approximate a Nash equilibrium. We evaluate the algorithm empirically using a WeChat mini program and achieve a Master level with a 99.41% win rate against human players. The algorithm’s effectiveness in overcoming non-transitivity is confirmed by a plethora of metrics, such as relative population performance and visualization results. Our project site is available at https://sites.google.com/view/jiangjun-site/.

URL: https://openreview.net/forum?id=MMsyqXIJuk

---

Title: Contrastive Attraction and Contrastive Repulsion for Representation Learning

Authors: Huangjie Zheng, Xu Chen, Jiangchao Yao, Hongxia Yang, Chunyuan Li, Ya Zhang, Hao Zhang, Ivor Tsang, Jingren Zhou, Mingyuan Zhou

Abstract: Contrastive learning (CL) methods effectively learn data representations in a self-supervision manner, where the encoder contrasts each positive sample over multiple negative samples via a one-vs-many softmax cross-entropy loss. By leveraging large amounts of unlabeled image data, recent CL methods have achieved promising results when pretrained on large-scale datasets, such as ImageNet. However, most of them consider the augmented views from the same instance are positive pairs, while views from other instances are negative ones. Such binary partition insufficiently considers the relation between samples and tends to yield worse performance when generalized on images in the wild. In this paper, to further improve the performance of CL and enhance its robustness on various datasets, we propose a doubly CL strategy that contrasts positive samples and negative ones within themselves separately. We realize this strategy with contrastive attraction and contrastive repulsion (CACR), which makes the query not only exert a greater force to attract more distant positive samples but also do so to repel closer negative samples. Theoretical analysis reveals that CACR generalizes CL's behavior by positive attraction and negative repulsion. It further considers the intra-contrastive relation within the positive and negative pairs to narrow the gap between the sampled and true distribution, which is important when datasets are less curated. Extensive large-scale experiments on standard vision tasks show that CACR not only consistently outperforms existing CL methods on benchmark datasets, but also shows better robustness when generalized on imbalanced image datasets.

URL: https://openreview.net/forum?id=f39UIDkwwc

---

Title: Data Distillation: A Survey

Authors: Noveen Sachdeva, Julian McAuley

Abstract: The popularity of deep learning has led to the curation of a vast number of massive and multifarious datasets. Despite having close-to-human performance on individual tasks, training parameter-hungry models on large datasets poses multi-faceted problems such as (a) high model-training time; (b) slow research iteration; and (c) poor eco-sustainability. As an alternative, data distillation approaches aim to synthesize terse data summaries, which can serve as effective drop-in replacements of the original dataset for scenarios like model training, inference, architecture search, etc. In this survey, we present a formal framework for data distillation, along with providing a detailed taxonomy of existing approaches. Additionally, we cover data distillation approaches for different data modalities, namely images, graphs, and user-item interactions (recommender systems), while also identifying current challenges and future research directions.

URL: https://openreview.net/forum?id=lmXMXP74TO

---

Title: Catastrophic overfitting can be induced with discriminative non-robust features

Authors: Guillermo Ortiz-Jimenez, Pau de Jorge, Amartya Sanyal, Adel Bibi, Puneet K. Dokania, Pascal Frossard, Grégory Rogez, Philip Torr

Abstract: Adversarial training (AT) is the de facto method for building robust neural networks, but it can be computationally expensive. To mitigate this, fast single-step attacks can be used, but this may lead to catastrophic overfitting (CO). This phenomenon appears when networks gain non-trivial robustness during the first stages of AT, but then reach a breaking point where they become vulnerable in just a few iterations. The mechanisms that lead to this failure mode are still poorly understood. In this work, we study the onset of CO in single-step AT methods through controlled modifications of typical datasets of natural images. In particular, we show that CO can be induced at much smaller $\epsilon$ values than it was observed before just by injecting images with seemingly innocuous features. These features aid non-robust classification but are not enough to achieve robustness on their own. Through extensive experiments we analyze this novel phenomenon and discover that the presence of these easy features induces a learning shortcut that leads to CO. Our findings provide new insights into the mechanisms of CO and improve our understanding of the dynamics of AT.

URL: https://openreview.net/forum?id=10hCbu70Sr

---

Title: Finding and Only Finding Differential Nash Equilibria by Both Pretending to be a Follower

Authors: Xuchan Bao, Guodong Zhang

Abstract: Finding Nash equilibria in two-player differentiable games is a classical problem in game theory with important relevance in machine learning. We propose double Follow-the-Ridge (double-FTR), an algorithm that locally converges to and only to differential Nash equilibria in general-sum two-player differentiable games. To our knowledge, double-FTR is the first algorithm with such guarantees for general-sum games. Furthermore, we show that by varying its preconditioner, double-FTR leads to a broader family of algorithms with the same convergence guarantee. In addition, double-FTR avoids oscillation near equilibria due to the real-eigenvalues of its Jacobian at fixed points. Empirically, we validate the double-FTR algorithm on a range of simple zero-sum and general sum games, as well as simple Generative Adversarial Network (GAN) tasks.

URL: https://openreview.net/forum?id=igdWKxK5RZ

---

Title: Conditional Generative Models are Provably Robust: Pointwise Guarantees for Bayesian Inverse Problems

Authors: Fabian Altekrüger, Paul Hagemann, Gabriele Steidl

Abstract: Conditional generative models became a very powerful tool to sample from Bayesian inverse problem posteriors. It is well-known in classical Bayesian literature that posterior measures are quite robust with respect to perturbations of both the prior measure and the negative log-likelihood, which includes perturbations of the observations. However, to the best of our knowledge, the robustness of conditional generative models with respect to perturbations of the observations has not been investigated yet. In this paper, we prove for the first time that appropriately learned conditional generative models provide robust results for single observations.

URL: https://openreview.net/forum?id=Wcui061fxr

---

Title: A Characteristic Function for Shapley-Value-Based Attribution of Anomaly Scores

Authors: Naoya Takeishi, Yoshinobu Kawahara

Abstract: In anomaly detection, the degree of irregularity is often summarized as a real-valued anomaly score. We address the problem of attributing such anomaly scores to input features for interpreting the results of anomaly detection. We particularly investigate the use of the Shapley value for attributing anomaly scores of semi-supervised detection methods. We propose a characteristic function specifically designed for attributing anomaly scores. The idea is to approximate the absence of some features by locally minimizing the anomaly score with regard to the to-be-absent features. We examine the applicability of the proposed characteristic function and other general approaches for interpreting anomaly scores on multiple datasets and multiple anomaly detection methods. The results indicate the potential utility of the attribution methods including the proposed one.

URL: https://openreview.net/forum?id=eLX5XrajXh

---

Title: The Open MatSci ML Toolkit: A Flexible Framework for Machine Learning in Materials Science

Authors: Santiago Miret, Kin Long Kelvin Lee, Carmelo Gonzales, Marcel Nassar, Matthew Spellings

Abstract: We present the Open MatSci ML Toolkit: a flexible, self-contained, and scalable Python-based framework to apply deep learning models and methods on scientific data with a specific focus on materials science and the OpenCatalyst Dataset. Our toolkit provides: 1. A scalable machine learning workflow for materials science leveraging PyTorch Lightning, which enables seamless scaling across different computation capabilities (laptop, server, cluster) and hardware platforms (CPU, GPU, XPU). 2. Deep Graph Library (DGL) support for rapid graph neural network prototyping and development. By publishing and sharing this toolkit with the research community via open-source release, we hope to: 1. Lower the entry barrier for new machine learning researchers and practitioners that want to get started with the OpenCatalyst dataset, which presently comprises the largest computational materials science dataset. 2. Enable the scientific community to apply advanced machine learning tools to high-impact scientific challenges, such as modeling of materials behavior for clean energy applications. We demonstrate the capabilities of our framework by enabling three new equivariant neural network models for multiple OpenCatalyst tasks and arrive at promising results for compute scaling and model performance. The code of the framework and experiments presented in this is paper are publicly available at https://github.com/IntelLabs/matsciml.

URL: https://openreview.net/forum?id=QBMyDZsPMd

---

Title: Differentiable Logic Machines

Authors: Matthieu Zimmer, Xuening Feng, Claire Glanois, Zhaohui JIANG, Jianyi Zhang, Paul Weng, Dong Li, Jianye HAO, Wulong Liu

Abstract: The integration of reasoning, learning, and decision-making is key to build more general artificial intelligence systems. As a step in this direction, we propose a novel neural-logic architecture, called differentiable logic machine (DLM), that can solve both inductive logic programming (ILP) and reinforcement learning (RL) problems, where the solution can be interpreted as a first-order logic program. Our proposition includes several innovations. Firstly, our architecture defines a restricted but expressive continuous relaxation of the space of first-order logic programs by assigning weights to predicates instead of rules, in contrast to most previous neural-logic approaches. Secondly, with this differentiable architecture, we propose several (supervised and RL) training procedures, based on gradient descent, which can recover a fully-interpretable solution (i.e., logic formula). Thirdly, to accelerate RL training, we also design a novel critic architecture that enables actor-critic algorithms. Fourthly, to solve hard problems, we propose an incremental training procedure that can learn a logic program progressively. Compared to state-of-the-art (SOTA) differentiable ILP methods, DLM successfully solves all the considered ILP problems with a higher percentage of successful seeds (up to 3.5x). On RL problems, without requiring an interpretable solution, DLM outperforms other non-interpretable neural-logic RL approaches in terms of rewards (up to 3.9%). When enforcing interpretability, DLM can solve harder RL problems (e.g., Sorting, Path) than other interpretable RL methods. Moreover, we show that deep logic programs can be learned via incremental supervised training. In addition to this excellent performance, DLM can scale well in terms of memory and computational time, especially during the testing phase where it can deal with much more constants (>2x) than SOTA.

URL: https://openreview.net/forum?id=mXfkKtu5JA

---

Title: Vulnerability-Aware Instance Reweighting For Adversarial Training

Authors: Olukorede Fakorede, Ashutosh Kumar Nirala, Modeste Atsague, Jin Tian

Abstract: Adversarial Training (AT) has been found to substantially improve the robustness of deep learning classifiers against adversarial attacks. AT involves obtaining robustness by including adversarial examples in training a classifier. Most variants of AT algorithms treat every training example equally. However, recent works have shown that better performance is achievable by treating them unequally. In addition, it has been observed that AT exerts an uneven influence on different classes in a training set and unfairly hurts examples corresponding to classes that are inherently harder to classify. Consequently, various reweighting schemes have been proposed that assign unequal weights to robust losses of individual examples in a training set. In this work, we propose a novel instance-wise reweighting scheme. It considers the vulnerability of each natural example and the resulting information loss on its adversarial counterpart occasioned by adversarial attacks. Through extensive experiments, we show that our proposed method significantly improves over existing reweighting schemes, especially against strong white and black-box attacks.

URL: https://openreview.net/forum?id=kdPcLdJbt1

---

Title: Lifelong Reinforcement Learning with Modulating Masks

Authors: Eseoghene Ben-Iwhiwhu, Saptarshi Nath, Praveen Kumar Pilly, Soheil Kolouri, Andrea Soltoggio

Abstract: Lifelong learning aims to create AI systems that continuously and incrementally learn during a lifetime, similar to biological learning. Attempts so far have met problems, including catastrophic forgetting, interference among tasks, and the inability to exploit previous knowledge. While considerable research has focused on learning multiple supervised classification tasks that involve changes in the input distribution, lifelong reinforcement learning (LRL) must deal with variations in the state and transition distributions, and in the reward functions. Modulating masks with a fixed backbone network, recently developed for classification, are particularly suitable to deal with such a large spectrum of task variations. In this paper, we adapted modulating masks to work with deep LRL, specifically PPO and IMPALA agents. The comparison with LRL baselines in both discrete and continuous RL tasks shows superior performance. We further investigated the use of a linear combination of previously learned masks to exploit previous knowledge when learning new tasks: not only is learning faster, the algorithm solves tasks that we could not otherwise solve from scratch due to extremely sparse rewards. The results suggest that RL with modulating masks is a promising approach to lifelong learning, to the composition of knowledge to learn increasingly complex tasks, and to knowledge reuse for efficient and faster learning.

URL: https://openreview.net/forum?id=V7tahqGrOq

---

Title: Semantic Self-adaptation: Enhancing Generalization with a Single Sample

Authors: Sherwin Bahmani, Oliver Hahn, Eduard Zamfir, Nikita Araslanov, Daniel Cremers, Stefan Roth

Abstract: The lack of out-of-domain generalization is a critical weakness of deep networks for semantic segmentation. Previous studies relied on the assumption of a static model, i. e., once the training process is complete, model parameters remain fixed at test time. In this work, we challenge this premise with a self-adaptive approach for semantic segmentation that adjusts the inference process to each input sample. Self-adaptation operates on two levels. First, it fine-tunes the parameters of convolutional layers to the input image using consistency regularization. Second, in Batch Normalization layers, self-adaptation interpolates between the training and the reference distribution derived from a single test sample. Despite both techniques being well known in the literature, their combination sets new state-of-the-art accuracy on synthetic-to-real generalization benchmarks. Our empirical study suggests that self-adaptation may complement the established practice of model regularization at training time for improving deep network generalization to out-of-domain data. Our code and pre-trained models are available at https://github.com/visinf/self-adaptive.

URL: https://openreview.net/forum?id=ILNqQhGbLx

---

Title: Fair Kernel Regression through Cross-Covariance Operators

Authors: Adrian Perez-Suay, Paula Gordaliza, Jean-Michel Loubes, Dino Sejdinovic, Gustau Camps-Valls

Abstract: Ensuring fairness in machine learning models is a difficult problem from both a formulation and implementation perspective. One sensible criterion for achieving fairness is Equalised Odds, which requires that subjects in protected and unprotected groups have equal true and false positive rates. However, practical implementation is challenging. This work proposes two ways to address this issue through the conditional independence operator. First, given the output values, it is used as a fairness measure of independence between model predictions and sensitive variables. Second, it is used as a regularisation term in the problem formulation, which seeks optimal models that balance performance and fairness concerning the sensitive variables. To illustrate the potential of our approach, we consider different scenarios. First, we use the Gaussian model to provide new insights into the problem formulation and numerical results on its convergence. Second, we present the formulation using the conditional cross-covariance operator. We anticipate that a closed-form solution is possible in the general problem formulation, including in the case of a kernel formulation setting. Third, we introduce a normalised criterion of the conditional independence operator. All formulations are posed under the risk minimisation principle, which leads to theoretical results on the performance.
Additionally, insights are provided into using these operators under a Gaussian Process setting. Our methods are compared to state-of-the-art methods in terms of performance and fairness metrics on a representative set of real problems. The results obtained with our proposed methodology show promising performance-fairness curves. Furthermore, we discuss the usefulness of linear weights in the fair model to describe the behaviour of the features when enforcing fairness over a particular set of input features.

URL: https://openreview.net/forum?id=MyQ1e1VQQ3

---

Title: The Score-Difference Flow for Implicit Generative Modeling

Authors: Romann M. Weber

Abstract: Implicit generative modeling (IGM) aims to produce samples of synthetic data matching the characteristics of a target data distribution. Recent work (e.g. score-matching networks, diffusion models) has approached the IGM problem from the perspective of pushing synthetic source data toward the target distribution via dynamical perturbations or flows in the ambient space. In this direction, we present the score difference (SD) between arbitrary target and source distributions as a flow that optimally reduces the Kullback-Leibler divergence between them while also solving the Schrödinger bridge problem. We apply the SD flow to convenient proxy distributions, which are aligned if and only if the original distributions are aligned. We demonstrate the formal equivalence of this formulation to denoising diffusion models under certain conditions. We also show that the training of generative adversarial networks includes a hidden data-optimization sub-problem, which induces the SD flow under certain choices of loss function when the discriminator is optimal. As a result, the SD flow provides a theoretical link between model classes that individually address the three challenges of the "generative modeling trilemma"—high sample quality, mode coverage, and fast sampling—thereby setting the stage for a unified approach.

URL: https://openreview.net/forum?id=dpGSNLUCzu

---

Title: POMRL: No-Regret Learning-to-Plan with Increasing Horizons

Authors: Khimya Khetarpal, Claire Vernade, Brendan O'Donoghue, Satinder Singh, Tom Zahavy

Abstract: We study the problem of planning under model uncertainty in an online meta-reinforcement learning (RL) setting where an agent is presented with a sequence of related tasks with limited interactions per task. The agent can use its experience in each task and across tasks to estimate both the transition model and the distribution over tasks. We propose an algorithm to meta-learn the underlying relatedness across tasks, utilize it to plan in each task, and upper-bound the regret of the planning loss. Our bound suggests that the average regret over tasks decreases as the number of tasks increases and as the tasks are more similar. In the classical single-task setting, it is known that the planning horizon should depend on the estimated model's accuracy, that is, on the number of samples within task. We generalize this finding to meta-RL and study this dependence of planning horizons on the number of tasks. Based on our theoretical findings, we derive heuristics for selecting slowly increasing discount factors, and we validate its significance empirically.

URL: https://openreview.net/forum?id=brGgOAXYtr

---

Title: Off-Policy Evaluation with Out-of-Sample Guarantees

Authors: Sofia Ek, Dave Zachariah, Fredrik D. Johansson, Peter Stoica

Abstract: We consider the problem of evaluating the performance of a decision policy using past observational data. The outcome of a policy is measured in terms of a loss (aka. disutility or negative reward) and the main problem is making valid inferences about its out-of-sample loss when the past data was observed under a different and possibly unknown policy. Using a sample-splitting method, we show that it is possible to draw such inferences with finite-sample coverage guarantees about the entire loss distribution, rather than just its mean. Importantly, the method takes into account model misspecifications of the past policy - including unmeasured confounding. The evaluation method can be used to certify the performance of a policy using observational data under a specified range of credible model assumptions.

URL: https://openreview.net/forum?id=XnYtGPgG9p

---

Title: A Unified Perspective on Natural Gradient Variational Inference with Gaussian Mixture Models

Authors: Oleg Arenz, Philipp Dahlinger, Zihan Ye, Michael Volpp, Gerhard Neumann

Abstract: Variational inference with Gaussian mixture models (GMMs) enables learning of highly tractable yet multi-modal approximations of intractable target distributions with up to a few hundred dimensions. The two currently most effective methods for GMM-based variational inference, VIPS and iBayes-GMM, both employ independent natural gradient updates for the individual components and their weights. We show for the first time, that their derived updates are equivalent, although their practical implementations and theoretical guarantees differ. We identify several design choices that distinguish both approaches, namely with respect to sample selection, natural gradient estimation, stepsize adaptation, and whether trust regions are enforced or the number of components adapted. We argue that for both approaches, the quality of the learned approximations can heavily suffer from the respective design choices: By updating the individual components using samples from the mixture model, iBayes-GMM often fails to produce meaningful updates to low-weight components, and by using a zero-order method for estimating the natural gradient, VIPS scales badly to higher-dimensional problems. Furthermore, we show that information-geometric trust-regions (used by VIPS) are effective even when using first-order natural gradient estimates, and often outperform the improved Bayesian learning rule (iBLR) update used by iBayes-GMM. We systematically evaluate the effects of design choices and show that a hybrid approach significantly outperforms both prior works. Along with this work, we publish our highly modular and efficient implementation for natural gradient variational inference with Gaussian mixture models, which supports $432$ different combinations of design choices, facilitates the reproduction of all our experiments, and may prove valuable for the practitioner.

URL: https://openreview.net/forum?id=tLBjsX4tjs

---

Title: On the Gradient Formula for learning Generative Models with Regularized Optimal Transport Costs

Authors: Antoine Houdard, Arthur Leclaire, Nicolas Papadakis, Julien Rabin

Abstract: Learning a Wasserstein Generative Adversarial Networks (WGAN) requires the differentiation of the optimal transport cost with respect to the parameters of the generative model. In this work, we provide sufficient conditions for the existence of a gradient formula in two different frameworks: the case of semi-discrete optimal transport (i.e. with a discrete target distribution) and the case of regularized optimal transport (i.e. with an entropic penalty). In both cases the gradient formula involves a solution of the semi-dual formulation of the optimal transport cost. Our study makes a connection between the gradient of the WGAN loss function and the Laguerre diagrams associated to semi-discrete transport maps. The learning problem is addressed with an alternating algorithm, which is in general not convergent. However, in most cases, it stabilizes close to a relevant solution for the generative learning problem. We also show that entropic regularization can improve the convergence speed but noticeably changes the shape of the learned generative model.

URL: https://openreview.net/forum?id=FbztvhdCX9

---

Title: Understanding Self-Supervised Pretraining with Part-Aware Representation Learning

Authors: Jie Zhu, Jiyang Qi, Mingyu Ding, Xiaokang Chen, Ping Luo, Xinggang Wang, Wenyu Liu, Leye Wang, Jingdong Wang

Abstract: In this paper, we are interested in understanding self-supervised pretraining through studying the capability that self-supervised methods learn part-aware representations. The study is mainly motivated by that random views, used in contrastive learning, and random masked (visible) patches, used in masked image modeling, are often about object parts.

We explain that contrastive learning is a part-to-whole task: the projection layer hallucinates the whole object representation from the object part representation learned from the encoder, and that masked image modeling is a part-to-part task: the masked patches of the object are hallucinated from the visible patches. The explanation suggests that the self-supervised pretrained encoder leans toward understanding the object part. We empirically compare the off-the-shelf encoders pretrained with several representative methods on object-level recognition and part-level recognition. The results show that the fully-supervised model outperforms self-supervised models for object-level recognition, and most self-supervised contrastive learning and masked image modeling methods outperform the fully-supervised method for part-level recognition. It is observed that the combination of contrastive learning and masked image modeling further improves the performance.

URL: https://openreview.net/forum?id=HP7Qpui5YE

---

Title: DSpar: An Embarrassingly Simple Strategy for Efficient GNN training and inference via Degree-based Sparsification

Authors: Zirui Liu, Kaixiong Zhou, Zhimeng Jiang, Li Li, Rui Chen, Soo-Hyun Choi, Xia Hu

Abstract: Running Graph Neural Networks (GNNs) on large graphs suffers from notoriously inefficiency. This is attributed to the sparse graph-based operations, which is hard to be accelerated by community hardware, e.g., GPUs and CPUs. One potential solution is to ``sketch'' the original graph by removing unimportant edges, then both the training and inference process are executed on the sparsified graph with improved efficiency. Traditional graph sparsification work calculates the edge importance score, i.e., effective resistance, from graph topology with theoretical guarantee. However, estimating effective resistance is even more expensive than training GNNs itself. Later, learning-based sparsification methods propose to learn the edge importance from data, but with significant overhead due to the extra learning process. Thus, both of them introduce significant ahead-of-training overhead. In this paper, we experimentally and theoretically prove that effective resistance can be approximated using only the node degree information and achieve similar node presentations on graph with/without sparsification. Based on this finding, we propose DSpar, to sparsify the graph once before training based on only the node degree information with negligible ahead-of-training overhead. In practice, for the training phase, DSpar achieves up to $5.9\times$ faster than baseline with almost no accuracy drop. For the inference phase, DSpar reduces up to $90\%$ latency.

URL: https://openreview.net/forum?id=SaVEXFuozg

---

Title: Efficient Reward Poisoning Attacks on Online Deep Reinforcement Learning

Authors: Yinglun Xu, Qi Zeng, Gagandeep Singh

Abstract: We study reward poisoning attacks on online deep reinforcement learning (DRL), where the attacker is oblivious to the learning algorithm used by the agent and the dynamics of the environment. We demonstrate the intrinsic vulnerability of state-of-the-art DRL algorithms by designing a general, black-box reward poisoning framework called adversarial MDP attacks. We instantiate our framework to construct two new attacks which only corrupt the rewards for a small fraction of the total training timesteps and make the agent learn a low-performing policy. We provide a theoretical analysis of the efficiency of our attack and perform an extensive empirical evaluation. Our results show that our attacks efficiently poison agents learning in several popular classical control and MuJoCo environments with a variety of state-of-the-art DRL algorithms, such as DQN, PPO, SAC, etc.

URL: https://openreview.net/forum?id=25G63lDHV2

---

Title: Challenges and Opportunities in Offline Reinforcement Learning from Visual Observations

Authors: Cong Lu, Philip J. Ball, Tim G. J. Rudner, Jack Parker-Holder, Michael A Osborne, Yee Whye Teh

Abstract: Offline reinforcement learning has shown great promise in leveraging large pre-collected datasets for policy learning, allowing agents to forgo often-expensive online data collection. However, offline reinforcement learning from visual observations with continuous action spaces remains under-explored, with a limited understanding of the key challenges in this complex domain. In this paper, we establish simple baselines for continuous control in the visual domain and introduce a suite of benchmarking tasks for offline reinforcement learning from visual observations designed to better represent the data distributions present in real-world offline RL problems and guided by a set of desiderata for offline RL from visual observations, including robustness to visual distractions and visually identifiable changes in dynamics. Using this suite of benchmarking tasks, we show that simple modifications to two popular vision-based online reinforcement learning algorithms, DreamerV2 and DrQ-v2, suffice to outperform existing offline RL methods and establish competitive baselines for continuous control in the visual domain. We rigorously evaluate these algorithms and perform an empirical evaluation of the differences between state-of-the-art model-based and model-free offline RL methods for continuous control from visual observations. All code and data used in this evaluation are open-sourced to facilitate progress in this domain.

URL: https://openreview.net/forum?id=1QqIfGZOWu

---

Title: Towards a More Rigorous Science of Blindspot Discovery in Image Classification Models

Authors: Gregory Plumb, Nari Johnson, Angel Cabrera, Ameet Talwalkar

Abstract: A growing body of work studies Blindspot Discovery Methods ("BDM"s): methods that use an image embedding to find semantically meaningful (i.e., united by a human-understandable concept) subsets of the data where an image classifier performs significantly worse. Motivated by observed gaps in prior work, we introduce a new framework for evaluating BDMs, SpotCheck, that uses synthetic image datasets to train models with known blindspots and a new BDM, PlaneSpot, that uses a 2D image representation. We use SpotCheck to run controlled experiments that identify factors that influence BDM performance (e.g., the number of blindspots in a model, or features used to define the blindspot) and show that PlaneSpot is competitive with and in many cases outperforms existing BDMs. Importantly, we validate these findings by designing additional experiments that use real image data from MS-COCO, a large image benchmark dataset. Our findings suggest several promising directions for future work on BDM design and evaluation. Overall, we hope that the methodology and analyses presented in this work will help facilitate a more rigorous science of blindspot discovery.

URL: https://openreview.net/forum?id=MaDvbLaBiF

---

Title: Adjusting Machine Learning Decisions for Equal Opportunity and Counterfactual Fairness

Authors: Yixin Wang, Dhanya Sridhar, David Blei

Abstract: Machine learning (ML) methods have the potential to automate
high-stakes decisions, such as bail admissions or credit lending, by
analyzing and learning from historical data. But these algorithmic
decisions may be unfair: in learning from historical data, they may
replicate discriminatory practices from the past. In this paper, we
propose two algorithms that adjust fitted ML predictors to produce
decisions that are fair. Our methods provide post-hoc adjustments to
the predictors, without requiring that they be retrained. We consider
a causal model of the ML decisions, define fairness through
counterfactual decisions within the model, and then form algorithmic
decisions that capture the historical data as well as possible but
are provably fair. In particular, we consider two definitions of
fairness. The first is ``equal counterfactual opportunity,'' where
the counterfactual distribution of the decision is the same regardless
of the protected attribute; the second is counterfactual fairness. We
evaluate the algorithms, and the trade-off between accuracy and
fairness, on datasets about admissions, income, credit, and
recidivism.

URL: https://openreview.net/forum?id=P6NcRPb13w

---

Title: On the Predictive Accuracy of Neural Temporal Point Process Models for Continuous-time Event Data

Authors: Tanguy Bosser, Souhaib Ben Taieb

Abstract: Temporal Point Processes (TPPs) serve as the standard mathematical framework for modeling asynchronous event sequences in continuous time. However, classical TPP models are often constrained by strong assumptions, limiting their ability to capture complex real-world event dynamics. To overcome this limitation, researchers have proposed Neural TPPs, which leverage neural network parametrizations to offer more flexible and efficient modeling. While recent studies demonstrate the effectiveness of Neural TPPs, they often lack a unified setup, relying on different baselines, datasets, and experimental configurations. This makes it challenging to identify the key factors driving improvements in predictive accuracy, hindering research progress. To bridge this gap, we present a comprehensive large-scale experimental study that systematically evaluates the predictive accuracy of state-of-the-art neural TPP models. Our study encompasses multiple real-world and synthetic event sequence datasets, following a carefully designed unified setup. We thoroughly investigate the influence of major architectural components such as event encoding, history encoder, and decoder parametrization on both time and mark prediction tasks. Additionally, we delve into the less explored area of probabilistic calibration for neural TPP models. By analyzing our results, we draw insightful conclusions regarding the significance of history size and the impact of architectural components on predictive accuracy. Furthermore, we shed light on the miscalibration of mark distributions in neural TPP models. Our study aims to provide valuable insights into the performance and characteristics of neural TPP models, contributing to a better understanding of their strengths and limitations.

URL: https://openreview.net/forum?id=3OSISBQPrM

---

Title: Mind the Gap: Mitigating the Distribution Gap in Graph Few-shot Learning

Authors: Chunhui Zhang, Hongfu Liu, Jundong Li, Yanfang Ye, Chuxu Zhang

Abstract: Prevailing supervised deep graph learning models often suffer from the issue of label scarcity, leading to performance degradation in the face of limited annotated data. Although numerous graph few-shot learning (GFL) methods have been developed to mitigate this problem, they tend to rely excessively on labeled data. This over-reliance on labeled data can result in impaired generalization ability in the test phase due to the existence of a distribution gap. Moreover, existing GFL methods lack a general purpose as their designs are coupled with task or data-specific characteristics. To address these shortcomings, we propose a novel Self-Distilled Graph Few-shot Learning framework (SDGFL) that is both general and effective. SDGFL leverages a self-distilled contrastive learning procedure to boost GFL. Specifically, our model first pre-trains a graph encoder with contrastive learning using unlabeled data. Later, the trained encoder is frozen as a teacher model to distill a student model with a contrastive loss. The distilled model is then fed to GFL. By learning data representation in a self-supervised manner, SDGFL effectively mitigates the distribution gap and enhances generalization ability. Furthermore, our proposed framework is task and data-independent, making it a versatile tool for general graph mining purposes. To evaluate the effectiveness of our proposed framework, we introduce an information-based measurement that quantifies its capability. Through comprehensive experiments, we demonstrate that SDGFL outperforms state-of-the-art baselines on various graph mining tasks across multiple datasets in the few-shot scenario. We also provide a quantitative measurement of SDGFL’s superior performance in comparison to existing methods.

URL: https://openreview.net/forum?id=LEVbhNrLEL

---

Title: Augmented Language Models: a Survey

Authors: Grégoire Mialon, Roberto Dessi, Maria Lomeli, Christoforos Nalmpantis, Ramakanth Pasunuru, Roberta Raileanu, Baptiste Roziere, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, Edouard Grave, Yann LeCun, Thomas Scialom

Abstract: This survey reviews works in which language models (LMs) are augmented with reasoning skills and the ability to use tools. The former is defined as decomposing a potentially complex task into simpler subtasks while the latter consists in calling external modules such as a
code interpreter. LMs can leverage these augmentations separately or in combination via heuristics, or learn to do so from demonstrations. While adhering to a standard missing tokens prediction objective, such augmented LMs can use various, possibly non-parametric external modules to expand their context processing ability, thus departing from the pure language modeling paradigm. We therefore refer to them as Augmented Language Models (ALMs). The missing token objective allows ALMs to learn to reason, use tools, and even act, while still performing standard natural language tasks and even outperforming most regular LMs on several benchmarks. In this work, after reviewing current advance in ALMs, we conclude that this new research direction has the potential to address common limitations of traditional LMs such as interpretability, consistency, and scalability issues.

URL: https://openreview.net/forum?id=jh7wH2AzKK

---

Title: Black-Box Batch Active Learning for Regression

Authors: Andreas Kirsch

Abstract: Batch active learning is a popular approach for efficiently training machine learning models on large, initially unlabelled datasets by repeatedly acquiring labels for batches of data points. However, many recent batch active learning methods are white-box approaches and are often limited to differentiable parametric models: they score unlabeled points using acquisition functions based on model embeddings or first- and second-order derivatives. In this paper, we propose black-box batch active learning for regression tasks as an extension of white-box approaches. Crucially, our method only relies on model predictions. This approach is compatible with a wide range of machine learning models, including regular and Bayesian deep learning models and non-differentiable models such as random forests. It is rooted in Bayesian principles and utilizes recent kernel-based approaches. This allows us to extend a wide range of existing state-of-the-art white-box batch active learning methods (BADGE, BAIT, LCMD) to black-box models. We demonstrate the effectiveness of our approach through extensive experimental evaluations on regression datasets, achieving surprisingly strong performance compared to white-box approaches for deep learning models.

URL: https://openreview.net/forum?id=fvEvDlKko6

---

Title: Consistent Collaborative Filtering via Tensor Decomposition

Authors: Shiwen Zhao, Guillermo Sapiro

Abstract: Collaborative filtering is the de facto standard for analyzing users’ activities and building recommendation systems for items. In this work we develop Sliced Anti-symmetric Decomposition (SAD), a new model for collaborative filtering based on implicit feedback. In contrast to traditional techniques where a latent representation of users (user vectors) and items (item vectors) are estimated, SAD introduces one additional latent vector to each item, using a novel three-way tensor view of user-item interactions. This new vector extends user-item preferences calculated by standard dot products to general inner products, producing interactions between items when evaluating their relative preferences. SAD reduces to state-of-the-art (SOTA) collaborative filtering models when the vector collapses to 1, while in this paper we allow its value to be estimated from data. Allowing the values of the new item vector to be different from 1 has profound implications. It suggests users may have nonlinear mental models when evaluating items, allowing the existence of cycles in pairwise comparisons. We demonstrate the efficiency of SAD in both simulated and real world datasets containing over 1M user-item interactions. By comparing with seven SOTA collaborative filtering models with implicit feedbacks, SAD produces the most consistent personalized preferences, in the meanwhile maintaining top-level of accuracy in personalized recommendations. We release the model and inference algorithms in a Python library https://github.com/apple/ml-sad.

URL: https://openreview.net/forum?id=HqIuAzBxbh

---

Title: Provably Convergent Policy Optimization via Metric-aware Trust Region Methods

Authors: Jun Song, Niao He, Lijun Ding, Chaoyue Zhao

Abstract: Trust-region methods based on Kullback-Leibler divergence are pervasively used to stabilize policy optimization in reinforcement learning. In this paper, we exploit more flexible metrics and examine two natural extensions of policy optimization with Wasserstein and Sinkhorn trust regions, namely Wasserstein policy optimization (WPO) and Sinkhorn policy optimization (SPO). Instead of restricting the policy to a parametric distribution class, we directly optimize the policy distribution and derive their close-form policy updates based on the Lagrangian duality. Theoretically, we show that WPO guarantees a monotonic performance improvement, and SPO provably converges to WPO as the entropic regularizer diminishes. Moreover, we prove that with a decaying Lagrangian multiplier to the trust region constraint, both methods converge to global optimality. Experiments across tabular domains, robotic locomotion, and continuous control tasks further demonstrate the performance improvement of both approaches, more robustness of WPO to sample insufficiency, and faster convergence of SPO, over state-of-art policy gradient methods.

URL: https://openreview.net/forum?id=jkTqJJOGMS

---

Title: On Average-Case Error Bounds for Kernel-Based Bayesian Quadrature

Authors: Xu Cai, Thanh Lam, Jonathan Scarlett

Abstract: In this paper, we study error bounds for Bayesian quadrature (BQ), with an emphasis on noisy settings, randomized algorithms, and average-case performance measures. We seek to approximate the integral of functions in a Reproducing Kernel Hilbert Space (RKHS), particularly focusing on the Mat\'ern-$\nu$ and squared exponential (SE) kernels, with samples from the function potentially being corrupted by Gaussian noise. We provide a two-step meta-algorithm that serves as a general tool for relating the average-case quadrature error with the $L^2$-function approximation error. When specialized to the Mat\'ern kernel, we recover an existing near-optimal error rate while avoiding the existing method of repeatedly sampling points. When specialized to other settings, we obtain new average-case results for settings including the SE kernel with noise and the Mat\'ern kernel with misspecification. Finally, we present algorithm-independent lower bounds that have greater generality and/or give distinct proofs compared to existing ones.

URL: https://openreview.net/forum?id=JJrKbq35l4

---

Title: Self-Supervised Graph Representation Learning for Neuronal Morphologies

Authors: Marissa A. Weis, Laura Pede, Timo Lüddecke, Alexander S Ecker

Abstract: Unsupervised graph representation learning has recently gained interest in several application domains such as neuroscience, where modeling the diverse morphology of cell types in the brain is one of the key challenges. It is currently unknown how many excitatory cortical cell types exist and what their defining morphological features are. Here we present GraphDINO, a purely data-driven approach to learn low-dimensional representations of 3D neuronal morphologies from unlabeled large-scale datasets. GraphDINO is a novel transformer-based representation learning method for spatially-embedded graphs. To enable self-supervised learning on transformers, we (1) developed data augmentation strategies for spatially-embedded graphs, (2) adapted the positional encoding and (3) introduced a novel attention mechanism, AC-Attention, which combines attention-based global interaction between nodes and classic graph convolutional processing. We show, in two different species and across multiple brain areas, that this method yields morphological cell type clusterings that are on par with manual feature-based classification by experts, but without using prior knowledge about the structural features of neurons. Moreover, it outperforms previous approaches on quantitative benchmarks predicting expert labels. Our method could potentially enable data-driven discovery of novel morphological features and cell types in large-scale datasets. It is applicable beyond neuroscience in settings where samples in a dataset are graphs and graph-level embeddings are desired.

URL: https://openreview.net/forum?id=ThhMzfrd6r

---

Title: Breaking the Spurious Causality of Conditional Generation via Fairness Intervention with Corrective Sampling

Authors: Junhyun Nam, Sangwoo Mo, Jaeho Lee, Jinwoo Shin

Abstract: Trying to capture the sample-label relationship, conditional generative models often end up inheriting the spurious correlation in the training dataset, giving label-conditional distributions that are severely imbalanced in another latent attribute. To mitigate such undesirable correlations engraved into generative models, which we call spurious causality, we propose a general two-step strategy. (a) Fairness Intervention (FI): Emphasize the minority samples that are hard to be generated due to the spurious correlation in the training dataset. (b) Corrective Sampling (CS): Filter the generated samples explicitly to follow the desired label-conditional latent attribute distribution. We design the fairness intervention for various degrees of supervision on the spurious attribute, including unsupervised, weakly-supervised, and semi-supervised scenarios. Our experimental results show that the proposed FICS can successfully resolve the spurious correlation in generated samples on various datasets.

URL: https://openreview.net/forum?id=VV4zJwLwI7

---

Title: Stochastic Constrained DRO with a Complexity Independent of Sample Size

Authors: Qi Qi, Jiameng Lyu, Kung-Sik Chan, Er-Wei Bai, Tianbao Yang

Abstract: Distributionally Robust Optimization (DRO), as a popular method to train robust models against distribution shift between training and test sets, has received tremendous attention in recent years. In this paper, we propose and analyze stochastic algorithms that apply to both non-convex and convex losses for solving Kullback–Leibler divergence constrained DRO problem. Compared with existing methods solving this problem, our stochastic algorithms not only enjoy competitive if not better complexity independent of sample size but also just require a constant batch size at every iteration, which is more practical for broad applications. We establish a nearly optimal complexity bound for finding an $\epsilon$-stationary solution for non-convex losses and an optimal complexity for finding an $\epsilon$-optimal solution for convex losses. Empirical studies demonstrate the effectiveness of the proposed algorithms for solving non-convex and convex constrained DRO problems.

URL: https://openreview.net/forum?id=VpaXrBFYZ9

---

Title: Neural Networks beyond explainability: Selective inference for sequence motifs

Authors: Antoine Villié, Philippe Veber, Yohann De Castro, Laurent Jacob

Abstract: Over the past decade, neural networks have been successful at making predictions from biological sequences, especially in the context of regulatory genomics. As in other fields of deep learning, tools have been devised to extract features such as sequence motifs that can explain the predictions made by a trained network. Here we intend to go beyond explainable machine learning and introduce SEISM, a selective inference procedure to test the association between these extracted features and the predicted phenotype. In particular, we discuss how training a one-layer convolutional network is formally equivalent to selecting motifs maximizing some association score. We adapt existing sampling-based selective inference procedures by quantizing this selection over an infinite set to a large but finite grid. Finally,we show that sampling under a specific choice of parameters is sufficient to characterize the composite null hypothesis typically used for selective inference - a result that goes well beyond our particular framework. We illustrate the behavior of our method in terms of calibration, power and speed and discuss its power/speed trade-off with a simpler data-split strategy. SEISM paves the way to an easier analysis of neural networks used in regulatory genomics, and to more powerful methods for genome wide association studies (GWAS).

URL: https://openreview.net/forum?id=nddEHTSnqg

---

Title: Stochastic gradient updates yield deep equilibrium kernels

Authors: Russell Tsuchida, Cheng Soon Ong

Abstract: Implicit deep learning allows one to compute with implicitly defined features, for example features that solve optimisation problems. We consider the problem of computing with implicitly defined features in a kernel regime. We call such a kernel a deep equilibrium kernel (DEKer). Specialising on a stochastic gradient descent (SGD) update rule applied to features (not weights) in a latent variable model, we find an exact deterministic update rule for the (DEKer) in a high dimensional limit. This derived update rule resembles previously introduced infinitely wide neural network kernels. To perform our analysis, we describe an alternative parameterisation of the link function of exponential families, a result that may be of independent interest. This new parameterisation allows us to draw new connections between a statistician's inverse link function and a machine learner's activation function. We describe an interesting property of SGD in this high dimensional limit: even though individual iterates are random vectors, inner products of any two iterates are deterministic, and can converge to a unique fixed point as the number of iterates increases. We find that the (DEKer) empirically outperforms related neural network kernels on a series of benchmarks.

URL: https://openreview.net/forum?id=p7UTv2hWgM

---

New submissions
===============

Title: Local Function Complexity for Active Learning via Mixture of Gaussian Processes

Abstract: Inhomogeneities in real-world data, e.g., due to changes in the observation noise level or variations in the structural complexity of the source function, pose a unique set of challenges for statistical inference. Accounting for them can greatly improve predictive power when physical resources or computation time is limited. In this paper, we draw on recent theoretical results on the estimation of local function complexity (LFC), derived from the domain of local polynomial smoothing (LPS), to establish a notion of local structural complexity, which is used to develop a model-agnostic active learning (AL) framework. Due to its reliance on pointwise estimates, the LPS model class is not robust and scalable concerning large input space dimensions that typically come along with real-world problems.
Here, we derive and estimate the Gaussian process regression (GPR)-based analog of the LPS-based LFC and use it as a substitute in the above framework to make it robust and scalable. We assess the effectiveness of our LFC estimate in an AL application on a prototypical low-dimensional synthetic dataset, before taking on the challenging real-world task of reconstructing a quantum chemical force field for a small organic molecule and demonstrating state-of-the-art performance with a significantly reduced training demand.

URL: https://openreview.net/forum?id=w4MoQ39zmc

---

Title: Revisiting Discrete Soft Actor-Critic

Abstract: We study the adaption of soft actor-critic (SAC) from continuous action space to discrete action space. We revisit vanilla SAC and provide an in-depth understanding of its Q value underestimation and performance instability issues when applied to discrete settings. We thereby propose entropy-penalty and double average Q-learning with Q-clip to address these issues. Extensive experiments on typical benchmarks with discrete action space, including Atari games and a large-scale MOBA game, show the efficacy of our proposed method. Our code is at: https://github.com/revisiting-sac/Revisiting-Discrete-SAC.git.

URL: https://openreview.net/forum?id=Wum4m8tMVP

---

Title: A Case Study of Low Ranked Self-Expressive Structures in Neural Network Representations

Abstract: Understanding Neural Networks by studying their underlying geometry can help us design more robust training methodologies and better architectures. In this work we approach this problem via the lens of subspace clustering, where each input in the representation space is represented as a linear combination of some other set of inputs. Such structures are called self-expressive structures and in this work we analyse their evolution and juxtapose their ability with linear classifiers. We also perform an analysis of the subspace clustering based analysis and compare with other analysis tools to demonstrate how they are related but different formulations of one another. Additionally, we monitor the evolution of self-expressive structures in networks trained to memorise parts of their training data and how they differ from networks which generalise well. Next, as a step to test the limitations of the proposed subspace clustering approach and other linear probing methodologies in the literature we perform a similar set of tests on networks with non-linear activation functions and demonstrate weakness of linear structures in differentiating between models with generalisation and memorisation. Finally, we analyse the relationship between networks trained over cross entropy loss and subspace separation loss to better understand how self expressive structures emerge in neural networks just trained to classify data.

URL: https://openreview.net/forum?id=8MwtJeZBmz

---

Title: Controlling Federated Learning for Covertness

Abstract: A learner aims to minimize a function $f$ by repeatedly querying a distributed oracle that provides noisy gradient evaluations. At the same time, the learner seeks to hide $\arg\min f$ from a malicious eavesdropper that observes the learner's queries. This paper considers the problem of \textit{covert} or \textit{learner-private} optimization, where the learner has to dynamically choose between learning and obfuscation by exploiting the stochasticity. The problem of controlling the stochastic gradient algorithm for covert optimization is modeled as a Markov decision process, and we show that the dynamic programming operator has a supermodular structure implying that the optimal policy has a monotone threshold structure. A computationally efficient policy gradient algorithm is proposed to search for the optimal querying policy without knowledge of the transition probabilities. As a practical application, our methods are demonstrated on a hate speech classification task in a federated setting where an eavesdropper can use the optimal weights to generate toxic content, which is more easily misclassified. Numerical results show that when the learner uses the optimal policy, an eavesdropper can only achieve a validation accuracy of $52\%$ with no information and $69\%$ when it has a public dataset with 10\% positive samples compared to $83\%$ when the learner employs a greedy policy.

URL: https://openreview.net/forum?id=g01OVahtN9

---

Title: Bayesian Inference for Sequence Mixture Density Networks using Bézier Curve Gaussian Processes

Abstract: Probabilistic models for sequential data are the basis for a variety of applications concerned with processing timely ordered information. The predominant approach in this domain is given by recurrent neural networks, implementing either an approximate Bayesian approach (e.g. Variational Autoencoders or Generative Adversarial Networks) or a regression-based approach, i.e. variations of Mixture Density networks (MDN). In this paper, we focus on the \emph{$\mathcal{N}$-MDN} variant, which parameterizes (mixtures of) probabilistic B\'ezier curves (\emph{$\mathcal{N}$-Curves}) for modeling stochastic processes. While MDNs are favorable in terms of computational cost and stability, they generally fall behind approximate Bayesian approaches in terms of expressiveness. Towards this end, we present an approach for closing this gap by enabling full Bayesian inference on top of $\mathcal{N}$-MDNs. For this, we show that $\mathcal{N}$-Curves are a special case of non-stationary Gaussian processes (denoted as $\mathcal{N}$-GP) and then derive corresponding mean and kernel functions for different modalities. Following this, we propose the use of the $\mathcal{N}$-MDN as a data-dependent generator for $\mathcal{N}$-GP prior distributions. We show the advantages granted by this combined model in an application context, using human trajectory prediction as an example.

URL: https://openreview.net/forum?id=geKUqg4p9b

---

Title: Variational Causal Dynamics: Discovering Modular World Models from Interventions

Abstract: Latent world models allow agents to reason about complex environments with high-dimensional observations. However, adapting to new environments and effectively leveraging previous knowledge remain significant challenges. We present Variational Causal Dynamics (VCD), a structured world model that exploits the invariance of causal mechanisms across environments to achieve fast and modular adaptation. By causally factorising a transition model, VCD is able to identify reusable components across different environments. This is achieved by combining causal discovery and variational inference to learn a latent representation and transition model jointly in an unsupervised manner. Specifically, we optimise the evidence lower bound jointly over a representation model and a transition model structured as a causal graphical model. In evaluations on simulated environments with state and image observations, we show that VCD is able to successfully identify causal variables, and to discover consistent causal structures across different environments. Moreover, given a small number of observations in a previously unseen, intervened environment, VCD is able to identify the sparse changes in the dynamics and to adapt efficiently. In doing so, VCD significantly extends the capabilities of the current state-of-the-art in latent world models while also comparing favourably in terms of prediction accuracy.

URL: https://openreview.net/forum?id=V9tQKYYNK1

---

Title: Federated Classification in Hyperbolic Spaces via Secure Aggregation of Convex Hulls

Abstract: Hierarchical and tree-like data sets arise in many relevant applications, including language processing, graph data mining, phylogeny and genomics. It is known that tree-like data cannot be embedded into Euclidean spaces of finite dimension with small distortion, and that this problem can be mitigated through the use of hyperbolic spaces. When such data also has to be processed in a distributed and privatized setting, it becomes necessary to work with new federated learning methods tailored to hyperbolic spaces. As an initial step towards the development of the field of federated learning in hyperbolic spaces, we propose the first known approach to federated classification in hyperbolic spaces. Our contributions are as follows. First, we develop distributed versions of convex SVM classifiers for Poincar\'e discs. In this setting, the information conveyed from clients to the global classifier are convex hulls of clusters present in individual client data. Second, to avoid label switching issues, we introduce a number-theoretic approach for label recovery based on the so-called integer $B_h$ sequences. Third, we compute the complexity of the convex hulls in hyperbolic spaces to assess the extent of data leakage; at the same time, in order to limit the communication cost for the hulls, we propose a new quantization method for the Poincar\'e disc coupled with Reed-Solomon-like encoding. Fourth, at the server level, we introduce a new approach for aggregating convex hulls of the clients based on balanced graph partitioning. We test our method on a collection of diverse data sets, including hierarchical single-cell RNA-seq data from different patients distributed across different repositories that have stringent privacy constraints. The classification accuracy of our method is up to $\sim11\%$ better than its Euclidean counterpart, demonstrating the importance of privacy-preserving learning in hyperbolic spaces.

URL: https://openreview.net/forum?id=umggDfMHha

---

Title: Differentially Private Optimizers Can Learn Adversarially Robust Models

Abstract: Machine learning models have shone in a variety of domains and attracted increasing attention from both the security and the privacy communities. One important yet worrying question is: will training models under the differential privacy (DP) constraint unfavorably impact on the adversarial robustness? While previous works have postulated that privacy comes at the cost of worse robustness, we give the first theoretical analysis to show that DP models can indeed be robust and accurate, even sometimes more robust than their naturally-trained non-private counterparts. We observe three key factors that influence the privacy-robustness-accuracy tradeoff: (1) hyperparameters for DP optimizers are critical; (2) pre-training on public data significantly mitigates the accuracy and robustness drop; (3) choice of DP optimizers makes a difference. With these factors set properly, we achieve 90\% natural accuracy, 72\% robust accuracy ($+9\%$ than the non-private model) under $l_2(0.5)$ attack, and 69\% robust accuracy ($+16\%$ than the non-private model) with pre-trained SimCLRv2 model under $l_\infty(4/255)$ attack on CIFAR10 with $\epsilon=2$. In fact, we show both theoretically and empirically that DP models are Pareto optimal on the accuracy-robustness tradeoff. Empirically, the robustness of DP models is consistently observed on MNIST, Fashion MNIST and CelebA datasets, with ResNet and Vision Transformer. We believe our encouraging results are a significant step towards training models that are private as well as robust.

URL: https://openreview.net/forum?id=o8VgRNYh6n

---

Title: Understanding the robustness difference between stochastic gradient descent and adaptive gradient methods

Abstract: Stochastic gradient descent (SGD) and adaptive gradient methods, such as Adam and RMSProp, have been widely used in training deep neural networks. We empirically show that while the difference between the standard generalization performance of models trained using these methods is small, those trained using SGD exhibit far greater robustness under input perturbations. Notably, our investigation demonstrates the presence of irrelevant frequencies in natural datasets, where alterations do not affect models' generalization performance. However, models trained with adaptive methods show sensitivity to these changes, suggesting that their use of irrelevant frequencies can lead to solutions sensitive to perturbations. To better understand this difference, we study the learning dynamics of gradient descent (GD) and sign gradient descent (signGD) on a synthetic dataset that mirrors natural signals. With a three-dimensional input space, the models optimized with GD and signGD have standard risks close to zero but vary in their adversarial risks. Our result shows that linear models' robustness to $\ell_2$-norm bounded changes is inversely proportional to the model parameters' weight norm: a smaller weight norm implies better robustness. In the context of deep learning, our experiments show that SGD-trained neural networks show smaller Lipschitz constants, explaining the better robustness to input perturbations than those trained with adaptive gradient methods.

URL: https://openreview.net/forum?id=ed8SkMdYFT

---

Title: Step-size Optimization for Continual Learning

Abstract: In continual learning, a learner has to keep learning from the data over its whole life time. A key issue is to decide what knowledge to keep and what knowledge to let go. In a neural network, this can be implemented by using a step-size vector to scale how much gradient samples change network weights. Common algorithms, like RMSProp and Adam, use heuristics, specifically normalization, to adapt this step-size vector. In this paper, we show that those heuristics ignore the effect of their adaptation on the overall objective function, for example by moving the step-size vector away from better step-size vectors. On the other hand, stochastic meta-gradient descent algorithms, like IDBD (Sutton, 1992), explicitly optimize the step-size vector with respect to the overall objective function. On simple problems, we show that IDBD is able to consistently improve step-size vectors, where RMSProp and Adam do not. We explain the differences between the two approaches and their respective limitations. We conclude by suggesting that combining both approaches could be a promising future direction to improve the performance of neural networks in continual learning.

URL: https://openreview.net/forum?id=467OqytOnI

---

Title: Input Normalized Stochastic Gradient Descent Training for Deep Neural Networks

Abstract: In this paper, we propose a novel optimization algorithm for training machine learning models called Input Normalized Stochastic Gradient Descent (INSGD), inspired by the Normalized Least Mean Squares (NLMS) algorithm used in adaptive filtering. When training complex models on large datasets, the choice of optimizer parameters, particularly the learning rate, is crucial to avoid divergence. Our algorithm updates the network weights using stochastic gradient descent with $\ell_1$ and $\ell_2$-based normalizations applied to the learning rate, similar to NLMS. However, unlike existing normalization methods, we exclude the error term from the normalization process and instead normalize the update term using the input vector to the neuron. Our experiments demonstrate that our optimization algorithm achieves higher accuracy levels compared to different initialization settings. We evaluate the efficiency of our training algorithm on benchmark datasets using ResNet-20, WResNet-18, ResNet-50, and a toy neural network. Our INSGD algorithm improves the accuracy of ResNet-20 on CIFAR-10 from 92.55\% to 92.80\%, the accuracy of MobileNetV3 on CIFAR-10 from 90.83\% to 91.13\%, WResNet-18 on CIFAR-100 from 78.75\% to 78.85\%, and ResNet-50 on ImageNet-1K from 75.56\% to 75.89\%.

URL: https://openreview.net/forum?id=7u9OfPTHVm

---

Title: InPars-Light: Cost-Effective Unsupervised Training of Efficient Rankers

Abstract: We carried out a reproducibility study of InPars, which is a method for unsupervised training
of neural rankers (Bonifacio et al., 2022). As a by-product, we developed InPars-Light, which
is a simple-yet-effective modification of InPars. Unlike InPars, InPars-Light uses 7x-100x
smaller ranking models and only a freely available language model BLOOM, which—as we
found out—produced more accurate rankers compared to a proprietary GPT-3 model. On all
five English retrieval collections (used in the original InPars study) we obtained substantial
(7%-30%) and statistically significant improvements over BM25 (in nDCG and MRR) using
only a 30M parameter six-layer MiniLM-30M ranker and a single three-shot prompt. In
contrast, in the InPars study only a 100x larger monoT5-3B model consistently outperformed
BM25, whereas their smaller monoT5-220M model (which is still 7x larger than our MiniLM
ranker) outperformed BM25 only on MS MARCO and TREC DL 2020. In the same three-shot
prompting scenario, our 435M parameter DeBERTA v3 ranker was at par with the 7x larger
monoT5-3B (average gain over BM25 of 1.3 vs 1.32): In fact, on three out of five datasets,
DeBERTA slightly outperformed monoT5-3B. Finally, these good results were achieved by
re-ranking only 100 candidate documents compared to 1000 used by Bonifacio et al. (2022).
We believe that InPars-Light is the first truly cost-effective prompt-based unsupervised recipe
to train and deploy neural ranking models that outperform BM25. Our code and data is
publicly available: https://anonymous.4open.science/r/inpars_anonymized-0635/

URL: https://openreview.net/forum?id=sHSKFYyINO

---

Title: TAMUNA: Doubly Accelerated Federated Learning with Local Training, Compression, and Partial Participation

Abstract: In federated learning, a large number of users collaborate to learn a global model. They alternate local computations and communication with a distant server. Communication, which can be slow and costly, is the main bottleneck in this setting. In addition to communication- efficiency, a robust algorithm should allow for partial participation, the desirable feature that not all clients need to participate to every round of the training process. To reduce the communication load and therefore accelerate distributed gradient descent, two strategies are popular: 1) communicate less frequently; that is, perform several iterations of local computations between the communication rounds; and 2) communicate compressed information instead of full-dimensional vectors. We propose TAMUNA, the first algorithm for distributed optimization and federated learning, which harnesses these two strategies jointly and allows for partial participation. TAMUNA converges linearly to an exact solution in the strongly convex setting, with a doubly accelerated rate: it provably benefits from the two acceleration mechanisms provided by local training and compression, namely a better dependency on the condition number of the functions and on the model dimension, respectively.

URL: https://openreview.net/forum?id=vk103loaxw

---

Title: Beyond Boundaries: A Novel Data-Augmentation Discourse for Open Domain Generalization

Abstract: The problem of Open Domain Generalization (ODG) is multifaceted, encompassing shifts in domains and labels across all source and target domains. Existing approaches have encountered challenges such as style bias towards training domains, insufficient feature-space disentanglement to highlight semantic features, and discriminativeness of the latent space. Additionally, they rely on a confidence-based target outlier detection approach, which can lead to misclassifications when target open samples visually align with the source domain data.
In response to these challenges, we present a solution named \textsc{ODG-Net}. We aim to create a direct open-set classifier within a \textit{discriminative}, \textit{unbiased}, and \textit{disentangled} semantic embedding space. To enrich data density and diversity, we introduce a generative augmentation framework that produces \textit{style-interpolated} novel domains for closed-set images and novel pseudo-open images by \textit{interpolating the contents of paired training images}. Our augmentation strategy skillfully utilizes \textit{disentangled style and content information} to synthesize images effectively.
Furthermore, we tackle the issue of style bias by representing all images in relation to all source domain properties, which effectively accentuates complementary visual features. Consequently, we train a multi-class semantic object classifier, incorporating both closed and open class classification capabilities, along with a style classifier to identify style primitives. The joint use of style and semantic classifiers facilitates the disentanglement of the latent space, thereby enhancing the generalization performance of the semantic classifier.
To ensure discriminativeness in both closed and open spaces, we optimize the semantic feature space using novel metric losses. The experimental results on six benchmark datasets convincingly demonstrate that \textsc{ODG-Net} surpasses the state-of-the-art by an impressive margin of $1-4\%$ in both open and closed-set DG scenarios.

URL: https://openreview.net/forum?id=jpZmhiIys1

---

Title: Improved Regret Bounds for Linear Adversarial MDPs via Linear Optimization

Abstract: Learning Markov decision processes (MDP) in an adversarial environment has been a challenging problem. The problem becomes even more challenging with function approximation since the underlying structure of the loss function and transition kernel are especially hard to estimate in a varying environment. In fact, the state-of-the-art results for linear adversarial MDP achieve a regret of $\tilde{\mathcal{O}}({K^{6/7}})$ ($K$ denotes the number of episodes), which admits a large room for improvement. In this paper, we propose a novel explore-exploit algorithm framework and investigate the problem with a new view, which reduces linear MDP into linear optimization by subtly setting the feature maps of the bandit arms of linear optimization. This new technique, under an exploratory assumption, yields an improved bound of $\tilde{\mathcal{O}}({K^{4/5}})$ for linear adversarial MDP without access to a transition simulator. The new view could be of independent interest for solving other MDP problems that possess a linear structure.

URL: https://openreview.net/forum?id=KcmWZSk53y

---

Title: The Kernel Density Quantile Transformation

Abstract: Feature preprocessing continues to play a critical role when applying machine learning and statistical methods to tabular data. In this paper, we propose the use of the kernel-density quantile transformation as a feature preprocessing step. We describe and demonstrate how the kernel-density quantile transformation can be used as a simple drop-in replacement for either min-max scaling or quantile transformation, combining the advantages of each of those preprocessing methods. Furthermore, we show that, in addition to its utility in supervised machine learning, the kernel-density transformation can be profitably applied to statistical data analysis, particularly in correlation analysis and univariate clustering.

URL: https://openreview.net/forum?id=6OEcDKZj5j

---

Title: Answering Unseen Questions With Smaller Language Models Using Rationale Generation and Dense Retrieval

Abstract: When provided with sufficient explanatory context, smaller Language Models have been shown to exhibit strong reasoning ability on challenging short-answer question-answering tasks where the questions are unseen in training. We evaluate two methods for further improvement in this setting. Both methods focus on combining rationales generated by a larger Language Model with longer contexts created from a multi-hop dense retrieval system. The first method ($\textit{RR}$) involves training a Rationale Ranking model to score both generated rationales and retrieved contexts with respect to relevance and truthfulness. We then use the scores to derive combined contexts from both knowledge sources using a number of combinatory strategies. For the second method ($\textit{RATD}$) we train a smaller Reasoning model using retrieval-augmented training datasets such that it becomes proficient at utilising relevant information from longer text sequences that may be only partially evidential and frequently contain many irrelevant sentences. Generally we find that both methods are effective but that the $\textit{RATD}$ method is more straightforward to apply and produces the strongest results in the unseen setting on which we focus. Our single best Reasoning model using only 440 million parameters materially improves upon strong comparable prior baselines for unseen evaluation datasets (StrategyQA 58.9 $\rightarrow$ 61.7 acc., CommonsenseQA 63.6 $\rightarrow$ 72.7 acc., ARC-DA 31.6 $\rightarrow$ 52.1 F1, IIRC 25.5 $\rightarrow$ 27.3 F1) and a version utilising our prior knowledge of each type of question in selecting a context combination strategy does even better. Our proposed models also generally outperform direct prompts against much larger models (BLOOM 175B and StableVicuna 13B) in both few-shot chain-of-thought and few-shot answer-only settings.

URL: https://openreview.net/forum?id=RiSTA3qFDB

---

Title: A Fully Decentralized Surrogate for Multi-Agent Policy Optimization

Abstract: The study of fully decentralized learning or independent learning in cooperative multi-agent reinforcement learning has a history of decades. Recent empirical studies have shown that independent PPO (IPPO) can achieve good performance, comparable to or even better than the methods of centralized training with decentralized execution, in several benchmarks. However, a decentralized actor-critic algorithm with convergence guarantee is still an open problem. In this paper, we propose \textit{decentralized policy optimization} (DPO), a decentralized actor-critic algorithm with monotonic improvement and convergence guarantee. We derive a novel decentralized surrogate for policy optimization such that the monotonic improvement of joint policy can be guaranteed by each agent \textit{independently} optimizing the surrogate. For practical implementation, this decentralized surrogate can be realized by two adaptive coefficients for policy optimization at each agent. Empirically, we evaluate DPO, IPPO, and independent Q-learning (IQL) in a variety of cooperative multi-agent tasks, covering discrete and continuous action spaces, as well as fully and partially observable environments. The results show DPO outperforms both IPPO and IQL in most tasks, which serves as evidence for our theoretical results.

URL: https://openreview.net/forum?id=MppUW90uU2

---

Title: Image retrieval outperforms diffusion models on data augmentation

Abstract: Diffusion models excel at generating photorealistic images from text-queries. Naturally, many approaches have been proposed to use these generative abilities to augment training datasets for downstream tasks, such as classification. However, diffusion models are themselves trained on large datasets, often with noisy annotations. It is an open question to which extent diffusion models are useful to augment data for improved downstream classification performance. In particular, it is unclear if they generalize enough to improve over directly using the additional data of their pre-training process for augmentation. We perform a systematic evaluation of existing methods to generate images from diffusion models and study new extensions to assess their benefit for data augmentation. We find that personalizing diffusion models towards the target data outperforms simpler prompting strategies. However, using the pre-training data of the diffusion model alone, via a simple nearest neighbor retrieval procedure, leads to even stronger downstream performance. Overall, our study explores the potential of diffusion models in generating new training data and at the same time surprisingly finds that these sophisticated models are not yet able to beat a simple and strong image retrieval baseline on simple downstream vision tasks.

URL: https://openreview.net/forum?id=xflYdGZMpv

---

Title: Exact and Approximate Conformal Inference for Multi-task Learning

Abstract: It is common in machine learning to estimate a response $y$ given covariate information $x$. However, these predictions alone do not quantify any uncertainty associated with said predictions. One way to overcome this deficiency is with conformal inference methods, which construct a set containing the unobserved response $y$ with a prescribed probability. Unfortunately, even with a one-dimensional response, conformal inference is computationally expensive despite recent encouraging advances. In this paper, we explore multi-task learning within a regression setting, delivering exact derivations of conformal inference $p$-values when the predictive model can be described as a linear function of $y$. Additionally, we propose \texttt{unionCP} and a multivariate extension of \texttt{rootCP} as efficient ways of approximating the conformal prediction region for a wide array of multi-task predictors while preserving computational advantages. We also provide both theoretical and empirical evidence of the effectiveness of these methods.

URL: https://openreview.net/forum?id=ixcW7aQNHl

---

Title: SIESTA: Efficient Online Continual Learning with Sleep

Abstract: In supervised continual learning, a deep neural network (DNN) is updated with an ever-growing data stream. Unlike the offline setting where data is shuffled, we cannot make any distributional assumptions about the data stream. Ideally, only one pass through the dataset is needed for computational efficiency. However, existing methods are inadequate and make many assumptions that cannot be made for real-world applications, while simultaneously failing to improve computational efficiency. In this paper, we propose a novel online continual learning method, SIESTA based on wake/sleep framework for training, which is well aligned to the needs of on-device learning. The major goal of SIESTA is to advance compute efficient continual learning so that DNNs can be updated efficiently using far less time and energy. The principal innovations of SIESTA are: 1) rapid online updates using a rehearsal-free, backpropagation-free, and data-driven network update rule during its wake phase, and 2) expedited memory consolidation using a compute-restricted rehearsal policy during its sleep phase. For memory efficiency, SIESTA adapts latent rehearsal using memory indexing from REMIND. Compared to REMIND and prior arts, SIESTA is far more computationally efficient, enabling continual learning on ImageNet-1K in under 2.4 hours on a single GPU; moreover, in the augmentation-free setting it matches the performance of the offline learner, a milestone critical to driving adoption of continual learning in real-world applications.

URL: https://openreview.net/forum?id=MqDVlBWRRV

---

Title: Relation-Oriented: Toward Knowledge-Aligned Causal AI

Abstract: In machine learning, we naturally apply an Observation-Oriented principle, in which observational variables preexist and set the stage for constructing relationships. While sufficient for traditional models, the integration of AI with big data exposes the misalignment between the observational models and our actual comprehension. Contrarily, humans shape cognitive entities defined by relationships, enabling us to formulate knowledge across temporal and hyper-dimensional spaces, rather than being confined to observational constructs.
From an innovative Relation-Oriented perspective, this study examines the roots of this misalignment within our current modeling paradigm, illuminated by intuitive examples from computer vision and health informatics. We also introduce the relation-defined representation learning methodology as a practical implementation of Relation-Oriented modeling, supported by extensive experimental validation.

Consider an analogy where ants dwell on a two-dimensional plane of a floor. If these ants were to construct models, they might use the nearest tree as a reference to specify the elevation in their two-dimensional models. By modeling, they observe an increased disruption at the tree's mid-level, which indicates a higher chance of encountering children. However, since they fail to comprehend humans as three-dimensional beings, instead of interpreting this phenomenon in a new dimension, "height", they solely relate it to the tree's mid-level. If they migrate to a different tree with a varying height, where mid-level no longer presents a risk, they might conclude that human behavior is too complex to model effectively.
Similarly, when modeling time series, we usually discount the dimension "time" as a single timeline, which has become our "tree".

URL: https://openreview.net/forum?id=EzOJkQk4Ar

---

Title: Meta Continual Learning on Graphs with Experience Replay

Abstract: Continual learning is a machine learning problem where the challenge is that a constructed learning model executes incoming tasks while maintaining its performance over the earlier tasks. In order to solve this problem, we devise a technique that combines two uniquely important concepts in machine learning, namely "replay buffer" and "meta learning", aiming to exploit the best of two worlds. In our approach, first, the model is trained by using the current task dataset, and the model weights for the current task are calculated. Next, we merge the dataset of the current task with the stored samples from the earlier tasks and update the model parameters by training the model with the calculated parameters and the extended dataset. This allows us to distance the parameters from the optimal parameters for the current task and keep the model aligned for the earlier tasks. Furthermore, we choose to adapt our technique to graph data structure and the task of node classification on graphs. We demonstrate that our new method, the MetaCLGraph, outperforms the baseline methods over various graph datasets including Citeseer, Corafull, Arxiv, and Reddit.

URL: https://openreview.net/forum?id=8tnrh56P5W

---

Title: Experiential Learning: Using Partition-Tree Weighting and MAML in the Continual and Online Setting

Abstract: Learning from experience means remaining adaptive and responsive to errors over time. However, gradient-based deep learning can fail dramatically in the continual, online setting. In this work we address this shortcoming by combining two meta-learning methods: the purely online Partition Tree Weighting (PTW) mixture-of-experts algorithm, and a novel variant of the Model-Agnostic Meta-Learning (MAML) initialization-learning procedure. We demonstrate our approach, \RMPTW, in a piecewise stationary classification task. In this continual, online setting, \RMPTW matches and even outperforms an augmented learner that is allowed to pre-train offline and is given explicit notification when the task changes. \RMPTW thus provides a base learner with the benefits of offline training, explicit task sampling, and boundary notification, all for a $O(\log_2(t))$ increase in computation and memory. This makes deep learning more viable for fully online, task-agnostic continual learning, which is at the heart of general intelligence.

URL: https://openreview.net/forum?id=BKB4VOdfN6

---

Title: Manifold Contrastive Learning with Variational Lie Group Operators

Abstract: Self-supervised learning of deep neural networks has become a prevalent paradigm for learning representations that transfer to a variety of downstream tasks. Similar to proposed models of the ventral stream of biological vision, it is observed that these networks lead to a separation of category manifolds in the representations of the penultimate layer. Although this observation matches the manifold hypothesis of representation learning, current self-supervised approaches are limited in their ability to explicitly model this manifold. Indeed, current approaches often only apply a pre-specified set of augmentations for "positive pairs" during learning. In this work, we propose a contrastive learning approach that directly models the latent manifold using Lie group operators parameterized by coefficients with a sparsity-promoting prior. A variational distribution over these coefficients provides a generative model of the manifold, with samples which provide feature augmentations applicable both during contrastive training and downstream tasks. Additionally, learned coefficient distributions provide a quantification of which transformations are most likely at each point on the manifold while preserving identity. We demonstrate benefits in self-supervised benchmarks for image datasets, as well as a downstream semi-supervised task. In the former case, we demonstrate that the proposed methods can effectively apply manifold feature augmentations and improve learning both with and without a projection head. In the latter case, we demonstrate that feature augmentations sampled from learned Lie group operators can improve classification performance when using few labels.

URL: https://openreview.net/forum?id=lVE1VeGQwg

---

Title: On the Efficacy of Differentially Private Few-shot Image Classification

Abstract: There has been significant recent progress in training differentially private (DP) models which achieve accuracy that approaches the best non-private models. These DP models are typically pretrained on large public datasets and then fine-tuned on private downstream datasets that are relatively large and similar in distribution to the pretraining data. However, in many applications including personalization and federated learning, it is crucial to perform well (i) in the few-shot setting, as obtaining large amounts of labeled data may be problematic; and (ii) on datasets from a wide variety of domains for use in various specialist settings. To understand under which conditions few-shot DP can be effective,
we perform an exhaustive set of experiments that reveals how the accuracy and vulnerability to attack of few-shot DP image classification models are affected as the number of shots per class, privacy level, model architecture, downstream dataset, and subset of learnable parameters in the model vary. We show that to achieve DP accuracy on par with non-private models, the shots per class must be increased as the privacy level increases by as much as $20-35\times$ at $\epsilon=1$. We also show that learning parameter-efficient FiLM adapters under DP is competitive with and often superior to learning just the final classifier layer or learning all of the network parameters. Finally, we evaluate DP federated learning systems and establish state-of-the-art performance on the challenging FLAIR benchmark.

URL: https://openreview.net/forum?id=hFsr59Imzm

---

Title: Preserving Fairness in AI under Domain Shift

Abstract: Existing algorithms for ensuring fairness in AI use a single-shot training strategy, where an AI model is trained on an annotated training dataset with sensitive attributes and then fielded for utilization. This training strategy is effective in problems with stationary distributions, where both the training and testing data are drawn from the same distribution. However, it is vulnerable with respect to distributional shifts in the input space that may occur after the initial training phase. As a result, the time-dependent nature of data can introduce biases and performance degradation into the model predictions. Model retraining from scratch using a new annotated dataset is a naive solution that is expensive and time-consuming. We develop an algorithm to adapt a fair model to remain fair and generalizable under domain shift using solely new unannotated data points. We recast this learning setting as an unsupervised domain adaptation (UDA) problem. Our algorithm is based on updating the model such that the internal representation of data remains unbiased despite distributional shifts in the input space. We provide empirical validation on three common fairness datasets to show that the challenge exists in practical setting and to demonstrate the effectiveness of our algorithm.

URL: https://openreview.net/forum?id=M7csdqcyOW

---

Title: Provably Personalized and Robust Federated Learning

Abstract: Clustering clients with similar objectives and learning a model per cluster is an intuitive and interpretable approach to personalization in federated learning. However, doing so with provable and optimal guarantees has remained an open challenge. In this work, we formalize personalized federated learning as a stochastic optimization problem where the stochastic gradients on a client may correspond to one of $K$ distributions. In such a setting, we show that using i) a simple thresholding-based clustering algorithm, and ii) local client gradients obtains optimal convergence guarantees. In fact, our rates asymptotically match those obtained if we knew the true underlying clustering of the clients. Furthermore, our algorithms are provably robust in the Byzantine setting where some fraction of the gradients are corrupted.

URL: https://openreview.net/forum?id=B0uBSSUy0G

---

Title: ADIR: Adaptive Diffusion for Image Reconstruction

Abstract: In recent years, denoising diffusion models have demonstrated outstanding image generation performance.
The information on natural images captured by these models is useful for many image reconstruction applications,
where the task is to restore a clean image from its degraded observations.
In this work, we propose a conditional sampling scheme that exploits the prior learned by diffusion models while retaining agreement with the measurements. We then combine it with a novel approach to adapting pre-trained diffusion denoising networks to their input. We examine two adaptation strategies: the first uses only the degraded image, while the second, which we advocate, is performed using images that are ``nearest neighbors'' of the degraded image, retrieved from a diverse dataset with an off-the-shelf visual-language model. To evaluate our method, we test it on two state-of-the-art publicly available diffusion models, Stable Diffusion and Guided Diffusion. We show that our proposed `adaptive diffusion for image reconstruction' (ADIR) approach achieves a significant improvement in image reconstruction tasks.
Our code will be available online upon publication.

URL: https://openreview.net/forum?id=ATVqgUofSr

---

Title: Unsupervised Discovery of Steerable Factors When Graph Deep Generative Models Are Entangled

Abstract: Deep generative models (DGMs) have been widely developed for graph data. However, much less investigation has been carried out on understanding the latent space of such pretrained graph DGMs. These understandings possess the potential to provide constructive guidelines for crucial tasks, such as graph controllable generation. Thus in this work, we are interested in studying this problem and propose GraphCG, a method for the unsupervised discovery of steerable factors in the latent space of pretrained graph DGMs. We first examine the representation space of three pretrained graph DGMs with six disentanglement metrics, and we observe that the pretrained representation space is entangled. Motivated by this observation, GraphCG learns the steerable factors via maximizing the mutual information between semantic-rich directions, where the controlled graph moving along the same direction will share the same steerable factors. We quantitatively verify that GraphCG outperforms four competitive baselines on two graph DGMs pretrained on two molecule datasets. Additionally, we qualitatively illustrate seven steerable factors learned by GraphCG on five pretrained DGMs over five graph datasets, including two for molecules and three for point clouds.

URL: https://openreview.net/forum?id=wyU3Q4gahM

---

Title: Multimodal Language Learning for Object Retrieval in Low Data Regimes and in the Face of Missing Modalities

Abstract: Our study is motivated by robotics, where when dealing with robots or other physical systems, we often need to balance competing concerns of relying on complex, multimodal data coming from a variety of sensors with a general lack of large representative datasets. Despite the complexity of modern robotic platforms and the need for multimodal interaction, there has been little research on integrating more than two modalities in a low data regime with the real-world constraint that sensors fail due to obstructions or adverse conditions. In this work, we consider a case in which natural language is used as a retrieval query against objects, represented across multiple modalities, in a physical environment. We introduce extended multimodal alignment (EMMA), a method that learns to select the appropriate object while jointly refining modality-specific embeddings through a geometric (distance-based) loss. In contrast to prior work, our approach is able to incorporate an arbitrary number of views (modalities) of a particular piece of data. We demonstrate the efficacy of our model on a grounded language object retrieval scenario. We show that our model outperforms state-of-the-art baselines when little training data is available. A link to our code will be included in the camera-ready version.

URL: https://openreview.net/forum?id=cXa6Xdm0v7

---

Title: The Counterattack of CNNs in Self-Supervised Learning: Larger Kernel Size might be All You Need

Abstract: Vision Transformers have been rapidly uprising in computer vision thanks to their outstanding scaling trends, and gradually replacing convolutional neural networks (CNNs). Recent works on self-supervised learning (SSL) introduce siamese pre-training tasks, on which Transformer backbones continue to demonstrate ever stronger results than CNNs. People come to believe that Transformers or self-attention modules are inherently more suitable than CNNs in the context of SSL. However, it is noteworthy that most if not all prior arts of SSL with CNNs chose the standard ResNets as their backbones, whose architecture effectiveness is known to already lag behind advanced Vision Transformers. Therefore, it remains unclear whether the self-attention operation is crucial for the recent advances in SSL - or CNNs can deliver the same excellence with more advanced designs, too? Can we close the SSL performance gap between Transformers and CNNs? To answer these intriguing questions, we apply self-supervised pre-training to the recently proposed, stronger lager-kernel CNN architecture and conduct an apple-to-apple comparison with Transformers, in their SSL performance. Our results show that we are able to build pure CNN SSL architectures that perform on par with or better than the best SSL-trained Transformers, by just scaling up convolutional kernel sizes besides other small tweaks. Impressively, when transferring to the downstream tasks MS COCO detection and segmentation, our SSL pre-trained CNN model (trained in 100 epochs) achieves the same good performance as the 300-epoch pre-trained Transformer counterpart. We hope this work can help to better understand what is essential (or not) for self-supervised learning backbones. Codes will be made public.

URL: https://openreview.net/forum?id=NOEfOcSG3z

---

Title: Fast Slate Policy Optimization: Going Beyond Plackett-Luce

Abstract: An increasingly important building block of large scale machine learning systems is based on returning slates; an ordered lists of items given a query. Applications of this technology include: search, information retrieval and recommender systems. When the action space is large, decision systems are restricted to a particular structure to complete online queries quickly. This paper addresses the optimization of these large scale decision systems given an arbitrary reward function. We cast this learning problem in a policy optimization framework and propose a new class of policies, born from a novel relaxation of decision functions. This results in a simple, yet efficient learning algorithm that scales to massive action spaces. We compare our method to the commonly adopted Plackett-Luce policy class and demonstrate the effectiveness of our approach on problems with action space sizes in the order of millions.

URL: https://openreview.net/forum?id=f7a8XCRtUu

---

Title: Smoothed Differential Privacy

Abstract: Differential privacy (DP) is a widely-accepted and widely-applied notion of privacy based on worst-case analysis. Often, DP classifies most mechanisms without additive noise as non-private (Dwork et al., 2014). Thus, additive noises are added to improve privacy (to achieve DP). However, in many real-world applications, adding additive noise is undesirable (Bagdasaryan et al., 2019) and sometimes prohibited (Liu et al., 2020).

In this paper, we propose a natural extension of DP following the worst average-case idea behind the celebrated smoothed analysis (Spielman & Teng, May 2004). Our notion, smoothed DP, can effectively measure the privacy leakage of mechanisms without additive noises under realistic settings. We prove that any discrete mechanism with sampling procedures is more private than what DP predicts, while many continuous mechanisms with sampling procedures are still non-private under smoothed DP. In addition, we prove several desirable properties of smoothed DP, including composition, robustness to post-processing, and distribution reduction. Based on those properties, we propose an efficient algorithm to calculate the privacy parameters for smoothed DP. Experimentally, we verify that, according to smoothed DP, the discrete sampling mechanisms are private in real-world elections, and some discrete neural networks can be private without adding any additive noise. We believe that these results contribute to the theoretical foundation of realistic privacy measures beyond worst-case analysis.

URL: https://openreview.net/forum?id=CviCLt44Em

---

Title: Error bounds and dynamics of bootstrapping in actor-critic reinforcement learning

Abstract: Actor-critic algorithms such as DDPG, TD3, and SAC, which are built on Silver's deterministic policy gradient theorem, are among the most successful reinforcement-learning methods, but their mathematical basis is not entirely clear. In particular, the critic networks in these algorithms learn to estimate action-value functions by a “bootstrapping” technique based on Bellman error, and it is unclear why this approach works so well in practice, given that Bellman error is only very loosely related to value error, i.e. to the inaccuracy of the action-value estimate. Here we show that policy training in this class of actor-critic methods depends not on the accuracy of the critic's action-value estimate but on how well the critic estimates the gradient of the action-value, which is better assessed using what we call difference error. We show that this difference error is closely related to the Bellman error — a finding which helps to explain why Bellman-based bootstrapping leads to good policies. Further, we show that value error and difference error show different dynamics along on-policy trajectories through state-action space: value error is a low-pass anticausal (i.e., backward-in-time) filter of Bellman error, and therefore accumulates along trajectories, whereas difference error is a high-pass filter of Bellman error. It follows that techniques which reduce the high-frequency Fourier components of the Bellman error may improve policy training even if they increase the actual size of the Bellman errors. These findings help to explain certain aspects of actor-critic methods that are otherwise theoretically puzzling, such as the use of policy (as distinct from exploratory) noise, and they suggest other measures that may improve these methods.

URL: https://openreview.net/forum?id=QCjMJfSnYk

---

Title: Distributed Architecture Search Over Heterogeneous Distributions

Abstract: Federated learning (FL) is an efficient learning framework that assists distributed machine learning when data cannot be shared with a centralized server. Recent advancements in FL use predefined architecture-based learning for all clients. However, given that clients' data are invisible to the server and data distributions are non-identical across clients, a predefined architecture discovered in a centralized setting may not be an optimal solution for all the clients in FL. Motivated by this challenge, we introduce SPIDER, an algorithmic framework that aims to Search PersonalIzed neural architecture for feDERated learning. SPIDER is designed based on two unique features: (1) alternately optimizing one architecture-homogeneous global model in a generic FL manner and architecture-heterogeneous local models that are connected to the global model by weight-sharing-based regularization, (2) achieving architecture-heterogeneous local models by a perturbation-based neural architecture search method. Experimental results demonstrate superior prediction performance compared with other state-of-the-art personalization methods.

URL: https://openreview.net/forum?id=sY75NqDRk1

---

Title: SANTA: Source Anchoring Network and Target Alignment for Continual Test Time Adaptation

Abstract: Adapting a trained model to perform satisfactorily on continually changing test environments is an important and challenging task. In this work, we propose a novel framework, SANTA, which aims to satisfy the following characteristics required for online adaptation: 1) can work effectively for different (even small) batch sizes; 2) should continue to work well on the source domain; 3) should have minimal tunable hyperparameters and storage requirements. Given a pre-trained network trained on source domain data, the proposed framework modifies the affine parameters of the batch normalization layers using source anchoring based self-distillation. This ensures that the model incorporates knowledge from the newly encountered domains, without catastrophically forgetting the previously seen domains.
We also propose a source-prototype driven contrastive alignment to ensure natural grouping of the target samples, while maintaining the already learnt semantic information. Extensive evaluation on three benchmark datasets under challenging settings justify the effectiveness of SANTA for real-world applications.

URL: https://openreview.net/forum?id=V7guVYzvE4

---

Title: Active Learning via Classifier Impact and Greedy Selection for Interactive Image Retrieval

Abstract: Active Learning (AL) is a user-interactive approach aimed at reducing annotation costs by selecting the most crucial examples to label. Although AL has been extensively studied for image classification tasks, the specific scenario of interactive image retrieval has received relatively little attention. This scenario presents unique characteristics, including an open-set and class-imbalanced binary classification, starting with very few labeled samples.
To address this specific scenario, we introduce a novel batch-mode Active Learning framework named GAL (Greedy Active Learning) that incorporates a new objective function for sample selection that measures the impact of each unlabeled sample on the classifier. We further embed this strategy in a greedy selection approach. We evaluate our framework with both linear (SVM) and non-linear (Gaussian Process) classifiers. For the linear case, our method considers a pseudo-label strategy for each sample while ensuring tractability through a greedy approach. Considering our Gaussian Process objective function, we show a theoretical guarantee for the greedy approximation. Finally, we assess our performance on the interactive content-based image retrieval task and demonstrate its superiority over existing approaches and common baselines.

URL: https://openreview.net/forum?id=OEee0WzVXd

---

Title: Dynamic Window-level Granger Causality of Multi-channel Time Series

Abstract: Granger causality (GC) techniques facilitate the examination of temporal causal relationships by assessing the predictive utility of one time series for another. However, conventional GC approaches are limited to linear scenarios and assume that causalities exclusively exist between entire time series channels, remaining constant over time. This assumption falls short when dealing with real-world time series data characterized by dynamic causalities among channels. To address this limitation, we present the dynamic window-level Granger causality with causality index (DWGC-CI) method, which incorporates nonlinear window-level variability into the traditional GC framework. The DWGC-CI method specifically employs F-tests on sliding windows to assess forecasting errors, subsequently extracting dynamic causalities using a causality index. This index involves the introduction of a negative feedback adjustment mechanism, which mitigates window-level noisy fluctuations and helps extract dynamic causalities. We theoretically demonstrate that, compared to traditional channel-level GC methods, DWGC-CI accurately captures window-level GCs. In experimental evaluations involving two synthetic datasets, DWGC-CI significantly surpasses the baseline in terms of accuracy and recall rates. Furthermore, when applied to two real-world datasets, DWGC-CI effectively identifies seasonal and annual varying GCs, demonstrating a markedly better consistency with domain-specific knowledge of seasonal meteorology and stock market dynamics.

URL: https://openreview.net/forum?id=0niO1TILv1

---

Title: Variational Classification

Abstract: We present a latent variable generalisation of neural network softmax classification trained with cross-entropy loss, referred to as Variational Classification (VC). Our approach offers a novel probabilistic perspective on the highly familiar softmax classification model, to which it relates similarly to how variational and traditional autoencoders relate. We derive a training objective based on the evidence lower bound (ELBO) that is non-trivial to optimize, and therefore propose an adversarial approach to maximise it. We show that VC addresses an inherent inconsistency within softmax classification, whilst also allowing more flexible choices of prior distributions in the latent space in place of implicit assumptions revealed within off-the-shelf softmax classifiers. Empirical evaluation on image and text classification datasets demonstrates that variational classification maintains prediction accuracy while improving other desirable properties such as calibration and adversarial robustness, particularly under distribution shift and low data settings.

URL: https://openreview.net/forum?id=EWv9XGOpB3

---

Title: Nonlinear Feature Aggregation: Two Algorithms driven by Theory

Abstract: Many real-world machine learning applications are characterized by a huge number of features, leading to computational and memory issues, as well as the risk of overfitting. Ideally, only relevant and non-redundant features should be considered to preserve the complete information of the original data and limit the dimensionality. Dimensionality reduction and feature selection are common preprocessing techniques addressing the challenge of efficiently dealing with high-dimensional data. Dimensionality reduction methods control the number of features in the dataset while preserving its structure and minimizing information loss. Feature selection aims to identify the most relevant features for a task, discarding the less informative ones. Previous works have proposed approaches that aggregate features depending on their correlation without discarding any of them and preserving their interpretability through aggregation with the mean. A limitation of methods based on correlation is the assumption of linearity in the relationship between features and target. In this paper, we relax such an assumption in two ways. First, we propose a bias-variance analysis for general models with additive Gaussian noise, leading to a dimensionality reduction algorithm (NonLinCFA) which aggregates non-linear transformations of features with a generic aggregation function. Then, we extend the approach assuming that a generalized (non-)linear model regulates the relationship between features and target. A deviance analysis leads to a second dimensionality reduction algorithm (GenLinCFA), applicable to a larger class of regression problems and classification settings. Finally, we test the algorithms on synthetic and real-world datasets, performing regression and classification tasks, showing competitive performances.

URL: https://openreview.net/forum?id=WlFknvjBfx

---

Title: Synthetic Data from Diffusion Models Improves ImageNet Classification

Abstract: Deep generative models are becoming increasingly powerful, now generating diverse, high fidelity, photo-realistic samples given text prompts. Nevertheless, samples from such models have not been shown to significantly improve model training for challenging and well-studied discriminative tasks like ImageNet classification. In this paper we show that augmenting the ImageNet training set with samples from a generative diffusion model can yield substantial improvements in ImageNet classification accuracy over strong ResNet and Vision Transformer baselines. To this end we explore the fine-tuning of large-scale text-to-image diffusion models, yielding class-conditional ImageNet models with state-of-the-art FID score(1.76 at 256×256 resolution) and Inception Score (239 at 256×256). The model also yields a new state-of-the-art in Classification Accuracy Scores, i.e., ImageNet test accuracy for a ResNet-50 architecture trained solely on synthetic data (64.96 top-1 accuracy for 256×256 samples, improving to 69.24 for 1024×1024 samples). Adding up to three times as many synthetic samples as real training samples consistently improves ImageNet classification accuracy across multiple architectures

URL: https://openreview.net/forum?id=DlRsoxjyPm

---

Title: RCT Rejection Sampling for Causal Estimation Evaluation

Abstract: Confounding is a significant obstacle to unbiased estimation of causal effects from observational data. For settings with high-dimensional covariates---such as text data, genomics, or the behavioral social sciences---researchers have proposed methods to adjust for confounding by adapting machine learning methods to the goal of causal estimation. However, empirical evaluation of these adjustment methods has been challenging and limited. In this work, we build on a promising empirical evaluation strategy that simplifies evaluation design and uses real data: subsampling randomized controlled trials (RCTs) to create confounded observational datasets while using the average causal effects from the RCTs as ground-truth. We contribute a new sampling algorithm, which we call RCT rejection sampling, and provide theoretical guarantees that causal identification holds in the observational data to allow for valid comparisons to the ground-truth RCT. Using synthetic data, we show our algorithm indeed results in low bias when oracle estimators are evaluated on the confounded samples, which is not always the case for a previously proposed algorithm. In addition to this identification result, we highlight several finite data considerations for evaluation designers who plan to use RCT rejection sampling on their own datasets. As a proof of concept, we implement an example evaluation pipeline and walk through these finite data considerations with a novel, real-world RCT---which we release publicly---consisting of approximately 70k observations and text data as high-dimensional covariates. Together, these contributions build towards a broader agenda of improved empirical evaluation for causal estimation.

URL: https://openreview.net/forum?id=F74ZZk5hPa

---

Title: Size Lowerbounds for Deep Operator Networks

Abstract: Deep Operator Networks are an increasingly popular paradigm for solving regression in infinite dimensions and hence solve families of PDEs in one shot. In this work, we aim to establish a first-of-its-kind data-dependent lowerbound on the size of DeepONets required for them to be able to reduce empirical error on noisy data. In particular, we show that for low training errors to be obtained on $n$ data points it is necessary that the common output dimension of the branch and the trunk net be scaling as $\Omega \left ( {\sqrt{n}} \right )$. This inspires our experiments with DeepONets solving the advection-diffusion-reaction PDE, where we demonstrate the possibility that at a fixed model size, to leverage increase in this common output dimension and get monotonic lowering of training error, the size of the training data might necessarily need to scale quadratically with it.

URL: https://openreview.net/forum?id=RwmWODTNFE

---

Title: Complementary Sparsity: Accelerating Sparse CNNs with High Accuracy on General-Purpose Computing Platforms

Abstract: Model sparsity is a promising approach to reduce parameters or FLOPs of convolutional neural networks (CNNs). Compared to unstructured or coarse-grained structured sparsity, fine-grained structured sparsity, e.g., N:M sparse pattern, can achieve better balance between accuracy and efficiency on general computing platforms like CPUs and GPUs. In particular, the 2:4 sparsity can accelerate CNN inference by 2$\times$ speed and with negligible accuracy drop. However, N:M sparsity needs to be supported by GPU within specific hardware circuits and hardly achieve significant speedups on common GPUs. To accelerate CNNs with general-purposed computing resources and simultaneously retain the model accuracy as much as possible, this paper proposes complementary sparsity (CS). CS denotes that only one weight can be retained for weights spaced at the same distance. On the one hand, CS features high mask flexibility, which is naturally favorable to high model accuracy. Moreover, we propose a CS-specific sparse training method to improve CS-based CNNs' accuracy under high parameter sparsities ($>$75\%). On the other hand, CS itself is memory-access balanced and robust to pattern hyperparameters, which can be utilized to speedup CS-based convolution computation on CPUs and common GPUs. We thus propose a CS convolution parallel computing algorithm that adapts to common GPUs without sparse tensor cores. Experimental results show that compared to other sparsity patterns, the proposed CS can achieve the optimal trade-off in terms of accuracy and latency for CPUs and common GPUs, respectively.

URL: https://openreview.net/forum?id=g1B4qgOw79

---

Title: A Survey on Graph Construction for Geometric Deep Learning in Medicine: Methods and Recommendations

Abstract: Graph neural networks are powerful tools that enable deep learning on non-Euclidean data structures like graphs, point clouds, and meshes. They leverage the connectivity of data points and can even benefit learning tasks on data, which is not naturally graph-structured, like point clouds. In these cases, the graph structure needs to be determined from the dataset, which adds a significant challenge to the learning process. This opens up a multitude of design choices for creating suitable graph structures, which have a substantial impact on the success of the graph learning task. However, so far no concrete guidance for choosing the most appropriate graph construction is available, not only due to the large variety of methods out there but also because of its strong connection to the dataset at hand. In medicine, for example, a large variety of different data types complicates the selection of graph construction methods even more. We here therefore summarise the current state-of-the-art graph construction methods, especially for medical data. In this paper, we identify two main strands of graph construction: static and adaptive methods, discuss their advantages and disadvantages, and formulate recommendations for choosing a suitable graph construction method. We also discuss how a created graph structure can be assessed and to what degree it supports graph learning. We hope to support medical research with graph deep learning with this work by elucidating the wide variety of graph construction methods.

URL: https://openreview.net/forum?id=sWlHhfijcS

---

Title: StarCoder: may the source be with you!

Abstract: The ANONYMIZED community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python and still retains its performance on other programming languages.We take several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license.

URL: https://openreview.net/forum?id=KoFOg41haE

---

Title: Detecting danger in gridworlds using Gromov’s Link Condition

Abstract: Gridworlds have been long-utilised in AI research, particularly in reinforcement learning, as they provide simple yet scalable models for many real-world applications such as robot navigation, emergent behaviour, and operations research. We initiate a study of gridworlds using the mathematical framework of reconfigurable systems and state complexes due to Abrams, Ghrist & Peterson. State complexes represent all possible configurations of a system as a single geometric space, thus making them conducive to study using geometric, topological, or combinatorial methods. The main contribution of this work is a modification to the original Abrams, Ghrist & Peterson setup which we introduce to capture agent braiding and thereby more naturally represent the topology of gridworlds. With this modification, the state complexes may exhibit geometric defects (failure of Gromov's Link Condition). Serendipitously, we discover these failures occur exactly where undesirable or dangerous states appear in the gridworld. Our results therefore provide a novel method for seeking guaranteed safety limitations in discrete task environments with single or multiple agents, and offer useful safety information (in geometric and topological forms) for incorporation in or analysis of machine learning systems. More broadly, our work introduces tools from geometric group theory and combinatorics to the AI community and demonstrates a proof-of-concept for this geometric viewpoint of the task domain through the example of simple gridworld environments.

URL: https://openreview.net/forum?id=t4p612DftO

---

Title: The Fair Value of Data Under Heterogeneous Privacy Constraints in Federated Learning

Abstract: Modern data aggregation often involves a platform collecting data from a network of users. Now users are requesting that the data they provide is protected with a guarantee of privacy. With privacy options for users, platforms must solve the problem of how to allocate incentives to users to convince them to share their data. The main goal of this paper is to characterize a \textit{fair} amount to compensate users for their data at a given privacy level. We propose an axiomatic definition of fairness, along the lines of the
celebrated Shapley value. The notion of fairness we propose is ultimately related to the average marginal contribution of a user. To the best of our knowledge, these are the first fairness concepts for data that explicitly consider privacy constraints. We also formulate a heterogeneous federated learning problem for the platform with privacy level options for users. By studying this problem, we investigate the amount of compensation users receive under fair allocations with different privacy levels, amounts of data and degrees of heterogeneity. Under certain conditions, we characterize the optimal behavior of the platform when incentives are constrained to be fair, revealing that the optimal behavior of the platform can be separated into three regimes, depending on the privacy sensitivity of the users. When privacy sensitivity is low, the platform will set incentives to ensure that it collects all the data with the lowest privacy options. When the privacy sensitivity is above a given threshold, the platform will provide no incentives to users. Between these two extremes, the platform will set the incentives so some fraction of the users chooses the higher privacy option and the other chooses the lower privacy option.

URL: https://openreview.net/forum?id=ynG5Ak7n7Q

---

Title: Online Reference Tracking For Linear Systems with Unknown Dynamics and Unknown Disturbances

Abstract: This paper presents an online learning mechanism to address the challenge of state tracking for unknown linear systems under general adversarial disturbances. The reference trajectory is assumed to be generated by unknown exosystem dynamics, which relaxes the common assumption of known dynamics for exosystems. Learning a tracking control policy for unknown systems with unknown exosystem dynamics under general disturbances is challenging and surprisingly unsettled. To face this challenge, the presented online learning algorithm has two stages: In the first stage, an algorithm identifies the dynamics of the uncertain system, and in the second stage, an online parametrized memory-augmented controller accounts for the identification error, unknown exosystem dynamics as well as disturbances. The controller's parameters are learned to optimize a general convex cost function, and learning the control parameters is formulated as an online convex optimization problem. This approach uses the memory of previous disturbances and reference values to capture their effects on performance over time. Besides, it implicitly learns the dynamics of the exosystems. The algorithm enables online tuning of controller parameters to achieve state tracking and disturbance rejection while minimizing general convex costs. It is shown that the algorithm achieves a policy regret of $\mathcal{O}({T}^{\frac{2}{3}})$. In the simulation results, the performance of the presented tracking algorithm was compared with the certainty equivalent $H_{\infty}$-control and linear quadratic regulator.

URL: https://openreview.net/forum?id=pfbVayaUMc

---

Title: Model Selection of Discrete Classifiers under Class and Cost Distribution Change: An Empirical Study

Abstract: A variety of important machine learning applications require predictions on test data with
different characteristics than the data on which a model was trained and validated. In
particular, test data may have a different relative frequency of positives and negatives (i.e.,
class distribution) and/or different mislabeling costs of false positive and false negative
errors (i.e., cost distribution) than the training data. Selecting models that have been
built in conditions that are substantially different from the conditions under which they
will be applied is more challenging than selecting models for identical conditions. Several
approaches to this problem exist, but they have mostly been studied in theoretical contexts.
This paper presents an empirical evaluation of approaches for model selection under class and
cost distribution change, based on Receiver Operating Characteristic (ROC) analysis. The
analysis compares the ROC Convex Hull (ROCCH) method with other candidate approaches
for selecting discrete classifiers for several UCI Machine Learning Repository and simulated
datasets. Surprisingly, the ROCCH method did not perform well in the experiments, despite
being developed for this task. Instead, the results indicate that a reliable approach for
selecting discrete classifiers on the ROC convex hull is to select the model that optimizes
the cost metric of interest on the validation data (which has the same characteristics as the
training data) but weighted by the class and/or cost distributions anticipated at test time.

URL: https://openreview.net/forum?id=BbSNFN2SqE

---

Title: An Empirical Study of Focus Identification in Large Language Models with Modal Contexts

Abstract: We examine whether pre-trained large language models GPT-2 and Flan-T5 use prior context to accurately identify the focus of negated sentences. We present to the models procedurally-generated negated sentences and procedurally-generated prior contexts to prime the models to generate text coherent with a particular focus in the negated sentences. We examine model accuracy with and without explicit modal operators for necessity and possibility and formalize our examples in modal logic. We find that larger models of the same class are much more likely to generate sentence continuations that cohere with the provided context, but even the larger models often struggle to outperform a random baseline. Certain necessity modals in the prior context greatly increase the likelihood of a continuation that
is both relevant to the context and consistent with the primed negation focus.

URL: https://openreview.net/forum?id=g9pcPz1Hq1

---

Title: Ensemble learning for Physics Informed Neural Networks: a Gradient Boosting approach

Abstract: While the popularity of physics-informed neural networks (PINNs) is steadily rising, to this date, PINNs have not been successful in simulating multi-scale and singular perturbation problems. In this work, we present a new training paradigm referred to as "gradient boosting" (GB), which significantly enhances the performance of physics informed neural networks (PINNs). Rather than learning the solution of a given PDE using a single neural network directly, our algorithm employs a sequence of neural networks to achieve a superior outcome. This approach allows us to solve problems presenting great challenges for traditional PINNs. Our numerical experiments demonstrate the effectiveness of our algorithm through various benchmarks, including comparisons with finite element methods and PINNs. Furthermore, this work also unlocks the door to employing ensemble learning techniques in PINNs, providing opportunities for further improvement in solving PDEs.

URL: https://openreview.net/forum?id=e2A07ruqDW

---

Title: AdaFed: Fair Federated Learning via Adaptive Common Descent Direction

Abstract: Federated learning (FL) is a promising technology via which some edge devices/clients
collaboratively train a machine learning model orchestrated by a server. Learning an unfair
model is known as a critical problem in federated learning, where the trained model may
unfairly advantage or disadvantage some of the devices. To tackle this problem, in this work,
we propose AdaFed. The goal of AdaFed is to find an updating direction for the server along
which (i) all the clients’ loss functions are decreasing; and (ii) more importantly, the loss
functions for the clients with larger values decrease with a higher rate. AdaFed adaptively
tunes this common direction based on the values of local gradients and loss functions. We
validate the effectiveness of AdaFed on a suite of federated datasets, and demonstrate that
AdaFed outperforms state-of-the-art fair FL methods.

URL: https://openreview.net/forum?id=rFecyFpFUp

---

Title: Lightweight Vision Transformer Course-to-Fine Search via Latency Profiling

Abstract: Despite their impressive performance on various tasks, vision transformers (ViTs) are heavy for mobile vision applications. Recent works have proposed combining the strengths of ViTs and convolutional neural networks (CNNs) to build lightweight networks. Still, these approaches rely on hand-designed architectures with a pre-determined number of parameters. This requires re-running the training process to obtain different lightweight ViTs under different resource constraints.
In this work, we address the challenge of finding optimal lightweight ViTs given constraints on model size and computational cost using neural architecture search. Using the proposed method, we first train a supernet, which is a hybrid architecture of CNNs and transformers. To efficiently search for the optimal architecture, we use a search algorithm that considers both model parameters and on-device deployment latency. This method analyzes network properties, hardware memory access pattern, and degree of parallelism to directly and accurately estimate the network latency. To prevent the need for extensive testing during the search process, we use a lookup table based on a detailed breakdown of the speed of each component and operation, which can be reused to evaluate the whole latency of each search structure. Our approach leads to improved efficiency compared to testing the speed of the whole model during the search process.
With extensive experiments on ImageNet, we demonstrate that under similar parameters and FLOPs, our searched lightweight ViTs have higher accuracy and lower latency compared to state-of-the-art models. For example, our AutoViT\_XXS (71.3\% Top-1 accuracy of and 10.2ms latency) has a 2.3\% higher accuracy and 4.5ms lower latency compared to MobileViT\_XXS (69.0\% Top-1 accuracy of and 14.7ms latency).

URL: https://openreview.net/forum?id=sTdd0yCOZ2

---

Title: Data-Free Diversity-Based Ensemble Selection for One-Shot Federated Learning

Abstract: The emerging availability of various machine learning models creates a great demand to harness the collective intelligence of many independently well-trained models to improve overall performance. Considering the privacy concern and non-negligible communication costs, one-shot federated learning and ensemble learning in a data-free manner attract significant attention. However, conventional ensemble selection approaches are neither training efficient nor applicable to federated learning due to the risk of privacy leakage from local clients; meanwhile, the "many could be better than all" principle under data-free constraints makes it even more challenging. Therefore, it becomes crucial to design an effective ensemble selection strategy to find a good subset of the base models as the ensemble team for the federated learning scenario. In this paper, we propose a novel data-free diversity-based framework, DeDES, to address the ensemble selection problem with diversity consideration for models under the one-shot federated learning setting. Experimental results show that our method can achieve both better performance and higher efficiency over 5 datasets, 4 different model structures, and both homogeneous and heterogeneous model groups under four different data-partition strategies.

URL: https://openreview.net/forum?id=ORMlg4g3mG

---

Title: Probabilistic Forecasting with Coherent Aggregation

Abstract: Obtaining accurate probabilistic forecasts while respecting hierarchical information is an important operational challenge in many applications, perhaps most obviously in energy management, supply chain planning, and resource allocation. The basic challenge, especially for multivariate forecasting, is that forecasts are often required to be coherent with respect to the hierarchical structure. In this paper, we propose a new model which leverages a factor model structure to produce coherent forecasts by construction. This is a consequence of a simple (exchangeability) observation: permuting base-level series in the hierarchy does not change their aggregates. Our model uses a convolutional neural network to produce parameters for the factors, their loadings and base-level distributions; it produces samples which can be differentiated with respect to the model's parameters; and it can therefore optimize for any sample-based loss function, including the Continuous Ranked Probability Score and quantile losses. We can choose arbitrary continuous distributions for the factor and the base-level distributions. We compare our method to two previous methods which can be optimized end-to-end, while enforcing coherent aggregation. Our model achieves significant improvements: between $11.8 - 41.4\%$ on three hierarchical forecasting datasets. We also analyze the influence of parameters in our model with respect to base-level distribution and number of factors.

URL: https://openreview.net/forum?id=B2LY1P3vrx

---

Title: Convergence of SGD for Training Neural Networks with Sliced Wasserstein Losses

Abstract: Optimal Transport has sparked vivid interest in recent years, in particular thanks to the Wasserstein distance, which provides a geometrically sensible and intuitive way of comparing probability measures. For computational reasons, the Sliced Wasserstein (SW) distance was introduced as an alternative to the Wasserstein distance, and has seen uses for training generative Neural Networks (NNs). While convergence of Stochastic Gradient Descent (SGD) has been observed practically in such a setting, there is to our knowledge no theoretical guarantee for this observation. Leveraging recent works on convergence of SGD on non-smooth and non-convex functions by Bianchi et al. (2022), we aim to bridge that knowledge gap, and provide a realistic context under which fixed-step SGD trajectories for the SW loss on NN parameters converge. More precisely, we show that the trajectories approach the set of (sub)-gradient flow equations as the step decreases. Under stricter assumptions, we show a much stronger convergence result for noised and projected SGD schemes, namely that the long-run limits of the trajectories approach a set of generalised critical points of the loss function.

URL: https://openreview.net/forum?id=aqqfB3p9ZA

---

Title: Relationship between Batch Size and Number of Steps Needed for Nonconvex Optimization of Stochastic Gradient Descent using Armijo Line Search

Abstract: Stochastic gradient descent (SGD) is the simplest deep learning optimizer with which to train deep neural networks. While SGD can use various learning rates, such as constant or diminishing rates, the previous numerical results showed that SGD performs better than other deep learning optimizers using when it uses learning rates given by line search methods. In this paper, we perform a convergence analysis on SGD with a learning rate given by an Armijo line search for nonconvex optimization. The analysis indicates that the upper bound of the expectation of the squared norm of the full gradient becomes small when the number of steps and the batch size are large. Next, we show that, for SGD with the Armijo-line-search learning rate, the number of steps needed for nonconvex optimization is a monotone decreasing convex function of the batch size; that is, the number of steps needed for nonconvex optimization decreases as the batch size increases. Furthermore, we show that the stochastic first-order oracle (SFO) complexity, which is the stochastic gradient computation cost, is a convex function of the batch size; that is, there exists a critical batch size that minimizes the SFO complexity. Finally, we provide numerical results that support our theoretical results. The numerical results indicate that the number of steps needed for training deep neural networks decreases as the batch size increases and that there exist the critical batch sizes that can be estimated from the theoretical results.

URL: https://openreview.net/forum?id=fAdPby5vZg

---

Title: QDC: Quantum Diffusion Convolution Kernels on Graphs

Abstract: Graph convolutional neural networks (GCNs) operate by aggregating messages over local neighborhoods given the prediction task under interest. Many GCNs can be understood as a form of generalized diffusion of input features on the graph, and significant work has been dedicated to improving predictive accuracy by altering the ways of message passing. In this work, we propose a new convolution kernel that effectively rewires the graph according to the occupation correlations of the vertices by trading on the generalized diffusion paradigm for the propagation of a quantum particle over the graph. We term this new convolution kernel the Quantum Diffusion Convolution (QDC) operator. In addition, we introduce a multiscale variant that combines messages from the QDC operator and the traditional combinatorial Laplacian. To understand our method, we explore the spectral dependence of homophily and the importance of quantum dynamics in the construction of a bandpass filter. Through these studies, as well as experiments on a range of datasets, we observe that QDC improves predictive performance on the widely used benchmark datasets when compared to similar methods.

URL: https://openreview.net/forum?id=uXGUSX8GoY

---

Title: $f$-MICL: Understanding and Generalizing InfoNCE-based Contrastive Learning

Abstract: In self-supervised contrastive learning, a widely-adopted objective function is InfoNCE, which uses the heuristic cosine similarity for the representation comparison, and is closely related to maximizing the Kullback-Leibler (KL)-based mutual information. In this paper, we aim at answering two intriguing questions: (1) Can we go beyond the KL-based objective? (2) Besides the popular cosine similarity, can we design a better similarity function? We provide answers to both questions by generalizing the KL-based mutual information to the $f$-Mutual Information in Contrastive Learning ($f$-MICL) using the $f$-divergences. To answer the first question, we provide a wide range of $f$-MICL objectives which share the nice properties of InfoNCE (e.g., alignment and uniformity), and meanwhile result in similar or even superior performance. For the second question, assuming that the joint feature distribution is proportional to the Gaussian kernel, we derive an $f$-Gaussian similarity with better interpretability and empirical performance. Finally, we identify close relationships between the $f$-MICL objective and several popular InfoNCE-based objectives. Using benchmark tasks from both vision and natural language, we empirically evaluate $f$-MICL with different $f$-divergences on various architectures (SimCLR, MoCo, and MoCo v3) and datasets. We observe that $f$-MICL generally outperforms the benchmarks and the best-performing $f$-divergence is task and dataset dependent.

URL: https://openreview.net/forum?id=ZD03VUZmRx

---

Title: Learning to reconstruct signals from binary measurements alone

Abstract: Recent advances in unsupervised learning have highlighted the possibility of learning to reconstruct signals from noisy and incomplete linear measurements alone. These methods play a key role in medical and scientific imaging and sensing, where ground truth data is often scarce or difficult to obtain. However, in practice measurements are not only noisy and incomplete but also quantized. Here we explore the extreme case of learning from binary observations and provide necessary and sufficient conditions on the number of measurements required for identifying a set of signals from incomplete binary data. Our results are complementary to existing bounds on signal recovery from binary measurements. Furthermore, we introduce a novel self-supervised learning approach, which we name SSBM, that only requires binary data for training. We demonstrate in a series of experiments with real datasets that SSBM performs on par with supervised learning and outperforms sparse reconstruction methods with a fixed wavelet basis by a large margin.

URL: https://openreview.net/forum?id=ioFIAQOBOS

---

Title: Conformal prediction under ambiguous ground truth

Abstract: In safety-critical classification tasks, conformal prediction allows to perform rigorous uncertainty quantification by providing confidence sets including the true class with a user-specified probability. This generally assumes the availability of a held-out calibration set with access to ground truth labels. Unfortunately, in many domains, such labels are difficult to obtain and usually approximated by aggregating expert opinions. In fact, this holds true for almost all datasets, including well-known ones such as CIFAR and ImageNet. Applying conformal prediction using such labels underestimates uncertainty. Indeed, when expert opinions are not resolvable, there is inherent ambiguity present in the labels. That is, we do not have ``crisp'', definitive ground truth labels and this uncertainty should be taken into account during calibration. In this paper, we develop a conformal prediction framework for such ambiguous ground truth settings which relies on an approximation of the underlying posterior distribution of labels given inputs. We demonstrate our methodology on synthetic and real datasets, including a case study of skin condition classification in dermatology.

URL: https://openreview.net/forum?id=CAd6V2qXxc

---

Title: Addressing caveats of neural persistence with deep graph persistence

Abstract: Neural Persistence is a prominent measure for quantifying neural network complexity, proposed in the emerging field of topological data analysis in deep learning. In this work, however, we find both theoretically and empirically that the variance of network weights and spatial concentration of large weights are the main factors that impact neural persistence. Whilst this captures useful information for linear classifiers, we find that no relevant spatial structure is present in later layers of deep neural networks, making neural persistence roughly equivalent to the variance of weights. Additionally, the proposed averaging procedure across layers for deep neural networks does not consider interaction between layers. Based on our analysis, we propose an extension of the filtration underlying neural persistence to the whole neural network instead of single layers, which is equivalent to calculating neural persistence on one particular matrix. This yields our deep graph persistence measure, which implicitly incorporates persistent paths through the network and alleviates variance-related issues through standardisation. Code will be publicly released upon acceptance of the paper.

URL: https://openreview.net/forum?id=oyfRWeoUJY

---

Title: Out-of-Distribution Optimality of Invariant Risk Minimization

Abstract: Deep Neural Networks often inherit spurious correlations embedded in training data and hence may fail to generalize to unseen domains, which have different distributions from the domain to provide training data. M. Arjovsky et al. (2019) introduced the concept out-of-distribution (o.o.d.) risk, which is the maximum risk among all domains, and formulated the issue caused by spurious correlations as a minimization problem of the o.o.d. risk. Invariant Risk Minimization (IRM) is considered to be a promising approach to minimize the o.o.d. risk: IRM estimates a minimum of the o.o.d. risk by solving a bi-level optimization problem. While IRM has attracted considerable attention with empirical success, it comes with few theoretical guarantees. Especially, a solid theoretical guarantee that the bi-level optimization problem gives the minimum of the o.o.d. risk has not yet been established. Aiming at providing a theoretical justification for IRM, this paper rigorously proves that a solution to the bi-level optimization problem minimizes the o.o.d. risk under certain conditions. The result also provides sufficient conditions on distributions providing training data and on a dimension of feature space for the bi-leveled optimization problem to minimize the o.o.d. risk.

URL: https://openreview.net/forum?id=pWsfWDnJDa

---

Title: 3D Molecular Generation via Virtual Dynamics

Abstract: Structure-based drug design, a critical aspect of drug discovery, aims to identify high-affinity molecules for target protein pockets. Traditional virtual screening methods, which involve exhaustive searches within large molecular databases, are inefficient and limited in discovering novel molecules. The pocket-based 3D molecular generation model offers a promising alternative by directly generating molecules with 3D structures and binding positions in the pocket. In this paper, we present VD-Gen, a novel pocket-based 3D molecular generation pipeline. VD-Gen features a series of carefully designed stages to generate fine-grained 3D molecules with binding positions in the pocket cavity end-to-end. Rather than directly generating or sampling atoms with 3D positions in the pocket, VD-Gen randomly initializes multiple virtual particles within the pocket and learns to iteratively move them to approximate the distribution of molecular atoms in 3D space. After the iterative movement, a 3D molecule is extracted and further refined through additional iterative movement, yielding a high-quality 3D molecule with a confidence score. Comprehensive experimental results on pocket-based molecular generation demonstrate that VD-Gen can generate novel 3D molecules that fill the target pocket cavity with high binding affinities, significantly outperforming previous baselines.

URL: https://openreview.net/forum?id=QvipGVdE6L

---

Title: Non-asymptotic approximations of Gaussian neural networks via second-order Poincar\'e inequalities

Abstract: There is a growing interest on large-width asymptotic and non-asymptotic properties of deep Gaussian neural networks (NNs), namely NNs with weights initialized as Gaussian distributions. For a Gaussian NN of depth $L\geq1$ and width $n\geq1$, a well-established result is that, as $n\rightarrow+\infty$, the NN's output converges in distribution to a Gaussian process. Recently, quantitative versions of this CLT have been obtained by exploiting the recursive structure of the NN and its infinitely wide Gaussian limit, showing that the NN's output converges to its Gaussian limit at the rate $n^{-1/2}$, in the $2$-Wasserstein distance, as well as in some convex distances. In this paper, we investigate the use of second-order Gaussian Poincar\'e inequalities to obtain quantitive CLTs for the NN's output, showing their pros and cons in such a new field of application. For shallow Gaussian NNs, i.e. $L=1$, we show how second-order Poincar\'e inequalities provide a powerful tool, reducing the problem of establishing quantitative CLTs to the algebraic problem of computing the gradient and the Hessian of the NN's output, and lead to the rate of convergence $n^{-1/2}$ in the $1$-Wasserstein distance. Instead, for deep Gaussian NNs, i.e. $L\geq2$, the use of second-order Poincar\'e inequalities turns out to be more problematic. By relying on exact computations of the gradient and the Hessian of the NN's output, which is a non-trivial task due to its (algebraic) complexity that increases with $L$, we show that for $L=2$ second-order Poincar\'e inequalities still lead to a quantitative CLT in the $1$-Wasserstein distance, though with the rate of convergence $n^{-1/4}$, and we conjecture the same rate for any depth $L\geq2$. Such a worsening in the rate is a peculiar feature of the use of second-order Poincar\'e inequalities, which are designed to be applied directly to the NN's output as a function of all the previous layers, hence not exploiting the recursive structure of the NN and/or its infinitely wide Gaussian limit. While this is a negative result over the state-of-the-art, it does not diminish the effectiveness of second-order Poincar\'e inequalities, which we prove to maintain their effectiveness in establishing a quantitative CLT for a complicated functional of Gaussian processes such as the deep Gaussian NN.

URL: https://openreview.net/forum?id=OUn4ezaMem

---

Title: Dual-windowed Vision Transformer with Angular Self-Attention

Abstract: Following the great success in natural language processing, transformer-based models have emerged as the competitive model against the convolutional neural networks in computer vision. Vision transformer (ViT) and its subsequent variants have exhibited promising performance in tasks such as image classification, object detection and semantic segmentation. The core of vision transformers is the self-attention mechanism, which models the long-range dependency of different tokens. Conventionally, the attention matrix in self-attention is calculated by the scaled dot-product of \textit{query} (Q) and \textit{key} (K). In this case, the attention weight would depend on norm of Q and K as well as the angle between them. In this paper, we propose a new attention mechanism named angular self-attention, which replaces the scaled dot-product operation with the angular function in order to effectively model the relationship between tokens. In particular, we propose two solutions: quadratic and cosine functions, for our angular self-attention. Based on angular self-attention, we design a new vision transformer architecture called dual-windowed angular vision transformer (\textbf{DWAViT}). DWAViT is a hierarchical-structured model characterized by the angular self-attention and a new local window mechanism. We evaluate DWAViT on multiple computer vision benchmarks, including image classification on ImageNet-1K, object detection on COCO, and semantic segmentation on ADE20k. We also validate the effectiveness of our angular self-attention by investigating the performance of vision transformers with the scaled dot-product operation replaced by our angular function on several tasks.

URL: https://openreview.net/forum?id=pK6FkQv1Hq

---

Title: UnIVAL: Unified Model for Image, Video, Audio and Language Tasks

Abstract: Large Language Models (LLMs) have made the ambitious quest for generalist agents significantly far from being a fantasy. A key hurdle for building such general models is the diversity and heterogeneity of tasks and modalities. A promising solution is to unify models, allowing the support of a myriad of tasks and modalities while scaling easily. While few large models (e.g., Flamingo (Alayrac et al., 2022)), trained on massive datasets, can support more than two modalities, current small to mid-scale unified models are still limited to 2 modalities (e.g., image-text, or video-text). The question that we ask is: is it possible to build efficiently a unified model that can support all modalities? To answer this, we propose UnIVAL, a step further towards this ambitious goal. Without relying on fancy datasets sizes or models with billions of parameters, the ~ 0.25B parameter UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model. Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning. UnIVAL shows competitive performance to existing state-of-the-art approaches, across image and video-text tasks. The representation learned from image and video-text modalities, allows the model to achieve competitive performance to SoTA when finetuned on audio-text tasks, despite not being pretrained on audio. Thanks to the unified model, we propose a novel study on multimodal model merging via weight interpolation of models trained on different multimodal tasks, showing their benefits for out-of-distribution generalization. We motivate unification by showing the synergy between tasks. The model weights and code will be open-source.

URL: https://openreview.net/forum?id=4uflhObpcp

---

Title: Replay-enhanced Continual Reinforcement Learning

Abstract: Replaying past experiences has proven to be a highly effective approach for averting catastrophic forgetting in supervised continual learning. However, some crucial factors are still largely ignored, making it vulnerable to serious failure, when used as a solution to forgetting in continual reinforcement learning, even in the context of perfect memory where all data of previous tasks are accessible in the current task. On the one hand, since most reinforcement learning algorithms are not invariant to the reward scale, the previously well-learned tasks (with high rewards) may appear to be more salient to the current learning process than the current task (with small initial rewards). This causes the agent to concentrate on those salient tasks at the expense of generality on the current task. On the other hand, the off-policy learning on replayed tasks while learning a new task may induce policy drift on the old tasks, thus exacerbating forgetting. In this paper, we introduce RECALL, a replay-enhanced method that greatly avoids catastrophic forgetting while ensuring effective learning of the current task in continual reinforcement learning. RECALL leverages adaptive normalization on approximate targets and policy distillation on old tasks to enhance generality and stability, respectively. Extensive experiments on the Continual World benchmark show that RECALL performs significantly better than purely perfect memory replay, and achieves better overall performance compared with state-of-the-art continual learning methods.

URL: https://openreview.net/forum?id=91hfMEUukm

---

Title: Neural Implicit Manifold Learning for Topology-Aware Density Estimation

Abstract: Natural data observed in $\mathbb{R}^n$ is often constrained to an $m$-dimensional manifold $\mathcal{M}$, where $m < n$. This work focuses on the task of building theoretically principled generative models for such data. Current generative models learn $\mathcal{M}$ by mapping an $m$-dimensional latent variable through a neural network $f_\theta: \mathbb{R}^m \to \mathbb{R}^n$. These procedures, which we call pushforward models, incur a straightforward limitation: manifolds cannot in general be represented with a single parameterization, meaning that attempts to do so will incur either computational instability or the inability to learn probability densities within the manifold. To remedy this problem, we propose to model $\mathcal{M}$ as a neural implicit manifold: the set of zeros of a neural network. We then learn the probability density within $\mathcal{M}$ with a constrained energy-based model, which employs a constrained variant of Langevin dynamics to train and sample from the learned manifold. In experiments on synthetic and natural data, we show that our model can learn manifold-supported distributions with complex topologies more accurately than pushforward models.

URL: https://openreview.net/forum?id=lTOku838Zv

---

Title: Improving Continual Learning by Accurate Gradient Reconstructions of the Past

Abstract: Regularization and experience replay are two popular continual-learning strategies with complementary strengths: while regularization requires less memory, replay can more accurately mimic batch training. But can we combine them to get better methods? Despite the simplicity of the question, little is known or done to optimally combine these approaches. In this paper, we present such a method by using a recently proposed principle of adaptation that relies on a faithful reconstruction of the gradients of the past data. Using this principle, we design a prior which combines two types of replay methods with a quadratic weight-regularizer and achieves better gradient reconstructions. The combination improves performance on standard benchmarks such as Split-CIFAR, SplitTinyImageNet, and ImageNet-1000, achieving $>\!80\%$ of the batch performance by simply utilizing a memory of $<\!10\%$ of the past data. Our work shows that a good combination of replay and regularizer-based methods can be very effective in reducing forgetting, and can sometimes even completely eliminate it.

URL: https://openreview.net/forum?id=b1fpfCjja1

---

Title: NOFLITE: Learning to Predict Individual Treatment Effect Distributions

Abstract: Estimating the effect of a treatment on an individual's outcome of interest is an important challenge in various fields, such as healthcare, economics, marketing, and education. Previous work in machine learning has focused on estimating the expected value of the treatment effect. However, effective personalized decision-making requires more than just the treatment expected effect; it requires the entire treatment effect distribution. Knowing this distribution allows analyzing the treatment's expected utility or quantifying the uncertainty regarding a treatment's effect. This information is essential for prescribing optimal treatments. The ability of a model to predict accurate individual treatment effect distributions is captured by its likelihood. In light of this, we propose a novel neural architecture, NOFLITE, that uses normalizing flows to directly optimize this likelihood, while simultaneously learning flexible estimates of the individual treatment effect distribution. Experiments on various semi-synthetic data sets show that NOFLITE outperforms existing methods in terms of loglikelihood and performs competitively in terms of PEHE and IoU.

URL: https://openreview.net/forum?id=EjqopDxLbG

---

Title: Robust Contextual Linear Bandits

Abstract: Model misspecification is a major consideration in applications of statistical methods and machine learning. However, it is often neglected in contextual bandits. This paper studies a common form of misspecification, an inter-arm heterogeneity that is not captured by context. To address this issue, we assume that the heterogeneity arises due to arm-specific random variables, which can be learned. We call this setting a robust contextual bandit. The arm-specific variables explain the unknown inter-arm heterogeneity, and we incorporate them in the robust contextual estimator of the mean reward and its uncertainty. We develop two efficient bandit algorithms for our setting: a UCB algorithm called RoLinUCB and a posterior-sampling algorithm called RoLinTS. We analyze both algorithms and bound their $n$-round Bayes regret. Our experiments show that RoLinTS is comparably statistically efficient to the classic methods when the misspecification is low, more robust when the misspecification is high, and significantly more computationally efficient than its naive implementation.

URL: https://openreview.net/forum?id=g7aOs1xFcl

---

Title: Sharper Rates and Flexible Framework for Nonconvex SGD with Client and Data Sampling

Abstract: We revisit the classical problem of finding an approximately stationary point of the average of $n$ smooth and possibly nonconvex functions. The optimal complexity of stochastic first-order methods in terms of the number of gradient evaluations of individual functions is $\mathcal{O}\left(n + n^{1/2}\varepsilon^{-1}\right)$, attained by the optimal SGD methods SPIDER (Fang et al., 2018) and PAGE (Li et al., 2021), for example, where $\varepsilon$ is the error tolerance. However, i) the big-$\mathcal{O}$ notation hides crucial dependencies on the smoothness constants associated with the functions, and ii) the rates and theory in these methods assume simplistic sampling mechanisms that do not offer any flexibility. In this work we remedy the situation. First, we generalize the PAGE (Li et al., 2021) algorithm so that it can provably work with virtually any (unbiased) sampling mechanism. This is particularly useful in federated learning, as it allows us to construct and better understand the impact of various combinations of client and data sampling strategies. Second, our analysis is sharper as we make explicit use of certain novel inequalities that capture the intricate interplay between the smoothness constants and the sampling procedure. Indeed, our analysis is better even for the simple sampling procedure analyzed in the PAGE (Li et al., 2021) paper. However, this already improved bound can be further sharpened by a different sampling scheme which we propose. In summary, we provide the most general and most accurate analysis of optimal SGD in the smooth nonconvex regime. Finally, our theoretical findings are supposed with carefully designed experiments.

URL: https://openreview.net/forum?id=zKgJ6TWAFE

---

Title: Debiasing, calibrating, and improving Semi-supervised Learning performance via simple Ensemble Projector

Abstract: Recent studies on semi-supervised learning (SSL) have achieved great success.
Despite their promising performance, current state-of-the-art methods tend toward increasingly complex designs at the cost of introducing more network components and additional training procedures.
In this paper, we propose a simple method named Ensemble Projectors Aided for Semi-supervised Learning (EPASS), which focuses mainly on improving the learned embeddings to boost the performance of the existing contrastive joint-training semi-supervised learning frameworks.
Unlike standard methods, where the learned embeddings from one projector are stored in memory banks to be used with contrastive learning, EPASS stores the ensemble embeddings from multiple projectors in memory banks.
As a result, EPASS improves generalization, strengthens feature representation, and boosts performance.
For instance, EPASS improves strong baselines for semi-supervised learning by 39.47\%/31.39\%/24.70\% top-1 error rate, while using only 100k/1\%/10\% of labeled data for SimMatch, and achieves 40.24\%/32.64\%/25.90\% top-1 error rate for CoMatch on the ImageNet dataset.
These improvements are consistent across methods, network architectures, and datasets, proving the general effectiveness of the proposed methods.

URL: https://openreview.net/forum?id=nkuIcZO2ZB

---

Title: One-Round Active Learning through Data Utility Learning and Proxy Models

Abstract: While active learning (AL) techniques have demonstrated the potential to produce high-performance models with fewer labeled data, their application remains limited due to the necessity for multiple rounds of interaction with annotators. This paper studies the problem of one-round AL, which aims at selecting a subset of unlabeled points and querying their labels \emph{all at once}. A fundamental challenge is how to measure the utility of different choices of labeling queries for learning a target model. Our key idea is to learn such a utility metric from a small initial labeled set. We demonstrate that our approach leads to state-of-the-art performance on various AL benchmarks and is more robust to the lack of initial labeled data.

In addition to algorithmic development and evaluation, we introduce a novel metric for quantifying `\emph{utility transferability}' -- the degree of correlation between the performance changes of two learning algorithms due to variations in training data selection. Previous studies have often observed a notable utility transferability between models, even those with differing complexities. Such transferability enabled our approach, as well as other techniques such as coresets, hyperparameter tuning, and data valuation, to scale up to more sophisticated target models by substituting them with smaller proxy models. Nevertheless, utility transferability has not yet been rigorously defined within a formal mathematical framework, a gap that our work addresses innovatively. We further propose two Monte Carlo-based methods for efficiently comparing utility transferability for different proxy models, thereby facilitating a more informed selection of proxy models.

URL: https://openreview.net/forum?id=8HQCOMRa7g

---

Title: Revisiting Image Classifier Training for Improved Certified Robust Defense against Adversarial Patches

Abstract: Certifiably robust defenses against adversarial patches for image classifiers ensure correct prediction against any changes to a constrained neighborhood of pixels. PatchCleanser, the state-of-the-art certified defense, uses a double-masking strategy for robust classification. The success of this strategy relies heavily on the model's invariance to image pixel masking. In this paper, we take a closer look at model training schemes to improve this invariance. Instead of using Random Cutout augmentations like PatchCleanser, we introduce the notion of worst-case masking, i.e., selecting masked images which maximize classification loss. However, finding worst-case masks requires an exhaustive search, which might be prohibitively expensive to do on-the-fly during training. To solve this problem, we propose a two-round greedy masking strategy (Greedy Cutout) which finds an approximate worst-case mask location with much less compute. We show that the models trained with our Greedy Cutout improves certified robust accuracy over Random Cutout in PatchCleanser across a range of datasets and architectures. Certified robust accuracy on ImageNet with a ViT-B16-224 model increases from 58.1% to 62.3% against a 3% square patch applied anywhere on the image.

URL: https://openreview.net/forum?id=2tdhQMLg36

---

Title: A Quantitative Approach to Predicting Representational Learning and Performance in Neural Networks

Abstract: A key property of neural networks (both biological and artificial) is how they learn to represent and manipulate input information in order to solve a task. Different types of representations may be suited to different types of tasks, making identifying and understanding learned representations a critical part of understanding and designing useful networks. In this paper, we introduce a new pseudo-kernel based tool for analyzing and predicting learned representations, based only on the initial conditions of the network and the training curriculum. We validate the method on a simple test case, before demonstrating its use on a question about the effects of representational learning on sequential single versus concurrent multitask performance. We show that our method can be used to predict the effects of the scale of weight initialization and training curriculum on representational learning and downstream concurrent multitasking performance.

URL: https://openreview.net/forum?id=dXx8vzRrRX

---

Title: Partial Optimal Transport for Support Subset Selection

Abstract: In probabilistic terms, optimal transport aims to find a joint distribution that couples two
distributions, which minimizes the cost of transforming one distribution to another. Any
feasible coupling necessarily maintains the support of both distributions. However, maintaining the entire support is not ideal when only a subset of one of the distributions, namely
the source, is assumed to align with the other target distribution. For these cases, which are
common in machine learning applications, we propose to relax the constraints on the joint
distribution allowing it to underrepresent a subset of the source by overrepresenting other
subsets of the source by a constant factor. This is a special case of partial optimal transport.
In the discrete distribution case, such as in the case of two samples from continuous random
variables, optimal transport with the relaxed constraints is a linear program. When sufficiently relaxed, the solution has a source marginal with only a subset of its original support.
We demonstrate the usefulness of this support subset selection in applications such as color
transfer, partial point cloud alignment, and semi-supervised machine learning, where a part
of data is curated to have reliable labels and another part is unlabeled or has unreliable
labels. Our experiments show that optimal transport under the relaxed constraint can improve the performance of these applications by allowing for more flexible alignment between
distributions

URL: https://openreview.net/forum?id=75CcopPxIr

---

Title: DINOv2: Learning Robust Visual Features without Supervision

Abstract: The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP on most of the benchmarks at image and pixel levels.

URL: https://openreview.net/forum?id=a68SUt6zFt

---

Title: Expanding continual few-shot learning benchmarks to include recognition of specific instances

Abstract: Continual learning and few-shot learning are important frontiers in progress towards broader Machine Learning (ML) capabilities. There is a growing body of work in both, but few works combining the two. One exception is the Continual Few-shot Learning (CFSL) framework of Antoniou et al. (2020). In this study, we extend CFSL in two ways that capture a broader range of challenges, important for intelligent agent behaviour in real-world conditions. First, we modify CFSL to make it more comparable to standard continual learning experiments, where usually a much larger number of classes are presented. Second, we introduce an ‘instance test’ which requires recognition of specific instances of classes – a capability of animal cognition that is usually neglected in ML. For an initial exploration of ML model performance under these conditions, we selected representative baseline models from the original CFSL work and added a model variant with replay. As expected, learning more classes is more difficult than the original CFSL experiments, and interestingly, the way in which image instances and classes are presented affects classification performance. Surprisingly, accuracy in the baseline instance test is comparable to other classification tasks, but poor given significant occlusion and noise. The use of replay for consolidation improves performance substantially for both types of tasks, but particularly the instance test.

URL: https://openreview.net/forum?id=aPyXAqeg3r

---

Title: Revisiting Generalized p-Laplacian Regularized Framelet GCNs: Convergence, Energy Dynamic and as Non-Linear Diffusion

Abstract: This paper presents a comprehensive theoretical analysis of the graph p-Laplacian regularized framelet network (pL-UFG) to establish a solid understanding of its properties. We conduct a convergence analysis on pL-UFG, addressing the gap in the understanding of its asymptotic behaviors. Further by investigating the generalized Dirichlet energy of pL-UFG, we demonstrate that the Dirichlet energy remains non-zero throughout convergence, ensuring the avoidance of over-smoothing issues. Additionally, we elucidate the energy dynamic perspective, highlighting the synergistic relationship between the implicit layer in pL-UFG and graph framelets. This synergy enhances the model's adaptability to both homophilic and heterophilic data. Notably, we reveal that pL-UFG can be interpreted as a generalized non-linear diffusion process, thereby bridging the gap between pL-UFG and differential equations on the graph. Importantly, these multifaceted analyses lead to unified conclusions that offer novel insights for understanding and implementing pL-UFG, as well as other graph neural network (GNN) models. Finally, based on our dynamic analysis, we propose two novel pL-UFG models with manually controlled energy dynamics. We demonstrate empirically and theoretically that our proposed models not only inherit the advantages of pL-UFG but also significantly reduce computational costs for training on large-scale graph datasets.

URL: https://openreview.net/forum?id=VCXDpyxkEj

---

Title: Revisiting Sparsity Hunting in Federated Learning: Why the Sparsity Consensus Matters?

Abstract: Edge devices can benefit remarkably from federated learning due to their distributed nature; however, their limited resource and compute power pose limitations in deployment. A possible solution to this problem is to utilize off-the-shelf sparse learning algorithms at the clients to meet their resource budget. However, such naive deployment causes significant accuracy degradation, especially for highly resource-constrained clients. In particular, our investigations reveal that the lack of consensus in the trained sparsity masks among the clients may potentially lead to slower convergence of the global model, which can further result in a substantial accuracy drop.
Based on our observations, we present \textit{federated lottery aware sparsity hunting} (FLASH), a unified sparse learning framework for training a sparse sub-model that maintains the performance under ultra low parameter density while yielding proportional communication benefits.
Moreover, given that different clients may have different resource budgets, we further extend our algorithm to present \textit{hetero-FLASH} where instead of supporting only one target parameter density, clients can take different density budgets based on their device resource limitations. Experimental evaluations on diverse models and datasets show the superiority of FLASH in closing the gap with an unpruned baseline while yielding up to $\mathord{\sim}10.1\%$ improved accuracy with $\mathord{\sim}10.26\times$ fewer communication costs, compared to existing alternatives, at similar hyperparameter settings.

URL: https://openreview.net/forum?id=iHyhdpsnyi

---

Title: Physics informed neural networks for elliptic equations with oscillatory differential operators

Abstract: Physics informed neural network (PINN) based solution methods for differential equations have recently shown success in a variety of scientific computing applications. Several authors have reported difficulties, however, when using PINNs to solve equations with multiscale
features. The objective of the present work is to illustrate and explain the difficulty of using standard PINNs for the particular case of elliptic partial differential equations (PDEs) with oscillatory coefficients. We show that if the coefficient in the elliptic operator contains frequencies on the order of $1/\epsilon$, then the Frobenius norm of the neural tangent kernel matrix associated to the loss function grows as 1/\epsilon^2$. This implies that as the separation of scales in the problem increases, training the neural network with gradient descent based methods to achieve an accurate approximation of the solution to the PDE becomes increasingly difficult. Numerical examples illustrate the stiffness of the optimization problem.

URL: https://openreview.net/forum?id=QfyVqvpg7u

---

Title: Noncommutative $C^*$-algebra Net: Learning Neural Networks with Powerful Product Structure in $C^*$-algebra

Abstract: We propose a new generalization of neural networks with noncommutative $C^*$-algebra.
An important feature of $C^*$-algebras is their noncommutative structure of products, but the existing $C^*$-algebra net frameworks have only considered commutative $C^*$-algebras.
We show that this noncommutative structure of $C^*$-algebras induces powerful effects in learning neural networks.
Our framework has a wide range of applications, such as learning multiple related neural networks simultaneously with interactions and learning invariant features with respect to group actions.
We also show the validity of our framework numerically, which illustrates its potential power.

URL: https://openreview.net/forum?id=4glHUNZB6C

---

Title: Positive Unlabeled Learning Selected Not At Random (PULSNAR): class proportion estimation when the SCAR assumption does not hold

Abstract: Positive and Unlabeled (PU) learning is a type of semi-supervised binary classification where the machine learning algorithm differentiates between a set of positive instances (labeled) and a set of both positive and negative instances (unlabeled). PU learning has broad applications in settings where confirmed negatives are unavailable or difficult to obtain, and there is value in discovering positives among the unlabeled (e.g., viable drugs among untested compounds). Most PU learning algorithms make the \emph{selected completely at random} (SCAR) assumption, namely that positives are selected independently of their features. However, in many real-world applications, such as healthcare, positives are not SCAR (e.g., severe cases are more likely to be diagnosed), leading to a poor estimate of the proportion, α, of positives among unlabeled examples and poor model calibration, resulting in an uncertain decision threshold for selecting positives. PU learning algorithms can estimate α or the probability of an individual unlabeled instance being positive or both. We propose two PU learning algorithms to estimate α, calculate calibrated probabilities for PU instances, and improve classification metrics: i) PULSCAR (positive unlabeled learning selected completely at random), and ii) PULSNAR (positive unlabeled learning selected not at random). PULSNAR uses a divide-and-conquer approach that creates and solves several SCAR-like sub-problems using PULSCAR. In our experiments, PULSNAR outperformed state-of-the-art approaches on both synthetic and real-world benchmark datasets.

URL: https://openreview.net/forum?id=XCAQwD4iIU

---

Title: When is Momentum Extragradient Optimal? A Polynomial-Based Analysis

Abstract: The extragradient method has gained increasing attention due to its favorable convergence on differentiable games. The eigenvalues of the game vector field's Jacobian are distributed on the complex plane, making game dynamics more complicated than single objective minimization. Even for bilinear games, the simple gradient method diverges, while the extragradient method converges. Our work focuses on studying the momentum extragradient method, which was recently proven to converge at an accelerated rate for bilinear games \citep{azizian2020accelerating}. By using polynomial-based analysis, we identify three cases of game Jacobian eigenvalue distribution where the momentum extragradient method exhibits accelerated convergence rates. These cases include when the eigenvalues are distributed on the (positive) real line, both on the real line along with complex conjugates, and only as complex conjugates. We also derive the optimal hyperparameters for each scenario.

URL: https://openreview.net/forum?id=ZLVbQEu4Ab

---

Title: Regularized Data Programming with Automated Bayesian Prior Selection

Abstract: The cost of manual data labeling can be a significant obstacle in supervised learning. Data programming (DP) offers a weakly supervised solution for training dataset creation, wherein the outputs of user-defined programmatic labeling functions (LFs) are reconciled through unsupervised learning. However, DP can fail to outperform an unweighted majority vote in some scenarios, including low-data contexts. This work introduces a Bayesian extension of classical DP that mitigates failures of unsupervised learning by augmenting the DP objective with regularization terms. Regularized learning is achieved through maximum a posteriori estimation with informative priors. Majority vote is proposed as a proxy signal for automated prior parameter selection. Results suggest that regularized DP improves performance relative to maximum likelihood and majority voting, confers greater interpretability, and bolsters performance in low-data regimes.

URL: https://openreview.net/forum?id=3sanNL5ROL

---

Title: Uncovering Unique Concept Vectors through Latent Space Decomposition

Abstract: Interpreting the inner workings of deep learning models is crucial for establishing trust and ensuring model safety. Concept-based explanations have emerged as a superior approach that is more interpretable than feature attribution estimates such as pixel saliency. However, defining the concepts for the interpretability analysis biases the explanations by the user’s expectations on the concepts. To address this, we propose a novel post-hoc unsupervised method that automatically uncovers the concepts learned by deep models during training. By decomposing the latent space of a layer in singular vectors and refining them by unsupervised clustering, we uncover concept vectors aligned with directions of high variance that are relevant to the model prediction, and that point to semantically distinct concepts. Our extensive experiments reveal that the majority of our concepts are readily understandable to humans, exhibit coherency, and bear relevance to the task at hand. Moreover, we showcase the practical utility of our method in dataset exploration, where our concept vectors successfully identify outlier training samples affected by various confounding factors. This novel exploration technique has remarkable versatility to data types and model architectures and it will facilitate the identification of biases and the discovery of sources of error within training data.

URL: https://openreview.net/forum?id=LT4DXqUJTD

---

Title: Bridging the Gap Between Offline and Online Reinforcement Learning Evaluation Methodologies

Abstract: Reinforcement learning (RL) has shown great promise with algorithms learning in environments with large state and action spaces purely from scalar reward signals.
A crucial challenge for current deep RL algorithms is that they require a tremendous amount of environment interactions for learning.
This can be infeasible in situations where such interactions are expensive, such as in robotics.
Offline RL algorithms try to address this issue by bootstrapping the learning process from existing logged data without needing to interact with the environment from the very beginning.
While online RL algorithms are typically evaluated as a function of the number of environment interactions, there isn't a single established protocol for evaluating offline RL methods.
In this paper, we propose a sequential approach to evaluate offline RL algorithms as a function of the training set size and thus by their data efficiency.
Sequential evaluation provides valuable insights into the data efficiency of the learning process and the robustness of algorithms to distribution changes in the dataset while also harmonizing the visualization of the offline and online learning phases.
Our approach is generally applicable and easy to implement.
We compare several existing offline RL algorithms using this approach and present insights from a variety of tasks and offline datasets.

URL: https://openreview.net/forum?id=J3veZdVpts

---

Title: Ranking evaluation metrics from a group-theoretic perspective

Abstract: Searching for always better-performing machine learning techniques requires continuously comparing with well-established methods. While facing the challenges of finding the right evaluation metric to prove the strengths of the proposed models, choosing one metric despite another might hide the method's weaknesses intentionally or not. Conversely, one metric fitting all applications is probably not existing and represents a hopeless search.
In several applications, comparing rankings represents a severe challenge: various metrics strictly correlated to the context appeared to evaluate their similarities and differences. However, most metrics spread to other areas, although a complete understanding of their internal functioning is often missing, leading to unexpected results and misuses. Furthermore, as distinguished metrics focus on different aspects and rankings' characteristics, the comparisons of the models' results outputs given by the various metrics are often contradicting.
We propose to theorize rankings using the mathematical formality of symmetric groups to rise above the possible contextualization of the evaluation metrics. We prove that contradictory evaluations frequently appear among pairs of metrics, introduce the agreement ratio to measure the frequency of such disagreement, and formally define essential mathematical properties for ranking evaluation metrics. We finally check if any of these metrics is a distance in the mathematical sense. In conclusion, our analysis underlines the inconsistencies' reasons, compares the metrics purely based on mathematical concepts, and allows for a more conscious choice based on specific exigencies.

URL: https://openreview.net/forum?id=5ibNhaomZT

---

Title: Pareto Actor-Critic for Equilibrium Selection in Multi-Agent Reinforcement Learning

Abstract: This work focuses on equilibrium selection in no-conflict multi-agent games, where we specifically study the problem of selecting a Pareto-optimal equilibrium among several existing equilibria. It has been shown that many state-of-the-art multi-agent reinforcement learning (MARL) algorithms are prone to converging to Pareto-dominated equilibria due to the uncertainty each agent has about the policy of the other agents during training. To address sub-optimal equilibrium selection, we propose Pareto Actor-Critic (Pareto-AC), which is an actor-critic algorithm that utilises a simple property of no-conflict games (a superset of cooperative games): the Pareto-optimal equilibrium in a no-conflict game maximises the returns of all agents and therefore is the preferred outcome for all agents. We evaluate Pareto-AC in a diverse set of multi-agent games and show that it converges to higher episodic returns compared to seven state-of-the-art MARL algorithms, and that it successfully converges to a Pareto-optimal equilibrium in a range of matrix games. Finally, we propose PACDCG, a graph neural network extension of Pareto-AC which is shown to efficiently scale in games with a large number of agents.

URL: https://openreview.net/forum?id=3AzqYa18ah

---

Title: Reliable Active Learning via Influence Functions

Abstract: Due to the high cost and time-consuming nature of collecting labeled data, having insufficient labeled data is a common challenge that can negatively impact the performance of deep learning models when applied to real-world applications. Active learning (AL) aims to reduce the cost and time required for obtaining labeled data by selecting valuable samples during model training. However, recent works have pointed out the performance unreliability of existing AL algorithms for deep learning (DL) architectures under different scenarios, which manifests as their performance being comparable (or worse) to that of basic random selection. This behavior compromises the applicability of these approaches. We address this problem by proposing a theoretically motivated AL framework for DL architectures. We demonstrate that the most valuable samples for the model are those that, unsurprisingly, improve its performance on the entire dataset, most of which is unlabeled, and present a framework to efficiently estimate such performance (or loss) via influence functions, pseudo labels and diversity selection. Experimental results show that the proposed reliable active learning via influence functions (RALIF) can consistently outperform the random selection baseline as well as other existing and state-of-the art active learning approaches.

URL: https://openreview.net/forum?id=dN9YICB6hN

---

Title: Analysis of Convolutions, Non-linearity and Depth in Graph Neural Networks using Neural Tangent Kernel

Abstract: The fundamental principle of Graph Neural Networks (GNNs) is to exploit the structural information of the data by aggregating the neighboring nodes using a `graph convolution' in conjunction with a suitable choice for the network architecture, such as depth and activation functions. Therefore, understanding the influence of each of the design choice on the network performance is crucial. Convolutions based on graph Laplacian have emerged as the dominant choice with the symmetric normalization of the adjacency matrix as the most widely adopted one. However, some empirical studies show that row normalization of the adjacency matrix outperforms it in node classification. Despite the widespread use of GNNs, there is no rigorous theoretical study on the representation power of these convolutions, that could explain this behavior. Similarly, the empirical observation of the linear GNNs performance being on par with non-linear ReLU GNNs lacks rigorous theory.

In this work, we theoretically analyze the influence of different aspects of the GNN architecture using the Graph Neural Tangent Kernel in a semi-supervised node classification setting. Under the population Degree Corrected Stochastic Block Model, we prove that: (i) linear networks capture the class information as good as ReLU networks; (ii) row normalization preserves the underlying class structure better than other convolutions; (iii) performance degrades with network depth due to over-smoothing, but the loss in class information is the slowest in row normalization; (iv) skip connections retain the class information even at infinite depth, thereby eliminating over-smoothing.
We finally validate our theoretical findings numerically and on real datasets such as Cora and Citeseer.

URL: https://openreview.net/forum?id=xgYgDEof29

---

Title: Combining Tree-Search, Generative Models, and Nash Bargaining Concepts in Game-Theoretic Reinforcement Learning

Abstract: Algorithms that combine deep reinforcement learning and search to train agents, such as AlphaZero, have demonstrated remarkable success in producing human-level game-playing AIs for large adversarial domains. We propose a like combination that can be applied to general-sum, imperfect information games, by integrating a novel search procedure with a population-based deep RL training framework. The outer loop of our algorithm is implemented by Policy Space Response Oracles (PSRO), which generates a diverse population of rationalizable policies by interleaving game-theoretic analysis and deep RL. We train each policy using an Information-Set Monte-Carlo Tree Search (IS-MCTS) procedure, with concurrent learning of a deep generative model for handling imperfect information during search. We furthermore propose two new meta-strategy solvers for PSRO based on the Nash bargaining solution. Our approach thus combines planning, inferring environmental state, and predicting opponents' strategies during online decision-making. To demonstrate the efficacy of this training framework, we evaluate PSRO's ability to compute approximate Nash equilibria in benchmark games. We further explore its performance on two negotiation games: Colored Trails, and Deal-or-No-Deal. Employing our integrated search method, we conduct behavioral studies where human participants negotiate with our agents ($N = 346$). We find that search with generative modeling finds stronger policies during both training time and test time, enables online Bayesian co-player prediction, and can produce agents that achieve comparable social welfare negotiating with humans as humans trading among themselves.

URL: https://openreview.net/forum?id=cI7uknM0nR

---

Title: Resmax: An Alternative Soft-Greedy Operator for Reinforcement Learning

Abstract: Soft-greedy operators, namely $\varepsilon$-greedy and softmax, remain a common choice to induce a basic level of exploration for action-value methods in reinforcement learning. These operators, however, have a few critical limitations. In this work, we investigate a simple soft-greedy operator, which we call resmax, that takes actions proportionally to their max action gap: the residual to the estimated maximal value. It is simple to use and ensures coverage of the state-space like $\varepsilon$-greedy, but focuses exploration more on potentially promising actions like softmax. Further, it does not concentrate probability as quickly as softmax, and so better avoids overemphasizing sub-optimal actions that appear high-valued during learning. Additionally, we prove it is a non-expansion for any fixed exploration hyperparameter, unlike the softmax policy which requires a state-action specific temperature to obtain a non-expansion (called mellowmax). We empirically validate that resmax is comparable to or outperforms $\varepsilon$-greedy and softmax across a variety of environments in tabular and deep RL.

URL: https://openreview.net/forum?id=wzzrs5QH5k

---

Title: Revisiting Parameter Sharing in Multi-Agent Deep Reinforcement Learning

Abstract: Parameter sharing, where each agent independently learns a policy with fully shared parameters between all policies, is a popular baseline method for multi-agent deep reinforcement learning. Unfortunately, since all agents share the same policy network, they cannot learn different policies or tasks. This issue has been circumvented experimentally by adding an agent-specific indicator signal to observations, which we term ``agent indication.'' Agent indication is limited, however, in that without modification it does not allow parameter sharing to be applied to environments where the action spaces and/or observation spaces are heterogeneous. This work formalizes the notion of agent indication and proves that it enables convergence to optimal policies for the first time. Next, we formally introduce methods to extend parameter sharing to learning in heterogeneous observation and action spaces, and prove that these methods allow for convergence to optimal policies. Finally, we experimentally confirm that the methods we introduce function empirically, and conduct a wide array of experiments studying the empirical efficacy of many different agent indication schemes for image based observation spaces.

URL: https://openreview.net/forum?id=4Phqjr9cHe

---

Title: Efficient Weighting and Optimization in Federated Learning: A Primal-Dual Approach

Abstract: Federated learning has emerged as a promising approach for constructing a large-scale cooperative learning system involving multiple clients without sharing their raw data. However, the task of finding the optimal sampling weights for each client, given a specific global objective, remains largely unexplored. This challenge becomes particularly pronounced when clients' data distributions are non-i.i.d. (independent and identically distributed) and when clients only partially participate in the learning process.

In this paper, we tackle this issue by formulating the aforementioned task as a bi-level optimization problem that incorporates the correlations among different clients. To address this problem, we propose a double-loop primal-dual-based algorithm, designed specifically to solve the bi-level optimization problem efficiently. To establish the effectiveness of our algorithm, we provide rigorous convergence analysis under mild assumptions.
Furthermore, we conduct extensive empirical studies using both toy examples and learning models based on real datasets. Through these experiments, we verify and demonstrate the effectiveness of our proposed method.

URL: https://openreview.net/forum?id=y9YzBkuYT9

---

Title: Provably Safe Reinforcement Learning: Conceptual Analysis, Survey, and Benchmarking

Abstract: Ensuring the safety of reinforcement learning (RL) algorithms is crucial to unlock their potential for many real-world tasks. However, vanilla RL and most safe RL approaches do not guarantee safety. In recent years, several methods have been proposed to provide hard safety guarantees for RL, which is essential for applications where unsafe actions could have disastrous consequences, e.g., a car crash. Nevertheless, there is no comprehensive comparison of these provably safe RL methods. Therefore, we introduce a categorization of existing provably safe RL methods, present the conceptual foundations for both continuous and discrete action spaces, and empirically benchmark existing methods. We categorize the methods based on how they adapt the action: action replacement, action projection, and action masking. Our experiments on an inverted pendulum and a quadrotor stabilization task indicate that action replacement is the best-performing approach despite its comparatively simple formulation. Furthermore, adding a reward penalty every time the safety verification is engaged improves training performance. Finally, we provide practical guidance on selecting provably safe RL approaches depending on the safety specification, RL algorithm, and type of action space.

URL: https://openreview.net/forum?id=mcN0ezbnzO

---

Title: Estimating Differential Equations from Temporal Point Processes

Abstract: Ordinary differential equations (ODEs) allow interpretation of phenomena in various scientific fields. They have mostly been applied to numerical data observed at regular intervals, but not to irregularly observed discrete events, also known as point processes. In this study, we introduce an ODE modeling of such events by combining ODEs with log-Gaussian Cox processes (Møller et al., 1998). In the experiments with different types of ODEs regarding infectious disease, predator-prey interaction, and competition among participants, our method outperformed existing baseline methods assuming regularly observed continuous data with respect to the accuracy of recovering the latent parameters of ODEs. Through both synthetic and actual examples, we also showed the ability of our method to extrapolate, model latent events that cannot be observed, and offer interpretability of phenomena from the viewpoint of the estimated parameters of ODE.

URL: https://openreview.net/forum?id=cJgHzw8Qhq

---

Title: A Simple Baseline for Task-Agnostic Self-Training

Abstract: Classical semi-supervised approaches achieve state-of-the-art results on various visual recognition tasks, especially image classification, but they are typically designed with expert knowledge of the task at hand such as task-specific data augmentation. However, these approaches do not generalize to novel tasks such as image segmentation and surface normal estimation. In this work, we instead study self-training for a wide variety of tasks in a task-agnostic fashion. We find out a simple success recipe: to construct a continuous schedule of learning updates that iterates between self-training on novel segments of the streams of unlabeled data and fine-tuning on the small and fixed labeled data. Our task-agnostic self-training approach works with a few labeled samples per task by leveraging millions of unlabeled web images, and it requires neither enormous computational resources to process data nor domain-specific unlabeled data, which are assumed in most prior works. We show that our simple approach, without hyper-parameter tuning, can be as effective as state-of-the-art semi-supervised learning method (Fixmatch) that is designed with task-specific knowledge for image classification. Furthermore, we demonstrate the findings for both (1) pixel-level tasks such as surface normal estimation and segmentation, and (2) diverse domains with extreme differences to web images, including medical, satellite, and agricultural imagery, where there does not exist a large amount of labeled or unlabeled data. The experiments consistently suggest that ours is a competitive baseline to consider before developing compute-heavy and task-specific semi-supervised methods.

URL: https://openreview.net/forum?id=BGBtRCidRM

---

Title: Transfer Learning for High-dimensional Quantile Regression with Statistical Guarantee

Abstract: The task of transfer learning is to improve estimation/inference of a target model by migrating data from closely related source populations. In this article, we propose transfer learning algorithms for high-dimensional Quantile Regression (QR) models with the technique of convolution-type smoothing. Given the transferable source populations, we derive $\ell_1/\ell_2$-estimation error bounds for the estimators of the target regression coefficients under mild conditions. Theoretical analysis shows that the upper bounds are improved over those of the classical penalized QR estimator with only the target data, as long as the target and the sources are sufficiently similar to each other. When the set of informative sources is unknown, a transferable source detection algorithm is proposed to detect informative sources from all available sources. Thorough simulation studies justify our theoretical analysis.

URL: https://openreview.net/forum?id=d3xwrfAG4V

---

Title: RLTF: Reinforcement Learning from Unit Test Feedback

Abstract: The goal of program synthesis, or code generation, is to generate executable code based on given descriptions. Recently, there has been an increasing number of studies employing reinforcement learning (RL) to improve the performance of large language models (LLMs) for code. However, these RL methods have only used offline frameworks, limiting their exploration of new sample spaces. Additionally, current approaches that utilize unit test signals are rather simple, not accounting for specific error locations within the code.
To address these issues, we proposed RLTF, i.e., Reinforcement Learning from Unit Test Feedback, a novel online RL framework with unit test feedback of multi-granularity for refining code LLMs. Our approach generates data in real-time during training and simultaneously utilizes fine-grained feedback signals to guide the model towards producing higher-quality code. Extensive experiments show that RLTF achieves state-of-the-art performance on the APPS and the MBPP benchmarks. Our code can be found at: \url{https://github.com/Zyq-scut/RLTF}.

URL: https://openreview.net/forum?id=hjYmsV6nXZ

---

Title: Wrapped $\beta$-Gaussians with compact support for exact probabilistic modeling on manifolds

Abstract: We introduce wrapped $\beta$-Gaussians, a family of wrapped distributions on Riemannian manifolds, supporting efficient reparametrized sampling, as well as exact density estimation, effortlessly supporting high dimensions and anisotropic scale parameters. We extend Fenchel-Young losses for geometry-aware learning with wrapped $\beta$-Gaussians, and demonstrate the efficacy of our proposed family in a suite of experiments on hypersphere and rotation manifolds: data fitting, hierarchy encoding, generative modeling with variational autoencoders, and multilingual word embedding alignment.

URL: https://openreview.net/forum?id=KrequDpWzt

---

Title: Distributed Newton-Type Methods with Communication Compression and Bernoulli Aggregation

Abstract: Despite their high computation and communication costs, Newton-type methods remain an appealing option for distributed training due to their robustness against ill-conditioned convex problems. In this work, we study communication compression and aggregation mechanisms for curvature information in order to reduce these costs while preserving theoretically superior local convergence guarantees. We prove that the recently developed class of three point compressors (3PC) of [Richtarik et al., 2022] for gradient communication can be generalized to Hessian communication as well. This result opens up a wide variety of communication strategies, such as contractive compression and lazy aggregation, available to our disposal to compress prohibitively costly curvature information. Moreover, we discovered several new 3PC mechanisms, such as adaptive thresholding and Bernoulli aggregation, which require reduced communication and occasional Hessian computations. Furthermore, we extend and analyze our approach to bidirectional communication compression and partial device participation setups to cater to the practical considerations of applications in federated learning. For all our methods, we derive fast condition-number-independent local linear and/or superlinear convergence rates. Finally, with extensive numerical evaluations on convex optimization problems, we illustrate that our designed schemes achieve state-of-the-art communication complexity compared to several key baselines using second-order information.

URL: https://openreview.net/forum?id=NekBTCKJ1H

---

Title: Design of the topology for contrastive visual-textual alignment

Abstract: Cosine similarity is the common choice for measuring the distance between the feature representations in contrastive visual-textual alignment learning. However, empirically a learnable softmax temperature parameter is required when learning on large-scale noisy training data. n this work, we first discuss the role of softmax temperature from the embedding space's topological properties. We argue that the softmax temperature is the key mechanism for contrastive learning on noisy training data. It acts as a scaling factor of the distance range (e.g. [-1, 1] for the cosine similarity), and its learned value indicates the level of noise in the training data. Then, we propose an alternative design of the topology for the embedding alignment. We make use of multiple class tokens in the transformer architecture; then map the feature representations onto an oblique manifold endowed with the negative inner product as the distance function. With this configuration, we largely improve the zero-shot classification performance of baseline CLIP models pre-trained on large-scale datasets by an average of 6.1%.

URL: https://openreview.net/forum?id=zNRGSv7WKm

---

Title: Understanding Curriculum Learning in Policy Optimization for Online Combinatorial Optimization

Abstract: Over the recent years, reinforcement learning (RL) starts to show promising results in tackling combinatorial optimization (CO) problems, in particular when coupled with curriculum learning to facilitate training. Despite emerging empirical evidence, theoretical study on why RL helps is still at its early stage. This paper presents the first systematic study on policy optimization methods for online CO problems. We show that online CO problems can be naturally formulated as latent Markov Decision Processes (LMDPs), and prove convergence bounds on natural policy gradient (NPG) for solving LMDPs. Furthermore, our theory explains the benefit of curriculum learning: it can find a strong sampling policy and reduce the distribution shift, a critical quantity that governs the convergence rate in our theorem. For a canonical online CO problem, the Best Choice Problem (BCP), we formally prove that distribution shift is reduced exponentially with curriculum learning even if the curriculum is a randomly generated BCP on a smaller scale. Our theory also shows we can simplify the curriculum learning scheme used in prior work from multi-step to single-step. Lastly, we provide extensive experiments on the Best Choice Problem, Online Knapsack, and AdWords to verify our findings.

URL: https://openreview.net/forum?id=gKEbBKRUjA

---

Title: Invariant Structure Learning for Better Generalization and Causal Explainability

Abstract: Learning the causal structure behind data is invaluable for improving generalization and ob- taining high-quality explanations. Towards this end, we propose a novel framework, Invariant Structure Learning (ISL), that is designed to improve causal structure discovery by utilizing generalization as an indication in the process. ISL splits the data into different environments, and learns a structure that is invariant to the target across different environments by imposing a consistency constraint. The proposed aggregation mechanism then selects the classifier based on a graph structure that reflects the causal mechanisms in the data more accurately compared to the structures learnt from individual environments. Furthermore, we extend ISL to a self-supervised learning setting, where accurate causal structure discovery does not rely on any labels. Self-supervised ISL utilizes proposals for invariant causality, by iteratively setting different nodes as targets. On synthetic and real-world datasets, we demonstrate that ISL accurately discovers the causal structure, outperforms alternative methods, and yields superior generalization for datasets with significant distribution shifts.

URL: https://openreview.net/forum?id=A9yn7KTwsK

---

Title: A Survey on Transformers in Reinforcement Learning

Abstract: Transformer has been considered the dominating neural architecture in NLP and CV, mostly under supervised settings. Recently, a similar surge of using Transformers has appeared in the domain of reinforcement learning (RL), but it is faced with unique design choices and challenges brought by the nature of RL. However, the evolution of Transformers in RL has not yet been well unraveled. In this paper, we seek to systematically review motivations and progress on using Transformers in RL, provide a taxonomy on existing works, discuss each sub-field, and summarize future prospects.

URL: https://openreview.net/forum?id=r30yuDPvf2

---

Title: Semi-Supervised Single Domain Generalization with Label-Free Adversarial Data Augmentation

Abstract: Domain generalization (DG) has attracted increasing attention recently, as it seeks to improve the generalization ability of visual recognition models to unseen target domains. DG leverages multiple source domains for model training, while single domain generalization (SDG) further restricts such setting by exploiting only a single source domain. Nevertheless, both DG and SDG assume that the source domains are fully labeled, which might not be practical in many real world scenarios. In this paper, we present a new problem, i.e., semi-supervised single domain generalization (SS-SDG), which aims to train a model with a partially labeled single source domain to generalize to multiple unseen testing domains. We propose an effective framework to address this problem. In particular, we design a label-free adversarial data augmentation strategy to diversify the source domain, and propose a novel multi-pair FixMatch loss to generalize classifiers to unseen testing domains. Extensive experiments on OfficeHome, PACS and DomainNet20 datasets show that our method surpasses the latest SDG and semi-supervised methods. Moreover, on PACS and DomainNet20, our method approaches the fully supervised ERM upper bound within $5\%$ gap, but only uses less than $8\%$ of the labels.

URL: https://openreview.net/forum?id=sUlbRfLijj

---

Title: Invertible Hierarchical Generative Model for Images

Abstract: Normalizing flows (NFs) as generative models enjoy desirable properties such as exact invertibility and exact likelihood evaluation, while being efficient to sample from. These properties, however, come at the cost of heavy restrictions on the architecture. Due to these limitations, modeling multi-modal probability distributions can yield poor results even with low-dimensional data. Additionally, typical flow architectures employed on real image datasets produce samples with visible aliasing artifacts and limited variation. The latent decomposition of flow-models also falls short on that of competing methods, with uneven contribution to a decoded image. In this work we build an invertible generative model using conditional normalizing flows in a hierarchical fashion to circumvent the aforementioned limitations. We show that we can achieve superior sample quality among flow-based models with fewer parameters compared to the state of the art. We demonstrate ability to control individual levels of detail via the latent decomposition of our model.

URL: https://openreview.net/forum?id=4rkKN4tM63

---

Title: Evaluating Human-Language Model Interaction

Abstract: Many real-world applications of language models (LMs), such as writing assistance and code autocomplete, involve human-LM interaction. However, most benchmarks are non-interactive in that a model produces output without human involvement. To evaluate human-LM interaction, we develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems and dimensions to consider when designing evaluation metrics. Compared to standard, non-interactive evaluation, HALIE captures (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality (e.g., enjoyment and ownership). We then design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation. With four state-of-the-art LMs (three variants of OpenAI's GPT-3 and AI21 Labs' Jurassic-1), we find that better non-interactive performance does not always translate to better human-LM interaction. In particular, we highlight three cases where the results from non-interactive and interactive metrics diverge and underscore the importance of human-LM interaction for LM evaluation.

URL: https://openreview.net/forum?id=hjDYJUn9l1

---

Title: Out-of-distribution Detection with Test Time Augmentation and Consistency Evaluation

Abstract: Deep neural networks are known to be vulnerable to unseen data: they may wrongly assign high confidence scores to out-distribution samples. Existing works try to handle the problem using intricate representation learning methods and specific metrics. In this paper, we propose a simple yet effective post-hoc out-of-distribution detection algorithm named Test Time Augmentation Consistency Evaluation~(TTACE). It is inspired by a novel observation that in-distribution data enjoy more consistent predictions for its original and augmented versions on a trained network than out-distribution data, which separates in-distribution and out-distribution samples.
Experiments on various high-resolution image benchmark datasets demonstrate that TTACE achieves comparable or better detection performance under dataset-vs-dataset anomaly detection settings with a $60\%\sim90\%$ running time reduction of existing classifier-based algorithms.
We also provide empirical verification that the key to TTACE lies in the remaining classes between augmented features, which previous works have partially ignored.

URL: https://openreview.net/forum?id=yiG474vJFz

---

Title: MARS: A Motif-based Autoregressive Model for Retrosynthesis Prediction

Abstract: Retrosynthesis is a critical task in drug discovery, aimed at finding a viable pathway for synthesizing a given target molecule. Many existing approaches frame this task as a graph-generating problem. Specifically, these methods firstly identify the reaction center, and break a targeted molecule accordingly to generate synthons. Reactants are generated by either adding atoms sequentially to synthon graphs or directly adding proper leaving groups. However, both of these strategies have limitations. Adding atoms results in a long prediction sequence which increases the complexity of generation, while adding leaving groups can only consider those in the training set which results in poor generalization. In this paper, we propose a novel end-to-end graph generation model for retrosynthesis prediction, which sequentially identifies the reaction center, generates the synthons, and adds motifs to the synthons to generate reactants. Since chemically meaningful motifs are bigger than atoms and smaller than leaving groups, our method enjoys lower prediction complexity than adding atoms and better generalization than adding leaving groups. We evaluate our proposed model on a benchmark dataset and show that it significantly outperforms previous state-of-the-art models. Furthermore, we conduct an ablation study to investigate the contribution of each component of our proposed model to the overall performance on the benchmark dataset. Our results demonstrate the effectiveness of our model in predicting rethresynthesis pathways and suggest its potential as a valuable tool in drug discovery.

URL: https://openreview.net/forum?id=HCGzA07AjJ

---

Title: When Less is More: Simplifying Inputs Aids Neural Network Understanding

Abstract: How do neural network image classifiers respond to simpler and simpler inputs? And what do such responses reveal about the characteristics of the data and their interaction with the learning process? To answer these questions, we need measures of input simplicity (or inversely, complexity) and class-discriminative input information as well as a framework to optimize these during training. Lastly, we need experiments that evaluate whether this framework can simplify data to remove injected distractors and evaluate the impact of such simplification on real-world data. In this work, we measure simplicity with the encoding bit size given by a pretrained generative model, and minimize the bit size to simplify inputs during training. At the same time, we minimize a gradient- and activation-based distance metric between original and simplified inputs to retain discriminative information. We investigate the trade-off between input simplicity and task performance. For images with injected distractors, such simplification naturally removes superfluous information. For real-world datasets, such simplification retains visually discriminative features and learns features that may be more robust in some settings.

URL: https://openreview.net/forum?id=uNU8m7DWPM

---

Title: Causal Reinforcement Learning: A Survey

Abstract: Reinforcement learning is an essential paradigm for solving sequential decision problems under uncertainty. Despite many remarkable achievements in recent decades, applying reinforcement learning methods in the real world remains challenging. One of the main obstacles is that reinforcement learning agents lack a fundamental understanding of the world and must therefore learn from scratch through numerous trial-and-error interactions. They may also face challenges in providing explanations for their decisions and generalizing the acquired knowledge. Causality, however, offers a notable advantage as it can formalize knowledge in a systematic manner and leverage invariance for effective knowledge transfer. This has led to the emergence of causal reinforcement learning, a subfield of reinforcement learning that seeks to enhance existing algorithms by incorporating causal relationships into the learning process. In this survey, we comprehensively review the literature on causal reinforcement learning. We first introduce the basic concepts of causality and reinforcement learning, and then explain how causality can address core challenges in non-causal reinforcement learning. We categorize and systematically review existing causal reinforcement learning approaches based on their target problems and methodologies. Finally, we outline open issues and future directions in this emerging field.

URL: https://openreview.net/forum?id=qqnttX9LPo

---

Title: MERMAIDE: Learning to Align Learners using Model-Based Meta-Learning

Abstract: We study how a principal can efficiently and effectively intervene on the rewards of a previously unseen learning agent in order to induce desirable outcomes. This is relevant to many real-world settings like auctions or taxation, where the principal may not know the learning behavior nor the rewards of real people. Moreover, the principal should be few-shot adaptable and minimize the number of interventions, because interventions are often costly. We introduce MERMAIDE, a model-based meta-learning framework to train a principal that can quickly adapt to out-of-distribution agents with different learning strategies and reward functions. We validate this approach step-by-step. First, in a Stackelberg setting with a best-response agent, we show that meta-learning enables quick convergence to the theoretically known Stackelberg equilibrium at test time, although noisy observations severely increase the sample complexity. We then show that our model-based meta-learning approach is cost-effective in intervening on bandit agents with unseen explore-exploit strategies. Finally, we outperform baselines that use either meta-learning or agent behavior modeling, in both $0$-shot and $1$-shot settings with partial agent information.

URL: https://openreview.net/forum?id=H5VRvCXCzf

---

Title: On Perfect Clustering for Gaussian Processes

Abstract: In this paper, we propose a data based transformation for infinite-dimensional Gaussian processes and derive its limit theorem. In a clustering problem using mixture models, an appropriate modification of this transformation asymptotically leads to perfect separation of the populations under rather general conditions, except the scenario in which differences between clusters depend only on the locations; in this case our procedure is useless. Theoretical properties related to label consistency are studied for the k-means clustering algorithm when used on this transformed data. Good empirical performance of the proposed methodology is demonstrated using simulated as well as benchmark data sets, when compared with some popular parametric and nonparametric methods for such functional data.

URL: https://openreview.net/forum?id=igDOV2KBwM

---

Title: Relating graph auto-encoders to linear models

Abstract: Graph auto-encoders are widely used to construct graph representations in Euclidean vector spaces. However, it has already been pointed out empirically that linear models on many tasks can outperform graph auto-encoders.
In our work, we prove that the solution space induced by graph auto-encoders is a subset of the solution space of a linear map. This demonstrates that linear embedding models have at least the representational power of graph auto-encoders based on graph convolutional networks. So why are we still using nonlinear graph auto-encoders? One reason could be that actively restricting the linear solution space might introduce an inductive bias that helps improve learning and generalization. While many researchers believe that the nonlinearity of the encoder is the critical ingredient towards this end, we instead identify the node features of the graph as a more powerful inductive bias. We give theoretical insights by introducing a corresponding bias in a linear model and analyzing the change in the solution space. Our experiments show that the linear encoder can outperform the nonlinear encoder when using feature information and are aligned with other empirical work on this question.

URL: https://openreview.net/forum?id=Y1eYplvxrE

---

Title: FEATHERS: Federated Architecture and Hyperparameter Search

Abstract: Deep neural architectures have profound impact on achieved performance in many of today's AI tasks, yet, their design still heavily relies on human prior knowledge and experience. Neural architecture search (NAS) together with hyperparameter optimization (HO) helps to reduce this dependence. However, state of the art NAS and HO rapidly become infeasible with increasing amount of data being stored in a distributed fashion, typically violating data privacy regulations such as GDPR and CCPA. As a remedy, we introduce FEATHERS - FEderated ArchiTecture and HypERparameter Search, a method that not only optimizes both neural architectures and optimization-related hyperparameters jointly in distributed data settings, but further adheres to data privacy through the use of differential privacy (DP). We show that FEATHERS efficiently optimizes architectural and optimization-related hyperparameters alike, while demonstrating convergence on classification tasks at no detriment to model performance when complying with privacy constraints.

URL: https://openreview.net/forum?id=4yFPrQxidt

---

Title: Generalization in Cooperative Multi-Agent Systems

Abstract: Collective intelligence is a fundamental trait shared by many species that has allowed them
to thrive in diverse environmental conditions. From simple organisations in an ant colony
to complex systems in human groups, collective intelligence is vital for solving many survival
tasks. Such natural systems are flexible to changes in their structure: they generalize well
when the abilities or number of agents change, which we call Combinatorial Generalization
(CG). CG is a highly desirable trait for autonomous systems as it can increase their utility
and deployability across a wide range of applications. While recent works addressing
specific aspects of CG have shown impressive results on complex domains, they provide no
performance guarantees when generalizing to novel situations. In this work, we shed light on
the theoretical underpinnings of CG for cooperative multi-agent systems (MAS). Specifically,
we study generalization bounds under a linear dependence of the underlying dynamics on
the agent capabilities, which can be seen as a generalization of Successor Features to MAS.
We then extend the results first for Lipschitz and then arbitrary dependence of rewards on
team capabilities. Finally, empirical analysis on various domains using the framework of
multi-agent reinforcement learning highlights important desiderata for multi-agent algorithms
towards ensuring CG.

URL: https://openreview.net/forum?id=sLlufxgpmO

---

Title: When does knowledge distillation improve robustness?

Abstract: Knowledge distillation, typically employed to condense a large `teacher' network into a smaller `student' network, has been found to also effectively transfer adversarial robustness into mobile-friendly students. In this study, however, we show that knowledge distillation between large models can also be used to purely enhance adversarial robustness. Specifically, we present a thorough analysis of different robust knowledge distillation (RKD) techniques with the aim to provide general guidelines to improve the adversarial performance of a student model. Our ablations demonstrate the significance of early stopping, model ensembling, label mixing, and the use of weakly adversarially trained teachers as keys to maximize a student's performance; but we also find that matching the student and teacher in adversarial regions is beneficial in some settings. We thus introduce a new adversarial knowledge distillation loss (AKD) which matches the student's and teacher's output on adversarial examples, to study when it can be beneficial in the context of RKD. Finally, we use our insights to enhance the state-of-the-art robust models and find that while our proposed guidelines can complement and improve them, the main achievable performance benefits still depend on the quantity and quality of the training data used.

URL: https://openreview.net/forum?id=e25HGDzfTu

---

Title: A Simple Unified Method for Node Classification on Homophilous and Heterophilous Graphs

Abstract: In graph learning, there have been two predominant inductive biases regarding graph-inspired architectures: On the one hand, higher-order interactions and message passing work well on homophilous graphs and are leveraged by GCNs and GATs. Such architectures, however, cannot easily scale to large real-world graphs. On the other hand, shallow (or node-level) models using ego features and adjacency embeddings work well in heterophilous graphs. In this work, we propose a novel scalable shallow method -- GLINKX -- that can work both on homophilous and heterophilous graphs. GLINKX leverages (i) novel monophilous label propagations (ii) ego/node features, (iii) knowledge graph embeddings as positional embeddings, (iv) node-level training, and (v) low-dimensional message passing. Formally, we prove novel error bounds and justify the components of GLINKX. Experimentally, we show its effectiveness on several homophilous and heterophilous datasets with up to millions of nodes and tens of millions of edges.

URL: https://openreview.net/forum?id=7JKFHoXYEG

---

Title: The Slingshot Effect: A Late-Stage Optimization Anomaly in Adaptive Gradient Methods

Abstract: Adaptive gradient methods, notably Adam ~\citep{kingma2014adam, loshchilov2017decoupled}, have become indispensable for optimizing neural networks, particularly in conjunction with Transformers ~\citep{vaswani2017attention, dosovitskiy2020an}. In this paper, we present a novel optimization anomaly called the \emph{Slingshot Effect}, which manifests during extremely late stages of training. We identify a distinctive characteristic of this phenomenon through cyclic phase transitions between stable and unstable training regimes, as evidenced by the cyclic behavior of the norm of the last layer's weights. Although the Slingshot Effect can be easily reproduced in more general settings, it does not align with any known optimization theories, emphasizing the need for in-depth examination.
Moreover, we make a noteworthy observation that Grokking, as reported by ~\citet{power2021grokking}, occurs predominantly during the onset of the Slingshot Effects and is absent without it, even in the absence of explicit regularization. This finding suggests a surprising inductive bias of adaptive gradient optimizers at late training stages, urging a revised theoretical analysis of their origin.
Our study sheds light on an intriguing optimization behavior that has significant implications for understanding the inner workings of adaptive gradient methods.

URL: https://openreview.net/forum?id=OZbn8ULouY

---

Title: Multi-view Data Visualisation via Manifold Learning

Abstract: Non-linear dimensionality reduction can be performed by manifold learning approaches, such as Stochastic Neighbour Embedding (SNE), Locally Linear Embedding (LLE) and Isometric Feature Mapping (ISOMAP). These methods aim to produce two or three latent embeddings, primarily to visualise the data in intelligible representations. This manuscript proposes extensions of Student's t-distributed SNE (t-SNE), LLE and ISOMAP, for dimensionality reduction and visualisation of multi-view data. Multi-view data refers to multiple types of data generated from the same samples.

The proposed multi-view approaches provide more comprehensible projections of the samples compared to the ones obtained by visualising each data-view separately. Commonly, visualisation is used for identifying underlying patterns within the samples. By incorporating the obtained low-dimensional embeddings from the multi-view manifold approaches into the $K$-means clustering algorithm, it is shown that clusters of the samples are accurately identified. Through extensive comparisons of novel and existing multi-view manifold learning algorithms on real and synthetic data, the proposed multi-view extension of t-SNE, named multi-SNE, is found to have the best performance. We further illustrate the applicability of the multi-SNE approach for the analysis of multi-omics single-cell data, where the aim is to visualise and identify cell heterogeneity and cell types in biological tissues relevant to health and disease.

URL: https://openreview.net/forum?id=jMiwptYV5n

---

Title: Variation-aware Vision Transformer Quantization

Abstract: Despite the remarkable performance of Vision Transformers (ViTs) in various visual tasks, the expanding computation and model size of ViTs have increased the demand for improved efficiency during training and inference. To address the heavy computation and parameter drawbacks, quantization is frequently studied in the community as a representative model compression technique and has seen extensive use on CNNs. However, due to the unique properties of CNNs and ViTs, the quantization applications on ViTs are still limited and underexplored. In this paper, we identify the difficulty of ViT quantization on its unique variation behaviors, which differ from traditional CNN architectures. The variations indicate the magnitude of the parameter fluctuations and can also measure outlier conditions. Moreover, the variation behaviors reflect the various sensitivities to the quantization of each module. The quantization sensitivity analysis and comparison of ViTs with CNNs help us locate the underlying differences in variations. We also find that the variations in ViTs cause training oscillations, bringing instability during quantization-aware training (QAT). Correspondingly, we solve the variation problem with an efficient knowledge-distillation-based variation-aware quantization method. The multi-crop knowledge distillation scheme can accelerate and stabilize the training and alleviate the variation's influence during QAT. We also proposed a module-dependent quantization scheme and a variation-aware regularization term to suppress the oscillation of weights. On ImageNet-1K, we obtain a 77.66% Top-1 accuracy on the extremely low-bit scenario of 2-bit Swin-T, outperforming the previous state-of-the-art quantized model by 3.35%. Code and models will be publicly available.

URL: https://openreview.net/forum?id=yVytaiJzkq

---

Title: Unlocking Unlabeled Data: Ensemble Learning with the Hui-Walter Paradigm for Performance Estimation in Online and Static Settings

Abstract: In the realm of machine learning and statistical modeling, practitioners often work under
the assumption of accessible, static, labeled data for evaluation and training. However,
this assumption often deviates from reality where data may be private, encrypted, difficult-
to-measure, or unlabeled. In this paper, we bridge this gap by adapting the Hui-Walter
paradigm, a method traditionally applied in epidemiology and medicine, to the field of
machine learning. This approach enables us to estimate key performance metrics such as false
positive rate, false negative rate, and prior in scenarios where no ground truth is available.
We further extend this paradigm for handling online data, opening up new possibilities for
dynamic data environments. Our methodology, applied to two diverse datasets, the Wisconsin
Breast Cancer and the Adult dataset, involves partitioning each into latent classes to simulate
multiple data populations and independently training models to replicate multiple tests. By
cross-tabulating binary outcomes across ensemble categorizers and multiple populations, we
are able to estimate unknown parameters through Gibbs sampling, eliminating the need
for ground-truth or labeled data. This paper showcases the potential of our methodology
to transform machine learning practices by allowing for accurate model assessment under
dynamic and uncertain data conditions.

URL: https://openreview.net/forum?id=s8TsIPTgVG

---

Title: SHAP-XRT: The Shapley Value Meets Conditional Independence Testing

Abstract: The complex nature of artificial neural networks raises concerns on their reliability, trustworthiness, and fairness in real-world scenarios. The Shapley value---a solution concept from game theory---is one of the most popular explanation methods for machine learning models. More traditionally, from a statistical perspective, feature importance is defined in terms of conditional independence. So far, these two approaches to interpretability and feature importance have been considered separate and distinct. In this work, we show that Shapley-based explanation methods and conditional independence testing are closely related. We introduce the \textbf{SHAP}ley-E\textbf{X}planation \textbf{R}andomization \textbf{T}est (SHAP-XRT), a testing procedure inspired by the Conditional Randomization Test (CRT) for a specific notion of \emph{local} (i.e., on a sample) conditional independence. With it, we prove that for binary classification problems, the marginal contributions in the Shapley value provide lower and upper bounds to the $p$-values of their respective tests. Furthermore, we show that the Shapley value itself provides an upper bound to the $p$-value of a \emph{global} (i.e., overall) null hypothesis. As a result, we further our understanding of Shapley-based explanation methods from a novel perspective and characterize under which conditions one can make statistically valid claims about feature importance via the Shapley value.

URL: https://openreview.net/forum?id=WFtTpQ47A7

---

Title: Using VAEs to Learn Latent Variables: Observations on Applications in cryo-EM

Abstract: Variational autoencoders (VAEs) are a popular generative model used to approximate distributions. The encoder part of the VAE is used in amortized learning of latent variables, producing a latent representation for data samples. Recently, VAEs have been used to characterize physical and biological systems. In this case study, we qualitatively examine the amortization properties of a VAE used in biological applications. We find that in this application the encoder bears a qualitative resemblance to a more traditional explicit representation of latent variables.

URL: https://openreview.net/forum?id=JULyZ1rtaS

---

Title: Survey of Scientific Rigor Studied in Machine Learning

Abstract: The concern that Artificial Intelligence (AI) and Machine Learning (ML) are entering a ``reproducibility crisis'' has spurred significant research in the past few years. Yet with each paper, it is often unclear what someone means by ``reproducibility'' and where it fits in the larger scope of what we will call the ``scientific rigor'' literature. Ultimately, the lack of clear rigor standards can affect the manner in which businesses seeking to adopt AI/ML implement such capabilities. In this survey, we will use 66 papers published since 2017 to construct a proposed set of 8 high-level categories of scientific rigor, what they are, and the history of work conducted in each. Our proposal is that these eight rigor types are not mutually exclusive and present a model for how they influence each other. To encourage more to study these questions, we map these rigors to the adoption process in real-world business use cases. In doing so, we can quantify gaps in the literature that suggest an under focus on the issues necessary for scientific rigor research to transition to practice.

URL: https://openreview.net/forum?id=f9QWh3lZIt

---

Title: Offline Reinforcement Learning with Mixture of Deterministic Policies

Abstract: Offline reinforcement learning (RL) has recently attracted considerable attention as an approach for utilizing past experiences to learn a policy. Recent studies have reported the challenges of offline RL, such as estimating the values of actions that are outside the data distribution. To mitigate offline RL issues, we propose an algorithm that leverages a mixture of deterministic policies. When the data distribution is multimodal, fitting a policy modeled with a unimodal distribution, such as Gaussian distribution, may lead to interpolation between separate modes, thereby resulting in the value estimation of actions that are outside the data distribution. In our framework, the state-action space is divided by learning discrete latent variables, and the sub-policies corresponding to each region are trained. The proposed algorithm was derived by considering the variational lower bound of the offline RL objective function. We show empirically that the use of the proposed mixture policy can reduce the accumulation of the critic loss in offline RL, which was reported in previous studies. Experimental results also indicate that using a mixture of deterministic policies in offline RL improves the performance with the D4RL benchmarking datasets.

URL: https://openreview.net/forum?id=zkRCp4RmAF

---

Title: Offline Reinforcement Learning with Additional Covering Distributions

Abstract: We study learning optimal policies from a logged dataset, i.e., offline RL, with function approximation. Despite the efforts devoted, existing algorithms with theoretic finite-sample guarantees typically assume exploratory data coverage or strong realizable function classes, which is hard to be satisfied in reality. While there are recent works that successfully tackle these strong assumptions, they either require the gap assumptions that only could be satisfied by part of MDPs or use the behavior regularization that makes the optimality of learned policy even intractable. To solve this challenge, we provide finite-sample guarantees for a simple algorithm based on marginalized importance sampling (MIS), showing that sample-efficient offline RL for general MDPs is possible with only a partial coverage dataset and weak realizable function classes given additional side information of a covering distribution. Furthermore, we demonstrate that the covering distribution trades off prior knowledge of the optimal trajectories against the coverage requirement of the dataset, revealing the effect of this inductive bias in the learning processes.

URL: https://openreview.net/forum?id=AfXq3x3X16

---

Title: Simplifying and Understanding State Space Models with Diagonal Linear RNNs

Abstract: Sequence models based on linear state spaces (SSMs) have recently emerged as a promising choice of architecture for modeling long range dependencies across various modalities. However, they invariably rely on discretization of a continuous state space, which complicates their presentation and understanding. In this work, we dispose of the discretization step, and propose a model based on vanilla Diagonal Linear RNNs ($\mathrm{DLR}$). We empirically show that $\mathrm{DLR}$ is as performant as previously-proposed SSMs in the presence of strong supervision, despite being conceptually much simpler. Moreover, we characterize the expressivity of SSMs (including $\mathrm{DLR}$) and attention-based models via a suite of $13$ synthetic sequence-to-sequence tasks involving interactions over tens of thousands of tokens, ranging from simple operations, such as shifting an input sequence, to detecting co-dependent visual features over long spatial ranges in flattened images. We find that while SSMs report near-perfect performance on tasks that can be modeled via $\textit{few}$ convolutional kernels, they struggle on tasks requiring $\textit{many}$ such kernels and especially when the desired sequence manipulation is $\textit{context-dependent}$. For example, $\mathrm{DLR}$ learns to perfectly shift a $0.5M$-long input by an arbitrary number of positions but fails when the shift size depends on context. Despite these limitations, $\mathrm{DLR}$ reaches high performance on two higher-order reasoning tasks $\mathrm{ListOpsSubTrees}$ and $\mathrm{PathfinderSegmentation}\text{-}\mathrm{256}$ with input lengths $8K$ and $65K$ respectively, and gives encouraging performance on $\mathrm{PathfinderSegmentation}\text{-}\mathrm{512}$ with input length $262K$ for which attention is not a viable choice.

URL: https://openreview.net/forum?id=YrBFZ2egXv

---

Title: Representation Ensembling for Synergistic Lifelong Learning with Quasilinear Complexity

Abstract: In lifelong learning, data are used to improve performance not only on the current task, but also on previously encountered, and as yet unencountered tasks. In contrast, classical machine learning which starts from a blank slate, or tabula rasa, uses data only for the single task at hand. While typical transfer learning algorithms can improve performance on future tasks, their performance on prior tasks degrades upon learning new tasks (called forgetting). Many recent approaches for continual or lifelong learning have attempted to maintain performance on old tasks given new tasks. But striving to avoid forgetting sets the goal unnecessarily low. The goal of lifelong learning should be to not only improve performance on future tasks (forward transfer) but also to improve performance on past tasks (backward transfer) with any new data. Our key insight is that we can synergistically ensemble representations that were learned independently on disparate tasks to enable both forward and backward transfer. This generalizes ensembling independently learned representations (like in decision forests) and complements ensembling dependent representations (like in gradient boosted trees). Moreover, we ensemble representations in quasilinear space and time. We demonstrate this insight with two algorithms: representation ensembles of (1) trees and (2) networks. Both algorithms demonstrate forward and backward transfer in a variety of simulated and benchmark data scenarios, including tabular, image, spoken, and adversarial tasks, including CIFAR-100, Five-Dataset, Split Mini-Imagenet, and Food1k, as well as the spoken digit dataset. This is in stark contrast to the reference algorithms we compared to, most of which failed to transfer either forward or backward, or both, despite that many of them require quadratic space or time complexity.

URL: https://openreview.net/forum?id=DPtLNKzkg0

---

Title: Federated Sampling with Langevin Algorithm under Isoperimetry

Abstract: Federated learning uses a set of techniques to efficiently distribute the training of a machine learning algorithm across several devices, who own the training data. These techniques critically rely on reducing the communication cost---the main bottleneck---between the devices and a central server. Federated learning algorithms usually take an optimization approach: they are algorithms for minimizing the training loss subject to communication (and other) constraints. In this work, we instead take a Bayesian approach for the training task, and propose a communication-efficient variant of the Langevin algorithm to sample \textit{a posteriori}. The latter approach is more robust and provides more knowledge of the \textit{a posteriori} distribution than its optimization counterpart. We analyze our algorithm without assuming that the target distribution is strongly log-concave. Instead, we assume the weaker log Sobolev inequality, which allows for nonconvexity.

URL: https://openreview.net/forum?id=Sj7bFPeR6W

---

Title: No Free Lunch in Self Supervised Representation Learning

Abstract: Self-supervised representation learning in computer vision relies heavily on hand-crafted image transformations to learn meaningful and invariant features. However few extensive explorations of the impact of transformation design have been conducted in the literature. In particular, although the dependence of representation quality to transformation design has been established, it has not been thoroughly studied. In this work, we explore this relationship and its impact on a domain other than natural images. We demonstrate that designing transformations can be viewed as a form of beneficial supervision. Firstly, we not only show that transformations have an effect on the features in representations and the relevance of clustering, but also that each category in a supervised dataset can be impacted differently in a controllable manner. Furthermore, we explore the impact of transformation design on a domain such as microscopy images where differences between classes are more subtle than in natural images. In this case, we observe a more significant impact on the features encoded into the resulting representations. Finally, we demonstrate that transformation design can be leveraged as a form of supervision, as careful selection of these transformation, based on the desired features, can lead to a drastic increase in performance by improving the resulting representation.

URL: https://openreview.net/forum?id=xXgxlrQ6uo

---

Title: Models of human preference for learning reward functions

Abstract: The utility of reinforcement learning is limited by the alignment of reward functions with the interests of human stakeholders. One promising method for alignment is to learn the reward function from human-generated preferences between pairs of trajectory segments, a type of reinforcement learning from human feedback (RLHF). These human preferences are typically assumed to be informed solely by partial return, the sum of rewards along each segment. We find this assumption to be flawed and propose modeling human preferences instead as informed by each segment’s regret, a measure of a segment’s deviation from optimal decision-making. Given infinitely many preferences generated according to regret, we prove that we can identify a reward function equivalent to the reward function that generated those preferences, and we prove that the previous partial return model lacks this identifiability property in multiple contexts. We empirically show that our proposed regret preference model outperforms the partial return preference model with finite training data in otherwise the same setting. Additionally, we find that our proposed regret preference model better predicts real human preferences and also learns reward functions from these preferences that lead to policies that are better human-aligned. Overall, this work establishes that the choice of preference model is impactful, and our proposed regret preference model provides an improvement upon a core assumption of recent research. We have open sourced our experimental code, the human preferences dataset we gathered, and our training and preference elicitation interfaces for gathering such a dataset.

URL: https://openreview.net/forum?id=hpKJkVoThY

---

Title: MESSY Estimation: Maximum-Entropy based Stochastic and Symbolic densitY Estimation

Abstract: We introduce MESSY estimation, a Maximum-Entropy based Stochastic and Symbolic densitY estimation method. The proposed approach recovers probability density functions symbolically from samples using moments of a Gradient flow in which the ansatz serves as the driving force. In particular, we construct a gradient-based drift-diffusion process that connects samples of the unknown distribution function to a guess symbolic expression. We then show that when the guess distribution has the maximum entropy form, the parameters of this distribution can be found efficiently by solving a linear system of equations constructed using the moments of the provided samples. Furthermore, we use Symbolic regression to explore the space of smooth functions and find optimal basis functions for the exponent of the maximum entropy functional leading to good conditioning. The cost of the proposed method for each set of selected basis functions is linear with the number of samples and quadratic with the number of basis functions. However, the underlying acceptance/rejection procedure in finding optimal and well-conditioned bases adds to the computational cost. We validate the proposed MESSY estimation method against other benchmark methods for the case of a bi-modal and a discontinuous density, as well as a density at the limit of physical realizability. We find that the addition of a symbolic search for basis functions improves the accuracy of the estimation at a reasonable additional computational cost. Our results suggest that the proposed method outperforms existing density recovery methods in the limit of a small to moderate number of samples by providing a low-bias and tractable symbolic description of the unknown density at a reasonable computational cost.

URL: https://openreview.net/forum?id=Y2ru0LuQeS

---

Title: Detecting Label Errors in Token Classification Data

Abstract: Mislabeled examples are a common issue in real-world data, particularly for tasks like token classification where many labels must be chosen on a fine-grained basis. Here we consider the task of finding sentences that contain label errors in token classification datasets. We study 11 different straightforward methods that score tokens/sentences based on the predicted class probabilities output by a (any) token classification model (trained via any procedure). In precision-recall evaluations based on real-world label errors in entity recognition data from CoNLL-2003, we identify a simple and effective method that consistently detects those sentences containing label errors when applied with different token classification models.

URL: https://openreview.net/forum?id=qxZWeAn4Fy

---

Title: Transport with Support: Data-Conditional Diffusion Bridges

Abstract: The dynamic Schrödinger bridge problem provides an appealing setting for solving optimal transport problems by learning non-linear diffusion processes using efficient iterative solvers. Recent works have demonstrated state-of-the-art results (eg., in modelling single-cell embryo RNA sequences or sampling from complex posteriors) but are limited to learning bridges with only initial and terminal constraints.
Our work extends this paradigm by proposing the Iterative Smoothing Bridge (ISB). We integrate Bayesian filtering and optimal control into learning the diffusion process, enabling constrained stochastic processes governed by sparse observations at intermediate stages and terminal constraints. We assess the effectiveness of our method on synthetic and real-world data and show that the ISB generalises well to high-dimensional data, is computationally efficient, and provides accurate estimates of the marginals at intermediate and terminal times.

URL: https://openreview.net/forum?id=Mbc58EzF5q

---

Title: Boomerang: Local sampling on image manifolds using diffusion models

Abstract: The inference stage of diffusion models can be seen as running a reverse-time diffusion stochastic differential equation, where samples from a Gaussian latent distribution are transformed into samples from a target distribution that usually reside on a low-dimensional manifold, e.g., an image manifold. The intermediate values between the initial latent space and the image manifold can be interpreted as noisy images, with the amount of noise determined by the forward diffusion process noise schedule. We utilize this interpretation to present Boomerang, an approach for local sampling of image manifolds exploiting the reverse diffusion process dynamics. As implied by its name, Boomerang local sampling involves adding noise to an input image, moving it closer to the latent space, and then mapping it back to the image manifold through a partial reverse diffusion process. Thus, Boomerang generates images on the manifold that are ``similar,'' but nonidentical, to the original input image. We can control the proximity of the generated images to the original by adjusting the amount of noise added. Furthermore, due to the stochastic nature of the partial reverse diffusion process in Boomerang, the generated images display a certain degree of stochasticity, allowing us to obtain ample local samples from the manifold without encountering any duplicates. Boomerang offers the flexibility to work seamlessly with any pretrained diffusion model, such as Stable Diffusion, without necessitating any adjustments to the reverse diffusion process. We present three applications for local sampling using Boomerang. First, we provide a framework for constructing privacy-preserving datasets having controllable degrees of anonymity. Second, we show that using Boomerang for data augmentation increases generalization performance and outperforms state-of-the-art synthetic data augmentation. Lastly, we introduce a perceptual image enhancement framework powered by Boomerang, which enables resolution enhancement.

URL: https://openreview.net/forum?id=NYdThkjNW1

---

Title: About the Cost of Central Privacy in Density Estimation

Abstract: We study non-parametric density estimation for densities in Lipschitz and Sobolev spaces, and under central privacy. In particular, we investigate regimes where the privacy budget is not supposed to be constant. We consider the classical definition of central differential privacy, but also the more recent notion of central concentrated differential privacy. We recover the result of Barber & Duchi (2014) stating that histogram estimators are optimal against Lipschitz distributions for the L2 risk and, under regular differential privacy, we extend it to other norms and notions of privacy. Then, we investigate higher degrees of smoothness, drawing two conclusions: First, and contrary to what happens with constant privacy budget (Wasserman & Zhou, 2010), there are regimes where imposing privacy degrades the regular minimax risk of estimation on Sobolev densities. Second, so-called projection estimators are near-optimal against the same classes of densities in this new setup with pure differential privacy, but contrary to the constant privacy budget case, it comes at the cost of relaxation. With zero concentrated differential privacy, there is no need for relaxation, and we prove that the estimation is optimal.

URL: https://openreview.net/forum?id=uq29MIWvIV

---

Title: Learn the Time to Learn: Replay Scheduling in Continual Learning

Abstract: Replay methods are known to be successful at mitigating catastrophic forgetting in continual learning scenarios despite having limited access to historical data. However, storing historical data is cheap in many real-world settings, yet replaying all historical data is often prohibited due to processing time constraints. In such settings, we propose that continual learning systems should learn the time to learn and schedule which tasks to replay at different time steps. We first demonstrate the benefits of our proposal by using Monte Carlo tree search to find a proper replay schedule, and show that the found replay schedules can outperform fixed scheduling policies when combined with various replay methods in different continual learning settings. Additionally, we propose a framework for learning replay scheduling policies with reinforcement learning. We show that the learned policies can generalize better in new continual learning scenarios compared to equally replaying all seen tasks, without added computational cost. Our study reveals the importance of learning the time to learn in continual learning, which brings current research closer to real-world needs.

URL: https://openreview.net/forum?id=Q4aAITDgdP

---

Title: Neural Circuit Diagrams: Standardized Diagrams for Deep Learning Architectures

Abstract: Diagrams matter. Unfortunately, the deep learning community has no standard method for diagramming architectures. The current combination of linear algebra notation and ad-hoc diagrams fails to offer the necessary precision to understand architectures in all their detail. However, this detail is critical for faithful implementation, mathematical analysis, further innovation, and ethical assurances. We present neural circuit diagrams, a graphical language based on category theory tailored to the needs of communicating deep learning architectures. Neural circuit diagrams naturally keep track of the changing arrangement of data, precisely show how operations are broadcast over axes, and display the critical parallel behavior of linear operations.

In this paper, we introduce neural circuit diagrams for an audience of machine learning researchers. Our introduction discusses the necessity of improved communication of architectures. Looking at related works, we establish why diagrams based on category theory offer the most promising approach. We then cover the mathematical foundations of neural circuit diagrams by introducing the category of shaped data. This section is accessible, given familiarity with sets and linear algebra, and contributes new perspectives on broadcasting, linear operations, and multilinearity. Finally, our results section justifies the utility of neural diagrams by explaining in quick succession a host of deep-learning architectures and features that are otherwise difficult to communicate. This includes convolution and its extensions with stride, dilation, and transposing; the transformer model; the U-Net; residual networks; and the vision transformer. We briefly discuss differentiation and graphically derive the memory cost of backpropagation, showing the potential of neural circuit diagrams to provide mathematical insight.

URL: https://openreview.net/forum?id=RyZB4qXEgt

---

Title: Prediction interval for neural network models using weighted asymmetric loss functions.

Abstract: We propose a simple and efficient approach to generate prediction intervals (PIs) for approximated and forecasted trends. Our method leverages a weighted asymmetric loss function to estimate the lower and upper bounds of the PIs, with the weights determined by the interval probability level. We provide a concise mathematical proof of the method, show how it can be extended to derive PIs for parametrised functions and argue why the method works for predicting PIs of dependent variables. The presented tests of the method on a real-world forecasting task using a neural network-based model show that it can produce reliable PIs in complex machine learning scenarios.

URL: https://openreview.net/forum?id=1oAwXU1PLd

---

Title: Auto-configuring Exploration-Exploitation Tradeoff in Population-Based Optimization: A Deep Reinforcement Learning Approach

Abstract: Population-based optimization (PBO) algorithms, renowned as powerful black-box optimizers, leverage a group of individuals to cooperatively search for the optimum. The exploration-exploitation tradeoff (EET) plays a crucial role in PBO, which, however, has traditionally been governed by manually designed rules. In this paper, we propose a deep reinforcement learning-based framework that autonomously configures and adapts the EET throughout the PBO search process. The framework allows different individuals of the population to selectively attend to the global and local exemplars based on the current search state, maximizing the cooperative search outcome. Our proposed framework is characterized by its simplicity, effectiveness, and generalizability, with the potential to enhance numerous existing PBO algorithms. To validate its capabilities, we apply our framework to several representative PBO algorithms and conduct extensive experiments on the augmented CEC2021 benchmark. The results demonstrate significant improvements in the performance of the backbone algorithms, as well as favorable generalization across diverse problem classes, dimensions, and population sizes. Additionally, we provide an in-depth analysis of the EET issue by interpreting the learned behaviors of PBO.

URL: https://openreview.net/forum?id=U2viPsXLxw

---

Title: Explaining the origin of adversarial attacks using in-distribution adversarial examples.

Abstract: Despite a plethora of proposed theories, understanding why deep neural networks are susceptible to adversarial attacks remains an open question. A promising recent strand of research investigates adversarial attacks within the training data distribution, providing a more stringent and worrisome definition for these attacks. These theories posit that the key issue is that in high dimensional datasets, most data points are close to the ground-truth class boundaries. This has been shown in theory for some simple data distributions, but it is unclear if this theory is relevant in practice. Here, we demonstrate the existence of in-distribution adversarial examples for object recognition. This result provides evidence supporting theories attributing adversarial examples to the proximity of data to ground-truth class boundaries, and calls into question other theories which do not account for this more stringent definition of adversarial attacks. These experiments are enabled by our novel gradient-free, evolutionary strategies (ES) based approach for finding in-distribution adversarial examples in 3D rendered objects, which we call CMA-Search.

URL: https://openreview.net/forum?id=WOpvWXO5bB

---

Title: Switching Latent Bandits

Abstract: We consider a Latent Bandit problem where the latent state keeps changing in time according to an underlying Markov Chain and every state is represented by a specific Bandit instance. At each step, the agent chooses an arm and observes a random reward but is unaware of which MAB he is currently pulling. As typical in Latent Bandits, we assume to know the reward distribution of the arms of all the Bandit instances. Within this setting, our goal is to learn the transition matrix determined by the Markov process, so as to minimize the cumulative regret. We propose a technique to solve this estimation problem that exploits the properties of Markov Chains and results in solving a system of linear equations. We present an offline method that chooses the best subset of possible arms that can be used for matrix estimation, and we ultimately introduce the SL-EC learning algorithm based on an Explore Then Commit strategy that builds a belief representation of the current state and optimizes the instantaneous regret at each step. This algorithm achieves a regret of the order $\widetilde{\mathcal{O}}(T^{2/3})$ with $T$ being the considered horizon. Finally, we illustrate the effectiveness of the approach and compare it with state-of-the-art algorithms for non-stationary bandits.

URL: https://openreview.net/forum?id=qkKgjX5Rr1

---

Title: Inverse Scaling: When Bigger Isn't Better

Abstract: Work on scaling laws has found that large language models (LMs) show predictable improvements to overall loss with increased scale (model size, training data, and compute). Here, we present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale, e.g., due to flaws in the training objective and data. We present empirical evidence of inverse scaling on 11 datasets collected by running a public contest, the Inverse Scaling Prize, with a substantial prize pool. Through analysis of the datasets, along with other examples found in the literature, we identify four potential causes of inverse scaling:
(i) preference to repeat memorized sequences over following in-context instructions,
(ii) imitation of undesirable patterns in the training data,
(iii) tasks containing an easy distractor task which LMs could focus on, rather than the harder real task, and
(iv) correct but misleading few-shot demonstrations of the task.
We release the winning datasets at [redacted; see supplementary material] to allow for further investigation of inverse scaling. Our tasks have helped drive the discovery of U-shaped and inverted-U scaling trends, where an initial trend reverses, suggesting that scaling trends are less reliable at predicting the behavior of larger-scale models than previously understood. Overall, our results suggest that there are tasks for which increased model scale alone may not lead to progress, and that more careful thought needs to go into the data and objectives for training language models.

URL: https://openreview.net/forum?id=DwgRm72GQF

---

Title: Local Advantage Networks for Multi-Agent Reinforcement Learning in Dec-POMDPs

Abstract: Many recent successful off-policy multi-agent reinforcement learning (MARL) algorithms for cooperative partially observable environments focus on finding factorized value functions, leading to convoluted network structures. Building on the structure of independent Q-learners, our LAN algorithm takes a radically different approach, leveraging a dueling architecture to learn for each agent a decentralized best-response policies via individual advantage functions. The learning is stabilized by a centralized critic whose primary objective is to reduce the moving target problem of the individual advantages. The critic, whose network's size is independent of the number of agents, is cast aside after learning. Evaluation on the StarCraft II multi-agent challenge benchmark shows that LAN reaches state-of-the-art performance and is highly scalable with respect to the number of agents, opening up a promising alternative direction for MARL research.

URL: https://openreview.net/forum?id=adpKzWQunW

---

Title: General-Purpose In-Context Learning by Meta-Learning Transformers

Abstract: Modern machine learning requires system designers to specify aspects of the learning pipeline, such as losses, architectures, and optimizers. Meta-learning, or learning-to-learn, instead aims to learn those aspects, and promises to unlock greater capabilities with less manual effort. One particularly ambitious goal of meta-learning is to train general-purpose in-context learning algorithms from scratch, using only black-box models with minimal inductive bias. Such a model takes in training data, and produces test-set predictions across a wide range of problems, without any explicit definition of an inference model, training loss, or optimization algorithm. In this paper we show that Transformers and other black-box models can be meta-trained to act as general-purpose in-context learners. We characterize phase transitions between algorithms that generalize, algorithms that memorize, and algorithms that fail to meta-train at all, induced by changes in model size, number of tasks, and meta-optimization. We further show that the capabilities of meta-trained algorithms are bottlenecked by the accessible state size (memory) determining the next prediction, unlike standard models which are thought to be bottlenecked by parameter count. Finally, we propose practical interventions such as biasing the training distribution that improve the meta-training and meta-generalization of general-purpose in-context learning algorithms.

URL: https://openreview.net/forum?id=tK4RSfKHug

---

Title: Identifying latent distances with Finslerian geometry

Abstract: Riemannian geometry provides us with powerful tools to explore the latent space of generative models while preserving the underlying structure of the data. The latent space can be equipped it with a Riemannian metric, pulled back from the data manifold. With this metric, we can systematically navigate the space relying on geodesics defined as the shortest curves between two points.

Generative models are often stochastic causing the data space, the Riemannian metric, and the geodesics to be stochastic as well. Stochastic objects are at best impractical, and at worst impossible, to manipulate. A common solution is to approximate the stochastic pullback metric by its expectation. But the geodesics derived from this expected Riemannian metric do not correspond to the expected length-minimising curves.

In this work, we propose another metric whose geodesics explicitly minimise the expected length of the pullback metric. We show this metric defines a Finsler metric, and we compare it with the expected Riemannian metric. In high dimensions, we prove that both metrics converge to each other at a rate of $\mathcal{O}\left(\frac{1}{D}\right)$. This convergence implies that the established expected Riemannian metric is an accurate approximation of the theoretically more grounded Finsler metric. This provides justification for using the expected Riemannian metric for practical implementations.

URL: https://openreview.net/forum?id=Q2Gi0TUAdS

---

Title: RIFLE: Imputation and Robust Inference from Low Order Marginals

Abstract: The ubiquity of missing values in real-world datasets poses a challenge for statistical inference and can prevent similar datasets from being analyzed in the same study, precluding many existing datasets from being used for new analyses. While an extensive collection of packages and algorithms have been developed for data imputation, the overwhelming majority perform poorly if there are many missing values and low sample sizes, which are unfortunately common characteristics in empirical data. Such low-accuracy estimations adversely affect the performance of downstream statistical models. We develop a statistical inference framework for predicting the target variable in the presence of missing data without imputation. Our framework, RIFLE (Robust InFerence via Low-order moment Estimations), estimates low-order moments of the underlying data distribution with corresponding confidence intervals to learn a distributionally robust model. We specialize our framework to linear regression and normal discriminant analysis, and we provide convergence and performance guarantees. This framework can also be adapted to impute missing data. We compare RIFLE with state-of-the-art approaches (including MICE, Amelia, MissForest, KNN-imputer, MIDA, and Mean Imputer) in numerical experiments. Our experiments demonstrate that RIFLE outperforms other benchmark algorithms when the percentage of missing values is high and/or when the number of data points is relatively small. RIFLE is publicly available

URL: https://openreview.net/forum?id=oud7Ny0KQy

---

Title: Graph Neural Networks Formed via Layer-wise Ensembles of Heterogeneous Base Models

Abstract: Graph Neural Networks (GNNs) with numerical node features and graph structure as inputs have demonstrated superior performance on various semi-supervised learning tasks with graph data. However, the numerical node features utilized by GNNs are commonly extracted from raw data which is of text or tabular (numeric/categorical) type in most real-world applications.
The best models for such data types in most standard supervised learning settings with IID (non-graph) data are not simple neural network layers and thus are not easily incorporated into a GNN. Here we propose a robust stacking framework that fuses graph-aware propagation with arbitrary models intended for IID data, which are ensembled and stacked in multiple layers. Our layer-wise framework leverages bagging and stacking strategies to enjoy strong generalization, in a manner which effectively mitigates label leakage and overfitting. Across a variety of graph datasets with tabular/text node features, our method achieves comparable or superior performance relative to both tabular/text and graph neural network models, as well as existing state-of-the-art hybrid strategies that combine the two.

URL: https://openreview.net/forum?id=LO02YHxrxd

---

Title: AutoML in the Age of Large Language Models: Current Challenges, Future Opportunities and Risks

Abstract: The fields of both Natural Language Processing (NLP) and Automated Machine Learning (AutoML) have achieved remarkable results over the past years. In NLP, especially Large Language Models (LLMs) have experienced a rapid series of breakthroughs very recently. We envision that the two fields can radically push the boundaries of each other through tight integration. To showcase this vision, we explore the potential of a symbiotic relationship between AutoML and LLMs, shedding light on how they can benefit each other. In particular, we investigate both the opportunities to enhance AutoML approaches with LLMs from different perspectives and the challenges of leveraging AutoML to further improve LLMs. To this end, we survey existing work, and we critically assess risks. We strongly believe that the integration of the two fields has the potential to disrupt both fields, NLP and AutoML. By highlighting conceivable synergies, but also risks, we aim to foster further exploration at the intersection of AutoML and LLMs.

URL: https://openreview.net/forum?id=cAthubStyG

---

Title: On the Role of Initialization on the Implicit Bias in Deep Learning

Abstract: Despite Deep Learning's (DL) empirical success, our theoretical understanding of its efficacy remains limited. One notable paradox is that while conventional wisdom discourages perfect data fitting, deep neural networks are designed to do just that, yet they generalize effectively. This study focuses on exploring this phenomenon attributed to implicit bias at play. Various implicit bias sources have been identified, such as step size, weight initialization, the optimization algorithm, and the number of parameters. In this work, we focus on investigating the implicit bias originating from weight initialization. To this end, we examine the problem of solving underdetermined linear systems in various contexts, scrutinizing the impact of initialization on the implicit regularization when using deep networks to solve such systems. Our findings elucidate the role of initialization in the optimization and generalization paradoxes, contributing to a more comprehensive understanding of DL's performance characteristics.

URL: https://openreview.net/forum?id=iutDpibXnc

---

Title: Bag of Image Patch Embedding Behind the Success of Self-Supervised Learning

Abstract: Self-supervised learning (SSL) has recently achieved tremendous empirical advancements in learning image representation. However, our understanding of the principle behind learning such a representation is still limited. This work shows that joint-embedding SSL approaches primarily learn a representation of image patches, which reflects their co-occurrence. Such a connection to co-occurrence modeling can be established formally, and it supplements the prevailing invariance perspective. We empirically show that learning a representation for fixed-scale patches and aggregating local patch representations as the image representation achieves similar or even better results than the baseline methods. We denote this process as {\it BagSSL}. Even with $32\times 32$ patch representation, BagSSL achieves $62\%$ top-1 linear probing accuracy on ImageNet. On the other hand, with a multi-scale pretrained model, we show that the whole image embedding is approximately the average of local patch embeddings. While the SSL representation is relatively invariant at the global scale, we show that locality is preserved when we zoom into local patch-level representation. Further, we show that patch representation aggregation can improve various SOTA baseline methods by a large margin. The patch representation is considerably easier to understand, and this work makes a step to demystify self-supervised representation learning.

URL: https://openreview.net/forum?id=r06xREo3QG

---

Title: Sampling-Based Techniques for Training Deep Neural Networks with Limited Computational Resources: A Scalability Evaluation

Abstract: Deep neural networks are superior to shallow networks in learning complex representations. As such, there is fast-growing interest in utilizing them in large-scale settings. The training process of neural networks is already known to be time-consuming, and having a deep architecture only aggravates the issue. This process consists mostly of matrix operations, among which matrix multiplication is the bottleneck. Several sampling-based techniques have been proposed for speeding up the training time of deep neural networks by approximating the matrix products. These techniques fall under two categories: (i) sampling a subset of nodes in every hidden layer as active at every iteration and (ii) sampling a subset of nodes from the previous layer to approximate the current layer's activations using the edges from the sampled nodes. In both cases, the matrix products are computed using only the selected samples. In this paper, we evaluate the scalability of these approaches on CPU machines with limited computational resources. Making a connection between the two research directions as special cases of approximating matrix multiplications in the context of neural networks, we provide a negative theoretical analysis that shows feedforward approximation is an obstacle against scalability. We conduct comprehensive experimental evaluations that demonstrate the most pressing challenges and limitations associated with the studied approaches. We observe that the hashing-based node selection method is not scalable to a large number of layers, confirming our theoretical analysis. Finally, we identify directions for future research.

URL: https://openreview.net/forum?id=30C7R1LQ68

---

Title: Using Representation Expressiveness and Learnability to Evaluate Self-Supervised Learning Methods

Abstract: We address the problem of evaluating the quality of self-supervised learning (SSL) models
without access to supervised labels, while being agnostic to the architecture, learning
algorithm or data manipulation used during training. We argue that representations can
be evaluated through the lens of expressiveness and learnability. We propose to use the
Intrinsic Dimension (ID) to assess expressiveness and introduce Cluster Learnability (CL) to
assess learnability. CL is measured in terms of the performance of a KNN classifier trained
to predict labels obtained by clustering the representations with K-means. We thus combine
CL and ID into a single predictor – CLID. Through a large-scale empirical study with a
diverse family of SSL algorithms, we find that CLID better correlates with in-distribution
model performance than other competing recent evaluation schemes. We also benchmark
CLID on out-of-domain generalization, where CLID serves as a predictor of the transfer
performance of SSL models on several visual classification tasks, yielding improvements with
respect to the competing baselines.

URL: https://openreview.net/forum?id=BxdrpnRHNh

---

Title: Interleaving Multi-Task Neural Architecture Search

Abstract: Multi-task neural architecture search (MTNAS), which searches for a shared architecture for multiple tasks, has been broadly investigated. In these methods, multiple tasks are learned simultaneously by minimizing the weighted sum of their losses. How to balance these losses by finding the optimal loss weights requires a lot of tuning, which is time-consuming and labor intensive. To address this problem, we propose an interleaving MTNAS framework, where no tuning of loss weights is needed. In our method, a set of tasks (e.g., A, B, C) are performed in an interleaving loop (e.g., ABCABCABC...) where each task transfers its knowledge to the next task. Each task is learned by minimizing its loss function alone, without intervening with losses of other tasks. Loss functions of individual tasks are organized into a multi-level optimization framework which enables all tasks performed end-to-end. The effectiveness of our method is demonstrated in a variety of experiments.

URL: https://openreview.net/forum?id=8JXbKYbaG2

---

Title: Pairwise Learning with Adaptive Online Gradient Descent

Abstract: In this paper, we propose an adaptive online gradient descent method with momentum for pairwise learning, in which the stepsize is determined by historical information. Due to the structure of pairwise learning, the sample pairs are dependent on the parameters, causing difficulties in the convergence analysis. To this end, we develop novel techniques for the convergence analysis of the proposed algorithm. We show that the proposed algorithm can output the desired solution in strongly convex, convex, and nonconvex cases. Furthermore, we present theoretical explanations for why our proposed algorithm can accelerate previous workhorses for online pairwise learning. All assumptions used in the theoretical analysis are mild and common, making our results applicable to various pairwise learning problems. To demonstrate the efficiency of our algorithm, we compare the proposed adaptive method with the non-adaptive counterpart on the benchmark online AUC maximization problem.

URL: https://openreview.net/forum?id=rq1SaHQg2k

---

Title: An Optical Control Environment for Benchmarking Reinforcement Learning Algorithms

Abstract: Deep reinforcement learning has the potential to address various scientific problems. In this paper, we implement an optics simulation environment for reinforcement learning based controllers. The environment captures the essence of nonconvexity, nonlinearity, and time-dependent noise inherent in optical systems, offering a more realistic setting.
Subsequently, we provide the benchmark results of several reinforcement learning algorithms on the proposed simulation environment. The experimental findings demonstrate the superiority of off-policy reinforcement learning approaches over traditional control algorithms in navigating the intricacies of complex optical control environments.

URL: https://openreview.net/forum?id=61TKzU9B96

---

Title: Tight bounds for maximum $\ell_1$-margin classifiers

Abstract: Popular iterative algorithms such as boosting methods and coordinate descent on linear models converge to the maximum $\ell_1$-margin classifier, a.k.a. sparse hard-margin SVM, in high dimensional regimes where the data is linearly separable. Previous works consistently show that many estimators relying on the $\ell_1$-norm achieve improved statistical rates for hard sparse ground truths. We show that surprisingly, this adaptivity does not apply to the maximum $\ell_1$-margin classifier for a standard discriminative setting. In particular, for the noiseless setting, we prove tight upper and lower bounds for the prediction error that match existing rates of order $\frac{\|w^*\|_1^{2/3}}{n^{1/3}}$ for general ground truths. To complete the picture, we show that when interpolating noisy observations, the error vanishes at a rate of order $\frac{1}{\sqrt{\log(d/n)}}$. We are therefore first to show benign overfitting for the maximum $\ell_1$-margin classifier.

URL: https://openreview.net/forum?id=WWmTWWqXdW

---

Title: FONDUE: an algorithm to automatically find the dimensionality of the latent representations of variational autoencoders

Abstract: When training a variational autoencoder (VAE) on a given dataset, determining the number of latent variables requires human supervision and to fully train one or more VAEs. In this paper, we explore ways to simplify this time-consuming process through the lens of the polarised regime. Specifically, we show that the discrepancies between the variance of the mean and sampled representations of a VAE reveal the presence of passive variables in the latent space, which, in well-behaved VAEs, indicates a superfluous number of dimensions.
After formally demonstrating this phenomenon, we use it to propose FONDUE: an unsupervised algorithm which efficiently finds a suitable number of latent dimensions without fully training any models. We additionally show that FONDUE can be extended in a number of ways, providing a principled and unified method for selecting the number of latent dimensions for VAEs and deterministic autoencoders.

URL: https://openreview.net/forum?id=wULNbJ2Cak

---

Reply all

Reply to author

Forward

0 new messages