Weekly TMLR digest for Feb 04, 2024

9 views

Skip to first unread message

TMLR

unread,

Feb 3, 2024, 7:00:13 PMFeb 3

to tmlr-annou...@googlegroups.com

New certifications
==================

Reproducibility Certification: We're Not Using Videos Effectively: An Updated Domain Adaptive Video Segmentation Baseline

Simar Kareer, Vivek Vijaykumar, Harsh Maheshwari, Judy Hoffman, Prithvijit Chattopadhyay, Viraj Uday Prabhu

https://openreview.net/forum?id=10R6iX6JHm

---

Expert Certification: Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks

Aradhana Sinha, Ananth Balashankar, Ahmad Beirami, Thi Avrahami, Jilin Chen, Alex Beutel

https://openreview.net/forum?id=UAT4j3Y7HP

---

Survey Certification: A Survey on Graph Construction for Geometric Deep Learning in Medicine: Methods and Recommendations

Tamara T. Müller, Sophie Starck, Alina Dima, Stephan Wunderlich, Kyriaki-Margarita Bintsi, Kamilia Zaripova, Rickmer Braren, Daniel Rueckert, Anees Kazi, Georgios Kaissis

https://openreview.net/forum?id=sWlHhfijcS

---

Expert Certification: Wavelet Networks: Scale-Translation Equivariant Learning From Raw Time-Series

David W. Romero, Erik J Bekkers, Jakub M. Tomczak, Mark Hoogendoorn

https://openreview.net/forum?id=ga5SNulYet

---

Accepted papers
===============

Title: Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

Authors: Angelica Chen, Jason Phang, Alicia Parrish, Vishakh Padmakumar, Chen Zhao, Samuel R. Bowman, Kyunghyun Cho

Abstract: Large language models (LLMs) have achieved widespread success on a variety of in-context few-shot tasks, but this success is typically evaluated via correctness rather than consistency. We argue that self-consistency is an important criteria for valid multi-step reasoning in tasks where the solution is composed of the answers to multiple sub-steps. We propose two types of self-consistency that are particularly important for multi-step reasoning -- hypothetical consistency (a model's ability to predict what its output would be in a hypothetical other context) and compositional consistency (consistency of a model's final outputs when intermediate sub-steps are replaced with the model's outputs for those steps). We demonstrate that multiple variants of the GPT-3/-4 models exhibit poor consistency rates across both types of consistency on a variety of tasks.

URL: https://openreview.net/forum?id=5nBqY1y96B

---

Title: MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments

Authors: Spyros Gidaris, Andrei Bursuc, Oriane Siméoni, Antonín Vobecký, Nikos Komodakis, Matthieu Cord, Patrick Perez

Abstract: Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks for very large fully-annotated datasets. Different classes of self-supervised learning offer representations with either good contextual reasoning properties, e.g., using masked image modeling strategies, or invariance to image perturbations, e.g., with contrastive methods. In this work, we propose a single-stage and standalone method, MOCA, which unifies both desired properties using novel mask-and-predict objectives defined with high-level features (instead of pixel-level details). Moreover, we show how to effectively employ both learning paradigms in a synergistic and computation-efficient way. Doing so, we achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols with a training that is at least 3 times faster than prior methods. We provide the implementation code at https://github.com/valeoai/MOCA.

URL: https://openreview.net/forum?id=OdDsCaacZ0

---

Title: Transfer Learning for Bayesian Optimization on Heterogeneous Search Spaces

Authors: Zhou Fan, Xinran Han, Zi Wang

Abstract: Bayesian optimization (BO) is a popular black-box function optimization method, which makes sequential decisions based on a Bayesian model, typically a Gaussian process (GP), of the function. To ensure the quality of the model, transfer learning approaches have been developed to automatically design GP priors by learning from observations on "training" functions. These training functions are typically required to have the same domain as the "test" function (black-box function to be optimized). In this paper, we introduce MPHD, a model pre-training method on heterogeneous domains, which uses a neural net mapping from domain-specific contexts to specifications of hierarchical GPs. MPHD can be seamlessly integrated with BO to transfer knowledge across heterogeneous search spaces. Our theoretical and empirical results demonstrate the validity of MPHD and its superior performance on challenging black-box function optimization tasks.

URL: https://openreview.net/forum?id=emXh4M7TyH

---

Title: A Multilinear Least-Squares Formulation for Sparse Tensor Canonical Correlation Analysis

Authors: Jun Yu, Zhaoming Kong, Kun Chen, Xin Zhang, Yong Chen, Lifang He

Abstract: Tensor data are becoming important recently in various applications, e.g., image and video recognition, which pose new challenges for data modeling and analysis approaches, such as high-order relations of large complexity, varying data scale and gross noise. In this paper, we consider the problem of sparse canonical correlation analysis for arbitrary tensor data. Although several methods have been proposed for this task, there are still limitations hindering its practical applications. To this end, we present a general Sparse Tensor Canonical Correlation Analysis (gSTCCA) method from a multilinear least-squares perspective. Specifically, we formulate the problem as a constrained multilinear least-squares problem with tensor-structured sparsity regularization based on CANDECOMP/PARAFAC (CP) decomposition. Then we present a divide-and-conquer deflation approach to tackle the problem by successive rank-one tensor estimation of the residual tensors, where the overall model is broken up into a set of unconstrained linear least-squares problems that can be efficiently solved. Through extensive experiments conducted on five different datasets for recognition tasks, we demonstrate that the proposed method achieves promising performance compared to the SOTA vector- and tensor-based canonical correlation analysis methods in terms of classification accuracy, model sparsity, and robustness to missing and noisy data. The code is publicly available at https://github.com/junfish/gSTCCA.

URL: https://openreview.net/forum?id=zc0Y0cAuTV

---

Title: Generalizing Neural Additive Models via Statistical Multimodal Analysis

Authors: Young Kyung Kim, Juan Matias Di Martino, Guillermo Sapiro

Abstract: Interpretable models are gaining increasing attention in the machine learning community, and significant
progress is being made to develop simple, interpretable, yet powerful deep learning approaches.
Generalized Additive Models (GAM) and Neural Additive Models (NAM) are prime examples. Despite these
methods' great potential and popularity in critical applications, e.g., medical applications, they fail to
generalize to distributions with more than one mode (multimodal\footnote{In this paper, multimodal refers to the context of distributions, wherein a distribution possesses more than one mode.}). The main reason behind this limitation is that these "all-fit-one"
models collapse multiple relationships by being forced to fit the data unimodally. We address this critical
limitation by proposing interpretable multimodal network frameworks capable of learning a Mixture of Neural
Additive Models (MNAM). The proposed MNAM learns relationships between input features and outputs
in a multimodal fashion and assigns a probability to each mode. The proposed method shares similarities with Mixture Density Networks (MDN) while keeping the interpretability that characterizes GAM and NAM. We
demonstrate how the proposed MNAM balances between rich representations and interpretability with
numerous empirical observations and pedagogical studies. We present and discuss different training alternatives
and provided extensive practical evaluation to assess the proposed framework. The code is available at \href{https://github.com/youngkyungkim93/MNAM}{https://github.com/youngkyungkim93/MNAM}.

URL: https://openreview.net/forum?id=xLg8ljlEba

---

Title: A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models

Authors: Noriyuki Kojima, Hadar Averbuch-Elor, Yoav Artzi

Abstract: Key to tasks that require reasoning about natural language in visual contexts is grounding words and phrases to image regions. However, observing this grounding in contemporary models is complex, even if it is generally expected to take place if the task is addressed in a way that is conductive to generalization. We propose a framework to jointly study task performance and phrase grounding, and propose three benchmarks to study the relation between the two. Our results show that contemporary models demonstrate inconsistency between their ability to ground phrases and solve tasks. We show how this can be addressed through brute-force training on ground phrasing annotations, and analyze the dynamics it creates. Code and data are available at https://github.com/lil-lab/phrase_grounding.

URL: https://openreview.net/forum?id=5G3PI1hEdw

---

Title: Size Lowerbounds for Deep Operator Networks

Authors: Anirbit Mukherjee, Amartya Roy

Abstract: Deep Operator Networks are an increasingly popular paradigm for solving regression in infinite dimensions and hence solve families of PDEs in one shot. In this work, we aim to establish a first-of-its-kind data-dependent lowerbound on the size of DeepONets required for them to be able to reduce empirical error on noisy data. In particular, we show that for low training errors to be obtained on $n$ data points it is necessary that the common output dimension of the branch and the trunk net be scaling as $\Omega \left ( \sqrt[\leftroot{-1}\uproot{-1}4]{n} \right )$.

This inspires our experiments with DeepONets solving the advection-diffusion-reaction PDE, where we demonstrate the possibility that at a fixed model size, to leverage increase in this common output dimension and get monotonic lowering of training error, the size of the training data might necessarily need to scale at least quadratically with it.

URL: https://openreview.net/forum?id=RwmWODTNFE

---

Title: Extending Path-Dependent NJ-ODEs to Noisy Observations and a Dependent Observation Framework

Authors: William Andersson, Jakob Heiss, Florian Krach, Josef Teichmann

Abstract: The \emph{Path-Dependent Neural Jump Ordinary Differential Equation (PD-NJ-ODE)} is a model for predicting continuous-time stochastic processes with irregular and incomplete observations. In particular, the method learns optimal forecasts given irregularly sampled time series of incomplete past observations. So far the process itself and the coordinate-wise observation times were assumed to be independent and observations were assumed to be noiseless. In this work we discuss two extensions to lift these restrictions and provide theoretical guarantees as well as empirical examples for them. In particular, we can lift the assumption of independence by extending the theory to much more realistic settings of conditional independence without any need to change the algorithm. Moreover, we introduce a new loss function, which allows us to deal with noisy observations and explain why the previously used loss function did not lead to a consistent estimator.

URL: https://openreview.net/forum?id=0T2OTVCCC1

---

Title: Blind Biological Sequence Denoising with Self-Supervised Set Learning

Authors: Nathan Hoyen Ng, Ji Won Park, Jae Hyeon Lee, Ryan Lewis Kelly, Stephen Ra, Kyunghyun Cho

Abstract: Biological sequence analysis relies on the ability to denoise the imprecise output of sequencing platforms. We consider a common setting where a short sequence is read out repeatedly using a high-throughput long-read platform to generate multiple subreads, or noisy obser- vations of the same sequence. Denoising these subreads with alignment-based approaches often fails when too few subreads are available or error rates are too high. In this paper, we propose a novel method for blindly denoising sets of sequences without directly observing clean source sequence labels. Our method, Self-Supervised Set Learning (SSSL), gathers subreads together in an embedding space and estimates a single set embedding as the mid- point of the subreads in both the latent and sequence spaces. This set embedding represents the “average” of the subreads and can be decoded into a prediction of the clean sequence. In experiments on simulated long-read DNA data, SSSL methods denoise small reads of ≤ 6 subreads with 17% fewer errors and large reads of > 6 subreads with 8% fewer errors compared to the best baseline. On a real dataset of antibody sequences, SSSL improves over baselines on two self-supervised metrics, with a significant improvement on difficult small reads that comprise over 60% of the test set. By accurately denoising these reads, SSSL promises to better realize the potential of high-throughput DNA sequencing data for downstream scientific applications.

URL: https://openreview.net/forum?id=3s7ior0WZ5

---

Title: We're Not Using Videos Effectively: An Updated Domain Adaptive Video Segmentation Baseline

Authors: Simar Kareer, Vivek Vijaykumar, Harsh Maheshwari, Judy Hoffman, Prithvijit Chattopadhyay, Viraj Uday Prabhu

Abstract: There has been abundant work in unsupervised domain adaptation for semantic segmentation (DAS) seeking to adapt a model trained on images from a labeled source domain to an unlabeled target domain. While the vast majority of prior work has studied this as a frame-level Image-DAS problem, a few Video-DAS works have sought to additionally leverage the temporal signal present in adjacent frames. However, Video-DAS works have historically studied a distinct set of benchmarks from Image-DAS, with minimal cross-benchmarking. In this work, we address this gap. Surprisingly, we find that (1) even after carefully controlling for data and model architecture, state-of-the-art Image-DAS methods (HRDA and HRDA+MIC)} outperform Video-DAS methods on established Video-DAS benchmarks (+14.5 mIoU on Viper$\rightarrow$CityscapesSeq, +19.0 mIoU on Synthia$\rightarrow$CityscapesSeq), and (2) naive combinations of Image-DAS and Video-DAS techniques only lead to marginal improvements across datasets. To avoid siloed progress between Image-DAS and Video-DAS, we open-source our codebase with support for a comprehensive set of Video-DAS and Image-DAS methods on a common benchmark. Code available at https://github.com/SimarKareer/UnifiedVideoDA

URL: https://openreview.net/forum?id=10R6iX6JHm

---

Title: Unsupervised Discovery of Steerable Factors When Graph Deep Generative Models Are Entangled

Authors: Shengchao Liu, Chengpeng Wang, Jiarui Lu, Weili Nie, Hanchen Wang, Zhuoxinran Li, Bolei Zhou, Jian Tang

Abstract: Deep generative models (DGMs) have been widely developed for graph data. However, much less investigation has been carried out on understanding the latent space of such pretrained graph DGMs. These understandings possess the potential to provide constructive guidelines for crucial tasks, such as graph controllable generation. Thus in this work, we are interested in studying this problem and propose GraphCG, a method for the unsupervised discovery of steerable factors in the latent space of pretrained graph DGMs. We first examine the representation space of three pretrained graph DGMs with six disentanglement metrics, and we observe that the pretrained representation space is entangled. Motivated by this observation, GraphCG learns the steerable factors via maximizing the mutual information between semantic-rich directions, where the controlled graph moving along the same direction will share the same steerable factors. We quantitatively verify that GraphCG outperforms four competitive baselines on two graph DGMs pretrained on two molecule datasets. Additionally, we qualitatively illustrate seven steerable factors learned by GraphCG on five pretrained DGMs over five graph datasets, including two for molecules and three for point clouds.

URL: https://openreview.net/forum?id=wyU3Q4gahM

---

Title: Are you using test log-likelihood correctly?

Authors: Sameer Deshpande, Soumya Ghosh, Tin D. Nguyen, Tamara Broderick

Abstract: Test log-likelihood is commonly used to compare different models of the same data or different approximate inference algorithms for fitting the same probabilistic model. We present simple examples demonstrating how comparisons based on test log-likelihood can contradict comparisons according to other objectives. Specifically, our examples show that (i) approximate Bayesian inference algorithms that attain higher test log-likelihoods need not also yield more accurate posterior approximations and (ii) conclusions about forecast accuracy based on test log-likelihood comparisons may not agree with conclusions based on root mean squared error.

URL: https://openreview.net/forum?id=n2YifD4Dxo

---

Title: Blockwise Self-Supervised Learning at Scale

Authors: Shoaib Siddiqui, David Krueger, Yann LeCun, Stephane Deny

Abstract: Current state-of-the-art deep networks are all powered by backpropagation. However, long backpropagation paths as found in end-to-end training are biologically implausible, as well as inefficient in terms of energy consumption. In this paper, we explore alternatives to full backpropagation in the form of blockwise learning rules, leveraging the latest developments in self-supervised learning. We show that a blockwise pretraining procedure consisting of training independently the 4 main blocks of layers of a ResNet-50 with Barlow Twins' loss function at each block performs almost as well as end-to-end backpropagation on ImageNet: a linear probe trained on top of our blockwise pretrained model obtains a top-1 classification accuracy of 70.48\%, only 1.1\% below the accuracy of an end-to-end pretrained network (71.57\% accuracy). We perform extensive experiments to understand the impact of different components within our method and explore a variety of adaptations of self-supervised learning to the blockwise paradigm, building an exhaustive understanding of the critical avenues for scaling local learning rules to large networks, with implications ranging from hardware design to neuroscience.

URL: https://openreview.net/forum?id=M2m618iIPk

---

Title: Temporally Rich Deep Learning Models for Magnetoencephalography

Authors: Tim Chard, Mark Dras, Paul Sowman, Steve Cassidy, Jia Wu

Abstract: Deep learning has been used in a wide range of applications, but it has only very recently been applied to Magnetoencephalography (MEG). MEG is a neurophysiological technique used to investigate a variety of cognitive processes such as language and learning, and an emerging technology in the quest to identify neural correlates of cognitive impairments such as those occurring in dementia.
Recent work has shown that it is possible to apply deep learning to MEG to categorise induced responses to stimuli across subjects.
While novel in the application of deep learning, such work has generally used relatively simple neural network (NN) models compared to those being used in domains such as computer vision and natural language processing.
In these other domains, there is a long history in developing complex NN models that combine spatial and temporal information.
We propose more complex NN models that focus on modelling temporal relationships in the data, and apply them to the challenges of MEG data.
We apply these models to an extended range of MEG-based tasks, and find that they substantially outperform existing work on a range of tasks, particularly but not exclusively temporally-oriented ones. We also show that an autoencoder-based preprocessing component that focuses on the temporal aspect of the data can improve the performance of existing models.
Our source code is available at https://github.com/tim-chard/DeepLearningForMEG.

URL: https://openreview.net/forum?id=zSeoG5dRHK

---

Title: Disciplined Saddle Programming

Authors: Philipp Schiele, Eric Sager Luxenberg, Stephen P. Boyd

Abstract: We consider convex-concave saddle point problems, and more generally convex optimization problems we refer to as saddle problems, which include the partial supremum or infimum of convex-concave saddle functions. Saddle problems arise in a wide range of applications, including game theory, machine learning, and finance. It is well known that a saddle problem can be reduced to a single convex optimization problem by dualizing either the convex (min) or concave (max) objectives, reducing a min-max problem into a min-min (or max-max) problem. Carrying out this conversion by hand can be tedious and error prone. In this paper we introduce disciplined saddle programming (DSP), a domain specific language (DSL) for specifying saddle problems, for which the dualizing trick can be automated. The language and methods are based on recent work by Juditsky and Nemirovski, who developed the idea of conic-representable saddle point programs, and showed how to carry out the required dualization automatically using conic duality. Juditsky and Nemirovski's conic representation of saddle problems extends Nesterov and Nemirovski's earlier development of conic representable convex problems; DSP can be thought of as extending disciplined convex programming (DCP) to saddle problems. Just as DCP makes it easy for users to formulate and solve complex convex problems, DSP allows users to easily formulate and solve saddle problems. Our method is implemented in an open-source package, also called DSP.

URL: https://openreview.net/forum?id=KhMLfEIoUm

---

Title: Federated Sampling with Langevin Algorithm under Isoperimetry

Authors: Lukang Sun, Adil Salim, Peter Richtárik

Abstract: Federated learning uses a set of techniques to efficiently distribute the training of a machine learning algorithm across several devices, who own the training data. These techniques critically rely on reducing the communication cost---the main bottleneck---between the devices and a central server. Federated learning algorithms usually take an optimization approach: they are algorithms for minimizing the training loss subject to communication (and other) constraints. In this work, we instead take a Bayesian approach for the training task, and propose a communication-efficient variant of the Langevin algorithm to sample \textit{a posteriori}. The latter approach is more robust and provides more knowledge of the \textit{a posteriori} distribution than its optimization counterpart. We analyze our algorithm without assuming that the target distribution is strongly log-concave. Instead, we assume the weaker log Sobolev inequality, which allows for nonconvexity.

URL: https://openreview.net/forum?id=Sj7bFPeR6W

---

Title: TensorVAE: a simple and efficient generative model for conditional molecular conformation generation

Authors: Hongyang Yu, Hongjiang Yu

Abstract: Efficient generation of 3D conformations of a molecule from its 2D graph is a key challenge in in-silico drug discovery. Deep learning (DL) based generative modelling has recently become a potent tool to tackling this challenge. However, many existing DL-based methods are either indirect–leveraging inter-atomic distances or direct–but requiring numerous sampling steps to generate conformations. In this work, we propose a simple model abbreviated TensorVAE capable of generating conformations directly from a 2D molecular graph in a single step. The main novelty of the proposed method is focused on feature engineering. We develop a novel encoding and feature extraction mechanism relying solely on standard convolution operation to generate token-like feature vector for each atom. These feature vectors are then transformed through standard transformer encoders under a conditional Variational Autoencoder framework for generating conformations directly. We show through experiments on two benchmark datasets that with intuitive feature engineering, a relatively simple and standard model can provide promising generative capability outperforming more than a dozen state-of-the-art models employing more sophisticated and specialized generative architecture.

URL: https://openreview.net/forum?id=rQqzt4gYcc

---

Title: PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling

Authors: Yuan Liu, Songyang Zhang, Jiacheng Chen, Kai Chen, Dahua Lin

Abstract: Masked Image Modeling (MIM) has achieved promising progress with the advent of Masked Autoencoders (MAE) and BEiT. However, subsequent works have complicated the framework with new auxiliary tasks or extra pre-trained models, inevitably increasing computational overhead. This paper undertakes a fundamental analysis of MIM from the perspective of pixel reconstruction, which examines the input image patches and reconstruction target, and highlights two critical but previously overlooked bottlenecks.
Based on this analysis, we propose a remarkably simple and effective method, PixMIM, that entails two strategies: 1) filtering the high-frequency components from the reconstruction target to de-emphasize the network's focus on texture-rich details and 2) adopting a conservative data transform strategy to alleviate the problem of missing foreground in MIM training. PixMIM can be easily integrated into most existing pixel-based MIM approaches (i.e., using raw images as reconstruction target) with negligible additional computation. Without bells and whistles, our method consistently improves four MIM approaches, MAE, MFF, ConvMAE, and LSMAE, across various downstream tasks. We believe this effective plug-and-play method will serve as a strong baseline for self-supervised learning and provide insights for future improvements of the MIM framework. Code and models will be available.

URL: https://openreview.net/forum?id=qyfz0QrkqP

---

Title: Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks

Authors: Aradhana Sinha, Ananth Balashankar, Ahmad Beirami, Thi Avrahami, Jilin Chen, Alex Beutel

Abstract: Real-world natural language processing systems need to be robust to human adversaries. Collecting examples of human adversaries for training is an effective but expensive solution. On the other hand, training on synthetic attacks with small perturbations---such as word-substitution---does not actually improve robustness to human adversaries. In this paper, we propose an adversarial training framework that uses limited human adversarial examples to generate more useful adversarial examples at scale. We demonstrate the advantages of this system on the ANLI and hate speech detection benchmark datasets---both collected via an iterative, adversarial human-and-model-in-the-loop procedure. Compared to training only on observed human attacks, also training on our synthetic adversarial examples improves model robustness to future rounds. In ANLI, we see accuracy gains on the current set of attacks (44.1\%$\,\to\,$50.1\%) and on two future unseen rounds of human generated attacks (32.5\%$\,\to\,$43.4\%, and 29.4\%$\,\to\,$40.2\%). In hate speech detection, we see AUC gains on current attacks (0.76 $\to$ 0.84) and a future round (0.77 $\to$ 0.79). Attacks from methods that do not learn the distribution of existing human adversaries, meanwhile, degrade robustness.

URL: https://openreview.net/forum?id=UAT4j3Y7HP

---

Title: MMD-Regularized Unbalanced Optimal Transport

Authors: Piyushi Manupriya, SakethaNath Jagarlapudi, Pratik Jawanpuria

Abstract: We study the unbalanced optimal transport (UOT) problem, where the marginal constraints are enforced using Maximum Mean Discrepancy (MMD) regularization. Our work is motivated by the observation that the literature on UOT is focused on regularization based on $\phi$-divergence (e.g., KL divergence). Despite the popularity of MMD, its role as a regularizer in the context of UOT seems less understood. We begin by deriving a specific dual of MMD-regularized UOT (MMD-UOT), which helps us prove several useful properties. One interesting outcome of this duality result is that MMD-UOT induces novel metrics, which not only lift the ground metric like the Wasserstein but are also sample-wise efficient to estimate like the MMD. Further, for real-world applications involving non-discrete measures, we present an estimator for the transport plan that is supported only on the given ($m$) samples. Under certain conditions, we prove that the estimation error with this finitely-supported transport plan is also $\mathcal{O}(1/\sqrt{m})$. As far as we know, such error bounds that are free from the curse of dimensionality are not known for $\phi$-divergence regularized UOT. Finally, we discuss how the proposed estimator can be computed efficiently using accelerated gradient descent. Our experiments show that MMD-UOT consistently outperforms popular baselines, including KL-regularized UOT and MMD, in diverse machine learning applications.

URL: https://openreview.net/forum?id=eN9CjU3h1b

---

Title: Bandits Corrupted by Nature: Lower Bounds on Regret and Robust Optimistic Algorithms

Authors: Timothée Mathieu, Debabrota Basu, Odalric-Ambrym Maillard

Abstract: We study the corrupted bandit problem, i.e. a stochastic multi-armed bandit problem with $k$ unknown reward distributions, which are heavy-tailed and corrupted by a history-independent adversary or Nature. To be specific, the reward obtained by playing an arm comes from corresponding heavy-tailed reward distribution with probability $1-\varepsilon \in (0.5,1]$ and an arbitrary corruption distribution of unbounded support with probability $\varepsilon \in [0,0.5)$.
First, we provide \textit{a problem-dependent lower bound on the regret} of any corrupted bandit algorithm. The lower bounds indicate that the corrupted bandit problem is harder than the classical stochastic bandit problem with subGaussian or heavy-tail rewards.
Following that, we propose a novel UCB-type algorithm for corrupted bandits, namely \texttt{HubUCB}, that builds on Huber's estimator for robust mean estimation. Leveraging a novel concentration inequality of Huber's estimator, we prove that \texttt{HubUCB} achieves a near-optimal regret upper bound.
Since computing Huber's estimator has quadratic complexity, we further introduce a sequential version of Huber's estimator that exhibits linear complexity. We leverage this sequential estimator to design \texttt{SeqHubUCB} that enjoys similar regret guarantees while reducing the computational burden.
Finally, we experimentally illustrate the efficiency of \texttt{HubUCB} and \texttt{SeqHubUCB} in solving corrupted bandits for different reward distributions and different levels of corruptions.

URL: https://openreview.net/forum?id=oGIR0ic3jU

---

Title: High-dimensional Bayesian Optimization via Covariance Matrix Adaptation Strategy

Authors: Lam Ngo, Huong Ha, Jeffrey Chan, Vu Nguyen, Hongyu Zhang

Abstract: Bayesian Optimization (BO) is an effective method for finding the global optimum of expensive black-box functions. However, it is well known that applying BO to high-dimensional optimization problems is challenging. To address this issue, a promising solution is to use a local search strategy that partitions the search domain into local regions with high likelihood of containing the global optimum, and then use BO to optimize the objective function within these regions. In this paper, we propose a novel technique for defining the local regions using the Covariance Matrix Adaptation (CMA) strategy. Specifically, we use CMA to learn a search distribution that can estimate the probabilities of data points being the global optimum of the objective function. Based on this search distribution, we then define the local regions consisting of data points with high probabilities of being the global optimum. Our approach serves as a meta-algorithm as it can incorporate existing black-box BO optimizers, such as BO, TuRBO, and BAxUS, to find the global optimum of the objective function within our derived local regions. We evaluate our proposed method on various benchmark synthetic and real-world problems. The results demonstrate that our method outperforms existing state-of-the-art techniques.

URL: https://openreview.net/forum?id=eTgxr7gPuU

---

Title: A general framework for formulating structured variable selection

Authors: GUANBO WANG, Mireille Schnitzer, Tom Chen, Rui Wang, Robert W Platt

Abstract: In variable selection, a selection rule that prescribes the permissible sets of selected variables (called a "selection dictionary") is desirable due to the inherent structural constraints among the candidate variables. Such selection rules can be complex in real-world data analyses, and failing to incorporate such restrictions could not only compromise the interpretability of the model but also lead to decreased prediction accuracy. However, no general framework has been proposed to formalize selection rules and their applications, which poses a significant challenge for practitioners seeking to integrate these rules into their analyses. In this work, we establish a framework for structured variable selection that can incorporate universal structural constraints. We develop a mathematical language for constructing arbitrary selection rules, where the selection dictionary is formally defined. We demonstrate that all selection rules can be expressed as combinations of operations on constructs, facilitating the identification of the corresponding selection dictionary. We use a detailed and complex example to illustrate the developed framework. Once this selection dictionary is derived, practitioners can apply their own user-defined criteria to select the optimal model. Additionally, our framework enhances existing penalized regression methods for variable selection by providing guidance on how to appropriately group variables to achieve the desired selection rule. Furthermore, our innovative framework opens the door to establishing new $\ell_0$-based penalized regression techniques that can be tailored to respect arbitrary selection rules, thereby expanding the possibilities for more robust and tailored model development.

URL: https://openreview.net/forum?id=cvOpIhQQMN

---

Title: Towards fully covariant machine learning

Authors: Soledad Villar, David W Hogg, Weichi Yao, George A Kevrekidis, Bernhard Schölkopf

Abstract: Any representation of data involves arbitrary investigator choices. Because those choices are external to the data-generating process, each choice leads to an exact symmetry, corresponding to the group of transformations that takes one possible representation to another. These are the passive symmetries; they include coordinate freedom, gauge symmetry, and units covariance, all of which have led to important results in physics. In machine learning, the most visible passive symmetry is the relabeling or permutation symmetry of graphs. The active symmetries are those that must be established by observation and experiment. They include, for instance, translations invariances or rotation invariances of physical law. These symmetries are the subject of most of the equivariant machine learning literature. Our goal, in this conceptual contribution, is to understand the implications for machine learning of the many passive and active symmetries in play.
We discuss dos and don'ts for machine learning practice if passive symmetries are to be respected. We discuss links to causal modeling and argue that the implementation of passive symmetries is particularly valuable when the goal of the learning problem is to generalize out of sample. We conjecture that the implementation of passive symmetries might help machine learning in the same ways that it transformed physics in the twentieth century.

URL: https://openreview.net/forum?id=gllUnpYuXg

---

Title: Empowering GNNs via Edge-Aware Weisfeiler-Leman Algorithm

Authors: Meng Liu, Haiyang Yu, Shuiwang Ji

Abstract: Message passing graph neural networks (GNNs) are known to have their expressiveness upper-bounded by 1-dimensional Weisfeiler-Leman (1-WL) algorithm. To achieve more powerful GNNs, existing attempts either require \emph{ad hoc} features, or involve operations that incur high time and space complexities. In this work, we propose a \textit{general} and \textit{provably powerful} GNN framework that preserves the \textit{scalability} of the message passing scheme. In particular, we first propose to empower 1-WL for graph isomorphism test by considering edges among neighbors, giving rise to NC-1-WL. The expressiveness of NC-1-WL is shown to be strictly above 1-WL and below 3-WL theoretically. Further, we propose the NC-GNN framework as a differentiable neural version of NC-1-WL. Our simple implementation of NC-GNN is provably as powerful as NC-1-WL. Experiments demonstrate that our NC-GNN performs effectively and efficiently on various benchmarks.

URL: https://openreview.net/forum?id=VDy6LgErFM

---

Title: A Survey on Graph Construction for Geometric Deep Learning in Medicine: Methods and Recommendations

Authors: Tamara T. Müller, Sophie Starck, Alina Dima, Stephan Wunderlich, Kyriaki-Margarita Bintsi, Kamilia Zaripova, Rickmer Braren, Daniel Rueckert, Anees Kazi, Georgios Kaissis

Abstract: Graph neural networks are powerful tools that enable deep learning on non-Euclidean data structures like graphs, point clouds, and meshes. They leverage the connectivity of data points and can even benefit learning tasks on data, which is not naturally graph-structured -like point clouds. In these cases, the graph structure needs to be determined from the dataset, which adds a significant challenge to the learning process. This opens up a multitude of design choices for creating suitable graph structures, which have a substantial impact on the success of the graph learning task. However, so far no concrete guidance for choosing the most appropriate graph construction is available, not only due to the large variety of methods out there but also because of its strong connection to the dataset at hand. In medicine, for example, a large variety of different data types complicates the selection of graph construction methods even more. We therefore summarise the current state-of-the-art graph construction methods, especially for medical data. In this work, we introduce a categorisation scheme for graph types and graph construction methods. We identify two main strands of graph construction: static and adaptive methods, discuss their advantages and disadvantages, and formulate recommendations for choosing a suitable graph construction method. We furthermore discuss how a created graph structure can be assessed and to what degree it supports graph learning. We hope to support medical research with graph deep learning with this work by elucidating the wide variety of graph construction methods.

URL: https://openreview.net/forum?id=sWlHhfijcS

---

Title: To Transfer or Not to Transfer: Suppressing Concepts from Source Representations

Authors: Vijay Sadashivaiah, Keerthiram Murugesan, Ronny Luss, Pin-Yu Chen, Chris Sims, James Hendler, Amit Dhurandhar

Abstract: With the proliferation of large pre-trained models in various domains, transfer learning has gained prominence where intermediate representations from these models can be leveraged to train better (target) task-specific models, with possibly limited labeled data. Although transfer learning can be beneficial in many applications, it can transfer undesirable information to target tasks that may severely curtail its performance in the target domain or raise ethical concerns related to privacy and/or fairness. In this paper, we propose a novel approach for suppressing the transfer of user-determined semantic concepts (viz. color, glasses, etc.) in intermediate source representations to target tasks without retraining the source model which can otherwise be expensive or even infeasible. Notably, we tackle a bigger challenge in the input data as a given intermediate source representation is biased towards the source task, thus possibly further entangling the desired concepts. We evaluate our approach qualitatively and quantitatively in the visual domain showcasing its efficacy for classification and generative source models. Finally, we provide a concept selection approach that automatically suppresses the undesirable concepts.

URL: https://openreview.net/forum?id=BNP4MxzDEI

---

Title: On the Choice of Learning Rate for Local SGD

Authors: Lukas Balles, Prabhu Teja S, Cedric Archambeau

Abstract: Distributed data-parallel optimization accelerates the training of neural networks, but requires constant synchronization of gradients between the workers, which can become a bottleneck. One way to reduce communication overhead is to use Local SGD, where each
worker asynchronously takes multiple local gradient steps, after which the model weights are averaged. In this work, we discuss the choice of learning rate for Local SGD, showing that it faces an intricate trade-off. Unlike in the synchronous case, its gradient estimate is
biased, with the bias dependent on the learning rate itself. Thus using learning rate scaling techniques designed for faster convergence in the synchronous case with Local SGD results in a performance degradation as previously observed. To analyze the manifestation of this bias, we study convergence behaviour of Local SGD and synchronous data-parallel SGD when using their optimal learning rates. Our experiments show that the optimal learning rate for Local SGD differs substantially from that of SGD, and when using it the performance of Local SGD matches that of SGD. However, this performance comes at the cost of added training iterations, rendering Local SGD faster than SGD only when communication is much more time-consuming than computation. This suggests that Local SGD may be of limited practical utility.

URL: https://openreview.net/forum?id=DPvwr4HJdt

---

Title: Semantic similarity prediction is better than other semantic similarity measures

Authors: Steffen Herbold

Abstract: Semantic similarity between natural language texts is typically measured either by looking at the overlap between subsequences (e.g., BLEU) or by using embeddings (e.g., BERTScore, S-BERT). Within this paper, we argue that when we are only interested in measuring the semantic similarity, it is better to directly predict the similarity using a fine-tuned model for such a task. Using a fine-tuned model for the Semantic Textual Similarity Benchmark tasks (STS-B) from the GLUE benchmark, we define the STSScore approach and show that the resulting similarity is better aligned with our expectations on a robust semantic similarity measure than other approaches.

URL: https://openreview.net/forum?id=bfsNmgN5je

---

Title: Prismer: A Vision-Language Model with Multi-Task Experts

Authors: Shikun Liu, Linxi Fan, Edward Johns, Zhiding Yu, Chaowei Xiao, Anima Anandkumar

Abstract: Recent vision-language models have shown impressive multi-modal generation capabilities. However, typically they require training huge models on massive datasets. As a more scalable alternative, we introduce Prismer, a data- and parameter-efficient vision-language model that leverages an ensemble of task-specific experts. Prismer only requires training of a small number of components, with the majority of network weights inherited from multiple readily-available, pre-trained experts, and kept frozen during training. By leveraging experts from a wide range of domains, we show Prismer can efficiently pool this expert knowledge and adapt it to various vision-language reasoning tasks. In our experiments, we show that Prismer achieves fine-tuned and few-shot learning performance which is competitive with current state-of-the-arts, whilst requiring up to two orders of magnitude less training data. Code is available at https://github.com/NVlabs/prismer.

URL: https://openreview.net/forum?id=R7H43YD6Lo

---

Title: CAREER: A Foundation Model for Labor Sequence Data

Authors: Keyon Vafa, Emil Palikot, Tianyu Du, Ayush Kanodia, Susan Athey, David Blei

Abstract: Labor economists regularly analyze employment data by fitting predictive models to small, carefully constructed longitudinal survey datasets. Although machine learning methods offer promise for such problems, these survey datasets are too small to take advantage of them. In recent years large datasets of online resumes have also become available, providing data about the career trajectories of millions of individuals. However, standard econometric models cannot take advantage of their scale or incorporate them into the analysis of survey data. To this end we develop CAREER, a foundation model for job sequences. CAREER is first fit to large, passively-collected resume data and then fine-tuned to smaller, better-curated datasets for economic inferences. We fit CAREER to a dataset of 24 million job sequences from resumes, and adjust it on small longitudinal survey datasets. We find that CAREER forms accurate predictions of job sequences, outperforming econometric baselines on three widely-used economics datasets. We further find that CAREER can be used to form good predictions of other downstream variables. For example, incorporating CAREER into a wage model provides better predictions than the econometric models currently in use.

URL: https://openreview.net/forum?id=4i1MXH8Sle

---

Title: Hyperspherical Prototype Node Clustering

Authors: Jitao Lu, Danyang Wu, Feiping Nie, Rong Wang, Xuelong Li

Abstract: The general workflow of deep node clustering is to encode the nodes into node embeddings via graph neural networks and uncover clustering decisions from them, so clustering performance is heavily affected by the embeddings. However, existing works only consider preserving the semantics of the graph but ignore the inter-cluster separability of the nodes, so there's no guarantee that the embeddings can present a clear clustering structure. To remedy this deficiency, we propose Hyperspherical Prototype Node Clustering (HPNC), an end-to-end clustering paradigm that explicitly enhances the inter-cluster separability of learned node embeddings. Concretely, we constrain the embedding space to a unit-hypersphere, enabling us to scatter the cluster prototypes over the space with maximized pairwise distances. Then, we employ a graph autoencoder to map nodes onto the same hypersphere manifold. Consequently, cluster affinities can be directly retrieved from cosine similarities between node embeddings and prototypes. A clustering-oriented loss is imposed to sharpen the affinity distribution so that the learned node embeddings are encouraged to have small intra-cluster distances and large inter-cluster distances. Based on the proposed HPNC paradigm, we devise two schemes (HPNC-IM and HPNC-DEC) with distinct clustering backbones. Empirical results on popular benchmark datasets demonstrate the superiority of our method compared to other state-of-the-art clustering methods, and visualization results illustrate improved separability of the learned embeddings.

URL: https://openreview.net/forum?id=z3ZlnaOM0d

---

Title: AdaFed: Fair Federated Learning via Adaptive Common Descent Direction

Authors: Shayan Mohajer Hamidi, EN-HUI YANG

Abstract: Federated learning (FL) is a promising technology via which some edge devices/clients
collaboratively train a machine learning model orchestrated by a server. Learning an unfair
model is known as a critical problem in federated learning, where the trained model may
unfairly advantage or disadvantage some of the devices. To tackle this problem, in this work,
we propose AdaFed. The goal of AdaFed is to find an updating direction for the server along
which (i) all the clients’ loss functions are decreasing; and (ii) more importantly, the loss
functions for the clients with larger values decrease with a higher rate. AdaFed adaptively
tunes this common direction based on the values of local gradients and loss functions. We
validate the effectiveness of AdaFed on a suite of federated datasets, and demonstrate that
AdaFed outperforms state-of-the-art fair FL methods.

URL: https://openreview.net/forum?id=rFecyFpFUp

---

Title: Wavelet Networks: Scale-Translation Equivariant Learning From Raw Time-Series

Authors: David W. Romero, Erik J Bekkers, Jakub M. Tomczak, Mark Hoogendoorn

Abstract: Leveraging the symmetries inherent to specific data domains for the construction of equivariant neural networks has lead to remarkable improvements in terms of data efficiency and generalization. However, most existing research focuses on symmetries arising from planar and volumetric data, leaving a crucial data source largely underexplored: *time-series*. In this work, we fill this gap by leveraging the symmetries inherent to time-series for the construction of equivariant neural network. We identify two core symmetries: *scale and translation*, and construct scale-translation equivariant neural networks for time-series learning. Intriguingly, we find that scale-translation equivariant mappings share strong resemblance with the *wavelet transform*. Inspired by this resemblance, we term our networks *Wavelet Networks*, and show that they perform nested non-linear wavelet-like time-frequency transforms. Empirical results show that Wavelet Networks outperform conventional CNNs on raw waveforms, and match strongly engineered spectrogram techniques across several tasks and time-series types, including audio, environmental sounds, and electrical signals. Our code is publicly available at https://github.com/dwromero/wavelet_networks.

URL: https://openreview.net/forum?id=ga5SNulYet

---

New submissions
===============

Title: Neural networks can be FLOP-efficient integrators of 1D oscillatory integrands

Abstract: We demonstrate that neural networks can be FLOP-efficient integrators of one-dimensional oscillatory integrals. We train a feed-forward neural network to compute integrals of highly oscillatory 1D functions. The training set is a parametric combination of functions with varying characters and oscillatory behavior degrees. Numerical examples show that these networks are FLOP-efficient for oscillatory with an average gain of $10^{3}$ FLOPs. The network calculates oscillatory integrals better than traditional quadrature methods under the same computational budget or number of floating point operations. We find that feed-forward networks of 5 hidden layers are satisfactory for an accuracy level of $10^{-3}$ in terms of normalized mean squared error loss. The computational burden of inference of the neural network is relatively small, even compared to inner-product pattern quadrature rules. We postulate that this seemingly surprising result follows from learning latent patterns in the oscillatory integrands otherwise opaque to traditional numerical integrators.

URL: https://openreview.net/forum?id=5psgQEHn6t

---

Title: Sparsifying Bayesian neural networks with latent binary vari- ables and normalizing flows

Abstract: Artificial neural networks are powerful machine learning methods used in many modern ap- plications. A common issue is that they have millions or billions of parameters, and therefore tend to overfit. Bayesian neural networks (BNN) can improve on this since they incorpo- rate parameter uncertainty. Latent binary Bayesian neural networks (LBBNN) further take into account structural uncertainty by allowing the weights to be turned on or off, enabling inference in the joint space of weights and structures. In this paper, we will consider two extensions of variational inference for the LBBNN: Firstly, by using the local reparametriza- tion trick (LRT), we improve on computational efficiency. Secondly, and more important, by using normalizing flows on the variational posterior distribution of the LBBNN parameters, we learn a more flexible variational posterior than the mean field Gaussian. Experimental results on real data show that this improves on predictive power compared to using mean field variational inference on the the LBBNN method, while also obtaining sparser networks. We also perform two simulation studies. In the first, we consider variable selection in a lo- gistic regression setting, where the more flexible variational distribution improves results. In the second study, we compare predictive uncertainty based on data generated from two- dimensional Gaussian distributions. Here, we argue that our Bayesian methods lead to more realistic estimates of predictive uncertainty.

URL: https://openreview.net/forum?id=58jTKZb86S

---

Title: RedMotion: Motion Prediction via Redundancy Reduction

Abstract: We introduce RedMotion, a transformer model for motion prediction in autonomous driving that learns environment representations via redundancy reduction. Our first type of redundancy reduction is induced by an internal transformer decoder and reduces a variable-sized set of local road environment tokens, representing road graphs and agent data, to a fixed-sized global embedding. The second type of redundancy reduction is obtained by self-supervised learning and applies the redundancy reduction principle to embeddings generated from augmented views of road environments. Our experiments reveal that our representation learning approach outperforms PreTraM, Traj-MAE, and GraphDINO in a semi-supervised setting. Our RedMotion model achieves competitive results compared to HPTR or MTR++ in the Waymo Motion Prediction Challenge. We provide an anonymized open source implementation that is accessible via Colab: https://colab.research.google.com/drive/16pwsmOTYdPpbNWf2nm1olXcx1ZmsXHB8

URL: https://openreview.net/forum?id=xl1KhKT3Xx

---

Title: Active Sequential Two-Sample Testing

Abstract: A two-sample hypothesis test is a statistical procedure used to determine whether the distributions generating two samples are identical. We consider the two-sample testing problem in a new scenario where the sample measurements (or sample features) are inexpensive to access, but their group memberships (or labels) are costly. To address the problem, we devise the first \emph{active sequential two-sample testing framework} that not only sequentially but also \emph{actively queries}. Our test statistic is a likelihood ratio where one likelihood is found by maximization over all class priors, and the other is provided by a probabilistic classification model. The classification model is adaptively updated and used to predict where the (unlabelled) features have a high dependency on labels; labeling the ``high-dependency'' features leads to the increased power of the proposed testing framework. In theory, we provide the proof that our framework produces an \emph{anytime-valid} $p$-value. In addition, we characterize the proposed framework's gain in testing power by analyzing the mutual information between the feature and label variables in asymptotic and finite-sample scenarios. In practice, we introduce an instantiation of our framework and evaluate it using several experiments; the experiments on the synthetic, MNIST, and application-specific datasets demonstrate that the testing power of the instantiated active sequential test significantly increases while the Type I error is under control.

URL: https://openreview.net/forum?id=EzPRgIq2Tk

---

Title: FedConv: Enhancing Convolutional Neural Networks for Handling Data Heterogeneity in Federated Learning

Abstract: Federated learning (FL) is an emerging paradigm in machine learning, where a shared model is collaboratively learned using data from multiple devices to mitigate the risk of data leakage. While recent studies posit that Vision Transformer (ViT) outperforms Convolutional Neural Networks (CNNs) in addressing data heterogeneity in FL, the specific architectural components that underpin this advantage have yet to be elucidated. In this paper, we systematically investigate the impact of different architectural elements, such as activation functions and normalization layers, on the performance within heterogeneous FL. Through rigorous empirical analyses, we are able to offer the first-of-its-kind general guidance on micro-architecture design principles for heterogeneous FL.

Intriguingly, our findings indicate that with strategic architectural modifications, pure CNNs can achieve a level of robustness that either matches or even exceeds that of ViTs when handling heterogeneous data clients in FL. Additionally, our approach is compatible with existing FL techniques and delivers state-of-the-art solutions across a broad spectrum of FL benchmarks.

URL: https://openreview.net/forum?id=bzTfO4mURl

---

Title: Time Series Continuous Modeling for Imputation and Forecasting with Implicit Neural Representations

Abstract: We introduce a novel modeling approach for time series imputation and forecasting, tailored to address the challenges often encountered in real-world data, such as irregular samples, missing data, or unaligned measurements from multiple sensors. Our method relies on a continuous-time-dependent model of the series' evolution dynamics. It leverages adaptations of conditional, implicit neural representations for sequential data. A modulation mechanism, driven by a meta-learning algorithm, allows adaptation to unseen samples and extrapolation beyond observed time-windows for long-term predictions. The model provides a highly flexible and unified framework for imputation and forecasting tasks across a wide range of challenging scenarios. It achieves state-of-the-art performance on classical benchmarks and outperforms alternative time-continuous models.

URL: https://openreview.net/forum?id=P1vzXDklar

---

Title: Conciliator steering: Imposing user preference in multi-objective reinforcement learning

Abstract: Many real-world problems with multiple objectives require reinforcement learning solutions that can handle trade-offs in a user-preferred manner. In the multi-objective framework, a single algorithm adapting to different user preferences based on a pre-defined reward function and a subjectively defined scalarisation function may be developed. The scalarisation function approximation can be done by fitting a meta-model with information gained from the interaction between the user and the environment or the agent. The interaction requires exact formulation of a constructive feedback, which is also simple for the user to give. In this paper, we propose a novel algorithm, Conciliator steering, that leverages priority order and reward transfer to seek optimal user-preferred policies in multi-objective reinforcement learning under expected scalarised returns criterion. We test Conciliator steering on DeepSeaTreasure v1 benchmark problem and demonstrate that it can find user-preferred policies with effortless and simple user-agent interaction and negligible bias, which has not been possible before. Additionally, we show that on average Conciliator steering results in a tiny fraction of carbon dioxide emissions and total energy consumption when compared to a training of fully connected MNIST classifier, both run on a personal laptop.

URL: https://openreview.net/forum?id=XAD2kcBS50

---

Title: Experts on Demand: Dynamic Routing for Personalized Diffusion Models

Abstract: Diffusion models have excelled in the realm of image generation, owing to their expansive parameter space. However, most users only exploit a fraction of the available capabilities for specialized image categories synthesis. These specific requirements for individual users are often persistently fixed over the long term, for example, a pet store pursues images of cats and dogs, which poses an efficiency challenge due to the computational complexity involved. In this paper, we introduce Mixture of Expert Diffusion Models (MoEDM), a personalized and efficient strategy for large-scale diffusion models specific to certain applications. By employing dynamic routing, MoEDM selectively activates only indispensable neurons, thereby optimizing runtime performance for specialized tasks while minimizing computational costs. Our MoEDM doubles the sampling speed without compromising efficacy across various applications. Moreover, MoEDM's modular design allows straightforward incorporation of state-of-the-art optimization methods such as DPM-Solver and Latent Diffusion. Empirical assessments, validated by FID score, KID score and human evaluation, confirm the advantages of MoEDM in terms of both efficiency and robustness.

URL: https://openreview.net/forum?id=78KYfELI7U

---

Title: GOPlan: Goal-conditioned Offline Reinforcement Learning by Planning with Learned Models

Abstract: Offline Goal-Conditioned RL (GCRL) offers a feasible paradigm for learning general-purpose policies from diverse and multi-task offline datasets. Despite notable recent progress, the predominant offline GCRL methods, mainly model-free, face constraints in handling limited data and generalizing to unseen goals. In this work, we propose Goal-conditioned Offline Planning (GOPlan), a novel model-based framework that contains two key phases: (1) pretraining a prior policy capable of capturing multi-modal action distribution within the multi-goal dataset; (2) employing the reanalysis method with planning to generate imagined trajectories for funetuning policies. Specifically, we base the prior policy on an advantage-weighted conditioned generative adversarial network, which facilitates distinct mode separation, mitigating the pitfalls of out-of-distribution (OOD) actions. For further policy optimization, the reanalysis method generates high-quality imaginary data by planning with learned models for both intra-trajectory and inter-trajectory goals. With thorough experimental evaluations, we demonstrate that GOPlan achieves state-of-the-art performance on various offline multi-goal navigation and manipulation tasks. Moreover, our results highlight the superior ability of GOPlan to handle small data budgets and generalize to OOD goals.

URL: https://openreview.net/forum?id=zOKAmm8R9B

---

Title: Augmenting Ad-Hoc IR Dataset for Interactive Conversational Search

Abstract: A peculiarity of conversational search systems is that they involve mixed-initiatives such as system-generated query clarifying questions. Evaluating those systems at a large scale on the end task of IR is very challenging, requiring adequate datasets containing such interactions. However, current datasets only focus on either traditional ad-hoc IR tasks or query clarification tasks, the latter being usually seen as a reformulation task from the initial query.
Only few datasets are known to contain both document relevance judgments and the associated clarification interactions such as Qulac and ClariQ. Both are based on the TREC Web Track 2009-12 collection but cover a very limited number of topics
(237 topics), far from being enough for training and testing conversational IR models.
To fill the gap, we propose a methodology to automatically build large-scale conversational IR datasets from ad-hoc IR datasets in order to facilitate explorations on conversational IR.
Our methodology is based on two processes: 1) generating query clarification interactions through query clarification and answer generators, and 2) augmenting ad-hoc IR datasets with simulated interactions.
In this paper, we focus on MsMarco and augment it with query clarification and answer simulations. We perform a thorough evaluation showing the quality and the relevance of the generated interactions for each initial query. This paper shows the feasibility and utility of augmenting ad-hoc IR datasets for conversational IR.

URL: https://openreview.net/forum?id=z8d7nT1HWw

---

Title: Conservative Prediction via Data-Driven Confidence Minimization

Abstract: In safety-critical applications of machine learning, it is often desirable for a model to be \textit{conservative}, abstaining from making predictions on ``unknown'' inputs which are not well-represented in the training data. However, detecting unknown examples is challenging, as it is impossible to anticipate all potential inputs at test time. To address this, prior work minimizes model confidence on an auxiliary outlier dataset carefully curated to be disjoint from the training distribution. We theoretically analyze the choice of auxiliary dataset for confidence minimization, revealing two actionable insights: (1) if the auxiliary set contains unknown examples similar to those seen at test time, confidence minimization leads to provable detection of unknown test examples, and (2) if the first condition is satisfied, it is unnecessary to filter out known examples for out-of-distribution (OOD) detection. Motivated by these guidelines, we propose the Data-Driven Confidence Minimization (DCM) framework, which minimizes confidence on an \textit{uncertainty dataset}. We apply DCM to two problem settings in which conservative prediction is paramount -- selective classification and OOD detection -- and provide a realistic way to gather uncertainty data for each setting. In our experiments, DCM consistently outperforms existing selective classification approaches on 4 datasets when tested on unseen distributions and outperforms state-of-the-art OOD detection methods on 12 ID-OOD dataset pairs, reducing FPR (at TPR $95\%$) by $6.3\%$ and $58.1\%$ on CIFAR-10 and CIFAR-100 compared to Outlier Exposure.

URL: https://openreview.net/forum?id=QPuxjsjKCP

---

Title: What Has Been Overlooked in Contrastive Source-Free Domain Adaptation: Leveraging Source-Informed Latent Augmentation within Neighborhood Context

Abstract: Source-free domain adaptation (SFDA) involves adapting a model originally trained using a labeled dataset (source domain) to perform effectively on an unlabeled dataset (target domain) without relying on any source data during adaptation. This adaptation is especially crucial when significant disparities in data distributions exist between the two domains and when there are privacy concerns regarding the source model's training data. The absence of access to source data during adaptation makes it challenging to analytically estimate the domain gap. To tackle this issue, various techniques have been proposed, such as unsupervised clustering, contrastive learning, and continual learning. In this paper, we first conduct an extensive theoretical analysis of SFDA based on contrastive learning, primarily because it has demonstrated superior performance compared to other techniques. Motivated by the obtained insights, we then introduce a straightforward yet highly effective latent augmentation method tailored for contrastive SFDA. This augmentation method leverages the dispersion of latent features within the neighborhood of the query sample, guided by the source pre-trained model, to enhance the informativeness of positive keys. Our approach, based on a single InfoNCE-based contrastive loss, outperforms state-of-the-art SFDA methods on widely recognized benchmark datasets.

URL: https://openreview.net/forum?id=iulMde3dP1

---

Title: The Survival Bandit Problem

Abstract: In this paper, we introduce and study a new variant of the multi-armed bandit problem (MAB), called the survival bandit problem (S-MAB). While in both problems, the objective is to maximize the so-called cumulative reward, in this new variant, the procedure is interrupted if the cumulative reward falls below a preset threshold. This simple yet unexplored extension of the MAB follows from many practical applications. For example, when testing two medicines against each other on voluntary patients, people's lives and health are at stake, and it is necessary to be able to interrupt experiments if serious side effects occur or if the disease syndromes are not dissipated by the treatment. From a theoretical perspective, the S-MAB is the first variant of the MAB where the procedure may or may not be interrupted.

We start by formalizing the S-MAB and we define its objective as the minimization of the so-called survival regret, which naturally generalizes the regret of the MAB. Then, we show that the objective of the S-MAB is considerably more difficult than the MAB, in the sense that contrary to the MAB, no policy can achieve a reasonably small (i.e., sublinear) survival regret. Instead, we minimize the survival regret in the sense of Pareto, i.e., we seek a policy whose cumulative reward cannot be improved for some problem instance without being sacrificed for another one. For that purpose, we identify two key components in the survival regret: the regret given no ruin (which corresponds to the regret in the MAB), and the probability that the procedure is interrupted, called the probability of ruin. We derive a lower bound on the probability of ruin, as well as policies whose probability of ruin matches the lower bound. Finally, based on a doubling trick on those policies, we derive a policy which minimizes the survival regret in the sense of Pareto, providing an answer to the open problem by Perotto et al. (COLT 2019).

URL: https://openreview.net/forum?id=1qZyJQxOof

---

Title: Meta-learning for Positive-unlabeled Classification

Abstract: We propose a meta-learning method for positive and unlabeled (PU) classification, which improves the performance of binary classifiers obtained from only PU data in unseen target tasks. PU learning is an important problem since PU data naturally arise in real-world applications such as outlier detection and information retrieval. Existing PU learning methods require many PU data, but sufficient data are often unavailable in practice. The proposed method minimizes the test classification risk after the model is adapted to PU data by using related tasks that consist of positive, negative, and unlabeled data. We formulate the adaptation as an estimation problem of the Bayes optimal classifier, which is an optimal classifier to minimize the classification risk. The proposed method embeds each instance into a task-specific space using neural networks. With the embedded PU data, the Bayes optimal classifier is estimated through density-ratio estimation of PU densities, whose solution is obtained as a closed-form solution. The closed-form solution enables us to efficiently and effectively minimize the test classification risk. We empirically show that the proposed method outperforms existing methods with one synthetic and three real-world datasets.

URL: https://openreview.net/forum?id=zfTTBwHHvc

---

Title: Internal-Coordinate Density Modelling of Protein Structure: Covariance Matters

Abstract: After the recent ground-breaking advances in protein structure prediction, one of the remaining challenges in protein machine learning is to reliably predict distributions of structural states. Parametric models of fluctuations are difficult to fit due to complex covariance structures between degrees of freedom in the protein chain, often causing models to either violate local or global structural constraints. In this paper, we present a new strategy for modelling protein densities in internal coordinates, which uses constraints in 3D space to induce covariance structure between the internal degrees of freedom. We illustrate the potential of the procedure by constructing a variational autoencoder with full covariance output induced by the constraints implied by the conditional mean in 3D, and demonstrate that our approach makes it possible to scale density models of internal coordinates to full protein backbones in two settings: 1) a unimodal setting for proteins exhibiting small fluctuations and limited amounts of available data, and 2) a multimodal setting for larger conformational changes in a high data regime.

URL: https://openreview.net/forum?id=9XRZtZRmEB

---

Title: Differentially Private Kernel Inducing Points using features from ScatterNets (DP-KIP-ScatterNet) for Privacy Preserving Data Distillation

Abstract: While it is tempting to believe that data distillation itself preserves privacy, distilled data's empirical robustness against known attacks does not imply a provable privacy guarantee. Here, we develop a provably privacy-preserving data distillation algorithm, differentially private kernel inducing points (DP-KIP) by utilizing DP-SGD. Unlike our original intention to simply apply DP-SGD to the framework of KIP, we find that KIP using infinitely-wide convolutional neural tangent kernels (conv-NTKs) performs better compared to KIP using fully-connected NTKs. However, KIP with conv-NTKs, due to its convolutional and pooling operations, introduces an unbearable computational complexity, requiring hundreds of V100 GPUs in parallel to train, which is impractical and more importantly, such computational resources are inaccessible to many. To overcome this issue, we propose an alternative that does not require pre-training (to avoid a privacy loss) and can well capture complex information on images, as those features from conv-NKTs do, while the computational cost is manageable by a single V100 GPU. To this end, we propose DP-KIP-ScatterNet, which uses the wavelet features from Scattering networks (ScatterNet) instead of those from conv-NTKs, to perform DP-KIP at a reasonable computational cost. We implement DP-KIP-ScatterNet in -- computationally efficient -- JAX and test on several popular image datasets to show its efficacy and its superior performance compared to state-of-the art methods in image data distillation with differential privacy guarantees.

URL: https://openreview.net/forum?id=84M8xwNxrc

---

Title: Physics Informed Distillation for Diffusion Models

Abstract: Diffusion models have recently emerged as a potent tool in generative modeling. However, their inherent iterative nature often results in sluggish image generation due to the requirement for multiple model evaluations. Recent progress has unveiled the intrinsic link between diffusion models and Probability Flow Ordinary Differential Equations (ODEs), thus enabling us to conceptualize diffusion models as ODE systems. Simultaneously, Physics Informed Neural Networks (PINNs) have substantiated their effectiveness in solving intricate differential equations through implicit modeling of their solutions. Building upon these foundational insights, we introduce Physics Informed Distillation (PID), which employs a student model to represent the solution of the ODE system corresponding to the teacher diffusion model, akin to the principles employed in PINNs. Through experiments on CIFAR 10 and ImageNet 64x64, we observe that PID achieves performance comparable to recent distillation methods. Notably, it demonstrates predictable trends concerning method-specific hyperparameters and eliminates the need for synthetic dataset generation during the distillation process. Both of which contribute to its easy-to-use nature as a distillation approach for Diffusion Models.

URL: https://openreview.net/forum?id=rOvaUsF996

---

Title: Hybrid Active Learning with Uncertainty-Weighted Embeddings

Abstract: We introduce a hybrid active learning method that simultaneously considers uncertainty and diversity for sample selection. Our method consists of two key steps: computing a novel uncertainty-weighted embedding, then applying distance-based sampling for sample selection. Our proposed uncertainty-weighted embedding is computed by weighting a sample's feature representation by an uncertainty measure. We show how this embedding generalizes the gradient embedding of BADGE so it can be used with arbitrary loss functions and be computed more efficiently, especially for dense prediction tasks and network architectures with large numbers of parameters in the final layer. We extensively evaluate the proposed hybrid active learning method on image classification, semantic segmentation and object detection tasks, and demonstrate that it achieves state-of-the-art performance.

URL: https://openreview.net/forum?id=jD761b5OaE

---

Title: A Study of the Effects of Transfer Learning on Adversarial Robustness

Abstract: The security and robustness of AI systems are paramount in real-world applications. Previous research has focused on developing methods to train robust networks, assuming the availability of sufficient labeled training data. However, in deployment scenarios with limited training data, existing techniques for training robust networks become impractical. In such low-data scenarios, non-robust training methods often resort to transfer learning. This involves pre-training a network on a large, possibly labeled dataset and fine-tuning it for a new task with a limited set of training samples. The efficacy of transfer learning in enhancing adversarial robustness is not comprehensively explored. Specifically, it remains uncertain whether transfer learning can improve adversarial performance in low-data scenarios. Furthermore, the potential benefits of transfer learning for certified robustness are unexplored. In this paper, we conduct an extensive analysis of the impact of transfer learning on both empirical and certified adversarial robustness. Employing supervised and self-supervised pre-training methods and fine-tuning across 12 downstream tasks representing diverse data availability scenarios, we identify the conditions conducive to training adversarially robust models through transfer learning. Our study reveals that the effectiveness of transfer learning in improving adversarial robustness is attributed to an increase in standard accuracy and not the direct ``transfer'' of robustness from the source to the target task, contrary to previous beliefs. Our findings provide valuable insights for practitioners aiming to deploy robust ML models in their applications.

URL: https://openreview.net/forum?id=T6RygOFZ6B

---

Title: Learning Hybrid Interpretable Models: Theory, Taxonomy, and Methods

Abstract: A hybrid model involves the cooperation of an interpretable model and a complex black box. At inference, any input of the hybrid model is assigned to either its interpretable or complex component based on a gating mechanism. The advantages of such models over classical ones are two-fold: 1) They grant users precise control over the level of transparency of the system and 2) They can potentially perform better than a standalone black box since redirecting some of the inputs to an interpretable model implicitly acts as regularization. Still, despite their high potential, hybrid models remain under-studied in the interpretability/explainability literature. In this paper, we remedy this fact by presenting a thorough investigation of such models from three perspectives: Theory, Taxonomy, and Methods. First, we explore the theory behind the generalization of hybrid models from the Probably-Approximately-Correct (PAC) perspective. A consequence of our PAC guarantee is the existence of a sweet spot for the optimal transparency of the system. When such a sweet spot is attained, a hybrid model can potentially perform better than a standalone black box. Secondly, we provide a general taxonomy for the different ways of training hybrid models: the Post-Black-Box and Pre-Black-Box paradigms. These approaches differ in the order in which the interpretable and complex components are trained. We show where the state-of-the-art hybrid models Hybrid-Rule-Set and Companion-Rule-List fall in this taxonomy.
Thirdly, we implement the two paradigms in a single method: HybridCORELS, which extends the CORELS algorithm to hybrid modeling. By leveraging CORELS, HybridCORELS provides a certificate of optimality of its interpretable component and precise control over transparency. We finally show empirically that HybridCORELS is competitive with existing hybrid models, and performs just as well as a standalone black box (or even better) while being partly transparent.

URL: https://openreview.net/forum?id=XzaSGIStXP

---

Title: Federated Variational Inference: Towards Improved Personalization and Generalization

Abstract: Conventional federated learning algorithms train a single global model by leveraging all participating clients’ data. However, due to heterogeneity in client generative distributions and predictive models, these approaches may not appropriately approximate the predictive
process, converge to an optimal state, or generalize to new clients. We study personalization and generalization in stateless cross-device federated learning setups assuming heterogeneity in client data distributions and predictive models. We first propose a hierarchical generative model and formalize it using Bayesian Inference. We then approximate this process using Variational Inference to train our model efficiently. We call this algorithm Federated Variational Inference (FedVI). We use PAC-Bayes analysis to provide generalization bounds for FedVI. We evaluate our model on FEMNIST and CIFAR-100 image classification and show that FedVI beats the state-of-the-art on both tasks.

URL: https://openreview.net/forum?id=6OmRkUHgs5

---

Title: MUBen: Benchmarking the Uncertainty of Molecular Representation Models

Abstract: Large molecular representation models pre-trained on massive unlabeled data have shown great success in predicting molecular properties. However, these models may tend to overfit the fine-tuning data, resulting in over-confident predictions on test data that fall outside of the training distribution. To address this issue, uncertainty quantification (UQ) methods can be used to improve the models' calibration of predictions. Although many UQ approaches exist, not all of them lead to improved performance. While some studies have included UQ to improve molecular pre-trained models, the process of selecting suitable backbone and UQ methods for reliable molecular uncertainty estimation remains underexplored. To address this gap, we present MUBen, which evaluates different UQ methods for state-of-the-art backbone molecular representation models to investigate their capabilities. By fine-tuning various backbones using different molecular descriptors as inputs with UQ methods from different categories, we critically assess the influence of architectural decisions and training strategies. Our study offers insights for selecting UQ for backbone models, which can facilitate research on uncertainty-critical applications in fields such as materials science and drug discovery.

URL: https://openreview.net/forum?id=qYceFeHgm4

---

Title: A True-to-the-model Axiomatic Benchmark for Graph-based Explainers

Abstract: Regulators, researchers, and practitioners recognize the urgency of explainability in artificial intelligence systems, including the ones based on machine learning for graph-structured data. Despite the large number of proposals, however, a common understanding of what constitutes a good explanation is still lacking: different explainers often arrive at different conclusions on the same problem instance, making it hard for practitioners to choose among them. Furthermore, explainers often produce explanations through opaque logic hard to understand and assess -- ironically mirroring the black box nature they aim to elucidate.

Recent proposals in the literature for benchmarking graph-based explainers typically involve embedding specific logic into data, training a black-box model, and then empirically assessing how well the explanation matches the embedded logic, i.e., they test truthfulness to the data. In contrast, we propose a true-to-the-model axiomatic framework for auditing explainers in the task of node classification on graphs.
Our proposal hinges on the fundamental idea that an explainer should discern if a model relies on a particular feature for classifying a node.
Building on this concept, we develop three types of white-box classifiers, with clear internal logic, that are relevant in real-world applications. We then formally prove that the set of features that can induce a change in the classification correctly corresponds to a ground-truth set of predefined important features. This property allows us to use the white-box classifiers to build a testing framework. We apply this framework to both synthetic and real data and evaluate various state-of-the-art explainers, thus characterizing their behavior. Our findings highlight how explainers often react in a rather counter-intuitive fashion to technical details that might be easily overlooked. Our approach offers
valuable insights and recommended practices for selecting the right explainer given the task at hand, and for developing new methods for explaining graph-learning models.

URL: https://openreview.net/forum?id=HSQTv3R8Iz

---

Title: One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning

Abstract: We present Generalized LoRA (GLoRA), a flexible approach for universal parameter-efficient fine-tuning tasks. Enhancing Low-Rank Adaptation (LoRA), GLoRA employs a generalized prompt module to optimize pre-trained model weights and adjust intermediate activations, providing more flexibility and capability across diverse tasks and datasets. Moreover, GLoRA facilitates efficient parameter adaptation by employing a scalable, modular, and layer-wise structure search that learns the individual adapter of each layer. Originating from a unified mathematical formulation, GLoRA exhibits strong transfer learning, few-shot learning and domain generalization abilities, as it adapts to new tasks through not only weights but also additional dimensions like activations. Comprehensive experiments demonstrate that GLoRA outperforms all previous methods in natural, specialized, and structured benchmarks in the field of vision, achieving superior accuracy with fewer parameters and computations. To demonstrate the applicability in the language domain, we perform GLoRA on LLaMA-1/2 models, which also achieve considerable enhancements compared to the original LoRA. Furthermore, our structural re-parameterization design ensures that GLoRA incurs no extra inference cost, rendering it a practical solution for resource-limited applications.

URL: https://openreview.net/forum?id=TQeT7UupFk

---

Title: Double Descent and Overfitting under Noisy Inputs and Distribution Shift

Abstract: Despite the importance of denoising in modern machine learning and ample empirical work on supervised denoising, its theoretical understanding is still relatively scarce. One concern about studying supervised denoising is that one might not always have noiseless training data from the test distribution. It is more reasonable to have access to noiseless training data from a different dataset than the test dataset. Motivated by this, we study supervised denoising and noisy-input regression under distribution shift. We add three considerations to increase the applicability of our theoretical insights to real-life data and modern machine learning. First, while most past theoretical work assumes that the data covariance matrix is full-rank and well-conditioned, empirical studies have shown that real-life data is approximately low-rank. Thus, we assume that our data matrices are low-rank. Second, we drop independence assumptions on our data. Third, the rise in computational power and dimensionality of data have made it important to study non-classical regimes of learning. Thus, we work in the non-classical proportional regime, where data dimension $d$ and number of samples $N$ grow as $d/N = c + o(1)$.

For this setting, we derive general test error expressions for both denoising and noisy-input regression, and study when overfitting the noise is benign, tempered or catastrophic. We show that the test error exhibits double descent under general distribution shift, providing insights for data augmentation and the role of noise as an implicit regularizer. We also perform experiments using real-life data, where we match the theoretical predictions with under 1\% MSE error for low-rank data.

URL: https://openreview.net/forum?id=HxfqTdLIRF

---

Title: Shifted Inverse stereographic normal distributions as flexible distribution family on the hypertorus

Abstract: Circular data arises in various fields including robotics, biology, geology and material sciences. Modelling such data requires flexible distribution families on the hypertorus. Common choices are the von Mises and the wrapped normal distributions. In this work we investigate the \textit{inverse stereographic normal distribution} as an interesting and computationally appealing alternative. We demonstrate its flexibility and practical applicability by fitting mixtures of shifted inverse stereographic normal distributions via gradient descent to dihedral data of protein backbones characterizing the conformational landscape of folding. Furthermore, we prove that the inverse stereographic normal distribution is unimodal if and only if all eigenvalues of the covariance matrix are less than or equal to $0.5$.

URL: https://openreview.net/forum?id=RVTBo5FoDV

---

Title: Extreme Risk Mitigation in Reinforcement Learning using Extreme Value Theory

Abstract: Risk-sensitive reinforcement learning (RL) has garnered significant attention in recent years due to the growing interest in deploying RL agents in real-world scenarios. A critical aspect of risk awareness involves modelling highly rare risk events (rewards) that could potentially lead to catastrophic outcomes. These infrequent occurrences present a formidable challenge for data-driven methods aiming to capture such risky events accurately. While risk-aware RL techniques do exist, they suffer from high variance estimation due to the inherent data scarcity. Our work proposes to enhance the resilience of RL agents when faced with very rare and risky events by focusing on refining the predictions of the extreme values predicted by the state-action value distribution. To achieve this, we formulate the extreme values of the state-action value function distribution as parameterized distributions, drawing inspiration from the principles of extreme value theory (EVT). We propose an extreme value theory based actor-critic approach, namely, Extreme Valued Actor-Critic (EVAC) which effectively addresses the issue of infrequent occurrence by leveraging EVT-based parameterization. Importantly, we theoretically demonstrate the advantages of employing these parameterized distributions in contrast to other risk-averse algorithms. Our evaluations show that the proposed method outperforms other risk averse RL algorithms on a diverse range of benchmark tasks, each encompassing distinct risk scenarios.

URL: https://openreview.net/forum?id=098mb06uhA

---

Title: PID Control-Based Self-Healing to Improve the Robustness of Large Language Models

Abstract: Despite the effectiveness of deep neural networks in numerous natural language processing applications, recent findings have exposed the vulnerability of these language models when minor perturbations are introduced. While appearing semantically indistinguishable to humans, these perturbations can significantly reduce the performance of well-trained language models, raising concerns about the reliability of deploying them in safe-critical situations. In this work, we construct a computationally efficient self-healing process to correct undesired model behavior during online inference when perturbations are applied to input data. This is formulated as a trajectory optimization problem in which the internal states of the neural network layers are automatically corrected using a PID (Proportional-Integral-Derivative) control mechanism. The P controller targets immediate state adjustments, while the I and D controllers consider past states and future dynamical trends, respectively. We leverage the geometrical properties of the training data to design effective linear PID controllers. This approach reduces the computational cost to that of using just the P controller, instead of the full PID control. Further, we introduce an analytical method for approximating the optimal control solutions, enhancing the real-time inference capabilities of this controlled system. Moreover, we conduct a theoretical error analysis of the analytic solution in a simplified setting. The proposed PID control-based self-healing is a low-cost framework that improves the robustness of pre-trained large language models, whether standard or robustly trained, against a wide range of perturbations.

URL: https://openreview.net/forum?id=Fu4mwB0XIU

---

Title: Soft Merging of Experts with Adaptive Routing

Abstract: Neural networks that learn to route their inputs through different "expert" subnetworks provide a form of modularity that standard dense models lack. Despite their possible benefits, modular models with learned routing often underperform their parameter-matched dense counterparts as well as models that use non-learned heuristic routing strategies. In this paper, we hypothesize that these shortcomings stem from the gradient estimation techniques used to train modular models that use non-differentiable discrete routing decisions. To address this issue, we introduce $\textbf{S}$oft $\textbf{M}$erging of $\textbf{E}$xperts with $\textbf{A}$daptive $\textbf{R}$outing (SMEAR), which avoids discrete routing by using a single "merged" expert constructed via a weighted average of all of the experts' parameters. By routing activations through a single merged expert, SMEAR does not incur a significant increase in computational costs and enables standard gradient-based training. We empirically validate that models using SMEAR outperform models that route based on metadata or learn routing through gradient estimation. Furthermore, we provide qualitative analysis demonstrating that the experts learned via SMEAR exhibit a significant amount of specialization.

URL: https://openreview.net/forum?id=7I199lc54z

---

Title: Accelerated Deep Active Learning with Graph-based Sub- Sampling

Abstract: Past years have witnessed the fast and thorough development of active learning, a human-in-the-loop semi-supervised learning that helps reduce the burden of expensive data annotation. Diverse techniques have been proposed to improve the efficiency of label acquisition. However, the existing techniques are mostly intractable at scale on massive unlabeled instances. In particular, the query time and model retraining time of large scale image-data models is usually linear or even quadratic in the size of the unlabeled pool set and its dimension. The main reason for this intractability is the iterative need to scan the pool set at least once in order to select the best samples for label annotation.

To alleviate this computational burden we propose efficient Diffusion Graph Active Learning (DGAL). DGAL is used on a pre-computed Variational-Auto-Encoders (VAE) latent space to restrict the pool set to a much smaller candidates set. The sub-sample is then used in deep architectures, to reduce the query time, via an additional standard active learning baseline criterion.
DGAL demonstrates a query time versus accuracy trade-off that is two or more orders of magnitude acceleration over state-of-the-art methods. Moreover, we demonstrate the important exploration-exploitation trade-off in DGAL that allows the restricted set to capture the most impactful samples for active learning at each iteration.

URL: https://openreview.net/forum?id=iBorskJRrg

---

Title: Adversarial Imitation Learning from Visual Observations using Latent Information

Abstract: We focus on the problem of imitation learning from visual observations, where the learning agent has access to videos of experts as its sole learning source. The challenges of this framework include the absence of expert actions and the partial observability of the environment, as the ground-truth states can only be inferred from pixels. To tackle this problem, we first conduct a theoretical analysis of imitation learning in partially observable environments. We establish upper bounds on the suboptimality of the learning agent with respect to the divergence between the expert and the agent latent state-transition distributions. Motivated by this analysis, we introduce an algorithm called Latent Adversarial Imitation from Observations, which combines off-policy adversarial imitation techniques with a learned latent representation of the agent's state from sequences of observations. In experiments on high-dimensional continuous robotic tasks, we show that our algorithm matches state-of-the-art performance while providing significant computational advantages. Additionally, we show how our method can be used to improve the efficiency of reinforcement learning from pixels by leveraging expert videos. To ensure reproducibility, we provide free access to our code.

URL: https://openreview.net/forum?id=ydPHjgf6h0

---

Title: AdaStop: adaptive statistical testing for sound comparisons of Deep RL agents

Abstract: Recently, the scientific community has questioned the statistical reproducibility of many empirical results, especially in the field of machine learning.
To solve this reproducibility crisis, we propose a theoretically sound methodology to compare the overall performance of multiple algorithms with stochastic returns. We exemplify our methodology in Deep Reinforcement Learning (Deep RL). Indeed, the performance of one execution of a Deep RL algorithm is random. Therefore, several independent executions are needed to accurately evaluate the overall performance.
When comparing several RL algorithms, a major question is how many executions must be made and how can we ensure that the results of such a comparison are theoretically sound. Researchers in Deep RL often use less than 5 independent executions
to compare algorithms: we claim that this is not enough in general. Moreover, when comparing several algorithms at once, the error of each comparison may accumulate and must be taken into account with a multiple tests procedure to preserve low error guarantees. We introduce AdaStop, a new statistical test based on multiple group sequential tests.
When comparing algorithms, AdaStop adapts the number of executions to stop as early as possible while ensuring that we have enough information to distinguish algorithms that perform better than the others in a statistical significant way. We prove theoretically and empirically that AdaStop has a low probability of making a (family-wise) error. Finally, we illustrate the effectiveness of AdaStop in multiple Deep RL use-cases, including toy examples and challenging Mujoco environments.

AdaStop is the first statistical test fitted to this sort of comparisons: AdaStop is both a significant contribution to statistics, and a major contribution to computational studies performed in reinforcement learning and in other domains.

To summarize our contribution, we introduce AdaStop, a formally grounded statistical tool to let anyone answer the practical question: ``Is my algorithm the new state-of-the-art?''

URL: https://openreview.net/forum?id=lXyZr9TLEU

---

Title: Lyra: Orchestrating Dual Correction in Automated Theorem Proving

Abstract: Large Language Models (LLMs) present an intriguing avenue for exploration in the field of formal theorem proving. Nevertheless, their full potential, particularly concerning the mitigation of hallucinations and refinement through prover error messages, remains an area that has yet to be thoroughly investigated. To enhance the effectiveness of LLMs in the field, we introduce the Lyra, a new framework that employs two distinct correction mechanisms: Tool Correction (TC) and Conjecture Correction (CC). To implement Tool Correction in the post-processing of formal proofs, we leverage prior knowledge to utilize predefined prover tools (e.g., Sledgehammer) for guiding the replacement of incorrect tools. Tool Correction significantly contributes to mitigating hallucinations, thereby improving the overall accuracy of the proof. In addition, we introduce Conjecture Correction, an error feedback mechanism designed to interact with prover to refine formal proof conjectures with prover error messages. Compared to the previous refinement framework, the proposed Conjecture Correction refines generation with instruction but does not collect paired (generation, error & refinement) prompts. Our method has achieved state-of-the-art (SOTA) performance on both miniF2F validation (48.0% → 55.3%) and test (45.5% → 51.2%). We also present 3 IMO problems solved by Lyra. We believe Tool Correction (post-process for hallucination mitigation) and Conjecture Correction (subgoal adjustment from interaction with the environment) could provide a promising avenue for future research in this field.

URL: https://openreview.net/forum?id=Svt75kotzs

---

Title: Online Continuous Hyperparameter Optimization for Generalized Linear Contextual Bandits

Abstract: In stochastic contextual bandits, an agent sequentially makes actions from a time-dependent action set based on past experience to minimize the cumulative regret. Like many other machine learning algorithms, the performance of bandits heavily depends on the values of hyperparameters, and theoretically derived parameter values may lead to unsatisfactory results in practice. Moreover, it is infeasible to use offline tuning methods like cross-validation to choose hyperparameters under the bandit environment, as the decisions should be made in real-time. To address this challenge, we propose the first online continuous hyperparameter tuning framework for contextual bandits to learn the optimal parameter configuration within a search space on the fly. Specifically, we use a double-layer bandit framework named CDT (Continuous Dynamic Tuning) and formulate the hyperparameter optimization as a non-stationary continuum-armed bandit, where each arm represents a combination of hyperparameters, and the corresponding reward is the algorithmic result. For the top layer, we propose the Zooming TS algorithm that utilizes Thompson Sampling (TS) for exploration and a restart technique to get around the \textit{switching} environment. The proposed CDT framework can be easily utilized to tune contextual bandit algorithms without any pre-specified candidate set for multiple hyperparameters. We further show that it could achieve a sublinear regret in theory and performs consistently better than all existing methods on both synthetic and real datasets.

URL: https://openreview.net/forum?id=lQE2AcbYge

---

Title: Single Image Test-Time Adaptation for Segmentation

Abstract: Test-Time Adaptation (TTA) methods improve the robustness of deep neural networks to domain shift on various tasks such as image classification, key-point estimation, or segmentation. However, the overwhelming majority of methods have been developed for image classification and new image segmentation methods are each evaluated under very different conditions and compared to a limited set of baselines, making understanding their performance difficult.
This work explores adapting segmentation models to a single unlabelled image with no other data available at test-time. This allows for individual sample performance analysis while excluding orthogonal factors such as weight restart strategies.
A diverse set of baselines, some modified from other domains or modalities, are first thoroughly validated on synthetic domain shifts and then tested on real datasets. The analysis highlights that simple optimization improvements such as proper choice of the loss function can greatly improve the performance of standard baselines such as pseudolabelling and that different methods and hyper-parameters are optimal for different kinds of domain shift, hindering the TTA performance where no prior knowledge about the domain shift is assumed.

URL: https://openreview.net/forum?id=68LsWm2GuD

---

Title: On the Inherent Privacy Properties of Discrete Denoising Diffusion Models

Abstract: Privacy concerns have led to a surge in the creation of synthetic datasets, with diffusion models emerging as a promising avenue. Although prior studies have performed empirical evaluations on these models, there has been a gap in providing a mathematical characterization of their privacy-preserving capabilities. To address this, we present the pioneering theoretical exploration of the privacy preservation inherent in \emph{discrete diffusion models} (DDMs) for discrete dataset generation. Focusing on per-instance differential privacy (pDP), our framework elucidates the potential privacy leakage for each data point in a given training dataset, offering insights into how the privacy loss of each point correlates with the dataset's distribution. Our bounds also show that training with $s$-sized data points leads to a surge in privacy leakage from $(\epsilon, \mathcal{O}(\frac{1}{s^2\epsilon}))$-pDP to $(\epsilon, \mathcal{O}(\frac{1}{s\epsilon}))$-pDP of the DDM during the transition from the pure noise to the synthetic clean data phase, and a faster decay in diffusion coefficients amplifies the privacy guarantee. Finally, we empirically verify our theoretical findings on both synthetic and real-world datasets.

URL: https://openreview.net/forum?id=UuU6C6CUoF

---

Title: Knowledge Translation: A New Pathway for Model Compression

Abstract: Deep learning has witnessed significant advancements in recent years at the cost of increasing training, inference, and model storage overhead. While existing model compression methods strive to reduce the number of model parameters while maintaining high accuracy, they inevitably necessitate the re-training of the compressed model or impose architectural constraints. To overcome these limitations, this paper presents a novel framework, termed Knowledge Translation (KT), wherein a “translation” model is trained to receive the parameters of a larger model and generate compressed parameters. The concept of KT draws inspiration from language translation, which effectively employs neural networks to convert different languages, maintaining identical meaning. Accordingly, we explore the potential of neural networks to convert models of disparate sizes, while preserving their functionality. We propose a comprehensive framework for KT, introduce data augmentation strategies to enhance model performance despite limited training data, and successfully demonstrate the feasibility of KT on the MNIST dataset. Code is available at the supplementary material.

URL: https://openreview.net/forum?id=KVfbOupbxB

---

Title: Understanding and Improving Transfer Learning of Deep Models via Neural Collapse

Abstract: With the ever-increasing complexity of large-scale pre-trained models coupled with a shortage of labeled data for downstream training, transfer learning has become the primary approach in many fields, including natural language processing, computer vision, and multi-modal learning. Despite recent progress, the fine-tuning process for large-scale pre-trained models in vision still mostly relies on trial and error. This work investigates the relationship between neural collapse (NC) and transfer learning for classification problems. NC is an intriguing while prevalent phenomenon that has been recently discovered in terms of the final-layer features and linear classifiers of trained neural networks. Specifically, during the terminal phase of training, NC implies that the variability of the features within each class diminishes to zero, while the means of features between classes are maximally and equally distanced. In this work, we examine the NC attributes of pre-trained models on both downstream and training data for transfer learning, and we find strong correlation between feature collapse and downstream performance. In particular, we discovered a systematic pattern that emerges when linear probing pre-trained models on downstream training data: the more feature collapse of pre-trained models on downstream data, the higher the transfer accuracy.
Additionally, we also studied the relationship between NC and transfer accuracy on the training data. We validate these findings through a series of comprehensive experiments under a range of conditions. Moreover, these findings allow us to develop a principled, parameter-efficient fine-tuning method that employs skip-connection to induce the last-layer feature collapse on downstream data. When compared to the full model fine-tuning, our proposed fine-tuning method delivers comparable or even superior performance, while reducing fine-tuning parameters by at least 90\% and mitigating overfitting in situations especially when the downstream data is scarce.

URL: https://openreview.net/forum?id=o8r84MzTQB

---

Title: Vision Learners Meet Web Image-Text Pairs

Abstract: Most recent self-supervised learning methods are pre-trained on the well-curated ImageNet-1K dataset.
In this work, given the excellent scalability of web data, we consider self-supervised pre-training on noisy web sourced image-text paired data.
First, we conduct a benchmark study of representative self-supervised pre-training methods on large-scale web data in a like-for-like setting.
We compare a range of methods, including single-modal ones that use masked training objectives and multi-modal ones that use image-text constrastive training.
We observe that existing multi-modal methods do not outperform their single-modal counterparts on vision transfer learning tasks.
We derive an information-theoretical view to explain these benchmark results, which provides insight into how to design a novel vision learner.
Inspired by this insight, we present a new visual representation pre-training method, MUlti-modal Generator~(MUG), that learns from scalable web sourced image-text data.
MUG achieves state-of-the-art transfer performance on a variety of tasks and demonstrates promising scaling properties.
Pre-trained models and code will be made public upon acceptance.

URL: https://openreview.net/forum?id=OCnTRfmaTb

---

Title: End-to-End Mesh Optimization of a Hybrid Deep Learning Black-Box PDE Solver

Abstract: Deep learning has been widely applied to solve partial differential equations (PDEs) in computational fluid dynamics. Recent research proposed a PDE correction framework that leverages deep learning to correct the solution obtained by a PDE solver on a coarse mesh. However, end-to-end training of such a PDE correction model over both solver-dependent parameters such as mesh parameters and neural network parameters requires the PDE solver to support automatic differentiation through the iterative numerical process. Such a feature is not readily available in many existing solvers. In this study, we explore the feasibility of end-to-end training of a hybrid model with a black-box PDE solver and a deep learning model for fluid flow prediction. Specifically, we investigate a hybrid model that integrates a black-box PDE solver into a differentiable deep graph neural network. To train this model, we use a zeroth-order gradient estimator to differentiate the PDE solver via forward propagation. Although experiments show that the proposed approach based on zeroth-order gradient estimation underperforms the baseline that computes exact derivatives using automatic differentiation, our proposed method outperforms the baseline trained with a frozen input mesh to the solver.
Moreover, with a simple warm-start on the neural network parameters, we show that models trained by these zeroth-order algorithms achieve an accelerated convergence and improved generalization performance.

URL: https://openreview.net/forum?id=zAAX1GzuMf

---

Title: Revisiting Residual Connections for Neural Structure Learning

Abstract: Recent studies reveal that deep representation learning models without proper regularization can suffer from the dimensional collapse issue, i.e., representation vectors span over a lower dimensional space. In the domain of graph deep representation learning, the phenomenon that the node representations are indistinguishable and even shrink to a constant vector is called oversmoothing. Based on the analysis of the rank of node representations, we find that representation oversmoothing and dimensional collapse are highly related to each other for deep graph neural networks (GNNs), and the oversmoothing problem can be interpreted by the dimensional collapse of the representation matrix. Then, to address the dimensional collapse and the triggered oversmoothing in deep graph neural networks, we first find vanilla residual connections and contrastive learning producing sub-optimal outcomes by ignoring the structural information of graph data. Motivated by this, we propose a novel graph neural network named GearGNN to address the oversmoothing issue from the perspective of addressing dimensional collapse in two folds. Specifically, in GearGNN, we design a topology-preserving residual connection for graph neural networks to force the low-rank of hidden representations close to the full-rank input features. Also, we propose the structure-guided contrastive loss to ensure only close neighbors who share similar local connections can have similar representations. Empirical experiments on multiple real-world datasets demonstrate that GearGNN outperforms state-of-the-art deep graph representation baseline algorithms.

URL: https://openreview.net/forum?id=PxurZnjGy9

---

Title: ITEM: Improving Training and Evaluation of Message-Passing based GNNs for top-k recommendation

Abstract: Graph Neural Networks (GNNs), especially message-passing-based models, have become prominent in top-k recommendation tasks, outperforming matrix factorization models due to their ability to efficiently aggregate information from a broader context.
Although GNNs are evaluated with ranking-based metrics, e.g NDCG@k and Recall@k, they remain largely trained with proxy losses, e.g the BPR loss. In this work we explore the use of ranking loss functions to directly optimize the evaluation metrics, an area not extensively investigated in the GNN community for collaborative filtering.
We take advantage of smooth approximations of the rank to facilitate end-to-end training of GNNs and propose a Personalized PageRank-based negative sampling strategy tailored for ranking loss functions. Moreover, we extend the evaluation of GNN models for top-k recommendation tasks with an inductive user-centric protocol, providing a more accurate reflection of real-world applications.
Our proposed method significantly outperforms the standard BPR loss and more advanced losses across four datasets and four recent GNN architectures while also exhibiting faster training. Demonstrating the potential of ranking loss functions in improving GNN training for collaborative filtering tasks.

URL: https://openreview.net/forum?id=9B6LM2uoEs

---

Title: Fixed Budget Best Arm Identification in Unimodal Bandits

Abstract: We consider the best arm identification problem in a fixed budget stochastic multi-armed bandit setting, where the arm mean rewards exhibit an unimodal structure. We establish that the probability of misidentifying the optimal arm within a budget of $T$ is lower bounded as $\mathcal{O}\left(\exp\left\{-T/\bar{H}\right\}\right)$, where $\bar{H}$ depends on the sub-optimality gaps of arms in the neighborhood of the optimal arm. % where $\bar{H}\leq 2\Delta^{-2}$. In contrast to the lower bound for the unstructured case, the error exponent in this bound does not depend on the number of arms $K$ and is smaller by a factor $K\log K$, which captures the gain achievable by exploiting the unimodal structure. We then develop an algorithm named {\it Fixed Budget Best Arm Unimodal Bandits (\algo{})} that exploits unimodality to achieve the gain. Specifically, we show that the error probability of \algo{} is upper bounded as $\mathcal{O}\left(\log_2 K\exp\left\{-T\Delta^2\right\}\right)$, where $\Delta$ is the gap between the neighboring arms and $\bar{H}\leq 2\Delta^{-2}$. We demonstrate that \algo{} outperforms the state-of-the-art algorithms through extensive simulations. Moreover, \algo{} is parameter-free and simple to implement.

URL: https://openreview.net/forum?id=epcLNhkoEL

---

Title: Koopman Spectrum Nonlinear Regulators and Efficient Online Learning

Abstract: Most modern reinforcement learning algorithms optimize a cumulative single-step cost along a trajectory. The optimized motions are often ‘unnatural’, representing, for example, behaviors with sudden accelerations that waste energy and lack predictability. In this work, we present a novel paradigm of controlling nonlinear systems via the minimization of the Koopman spectrum cost: a cost over the Koopman operator of the controlled dynamics. This induces a broader class of dynamical behaviors that evolve over stable manifolds such as nonlinear oscillators, closed loops, and smooth movements. We demonstrate that some dynamics characterizations that are not possible with a cumulative cost are feasible in this paradigm, which generalizes the classical eigenstructure and pole assignments to nonlinear decision making. Moreover, we present a sample efficient online learning algorithm for our problem that enjoys a sub-linear regret bound under some structural assumptions.

URL: https://openreview.net/forum?id=thfoUZugvS

---

Title: Indexed Minimum Empirical Divergence-Based Algorithms for Linear Bandits

Abstract: The Indexed Minimum Empirical Divergence (IMED) algorithm is a highly effective approach that offers a stronger theoretical guarantee of the asymptotic optimality compared to the Kullback--Leibler Upper Confidence Bound (KL-UCB) algorithm for the multi-armed bandit problem. Additionally, it has been observed to empirically outperform UCB-based algorithms and Thompson Sampling. Despite its effectiveness, the generalization of this algorithm to contextual bandits with linear payoffs has remained elusive. In this paper, we present novel linear versions of the IMED algorithm, which we call the family of LinIMED algorithms. We demonstrate that LinIMED provides a $\widetilde{O}(d\sqrt{T})$ upper regret bound where $d$ is the dimension of the context and $T$ is the time horizon. Furthermore, extensive empirical studies reveal that LinIMED and its variants outperform widely-used linear bandit algorithms such as LinUCB and Linear Thompson Sampling in some regimes.

URL: https://openreview.net/forum?id=wE9kpJSemv

---

Title: Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models

Abstract: Fine-tuning language models~(LMs) on human-generated data remains a prevalent practice. However, the performance of such models is often limited by the quantity and diversity of high-quality human data. In this paper, we explore whether we can go beyond human data on tasks where we have access to scalar feedback, for example, on math problems where one can verify correctness. To do so, we investigate a simple self-training method based on expectation-maximization, which we call \method, where we (1) generate samples from the model and filter them using binary feedback, (2) fine-tune the model on these samples, and (3) repeat this process a few times. Testing on advanced MATH reasoning and APPS coding benchmarks using PaLM-2 models, we find that \method{} scales favorably with model size and significantly surpasses fine-tuning only on human data. Overall, our findings suggest self-training with feedback can reduce dependence on human-generated data.

URL: https://openreview.net/forum?id=lNAyUngGFK

---

Title: Path Development Network with Finite-dimensional Lie Group

Abstract: Signature, lying at the heart of rough path theory, is a central tool for analysing controlled differential equations driven by irregular paths. Recently it has also found extensive applications in machine learning and data science as a mathematically principled, universal feature that boosts the performance of deep learning-based models in sequential data tasks. It, nevertheless, suffers from the curse of dimensionality when paths are high-dimensional.

We propose a novel, trainable path development layer, which exploits representations of sequential data through finite-dimensional Lie groups, thus resulting in dimension reduction. Its backpropagation algorithm is designed via optimization on manifolds. Our proposed layer, analogous to recurrent neural networks (RNN), possesses an explicit, simple recurrent unit that alleviates the gradient issues.

Our layer demonstrates its strength in irregular time series modelling. Empirical results on a range of datasets show that the development layer consistently and significantly outperforms signature features on accuracy and dimensionality. The compact hybrid model (stacking one-layer LSTM with the development layer) achieves state-of-the-art against various RNN and continuous time series models. Our layer also enhances the performance of modelling dynamics constrained to Lie groups.

URL: https://openreview.net/forum?id=YoWBLu74TL

---

Title: From Stability to Chaos: Analyzing Gradient Descent Dynamics in Quadratic Regression

Abstract: We conduct a comprehensive investigation into the dynamics of gradient descent using large-order constant step-sizes in the context of quadratic regression models. Within this framework, we reveal that the dynamics can be encapsulated by a specific cubic map, naturally parameterized by the step-size. Through a fine-grained bifurcation analysis concerning the step-size parameter, we delineate five distinct training phases: (1) monotonic, (2) catapult, (3) periodic, (4) chaotic, and (5) divergent, precisely demarcating the boundaries of each phase. As illustrations, we provide examples involving phase retrieval and two-layer neural networks employing quadratic activation functions and constant outer-layers, utilizing orthogonal training data. Our simulations indicate that these five phases also manifest with generic non-orthogonal data. We also empirically investigate the generalization performance when training in the various non-monotonic (and non-divergent) phases. In particular, we observe that performing an ergodic trajectory averaging stabilizes the test error in non-monotonic (and non-divergent) phases.

URL: https://openreview.net/forum?id=Wiklo5VpG7

---

Title: Holistic Molecular Representation Learning via Multi-view Fragmentation

Abstract: Learning chemically meaningful representations from unlabeled molecules plays a vital role in AI-based drug design and discovery. In response to this, several self-supervised learning methods have been developed, focusing either on global (e.g., graph-level) or local (e.g., motif-level) information of molecular graphs. However, it is still unclear which approach is more e ffective for learning better molecular representations. In this paper, we propose a novel holistic self-supervised molecular representation learning framework that e ffectively learns both global and local molecular information. Our key idea is to utilize fragmentation, which decomposes a molecule into a set of chemically meaningful fragments (e.g., functional groups), to associate a global graph structure to a set of local substructures, thereby preserving chemical properties and learn both information via contrastive learning between them. Additionally, we also consider the 3D geometry of molecules as another view for contrastive learning. We demonstrate that our framework outperforms prior molecular representation learning methods across various molecular property prediction tasks.

URL: https://openreview.net/forum?id=ufDh55J1ML

---

Title: Variational excess risk bound for general state space models

Abstract: In this paper, we consider variational autoencoders (VAE) for general state space models. We consider a backward factorization of the variational distributions to analyze the excess risk associated with VAE. Such backward factorizations were recently proposed to perform online variational learning and to obtain upper bounds on the variational estimation error. When independent trajectories of sequences are observed and under strong mixing assumptions on the state space model and on the variational distribution, we provide an oracle inequality explicit in the number of samples and in the length of the observation sequences. We then derive consequences of this theoretical result. In particular, when the data distribution is given by a state space model, we provide an upper bound for the Kullback-Leibler divergence between the data distribution and its estimator and between the variational posterior and the estimated state space posterior distributions. Under classical assumptions, we prove that our results can be applied to Gaussian backward kernels built with dense and recurrent neural networks.

URL: https://openreview.net/forum?id=36OX7uRM5t

---

Title: Graph Harmony: Denoising and Nuclear-Norm Wasserstein Adaptation for Enhanced Domain Transfer in Graph-Structured Data

Abstract: Graph-structured data can be found in numerous domains, yet the scarcity of labeled instances hinders its effective utilization of deep learning in many scenarios. Traditional unsupervised domain adaptation (UDA) strategies for graphs primarily hinge on adversarial learning and pseudo-labeling. These approaches fail to effectively leverage graph discriminative features, leading to class mismatching and unreliable label quality. To address these obstacles, we develop the Denoising and Nuclear-Norm Wasserstein Adaptation Network (DNAN). DNAN employs the Nuclear-norm Wasserstein discrepancy (NWD), which can simultaneously achieve domain alignment and class distinguishment. It also integrates a denoising mechanism via a Variational Graph Autoencoder. This denoising mechanism helps capture essential features of both source and target domains, improving the robustness of the domain adaptation process. Our comprehensive experiments demonstrate that DNAN outperforms state-of-the-art methods on standard UDA benchmarks for graph classification.

URL: https://openreview.net/forum?id=8HmMzL2rm3

---

Title: Training Graph Neural Networks Subject to a Tight Lipschitz Constraint

Abstract: We propose a strategy for training a wide range of graph neural networks (GNNs) under tight Lipschitz bound constraints.
Specifically, by leveraging graph spectral theory, we derive computationally tractable expressions of their Lipschitz constant. This allows us to propose a constrained-optimization approach to control the constant, ensuring robustness to adversarial perturbations. Unlike the existing methods for controlling the Lipschitz constant, our approach reduces the size of the handled matrices by a factor equal to the square of the number of nodes in the graph. We employ a stochastic projected subgradient algorithm, which operates in a block-coordinate manner, with the projection step performed via an accelerated iterative proximal algorithm.
We focus on defending against attacks that perturb features while keeping the topology of the graph constant. This contrasts with most of the existing defenses, which tackle perturbations of the graph structure. We report experiments on various datasets in the context of node classification tasks, showing the effectiveness of our constrained GNN model.

URL: https://openreview.net/forum?id=KLojVqdj2y

---

Title: Locally Optimal Fixed-Budget Best Arm Identification in Two-Armed Gaussian Bandits with Unknown Variances

Abstract: We address the problem of best arm identification (BAI) with a fixed budget for two-armed Gaussian bandits. In BAI, given multiple arms, we aim to find the best arm, an arm with the highest expected reward, through an adaptive experiment. \citet{Kaufman2016complexity} develops a lower bound for the probability of misidentifying the best arm. They also propose a strategy, assuming that the variances of rewards are known, and show that it is asymptotically optimal in the sense that its probability of misidentification matches the lower bound as the budget approaches infinity. However, an asymptotically optimal strategy is unknown when the variances are unknown. For this open issue, we propose a strategy that estimates variances during an adaptive experiment and draws arms with a ratio of the estimated standard deviations. We refer to this strategy as the \emph{Neyman Allocation (NA)-Augmented Inverse Probability weighting (AIPW)} strategy. We then demonstrate that this strategy is asymptotically optimal by showing that its probability of misidentification matches the lower bound when the budget approaches infinity, and the gap between the expected rewards of two arms approaches zero (\emph{small-gap regime}). Our results suggest that under the worst-case scenario characterized by the small-gap regime, our strategy, which employs estimated variance, is asymptotically optimal even when the variances are unknown.

URL: https://openreview.net/forum?id=5FE4z0yJXb

---

Title: Investigating the Sim2real Gap in Computer Vision for Robotics

Abstract: A major challenge in designing machine learning systems for the real world is the sim2real gap, i.e. the change in performance when the system is transferred from simulation to the physical environment. Although many algorithms have been proposed to reduce this gap, it is not well understood. In this paper, we perform an empirical study of the sim2real gap for popular models in three standard computer vision tasks, monocular depth estimation, object detection, and image inpainting, in a robotic manipulation environment. We find that the lighting conditions significantly affect the gap for monocular depth estimation while object properties affect the gap for object detection and image inpainting, and these qualitative observations remain stable with different models and renderers.

URL: https://openreview.net/forum?id=itrMf8mjwo

---

Title: VisionAD, a software package of performant anomaly detection algorithms, and Proportion Localised, an interpretable metric

Abstract: We release VisionAD, an anomaly detection library in the domain of images. The library forms the largest and most performant collection of such algorithms to date. Each algorithm is written through a standardised API. The library has a focus on fair benchmarking intended to mitigate the issue of cherry-picked results. It is designed enable rapid experimentation, and the integration of new algorithms is straightforward. In addition, we propose a new metric, Proportion Localised (PL). This reports the proportion of anomalies that are sufficiently localised via discretely classifying each anomaly as localised or not. The metric is far more intuitive as it has a real physical relation, which is attractive to industry-based professionals. We also release a thorough benchmarking of the MVTec dataset consisting of the top 15 algorithms available. We call this the VisionAD leaderboard. We are committed to hosting an updated version of this leaderboard online, and encourage researchers to add, tweak and improve algorithms to climb this leaderboard.

URL: https://openreview.net/forum?id=o5kYH7bNe3

---

Title: Dynamic Structure Estimation from Bandit Feedback using Nonvanishing Exponential Sums

Abstract: This work tackles the dynamic structure estimation problems for periodically behaved discrete dynamical system in the Euclidean space. We assume the observations become sequentially available in a form of bandit feedback contaminated by a sub-Gaussian noise. Under such fairly general assumptions on the noise distribution, we carefully identify a set of recoverable information of periodic structures. Our main results are the (computation and sample) efficient algorithms that exploit asymptotic behaviors of exponential sums to effectively average out the noise effect while preventing the information to be estimated from vanishing. In particular, the novel use of the Weyl sum, a variant of exponential sums, allows us to extract spectrum information for linear systems. We provide sample complexity bounds for our algorithms, and we experimentally validate our theoretical claims on simulations of toy examples, including Cellular Automata.

URL: https://openreview.net/forum?id=xNkASJL0F6

---

Title: Where Did the Gap Go? Reassessing the Long-Range Graph Benchmark

Abstract: The recent Long-Range Graph Benchmark (LRGB, Dwivedi et al. 2022) introduced a set of graph learning tasks strongly dependent on long-range interaction between vertices. Empirical evidence suggests that on these tasks Graph Transformers significantly outperform Message Passing GNNs (MPGNNs). In this paper, we carefully reevaluate multiple MPGNN baselines as well as the Graph Transformer GPS (Rampášek et al. 2022) on LRGB. Through a rigorous empirical analysis, we demonstrate that the reported performance gap is overestimated due to suboptimal hyperparameter choices. It is noteworthy that across multiple datasets the performance gap completely vanishes after basic hyperparameter optimization. In addition, we discuss the impact of lacking feature normalization for LRGB's vision datasets and highlight a spurious implementation of LRGB's link prediction metric. The principal aim of our paper is to establish a higher standard of empirical rigor within the graph machine learning community.

URL: https://openreview.net/forum?id=Nm0WX86sKv

---

Title: A Note on the Convergence of Denoising Diffusion Probabilistic Models

Abstract: Diffusion models are one of the most important families of deep generative models. In this note, we derive a quantitative upper bound on the Wasserstein distance between the data-generating distribution and the distribution learned by a diffusion model. Unlike previous works in this field, our result does not make assumptions on the learned score function. Moreover, our bound holds for arbitrary data-generating distributions on bounded instance spaces, even those without a density w.r.t. the Lebesgue measure, and the upper bound does not suffer from exponential dependencies. Our main result builds upon the recent work of Mbacke et al. (2023) and our proofs are elementary.

URL: https://openreview.net/forum?id=wLe1bG93yc

---

Title: CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement

Abstract: Contrastive language image pretraining (CLIP) is a standard method for training vision-language models. While CLIP is scalable, promptable, and robust to distribution shifts on image classification tasks, it lacks object localization capabilities. This paper studies the following question: Can we augment CLIP training with task-specific vision models from model zoos to improve its visual representations? Towards this end, we leverage open-source task-specific vision models to generate pseudo-labels for an uncurated and noisy image-text dataset. Subsequently, we train CLIP models on these pseudo-labels in addition to the contrastive training on image and text pairs. This simple setup shows substantial improvements of up to 16.3% across different vision tasks, including segmentation, detection, depth estimation, and surface normal estimation. Importantly, these enhancements are achieved without compromising CLIP's existing capabilities, including its proficiency in promptable zero-shot classification.

URL: https://openreview.net/forum?id=Yh8Y7a4myU

---

Reply all

Reply to author

Forward

0 new messages