Weekly TMLR digest for Jun 02, 2024

6 views

Skip to first unread message

TMLR

unread,

Jun 2, 2024, 12:00:15 AMJun 2

to tmlr-annou...@googlegroups.com

New certifications
==================

Reproducibility Certification: 'Explaining RL Decisions with Trajectories’: A Reproducibility Study

Karim Ahmed Abdel Sadek, Matteo Nulli, Joan Velja, Jort Vincenti

https://openreview.net/forum?id=QdeBbK5CSh

---

Featured Certification: LeanVec: Searching vectors faster by making them fit

Mariano Tepper, Ishwar Singh Bhati, Cecilia Aguerrebere, Mark Hildebrand, Theodore L. Willke

https://openreview.net/forum?id=wczqrpOrIc

---

Reproducibility Certification: Reproducibility Study on Adversarial Attacks Against Robust Transformer Trackers

Fatemeh Nourilenjan Nokabadi, Jean-Francois Lalonde, Christian Gagné

https://openreview.net/forum?id=FEEKR0Vl9s

---

Featured Certification: Soft Merging of Experts with Adaptive Routing

Mohammed Muqeeth, Haokun Liu, Colin Raffel

https://openreview.net/forum?id=7I199lc54z

---

Reproducibility Certification: Transfer Learning with Informative Priors: Simple Baselines Better than Previously Reported

Ethan Harvey, Mikhail Petrov, Michael C Hughes

https://openreview.net/forum?id=BbvSU02jLg

---

Survey Certification: Efficient Large Language Models: A Survey

Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Jiachen Liu, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, Mosharaf Chowdhury, Mi Zhang

https://openreview.net/forum?id=bsCCJHbO8A

---

Featured Certification: Deep Generalized Prediction Set Classifier and Its Theoretical Guarantees

Zhou Wang, Xingye Qiao

https://openreview.net/forum?id=H7gLN5nqVF

---

Featured Certification: As large as it gets – Studying Infinitely Large Convolutions via Neural Implicit Frequency Filters

Julia Grabinski, Janis Keuper, Margret Keuper

https://openreview.net/forum?id=xRy1YRcHWj

---

Reproducibility Certification: Reproducibility Study of "Learning Perturbations to Explain Time Series Predictions"

Jiapeng Fan, Luke Cadigan, Paulius Skaisgiris, Sebastian Uriel Arias

https://openreview.net/forum?id=fCNqD2IuoD

---

Survey Certification: A Short Survey on Importance Weighting for Machine Learning

Masanari Kimura, Hideitsu Hino

https://openreview.net/forum?id=IhXM3g2gxg

---

Reproducibility Certification: Efficient Parallelized Simulation of Cyber-Physical Systems

Bas van der Heijden, Laura Ferranti, Jens Kober, Robert Babuska

https://openreview.net/forum?id=VzKXbCzNoU

---

Accepted papers
===============

Title: Enhancing Low-Precision Sampling via Stochastic Gradient Hamiltonian Monte Carlo

Authors: Ziyi Wang, Yujie Chen, Qifan Song, Ruqi Zhang

Abstract: Low-precision training has emerged as a promising low-cost technique to enhance the training efficiency of deep neural networks without sacrificing much accuracy. Its Bayesian counterpart can further provide uncertainty quantification and improved generalization accuracy.
This paper investigates low-precision sampling via Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) with low-precision and full-precision gradient accumulators for both strongly log-concave and non-log-concave distributions. Theoretically, our results show that to achieve $\epsilon$-error in the 2-Wasserstein distance for non-log-concave distributions, low-precision SGHMC achieves quadratic improvement ($\tilde{\mathcal{O}}\left({\epsilon^{-2}{\mu^*}^{-2}\log^2\left({\epsilon^{-1}}\right)}\right)$) compared to the state-of-the-art low-precision sampler, Stochastic Gradient Langevin Dynamics (SGLD) ($\tilde{\mathcal{O}}\left({{\epsilon}^{-4}{\lambda^{*}}^{-1}\log^5\left({\epsilon^{-1}}\right)}\right)$). Moreover, we prove that low-precision SGHMC is more robust to the quantization error compared to low-precision SGLD due to the robustness of the momentum-based update w.r.t. gradient noise.
Empirically, we conduct experiments on synthetic data, and MNIST, CIFAR-10 \& CIFAR-100 datasets, which validate our theoretical findings. Our study highlights the potential of low-precision SGHMC as an efficient and accurate sampling method for large-scale and resource-limited machine learning.

URL: https://openreview.net/forum?id=uSLNzzuiDJ

---

Title: Improved Convergence of Score-Based Diffusion Models via Prediction-Correction

Authors: Francesco Pedrotti, Jan Maas, Marco Mondelli

Abstract: Score-based generative models (SGMs) are powerful tools to sample from complex data distributions. Their underlying idea is to \emph{(i)} run a forward process for time $T_1$ by adding noise to the data, \emph{(ii)} estimate its score function, and \emph{(iii)} use such estimate to run a reverse process. As the reverse process is initialized with the stationary distribution of the forward one, the existing analysis paradigm requires $T_1\to\infty$. This is however problematic: from a theoretical viewpoint, for a given precision of the score approximation, the convergence guarantee fails as $T_1$ diverges; from a practical viewpoint, a large $T_1$ increases computational costs and leads to error propagation.
This paper addresses the issue by considering a version of the popular \emph{predictor-corrector} scheme: after running the forward process, we first estimate the final distribution via an inexact Langevin dynamics and then revert the process. Our key technical contribution is to provide convergence guarantees which require to run the forward process \emph{only for a fixed finite time} $T_1$.
Our bounds exhibit a mild logarithmic dependence on the input dimension and the subgaussian norm of the target distribution, have minimal assumptions on the data, and require only to control the $L^2$ loss on the score approximation, which is the quantity minimized in practice.

URL: https://openreview.net/forum?id=0zKvH7YiAq

---

Title: 'Explaining RL Decisions with Trajectories’: A Reproducibility Study

Authors: Karim Ahmed Abdel Sadek, Matteo Nulli, Joan Velja, Jort Vincenti

Abstract: This work investigates the reproducibility of the paper "Explaining RL decisions with trajectories“ by Deshmukh et al. (2023). The original paper introduces a novel approach in explainable reinforcement learning based on the attribution decisions of an agent to specific clusters of trajectories encountered during training. We verify the main claims from the paper, which state that (i) training on less trajectories induces a lower initial state value, (ii) trajectories in a cluster present similar high-level patterns, (iii) distant trajectories influence the decision of an agent, and (iv) humans correctly identify the attributed trajectories to the decision of the agent. We recover the environments used by the authors based on the partial original code they provided for one of the environments (Grid-World), and implemented the remaining from scratch (Seaquest and HalfCheetah, Breakout, Q*Bert).
While we confirm that (i), (ii), and (iii) partially hold, we extend on the largely qualitative experiments from the authors by introducing a quantitative metric to further support (iii), and new experiments and visual results for (i). Moreover, we investigate the use of different clustering algorithms and encoder architectures to further support (ii). We could not support (iv), given the limited extent of the original experiments. We conclude that, while some of the claims can be supported, further investigations and experiments could be of interest. We recognize the novelty of the work from the authors and hope that our work paves the way for clearer and more transparent approaches.

URL: https://openreview.net/forum?id=QdeBbK5CSh

---

Title: Online Tensor Max-Norm Regularization via Stochastic Optimization

Authors: Tong Wu

Abstract: The advent of ubiquitous multidimensional arrays poses unique challenges for low-rank modeling of tensor data due to higher-order relationships, gross noise, and large dimensions of the tensor. In this paper, we consider online low-rank estimation of tensor data where the multidimensional data are revealed sequentially. Induced by the recently proposed tensor-tensor product (t-product), we rigorously deduce the tensor max-norm and formulate the tensor max-norm into an equivalent tensor factorization form, where the factors consist of a tensor basis component and a coefficient one. With this formulation, we develop an online max-norm regularized tensor decomposition (OMRTD) method by alternatively optimizing over the basis component and the coefficient tensor. The algorithm is scalable to the large-scale setting and the sequence of the solutions produced by OMRTD converges to a stationary point of the expected loss function asymptotically. Further, we extend OMRTD for tensor completion. Numerical experiments demonstrate encouraging results for the effectiveness and robustness of our algorithm. The code is available at https://github.com/twugithub/2024-TMLR-OMRTD.

URL: https://openreview.net/forum?id=1iDpP3GWmS

---

Title: A Study of the Effects of Transfer Learning on Adversarial Robustness

Authors: Pratik Vaishnavi, Kevin Eykholt, Amir Rahmati

Abstract: The security and robustness of AI systems are paramount in real-world applications. Previous research has focused on developing methods to train robust networks, assuming the availability of sufficient labeled training data. However, in deployment scenarios with limited training data, existing techniques for training robust networks become impractical. In such low-data scenarios, non-robust training methods often resort to transfer learning. This involves pre-training a network on a large, possibly labeled dataset and fine-tuning it for a new task with a limited set of training samples. The efficacy of transfer learning in enhancing adversarial robustness is not comprehensively explored. Specifically, it remains uncertain whether transfer learning can improve adversarial performance in low-data scenarios. Furthermore, the potential benefits of transfer learning for certified robustness are unexplored. In this paper, we conduct an extensive analysis of the impact of transfer learning on both empirical and certified adversarial robustness. Employing supervised and self-supervised pre-training methods and fine-tuning across 12 downstream tasks representing diverse data availability scenarios, we identify the conditions conducive to training adversarially robust models through transfer learning. Our study reveals that the effectiveness of transfer learning in improving adversarial robustness is attributed to an increase in standard accuracy and not the direct ``transfer'' of robustness from the source to the target task, contrary to previous beliefs. Our findings provide valuable insights for practitioners aiming to deploy robust ML models in their applications.

URL: https://openreview.net/forum?id=T6RygOFZ6B

---

Title: End-to-End Training Induces Information Bottleneck through Layer-Role Differentiation: A Comparative Analysis with Layer-wise Training

Authors: Keitaro Sakamoto, Issei Sato

Abstract: End-to-end (E2E) training, optimizing the entire model through error backpropagation, fundamentally supports the advancements of deep learning. Despite its high performance, E2E training faces the problems of memory consumption, parallel computing, and discrepancy with the functionalities of the actual brain. Various alternative methods have been proposed to overcome these difficulties; however, no one can yet match the performance of E2E training, thereby falling short in practicality. Furthermore, there is no deep understanding regarding differences in the trained model properties beyond the performance gap.
In this paper, we reconsider why E2E training demonstrates a superior performance through a comparison with layer-wise training, which shares fundamental learning principles and architectures with E2E training, with the granularity of loss evaluation being the only difference. On the basis of the observation that E2E training has an advantage in propagating input information, we analyze the information plane dynamics of intermediate representations based on the Hilbert-Schmidt independence criterion (HSIC). The results of our normalized HSIC value analysis reveal the E2E training ability to exhibit different information dynamics across layers, in addition to efficient information propagation. Furthermore, we show that this layer-role differentiation leads to the final representation following the information bottleneck principle. Our work not only provides the advantages of E2E training in terms of information propagation and the information bottleneck but also suggests the need to consider the cooperative interactions between layers, not just the final layer when analyzing the information bottleneck of deep learning.

URL: https://openreview.net/forum?id=O3wmRh2SfT

---

Title: Where Did the Gap Go? Reassessing the Long-Range Graph Benchmark

Authors: Jan Tönshoff, Martin Ritzert, Eran Rosenbluth, Martin Grohe

Abstract: The recent Long-Range Graph Benchmark (LRGB, Dwivedi et al. 2022) introduced a set of graph learning tasks strongly dependent on long-range interaction between vertices. Empirical evidence suggests that on these tasks Graph Transformers significantly outperform Message Passing GNNs (MPGNNs). In this paper, we carefully reevaluate multiple MPGNN baselines as well as the Graph Transformer GPS (Rampášek et al. 2022) on LRGB. Through a rigorous empirical analysis, we demonstrate that the reported performance gap is overestimated due to suboptimal hyperparameter choices. It is noteworthy that across multiple datasets the performance gap completely vanishes after basic hyperparameter optimization. In addition, we discuss the impact of lacking feature normalization for LRGB's vision datasets and highlight a spurious implementation of LRGB's link prediction metric. The principal aim of our paper is to establish a higher standard of empirical rigor within the graph machine learning community.

URL: https://openreview.net/forum?id=Nm0WX86sKv

---

Title: What Does Softmax Probability Tell Us about Classifiers Ranking Across Diverse Test Conditions?

Authors: Weijie Tu, Weijian Deng, Liang Zheng, Tom Gedeon

Abstract: This work aims to develop a measure that can accurately rank the performance of various classifiers when they are tested on unlabeled data from out-of-distribution (OOD) distributions. We commence by demonstrating that conventional uncertainty metrics, notably the maximum Softmax prediction probability, possess inherent utility in forecasting model generalization across certain OOD contexts. Building on this insight, we introduce a new measure called Softmax Correlation (SoftmaxCorr). It calculates the cosine similarity between a class-class correlation matrix, constructed from Softmax output vectors across an unlabeled test dataset, and a predefined reference matrix that embodies ideal class correlations. A high resemblance of predictions to the reference matrix signals that the model delivers confident and uniform predictions across all categories, reflecting minimal uncertainty and confusion. Through rigorous evaluation across a suite of datasets, including ImageNet, CIFAR-10, and WILDS, we affirm the predictive validity of SoftmaxCorr in accurately forecasting model performance within both in-distribution (ID) and OOD settings. Furthermore, we discuss the limitations of our proposed measure and suggest avenues for future research.

URL: https://openreview.net/forum?id=vtiDUgGjyx

---

Title: Certified Deductive Reasoning with Language Models

Authors: Gabriel Poesia, Kanishk Gandhi, Eric Zelikman, Noah Goodman

Abstract: Language models often achieve higher accuracy when reasoning step-by-step in complex tasks. However, even when arriving at a correct final answer, their rationales are often logically unsound or inconsistent. This is a major issue when reliable reasoning traces are needed,
such as when fine-tuning on model-generated reasoning for self-improvement. To tackle these issues, we introduce a class of tools for language models called guides, that use state and incremental constraints to guide generation. A guide can be invoked by the model to constrain its own generation to a set of valid statements given by the tool. In turn, the model’s choices can change the guide’s state. We show how a general system for logical reasoning can be used as a guide, which we call LogicGuide. Given a reasoning problem in natural language, a model can formalize its assumptions for LogicGuide and guarantee that its step-by-step reasoning is sound. In experiments on PrOntoQA, ProofWriter and Syllogism Validity datasets, LogicGuide significantly improves the performance of GPT-3, GPT-3.5 Turbo and LLaMA (accuracy gains up to 35%), while drastically reducing content effects — the interference between unwanted prior assumptions and reasoning, which humans and language models suffer from. We then explore bootstrapping GPT-3.5 Turbo and LLaMA using their own reasoning traces. We find that LogicGuide is critical: by training only on certified self-generated reasoning, models can self-improve, avoiding learning from their own hallucinations. Moreover, bootstrapped models enjoy significant boosts on ReClor, a challenging real-world reasoning dataset, even when not relying on formalization at inference time.

URL: https://openreview.net/forum?id=yXnwrs2Tl6

---

Title: On the Inherent Privacy Properties of Discrete Denoising Diffusion Models

Authors: Rongzhe Wei, Eleonora Kreacic, Haoyu Peter Wang, Haoteng Yin, Eli Chien, Vamsi K. Potluru, Pan Li

Abstract: Privacy concerns have led to a surge in the creation of synthetic datasets, with diffusion models emerging as a promising avenue. Although prior studies have performed empirical evaluations on these models, there has been a gap in providing a mathematical characterization of their privacy-preserving capabilities. To address this, we present the pioneering theoretical exploration of the privacy preservation inherent in \emph{discrete diffusion models} (DDMs) for discrete dataset generation. Focusing on per-instance differential privacy (pDP), our framework elucidates the potential privacy leakage for each data point in a given training dataset, offering insights into how the privacy loss of each point correlates with the dataset's distribution. Our bounds also show that training with $s$-sized data points leads to a surge in privacy leakage from $(\epsilon, \mathcal{O}(\frac{1}{s^2\epsilon}))$-pDP to $(\epsilon, \mathcal{O}(\frac{1}{s\epsilon}))$-pDP of the DDM during the transition from the pure noise to the synthetic clean data phase, and a faster decay in diffusion coefficients amplifies the privacy guarantee. Finally, we empirically verify our theoretical findings on both synthetic and real-world datasets.

URL: https://openreview.net/forum?id=UuU6C6CUoF

---

Title: Online Continual Learning via Logit Adjusted Softmax

Authors: Zhehao Huang, Tao Li, Chenhe Yuan, Yingwen Wu, Xiaolin Huang

Abstract: Online continual learning is a challenging problem where models must learn from a non-stationary data stream while avoiding catastrophic forgetting. Inter-class imbalance during training has been identified as a major cause of forgetting, leading to model prediction bias towards recently learned classes. In this paper, we theoretically analyze that inter-class imbalance is entirely attributed to imbalanced class-priors, and the function learned from intra-class intrinsic distributions is the optimal classifier that minimizes the class-balanced error. To that end, we present that a simple adjustment of model logits during training can effectively resist prior class bias and pursue the corresponding optimum. Our proposed method, Logit Adjusted Softmax, can mitigate the impact of inter-class imbalance not only in class-incremental but also in realistic scenarios that sum up class and domain incremental learning, with little additional computational cost. We evaluate our approach on various benchmarks and demonstrate significant performance improvements compared to prior arts. For example, our approach improves the best baseline by 4.6% on CIFAR10.

URL: https://openreview.net/forum?id=MyQKcQAte6

---

Title: Single Image Test-Time Adaptation for Segmentation

Authors: Klara Janouskova, Tamir Shor, Chaim Baskin, Jiri Matas

Abstract: Test-Time Adaptation methods improve domain shift robustness of deep neural networks.
We explore the adaptation of segmentation models to a single unlabelled image with no other data available at test time. This allows individual sample performance analysis while excluding orthogonal factors such as weight restart strategies. We propose two new segmentation \ac{tta} methods and compare them to established baselines and recent state-of-the-art. The methods are first validated on synthetic domain shifts and then tested on real-world datasets. The analysis highlights that simple modifications such as the choice of the loss function can greatly improve the performance of standard baselines and that different methods and hyper-parameters are optimal for different kinds of domain shift, hindering the development of fully general methods applicable in situations where no prior knowledge about the domain shift is assumed.

URL: https://openreview.net/forum?id=68LsWm2GuD

---

Title: Normed Spaces for Graph Embedding

Authors: Diaaeldin Taha, Wei Zhao, J. Maxwell Riestenberg, Michael Strube

Abstract: Theoretical results from discrete geometry suggest that normed spaces can abstractly embed finite metric spaces with surprisingly low theoretical bounds on distortion in low dimensions. Inspired by this theoretical insight, we highlight in this paper normed spaces as a more flexible and computationally efficient alternative to several popular Riemannian manifolds for learning graph embeddings. Normed space embeddings significantly outperform several popular manifolds on a large range of synthetic and real-world graph reconstruction benchmark datasets while requiring significantly fewer computational resources. We also empirically verify the superiority of normed space embeddings on growing families of graphs associated with negative, zero, and positive curvature, further reinforcing the flexibility of normed spaces in capturing diverse graph structures as graph sizes increase. Lastly, we demonstrate the utility of normed space embeddings on two applied graph embedding tasks, namely, link prediction and recommender systems. Our work highlights the potential of normed spaces for geometric graph representation learning, raises new research questions, and offers a valuable tool for experimental mathematics in the field of finite metric space embeddings. We make our code and data publically available \footnote{\url{https://github.com/andyweizhao/graphs-normed-spaces}}.

URL: https://openreview.net/forum?id=4E2XLydJiv

---

Title: Uncovering Sets of Maximum Dissimilarity on Random Process Data

Authors: Miguel de Carvalho, Gabriel Martos

Abstract: The comparison of local characteristics of two random processes can shed light on periods of time or space at which the processes differ the most. This paper proposes a method that learns about regions with a certain volume, where the marginal attributes of two processes are less similar. The proposed methods are devised in full generality for the setting where the data of interest are themselves stochastic processes, and thus the proposed method can be used for pointing out the regions of maximum dissimilarity with a certain volume, in the contexts of point processes, functional data, and time series. The parameter functions underlying both stochastic processes of interest are modeled via a basis representation, and Bayesian inference is conducted via an integrated nested Laplace approximation. The numerical studies validate the proposed methods, and we showcase their application with case studies on criminology, finance, and medicine.

URL: https://openreview.net/forum?id=ntWCJrlDD8

---

Title: Internal-Coordinate Density Modelling of Protein Structure: Covariance Matters

Authors: Marloes Arts, Jes Frellsen, Wouter Boomsma

Abstract: After the recent ground-breaking advances in protein structure prediction, one of the remaining challenges in protein machine learning is to reliably predict distributions of structural states. Parametric models of fluctuations are difficult to fit due to complex covariance structures between degrees of freedom in the protein chain, often causing models to either violate local or global structural constraints. In this paper, we present a new strategy for modelling protein densities in internal coordinates, which uses constraints in 3D space to induce covariance structure between the internal degrees of freedom. We illustrate the potential of the procedure by constructing a variational autoencoder with full covariance output induced by the constraints implied by the conditional mean in 3D, and demonstrate that our approach makes it possible to scale density models of internal coordinates to full protein backbones in two settings: 1) a unimodal setting for proteins exhibiting small fluctuations and limited amounts of available data, and 2) a multimodal setting for larger conformational changes in a high data regime.

URL: https://openreview.net/forum?id=9XRZtZRmEB

---

Title: LeanVec: Searching vectors faster by making them fit

Authors: Mariano Tepper, Ishwar Singh Bhati, Cecilia Aguerrebere, Mark Hildebrand, Theodore L. Willke

Abstract: Modern deep learning models have the ability to generate high-dimensional vectors whose similarity reflects semantic resemblance.
Thus, similarity search, i.e., the operation of retrieving those vectors in a large collection that are similar to a given query, has become a critical component of a wide range of applications that demand highly accurate and timely answers. In this setting, the high vector dimensionality puts similarity search systems under compute and memory pressure, leading to subpar performance.
Additionally, cross-modal retrieval tasks have become increasingly common, e.g., where a user inputs a text query to find the most relevant images for that query. However, these queries often have different distributions than the database embeddings, making it challenging to achieve high accuracy. In this work, we present LeanVec, a framework that combines linear dimensionality reduction with vector quantization to accelerate similarity search on high-dimensional vectors while maintaining accuracy. We present LeanVec variants for in-distribution (ID) and out-of-distribution (OOD) queries. LeanVec-ID yields accuracies on par with those from recently introduced deep learning alternatives whose computational overhead precludes their usage in practice. LeanVec-OOD uses a novel technique for dimensionality reduction that considers the query and database distributions to simultaneously boost the accuracy and the performance of the framework even further (even presenting competitive results when the query and database distributions match). All in all, our extensive and varied experimental results show that LeanVec produces state-of-the-art results, with up to 3.7x improvement in search throughput and up to 4.9x faster index build time over the state of the art.

URL: https://openreview.net/forum?id=wczqrpOrIc

---

Title: CLIP-QDA: An Explainable Concept Bottleneck Model

Authors: Rémi Kazmierczak, Eloïse Berthier, Goran Frehse, Gianni Franchi

Abstract: In this paper, we introduce an explainable algorithm designed from a multi-modal foundation model, that performs fast and explainable image classification. Drawing inspiration from CLIP-based Concept Bottleneck Models (CBMs), our method creates a latent space where each neuron is linked to a specific word. Observing that this latent space can be modeled with simple distributions, we use a Mixture of Gaussians (MoG) formalism to enhance the interpretability of this latent space. Then, we introduce CLIP-QDA, a classifier that only uses statistical values to infer labels from the concepts. In addition, this formalism allows for both local and global explanations.
These explanations come from the inner design of our architecture, our work is part of a new family of greybox models, combining performances of opaque foundation models and the interpretability of transparent models. Our empirical findings show that in instances where the MoG assumption holds, CLIP-QDA achieves similar accuracy with state-of-the-art methods CBMs. Our explanations compete with existing XAI methods while being faster to compute.

URL: https://openreview.net/forum?id=jjmdiMiag7

---

Title: Independence Testing for Temporal Data

Authors: Cencheng Shen, Jaewon Chung, Ronak Mehta, Ting Xu, Joshua T Vogelstein

Abstract: Temporal data are increasingly prevalent in modern data science. A fundamental question is whether two time series are related or not. Existing approaches often have limitations, such as relying on parametric assumptions, detecting only linear associations, and requiring multiple tests and corrections. While many non-parametric and universally consistent dependence measures have recently been proposed, directly applying them to temporal data can inflate the p-value and result in an invalid test. To address these challenges, this paper introduces the temporal dependence statistic with block permutation to test independence between temporal data. Under proper assumptions, the proposed procedure is asymptotically valid and universally consistent for testing independence between stationary time series, and capable of estimating the optimal dependence lag that maximizes the dependence. Moreover, it is compatible with a rich family of distance and kernel based dependence measures, eliminates the need for multiple testing, and exhibits excellent testing power in various simulation settings.

URL: https://openreview.net/forum?id=jv1aPQINc4

---

Title: Unleashing the Power of Visual Prompting At the Pixel Level

Authors: Junyang Wu, Xianhang Li, Chen Wei, Huiyu Wang, Alan Yuille, Yuyin Zhou, Cihang Xie

Abstract: This paper presents a simple and effective visual prompting method for adapting pre-trained models to downstream recognition tasks. Our approach is underpinned by two key designs. First, rather than directly adding together the prompt and the image, we treat the prompt as an extra and independent learnable entity. We show that the strategy of reconciling the prompt and the image matters, and find that warping the prompt around a properly shrinked image empirically works the best. Second, we re-introduce two “old tricks” commonly used in building transferable adversarial examples, i.e., input diversity and gradient normalization, into the realm of visual prompting. These techniques improve optimization and enable the prompt to generalize better. We provide extensive experimental results to demonstrate the effectiveness of our method. Using a CLIP model, our prompting method registers a new record of 82.5% average accuracy across 12 popular classification datasets, substantially surpassing the prior art by +5.2%. It is worth noting that such performance not only surpasses linear probing by +2.2%, but, in certain datasets, is on par with the results from fully fine-tuning. Additionally, our prompting method shows competitive performance across different data scales and against distribution shifts.

URL: https://openreview.net/forum?id=IX5GX8SNtM

---

Title: RedMotion: Motion Prediction via Redundancy Reduction

Authors: Royden Wagner, Omer Sahin Tas, Marvin Klemp, Carlos Fernandez, Christoph Stiller

Abstract: We introduce RedMotion, a transformer model for motion prediction in self-driving vehicles that learns environment representations via redundancy reduction. Our first type of redundancy reduction is induced by an internal transformer decoder and reduces a variable-sized set of local road environment tokens, representing road graphs and agent data, to a fixed-sized global embedding. The second type of redundancy reduction is obtained by self-supervised learning and applies the redundancy reduction principle to embeddings generated from augmented views of road environments. Our experiments reveal that our representation learning approach outperforms PreTraM, Traj-MAE, and GraphDINO in a semi-supervised setting. Moreover, RedMotion achieves competitive results compared to HPTR or MTR++ in the Waymo Motion Prediction Challenge. Our open-source implementation is available at: https://github.com/kit-mrt/future-motion

URL: https://openreview.net/forum?id=xl1KhKT3Xx

---

Title: CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement

Authors: Mohammadreza Salehi, Mehrdad Farajtabar, Maxwell Horton, Fartash Faghri, Hadi Pouransari, Raviteja Vemulapalli, Oncel Tuzel, Ali Farhadi, Mohammad Rastegari, Sachin Mehta

Abstract: Contrastive language image pretraining (CLIP) is a standard method for training vision-language models. While CLIP is scalable, promptable, and robust to distribution shifts on image classification tasks, it lacks object localization capabilities. This paper studies the following question: Can we augment CLIP training with task-specific vision models from model zoos to improve its visual representations? Towards this end, we leverage open-source task-specific vision models to generate pseudo-labels for an uncurated and noisy image-text dataset. Subsequently, we train CLIP models on these pseudo-labels in addition to the contrastive training on image and text pairs. This simple setup shows substantial improvements of up to 16.3% across different vision tasks, including segmentation, detection, depth estimation, and surface normal estimation. Importantly, these enhancements are achieved without compromising CLIP's existing capabilities, including its proficiency in promptable zero-shot classification.

URL: https://openreview.net/forum?id=Yh8Y7a4myU

---

Title: Amortized Bayesian Decision Making for simulation-based models

Authors: Mila Gorecki, Jakob H. Macke, Michael Deistler

Abstract: Simulation-based inference (SBI) provides a powerful framework for inferring posterior distributions of stochastic simulators in a wide range of domains. In many settings, however, the posterior distribution is not the end goal itself --- rather, the derived parameter values and their uncertainties are used as a basis for deciding what actions to take. Unfortunately, because posterior distributions provided by SBI are (potentially crude) approximations of the true posterior, the resulting decisions can be suboptimal. Here, we address the question of how to perform Bayesian decision making on stochastic simulators, and how one can circumvent the need to compute an explicit approximation to the posterior. Our method trains a neural network on simulated data and can predict the expected cost given any data and action, and can, thus, be directly used to infer the action with lowest cost. We apply our method to several benchmark problems and demonstrate that it induces similar cost as the true posterior distribution. We then apply the method to infer optimal actions in a real-world simulator in the medical neurosciences, the Bayesian Virtual Epileptic Patient, and demonstrate that it allows to infer actions associated with low cost after few simulations.

URL: https://openreview.net/forum?id=BQE4MTAfCE

---

Title: Appropriate Balance of Diversification and Intensification Improves Performance and Efficiency of Adversarial Attacks

Authors: Keiichiro Yamamura, Issa Oe, Nozomi Hata, Hiroki Ishikura, Katsuki Fujisawa

Abstract: Recently, adversarial attacks that generate adversarial examples by optimizing a multimodal function with many local optimums have attracted considerable research attention. Quick convergence to a nearby local optimum (intensification) and fast enumeration of multiple different local optima (diversification) are important to construct strong attacks. Most existing white-box attacks that use the model’s gradient enumerate multiple local optima based on multi-restart; however, our experiments suggest that the ability of diversification based on multi-restart is limited. To tackle this problem, we propose the multi-directions/objectives (MDO) strategy, which uses multiple search directions and objective functions for diversification. Efficient Diversified Attack, a combination of MDO and multi-target strategies, showed further diversification performance, resulting in better performance than recently proposed attacks against around 88% of 41 CNN-based robust models and 100% of 10 more advanced models, including transformer-based architecture. These results suggest a relationship between attack performances and a balance of diversification and intensification, which is beneficial to constructing more potent attacks.

URL: https://openreview.net/forum?id=mK6TwmInTg

---

Title: Reproducibility Study on Adversarial Attacks Against Robust Transformer Trackers

Authors: Fatemeh Nourilenjan Nokabadi, Jean-Francois Lalonde, Christian Gagné

Abstract: New transformer networks have been integrated into object tracking pipelines and have demonstrated strong performance on the latest benchmarks. This paper focuses on understanding how transformer trackers behave under adversarial attacks and how different attacks perform on tracking datasets as their parameters change. We conducted a series of experiments to evaluate the effectiveness of existing adversarial attacks on object trackers with transformer and non-transformer backbones. We experimented on 7 different trackers, including 3 that are transformer-based, and 4 which leverage other architectures. These trackers are tested against 4 recent attack methods to assess their performance and robustness on VOT2022ST, UAV123 and GOT10k datasets. Our empirical study focuses on evaluating adversarial robustness of object trackers based on bounding box versus binary mask predictions, and attack methods at different levels of perturbations. Interestingly, our study found that altering the perturbation level may not significantly affect the overall object tracking results after the attack. Similarly, the sparsity and imperceptibility of the attack perturbations may remain stable against perturbation level shifts. By applying a specific attack on all transformer trackers, we show that new transformer trackers having a stronger cross-attention modeling achieve a greater adversarial robustness on tracking datasets, such as VOT2022ST and GOT10k. Our results also indicate the necessity for new attack methods to effectively tackle the latest types of transformer trackers. The codes necessary to reproduce this study are available at https://github.com/fatemehN/ReproducibilityStudy.

URL: https://openreview.net/forum?id=FEEKR0Vl9s

---

Title: Improve Certified Training with Signal-to-Noise Ratio Loss to Decrease Neuron Variance and Increase Neuron Stability

Authors: Tianhao Wei, Ziwei Wang, Peizhi Niu, ABULIKEMU ABUDUWEILI, Weiye Zhao, Casidhe Hutchison, Eric Sample, Changliu Liu

Abstract: Neural network robustness is a major concern in safety-critical applications. Certified robustness provides a reliable lower bound on worst-case robustness, and certified training methods have been developed to enhance it. However, certified training methods often suffer from over-regularization, leading to lower certified robustness.
This work addresses this issue by introducing the concepts of neuron variance and neuron stability, examining their impact on over-regularization and model robustness. To tackle the problem, we extend the Signal-to-Noise Ratio (SNR) into the realm of model robustness, offering a novel perspective and developing SNR-inspired losses aimed at optimizing neuron variance and stability to mitigate over-regularization. Through both empirical and theoretical analysis, our SNR-based approach demonstrates superior performance over existing methods on the MNIST and CIFAR-10 datasets. In addition, our exploration of adversarial training uncovers a beneficial correlation between neuron variance and adversarial robustness, leading to an optimized balance between standard and robust accuracy that outperforms baseline methods.

URL: https://openreview.net/forum?id=iV0jktFZ5Y

---

Title: Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models

Authors: Zhen Lin, Shubhendu Trivedi, Jimeng Sun

Abstract: Large language models (LLMs) specializing in natural language generation (NLG) have recently started exhibiting promising capabilities across a variety of domains. However, gauging the trustworthiness of responses generated by LLMs remains an open challenge, with limited research on uncertainty quantification (UQ) for NLG. Furthermore, existing literature typically assumes white-box access to language models, which is becoming unrealistic either due to the closed-source nature of the latest LLMs or computational constraints. In this work, we investigate UQ in NLG for *black-box* LLMs. We first differentiate *uncertainty* vs *confidence*: the former refers to the ``dispersion'' of the potential predictions for a fixed input, and the latter refers to the confidence on a particular prediction/generation. We then propose and compare several confidence/uncertainty measures, applying them to *selective NLG* where unreliable results could either be ignored or yielded for further assessment. Experiments were carried out with several popular LLMs on question-answering datasets (for evaluation purposes). Results reveal that a simple measure for the semantic dispersion can be a reliable predictor of the quality of LLM responses, providing valuable insights for practitioners on uncertainty management when adopting LLMs.

URL: https://openreview.net/forum?id=DWkJCSxKU5

---

Title: Enhancing Vision-Language Model with Unmasked Token Alignment

Authors: Jihao Liu, Jinliang Zheng, Boxiao Liu, Yu Liu, Hongsheng Li

Abstract: Contrastive pre-training on image-text pairs, exemplified by CLIP, becomes a standard technique for learning multi-modal visual-language representations. Although CLIP has demonstrated remarkable performance, training it from scratch on noisy web-scale datasets is computationally demanding. On the other hand, mask-then-predict pre-training approaches, like Masked Image Modeling (MIM), offer efficient self-supervised learning for single-modal representations. This paper introduces $\textbf{U}$nmasked $\textbf{T}$oken $\textbf{A}$lignment ($\textbf{UTA}$), a method that leverages existing CLIP models to further enhance its vision-language representations. UTA trains a Vision Transformer (ViT) by aligning unmasked visual tokens to the corresponding image tokens from a frozen CLIP vision encoder, which automatically aligns the ViT model with the CLIP text encoder. The pre-trained ViT can be directly applied for zero-shot evaluation even without training on image-text pairs. Compared to MIM approaches, UTA does not suffer from training-finetuning inconsistency and is much more training-efficient by avoiding using the extra $\mathrm{[MASK]}$ tokens. Extensive experimental results demonstrate that UTA can enhance CLIP models and outperform existing MIM methods on various uni- and multi-modal benchmarks.

URL: https://openreview.net/forum?id=JkFEVbW6wE

---

Title: Soft Merging of Experts with Adaptive Routing

Authors: Mohammed Muqeeth, Haokun Liu, Colin Raffel

Abstract: Neural networks that learn to route their inputs through different "expert" subnetworks provide a form of modularity that standard dense models lack. Despite their possible benefits, modular models with learned routing often underperform their parameter-matched dense counterparts as well as models that use non-learned heuristic routing strategies. In this paper, we hypothesize that these shortcomings stem from the gradient estimation techniques used to train modular models that use non-differentiable discrete routing decisions. To address this issue, we introduce $\textbf{S}$oft $\textbf{M}$erging of $\textbf{E}$xperts with $\textbf{A}$daptive $\textbf{R}$outing (SMEAR), which avoids discrete routing by using a single "merged" expert constructed via a weighted average of all of the experts' parameters. By routing activations through a single merged expert, SMEAR does not incur a significant increase in computational costs and enables standard gradient-based training. We empirically validate that models using SMEAR outperform models that route based on metadata or learn routing through gradient estimation. Furthermore, we provide qualitative analysis demonstrating that the experts learned via SMEAR exhibit a significant amount of specialization.

URL: https://openreview.net/forum?id=7I199lc54z

---

Title: MaskOCR: Scene Text Recognition with Masked Vision-Language Pre-training

Authors: Pengyuan Lyu, Chengquan Zhang, Shanshan Liu, Meina Qiao, Yangliu Xu, Liang Wu, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

Abstract: Text images contain both visual and linguistic information. However, existing pre-training techniques for text recognition mainly focus on either visual representation learning or linguistic knowledge learning. In this paper, we propose a novel approach to unify vision and language pre-training in the classical encoder-decoder recognition framework. We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images, which allows us to learn strong visual representations. In contrast to introducing linguistic knowledge with an additional language model, we directly pre-train the sequence decoder. Specifically, we transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder using a proposed masked image-language modeling scheme.
Significantly, the encoder is frozen during the pre-training phase of the sequence decoder. Experimental results demonstrate that our proposed method achieves superior performance on benchmark datasets, including Chinese and English text images. The code for our approach will be made available.

URL: https://openreview.net/forum?id=KNAWoKKpi3

---

Title: Differentially Private Kernel Inducing Points using features from ScatterNets (DP-KIP-ScatterNet) for Privacy Preserving Data Distillation

Authors: Margarita Vinaroz, Mijung Park

Abstract: Data distillation aims to generate a small data set that closely mimics the performance of a given learning algorithm on the original data set. The distilled dataset is hence useful to simplify the training process thanks to its small data size. However, distilled data samples are not necessarily privacy-preserving, even if they are generally humanly indiscernible. To address this limitation, we introduce differentially private kernel inducing points (DP-KIP) for privacy-preserving data distillation. Unlike our original intention to simply apply DP-SGD to the framework of KIP, we find that KIP using infinitely-wide convolutional neural tangent kernels (conv-NTKs) performs better compared to KIP using fully-connected NTKs. However, KIP with conv-NTKs, due to its convolutional and pooling operations, introduces an unbearable computational complexity, requiring hundreds of V100 GPUs in parallel to train, which is impractical and more importantly, such computational resources are inaccessible to many. To overcome this issue, we propose an alternative that does not require pre-training (to avoid a privacy loss) and can well capture complex information on images, as those features from conv-NKTs do, while the computational cost is manageable by a single V100 GPU. To this end, we propose DP-KIP-ScatterNet, which uses the wavelet features from Scattering networks (ScatterNet) instead of those from conv-NTKs, to perform DP-KIP at a reasonable computational cost. We implement DP-KIP-ScatterNet in -- computationally efficient -- JAX and test on several popular image datasets to show its efficacy and its superior performance compared to state-of-the art methods in image data distillation with differential privacy guarantees. Our code is available at https://github.com/ParkLabML/DP-KIP.

URL: https://openreview.net/forum?id=84M8xwNxrc

---

Title: Large Language Models can be Guided to Evade AI-generated Text Detection

Authors: Ning Lu, Shengcai Liu, Rui He, Yew-Soon Ong, Qi Wang, Ke Tang

Abstract: Large language models (LLMs) have shown remarkable performance in various tasks and have been extensively utilized by the public.
However, the increasing concerns regarding the misuse of LLMs, such as plagiarism and spamming, have led to the development of multiple detectors, including fine-tuned classifiers and statistical methods. In this study, we equip LLMs with prompts, rather than relying on an external paraphraser, to evaluate the vulnerability of these detectors. We propose a novel Substitution-based In-Context example Optimization method (SICO) to automatically construct prompts for evading the detectors. SICO is cost-efficient as it requires only 40 human-written examples and a limited number of LLM inferences to generate a prompt. Moreover, once a task-specific prompt has been constructed, it can be universally used against a wide range of detectors. Extensive experiments across three real-world tasks demonstrate that SICO significantly outperforms the paraphraser baselines and enables GPT-3.5 to successfully evade six detectors, decreasing their AUC by 0.5 on average. Furthermore, a comprehensive human evaluation show that the SICO-generated text achieves human-level readability and task completion rates, while preserving high imperceptibility. Finally, we propose an ensemble approach to enhance the robustness of detectors against SICO attack.

URL: https://openreview.net/forum?id=lLE0mWzUrr

---

Title: Predicting the Encoding Error of SIRENs

Authors: Jeremy Vonderfecht, Feng Liu

Abstract: Implicit Neural Representations (INRs), which encode signals such as images, videos, and 3D shapes in the weights of neural networks, are becoming increasingly popular. Among their many applications is signal compression, for which there is great interest in achieving the highest possible fidelity to the original signal subject to constraints such as neural network size, training (encoding) and inference (decoding) time. But training INRs can be a computationally expensive process, making it challenging to determine the best possible tradeoff under such constraints. Towards this goal, we propose a novel problem: predicting the encoding error (i.e. training loss) that an INR will reach on a given training signal. We present a method which predicts the encoding error that a popular INR network (SIREN) will reach, given its network hyperparameters and the signal to encode. This method is trained on a unique dataset of 300,000 SIRENs, trained across a variety of images and hyperparameters. Our predictive method demonstrates the feasibility of this regression problem, and allows users to anticipate the encoding error that a SIREN network will reach in milliseconds instead of minutes or longer. We also provide insights into the behavior of SIREN networks, such as why narrow SIRENs can have very high random variation in encoding error, and how the performance of SIRENs relates to JPEG compression.

URL: https://openreview.net/forum?id=iKPC7N85Pf

---

Title: Adversarial Imitation Learning from Visual Observations using Latent Information

Authors: Vittorio Giammarino, James Queeney, Ioannis Paschalidis

Abstract: We focus on the problem of imitation learning from visual observations, where the learning agent has access to videos of experts as its sole learning source. The challenges of this framework include the absence of expert actions and the partial observability of the environment, as the ground-truth states can only be inferred from pixels. To tackle this problem, we first conduct a theoretical analysis of imitation learning in partially observable environments. We establish upper bounds on the suboptimality of the learning agent with respect to the divergence between the expert and the agent latent state-transition distributions. Motivated by this analysis, we introduce an algorithm called Latent Adversarial Imitation from Observations, which combines off-policy adversarial imitation techniques with a learned latent representation of the agent's state from sequences of observations. In experiments on high-dimensional continuous robotic tasks, we show that our model-free approach in latent space matches state-of-the-art performance. Additionally, we show how our method can be used to improve the efficiency of reinforcement learning from pixels by leveraging expert videos. To ensure reproducibility, we provide free access to all the learning curves and open-source our code.

URL: https://openreview.net/forum?id=ydPHjgf6h0

---

Title: Transfer Learning with Informative Priors: Simple Baselines Better than Previously Reported

Authors: Ethan Harvey, Mikhail Petrov, Michael C Hughes

Abstract: We pursue transfer learning to improve classifier accuracy on a target task with few labeled examples available for training. Recent work suggests that using a source task to learn a prior distribution over neural net weights, not just an initialization, can boost target task performance. In this study, we carefully compare transfer learning with and without source task informed priors across 5 datasets. We find that standard transfer learning informed by an initialization only performs far better than reported in previous comparisons. The relative gains of methods using informative priors over standard transfer learning vary in magnitude across datasets. For the scenario of 5-300 examples per class, we find negative or negligible gains on 2 datasets, modest gains (between 1.5-3 points of accuracy) on 2 other datasets, and substantial gains (>8 points) on one dataset. Among methods using informative priors, we find that an isotropic covariance appears competitive with learned low-rank covariance matrix while being substantially simpler to understand and tune. Further analysis suggests that the mechanistic justification for informed priors -- hypothesized improved alignment between train and test loss landscapes -- is not consistently supported due to high variability in empirical landscapes. We release code to allow independent reproduction of all experiments.

URL: https://openreview.net/forum?id=BbvSU02jLg

---

Title: From Identifiable Causal Representations to Controllable Counterfactual Generation: A Survey on Causal Generative Modeling

Authors: Aneesh Komanduri, Xintao Wu, Yongkai Wu, Feng Chen

Abstract: Deep generative models have shown tremendous capability in data density estimation and data generation from finite samples. While these models have shown impressive performance by learning correlations among features in the data, some fundamental shortcomings are their lack of explainability, tendency to induce spurious correlations, and poor out-of-distribution extrapolation. To remedy such challenges, recent work has proposed a shift toward causal generative models. Causal models offer several beneficial properties to deep generative models, such as distribution shift robustness, fairness, and interpretability. Structural causal models (SCMs) describe data-generating processes and model complex causal relationships and mechanisms among variables in a system. Thus, SCMs can naturally be combined with deep generative models. We provide a technical survey on causal generative modeling categorized into causal representation learning and controllable counterfactual generation methods. We focus on fundamental theory, methodology, drawbacks, datasets, and metrics. Then, we cover applications of causal generative models in fairness, privacy, out-of-distribution generalization, precision medicine, and biological sciences. Lastly, we discuss open problems and fruitful research directions for future work in the field.

URL: https://openreview.net/forum?id=PUpZXvNqmb

---

Title: Temporal Difference Learning with Compressed Updates: Error-Feedback meets Reinforcement Learning

Authors: Aritra Mitra, George J. Pappas, Hamed Hassani

Abstract: In large-scale distributed machine learning, recent works have studied the effects of compressing gradients in stochastic optimization to alleviate the communication bottleneck. These works have collectively revealed that stochastic gradient descent (SGD) is robust to structured perturbations such as quantization, sparsification, and delays. Perhaps surprisingly, despite the surge of interest in multi-agent reinforcement learning, almost nothing is known about the analogous question: \textit{Are common reinforcement learning (RL) algorithms also robust to similar perturbations?} We investigate this question by studying a variant of the classical temporal difference (TD) learning algorithm with a perturbed update direction, where a general compression operator is used to model the perturbation. Our work makes three important technical contributions. First, we prove that compressed TD algorithms, coupled with an error-feedback mechanism used widely in optimization, exhibit the same non-asymptotic theoretical guarantees as their SGD counterparts. Second, we show that our analysis framework extends seamlessly to nonlinear stochastic approximation schemes that subsume Q-learning. Third, we prove that for multi-agent TD learning, one can achieve linear convergence speedups with respect to the number of agents while communicating just $\tilde{O}(1)$ bits per iteration. Notably, these are the first finite-time results in RL that account for general compression operators and error-feedback in tandem with linear function approximation and Markovian sampling. Our proofs hinge on the construction of novel Lyapunov functions that capture the dynamics of a memory variable introduced by error-feedback.

URL: https://openreview.net/forum?id=dltUedmUVT

---

Title: Genetic InfoMax: Exploring Mutual Information Maximization in High-Dimensional Imaging Genetics Studies

Authors: Yaochen Xie, Ziqian Xie, Sheikh Muhammad Saiful Islam, Degui Zhi, Shuiwang Ji

Abstract: Genome-wide association studies (GWAS) are used to identify relationships between genetic variations and specific traits. When applied to high-dimensional medical imaging data, a key step is to extract lower-dimensional, yet informative representations of the data as traits. Representation learning for imaging genetics is largely under-explored due to the unique challenges posed by GWAS in comparison to typical visual representation learning. In this study, we tackle this problem from the mutual information (MI) perspective by identifying key limitations of existing methods. We introduce a trans-modal learning framework Genetic InfoMax (GIM), including a regularized MI estimator and a novel genetics-informed transformer to address the specific challenges of GWAS. We evaluate GIM on human brain 3D MRI data and establish standardized evaluation protocols to compare it to existing approaches. Our results demonstrate the effectiveness of GIM and a significantly improved performance on GWAS.

URL: https://openreview.net/forum?id=9UgUMFW67X

---

Title: Enhancing Compositional Generalization via Compositional Feature Alignment

Authors: Haoxiang Wang, Haozhe Si, Huajie Shao, Han Zhao

Abstract: Real-world applications of machine learning models often confront data distribution shifts, wherein discrepancies exist between the training and test data distributions. In the common multi-domain multi-class setup, as the number of classes and domains scales up, it becomes infeasible to gather training data for every domain-class combination. This challenge naturally leads the quest for models with Compositional Generalization (CG) ability, where models can generalize to unseen domain-class combinations. To delve into the CG challenge, we develop CG-Bench, a suite of CG benchmarks derived from existing real-world image datasets, and observe that the prevalent pretraining-finetuning paradigm on foundational models, such as CLIP and DINOv2, struggles with the challenge. To address this challenge, we propose Compositional Feature Alignment (CFA), a simple two-stage finetuning technique that i) learns two orthogonal linear heads on a pretrained encoder with respect to class and domain labels, and ii) fine-tunes the encoder with the newly learned head frozen. We theoretically and empirically justify that CFA encourages compositional feature learning of pretrained models. We further conduct extensive experiments on CG-Bench for CLIP and DINOv2, two powerful pretrained vision foundation models. Experiment results show that CFA outperforms common finetuning techniques in compositional generalization, corroborating CFA's efficacy in compositional feature learning.

URL: https://openreview.net/forum?id=k3d5C0YvfK

---

Title: Statistical and Computational Complexities of BFGS Quasi-Newton Method for Generalized Linear Models

Authors: Qiujiang Jin, Tongzheng Ren, Nhat Ho, Aryan Mokhtari

Abstract: The gradient descent (GD) method has been used widely to solve parameter estimation in generalized linear models (GLMs), a generalization of linear models when the link function can be non-linear. In GLMs with a polynomial link function, it has been shown that in the high signal-to-noise ratio (SNR) regime, due to the problem's strong convexity and smoothness, GD converges linearly and reaches the final desired accuracy in a logarithmic number of iterations. In contrast, in the low SNR setting, where the problem becomes locally convex, GD converges at a slower rate and requires a polynomial number of iterations to reach the desired accuracy. Even though Newton's method can be used to resolve the flat curvature of the loss functions in the low SNR case, its computational cost is prohibitive in high-dimensional settings as it is $\mathcal{O}(d^3)$, where $d$ the is the problem dimension. To address the shortcomings of GD and Newton's method, we propose the use of the BFGS quasi-Newton method to solve parameter estimation of the GLMs, which has a per iteration cost of $\mathcal{O}(d^2)$. When the SNR is low, for GLMs with a polynomial link function of degree $p$, we demonstrate that the iterates of BFGS converge linearly to the optimal solution of the population least-square loss function, and the contraction coefficient of the BFGS algorithm is comparable to that of Newton's method. Moreover, the contraction factor of the linear rate is independent of problem parameters and only depends on the degree of the link function $p$. Also, for the empirical loss with $n$ samples, we prove that in the low SNR setting of GLMs with a polynomial link function of degree $p$, the iterates of BFGS reach a final statistical radius of $\mathcal{O}((d/n)^{\frac{1}{2p+2}})$ after at most $\log(n/d)$ iterations. This complexity is significantly less than the number required for GD, which scales polynomially with $(n/d)$.

URL: https://openreview.net/forum?id=PIL3YWXmx2

---

Title: CascadedGaze: Efficiency in Global Context Extraction for Image Restoration

Authors: Amirhosein Ghasemabadi, Muhammad Kamran Janjua, Mohammad Salameh, CHUNHUA ZHOU, Fengyu Sun, Di Niu

Abstract: Image restoration tasks traditionally rely on convolutional neural networks. However, given the local nature of the convolutional operator, they struggle to capture global information. The promise of attention mechanisms in Transformers is to circumvent this problem, but it comes at the cost of intensive computational overhead. Many recent studies in image restoration have focused on solving the challenge of balancing performance and computational cost via Transformer variants. In this paper, we present CascadedGaze Network (CGNet), an encoder-decoder architecture that employs Global Context Extractor (GCE), a novel and efficient way to capture global information for image restoration. The GCE module leverages small kernels across convolutional layers to learn global dependencies, without requiring self-attention. Extensive experimental results show that our approach outperforms a range of state-of-the-art methods on denoising benchmark datasets including both real image denoising and synthetic image denoising, as well as on image deblurring task, while being more computationally efficient.

URL: https://openreview.net/forum?id=C3FXHxMVuq

---

Title: Revisiting stochastic submodular maximization with cardinality constraint: A bandit perspective

Authors: Pratik Jawanpuria, Bamdev Mishra, Karthik Gurumoorthy

Abstract: In this paper, we focus on the problem of maximizing non-negative, monotone, stochastic submodular functions under cardinality constraint. Recent works have explored continuous optimization algorithms via multi-linear extensions for such problems and provided appropriate approximation guarantees. We take a fresh look into this problem from a discrete, (stochastic) greedy perspective under a probably approximately correct (PAC) setting, i.e., the goal is to obtain solutions whose expected objective value is greater than or equal to $(1-1/e-\epsilon){\rm OPT}-\nu$ with at least $1-\delta$ probability, where ${\rm OPT}$ is the optimal objective value. Using the theory of multi-armed bandits, we propose novel bandit stochastic greedy (BSG) algorithms in which selection of the next element at iteration $i$ is posed as a $(\nu_i,\delta_i)$-PAC best-arm identification problem. Given $(\nu,\delta)$-PAC parameters to BSG, we formally characterize a set $\mathcal{A}(\nu,\delta)$ of per-iteration policies such that any policy from this set guarantees a $(\nu,\delta)$-PAC solution for the stochastic submodular maximization problem using BSG. We next discuss the problem of learning a policy in $\mathcal{A}(\nu,\delta)$ by minimizing the computational cost. With our learned policy, we show that BSG has lower computational cost than existing stochastic submodular maximization approaches. An interesting outcome of our analysis is the development of both linear and almost-linear time algorithms for the exemplar based clustering problem with $(1-1/e-\epsilon)$-approximation guarantee under a PAC setting. Lastly, we also study the problem of learning a policy for BSG under budget setting. Experiments on various problems illustrate the efficacy of our approach in terms of optimization quality as well as computational efficiency.

URL: https://openreview.net/forum?id=57ETChLAOE

---

Title: [Re] Explaining Temporal Graph Models through an Explorer-Navigator Framework

Authors: Miklós Hamar, Matey Krastev, Kristiyan Danielov Hristov, David Beglou

Abstract: Temporal graphs model complex dynamic relations that change over time, and are being used in a growing number of applications. In recent years, several graph neural networks (GNNs) were proposed, designed specifically for this temporal setting (Xu et al., 2020; Rossi et al., 2020). However, these models are notoriously hard to interpret. For this reason, the original authors (Xia et al., 2023) propose the Temporal GNN Explainer (T-GNNExplainer) – an explorer-navigator framework to efficiently compute sparse explanations of target Temporal GNNs. We reproduce the main findings of the original paper, extend their work by proposing a different type of navigator method, and examine in detail the explanation capabilities and efficiency of the provided framework within various model and hyperparameter settings. We confirm that their explainer outperforms the other baselines across nearly all datasets and metrics. Our findings suggest the navigator helps bias the search process, as well as that T-GNNExplainer can find an exact influential event set. Moreover, we examine the effect of different navigator methods and quantify the runtime-fidelity tradeoff controlled by two hyper-parameters.

URL: https://openreview.net/forum?id=FI1XvwpchC

---

Title: Efficient Large Language Models: A Survey

Authors: Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Jiachen Liu, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, Mosharaf Chowdhury, Mi Zhang

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in important tasks such as natural language understanding and language generation, and thus have the potential to make a substantial impact on our society. Such capabilities, however, come with the considerable resources they demand, highlighting the strong need to develop effective techniques for addressing their efficiency challenges. In this survey, we provide a systematic and comprehensive review of efficient LLMs research. We organize the literature in a taxonomy consisting of three main categories, covering distinct yet interconnected efficient LLMs topics from model-centric, data-centric, and framework-centric perspective, respectively. We have also created a GitHub repository where we organize the papers featured in this survey at https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey. We will actively maintain the repository and incorporate new research as it emerges. We hope our survey can serve as a valuable resource to help researchers and practitioners gain a systematic understanding of efficient LLMs research and inspire them to contribute to this important and exciting field.

URL: https://openreview.net/forum?id=bsCCJHbO8A

---

Title: Deep Generalized Prediction Set Classifier and Its Theoretical Guarantees

Authors: Zhou Wang, Xingye Qiao

Abstract: A standard classification rule returns a single-valued prediction for any observation without a confidence guarantee, which may result in severe consequences in many critical applications when the uncertainty is high. In contrast, set-valued classification is a new paradigm to handle the uncertainty in classification by reporting a set of plausible labels to observations in highly ambiguous regions. In this article, we propose the Deep Generalized Prediction Set (DeepGPS) method, a network-based set-valued classifier induced by acceptance region learning. DeepGPS is capable of identifying ambiguous observations and detecting out-of-distribution (OOD) observations. It is the first set-valued classification of this kind with a theoretical guarantee and scalable to large datasets. Our nontrivial proof shows that the risk of DeepGPS, defined as the expected size of the prediction set, attains the optimality within a neural network hypothesis class while simultaneously achieving the user-prescribed class-specific accuracy. Additionally, by using a weighted loss, DeepGPS returns tighter acceptance regions, leading to informative predictions and improved OOD detection performance. Empirically, our method outperforms the baselines on several benchmark datasets.

URL: https://openreview.net/forum?id=H7gLN5nqVF

---

Title: As large as it gets – Studying Infinitely Large Convolutions via Neural Implicit Frequency Filters

Authors: Julia Grabinski, Janis Keuper, Margret Keuper

Abstract: Recent work in neural networks for image classification has seen a strong tendency towards increasing the spatial context during encoding. Whether achieved through large convolution kernels or self-attention, models scale poorly with the increased spatial context, such that the improved model accuracy often comes at significant costs. In this paper, we propose a module for studying the effective filter size of convolutional neural networks (CNNs). To facilitate such a study, several challenges need to be addressed: (i) we need an effective means to train models with large filters (potentially as large as the input data) without increasing the number of learnable parameters, (ii) the employed convolution operation should be a plug-and-play module that can replace conventional convolutions in a CNN and allow for an efficient implementation in current frameworks, (iii) the study of filter sizes has to be decoupled from other aspects such as the network width or the number of learnable parameters, and (iv) the cost of the convolution operation itself has to remain manageable i.e.~we can not naïvely increase the size of the convolution kernel. To address these challenges, we propose to learn the frequency representations of filter weights as neural implicit functions, such that the better scalability of the convolution in the frequency domain can be leveraged. Additionally, due to the implementation of the proposed neural implicit function, even large and expressive spatial filters can be parameterized by only a few learnable weights. Interestingly, our analysis shows that, although the proposed networks could learn very large convolution kernels, the learned filters are well localized and relatively small in practice when transformed from the frequency to the spatial domain. We anticipate that our analysis of individually optimized filter sizes will allow for more efficient, yet effective, models in the future. Our code is available at https://github.com/GeJulia/NIFF .

URL: https://openreview.net/forum?id=xRy1YRcHWj

---

Title: TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks

Authors: Dongfu Jiang, Yishan Li, Ge Zhang, Wenhao Huang, Bill Yuchen Lin, Wenhu Chen

Abstract: We present TIGERScore, a \textbf{T}rained metric that follows \textbf{I}nstruction \textbf{G}uidance to perform \textbf{E}xplainable, and \textbf{R}eference-free evaluation over a wide spectrum of text generation tasks.
Different from other automatic evaluation methods that only provide arcane scores, TIGERScore is guided by natural language instruction to provide error analysis to pinpoint the mistakes in the generated text. Our metric is based on LLaMA-2, trained on our meticulously curated instruction-tuning dataset MetricInstruct which covers 6 text generation tasks and 23 text generation datasets. The dataset consists of 42K quadruple in the form of (instruction, input, system output $\rightarrow$ error analysis). We collected the `system outputs' through from a large variety of models to cover different types of errors.
To quantitatively assess our metric, we evaluate its correlation with human ratings on 5 held-in datasets, 2 held-out datasets and show that \metricname{} can achieve the open-source SoTA correlation with human ratings across these datasets and almost approaches GPT-4 evaluator. As a reference-free metric, its correlation can even surpass the best existing reference-based metrics. To further qualitatively assess the rationale generated by our metric, we conduct human evaluation on the generated explanations and found that the explanations are 70.8\% accurate.
Through these experimental results, we believe \metricname{} demonstrates the possibility of building universal explainable metrics to evaluate any text generation task.

URL: https://openreview.net/forum?id=EE1CBKC0SZ

---

Title: Reproducibility Study of "Learning Perturbations to Explain Time Series Predictions"

Authors: Jiapeng Fan, Luke Cadigan, Paulius Skaisgiris, Sebastian Uriel Arias

Abstract: In this work, we attempt to reproduce the results of Enguehard (2023), which introduced ExtremalMask, a mask-based perturbation method for explaining time series data. We investigated the key claims of this paper, namely that (1) the model outperformed other models in several key metrics on both synthetic and real data, and (2) the model performed better when using the loss function of the preservation game relative to that of the deletion game. Although discrepancies exist, our results generally support the core of the original paper’s conclusions. Next, we interpret ExtremalMask’s outputs using new visualizations and metrics and discuss the insights each interpretation provides. Finally, we test whether ExtremalMask create out of distribution samples, and found the model does not exhibit this flaw on our tested synthetic dataset. Overall, our results support and add nuance to the original paper’s findings.

URL: https://openreview.net/forum?id=fCNqD2IuoD

---

Title: Extreme Risk Mitigation in Reinforcement Learning using Extreme Value Theory

Authors: Karthik Somayaji NS, Yu Wang, Malachi Schram, Jan Drgona, Mahantesh M Halappanavar, Frank Liu, Peng Li

Abstract: Risk-sensitive reinforcement learning (RL) has garnered significant attention in recent years due to the growing interest in deploying RL agents in real-world scenarios. A critical aspect of risk awareness involves modelling highly rare risk events (rewards) that could potentially lead to catastrophic outcomes. These infrequent occurrences present a formidable challenge for data-driven methods aiming to capture such risky events accurately. While risk-aware RL techniques do exist, they suffer from high variance estimation due to the inherent data scarcity. Our work proposes to enhance the resilience of RL agents when faced with very rare and risky events by focusing on refining the predictions of the extreme values predicted by the state-action value distribution. To achieve this, we formulate the extreme values of the state-action value function distribution as parameterized distributions, drawing inspiration from the principles of extreme value theory (EVT). We propose an extreme value theory based actor-critic approach, namely, Extreme Valued Actor-Critic (EVAC) which effectively addresses the issue of infrequent occurrence by leveraging EVT-based parameterization. Importantly, we theoretically demonstrate the advantages of employing these parameterized distributions in contrast to other risk-averse algorithms. Our evaluations show that the proposed method outperforms other risk averse RL algorithms on a diverse range of benchmark tasks, each encompassing distinct risk scenarios.

URL: https://openreview.net/forum?id=098mb06uhA

---

Title: Interpretable Additive Tabular Transformer Networks

Authors: Anton Frederik Thielmann, Arik Reuter, Thomas Kneib, David Rügamer, Benjamin Säfken

Abstract: Attention based Transformer networks have not only revolutionized Natural Language Processing but have also achieved state-of-the-art results for tabular data modeling. The attention mechanism, in particular, has proven to be highly effective in accurately modeling categorical variables. Although deep learning models recently outperform tree-based models, they often lack a complete comprehension of the individual impact of features because of their opaque nature. In contrast, additive neural network structures have proven to be both predictive and interpretable. Within the context of explainable deep learning, we propose Neural Additive Tabular Transformer Networks (NATT), a modeling framework that combines the intelligibility of additive neural networks with the predictive power of Transformer models. NATT offers inherent intelligibility while achieving similar performance to complex deep learning models. To validate its efficacy, we conduct experiments on multiple datasets and find that NATT performs on par with state-of-the-art methods on tabular data and surpasses other interpretable approaches.

URL: https://openreview.net/forum?id=TdJ7lpzAkD

---

Title: Geometrical aspects of lattice gauge equivariant convolutional neural networks

Authors: David I. Müller, Jimmy Aronsson, Daniel Schuh

Abstract: Lattice gauge equivariant convolutional neural networks (L-CNNs) are a framework for convolutional neural networks that can be applied to non-Abelian lattice gauge theories without violating gauge symmetry. We demonstrate how L-CNNs can be equipped with global group equivariance. This allows us to extend the formulation to be equivariant not just under translations but under global lattice symmetries such as rotations and reflections. Additionally, we provide a geometric formulation of L-CNNs and show how convolutions in L-CNNs arise as a special case of gauge equivariant neural networks on SU(N) principal bundles.

URL: https://openreview.net/forum?id=zO4aAVHxPe

---

Title: Boosting Data-Driven Mirror Descent with Randomization, Equivariance, and Acceleration

Authors: Hong Ye Tan, Subhadip Mukherjee, Junqi Tang, Carola-Bibiane Schönlieb

Abstract: Learning-to-optimize (L2O) is an emerging research area in large-scale optimization with applications in data science. Recently, researchers have proposed a novel L2O framework called learned mirror descent (LMD), based on the classical mirror descent (MD) algorithm with learnable mirror maps parameterized by input-convex neural networks. The LMD approach has been shown to significantly accelerate convex solvers while inheriting the convergence properties of the classical MD algorithm. This work proposes several practical extensions of the LMD algorithm, addressing its instability, scalability, and feasibility for high-dimensional problems. We first propose accelerated and stochastic variants of LMD, leveraging classical momentum-based acceleration and stochastic optimization techniques for improving the convergence rate and per-iteration computational complexity. Moreover, for the particular application of training neural networks, we derive and propose a novel and efficient parameterization for the mirror potential, exploiting the equivariant structure of the training problems to significantly reduce the dimensionality of the underlying problem. We provide theoretical convergence guarantees for our schemes under standard assumptions and demonstrate their effectiveness in various computational imaging and machine learning applications such as image inpainting, and the training of support vector machines and deep neural networks.

URL: https://openreview.net/forum?id=r2dx1s1lqG

---

Title: Causal Discovery from Time Series with Hybrids of Constraint-Based and Noise-Based Algorithms

Authors: Daria Bystrova, Charles K. Assaad, Julyan Arbel, Emilie Devijver, Eric Gaussier, Wilfried Thuiller

Abstract: Constraint-based methods and noise-based methods are two distinct families of methods proposed for uncovering causal graphs from observational data. However, both operate under strong assumptions that may be challenging to validate or could be violated in real-world scenarios. In response to these challenges, there is a growing interest in hybrid methods that amalgamate principles from both methods, showing robustness to assumption violations. This paper introduces a novel comprehensive framework for hybridizing constraint-based and noise-based methods designed to uncover causal graphs from observational time series. The framework is structured into two classes. The first class employs a noise-based strategy to identify a super graph, containing the true graph, followed by a constraint-based strategy to eliminate unnecessary edges. In the second class, a constraint-based strategy is applied to identify a skeleton, which is then oriented using a noise-based strategy. The paper provides theoretical guarantees for each class under the condition that all assumptions are satisfied, and it outlines some properties when assumptions are violated. To validate the efficacy of the framework, two algorithms from each class are experimentally tested on simulated data, realistic ecological data, and real datasets sourced from diverse applications. Notably, two novel datasets related to Information Technology monitoring are introduced within the set of considered real datasets. The experimental results underscore the robustness and effectiveness of the hybrid approaches across a broad spectrum of datasets.

URL: https://openreview.net/forum?id=PGLbZpVk2n

---

Title: GLASU: A Communication-Efficient Algorithm for Federated Learning with Vertically Distributed Graph Data

Authors: Xinwei Zhang, Mingyi Hong, Jie Chen

Abstract: Vertical federated learning (VFL) is a distributed learning paradigm, where computing clients collectively train a model based on the partial features of the same set of samples they possess. Current research on VFL focuses on the case when samples are independent, but it rarely addresses an emerging scenario when samples are interrelated through a graph. In this work, we train a graph neural network (GNN) through VFL, where each client owns a part of the node features and a different edge set. This data scenario incurs a significant communication overhead, not only because of the handling of distributed features but also due to neighborhood aggregation in a GNN. Moreover, the training analysis is faced with a challenge caused by the biased stochastic gradients. We propose a model-splitting method that splits a backbone GNN across the clients and the server and a communication-efficient algorithm, GLASU, to train such a model. GLASU adopts lazy aggregation and stale updates to skip communication in neighborhood aggregation and in model updates, respectively, greatly reducing communication while enjoying convergence guarantees. We conduct extensive numerical experiments on real-world datasets, showing that GLASU effectively trains a GNN that matches the accuracy of centralized training, while using only a fraction of the time due to communication saving.

URL: https://openreview.net/forum?id=LHl2I2rWZa

---

Title: Training Graph Neural Networks Subject to a Tight Lipschitz Constraint

Authors: Simona Ioana Juvina, Ana Antonia Neacșu, Jérôme Rony, Jean-Christophe Pesquet, Corneliu Burileanu, Ismail Ben Ayed

Abstract: We propose a strategy for training a wide range of graph neural networks (GNNs) under tight Lipschitz bound constraints.
Specifically, by leveraging graph spectral theory, we derive computationally tractable expressions of a tight Lipschitz constant. This allows us to propose a constrained-optimization approach to control the constant, ensuring robustness to adversarial perturbations. Unlike the existing methods for controlling the Lipschitz constant, our approach reduces the size of the handled matrices by a factor equal to the square of the number of nodes in the graph. We employ a stochastic projected subgradient algorithm, which operates in a block-coordinate manner, with the projection step performed via an accelerated iterative proximal algorithm.
We focus on defending against attacks that perturb features while keeping the topology of the graph constant. This contrasts with most of the existing defenses, which tackle perturbations of the graph structure. We report experiments on various datasets in the context of node classification tasks, showing the effectiveness of our constrained GNN model.

URL: https://openreview.net/forum?id=KLojVqdj2y

---

Title: GOPlan: Goal-conditioned Offline Reinforcement Learning by Planning with Learned Models

Authors: Mianchu Wang, Rui Yang, Xi Chen, Hao Sun, Meng Fang, Giovanni Montana

Abstract: Offline Goal-Conditioned RL (GCRL) offers a feasible paradigm for learning general-purpose policies from diverse and multi-task offline datasets. Despite notable recent progress, the predominant offline GCRL methods, mainly model-free, face constraints in handling limited data and generalizing to unseen goals. In this work, we propose Goal-conditioned Offline Planning (GOPlan), a novel model-based framework that contains two key phases: (1) pretraining a prior policy capable of capturing multi-modal action distribution within the multi-goal dataset; (2) employing the reanalysis method with planning to generate imagined trajectories for funetuning policies. Specifically, we base the prior policy on an advantage-weighted conditioned generative adversarial network, which facilitates distinct mode separation, mitigating the pitfalls of out-of-distribution (OOD) actions. For further policy optimization, the reanalysis method generates high-quality imaginary data by planning with learned models for both intra-trajectory and inter-trajectory goals. With thorough experimental evaluations, we demonstrate that GOPlan achieves state-of-the-art performance on various offline multi-goal navigation and manipulation tasks. Moreover, our results highlight the superior ability of GOPlan to handle small data budgets and generalize to OOD goals.

URL: https://openreview.net/forum?id=zOKAmm8R9B

---

Title: A Short Survey on Importance Weighting for Machine Learning

Authors: Masanari Kimura, Hideitsu Hino

Abstract: Importance weighting is a fundamental procedure in statistics and machine learning that weights the objective function or probability distribution based on the importance of the instance in some sense. The simplicity and usefulness of the idea has led to many applications of importance weighting. For example, it is known that supervised learning under an assumption about the difference between the training and test distributions, called distribution shift, can guarantee statistically desirable properties through importance weighting by their density ratio. This survey summarizes the broad applications of importance weighting in machine learning and related research.

URL: https://openreview.net/forum?id=IhXM3g2gxg

---

Title: Depth Scaling in Graph Neural Networks: Understanding the Flat Curve Behavior

Authors: Diana Gomes, Kyriakos Efthymiadis, Ann Nowe, Peter Vrancx

Abstract: Training deep Graph Neural Networks (GNNs) has proved to be a challenging task. A key goal of many new GNN architectures is to enable the depth scaling seen in other types of deep learning models. However, unlike deep learning methods in other domains, deep GNNs do not show significant performance boosts when compared to their shallow counterparts (resulting in a flat curve of performance over depth). In this work, we investigate some of the reasons why this goal of depth still eludes GNN researchers. We also question the effectiveness of current methods to train deep GNNs and show evidence of different types of pathological behavior in these networks. Our results suggest that current approaches hide the problems with deep GNNs rather than solve them, as current deep GNNs are only as discriminative as their respective shallow versions.

URL: https://openreview.net/forum?id=fdyHzoGT8g

---

Title: Understanding Smoothness of Vector Gaussian Processes on Product Spaces

Authors: Emilio Porcu, Ana Paula Peron, Eugenio Massa, Xavier Emery

Abstract: Vector Gaussian processes are becoming increasingly important in machine learning and statistics, with applications to many branches of applied sciences. Recent efforts have al- lowed to understand smoothness in scalar Gaussian processes defined over manifolds as well as over product spaces involving manifolds.
Under assumptions of Gaussianity and mean-square continuity, the smoothness of a zero- mean scalar process is in one-to-one correspondence with the smoothness of the covariance kernel. Unfortunately, such a result is not available for vector-valued random fields, as the way each component in the covariance kernel contributes to the smoothness of the vector field is unclear.
This paper challenges the problem of quantifying smoothness of matrix-valued continuous kernels that are associated with mean-square continuous vector Gaussian processes defined over non-Euclidean product manifolds. After noting that a constructive RKHS approach is unsuitable for this specific task, we proceed through the analysis of spectral properties. Specifically, we find a spectral representation to quantify smoothness through Sobolev spaces that are adapted to certain measure spaces of product measures obtained through the ten- sor product of Haar measures with multivariate Gaussian measures. Our results allow to measure smoothness in a simple way, and open for the study of foundational properties of certain machine learning techniques over product spaces.

URL: https://openreview.net/forum?id=WHAmxfLjeJ

---

Title: MAGDiff: Covariate Data Set Shift Detection via Activation Graphs of Neural Networks

Authors: Charles Arnal, Felix Hensel, Mathieu Carrière, Théo Lacombe, Hiroaki Kurihara, Yuichi Ike, Frederic Chazal

Abstract: Despite their successful application to a variety of tasks, neural networks remain limited, like other machine learning methods, by their sensitivity to shifts in the data: their performance can be severely impacted by differences in distribution between the data on which they were trained and that on which they are deployed. In this article, we propose a new family of representations, called MAGDiff, that we extract from any given neural network classifier and that allows for efficient covariate data shift detection without the need to train a new model dedicated to this task. These representations are computed by comparing the activation graphs of the neural network for samples belonging to the training distribution and to the target distribution, and yield powerful data- and task-adapted statistics for the two-sample tests commonly used for data set shift detection. We demonstrate this empirically by measuring the statistical powers of two-sample Kolmogorov-Smirnov (KS) tests on several different data sets and shift types, and showing that our novel representations induce significant improvements over a state-of-the-art baseline relying on the network output.

URL: https://openreview.net/forum?id=kxHIK4x8qc

---

Title: Regret Bounds for Noise-Free Cascaded Kernelized Bandits

Authors: Zihan Li, Jonathan Scarlett

Abstract: We consider optimizing a function network in the noise-free grey-box setting with RKHS function classes, where the exact intermediate results are observable. We assume that the structure of the network is known (but not the underlying functions comprising it), and we study three types of structures: (1) chain: a cascade of scalar-valued functions, (2) multi-output chain: a cascade of vector-valued functions, and (3) feed-forward network: a fully connected feed-forward network of scalar-valued functions. We propose a sequential upper confidence bound based algorithm GPN-UCB along with a general theoretical upper bound on the cumulative regret. In addition, we propose a non-adaptive sampling based method along with its theoretical upper bound on the simple regret for the Mat\'ern kernel. We also provide algorithm-independent lower bounds on the simple regret and cumulative regret. Our regret bounds for GPN-UCB have the same dependence on the time horizon as the best known in the vanilla black-box setting, as well as near-optimal dependencies on other parameters (e.g., RKHS norm and network length).

URL: https://openreview.net/forum?id=oCfamUtecN

---

Title: The Interplay of Uncertainty Modeling and Deep Active Learning: An Empirical Analysis in Image Classification

Authors: Denis Huseljic, Marek Herde, Yannick Nagel, Lukas Rauch, Paulius Strimaitis, Bernhard Sick

Abstract: Deep active learning (AL) seeks to reduce the annotation costs required for training deep neural networks (DNNs). Often, deep AL strategies focus on instances where the predictive uncertainty of a DNN is high. Furthermore, Bayesian concepts to model uncertainty are frequently adopted. Despite considerable research, a detailed analysis of the role of uncertainty in deep AL is still missing, especially regarding aleatoric and epistemic uncertainty, both related to predictive uncertainty. This article provides an in-depth empirical study analyzing the interplay of uncertainty and deep AL in image classification. Our study investigates four hypotheses that provide an intuitive understanding of the effects of accurately estimating aleatoric and epistemic uncertainty on existing uncertainty-based AL strategies but also, in the opposite direction, the impact of uncertainty-based AL on the quality of uncertainty estimates that are needed in many applications. By analyzing these hypotheses on synthetic and real-world data, we find that accurate aleatoric estimates can even impair instance selection, while accurate epistemic estimates have negligible effects. Moreover, we provide a publicly available toolbox for deep AL with various models and strategies to facilitate further research and practical applications. Code is available at github.com/anonymous.

URL: https://openreview.net/forum?id=KLBD13bsVl

---

Title: InduCE: Inductive Counterfactual Explanations for Graph Neural Networks

Authors: Samidha Verma, Burouj Armgaan, Sourav Medya, Sayan Ranu

Abstract: Graph neural networks (GNNs) drive several real-world applications including drug-discovery, recommendation engines, and chip designing. Unfortunately, GNNs are a black-box since they do not allow human-intelligible explanations of their predictions. Counterfactual reasoning is an effort to overcome this limitation. Specifically, the objective is to minimally perturb the input graph to a GNN, so that its prediction changes. While several algorithms have been proposed towards counterfactual explanations of GNNs, majority suffer from three key limitations: (1) they only consider perturbations in the form of deletions of existing edges, (2) they perform an inefficient exploration of the combinatorial search space, (3) the counterfactual explanation model is transductive in nature, i.e., they do not generalize to unseen data. In this work, we propose an inductive algorithm called InduCE, that overcomes these limitations. Through extensive experiments on graph datasets, we show that incorporating edge additions, and modelling marginal effect of perturbations aid in generating better counterfactuals among available recourse. Furthermore, inductive modeling enables InduCE to directly predict counterfactual perturbations without requiring instance-specific training. This leads to significant computational speed-up over baselines and allows counterfactual analyses for GNNs at scale.

URL: https://openreview.net/forum?id=RZPN8cgqST

---

Title: G4SATBench: Benchmarking and Advancing SAT Solving with Graph Neural Networks

Authors: Zhaoyu Li, Jinpei Guo, Xujie Si

Abstract: Graph neural networks (GNNs) have recently emerged as a promising approach for solving the Boolean Satisfiability Problem (SAT), offering potential alternatives to traditional backtracking or local search SAT solvers. However, despite the growing volume of literature in this field, there remains a notable absence of a unified dataset and a fair benchmark to evaluate and compare existing approaches. To address this crucial gap, we present G4SATBench, the first benchmark study that establishes a comprehensive evaluation framework for GNN-based SAT solvers. In G4SATBench, we meticulously curate a large and diverse set of SAT datasets comprising 7 problems with 3 difficulty levels and benchmark a broad range of GNN models across various prediction tasks, training objectives, and inference algorithms. To explore the learning abilities and comprehend the strengths and limitations of GNN-based SAT solvers, we also compare their solving processes with the heuristics in search-based SAT solvers. Our empirical results provide valuable insights into the performance of GNN-based SAT solvers and further suggest that existing GNN models can effectively learn a solving strategy akin to greedy local search but struggle to learn backtracking search in the latent space. Our codebase is available at https://github.com/zhaoyu-li/G4SATBench.

URL: https://openreview.net/forum?id=7VB5db72lr

---

Title: Efficient Parallelized Simulation of Cyber-Physical Systems

Authors: Bas van der Heijden, Laura Ferranti, Jens Kober, Robert Babuska

Abstract: Advancements in accelerated physics simulations have greatly reduced training times for reinforcement learning policies, yet the conventional step-by-step agent-simulator interaction undermines simulation accuracy. In the real-world, interactions are asynchronous, with sensing, acting and processing happening simultaneously. Failing to capture this widens the sim2real gap and results in suboptimal real-world performance. In this paper, we address the challenges of simulating realistic asynchronicity and delays within parallelized simulations, crucial to bridging the sim2real gap in complex cyber-physical systems. Our approach efficiently parallelizes cyber-physical system simulations on accelerator hardware, including physics, sensors, actuators, processing components and their asynchronous interactions. We extend existing accelerated physics simulations with latency simulation capabilities by constructing a `supergraph' that encodes all data dependencies across parallelized simulation steps, ensuring accurate simulation. By finding the smallest supergraph, we minimize redundant computation. We validate our approach on two real-world systems and perform an extensive ablation, demonstrating superior performance compared to baseline methods.

URL: https://openreview.net/forum?id=VzKXbCzNoU

---

Title: PaDPaF: Partial Disentanglement with Partially-Federated GANs

Authors: Abdulla Jasem Almansoori, Samuel Horváth, Martin Takáč

Abstract: Federated learning has become a popular machine learning paradigm with many potential real-life applications, including recommendation systems, the Internet of Things (IoT), healthcare, and self-driving cars. Though most current applications focus on classification-based tasks, learning personalized generative models remains largely unexplored, and their benefits in the heterogeneous setting still need to be better understood. This work proposes a novel architecture combining global client-agnostic and local client-specific generative models. We show that using standard techniques for training federated models, our proposed model achieves privacy and personalization by implicitly disentangling the globally-consistent representation (i.e. content) from the client-dependent variations (i.e. style). Using such decomposition, personalized models can generate locally unseen labels while preserving the given style of the client and can predict the labels for all clients with high accuracy by training a simple linear classifier on the global content features. Furthermore, disentanglement enables other essential applications, such as data anonymization, by sharing only the content. Extensive experimental evaluation corroborates our findings, and we also discuss a theoretical motivation for the proposed approach.

URL: https://openreview.net/forum?id=vsez76EAV8

---

Title: Archetypal Analysis++: Rethinking the Initialization Strategy

Authors: Sebastian Mair, Jens Sjölund

Abstract: Archetypal analysis is a matrix factorization method with convexity constraints. Due to local minima, a good initialization is essential, but frequently used initialization methods yield either sub-optimal starting points or are prone to get stuck in poor local minima. In this paper, we propose archetypal analysis++ (AA++), a probabilistic initialization strategy for archetypal analysis that sequentially samples points based on their influence on the objective function, similar to $k$-means++. In fact, we argue that $k$-means++ already approximates the proposed initialization method. Furthermore, we suggest to adapt an efficient Monte Carlo approximation of $k$-means++ to AA++. In an extensive empirical evaluation of 15 real-world data sets of varying sizes and dimensionalities and considering two pre-processing strategies, we show that AA++ almost always outperforms all baselines, including the most frequently used ones.

URL: https://openreview.net/forum?id=KVUtlM60HM

---

Title: [Re] On the Reproducibility of Post-Hoc Concept Bottleneck Models

Authors: Nesta Midavaine, Gregory Hok Tjoan Go, Diego Canez, Ioana Simion, Satchit Chatterji

Abstract: To obtain state-of-the-art performance, many deeper artificial intelligence models sacrifice human explainability in their decision-making. One solution proposed for achieving top performance and retaining explainability is the Post-Hoc Concept Bottleneck Model (PCBM) (Yuksekgonul et al., 2023), which can convert the embeddings of any deep neural network into a set of human-interpretable concept weights. In this work, we reproduce and expand upon the findings of Yuksekgonul et al. (2023). Our results show that while most of the authors’ claims and results hold, some of the results they obtained could not be sufficiently replicated. Specifically, the claims relating to PCBM performance preservation and its non-requirement of labeled concept datasets were generally reproduced, whereas the one claiming its model editing capabilities was not. Beyond these results, our contributions to their work include evidence that PCBMs may work for audio classification problems, verification of the interpretability of their methods, and updates to their code for missing implementations. The code for our implementations can be found at https://github.com/dgcnz/FACT.

URL: https://openreview.net/forum?id=8UfhCZjOV7

---

Title: Data Pruning Can Do More: A Comprehensive Data Pruning Approach for Object Re-identification

Authors: Zi Yang, Haojin Yang, Soumajit Majumder, Jorge Cardoso, Guillermo Gallego

Abstract: Previous studies have demonstrated that not each sample in a dataset is of equal importance during training. Data pruning aims to remove less important or informative samples while still achieving comparable results as training on the original (untruncated) dataset, thereby reducing storage and training costs. However, the majority of data pruning methods are applied to image classification tasks. To our knowledge, this work is the first to explore the feasibility of these pruning methods applied to object re-identification (ReID) tasks, while also presenting a more comprehensive data pruning approach. By fully leveraging the logit history during training, our approach offers a more accurate and comprehensive metric for quantifying sample importance, as well as correcting mislabeled samples and recognizing outliers. Furthermore, our approach is highly efficient, reducing the cost of importance score estimation by 10 times compared to existing methods. Our approach is a plug-and-play, architecture-agnostic framework that can eliminate/reduce 35%, 30%, and 5% of samples/training time on the VeRi, MSMT17 and Market1501 datasets, respectively, with negligible loss in accuracy (< 0.1%). The lists of important, mislabeled, and outlier samples from these ReID datasets are available at https://github.com/Zi-Y/data-pruning-reid.

URL: https://openreview.net/forum?id=vxxi7xzzn7

---

Title: Scaling Vision-and-Language Navigation With Offline RL

Authors: Valay Bundele, Mahesh Bhupati, Biplab Banerjee, Aditya Grover

Abstract: The study of vision-and-language navigation (VLN) has typically relied on expert trajectories, which may not always be available in real-world situations due to the significant effort required to collect them. On the other hand, existing approaches to training VLN agents that go beyond available expert data involve data augmentations or online exploration which can be tedious and risky. In contrast, it is easy to access large repositories of suboptimal offline trajectories. Inspired by research in offline reinforcement learning (ORL), we introduce a new problem setup of VLN-ORL which studies VLN using suboptimal demonstration data. We introduce a simple and effective reward-conditioned approach that can account for dataset suboptimality for training VLN agents, as well as benchmarks to evaluate progress and promote research in this area. We empirically study various noise models for characterizing dataset suboptimality among other unique challenges in VLN-ORL and instantiate it for the VLN⟳BERT and MTVM architectures in the R2R and RxR environments. Our experiments demonstrate that the proposed reward-conditioned approach leads to significant performance improvements, even in complex and intricate environments.

URL: https://openreview.net/forum?id=kPIU8PnJPo

---

Title: Momentum-Based Policy Gradient with Second-Order Information

Authors: Saber Salehkaleybar, Mohammadsadegh Khorasani, Negar Kiyavash, Niao He, Patrick Thiran

Abstract: Variance-reduced gradient estimators for policy gradient methods have been one of the main focus of research in the reinforcement learning in recent years as they allow acceleration of the estimation process. We propose a variance-reduced policy-gradient method, called SHARP, which incorporates second-order information into stochastic gradient descent (SGD) using momentum with a time-varying learning rate. SHARP algorithm is parameter-free, achieving $\epsilon$-approximate first-order stationary point with $O(\epsilon^{-3})$ number of trajectories, while using a batch size of $O(1)$ at each iteration. Unlike most previous work, our proposed algorithm does not require importance sampling which can compromise the advantage of variance reduction process. Moreover, the variance of estimation error decays with the fast rate of $O(1/t^{2/3})$ where $t$ is the number of iterations. Our extensive experimental evaluations show the effectiveness of the proposed algorithm on various control tasks and its advantage over the state of the art in practice.

URL: https://openreview.net/forum?id=2bURaH6RN8

---

Title: Offline Reinforcement Learning via Tsallis Regularization

Authors: Lingwei Zhu, Matthew Kyle Schlegel, Han Wang, Martha White

Abstract: Offline reinforcement learning (RL) focuses on learning a good policy from a fixed dataset. The dataset is generated by an unknown behavior policy through interactions with the environment and contains only a subset of the state-action spaces. Standard off-policy algorithms often perform poorly in this setting, suffering from errorneously optimistic values incurred by the out-of-distribution (OOD) actions not present in the dataset. The optimisim cannot be corrected as no further interaction with the environment is possible. Imposing divergence regularization and in-sample constraints are among the most popular methods to overcoming the issue by ensuring that the learned policy stays close to the behavior policy to minimize the occurrence of OOD actions. This paper proposes Tsallis regularization for offline RL, which aligns the induced sparsemax policies to the in-sample constraint. Sparsemax interpolates existing methods utilizing hard-max and softmax policies, in that only a subset of actions contributes non-zero action probability as compared to softmax (all actions) and hard-max (single action). We leverage this property to model the behavior policy and show that under several assumptions the learned sparsemax policies may have sparsity-conditional KL divergence to the behavior policy, making Tsallis regularization especially suitable for the Behavior Cloning methods. We propose a novel actor-critic algorithm: Tsallis Advantage Weighted Actor-Critic (Tsallis AWAC) generalizing AWAC and analyze its performance in standard Mujoco environments. Our code is available at \url{https://github.com/lingweizhu/tsallis_regularization}.

URL: https://openreview.net/forum?id=HNqEKZDDRc

---

Title: A Survey on Transferability of Adversarial Examples Across Deep Neural Networks

Authors: Jindong Gu, Xiaojun Jia, Pau de Jorge, Wenqian Yu, Xinwei Liu, Avery Ma, Yuan Xun, Anjun Hu, Ashkan Khakzar, Zhijiang Li, Xiaochun Cao, Philip Torr

Abstract: The emergence of Deep Neural Networks (DNNs) has revolutionized various domains by enabling the resolution of complex tasks spanning image recognition, natural language processing, and scientific problem-solving. However, this progress has also brought to light a concerning vulnerability: adversarial examples. These crafted inputs, imperceptible to humans, can manipulate machine learning models into making erroneous predictions, raising concerns for safety-critical applications. An intriguing property of this phenomenon is the transferability of adversarial examples, where perturbations crafted for one model can deceive another, often with a different architecture. This intriguing property enables ``black-box'' attacks which circumvents the need for detailed knowledge of the target model. This survey explores the landscape of the adversarial transferability of adversarial examples. We categorize existing methodologies to enhance adversarial transferability and discuss the fundamental principles guiding each approach. While the predominant body of research primarily concentrates on image classification, we also extend our discussion to encompass other vision tasks and beyond. Challenges and opportunities are discussed, highlighting the importance of fortifying DNNs against adversarial vulnerabilities in an evolving landscape.

URL: https://openreview.net/forum?id=AYJ3m7BocI

---

Title: Solving Inverse Problems with Model Mismatch using Untrained Neural Networks within Model-based Architectures

Authors: Peimeng Guan, Naveed Iqbal, Mark A. Davenport, Mudassir Masood

Abstract: Model-based deep learning methods such as loop unrolling (LU) and deep equilibrium model (DEQ) extensions offer outstanding performance in solving inverse problems (IP). These methods unroll the optimization iterations into a sequence of neural networks that in effect learn a regularization function from data. While these architectures are currently state-of-the-art in numerous applications, their success heavily relies on the accuracy of the forward model. This assumption can be limiting in many physical applications due to model simplifications or uncertainties in the apparatus. To address forward model mismatch, we introduce an untrained forward model residual block within the model-based architecture to match the data consistency in the measurement domain for each instance. We propose two variants in well-known model-based architectures (LU and DEQ) and prove convergence under mild conditions. Our approach offers a unified solution that is less parameter-sensitive, requires no additional data, and enables simultaneous fitting of the forward model and reconstruction in a single pass, benefiting both linear and nonlinear inverse problems. The experiments show significant quality improvement in removing artifacts and preserving details across three distinct applications, encompassing both linear and nonlinear inverse problems. Moreover, we highlight reconstruction effectiveness in intermediate steps and showcase robustness to random initialization of the residual block and a higher number of iterations during evaluation.

URL: https://openreview.net/forum?id=XHEhjDxPDl

---

New submissions
===============

Title: Learning a Decision Tree Algorithm with Transformers

Abstract: Decision trees are renowned for their ability to achieve high predictive performance while remaining interpretable, especially on tabular data. Traditionally, they are constructed through recursive algorithms, where they partition the data at every node in a tree. However, identifying a good partition is challenging, as decision trees optimized for local segments may not bring global generalization. To address this, we introduce MetaTree, which trains a Transformer-based model on outputs from classical algorithms to directly produce strong decision trees in a meta-learning manner. Specifically, we fit both greedy decision trees and optimized decision trees on a large number of datasets. We then train MetaTree to produce the trees that achieve strong generalization performance. This training enables MetaTree to emulate these algorithms and intelligently adapt its strategy according to the context, thereby achieving superior generalization performance.

URL: https://openreview.net/forum?id=1Kzzm22usl

---

Title: M$^3$PL: Identifying and Exploiting View Bias of Prompt Learning

Abstract: Prompt learning is an effective means of fine-tuning multi-modal foundation models such as CLIP. Despite existing success, the inner mechanism of multi-modal prompt learning has not been well understood. In this work, we identify an inductive bias of multi-modal prompt learning, which we refer to as view bias, that the learned prompts may extract only a partial subset of useful features (views) and ignore others. This bias can undermine the model's generalization ability, particularly under distribution shifts. We further observe that independently trained prompts have distinct view biases, contrary to the existing belief that they may converge to similar local optima due to having the same cross-modal representation matching objective. Based on our observations, we propose Multi-modal Matching Multi-Prompt Learning (M$^3$PL), which incorporates multiple paired prompts and a cross-modal contrastive regularizer that facilitates the prompt pairs to encapsulate a broader spectrum of views. Extensive experiments show that M$^3$PL effectively boosts the model's generalization capability, achieving state-of-the-art performance under various distribution shifts.

URL: https://openreview.net/forum?id=2rnTIBm19V

---

Title: Variational Autoencoding of Dental Point Clouds

Abstract: Digital dentistry has made significant advancements, yet numerous challenges remain. This paper introduces the FDI 16 dataset, an extensive collection of tooth meshes and point clouds. Additionally, we present a novel approach: Variational FoldingNet (VF-Net), a fully probabilistic variational autoencoder for point clouds. Notably, prior latent variable models for point clouds lack a one-to-one correspondence between input and output points. Instead, they rely on optimizing Chamfer distances, a metric that lacks a normalized distributional counterpart, rendering it unsuitable for probabilistic modeling. We replace the explicit minimization of Chamfer distances with a suitable encoder, increasing computational efficiency while simplifying the probabilistic extension. This allows for straightforward application in various tasks, including mesh generation, shape completion, and representation learning. Empirically, we provide evidence of lower reconstruction error in dental reconstruction and interpolation, showcasing state-of-the-art performance in dental sample generation while identifying valuable latent representations

URL: https://openreview.net/forum?id=nH416rLxtI

---

Title: Training Wasserstein GANs without Gradient Penalties via Duality

Abstract: We propose a new method for training Wasserstein generative adversarial networks (WGANs) without using gradient penalties. We develop a novel approach to accurately estimate the Wasserstein distance between datasets, specifically tailored for the generative setting. Instead of employing the commonly used gradient penalty term in the WGAN training procedure, we introduce two objective functions that utilize the $c$-transform based on Kantorovich duality, which is a fundamental property in optimal transport theory. Through our experiments, we observe that this algorithm effectively enforces the Lipschitz constraint on the discriminator, paving the way for understanding the optimal transport problem via a deep learning approach. As a result, our method provides an accurate estimation not only for the optimal discriminator but also for the Wasserstein distance between the true and generated distribution. Notably, our method eliminates the need for gradient penalties and corresponding hyperparameter tuning. Moreover, we demonstrate its effectiveness by generating competitive synthetic images using various datasets such as MNIST, Fashion-MNIST, CIFAR-10, and CelebA-HQ.

URL: https://openreview.net/forum?id=vN3eAGHhpC

---

Title: Identifying and Clustering Counter Relationships of Team Compositions in PvP Games for Efficient Balance Analysis

Abstract: $\textbf{How can balance be quantified in game settings?}$ This question is crucial for game designers, especially in player-versus-player (PvP) games, where analyzing the strength relations among predefined team compositions—such as hero combinations in MOBA games or decks in card games—is essential for enhancing gameplay and achieving balance. We have developed two advanced measures that extend beyond the simplistic win rate to quantify balance in competitive scenarios. These measures are derived from win value estimations, which employ strength rating approximations via the Bradley-Terry model and counter relationship approximations via vector quantization, significantly reducing the computational complexity associated with traditional win value estimations.
Throughout the learning process of these models, we identify useful categories of compositions and pinpoint their counter relationships, aligning with the experiences of human players without requiring specific game knowledge. Our methodology, underpinned by a novel discrete representation learning technique, enhances codebook utilization in a deterministic vector quantization process with an extremely small state space.
Our framework has been validated in popular online games, including $\textit{Age of Empire II}$, $\textit{Hearthstone}$, $\textit{Brawl Stars}$, and $\textit{League of Legends}$. The accuracy of the observed strength relations in these games is comparable to traditional pairwise win value predictions, while also offering a more manageable complexity for analysis. Ultimately, our findings contribute to a deeper understanding of PvP game dynamics and present a methodology that significantly improves game balance evaluation and design.

URL: https://openreview.net/forum?id=2D36otXvBE

---

Title: Model Recycling Framework for Multi-Source Data-Free Supervised Transfer Learning

Abstract: Increasing concerns for data privacy and other difficulties associated with retrieving source data for model training have created the need for source-free transfer learning, in which one only has access to pre-trained models instead of data from the original source domains. This setting introduces many challenges, as most existing transfer learning methods require the availability of source data, and thus cannot be directly adapted to the source-free scenario. Further, practical concerns make it more difficult, for instance efficiently selecting models for transfer without information on source data, and transferring without full access to the source models. So motivated, we propose a model recycling framework for parameter-efficient training of models that identifies subsets of related source models to reuse in both white-box and black-box settings. Consequently, our framework makes it possible for Model as a Service (MaaS) providers to build libraries of efficient pre-trained models, thus creating an opportunity for multi-source data-free supervised transfer learning.

URL: https://openreview.net/forum?id=CuPQhnRJFs

---

Title: Generalized Category Discovery under the Long-Tailed Distribution

Abstract: Generalized Category Discovery (GCD) is a challenging problem, which aims at discovering novel categories in unlabelled data by transferring the knowledge from the labelled data. Existing methods often assume a uniform distribution of categories, which is not representative of real-world data that typically exhibits a long-tailed distribution. In this paper, we address the problem of GCD under a long-tailed distribution. Our approach introduces a novel framework that tackles the challenges of biased classifier learning and imprecise class number estimation. We propose adaptive sample selection based on confidence and density, balancing the model's training distribution and mitigating bias. Additionally, we present a density-peak-based method for accurate class number estimation in long-tailed settings. Experimental results demonstrate the effectiveness of our approach in discovering novel categories and outperforming state-of-the-art methods.

URL: https://openreview.net/forum?id=0CIS2nthtK

---

Title: Towards Measuring Predictability: To which extent data-driven approaches can extract deterministic relations from data exemplified with time series prediction and classification

Abstract: Minimizing loss functions is one important ingredient for machine learning to fit parameters such that the machine learning models extract relations hidden in the data. The smaller the loss function value on various splittings of a dataset, the better the machine learning model is assumed to perform. However, datasets are usually generated by dynamics consisting of deterministic components where relations are clearly defined and consequently learnable as well as stochastic parts where outcomes are random and thus not predictable. Depending on the amplitude of the deterministic and stochastic processes, the best achievable loss function value varies and is usually not known in real data science scenarios.
In this research, a statistical framework is developed that provides measures to address predictability of a target given the available input data and, after training an machine learning model, how much of the deterministic relations have been missed by the model. Consequently, the presented framework allows to differentiate model errors into unpredictable parts regarding the given input and a systematic miss of deterministic relations. The work extends the definition of model success or failure as well as convergence of a training process. Moreover, it is demonstrated how such measures can enrich the procedure of model training and guide the combination of different models.
The framework is showcased with time series data on different synthetic and real world datasets.
The implementation of the used models and measures for quantifying the deterministic relations are provided via the git repository .... (the repository will be published and the link will be provided in case of acceptance, but for the review process it is provided as a supplementary zip-file)

URL: https://openreview.net/forum?id=jZBAVFGUUo

---

Title: Time series recovery from partial observations via Nonnegative Matrix Factorization

Abstract: In the modern analysis of time series, one may want to forecast several hundred or thousands of times series with possible missing entries. We introduce a novel algorithm to address these issues, referred to as Sliding Mask Method (SMM). SMM is a method based on the framework of predicting a time window and using completion of nonnegative matrices. This new procedure combines nonnegative factorization and matrix completion with hidden values (i.e., a partially observed matrix). From a theoretical point of view, we prove the statistical guarantees on uniqueness and robustness of the solutions of the completion of partially observed nonnegative matrices. From a numerical point of view, we present experiments on real-world and synthetic data-set confirm forecasting accuracy for the novel methodology.

URL: https://openreview.net/forum?id=to1OL4qBku

---

Title: Do not trust what you trust: Miscalibration in Semisupervised Learning

Abstract: State-of-the-art semi-supervised learning (SSL) approaches rely on highly confident predictions to serve as pseudo-labels that guide the training on unlabeled samples. An inherent drawback of this strategy stems from the quality of the uncertainty estimates, as pseudo-labels are filtered only based on their degree of uncertainty, regardless of the correctness of their predictions. Thus, assessing and enhancing the uncertainty of network predictions is of paramount importance in the pseudo-labeling process. In this work, we empirically demonstrate that SSL methods based on pseudo-labels are significantly miscalibrated, and formally demonstrate the minimization of the min-entropy, a lower bound of the Shannon entropy, as a potential cause for miscalibration. To alleviate this issue, we integrate a simple penalty term, which enforces the logit distances of the predictions on unlabeled samples to remain low, preventing the network predictions to become overconfident. Comprehensive experiments on a variety of SSL image classification benchmarks demonstrate that the proposed solution systematically improves the calibration performance of relevant SSL models, while also enhancing their discriminative power, being an appealing addition to tackle SSL tasks.

URL: https://openreview.net/forum?id=1WqLLYgBNt

---

Title: On the Detection of Reviewer-Author Collusion Rings From Paper Bidding

Abstract: A major threat to the peer-review systems of computer science conferences is the existence of "collusion rings" between reviewers. In such collusion rings, reviewers who have also submitted their own papers to the conference work together to manipulate the conference's paper assignment, with the aim of being assigned to review each other's papers. The most straightforward way that colluding reviewers can manipulate the paper assignment is by indicating their interest in each other's papers through strategic paper bidding. One potential approach to solve this important problem would be to detect the colluding reviewers from their manipulated bids, after which the conference can take appropriate action. While prior work has developed effective techniques to detect other kinds of fraud, no research has yet established that detecting collusion rings is even possible. In this work, we tackle the question of whether it is feasible to detect collusion rings from the paper bidding. To answer this question, we conduct empirical analysis of two realistic conference bidding datasets, including evaluations of existing algorithms for fraud detection in other applications. We find that collusion rings can achieve considerable success at manipulating the paper assignment while remaining hidden from detection: for example, in one dataset, undetected colluders are able to achieve assignment to up to 30% of the papers authored by other colluders. In addition, when 10 colluders bid on all of each other's papers, no detection algorithm outputs a group of reviewers with more than 31% overlap with the true colluders. These results suggest that collusion cannot be effectively detected from the bidding using popular existing tools, demonstrating the need to develop more complex detection algorithms as well as those that leverage additional metadata (e.g., reviewer-paper text-similarity scores).

URL: https://openreview.net/forum?id=o58uy91V2V

---

Title: Towards Size-Independent Generalization Bounds for Deep Operator Nets

Abstract: In recent times machine learning methods have made significant advances in becoming a useful tool for analyzing physical systems. A particularly active area in this theme has been ``physics-informed machine learning'' which focuses on using neural nets for numerically solving differential equations. In this work, we aim to advance the theory of measuring out-of-sample error while training DeepONets -- which is among the most versatile ways to solve P.D.E systems in one-shot.

Firstly, for a class of DeepONets, we prove a bound on their Rademacher complexity which does not explicitly scale with the width of the nets involved. Secondly, we use this to show how the Huber loss can be chosen so that for these DeepONet classes generalization error bounds can be obtained that have no explicit dependence on the size of the nets. The effective capacity measure for DeepONets that we thus derive is also shown to correlate with the behavior of generalization error in experiments.

URL: https://openreview.net/forum?id=21kO0u6LN0

---

Title: Integrating Large Language Models in Causal Discovery: A Statistical Causal Approach

Abstract: In practical statistical causal discovery (SCD), embedding domain expert knowledge as constraints into the algorithm is significant for creating consistent meaningful causal models, despite the challenges in systematic acquisition of the background knowledge. To overcome these challenges, this paper proposes a novel methodology for causal inference, in which SCD methods and knowledge based causal inference (KBCI) with a large language model (LLM) are synthesized through “statistical causal prompting (SCP)” for LLMs and prior knowledge augmentation for SCD. Experiments have revealed that GPT-4 can cause the output of the LLM-KBCI and the SCD result with prior knowledge from LLM-KBCI to approach the ground truth, and that the SCD result can be further improved, if GPT-4 undergoes SCP. Furthermore, by using an unpublished real-world dataset, we have demonstrated that the background knowledge provided by the LLM can improve SCD on this dataset, even if this dataset has never been included in the training data of the LLM. The proposed approach can thus address challenges such as dataset biases and limitations, illustrating the potential of LLMs to improve data-driven causal inference across diverse scientific domains.

URL: https://openreview.net/forum?id=rJRmspwYuW

---

Title: VideoGLUE: Video General Understanding Evaluation of Foundation Models

Abstract: We evaluate the video understanding capabilities of existing foundation models (FMs) using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition, temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring an FM for downstream tasks. Furthermore, we jointly profile FMs’ efficacy and efficiency when adapting to general video understanding tasks using various cost measurements under different scenarios, namely training, inference, and storage. Our main findings are as follows. First, task-specialized models significantly outperform the six FMs studied in this work, in sharp contrast to what FMs have achieved in natural language and image understanding. Second, video-native FMs, whose pretraining data contains the video modality, are generally better than image-native FMs in classifying motion-rich videos, localizing actions in time, and understanding a video of more than one action. Third, the video-native FMs can perform well on video tasks under light adaptations to downstream tasks (e.g., freezing the FM backbones), while image-native FMs win in full end-to-end finetuning. The first two observations reveal the need and tremendous opportunities to conduct research on video-focused FMs, and the last confirms that both tasks and adaptation methods matter when it comes to the evaluation of FMs. We will release our code upon acceptance.

URL: https://openreview.net/forum?id=wnI4sJtjqL

---

Title: What and How does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization

Abstract: In-Context Learning (ICL) ability has been found efficient across a wide range of applications, where the Large Language Models (LLM) learn to complete the tasks from the examples in the prompt without tuning the parameters. In this work, we conduct a comprehensive study to understand ICL from a statistical perspective. First, we show that the perfectly pretrained LLMs perform Bayesian Model Averaging (BMA) for ICL under a dynamic model of examples in the prompt. The average error analysis for ICL is then built for the perfectly pretrained LLMs with the analysis of BMA. Second, we demonstrate how the attention structure boosts the BMA implementation. With sufficient examples in the prompt, attention is proven to perform BMA under the Gaussian linear ICL model, which also motivates the explicit construction of the hidden concepts from the attention heads values. Finally, we analyze the pretraining behavior of LLMs. The pretraining error is decomposed as the generalization error and the approximation error, which are bounded separately. Then the ICL average error of the pretrained LLMs is shown to be the sum of $O(T^{-1})$ and the pretraining error. In addition, we analyze the ICL performance of the pretrained LLMs with misspecified examples. The theoretical findings are corroborated with the experimental results.

URL: https://openreview.net/forum?id=inpkC8UrDu

---

Title: Fairness Under Demographic Scarce Regime

Abstract: information. However, there exist scenarios where demographic information is partially available because a record was not maintained throughout data collection or for privacy reasons. This setting is known as demographic scarce regime. Prior research has shown that training an attribute classifier to replace the missing sensitive attributes (proxy) can still improve fairness. However, using proxy-sensitive attributes worsens fairness-accuracy trade-offs compared to true sensitive attributes. To address this limitation, we propose a framework to build attribute classifiers that achieve better fairness-accuracy trade-offs. Our method introduces uncertainty awareness in the attribute classifier and enforces fairness on samples with demographic information inferred with the lowest uncertainty. We show empirically that enforcing fairness constraints on samples with uncertain sensitive attributes is detrimental to fairness and accuracy. Our experiments on five datasets showed that the proposed framework yields models with significantly better fairness-accuracy trade-offs than classic attribute classifiers. Surprisingly, our framework can outperform models trained with fairness constraints on the true sensitive attributes in most benchmarks.

URL: https://openreview.net/forum?id=TB18G0w6Ld

---

Title: A Fisher-Rao gradient flow for entropic mean-field min-max games

Abstract: Gradient flows play a substantial role in addressing many machine learning problems. We examine the convergence in continuous-time of a Fisher-Rao (Mean-Field Birth-Death) gradient flow in the context of solving convex-concave min-max games with entropy regularization. We propose appropriate Lyapunov functions to demonstrate convergence with explicit rates to the unique mixed Nash equilibrium.

URL: https://openreview.net/forum?id=Afc2CucRaR

---

Title: GANDALF: Gated Adaptive Network for Deep Automated Learning of Features for Tabular Data

Abstract: We propose a novel high-performance, interpretable, and parameter \& computationally efficient deep learning architecture for tabular data, Gated Adaptive Network for Deep Automated Learning of Features (GANDALF). GANDALF relies on a new tabular processing unit with a gating mechanism and in-built feature selection called Gated Feature Learning Unit (GFLU) as a feature representation learning unit. We demonstrate that GANDALF outperforms or stays at-par with SOTA approaches like XGBoost, SAINT, FT-Transformers, etc. by experiments on multiple established public benchmarks. We have made available the code under MIT License.

URL: https://openreview.net/forum?id=syMYVrQcbd

---

Title: Leveraging World Model Disentanglement in Value-Based Multi-Agent Reinforcement Learning

Abstract: In this paper, we propose a novel model-based multi-agent reinforcement learning approach named Value Decomposition Framework with Disentangled World Model to address the challenge of achieving a common goal of multiple agents interacting in the same environment with reduced sample complexity. Due to scalability and non-stationarity problems posed by multi-agent systems, model-free methods rely on a considerable number of samples for training. In contrast, we use a modularized world model, composed of action-conditioned, action-free, and static branches, to unravel the complicated environment dynamics. Our model produces imagined outcomes based on past experience, without sampling directly from the real environment. We employ variational auto-encoders and variational graph auto-encoders to learn the latent representations for the world model, which is merged with a value-based framework to predict the joint action-value function and optimize the overall training objective. Experimental results on StarCraft II micro-management, Multi-Agent MuJoCo, and Level-Based Foraging challenges demonstrate that our method achieves high sample efficiency and exhibits superior performance compared to other baselines across a wide range of multi-agent learning tasks.

URL: https://openreview.net/forum?id=a8VRadL9L4

---

Title: Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach

Abstract: Self-supervised features are the cornerstone of modern machine learning systems. They are typically pre-trained on data collections whose construction and curation typically require extensive human effort. This manual process has some limitations similar to those encountered in supervised learning, e.g.,the crowd-sourced selection of data is costly and time-consuming, preventing scaling the dataset size. In this work, we consider the problem of automatic curation of high-quality datasets for self-supervised pre-training. We posit that such datasets should be large, diverse and balanced, and propose a clustering-based approach for building ones satisfying all these criteria. Our method involves successive and hierarchical applications of $k$-means on a large and diverse data repository to obtain clusters that distribute uniformly among data concepts, followed by a hierarchical, balanced sampling step from these clusters. Extensive experiments on three different data domains including web-based images, satellite images and text show that features trained on our automatically curated datasets outperform those trained on uncurated data while being on par or better than ones trained on manually curated data.

URL: https://openreview.net/forum?id=G7p8djzWOl

---

Title: Can We Count on LLMs? The Fixed-Effect Fallacy and Claims of GPT-4 Capabilities

Abstract: In this paper we explore evaluation of LLM capabilities. We present measurements of GPT-4 performance on several deterministic tasks; each task involves a basic calculation and takes as input parameter some element drawn from a large well-defined population (e.g., count elements in a list, multiply two k-digit numbers, etc). We examine several conditions per-task and perform enough trials so that statistically significant differences can be detected. This allows us to investigate the sensitivity of task-accuracy both to query phrasing and input parameter
population. We find that seemingly trivial modifications in the task-prompt or input population can yield differences far larger than can be explained by sampling effects. For example, performance on a simple list-counting task varies with query-phrasing and list-length, but also with list composition (i.e., the thing-to-be-counted) and object frequency e.g., success when an element accounts for ≈ 50\% of a list is different from when it accounts for ≈ 70\% etc).

We conclude that efforts to quantify LLM capabilities easily succumb to the language-as-fixed-effect fallacy, where experimental observations are improperly generalized beyond what the data supports. A consequence appears to be that intuitions that have been formed based on interactions with humans form a very unreliable guide as to which input modifications should ``make no difference'' to LLM performance.

URL: https://openreview.net/forum?id=qt4d0EGZsK

---

Title: Credal Bayesian Deep Learning

Abstract: Uncertainty quantification and robustness to distribution shifts are important goals in machine learning and artificial intelligence. Although Bayesian Neural Networks (BNNs) allow for uncertainty in the predictions to be assessed, different sources of uncertainty are indistinguishable. We present Credal Bayesian Deep Learning (CBDL).
Heuristically, CBDL allows to train an (uncountably) infinite ensemble of BNNs, using only finitely many elements. This is possible thanks to prior and likelihood finitely generated credal sets (FGCSs), a concept from the imprecise probability literature. Intuitively, convex combinations of a finite collection of prior-likelihood pairs are able to represent infinitely many such pairs. After training, CBDL outputs a set of posteriors on the parameters of the neural network. At inference time, such posterior set is used to derive a set of predictive distributions that is in turn utilized to distinguish between aleatoric and epistemic uncertainties, and to quantify them. The predictive set also produces either (i) a collection of outputs enjoying desirable probabilistic guarantees, or (ii) the single output that is deemed the best, that is, the one having the highest predictive lower probability -- another imprecise-probabilistic concept. CBDL is more robust than single BNNs to prior and likelihood misspecification, and to distribution shift. We show that CBDL is better at quantifying and disentangling different types of uncertainties than single BNNs and ensemble of BNNs.
In addition, we apply CBDL to two case studies to demonstrate its downstream tasks capabilities: one, for motion prediction in autonomous driving scenarios, and two, to model blood glucose and insulin dynamics for artificial pancreas control. We show that CBDL performs better when compared to an ensemble of BNNs baseline.

URL: https://openreview.net/forum?id=4NHF9AC5ui

---

Title: Improved motif-scaffolding with SE(3) flow matching

Abstract: Protein design often begins with the knowledge of a desired function from a motif which motif-scaffolding aims to construct a functional protein around. Recently, generative models have achieved breakthrough success in designing scaffolds for a range of motifs. However, generated scaffolds tend to lack structural diversity, which can hinder success in wet-lab validation. In this work, we extend FrameFlow, an SE(3) flow matching model for protein backbone generation, to perform motif-scaffolding with two complementary approaches. The first is motif amortization, in which FrameFlow is trained with the motif as input using a data augmentation strategy. The second is motif guidance, which performs scaffolding using an estimate of the conditional score from FrameFlow without additional training. On a benchmark of 24 biologically meaningful motifs, we show our method achieves a 2.5 times greater in-silico success rate of unique motif-scaffolds compared to state-of-the-art.

URL: https://openreview.net/forum?id=fa1ne8xDGn

---

Title: On Safety in Safe Bayesian Optimization

Abstract: Safe Bayesian Optimization (BO) is increasingly used to optimize an unknown function under safety constraints, a central task in robotics, biomedical engineering, and many other disciplines. Due to the safety-critical nature of these applications, it is crucial that theoretical safety guarantees for these algorithms translate into the real world. In this work, we investigate three safety-related issues in SafeOpt-type algorithms, a popular class of safe BO methods. First, these algorithms critically rely on frequentist uncertainty bounds for Gaussian Process (GP) regression, but concrete implementations typically utilize heuristics that invalidate all safety guarantees. We provide a detailed analysis of this problem and introduce Real-$\beta$-SafeOpt, a variant of the SafeOpt algorithm that leverages recent GP bounds and thus retains all theoretical guarantees. Second, we identify a key technical assumption in SafeOpt-like algorithms, the assumption of an upper bound on the reproducing kernel Hilbert space (RKHS) norm of the target function, as a central obstacle to real-world usage. To overcome this obstacle, we introduce the Lipschitz-only Safe Bayesian Optimization (LoSBO) algorithm,
which guarantees safety without an assumption on the RKHS bound, and show empirically that this algorithm is not only safe, but also outperforms the state-of-the-art on several function classes. Third, SafeOpt and derived algorithms rely on a discrete search space, complicating their application to higher-dimensional problems. To broaden the applicability of these algorithms, we introduce Lipschitz-only Safe GP-UCB (LoS-GP-UCB), a LoSBO variant that is applicable to moderately high-dimensional problems, while retaining safety. By analyzing practical safety issues in an important class of safe BO algorithms, and providing ready-to-use algorithms that overcome these issues, this work contributes to bringing safe and reliable machine learning techniques closer to real world applications.

URL: https://openreview.net/forum?id=tgFHZMsl1N

---

Title: Counterfactual Learning of Stochastic Policies with Continuous Actions

Abstract: Counterfactual reasoning from logged data has become increasingly important for many
applications such as web advertising or healthcare. In this paper, we address the problem of
learning stochastic policies with continuous actions from the viewpoint of counterfactual risk
minimization (CRM). While the CRM framework is appealing and well studied for discrete
actions, the continuous action case raises new challenges about modelization, optimization,
and offline model selection with real data which turns out to be particularly challenging. Our
paper contributes to these three aspects of the CRM estimation pipeline. First, we introduce
a modelling strategy based on a joint kernel embedding of contexts and actions, which
overcomes the shortcomings of previous discretization approaches. Second, we empirically
show that the optimization aspect of counterfactual learning is important, and we demonstrate
the benefits of proximal point algorithms and smooth estimators. Finally, we propose an
evaluation protocol for offline policies in real-world logged systems, which is challenging since
policies cannot be replayed on test data, and we release a new large-scale dataset along with
multiple synthetic, yet realistic, evaluation setups.

URL: https://openreview.net/forum?id=fC4bh1PmZr

---

Title: On the Equivalence of Graph Convolution and Mixup

Abstract: This paper investigates the relationship between graph convolution and Mixup techniques. Graph convolution in a graph neural network involves aggregating features from neighboring samples to learn representative features for a specific node or sample. On the other hand, Mixup is a data augmentation technique that generates new examples by averaging features and one-hot labels from multiple samples. One commonality between these techniques is their utilization of information from multiple samples to derive feature representation. This study aims to explore whether a connection exists between these two approaches. Our investigation reveals that, under two mild conditions, graph convolution can be viewed as a specialized form of Mixup that is applied during both the training and testing phases. The two conditions are: 1) \textit{Homophily Relabel} - assigning the target node's label to all its neighbors, and 2) \textit{Test-Time Mixup} - Mixup the feature during the test time. We establish this equivalence mathematically by demonstrating that graph convolution networks (GCN) and simplified graph convolution (SGC) can be expressed as a form of Mixup. We also empirically verify the equivalence by training an MLP using the two conditions to achieve comparable performance.

URL: https://openreview.net/forum?id=koC6zyaj73

---

Title: Calibrating Deep Ensemble through Functional Variational Inference

Abstract: Deep Ensemble (DE) is an effective and practical uncertainty quantification approach in deep learning. The uncertainty of DE is usually manifested by the functional inconsistency among the ensemble members, which, yet, originates from unmanageable randomness in the initialization and optimization of neural networks (NNs), and may easily collapse in specific cases. To tackle this issue, we advocate characterizing the functional inconsistency with the empirical covariance of the functions dictated by the ensemble members, and defining a Gaussian process (GP) with it. We perform functional variational inference to tune such a probabilistic model w.r.t. training data and specific prior beliefs. This way, we can explicitly manage the uncertainty of the ensemble of NNs. We further provide strategies to make the training efficient. The proposed approach achieves better uncertainty quantification than DE and its variants across diverse scenarios, while consuming only marginally added training cost compared to standard DE.

URL: https://openreview.net/forum?id=uvPnTWMLll

---

Title: More Agents Is All You Need

Abstract: We find that, simply via a sampling-and-voting method, the performance of large language models (LLMs) scales with the number of agents instantiated. Also, this method is orthogonal to existing complicated methods to further enhance LLMs, while the degree of enhancement is correlated to the task difficulty. We conduct comprehensive experiments on a wide range of LLM benchmarks to verify the presence of our finding, and to study the properties that can facilitate its occurrence. Our code will be provided when the submission is accepted.

URL: https://openreview.net/forum?id=bgzUSZ8aeg

---

Title: Calibration Attacks: A Comprehensive Study of Adversarial Attacks on Model Confidence

Abstract: In this work, we highlight and perform a comprehensive study on calibration attacks, a form of adversarial attacks that aim to trap victim models to be heavily miscalibrated without altering their predicted labels, hence endangering the trustworthiness of the models and follow-up decision making based on their confidence. We propose four typical forms of calibration attacks: underconfidence, overconfidence, maximum miscalibration, and random confidence attacks, conducted in both the black-box and white-box setups. We demonstrate that the attacks are highly effective on both convolutional and attention-based models: with a small number of queries, they seriously skew confidence without changing the predictive performance. Given the potential danger, we further investigate the effectiveness of a wide range of adversarial defence and recalibration methods, including our proposed defences specifically designed for calibration attacks to mitigate the harm. From the ECE and KS scores, we observe that there are still significant limitations in handling calibration attacks. To the best of our knowledge, this is the first dedicated study that provides a comprehensive investigation on calibration-focused attacks. We hope this study helps attract more attention to these types of attacks and hence hamper their potential serious damages. To this end, this work also provides detailed analyses to understand the characteristics of the attacks.

URL: https://openreview.net/forum?id=TXzz9xwdpv

---

Title: AdaFNIO: A Physics-Informed Adaptive Fourier Neural In- terpolation Operator for Synthetic Frame Generation

Abstract: We present, \textbf{AdaFNIO} - Adaptive Fourier Neural Interpolation Operator, a neural operator-based architecture to perform synthetic frame generation. Current deep learning-based methods rely on local convolutions for feature learning and suffer from not being scale-invariant, thus requiring training data to be augmented through random flipping and re-scaling. On the other hand, \textbf{AdaFNIO} leverages the principles of physics to learn the features in the frames, independent of input resolution, through token mixing and global convolution in the Fourier spectral domain by using Fast Fourier Transform (FFT). We show that \textbf{AdaFNIO} can produce visually smooth and accurate results. To evaluate the visual quality of our interpolated frames, we calculate the structural similarity index (SSIM) and Peak Signal to Noise Ratio (PSNR) between the generated frame and the ground truth frame. We provide the quantitative performance of our model on Vimeo-90K dataset, DAVIS, UCF101 and DISFA+ dataset. Lastly, we apply the model to in-the-wild videos such as photosynthesis, brain MRI recordings and red blood cell animations

URL: https://openreview.net/forum?id=Yi9mtDeWL7

---

Title: Discrete Messages Improve Communication Efficiency among Isolated Intelligent Agents

Abstract: Individuals, despite having varied life experiences and learning processes, can communicate effectively through languages. This study aims to explore the efficiency of language as a communication medium. We put forth two specific hypotheses: First, discrete messages are more effective than continuous ones when agents have diverse personal experiences. Second, communications using multiple discrete tokens are more advantageous than those using a single token. To validate these hypotheses, we designed multi-agent machine learning experiments to assess communication efficiency using various information transmission methods between speakers and listeners. Our empirical findings indicate that, in scenarios where agents are exposed to different data, communicating through sentences composed of discrete tokens offers the best inter-agent communication efficiency. The limitations of our finding include lack of systematic advantages over other more sophisticated encoder-decoder model such as variational autoencoder and lack of evluation on non-image dataset, which we will leave for future studies.

URL: https://openreview.net/forum?id=X9866qeMYb

---

Title: GraphPrivatizer: Improved Structural Differential Privacy for Graph Neural Networks

Abstract: Graph privacy is crucial in systems that present a graph structure where the confidentiality and privacy of participants play a significant role in the integrity of the system itself. For instance, it is necessary to ensure the integrity of banking systems and transaction networks, protecting the privacy of customers' financial information and transaction details. We propose a method called GraphPrivatizer that privatizes the structure of a graph and protects it under Differential Privacy. GraphPrivatizer performs a controlled perturbation of the graph structure by randomly replacing the neighbors of a node with other similar neighbors, according to some similarity metric. With regard to neighbor perturbation, we find that aggregating features to compute similarities and imposing a minimum similarity score between the original and the replaced nodes provides the best privacy-utility trade-off. We use our method to train a Graph Neural Network server-side without disclosing users' private information to the server. We conduct experiments on real-world graph datasets and empirically evaluate the privacy of our models against privacy attacks.

URL: https://openreview.net/forum?id=lcPtUhoGYc

---

Title: Affordable Generative Agents

Abstract: The emergence of large language models (LLMs) has significantly advanced the simulation of believable interactive agents. However, the substantial cost on maintaining the prolonged agent interactions poses challenge over the deployment of believable LLM-based agents. Therefore, in this paper, we develop Affordable Generative Agents (AGA), a framework for enabling the generation of believable and low-cost interactions on both agent-environment and inter-agents. Specifically, for agent-environment interactions, we substitute repetitive LLM inferences with learned policies; while for inter-agent interactions, we model the social relationships between agents and compress auxiliary dialogue information. Extensive experiments on multiple environments show the effectiveness and efficiency of our proposed framework. Also, we delve into the mechanisms of emergent believable behaviors lying in LLM agents, demonstrating that agents can only generate finite behaviors in fixed environments, based upon which, we understand ways to facilitate emergent interaction behaviors. Our code is publicly available at: https://github.com/AffordableGenerativeAgents/Affordable-Generative-Agents.

URL: https://openreview.net/forum?id=7tlYbcq5DY

---

Title: Data-Centric Defense: Shaping Loss Landscape with Augmentations to Counter Model Inversion

Abstract: Machine Learning models have shown susceptibility to various privacy attacks, with model inversion (MI) attacks posing a significant threat. Current defense techniques are mostly \emph{model-centric}, involving modifying model training or inference. However, these approaches require model trainers' cooperation, are computationally expensive, and often result in a significant privacy-utility tradeoff. To address these limitations, we propose a novel \emph{data-centric} approach to mitigate MI attacks. Compared to traditional model-centric techniques, our approach offers the unique advantage of enabling each individual user to control their data's privacy risk, aligning with findings from a Cisco survey that only a minority actively seek privacy protection. Specifically, we introduce several privacy-focused data augmentations that modify the private data uploaded to the model trainer. These augmentations shape the resulting model's loss landscape, making it challenging for attackers to generate private target samples. Additionally, we provide theoretical analysis to explain why such augmentations can reduce the risk of model inversion. We evaluate our approach against state-of-the-art MI attacks and demonstrate its effectiveness and robustness across various model architectures and datasets. Specifically, in standard face recognition benchmarks, we reduce face reconstruction success rates to $\leq5\%$, while maintaining high utility with only a 2\% classification accuracy drop, significantly surpassing state-of-the-art model-centric defenses. This is the first study to propose a data-centric approach for mitigating model inversion attacks, showing promising potential for decentralized privacy protection.

URL: https://openreview.net/forum?id=r8wXaLJBIS

---

Title: Budget-Aware Sequential Brick Assembly with Efficient Constraint Satisfaction

Abstract: We tackle the problem of \emph{sequential brick assembly} with LEGO bricks to create combinatorial 3D structures. This problem is challenging since this brick assembly task encompasses the characteristics of combinatorial optimization problems. In particular, the number of assemblable structures increases exponentially as the number of bricks used increases. To solve this problem, we propose a new method to predict the scores of the next brick position by employing a U-shaped sparse 3D convolutional neural network. Along with the 3D convolutional network, a \emph{one-initialized brick-sized} convolution filter is used to efficiently validate physical constraints between bricks without training itself. By the nature of this one-initialized convolution filter, we can readily consider several different brick types by benefiting from modern implementation of convolution operations. To generate a novel structure, we devise a sampling strategy to determine the next brick position considering the satisfaction of physical constraints. Moreover, our method is designed for either \emph{budget-free} or \emph{budget-aware} scenario where a budget may confine the number of bricks and their types. We demonstrate that our method successfully generates a variety of brick structures and outperforms existing methods with Bayesian optimization, deep graph generative model, and reinforcement learning.

URL: https://openreview.net/forum?id=0mRfQOnkqk

---

Title: Continual Learning for Long-Tailed Recognition

Abstract: We propose Continual Learning for Long-Tailed Recognition (CLTR), a framework that employs standard off-the-shelf continual Learning (CL) methods for addressing Long-Tailed Recognition (LTR) problems, by first learning the majority classes (Head) followed by learning of the minority classes (Tail), without forgetting the majority. To ensure that our method is theoretically sound, we first prove that training a model on long-tailed data leads to weights similar to training the same learner on the Head classes. This naturally necessitates another step where the model learns the Tail after the Head in a sequential manner. We then prove that employing CL can effectively mitigate catastrophic forgetting in this setup and thus improve the model's performance in addressing LTR. We evaluate the efficacy of our approach using several standard CL methods on multiple datasets (CIFAR100-LT, CIFAR10-LT, ImageNet-LT, and Caltech256), showing that CLTR achieves state-of-the-art performance on all the benchmarks. Further, we demonstrate the effectiveness of CLTR in the more challenging task of class-incremental LTR, surpassing the state-of-the-art methods in this area by notable margins. Lastly, extensive sensitivity analyses and detailed discussions are provided to further explore the underlying mechanisms of CLTR. Our work not only bridges LTR and CL in a systematic way, but also paves the way for leveraging future advances in CL methods to more effectively tackle LTR problems.

URL: https://openreview.net/forum?id=lJWaicH8Dz

---

Title: LoRA Learns Less and Forgets Less

Abstract: Low-Rank Adaptation (LoRA) is a widely-used parameter-efficient finetuning method for large language models. LoRA saves memory by training only low rank perturbations to selected weight matrices. In this work, we compare the performance of LoRA and full finetuning on two target domains, programming and mathematics. We consider both the instruction finetuning ($\approx$100K prompt-response pairs) and continued pretraining ($\approx$10B unstructured tokens) data regimes. Our results show that, in most settings, LoRA substantially underperforms full finetuning.
Nevertheless, LoRA exhibits a desirable form of regularization: it better maintains the base model's performance on tasks outside the target domain.
We show that LoRA provides stronger regularization compared to common techniques such as weight decay and dropout; it also helps maintain more diverse generations. We show that full finetuning learns perturbations with a rank that is 10-100X greater than typical LoRA configurations, possibly explaining some of the reported gaps. We conclude by proposing best practices for finetuning with LoRA.

URL: https://openreview.net/forum?id=aloEru2qCG

---

Title: Mitigating Group Bias in Federated Learning: Beyond Local Fairness

Abstract: The issue of group fairness in machine learning models, where certain sub-populations or groups are favored over others, has been recognized for some time. While many mitigation strategies have been proposed in centralized learning, many of these methods are not directly applicable in federated learning, where data is privately stored on multiple clients. To address this, many proposals try to mitigate bias at the level of clients before aggregation, which we call locally fair training. However, the effectiveness of these approaches is not well understood. In this work, we investigate the theoretical foundation of locally fair training by studying the relationship between global model fairness and local model fairness. Additionally, we prove that for a broad class of fairness metrics, the global model's fairness can be obtained using only summary statistics from local clients. Based on that, we propose a globally fair training algorithm that optimizes the fairness-regularized empirical loss. Real-data experiments demonstrate the promising performance of our proposed approach for enhancing fairness

URL: https://openreview.net/forum?id=ANXoddnzct

---

Title: Learning multi-modal generative models with permutation-invariant encoders and tighter variational bounds

Abstract: Devising deep latent variable models for multi-modal data has been a long-standing theme in machine learning research. Multi-modal Variational Autoencoders (VAEs) have been a popular generative model class that learns latent representations that jointly explain multiple modalities. Various objective functions for such models have been suggested, often motivated as lower bounds on the multi-modal data log-likelihood or from information-theoretic considerations. To encode latent variables from different modality subsets, Product-of-Experts (PoE) or Mixture-of-Experts (MoE) aggregation schemes have been routinely used and shown to yield different trade-offs, for instance, regarding their generative quality or consistency across multiple modalities. In this work, we consider a variational bound that can tightly approximate the data log-likelihood. We develop more flexible aggregation schemes that generalize PoE or MoE approaches by combining encoded features from different modalities based on permutation-invariant neural networks. Our numerical experiments illustrate trade-offs for multi-modal variational bounds and various aggregation schemes. We show that tighter variational bounds and more flexible aggregation models can become beneficial when one wants to approximate the true joint distribution over observed modalities and latent variables in identifiable models.

URL: https://openreview.net/forum?id=lM4nHnxGfL

---

Title: Exploring Neural Network Landscapes: Star-Shaped and Geodesic Connectivity

Abstract: One of the most intriguing findings in the structure of neural network landscapes is the phenomenon of mode connectivity: For two typical global minima, there exists a path connecting them without barrier. This concept of mode connectivity has played a crucial role in understanding important phenomena in deep learning.

In this paper, we conduct a fine-grained analysis of this connectivity phenomenon. First, we demonstrate that in the overparameterized case, the connecting path can be as simple as a two-piece linear path, and the path length can be nearly equal to the Euclidean distance. This finding suggests that the landscape should be nearly convex in a certain sense. Second, we uncover a surprising star-shaped connectivity: For a finite number of typical minima, there exists a center on the minima manifold that connects all of them simultaneously via linear paths. These results are provably valid for linear networks and two-layer ReLU networks under a teacher-student setup, and are empirically supported by models trained on MNIST and CIFAR-10.

URL: https://openreview.net/forum?id=V8lv3u5UAX

---

Title: Revisiting System-Heterogeneous Federated Learning through Dynamic Model Search

Abstract: Federated learning is a distributed learning paradigm in which multiple mobile clients train a global model while keeping data local. These mobile clients can have various available memory and network bandwidth. However, to achieve the best global model performance, how we can utilize available memory and network bandwidth to the maximum remains an open challenge. In this paper, we propose to assign each client a subset of the global model, having different layers and channels on each layer. To realize that, we design a constrained model search process with early stop to improve efficiency of finding the models from such a very large space; and a data-free knowledge distillation mechanism to improve the global model performance when aggregating models of such different structures. For fair and reproducible comparison between different solutions, we develop a new system, which can directly allocate different memory and bandwidth to each client according to memory and bandwidth logs collected on mobile devices. The evaluation shows that our solution can have accuracy increase ranging from 2.43\% to 15.81\% and provide 5\% to 40\% more memory and bandwidth utilization with negligible extra running time, comparing to existing state-of-the-art system-heterogeneous federated learning methods under different available memory and bandwidth, non-i.i.d.~datasets, image and text tasks.

URL: https://openreview.net/forum?id=ZoyltImuHh

---

Title: Large Language Models Synergize with Automated Machine Learning

Abstract: Recently, program synthesis driven by large language models (LLMs) has become increasingly popular. However, program synthesis for machine learning (ML) tasks still poses significant challenges. This paper explores a novel form of program synthesis, targeting ML programs, by combining LLMs and automated machine learning (autoML). Specifically, our goal is to fully automate the generation and optimization of the code of the entire ML workflow, from data preparation to modeling and post-processing, utilizing only textual descriptions of the ML tasks. To manage the length and diversity of ML programs, we propose to break each ML program into smaller, manageable parts. Each part is generated separately by the LLM, with careful consideration of their compatibilities. To ensure compatibilities, we design a testing technique for ML programs. Unlike traditional program synthesis, which typically relies on binary evaluations (i.e., correct or incorrect), evaluating ML programs necessitates more than just binary judgments. Therefore, we further assess ML programs numerically and select the optimal programs from a range of candidates using AutoML methods. In experiments across various ML tasks, our method outperforms existing methods in 10 out of 12 tasks for generating ML programs. In addition, autoML significantly improves the performance of the generated ML programs. In experiments, given the textual task description, our method, Text-to-ML, generates the complete and optimized ML program in a fully autonomous process.

URL: https://openreview.net/forum?id=RDEaIfOiJM

---

Title: Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models

Abstract: We address the long-standing problem of how to learn effective pixel-based image diffusion models at scale, introducing a remarkably simple greedy method for stable training of large-scale, high-resolution models. without the needs for cascaded super-resolution components.The key insight stems from careful pre-training of core components, namely, those responsible for text-to-image alignment vs. high resolution rendering. We first demonstrate the benefits of scaling a Shallow UNet, with no down(up)-sampling enc(dec)oder. Scaling its deep core layers is shown to improve alignment, object structure, and composition. Building on this core model, we propose a greedy algorithm that grows the architecture into high resolution end-to-end models, while preserving the integrity of the pre-trained representation,stabilizing training, and reducing the need for large high-resolution datasets. This enables a single stage model capable of generating high-resolution images without the need of a super-resolution cascade. Our key results rely on public datasets and show that we are able to train non-cascaded models up to 8B parameters with no further regularization schemes.Vermeer, our full pipeline model trained with internal datasets to produce 1024×1024 images, without cascades, is preferred by 44.0% vs. 21.4% human evaluators over SDXL.

URL: https://openreview.net/forum?id=GpULi1dAMm

---

Title: Efficient Identification of Direct Causal Parents via Invariance and Minimum Error Testing

Abstract: Invariant causal prediction (ICP) is a popular technique for finding causal parents (direct causes) of a target via exploiting distribution shifts and invariance testing (Peters et al., 2016). However, since ICP needs to run an exponential number of tests and fails to identify parents when distribution shifts only affect a few variables, applying ICP to practical large scale problems is challenging. We propose MMSE-ICP and fastICP, two approaches which employ an error inequality to address the identifiability problem of ICP. The inequality states that the minimum prediction error of the predictor using causal parents is the smallest among all predictors which do not use descendants. fastICP is an efficient approximation tailored for large problems as it exploits the inequality and a heuristic to run fewer tests. MMSE-ICP and fastICP not only outperform competitive baselines in many simulations but also achieve state-of-the-art result on a large scale real data benchmark.

URL: https://openreview.net/forum?id=3G7mFdGVRW

---

Title: Exploiting Hankel-Toeplitz Structures for Fast Computation of Kernel Precision Matrices

Abstract: The Hilbert-space Gaussian process (HGP) approach offers a hyperparameter-independent basis function approximation for speeding up Gaussian process (GP) inference by projecting the GP onto $M$ basis functions. These properties result in a favorable data-independent $\mathcal{O}(M^3)$ computational complexity during hyperparameter optimization but require a dominating one-time precomputation of the precision matrix costing $\mathcal{O}(NM^2)$ operations. In this paper, we lower this dominating computational complexity to $\mathcal{O}(NM)$ with no additional approximations. We can do this because we realize that the precision matrix can be split into a sum of Hankel-Toeplitz matrices, each having $\mathcal{O}(M)$ unique entries. Based on this realization we propose computing only these unique entries at $\mathcal{O}(NM)$ costs. Further, we develop two theorems that prescribe sufficient conditions for the complexity reduction to hold generally for a wide range of other approximate GP models, such as the Variational Fourier features approach. The two theorems do this with no assumptions on the data and no additional approximations of the GP models themselves. Thus, our contribution provides a pure speed-up of several existing, widely used, GP approximations, without further approximations

URL: https://openreview.net/forum?id=s9ylaDLvdO

---

Title: Considerations for Distribution Shift Robustness of Diagnostic Models in Healthcare

Abstract: We consider robustness to distribution shifts in the context of diagnostic models in healthcare, where the prediction target Y , e.g., the presence of a disease, is causally upstream of the observations X, e.g., a biomarker. Distribution shifts may occur, for instance, when the training data is collected in a domain with patients having particular demographic characteristics while the model is deployed on patients from a different demographic group. In the domain of applied ML for health, it is common to predict Y from X without considering further information about the patient. However, beyond the direct influence of the disease Y on biomarker X, a predictive model may learn to exploit confounding dependencies (or shortcuts) between X and Y that are unstable under certain distribution shifts. In this work, we highlight a data generating mechanism common to healthcare settings and discuss how recent theoretical results from the causality literature can be applied to build robust predictive models. We theoretically show why ignoring covariates as well as common invariant learning approaches will in general not yield robust predictors in the studied setting, while including certain covariates into the prediction model will. In an extensive simulation study, we showcase the robustness (or lack thereof) of different predictors under various data generating processes. Lastly, we analyze the performance of the different approaches using the PTB-XL dataset, a public dataset of annotated ECG recordings.

URL: https://openreview.net/forum?id=B7TLNCBnlA

---

Title: Reweighting Improves Conditional Risk Bounds

Abstract: In this work, we study the weighted empirical risk minimization (weighted ERM) schema, in which an additional data-dependent weight function is incorporated when the empirical risk function is being minimized. We show that under a general ``balanceable" Bernstein condition, one can design a weighted ERM estimator to achieve superior performance in certain sub-regions over the one obtained from standard ERM, and the superiority manifests itself through a data-dependent constant term in the error bound. These sub-regions correspond to large-margin ones in classification settings and low-variance ones in heteroscedastic regression settings, respectively. Our findings are supported by evidence from synthetic data experiments.

URL: https://openreview.net/forum?id=MvYddudHuE

---

Title: Oops, I Sampled it Again: Reinterpreting Confidence Intervals in Few-Shot Learning

Abstract: The predominant method for computing confidence intervals (CI) in few-shot learning (FSL) is based on sampling the tasks with replacement, i.e. allowing the same samples to appear in multiple tasks. This makes the CI misleading in that it takes into account the randomness of the sampler but not the data itself. To quantify the extent of this problem, we conduct a comparative analysis between CIs computed with and without replacement. These reveal a notable underestimation by the predominant method. This observation calls for a reevaluation of how we interpret confidence intervals and the resulting conclusions in FSL comparative studies. Our research demonstrates that the use of paired tests can partially address this issue. Additionally, we explore methods to further reduce the (size of the) CI by strategically sampling tasks of a specific size. We also introduce a new optimized benchmark, which can be accessed at https://anonymous.4open.science/r/FSL-benchmark-again-F73B

URL: https://openreview.net/forum?id=JxxkKt9yrx

---

Title: Adaptive Self-Distillation for Minimizing Client Drift in Heterogeneous Federated Learning

Abstract: Federated Learning (FL) is a machine learning paradigm that enables clients to jointly train a global model by aggregating the locally trained models without sharing any local training data. In practice, there can often be substantial heterogeneity (e.g., class imbalance) across the local data distributions observed by each of these clients. Under such non-iid data distributions across clients, FL suffers from the `client-drift’ problem where every client drifts to its own local optimum. This results in slower convergence and poor performance of the aggregated model. To address this limitation, we propose a novel regularization technique based on adaptive self-distillation (ASD) for training models on the client side. Our regularization scheme adaptively adjusts to each client's training data based on the global model's prediction entropy and the client-data label distribution. We show in this paper that our proposed regularization (ASD) can be easily integrated atop existing, state-of-the-art FL algorithms, leading to a further boost in the performance of these off-the-shelf methods. We theoretically explain how incorporation of ASD regularizer leads to reduction in client-drift and empirically justify the generalization ability of the trained model. We demonstrate the efficacy of our approach through extensive experiments on multiple real-world benchmarks and show substantial gains in performance when the proposed regularizer is combined with popular FL methods. The code is provided as supplementary material.

URL: https://openreview.net/forum?id=K58n87DE4s

---

Title: Adaptive $k$-nearest neighbor classifier based on the local estimation of the shape operator

Abstract: Nonparametric classification does not assume a particular functional form of the underlying data distribution, being a suitable approach for a wide variety of data sets. The $k$-nearest neighbor ($k$-NN) algorithm is one of the most popular methods for nonparametric classification. However, a relevant limitation concerns the definition of the number of neighbors $k$. This parameter exerts a direct impact on several properties of the classifier, such as the bias-variance tradeoff, smoothness of decision boundaries, robustness to noise, and class imbalance handling. In the present paper, we propose a new adaptive $k$-nearest neighbours ($kK$-NN) algorithm that explores the local curvature at a sample to automatically define the neighborhood size. The rationale is that points with low curvature could have larger neighborhoods (locally, the tangent space approximates well the underlying data shape), whereas points with high curvature could have smaller neighborhoods (locally, the tangent space is a loose approximation). We estimate the local Gaussian curvature by computing an approximation to the local shape operator in terms of the local covariance matrix as well as the local Hessian matrix. Results on many real-world data sets indicate that the new $kK$-NN algorithm may provide superior balanced accuracy compared to the established $k$-NN method. This is particularly evident when the number of samples in the training data is limited, suggesting that the $kK$-NN is capable of learning more discriminant functions with less data in some relevant cases.

URL: https://openreview.net/forum?id=ol9Pl2Ogtp

---

Title: MixedNUTS: Training-Free Accuracy-Robustness Balance via Nonlinearly Mixed Classifiers

Abstract: Adversarial robustness often comes at the cost of degraded accuracy, impeding real-life applications of robust classification models. Training-based solutions for better trade-offs are limited by incompatibilities with already-trained high-performance large models, necessitating the exploration of training-free ensemble approaches. Observing that robust models are more confident in correct predictions than in incorrect ones on clean and adversarial data alike, we speculate amplifying this "benign confidence property" can reconcile accuracy and robustness in an ensemble setting. To achieve so, we propose "MixedNUTS", a *training-free* method where the output logits of a robust classifier and a standard non-robust classifier are processed by nonlinear transformations with only three parameters, which are optimized through an efficient algorithm. MixedNUTS then converts the transformed logits into probabilities and mixes them as the overall output. On CIFAR-10, CIFAR-100, and ImageNet datasets, experimental results with custom strong adaptive attacks demonstrate MixedNUTS's vastly improved accuracy and near-SOTA robustness -- it boosts CIFAR-100 clean accuracy by $7.86$ points, sacrificing merely $0.87$ points in robust accuracy.

URL: https://openreview.net/forum?id=pyD6cujUmL

---

Title: Robust Feature Inference: A Test-time Defense Strategy using Spectral Projections

Abstract: Test-time defenses are used to improve the robustness of deep neural networks to adversarial examples during inference. However, existing methods either require an additional trained classifier to detect and correct the adversarial samples, or perform additional complex optimization on the model parameters or the input to adapt to the adversarial samples at test-time, resulting in a significant increase in the inference time compared to the base model. In this work, we propose a novel test-time defense strategy called Robust Feature Inference (RFI) that is easy to integrate with any existing (robust) training procedure without additional test-time computation. Based on the notion of robustness of features that we present, the key idea is to project the trained models to the most robust feature space, thereby reducing the vulnerability to adversarial attacks in non-robust directions. We theoretically characterize the subspace of the eigenspectrum of the feature covariance that is the most robust for a generalized additive model. Our extensive experiments on CIFAR-10, CIFAR-100, tiny ImageNet and ImageNet datasets for several robustness benchmarks, including the state-of-the-art methods in RobustBench show that RFI improves robustness across adaptive and transfer attacks consistently. We also compare RFI with adaptive test-time defenses to demonstrate the effectiveness of our proposed approach.

URL: https://openreview.net/forum?id=9OHAtWdFWB

---

Title: Optimized Tradeoffs for Private Prediction with Majority Ensembling

Abstract: We study a classical problem in private prediction, the problem of computing an $(m\epsilon, \delta)$-differentially private majority of $K$ $(\epsilon, \Delta)$-differentially private algorithms for $1 \leq m \leq K$ and $1 > \delta \geq \Delta \geq 0$. Standard methods such as subsampling or randomized response are widely used, but do they provide optimal privacy-utility tradeoffs? To answer this, we introduce the Data-dependent Randomized Response Majority (DaRRM) algorithm. It is parameterized by a data-dependent noise function $\gamma$, and enables efficient utility optimization over the class of all private algorithms, encompassing those standard methods. Surprisingly, we show that an $(m\epsilon, \delta)$-private majority algorithm with maximal utility can be computed tractably for any $m \leq K$ by a novel structural result that reduces the infinitely many privacy constraints into a polynomial set. In some settings, we show that DaRRM provably enjoys a privacy gain of a factor of 2 over common baselines, with fixed utility. Lastly, we demonstrate the strong empirical effectiveness of our first-of-its-kind privacy-constrained utility optimization for ensembling labels for private prediction from private teachers in image classification. Notably, our DaRRM framework with an optimized $\gamma$ exhibits substantial utility gains when compared against several baselines.

URL: https://openreview.net/forum?id=dwJluAakM8

---

Title: Strengthening Interpretability: An Investigative Study of Integrated Gradient Methods

Abstract: We conducted a reproducibility study on Integrated Gradients (IG) based methods and the Important Direction Gradient Integration (IDGI) framework. IDGI eliminates the explanation noise in each step of the computation of IG-based methods that use the Riemann Integration for integrated gradient computation. We perform a rigorous theoretical analysis of IDGI and raise a few critical questions that we later address through our study. We also experimentally verify the authors' claims concerning the performance of IDGI over IG-based methods. Additionally, we varied the number of steps used in the Riemann approximation, an essential parameter in all IG methods, and analyzed the corresponding change in results. We also studied the numerical instability of the attribution methods to check the consistency of the saliency maps produced. We developed the complete code to implement IDGI over the baseline IG methods and evaluated them using three metrics since the available code was insufficient for this study. Our code is readily usable and publicly available at [link-hidden-for-submission].

URL: https://openreview.net/forum?id=iBnPpN2hr5

---

Title: Equivariant Graph Network Approximations of High-Degree Polynomials for Force Field Prediction

Abstract: Recent advancements in equivariant deep models have shown promise in accurately predicting atomic potentials and force fields in molecular dynamics simulations. Using spherical harmonics (SH) and tensor products (TP), these equivariant networks gain enhanced physical understanding, like symmetries and many-body interactions. Beyond encoding physical insights, SH and TP are also crucial to represent equivariant polynomial functions. In this work, we analyze the equivariant polynomial functions for the equivariant architecture, and introduce a novel equivariant network, named PACE. The proposed PACE utilizes edge booster and the Atomic Cluster Expansion (ACE) technique to approximate $SE(3) \times S_n$ equivariant polynomial functions with enhanced degrees. As experimented in commonly used benchmarks, PACE demonstrates state-of-the-art performance and generalization capability in predicting atomic energy and force fields.

URL: https://openreview.net/forum?id=7DAFwp0Vne

---

Title: MDP: A Generalized Framework for Text-Guided Image Editing by Manipulating the Diffusion Path

Abstract: Image generation using diffusion can be controlled in multiple ways. In this paper, we systematically analyze the equations of modern generative diffusion networks to propose a framework, called MDP, that explains the design space of suitable manipulations. We identify 5 different manipulations, including intermediate latent, conditional embedding, cross attention maps, guidance, and predicted noise. We analyze the corresponding parameters of these manipulations and the manipulation schedule. We show that some previous editing methods fit nicely into our framework. Particularly, we identified one specific configuration as a new type of control by manipulating the predicted noise, which can perform higher-quality edits than previous work for a variety of local and global edits.

URL: https://openreview.net/forum?id=nuIjjebNuu

---

Title: Norm-count Hypothesis: On the Relationship Between Norm and Object Count in Visual Representations

Abstract: We present a novel hypothesis on norms of representations produced by convolutional neural networks (CNNs). In particular, we propose the norm-count hypothesis (NCH), which states that there is a monotonically increasing relationship between the number of certain objects in the image, and the norm of the corresponding representation. We formalize and prove our hypothesis in a controlled setting, showing that the NCH is true for linear and batch normalized CNNs followed by global average pooling, when they are applied to a certain class of images. Further, we present experimental evidence that corroborates our hypothesis for CNN-based representations. Our experiments are conducted with several real-world image datasets, in both supervised and self-supervised learning -- providing new insight on the relationship between object counts and representation norms.

URL: https://openreview.net/forum?id=enOjfAo1LT

---

Title: Analyzing the Impact of Learnable Softmax Temperature in Contrastive Visual-Textual Alignment Systems: Benefits, Drawbacks, and Alternative Approaches

Abstract: This work does NOT read like “fabricate motivation - propose something - obtain sota results”. Instead, we give an analysis of the learnable softmax temperature parameter in the practical training of contrastive visual-textual alignment learning model (commonly referred to as the “CLIP” model). This parameter is considered to be imperative for optimal system performance, however, its working mechanism and possible drawbacks have long been neglected. This study addresses this problem as well as offers a novel solution by leveraging the structure of ViTs. Our argument centers around the pivotal role of the softmax temperature in handling noisy training data. We visualize that there exists an equilibrium in the gradient of the contrastive loss, while the temperature parameter serves as a distance scaling factor. Otherwise, the model has trouble aligning positive pairs due to a numerical problem in the loss term. On the contrary, we also show that a large temperature would result in possible unstable learning dynamics. Subsequently, we figured out alternative approaches that could mitigate the problem from a topological view of the contrastive loss. Finally, we capitalize on multiple class tokens embedded within the transformer architecture to offer a concise solution. This configuration significantly boosts zero-shot classification performance, enhancing baseline CLIP models pretrained on large-scale datasets by an average of 6.1%. The codes and learned weights are provided in https://github.com/{Anonymous_authors}.

URL: https://openreview.net/forum?id=rx1QNhsNsK

---

Title: ISR: Invertible Symbolic Regression

Abstract: We introduce an Invertible Symbolic Regression (ISR) method. It is a machine learning technique that generates analytical relationships between inputs and outputs of a given dataset via invertible maps (or architectures). The proposed ISR method naturally combines the principles of Invertible Neural Networks (INNs) and Equation Learner (EQL), a neural network-based symbolic architecture for function learning. In particular, we transform the affine coupling blocks of INNs into a symbolic framework, resulting in an end-to-end differentiable symbolic invertible architecture that allows for efficient gradient-based learning. The proposed ISR framework also relies on sparsity promoting regularization, allowing the discovery of concise and interpretable invertible expressions. We show that ISR can serve as a (symbolic) normalizing flow for density estimation tasks. Furthermore, we highlight its practical applicability in solving inverse problems, including a benchmark inverse kinematics problem, and notably, a geoacoustic inversion problem in oceanography aimed at inferring posterior distributions of underlying seabed parameters from acoustic signals.

URL: https://openreview.net/forum?id=3edv2HTOMk

---

Title: Federated Learning under Evolving Distribution Shifts

Abstract: Federated learning (FL) is a distributed learning paradigm that facilitates training a global machine learning model without collecting the raw data from distributed clients. Recent advances in FL have addressed several considerations that are likely to transpire in realistic settings such as data distribution heterogeneity among clients. However, most of the existing works still consider clients' data distributions to be static or conforming to a simple dynamic, e.g., in participation rates of clients. In real FL applications, client data distributions change over time, and the dynamics, i.e., the evolving pattern, can be highly non-trivial. Further, evolution may take place from training to testing. In this paper, we address dynamics in client data distributions and aim to train FL systems from time-evolving clients that can generalize to future target data. Specifically, we propose two algorithms, FedEvolve and FedEvp, which are able to capture the evolving patterns of the clients during training and are test-robust under evolving distribution shifts. Through extensive experiments on both synthetic and real data, we show the proposed algorithms can significantly outperform the FL baselines across various network architectures.

URL: https://openreview.net/forum?id=8T42p8eMDb

---

Title: Robust Stochastic Optimization via Gradient Quantile Clipping

Abstract: We introduce a clipping strategy for Stochastic Gradient Descent (SGD) which uses quantiles of the gradient norm as clipping thresholds. We prove that this new strategy provides a robust and efficient optimization algorithm for smooth objectives (convex or non-convex), that tolerates heavy-tailed samples (including infinite variance) and a fraction of outliers in the data stream akin to Huber contamination. Our mathematical analysis leverages the connection between constant step size SGD and Markov chains and handles the bias introduced by clipping in an original way. For strongly convex objectives, we prove that the iteration converges to a concentrated distribution and derive high probability bounds on the final estimation error. In the non-convex case, we prove that the limit distribution is localized on a neighborhood with low gradient. We propose an implementation of this algorithm using rolling quantiles which leads to a highly efficient optimization procedure with strong robustness properties, as confirmed by our numerical experiments.

URL: https://openreview.net/forum?id=HCRkV3kxHW

---

Title: APBench: A Unified Availability Poisoning Attack and Defenses Benchmark

Abstract: The efficacy of availability poisoning, a method of poisoning data by injecting imperceptible perturbations to prevent its use in model training, has been a hot subject of investigation. Previous research suggested that it was difficult to effectively counteract such poisoning attacks. However, the introduction of various defense methods has challenged this notion. Due to the rapid progress in this field, the performance of different novel methods cannot be accurately validated due to variations in experimental setups. To further evaluate the attack and defense capabilities of these poisoning methods, we have developed a benchmark — APBench for assessing the efficacy of adversarial poisoning. APBench consists of 9 state-of-the-art availability poisoning attacks, 8 defense algorithms, and 4 conventional data augmentation techniques. We also have set up experiments with varying different poisoning ratios, and evaluated the attacks on multiple datasets and their transferability across model architectures. We further conducted a comprehensive evaluation of 2 additional attacks specifically targeting unsupervised models. Our results reveal the glaring inadequacy of existing attacks in safeguarding individual privacy. APBench is open source and available to the deep learning community.

URL: https://openreview.net/forum?id=igJ2XPNYbJ

---

Title: LINOCS: Lookahead Inference of Networked Operators for Continuous Stability

Abstract: Identifying latent interactions within complex systems is key to unlocking deeper insights into their operational dynamics, including how their elements affect each other and contribute to the overall system behavior. For instance, in neuroscience, discovering neuron-to-neuron interactions is essential for understanding brain function; in ecology, recognizing the interactions among populations is key for understanding complex ecosystems. Such systems, often modeled as dynamical systems, typically exhibit noisy high-dimensional and non-stationary temporal behavior that renders their identification challenging. Existing dynamical system identification methods often yield operators that accurately capture short-term behavior but fail to predict long-term trends, suggesting an incomplete capture of the underlying process. Methods that consider extended forecasts (e.g., recurrent neural networks) lack explicit representations of element-wise interactions and require substantial training data, thereby failing to capture interpretable network operators. Here we introduce Lookahead-driven Inference of Networked Operators for Continuous Stability (LINOCS), a robust learning procedure for identifying hidden dynamical interactions in noisy time-series data. LINOCS integrates several multi-step predictions with adaptive weights during training to recover dynamical operators that can yield accurate long-term predictions. We demonstrate LINOCS' ability to recover the ground truth dynamical operators underlying synthetic time-series data for multiple dynamical systems models (including linear, piece-wise linear, time-changing linear systems' decomposition, and regularized linear time-varying systems) as well as its capability to produce meaningful operators with robust reconstructions through various real-world examples.

URL: https://openreview.net/forum?id=A6D3PYSyqJ

---

Title: Orthogonal Random Features: Explicit Forms and Sharp Inequalities

Abstract: Random features have been introduced to scale up kernel methods via randomization tech- niques. In particular, random Fourier features and orthogonal random features were used to approximate the popular Gaussian kernel. The former is performed by a random Gaussian matrix and leads exactly to the Gaussian kernel after averaging. In this work, we analyze the bias and the variance of the kernel approximation based on orthogonal random features which makes use of Haar orthogonal matrices. We provide explicit expressions for these quantities using normalized Bessel functions, showing that—contrary to what is commonly thought—orthogonal random features does not approximate the Gaussian kernel but a Bessel kernel. We also derive sharp exponential bounds supporting the view that orthogonal random features are less dispersed than random Fourier features.

URL: https://openreview.net/forum?id=FMtRZ4xzSi

---

Title: Vision-Language Dataset Distillation

Abstract: Dataset distillation methods reduce large-scale datasets to smaller sets of synthetic data, preserving sufficient information to quickly train a new model from scratch. However, prior work on dataset distillation has focused exclusively on image classification datasets, whereas modern large-scale datasets are primarily in the vision-language space. In this work, we design the first vision-language dataset distillation method, building on the idea of trajectory matching. A key challenge is that vision-language datasets do not have a set of discrete classes. To overcome this, our proposed method jointly distills image-text pairs in a contrastive formulation. Further, we leverage Low-Rank Adaptation (LoRA) matching to enable more efficient and effective trajectory matching in complex modern vision-language models. Since there are no existing baselines, we compare our distillation approach with three adapted vision-language coreset selection methods. We demonstrate significant improvements on the challenging Flickr30K and COCO retrieval benchmarks: for example, on Flickr30K, the best coreset selection method selecting 1000 image-text pairs for training achieves only 5.6% image-to-text retrieval accuracy (i.e., recall@1); in contrast, our dataset distillation almost doubles that to 9.9% with just 100 training pairs, an order of magnitude fewer.

URL: https://openreview.net/forum?id=6L6cD65pot

---

Title: Noise Stability Optimization For Flat Minima With Tight Rates

Abstract: We consider minimizing a perturbed function $F(W) = \mathbb{E}_{U}[f(W + U)]$, given a function $f: \mathbb{R}^d \rightarrow \mathbb{R}$ and a random sample $U$ from a distribution $\mathcal{P}$ with mean zero. When $\mathcal{P}$ is the isotropic Gaussian, $F(W)$ is roughly equal to $f(W)$ plus a penalty on the trace of $\nabla^2 f(W)$, scaled by the variance of $\mathcal{P}$. This penalty on the Hessian has the benefit of improving generalization, through PAC-Bayes analysis. It is useful in low-sample regimes, for instance, when a (large) pre-trained model is fine-tuned on a small data set. One way to minimize $F$ is by adding $U$ to $W$, and then run SGD. We observe, empirically, that this noise injection does not provide significant gains over SGD, in our experiments of conducting fine-tuning on three image classification data sets. We design a simple, practical algorithm that adds noise along both $U$ and $-U$, with the option of adding several perturbations and taking their average. We analyze the convergence of this algorithm, showing tight rates on the norm of the output's gradient.

We provide a comprehensive empirical analysis to show that this modified noise injection algorithm can be competitive and even outperform sharpness-reducing training methods. First, we show that in an over-parameterized matrix sensing problem, it can find solutions with lower test loss than naive noise injection. Then, we compare our algorithm with four sharpness-reducing training methods (including the Sharpness-Aware Minimization (Foret et al., 2021)). We find that our algorithm can outperform them by up to 1.8% test accuracy, for fine-tuning ResNet on six image classification data sets. It leads to a 17.7% (and 12.8%) reduction in the trace (and largest eigenvalue) of the Hessian matrix of the loss surface. This form of regularization on the Hessian is compatible with $\ell_2$ weight decay (and data augmentation), in the sense that combining both can lead to improved empirical performance.

URL: https://openreview.net/forum?id=yfrNkb2Ldd

---

Title: Large-width asymptotics and training dynamics of $\alpha$-Stable ReLU neural networks

Abstract: There is a recent literature on large-width properties of Gaussian neural networks (NNs), namely NNs with Gaussian distributed weights. Two popular results are: i) the characterization of the large-width asymptotic behavior of NNs in terms of Gaussian processes; ii) the characterization of the large-width training dynamics of NNs in terms of the so-called neural tangent kernel (NTK). In this paper, we investigate large-width asymptotics and training dynamics of $\alpha$-Stable NNs, namely NNs whose weights are distributed according to $\alpha$-Stable distributions, with $\alpha\in(0,2]$. First, for $\alpha$-Stable NNs with a ReLU activation function, we show that if the NN's width goes to infinity then a rescaled NN converges weakly to an $\alpha$-Stable process, generalizing Gaussian processes. Differently from the Gaussian setting, our result shows that the choice of the activation function affects the scaling of the NN, that is: to achieve the infinitely wide $\alpha$-Stable process, the ReLU activation requires an additional logarithmic term in the scaling with respect to sub-linear activations. Then, we characterize the large-width training dynamics of $\alpha$-Stable ReLU-NNs in terms of a random kernel, referred to as the $\alpha$-Stable NTK, showing that, for a sufficiently large width, the gradient descent achieves zero training error at a linear rate. The randomness of the $\alpha$-Stable NTK is a further difference with respect to the Gaussian setting, that is: in the $\alpha$-Stable setting, the randomness of the NN at initialization does not vanish in the large-width regime of the training.

URL: https://openreview.net/forum?id=bEwAAEmRbh

---

Title: Deep Unlearning: Fast and Efficient Training-free Class Forgetting

Abstract: Machine unlearning is a prominent and challenging field, driven by regulatory demands for user data deletion and heightened privacy awareness. Existing approaches involve retraining model or multiple finetuning steps for each deletion request, often constrained by computational limits and restricted data access. In this work, we introduce a novel class unlearning algorithm designed to strategically eliminate specific classes from the learned model. Our algorithm first estimates the Retain and the Forget Spaces using Singular Value Decomposition on the layerwise activations for a small subset of samples from the retain and unlearn classes, respectively. We then compute the shared information between these spaces and remove it from the forget space to isolate class-discriminatory feature space. Finally, we obtain the unlearned model by updating the weights to suppress the class discriminatory features from the activation spaces. We demonstrate our algorithm's efficacy on ImageNet using a Vision Transformer with only ~1.5% drop in retain accuracy compared to the original model while maintaining under 1% accuracy on the unlearned class samples. Further, our algorithm consistently performs well when subject to Membership Inference Attacks showing 7.8% improvement on average across a variety of image classification datasets and network architectures, as compared to other baselines while being ~ 6x more computationally efficient.

URL: https://openreview.net/forum?id=BmI5p6wBi0

---

Title: Sensitivity-Aware Amortized Bayesian Inference

Abstract: Sensitivity analyses reveal the influence of various modeling choices on the outcomes of statistical analyses. While theoretically appealing, they are overwhelmingly inefficient for complex Bayesian models. In this work, we propose sensitivity-aware amortized Bayesian inference (SA-ABI), a multifaceted approach to efficiently integrate sensitivity analyses into simulation-based inference with neural networks. First, we utilize weight sharing to encode the structural similarities between alternative likelihood and prior specifications in the training process with minimal computational overhead. Second, we leverage the rapid inference of neural networks to assess sensitivity to data perturbations and preprocessing steps. In contrast to most other Bayesian approaches, both steps circumvent the costly bottleneck of refitting the model for each choice of likelihood, prior, or data set. Finally, we propose to use deep ensembles to detect sensitivity arising from unreliable approximation (e.g., due to model misspecification). We demonstrate the effectiveness of our method in applied modeling problems, ranging from disease outbreak dynamics and global warming thresholds to human decision-making. Our results support sensitivity-aware inference as a default choice for amortized Bayesian workflows, automatically providing modelers with insights into otherwise hidden dimensions.

URL: https://openreview.net/forum?id=Kxtpa9rvM0

---

Title: Fast Computation of Leave-One-Out Cross-Validation for $k$-NN Regression

Abstract: We describe a fast computation method for leave-one-out cross-validation (LOOCV) for $k$-nearest neighbours ($k$-NN) regression. We show that, under a tie-breaking condition for nearest neighbours, the LOOCV estimate of the mean square error for $k$-NN regression is identical to the mean square error of $(k+1)$-NN regression evaluated on the training data, multiplied by the scaling factor $(k+1)^2/k^2$. Therefore, to compute the LOOCV score, one only needs to fit $(k+1)$-NN regression only once, and does not need to repeat training-validation of $k$-NN regression for the number of training data. Numerical experiments confirm the validity of the fast computation method.

URL: https://openreview.net/forum?id=SBE2q9qwZj

---

Title: Leveraging Task Structures for Improved Identifiability in Neural Network Representations

Abstract: This work extends the theory of identifiability in supervised learning by considering the consequences of having access to a distribution of tasks. In such cases, we show that linear identifiability is achievable in the general multi-task regression setting. Furthermore, we show that the existence of a task distribution which defines a conditional prior over latent factors reduces the equivalence class for identifiability to permutations and scaling of the true latent factors, a stronger and more useful result than linear identifiability. Crucially, when we further assume a causal structure over these tasks, our approach enables simple maximum marginal likelihood optimization, and suggests potential downstream applications to causal representation learning. Empirically, we find that this straightforward optimization procedure enables our model to outperform more general unsupervised models in recovering canonical representations for both synthetic data and real-world molecular data.

URL: https://openreview.net/forum?id=WLcPrq6pu0

---

Title: Neural incomplete factorization: learning preconditioners for the conjugate gradient method

Abstract: The convergence of the conjugate gradient method to solve large-scale and sparse linear
equation systems depends on the conditioning of the system matrix, which can be improved by
preconditioning. In this paper, we develop a computationally efficient data-driven approach
to accelerate the generation of effective preconditioners. We, therefore, replace the typically
hand-engineered preconditioners by the output of graph neural networks. Optimizing the
condition number of the linear system directly is computationally infeasible. Instead, our
method generates an incomplete factorization of the matrix and is, therefore, referred to
as neural incomplete factorization (NeuralIF). For efficient training, we utilize a stochastic
approximation of the Frobenius loss which only requires matrix-vector multiplications. At
the core of our method is a novel message-passing block, inspired by sparse matrix theory,
that aligns with the objective of finding a sparse factorization of the matrix. We evaluate
our proposed method on both synthetic problem instances and on problems arising from the
discretization of the Poisson equation on varying domains. Our experiments show that by
utilizing data-driven preconditioners within the conjugate gradient method we are able to
speed up the convergence of the iterative procedure.

URL: https://openreview.net/forum?id=FozLrZ3CI5

---

Title: A Unified Hallucination Mitigation Framework for Large Vision-Language Models

Abstract: Hallucination is a common problem for Large Vision-Language Models (LVLMs) with long generations which is difficult to eradicate. The generation with hallucinations is partially inconsistent with the image content. To mitigate hallucination, current studies either focus on the process of model inference or the results of model generation, but the solutions they design sometimes do not deal appropriately with various types of queries and the hallucinations of the generations about these queries. To accurately deal with various hallucinations, we present a unified framework, Dentist, for hallucination mitigation. The core step is to first classify the queries, then perform different processes of hallucination mitigation based on the classification result, just like a dentist first observes the teeth and then makes a plan. In a simple deployment, Dentist can classify queries as perception or reasoning and easily mitigate potential hallucinations in answers which has been demonstrated in our experiments. On MMbench, we achieve a 13.44%/10.2%/15.8% improvement in accuracy on Image Quality, a Coarse Perception visual question answering (VQA) task, over the baseline InstructBLIP/LLaVA/VisualGLM. Our source code will be released on GitHub.

URL: https://openreview.net/forum?id=ZVDWzgk6L6

---

Title: Magnifying the Three Phases of GAN Training — Fitting, Refining and Collapsing

Abstract: Generative Adversarial Networks (GANs) are efficient generative models but may suffer from mode mixture and mode collapse. We present an original global characterization of GAN training by dividing it into three successive phases — fitting, refining, and collapsing. Such a characterization underscores a strong correlation between mode mixture and the refining phase, as well as mode collapse and the collapsing phase. To analyze the causes and features of each phase, we propose a novel theoretical framework that integrates both continuous and discrete aspects of GANs, addressing a gap in existing literature that predominantly focuses on only one aspect. We develop a specialized metric to detect the phase transition from refining to collapsing and integrate it in an "early stopping" algorithm to optimize GAN training. Experiments on synthetic datasets and real-world datasets including MNIST, Fashion MNIST and CIFAR-10 substantiate our theoretical insights and highlight the efficacy of our algorithm.

URL: https://openreview.net/forum?id=MWam9EhQYS

---

Title: AdaFlood: Adaptive Flood Regularization

Abstract: Although neural networks are conventionally optimized towards zero training loss, it has been recently learned that targeting a non-zero training loss threshold, referred to as a flood level, often enables better test time generalization. Current approaches, however, apply the same constant flood level to all training samples, which inherently assumes all the samples have the same difficulty. We present AdaFlood, a novel flood regularization method that adapts the flood level of each training sample according to the difficulty of the sample. Intuitively, since training samples are not equal in difficulty, the target training loss should be conditioned on the instance. Experiments on datasets covering four diverse input modalities -- text, images, asynchronous event sequences, and tabular -- demonstrate the versatility of AdaFlood across data domains and noise levels.

URL: https://openreview.net/forum?id=2s5YU6CSEz

---

Title: Robust and Efficient Quantization-aware Training via Coreset Selection

Abstract: Quantization-aware training (QAT) is a representative model compression method to reduce redundancy in weights and activations.
However, most existing QAT methods require end-to-end training on the entire dataset, which suffers from long training time and high energy costs. In addition, the potential label noise in the training data undermines the robustness of QAT. We propose two metrics based on analysis of loss and gradient of quantized weights: error vector score and disagreement score, to quantify the importance of each sample during training. Guided by these two metrics, we proposed a quantization-aware Adaptive Coreset Selection (ACS) method to select the data for the current training epoch. We evaluate our method on various networks (ResNet-18, MobileNetV2, RetinaNet), datasets(CIFAR-10, CIFAR-100, ImageNet-1K, COCO), and under different quantization settings. Specifically, our method can achieve an accuracy of 68.39\% of 4-bit quantized ResNet-18 on the ImageNet-1K dataset with only a 10\% subset, which has an absolute gain of 4.24\% compared to the baseline. Our method can also improve the robustness of QAT by removing noisy samples in the training set.

URL: https://openreview.net/forum?id=4c2pZzG94y

---

Title: Scale Equalization for Multi-Level Feature Fusion

Abstract: Deep neural networks have exhibited remarkable performance in a variety of computer vision fields, especially in semantic segmentation tasks. Their success is often attributed to multi-level feature fusion, which enables them to understand both global and local information from an image. However, we found that multi-level features from parallel branches are on different scales. The scale disequilibrium is a universal and unwanted flaw that leads to detrimental gradient descent, thereby degrading performance in semantic segmentation. We discover that scale disequilibrium is caused by bilinear upsampling, which is supported by both theoretical and empirical evidence. Based on this observation, we propose injecting scale equalizers to achieve scale equilibrium across multi-level features after bilinear upsampling. Our proposed scale equalizers are easy to implement, applicable to any architecture, hyperparameter-free, implementable without requiring extra computational cost, and guarantee scale equilibrium for any dataset. Experiments showed that adopting scale equalizers consistently improved the mIoU index across various target datasets, including ADE20K, PASCAL VOC 2012, and Cityscapes, as well as various decoder choices, including UPerHead, PSPHead, ASPPHead, SepASPPHead, and FCNHead.

URL: https://openreview.net/forum?id=oS4SkVKA7S

---

Title: From Complexity to Clarity: Analytical Expressions of Deep Neural Network Weights via Clifford Algebra and Convexity

Abstract: In this paper, we introduce a novel analysis of neural networks based on geometric (Clifford) algebra and convex optimization. We show that optimal weights of deep ReLU neural networks are given by the wedge product of training samples when trained with standard regularized loss. Furthermore, the training problem reduces to convex optimization over wedge product features, which encode the geometric structure of the training dataset. This structure is given in terms of signed volumes of triangles and parallelotopes generated by data vectors. The convex problem finds a small subset of samples via $\ell_1$ regularization to discover only relevant wedge product features. Our analysis provides a novel perspective on the inner workings of deep neural networks and sheds light on the role of the hidden layers.

URL: https://openreview.net/forum?id=6XPwSwEWsV

---

Title: Calibrated Uncertainty Quantification for Operator Learning via Conformal Prediction

Abstract: Operator learning has been increasingly adopted in scientific and engineering applications, many of which require calibrated uncertainty quantification. Since the output of operator learning is a continuous function, quantifying uncertainty simultaneously at all points in the domain is challenging. Current methods consider calibration at a single point or over one scalar function or make strong assumptions such as Gaussianity. We propose a risk-controlling quantile neural operator, a distribution-free, finite-sample functional calibration conformal prediction method. We provide a theoretical calibration guarantee on the coverage rate, defined as the expected percentage of points on the function domain whose true value lies within the predicted uncertainty ball. Empirical results on a 2D Darcy flow and a 3D car surface pressure prediction task validate our theoretical results, demonstrating calibrated coverage and efficient uncertainty bands outperforming baseline methods. In particular, on the 3D problem, our method is the only one that meets the target calibration percentage (percentage of test samples for which the uncertainty estimates are calibrated) of 98%.

URL: https://openreview.net/forum?id=cGpegxy12T

---

Title: Improving Neural Architecture Search by Minimizing Worst-Case Validation Loss

Abstract: Neural architecture search (NAS) aims at automatically searching for high-performance architectures and has achieved considerable progress. Existing NAS methods learn architectures by minimizing average-case validation losses. As a result, the searched architectures are less capable of making correct predictions under worst-case scenarios. To address this problem, we propose a framework which leverages a deep generative model to generate adversarial validation examples to measure the worst-case validation performance of an architecture and improves the architecture by minimizing the loss on the generated adversarial validation data. Our framework is based on multi-level optimization, which performs multiple learning stages end-to-end. Experiments on a variety of datasets demonstrate the effectiveness of our method.

URL: https://openreview.net/forum?id=kRtMaQxJ3q

---

Title: Federated Fine-Tuning of Vision Foundation Models via Probabilistic Masking

Abstract: Foundation Models (FMs) have revolutionized machine learning with their adaptability and high performance across tasks; yet, their integration into Federated Learning (FL) is challenging due to substantial communication overhead from their extensive parameterization. Current communication-efficient FL strategies, such as gradient compression, reduce bitrates to around 1 bit-per-parameter (bpp). However, these approaches fail to harness the characteristics of FMs, with their large number of parameters still posing a challenge to communication efficiency, even at these bitrate regimes. In this work, we present DeltaMask, a novel method that efficiently fine-tunes FMs in FL at an ultra-low bitrate, well below 1 bpp. DeltaMask employs stochastic masking to detect highly effective subnetworks within FMs and leverage stochasticity and sparsity in client masks to compress updates into a compact grayscale image using probabilistic filters, deviating from traditional weight training approaches. Our comprehensive evaluations across various datasets and architectures demonstrate DeltaMask efficiently achieves bitrates as low as 0.09 bpp, enhancing communication efficiency while maintaining FMs performance, as measured on 8 datasets and 5 pre-trained models of various network architectures.

URL: https://openreview.net/forum?id=IBVG9LKSRF

---

Title: Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Abstract: This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose 200+, concrete research questions.

URL: https://openreview.net/forum?id=oVTkOs8Pka

---

Title: Fast and Optimal Weight Update for Pruned Large Language Models

Abstract: Pruning large language models (LLMs) is a challenging task due to their enormous size. The primary difficulty is fine-tuning the model after pruning, which is needed to recover the lost performance caused by dropping weights. Recent approaches have either ignored fine-tuning entirely, focusing on efficient pruning criteria, or attempted layer-wise weight updates, preserving the behavior of each layer. However, even layer-wise weight updates can be costly for LLMs, and previous works have resorted to various approximations.

In our paper, we propose a fast and optimal weight update algorithm for pruned layers based on the Alternating Direction Method of Multipliers (ADMM). Coupled with a simple gradual pruning mask selection, our algorithm achieves state-of-the-art pruning performance across a wide range of LLMs.

URL: https://openreview.net/forum?id=1hcpXd9Jir

---

Title: Federated Learning with Nonvacuous Generalisation Bounds

Abstract: We introduce a novel strategy to train randomised predictors in federated learning, where each node of the network aims at preserving its privacy by releasing a local predictor but keeping secret its training dataset with respect to the other nodes.
We then build a global randomised predictor which inherits the properties of the local private predictors in the sense of a PAC-Bayesian generalisation bound. We consider the synchronous case where all nodes share the same training objective (derived from a generalisation bound), and the heterogenous and homogenous cases where each node may have its own personalised training objective. We show through a series of numerical experiments that our approach achieves a comparable predictive performance to that of the batch approach where all datasets are shared across nodes. Moreover the predictors are supported by numerically nonvacuous generalisation bounds while preserving privacy for each node. We explicitly compute the increment on predictive performance and generalisation bounds for our two federated settings, highlighting the price to pay to preserve privacy.

URL: https://openreview.net/forum?id=VEEl3pFmaT

---

Title: Simple Drop-in LoRA Conditioning on Attention Layers Will Improve Your Diffusion Model

Abstract: Current state-of-the-art diffusion models employ U-Net architectures containing convolutional and (qkv) self-attention layers. The U-Net processes images while being conditioned on the time embedding input for each sampling step and the class or caption embedding input corresponding to the desired conditional generation. Such conditioning involves scale-and-shift operations to the convolutional layers but does not directly affect the attention layers. While these standard architectural choices are certainly effective, not conditioning the attention layers feels arbitrary and potentially suboptimal. In this work, we show that simply adding LoRA conditioning to the attention layers without changing or tuning the other parts of the U-Net architecture improves the image generation quality. For example, a drop-in addition of LoRA conditioning to EDM diffusion model yields FID scores of 1.91/1.75 for unconditional and class-conditional CIFAR-10 generation, improving upon the baseline of 1.97/1.79.

URL: https://openreview.net/forum?id=38P40gJPrI

---

Title: Sparse Modal Regression with Mode-Invariant Skew Noise

Abstract: Sparse regression methods have been widely used in many fields due to their statistical effectiveness and high interpretability. However, there are few sparse regression methods with skew noise, although statistical modeling using skewness is becoming more important, e.g., in the medical field. The Azzalini's skew-normal distribution and its extensions are well-used for skew distributions, and representative regression methods dealing with skew noise use such skew distributions as noise. These methods have a severe problem with statistical interpretability because the regression model is not for the mean, median, or mode. We propose a novel sparse regression method based on mode-invariant skew-normal noise. The regression model is always for a mode regardless of skewness and is easy to interpret. The proposed method is simple to implement and optimize, suggesting it is highly scalable to other machine-learning methods. We also provide theoretical guarantees of the proposed method for the average excess risk and the estimation error. Numerical experiments on artificial and real-world data demonstrate that the proposed method performs significantly better and is more stable than other existing methods for various skew-noise data.

URL: https://openreview.net/forum?id=63r6M1JkXm

---

Title: Identifiable Causal Inference with Noisy Treatment and No Side Information

Abstract: In some causal inference scenarios, the treatment variable is measured inaccurately, for instance in epidemiology or econometrics. Failure to correct for the effect of this measurement error can lead to biased causal effect estimates. Previous research has not studied methods that address this issue from a causal viewpoint while allowing for complex nonlinear dependencies and without assuming access to side information. For such a scenario, this study proposes a model that assumes a continuous treatment variable that is inaccurately measured. Building on existing results for measurement error models, we prove that our model's causal effect estimates are identifiable, even without knowledge of the measurement error variance or other side information. Our method relies on a deep latent variable model in which Gaussian conditionals are parameterized by neural networks, and we develop an amortized importance-weighted variational objective for training the model. Empirical results demonstrate the method's good performance with unknown measurement error. More broadly, our work extends the range of applications in which reliable causal inference can be conducted.

URL: https://openreview.net/forum?id=E0NPcsEZ2f

---

Title: Why do CNNs excel at feature extraction? A mathematical explanation.

Abstract: Over the past decade deep learning has revolutionized the field of computer vision, with convolutional neural network models proving to be very effective for image classification benchmarks. However, a fundamental theoretical questions remain answered: why can they solve discrete image classification tasks that involve feature extraction? We address this question in this paper by introducing a novel mathematical model for image classification, based on feature extraction, that can be used to generate images resembling real-world datasets. We show that convolutional neural network classifiers can solve these image classification tasks with zero error. In our proof, we construct piecewise linear functions that detect the presence of features, and show that they can be realized by a convolutional network.

URL: https://openreview.net/forum?id=BDUB1wWR1X

---

Title: A Comprehensive Survey on AI-based Methods for Patents

Abstract: Recent advancements in Artificial Intelligence (AI) and machine learning have demonstrated transformative capabilities across diverse domains. This progress extends to the field of patent analysis and innovation, where AI-based tools present opportunities to streamline and enhance important tasks in the patent cycle such as classification, retrieval, and valuation prediction. This not only accelerates the efficiency of patent researchers and applicants but also opens new avenues for technological innovation and discovery. Our survey provides a comprehensive summary of recent AI tools in patent analysis from more than 40 papers from 26 venues between 2017 and 2023. Unlike existing surveys, we include methods that work for patent image and text data. Furthermore, we introduce a novel taxonomy for the categorization based on the tasks in the patent life cycle as well as the specifics of the AI methods. This interdisciplinary survey aims to serve as a resource for researchers and practitioners who are working at the intersection of AI and patent analysis as well as the patent offices that are aiming to build efficient patent systems.

URL: https://openreview.net/forum?id=GJDWPVqicN

---

Title: Implicit Neural Representations for Robust Joint Sparse-View CT Reconstruction

Abstract: Computed Tomography (CT) is pivotal in industrial quality control and medical diagnostics. Sparse-view CT, offering reduced ionizing radiation, faces challenges due to its under-sampled nature, leading to ill-posed reconstruction problems. Recent advancements in Implicit Neural Representations (INRs) have shown promise in addressing sparse-view CT reconstruction. Recognizing that CT often involves scanning similar subjects, we propose a novel approach to improve reconstruction quality through joint reconstruction of multiple objects using INRs. This approach can potentially leverage both the strengths of INRs and the statistical regularities across multiple objects. While current INR joint reconstruction techniques primarily focus on accelerating convergence via meta-initialization, they are not specifically tailored to enhance reconstruction quality. To address this gap, we introduce a novel INR-based Bayesian framework integrating latent variables to capture the inter-object relationships. These variables serve as a dynamic reference throughout the optimization, thereby enhancing individual reconstruction fidelity. Our extensive experiments, which assess various key factors such as reconstruction quality, resistance to overfitting, and generalizability, demonstrate significant improvements over baselines in common numerical metrics. This underscores a notable advancement in CT reconstruction methods.

URL: https://openreview.net/forum?id=XCzuQI0oXR

---

Title: Causal Mediation Analysis with Multi-dimensional and Indirectly Observed Mediators

Abstract: Causal mediation analysis (CMA) is a powerful method to dissect the total effect of treatment into direct and mediated effects within the potential outcome framework. This is important in many scientific applications to identify the underlying mechanisms of a treatment effect. However, in many scientific applications, the mediator is unobserved, but there may exist related measurements. For example, we may want to identify how changes in brain activity or structure mediate an antidepressant's effect on behavior, but we may only have access to electrophysiological or imaging brain measurements. To date, most CMA methods assume the mediator is one-dimensional and observable, which oversimplifies such real-world scenarios. To overcome this limitation, we introduce a CMA framework that can handle complex and indirectly observed mediators based on the identifiable variational autoencoder (iVAE) architecture. We show the joint distribution over observed and latent variables is identifiable with our method both theoretically and empirically. In addition, our framework captures a disentangled representation of the indirectly observed mediator and yields an accurate estimation of the direct and mediated effects in synthetic and semi-synthetic experiments, providing evidence of its potential utility in real-world applications.

URL: https://openreview.net/forum?id=5tYmuJ9uj9

---

Title: Parametric Entropic Locally Linear Embedding for Small Sample Size Classification

Abstract: Manifold learning algorithms are powerful non-linear dimensionality reduction methods for unsupervised metric learning. The locally linear embedding (LLE) method uses the local geometry of the linear neighborhood spaces to estimate optimal reconstruction weights for each sample. In the present paper, we propose the parametric entropic LLE (PELLE) method, which adopts the relative entropy instead of the pointwise Euclidean metric to build local entropic covariance matrices. This methodological improvement increases the robustness of the method regarding noise and outliers. Moreover, state-of-the-art algorithms such as UMAP require a large number of samples for convergence to good results due to numerical optimization methods (gradient descent). Results considering 25 distinct real-world datasets indicate that the proposed method is capable of generating superior clustering and classification accuracies compared to existing state-of-the-art methods for dimensionality reduction-based metric learning, especially in datasets with a limited number of samples.

URL: https://openreview.net/forum?id=pujpAJGKD3

---

Title: Structural Pruning of Pre-trained Language Models via Neural Architecture Search

Abstract: Pre-trained language models (PLM), for example BERT or RoBERTa, mark the state-of-the-art for natural language understanding task when fine-tuned on labeled data. However, their large size poses challenges in deploying them for inference in real-world applications, due to significant GPU memory requirements and high inference latency.
This paper explores neural architecture search (NAS) for structural pruning to find sub-parts of the fine-tuned network that optimally trade-off efficiency, for example in terms of model size or latency, and generalization performance. We also show how we can utilize more recently developed two-stage weight-sharing NAS approaches in this setting to accelerate the search process. Unlike traditional pruning methods with fixed thresholds, we propose to adopt a multi-objective approach that identifies the Pareto optimal set of sub-networks, allowing for a more flexible and automated compression process.

URL: https://openreview.net/forum?id=XiK8tHDQNX

---

Title: Chronos: Learning the Language of Time Series

Abstract: We introduce Chronos, a simple yet effective framework for pretrained probabilistic time series models. Chronos tokenizes time series values using scaling and quantization into a fixed vocabulary and trains existing transformer-based language model architectures on these tokenized time series via the cross-entropy loss. We pretrained Chronos models based on the T5 family (ranging from 20M to 710M parameters) on a large collection of publicly available datasets, complemented by a synthetic dataset that we generated via Gaussian processes to improve generalization. In a comprehensive benchmark consisting of 42 datasets, and comprising both classical local models and deep learning methods, we show that Chronos models: (a) significantly outperform other methods on datasets that were part of the training corpus; and (b) have comparable and occasionally superior zero-shot performance on new datasets, relative to methods that were trained specifically on them. Our results demonstrate that Chronos models can leverage time series data from diverse domains to improve zero-shot accuracy on unseen forecasting tasks, positioning pretrained models as a viable tool to greatly simplify forecasting pipelines.

URL: https://openreview.net/forum?id=gerNCVqqtR

---

Title: Interpretability in the Era of Large Language Models: Opportunities and Challenges

Abstract: Interpretable machine learning has exploded as an area of interest over the last decade, sparked by the rise of increasingly large datasets and deep neural networks. Simultaneously, large language models (LLMs) have demonstrated remarkable capabilities across a wide array of tasks, offering a chance to rethink opportunities in interpretable machine learning. Notably, the capability to explain in natural language allows LLMs to expand the scale and complexity of patterns that can be given to a human. However, these new capabilities raise new challenges, such as hallucinated explanations and immense computational costs.

In this position paper, we start by reviewing existing methods to evaluate the emerging field of LLM interpretation (both interpreting LLMs and using LLMs for explanation). We contend that, despite their limitations, LLMs hold the opportunity to redefine interpretability with a more ambitious scope across many applications, including in auditing LLMs themselves. We highlight two emerging research priorities for LLM interpretation: using LLMs to directly analyze new datasets and to generate interactive explanations.

URL: https://openreview.net/forum?id=yibaLrx5Bm

---

Title: Merging Text Transformer Models from Different Initializations

Abstract: Recent work on one-shot permutation-based model merging has shown impressive low- or zero-barrier mode connectivity between models from completely different initializations.
However, this line of work has not yet extended to the Transformer architecture, despite its dominant popularity in the language domain.
Therefore, in this work, we investigate the extent to which separate Transformer minima learn similar features, and propose a model merging technique to investigate the relationship between these minima in the loss landscape.
The specifics of the architecture, like its residual connections, multi-headed attention, and discrete, sequential input, require specific interventions in order to compute model permutations that remain within the same functional equivalence class.
In merging these models with our method, we consistently find lower loss barriers between minima compared to model averaging for several models trained on a masked-language modeling task or fine-tuned on a language understanding benchmark.
Our results show that the minima of these models are less sharp and isolated than previously understood, and provide a basis for future work on merging separately trained Transformer models.

URL: https://openreview.net/forum?id=nWnYSLncXa

---

Title: GCondNet: A Novel Method for Improving Neural Networks on Small High-Dimensional Tabular Data

Abstract: Neural network models often struggle with high-dimensional but small sample-size tabular datasets. One reason is that current weight initialisation methods assume independence between weights, which can be problematic when there are insufficient samples to estimate the model's parameters accurately. In such small data scenarios, leveraging additional structures can improve the model's performance and training stability. To address this, we propose GCondNet, a general approach to enhance neural networks by leveraging implicit structures present in tabular data. We create a graph between samples for each data dimension, and utilise Graph Neural Networks (GNNs) for extracting this implicit structure, and for conditioning the parameters of the first layer of an underlying predictor network. By creating many small graphs, GCondNet exploits the data's high-dimensionality, and thus improves the performance of an underlying predictor network. We demonstrate the effectiveness of our method on 12 real-world datasets, where GCondNet outperforms 14 standard and state-of-the-art methods. The results show that GCondNet is a versatile framework for injecting graph-regularisation into various types of neural networks, including MLPs and tabular Transformers.

URL: https://openreview.net/forum?id=y0b0H1ndGQ

---

Title: Prioritized Federated Learning: Leveraging Non-Priority Clients for Targeted Model Improvement

Abstract: Federated Learning (FL) is a distributed machine learning approach to learn models on decentralized heterogeneous data, without the need for clients to share their data. Many existing FL approaches assume that all clients have equal importance and construct a global objective based on all clients. We consider a version of FL we call Prioritized FL, where the goal is to learn a weighted mean objective of a subset of clients, designated as priority clients. An important question arises: How do we choose well-aligned non-priority clients to participate in the federation, while discarding misaligned clients? We present FedALIGN (Federated Adaptive Learning with Inclusion of Global Needs) to address this challenge. The algorithm employs a matching strategy that chooses non-priority clients based on how similar the model’s loss is on their data compared to the global data, thereby ensuring the use of non-priority client gradients only when it is beneficial for priority clients. This approach ensures mutual benefits as non-priority clients are motivated to join when the model performs satisfactorily on their data, and priority clients can utilize their updates and computational resources when their goals align. We present a convergence analysis that quantifies the trade-off between client selection and speed of convergence. Our algorithm shows faster convergence and higher test accuracy than baselines for various synthetic and benchmark datasets.

URL: https://openreview.net/forum?id=FR8dvo6q8i

---

Title: Perceptual Similarity for Measuring Decision-Making Style and Policy Diversity in Games

Abstract: Defining and measuring decision-making styles, also known as playstyles, are crucial in gaming, where they reflect a broad spectrum of individuality and diversity. However, finding a universally applicable measure for these styles poses a challenge. Building on the $\textit{Playstyle Distance}$, the first unsupervised metric to measure playstyle similarity based on observation-action pairs by identifying comparable states with discrete representations, we introduce three enhancements to increase accuracy: multiscale analysis with varied state granularity, a perceptual kernel rooted in psychology, and the utilization of the intersection-over-union method for efficient evaluation. These innovations not only advance measurement precision but also offer insights into human cognition of similarity. Across two racing games and seven Atari games, our techniques significantly improve the precision of playstyle classification, achieving an accuracy exceeding $90\%$ with fewer than 512 observation-action pairs, less than half an episode of these games. Furthermore, our experiments with $\textit{2048}$ and $\textit{Go}$ demonstrate the potential of discrete playstyle measures in puzzle and board games. We also develop an algorithm for assessing decision-making diversity using these measures. Our findings illuminate promising avenues for real-time game analysis and the evolution of artificial intelligence with diverse playstyles.

URL: https://openreview.net/forum?id=30C9AWBW49

---

Title: Partially Personalized Federated Learning: Breaking the Curse of Data Heterogeneity

Abstract: We consider a partially personalized formulation of Federated Learning (FL) that strikes a balance between the flexibility of personalization and cooperativeness of global training. In our framework, we split the variables into global parameters, which are shared across all clients, and individual local parameters, which are kept private. We prove that under the right split of parameters, it is possible to find global parameters that allow each client to fit their data perfectly, and refer to the obtained problem as overpersonalized. For instance, the shared global parameters can be used to learn good data representations, whereas the personalized layers are fine-tuned for a specific client. Moreover, we present a simple algorithm for the partially personalized formulation that offers significant benefits to all clients. In particular, it breaks the curse of data heterogeneity in several settings, such as training with local steps, asynchronous training, and Byzantine-robust training.

URL: https://openreview.net/forum?id=8tMMCf4YYn

---

Title: Meta-Learning Approach for Joint Multimodal Signals with Multimodal Iterative Adaptation

Abstract: In the pursuit of effectively modeling real-world joint multimodal signals, learning to learn multiple Implicit Neural Representations (INRs) jointly has gained attention to overcome data scarcity and enhance fitting speed. However, predominant methods based on multi- modal encoders often underperform due to their reliance on direct data-to-parameter map- ping functions, bypassing the optimization steps necessary for capturing the complexities of real-world signals. To address this gap, we propose Multimodal Iterative Adaptation (MIA), a novel framework that combines the strengths of multimodal fusion with optimization-based meta-learning. The key idea is to enhance the learning of INRs by facilitating exchange of cross-modal knowledge among learners during the iterative optimization processes, improv- ing generalization and enabling a more nuanced adaptation to complex signals. To achieve this, we introduce State Fusion Transformers (SFTs), an attention-based meta-learner de- signed to operate in the backward pass of the learners, aggregating learning states, capturing cross-modal relationships, and predicting enhanced parameter updates for the learners. Our extensive evaluation in various real-world multimodal signal regression setups shows that MIA outperforms existing baselines in both generalization and memorization performances.

URL: https://openreview.net/forum?id=LV04KBaIQt

---

Title: Sparsifying Bayesian neural networks with latent binary variables and normalizing flows

Abstract: Artificial neural networks are powerful machine learning methods used in many modern applications. A common issue is that they have millions or billions of parameters, and therefore tend to overfit. Bayesian neural networks (BNN) can improve on this since they incorporate parameter uncertainty. Latent binary Bayesian neural networks (LBBNN) further take into account structural uncertainty by allowing the weights to be turned on or off, enabling inference in the joint space of weights and structures. Mean-field variational inference is typically used for computation within such models. In this paper, we will consider two extensions of variational inference for the LBBNN: Firstly, by using the local reparametrization trick (LCRT), we improve computational efficiency. Secondly, and more importantly, by using normalizing flows on the variational posterior distribution of the LBBNN parameters, we learn a more flexible variational posterior than the mean field Gaussian. Experimental results on real data show that this improves predictive power compared to using mean field variational inference on the LBBNN method, while also obtaining sparser networks. We also perform two simulation studies. In the first, we consider variable selection in a logistic regression setting, where the more flexible variational distribution improves results. In the second study, we compare predictive uncertainty based on data generated from two-dimensional Gaussian distributions. Here, we argue that our Bayesian methods lead to more realistic estimates of predictive uncertainty.

URL: https://openreview.net/forum?id=d6kqUKzG3V

---

Title: Training-free Graph Neural Networks and the Power of Labels as Features

Abstract: We propose training-free graph neural networks (TFGNNs), which can be used without training and can also be improved with optional training, for transductive node classification. We first advocate labels as features (LaF), which is an admissible but not explored technique. We show that LaF provably enhances the expressive power of graph neural networks. We design TFGNNs based on this analysis. In the experiments, we confirm that TFGNNs outperform existing GNNs in the training-free setting and converge with much fewer training iterations than traditional GNNs.

URL: https://openreview.net/forum?id=7DzU88VrNU

---

Title: Privacy-Preserving Split Learning with Vision Transformers using Patch-Wise Random and Noisy CutMix

Abstract: In computer vision, the vision transformer (ViT) has increasingly superseded the convolutional neural network (CNN) for improved accuracy and robustness. Since ViT often comes with large model sizes and high sample complexity, split learning (SL) is a promising solution to training ViT using large memory and computing resources at a server with the sheer amount of private data owned by users or clients. In SL, a ViT is split into two parts under a server-client architecture. The sever stores its upper segment that is associated with multiple clients each of which stores the lower segment. At the cut layer between the upper and lower segments, SL exchanges the cut-layer hidden activations in the forward propagation (FP), referred to as smashed data, and the cut-layer gradients in the backpropagation (BP), which are exposed to various attacks on private training data. To mitigate the risk of data breaches in classification tasks, inspired from the CutMix regularization, we propose a novel privacy-preserving SL framework that injects Gaussian noise into smashed data and
mixes randomly chosen patches of smashed data across clients, coined DP-CutMixSL. By analysis, we prove that DP-CutMixSL is a differentially private (DP) mechanism amplifying the privacy budget with respect to membership inference attacks in FP. By simulation, we
additionally show that DP-CutMixSL protects privacy from reconstruction attacks in FP and from label inference attacks in BP. Surprisingly, DP-CutMixSL even improves accuracy and robustness to imbalanced data distributions over clients, due to the regularization effect of its patch-wise random CutMix operations.

URL: https://openreview.net/forum?id=4bo6XAnutd

---

Title: Meta-Learning under Task Shift

Abstract: A common assumption in meta-learning is that meta-training and meta-test tasks are drawn from the same distribution. However, this assumption is often not fulfilled. Under such task shift, standard meta-learning algorithms do not work as desired since their unbiasedness is no longer maintained. In this paper, we propose a new meta-learning method called Importance Weighted Meta-Learning (IWML), which preserves unbiasedness even under task shift. Our approach uses both labeled meta-training datasets and unlabeled datasets in tasks obtained from the meta-test task distribution to assign weights to each meta-training task. These weights are determined by the ratio of meta-test and meta-training task densities. Our method enables the model to focus more on the meta-training tasks that closely align with meta-test tasks during the meta-training process. We meta-learn neural network-based models by minimizing the expected weighted meta-training error, which is an unbiased estimator of the expected error over meta-test tasks. The task density ratio is estimated using kernel density estimation, where the distance between tasks is measured by the maximum mean discrepancy. Our empirical evaluation of few-shot classification datasets demonstrates a significant improvement of IWML over existing approaches.

URL: https://openreview.net/forum?id=OxV75W90FN

---

Title: Defending Against Unknown Corrupted Agents: Reinforcement Learning of Adversarially Robust Nash Equilibria

Abstract: We consider a Multi-agent Reinforcement Learning (MARL) setting, in which an attacker can arbitrarily corrupt any subset of up to $k$ out of $n$ agents at deployment. Our goal is to design agents that are robust against such an attack, by accounting for the presence of corrupted agents at test time. To that end, we introduce a novel solution concept, the Adversarially Robust Nash Equilibrium (ARNEQ), and provide theoretical proof of its existence in general-sum Markov games. Furthermore, we introduce a proof-of-concept model-based approach to computing it and theoretically prove its convergence under standard assumptions. We also present a practical approach called Adversarially Robust Training (ART), an independent learning algorithm based on stochastic gradient descent ascent. Our experiments in both cooperative and mixed cooperative-competitive environments demonstrate ART's effectiveness and practical value in enhancing MARL resilience against adversarial behavior.

URL: https://openreview.net/forum?id=aggyMifxLQ

---

Title: Adaptive Training Distributions with Scalable Online Bilevel Optimization

Abstract: Large neural networks pretrained on web-scale corpora are central to modern machine learning. In this paradigm, the distribution of the large, heterogeneous pretraining data rarely matches that of the application domain. This work considers modifying the pretraining distribution in the case where one has a small sample of data reflecting the targeted test conditions. We propose an algorithm motivated by a recent formulation of this setting as an online, bilevel optimization problem. With scalability in mind, our algorithm prioritizes computing gradients at training points which are likely to most improve the loss on the targeted distribution. Empirically, we show that in some cases this approach is beneficial over existing strategies from the domain adaptation literature but may not succeed in other cases. We propose a simple test to evaluate when our approach can be expected to work well and point towards further research to address current limitations.

URL: https://openreview.net/forum?id=JP1GVyF5i5

---

Title: Augmenting Black-box LLMs with Medical Textbooks for Biomedical Question Answering

Abstract: Large language models (LLMs) such as ChatGPT have demonstrated impressive abilities in generating responses based on human instructions. However, their use in the medical field can be challenging due to their lack of specific, in-depth knowledge. A general solution to integrate external knowledge into LLMs is through a retrieval-augmented generation framework. Yet, the choice of corpus and the design of the retrieval pipeline for medical question-answering tasks remain under-explored. In this study, we present a system called LLMs Augmented with Medical Textbooks (LLM-AMT) designed to enhance the proficiency of LLMs in specialized domains. LLM-AMT integrates authoritative medical textbooks into the LLMs' framework using plug-and-play modules. These modules include a Query Augmenter, a Hybrid Textbook Retriever, and a Knowledge Self-Refiner. Together, they incorporate authoritative medical knowledge. Additionally, an LLM Reader aids in contextual understanding. Our experimental results on three medical QA tasks demonstrate that LLM-AMT significantly improves response quality, with accuracy gains ranging from 11.6% to 16.6%. Notably, with GPT-4-Turbo as the base model, LLM-AMT outperforms the specialized Med-PaLM 2 model pre-trained on a massive amount of medical corpus by 2-3%. We found that despite being 100× smaller in size, medical textbooks as a retrieval corpus is a more effective knowledge database than Wikipedia in the medical domain, boosting performance by 7.8%-13.7%. We will open-source the code for this work.

URL: https://openreview.net/forum?id=SvYTotUshi

---

Title: Overcoming the Stability Gap in Continual Learning

Abstract: Pre-trained deep neural networks (DNNs) are being widely deployed by industry for making business decisions and to serve users; however, a major problem is model decay, where the DNN's predictions become more erroneous over time, resulting in revenue loss or unhappy users. To mitigate model decay, DNNs are retrained from scratch using old and new data. This is computationally expensive, so retraining happens only once performance has significantly decreased. Here, we study how continual learning (CL) could potentially overcome model decay in large pre-trained DNNs and also greatly reduce computational costs for keeping DNNs up-to-date. We identify the ``stability gap'' as a major obstacle in our setting. The stability gap refers to a phenomenon where learning new data causes large drops in performance for past tasks before CL mitigation methods eventually compensate for this drop. We test two hypotheses for why the stability gap occurs and identify a method that vastly reduces this gap. In large-scale experiments for both easy and hard CL distributions (e.g., class incremental learning), we demonstrate that our method reduces the stability gap and greatly increases computational efficiency. Our work aligns CL with the goals of the production setting, where CL is needed for many applications.

URL: https://openreview.net/forum?id=o2wEfwUOma

---

Title: Dependency Structure Search Bayesian Optimization for Decision Making Models

Abstract: Many approaches for optimizing decision making models rely on gradient based methods requiring informative feedback from the environment. However, in the case where such feedback is sparse or uninformative, such approaches may result in poor performance.
Derivative-free approaches such as Bayesian Optimization mitigate the dependency on the quality of gradient feedback, but are known to scale poorly in the high-dimension setting of complex decision making models. This problem is exacerbated if the model requires interactions between several agents cooperating to accomplish a shared goal.
To address the dimensionality challenge, we propose a compact multi-layered architecture modeling the dynamics of agent interactions through the concept of role.
We introduce Dependency Structure Search Bayesian Optimization to efficiently optimize the multi-layered architecture parameterized by a large number of parameters, and give the first improved regret bound in additive high-dimensional Bayesian Optimization since Mutny and Krause (2018). Our approach shows strong empirical results under malformed or sparse reward.

URL: https://openreview.net/forum?id=U6bA2lhwVV

---

Title: Efficient Model-Agnostic Multi-Group Equivariant Networks

Abstract: Constructing model-agnostic group equivariant networks, such as equitune (Basu et al., 2023b) and its generalizations (Kim et al., 2023), can be computationally expensive for large product groups. We address this problem by providing efficient model-agnostic equivariant designs for two related problems: one where the network has multiple inputs each with potentially different groups acting on them, and another where there is a single input but the group acting on it is a large product group. For the first design, we initially consider a linear model and characterize the entire equivariant space that satisfies this constraint. This characterization gives rise to a novel fusion layer between different channels that satisfies an invariance-symmetry (IS) constraint, which we call an IS layer. We then extend this design beyond linear models, similar to equitune, consisting of equivariant and IS layers. We also show that the IS layer is a universal approximator of invariant-symmetric functions. Inspired by the first design, we use the notion of the IS property to design a second efficient model-agnostic equivariant design for large product groups acting on a single input. For the first design, we provide experiments on multi-image classification where each view is transformed independently with transformations such as rotations. We find equivariant models are robust to such transformations and perform competitively otherwise. For the second design, we consider three applications: language compositionality on the SCAN dataset to product groups; fairness in natural language generation from GPT-2 to address intersectionality; and robust zero-shot image classification with CLIP. Overall, our methods are simple and general, competitive with equitune and its variants, while also being computationally more efficient.

URL: https://openreview.net/forum?id=HCMDtc0ZhV

---

Title: Learning Causally Invariant Reward Functions from Diverse Demonstrations

Abstract: Inverse reinforcement learning methods aim to retrieve the reward function of a Markov
decision process based on a dataset of expert demonstrations. The commonplace scarcity
and heterogeneous sources of such demonstrations can lead to the absorption of spurious
correlations in the data by the learned reward function. Consequently, this adaptation
often exhibits behavioural overfitting to the expert data set when a policy is trained on the
obtained reward function under distribution shift of the environment dynamics. In this work,
we explore a novel regularization approach for inverse reinforcement learning methods based
on the causal invariance principle with the goal of improved reward function generalization.
By applying this regularization to both exact and approximate formulations of the learning
task, we demonstrate superior policy performance when trained using the recovered reward
functions in a transfer setting.

URL: https://openreview.net/forum?id=1lW6xdQQ3r

---

Title: Rethinking Teacher-Student Curriculum Learning through the Cooperative Mechanics of Experience

Abstract: Teacher-Student Curriculum Learning (TSCL) is a curriculum learning framework that draws inspiration from human cultural transmission and learning. It involves a teacher algorithm shaping the learning process of a learner algorithm by exposing it to controlled experiences. Despite its success, understanding the conditions under which TSCL is effective remains challenging. In this paper, we propose a data-centric perspective to analyze the underlying mechanics of the teacher-student interactions in TSCL. We leverage cooperative game theory to describe how the composition of the set of experiences presented by the teacher to the learner, as well as their order, influences the performance of the curriculum that are found by TSCL approaches. To do so, we demonstrate that for every TSCL problem, there exists an equivalent cooperative game, and several key components of the TSCL framework can be reinterpreted using game-theoretic principles. Through experiments covering supervised learning, reinforcement learning, and classical games, we estimate the cooperative values of experiences and use value-proportional curriculum mechanisms to construct curricula, even in cases where TSCL struggles. The framework and experimental setup we present in this work represent a novel foundation for a deeper exploration of TSCL, shedding light on its underlying mechanisms and providing insights into its broader applicability in machine learning.

URL: https://openreview.net/forum?id=qWh82br6KT

---

Title: Neural Likelihood Approximation for Integer Valued Time Series Data

Abstract: Stochastic processes defined on integer valued state spaces are popular within the physical and biological sciences. These models are necessary for capturing the dynamics of small systems where the individual nature of the populations cannot be ignored and stochastic effects are important. The inference of the parameters of such models, from time series data, is challenging due to intractability of the likelihood. To work at all, current simulation based inference methods require the generation of realisations of the model conditional on the data, which can be both tricky to implement and computationally expensive. In this paper we instead construct a neural likelihood approximation that can be trained using unconditional simulation of the underlying model, which is much simpler. We demonstrate our method by performing inference on a number of ecological and epidemiological models, showing that we can accurately approximate the true posterior while achieving significant computational speed ups compared to current best methods.

URL: https://openreview.net/forum?id=MMjRBe4oKF

---

Title: Communication Efficient Federated Learning over Wireless Channels

Abstract: Large-scale federated learning (FL) over wireless multiple access channels (MACs) has
emerged as a crucial learning paradigm with a wide range of applications. However, its
widespread adoption is hindered by several major challenges, including limited bandwidth
shared by many edge devices, noisy and erroneous wireless communications, and heterogeneous
datasets with different distributions across edge devices. To overcome these fundamental
challenges, we propose Federated Proximal Sketching (FPS), tailored towards
band-limited wireless channels and handling data heterogeneity across edge devices. FPS
uses a count sketch data structure to address the bandwidth bottleneck and enable efficient
compression while maintaining accurate estimation of significant coordinates. Additionally,
we modify the loss function in FPS such that it is equipped to deal with varying degrees of
data heterogeneity. We establish the convergence of the FPS algorithm under mild technical
conditions and characterize how the bias induced due to factors like data heterogeneity and
noisy wireless channels play a role in the overall result. We complement the proposed theoretical
framework with numerical experiments that demonstrate the stability, accuracy, and
efficiency of FPS in comparison to state-of-the-art methods on both synthetic and real-world
datasets. Overall, our results show that FPS is a promising solution to tackling the above
challenges of FL over wireless MACs.

URL: https://openreview.net/forum?id=jk79NlmVIn

---

Title: Stochastic Variance-Reduced Newton: Accelerating Finite-Sum Minimization with Large Batches

Abstract: Stochastic variance reduction has proven effective at accelerating first-order algorithms for solving convex finite-sum optimization tasks such as empirical risk minimization. Incorporating second-order information has proven helpful in further improving the performance of these first-order methods. Yet, comparatively little is known about the benefits of using variance reduction to accelerate popular stochastic second-order methods such as Subsampled Newton. To address this, we propose Stochastic Variance-Reduced Newton (SVRN), a finite-sum minimization algorithm that provably accelerates existing stochastic Newton methods from $O(\alpha\log(1/\epsilon))$ to $O\big(\frac{\log(1/\epsilon)}{\log(n)}\big)$ passes over the data, i.e., by a factor of $O(\alpha\log(n))$, where $n$ is the number of sum components and $\alpha$ is the approximation factor in the Hessian estimate. Surprisingly, this acceleration gets more significant the larger the data size $n$, which is a unique property of SVRN. Our algorithm retains the key advantages of Newton-type methods, such as easily parallelizable large-batch operations and a simple unit step size. We use SVRN to accelerate Subsampled Newton and Iterative Hessian Sketch algorithms, and show that it compares favorably to popular first-order methods with variance~reduction.

URL: https://openreview.net/forum?id=dzQCRHKRdC

---

Title: Fortify the Guardian, Not the Treasure: Resilient Adversarial Detectors

Abstract: This paper presents RADAR---Robust Adversarial Detection via Adversarial Retraining---an approach designed to enhance the robustness of adversarial detectors against adaptive attacks, while maintaining classifier performance. An adaptive attack is one where the attacker is aware of the defenses and adapts their strategy accordingly.
Our proposed method leverages adversarial training to reinforce the ability to detect attacks, without compromising clean accuracy. During the training phase, we integrate into the dataset adversarial examples, which were optimized to fool both the classifier and the adversarial detector, enabling the adversarial detector to learn and adapt to potential attack scenarios.

URL: https://openreview.net/forum?id=YPlL1b6gil

---

Title: Resource-efficient Pure Exploration for Contextual Bandits

Abstract: It is often of interest to learn a context-sensitive decision policy, such as in contextual multi-armed bandit processes. To quantify the efficiency of a machine learning algorithm for such settings, probably approximately correct (PAC) bounds, which bound the number of samples required, or cumulative regret guarantees, are typically used. However, many real-world settings have a limited amount of resources for experimentation, and decisions/interventions may differ in the amount of resources required (e.g., money or time). Therefore it is of interest to consider how to design an experiment strategy to learn a near-optimal contextual policy while minimizing the total amount of resources required. In contrast to RL or bandit approaches that embed costs into the reward function, here we focus on minimizing the resources needed to learn a near-optimal policy without resource constraints, which is similar to PAC-style approaches which seek to minimize the amount of data needed to learn a near-optimal policy. We propose two resource-aware algorithms for the contextual bandit setting and provide finite sample performance bounds on the resulting best policy that can be obtained from each of the algorithms. We also evaluate both algorithms on synthetic and semi-synthetic datasets and show that they significantly reduce the total resources needed to learn a near-optimal decision policy compared to prior approaches that use resource-unaware exploration strategies.

URL: https://openreview.net/forum?id=2Zr5mHpYxt

---

Title: Lifelong Learning in StyleGAN through Latent Subspaces

Abstract: StyleGAN is one of the most versatile generative models that have emerged in recent times. However, when it is trained continually on a stream of data (potentially previously unseen distributions), it tends to forget the distribution it has learned, as is the case with any other generative model, due to catastrophic forgetting. Recent studies have shown that the latent space of StyleGAN is very versatile, as data from a variety of distributions can be inverted onto it. In this paper, we propose to leverage this property to facilitate lifelong learning of StyleGAN without forgetting. Specifically, given a StyleGAN trained on a certain task (dataset), we propose to learn a latent subspace characterized by a set of dictionary vectors in its latent space, one for each novel, unseen task (or dataset). We also learn a relatively small set of parameters (feature adaptors) in the weight space to complement the dictionary learning in the latent space. Furthermore, we introduce a method that utilizes the similarity between tasks to effectively reuse the feature adaptor parameters from the previous tasks, aiding in the learning process for the current task at hand. Our approach guarantees that the parameters from previous tasks are reused only if they contribute to a beneficial forward transfer of knowledge. Remarkably, StyleCL avoids catastrophic forgetting because the set of dictionary and the feature adaptor parameters are unique for each task. We demonstrate that our method, StyleCL, achieves better generation quality on multiple datasets with significantly fewer additional parameters per task compared to previous methods. This is a consequence of learning task-specific dictionaries in the latent space, which has a much lower dimensionality compared to the weight space.

URL: https://openreview.net/forum?id=I4IAwVOZrM

---

Title: Locally Adaptive Federated Learning

Abstract: Federated learning is a paradigm of distributed machine learning in which multiple clients coordinate with a central server to learn a model, without sharing their own training data. Standard federated optimization methods such as Federated Averaging (FedAvg) ensure balance among the clients by using the same stepsize for local updates on all clients. However, this means that all clients need to respect the global geometry of the function which could yield slow convergence. In this work, we propose locally adaptive federated learning algorithms, that leverage the local geometric information for each client function. We show that such locally adaptive methods with uncoordinated stepsizes across all clients can be particularly efficient in interpolated (overparameterized) settings, and analyze their convergence in the presence of heterogeneous data for convex and strongly convex settings. We validate our theoretical claims by performing illustrative experiments for both i.i.d. non-i.i.d. cases. Our proposed algorithms match the optimization performance of tuned FedAvg in the convex setting, outperform FedAvg as well as state-of-the-art adaptive federated algorithms like FedAMS for non-convex experiments, and come with superior generalization performance.

URL: https://openreview.net/forum?id=auLdS1iuKW

---

Title: Heterogeneous graph adaptive flow network

Abstract: Many graphs or networks are heterogeneous by nature, involving various vertex types and relation types. Most graph learning models for heterogeneous graphs employ meta-paths to guide neighbor selections and extract composite relations. However, the use of meta-paths to generate relations between the same vertex types may result in directed edges and failure to fully utilize the other vertex or edge types in the data. To address such a limitation, we propose Heterogeneous graph adaptive flow network (HetaFlow), which removes the need for meta-paths. HetaFlow decomposes the heterogeneous graph into flows and performs convolution across heterogeneous vertex and edge types, using an adaptation to change the vertex features based on the corresponding vertex and edge types during aggregation. Experiments on real-world datasets for vertex clustering and vertex classification demonstrate that HetaFlow outperforms other benchmark models and achieves state-of-the-art performance on commonly used benchmark datasets. The codes are available at https://github.com/AnonymizedC/HetaFlow

URL: https://openreview.net/forum?id=usvg3yhjAx

---

Title: On Intriguing Layer-Wise Properties of Robust Overfitting in Adversarial Training

Abstract: Adversarial training has proven to be one of the most effective methods to defend against adversarial attacks. Nevertheless, robust overfitting is a common obstacle in adversarial training of deep networks. There is a common belief that the features learned by different network layers have different properties, however, existing works generally investigate robust overfitting by considering a DNN as a single unit and hence the impact of different network layers on robust overfitting remains unclear. In this work, we divide a DNN into a series of layers and investigate the effect of different network layers on robust overfitting. We find that different layers exhibit distinct properties towards robust overfitting, and in particular, robust overfitting is mostly related to the optimization of latter parts of the network. Based upon the observed effect, we propose a \emph{robust adversarial training} (RAT) prototype: in a minibatch, we optimize the front parts of the network as usual, and adopt additional measures to regularize the optimization of the latter parts. Based on the prototype, we designed two realizations of RAT, and extensive experiments demonstrate that RAT can eliminate robust overfitting and boost adversarial robustness over the standard adversarial training.

URL: https://openreview.net/forum?id=phk5CcKTcc

---

Title: Unleashing the Power of PAC-Bayes Training for Unbounded Loss

Abstract: Previous research on PAC-Bayes learning theory has focused extensively on establishing tight upper bounds for test errors. A recently proposed training procedure, called PAC-Bayes training, updates the network weights toward minimizing these bounds. Although this approach is theoretically sound, in practice, it has not achieved a test error as low as those obtained by empirical risk minimization (ERM) with carefully tuned regularization hyperparameters. Additionally, existing PAC-Bayes training algorithms often require bounded loss functions and may need a search over priors with additional datasets, which limits their broader applicability. In this paper, we introduce a new PAC-Bayes training algorithm with improved performance and reduced reliance on prior tuning. This is achieved by establishing a new PAC-Bayes bound for unbounded loss and a theoretically grounded approach that involves jointly training the prior and posterior using the same dataset. Our comprehensive evaluations across various classification tasks and neural network architectures demonstrate that the proposed method not only outperforms existing PAC-Bayes training algorithms but also approximately matches the test accuracy of ERM that is optimized by SGD/Adam using various regularization methods with optimal hyperparameters.

URL: https://openreview.net/forum?id=MP8bmxvWt6

---

Title: Enhancing the Explainability of Gradient Boosting for Regression Problems through Comparable Samples Selection

Abstract: Gradient-boosted decision Trees (GBDT) is a highly effective learning method widely used for addressing classification and regression problems. However, akin to other ensemble methods, GBDT suffers from a lack of explainability. Explainability is a desirable property: the ability to discover relationships between input data attributes and the ultimate model predictions is crucial for a comprehensive understanding of the GBDT method. To enhance the explainability of such algorithms, we propose to exhibit particular training data, referred to as \textit{comparable samples}, upon which the model heavily relies for specific predictions. To that end, we show that a prediction of GBDT can be decomposed as a weighted sum of training data when using specific loss functions. It is noteworthy that these weights may be negative. Furthermore, during the prediction of a training sample's response, the weights associated with other training samples in the prediction's decomposition vanish, indicating a potential issue of overfitting. To overcome this issue, we introduce nonnegativity constraints on the weights and substitute gradient descent with a methodology inspired by the Frank-Wolfe algorithm called Explainable Gradient Boosting (ExpGB). The predictions generated by the proposed algorithm can be directly interpreted as convex combinations of the training targets. This allows for selecting training data resembling a given sample by comparing their decomposition coefficients. We conduct a comparative analysis with classical GBDT algorithms across diverse datasets to validate the estimation quality. Additionally, we evaluate the fidelity of comparable samples by demonstrating their proficiency in estimating the characteristics of the considered sample. Our approach, thus, offers a promising avenue for enhancing the explainability of GBDT and similar ensemble methods.

URL: https://openreview.net/forum?id=9oixrKwyfx

---

Title: Incorporating Inductive Bias to Energy-based Generative Model

Abstract: With new techniques of model training (e.g. denoising score-matching) and drawing samples (e.g. Langevin dynamics), energy-based models (EBM) gained new interests as generative models. Recent EBMs usually use neural networks to devise their energy functions. In this work, we explore a novel hybrid approach to combine an EBM and an exponential family model to inject inductive bias in data modeling. Specifically, We augment the energy term with a parameter-less statistic function, which helps the model to capture key statistics of the data. Similar to an exponential family model, the hybrid model tries to match the distribution statistics to data statistics during model training, even though it only approximately maximizes the data likelihood. This property allows us to impose constraints on the hybrid model. Our empirical study verifies our hypothesis of the statistic-matching property in the hybrid model. Experimental results further demonstrate that data fitting and generation are improved when informative statistics are used in the hybrid model.

URL: https://openreview.net/forum?id=k98ZDblyhN

---

Reply all

Reply to author

Forward

0 new messages