Weekly TMLR digest for Mar 16, 2025

29 views

Skip to first unread message

TMLR

unread,

Mar 16, 2025, 12:00:12 AMMar 16

to tmlr-annou...@googlegroups.com

New certifications
==================

Featured Certification: Align and Distill: Unifying and Improving Domain Adaptive Object Detection

Justin Kay, Timm Haucke, Suzanne Stathatos, Siqi Deng, Erik Young, Pietro Perona, Sara Beery, Grant Van Horn

https://openreview.net/forum?id=ssXSrZ94sR

---

Accepted papers
===============

Title: Ensemble and Mixture-of-Experts DeepONets For Operator Learning

Authors: Ramansh Sharma, Varun Shankar

Abstract: We present a novel deep operator network (DeepONet) architecture for operator learning, the ensemble DeepONet, that allows for enriching the trunk network of a single DeepONet with multiple distinct trunk networks. This trunk enrichment allows for greater expressivity and generalization capabilities over a range of operator learning problems. We also present a spatial mixture-of-experts (MoE) DeepONet trunk network architecture that utilizes a partition-of-unity (PoU) approximation to promote spatial locality and model sparsity in the operator learning problem. We first prove that both the ensemble and PoU-MoE DeepONets are universal approximators. We then demonstrate that ensemble DeepONets containing a trunk ensemble of a standard trunk, the PoU-MoE trunk, and/or a proper orthogonal decomposition (POD) trunk can achieve 2-4x lower relative $\ell_2$ errors than standard DeepONets and POD-DeepONets on both standard and challenging new operator learning problems involving partial differential equations (PDEs) in two and three dimensions. Our new PoU-MoE formulation provides a natural way to incorporate spatial locality and model sparsity into any neural network architecture, while our new ensemble DeepONet provides a powerful and general framework for incorporating basis enrichment in scientific machine learning architectures for operator learning.

URL: https://openreview.net/forum?id=MGdydNfWzQ

---

Title: HARE: Human-in-the-Loop Algorithmic Recourse

Authors: Sai Srinivas Kancheti, Rahul Vigneswaran, Bamdev Mishra, Vineeth N. Balasubramanian

Abstract: Machine learning models are seeing increasing use as decision making systems in domains such as education, finance and healthcare. It is desirable that these models are trustworthy to the end-user, by ensuring fairness, transparency and reliability of decisions. In this work, we consider a key aspect of responsible and transparent AI models -- actionable explanations, viz. the ability of such models to provide recourse to end users adversely affected by their decisions. While algorithmic recourse has seen a variety of efforts in recent years, there have been very few efforts on exploring personalized recourse for a given user. Two users with the same feature profile may prefer vastly different recourses. The limited work in this direction hitherto rely on one-time feature preferences provided by a user. Instead, we present a human-in-the-loop formulation of algorithmic recourse that can incorporate both relative and absolute human feedback for a given test instance. We show that our formulation can extend any existing recourse generating method, enabling the generation of recourses that are satisfactory to the user. We perform experiments on 3 benchmark datasets on top of 6 popular baseline recourse methods where we observe that our framework performs significantly better on simulated user preferences.

URL: https://openreview.net/forum?id=56EBglCFvx

---

Title: Domain Generalization for Time Series: Enhancing Drilling Regression Models for Stick-Slip Index Prediction

Authors: Hana YAHIA, Bruno Figliuzzi, Florent Di Meglio, Gerbaud, Stephane Menand, Mohamed MAHJOUB

Abstract: This paper provides a comprehensive comparison of domain generalization techniques applied to time series data within a drilling context, focusing on the prediction of a continuous Stick-Slip Index (SSI), a critical metric for assessing torsional downhole vibrations at the drill bit. The study aims to develop a robust regression model that can generalize across domains by training on $60$~ second labeled sequences of $1$~Hz surface drilling data to predict the SSI. The model is tested in wells that are different from those used during training. To fine-tune the model architecture, a grid search approach is employed to optimize key hyperparameters. A comparative analysis of the Adversarial Domain Generalization (ADG), Invariant Risk Minimization (IRM) and baseline models is presented, along with an evaluation of the effectiveness of transfer learning (TL) in improving model performance. The ADG and IRM models achieve performance improvements of $10\%$ and $8\%$, respectively, over the baseline model. Most importantly, severe events are detected $60\%$ of the time, against $20\%$ for the baseline model. Overall, the results indicate that both ADG and IRM models surpass the baseline, with the ADG model exhibiting a slight advantage over the IRM model. Additionally, applying TL to a pre-trained model further improves performance. Our findings demonstrate the potential of domain generalization approaches in drilling applications, with ADG emerging as the most effective approach.

URL: https://openreview.net/forum?id=nNN1pPJRVL

---

Title: Align and Distill: Unifying and Improving Domain Adaptive Object Detection

Authors: Justin Kay, Timm Haucke, Suzanne Stathatos, Siqi Deng, Erik Young, Pietro Perona, Sara Beery, Grant Van Horn

Abstract: Object detectors often perform poorly on data that differs from their training set. Domain adaptive object detection (DAOD) methods have recently demonstrated strong results on addressing this challenge. Unfortunately, we identify systemic benchmarking pitfalls that call past results into question and hamper further progress: (a) Overestimation of performance due to underpowered baselines, (b) Inconsistent implementation practices preventing transparent comparisons of methods, and (c) Lack of generality due to outdated backbones and lack of diversity in benchmarks. We address these problems by introducing: (1) A unified benchmarking and implementation framework, Align and Distill (ALDI), enabling comparison of DAOD methods and supporting future development, (2) A fair and modern training and evaluation protocol for DAOD that addresses benchmarking pitfalls, (3) A new DAOD benchmark dataset, CFC-DAOD, increasing the diversity of available DAOD benchmarks, and (4) A new method, ALDI++, that achieves state-of-the-art results by a large margin. ALDI++ outperforms the previous state-of-the-art by +3.5 AP50 on Cityscapes $\rightarrow$ Foggy Cityscapes, +5.7 AP50 on Sim10k $\rightarrow$ Cityscapes (where ours is the only method to outperform a fair baseline), and +0.6 AP50 on CFC-DAOD. ALDI and ALDI++ are architecture-agnostic, setting a new state-of-the-art for YOLO and DETR-based DAOD as well without additional hyperparameter tuning. Our framework, dataset, and method offer a critical reset for DAOD and provide a strong foundation for future research.

URL: https://openreview.net/forum?id=ssXSrZ94sR

---

Title: Early Directional Convergence in Deep Homogeneous Neural Networks for Small Initializations

Authors: Akshay Kumar, Jarvis Haupt

Abstract: This paper studies the gradient flow dynamics that arise when training deep homogeneous neural networks assumed to have locally Lipschitz gradients and an order of homogeneity strictly greater than two. It is shown here that for sufficiently small initializations, during the early stages of training, the weights of the neural network remain small in (Euclidean) norm and approximately converge in direction to the Karush-Kuhn-Tucker (KKT) points of the recently introduced neural correlation function. Additionally, this paper also studies the KKT points of the neural correlation function for feed-forward networks with (Leaky) ReLU and polynomial (Leaky) ReLU activations, deriving necessary and sufficient conditions for rank-one KKT points.

URL: https://openreview.net/forum?id=VNM6V1gi3k

---

Title: Unlabelled Compressive Sensing under Sparse Permutation and Prior Information

Authors: Garweet Sresth, Satish Mulleti, Ajit Rajwade

Abstract: In this paper, we study the problem of unlabelled compressed sensing, where the correspondence between the measurement values and the rows of the sensing matrix is lost, the number of measurements is less than the dimension of the regression vector, and the regression vector is sparse in the identity basis. Additionally, motivated by practical situations, we assume that we accurately know a small number of correspondences between the rows of the measurement matrix and the measurement vector. We propose a tractable estimator, based on a modified form of the \textsc{Lasso}, to estimate the regression vector, and we derive theoretical error bounds for the estimate. This is unlike previous approaches to unlabelled compressed sensing, which either do not produce theoretical bounds or which produce bounds for intractable estimators. We show that our algorithm outperforms a hard thresholding pursuit (\textsc{Htp}) approach and an $\ell_1$-norm estimator used to solve a similar problem across diverse regimes. We also propose a modified \textsc{Htp} based estimator which has superior properties to the baseline \textsc{Htp} estimator. Lastly, we show an application of unlabelled compressed sensing in image registration, demonstrating the utility of a few known point correspondences.

URL: https://openreview.net/forum?id=HaAg9RN7Hi

---

Title: A unifying framework for generalised Bayesian online learning in non-stationary environments

Authors: Gerardo Duran-Martin, Leandro Sánchez-Betancourt, Alex Shestopaloff, Kevin Patrick Murphy

Abstract: We propose a unifying framework for methods that perform probabilistic online learning in non-stationary environments. We call the framework BONE, which stands for generalised (B)ayesian (O)nline learning in (N)on-stationary (E)nvironments. BONE provides a common structure to tackle a variety of problems, including online continual learning, prequential forecasting, and contextual bandits. The framework requires specifying three modelling choices: (i) a model for measurements (e.g., a neural network), (ii) an auxiliary process to model non-stationarity (e.g., the time since the last changepoint), and (iii) a conditional prior over model parameters (e.g., a multivariate Gaussian). The framework also requires two algorithmic choices, which we use to carry out approximate inference under this framework: (i) an algorithm to estimate beliefs (posterior distribution) about the model parameters given the auxiliary variable, and (ii) an algorithm to estimate beliefs about the auxiliary variable. We show how the modularity of our framework allows for many existing methods to be reinterpreted as instances of BONE, and it allows us to propose new methods. We compare experimentally existing methods with our proposed new method on several datasets, providing insights into the situations that make each method more suitable for a specific task. We provide a Jax open source library to facilitate the adoption of this framework.

URL: https://openreview.net/forum?id=osesw2V10u

---

Title: Variational Neural Stochastic Differential Equations with Change Points

Authors: Yousef El-Laham, Zhongchang Sun, Haibei Zhu, Tucker Balch, Svitlana Vyetrenko

Abstract: In this work, we explore modeling change points in time-series data using neural stochastic differential equations (neural SDEs). We propose a novel model formulation and training procedure based on the variational autoencoder (VAE) framework for modeling time-series as a neural SDE. Unlike existing algorithms training neural SDEs as VAEs, our proposed algorithm only necessitates a Gaussian prior of the initial state of the latent stochastic process, rather than a Wiener process prior on the entire latent stochastic process. We develop two methodologies for modeling and estimating change points in time-series data with distribution shifts. Our iterative algorithm alternates between updating neural SDE parameters and updating the change points based on either a maximum likelihood-based approach or a change point detection algorithm using the sequential likelihood ratio test. We also discuss theoretical implications of the proposed change point detection scheme. Finally, we present an empirical evaluation that demonstrates the expressive power of our proposed model, showing that it can effectively model both classical parametric SDEs and some real datasets with distribution shifts.

URL: https://openreview.net/forum?id=GEilvtsFNV

---

Title: Respecting the limit: Bayesian optimization with a bound on the optimal value

Authors: Hanyang Wang, Juergen Branke, Matthias Poloczek

Abstract: In many real-world optimization problems, we have prior information about what objective function values are achievable. In this paper, we study the scenario that we have either exact knowledge of the minimum value or a, possibly inexact, lower bound on its value. We propose bound-aware Bayesian optimization (BABO), a Bayesian optimization method that uses a new surrogate model and acquisition function to utilize such prior information. We present SlogGP, a new surrogate model that incorporates bound information and adapts the Expected Improvement (EI) acquisition function accordingly. Empirical results on a variety of benchmarks demonstrate the benefit of taking prior information about the optimal value into account, and that the proposed approach significantly outperforms existing techniques. Furthermore, we notice that even in the absence of prior information on the bound, the proposed SlogGP surrogate model still performs better than the standard GP model in most cases, which we explain by its larger expressiveness.

URL: https://openreview.net/forum?id=y5Hf0otJLk

---

Title: Rethinking Knowledge Transfer in Learning Using Privileged Information

Authors: Danil Provodin, Bram van den Akker, Christina Katsimerou, Maurits Clemens Kaptein, Mykola Pechenizkiy

Abstract: In supervised machine learning, privileged information (PI) is information that is unavailable at inference, but is accessible during training time. Research on learning using privileged information (LUPI) aims to transfer the knowledge captured in PI onto a model that can perform inference without PI. It seems that this extra bit of information ought to make the resulting model better. However, finding conclusive theoretical or empirical evidence that supports the ability to transfer knowledge using PI has been challenging. In this paper, we critically examine the assumptions underlying existing theoretical analyses and argue that there is little theoretical justification for when LUPI should work. We analyze two main LUPI methods - generalized distillation and marginalization with weight sharing - and reveal that apparent improvements in empirical risk may not directly result from PI. Instead, these improvements often stem from dataset anomalies or modifications in model design misguidedly attributed to PI. Our experiments for a wide variety of application domains further demonstrate that state-of-the-art LUPI approaches fail to effectively transfer knowledge from PI. Thus, we advocate for practitioners to exercise caution when working with PI to avoid unintended inductive biases.

URL: https://openreview.net/forum?id=dg1tqNIWg3

---

Title: Causal Discovery over High-Dimensional Structured Hypothesis Spaces with Causal Graph Partitioning

Authors: Ashka Shah, Adela Frances DePavia, Nathaniel C Hudson, Ian Foster, Rick Stevens

Abstract: The aim in many sciences is to understand the mechanisms that underlie the observed distribution of variables, starting from a set of initial hypotheses. Causal discovery allows us to infer mechanisms as sets of cause and effect relationships in a generalized way---without necessarily tailoring to a specific domain. Causal discovery algorithms search over a structured hypothesis space, defined by the set of Directed Acyclic Graphs (DAG), to find the graph that best explains the data. For high-dimensional problems, however, this search becomes intractable and scalable algorithms for causal discovery are needed to bridge the gap.
In this paper, we define a novel causal graph partition that allows for divide-and-conquer causal discovery with theoretical guarantees under the Maximal Ancestral Graph (MAG) class. We leverage the idea of a superstructure---a set of learned or existing candidate hypotheses---to partition the search space. We prove under certain assumptions that learning with a causal graph partition always yields the Markov Equivalence Class of the true causal graph. We show our algorithm achieves comparable accuracy and a faster time to solution for biologically-tuned synthetic networks and networks up to ${10^4}$ variables. This makes our method applicable to gene regulatory network inference and other domains with high-dimensional structured hypothesis spaces.

URL: https://openreview.net/forum?id=FecsgPCOHk

---

Title: On the Robustness of Kolmogorov-Arnold Networks: An Adversarial Perspective

Authors: Tal Alter, Raz Lapid, Moshe Sipper

Abstract: Kolmogorov-Arnold Networks (KANs) have recently emerged as a novel paradigm for function approximation by leveraging univariate spline-based decompositions inspired by the Kolmogorov–Arnold theorem. Despite their theoretical appeal---particularly the potential for inducing smoother decision boundaries and lower effective Lipschitz constants---their adversarial robustness remains largely unexplored. In this work, we conduct the first comprehensive evaluation of KAN robustness in adversarial settings, focusing on both fully connected (FCKANs) and convolutional (CKANs) instantiations for image classification tasks. Across a wide range of benchmark datasets (MNIST, FashionMNIST, KMNIST, CIFAR-10, SVHN, and a subset of ImageNet), we compare KANs against conventional architectures using an extensive suite of attacks, including white-box methods (FGSM, PGD, C\&W, MIM), black-box approaches (Square Attack, SimBA, NES), and ensemble attacks (AutoAttack). Our experiments reveal that while small- and medium-scale KANs are not consistently more robust than their standard counterparts, large-scale KANs exhibit markedly enhanced resilience against adversarial perturbations. An ablation study further demonstrates that critical hyperparameters---such as number of knots and spline order---significantly influence robustness. Moreover, adversarial training experiments confirm the inherent safety advantages of KAN-based architectures. Overall, our findings provide novel insights into the adversarial behavior of KANs and lay a rigorous foundation for future research on robust, interpretable network designs.

URL: https://openreview.net/forum?id=uafxqhImPM

---

Title: CroissantLLM: A Truly Bilingual French-English Language Model

Authors: Manuel Faysse, Patrick Fernandes, Nuno M Guerreiro, António Loison, Duarte Miguel Alves, Caio Corrro, Nicolas Boizard, João Alves, Ricardo Rei, Pedro Henrique Martins, Antoni Bigata Casademunt, François Yvon, Andre Martins, Gautier Viaud, CELINE HUDELOT, Pierre Colombo

Abstract: We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware. To that end, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio, a custom tokenizer, and bilingual finetuning datasets. We release the training dataset, notably containing a French split with manually curated, high-quality, and varied data sources. To assess performance outside of English, we craft a novel benchmark, FrenchBench, consisting of an array of classification and generation tasks, covering various orthogonal aspects of model performance in the French Language. Additionally, rooted in transparency and to foster further Large Language Model research, we release codebases, and dozens of checkpoints across various model sizes, training data distributions, and training steps, as well as fine-tuned Chat models, and strong translation models. We evaluate our model through the FMTI framework, and validate 81 % of the transparency criteria, far beyond the scores of even most open initiatives. This work enriches the NLP landscape, breaking away from previous English-centric work in order to strengthen our understanding of multilinguality in language models.

URL: https://openreview.net/forum?id=uA19Xo1o31

---

Title: Reheated Gradient-based Discrete Sampling for Combinatorial Optimization

Authors: Muheng Li, Ruqi Zhang

Abstract: Recently, gradient-based discrete sampling has emerged as a highly efficient, general-purpose solver for various combinatorial optimization (CO) problems, achieving performance comparable to or surpassing the popular data-driven approaches. However, we identify a critical issue in these methods, which we term ``wandering in contours''. This behavior refers to sampling new different solutions that share very similar objective values for a long time, leading to computational inefficiency and suboptimal exploration of potential solutions. In this paper, we introduce a novel reheating mechanism inspired by the concept of critical temperature and specific heat in physics, aimed at overcoming this limitation. Empirically, our method demonstrates superiority over existing sampling-based and data-driven algorithms across a diverse array of CO problems.

URL: https://openreview.net/forum?id=uPCvfyr2KP

---

Title: Enhancing Fairness in Unsupervised Graph Anomaly Detection through Disentanglement

Authors: Wenjing Chang, Kay Liu, Philip S. Yu, Jianjun Yu

Abstract: Graph anomaly detection (GAD) is becoming increasingly crucial in various applications, ranging from financial fraud detection to fake news detection. However, current GAD methods largely overlook the fairness problem, which might result in discriminatory decisions
skewed toward certain demographic groups defined on sensitive attributes (e.g., gender). This greatly limits the applicability of these methods in real-world scenarios in light of societal and ethical restrictions. To address this critical gap, we make the first attempt
to integrate fairness with utility in GAD decision-making. Specifically, we devise a novel DisEntangle-based FairnEss-aware aNomaly Detection framework on the attributed graph, named DEFEND. DEFEND first introduces disentanglement in GNNs to capture informative yet sensitive-irrelevant node representations, effectively reducing bias inherent in graphrepresentation learning. Besides, to alleviate discriminatory bias in evaluating anomalies, DEFEND adopts a reconstruction-based method, which concentrates solely on node attributes and avoids incorporating biased graph topology. Additionally, given the inherent association between sensitive-relevant and -irrelevant attributes, DEFEND further constrains the correlation between the reconstruction error and predicted sensitive attributes. Empirical evaluations on real-world datasets reveal that DEFEND performs effectively in GAD and significantly enhances fairness compared to state-of-the-art baselines. Our code is available at https://github.com/AhaChang/DEFEND.

URL: https://openreview.net/forum?id=5zRs34Ls3C

---

Title: Attention Overlap Is Responsible for The Entity Missing Problem in Text-to-image Diffusion Models!

Authors: Arash Mari Oriyad, Mohammadali Banayeeanzade, Reza Abbasi, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah

Abstract: Text-to-image diffusion models such as Stable Diffusion and DALL-E have exhibited impressive capabilities in producing high-quality, diverse, and realistic images based on textual prompts. Nevertheless, a common issue arises where these models encounter difficulties in faithfully generating every entity specified in the prompt, leading to a recognized challenge known as entity missing in visual compositional generation. While previous studies indicated that actively adjusting cross-attention maps during inference could potentially resolve the issue, there has been a lack of systematic investigation into the specific objective function required for this task. In this work, we thoroughly investigate three potential causes of entity missing from the perspective of cross-attention maps: insufficient attention intensity, excessive attention spread, and significant overlap between attention maps of different entities.
Through comprehensive empirical analysis, we found that optimizing metrics that quantify the overlap between attention maps of entities is highly effective at mitigating entity missing. We hypothesize that during the denoising process, entity-related tokens engage in a form of competition for attention toward specific regions through the cross-attention mechanism. This competition may result in the attention of a spatial location being divided among multiple tokens, leading to difficulties in accurately generating the entities associated with those tokens. Building on this insight, we propose four overlap-based loss functions that can be used to implicitly manipulate the latent embeddings of the diffusion model during inference: Intersection over union (IoU), center-of-mass (CoM) distance, Kullback–Leibler (KL) divergence, and clustering compactness (CC). Extensive experiments on a diverse set of prompts demonstrate that our proposed training-free methods substantially outperform previous approaches on a range of compositional alignment metrics, including visual question-answering, captioning score, CLIP similarity, and human evaluation. Notably, our method outperforms the best baseline by $9\%$ in human evaluation.

URL: https://openreview.net/forum?id=Xv3ZrFayIO

---

Title: The Time-Energy Model: Selective Time-Series Forecasting Using Energy-Based Models

Authors: Jonas Brusokas, Seshu Tirupathi, Dalin Zhang, Torben Bach Pedersen

Abstract: Time-series forecasting is an important task in many domains, including finance, weather prediction, and energy consumption forecasting, and deep learning methods have emerged as the best-performing time-series forecasting methods over the last few years. However, most proposed time-series forecasting models are deterministic and are prone to errors when deployed in production, potentially causing significant losses and penalties when making predictions with low confidence. In this paper, we propose the Time-Energy Model (TEM), a framework that introduces so-called selective time-series forecasting using energy-based models. Selective forecasting estimates model confidence and allows the end-user to selectively reject forecasts while maintaining a desired target coverage. TEM is model-agnostic and can be used to improve forecasting accuracy of any encoder-decoder deterministic time-series forecasting model. TEM is trained using a combination of supervised and self-supervised learning, leveraging excellent single-point prediction accuracy while maintaining the ability to reject forecasts based on model confidence. Experimental results indicate that TEM generalizes well across 5 state-of-the-art deterministic time-series forecasting models and 5 benchmark time-series forecasting datasets. Using selective forecasting, TEM reduces prediction error by up to $49.1\%$ over 5 state-of-the-art deterministic models. Furthermore, TEM has up to $87.0\%$ lower error than selected baseline EBM models, and achieves significantly better performance than state-of-the-art selective deep learning models. Code for the proposed TEM framework is available at https://github.com/JonasBrusokas/Time-Energy-Model

URL: https://openreview.net/forum?id=iHYCdTAOqF

---

Title: Characterizing the Convergence of Game Dynamics via Potentialness

Authors: Martin Bichler, Davide Legacci, Panayotis Mertikopoulos, Matthias Oberlechner, Bary Pradelski

Abstract: Understanding the convergence landscape of multi-agent learning is a fundamental problem of great practical relevance in many applications of artificial intelligence and machine learning.
While it is known that learning dynamics converge to Nash equilibrium in potential games, the behavior of dynamics in many important classes of games that do not admit a potential is poorly understood.
To measure how ``close'' a game is to being potential, we consider a distance function, that we call ``potentialness'', and which relies on a strategic decomposition of games introduced by Candogan et al. (2011).
We introduce a numerical framework enabling the computation of this metric, which we use to calculate the degree of ``potentialness'' in generic matrix games, as well as (non-generic) games that are important in economic applications, namely auctions and contests. Understanding learning in the latter games has become increasingly important due to the wide-spread automation of bidding and pricing with no-regret learning algorithms.
We empirically show that potentialness decreases and concentrates with an increasing number of agents or actions;
in addition, potentialness turns out to be a good predictor for the existence of pure Nash equilibria and the convergence of no-regret learning algorithms in matrix games.
In particular, we observe that potentialness is very low for complete-information models of the all-pay auction where no pure Nash equilibrium exists, and much higher for Tullock contests, first-, and second-price auctions, explaining the success of learning in the latter. In the incomplete-information version of the all-pay auction, a pure Bayes-Nash equilibrium exists and it can be learned with gradient-based algorithms. Potentialness nicely characterizes these differences to the complete-information version.

URL: https://openreview.net/forum?id=Is9APiPg4V

---

Title: Active Diffusion Subsampling

Authors: Oisín Nolan, Tristan Stevens, Wessel L. van Nierop, Ruud Van Sloun

Abstract: Subsampling is commonly used to mitigate costs associated with data acquisition, such as
time or energy requirements, motivating the development of algorithms for estimating the
fully-sampled signal of interest $x$ from partially observed measurements $y$. In maximum-
entropy sampling, one selects measurement locations that are expected to have the highest
entropy, so as to minimize uncertainty about $x$. This approach relies on an accurate model
of the posterior distribution over future measurements, given the measurements observed so
far. Recently, diffusion models have been shown to produce high-quality posterior samples
of high-dimensional signals using guided diffusion. In this work, we propose Active Diffusion
Subsampling (ADS), a method for designing intelligent subsampling masks using guided dif-
fusion in which the model tracks a distribution of beliefs over the true state of $x$ throughout
the reverse diffusion process, progressively decreasing its uncertainty by actively choosing
to acquire measurements with maximum expected entropy, ultimately producing the pos-
terior distribution $p(x | y)$. ADS can be applied using pre-trained diffusion models for any
subsampling rate, and does not require task-specific retraining – just the specification of
a measurement model. Furthermore, the maximum entropy sampling policy employed by
ADS is interpretable, enhancing transparency relative to existing methods using black-box
policies. Experimentally, we show that through designing informative subsampling masks,
ADS significantly improves reconstruction quality compared to fixed sampling strategies on
the MNIST and CelebA datasets, as measured by standard image quality metrics, includ-
ing PSNR, SSIM, and LPIPS. Furthermore, on the task of Magnetic Resonance Imaging
acceleration, we find that ADS performs competitively with existing supervised methods in
reconstruction quality while using a more interpretable acquisition scheme design procedure.
Code is available at https://active-diffusion-subsampling.github.io/.

URL: https://openreview.net/forum?id=OGifiton47

---

Title: Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining

Authors: Pihe Hu, Shaolong Li, Xun Wang, Longbo Huang

Abstract: Large language models (LLMs) have made significant strides in complex tasks, yet their widespread adoption is impeded by substantial computational demands. With hundreds of billion parameters, transformer-based LLMs necessitate months of pretraining across a high-end GPU cluster. However, this paper reveals a compelling finding: transformers exhibit considerable redundancy in pretraining computations, which motivates our proposed solution, Mixed Sparsity Training (MST), an efficient pretraining method that can reduce about $75$% of Floating Point Operations (FLOPs) while maintaining performance. MST integrates dynamic sparse training (DST) with Sparsity Variation (SV) and Hybrid Sparse Attention (HSA) during pretraining, involving three distinct phases: warm-up, ultra-sparsification, and restoration. The warm-up phase transforms the dense model into a sparse one, and the restoration phase reinstates connections. Throughout these phases, the model is trained with a dynamically evolving sparse topology and an HSA mechanism to maintain performance and minimize training FLOPs concurrently. Our experiment on GPT-2 showcases a FLOP reduction of $4\times$ without compromising performance.

URL: https://openreview.net/forum?id=XosdLS7KVE

---

Title: Online Control-Informed Learning

Authors: Zihao Liang, Tianyu Zhou, Zehui Lu, Shaoshuai Mou

Abstract: This paper proposes an Online Control-Informed Learning (OCIL) framework, which employs the well-established optimal control and state estimation techniques in the field of control to solve a broad class of learning tasks in an online fashion. This novel integration effectively handles practical issues in machine learning such as noisy measurement data, online learning, and data efficiency. By considering any robot as a tunable optimal control system, we propose an online parameter estimator based on extended Kalman filter (EKF) to incrementally tune the system in an online fashion, enabling it to complete designated learning or control tasks. The proposed method also improves the robustness in learning by effectively managing noise in the data. Theoretical analysis is provided to demonstrate the convergence of OCIL. Three learning modes of OCIL, i.e. Online Imitation Learning, Online System Identification, and Policy Tuning On-the-fly, are investigated via experiments, which validate their effectiveness.

URL: https://openreview.net/forum?id=LDzvZEVl5H

---

Title: Visual Privacy Auditing with Diffusion Models

Authors: Kristian Schwethelm, Johannes Kaiser, Moritz Knolle, Sarah Lockfisch, Daniel Rueckert, Alexander Ziller

Abstract: Data reconstruction attacks on machine learning models pose a substantial threat to privacy, potentially leaking sensitive information. Although defending against such attacks using differential privacy (DP) provides theoretical guarantees, determining appropriate DP parameters remains challenging. Current formal guarantees on the success of data reconstruction suffer from overly stringent assumptions regarding adversary knowledge about the target data, particularly in the image domain, raising questions about their real-world applicability. In this work, we empirically investigate this discrepancy by introducing a reconstruction attack based on diffusion models (DMs) that only assumes adversary access to real-world image priors and specifically targets the DP defense. We find that (1) real-world data priors significantly influence reconstruction success, (2) current reconstruction bounds do not model the risk posed by data priors well, and (3) DMs can serve as heuristic auditing tools for visualizing privacy leakage.

URL: https://openreview.net/forum?id=D3DA7pgpvn

---

New submissions
===============

Title: Knockout: A simple way to handle missing inputs

Abstract: Deep learning models benefit from rich (e.g., multi-modal) input features. However, multimodal models might be challenging to deploy, because some inputs may be missing at inference. Current popular solutions include marginalization, imputation, and training multiple models. Marginalization achieves calibrated predictions, but it is computationally expensive and only feasible for low dimensional inputs. Imputation may result in inaccurate predictions, particularly when high-dimensional data, such as images, are missing. Training multiple models, where each model is designed to handle different subsets of inputs, can work well but requires prior knowledge of missing input patterns. Furthermore, training and retaining multiple models can be costly. We propose an efficient method to learn both the conditional distribution using full inputs and the marginal distributions. Our method, Knockout, randomly replaces input features with appropriate placeholder values during training. We provide a theoretical justification for Knockout and show that it can be interpreted as an implicit marginalization strategy. We evaluate Knockout across a wide range of simulations and real-world datasets and show that it offers strong empirical performance.

URL: https://openreview.net/forum?id=K71y5pge84

---

Title: Visually Descriptive Language Model for Vector Graphics Reasoning

Abstract: Despite significant advancements, current large multimodal models (LMMs) struggle to bridge the gap between low-level visual perception—focusing on shapes, sizes, and layouts—and high-level language reasoning involving semantics, events, and logic. This limitation becomes evident in tasks requiring precise visual perception, such as comparing geometric properties or solving visual algorithmic reasoning problems. To study this failure mode, we focus on an important visual domain: vector graphics —images composed purely of 2D objects and shapes, which are prevalent in Web, PC, and Mobile environments. Importantly, we consider rasterized vector graphics without assuming access to their underlying vector code. We identify two key research questions: how can we enable precise visual perception, and how can we facilitate high-level reasoning based on such low-level perceptions? To accurately capture low-level visual details, we explore using SVG for the precise encoding of visual scenes. However, SVGs are not readily interpretable by LLMs or LMMs in a zero-shot manner. To address this challenge, we propose the Visually Descriptive Language Model (VDLM) to build a bridge between low-level visual perception and high-level language reasoning. VDLM learns an intermediate symbolic representation called Primal Visual Description (PVD), which translates raw SVGs into a higher-level abstraction comprising primitive attributes. This abstraction allows for direct interpretation by foundation models for zero-shot generalization to different reasoning tasks. Without any human-annotated data, VDLM leads to significant improvements in state-of-the-art LMMs, such as GPT-4o, across various low-level multimodal perception and reasoning tasks on rasterized vector graphics. Additionally, we provide extensive analyses of VDLM’s performance, showing that our framework offers improved interpretability due to its disentangled perception and reasoning processes. As the first attempt to construct a descriptive intermediate representation for low-level visual reasoning, we also conduct an in-depth error analysis, highlighting remaining limitations and suggesting directions for future research.

URL: https://openreview.net/forum?id=WzS33L1iPC

---

Title: Ellipsoidal Optimal Recovery: A Minimax Approach to Robust Counterfactual Estimation

Abstract: Consider the problem of quantifying the causal effects of an intervention to determine whether the intervention achieved desired outcomes. Researchers address this problem using statistical, machine learning, or signal processing techniques that have limitations of high bias or need of expert knowledge. We present a new minimax geometric approach called ellipsoidal optimal recovery (EOpR) for estimating the unobservable outcome of a treatment unit. It is an approximation-theoretic technique that recovers unknown observations given a learned signal/principal vector and a set of known observations. The significance of our approach is that it improves pre-treatment fit and mitigates bias of the post-treatment estimate relative to other methods in causal inference. Beyond recovery of the unit of interest, an advantage of EOpR is that it produces worst-case limits over the estimates produced. We assess our approach on synthetically-generated data, on standard datasets commonly used in the econometrics (synthetic control) literature, and in the context of the COVID-19 pandemic, showing better performance than baseline techniques.

URL: https://openreview.net/forum?id=iVxlbNl8Ow

---

Title: Joint Generative Modeling of Scene Graphs and Images via Diffusion Models

Abstract: In this paper, we introduce a novel framework for joint scene graph - image generation, a challenging task involving high-dimensional, multi-modal structured data. To effectively model this complex joint distribution, we adopt a factorized approach: first generating a scene graph, followed by image generation conditioned on the generated scene graph. While conditional image generation has been widely explored in the literature, our primary focus is on the unconditional generation of scene graphs from noise, which provides efficient and interpretable control over the image generation process. This task requires generating plausible scene graphs with heterogeneous attributes for both nodes (objects) and edges (relations between objects), encompassing continuous attributes (e.g., object bounding boxes) and discrete attributes (e.g., object and relation categories). To address this challenge, we introduce DiffuseSG, a novel diffusion model that jointly models the heterogeneous node and edge attributes. We explore different encoding strategies to effectively handle the categorical data. Leveraging a graph transformer as the denoiser, DiffuseSG progressively refines scene graph representations in a continuous space before discretizing them to generate structured outputs. Additionally, we introduce an IoU-based regularization term to enhance empirical performance. Our model outperforms existing methods in scene graph generation on the Visual Genome and COCO-Stuff datasets, excelling in both standard and newly introduced metrics that more accurately capture the task's complexity. Furthermore, we demonstrate the broader applicability of DiffuseSG in two important downstream tasks: (1) achieving superior results in a range of scene graph completion tasks, and (2) enhancing scene graph detection models by leveraging additional training samples generated by DiffuseSG.

URL: https://openreview.net/forum?id=2cxxZI2LOL

---

Title: Learning Using a Single Forward Pass

Abstract: We propose a learning algorithm to overcome the limitations of a traditional backpropagation in resource-constrained environments: Solo Pass Embedded Learning Algorithm (SPELA). SPELA is equipped with rapid learning capabilities and operates with local loss functions to update weights, significantly saving on resources allocated to the propagation of gradients and storing computational graphs while being sufficiently accurate. Consequently, SPELA can closely match backpropagation with less data, computing, storage, and power. Moreover, SPELA can effectively fine-tune pre-trained image recognition models for new tasks. Further, SPELA is extended with significant modifications to train CNN networks, which we evaluate for equivalent performance on CIFAR-10, CIFAR-100, and SVHN 10 datasets. Our results indicate that SPELA can be an ideal candidate for learning in resource-constrained edge AI applications.

URL: https://openreview.net/forum?id=EDQ8QOGqjr

---

Title: A2Perf: Real-World Autonomous Agents Benchmark

Abstract: Autonomous agents and systems cover a number of application areas, from robotics and digital assistants to combinatorial optimization, all sharing common, unresolved research challenges. It is not sufficient for agents to merely solve a given task; they must generalize to out-of-distribution tasks, perform reliably, and use hardware resources efficiently during training and on-device deployment, among other requirements. Several major classes of methods, such as reinforcement learning and imitation learning, are commonly used to tackle these problems, each with different trade-offs. However, there is currently no benchmarking suite that defines the environments, datasets, and metrics which can be used to develop reference implementations and seed leaderboards with baselines, providing a meaningful way for the community to compare progress. We introduce A2Perf—a benchmarking suite including three environments that closely resemble real-world domains: computer chip floorplanning, web navigation, and quadruped locomotion.
A2Perf provides metrics that track task performance, generalization, system resource efficiency, and reliability, which are all critical to real-world applications. Using A2Perf, we demonstrate that web navigation agents can achieve latencies comparable to human reaction times on consumer hardware, reveal important reliability trade-offs between algorithms for quadruped locomotion, and quantify the total energy costs of different learning approaches for computer chip-design. In addition, we propose a data cost metric to account for the cost incurred acquiring offline data for imitation learning, reinforcement learning, and hybrid algorithms, which allows us to better compare these approaches. A2Perf also contains baseline implementations of standard algorithms, enabling apples-to-apples comparisons across methods and facilitating progress in real-world autonomy. As an open-source and extendable benchmark, A2Perf is designed to remain accessible, documented, up-to-date, and useful to the research community over the long term.

URL: https://openreview.net/forum?id=AoGliDAEPC

---

Title: Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey

Abstract: Detecting out-of-distribution (OOD) samples is crucial for ensuring the safety of machine learning systems and has shaped the field of OOD detection. Meanwhile, several other problems are closely related to OOD detection, including anomaly detection (AD), novelty detection (ND), open set recognition (OSR), and outlier detection (OD). To unify these problems, a generalized OOD detection framework was proposed, taxonomically categorizing these five problems. However, Vision Language Models (VLMs) such as CLIP have significantly changed the paradigm and blurred the boundaries between these fields, again confusing researchers. In this survey, we first present a generalized OOD detection v2, encapsulating the evolution of these fields in the VLM era. Our framework reveals that, with some field inactivity and integration, the demanding challenges have become OOD detection and AD. Then, we highlight the significant shift in the definition, problem settings, and benchmarks; we thus feature a comprehensive review of the methodology for OOD detection and related tasks to clarify their relationship to OOD detection. Finally, we explore the advancements in the emerging Large Vision Language Model (LVLM) era, such as GPT-4V. We conclude with open challenges and future directions.

URL: https://openreview.net/forum?id=FO3IA4lUEY

---

Title: MACCA: Offline Multi-agent Reinforcement Learning with Causal Credit Assignment

Abstract: Offline Multi-agent Reinforcement Learning (MARL) is valuable in scenarios where online interaction is impractical or risky. While independent learning in MARL offers flexibility and scalability, accurately assigning credit to individual agents in offline settings poses challenges because interactions with an environment are prohibited. In this paper, we propose a new framework, namely \textbf{M}ulti-\textbf{A}gent \textbf{C}ausal \textbf{C}redit \textbf{A}ssignment (\textbf{MACCA}), to address credit assignment in the offline MARL setting. Our approach, MACCA, characterizing the generative process as a Dynamic Bayesian Network, captures relationships between environmental variables, states, actions, and rewards. Estimating this model on offline data, MACCA can learn each agent's contribution by analyzing the causal relationship of their individual rewards, ensuring accurate and interpretable credit assignment. Additionally, the modularity of our approach allows it to integrate with various offline MARL methods seamlessly. Theoretically, we proved that under the setting of the offline dataset, the underlying causal structure and the function for generating the individual rewards of agents are identifiable, which laid the foundation for the correctness of our modeling. In our experiments, we demonstrate that MACCA not only outperforms state-of-the-art methods but also enhances performance when integrated with other backbones.

URL: https://openreview.net/forum?id=gwUOzI4DuV

---

Title: A Survey on Quantum Machine Learning: Basics, Current Trends, Challenges, Opportunities, and the Road Ahead

Abstract: Quantum Computing (QC) claims to improve the efficiency of solving complex problems, compared to classical computing. When QC is integrated with Machine Learning (ML), it creates a Quantum Machine Learning (QML) system. This paper aims to provide a thorough understanding of the foundational concepts of QC and its notable advantages over classical computing. Following this, we delve into the key aspects of QML in a detailed and comprehensive manner.

In this survey, we investigate a variety of QML algorithms, discussing their applicability across different domains. We examine quantum datasets, highlighting their unique characteristics and advantages. The survey also covers the current state of hardware technologies, providing insights into the latest advancements and their implications for QML. Additionally, we review the software tools and simulators available for QML development, discussing their features and usability.

Furthermore, we explore practical applications of QML, illustrating how it can be leveraged to solve real-world problems more efficiently than classical ML methods. This paper serves as a valuable resource for readers seeking to understand the current state-of-the-art techniques in the QML field, offering a solid foundation to embark on further exploration and development in this rapidly evolving area.

URL: https://openreview.net/forum?id=rBNQdXdzBg

---

Title: Factor Learning Portfolio Optimization Informed by Continuous-Time Finance Models

Abstract: We study financial portfolio optimization in the presence of unknown and uncontrolled system variables referred to as stochastic factors. Existing work falls into two distinct categories: (i) reinforcement learning employs end-to-end policy learning with flexible factor representation, but does not precisely model the dynamics of asset prices or factors; (ii) continuous-time finance methods, in contrast, take advantage of explicitly modeled dynamics but pre-specify, rather than learn, factor representation. We propose FaLPO (factor learning portfolio optimization), a framework that interpolates between these two approaches. Specifically, FaLPO hinges on deep policy gradient to learn a performant investment policy that takes advantage of flexible representation for stochastic factors. Meanwhile, FaLPO also incorporates continuous-time finance models when modeling the dynamics. It uses the optimal policy functional form derived from such models and optimizes an objective that combines policy learning and model calibration. We prove the convergence of FaLPO and provide performance guarantees via a finite-sample bound. On both synthetic and real-world portfolio optimization tasks, we observe that FaLPO outperforms five leading methods. Finally, we show that FaLPO can be extended to other decision-making problems with stochastic factors.

URL: https://openreview.net/forum?id=KLOJUGusVE

---

Title: Long-Term Fairness Inquiries and Pursuits in Machine Learning: A Survey of Notions, Methods, and Challenges

Abstract: The widespread integration of Machine Learning systems in daily life, particularly in high-stakes domains, has raised concerns about the fairness implications. While prior works have investigated static fairness measures, recent studies reveal that automated decision-making has long-term implications and that off-the-shelf fairness approaches may not serve the purpose of achieving long-term fairness. Additionally, the existence of feedback loops and the interaction between models and the environment introduces additional complexities that may deviate from the initial fairness goals. In this survey, we review existing literature on long-term fairness from different perspectives and present a taxonomy for long-term fairness studies. We highlight key challenges and consider future research directions, analyzing both current issues and potential further explorations.

URL: https://openreview.net/forum?id=mYi6EWvFlR

---

Title: Semantic Mapping in Indoor Embodied AI - A Survey on Advances, Challenges, and Future Directions

Abstract: Intelligent embodied agents (e.g. robots) need to perform complex semantic tasks in unfamiliar environments. Among many skills that the agents need to possess, building and maintaining a semantic map of the environment is most crucial in long-horizon tasks. A semantic map captures information about the environment in a structured way, allowing the agent to reference it for advanced reasoning throughout the task. While existing surveys in embodied AI focus on general advancements or specific tasks like navigation and manipulation, this paper provides a comprehensive review of semantic map-building approaches in embodied AI, specifically for indoor navigation. We categorize these approaches based on their structural representation (spatial grids, topological graphs, dense point-clouds or hybrid maps) and the type of information they encode (implicit features or explicit environmental data). We also explore the strengths and limitations of the map building techniques, highlight current challenges, and propose future research directions. We identify that the field is moving towards developing open-vocabulary, queryable, task-agnostic map representations, while high memory demands and computational inefficiency still remaining to be open challenges. This survey aims to guide current and future researchers in advancing semantic mapping techniques for embodied AI systems.

URL: https://openreview.net/forum?id=USgQ38RG6G

---

Title: A Max-Min Approach to the Worst-Case Class Separation Problem

Abstract: In this paper, we propose a novel discriminative feature learning method based on a minorization-maximization framework for min-max (MM4MM) to address the long-standing “worst-case class separation (WCCS)” problem, which, in our design, refers to maximizing the minimum pairwise Chernoff distance between all class pairs in the low-dimensional subspace. The proposed algorithm relies on the relaxation of a semi-orthogonality constraint, which is proven to be tight at every iteration of the algorithm. To solve the worst-case class separation problem, we first introduce the vanilla version of the proposed algorithm, which requires solving a semi-definite program (SDP) at each iteration. We further simplify it to solving a quadratic program by formulating the dual of the surrogate maximization problem. We also then present reformulations of the worst-case class separation problem that enforce sparsity of the dimension-reducing matrix. The proposed algorithms are computationally efficient and are guaranteed to converge to optimal solutions. An important feature of these algorithms is that they do not require any hyperparameter tuning (except for the sparsity case, where a penalty parameter controlling sparsity must be chosen by the user). Experiments on several machine learning datasets demonstrate the effectiveness of the MM4MM approach.

URL: https://openreview.net/forum?id=EEmwBd4tfZ

---

Title: Exploring the Limitations of Layer Synchronization in Spiking Neural Networks

Abstract: Neural-network processing in machine learning applications relies on layer synchronization. This is practiced even in artificial Spiking Neural Networks (SNNs), which are touted as consistent with neurobiology, in spite of processing in the brain being in fact asynchronous. A truly asynchronous system however would allow all neurons to evaluate concurrently their threshold and emit spikes upon receiving any presynaptic current. Omitting layer synchronization is potentially beneficial, for latency and energy efficiency, but asynchronous execution of models previously trained with layer synchronization may entail a mismatch in network dynamics and performance. We present and quantify this problem, and show that models trained with layer synchronization either perform poorly in absence of the synchronization, or fail to benefit from any energy and latency reduction, when such a mechanism is in place. We then explore a potential solution direction, based on a generalization of backpropagation-based training that integrates knowledge about an asynchronous execution scheduling strategy, for learning models suitable for asynchronous processing. We experiment with 2 asynchronous neuron execution scheduling strategies in datasets that encode spatial and temporal information, and we show the potential of asynchronous processing to use less spikes (up to 50\%), complete inference faster (up to 2x), and achieve competitive or even better accuracy (up to $\sim$10\% higher). Our exploration affirms that asynchronous event-based AI processing can be indeed more efficient, but we need to rethink how we train our SNN models to benefit from it.

URL: https://openreview.net/forum?id=mfmAVwtMIk

---

Title: Implicit 3D Reconstruction of Fine Details from Multi-View Images using Wavelet-based Geometric Prior

Abstract: High-fidelity 3D reconstruction from images remains a fundamental challenge in computer vision. Implicit Signed Distance Field (SDF) models leverage photometric loss for isosurface reconstruction, while recent approaches, such as planar constrained Gaussian splatting, integrate 3D-2D geometry priors to improve structural accuracy. However, existing methods struggle to capture fine-grained geometric details due to due the loss of high-frequency geometric details during feature learning, which results in limited multi-scale representation. To address this, we introduce a novel wavelet-conditioned implicit SDF model that enhances geometric precision by leveraging a pretrained wavelet autoencoder optimized with sharp depth maps. This autoencoder extracts multi-scale wavelet transformed features, which are fused with implicit 3D triplane features via triplane projection, producing a more structured and detail-preserving distance field. Our method can serve as a plug-and-play module, seamlessly integrating with any implicit SDF representations.

Extensive evaluations on DTU, Tanks and Temples, and a cultural heritage dataset demonstrate that our model consistently outperforms state-of-the-art implicit and explicit 3D reconstruction methods, achieving more complete surfaces with fine-detail preservation across diverse scene scales, from small objects to large architectural buildings.

URL: https://openreview.net/forum?id=5NNJQJD5BS

---

Title: CLImage: Human-Annotated Datasets for Complementary-Label Learning

Abstract: Complementary-label learning (CLL) is a weakly-supervised learning paradigm that aims to train a multi-class classifier using only complementary labels, which indicate classes to which an instance does not belong. Despite numerous algorithmic proposals for CLL, their practical applicability remains unverified for two reasons. Firstly, these algorithms often rely on assumptions about the generation of complementary labels, and it is not clear how far the assumptions are from reality. Secondly, their evaluation has been limited to synthetic datasets. To gain insights into the real-world performance of CLL algorithms, we developed a protocol to collect complementary labels from human annotators. Our efforts resulted in the creation of four datasets: CLCIFAR10, CLCIFAR20, CLMicroImageNet10, and CLMicroImageNet20, derived from well-known classification datasets CIFAR10, CIFAR100, and TinyImageNet200. These datasets represent the very first real-world CLL datasets. Through extensive benchmark experiments, we discovered a notable decrease in performance when transitioning from synthetic datasets to real-world datasets. We investigated the key factors contributing to the decrease with a thorough dataset-level ablation study. Our analyses highlight annotation noise as the most influential factor in the real-world datasets. In addition, we discover that the biased-nature of human-annotated complementary labels and the difficulty to validate with only complementary labels are two outstanding barriers to practical CLL. These findings suggest that the community focus more research efforts on developing CLL algorithms and validation schemes that are robust to noisy and biased complementary-label distributions.

URL: https://openreview.net/forum?id=FHkWY4aGsN

---

Title: A Framework for Finding Local Saddle Points in Two-Player Zero-Sum Black-Box Games

Abstract: Saddle point optimization is a critical problem employed in numerous real-world applications, including portfolio optimization, generative adversarial networks, and robotics. It has been extensively studied in cases where the objective function is known and differentiable. Existing work in black-box settings with unknown objectives that can only be sampled either assumes convexity-concavity in the objective to simplify the problem or operates with noisy gradient estimators. In contrast, we introduce a framework inspired by Bayesian optimization which utilizes Gaussian processes to model the unknown (potentially nonconvex-nonconcave) objective and requires only zeroth-order samples. Our approach frames the saddle point optimization problem as a two-level process which can flexibly integrate existing and novel approaches to this problem. The upper level of our framework produces a model of the objective function by sampling in promising locations, and the lower level of our framework uses the existing model to frame and solve a general-sum game to identify locations to sample. This lower level procedure can be designed in complementary ways, and we demonstrate the flexibility of our approach by introducing variants which appropriately trade off between factors like runtime, the cost of function evaluations, and the number of available initial samples. We experimentally demonstrate these algorithms on synthetic and realistic datasets in black-box nonconvex-nonconcave settings, showcasing their ability to efficiently locate local saddle points in these contexts.

URL: https://openreview.net/forum?id=NbRybPuWCv

---

Title: Inverse Reinforcement Learning via Inverse Optimization

Abstract: Inverse reinforcement learning (IRL) and inverse optimization (IO) for Markov decision processes (MDPs) have developed independently in the literature, despite addressing the same problem. We establish the relationship between the IO framework for MDPs and the convex-analytic view of the apprenticeship learning (AL) formalism proposed by Kamoutsi et al. (2021). Furthermore, we demonstrate that this view of the AL formalism emerges as a relaxation of the IRL problem when observed through the lens of IO. The proposed formulation frames the IRL problem as a regularized min-max problem, extending prior approaches. Notably, the AL formalism is a special case when the regularization term is absent. We solve the regularized-convex-concave-min-max problem using stochastic mirror descent (SMD) and establish convergence bounds for the proposed method. Numerical experiments highlight the critical role of regularization in recovering the true cost vector for IRL problems.

URL: https://openreview.net/forum?id=AEvdHZFUJR

---

Title: Explaining Caption-Image Interactions in CLIP models with Second-Order Attributions

Abstract: Dual encoder architectures like Clip models map two types of inputs into a shared em- bedding space and predict similarities between them. Despite their success, it is, however, not understood how these models compare their two inputs. Common first-order feature- attribution methods can only provide limited insights into dual-encoders since their predic- tions depend on feature-interactions rather than on individual features.
In this paper, we first derive a second-order method enabling the attribution of predictions by any differentiable dual encoder onto feature-interactions between its inputs. Second, we apply our method to Clip models and show that they learn fine-grained correspondences between parts of captions and regions in images. They match objects across input modes also account for mismatches. This visual-linguistic grounding ability, however, varies heav- ily between object classes and exhibits pronounced out-of-domain effects. We can identify individual errors as well as systematic failure categories including object coverage, unusual scenes and correlated contexts.

URL: https://openreview.net/forum?id=HUUL19U7HP

---

Title: Online Test-time Adaptation for Time Series Forecasting

Abstract: Multivariate time series forecasting, which predicts future dynamics by analyzing historical data, has become an essential tool in modern data analysis. With the development of deep models, batch-training based time series forecasting has made significant progress. However, in real-world applications, time series data is often collected incrementally in a streaming manner, with only a portion of the data available at each time step. As time progresses, distribution shifts in the data can occur, leading to a drastic decline in model performance. To address this challenge, online test-time adaptation and online time series forecasting have emerged as a promising solution. However, for the former, most online test-time adaptation methods are primarily designed for images and do not consider the specific characteristics of time series. As for the latter, online time series forecasting typically relies on updating the model with each newly collected sample individually, which may be problematic when the sample deviates significantly from the historical data distribution and contains noise, which may lead to a worse generalization performance.
In this paper, we propose Batch Training with Transferable Online Augmentation (BTOA), which enhances model performance through three key ideas while enabling batch training. First, to fully leverage historical information, Transferable Historical Sample Selection (THSS) is proposed with theoretical guarantees to select historical samples that are most similar to the test-time distribution. Then, to mitigate the negative impact of distribution shifts through batch training and take advantage of the unique characteristics of time series, Transferable Online Augmentation (TOA) is proposed to augment the selected historical samples from the perspective of amplitude and phase in the frequency domain in a two-stream manner. Finally, a prediction module that utilizes a series decomposition module and a two-stream forecaster is employed to extract the complex patterns in time series, boosting the prediction performance. Moreover, BTOA is a general approach that is readily pluggable into any existing batch-training based deep models. Experiments demonstrate that our method achieves superior performance across seven benchmark datasets. Compared to state-of-the-art approaches, our method reduces the Mean Squared Error (MSE) by up to 13.7\%. The code is available at \href{https://anonymous.4open.science/r/BTOA-447B/}{https://anonymous.4open.science/r/BTOA/}.

URL: https://openreview.net/forum?id=Ht7rlkRCHq

---

Title: Large Language Model Confidence Estimation via Black-Box Access

Abstract: Estimating uncertainty or confidence in the responses of a model can be significant in evaluating trust not only in the responses, but also in the model as a whole. In this paper, we explore the problem of estimating confidence for responses of large language models
(LLMs) with simply black-box or query access to them. We propose a simple and extensible framework where, we engineer novel features and train a (interpretable) model (viz. logistic regression) on these features to estimate the confidence. We empirically demonstrate that
our simple framework is effective in estimating confidence of Flan-ul2, Llama-13b, Mistral-7b and GPT-4 on four benchmark Q&A tasks as well as of Pegasus-large and BART-large on two benchmark summarization tasks with it surpassing baselines by even over 10% (on AU-ROC) in some cases. Additionally, our interpretable approach provides insight into features that are predictive of confidence, leading to the interesting and useful discovery that our confidence models built for one LLM generalize zero-shot across others on a given dataset.

URL: https://openreview.net/forum?id=WrWYChkyRI

---

Title: Controlled Model Debiasing through Minimal and Interpretable Updates

Abstract: Traditional approaches to learning fair machine learning models often require rebuilding models from scratch, generally without accounting for potentially existing previous models. In a context where models need to be retrained frequently, this can lead to inconsistent model updates, as well as redundant and costly validation testing. To address this limitation, we introduce the notion of controlled model debiasing, a novel supervised learning task relying on two desiderata: that the differences between new fair model and the existing one should be (i) interpretable and (ii) minimal. After providing theoretical guarantees to this new problem, we introduce a novel algorithm for algorithmic fairness, COMMOD, that is both model-agnostic and does not require the sensitive attribute at test time. In addition, our algorithm is explicitly designed to enforce (i) minimal and (ii) interpretable changes between biased and debiased predictions—a property that, while highly desirable in high-stakes applications, is rarely prioritized as an explicit objective in fairness literature. Our approach combines a concept-based architecture and adversarial learning and we demonstrate through empirical results that it achieves comparable performance to state-of-the-art debiasing methods while performing minimal and interpretable prediction changes.

URL: https://openreview.net/forum?id=B9fdU4qjpD

---

Title: Long Context Transfer from Language to Vision

Abstract: Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos. Many works address this by reducing the number of visual tokens using visual resamplers. Alternatively, in this paper, we approach this problem from the perspective of the language model. By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training. We call this phenomenon \textit{long context transfer} and carefully ablate its properties. To effectively measure LMMs' ability to generalize to long contexts in the vision modality, we develop V-NIAH (Visual Needle-In-A-Haystack), a purely synthetic long vision benchmark inspired by the language model'
s NIAH test. Our proposed Long Video Assistant (LongVA) can process 2000 frames or over 200K visual tokens without additional complexities. With its extended context length, LongVA achieves state-of-the-art performance on Video-MME and MLVU among 7B-scale models by densely sampling more input frames.

URL: https://openreview.net/forum?id=30RAWQVGlx

---

Title: FlowBench: Benchmarking Optical Flow Estimation Methods for Reliability and Generalization

Abstract: Optical flow estimation is a crucial computer vision task often applied to safety-critical real-world scenarios like autonomous driving and medical imaging.
While optical flow estimation accuracy has greatly benefited from the emergence of deep learning, learning-based methods are also known for their lack of generalization and reliability.
However, reliability is paramount when optical flow methods are employed in the real world, where safety is essential.
Furthermore, a deeper understanding of the robustness and reliability of learning-based optical flow estimation methods is still lacking, hindering the research community from building methods safe for real-world deployment.
Thus we propose FlowBench, a robustness benchmark and evaluation tool for learning-based optical flow methods.
FlowBench facilitates streamlined research into the reliability of optical flow methods by benchmarking their robustness to adversarial attacks and out-of-distribution samples.
With FlowBench, we benchmark 63 methods across 3 different datasets under 7 diverse adversarial attacks and 23 established common corruptions, making it the most comprehensive robustness analysis of optical flow methods to date.
Across this wide range of methods, we consistently find that methods with state-of-the-art performance on established standard benchmarks lack reliability and generalization ability.
Moreover, we find interesting correlations between the performance, reliability, and generalization ability of optical flow estimation methods, under various lenses such as design choices used, number of parameters, etc.
After acceptance, FlowBench will be open-source and publicly available, including all tested model weights.

URL: https://openreview.net/forum?id=Kh4bj6YDNm

---

Title: Debiasing Through Circuits: A Reproducibility Study in Mechanistic Interpretability

Abstract: Large language models (LLMs) achieve remarkable performance yet remain vulnerable to ad-
versarial attacks. Mechanistic interpretability offers a promising avenue for diagnosing these
weaknesses by identifying the circuits that drive model behavior. We reproduce and criti-
cally assess the pipeline introduced by García-Carrasco et al. (2024), which uses activation
patching, gradient-based adversarial attacks, and logit attribution to locate vulnerabilities
in a synthetic acronym prediction task for GPT-2 small. While their approach provides
an interesting toy example, we find incomplete circuit identification and limited adversarial
effectiveness. To address these shortcomings, we apply edge attribution patching for more
faithful circuit discovery, generalize their adversarial approach to multi-token inputs, and
scale the analysis to a larger model, Llama-3.2-1B-Instruct, on a more complex and
socially relevant task: toxicity detection with a focus on name-related biases. We further
introduce Differential Circuit Editing (DICE) to demonstrate how targeted interventions in
the identified circuits can mitigate harmful behavior without compromising task accuracy
resulting in the bias reduction of 12.6% while slightly improving accuracy by 3.4%.

URL: https://openreview.net/forum?id=sM34rNHMyv

---

Title: Tackling the Abstraction and Reasoning Corpus with Vision Transformers: the Importance of 2D Representation, Positions, and Objects

Abstract: The Abstraction and Reasoning Corpus (ARC) is a popular benchmark focused on visual reasoning
in the evaluation of Artificial Intelligence systems. In its original framing, an ARC task requires
solving a program synthesis problem over small 2D images using a few input-output training pairs.
In this work, we adopt the recently popular data-driven approach to the ARC and ask whether a
Vision Transformer (ViT) can learn the implicit mapping, from input image to output image, that
underlies the task. We show that a ViT—otherwise a state-of-the-art model for images—fails dra-
matically on most ARC tasks even when trained on one million examples per task. This points to
an inherent representational deficiency of the ViT architecture that makes it incapable of uncov-
ering the simple structured mappings underlying the ARC tasks. Building on these insights, we
propose VITARC, a ViT-style architecture that unlocks some of the visual reasoning capabilities re-
quired by the ARC. Specifically, we use a pixel-level input representation, design a spatially-aware
tokenization scheme, and introduce a novel object-based positional encoding that leverages auto-
matic segmentation, among other enhancements. Our task-specific VITARC models achieve a test
solve rate close to 100% on more than half of the 400 public ARC tasks strictly through supervised
learning from input-output grids. This calls attention to the importance of imbuing the powerful
(Vision) Transformer with the correct inductive biases for abstract visual reasoning that are critical
even when the training data is plentiful and the mapping is noise-free. Hence, VITARC provides a
strong foundation for future research in visual reasoning using transformer-based architectures.

URL: https://openreview.net/forum?id=Al72Fp0rCg

---

Title: Preference Discerning with LLM-Enhanced Generative Retrieval

Abstract: In sequential recommendation, models recommend items based on user's interaction history. To this end, they usually incorporate information such as item descriptions and user intent or preferences. User preferences are usually not given in open-source datasets and thus need to be approximated, for example via large language models (LLMs). Current works incorporate approximated user preferences as targets for auxiliary tasks during training of the recommendation model to assist with downstream performance. However, this is limiting, as they cannot dynamically adapt to changing user preferences after training and require re-training, which is impractical. To address this issue, we propose a new paradigm, namely preference discerning, in which we explicitly condition a generative recommendation model on user preferences in natural language, within its context. Furthermore, we introduce a novel benchmark that provides a holistic evaluation across various scenarios, including preference steering and sentiment following. Upon assessing current state-of-the-art methods using our benchmark, we discover that they struggle to accurately discern user preferences. To address this, we propose a new method named Mender (Multimodal Preference Discerner), which achieves state-of-the-art performance in our benchmark.
Our results show that Mender can be effectively guided by human preferences, even if not observed during training, paving the way toward more personalized recommendation models.

URL: https://openreview.net/forum?id=74mrOdhvvT

---

Title: Generalization over Memorization in In-Context Learning

Abstract: Transformers exhibit remarkable in-context learning capabilities, solving new tasks without requiring explicit model weight updates. However, existing training paradigms for in-context learners rely on vast, unstructured datasets, which are costly and challenging to collect. These paradigms diverge significantly from how humans learn. Motivated by these limitations, we propose a paradigm shift: training on multiple smaller, domain-specific datasets to improve generalization. We investigate this paradigm by leveraging meta-learning to train an in-context learner across diverse, small-scale datasets using the Meta-Album benchmark. We further investigate realistic scenarios, including domain streaming with curriculum learning strategies and settings where training data is entirely unlabeled. Our experiments demonstrate that this multi-dataset approach promotes broader generalization, enhances robustness in streaming scenarios, and achieves competitive performance even under unsupervised conditions. These findings advance the in-context learning paradigm and shed light on how to bridge the gap between artificial and natural learning processes.

URL: https://openreview.net/forum?id=XMuVlWbjQm

---

Title: Physics-Aware Spatiotemporal Causal Graph Network for Forecasting with Limited Data

Abstract: Spatiotemporal models have drawn significant interest recently due to their widespread applicability across many domains. These models are often made more practically useful by incorporating beneficial inductive biases, such as laws or symmetries from domain-relevant physics equations. This "physics-awareness" provides an interpretable means of grounding otherwise purely data-driven models, improving robustness and boosting performance in settings with limited data. In this work, we view physical dynamics as domain knowledge that captures fundamental causal relationships across space and time, and can be effectively leveraged by our proposed physics-aware spatiotemporal causal graph network (P-STCGN). We firstly describe a means of deriving causal relationships from spatiotemporal data, serving as physics-aware labels to learn a causal structure via a dedicated neural module. We then formulate a forecasting module that can operate under this causal structure, producing predictions that are guided by physics-aware cause-effect relationships among modeled variables. Extensive experimentation demonstrates that our method is robust to noisy and limited data, outperforming existing models across a variety of challenging synthetic tasks and benchmark datasets. We further evaluate our method on real-world graph signals and observe superior forecasting performance, achieved by effectively utilizing causal signals from prior physics knowledge.

URL: https://openreview.net/forum?id=n3yrVzPcNa

---

Title: Reproducibility Study of "Attack-Resilient Image Water- marking Using Stable Diffusion"

Abstract: This paper presents a reproducibility study and robustness evaluation of the paper ‘Attack-
Resilient Image Watermarking Using Stable Diffusion’ by Zhang et al. (2024), which proposes
ZoDiac, a Stable Diffusion-based framework for attack-resilient image watermarking. While
successfully replicating the original method’s core claims—achieving >90% watermark de-
tection rate (WDR) against diffusion-based regeneration attacks and across MS-COCO,
DiffusionDB, and WikiArt datasets—we identify critical vulnerabilities under adversarial
and geometrically asymmetric attack paradigms. Our extended analysis demonstrates that
gradient-based adversarial perturbations reduce ZoDiac’s WDR, a threat model absent in
prior evaluations. We also investigate rotationally asymmetric attacks achieving WDR be-
low 65%. We also investigate a new loss function to mitigate these limitations. Despite
these enhancements, composite attacks combining adversarial noise with other methods re-
duce WDR to near-zero, exposing vulnerabilities through multi-stage offensive pipelines.
Our implementation can be found on Anonymous Github .

URL: https://openreview.net/forum?id=xoQV6kdTqG

---

Title: [Re] Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents

Abstract: Large Language Models (LLMs) are increasingly used in strategic decision-making environ-
ments, including game-theoretic scenarios where multiple agents interact under predefined
rules. One such setting is the common pool resource environment. In this study, we build
upon Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM
Agents (Piatti et al., 2024), a framework designed to test cooperation strategies among LLM
agents. We begin by replicating their results to a large degree to validate the framework.
Then, we extend their analysis by identifying a notable trend: specialized models trained
on research papers and mathematical reasoning tasks outperform general-purpose models of
similar scale in this environment. Additionally, we evaluate the recently released DeepSeek-
R1-Distill models, which show improvements over their baseline counterparts but come at
a higher computational cost. Finally, we investigate the impact of different prompting
strategies, including the veil of ignorance mechanism and other prompting strategies based
on universalization principles with varying levels of abstraction. Our results suggest that
older models benefit significantly from explicit boundary conditions, whereas newer models
demonstrate greater robustness to implicit constraints.

URL: https://openreview.net/forum?id=EWWxSkUchO

---

Title: Reproducibility Study of "XRec: Large Language Models for Explainable Recommendation"

Abstract: In this study, we reproduced the work done in the paper “XRec: Large Language Models for Explainable Recommendation” by Ma et al. (2024). The original authors introduced XRec, a model-agnostic collaborative instruction-tuning framework that enables large language models (LLMs) to provide users with comprehensive explanations of generated recommendations. Our objective was to replicate the results of the original paper, albeit using Llama 3 as the LLM for evaluation instead of GPT-3.5-turbo. We built on the source code provided by Ma et al. (2024) to achieve our goal. Our work extends the original paper by modifying the input embeddings or deleting the output embeddings of XRec’s Mixture of Experts module. Based on our results, XRec effectively generates personalized explanations and its stability is improved by incorporating collaborative information. However, XRec did not consistently outperform all baseline models in every metric. Our extended analysis further highlights the importance of the Mixture of Experts embeddings in shaping the explanation structures, showcasing how collaborative signals interact with language modeling. Through our work, we provide an open-source evaluation implementation that enhances accessibility for researchers and practitioners alike. Our complete code can be found at
https://anonymous.4open.science/r/xrec-repro-C2CD/.

URL: https://openreview.net/forum?id=np1P9HR9hQ

---

Title: Reproducibility Study: Mastering cooperation between small LLMs within the Governance of the Commons Simulation

Abstract: Governance of the Commons Simulation (GovSim) is a Large Language Model (LLM) multi- agent framework designed to study cooperation and sustainability between LLM agents in resource-sharing environments (Piatti et al., 2024). Understanding the cooperation capabilities of LLMs is vital to the real-world applicability of these models. This reproducibility study aims to verify the claims in the original paper by replicating their experiments using small open-source LLMs and extending the framework. The original paper claims that (1) GovSim enables the study and benchmarking of emergent sustainable behavior, (2) only the largest and most powerful LLM agents achieve a sustainable equilibrium, while smaller models fail, and (3) agents using universalization-based reasoning significantly improve sus- tainability. To test the second claim, we conducted simulations with the small open-source models used in the original study. Additionally, by running the same experiments with small SOTA DeepSeek models, we successfully achieved a sustainable equilibrium. This contradicts the original claim, suggesting that recent advances in LLMs have improved the cooperation abilities of small LLMs. Regarding the third claim, our results confirm that universalization-based reasoning improves performance in the GovSim environment, sup- porting the third claim of the author. However, further analysis suggests that the improved performance primarily stems from the numerical instructions provided to agents rather than the principle of universalization itself.

URL: https://openreview.net/forum?id=ON8EMrNwww

---

Title: A Reproducibility Study of Decoupling Feature Extraction and Classification Layers for Calibrated Neural Networks

Abstract: Many neural networks, especially over-parameterized ones, suffer from poor calibration and overconfidence. To address this, Jordahn & Olmos (2024) recently proposed a Two-Stage Training (TST) procedure that decouples the training of feature extraction and classification layers. In this study, we replicate their findings and extend their work through a series of ablation studies. We reproduce their main results and find that most of them replicate, with slight deviation for CIFAR100. Additionally, we extend the author’s results by exploring the impact of different model architectures, Monte Carlo (MC) sample sizes, and classification head designs. We further compare the method with focal loss - an implicit regularization technique known to improve calibration - and investigate whether calibration can be improved further by combining the two methods. We find that calibration can be improved even further by using focal loss in the first training stage of two-stage training. Our experiments validate the claims made by Jordahn & Olmos (2024), and show the transferability of the two-stage training to different architectures.

URL: https://openreview.net/forum?id=5Hwzd48ILf

---

Title: Reassessing Fairness: A Reproducibility Study of NIFA’s Impact on GNN Models

Abstract: Graph Neural Networks (GNNs) have demonstrated exceptional performance in processing graph-structured data, yet fairness concerns remain a critical challenge due to GNNs amplifying bias and prejudice in training data. The Node Injection-based Fairness Attack (NIFA)
(Luo et al., 2024) was recently proposed as a gray-box method to compromise fairness while maintaining model utility. This study aims to reproduce and validate the claims of NIFA, assessing its impact across multiple datasets and GNN architectures. This reproduction
study confirms that NIFA is an effective gray-box attack that degrades the fairness metrics, statistical parity, and equal odds while having a negligible utility loss. Additionally, NIFA’s ability to outperform other graph utility and fairness attacks is inconclusive. Finally, we extend the original work by evaluating NIFA’s performance under multi-class sensitive attributes and varying levels of homophily. NIFA’s ability to degrade fairness shows promising results in a multi-class sensitive attribute environment. Varying levels of homophily showed minimal utility loss and stable fairness metrics across most configurations, with the exception of heterophilic-homophilic and highly homophilic settings. The codebase used in this study can be found at https://anonymous.4open.science/r/Reassessing-NIFA-B4F5/.

URL: https://openreview.net/forum?id=l5fXUKi8GO

---

Title: Scaling Channel-Invariant Self-Supervised Learning

Abstract: Recent advances in self-supervised pre-training of foundation models for natural images have
made them a popular choice for various visual systems and applications. Self-supervised
strategies are also promising in non-RGB scientific imaging domains such as in biology, medical
and satellite imagery, but their broader application is hampered by heterogeneity in channel
composition and semantics between relevant datasets: two datasets may contain different
numbers of channels, and these may reveal distinct aspects of an object or scene. Recent
works on channel-invariant strategies report substantial advantages for those that account
for variable channel compositions without sacrificing the ability to jointly encode channels;
yet, how these strategies behave at scale remains unclear. We here show that, surprisingly,
trained across large-scale microscopy datasets, independent-encoding of channels consistently
outperforms joint-encoding methods by a substantial margin. We validate this result along an
extensive set of experiments on various datasets from cell microscopy to geospatial imagery.
Our DINO BoC approach sets a new state-of-the-art across challenging benchmarks, including
generalization to out-of-distribution tasks and unseen channel combinations at test time. We
will open source the code, along with model weights that constitute a new general purpose
feature extractor for fluorescent microscopy.

URL: https://openreview.net/forum?id=pT8sgtRVAf

---

Title: Reproducibility Study of "FairViT: Fair Vision Transformer via Adaptive Masking"

Abstract: Recently, Vision Transformers (ViTs) have excelled in computer vision tasks but often struggle with fairness issues related to attributes like gender and hair colour. FairViT by Tian et al. (2024), aims to address this challenge, by introducing adaptive masking combined with a distance-based loss, to improve fairness and accuracy, while maintaining competitive computational efficiency compared to other baseline methods. In our reproducibility study, we evaluated FairViT on the CelebA dataset on tasks related to attractiveness and facial expression prediction, while considering specific sensitive attributes. We then compared FairViT against the Vanilla and Fair Supervised Contrastive Loss (FSCL) baseline models. Contrary to the original claim regarding the effectiveness of adaptive masking, we observed that its impact is negligible, in terms of both fairness and accuracy, a finding confirmed also on the UTKFace dataset. On the other hand, the distance-based loss demonstrated partial effectiveness, but mainly when tested in the context of different architectures. Finally, in terms of computational efficiency, FairViT required almost double training time per epoch compared to the Vanilla model and did not outperform FSCL, which had the lowest training time for the specified dataset size used by the authors. Overall, our findings highlight the potential effectiveness of the proposed distance loss. However, the adaptive masking method did not deliver the expected improvements while also increasing the computational cost. Our implementation is available at: https://anonymous.4open.science/r/FairViT-reproducibility-study-54B0/.

URL: https://openreview.net/forum?id=QeERUU3GEB

---

Title: Time-Uniform Confidence Spheres for Means of Random Vectors

Abstract: We study sequential mean estimation in $\mathbb{R}^d$. In particular, we derive time-uniform confidence spheres---\emph{confidence sphere sequences} (CSSs)---which contain the mean of random vectors with high probability simultaneously across all sample sizes.
Our results include a dimension-free CSS for log-concave random vectors, a dimension-free CSS for sub-Gaussian random vectors, and
CSSs for sub-$\psi$ random vectors (which includes sub-gamma, and sub-exponential distributions). Many of our results are optimal. For sub-Gaussian distributions we also provide a CSS which tracks a time-varying mean, generalizing Robbins' mixture approach to the multivariate setting. Finally, we provide several CSSs for heavy-tailed random vectors (two moments only). Our bounds hold under a martingale assumption on the mean and do not require that the observations be iid. Our work is based on PAC-Bayesian theory and inspired by an approach of Catoni and Giulini.

URL: https://openreview.net/forum?id=2NSb3cJE03

---

Title: Revisiting Sparse Learning Methods: A Comprehensive Comparison of Best Subset Selection and LASSO

Abstract: Understanding the comparative performance of $L_0$ and $L_1$ models is crucial for developing accurate and efficient machine learning systems, particularly in noisy, real-world settings. The current understanding in the literature is that $L_1$-penalized linear models perform better than $L_0$ models as noise increases. However, prior studies have largely relied on small and synthetic datasets and limited comparisons between differing optimizers, leaving practical implications for diverse applications underexplored.
We fill these gaps in analysis by testing multiple different $L_0$ and $L_1$ based optimizers on a larger variety of real datasets,
and demonstrate that performance differences between $L_0$ and $L_1$ models depend significantly on the choice of optimizer and dataset characteristics. In many cases, the difference in performance by changing the optimization algorithm, while leaving the regularization penalty constant, is larger than the differences in changing the penalty. Additionally, we demonstrate cases where an $L_0$-penalized model can be both sparser and more accurate than the $L_1$-penalized variants. Together, our results show that even convex $L_1$ models can vary significantly in performance according to optimizer implementation, and that $L_0$ penalized models are more viable for many smaller real-world and noisy situations than previously recognized.

URL: https://openreview.net/forum?id=W6O5QTjXVY

---

Title: DistDD: Distributed Data Distillation Aggregation through Gradient Matching

Abstract: In this paper, we introduce DistDD, a novel approach within the federated learning framework that reduces the need for repetitive communication by distilling data directly on clients’ devices. Unlike traditional federated learning that requires iterative model updates across nodes, DistDD facilitates a one-time distillation process that extracts a global distilled dataset, maintaining the privacy standards of federated learning while significantly cutting down communication costs. By leveraging the DistDD's distilled dataset, the developers of the FL can achieve just-in-time parameter tuning and neural architecture search over FL without repeating the whole FL process multiple times. We provide a detailed convergence proof of the DistDD algorithm, reinforcing its mathematical stability and reliability for practical applications. Our experiments demonstrate the effectiveness and robustness of DistDD, particularly in non-i.i.d. and mislabeled data scenarios, showcasing its potential to handle complex real-world data challenges distinctively from conventional federated learning methods. We also evaluate DistDD's application in the use case and prove its effectiveness and communication savings in the NAS use case.

URL: https://openreview.net/forum?id=8nJWFAFAqQ

---

Title: Training on Fake Labels: Mitigating Label Leakage in Split Learning via Secure Dimension Transformation

Abstract: Two-party split learning has emerged as a popular paradigm for vertical federated learning. To preserve the privacy of the label owner, split learning utilizes a split model, which only requires the exchange of intermediate representations (IRs) based on the inputs and gradients for each IR between two parties during the learning process. However, split learning has recently been proven to survive label inference attacks. Though several defense methods could be adopted, they either have limited defensive performance or significantly negatively impact the original mission. In this paper, we propose a novel two-party split learning method to defend against existing label inference attacks while maintaining the high utility of the learned models. Specifically, we first craft a dimension transformation module, SecDT, which could achieve bidirectional mapping between original labels and increased $K$-class labels to mitigate label leakage from the directional perspective. Then, a gradient normalization algorithm is designed to remove the magnitude divergence of gradients from different classes. We propose a softmax-normalized Gaussian noise to mitigate privacy leakage and make our $K$ unknowable to adversaries. We conducted experiments on real-world datasets, including two binary-classification datasets (Avazu and Criteo) and three multi-classification datasets (MNIST, FashionMNIST, CIFAR-10); we also considered current attack schemes, including direction, norm, spectral, and model completion attacks. The detailed experiments demonstrate our proposed method's effectiveness and superiority over existing approaches. For instance, on the Avazu dataset, the attack AUC of evaluated four prominent attacks could be reduced by 0.4532±0.0127.

URL: https://openreview.net/forum?id=Ol0DBcIM7p

---

Title: [RE] Are Your Models Still Fair? Fairness Attacks on Graph Neural Networks via Node Injections

Abstract: Graph Neural Networks (GNNs) have become indispensable for learning on graph-structured data, with applications in socially sensitive domains such as recommendation systems and healthcare. However, recent research has revealed that fairness-enhancing GNNs remain vulnerable to adversarial attacks, raising concerns about their real-world robustness. This paper represents a reproducibility study of Luo et al. (2024), which demonstrates that adversarial node injection can effectively compromise fairness while preserving overall predictive accuracy. Our results confirm that such attacks are efficient (requiring minimal perturbations), realistic (exploiting feasible node injections), and deceptive (causing fairness degradation without significant accuracy loss). Along with validating the original findings, we redefine their framework as an evasion attack, showing that the attack remains effective on a clean model. Furthermore, we propose a novel defense strategy and analyze the impact of model depth on the attack. Our results highlight the need for more robust GNN architectures against fairness-targeted adversarial threats.

URL: https://openreview.net/forum?id=wnnh4XjFXp

---

Title: Carefully Blending Adversarial Training, Purification, and Aggregation Improves Adversarial Robustness

Abstract: In this work, we propose a novel adversarial defence mechanism for image classification - CARSO - blending the paradigms of adversarial training and adversarial purification in a synergistic robustness-enhancing way. The method builds upon an adversarially-trained classifier, and learns to map its internal representation associated with a potentially perturbed input onto a distribution of tentative clean reconstructions. Multiple samples from such distribution are classified by the same adversarially-trained model, and a carefully chosen aggregation of its outputs finally constitutes the robust prediction of interest. Experimental evaluation by a well-established benchmark of strong adaptive attacks, across different image datasets, shows that CARSO is able to defend itself against adaptive end-to-end white-box attacks devised for stochastic defences. Paying a modest clean accuracy toll, our method improves by a significant margin the state-of-the-art for Cifar-10, Cifar-100, and TinyImageNet-200 $\ell_\infty$ robust classification accuracy against AutoAttack.

URL: https://openreview.net/forum?id=40BXthYscW

---

Title: MARL-LNS: Efficient Multi-agent Deep Reinforcement Learning via Large Neighborhoods Search

Abstract: Cooperative multi-agent reinforcement learning (MARL) has emerged as a powerful paradigm for addressing complex real-world problems. However, the well-established centralized training with decentralized execution framework is hampered by the curse of dimensionality, leading to prolonged training times and inefficient convergence. In this work, we introduce MARL-LNS, a general training framework that overcomes these challenges by iteratively training on alternating subsets of agents with existing deep MARL algorithms serving as low-level trainers—without incurring any additional trainable parameters. Building on this framework, we propose three variants—Random Large Neighborhood Search (RLNS), Batch Large Neighborhood Search (BLNS), and Adaptive Large Neighborhood Search (ALNS)—each differing in its strategy for alternating agent subsets. Empirical evaluations on both the StarCraft Multi-Agent Challenge and Google Research Football environments demonstrate that our approach can reduce training time by at least 10\% while achieving comparable final performance to state-of-the-art methods.

URL: https://openreview.net/forum?id=O3e4x8W6GL

---

Title: Transferring Reasoning Capabilities between LLMs operating via Curriculum Learning Policy

Abstract: In-context reasoning methods, exemplified by Chain-of-Thought (CoT) (et alia.,) empower the reasoning abilities of large language models (LLMs), eliciting them to solve complex reasoning tasks step-by-step. Nevertheless, the capacities to deliver robust CoT explanations
arise only in models with billions of parameters, representing a barrier to entry for many users forced to operate on a smaller model scale, i.e., Small Language Models (SLMs). Even though many companies are releasing LLMs of the same family with a reduced number of
parameters, these models sometimes produce misleading answers and are unable to deliver accurate step-wise reasoned answers. This paper proposes a method to transfer step-wise reasoning over SLMs by operating via Instruction-tuning (IT) on synthetic demonstrations
delivered in a pedagogically motivated manner. In particular, firstly, we propose aligning step-wise reasoning capabilities via IT using Demonstrations "taught" by LLMs teacher to SLMs students. Then, we operate via Curriculum Learning, a pedagogically motivated
learning method that improves the IT phase. We analyse the impact on the downstream performances of four question-answering benchmarks. The results show that SMLs can be instructed to reason via Demonstrations delivered by LLMs. We move a step further
in research: conceiving SLMs as human learners, we expose them to a CL teaching-based approach, obtaining better results on downstream performances.

URL: https://openreview.net/forum?id=zPKqyjmyEQ

---

Title: Learning with Noisy Labels [Re]visited

Abstract: Learning with noisy labels (LNL) is a subfield of supervised machine learning investigating scenarios in which the training data contain errors. While most research has focused on synthetic noise, where labels are randomly corrupted, real-world noise from human annotation errors is more complex and less understood. Wei et al. (2022) introduced CIFAR-N, a dataset with human-labeled noise and claimed that real-world noise is fundamentally more challenging than synthetic noise. This study aims to reproduce their experiments on testing the characteristics of human-annotated label noise, memorization dynamics, and benchmarking of LNL methods. We successfully reproduce some of the claims but identify some quantitative discrepancies. Notably, our attempts to reproduce the reported benchmark reveal inconsistencies in the reported results. To address these issues, we develop a unified framework and propose a refined benchmarking protocol that ensures a fairer evaluation of LNL methods. Our findings confirm that real-world noise differs structurally from synthetic noise and is memorized more rapidly by deep networks. By open-sourcing our implementation, we provide a more reliable foundation for future research in LNL.

URL: https://openreview.net/forum?id=GKZ2leags0

---

Title: UniTST: Effectively Modeling Inter-Series and Intra-Series Dependencies for Multivariate Time Series Forecasting

Abstract: Transformer-based models have emerged as powerful tools for multivariate time series forecasting (MTSF). However, existing Transformer models often fall short of capturing both intricate dependencies across variate and temporal dimensions in MTS data. Some recent models are proposed to separately capture variate and temporal dependencies through either two sequential or parallel attention mechanisms. However, these methods cannot directly and explicitly learn the intricate inter-series and intra-series dependencies. In this work, we first demonstrate that these dependencies are very important as they usually exist in real-world data. To directly model these dependencies, we propose a transformer-based model UniTST containing a unified attention mechanism on the flattened patch tokens. Additionally, we add a dispatcher module which reduces the complexity and makes the model feasible for a potentially large number of variates. Although our proposed model employs a simple architecture, it offers compelling performance as shown in our extensive experiments on several datasets for time series forecasting.

URL: https://openreview.net/forum?id=p3y5q4cvzV

---

Title: Unifying Generative and Dense Retrieval for Sequential Recommendation

Abstract: Sequential dense retrieval models utilize advanced sequence learning techniques to compute item and user representations, which are then used to rank relevant items for a user through inner product computation between the user and all item representations. However, this approach requires storing a unique representation for each item, resulting in significant memory requirements as the number of items grows. In contrast, the recently proposed generative retrieval paradigm offers a promising alternative by directly predicting item indices using a generative model trained on semantic IDs that encapsulate items’ semantic information. Despite its potential for large-scale applications, a comprehensive comparison between generative retrieval and sequential dense retrieval under fair conditions is still lacking, leaving open questions regarding performance, and computation trade-offs. To address this, we compare these two approaches under controlled conditions on academic benchmarks and observe performance gaps, where dense retrieval achieves better ranking performance but at a higher computational cost. Motivated by these observations, we propose LIGER (LeveragIng dense retrieval for GEnerative Retrieval), a hybrid model that combines the strengths of these two widely used approaches. LIGER integrates sequential dense retrieval into generative retrieval, mitigating performance differences, and enhancing cold-start item recommendation in the evaluated datasets. This hybrid approach provides insight into the trade-offs between these approaches and demonstrates improvements in efficiency and effectiveness for recommendation systems in small-scale benchmarks.

URL: https://openreview.net/forum?id=jxdnFIsjCb

---

Title: The Accuracy Cost of Weakness: A Theoretical Analysis of Fixed-Segment Weak Labeling for Events in Time

Abstract: Accurate labels are critical for deriving robust machine learning models. Labels are used to train supervised learning models and to evaluate most machine learning paradigms. In this paper, we model the accuracy and cost of a common weak labeling process where annotators assign presence or absence labels to fixed-length data segments for a given event class. The annotator labels a segment as "present" if it sufficiently covers an event from that class, e.g., a birdsong sound event in audio data. We analyze how the segment length affects the label accuracy and the required number of annotations, and compare this fixed-length labeling approach with an oracle method that uses the true event activations to construct the segments. Furthermore, we quantify the gap between these methods and verify that in most realistic scenarios the oracle method is better than the fixed-length labeling method in both accuracy and cost. Our findings provide a theoretical justification for adaptive weak labeling strategies that mimic the oracle process, and a foundation for optimizing weak labeling processes in sequence labeling tasks.

URL: https://openreview.net/forum?id=tTw8wXBQ18

---

Title: Neural varifolds: an aggregate representation for quantifying the geometry of point clouds

Abstract: Point clouds are popular 3D representations for real-life objects (such as in LiDAR and Kinect) due to their detailed and compact representation of surface-based geometry. Recent approaches characterise the geometry of point clouds by bringing deep learning based techniques together with geometric fidelity metrics such as optimal transportation costs (e.g., Chamfer and Wasserstein metrics). In this paper, we propose a new surface geometry characterisation within this realm, namely a neural varifold representation of point clouds. Here, the surface is represented as a measure/distribution over both point positions and tangent spaces of point clouds. The varifold representation quantifies not only the surface geometry of point clouds through the manifold-based discrimination, but also subtle geometric consistencies on the surface due to the combined product space. This study proposes neural varifold algorithms to compute the varifold norm between two point clouds using neural networks on point clouds and their neural tangent kernel representations. The proposed neural varifold is evaluated on three different sought-after tasks -- shape matching, few-shot shape classification, and shape reconstruction. Detailed evaluation and comparison to the state-of-the-art methods demonstrate that the proposed versatile neural varifold is superior in shape matching and few-shot shape classification, and is competitive for shape reconstruction.

URL: https://openreview.net/forum?id=P02hoA7vln

---

Title: DeblurDiNAT: A Compact Model with Exceptional Generalization and Visual Fidelity on Unseen Domains

Abstract: Recent deblurring networks have effectively restored clear images from the blurred ones. However, they often struggle with generalization to unknown domains. Moreover, these models typically focus on distortion metrics such as PSNR and SSIM, neglecting the critical aspect of metrics aligned with human perception. To address these limitations, we propose DeblurDiNAT, a deblurring Transformer based on Dilated Neighborhood Attention. First, DeblurDiNAT employs an alternating dilation factor paradigm to capture both local and global blurred patterns, enhancing generalization and perceptual clarity. Second, a local cross-channel learner aids the Transformer block to understand the short-range relationships between adjacent channels. Additionally, we present a linear feed-forward network with a simple while effective design. Finally, a dual-stage feature fusion module is introduced as an alternative to the existing approach, which efficiently process multi-scale visual information across network levels. Compared to state-of-the-art models, our compact DeblurDiNAT demonstrates superior generalization capabilities and achieves remarkable performance in perceptual metrics, while maintaining a favorable model size.

URL: https://openreview.net/forum?id=zzubCvauSv

---

Reply all

Reply to author

Forward

0 new messages