J2C Certification: TimeAutoDiff: A Unified Framework for Generation, Imputation, Forecasting, and Time-Varying Metadata Conditioning of Heterogeneous Time Series Tabular Data
Namjoon Suh, Yuning Yang, Din-Yin Hsieh, Qitong Luan, Shirong Xu, Shixiang Zhu, Guang Cheng
https://openreview.net/forum?id=bkUd1Dg46c
---
J2C Certification: LOGLO-FNO: Efficient Learning of Local and Global Features in Fourier Neural Operators
Marimuthu Kalimuthu, David Holzmüller, Mathias Niepert
https://openreview.net/forum?id=MQ1dRdHTpi
---
Featured Certification, J2C Certification: LO-BCQ: Locally Optimal Block Clustered Quantization for 4-bit (W4A4) LLM Inference
Reena Elangovan, Charbel Sakr, Anand Raghunathan, Brucek Khailany
https://openreview.net/forum?id=loWISTqGwW
---
Featured Certification, Reproducibility Certification, J2C Certification: Robust Reinforcement Learning in a Sample-Efficient Setting
Siemen Herremans, Ali Anwar, Siegfried Mercelis
https://openreview.net/forum?id=iij6nLYLjF
---
J2C Certification: Mirror Descent Policy Optimisation for Robust Constrained Markov Decision Processes
David Mark Bossens, Atsushi Nitanda
https://openreview.net/forum?id=tmfdqtFUqO
---
J2C Certification: Tractable Representation Learning with Probabilistic Circuits
Steven Braun, Sahil Sidheekh, Antonio Vergari, Martin Mundt, Sriraam Natarajan, Kristian Kersting
https://openreview.net/forum?id=h8D75pVKja
---
Featured Certification, J2C Certification: Inverse Scaling in Test-Time Compute
Aryo Pradipta Gema, Alexander Hägele, Runjin Chen, Andy Arditi, Jacob Goldman-Wetzler, Kit Fraser-Taliente, Henry Sleight, Linda Petrini, Julian Michael, Beatrice Alex, Pasquale Minervini, Yanda Chen, Joe Benton, Ethan Perez
https://openreview.net/forum?id=NXgyHW1c7M
---
Accepted papers
===============
Title: Top-$k$ Feature Importance Ranking
Authors: Eric Chen, Tiffany Tang, Genevera I. Allen
Abstract: Accurate ranking of important features is a fundamental challenge in interpretable machine learning with critical applications in scientific discovery and decision-making. Unlike feature selection and feature importance, the specific problem of ranking important features has received considerably less attention. We introduce RAMPART (Ranked Attributions with MiniPatches And Recursive Trimming), a framework that utilizes any existing feature importance measure in a novel algorithm specifically tailored for ranking the top-$k$ features. Our approach combines an adaptive sequential halving strategy that progressively focuses computational resources on promising features with an efficient ensembling technique using both observation and feature subsampling. Unlike existing methods that convert importance scores to ranks as post-processing, our framework explicitly optimizes for ranking accuracy. We provide theoretical guarantees showing that RAMPART achieves the correct top-$k$ ranking with high probability under mild conditions, and demonstrate through extensive simulation studies that RAMPART consistently outperforms popular feature importance methods, concluding with two high-dimensional genomics case studies.
URL: https://openreview.net/forum?id=2OSHpccsaV
---
Title: Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective
Authors: Zubair Bashir, Bhavik Chandna, Procheta Sen
Abstract: Large Language Models (LLMs) are known to exhibit social, demographic, and gender biases, often as a consequence of the data on which they are trained. In this work, we adopt a mechanistic interpretability approach to analyze how such biases are structurally represented within models such as GPT-2 and Llama2. Focusing on demographic and gender biases, we explore different metrics to identify the internal edges responsible for biased behavior. We then assess the stability, localization, and generalizability of these components across dataset and linguistic variations. Through systematic ablations, we demonstrate that bias-related computations are highly localized, often concentrated in a small subset of layers. Moreover, the identified components change across fine-tuning settings, including those unrelated to bias. Finally, we show that removing these components not only reduces biased outputs but also affects other NLP tasks, such as named entity recognition and linguistic acceptability judgment because of the sharing of important components with these tasks.
URL: https://openreview.net/forum?id=EpQ2CBJTjD
---
Title: End-to-End Conformal Calibration for Optimization Under Uncertainty
Authors: Christopher Yeh, Nicolas Christianson, Alan Wu, Adam Wierman, Yisong Yue
Abstract: Machine learning can significantly improve performance for decision-making under uncertainty across a wide range of domains. However, ensuring robustness guarantees requires well-calibrated uncertainty estimates, which can be difficult to achieve with neural networks. Moreover, in high-dimensional settings, there may be many valid uncertainty estimates, each with its own performance profile—i.e., not all uncertainty is equally valuable for downstream decision-making. To address this problem, this paper develops an end-to-end framework to _learn_ uncertainty sets for conditional robust optimization in a way that is informed by the downstream decision-making loss, with robustness and calibration guarantees provided by conformal prediction. In addition, we propose to represent general convex uncertainty sets with partially input-convex neural networks, which are learned as part of our framework. Our approach consistently improves upon two-stage estimate-then-optimize baselines on concrete applications in energy storage arbitrage and portfolio optimization.
URL: https://openreview.net/forum?id=yM8qkT0f9H
---
Title: Are We Really Learning the Score Function? Reinterpreting Diffusion Models Through Wasserstein Gradient Flow Matching
Authors: An Vuong, Michael Thompson McCann, Javier E. Santos, Yen Ting Lin
Abstract: Diffusion models are commonly interpreted as learning the score function, i.e., the gradient of the log-density of noisy data. However, this learning target is a conservative vector field (i.e., a vector field that is the gradient of some function), a property not enforced by neural network architectures used in practice. We show numerically that trained diffusion networks violate both the integral and differential constraints that conservative vector fields must satisfy, indicating that the learned vector fields are not score functions of any density. Despite this, the models perform remarkably well as generative mechanisms. To explain this paradox, we propose a new theoretical perspective: diffusion training is better understood as \emph{flow matching} to the velocity field of a Wasserstein Gradient Flow (WGF), rather than as score learning for a reverse-time stochastic differential equation.
Under this view, the "probability flow" arises naturally from the WGF framework, eliminating the need to invoke reverse-time SDE theory and clarifying why generative sampling remains successful, even when the neural vector field is not a true score. We further show that non-conservative errors from neural approximation do not necessarily harm density transport. Our results advocate adopting the WGF perspective as a principled, elegant, and theoretically grounded framework for understanding diffusion generative models.
URL: https://openreview.net/forum?id=CzyJqXQRhJ
---
Title: GraphFM: A generalist graph transformer that learns transferable representations across diverse domains
Authors: Divyansha Lachi, Mehdi Azabou, Vinam Arora, Eva L Dyer
Abstract: Graph neural networks (GNNs) are often trained on individual datasets, requiring specialized models and significant hyperparameter tuning due to the unique structures and features of each dataset. This approach limits the scalability and generalizability of GNNs, as models must be tailored for each specific graph type. To address these challenges, we introduce GraphFM, a scalable multi-graph pretraining approach designed for learning across diverse graph datasets. GraphFM uses a Perceiver-based encoder with learned latent tokens to compress domain-specific features into a shared latent space, enabling generalization across graph domains. We propose new techniques for scaling up graph training on datasets of different sizes, allowing us to train GraphFM on 152 distinct graph datasets, containing a total of 7.4 million nodes and 189 million edges. This allows us to study the effect of scale on pretraining across domains such as molecules, citation networks, and product graphs, and show that training on diverse datasets improves performance over single-source pretraining. Additionally, pretraining with a mixture of synthetic and real graphs enhances adaptability and stability, leading to competitive performance with state-of-the-art models across various node classification tasks. This approach reduces the burden of dataset-specific training and provides a single generalist model capable of performing across multiple diverse graph structures and tasks. Code is available at https://github.com/nerdslab/GraphFM.
URL: https://openreview.net/forum?id=sZTpRfRUtR
---
Title: On the Hardness of Computing Counterfactual and Semi-factual Explanations in XAI
Authors: André Artelt, Martin Olsen, Kevin Tierney
Abstract: Providing clear explanations to the choices of machine learning models is essential for these models to be deployed in crucial applications. Counterfactual and semi-factual explanations have emerged as two mechanisms for providing users with insights into the outputs of their models. We provide an overview of the computational complexity results in the literature for generating these explanations, finding that in many cases, generating explanations is computationally hard. We strengthen the argument for this considerably by further contributing our own inapproximability results showing that not only are explanations often hard to generate, but under certain assumptions, they are also hard to approximate. We discuss the implications of these complexity results for the XAI community and for policymakers seeking to regulate explanations in AI.
URL: https://openreview.net/forum?id=aELzBw0q1O
---
Title: Variational Online Mirror Descent for Robust Learning in Schrödinger Bridge
Authors: Dong-Sig Han, Jaein Kim, HEE BIN YOO, Byoung-Tak Zhang
Abstract: The Schrödinger bridge (SB) has evolved into a universal class of probabilistic generative models. In practice, however, estimated learning signals are innately uncertain, and the reliability promised by existing methods is often based on speculative optimal case scenarios. Recent studies regarding the Sinkhorn algorithm through mirror descent (MD) have gained attention, revealing geometric insights into solution acquisition of the SB problems. In this paper, we propose a variational online MD (OMD) framework for the SB problems, which provides further stability to SB solvers. We formally prove convergence and a regret bound for the novel OMD formulation of SB acquisition. As a result, we propose a simulation-free SB algorithm called Variational Mirrored Schrödinger Bridge (VMSB) by utilizing the Wasserstein-Fisher-Rao geometry of the Gaussian mixture parameterization for Schrödinger potentials. Based on the Wasserstein gradient flow theory, the algorithm offers tractable learning dynamics that precisely approximate each OMD step. In experiments, we validate the performance of the proposed VMSB algorithm across an extensive suite of benchmarks. VMSB consistently outperforms contemporary SB solvers on a wide range of SB problems, demonstrating the robustness as well as generality predicted by our OMD theory.
URL: https://openreview.net/forum?id=g3SsM9FLpm
---
Title: Continual Learning on CLIP via Incremental Prompt Tuning with Intrinsic Textual Anchors
Authors: Haodong Lu, Xinyu Zhang, Kristen Moore, Jason Xue, Lina Yao, Anton van den Hengel, Dong Gong
Abstract: Continual learning (CL) enables deep neural networks to acquire new knowledge over time while mitigating catastrophic forgetting of previously learned information. The powerful generalization ability of pre-trained models (PTMs), such as the Contrastive Language-Image Pre-training (CLIP) model, has inspired a range of CL methods targeting new and specialized tasks, further bridging the gap between PTMs and continual adaptation. Leveraging its multi-modal visual and textual representations, CLIP offers a natural paradigm for CL, where new tasks can be accommodated by incrementally learning lightweight parameters, particularly prompts. However, existing prompt-based CL methods for PTMs often rely on complex designs built upon specific assumptions, such as intricate regularization schemes for prompt pools, specialized routing mechanisms, or multi-stage incrementation processes. While these approaches improve performance, they frequently introduce additional-and possibly unnecessary-complexity, underutilizing CLIP's intrinsic capabilities. In this paper, we propose a concise CL approach for CLIP based on incremental prompt tuning that fully exploits its multi-modal structure and the stability of textual representations. Our method, Textual Prototype-guided Prompt Tuning (TPPT), introduces textual prototypes not merely as static classifiers, as in existing methods, but as stable anchors to guide the learning of visual prompts, thereby shaping the embedding space (i.e., TPPT-V). We show that our bidirectional supervision strategy enables more effective learning of new knowledge while reducing forgetting. To further close the vision-language gap during CL, we activate the language branch and extend our approach to jointly optimize both visual and textual prompts (i.e., TPPT-VT). We also introduce a relational diversity regularization on the textual anchors to prevent embedding space collapse and mitigate correlated forgetting. Extensive experiments and analyses demonstrate the effectiveness of our proposed approach, highlighting the benefits of leveraging CLIP's intrinsic guidance for continual adaptation.
URL: https://openreview.net/forum?id=YJnjkzKq5Y
---
Title: STLDM: Spatio-Temporal Latent Diffusion Model for Precipitation Nowcasting
Authors: Shi Quan Foo, Chi-Ho Wong, Zhihan Gao, Dit-Yan Yeung, Ka-Hing Wong, Wai-Kin Wong
Abstract: Precipitation nowcasting is a critical spatio-temporal prediction task for society to prevent severe damage owing to extreme weather events. Despite the advances in this field, the complex and stochastic nature of this task still poses challenges to existing approaches. Specifically, deterministic models tend to produce blurry predictions while generative models often struggle with poor accuracy. In this paper, we present a simple yet effective model architecture termed STLDM, a diffusion-based model that learns the latent representation from end to end alongside both the Variational Autoencoder and the conditioning network. STLDM decomposes this task into two stages: a deterministic forecasting stage handled by the conditioning network, and an enhancement stage performed by the latent diffusion model. Experimental results on multiple radar datasets demonstrate that STLDM achieves superior performance compared to the state of the art, while also improving inference efficiency. The code is available in https://github.com/sqfoo/stldm_official.
URL: https://openreview.net/forum?id=f4oJwXn3qg
---
Title: Demystifying amortized causal discovery with transformers
Authors: Francesco Montagna, Max Cairney-Leeming, Dhanya Sridhar, Francesco Locatello
Abstract: Supervised learning for causal discovery from observational data often achieves competitive performance despite seemingly avoiding the explicit assumptions that traditional methods require for identifiability. In this work, we analyze CSIvA (Ke et al., 2023) on bivariate causal models, a transformer architecture for amortized inference promising to train on synthetic data and transfer to real ones. First, we bridge the gap with identifiability theory, showing that the training distribution implicitly defines a prior on the causal model of the test observations: consistent with classical approaches, good performance is achieved when we have a good prior on the test data, and the underlying model is identifiable. Second, we find that CSIvA can not generalize to classes of causal models unseen during training: to overcome this limitation, we theoretically and empirically analyze \textit{when} training CSIvA on datasets generated by multiple identifiable causal models with different structural assumptions improves its generalization at test time. Overall, we find that amortized causal discovery still adheres to identifiability theory, violating the previous hypothesis from Lopez-Paz et al. (2015) that supervised learning methods could overcome its restrictions.
URL: https://openreview.net/forum?id=9Lgy7IGSfp
---
Title: TimeAutoDiff: A Unified Framework for Generation, Imputation, Forecasting, and Time-Varying Metadata Conditioning of Heterogeneous Time Series Tabular Data
Authors: Namjoon Suh, Yuning Yang, Din-Yin Hsieh, Qitong Luan, Shirong Xu, Shixiang Zhu, Guang Cheng
Abstract: We present \texttt{TimeAutoDiff}, a unified latent-diffusion framework that addresses four fundamental time-series tasks—unconditional generation, missing-data imputation, forecasting, and time-varying-metadata conditional generation—within a single model that natively handles heterogeneous features (continuous, binary, and categorical). We unify these tasks through a simple masked-modeling strategy: a binary mask specifies which time feature cells are observed and which must be generated. To make this work on mixed data types, we pair a lightweight variational autoencoder (i.e., VAE)—which maps continuous, categorical, and binary variables into a continuous latent sequence—with a diffusion model that learns dynamics in that latent space, avoiding separate likelihoods for each data type while still capturing temporal and cross-feature structure.Two design choices give \texttt{TimeAutoDiff} clear speed and scalability advantages. First, the diffusion process samples a single latent trajectory for the full time horizon rather than denoising one timestep at a time; this whole-sequence sampling drastically reduces reverse-diffusion calls and yields an order-of-magnitude throughput gain. Second, the VAE compresses along the feature axis, so very wide tables are modeled in a lower-dimensional latent space, further reducing computational load. Empirical evaluation demonstrates that \texttt{TimeAutoDiff} matches or surpasses strong baselines in synthetic sequence fidelity (discriminative, temporal-correlation, and predictive metrics) and consistently lowers MAE/MSE for imputation and forecasting tasks. Time-varying metadata conditioning unlocks real-world scenario exploration: by editing metadata sequences, practitioners can generate coherent families of counterfactual trajectories that track intended directional changes, preserve cross-feature dependencies, and remain conditionally calibrated—making "what-if" analysis practical. Our ablation studies confirm that performance is impacted by key architectural choices, such as the VAE's continuous feature encoding and specific components of the DDPM denoiser. Furthermore, a distance-to-closest-record (DCR) audit demonstrates that the model achieves generalization with limited memorization given enough dataset. Code implementations of \texttt{TimeAutoDiff} are provided in https://github.com/namjoonsuh/TimeAutoDiff.
URL: https://openreview.net/forum?id=bkUd1Dg46c
---
Title: Overcoming the Stability Gap in Continual Learning
Authors: Md Yousuf Harun, Christopher Kanan
Abstract: Pre-trained deep neural networks (DNNs) are being widely deployed by industry for making business decisions and to serve users; however, a major problem is model decay, where the DNN's predictions become more erroneous over time, resulting in revenue loss or unhappy users. To mitigate model decay, DNNs are retrained from scratch using old and new data. This is computationally expensive, so retraining happens only once performance significantly decreases. Here, we study how continual learning (CL) could potentially overcome model decay in large pre-trained DNNs and greatly reduce computational costs for keeping DNNs up-to-date. We identify the ``stability gap'' as a major obstacle in our setting. The stability gap refers to a phenomenon where learning new data causes large drops in performance for past tasks before CL mitigation methods eventually compensate for this drop. We test two hypotheses to investigate the factors influencing the stability gap and identify a method that vastly reduces this gap. In large-scale experiments for both easy and hard CL distributions (e.g., class incremental learning), we demonstrate that our method reduces the stability gap and greatly increases computational efficiency. Our work aligns CL with the goals of the production setting, where CL is needed for many applications.
URL: https://openreview.net/forum?id=o2wEfwUOma
---
Title: Diffusion Self-Weighted Guidance for Offline Reinforcement Learning
Authors: Augusto Tagle, Javier Ruiz-del-solar, Felipe Tobar
Abstract: Offline reinforcement learning (RL) recovers the optimal policy $\pi$ given historical observations of an agent. In practice, $\pi$ is modeled as a weighted version of the agent's behavior policy $\mu$, using a weight function $w$ working as a critic of the agent's behavior. Although recent approaches to offline RL based on diffusion models (DM) have exhibited promising results, they require training a separate guidance network to compute the required scores, which is challenging due to their dependence on the unknown $w$. In this work, we construct a diffusion model over both the actions and the weights, to explore a more streamlined DM-based approach to offline RL. With the proposed setting, the required scores are directly obtained from the diffusion model without learning additional networks. Our main conceptual contribution is a novel exact guidance method, where guidance comes from the same diffusion model; therefore, our proposal is termed Self-Weighted Guidance (SWG). Through an experimental proof of concept for SWG, we show that the proposed method i) generates samples from the desired distribution on toy examples, ii) performs competitively against state-of-the-art methods on D4RL when using resampling, and iii) exhibits robustness and scalability via ablation studies.
URL: https://openreview.net/forum?id=jmXBnpmznv
---
Title: Language Models for Controllable DNA Sequence Design
Authors: Xingyu Su, Xiner Li, Yuchao Lin, Ziqian Xie, Degui Zhi, Shuiwang Ji
Abstract: We consider controllable DNA sequence design, where sequences are generated by conditioning on specific biological properties. While language models (LMs) such as GPT and BERT have achieved remarkable success in natural language generation, their application to DNA sequence generation remains largely underexplored. In this work, we introduce ATGC-Gen, an Automated Transformer Generator for Controllable Generation, which leverages cross-modal encoding to integrate diverse biological signals. ATGC-Gen is instantiated with both decoder-only and encoder-only transformer architectures, allowing flexible training and generation under either autoregressive or masked recovery objectives. We evaluate ATGC-Gen on representative tasks including promoter and enhancer sequence design, and further introduce a new dataset based on ChIP-Seq experiments for modeling protein binding specificity. Our experiments demonstrate that ATGC-Gen can generate fluent, diverse, and biologically relevant sequences aligned with the desired properties. Compared to prior methods, our model achieves notable improvements in controllability and functional relevance, highlighting the potential of language models in advancing programmable genomic design.
URL: https://openreview.net/forum?id=itwnoEu60S
---
Title: LOGLO-FNO: Efficient Learning of Local and Global Features in Fourier Neural Operators
Authors: Marimuthu Kalimuthu, David Holzmüller, Mathias Niepert
Abstract: Modeling high-frequency information is a critical challenge in scientific machine learning. For instance, fully turbulent flow simulations of the Navier-Stokes equations at Reynolds numbers 3500 and above can generate high-frequency signals due to swirling fluid motions caused by eddies and vortices. Faithfully modeling such signals using neural networks depends on the accurate reconstruction of moderate to high frequencies. However, it has been well known that deep neural nets exhibit the so-called spectral or frequency bias towards learning low-frequency components. Meanwhile, Fourier Neural Operators (FNOs) have emerged as a popular class of data-driven models in recent years for solving Partial Differential Equations (PDEs) and for surrogate modeling in general. Although impressive results have been achieved on several PDE benchmark problems, FNOs often perform poorly in learning non-dominant frequencies characterized by local features. This limitation stems from the spectral bias inherent in neural networks and the explicit exclusion of high-frequency modes in FNOs and their variants. Therefore, to mitigate these issues and improve FNO's spectral learning capabilities to represent a broad range of frequency components, we propose two key architectural enhancements: (i) a parallel branch performing local spectral convolutions and (ii) a high-frequency propagation module. Moreover, we propose a novel frequency-sensitive loss term based on radially binned spectral errors. This introduction of a parallel branch for local convolutions reduces the number of trainable parameters by up to 50% while achieving the accuracy of the baseline FNO that relies solely on global convolutions. Moreover, our findings demonstrate that the proposed model improves the stability over longer rollouts. Experiments on six challenging PDE problems in fluid mechanics, wave propagation, and biological pattern formation, and the qualitative and spectral analysis of predictions, show the effectiveness of our method over the state-of-the-art neural operator families of baselines.
URL: https://openreview.net/forum?id=MQ1dRdHTpi
---
Title: LO-BCQ: Locally Optimal Block Clustered Quantization for 4-bit (W4A4) LLM Inference
Authors: Reena Elangovan, Charbel Sakr, Anand Raghunathan, Brucek Khailany
Abstract: Post-training quantization (PTQ) is a promising approach to reducing the storage and computational requirements of large language models (LLMs) without additional training cost. Recent PTQ studies have primarily focused on quantizing only weights to sub-$8$-bits while maintaining activations at $8$-bits or higher. Accurate sub-8-bit quantization for both weights and activations without relying on quantization-aware training remains a significant challenge. We propose a novel quantization method called block clustered quantization (BCQ) wherein each operand tensor is decomposed into blocks (a block is a group of contiguous scalars), blocks are clustered based on their statistics, and a dedicated optimal quantization codebook is designed for each cluster. As a specific embodiment of this approach, we propose a PTQ algorithm called Locally-Optimal BCQ (LO-BCQ) that iterates between the steps of block clustering and codebook design to greedily minimize the quantization mean squared error. When weight and activation scalars are encoded to W4A4 format (with $0.5$-bits of overhead for storing scaling factors and codebook selectors), we advance the current state-of-the-art by demonstrating $<1$\% loss in inference accuracy across several LLMs and downstream tasks.
URL: https://openreview.net/forum?id=loWISTqGwW
---
Title: Fourier Learning Machines: Nonharmonic Fourier-Based Neural Networks for Scientific Machine Learning
Authors: Mominul Rubel, Adam Meyers, Gabriel Nicolosi
Abstract: We introduce the Fourier Learning Machine (FLM), a neural network (NN) architecture designed to represent a multidimensional nonharmonic Fourier series. The FLM uses a simple feedforward structure with cosine activation functions to learn the frequencies, amplitudes, and phase shifts of the series as trainable parameters. This design allows the model to create a problem--specific spectral basis adaptable to both periodic and nonperiodic functions. Unlike previous Fourier--inspired NN models, the FLM is the first architecture able to represent a multidimensional Fourier series with a complete set of basis functions in separable form, doing so by using a standard Multilayer Perceptron--like architecture. A one--to--one correspondence between the Fourier coefficients and amplitudes and phase-shifts is demonstrated, allowing for the translation between a full, separable basis form and the cosine phase--shifted one. Additionally, we evaluate the performance of FLMs on several scientific computing problems, including benchmark Partial Differential Equations (PDEs) and a family of Optimal Control Problems (OCPs). Computational experiments show that the performance of FLMs is comparable, and often superior, to that of established architectures like SIREN and vanilla feedforward NNs.
URL: https://openreview.net/forum?id=LPKt5vd7yz
---
Title: Robust Reinforcement Learning in a Sample-Efficient Setting
Authors: Siemen Herremans, Ali Anwar, Siegfried Mercelis
Abstract: The performance of reinforcement learning (RL) in real-world applications can be hindered by the absence of robustness and safety in the learned policies. More specifically, an RL agent that trains in a certain Markov decision process (MDP) often struggles to perform well in MDPs that slightly deviate. To address this issue, we employ the framework of Robust MDPs (RMDPs) in a model-based setting and introduce a second learned transition model. Our method specifically incorporates an auxiliary pessimistic model, updated adversarially, to estimate the worst-case MDP within a Kullback-Leibler uncertainty set. In comparison to several existing works, our method does not impose any additional conditions on the training environment, such as the need for a parametric simulator. To test the effectiveness of the proposed pessimistic model in enhancing policy robustness, we integrate it into a practical RL algorithm, called Robust Model-Based Policy Optimization (RMBPO). Our experimental results indicate a notable improvement in policy robustness on high-dimensional control tasks, with the auxiliary model enhancing the performance of the learned policy in distorted MDPs, while maintaining the data-efficiency of the base algorithm. Our methodology is also compared against various other robust RL approaches. We further examine how pessimism is achieved by exploring the learned deviation between the proposed auxiliary world model and the nominal model. By introducing a pessimistic world model and demonstrating its role in improving policy robustness, our research presents a general methodology for robust RL in a model-based setting.
URL: https://openreview.net/forum?id=iij6nLYLjF
---
Title: Mirror Descent Policy Optimisation for Robust Constrained Markov Decision Processes
Authors: David Mark Bossens, Atsushi Nitanda
Abstract: Safety is an essential requirement for reinforcement learning systems. The newly emerging framework of robust constrained Markov decision processes allows learning policies that satisfy long-term constraints while providing guarantees under epistemic uncertainty. This paper presents mirror descent policy optimisation for robust constrained Markov decision processes, making use of policy gradient techniques to optimise both the policy (as a maximiser) and the transition kernel (as an adversarial minimiser) on the Lagrangian representing a constrained Markov decision process. Our proposed algorithm obtains an $\tilde{\mathcal{O}}\left(1/T^{1/3}\right)$ convergence rate in the sample-based robust constrained Markov decision process setting. The paper also contributes an algorithm for approximate gradient descent in the space of transition kernels, which is of independent interest for designing adversarial environments in general Markov decision processes. Experiments confirm the benefits of mirror descent policy optimisation in constrained and unconstrained optimisation, and significant improvements are observed in robustness tests when compared to baseline policy optimisation algorithms.
URL: https://openreview.net/forum?id=tmfdqtFUqO
---
Title: Tractable Representation Learning with Probabilistic Circuits
Authors: Steven Braun, Sahil Sidheekh, Antonio Vergari, Martin Mundt, Sriraam Natarajan, Kristian Kersting
Abstract: Probabilistic circuits (PCs) are powerful probabilistic models that enable exact and tractable inference, making them highly suitable for probabilistic reasoning and inference tasks. While dominant in neural networks, representation learning with PCs remains underexplored, with prior approaches relying on external neural embeddings or activation-based encodings. To address this gap, we introduce autoencoding probabilistic circuits (APCs), a novel framework leveraging the tractability of PCs to model probabilistic embeddings explicitly. APCs extend PCs by jointly modeling data and embeddings, obtaining embedding representations through tractable probabilistic inference. The PC encoder allows the framework to natively handle arbitrary missing data and is seamlessly integrated with a neural decoder in a hybrid, end-to-end trainable architecture enabled by differentiable sampling. Our empirical evaluation demonstrates that APCs outperform existing PC-based autoencoding methods in reconstruction quality, generate embeddings competitive with, and exhibit superior robustness in handling missing data compared to neural autoencoders. These results highlight APCs as a powerful and flexible representation learning method that exploits the probabilistic inference capabilities of PCs, showing promising directions for robust inference, out-of-distribution detection, and knowledge distillation.
URL: https://openreview.net/forum?id=h8D75pVKja
---
Title: Group-robust Machine Unlearning
Authors: Thomas De Min, Subhankar Roy, Stéphane Lathuilière, Elisa Ricci, Massimiliano Mancini
Abstract: Machine unlearning is an emerging paradigm to remove the influence of specific training data (i.e., the forget set) from a model while preserving its knowledge of the rest of the data (i.e., the retain set). Previous approaches assume the forget data to be uniformly distributed from all training datapoints. However, if the data to unlearn is dominant in one group (e.g., ethnicity, gender), we empirically show that performance for this group degrades, leading to fairness issues. To perform unlearning while preserving fairness, this work addresses the overlooked problem of non-uniformly distributed forget sets, which we refer to as group-robust machine unlearning. We formalize the problem and present a simple and effective exact unlearning strategy that mitigates the performance loss in dominant groups via sample distribution reweighting. Moreover, we present MIU (Mutual Information-aware Machine Unlearning), the first approach for group robustness in approximate machine unlearning. MIU minimizes the mutual information between model features and group information, achieving unlearning while reducing performance degradation in the dominant group of the forget set. Additionally, MIU exploits sample distribution reweighting and mutual information calibration with the original model to preserve group robustness. We conduct experiments on three datasets and show that MIU outperforms standard methods, achieving unlearning without compromising model robustness. Source code available at https://github.com/tdemin16/group-robust_machine_unlearning
URL: https://openreview.net/forum?id=StSq7mpUVw
---
Title: Inverse Scaling in Test-Time Compute
Authors: Aryo Pradipta Gema, Alexander Hägele, Runjin Chen, Andy Arditi, Jacob Goldman-Wetzler, Kit Fraser-Taliente, Henry Sleight, Linda Petrini, Julian Michael, Beatrice Alex, Pasquale Minervini, Yanda Chen, Joe Benton, Ethan Perez
Abstract: We construct evaluation tasks where extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy. Our evaluation tasks span four categories: simple counting tasks with distractors, regression tasks with spurious features, deduction tasks with constraint tracking, and advanced AI risks. We identify five distinct failure modes when models reason for longer: 1) Claude models become increasingly distracted by irrelevant information; 2) OpenAI o-series models resist distractors but overfit to problem framings; 3) models shift from reasonable priors to spurious correlations; 4) all models show difficulties in maintaining focus on complex deductive tasks; and 5) extended reasoning may amplify concerning behaviors, with Claude Sonnet 4 showing increased expressions of self-preservation. These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns. Our results demonstrate the importance of evaluating models across diverse reasoning lengths to identify and address these failure modes in LRMs.
URL: https://openreview.net/forum?id=NXgyHW1c7M
---
Title: TicketLLM: Next-Generation Sparse and Low-bit Transformers with Supermask-based Method
Authors: Yasuyuki Okoshi, Hikari Otsuka, Daichi Fujiki, Masato Motomura
Abstract: Strong Lottery Tickets (SLTs) are subnetworks within a randomly weighted network uncovered by a binary mask called supermask. They offer a promising approach to model compression by eliminating the need to store weights since their effective subnetwork can be regenerated from a fixed random seed and the supermask. However, extending this approach to large language models (LLMs) is non-trivial due to limited scalability and inefficient training dynamics of existing SLT methods. To address these challenges, we propose Adaptive Supermask (Ada-Sup), a scalable and efficient method for discovering high-quality multi-bit supermasks through an innovative quantization-based approach. Building on this method, we introduce TicketLLM, a low-bit and sparse Transformer-based LLM architecture powered by Ada-Sup. Experimental results show that Ada-Sup can discover high-quality supermasks with significantly reduced training costs compared to previous methods in both binary and multi-bit settings. Furthermore, TicketLLM outperforms BitNet b1.58 on a 1.3B parameter model with the same memory per connection, achieving 0.6% reduction in perplexity (from 13.62 to 13.54) while operating at a higher sparsity level (around 50% vs. around 33%). These results highlight the potential of supermask-based methods as a promising approach for building lightweight LLMs.
URL: https://openreview.net/forum?id=sE69HKykQw
---
Title: Accumulator-Aware Post-Training Quantization for Large Language Models
Authors: Ian Colbert, Giuseppe Franco, Fabian Grob, Jinjie Zhang, Rayan Saab
Abstract: When quantizing weights and activations to increasingly narrower representations, the cost of additions begins to dominate that of multiplications in multiply-accumulate (MAC) units. Recent studies show that reducing addition costs via low-precision accumulation improves throughput, power, and area across inference platforms, albeit with an increased risk of overflow. Accumulator-aware quantization research has so far only considered the quantization-aware training (QAT) paradigm, in which models are fine-tuned or trained from scratch with quantization in the loop. As models and datasets continue to grow in size, QAT techniques become increasingly more expensive, which has motivated the recent surge in post-training quantization (PTQ) research. To bridge this gap, we introduce AXE—the first accumulator-aware quantization framework explicitly designed to endow overflow avoidance guarantees to PTQ algorithms. We present theoretical motivation for AXE and demonstrate its flexibility by implementing it on top of two existing algorithms: GPFQ and OPTQ. We design AXE to support multi-stage accumulation, opening the door to full datapath optimization for the first time. We evaluate AXE using recent language generation models; when quantizing Llama3 8B for a 16-bit multi-stage accumulation datapath, AXE maintains up to 98% of the FP16 perplexity, surpassing naïve bit width manipulation by up to 15%.
URL: https://openreview.net/forum?id=p6l0579yj7
---
Title: A Pattern Language for Machine Learning Tasks
Authors: Benjamin Rodatz, Ian Fan, Tuomas Laakkonen, Neil John Ortega, Thomas Hoffmann, Vincent Wang
Abstract: We formalise the essential data of objective functions as equality constraints on composites of learners. We call these constraints ``tasks'', and we investigate the idealised view that such tasks determine model behaviours. We develop a flowchart-like graphical mathematics for tasks that allows us to; offer a unified perspective of approaches in machine learning across domains; design and optimise desired behaviours model-agnostically; and import insights from theoretical computer science into practical machine learning.
As preliminary experimental validation of our theoretical framework, we exhibit and implement a novel ``manipulation'' task that minimally edits input data to have a desired attribute. Our model-agnostic approach achieves this end-to-end, and without the need for custom architectures, adversarial training, random sampling, or interventions on the data, hence enabling capable, small-scale, and training-stable models.
URL: https://openreview.net/forum?id=IOianP0UHC
---
New submissions
===============
Title: You Only Train Once: Differentiable Subset Selection for Omics Data
Abstract: Selecting compact and informative gene subsets from single-cell transcriptomic data is essential for biomarker discovery, improving interpretability, and cost-effective profiling. However, most existing feature selection approaches either operate as multi-stage pipelines or rely on post hoc feature attribution, making selection and prediction weakly coupled. In this work, we present YOTO (you only train once), an end-to-end framework that jointly identifies discrete gene subsets and performs prediction within a single differentiable architecture. In our model, the prediction task directly guides which genes are selected, while the learned subsets, in turn, shape the predictive representation. This closed feedback loop enables the model to iteratively refine both what it selects and how it predicts during training. Unlike existing approaches, YOTO enforces sparsity so that only the selected genes contribute to inference, eliminating the need to train additional downstream classifiers. Through a multi-task learning design, the model learns shared representations across related objectives, allowing partially labeled datasets to inform one another, and discovering gene subsets that generalize across tasks without additional training steps.
We evaluate YOTO on two representative single-cell RNA-seq datasets, showing that it consistently outperforms state-of-the-art baselines. These results demonstrate that sparse, end-to-end, multi-task gene subset selection improves predictive performance and yields compact and meaningful gene subsets, advancing biomarker discovery and single-cell analysis.
URL: https://openreview.net/forum?id=xQiXlADW5v
---
Title: PSAG: Projection-based Stabilized Attribution Guidance for Online Continual Learning
Abstract: Online Continual Learning (OCL) aims to incrementally learn from non-stationary data streams in a one-pass setting, facing the dual challenges of catastrophic forgetting and insufficient training. These challenges intensify the stability-plasticity dilemma, where preserving old knowledge conflicts with acquiring new information. In this paper, we propose Projection-based Stabilized Attribution Guidance (PSAG), a modular framework that leverages gradient-based attributions as active guidance signals to selectively preserve task-relevant representations. Our framework consists of three complementary mechanisms: (1) Attribution-Guided Feature Modulation (AGFM) that anchors critical features in the representation space; (2) Importance-Aware Loss Reweighting (IALR) that prioritizes informative samples at the loss level; and (3) Manifold-Consistent Projection (MCP) that emphasizes critical feature dimensions within a Riemannian metric space. To address the issue of attribution instability in online continual learning, we introduce a {Reliable Reference Model (R-Model)} that maintains consistent knowledge through exponential moving average updates. This design prevents feedback loops during attribution computation and enables reliable feature importance estimation. Extensive experiments on Split CIFAR-10, Split CIFAR-100, and Split Mini-ImageNet demonstrate that PSAG achieves consistent improvements over strong baselines, confirming the efficacy of stabilized attribution guidance in resolving the stability-plasticity dilemma.
URL: https://openreview.net/forum?id=NvXpSvMrXS
---
Title: Bridging Graph Neural Networks and Large Language Models: A Survey and Unified Perspective
Abstract: Decoder-Transformers have achieved remarkable success and have laid the groundwork for the development of Large Language Models (LLMs). At the core of these models is the self-attention matrix, which allows different tokens to interact with each other. This process is remarkably similar to the message-passing mechanism used in Graph Neural Networks (GNNs), and as such decoder-Transformers suffer many of the optimization difficulties studied extensively in the GNN literature. In this paper, we present a unified graph perspective that bridges the theoretical understanding of decoder-Transformers and GNNs. We systematically examine how well-known phenomena in GNNs, such as over-smoothing and over-squashing, directly manifest as analogous issues like rank collapse and representational collapse in deep Transformer architectures. By interpreting Transformers' self-attention as a learned adjacency operator, we reveal shared underlying principles governing signal propagation and demonstrate how insights from one field can illuminate challenges and solutions in the other. We analyze the role of architectural components like residual connections, normalization, and causal masking in these issues. We aim to provide a framework for understanding how information flows through deep learning models that perform sequence mixing through an adjacency operator, and to highlight areas for cross-pollination of research, as well as to provide a comprehensive reference for researchers interested in the underpinnings of these architectures.
URL: https://openreview.net/forum?id=H9zhC5pVnH
---
Title: LLM-Based World Models Can Make Decisions Solely, But Rigorous Evaluations are Needed
Abstract: World model emerges as a key module in decision making, where MuZero and Dreamer achieve remarkable successes in complex tasks. Recent work leverages Large Language Models (LLMs) as general world simulators to simulate the dynamics of the world due to their generalizability. LLMs also serve as the world model for deliberative reasoning in Reasoning via Planning (RAP) and Tree of Thought (ToT). However, the world model is either evaluated as a general world simulator, or as a functional module of the agent, i.e., predicting the transitions to assist the planning. This paper argues that LLM-based world models can make decisions solely, but rigorous evaluations are needed. We first present the two key observations to showcase how LLM-based world models can make decisions solely, and then present the three key observations to demonstrate why current evaluation framework of LLM-based world models is not sufficient. Then, we present our suggested evaluation framework: policy verification, action proposal, and policy planning, where the world model is used for decision making solely, and finally we leverage the 31 diverse environments from (Wang et al., 2023; 2024) and curate the rule-based policy of each environment for diverse evaluations. The key findings include: i) GPT-4o significantly outperforms GPT-4o-mini on the three main tasks, especially for the tasks which require the domain knowledge, e.g., scientific tasks, ii) the performance of the LLM-based world models depends predominantly on their performance in key steps, while the total number of steps required for task completion is not a reliable indicator of task difficulty, and iii) the combination of world models’ functionalities for decision making brings unstability of the
performance and partially obscures the performance gap between strong and weak models.
URL: https://openreview.net/forum?id=XmYCERErcD
---
Title: Structure-Augmented Reasoning Generation
Abstract: Recent advances in Large Language Models (LLMs) have significantly improved complex reasoning capabilities. Retrieval-Augmented Generation (RAG) has further extended these capabilities by grounding generation in dynamically retrieved evidence, enabling access to information beyond the model's training parameters. However, while RAG addresses knowledge availability, standard pipelines treat retrieved documents as independent, unstructured text chunks, forcing models to implicitly connect information across fragmented context. This limitation becomes critical for multi-hop queries, where answering correctly requires synthesizing information scattered across different documents. We present Structure-Augmented Reasoning Generation (SARG), a post-retrieval framework that addresses this gap by materializing explicit reasoning structures from retrieved context. SARG operates in three stages: extracting relational triples from retrieved documents via few-shot prompting, organizing these triples into a domain-adaptive knowledge graph, and performing multi-hop traversal to identify relevant reasoning chains. These chains, along with their associated text chunks, are then integrated into the generation prompt to explicitly guide the model's reasoning process. Importantly, SARG doesn't require custom retrievers or domain-specific fine-tuning. Instead, it functions as a modular layer compatible with all existing RAG pipelines. Extensive experiments on open-domain QA benchmarks and specialized reasoning datasets in finance and medicine demonstrate that SARG significantly outperforms state-of-the-art flat-context RAG baselines in both factual accuracy and reasoning coherence. Furthermore, by surfacing the exact traversal paths used during generation, SARG provides fully traceable and interpretable inference.
URL: https://openreview.net/forum?id=mXwuAhCj1z
---
Title: Reference-Guided Identity Preserving Face Restoration
Abstract: Preserving face identity is a critical yet persistent challenge in diffusion-based image restoration. While reference faces offer a path forward, existing methods typically suffer from partial reference information and inefficient identity losses. This paper introduces a novel approach that directly solves both issues, involving three key contributions: 1) Composite Context, a representation that fuses high- and low-level facial information to provide comprehensive guidance than traditional singular representations, 2) Hard Example Identity Loss, a novel loss function that uses the reference face to address the identity learning inefficiencies of the standard identity loss, 3) Training-free multi-reference inference, a new method that leverages multiple references for restoration, despite being trained with only a single reference. The proposed method demonstrably restores high-quality faces and achieves state-of-the-art identity preserving restoration on benchmarks such as FFHQ-Ref and CelebA-Ref-Test, consistently outperforming previous work.
URL: https://openreview.net/forum?id=g9YzUDnUUS
---
Title: LLM-RankFusion: Mitigating Intrinsic Inconsistency in LLM-based Ranking
Abstract: Ranking passages by prompting a large language model (LLM) can achieve promising performance in modern information retrieval (IR) systems. A common approach to sort the ranking list is by prompting LLMs for a pairwise or setwise comparison which often relies on sorting algorithms. However, sorting-based methods require consistent comparisons to sort the passages correctly, which we show that LLMs often violate. We identify two kinds of intrinsic inconsistency in LLM-based pairwise comparisons: order inconsistency which leads to conflicting results when switching the passage order, and transitive inconsistency which leads to non-transitive triads among all preference pairs. Our study of these inconsistencies is relevant for understanding and improving the stability of any ranking scheme based on relative preferences. In this paper, we propose LLM-RankFusion, an LLM-based ranking framework that mitigates these inconsistencies and produces a robust ranking list. LLM-RankFusion mitigates order inconsistency using in-context learning (ICL) to demonstrate order-agnostic comparisons and calibration to estimate the underlying preference probability between two passages. We then address transitive inconsistency by aggregating the ranking results from multiple rankers. In our experiments, we empirically show that LLM-RankFusion can significantly reduce inconsistent comparison results, improving the ranking quality by making the final ranking list more robust.
URL: https://openreview.net/forum?id=VUY0j74Yes
---
Title: Adaptive Conformal Prediction for Quantum Machine Learning
Abstract: Quantum machine learning seeks to leverage quantum computers to improve upon classical machine learning algorithms. Currently, robust uncertainty quantification methods remain underdeveloped in the quantum domain, despite the critical need for reliable and trustworthy predictions. Recent work has introduced quantum conformal prediction, a framework that produces prediction sets that are guaranteed to contain the true outcome with userspecified probability. In this work, we formalise how the time-varying noise inherent in quantum processors can undermine conformal guarantees, even when calibration and test data are exchangeable. To address this challenge, we draw on Adaptive Conformal Inference, a method which maintains validity over time via repeated recalibration. We introduce Adaptive Quantum Conformal Prediction (AQCP), an algorithm which preserves asymptotic average coverage guarantees under arbitrary hardware noise conditions. Empirical studies on an IBM quantum processor demonstrate that AQCP achieves target coverage levels and exhibits greater stability than quantum conformal prediction.
URL: https://openreview.net/forum?id=ShkPB9OeEW
---
Title: SHEP: Spatial Heterogeneity–Driven Experience Prioritization in Scalable Multi-Agent Reinforcement Learning
Abstract: Scalable Multi-Agent Reinforcement Learning (MARL) faces severe challenges regarding the exponential explosion of joint state-action space dimensionality and the difficulty of global coordination as the number of agents increases. Traditional methods optimize fine-grained individual strategies within an exponentially vast state space, leading to low sample efficiency and training bottlenecks in large-scale scenarios. To address these issues, this paper proposes \textbf{SHEP} (Spatial Heterogeneity–Driven Experience Prioritization), a mesoscopic guidance framework designed for large-scale group coordination. SHEP utilizes Occupancy Entropy, Action Diversity Entropy, and Moran's I to construct a set of topological feature descriptors, mapping the high-dimensional individual state space into a low-dimensional, interpretable group feature space. Building on this, we design heterogeneity-driven prioritized experience replay and Group Hindsight Experience Replay (Group-HER). By identifying critical moments of abrupt spatial heterogeneity changes or highly structured clustering, these mechanisms accurately screen for high-value samples and perform ``dimensionality reduction pruning'' on the ineffective exploration space, significantly improving sample efficiency. Due to the universality of its experience screening mechanism, SHEP can be seamlessly integrated as a ``plug-in'' into mainstream centralized training algorithms like MAPPO without altering their underlying policy optimization objectives. In MAgent environments and SMAC benchmarks, SHEP demonstrates superior performance, with convergence speed and final win rates significantly outperforming baseline methods such as QMIX and Mean-Field approaches. These results robustly validate that introducing explicit spatial heterogeneity features to guide experience prioritization is an effective paradigm for resolving the curse of dimensionality in scalable MARL.
URL: https://openreview.net/forum?id=b6WUL2GH1w
---
Title: A Closer Look at In-Distribution vs. Out-of-Distribution Accuracy for Open-Set Test-time Adaptation
Abstract: Open-set test-time adaptation (TTA) updates models on new data in the presence of input shifts and unknown output classes. While recent methods have made progress on improving in-distribution (InD) accuracy for known classes, their ability to accurately detect out-of-distribution (OOD) unknown classes remains underexplored. We benchmark robust and open-set TTA methods (SAR, OSTTA, UniEnt, and SoTTA) on the standard corruption benchmarks of CIFAR-10-C at the small scale and ImageNet-C at the large scale. For CIFAR-10-C, we use OOD data from SVHN and CIFAR-100 in their respective corrupted forms of SVHN-C and CIFAR-100-C. For ImageNet-C, we use OOD data from ImageNet-O and Textures in their respective corrupted forms of ImageNet-O-C and Textures-C. ImageNet-O is nearer to ImageNet, as unknown but related object classes (like ``garlic bread'' vs. ``hot dog'' for food, or ``highway'' vs. ``dam'' for infrastructure), while Textures is farther from ImageNet, as non-object patterns (like ``cracked'' mud, ``porous'' sponge, ``veined'' leaves). We evaluate the accuracy and confidence of TTA methods for InD vs. OOD recognition on CIFAR-10-C and ImageNet-C. We verify the accuracy of each method's own OOD detection technique on CIFAR-10-C. We also evaluate on ImageNet-C and report both accuracy and standard OOD detection metrics. We further examine more realistic settings, in which the proportions and rates of OOD data can vary. To explore the trade-off between InD recognition and OOD rejection, we propose a new baseline that replaces softmax/multi-class output with sigmoid/multi-label output. Our analysis shows for the first time that current open-set TTA methods struggle to balance InD and OOD accuracy and that they only imperfectly filter OOD data for their own adaptation updates.
URL: https://openreview.net/forum?id=4MuLx2YDmi
---
Title: SL-S4Wave: Self-Supervised Learning of Physiological Waveforms with Structured State Space Models
Abstract: Modeling long-sequence medical time series data, such as electrocardiograms (ECG), poses significant challenges due to high sampling rates, multichannel signal complexity, inherent noise, and limited labeled data. While recent self-supervised learning (SSL) methods, based on various encoder architectures such as convolutional neural networks, have been proposed to learn representations from unlabeled data, they often fall short in capturing long-range dependencies and noise-invariant features. Structured state space models (S4) have recently shown promise for efficient long-sequence modeling; however, existing S4-based architectures are not designed to capture the unique characteristics of multichannel physiological waveforms. In this work, we propose SL-S4Wave, a self-supervised learning framework that combines contrastive learning with a tailored encoder built on structured state space models. The encoder incorporates multi-layer global convolution using multiscale subkernels, enabling the capture of both fine-grained local patterns and long-range temporal dependencies in noisy, high-resolution multichannel waveforms. Extensive experiments on three real-world datasets demonstrate that SL-S4Wave (1) consistently outperforms state-of-the-art supervised and self-supervised baselines in a challenging arrhythmia detection task, (2) achieves high performance with significantly fewer labeled examples, showcasing strong label efficiency, and (3) maintains robust performance on long waveform segments, highlighting its capacity to model complex temporal dynamics in long sequences that most existing approaches fail to efficiently model, and (4) transfers effectively to unseen arrhythmia types, underscoring its robust cross-domain generalization.
URL: https://openreview.net/forum?id=km0xS3jZeO
---
Title: Rethinking Coreset Selection: The Surprising Effectiveness of Soft Labels
Abstract: Data-efficient deep learning is an emerging and powerful branch of deep learning that focuses on minimizing the amount of labeled data required for training. Coreset selection is one such method, where the goal is to select a representative subset from the original dataset, which can achieve comparable generalization performance at a much lower computation and disk space overhead. Dataset Distillation (DD), another branch of data-efficient deep learning, achieves this goal through distilling a small synthetic dataset from the original dataset. While DD works exploit soft labels (probabilistic target labels instead of traditional one-hot labels), which have yielded significant improvement over hard labels, to the best of our knowledge, no such study exists for coreset selection. In this work, for the first time, we
study the impact of soft labels on generalization accuracy for the image classification task for various coreset selection algorithms. While soft labels improve the performance of all the methods, surprisingly, random selection with soft labels performs on par or better than existing coreset selection approaches. Our findings suggest that future coreset algorithms should benchmark against random selection with soft labels as an important baseline.
URL: https://openreview.net/forum?id=Ll78kAR1lj
---
Title: Debiasing Diffusion Models via Score Guidance
Abstract: With the increasing use of Diffusion Models (DMs) in everyday applications, it is very important to ensure that these models are \textit{fair} towards various demographic/societal groups.
However, due to several reasons DMs inherit biases towards specific gender, race and community, which can perpetuate and amplify societal inequities.
Hence, it is important to \textit{debias} DMs.
Previous debiasing approaches require additional reference data, model fine-tuning, or auxiliary classifier training - each of which incur additional cost. In this work, we provide a training-free inference-time method for debiasing diffusion models. First, we provide a theoretical explanation for the cause of biases inhibited by DMs. Specifically, we show that the unconditional score predicted by the denoiser can be expressed as a convex combination of conditional scores corresponding to the attributes under consideration. We then argue that the weights allocated to underrepresented attributes are less which leads to domination of other attributes in overall score function. Building on this, we propose a score-guidance method that adheres to a user provided reference distribution for generation. Moreover, we show that this score guidance can be achieved via different modalities like `text' and `exemplar images'. To our knowledge, our method is the first to provide a debiasing framework that can utilize different modalities for diffusion models. We demonstrate the effectiveness of our method across various attributes on both unconditional and conditional text-based diffusion models, including Stable Diffusion.
URL: https://openreview.net/forum?id=vAz8xUHyTe
---
Title: Bridging VMP and CEP: Theoretical Insights for Connecting Different Approximate Bayesian Inference Methods
Abstract: Approximate Bayesian inference (ABI) methods have become indispensable tools in modern machine learning and statistics for approximating intractable posterior distributions. Despite the related extensive studies and applications across diverse domains, the theoretical connections among these methods have remained relatively unexplored. This paper takes the first step to uncover the underlying relationships between two widely employed ABI techniques: the variational message passing (VMP) and the conditional expectation propagation (CEP) methods. Through rigorous mathematical analysis, we demonstrate a strong connection between these two approaches under mild conditions, from optimization as well as graphical model perspectives. This newly unveiled connection not only enhances our understanding of the performance and convergence properties of VMP and CEP, but it also facilitates the cross-fertilization of their respective strengths. For instance, we establish the convergence of CEP under mild conditions and demonstrate how this connection facilitates the construction of streaming VMP. Furthermore, our findings provide insights into the underlying relationships and distinctive characteristics of other ABI methods, shedding new light on the understanding and development of more advanced ABI techniques. To validate our theoretical findings, we derive and analyze various ABI methods within the context of Bayesian tensor decomposition, a fundamental tool in machine learning research. Specifically, we show that these two approaches yield the same updates within this context and illustrate how the established connection can be leveraged to construct a streaming version of the VMP-based Bayesian tensor decomposition algorithm.
URL: https://openreview.net/forum?id=QdO4VrnNfb
---
Title: Gaming and Cooperation in Federated Learning: What Can Happen and How to Monitor It
Abstract: The success of federated learning (FL) ultimately depends on how strategic participants behave
under partial observability, yet most formulations still treat FL as a static optimization
problem. We instead view FL deployments as governed strategic systems and develop an analytical
framework that separates welfare-improving behavior from metric gaming. Within
this framework, we introduce indices that quantify manipulability, the price of gaming, and
the price of cooperation, and we use them to study how rules, information disclosure, evaluation
metrics, and aggregator-switching policies reshape incentives and cooperation patterns.
We derive threshold conditions for deterring harmful gaming while preserving benign cooperation,
and for triggering auto-switch rules when early-warning indicators become critical.
Building on these results, we construct a design toolkit including a governance checklist and
a simple audit-budget allocation algorithm with a provable performance guarantee. Simulations
across diverse stylized environments and a federated learning case study consistently
match the qualitative and quantitative patterns predicted by our framework. Taken together,
our results provide design principles and operational guidelines for reducing metric
gaming while sustaining stable, high-welfare cooperation in FL platforms.
URL: https://openreview.net/forum?id=Ck3q5YdWIv
---
Title: Seeing is Simulating: Differentiable Physics for Interaction-Aware Material Estimation
Abstract: Modeling human-object interactions is crucial for creating immersive virtual experiences. However, synthesizing 3D object dynamics conditioned on actions remains a challenging problem. Existing approaches equip static 3D objects with motion priors distilled from video diffusion models. However, this methodology has two drawbacks: (i) video diffusion models are not physically grounded. Thus, the generated videos may contain physical inaccuracies; (ii) video diffusion models cannot generate complex dynamics where multiple objects interact under actions with long durations and large spatial extent. We present $\textbf{PhysInteract}$, a physics-based framework that (i) models interactions with a representation that captures their duration and contact information; (ii) estimates object material properties (e.g., Young's modulus) from objects' deformation caused by interactions; (iii) uses physics simulation to reproduce realistic object dynamics based on estimated interactions and material properties. We highlight that PhysInteract is fully differentiable, enabling joint optimization of interaction representations and object material properties. PhysInteract achieves better performance than existing methods. We demonstrate its superiority by quantitatively testing PhysInteract on a curated dataset. In conjunction with an additional user study, our method shows a step towards more realistic and immersive virtual experiences.
URL: https://openreview.net/forum?id=lwuaTI4ISa
---
Title: RIGID: A Training-Free and Model-Agnostic Framework for Robust AI-Generated Image Detection
Abstract: The rapid advances in generative AI models have empowered the creation of highly realistic images with arbitrary content, raising concerns about potential misuse and harm, such as Deepfakes. Current research focuses on training detectors using large datasets of generated images. However, these training-based solutions are often computationally expensive and show limited generalization to unseen generated images. In this paper, we propose a training-free method to distinguish between real and AI-generated images. We first observe that real images are more robust to tiny noise perturbations than AI-generated images in the representation space of vision foundation models. Based on this observation, we propose RIGID, a training-free and model-agnostic method for robust AI-generated image detection. RIGID is a simple yet effective approach that identifies whether an image is AI-generated by comparing the representation similarity between the original and the noise-perturbed counterpart. Our comprehensive evaluation demonstrates RIGID’s exceptional performance. RIGID surpasses existing training-free detectors by more than 25% on average. Remarkably, RIGID performs comparably to training-based methods, particularly on unseen domain data. Additionally, RIGID maintains consistent performance across various image generation techniques and demonstrates strong resilience to common image corruptions.
URL: https://openreview.net/forum?id=NBkBI2Zjlm
---
Title: When Lifelong Novelty Fails: Coordination Breakdown in Decentralised MARL
Abstract: Lifelong novelty bonuses are a cornerstone of exploration in reinforcement learning, but we identify a critical failure mode when they are applied to decentralised multi-agent coordination tasks: \emph{coordination de-synchronisation}. In sequential coordination tasks with multiple joint coordination checkpoints (states that all agents must occupy simultaneously), agents searching for later checkpoints must repeatedly traverse earlier ones. Under lifelong novelty, this repeated traversal gradually depletes intrinsic motivation to revisit these critical locations and can destabilise coordination. Within a stylised analytical framework, we derive lower bounds showing that the \emph{guaranteed} success probability under a lifelong novelty scheme can shrink polynomially with a problem-dependent geometric \emph{revisit pressure} and the number of agents, whereas episodic bonuses, which reset at the start of each episode, provide a time-uniform lower bound on the probability of reaching a given checkpoint. We further prove that a hybrid scheme, which multiplicatively combines episodic and lifelong bonuses, inherits both a constant ``coordination floor'' at known checkpoints and a persistent drive to discover previously unseen states. We validate the qualitative predictions of this framework in GridWorld, Overcooked, and StarCraft~II, where hybrid bonuses yield substantially more reliable coordination than lifelong-only exploration in environments with multiple sequential checkpoints or narrow geometric bottlenecks, such as corridors that force agents to pass through the same cells many times. Together, these results provide a theoretical and empirical account of when different intrinsic motivation schemes are effective in decentralised multi-agent coordination.
URL: https://openreview.net/forum?id=xOPjPFTuvy
---
Title: OSHO-CCA: Orthogonal and Scalable High-Order Canonical Correlation Analysis
Abstract: Canonical Correlation Analysis (CCA) is a classical technique for learning shared representations from two views of data by maximizing the correlation between the resulting representations. Existing extensions to more than two views either maximize pairwise correlations, sacrificing higher-order structure, or model high-order interactions at the expense of orthogonality and scalability. In this paper, we propose OSHO-CCA, a novel method for Orthogonal and Scalable High-Order CCA that jointly addresses all three desiderata: (1) it captures high-order dependencies across views, (2) enforces orthogonality among projected features to ensure decorrelated embeddings, and (3) scales efficiently with the number of views. We further introduce a new evaluation metric for Total Canonical Correlation (TCC) that generalizes traditional two-view CCA metrics to the multiview setting. Experiments on real and synthetic datasets demonstrate that OSHO‑CCA outperforms existing methods in both correlation maximization and downstream classification tasks, while maintaining scalability and orthogonality even in challenging multiview scenarios.
URL: https://openreview.net/forum?id=H0opV3oEMt
---
Title: V-OCBF: Learning Safety Filters from Offline Data via Value-Guided Offline Control Barrier Functions
Abstract: Ensuring safety in autonomous systems requires controllers that satisfy hard, state-wise constraints without relying on online interaction. While existing Safe Offline RL methods typically enforce soft expected-cost constraints, they do not guarantee forward invariance. Conversely, Control Barrier Functions (CBFs) provide rigorous safety guarantees but usually depend on expert-designed barrier functions or full knowledge of the system dynamics. We introduce Value-Guided Offline Control Barrier Functions (V-OCBF), a framework that learns a neural CBF entirely from offline demonstrations. Unlike prior approaches, V-OCBF does not assume access to the dynamics model; instead, it derives a recursive finite-difference barrier update, enabling model-free learning of a barrier that propagates safety information over time. Moreover, V-OCBF incorporates an expectile-based objective that avoids querying the barrier on out-of-distribution actions and restricts updates to the dataset-supported action set. The learned barrier is then used with a Quadratic Program (QP) formulation to synthesize real-time safe control. Across multiple case studies, V-OCBF yields substantially fewer safety violations than baseline methods while maintaining strong task performance, highlighting its scalability for offline synthesis of safety-critical controllers without online interaction or hand-engineered barriers.
URL: https://openreview.net/forum?id=PGO9mpIyyb
---
Title: On the Fundamental Limits of LLMs at Scale
Abstract: Large Language Models (LLMs) have benefited enormously from scaling, yet these gains are bounded by five fundamental limitations: (1) hallucination, (2) context compression, (3) reasoning degradation, (4) retrieval fragility, and (5) multimodal misalignment. While existing surveys describe these phenomena empirically, they lack a rigorous theoretical synthesis connecting them to the foundational limits of computation, information, and learning. This work closes that gap by presenting a unified, proof-informed framework that formalizes the innate theoretical ceilings of LLM scaling. First, computability and uncomputability imply an irreducible residue of error: for any computably enumerable model family, diagonalization guarantees inputs on which some model must fail, and undecidable queries (e.g., halting-style tasks) induce infinite failure sets for all computable predictors. Second, information-theoretic and statistical constraints bound attainable accuracy even on decidable tasks, finite description length enforces compression error, and long-tail factual knowledge requires prohibitive sample complexity. Third, geometric and computational effects compress long contexts far below their nominal size due to positional under-training, encoding attenuation, and softmax crowding. We further show how likelihood-based training favors pattern completion over inference, how retrieval under token limits suffers from semantic drift and coupling noise, and how multimodal scaling inherits shallow cross-modal alignment. Across sections, we pair theorems and empirical evidence to outline where scaling helps, where it saturates, and where it cannot progress, providing both theoretical foundations and practical mitigation paths like bounded-oracle retrieval, positional curricula, and sparse or hierarchical attention.
URL: https://openreview.net/forum?id=BIRDGVrom8
---
Title: LARP: Learner-Agnostic Robust Data Prefiltering
Abstract: Public datasets, crucial for modern machine learning and statistical inference, often contain low-quality or contaminated data that harms model performance. This motivates the development of principled prefiltering procedures that facilitate accurate downstream learning. In this work, we formalize the problem of **L**earner-**A**gnostic **R**obust data **P**refiltering (LARP), which aims at finding prefiltering procedures that minimize a worst-case loss over a pre-specified set of learners. We instantiate this framework in two theoretical settings, providing a hardness result and upper bounds. Our theoretical results indicate that performing LARP on heterogeneous learner sets causes some performance loss compared to individual, learner-specific prefiltering; we term this gap as the price of LARP. To assess whether LARP remains worthwhile, we (i) empirically measure the price of LARP across image and tabular tasks and (ii) introduce a game-theoretic cost model that trades off the price of LARP against the cost of learner-specific prefiltering. The model yields sufficient conditions under which LARP is provably beneficial.
URL: https://openreview.net/forum?id=gI6VOV3jfO
---
Title: Recursive Reasoning for Sample-Efficient Multi-Agent Reinforcement Learning
Abstract: Policy gradient algorithms for deep multi-agent reinforcement learning (MARL) typically employ an update that responds to the current strategies of other agents. While being straightforward, this approach does not account for the updates of other agents within the same update step, resulting in miscoordination and reduced sample efficiency. In this paper, we introduce methods that recursively refine the policy gradient by updating each agent against the updated policies of other agents within the same update step, speeding up the discovery of effective coordinated policies. We provide principled implementations of recursive reasoning in MARL by applying it to competitive multi-agent algorithms in both on and off-policy regimes. Empirically, we demonstrate superior performance and sample efficiency over existing deep MARL algorithms in StarCraft II and multi-agent MuJoCo. We theoretically prove that higher recursive reasoning in gradient-based methods with finite iterates achieves monotonic convergence to a local Nash equilibrium under certain conditions.
URL: https://openreview.net/forum?id=k5zVPe32VX
---
Title: FairSpace: Search Space Pruning of AutoML for Fairness-Accuracy Trade-off
Abstract: A major challenge in responsible Machine Learning (ML) engineering is ensuring fairness across multiple protected attributes and their intersections. Existing bias mitigation techniques and Automated Machine Learning (AutoML) systems often fail to address this due to the combinatorial explosion of configurations during hyperparameter optimization (HPO). We propose \textsc{FairSpace}, a fairness-aware framework that jointly performs HPO and dataset-specific feature engineering while strategically pruning the configuration space. \textsc{FairSpace} integrates LLM-assisted feature engineering methods with a bi-objective cost function to balance fairness and accuracy. Experimental results on five widely-used datasets demonstrate that \textsc{FairSpace} achieves win–win outcomes—simultaneously improving fairness and accuracy for 63\% of the cases, outperforming state-of-the-art (SOTA) baselines that achieve up to 60\%. Moreover, \textsc{FairSpace} achieves these results with approximately 25\% less computation time, owing to its targeted pruning strategy as compared to the SOTA AutoML baseline such as FairAutoML. By explicitly tackling intersectional fairness, \textsc{FairSpace} reaches 94\% of its outcomes in the \emph{win–win} and \emph{good trade-off} regions, providing a consistent and generalizable foundation for fairness-aware AutoML.
URL: https://openreview.net/forum?id=dkO4IwfwJe
---
Title: Evaluating LLM Understanding via Structured Tabular Decision Simulations
Abstract: Large language models (LLMs) often achieve impressive predictive accuracy, yet correctness alone does not imply genuine understanding. True LLM understanding, analogous to human expertise, requires making consistent, well-founded decisions across multiple instances and diverse domains, relying on relevant and domain-grounded decision factors. We introduce Structured Tabular Decision Simulations (STaDS), a suite of expert-like decision settings that evaluate LLMs as if they were professionals undertaking structured decision "exams". In this context, understanding is defined as the ability to identify and rely on the correct decision factors, features that determine outcomes within a domain. STaDS jointly assesses understanding through: (i) question and instruction comprehension, (ii) knowledge-based prediction, and (iii) reliance on relevant decision factors. By analyzing 9 frontier LLMs across 15 diverse decision settings, we find that (a) most models struggle to achieve consistently strong accuracy across diverse domains; (b) models can be accurate yet globally unfaithful, and there are frequent mismatches between stated rationales and factors driving predictions. Our findings highlight the need for global-level understanding evaluation protocols and advocate for novel frameworks that go beyond accuracy to enhance LLMs' understanding ability.
URL: https://openreview.net/forum?id=R4NninzmGb
---
Title: Transitioning Heads Conundrum: The Hidden Bottleneck in Long-Tailed Class-Incremental Learning
Abstract: Long-Tailed Class-Incremental Learning (LTCIL) faces a fundamental tension: models must sequentially learn new classes while contending with extreme class imbalance, which amplifies catastrophic forgetting. A particularly overlooked phenomenon is the Transitioning Heads Conundrum: as replay buffers constrain memory, initially well-represented head classes shrink over time and effectively become tail classes, undermining knowledge retention. Existing approaches fail to address this because they apply knowledge distillation too late, after these transitions have already eroded head-class representations. To overcome this, we introduce DEcoupling Representations for Early Knowledge distillation (DEREK), which strategically employs Early Knowledge Distillation to safeguard head-class knowledge before data constraints manifest. Comprehensive evaluation across 2 LTCIL benchmarks, 12 experimental settings, and 24 baselines, including Long-Tail, Class-Incremental, Few-Shot CIL, and LTCIL methods, shows that DEREK maintains competitive performance across categories, establishing new state-of-the-art results.
URL: https://openreview.net/forum?id=Hb2Jvi5M7X
---
Title: Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis
Abstract: Time series analysis is crucial in real-world applications, yet traditional methods focus on isolated tasks only, and recent studies on time series reasoning remain limited to either single-step inference or are constrained to natural language answers. In this work, we introduce TS-Reasoner, a domain-specialized agent designed agent designed for multi-step time series inference. By integrating large language model (LLM) reasoning with domain- specific computational tools and error feedback loop, TS-Reasoner enables domain-informed, constraint-aware analytical workflows that combine symbolic reasoning with precise numerical analysis. We assess the system’s capabilities along two axes: 1) fundamental time series understanding assessed by TimeSeriesExam and 2) complex, multi-step inference, evaluated by a newly proposed dataset designed to test both compositional reasoning and computational precision in time series analysis. Experiments show that our approach outperforms standalone general-purpose LLMs in both basic time series concept understanding as well as the multi-step time series inference task, highlighting the promise of domain-specialized agents for automating real-world time series reasoning and analysis.
URL: https://openreview.net/forum?id=yhy7Vigjcf
---
Title: Meta-Learning and Meta-Reinforcement Learning - Tracing the Path towards Deep Mind's Adaptive Agent
Abstract: Humans are highly effective at utilizing prior knowledge to adapt to novel tasks, a capability that standard machine learning models struggle to replicate due to their reliance on task-specific training.
Meta-learning overcomes this limitation by allowing models to acquire transferable knowledge from various tasks, enabling rapid adaptation to new challenges with minimal data.
This survey provides a rigorous, task-based formalization of meta‑learning and meta-reinforcement learning and uses that paradigm to chronicle the landmark algorithms that paved the way for DeepMind’s Adaptive Agent, consolidating the essential concepts needed to understand the Adaptive Agent and other generalist approaches.
URL: https://openreview.net/forum?id=NZp1UVstvt
---
Title: Attention Trajectories as a Diagnostic Axis for Deep Reinforcement Learning
Abstract: While deep reinforcement learning agents demonstrate high performance across domains, their internal decision processes remain difficult to interpret when evaluated only through performance metrics. In particular, it is poorly understood which input features agents rely on, how these dependencies evolve during training, and how they relate to behavior. We introduce a scientific methodology for analyzing the learning process through quantitative analysis of saliency. This approach aggregates saliency information at the object and modality level into hierarchical attention profiles, quantifying how agents allocate attention over time, thereby forming attention trajectories throughout training. Applied to Atari benchmarks, custom Pong environments, and muscle-actuated biomechanical user simulations in visuomotor interactive tasks, this methodology uncovers algorithm-specific attention biases, reveals unintended reward-driven strategies, and diagnoses overfitting to redundant sensory channels. These patterns correspond to measurable behavioral differences, demonstrating empirical links between attention profiles, learning dynamics, and agent behavior. To assess robustness of the attention profiles, we validate our findings across multiple saliency methods and environments. The results establish attention trajectories as a promising diagnostic axis for tracing how feature reliance develops during training and for identifying biases and vulnerabilities invisible to performance metrics alone.
URL: https://openreview.net/forum?id=0aa9zthk7k
---
Title: Characterizing the ability of LLMs to recapitulate Americans’ distributional responses to public opinion polling questions across political issues
Abstract: Traditional survey-based political issue polling is becoming less tractable due to increasing costs and risk of bias associated with growing non-response rates and declining coverage of key demographic groups. With researchers and pollsters seeking alternatives, Large Language Models have drawn attention for their potential to augment human population studies in polling contexts. We propose and implement a new framework for anticipating human responses on multiple-choice political issue polling questions by directly prompting an LLM to predict a distribution of responses. By comparison to a large and high quality issue poll of the US population, the Cooperative Election Study, we evaluate how the accuracy of this framework varies across a range of demographics and questions on a variety of topics, as well as how this framework compares to previously proposed frameworks where LLMs are repeatedly queried to simulate individual respondents. We find the proposed framework consistently exhibits more accurate predictions than individual querying at significantly lower cost. In addition, we find the performance of the proposed framework varies much more systematically and predictably across demographics and questions, making it possible for those performing AI polling to better anticipate model performance using only information available before a query is issued.
URL: https://openreview.net/forum?id=TR84HetGOH
---