Weekly TMLR digest for May 17, 2026

6 views

Skip to first unread message

TMLR

unread,

May 17, 2026, 12:00:10 AMMay 17

to tmlr-annou...@googlegroups.com

New certifications
==================

Featured Certification, J2C Certification: Understanding the Effects of Neuron Dominance in Deep Reinforcement Learning

Zifan Wu, Qian Lin, Blake Lawlor, Haijun Zhao, Daniel S. Brown

https://openreview.net/forum?id=VNV1h77UnH

---

Survey Certification: From Models to Systems: A Comprehensive Survey of Efficient Multimodal Learning

Pan Wang, Siwei Song, Hui Ji, Siqi Cao, Heng Yu, Zhijian Liu, Huanrui Yang, Yingyan Celine Lin, Beidi Chen, Mohit Bansal, Xiaoming Liu, Pengfei Zhou, Ming-Hsuan Yang, Tianlong Chen, Jingtong Hu

https://openreview.net/forum?id=yfTU8FTS2Z

---

Featured Certification: DINOv3

Oriane Siméoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seung Eun Yi, Michael Ramamonjisoa, Francisco Massa, Daniel HAZIZA, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julien Mairal, Herve Jegou, Patrick Labatut, Piotr Bojanowski

https://openreview.net/forum?id=2NlGyqNjns

---

Expert Certification: Glocal Smoothness: Line search and adaptive sizes can help in theory too!

Curtis Fox, Aaron Mishkin, Sharan Vaswani, Mark Schmidt

https://openreview.net/forum?id=be9PdukwEL

---

Featured Certification: Accelerating Inference of Discrete Autoregressive Normalizing Flows by Selective Jacobi Decoding

Jiaru Zhang, Juanwu Lu, Xiaoyu Wu, Ziran Wang, Ruqi Zhang

https://openreview.net/forum?id=xYATz9HpE7

---

J2C Certification: Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens

Karthik Valmeekam, Vardhan Palod, Kaya Stechly, Atharva Gundawar, Subbarao Kambhampati

https://openreview.net/forum?id=gDE7YcRC3F

---

J2C Certification: Differentiable Cluster Discovery in Temporal Graphs

Md Hafizur Rahman, Chi-Guhn Lee

https://openreview.net/forum?id=1caZVb6zL7

---

J2C Certification: On Fitting Flow Models with Large Sinkhorn Couplings

Stephen Y. Zhang, Alireza Mousavi-Hosseini, Michal Klein, marco cuturi

https://openreview.net/forum?id=3MLKJZgY62

---

J2C Certification: Regret minimization in Linear Bandits with offline data via extended D-optimal exploration.

Sushant Vijayan, Arun Suggala, Karthikeyan Shanmugam, Soumyabrata Pal

https://openreview.net/forum?id=4WcK8gKgCi

---

Accepted papers
===============

Title: Memory-Efficient Differentially Private Training with Gradient Random Projection

Authors: Alex Mulrooney, Devansh Gupta, James Flemings, Huanyu Zhang, Murali Annavaram, Meisam Razaviyayn, Xinwei Zhang

Abstract: Differential privacy (DP) protects sensitive data during neural network training, but standard methods like DP-Adam suffer from high memory overhead due to per-sample gradient clipping, limiting scalability. We introduce DP-GRAPE (Gradient RAndom ProjEction), a DP training method that significantly reduces memory usage while maintaining utility on par with first-order DP approaches. DP-GRAPE is motivated by our finding that privatization flattens the gradient singular value spectrum, making SVD-based projections (Zhao et al., 2024) unnecessary. Consequently, DP-GRAPE employs three key components: (1) random Gaussian matrices replace SVD-based subspaces, (2) gradients are privatized after projection, and (3) projection is applied during backpropagation. These contributions eliminate the need for costly SVD computations, enable substantial memory savings, and lead to improved utility. Despite operating in lower-dimensional subspaces, our theoretical analysis shows that DP-GRAPE achieves a privacy-utility trade-off comparable to DP-SGD. Our extensive empirical experiments show that DP-GRAPE can significantly reduce the memory footprint of DP training without sacrificing accuracy or training time. In particular, DP-GRAPE reduces memory usage by over 63% when pre-training Vision Transformers and over 70% when fine-tuning RoBERTa-Large as compared to DP-Adam, while achieving similar performance. We further demonstrate that DP-GRAPE scales to fine-tuning large models such as OPT with up to 6.7 billion parameters, a scale at which DP-Adam fails due to memory constraints. Our code is available at https://github.com/alexmul1114/DP_GRAPE.

URL: https://openreview.net/forum?id=CyxgbXCrWZ

---

Title: A Resilience Framework for Bi-Criteria Combinatorial Optimization with Bandit Feedback

Authors: Vaneet Aggarwal, Shweta Jain, Subham Pokhriyal, Christopher John Quinn

Abstract: We study bi-criteria combinatorial optimization under noisy function evaluations. While resilience and black-box offline-to-online reductions have been studied in single-objective settings, extending these ideas to bi-criteria problems introduces new challenges due to the coupled degradation of approximation guarantees for objectives and constraints. We introduce a notion of $(\alpha,\beta,\delta,\texttt{N})$-resilience for bi-criteria approximation algorithms, capturing how joint approximation guarantees degrade under bounded (possibly worst-case) oracle noise, and develop a general black-box framework that converts any resilient offline algorithm into an online algorithm for bi-criteria combinatorial multi-armed bandits with bandit feedback. The resulting online guarantees achieve sublinear regret and cumulative constraint violation of order $\tilde{O}(\delta^{2/3}\texttt{N}^{1/3}T^{2/3})$ without requiring structural assumptions such as linearity, submodularity, or semi-bandit feedback on the noisy functions. We demonstrate the applicability of the framework by establishing resilience for several classical greedy algorithms in submodular optimization.

URL: https://openreview.net/forum?id=jcjxXUMyJ5

---

Title: On the Unreasonable Effectiveness of Last-layer Retraining

Authors: John Collins Hill, Tyler LaBonte, Xinchen Zhang, Vidya Muthukumar

Abstract: Last-layer retraining (LLR) methods --- wherein the last layer of a neural network is reinitialized and retrained on a held-out set following ERM training --- have garnered interest as an efficient approach to rectify dependence on spurious correlations and improve performance on minority groups. Surprisingly, LLR has been found to improve worst-group accuracy even when the held-out set is an imbalanced subset of the training set. We initially hypothesize that this ``unreasonable effectiveness'' of LLR is explained by its ability to mitigate neural collapse through the held-out set, resulting in the implicit bias of gradient descent benefiting robustness. Our empirical investigation does not support this hypothesis. Instead, we present strong evidence for an alternative hypothesis: that the success of LLR is primarily due to better group balance in the held-out set. We conclude by showing how the recent algorithms CB-LLR and AFR perform implicit group-balancing to elicit a robustness improvement.

URL: https://openreview.net/forum?id=h81ztbrkFb

---

Title: Token-Based Detection of Spurious Correlations in Vision Transformers

Authors: Solha Kang, Esla Timothy Anzaku, Wesley De Neve, Arnout Van Messem, Joris Vankerschaver, Francois Rameau, Utku Ozbulak

Abstract: Due to their powerful feature association capabilities, neural network-based computer vision models have the ability to detect and exploit unintended patterns within the data, potentially leading to correct predictions based on incorrect or unintended but statistically relevant signals. These clues may vary from simple color aberrations to small pieces of text within the image. In situations where these unintended signals align with the predictive task, models can mistakenly link these features with the task and rely on them for making predictions. This phenomenon is referred to as spurious correlations, where patterns appear to be associated with the task but are actually coincidental. As a result, detection and mitigation of spurious correlations have become crucial tasks for building trustworthy, reliable, and generalizable machine learning models. In this work, we present a token-based diagnostic pipeline that applies leave-one-out token removal to detect spurious correlations in vision transformers. The proposed approach quantifies a model’s reliance on non-core visual cues through complementary measures that capture both aggregate and localized spurious effects at the token level. Using both supervised and self-supervised trained models, we present large-scale experiments on the ImageNet dataset demonstrating the ability of the proposed method to identify spurious correlations. We also find that, even if the same architecture is used, the training methodology has a substantial impact on the model's reliance on spurious correlations. Furthermore, we show that for certain ImageNet classes, many images exhibit strong reliance on non-core visual cues across multiple models, and we discuss common sources of such signals (e.g., watermarks and background artifacts). Lastly, we present a case study investigating spurious signals in invasive breast mass classification, grounding our work in a real-world scenario.

URL: https://openreview.net/forum?id=GlPXPhwOzI

---

Title: Understanding the Effects of Neuron Dominance in Deep Reinforcement Learning

Authors: Zifan Wu, Qian Lin, Blake Lawlor, Haijun Zhao, Daniel S. Brown

Abstract: Recent studies in deep reinforcement learning have revealed that neural networks tend to lose their capacity to adapt to new targets over the course of training. The proliferation of inactive neurons, i.e., the so-called ``dormant neurons'', has been identified as one source of capacity loss. This paper investigates \textit{dominant neurons}, neurons whose activation values are significantly larger than average, as a potential cause for neuron dormancy. We demonstrate the existence of dominant neurons in a number of visual control tasks, and perform an analysis of the learning dynamics showing how dominant neurons can induce dormancy in the subsequent layer. To gain a better understanding of this phenomenon, we examine it through the lens of representation learning and establish its connection with representation collapse. Furthermore, this paper evaluates several mitigation strategies for dominant neurons across a variety of visual control tasks. Our results show that strategies that induce lower peak activation scores tend to exhibit greater representational capacity, lower dormant neuron percentage, and better performance. Among these mitigation strategies, LayerNorm with weight decay has the strongest performance, despite its simplicity. Moreover, switching the value learning loss from regression to a classification loss also significantly mitigates the neuron dominance issue and improves the performance. As a potential explanation of the effectiveness of classification losses, we provide an analysis that shows how a classification loss can prevent representation collapse.

URL: https://openreview.net/forum?id=VNV1h77UnH

---

Title: Post-Training Neural Network Pruning using Graph Curvature

Authors: Shuhang Tan, Jayson Sia, Paul Bogdan, Radoslav Ivanov

Abstract: This paper provides a fresh view of the neural network (NN) pruning problem, through the lens of graph theory. To achieve effective pruning, we aim to identify the main NN data flows and the corresponding NN connections that are most (and least) important for the performance of the full model. Unlike the standard approach to NN data flow analysis, which is based on information theory, we employ the notion of graph curvature, specifically Ollivier-Ricci curvature (ORC). The ORC has been successfully used to identify important graph edges in various domains such as road traffic analysis, biological and social networks. In particular, edges with negative ORC are considered bottlenecks and as such are critical to the graph’s overall connectivity, whereas positive-ORC edges are not essential. We use this intuition for the case of NNs to: 1) construct a graph induced by the NN structure and introduce the notion of neural curvature (NC) based on the ORC; 2) calculate curvatures based on activation patterns for a set of input examples; 3) demonstrate that NC can
indeed be used to rank edges according to their importance for the overall NN functionality. We evaluate our method through pruning experiments on a variety of small-to-medium-size models trained on three image datasets, namely MNIST, CIFAR-10 and CIFAR-100. The results indicate that our method can identify a larger number of unimportant edges as compared to existing pruning methods. (Code: https://github.com/SH-Tan/Post-Training-NN-Pruning-using-Graph-Curvature)

URL: https://openreview.net/forum?id=kwACVY73Ug

---

Title: From Models to Systems: A Comprehensive Survey of Efficient Multimodal Learning

Authors: Pan Wang, Siwei Song, Hui Ji, Siqi Cao, Heng Yu, Zhijian Liu, Huanrui Yang, Yingyan Celine Lin, Beidi Chen, Mohit Bansal, Xiaoming Liu, Pengfei Zhou, Ming-Hsuan Yang, Tianlong Chen, Jingtong Hu

Abstract: The rapid expansion of multimodal models has surfaced formidable bottlenecks in computation, memory, and deployment, catalyzing the rise of Efficient Multimodal Learning (EML) as a pivotal research frontier. Despite intensive progress, a cohesive understanding of $\textit{what}$, $\textit{how}$, and $\textit{where}$ efficiency is manifested across the learning stack remains fragmented. This survey systematizes the EML landscape by introducing the first structured, model-to-system taxonomy. We distill insights from over 300 seminal works into three hierarchical levels—$\textit{model}$, $\textit{algorithm}$, and $\textit{system}$—addressing architectural parsimony, execution refinement, and hardware-aware orchestration, respectively. Moving beyond a purely categorical review, we offer a methodological synthesis of the vertical synergies between these layers, elucidating how cross-layer co-design contributes to the fundamental "Efficiency-Utility-Privacy'' trade-off. Through an integrative case study of Multimodal Large Language Models (MLLMs), we trace the field’s evolutionary trajectory from initial structural adjustments to modern full-stack resource orchestration. Furthermore, we provide a holistic discussion and application-specific optimization blueprints for diverse domains and posit a paradigm shift toward self-regulating intelligence, where efficiency is an intrinsic, emergent property of the model’s fundamental design rather than a post-hoc constraint. Finally, we present open challenges and future directions that will define the trajectory of EML research. This survey establishes a structured framework for multimodal systems that are not only high-performing and generalizable but natively efficient and ready for ubiquitous deployment. A continuously updated version is available at https://github.com/pwang322/Efficient-Multimodal-Learning-Survey.

URL: https://openreview.net/forum?id=yfTU8FTS2Z

---

Title: RPATH: Explaining Time Series Mixture of Experts Routing via Ensemble Consensus and Structural Robustness

Authors: Temesgen Mikael Abraha, Yves Lucet

Abstract: Mixture-of-Experts (MoE) architectures achieve strong performance in time series forecasting through sparse expert activation, but understanding why specific experts are selected remains challenging. We present RPATH (Routing Pathway Analysis for Temporal Hierarchies), a post-hoc explainability framework for time series MoE models that combines temporal saliency mapping with counterfactual generation. Evaluating on Time-MoE-50M across 300 expert-sample pairs, we discover two properties of the routing architecture: (1) Ensemble Consensus, where experts at different layers consistently identify the same critical temporal windows (mean saliency Intersection over Union (IoU) = 0.677), rather than developing distinct specializations; and (2) Structural Robustness, characterized by a 300-fold "Stability Gap" where gentle perturbations alter routing in only 0.3% of cases while aggressive perturbations succeed in 99.7%, indicating that routing decisions reflect structural anchors rather than superficial signal characteristics. Together, these findings demonstrate that Time-MoE achieves reliable forecasting through Ensemble Redundancy: multiple experts verify the same structural features, providing consensus that is insensitive to noise but responsive to fundamental signal changes. Our framework provides practitioners with tools to visualize expert attention, identify critical input regions, and quantify routing stability for deployed MoE models.

URL: https://openreview.net/forum?id=kwpDOqas2x

---

Title: Improving OOD Robustness via Background-Aware Test- Time-AugmentationinBlack-BoxandResourceConstrained Settings

Authors: Ping Song, Adegboyega Ojo, Edward Curry

Abstract: Deep learning models for text classification typically achieve strong performance on in-distribution (ID) data but often fail to generalize to out-of-distribution (OOD) inputs. This degradation frequently arises because models rely on spurious background cues (e.g., specific syntax or register) learned during training, which become unreliable when the domain changes. While recent Test-Time Augmentation (TTA) approaches have enabled robustness in black-box settings, they often rely on unconstrained rewriting strategies. For instance, standard In-Context Rewriting (ICR) instructs Large Language Models (LLMs) to modify input details to match ID exemplars, creating a high risk of semantic drift and label flipping, particularly when using smaller, resource-constrained LLMs. In this work, we propose a Background-Aware TTA framework that strictly disentangles style from semantics. Unlike prior methods that encourage broad paraphrasing, we utilize a semantic-constrained alignment strategy that enables small, efficient LLMs to transform specific background attributes, such as tone and sentence structure, to match in-distribution priors while explicitly enforcing the preservation of original meaning. This approach mitigates OOD degradation by neutralizing spurious background shifts, allowing frozen black-box models to process inputs in their native distribution without risking semantic corruption. Empirical evaluations across multiple text classification benchmarks demonstrate that our targeted alignment strategy outperforms unconstrained augmentation baselines. By generating higher-fidelity augmentations, our method achieves superior OOD robustness with reduced computational overhead, establishing a viable path for deploying robust in resource-limited black-box environments. We validate the versatility of BA-TTA using a range of open-weights generators, from Llama-2 based models to the recent Llama-3.1-8B and Qwen-2.5-7B, showing consistent gains across model families.

URL: https://openreview.net/forum?id=xptPQVCy5X

---

Title: DINOv3

Authors: Oriane Siméoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seung Eun Yi, Michael Ramamonjisoa, Francisco Massa, Daniel HAZIZA, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julien Mairal, Herve Jegou, Patrick Labatut, Piotr Bojanowski

Abstract: Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being tailored to specific tasks or domains, this training paradigm has the potential to learn visual representations from diverse sources, ranging from natural to aerial images—using a single algorithm. This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies. First, we leverage the benefit of scaling both dataset and model size by careful data preparation, design, and optimization. Second, we introduce a new method called Gram anchoring, which effectively addresses the known yet unsolved issue of dense feature maps degrading during long training schedules. Finally, we apply post-hoc strategies that further enhance our models’ flexibility with respect to resolution, model size, and alignment with text. As a result, we present a versatile vision foundation model that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models. We also share the DINOv3 suite of vision models, designed to advance the state of the art on a wide spectrum of tasks and data by providing scalable solutions for diverse resource constraints and deployment scenarios.

URL: https://openreview.net/forum?id=2NlGyqNjns

---

Title: Post-Training Adaptive Conformal Prediction for Incomplete Time Series

Authors: Baiting Chen, Xiaofan Zhou, Lu Cheng

Abstract: Conformal Prediction (CP) is widely used for uncertainty quantification but faces significant challenges with time series due to non-exchangeability. The issue is exacerbated by missing data, where the exponential growth of missing patterns makes existing approaches computationally expensive and unable to adequately represent each missing pattern. To address this, we propose a novel approach that uses a post-training Neural Network (NN) to handle temporal dependencies and structured missingness in time series data. With a novel non-conformity score function, our method improves conditional coverage for different missing patterns, ensuring prediction intervals are both reliable and informative. We introduce features that capture different missingness mechanisms, enabling the model to adapt to various patterns. Theoretically, we establish asymptotic validity for conditional coverage with adaptive adjustments. Experiments on semi-synthetic benchmarks demonstrate the method's efficiency in producing tight prediction intervals while maintaining group conditional coverage.

URL: https://openreview.net/forum?id=KMBU4wx79B

---

Title: Offline changepoint localization using a matrix of conformal p-values

Authors: Sanjit Dandapanthula, Aaditya Ramdas

Abstract: Changepoint localization is the problem of estimating the index at which a change occurred in the data generating distribution of an ordered list of data, or declaring that no change occurred. We present the broadly applicable MCP algorithm, which uses a matrix of conformal p-values to produce a confidence interval for a (single) changepoint under the mild assumption that the pre-change and post-change distributions are each exchangeable. We prove a novel conformal Neyman-Pearson lemma, motivating practical classifier-based choices for our conformal score function. Finally, we exemplify the MCP algorithm on a variety of synthetic and real-world datasets, including using black-box pre-trained classifiers to detect changes in sequences of images, text, and accelerometer data.

URL: https://openreview.net/forum?id=bo2WlznUOc

---

Title: SynQuE: Estimating Synthetic Dataset Quality Without Annotations

Authors: Arthur Chen, Victor Zhong

Abstract: We introduce and formalize the Synthetic Dataset Quality Estimation (SynQuE) problem: ranking synthetic datasets by their expected real-world task performance using only limited unannotated real data. This addresses a critical and open challenge where data is scarce due to collection costs or privacy constraints. We establish the first comprehensive benchmarks for this problem by introducing and evaluating proxy metrics that choose synthetic data for training to maximize task performance on real data. We introduce the first proxy metrics for SynQuE by adapting distribution and diversity-based distance measures to our context via embedding models. To address the shortcomings of these metrics on complex planning tasks, we propose Lens, a novel proxy that leverages large language model reasoning. Our results show that SynQuE proxies correlate with real task performance across diverse tasks, including sentiment analysis, Text2SQL, web navigation, and image classification, with Lens consistently outperforming others on complex tasks by capturing nuanced characteristics. For instance, on text-to-SQL parsing, training on the top-3 synthetic datasets selected via SynQuE proxies can raise accuracy from 30.4% to 38.4 (+8.1)% on average compared to selecting data indiscriminately. This work establishes SynQuE as a practical framework for synthetic data selection under real-data scarcity and motivates future research on foundation model-based data characterization and fine-grained data selection. We release our code.

URL: https://openreview.net/forum?id=W4Pwb4SX3P

---

Title: PDEInvBench: A Comprehensive Dataset and Design Space Exploration of Neural Networks for PDE Inverse Problems

Authors: Divyam Goel, Nithin Chalapathi, Sanjeev Raja, Aditi S. Krishnapriyan

Abstract: Inverse problems in partial differential equations (PDEs) involve estimating the physical parameters of a system from observed spatiotemporal solution fields, a fundamental task in numerous scientific domains. Neural networks, and particularly neural operators, are well-suited for PDE parameter estimation due to their capability to model function-to-function space transformations.
While existing benchmarks of machine learning methods for PDEs primarily focus on the forward problem --- mapping physical parameters to solution fields---to our knowledge, there are no similar comprehensive studies and benchmark datasets on PDE inverse problems - mapping solution fields to underlying physical parameters. We fill this gap by introducing PDEInvBench, a comprehensive benchmark dataset consisting of numerical simulations for both time-dependent and time-independent PDEs across a wide range of physical behaviors and parameters. Our dataset includes evaluation splits that assess performance in both in-distribution and various out-of-distribution settings. Using our benchmark dataset, we comprehensively explore the design space of neural networks for PDE inverse problems along three key dimensions: (1) optimization procedures, analyzing the role of supervised, self-supervised, and test-time training objectives on performance, (2) problem representations, where we study the value of architectural choices with different inductive biases and various conditioning strategies, and (3) scaling, which we perform with respect to both model and data size.
Our experiments reveal several practical insights: 1) neural networks perform best with a two-stage training procedure: initial supervision with PDE parameters followed by test-time fine-tuning using the PDE residual, 2) incorporating PDE derivatives as input features consistently improves accuracy, and 3) increasing the diversity of initial conditions in the training data yields greater performance gains than expanding the range of PDE parameters. We make our dataset and evaluation codebase freely available to facilitate reproducibility and further development of our work.

URL: https://openreview.net/forum?id=MSjhqRnNyZ

---

Title: Adaptive multi-frame sampling for consistent zero-shot text-to-video editing

Authors: Thérèse Tisseau des Escotais, Clément Rambour, Bertrand Leroy, Arnaud Breloy

Abstract: Achieving convincing temporal coherence is a fundamental challenge in zero-shot text-to-video editing. To address this issue, this paper introduces AMAC (Adaptive Multi-frame sAmpling for Consistent zero-shot text-to-video editing), a novel method that effectively balances temporal consistency with detail preservation. Our approach proposes a theoretical framework with a fully adaptive sampling strategy that selects frames for joint processing using a pre-trained text-to-image diffusion model. By reformulating the sampling strategy as a stochastic permutation over frame indexes and constructing its distribution based on inter-frame similarities, we promote consistent processing of related content. This method demonstrates superior robustness against temporal variations and shot transitions, making it particularly well-suited for editing long dynamic video sequences, as validated through experiments on DAVIS and BDD100K datasets. Some examples of generated videos are available in the following anonymous repository https://anonymous.4open.science/r/AMAC-A406.

URL: https://openreview.net/forum?id=vcZ6qdbADL

---

Title: Synergizing Deconfounding and Temporal Generalization For Time-series Counterfactual Outcome Estimation

Authors: Yiling Liu, Juncheng Dong, Chen Fu, Wei Shi, Ziyang Jiang, Qi Xu, Zhigang Hua, David Carlson

Abstract: Estimating counterfactual outcomes from time‑series observations is crucial for effective decision-making, e.g. when to administer a life‑saving treatment, yet remains significantly challenging because (i) the counterfactual trajectory is never observed and (ii) confounders evolve with time and distort estimation at every step. To address these challenges, we propose a novel framework that synergistically integrates two complementary approaches: Sub-treatment Group Alignment (SGA) and Random Temporal Masking (RTM). Instead of the coarse practice of aligning marginal distributions of the treatments in latent space, SGA uses iterative treatment‑agnostic clustering to identify fine-grained sub‑treatment groups. Aligning these fine‑grained groups achieves improved distributional matching, thus leading to more effective deconfounding. We theoretically demonstrate that SGA optimizes a tighter upper bound on counterfactual risk up to an additive constant term and empirically verify its deconfounding efficacy. RTM promotes temporal generalization by randomly replaces input covariates with Gaussian noises during training. This encourages the model to rely less on potentially noisy or spuriously correlated covariates at the current step and more on stable historical patterns, thereby improving its ability to generalize across time and better preserve underlying causal relationships. Our experiments demonstrate that while applying SGA and RTM individually improves counterfactual outcome estimation, their synergistic combination consistently achieves state-of-the-art performance. This success comes from their distinct yet complementary roles: RTM enhances temporal generalization and robustness across time steps, while SGA improves deconfounding at each specific time point.

URL: https://openreview.net/forum?id=xuJH3BJiNu

---

Title: Adaptive Conformal Prediction for Quantum Machine Learning

Authors: Douglas Spencer, Samual Nicholls, Michele Caprio

Abstract: Quantum machine learning seeks to leverage quantum computers to improve upon classical machine learning algorithms. Currently, robust uncertainty quantification methods remain underdeveloped in the quantum domain, despite the critical need for reliable and trustworthy predictions. Recent work has introduced quantum conformal prediction, a framework that produces prediction sets that are guaranteed to contain the true outcome with a user-specified probability. In this work, we formalise how the time-varying noise inherent in quantum processors can undermine conformal guarantees, even when calibration and test data are exchangeable. To address this challenge, we draw on Adaptive Conformal Inference, a method which maintains validity over time via repeated recalibration. We introduce Adaptive Quantum Conformal Prediction (AQCP), an algorithm which provides asymptotic average coverage guarantees under arbitrary hardware noise conditions. Empirical studies on an IBM quantum processor demonstrate that AQCP achieves the target coverage level and exhibits greater stability than quantum conformal prediction.

URL: https://openreview.net/forum?id=ShkPB9OeEW

---

Title: SA-PEF: Step-Ahead Partial Error Feedback for Efficient Federated Learning

Authors: Dawit Kiros Redie, Reza Arablouei, Stefan Werner

Abstract: Biased gradient compression with error feedback (EF) reduces communication in federated learning (FL), but under heterogeneous (non-IID) data and local updates, the compression residual can decay slowly. This induces a mismatch between where gradients are evaluated and where the (decompressed) update is effectively applied, often slowing progress in the early rounds. We propose step-ahead partial error feedback (SA-PEF), which introduces a tunable step-ahead coefficient $\alpha_r\in[0,1]$ and previews only a fraction of the residual while carrying the remainder through standard EF. SA-PEF interpolates smoothly between EF ($\alpha_r=0$) and full step-ahead EF (SAEF; $\alpha_r=1$). For nonconvex objectives with $\delta$-contractive compressors, we develop a second-moment bound and a residual recursion that yield nonconvex stationarity guarantees under data heterogeneity and partial client participation. With a constant inner stepsize, the bound exhibits the standard $\mathcal{O}\!\bigl((\eta\,\eta_0TR)^{-1}\bigr)$ optimization term and an $R$-independent variance/heterogeneity floor induced by biased compression. Our analysis highlights a step-ahead-controlled residual contraction factor $\rho_r$, explaining the observed early-phase acceleration, and suggests choosing $\alpha_r$ near a theory-predicted optimum to balance SAEF’s rapid warm-up with EF’s long-run stability. Experiments across architectures, datasets, and compressors show that SA-PEF consistently reaches target accuracy in fewer communication rounds than EF.

URL: https://openreview.net/forum?id=ejnVWfknCm

---

Title: TOAST: Transformer Optimization using Adaptive and Simple Transformations

Authors: Irene Cannistraci, Simone Antonelli, Emanuele Palumbo, Thomas M. Sutter, Emanuele Rodolà, Bastian Rieck, Julia E Vogt

Abstract: Foundation models achieve state-of-the-art performance across different tasks, but their size and computational demands raise concerns about accessibility and sustainability. Existing efficiency methods often require additional retraining or finetuning, limiting their practicality.
Recent findings suggest that deep neural networks exhibit internal representation similarities. While such similarities across different models have been exploited for enabling techniques such as model stitching and merging, intra-network redundancy remains underexplored as a source for efficiency gains. In this paper, we introduce Transformer Optimization using Adaptive and Simple Transformations (TOAST), a framework that exploits these redundancies to approximate entire transformer blocks with lightweight closed-form mappings, such as linear transformations or even the identity function, without any additional training. Across state-of-the-art pretrained vision models (e.g., ViT, DINOv2, DeiT) and datasets ranging from MNIST to ImageNet-1k, TOAST reduces parameters and computation while preserving, and in some cases improving, downstream performance. These results show that large portions of transformer depth can be replaced by trivial functions, opening a new perspective on efficient foundation models.

URL: https://openreview.net/forum?id=fSwMCsBtTG

---

Title: Glocal Smoothness: Line search and adaptive sizes can help in theory too!

Authors: Curtis Fox, Aaron Mishkin, Sharan Vaswani, Mark Schmidt

Abstract: Iteration complexities for optimizing smooth functions with first-order algorithms are typically stated in terms of a global Lipschitz constant of the gradient, and near-optimal results are then achieved using fixed step sizes. But many objective functions that arise in practice have regions with small Lipschitz constants where larger step sizes can be used. Many local Lipschitz assumptions have been proposed, which have led to results showing that adaptive step sizes and/or line searches yield improved convergence rates over fixed step sizes. However, these faster rates tend to depend on the iterates of the algorithm, which makes it difficult to compare the iteration complexities of different methods. We consider a simple characterization of global and local ("glocal") smoothness that only depends on properties of the function. This allows upper bounds on iteration complexities in terms of iterate-independent constants and enables us to compare iteration complexities between algorithms. Under this assumption it is straightforward to show the advantages of line searches over fixed step sizes and that, in some settings, gradient descent with line search has a better iteration complexity than accelerated methods with fixed step sizes. We further show that glocal smoothness can lead to improved complexities for the Polyak and AdGD step sizes, as well other algorithms including coordinate optimization, stochastic gradient methods, accelerated gradient methods, and non-linear conjugate gradient methods.

URL: https://openreview.net/forum?id=be9PdukwEL

---

Title: Accelerating Inference of Discrete Autoregressive Normalizing Flows by Selective Jacobi Decoding

Authors: Jiaru Zhang, Juanwu Lu, Xiaoyu Wu, Ziran Wang, Ruqi Zhang

Abstract: Discrete normalizing flows are promising generative models with advantages such as analytical log-likelihood computation and end-to-end training.
However, the architectural constraints to ensure invertibility and tractable Jacobian computation limit their expressive power and practical usability.
Recent advancements utilize autoregressive modeling, significantly enhancing expressive power and generation quality.
Nevertheless, such sequential modeling inherently restricts parallel computation during inference, leading to slow generation that impedes practical deployment.
In this paper, we first identify that strict sequential dependency in inference is unnecessary to generate high-quality samples.
We observe that sub-variables in sequential modeling can also be approximated without strictly conditioning on all preceding sub-variables.
Moreover, the models tend to exhibit low dependency redundancy in the initial layer and higher redundancy in subsequent layers.
Leveraging these observations, we propose to selectively use Jacobi decoding strategy that accelerates its autoregressive inference through parallel iterative optimization.
Theoretical analyses demonstrate the method's superlinear convergence rate and guarantee that the number of iterations required is no greater than the original sequential approach.
Empirical evaluations across multiple datasets validate the generality and effectiveness of our acceleration technique, achieving up to 4.7 times faster inference on modern normalizing flow models while preserving generation quality.

URL: https://openreview.net/forum?id=xYATz9HpE7

---

Title: Exploration-Driven Optimization for Test-Time Large Language Model Reasoning

Authors: ChangHao Li, Yuchen Zhuang, Chenxiao Gao, Haotian Sun, Rushi Qiang, Chao Zhang, Bo Dai

Abstract: Post-training techniques combined with inference-time scaling significantly enhance the reasoning and alignment capabilities of large language models (LLMs). However, a fundamental tension arises: inference-time methods benefit from diverse sampling from a relatively flattened probability distribution, whereas reinforcement learning (RL)-based post-training inherently sharpens these distributions. To address this, we propose Exploration-Driven Optimization (EDO), which extends reward-biasing style exploration objectives to iterative post-training and integrates them into standard RL objectives, encouraging greater diversity in sampled solutions while facilitating more effective inference-time computation. We incorporate EDO into iterative Direct Preference Optimization (iDPO) and Group Relative Policy Optimization (GRPO), resulting in two variants: ED-iDPO and ED-GRPO. Extensive experiments demonstrate that both ED-iDPO and ED-GRPO exhibit greater solution diversity and improved reasoning abilities, particularly when combined with test-time computation techniques like self-consistency. Across three in-distribution reasoning benchmarks, EDO achieves a 1.0-1.3\% improvement over the strongest baselines, and delivers an additional 1.5\% average gain on five out-of-distribution tasks. Beyond accuracy, EDO preserves model entropy and stabilizes RL training dynamics, highlighting its effectiveness in preventing over-optimization collapse. Taken together, these results establish EDO as a practical framework for balancing exploration and exploitation in LLM reasoning, especially in settings that rely on test-time scaling.

URL: https://openreview.net/forum?id=NiINDlzvNj

---

Title: A Faster Generalized Two-Stage Approximate Top-K

Authors: Yashas Samaga B L, Varun Yerram, Spandana Raj Babbula, Prateek Jain, Praneeth Netrapalli

Abstract: We consider the Top-$K$ selection problem, which aims to identify the largest $K$ elements in an array. Top-$K$ selection arises in many machine learning algorithms and often becomes a bottleneck on accelerators, which are optimized for dense matrix multiplications. To address this problem, Chern et al. (2022) proposed a fast two-stage *approximate* Top-$K$ algorithm that: (i) partitions the input array into equal-sized chunks and selects the top-$1$ element from each partition; and (ii) sorts the resulting *smaller subset* and returns the top $K$ elements. In this paper, we generalize the first stage so that each partition selects the top $K'$ elements (for $1 \leq K' \leq K$). Our contributions include: (i) an expression for the expected recall of this generalized algorithm under random partitioning, and a demonstration that choosing $K' > 1$ with *fewer partitions* in the first stage more effectively reduces the input size to the second stage while maintaining the same expected recall as the original algorithm; (ii) a bound on the expected recall of the original algorithm as a function of the algorithm parameters that is provably tighter by a factor of $2$ than the bound reported by Chern et al. (2022); and (iii) an implementation of our algorithm on Cloud TPUv5e that achieves approximately an order of magnitude speedup over the original algorithm without sacrificing recall.

URL: https://openreview.net/forum?id=izqZ1Crpjz

---

Title: Graph Generation via Temporal-Aware Biased Walks

Authors: Resul Tugay, Eren Olug, Elif Ak, Kiymet Kaya, Şule Gündüz Öğüdücü

Abstract: Some real networks keep a fixed structure (e.g., roads, sensors and their connections) while node or edge signals evolve over time. Existing graph generators either model topology changes (i.e., edge additions/deletions) or focus only on static graph properties (such as degree distributions or motifs), without considering how temporal signals shape the generated structure. By approaching the problem from an unconventional perspective, we introduce TANGEM, that integrate a temporal similarity matrix into biased random walks, thereby coupling signals with structure to generate graphs that highlight patterns reflecting how nodes co-activate over time. We evaluate TANGEM using an approach that separates structural fidelity (clustering, spectral metrics) from downstream temporal consistency, allowing us to clearly isolate the impact of the topology generator itself. In structural benchmarks, TANGEM consistently outperforms strong baselines while remaining lightweight. These results show that adding attribute-guided bias to structural sampling produces more realistic graphs and establishes TANGEM as a basis for future models that further integrate evolving signals and structure.

URL: https://openreview.net/forum?id=lDnMlhk3aw

---

Title: Re-evaluating Minimum Bayes Risk Decoding for Automated Speech Recognition Tasks

Authors: Yuu Jinnai

Abstract: While sample-based Minimum Bayes Risk (MBR) decoding has shown to outperform beam search in many text-to-text generation tasks with modern LLMs, beam search remains the dominant approach for Automatic Speech Recognition (ASR) and Speech Translation (ST). To date, the efficacy of MBR decoding within modern speech systems lacks comprehensive evaluation.
Given that MBR decoding is effective in text-to-text generation tasks, it is reasonable to expect it to also be effective for speech-to-text tasks.
In this paper, we evaluate MBR decoding for ASR and ST tasks on English and Japanese using Whisper and its derivative models, as well as supplementary autoregressive baselines.
We observe that the accuracy of MBR decoding outperforms that of beam search in most of the experimental settings we have evaluated.
The results show that MBR decoding is a promising method for ASR and ST tasks that require high accuracy.

URL: https://openreview.net/forum?id=I6iLWhRIsf

---

Title: Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens

Authors: Karthik Valmeekam, Vardhan Palod, Kaya Stechly, Atharva Gundawar, Subbarao Kambhampati

Abstract: Recent impressive results from large reasoning models have been interpreted as a triumph of Chain of Thought (CoT), and especially of the process of training on CoTs sampled from base LLMs in order to help find new reasoning patterns. While these traces certainly seem to help the model performance, it is not clear how they actually influence model performance, with some works ascribing semantics to them and others cautioning against relying on them as transparent and faithful proxies of the model's internal computational process. To systematically investigate the role of end-user semantics of derivational traces, we set up a controlled study where we train transformer models from scratch on formally verifiable reasoning traces and the solutions they lead to, constraining both intermediate steps and final outputs to align with those of a formal solver. We notice that, despite significant improvements over the solution-only baseline, models trained on entirely correct traces can still produce invalid reasoning traces even when arriving at correct solutions. More interestingly, our experiments also show that models trained on corrupted traces, whose intermediate reasoning steps bear no relation to the problem they accompany, achieve performance largely comparable to those trained on correct traces. In fact, our corrupted models generalize better on out-of-distribution tasks. We also study the effect of GRPO-based RL post-training on trace validity, noting that while solution accuracy increase, this is not accompanied by any improvements in trace validity. Finally, we examine whether reasoning-trace length reflects inference-time scaling and find that trace length is largely agnostic to the underlying computational complexity of the problem being solved. These results challenge the assumption that intermediate tokens or "Chains of Thought" reflect or induce predictable reasoning behaviors and caution against anthropomorphizing such outputs or over-interpreting them (despite their mostly seemingly forms) as evidence of human-like or algorithmic behaviors in language models.

URL: https://openreview.net/forum?id=gDE7YcRC3F

---

Title: Differentiable Cluster Discovery in Temporal Graphs

Authors: Md Hafizur Rahman, Chi-Guhn Lee

Abstract: Existing temporal graph clustering methods suffer from poor optimization dynamics due to reliance on heuristically initialized cluster assignment distribution without considering the dynamic nature of the evolving graph. The target cluster assignment distribution often conflicts with evolving temporal representations, leading to oscillatory gradients and unstable convergence. Motivated by the need for differentiable and adaptive clustering in dynamic settings, we propose TGRAIL (Temporal Graph Alignment and Index Learning), a novel end-to-end framework for temporal graph clustering based on Gumbel–Softmax sampling. TGRAIL enables discrete cluster assignments while maintaining the gradient flow. To ensure stable training, we formulate the clustering objective as an expectation over Monte Carlo samples and show that this estimator is both unbiased and variance-reduced. Furthermore, we incorporate a temporal consistency loss to preserve the order of interactions across time. Extensive experiments on six real-world temporal graph datasets demonstrate that our approach consistently outperforms state-of-the-art baselines, achieving higher clustering accuracy and robustness. Our results validate the effectiveness of jointly optimizing temporal dynamics and discrete cluster assignments in evolving graphs.

URL: https://openreview.net/forum?id=1caZVb6zL7

---

Title: Convex Optimization with Local Label Differential Privacy: Tight Bounds in All Privacy Regimes

Authors: Lynn Chua, Badih Ghazi, Ravi Kumar, Pasin Manurangsi, Ziteng Sun, Chiyuan Zhang

Abstract: We study the problem of Stochastic Convex Optimization (SCO) under the constraint of local Label Differential Privacy (L-LDP). In this setting, the features are considered public, but the corresponding labels are sensitive and must be randomized by each user locally before being sent to an untrusted analyzer. Prior work for SCO under L-LDP (Ghazi et al., 2021) established an excess population risk bound with a *linear* dependence on the size of the label space, $K$: $O\left(\frac{K}{\epsilon\sqrt{n}}\right)$ in the high-privacy regime ($\epsilon \leq 1$) and $O\left(\frac{K}{e^{\epsilon} \sqrt{n}}\right)$ in the medium-privacy regime ($1 \leq \epsilon \leq \ln K$). This left open whether this linear cost is fundamental to the L-LDP model. In this note, we resolve this question. First, we present a novel and efficient non-interactive L-LDP algorithm that achieves an excess risk of $O\left(\sqrt{\frac{K}{\epsilon n}}\right)$ in the high-privacy regime ($\epsilon \leq 1$) and $O\left(\sqrt{\frac{K}{e^{\epsilon} n}}\right)$ in the medium-privacy regime ($1 \leq \epsilon \leq \ln K$). This quadratically improves the dependency on the label space size from $O(K)$ to $O(\sqrt{K})$. Second, we prove a matching information-theoretic lower bound across all privacy regimes for any sufficiently large $n$.

URL: https://openreview.net/forum?id=8Sjm0FrV2u

---

Title: On Fitting Flow Models with Large Sinkhorn Couplings

Authors: Stephen Y. Zhang, Alireza Mousavi-Hosseini, Michal Klein, marco cuturi

Abstract: Flow models transform data gradually from one modality (e.g. noise) onto another (e.g. images). Such models are parameterized by a time-dependent velocity field, trained to fit segments connecting pairs of source and target points. When the pairing between source and target points is given, training flow models boils down to a supervised regression problem. When no such pairing exists, as is the case when generating data from noise, training flows is much harder. A popular approach lies in picking source and target points independently (Lipman et al., 2023). This can, however, lead to velocity fields that are slow to train, but also costly to integrate at inference time. In theory, one would greatly benefit from training flow models by sampling pairs from an optimal transport (OT) measure coupling source and target, since this would lead to a highly efficient flow solving the Benamou-Brenier dynamical OT problem. In practice, recent works have proposed to sample mini-batches of $n$ source and $n$ target points and reorder them using an OT solver to form better pairs. These works have advocated using batches of size $n\approx 256$, and considered OT solvers that return couplings that are either sharp (using e.g. the Hungarian algorithm) or blurred (using e.g. entropic regularization, a.k.a. Sinkhorn). We follow in the footsteps of these works by exploring the benefits of increasing this mini-batch size $n$ by three to four orders of magnitude, and look more carefully on the effect of the entropic regularization $\varepsilon$ used in the Sinkhorn algorithm. Our analysis is facilitated by new scale invariant quantities to report the sharpness of a coupling, while our sharded computations across multiple GPU or GPU nodes allow scaling up $n$. We show that in both synthetic and image generation tasks, flow models greatly benefit when fitted with large Sinkhorn couplings, with a low entropic regularization $\varepsilon$.

URL: https://openreview.net/forum?id=3MLKJZgY62

---

Title: Scalable Equilibrium Propagation via Intermediate Error Signals for Deep Convolutional CRNNs

Authors: Jiaqi Lin, Malyaban Bal, Abhronil Sengupta

Abstract: Equilibrium Propagation (EP) is a biologically inspired local learning rule first proposed for convergent recurrent neural networks (CRNNs), in which synaptic updates depend only on neuron states from two distinct phases. EP estimates gradients that closely align with those computed by Backpropagation Through Time (BPTT) while significantly reducing computational demands, positioning it as a potential candidate for on-chip training in neuromorphic architectures. However, prior studies on EP have been constrained to shallow architectures, as deeper networks suffer from the vanishing gradient problem, leading to convergence difficulties in both energy minimization and gradient computation. To address the vanishing gradient problem in deep EP networks, we propose a novel EP framework that incorporates intermediate error signals to enhance information flow and convergence of neuron dynamics. This is the first work to integrate knowledge distillation and local error signals into EP, enabling the training of significantly deeper architectures. Our proposed approach achieves state-of-the-art performance on the CIFAR-10 and CIFAR-100 datasets, showcasing its scalability on deep VGG architectures. These results represent a significant advancement in the scalability of EP, paving the way for its application in real-world systems.

URL: https://openreview.net/forum?id=iXFmzKpPNA

---

Title: Sample-wise Adaptive Weighting for Transfer Consistency in Adversarial Distillation

Authors: Hongsin Lee, Hye Won Chung

Abstract: Adversarial distillation in the standard min–max adversarial training framework aims to transfer adversarial robustness from a large, robust teacher network to a compact student. However, existing work often neglects to incorporate state-of-the-art robust teachers. Through extensive analysis, we find that stronger teachers do not necessarily yield more robust students–a phenomenon known as robust saturation. While typically attributed to capacity gaps, we show that such explanations are incomplete. Instead, we identify adversarial transferability–the fraction of student-crafted adversarial examples that remain effective against the teacher–as a key factor in successful robustness transfer. Based on this insight, we propose Sample-wise Adaptive Adversarial Distillation (SAAD), which reweights training examples by their measured transferability without incurring additional computational cost. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet show that SAAD consistently improves AutoAttack robustness over prior methods.

URL: https://openreview.net/forum?id=ek45VamPCE

---

Title: Analyzing Best-Response Dynamics for Cooperation in Markov Potential Games

Authors: Dingyang Chen, Xiaoling Zeng, Thinh T. Doan, Qi Zhang

Abstract: Simultaneous gradient updates are widely used in multi-agent learning. However, this method introduces non-stationarity from the perspective of each agent due to the co-evolution of other agents' policies. To address this issue, we consider best-response dynamics, where only one agent updates its policy at a time. We theoretically show that with best-response dynamics, convergence results from single-agent reinforcement learning extend to Markov potential games (MPGs). Moreover, building on the concept of price of anarchy and smoothness from normal-form games, we aim to find policies in MPGs that achieve optimal cooperation and provide the first known suboptimality guarantees for policy gradient variants under the best-response dynamics. Empirical results demonstrate that the best-response dynamics significantly improves cooperation across policy gradient variants in classic and more complex games.

URL: https://openreview.net/forum?id=klFSzxt4MC

---

Title: SpecEval: Evaluating Model Adherence to Behavior Specifications

Authors: Ahmed M Ahmed, Kevin Klyman, Yi Zeng, Sanmi Koyejo, Percy Liang

Abstract: Companies that develop foundation models often publish behavioral guidelines they pledge their models will follow, but it remains unclear whether models actually do so, since there has been no systematic audit of adherence to these guidelines. We propose a simple but important baseline: at minimum, a foundation model should consistently satisfy its developer's own behavioral specifications when judged by the developer's own evaluator models. We focus on \emph{three-way consistency}: the relationship between a provider's specification, the provider's model outputs, and adherence scores from the provider model as a judge, extending prior two-way generator-validator consistency. We introduce an automated framework that audits models against their providers' specifications by (i) parsing statements that delineate desired behaviors, (ii) generating targeted prompts to elicit the aforementioned behaviors, and (iii) using the responses as inputs to models to judge adherence. We apply our framework to 16 models from six developers across 100+ behavioral statements, finding three-way consistency gaps of up to 20\% across providers, as measured by each provider's own model acting as judge.

URL: https://openreview.net/forum?id=VzLIQ3Lqm9

---

Title: Debiasing Diffusion Models via Score Guidance

Authors: Piyush Tiwary, Prabhav Verma, Prathosh AP

Abstract: With the increasing use of Diffusion Models (DMs) in everyday applications, it is very important to ensure that these models are \textit{fair} towards various demographic/societal groups.
However, due to several reasons DMs inherit biases towards specific gender, race and community, which can perpetuate and amplify societal inequities.
Hence, it is important to \textit{debias} DMs.
Previous debiasing approaches require additional reference data, model fine-tuning, or auxiliary classifier training - each of which incur additional cost. In this work, we provide a training-free inference-time method for debiasing diffusion models. First, we provide a theoretical explanation for the cause of biases inhibited by DMs. Specifically, we show that the unconditional score predicted by the denoiser can be expressed as a convex combination of conditional scores corresponding to the attributes under consideration. We then argue that the weights allocated to underrepresented attributes are less which leads to domination of other attributes in overall score function. Building on this, we propose a score-guidance method that adheres to a user provided reference distribution for generation. Moreover, we show that this score guidance can be achieved via different modalities like `text' and `exemplar images'. To our knowledge, our method is the first to provide a debiasing framework that can utilize different modalities for diffusion models. We demonstrate the effectiveness of our method across various attributes on both unconditional and conditional text-based diffusion models, including Stable Diffusion.

URL: https://openreview.net/forum?id=vAz8xUHyTe

---

Title: Reranker Optimization via Geodesic Distances on k-NN Manifolds

Authors: WEN G GONG

Abstract: Current neural reranking approaches for retrieval-augmented generation (RAG) rely on cross-
encoders or large language models (LLMs), requiring substantial computational resources
and exhibiting latencies of 3–5 seconds per query. We propose Maniscope, a geometric
reranking method that computes geodesic distances on k-nearest neighbor (k-NN) manifolds
constructed over retrieved document candidates. This approach combines global cosine
similarity with local manifold geometry to capture neighborhood coherence within the
candidate set that global pairwise similarity alone cannot model. Evaluated on 15 BEIR
benchmark datasets (∼25,000 queries spanning scientific, biomedical, financial, web search,
and fact-verification domains), Maniscope achieves 0.9806 average NDCG@10, ranking best
on 13 of 15 datasets and outperforming HNSW (0.9673) and three established graph-diffusion
baselines (0.7326–0.7630) at 13 ms average latency, 1.8× faster than HNSW (23.7 ms). The
algorithm requires O(N D + M 2 D + M k log k) complexity with M ≪ N . Code and data are
released as open source.

URL: https://openreview.net/forum?id=HvzgEt51f2

---

Title: Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs

Authors: Washim Uddin Mondal, Vaneet Aggarwal

Abstract: This paper focuses on learning a Constrained Markov Decision Process (CMDP) via general parameterized policies. We propose a Primal-Dual based Regularized Accelerated Natural Policy Gradient (PDR-ANPG) algorithm that uses entropy and quadratic regularizers to reach this goal. For parameterized policy classes with a transferred compatibility approximation error, $\epsilon_{\mathrm{bias}}$, PDR-ANPG achieves a last-iterate $\epsilon$ optimality gap and $\epsilon$ constraint violation with a sample complexity of $\tilde{\mathcal{O}}(\epsilon^{-2}\min\{\epsilon^{-2},\epsilon_{\mathrm{bias}}^{-\frac{1}{3}}\})$. If the class is incomplete ($\epsilon_{\mathrm{bias}}>0$), then the sample complexity reduces to $\tilde{\mathcal{O}}(\epsilon^{-2})$ for $\epsilon<(\epsilon_{\mathrm{bias}})^{\frac{1}{6}}$. Moreover, for complete policies with $\epsilon_{\mathrm{bias}}=0$, our algorithm achieves a last-iterate $\epsilon$ optimality gap and $\epsilon$ constraint violation with $\tilde{\mathcal{O}}(\epsilon^{-4})$ sample complexity. It is a significant improvement over the
state-of-the-art last-iterate guarantees of general parameterized CMDPs.

URL: https://openreview.net/forum?id=JedrMCZC6l

---

Title: Nested Slice Sampling: Vectorized Nested Sampling for GPU-Accelerated Inference

Authors: David Yallup, Namu Kroupa, Will Handley

Abstract: Model comparison and calibrated uncertainty quantification often require integrating over parameters, but scalable inference can be challenging for complex, multimodal targets. Nested Sampling is a robust alternative to standard MCMC, yet its typically sequential structure and hard constraints make efficient accelerator implementations difficult. This paper introduces Nested Slice Sampling (NSS), a GPU-friendly, vectorized formulation of Nested Sampling that uses Hit-and-Run Slice Sampling for constrained updates. A tuning analysis yields a simple near-optimal rule for setting the slice width, improving high-dimensional behavior and making per-step compute more predictable for parallel execution. Experiments on challenging synthetic targets, high dimensional Bayesian inference, and Gaussian process hyperparameter marginalization show that NSS maintains accurate evidence estimates and high-quality posterior samples, and is particularly robust on difficult multimodal problems where current state-of-the-art methods such as tempered SMC baselines can struggle. An open-source implementation is released to facilitate adoption and reproducibility.

URL: https://openreview.net/forum?id=5mF2eRl3gt

---

Title: Identifiable Latent Bandits: Leveraging observational data for personalized decision-making

Authors: Ahmet Zahid Balcıoğlu, Newton Mwai, Emil Carlsson, Fredrik D. Johansson

Abstract: Sequential decision-making algorithms such as multi-armed bandits can find optimal personalized decisions, but are notoriously sample-hungry. In personalized medicine, for example, training a bandit from scratch for every patient is typically infeasible, as the number of trials required is much larger than the number of decision points for a single patient. To combat this, latent bandits offer rapid exploration and personalization beyond what context variables alone can offer, provided that a latent variable model of problem instances can be learned consistently. However, existing works give no guidance as to how such a model can be found. In this work, we propose an identifiable latent bandit framework that leads to optimal decision-making with a shorter exploration time than classical bandits by learning from historical records of decisions and outcomes. Our method is based on nonlinear independent component analysis that provably identifies representations from observational data sufficient to infer optimal actions in new bandit instances. We verify this strategy in simulated and semi-synthetic environments, showing substantial improvement over online and offline learning baselines when identifying conditions are satisfied.

URL: https://openreview.net/forum?id=SvkZ76wKpu

---

Title: Not All CAMs Are Complete: Completeness as the Key to Faithfulness

Authors: Vincenzo Buono, Peyman Sheikholharam Mashhadi, Mahmoud Rahat, Prayag Tiwari, Stefan Byttner

Abstract: Although input-gradient techniques have evolved to mitigate the challenges associated with gradients, modern gradient-weighted CAM approaches still rely on vanilla gradients, which are inherently susceptible to the saturation phenomena. Despite recent enhancements that incorporate counterfactual gradient strategies as a mitigating measure, these local explanation techniques still exhibit a lack of sensitivity to their baseline parameter. Our work introduces a general distributional framework for gradient-based CAMs that recovers Integrated Grad-CAM and SmoothGrad-CAM as special cases of a single perturbation distribution, and from which we derive optimal weights minimizing explanation infidelity, an optimality we prove is governed by completeness as both a necessary and sufficient axiom. Consequently, methods that violate completeness, such as SmoothGrad-based variants, are provably suboptimal. Our technique, Expected Grad-CAM, instantiates this optimum via Expected Gradients and data-aware perturbations, purposefully designed as an enhanced substitute of the foundational Grad-CAM algorithm and any method built therefrom. By revisiting the original formulation as the smoothed expectation of the perturbed integrated gradients, one can concurrently construct more faithful, localized and robust explanations; through fine modulation of the perturbation distribution, it is possible to regulate the explanation complexity by selectively discriminating stable features. Quantitative and qualitative evaluations have been conducted to assess the effectiveness of our method.

URL: https://openreview.net/forum?id=NeeGBwXNs5

---

Title: Regret minimization in Linear Bandits with offline data via extended D-optimal exploration.

Authors: Sushant Vijayan, Arun Suggala, Karthikeyan Shanmugam, Soumyabrata Pal

Abstract: We consider the problem of online regret minimization in stochastic linear bandits with access to prior observations (\emph{i.e.,} offline data) from the underlying bandit model. This setting is highly relevant to numerous applications where extensive offline data is often available, such as in recommendation systems, personalized healthcare, and online advertising. Consequently, this problem has been studied intensively in recent works such as~\cite{banerjee2022artificial, wagenmaker2022leveraging, agrawal2023optimal,hao2023leveraging,cheung2024leveraging}. We introduce the Offline-Online Phased Elimination (OOPE) algorithm, that effectively incorporates the offline data to substantially reduce the online regret compared to prior work. To leverage offline information prudently, OOPE uses an extended D-optimal design within each exploration phase. We show that OOPE achieves an online regret is $\tilde{O}(\sqrt{d_{\text{eff}} T \log \left(|\mathcal{A}|T\right)}+d^2)$, where $\mathcal{A}$ is the action set, $d$ is the dimension and $T$ is the online horizon. $d_{\text{eff}} \hspace{0.1cm} (\leq d)$ is the \emph{effective problem dimension} which measures the number of poorly explored directions in offline data and depends on the eigen-spectrum $(\lambda_k)_{k \in [d]}$ of the Gram matrix of the offline data. Thus the eigen-spectrum $(\lambda_k)_{k \in [d]}$ is a quantitative measure of the \emph{quality} of offline data. If the offline data is poorly explored ($d_{\text{eff}} \approx d$), we recover the established regret bounds for purely online linear bandits. Conversely, when offline data is abundant ($T_{\text{off}} \gg T$) and well-explored ($d_{\text{eff}} = o(1) $), the online regret reduces substantially. Additionally, we provide the first known minimax regret lower bounds in this setting that depend explicitly on the quality of the offline data. These lower bounds establish the optimality of our algorithm \footnote{Optimal within log factors in $T, T_{\text{off}}$ and additive constants in $d$} in regimes where offline data is either well-explored or poorly explored. Finally, by using a Frank-Wolfe approximation to the extended optimal design we further improve the $O(d^{2})$ term to $O\left(\frac{d^{2}}{d_{\text{eff}} } \min \{ d_{\text{eff}},1\} \right)$, which can be substantial in high dimensions with moderate quality of offline data $d_{\text{eff}} = \Omega(1)$.

URL: https://openreview.net/forum?id=4WcK8gKgCi

---

Title: Domain Indexing Collaborative Filtering for Recommender Systems

Authors: Rohit Amarnath, Zihao Xu, Qi Xu, Zhigang Hua, Yan Xie, Shuang Yang, Bo Long, Hao Wang

Abstract: In cross-domain recommendation systems, addressing cold-start items remains a significant challenge. Previous methods typically focus on maximizing performance using cross-domain knowledge, often treating the knowledge transfer process as a black box. However, the recent development of domain indexing introduces a new approach to better address such challenges. We have developed an adversarial Bayesian framework, Domain Indexing Collaborative Filtering (DICF), that infers domain indices during cross-domain recommendation. This framework not only significantly improves the recommendation performance but also provides interpretability for cross-domain knowledge transfer. This is verified by our empirical results on both synthetic and real-world datasets.

URL: https://openreview.net/forum?id=2Wvpq5M42E

---

Title: A Systematic Assessment of Weak-to-Strong Confidence Prediction in Large Language Models

Authors: Tracy Yixin Zhu, Yukai Yang, Marco Morucci, Tim G. J. Rudner

Abstract: As large language models (LLMs) are deployed in increasingly diverse applications, understanding their capacity through uncertainty quantification (UQ) is crucial for ensuring safe and reliable behavior. Reliable uncertainty estimates that accompany the text generated by an LLM can signal when a response is likely to be incorrect and thus serve as an effective fail-safe mechanism against hallucinations. We study the extent to which a smaller and weaker open-access model, using only question embeddings and a lightweight probe, can predict the probability that a stronger black-box generator answers a query correctly. Across six benchmarks, two generators, and fifteen open-access predictors, we find that this simple approach provides useful confidence estimates: embeddings from models as small as Llama3-8b achieve 83.4\% AUROC on TriviaQA and 64.3\% on MMLU, and improve selective generator accuracy by up to 17.9\%. Our analysis shows that performance is not determined by predictor size alone, but depends more strongly on representational compatibility between weak model embeddings and strong model correctness. The signal is robust to decoding configurations, label imbalance, and embedding aggregation choices, but is weaker on reasoning-heavy benchmarks such as SuperGPQA and transfers poorly across datasets. These findings suggest that weak-to-strong probes are best viewed as lightweight in-distribution confidence estimators: after generator-based labels are collected for training, they provide efficient deployment-time uncertainty estimates without repeated generator sampling. Overall, our results provide a systematic baseline for studying scalable oversight of black-box LLMs. Our code and data are available at: https://github.com/YukaiYang0803/w2s-confidence-prediction.

URL: https://openreview.net/forum?id=xYSzkg5qPD

---

New submissions
===============

Title: When Should a Principal Delegate to an Agent in Selection Processes?

Abstract: Decision-makers in high-stakes selection processes often face a fundamental choice: whether to make decisions themselves or to delegate authority to another entity whose incentives may only be partially aligned with their own. Such delegation arises naturally in settings like graduate admissions, hiring, or promotion, where a principal (e.g. a professor or manager) either reviews applicants personally and makes decisions or decisions are delegated to an agent (e.g. a committee or third-party or AI agent).

The principal has the expertise to conduct holistic evaluations of applicants (even accounting for factors like team fit), but incurs a cost for every application reviewed. In contrast, the agent can review a large volume of applications efficiently, greatly lowering the principal's costs. However, the agent's evaluation is on the basis of a signal that is only correlated with the principal's metric but may be potentially misaligned, diminishing the expected quality of selected applicants. We study this fundamental trade-off in a stylized selection model with noisy signals.
Our goal is to characterize when delegation is beneficial versus when decision-making should remain with the principal. We compare these regimes along three dimensions: (i) the principal’s utility; (ii) the quality of the selected applicants according to the principal's metric; and (iii) the fairness of selection outcomes under disparate signal qualities.

URL: https://openreview.net/forum?id=qQitnIBsLA

---

Title: Gradient Heterogeneity Complements Hessian Heterogeneity in Transformer Optimization

Abstract: Transformers are difficult to optimize with stochastic gradient descent (SGD) and largely rely on adaptive optimizers such as Adam. Despite extensive efforts, a theoretical explanation for Adam's advantage over SGD in Transformer optimization is still incomplete. In this study, we analyze the optimization of Transformer models in the fine-tuning setting through the lens of gradient heterogeneity, defined as the variation in gradient norms across parameter blocks. We provide a theoretical analysis showing that gradient heterogeneity, together with Hessian heterogeneity, degrades the convergence of gradient-based methods such as SGD, while sign-based methods are substantially less sensitive to this effect. Adam's coordinate-wise normalization makes its update directions depend mainly on gradient signs, so Adam can be interpreted as a soft variant of SignSGD. Our analysis uses the fact that SGD and SignSGD follow steepest descent directions under different norms, and derives upper bounds on the iteration complexity with implications for learning rate scaling of SignSGD. We further investigate the origin of gradient heterogeneity in Transformer architectures and show that it is strongly influenced by the placement of layer normalization, with Post-LN architectures exhibiting particularly pronounced heterogeneity. Experimental results from fine-tuning Transformers in both NLP and vision domains validate our theoretical analysis.

URL: https://openreview.net/forum?id=wZJcQb5m1e

---

Title: Reproducibility Failures in Deep Learning for Variant Calling: A Four-Pronged Case Study

Abstract: Deep learning approaches to genomic variant calling are increasingly reported in the literature, often with striking accuracy improvements claimed over classical pipelines. We examine the methodology underlying such claims through a four-pronged case study built around a single binary classifier on the Genome in a Bottle (GIAB) HG001 benchmark. An initial analysis of our own pipeline produced an apparently rigorous result, synthetic-data $F_1 = 0.994$ for Focal Loss versus $0.975$ for binary cross-entropy, with a precision-dependent training-collapse pattern (24\%/18\%/0\% across FP32/BF16/FP16) on real GIAB data over 50 random seeds. A subsequent detailed analysis tested each load-bearing component independently. We find that (i) the synthetic-to-real generalization gap is severe and inverts the loss-function ranking; on real data Focal Loss collapses to $F_1=0$ while BCE achieves $F_1 \in [0.27, 0.34]$; (ii) the proposed mechanism explaining the precision-collapse pattern (gradient-noise-as-implicit-regularization) fails under controlled testing with both round-to-nearest and stochastic rounding; (iii) the feature pipeline used in the initial analysis contains a structural label leak by construction; and (iv) the precision-collapse pattern itself does not survive faithful re-implementation: across 150 trainings (50 seeds $\times$ 3 precisions $\times$ 30 epochs), zero collapses occur in any precision (Fisher exact $p < 10^{-3}$ versus the initial counts). Each individual finding has a plausible benign explanation; their conjunction in a methodology that appeared rigorous is the contribution of this work. We articulate four specific evaluation pitfalls implied by the case study and propose a minimal protocol to detect them prospectively.

URL: https://openreview.net/forum?id=6kBqfCosog

---

Title: Complete Cyclic Subtask Graphs for Tool-Using LLM Agents: Flexibility, Cost, and Bottlenecks in Long-Horizon Workflows

Abstract: Long-horizon tool-using tasks sometimes benefit from revisiting earlier subtasks, but explicit revisitation also adds routing, coordination, and token cost. We study complete cyclic subtask graphs for large language model (LLM) agents: a workflow controller in which executable subtasks are fully connected and a unified state-analysis-and-routing agent selects transitions from natural-language criteria. We evaluate task-specific (Spec-Cyc) and benchmark-generic (Gen-Cyc) cyclic graphs on TextCraft, ALFWorld, and Finance-Agent against ReAct and dependency-directed acyclic workflows. The results expose three task regimes. TextCraft behaves like a prerequisite-chain setting, where cyclic routing often adds overhead. ALFWorld behaves like a partially observable recovery setting, where explicit revisitation improves exploration and success. Finance-Agent behaves like an open-ended evidence-synthesis setting, where retrieval, grounding, and synthesis bottlenecks limit all controllers. We add a task-regime selection matrix, fault-injection robustness analysis, token-cost accounting, and scaling discussion. Overall, complete cyclic subtask graphs are best understood as a diagnostic workflow-control tool: they quantify when flexible backtracking is worth its cost and when simpler or sparsified controllers are preferable.

URL: https://openreview.net/forum?id=FAkarhXCfI

---

Title: A new perspective on the nature of dropout

Abstract: In this work, we explore the average behavior of the learning process with dropout in the contexts of linear regression, generalized linear models, matrix factorization, and fully-connected neural networks with dropout in the last layer. Initially, we find that the average behavior does not distinguish the original dropped-out quantity. The implication of this is that the dropout-induced regularization and optimization are ambiguous from the perspective of the average behavior. To resolve this, we reformulate the average behavior based on the elementary operations that a practitioner is able to apply in the learning process with dropout. Then, we disambiguate the dropout-induced regularization and optimization from the perspective of each reformulation. In the context of linear regression, we show that all of the reformulations result in the same predictions at test time, where the invariant in these predictions is the square of the coefficient of variation of the dropout distribution. More broadly, we demonstrate that the penalty term under dropout depends on the data, parameters, and predictions at train time, when the mean of the dropout distribution is not equal to one.

URL: https://openreview.net/forum?id=X5hYaYn7QM

---

Title: Agentic Subjective Q-Learning Equilibrium

Abstract: In many applications, agents/decision makers take part in systems with very complex dynamics and they respond by inevitably making incorrect modeling assumptions. In this context, we define the concept of Agentic Subjective Q-Learning Equilibrium as an equilibrium concept where each agent uses local/partial information in their learning algorithm, as if the partial information constitutes an approximate Markov model. A distinguishing feature of such a setup is that the exploration policy used for learning impacts the perceived model and there is a dual dependence of the induced cost on the agent policies: By noting an equivalence with empirical model learning, it follows that an exploration policy generates the sample path which induces a model (which depends on the exploration policy), and the model is used to obtain an optimal policy (for the learned model) either via reinforcement learning or empirical learning. This then leads to the question on existence of a fixed point equation involving learning and exploration. An agentic subjective learning equilibrium policy is thus defined as a policy which is self-confirming: the model induced by the policy has the policy as an optimal solution. We establish an existence result on equilibria critically building on continuous dependence of invariant measures on policies under a suitable control topology. We then present an associated learning/convergence theorem to $\epsilon$-equilibria via policy revision dynamics. We show implications for symmetric dynamic games (including mean-field games), weakly acyclic games (including potential games), and generalized weakly acyclic games.

URL: https://openreview.net/forum?id=793RRQwlLz

---

Title: Structure Over Scale: Rethinking Adaptation for Reinforcement Learning with Verifiable Rewards

Abstract: The standard justification for Full Fine-Tuning (FFT) in Reinforcement Learning with Verifiable Rewards (RLVR) rests on a reasonable intuition: reasoning requires expressive weight updates that Low-Rank Adaptation (LoRA) cannot provide. We show this intuition identifies the wrong variable. Through a systematic rank sweep under GRPO, we document *rank collapse*---a discontinuous performance cliff where increasing LoRA rank beyond a threshold causes catastrophic, irrecoverable policy failure, a phenomenon absent from the SFT literature. A batch-size ablation shows that this failure is not rescued by larger batches under the same one-epoch cold-start GRPO protocol: LoRA ranks $128$ and $256$ remain near floor across batch sizes $64$, $128$, and $256$, while rank $64$ itself falls from $73.1%$ at batch size $64$ to $8.7%$ and $6.0%$ at batch sizes $128$ and $256$. This failure is not generic undertraining: LoRA $r=8$, DoRA $r=16$, and QuanTA $d=3$ remain trainable under the same larger-batch regimes. Spectral analysis suggests a mechanism: collapsed high-rank adapters concentrate update energy into a small number of singular directions, consistent with degenerate optimization rather than distributed reasoning improvement. FFT shows a milder version of the same spectral concentration pattern, achieving lower effective rank than structured adapters despite updating far more parameters. Expressivity alone is therefore not the bottleneck; the structure of the update manifold is. Structured adapters that constrain which high-rank solutions are reachable by gradient descent outperform LoRA and FFT on our primary DeepMath-Hard comparison and remain more robust under the larger-batch stress tests. Across three 8B base models, the relative behavior of low-rank and structured high-rank adapters also correlates with frozen-weight spectral structure and reported pre-training scale, a pattern we term the Model Maturity Hypothesis. We present this as a falsifiable hypothesis rather than a causal law: architecture, tokenizer, and data mixture remain confounded with pre training scale in the current model set. The operative question for RLVR is not simply whether to use LoRA or FFT, but what structure to impose over the update manifold under a given model, task, and optimization budget.

URL: https://openreview.net/forum?id=17kr8cCW7s

---

Title: A Protocol-Fixed Information-Theoretic Reporting Framework for Evidence-Weak Hallucination in Large Language Models Authors

Abstract: This paper introduces a protocol-fixed information-theoretic reporting framework for evidence-weak hallucination comparisons in large language models (LLMs). The framework uses sequence-level information energy, variational free energy, KL decomposition, and coarse-graining-induced information loss to define a closed reporting object for controlled evidence-weakness comparisons. Its contribution is the protocol-fixed closure of that object, rather than a new free-energy identity, KL theorem, or scalar hallucination score. Within a declared protocol block, it fixes the admissibility conditions for the comparison coordinates (Delta E, T_eff, c0), the apply/not-apply gate, the relative identifiability convention, and a mechanically auditable artifact interface. The completed evidential anchor is a fully observable synthetic toy implementation showing that the declared comparison object can be instantiated, logged, filtered, aggregated, and assigned an explicit status under controlled evidence weakening. The result is a bounded reporting framework for auditable, comparable, and explicitly stoppable evidence-weakness comparisons under a declared protocol. It is a reporting framework rather than a deployment-time detector, mitigation method, protocol-free law, or completed real-LLM validation study.

Keywords
large language models, hallucination, information theory, variational free energy, protocol-fixed evaluation, uncertainty, auditability

URL: https://openreview.net/forum?id=GrzrPoRfEH

---

Title: ELAS: Efficient Pre-Training of Low-Rank Large Language Models via 2:4 Activation Sparsity

Abstract: Large Language Models (LLMs) have achieved remarkable capabilities, but their immense computational demands during training remain a critical bottleneck for widespread adoption. Low-rank training has received attention in recent years due to its ability to significantly reduce training memory usage. Meanwhile, applying 2:4 structured sparsity to weights and activations to leverage NVIDIA GPU support for 2:4 structured sparse format has become a promising direction. However, existing low-rank methods often leave activation matrices in full-rank, which dominates memory consumption and limits throughput during large-batch training. Furthermore, directly applying sparsity to weights often leads to non-negligible performance degradation. To achieve efficient pre-training of LLMs, this paper proposes ELAS: \textbf{E}fficient pre-training of \textbf{L}ow-rank LLMs via 2:4 \textbf{A}ctivation \textbf{S}parsity, a novel framework for low-rank models via 2:4 activation sparsity. ELAS applies squared ReLU activation functions to the feed-forward networks in low-rank models and implements 2:4 structured sparsity on the activations after the squared ReLU operation. We evaluated ELAS through pre-training experiments on LLaMA models \textcolor{red}{ranging from 60M to 1B parameters}. The results demonstrate that ELAS maintains performance with minimal degradation after applying 2:4 activation sparsity, while achieving training and inference acceleration. Moreover, ELAS reduces activation memory overhead—particularly with large batch sizes. Code will be made available.

URL: https://openreview.net/forum?id=yUsbucclgS

---

Title: Multi-Level Spatial Embedding Sharing for Enhanced Online Trajectory-User Linking

Abstract: Trajectory-User Linking (TUL) is a critical task in mobility applications that links unlabeled spatial trajectories to the users or entities that generated them. In these applications, data often arrives as a continuous stream and may experience distributional shifts over time. While adapting TUL models via online learning could address these challenges, this approach remains unexplored in current research. Our work bridges this gap by conducting comprehensive evaluations of common TUL techniques in an online learning context. To improve the performance of existing TUL techniques in this setting, we propose Multi-Level Spatial Embedding Sharing (MiLES), an embedding approach that adapts and extends the principle of multi-scale spatial sharing for online TUL. MiLES partially shares embeddings across neighborhoods of multiple size levels, enabling generalization within neighborhoods while maintaining fine-grained discrimination through more location-specific representations. MiLES also significantly reduces the number of embedding parameters, leading to lower memory usage and more computationally efficient model updates. We further incorporate learnable weighting parameters for each embedding level, allowing the model to learn the influence of different levels during training. Our experimental results on several real-world datasets show that integrating MiLES into state-of-the-art TUL models significantly improves their performance in online learning scenarios, yielding relative gains in top-1 accuracy of up to 24\%, with consistent improvements observed across other training paradigms as well. However, the online gains are particularly relevant, as our findings suggest that online learning is the most suitable paradigm real-time TUL on streaming data, outperforming periodic batch retraining at substantially lower computational cost. To demonstrate its general applicability, we also evaluate MiLES on the task of destination prediction, where it provides consistent performance improvements, confirming its value as a domain-general embedding technique. Our code is available at \url{https://anonymous.4open.science/r/MiLES-3D20}.

URL: https://openreview.net/forum?id=YfyOCduclR

---

Title: When Pointwise Forecast Errors Are Not Enough: An Empirical Study of Temporal Alignment Metrics for Time Series Forecasting

Abstract: Mean squared error (MSE) and mean absolute error (MAE) are the standard metrics used to evaluate time series forecasting models. Although these metrics are useful, they compare predictions and ground truth at fixed timestamps and can miss important failures on rapidly varying series. In particular, a model may obtain a strong MSE or MAE while smoothing sharp peaks, missing deep troughs, shifting ridges in time, or delaying abrupt changes. This paper studies this issue empirically by evaluating five forecasting models: DLinear, PatchTST, TimeMixer, iTransformer, and Chronos-2; using MSE, MAE, Dynamic Time Warping (DTW), and the Temporal Distortion Index (TDI). We compare these metrics on standard forecasting benchmarks and scientific network telemetry from ESnet, with emphasis on cases where local extrema and short-term temporal structure are important. Our results show that pointwise errors can give an incomplete view of model behavior: some forecasts score well under MSE and MAE while visibly smoothing or shifting peaks and troughs, whereas other forecasts better preserve local structure but receive worse pointwise scores. DTW and TDI help expose these differences by measuring shape similarity and temporal misalignment, respectively. We do not argue that DTW and TDI should replace MSE and MAE or that they are sufficient for every forecasting task. Rather, we show that they are useful diagnostic metrics when the timing and shape of peaks, troughs, and ridges matter.

URL: https://openreview.net/forum?id=NhLor0HOUK

---

Title: MANAR: Memory-augmented Attention with Navigational Abstract Conceptual Representation

Abstract: We introduce MANAR, a linear-time attention layer that can directly inherit weights from a pretrained Transformer's multi-head attention (MHA) — a property that distinguishes it from existing linear-time alternatives such as Mamba, RetNet, and Linear Attention, which require training from scratch and therefore forfeit access to the representational capital accumulated in large pretrained Transformers. MANAR augments MHA with a trainable external memory and a constant-size Abstract Conceptual Representation (ACR), a design inspired by the global-workspace bottleneck described in cognitive models of perception. The architecture follows a two-stage logic: (i) an integration phase, in which retrieved memory concepts are combined with the input sequence to form the ACR, a compact global state of the input; and (ii) a broadcasting phase, in which the ACR informs the contextualization of each token together with a local context window, replacing all-to-all attention. Routing global information through a constant-sized ACR yields strictly linear time and memory complexity when the local context window is fixed. Because MANAR preserves the semantic roles of the standard MHA projections, knowledge transfer from pretrained transformers reduces to a direct weight-copy, and we show that transferred models recover and then exceed the accuracy of their sources at a fraction of the from-scratch training budget. MANAR also enables non-convex contextualization: outputs can lie outside the convex hull of the input value vectors, a property we measure empirically and that quadratic softmax attention does not exhibit. Across language, vision, and speech, MANAR is competitive with strong baselines (GLUE 85.1, ImageNet-1K 83.9% top-1, LibriSpeech 2.7%, 6.4% WER) while delivering up to 14.8x single-layer latency reduction and 9.3x peak GPU memory reduction at 4,096 tokens versus quadratic MHA.

URL: https://openreview.net/forum?id=kWxjoZYW33

---

Title: BYOL: Bring Your Own Language Into LLMs

Abstract: Large Language Models (LLMs) exhibit strong multilingual capabilities, yet remain fundamentally constrained by the severe imbalance in global language resources. While over 7,000 languages are spoken worldwide, only a small subset (<100) has sufficient digital presence to meaningfully influence modern LLM training. This disparity leads to systematic underperformance, cultural misalignment, and diminished accessibility for speakers of low-resource and extreme-low-resource languages. To address this gap, we introduce Bring Your Own Language (BYOL), a unified framework that enables scalable, language-aware LLM development tailored to each language's digital footprint. BYOL begins with a language resource classification—mapping languages into four tiers (Extreme-Low, Low, Mid, High) based on curated web-scale corpora, and uses this classification to determine the appropriate integration strategy. For low-resource languages, we propose a full-stack data refinement and expansion pipeline, combining corpus cleaning, synthetic text generation, continual pretraining, and supervised finetuning. Applied to Chichewa and Māori, this pipeline yields two language-specific LLMs that achieve ~12% average improvement over strong multilingual baselines across 12 benchmarks, while preserving English and multilingual capabilities via weight-space model merging. For extreme-low-resource languages, we introduce a translation-mediated inclusion pathway, demonstrating with Inuktitut that a tailored MT system can deliver +4 BLEU improvement over a commercial baseline, enabling high-accuracy LLM access in settings where direct modeling is otherwise infeasible. Our results show that BYOL offers a practical, extensible, and data-efficient recipe for expanding LLM capabilities to the long tail of the world's languages. We will release human-translated versions of the Global MMLU-Lite benchmark in Chichewa, Māori, and Inuktitut, and make our codebase and models publicly available.

URL: https://openreview.net/forum?id=kp9xhcz2y4

---

Title: Zero-to-CAD: Agentic Synthesis of Interpretable CAD Programs at Million-Scale Without Real Data

Abstract: Computer-Aided Design (CAD) models are defined by their construction history: a parametric recipe that encodes design intent. However, existing large-scale 3D datasets predominantly consist of boundary representations (B-Reps) or meshes, stripping away this critical procedural information. To address this scarcity, we introduce Zero-to-CAD, a scalable framework for synthesizing executable CAD construction sequences. We frame synthesis as an agentic search problem: by embedding a large language model (LLM) within a feedback-driven CAD environment, our system iteratively generates, executes, and validates code using tools and documentation lookup to promote geometric validity and operation diversity. This agentic approach enables the synthesis of approximately one million executable, readable, editable CAD sequences, covering a rich vocabulary of operations beyond sketch-and-extrude workflows. We also release a curated subset of 100,000 high-quality models selected for geometric diversity. To demonstrate the dataset's utility, we fine-tune a vision-language model on our synthetic data to reconstruct editable CAD programs from multi-view images, outperforming strong baselines, including GPT-5.2, and effectively bootstrapping sequence generation capabilities without real construction-history training data. Zero-to-CAD bridges the gap between geometric scale and parametric interpretability, offering a vital resource for the next generation of CAD AI.

URL: https://openreview.net/forum?id=QiKZ2TPGL0

---

Title: Pre-Generating Multi-Difficulty PDE Data for Few-Shot Neural PDE Solvers

Abstract: A key aspect of learned partial differential equation (PDE) solvers is that the main cost often comes from generating training data with classical solvers rather than learning the model itself. Another is that there are clear axes of difficulty—e.g., more complex geome-
tries and higher Reynolds numbers—along which problems become (1) harder for classical solvers and thus (2) more likely to benefit from neural speedups. Towards addressing this chicken-and-egg challenge, we study difficulty transfer on 2D incompressible Navier-Stokes, systematically varying task complexity along geometry (number and placement of obstacles), physics (Reynolds number), and their combination. Similar to how it is possible to spend compute to pre-train foundation models and improve their performance on downstream tasks, we find that by classically solving (analogously pre-generating) many low and medium difficulty examples and including them in the training set, it is possible to learn high-difficulty physics from far fewer samples. Furthermore, we show that by combining low and high
difficulty data, we can spend 8.9× less compute on pre-generating a dataset to achieve the same error as using only high difficulty examples. Our results highlight that how we allocate classical-solver compute across difficulty levels is as important as how much we allocate overall, and suggest substantial gains from principled curation of pre-generated PDE data for neural solvers

URL: https://openreview.net/forum?id=w884ao8LLW

---

Title: SMART: A Unified Framework for Structural Representation Attribution

Abstract: Recovering latent signals via representation learning is fundamental to analyzing high-dimensional complex systems. However, existing approaches predominantly focus on static single-environment representations, overlooking mechanism shifts across environments, which are crucial for causal discovery, change point detection, and transfer learning, among others. We propose SMART (Sparse Mechanism Attribution for RepresenTation), a unified framework for structural representation attribution. Our two-stage framework first performs signal recovery, then applies sparse regularization on structural differences to attribute distributional shifts to specific mechanisms, yielding an attribution that circumvents the notorious identifiability issue. Theoretically, we show that estimation errors for structure and representation inherit the convergence rates of the first stage. A consistent information criterion is also introduced to determine latent dimensionality. Furthermore, we develop a one-step estimation method for additive noise representations and extend SMART to multi-environment scenarios with theoretical guarantees. Extensive simulations and applications to two real data datasets demonstrate the effectiveness and practical utility of SMART.

URL: https://openreview.net/forum?id=yDha4jnJy5

---

Title: Active Sampling for Ultra-Low-Bit-Rate Video Compression via Conditional Controlled Diffusion

Abstract: Diffusion models provide a powerful generative prior for perceptual reconstruction at ultra-low bitrates, but effective video compression requires controlling the generative process using highly compact conditioning signals. In this work, we present ActDiff-VC, a diffusion-based video compression framework for the ultra-low-bitrate regime. Our method partitions videos into variable-length segments, transmits keyframes only when needed, and summarizes temporal dynamics using a compact set of tracked point trajectories. Conditioned on these sparse signals, a conditional diffusion decoder synthesizes the remaining frames, enabling perceptually realistic reconstruction under severe rate constraints. To support this design, we introduce two mechanisms: content-adaptive keyframe selection and budget-aware sparse trajectory selection, which together enable compact yet effective conditioning for generative reconstruction. Experiments on the UVG and MCL-JCV benchmarks show that ActDiff-VC achieves up to 64.6\% bitrate reduction at matched NIQE, improves KID by up to 64.6\% and FID by up to 37.7\% at comparable bitrates against strong learned codecs, and delivers favorable perceptual rate--distortion trade-offs relative to learned and diffusion-based baselines in the ultra-low-bitrate regime.

URL: https://openreview.net/forum?id=zER4e4GfZH

---

Title: Can LLMs Grasp Implicit Cultural Values? Benchmarking LLMs' Cultural Intelligence with CQBench

Abstract: Cultural Intelligence (CQ) refers to the ability to understand unfamiliar cultural contexts—a crucial skill for large language models (LLMs) to effectively engage with globally diverse users. Existing studies often focus on explicitly stated cultural norms, but fail to capture the subtle, implicit values that are common in daily conversation. To address this gap, we introduce CQ-Bench, a benchmark specifically designed to assess LLMs’ capability to infer implicit cultural values from natural conversational contexts. CQ-Bench consists of multi-character conversation-based stories using values from the World Value Survey and the GlobalOpinions, with topics including ethical, religious, social, etc. Our automatic dataset construction pipeline integrates rigorous validation procedures (incorporation, consistency,
and implicitness checks), achieving a 94.5% human–model agreement in the final validation. To leverage CQ-Bench data, we design three tasks of increasing complexity: attitude detection, value selection, and value extraction. These tasks evaluate whether models can
detect attitude and recognize values embedded within natural dialogues rather than relying on explicit cultural knowledge. We find that while frontier models could reach human-level performance in value selection (0.809 F1), they still fall short in nuanced attitude detection (0.622 F1). Notably, fine-tuning a smaller LLaMA-3.2-3B on only 500 culturally-rich examples improves performance by over 10%, even outperforming o3-mini in some cases. Using CQ-Bench, we provide insights into the current challenges in LLMs’ CQ research
and suggest practical pathways for enhancing LLMs’ cross-cultural reasoning abilities.

URL: https://openreview.net/forum?id=eIdvFTtuym

---

Title: ICPRL: Acquiring Physical Intuition from Interactive Control

Abstract: VLMs excel at static perception but falter in interactive reasoning in dynamic physical environments, which demands planning and adaptation to dynamic outcomes. Existing physical reasoning methods often depend on abstract symbolic inputs or lack the ability to learn and adapt from direct, pixel-based visual interaction in novel scenarios. We introduce ICPRL (In-Context Physical Reinforcement Learning), a framework inspired by In-Context Reinforcement Learning (ICRL) that empowers VLMs to acquire physical intuition and adapt their policies in-context. Our approach trains a vision-grounded policy model via multi-turn Group Relative Policy Optimization (GRPO) over diverse multi-episode interaction histories. This enables the agent to adapt strategies by conditioning on past trial-and-error sequences, without requiring any weight updates. This adaptive policy works in concert with a separately trained world model that provides explicit physical reasoning by predicting the results of potential actions. At inference, the policy proposes candidate actions, while the world model predicts outcomes to guide a root-node PUCT search to select the most promising action. Evaluated on the diverse physics-based puzzle-solving tasks in the DeepPHY benchmark, ICPRL demonstrates significant improvements across both its: I. policy-only, and II. world-model-augmented stages. Notably, these gains are retained in unseen physical environments, demonstrating that our framework facilitates genuine in-context acquisition of the environment's physical dynamics from interactive experience.

URL: https://openreview.net/forum?id=qTBy2rUhE8

---

Title: LEAD: An EEG Foundation Model for Alzheimer’s Disease Detection

Abstract: Electroencephalography (EEG) provides a non-invasive, highly accessible, and cost-effective approach for detecting Alzheimer’s disease (AD). However, existing methods, whether based on handcrafted feature engineering or standard deep learning, face three major challenges: 1) the lack of large-scale EEG-based AD datasets for robust representation learning and evaluation; 2) limited cross-subject generalizability; and 3) difficulty in adapting to highly heterogeneous data. To address these challenges, we curate the world’s largest EEG-AD corpus to date, comprising 2,238 subjects. Leveraging this unique resource, we propose LEAD, the first foundation model for EEG-based AD detection. Specifically, we design a gated temporal-spatial Transformer that can adapt to EEG recordings with arbitrary lengths, channel configurations, and sampling rates. In addition, we introduce a subject-regularized training strategy to enhance end-to-end subject-level detection. We further employ medical contrastive learning to pre-train on 13 datasets, including 4 AD datasets and 9 non-AD neurological disorder datasets, and fine-tune/test the model on the other 5 AD datasets. LEAD achieves the best average ranking across all 20 evaluations on 5 downstream datasets, substantially outperforming existing approaches, including state-of-the-art (SOTA) EEG foundation models. These results strongly demonstrate the effectiveness and practical potential of the proposed method for real-world EEG-based AD detection. Source code: https://anonymous.4open.science/r/LEAD-3B51

URL: https://openreview.net/forum?id=AigNTyxcvH

---

Title: Analytic Rotor-Based Canonicalisation in VAEs: A Controlled Study of Geometric Inductive Biases

Abstract: This paper asks, whether in pose-aware variational autoencoders where the pose of each training view is already known, it is better to apply that pose through an exact geometric transformation or to give pose coordinates to a decoder and ask the decoder to learn the rendering rule from data. We study this question in a deliberately simple two-dimensional silhouette setting. The encoder extracts object content from two rotated views, the decoder predicts one canonical silhouette, and the proposed model renders each posed view by an analytic rotor-induced image warp. The matched baseline uses the same encoder, latent size, and decoder width, but concatenates the pose code to the decoder input. Across three random seeds in a compressed-capacity setting, the rotor pathway improves validation canonical binary cross-entropy from $0.0919 \pm 0.0022$ to $0.0889 \pm 0.0051$, improves thresholded canonical Dice from $0.8339 \pm 0.0042$ to $0.8407 \pm 0.0035$, improves thresholded view Dice from $0.7981 \pm 0.0119$ to $0.8352 \pm 0.0030$, and reduces relative-pose composition error from $0.0319 \pm 0.0022$ to $0.0092 \pm 0.0004$. The baseline obtains lower probabilistic view cross-entropy, which we interpret cautiously because its smoother predictions can reduce cross-entropy while giving worse thresholded shape agreement. These results support the value of explicit analytic warping for this known-pose canonicalisation problem. They do not establish that planar Geometric Algebra is superior to all analytic matrix or spatial-transformer warps. Finally, we outline how the same design principle could be carried to three-dimensional rotors and Conformal Geometric Algebra motors, but we leave that extension as future empirical work.

URL: https://openreview.net/forum?id=fWssygxB95

---

Title: NashPG: A Policy Gradient Method with Iteratively Refined Regularization for Finding Nash Equilibria

Abstract: Finding Nash equilibria in two-player zero-sum imperfect-information games remains a central challenge in multi-agent reinforcement learning. Recent multi-round regularization methods offer a promising direction, yet existing approaches either require full enumeration of the game tree or rely on non-policy-gradient inner solvers that underperform in practice, leaving a scalable policy-gradient-based solution open. In this paper, we propose a novel multi-round regularization procedure and show that it guarantees strictly monotonic reduction in Bregman divergence to Nash equilibria and eventual convergence to one in two-player zero-sum extensive-form games. Guided by this framework, we develop a practical algorithm, Nash Policy Gradient (NashPG), which places the regularization directly in the policy optimization objective and is implemented using standard policy gradient methods. Empirically, NashPG achieves comparable or lower exploitability than prior model-free methods on classic benchmark games and scales to large domains such as Battleship and No-Limit Texas Hold'em, where it attains higher average payoff in head-to-head play.

URL: https://openreview.net/forum?id=yIA2Fjs1FK

---

Title: Sample-wise Constrained Learning via a Sequential Penalty Approach with Applications in Image Processing

Abstract: In many learning tasks, certain requirements on the processing of individual data samples should arguably be formalized as strict constraints in the underlying optimization problem, rather than by means of arbitrary penalties. We show that, in these scenarios, learning can be carried out exploiting a sequential penalty method that allows to properly deal with constraints. For the proposed algorithm, under classical assumptions we prove correctness and almost sure convergence to stationary points. Moreover, the results of experiments on image processing tasks show that the method is indeed viable to be used in practical deep learning scenarios.

URL: https://openreview.net/forum?id=Xi1UWSFooI

---

Title: A Comparative Study of Label-free Representation Quality Metrics in Deep Learning

Abstract: We present a comparative study of label-free metrics for assessing the quality of representations in deep neural networks to understand their reliability under a wide variety of configurations. We group existing label-free metrics into three families based on their construction and analytically establish connections between metrics within the same family. We then characterise the sensitivity of spectral metrics through controlled synthetic experiments. Finally, all label-free metrics are evaluated against downstream task accuracy across a diverse set of 260 vision models on CIFAR-10, CIFAR-100, and ImageNet-1K, stratifying results by architecture class, training objective, and representation dimension. We find that intrinsic dimensionality (ID) is the most reliable predictor in all settings, while the reliability of other metrics is strongly moderated by architecture class and training objective. Our results provide a clearer understanding of what label-free representation quality metrics measure, when they are reliable, and how to interpret them in practice.

URL: https://openreview.net/forum?id=yknkAksqr1

---

Title: GeoDT: Geometry-Aware Decision Transformer for Robust Safe Multi-Task Offline Reinforcement Learning

Abstract: Scaling offline reinforcement learning across heterogeneous tasks remains challenging, especially under safety constraints. In multi-task settings, features processed by a shared model may play different semantic roles across tasks, leading to semantic inconsistency, conflicting optimization signals, and performance degradation as task diversity increases. While prior multi-task and safe offline RL methods address parts of this challenge, few provide a unified framework that is both effective and safety-aware. We propose GeoDT (Geometry-Aware Decision Transformer), a framework for safe multi-task offline RL that biases cross-task sharing toward geometry-related trajectory structure to mitigate semantic inconsistency. GeoDT learns to separate geometry-related structure from task-specific semantics, constructs geometry-aware context from prompt trajectories through relational structure induction and prototype memory, and incorporates safety by using cost signals to shape the feasible region of geometric reuse. The resulting context is fused with task-specific semantic features to condition a cost-aware Decision Transformer. To better assess behavior as task diversity increases, we further introduce the Task Scaling Robustness Score (TSRS) and Inter-Task Balance Score (ITBS), which measure performance retention and cross-task balance as the number of tasks increases. Experiments on multi-task safe offline RL benchmarks show that GeoDT achieves strong reward--cost trade-offs, improved robustness under increasing task diversity, and zero-shot adaptation to unseen safety budgets compared with competitive baselines. These results suggest that geometry-related trajectory structure can provide an effective basis for safe multi-task offline reinforcement learning.

URL: https://openreview.net/forum?id=zhI71qDarU

---

Title: Continual Learning using Evolution Strategies

Abstract: Continual Learning (CL) aims to train neural networks on sequences of tasks without triggering catastrophic forgetting. Existing approaches typically rely on gradient-based optimization, which breaks down in exemplar-free settings where data and therefore gradients from past tasks are unavailable. To overcome this limitation, we propose EvoCL, a gradient-free method that employs an evolutionary strategy to optimize neural network using a surrogate loss constructed by an adapter network. The adapter maps stored latent features of previous classes into the current task's embedding space, enabling joint training of the feature extractor and adapter without access to past data or gradients. This reframes CL as an optimization problem that does not require gradient information. Experiments on multiple benchmarks demonstrate that EvoCL achieves strong performance under tight parameter budgets, highlighting it as a promising direction for gradient-free, data-free CL. The code to reproduce these results is available at (omitted for the review, we enclose it in the supplementary material).

URL: https://openreview.net/forum?id=IBYmdE5g3X

---

Title: Always So Sure: Can LLM's Confidence be Trusted?

Abstract: Confidence estimation techniques are often used to better gauge the answers given by a Large Language Model (LLM). One such technique is $\textit{verbalized confidence}$. This prompting setup produces confidence scores alongside the actual answers, but the mechanisms behind these self-reported confidence values remain poorly understood. This paper presents a comprehensive analysis of verbalized confidence across multiple datasets spanning factual questions, multiple-choice QA, and causal reasoning using four different LLMs.
Our investigation reveals that verbalized confidence scores are $\textit{highly quantized}$, clustering around specific values (e.g., 0, 90, 100) with minimal differentiation between correct and incorrect answers. Through causal mediation analysis and targeted input perturbations, we demonstrate that confidence score generation is primarily influenced by structural prompt elements like the word $``confidence''$ and the specified scale range rather than the actual question's content.
These findings provide valuable insights into the behavior of verbalized confidence and underscore the importance of developing more reliable self-evaluation mechanisms for LLMs.

URL: https://openreview.net/forum?id=aW6dxQm3Vu

---

Title: Path-Integrated Loss-Gradient Kernels: Auditing and Similarity for Trained Neural Networks

Abstract: Despite their success, deep neural networks remain opaque: it is often unclear why a model fails on a particular input, and classical generalization theory offers limited guidance in the overparameterized regime. Gradient-descent training naturally gives rise to path-dependent inner products between data points, but the resulting kernel matrices are asymmetric and can have negative eigenvalues, precluding their use as proper kernels or similarity measures. We show that a simple modification -- replacing output gradients with loss gradients in these inner products -- restores symmetry and positive semi-definiteness, yielding a Mercer kernel (the path-integrated loss-gradient kernel, PLGK). In particular, the PLGK yields (i) a fine-grained auditing decomposition that attributes how individual predictions arise from training data, and (ii) an intrinsic, behavior-based similarity measure between inputs. We validate both tools in focused experiments, including pruning studies that confirm audit-identified influences predict retraining outcomes, and a capstone analysis demonstrating that adversarial perturbations exploit a cancellation among training influences that prevents the network from learning on adversarial inputs, which can be broken by a simple mode-aware perturbation to largely restore performance.

URL: https://openreview.net/forum?id=tE0RWpe5z2

---

Title: SCOPE-RRG: Symbolic Constraint Preference Optimization for Radiology Report Generation

Abstract: The complexity of medical image data combined with the variability of natural language generation often leads to inconsistencies, hallucinations, and a lack of clinical grounding, especially in automatically generated radiology reports. To address these challenges, we introduce a task-specific symbolic constraints preference optimization technique, tailored for radiology report generation. A typical radiology report comprises of findings and impression; findings capture the complex visual information from the medical image, for example a chest X-ray, and the impression is the implied conclusion. Our framework leverages on this phenomenon to design clinical rules from existing findings and impressions, that connect the finding and impression as a horn rule. The rules act as an additional, interpretable supervision signal, guiding the preference optimization of Vision–Language Models (VLM) toward outputs that are fluent as well as clinically coherent. Unlike conventional preference optimization, which relies solely on lexical preferences, our approach enforces alignment with clinically meaningful predicates such as the presence, absence, or severity of key findings. A central feature of this framework is its ability to inject symbolic constraint guidance during optimization, ensuring that generated reports remain both linguistically fluent and clinically coherent. Experimental results on benchmark datasets like MIMIC–CXR-JPG and IU–Xray, demonstrate that our approach substantially improves factual accuracy, and overall report quality compared to zero-shot and standard DPO baselines. We record a significant performance boost across lexical and semantic metrics. These results highlight the promise of clinically interpretable preference optimization as a pathway toward trustworthy radiology report generation in medical AI.

URL: https://openreview.net/forum?id=lu1oVLcuiN

---

Title: DS-RNNs: Conditional Computation in Recurrent Models via Input-Dependent Sparse Gating

Abstract: Recurrent neural networks (RNNs) and state-space models (SSMs) typically execute the same dense computation for every input, coupling inference cost to parameter count and exacerbating interference in multi-task or heterogeneous regimes. We introduce Dynamic-Sparse RNNs (DS-RNNs), a framework for conditional computation via learnable, input-dependent sparse gating. In DS-RNNs, a small router network predicts sparse masks over input and hidden-state channels, effectively routing each input to a specialized sparse subnetwork. Unlike prior adaptive methods such as DeltaRNNs or static sparse training like RigL, our approach is fully learned and budget-controlled, and utilizes structured masking that translates into practical FLOP savings on commodity hardware by reducing sparse matrix multiplications to dense operations on active submatrices. Empirically, DS-RNNs maintain most of the dense model performance at 90\% sparsity (sometimes even exceeding it) across diverse architectures, including LSTMs, GRUs, LTCs, and S4 models. They match or outperform RigL and DeltaRNNs on most tasks, while remaining stable in regimes where these baselines fail. We further show that DS-RNNs naturally induce subnetwork specialization without explicit supervision: masks for different classes (single-task) or tasks (multi-task) exhibit low overlap and are predictive of class identity. Our theoretical analysis provides intuition for this behavior, linking it to a formal bound on gradient interference. As a result, DS-RNNs improve robustness: DS-LSTM scales better with task count in multi-task regression and consistently reduces forgetting across benchmarks when combined with standard continual learning methods (e.g., GEM, replay).

URL: https://openreview.net/forum?id=8LkOGzFcVL

---

Title: Correction and Corruption: A Two-Rate View of Error Flow in LLM Protocols

Abstract: Large language models are increasingly deployed as \emph{protocols}: structured
multi-call procedures that spend additional computation to transform a baseline
answer into a final one. These protocols are usually evaluated only by
end-to-end accuracy, which reveals whether they deliver gains on average but gives
limited insight into when they help, when they hurt, and whether their
behavior transfers under distribution shift or composition.
We propose a \emph{paired-outcome measurement interface} for auditing a single
protocol step on exact-match tasks. For each instance, the interface records a
baseline correctness bit $E_0\in\{0,1\}$ and a post-step correctness bit
$E_1\in\{0,1\}$, with accuracies $p_t:=\Pr(E_t=1)$. This separates
\emph{correction}, $E_0{=}0\to E_1{=}1$, from \emph{corruption},
$E_0{=}1\to E_1{=}0$, through two conditional rates: the correction rate
$c=\Pr(E_1{=}1\mid E_0{=}0)$ and the corruption rate
$\gamma=\Pr(E_1{=}0\mid E_0{=}1)$. These two rates are sufficient to predict
accuracy changes and determine whether a step helps at a given baseline. They
also define a reusable empirical interface whose transfer can be tested across
seeds, difficulty mixtures, and composed pipelines.
We identify three mechanisms by which this interface can fail to transfer.
Under \textbf{mixture shift}, estimates of $(c,\gamma)$ pooled across
difficulty regimes become biased when calibration and deployment mixtures
differ; conditioning on depth identifies a regime variable under which the
interface becomes stable and enables predictive transfer, substantially
reducing this bias without additional model calls.
Under \textbf{presentation contamination}, selection
protocols can change the measured interface through stable presentation artifacts
even when candidate content is fixed. Finally, under \textbf{state
insufficiency}, the correctness bit alone may not carry enough history for
multi-step pipelines to compose predictably; a testable Markov factorization
characterizes when composition is valid and identifies where additional state
is needed when it is not.
When a protocol step passes these diagnostics, it becomes an auditable module:
it can be gated by estimated gain, conditioned on difficulty proxies to correct
mixture bias, and composed into multi-step pipelines with predictable accuracy.
We demonstrate these ideas on synthetic mathematical tasks with controlled
difficulty and on GSM8K using observable complexity proxies,
where the calibrated interface correctly predicts when protocol steps should be
activated or suppressed.

URL: https://openreview.net/forum?id=ZiIjTpir8o

---

Title: Assumption-free stability for ranking problems

Abstract: In this work, we consider ranking problems among a finite set of candidates: for instance, selecting the top-$k$ items from a list or obtaining the full ranking of all items in the set. These problems are often unstable: estimating a ranking from noisy data can exhibit high sensitivity to small perturbations. Concretely, if we use data to assign scores to items (say, by aggregating user preference data), then for two items with similar scores, small fluctuations in the data can alter their relative rankings. Many existing theoretical results sidestep this challenge by assuming a separation condition, but real-world data often contains near-ties, limiting the applicability of existing theory. To address this gap, we develop a new algorithmic stability framework for ranking problems, and propose two novel ranking operators for achieving stability: the \emph{inflated top-$k$} for the top-$k$ selection problem and the \emph{inflated full ranking} for ranking the full list, each of which allows for expressing some uncertainty in the output. Both our proposed methods provide guaranteed stability, with no assumptions on data distributions and no dependence on the total number of candidates to be ranked. Experiments on real-world data confirm that the proposed methods offer stability while retaining informativeness.

URL: https://openreview.net/forum?id=Z4oIsrGqek

---

Title: Retrieval-Free Instruction Selection for Instruction-Tuned Embedding Models via Uncentered Spectral Entropy

Abstract: Instruction-tuned embedding models expose a consequential deployment variable: the query-side instruction.
Choosing that instruction usually requires repeated retrieval evaluation with corpus access and judged queries, precisely when such infrastructure is least available: cold-start domains, API-only embedding services, and large candidate pools that must be screened before a stable retrieval stack exists.
We study this earlier decision point and ask whether candidate instructions can be ranked before retrieval evaluation is practical.
Our central hypothesis is geometric: better instructions induce a broader, less collapsed representation geometry on a small unlabeled set of query-like proxy texts.
Based on this view, we propose Instruction Performance Prediction (IPP), a retrieval-free, label-free screening method that scores each instruction by the normalized spectral entropy of the second-moment matrix of its proxy embeddings.
Across 16 embedding models, 17 retrieval datasets, and all 272 model--dataset pairs (full $16 \times 17$ coverage), IPP attains median oriented Spearman $\rho = 0.806$ and median regret@1 of $0.004$ NDCG@10 points on a 104-instruction pool.
The same evidence also identifies a clear operating boundary: when candidate instructions produce little downstream variation, the ranking problem becomes weakly separable, so geometric screening should hand off to direct retrieval evaluation rather than be over-interpreted.

URL: https://openreview.net/forum?id=3nL4o32Cq3

---

Title: Personalized Two-sided Dose Interval

Abstract: In fields such as medicine and social sciences, the goal of treatment is often to maintain the outcome of interest within a desirable range rather than to optimize its value. To achieve this, it may be more practical to recommend a treatment dose interval rather than a single fixed level for a study unit. Since individuals may respond differently to the same treatment level, the recommended dose interval should be personalized based on their unique characteristics. Existing methods for one-sided dose intervals and iteratively constructed two-sided intervals provide useful foundations, but their theory does not directly address simultaneous estimation over unrestricted product function spaces. To address this gap, we propose a direct method for learning personalized two-sided dose intervals based on empirical risk minimization with a doubly-robust loss function that is well-defined over a tensor product function space. This formulation enables simultaneous estimation of the lower and upper bounds without constrained alternating updates. We establish statistical properties of the estimated dose interval in terms of excess risk by leveraging reproducing kernel Hilbert space theory. Our simulation study and a real-world application in warfarin dosing show that the proposed direct method compares favorably with competing indirect regression-based methods.

URL: https://openreview.net/forum?id=KlwKRxmLKO

---

Title: Statistical Test for Diffusion-Based Anomaly Localization via Selective Inference

Abstract: Anomaly localization in images—identifying regions that deviate from normal patterns—is vital in applications such as medical diagnosis and industrial inspection. A recent trend is the use of image generation models in anomaly localization, where these models generate normal-looking counterparts of anomalous images, thereby allowing flexible and adaptive anomaly localization. However, these methods inherit the uncertainty and bias implicitly embedded in the employed generative model, raising concerns about the reliability. To address this, we propose a statistical framework based on selective inference to quantify the significance of detected anomalous regions. Our method provides $p$-values to assess the false positive detection rates, providing a principled measure of reliability. As a proof of concept, we consider anomaly localization using a diffusion model and its applications to medical diagnoses and industrial inspections. The results indicate that the proposed method effectively controls the risk of false positive detection, supporting its use in high-stakes decision-making tasks.

URL: https://openreview.net/forum?id=S1df8b69fh

---

Title: FreqSAM: Saliency-Masked Frequency–Spatial Adversarial Attacks for Stealthy Examples

Abstract: Deep Neural Networks (DNNs) have achieved remarkable success across computer vision tasks, yet their vulnerability to adversarial perturbations remains a critical security concern. Existing adversarial attacks often operate predominantly in a single representation (spatial or frequency), which can limit control over the effectiveness--imperceptibility trade-off and lead to perceptible artifacts. We introduce FreqSAM (Frequency-enhanced Salient Area Masking), an adversarial attack that combines saliency-guided spatial localization with frequency-aware updates to generate effective adversarial examples with strong perceptual similarity. FreqSAM strategically localizes spatial perturbations within semantically salient regions identified through gradient-based saliency maps, while shaping perturbations using Fast Fourier Transform (FFT) masking. This spatial--frequency design targets a strong effectiveness--imperceptibility trade-off under standard norm constraints. Experiments on ImageNet across multiple architectures show that FreqSAM achieves high white-box success rates while improving visual fidelity as measured by $L_2$, SSIM, and PSNR, and it exhibits moderate black-box transferability. We further evaluate FreqSAM under several common defense settings, including adversarially trained and augmentation-based models. Our approach highlights that common ImageNet models and several robustness baselines remain vulnerable to jointly spatial--frequency constrained perturbations, motivating defenses and evaluations that consider multi-domain attack vectors.

URL: https://openreview.net/forum?id=hK5gHS9JvS

---

Title: Speculative Decoding for Multimodal Models: A Survey

Abstract: Multimodal generative models have demonstrated remarkable capabilities across diverse domains, from visual understanding and image generation to video processing, audio synthesis, and embodied control. These capabilities, however, incur substantial inference overhead due to autoregressive decoding or iterative generation, which is further compounded by modality-specific challenges such as extensive visual token redundancy, strict real-time latency constraints in robotic control, and prolonged sequential generation in text-to-image synthesis. Speculative decoding has emerged as a promising paradigm to accelerate inference without degrading output quality, yet existing surveys remain focused on text-only large language models. In this survey, we provide a systematic and comprehensive review of speculative decoding methods for multimodal models, spanning Vision–Language, Vision–Language–Action, Video–Language, Speech, Text-to-Image (Vision Auto-Regressive), and Diffusion models. We organize the literature into a unified taxonomy with two primary axes, covering the draft generation stage and the verification and acceptance stage, complemented by an analysis of inference framework support. Through this taxonomy, we identify recurring cross-modal design patterns, including token compression, KV cache optimization, target-informed transfer, drafter-target alignment, verification cost reduction, relaxed acceptance, and verify-to-draft feedback, and examine how successful techniques transfer across modalities. We further provide a systematic comparison of existing methods under both self-reported and standardized benchmarking settings. Finally, we discuss open challenges and outline future directions. We hope this survey can serve as a valuable resource for researchers and practitioners working on accelerating multimodal inference.

URL: https://openreview.net/forum?id=7hVvEoMyob

---

Title: On Provable Benefits of Muon in Federated Learning

Abstract: The recently introduced optimizer, Muon, has gained increasing attention due to its superior performance across a wide range of applications. However, its effectiveness in federated learning remains unexplored. To address this gap, this paper investigates the performance of Muon in the federated learning setting. Specifically, we propose a new algorithm, FedMuon, and establish its convergence rate for nonconvex problems. Our theoretical analysis reveals multiple favorable properties of FedMuon. In particular, due to its orthonormalized update direction, the learning rate of FedMuon is independent of problem-specific parameters, and, importantly, it can naturally accommodate heavy-tailed noise. The extensive experiments on a variety of neural network architectures validate the effectiveness of the proposed algorithm.

URL: https://openreview.net/forum?id=GlZTVr36He

---

Title: Action-Conditioned Transformers for Decentralized Multi-Agent World Models

Abstract: Multi-agent reinforcement learning (MARL) has achieved strong results on large-scale decision making, yet most methods are model-free, limiting sample efficiency and making coordination harder as teammates’ policies evolve during training. Model-based reinforcement learning (MBRL) can reduce data usage, but planning and search scale poorly with joint action spaces. We adopt a world model approach to long-horizon coordination while avoiding expensive planning. We introduce MACT, a decentralized transformer world model with linear complexity in the number of agents. Each agent processes discretized observation–action tokens with a shared transformer, while a single cross-agent Perceiver step provides global context under centralized training and decentralized execution. MACT targets long-horizon coordination by coupling Perceiver-derived global context with an action-conditioned contrastive objective that predicts future latent representations over a short horizon conditioned on planned actions. Experiments on the StarCraft Multi-Agent Challenge (SMAC) under tight data budgets show that MACT is competitive with strong model-free baselines and prior world-model variants, with larger gains on coordination-heavy scenarios.

URL: https://openreview.net/forum?id=99nyrFfTJf

---

Title: Tabular Data in Interactive and Conversational AI: A Survey of Foundations, Benchmarks, Systems, and Open Problems

Abstract: Tabular and structured data underlie much of modern analytical work, yet natural language systems for interacting with such data have largely been studied in fragmented subfields. This survey studies that landscape under the broader problem of \emph{conversational AI over tabular and structured data}: systems that support multi-turn, context-dependent interaction with tables, databases, spreadsheets, and hybrid table--text documents. We first clarify the problem setting by defining tabular data, conversational interaction, and the primary interaction modes that distinguish querying, translating, manipulating, and orchestrating over structured data, while treating exploration as a recurrent interaction pattern rather than a separate category. Using an explicit corpus-construction and evidence policy, we organize 106 unique cited works into five categories: \emph{Foundations}, \emph{Conversational Table Question Answering} (CTabQA), \emph{Conversational Text-to-SQL} (CText2SQL), \emph{Interactive Table Manipulation}, and \emph{Agentic Table Systems}. Across these categories, we compare benchmark datasets, modelling paradigms, and evaluation practices, while tracing how closely related problems have often been studied under different task names, benchmarks, and research communities. Our synthesis shows recurring fragmentation in terminology, benchmark conventions, and modelling assumptions across the surveyed literatures, but we treat that fragmentation as a qualitative finding of the review rather than as a formal bibliometric result. We also find that CText2SQL currently has the most standardized benchmark and modelling pipeline, whereas manipulation and agentic systems more closely reflect real user workflows but remain harder to evaluate rigorously. Beyond category-specific findings, we identify three cross-cutting themes shared across the field: intent disambiguation and clarification, dialogue context tracking, and evaluation. These reveal a central mismatch between current benchmarks and realistic use: most systems are still optimized for short, clean, single-table interactions rather than long-horizon, ambiguous, multi-source analytical workflows. We conclude by synthesizing the field's main open problems, including unified evaluation, long-dialogue robustness, proactive clarification, interpretability, privacy, domain adaptation, and multi-table reasoning, and argue that progress will depend on moving from narrow task benchmarks toward integrated, user-centered conversational data systems.

URL: https://openreview.net/forum?id=OPLfz7iTUZ

---

Title: Temporal Memory for Resource-Constrained Agents: Continual Learning via Stochastic Compress--Add--Smooth

Abstract: An agent that operates sequentially must incorporate new experience without forgetting old experience, under a fixed memory budget. We propose a framework in which memory is not a parameter vector but a stochastic process: a Bridge Diffusion on a replay interval $[0,1]$, whose terminal marginal encodes the present and whose intermediate marginals encode the past. New experience is incorporated via a three-step \emph{Compress--Add--Smooth} (CAS) recursion. We test the framework on the class of models with marginal probability densities modeled via Gaussian mixtures of fixed number of components~$K$ in $d$ dimensions; temporal complexity is controlled by a fixed number~$L$ of piecewise-linear protocol segments whose nodes store Gaussian-mixture states. The entire recursion costs $O(LKd^2)$ flops per day --- no backpropagation, no stored data, no neural networks --- making it viable for controller-light hardware.

Forgetting in this framework arises not from parameter interference but from lossy temporal compression: the re-approximation of a finer protocol by a coarser one under a fixed segment budget. We find that the retention half-life scales linearly as $a_{1/2}\approx c\,L$ with a constant $c>1$ that depends on the dynamics but not on the mixture complexity~$K$, the dimension~$d$, or the geometry of the target family. The constant~$c$ admits an information-theoretic interpretation analogous to the Shannon channel capacity. The stochastic process underlying the bridge provides temporally coherent ``movie'' replay --- compressed narratives of the agent's history, demonstrated visually on an MNIST latent-space illustration. The framework provides a fully analytical ``Ising model'' of continual learning in which the mechanism, rate, and form of forgetting can be studied with mathematical precision.

URL: https://openreview.net/forum?id=wjoixYG0mC

---

Reply all

Reply to author

Forward

0 new messages