Expert Certification: Cropping outperforms dropout as an augmentation strategy for self-supervised training of text embeddings
Rita González-Márquez, Philipp Berens, Dmitry Kobak
https://openreview.net/forum?id=gVRsIh9x7W
---
Featured Certification: PruneFuse: Efficient Data Selection via Weight Pruning and Network Fusion
Humaira Kousar, Hasnain Irshad Bhatti, Jaekyun Moon
https://openreview.net/forum?id=BvnxenZwqY
---
J2C Certification: Wikipedia in the Era of LLMs: Evolution and Risks
Siming Huang, Yuliang Xu, Mingmeng Geng, Yao Wan, Dongping Chen
https://openreview.net/forum?id=ahVmnYkVLt
---
J2C Certification: Process Reward Models That Think
Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, Lu Wang
https://openreview.net/forum?id=FPVCb0WMuN
---
Survey Certification: A Survey of Reasoning in Autonomous Driving Systems: Open Challenges and Emerging Paradigms
Kejin Yu, Yuhan Sun, Taiqiang Wu, Ruixu Zhang, Zhiqiang Lin, Yuxin Meng, Junjie Wang, Yujiu Yang
https://openreview.net/forum?id=XwQ7dc4bqn
---
J2C Certification: Mollifier Layers: Enabling Efficient High-Order Derivatives in Inverse PDE Learning
Vinayak Vinayak, Ananyae Kumar bhartari, Vivek Shenoy
https://openreview.net/forum?id=6mFVZSzyev
---
Expert Certification: Learning Lagrangian Interaction Dynamics with Sampling-Based Model Order Reduction
Hrishikesh Viswanath, Yue Chang, Aleksey Panas, Julius Berner, Peter Yichen Chen, Aniket Bera
https://openreview.net/forum?id=vXCQA1EzaG
---
Survey Certification: Open Technical Problems in Open-Weight AI Model Risk Management
Stephen Casper, Kyle O'Brien, Shayne Longpre, Elizabeth Seger, Kevin Klyman, Rishi Bommasani, Aniruddha Nrusimha, Ilia Shumailov, Sören Mindermann, Steven Basart, Frank Rudzicz, Kellin Pelrine, Avijit Ghosh, Andrew Strait, Robert Kirk, Dan Hendrycks, Peter Henderson, J Zico Kolter, Geoffrey Irving, Yarin Gal, Yoshua Bengio, Dylan Hadfield-Menell
https://openreview.net/forum?id=8QyGLnFkzc
---
J2C Certification: Continual Robot Learning via Language-Guided Skill Acquisition
Shuo Cheng, Zhaoyi Li, Kelin Yu, Danfei Xu
https://openreview.net/forum?id=oYRNxxGN9u
---
Accepted papers
===============
Title: A Tighter Bound for Reward Learning in Reinforcement Learning from Human Feedback
Authors: Guoxi Chen, Xing Chen, Bo An, Ya Zhang
Abstract: As a key component of reinforcement learning from human feedback (RLHF), reward learning directly influences the final learned policy.
Unfortunately, existing theoretical estimation error bounds in reward learning rely on the complexity of the reward function class, unattainable optimal parameters, or non-zero constants independent of sample size, leading to uncomputable bounds that are meaningless for reward function classes with unknown complexity.
To address this issue,
this paper presents an analysis of parameter estimation for reward learning in RLHF under general function approximation, without imposing restrictions on the complexity of the reward function class.
A tighter bound is provided without non-zero terms independent of the sample size.
The optimal parameters are eliminated by applying linear approximation around the learned parameters.
Additionally, the relationship between the preference dataset and the learned parameters is further examined to demonstrate how to efficiently collect data based on the current learned parameters.
Inspired by the theoretical results,
a novel offline RLHF algorithm with parameter constraints is proposed, restricting parameters to the valid space defined by the dataset.
Furthermore, an online RLHF algorithm is proposed to iteratively optimize parameter learning and improve data collection efficiency.
This work provides a tighter bound than previous studies and offers theoretical guidance for online data collection under general function approximation.
URL: https://openreview.net/forum?id=EyMoFzI3Oz
---
Title: SSL-SLR: Self-Supervised Representation Learning for Sign Language Recognition
Authors: Ariel Basso Madjoukeng, Jérôme Fink, Pierre Poitier, Edith Belise Kenmogne, Benoit Frenay
Abstract: Sign language recognition (SLR) is a machine learning task aiming to identify signs in videos. Due to the scarcity of annotated data, unsupervised methods like contrastive learning have become promising in this field. They learn meaningful representations by pulling positive pairs (two augmented versions of the same instance) closer and pushing negative pairs (different from the positive pairs) apart. In SLR, only certain parts of the sign videos provide information that is truly useful for their recognition. Applying contrastive methods to SLR raises two issues: (i) contrastive learning methods treat all parts of a video in the same way, without taking into account the relevance of certain parts over others; (ii) shared movements between different signs make negative pairs highly similar, complicating sign discrimination. These issues lead to learning non-discriminative features for sign recognition and poor results in downstream tasks. In response, this paper proposes a self-supervised learning framework designed to learn meaningful representations for SLR. This framework consists of two key components designed to work together: (i) a new self-supervised approach with free-negative pairs; (ii) a new data augmentation technique. This approach shows a considerable gain in accuracy compared to several contrastive and self-supervised methods, across linear evaluation, semi-supervised learning, and transferability between sign languages.
URL: https://openreview.net/forum?id=buTZkTXijy
---
Title: Are foundation models for computer vision good conformal predictors?
Authors: Leo Fillioux, Julio Silva-Rodríguez, Ismail Ben Ayed, Paul-Henry Cournède, Maria Vakalopoulou, Stergios Christodoulidis, Jose Dolz
Abstract: Recent advances in self-supervision and contrastive learning have brought the performance of foundation models to unprecedented levels in a variety of tasks. Fueled by this progress, these models are becoming the prevailing approach for a wide array of real-world vision problems, including risk-sensitive and high-stakes applications. However, ensuring safe deployment in these scenarios requires a more comprehensive understanding of their uncertainty modeling capabilities, which has received little attention. In this work, we delve into the behaviour of vision and vision-language foundation models under Conformal Prediction (CP), a statistical framework that provides theoretical guarantees of marginal coverage of the true class. Across extensive experiments including popular vision classification benchmarks, well-known foundation vision models, and three CP methods, our findings reveal that foundation models are well-suited for conformalization procedures, particularly those integrating Vision Transformers. We also show that calibrating the confidence predictions of these models, a popular strategy to improve their uncertainty quantification, actually leads to efficiency degradation of the conformal set on adaptive CP methods. Furthermore, few-shot adaptation of Vision-Language Models (VLMs) to downstream tasks, whose popularity is surging, enhances conformal scores compared to zero-shot predictions. Last, our empirical study exposes APS as particularly promising in the context of vision foundation models, as it does not violate the marginal coverage guarantees across multiple challenging, yet realistic scenarios.
URL: https://openreview.net/forum?id=Kxdg98gZp4
---
Title: EdgeMask-DG*: Learning Domain-Invariant Graph Structures via Adversarial Edge Masking
Authors: Rishabh Bhattacharya, Naresh Manwani
Abstract: Structural shifts pose a significant challenge for graph neural networks, as graph topology acts as a covariate that can vary across domains. Existing domain generalization methods rely on fixed structural augmentations or training on globally perturbed graphs, mechanisms that do not pinpoint which specific edges encode domain-invariant information. We argue that domain-invariant structural information is not rigidly tied to a single topology but resides in the consensus across multiple graph structures derived from topology and feature similarity. To capture this, we first propose EdgeMask-DG, a novel min-max algorithm where an edge masker learns to find worst-case continuous masks subject to a sparsity constraint, compelling a task GNN to perform effectively under these adversarial structural perturbations. Building upon this, we introduce EdgeMask-DG*, an extension that applies this adversarial masking principle to an enriched graph. This enriched graph combines the original topology with feature-derived edges, allowing the model to discover invariances even when the original topology is noisy or domain-specific. At equilibrium, the structural patterns that the task GNN relies upon are, by design, robust and generalizable. EdgeMask-DG* is the first to systematically combine adaptive adversarial topology search with feature-enriched graphs. We provide a formal justification for our approach from a robust optimization perspective. We demonstrate that EdgeMask-DG* achieves new state-of-the-art performance on diverse graph domain generalization benchmarks, including citation networks, social networks, and temporal graphs. Notably, on the Cora OOD benchmark, EdgeMask-DG\* lifts the worst-case domain accuracy to {78.0\%}, a {+3.8 pp} improvement over the prior state of the art (74.2\%). The source code for our experiments can be found here: \url{https://anonymous.4open.science/r/TMLR-EAEF/}
URL: https://openreview.net/forum?id=vkfe8Ke7eC
---
Title: Cropping outperforms dropout as an augmentation strategy for self-supervised training of text embeddings
Authors: Rita González-Márquez, Philipp Berens, Dmitry Kobak
Abstract: Text embeddings, i.e. vector representations of entire texts, play an important role in many NLP applications, such as retrieval-augmented generation, clustering, or visualizing collections of texts for data exploration. Currently, top-performing embedding models are derived from pre-trained language models via supervised contrastive fine-tuning. This fine-tuning strategy relies on an external notion of similarity and annotated data for generation of positive pairs. Here we study self-supervised fine-tuning and systematically compare the two most well-known augmentation strategies used for fine-tuning text embeddings models. We assess embedding quality on MTEB and additional in-domain evaluations and show that cropping augmentation strongly outperforms the dropout-based approach. We find that on out-of-domain data, the quality of resulting embeddings is substantially below the supervised state-of-the-art models, but for in-domain data, self-supervised fine-tuning can produce high-quality text embeddings after very short fine-tuning. Finally, we show that representation quality increases towards the last transformer layers, which undergo the largest change during fine-tuning; and that fine-tuning only those last layers is sufficient to reach similar embedding quality.
URL: https://openreview.net/forum?id=gVRsIh9x7W
---
Title: Training speedups via batching for geometric learning: an analysis of static and dynamic algorithms
Authors: Daniel T. Speckhard, Tim Bechtel, Sebastian Kehl, Jonathan Godwin, Claudia Draxl
Abstract: Graph neural networks (GNN) have shown promising results for several domains such as materials science, chemistry, and the social sciences. GNN models often contain millions of parameters, and like other neural network (NN) models, are often fed only a fraction of the graphs that make up the training dataset in batches to update model parameters. The effect of batching algorithms on training time and model performance has been thoroughly explored for NNs but not yet for GNNs. We analyze two different batching algorithms for graph-based models, namely static and dynamic batching for two datasets, the QM9 dataset of small molecules and the AFLOW materials database. Our experiments show that changing the batching algorithm can provide up to a 2.7x speedup, but the fastest algorithm depends on the data, model, batch size, hardware, and number of training steps run. Experiments show that for a select number of combinations of batch size, dataset, and model, significant differences in model learning metrics are observed between static and dynamic batching algorithms.
URL: https://openreview.net/forum?id=v8rC6EEUep
---
Title: The Role of Feature Interactions in Graph-based Tabular Deep Learning
Authors: Elias Dubbeldam, Reza Mohammadi, Marit Schoonhoven, Ilker Birbil
Abstract: Accurate predictions on tabular data rely on capturing complex, dataset-specific feature interactions. Attention-based methods and graph neural networks, referred to as graph-based tabular deep learning (GTDL), aim to improve predictions by modeling these interactions as a graph. In this work, we analyze how these methods model the feature interactions. Current GTDL approaches primarily focus on optimizing predictive accuracy, often neglecting the accurate modeling of the underlying graph structure. Using synthetic datasets with known ground-truth graph structures, we find that current GTDL methods fail to recover meaningful feature interactions, as their edge recovery is close to random. This suggests that the attention mechanism and message-passing schemes used in GTDL do not effectively capture feature interactions. Furthermore, when we impose the true interaction structure, we find that the predictive accuracy improves. This highlights the need for GTDL methods to prioritize accurate modeling of the graph structure, as it leads to better predictions
URL: https://openreview.net/forum?id=olGaiwoZHZ
---
Title: A Practical Algorithm for Feature-Rich, Non-Stationary Bandit Problems
Authors: William Loh, Sajib Kumer Sinha, Ankur Agarwal, Pascal Poupart
Abstract: Contextual bandits are incredibly useful in many practical problems. We go one step further by devising a more realistic problem that combines: (1) contextual bandits with dense arm features, (2) non-linear reward functions, and (3) a generalization of correlated bandits where reward distributions change over time but the degree of correlation maintains. This formulation lends itself to a wider set of applications such as recommendation tasks. To solve this problem, we introduce *conditionally coupled contextual* ($C_3$) Thompson sampling for Bernoulli bandits. It combines an improved Nadaraya-Watson estimator on an embedding space with Thompson sampling that allows online learning without retraining. Empirical results show that $C_3$ outperforms the next best algorithm by 5.7% lower average cumulative regret on four OpenML tabular datasets as well as demonstrating a 12.4% click lift on Microsoft News Dataset (MIND) compared to other algorithms.
URL: https://openreview.net/forum?id=tRbwfej9uY
---
Title: PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generation
Authors: Alexandre Piché, Ehsan Kamalloo, Rafael Pardinas, Xiaoyin Chen, Dzmitry Bahdanau
Abstract: Reinforcement Learning (RL) is increasingly utilized to enhance the reasoning capabilities of Large Language Models (LLMs). However, effectively scaling these RL methods presents significant challenges, primarily due to the difficulty in maintaining high AI accelerator utilization without generating stale, off-policy data that harms common RL algorithms. This paper introduces PipelineRL, an approach designed to achieve a superior trade-off between hardware efficiency and data on-policyness for LLM training. PipelineRL employs concurrent asynchronous data generation and model training, distinguished by the novel in-flight weight updates. This mechanism allows the LLM generation engine to receive updated model weights with minimal interruption during the generation of token sequences, thereby maximizing both the accelerator utilization and the freshness of training data. Experiments conducted on long-form reasoning tasks using 128 H100 GPUs demonstrate that PipelineRL achieves approximately $\sim 2x$ faster learning compared to conventional RL baselines while maintaining highly on-policy training data. A scalable and modular open-source implementation of PipelineRL is also released as a key contribution.
URL: https://openreview.net/forum?id=A35ak14Cyp
---
Title: BrightDreamer: Generic 3D Gaussian Generative Framework for Fast Text-to-3D Synthesis
Authors: Lutao Jiang, Xu Zheng, Yuanhuiyi Lyu, Jiazhou Zhou, Lin Wang
Abstract: Text-to-3D synthesis has recently seen intriguing advances by combining the text-to-image priors with 3D representation methods, e.g., 3D Gaussian Splatting (3D GS), via Score Distillation Sampling (SDS). However, a hurdle of existing methods is the low efficiency, per-prompt optimization for a single 3D object. Therefore, it is imperative for a paradigm shift from per-prompt optimization to feed-forward generation for any unseen text prompts, which yet remains challenging. An obstacle is how to directly generate a set of millions of 3D Gaussians to represent a 3D object. This paper presents BrightDreamer, an end-to-end feed-forward approach that can achieve generalizable and fast (77 ms) text-to-3D generation. Our key idea is to formulate the generation process as estimating the 3D deformation from an anchor shape with predefined positions. For this, we first propose a Text-guided Shape Deformation (TSD) network to predict the deformed shape and its new positions, used as the centers (one attribute) of 3D Gaussians. To estimate the other four attributes (i.e., scaling, rotation, opacity, and SH), we then design a novel Text-guided Triplane Generator (TTG) to generate a triplane representation for a 3D object. The center of each Gaussian enables us to transform the spatial feature into the four attributes. The generated 3D Gaussians can be finally rendered at 705 frames per second. Extensive experiments demonstrate the superiority of our method over existing methods. Also, BrightDreamer possesses a strong semantic understanding capability even for complex text prompts. The project code is available in supplementary materials.
URL: https://openreview.net/forum?id=Rb19CQCwbi
---
Title: Fast Debiasing of the LASSO Estimator
Authors: Shuvayan Banerjee, James Saunderson, Radhendushka Srivastava, Ajit Rajwade
Abstract: In high-dimensional sparse regression, the \textsc{Lasso} estimator offers excellent theoretical guarantees but is well-known to produce biased estimates. To address this, \cite{Javanmard2014} introduced a method to ``debias'' the \textsc{Lasso} estimates for a random sub-Gaussian sensing matrix $\boldsymbol{A}$. Their approach relies on computing an ``approximate inverse'' $\boldsymbol{M}$ of the matrix $\boldsymbol{A}^\top \boldsymbol{A}/n$ by solving a convex optimization problem. This matrix $\boldsymbol{M}$ plays a critical role in mitigating bias and allowing for construction of confidence intervals using the debiased \textsc{Lasso} estimates. However the computation of $\boldsymbol{M}$ is expensive in practice as it requires iterative optimization.
In the presented work, we re-parameterize the optimization problem to compute a ``debiasing matrix'' $\boldsymbol{W} := \boldsymbol{AM}^{\top}$ directly, rather than the approximate inverse $\boldsymbol{M}$. This reformulation retains the theoretical guarantees of the debiased \textsc{Lasso} estimates, as they depend on the \emph{product} $\boldsymbol{AM}^{\top}$ rather than on $\boldsymbol{M}$ alone. Notably, we derive a simple and computationally efficient closed-form expression for $\boldsymbol{W}$, applicable to the sensing matrix $\boldsymbol{A}$ in the original debiasing framework, under a specific deterministic condition.
This condition is satisfied with high probability for a wide class of randomly generated sensing matrices.
Also, the optimization problem based on $\boldsymbol{W}$ guarantees a unique optimal solution, unlike the original formulation based on $\boldsymbol{M}$. We verify our main result with numerical simulations.
URL: https://openreview.net/forum?id=gEVPlLhoNI
---
Title: BNEM: A Boltzmann Sampler Based on Bootstrapped Noised Energy Matching
Authors: RuiKang OuYang, Bo Qiang, José Miguel Hernández-Lobato
Abstract: Generating independent samples from a Boltzmann distribution is a highly relevant problem in scientific research, \textit{e.g.} in molecular dynamics, where one has initial access to the underlying energy function but not to samples from the Boltzmann distribution. We address this problem by learning the energies of the convolution of the Boltzmann distribution with Gaussian noise. These energies are then used to generate independent samples through a denoising diffusion approach. The resulting method, \textsc{Noised Energy Matching} (NEM), has lower variance and only slightly higher cost than previous related works. We also improve NEM through a novel bootstrapping technique called \textsc{Bootstrap NEM} (BNEM) that further reduces variance while only slightly increasing bias. Experiments on a collection of problems demonstrate that NEM can outperform previous methods while being more robust and that BNEM further improves on NEM. Codes are available at \url{https://github.com/tonyauyeung/BNEM}.
URL: https://openreview.net/forum?id=ZZktU0U6Pu
---
Title: PruneFuse: Efficient Data Selection via Weight Pruning and Network Fusion
Authors: Humaira Kousar, Hasnain Irshad Bhatti, Jaekyun Moon
Abstract: Efficient data selection is crucial for enhancing the training efficiency of deep neural networks and minimizing annotation requirements. Traditional methods often face high computational costs, limiting their scalability and practical use. We introduce PruneFuse, a novel strategy that leverages pruned networks for data selection and later fuses them with the original network to optimize training.
PruneFuse operates in two stages: First, it applies structured pruning to create a smaller pruned network that, due to its structural coherence with the original network, is well-suited for the data selection task. This small network is then trained and selects the most informative samples from the dataset. Second, the trained pruned network is seamlessly fused with the original network. This integration leverages the insights gained during the training of the pruned network to facilitate the learning process of the fused network while leaving room for the network to discover more robust solutions. Extensive experimentation on various datasets demonstrates that PruneFuse significantly reduces computational costs for data selection, achieves better performance than baselines, and accelerates the overall training process.
URL: https://openreview.net/forum?id=BvnxenZwqY
---
Title: TACO: Training-free Sound Prompted Segmentation via Semantically Constrained Audio-visual CO-factorization
Authors: Hugo Malard, Michel Olvera, Stéphane Lathuilière, Slim Essid
Abstract: Large-scale pre-trained audio and image models demonstrate an unprecedented degree of generalization, making them suitable for a wide range of applications. Here, we tackle the specific task of sound-prompted segmentation, aiming to segment image regions corresponding to objects heard in an audio signal. Most existing approaches tackle this problem by fine-tuning pre-trained models or by training additional modules specifically for the task. We adopt a different strategy: we introduce a training-free approach that leverages Non-negative Matrix Factorization (NMF) to co-factorize audio and visual features from pre-trained models so as to reveal shared interpretable concepts. These concepts are passed on to an open-vocabulary segmentation model for precise segmentation maps. By using frozen pre-trained models, our method achieves high generalization and establishes state-of-the-art performance in unsupervised sound-prompted segmentation, significantly surpassing previous unsupervised methods.
URL: https://openreview.net/forum?id=Xt9sdzQQlJ
---
Title: Stepwise Guided Policy Optimization: Coloring Your Incorrect Reasoning in GRPO
Authors: Peter Chen, Xiaopeng Li, Ziniu Li, Xi Chen, Tianyi Lin
Abstract: Reinforcement learning (RL) has proven effective in strengthening the reasoning capabilities of large language models (LLMs). A widely adopted method, Group Relative Policy Optimization (GRPO)~\citep{Shao-2024-Deepseekmath}, has shown strong empirical results in training recent reasoning models~\citep{Guo-2025-Deepseek}, but it fails to update the policy when all responses within a group are incorrect (i.e., all-negative-sample groups). This limitation highlights a gap between artificial and human intelligence: unlike humans, who can learn from mistakes, GRPO discards these failure signals. We introduce a simple framework to mitigate the all-negative-sample issue by incorporating response diversity within groups using a \textit{step-wise} judge model, which can be trained directly or adapted from existing LLMs. In a simplified setting, we prove that this diversification accelerates GRPO’s learning dynamics. We then empirically validate Stepwise Guided Policy Optimization (SGPO) across model sizes (7B, 14B, 32B) in both offline and online training on nine reasoning benchmarks (including base and distilled variants). Overall, SGPO improves average performance and is effective in early and mid-training when all-negative groups are prevalent, while improvements are not uniform across every benchmark and depend on the structure and informativeness of negative samples. Finally, SGPO does not require the judge model to generate correct solutions, distinguishing it from knowledge distillation methods.
URL: https://openreview.net/forum?id=ALnVAqtshR
---
Title: Lifelong Open-Ended Probability Predictors
Authors: Omid Madani
Abstract: We advance probabilistic multiclass prediction on open-ended streams
of items. In this setting, a predictor must emit items with
probabilities, and adapt to significant non-stationarity, including
new item appearances and frequency changes. The predictor is not given
the set of items that it is to predict a priori, and moreover the
totality of the items can grow unbounded: the space-limited predictor
need only track the currently salient items and their probabilities.
We develop Sparse Moving Average techniques (SMAs), including
adaptations of sparse EMA as well as novel queue-based methods with
dynamic per-item histories. For performance evaluation, to handle new
items, we develop a bounded version of log-loss. Our findings, on a
range of synthetic and real data streams, show that dynamic
predictand-specific (per connection) parameters, such as learning
rates, enhance both adaptation speed and stability.
URL: https://openreview.net/forum?id=rojnGCcMaK
---
Title: Understanding Accelerated Gradient Methods: Lyapunov Analyses and Hamiltonian-Assisted Interpretations
Authors: Penghui Fu, Zhiqiang Tan
Abstract: We formulate two classes of first-order algorithms more general than previously studied for minimizing smooth and strongly convex or, respectively, smooth and convex functions. We establish sufficient conditions, via new discrete Lyapunov analyses, for achieving accelerated convergence rates which match Nesterov's methods in the strongly and general convex settings. Our results identify, for the first time, a simple and unified condition on gradient correction for accelerated convergence. Next, we study the convergence of limiting ordinary differential equations (ODEs), including high-resolution ODEs, and point out currently notable gaps between the convergence properties of the corresponding algorithms and ODEs, especially regarding the role of gradient correction. Finally, we propose a novel class of discrete algorithms, called the Hamiltonian-assisted gradient method, directly based on a Hamiltonian function and several interpretable operations, and then demonstrate meaningful and unified interpretations of our acceleration conditions in terms of the momentum variable updates.
URL: https://openreview.net/forum?id=0jvg4M1W40
---
Title: HiBaNG: Hierarchical Bayesian Nonparametric Granger Causal Discovery in Low-Data Regimes
Authors: He Zhao, Vassili Kitsios, Terence O'kane, Edwin V. Bonilla
Abstract: We present a principled probabilistic framework for discovering Granger causal relationships from multivariate time-series data in low-data regimes, where short sequences limit the applicability of modern deep learning approaches. While deep neural vector autoregressive (VAR) models perform well in high-data settings, they often struggle to generalize with limited samples and provide little insight into model uncertainty. To address these challenges, we introduce HiBaNG, a hierarchical Bayesian nonparametric framework for Granger causal discovery. HiBaNG places a hierarchical factorized prior over binary Granger causal graphs that encodes structured sparsity and enables interpretable, uncertainty-aware inference. We develop a tractable Gibbs sampling algorithm that exploits conjugacy and augmentation for scalable posterior estimation. Extensive experiments on synthetic, semi-synthetic, and real-world climate datasets demonstrate that HiBaNG consistently outperforms both classical and deep VAR baselines, achieving improved accuracy and calibrated uncertainty.
URL: https://openreview.net/forum?id=e4VO3YlRBr
---
Title: Wikipedia in the Era of LLMs: Evolution and Risks
Authors: Siming Huang, Yuliang Xu, Mingmeng Geng, Yao Wan, Dongping Chen
Abstract: In this paper, we present a comprehensive analysis and monitoring framework for the impact of Large Language Models (LLMs) on Wikipedia, examining the evolution of Wikipedia through existing data and using simulations to explore potential risks. We begin by analyzing article content and page views to study the recent changes in Wikipedia and assess the impact of LLMs. Subsequently, we evaluate how LLMs affect various Natural Language Processing (NLP) tasks related to Wikipedia, including machine translation and retrieval-augmented generation (RAG). Our findings and simulation results reveal that Wikipedia articles have been affected by LLMs, with an impact of approximately 1% in certain categories. If the machine translation benchmark based on Wikipedia is influenced by LLMs, the scores of the models may become inflated, and the comparative results among models could shift. Moreover, the effectiveness of RAG might decrease if the knowledge has been contaminated by LLMs. While LLMs have not yet fully changed Wikipedia's language and knowledge structures, we believe that our empirical findings signal the need for careful consideration of potential future risks in NLP research.
URL: https://openreview.net/forum?id=ahVmnYkVLt
---
Title: Process Reward Models That Think
Authors: Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, Lu Wang
Abstract: Step-by-step verifiers—also known as process reward models (PRMs)—are a key ingredient for test-time scaling, but training them requires expensive step-level supervision. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers—using only 1% of the process labels in PRM800K—across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME ’24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation over subsets of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained with the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. This work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training.
URL: https://openreview.net/forum?id=FPVCb0WMuN
---
Title: A Survey of Reasoning in Autonomous Driving Systems: Open Challenges and Emerging Paradigms
Authors: Kejin Yu, Yuhan Sun, Taiqiang Wu, Ruixu Zhang, Zhiqiang Lin, Yuxin Meng, Junjie Wang, Yujiu Yang
Abstract: The development of high-level autonomous driving (AD) is shifting from perception-centric limitations to a more fundamental bottleneck, namely, a deficit in robust and generalizable reasoning. Although current AD systems manage structured environments, they consistently falter in long-tail scenarios and complex social interactions that require human-like judgment. Meanwhile, the advent of large language and multimodal models (LLMs and MLLMs) presents a transformative opportunity to integrate a powerful cognitive engine into AD systems, moving beyond pattern matching toward genuine comprehension. However, a systematic framework to guide this integration is critically lacking. To bridge this gap, we provide a comprehensive review of this emerging field and argue that reasoning should be elevated from a modular component to the system's cognitive core. Specifically, we first propose a novel Cognitive Hierarchy to decompose the monolithic driving task according to its cognitive and interactive complexity. Building on this, we further derive and systematize seven core reasoning challenges, such as the responsiveness-reasoning trade-off and social-game reasoning. Furthermore, we conduct a dual-perspective review of the state-of-the-art, analyzing both system-centric approaches to architecting intelligent agents and evaluation-centric practices for their validation. Our analysis reveals a clear trend toward holistic and interpretable "glass-box'' agents. In conclusion, we identify a fundamental and unresolved tension between the high-latency, deliberative nature of LLM-based reasoning and the millisecond-scale, safety-critical demands of vehicle control. For future work, a primary objective is to bridge the symbolic-to-physical gap by developing verifiable neuro-symbolic architectures, robust reasoning under uncertainty, and scalable models for implicit social negotiation.
URL: https://openreview.net/forum?id=XwQ7dc4bqn
---
Title: Mollifier Layers: Enabling Efficient High-Order Derivatives in Inverse PDE Learning
Authors: Vinayak Vinayak, Ananyae Kumar bhartari, Vivek Shenoy
Abstract: Parameter estimation in inverse problems involving partial differential equations (PDEs) underpins modeling across scientific disciplines, especially when parameters vary in space or time. Physics-informed Machine Learning (PhiML) integrates PDE constraints into deep learning, but prevailing approaches depend on recursive automatic differentiation (autodiff), which produces inaccurate high-order derivatives, inflates memory usage, and underperforms in noisy settings. We propose Mollifier Layers, a lightweight, architecture-agnostic module that replaces autodiff with convolutional operations using analytically defined mollifiers. This reframing of derivative computation as smoothing integration enables efficient, noise-robust estimation of high-order derivatives directly from network outputs. Mollifier Layers attach at the output layer and require no architectural modifications. We compare them with three distinct architectures and benchmark performance across first-, second-, and fourth-order PDEs—including Langevin dynamics, heat diffusion, and reaction-diffusion systems—observing significant improvements in memory efficiency, training time and accuracy for parameter recovery across tasks. To demonstrate practical relevance, we apply Mollifier Layers to infer spatially varying epigenetic reaction rates from super-resolution chromatin imaging data—a real-world inverse problem with biomedical significance. Our results establish Mollifier Layers as an efficient and scalable tool for physics-constrained learning.
URL: https://openreview.net/forum?id=6mFVZSzyev
---
Title: Physics-Informed Deep B-Spline Networks
Authors: Zhuoyuan Wang, Raffaele Romagnoli, Saviz Mowlavi, Yorie Nakahira
Abstract: Physics-informed machine learning offers a promising framework for solving complex partial differential equations (PDEs) by integrating observational data with governing physical laws. However, learning PDEs with varying parameters and changing initial conditions and boundary conditions (ICBCs) with theoretical guarantees remains an open challenge. In this paper, we propose physics-informed deep B-spline networks, a novel technique that approximates a family of PDEs with different parameters and ICBCs by learning B-spline control points through neural networks. The proposed B-spline representation reduces the learning task from predicting solution values over the entire domain to learning a compact set of control points, enforces strict compliance to initial and Dirichlet boundary conditions by construction, and enables analytical computation of derivatives for incorporating PDE residual losses. While existing approximation and generalization theories are not applicable in this setting—where solutions of parametrized PDE families are represented via B-spline bases—we fill this gap by showing that B-spline networks are universal approximators for such families under mild conditions. We also derive generalization error bounds for physics-informed learning in both elliptic and parabolic PDE settings, establishing new theoretical guarantees. Finally, we demonstrate in experiments that the proposed technique has improved efficiency-accuracy tradeoffs compared to existing techniques in a dynamical system problem with discontinuous ICBCs and can handle nonhomogeneous ICBCs and non-rectangular domains.
URL: https://openreview.net/forum?id=tHO2zEqmzm
---
Title: Multiway Multislice PHATE: Visualizing Hidden Dynamics of RNNs through Training
Authors: Jiancheng Xie, Lou C Voinov, Noga Mudrik, Gal Mishne, Adam Shabti Charles
Abstract: Recurrent neural networks (RNNs) are a widely used tool for sequential data analysis; however, they are still often seen as black boxes. Visualizing the internal dynamics of RNNs is a critical step toward understanding their functional principles and developing better architectures and optimization strategies. Prior studies typically emphasize network representations only after training, overlooking how those representations evolve during learning. Here, we present Multiway Multislice PHATE (MM-PHATE), a graph-based embedding method for visualizing the evolution of RNN hidden states across the multiple dimensions spanned by RNNs: time, training epoch, and units. Across controlled synthetic benchmarks and real RNN applications, MM-PHATE preserves hidden-representation community structure among units and reveals training-phase changes in representation geometry. In controlled synthetic systems spanning multiple bifurcation families and smooth state-space warps, MM-PHATE recovers qualitative dynamical progression while distinguishing family-level differences. In task-trained RNNs, the embedding identifies information-processing and compression-related phases during training, and time-resolved geometric and entropy-based summaries align with linear probes, time-step ablations, and label--state mutual information. These results show that MM-PHATE provides an intuitive and comprehensive way to inspect RNN hidden dynamics across training and to better understand how model architecture and learning dynamics relate to performance.
URL: https://openreview.net/forum?id=9Yr4V7iZsq
---
Title: Are vision language models robust to classic uncertainty challenges?
Authors: Xi Wang, Eric Nalisnick
Abstract: Robustness against uncertain and ambiguous inputs is a critical challenge for deep learning models. While recent advancements in large scale vision language models (VLMs, e.g. GPT-4o) might suggest that increasing model and training dataset size would mitigate this issue, our empirical evaluation shows a more complicated picture. In this work, we sanity check whether modern VLMs pass the two most ``classic'' uncertainty quantification challenges: Anomaly detection and classification under inherently ambiguous conditions, we find that newer and larger VLMs indeed exhibit improved robustness compared to earlier models, but still suffer from a tendency to strictly follow instructions, often causing them to hallucinate confident responses even when faced with unclear or anomalous inputs.
Remarkably, for natural images such as ImageNet, this limitation can be overcome without pipeline modifications: simply prompting models to abstain from uncertain predictions enables significant reliability gains, achieving near-perfect robustness in several settings.
However, for domain-specific tasks such as galaxy morphology classification, a lack of specialized knowledge prevents reliable uncertainty estimation. Finally, we propose a simple mechanism based on caption diversity to reveal a model’s internal uncertainty, enabling practitioners to predict when models will successfully abstain without relying on labeled data.
URL: https://openreview.net/forum?id=4lCSYCNfmo
---
Title: Learning Lagrangian Interaction Dynamics with Sampling-Based Model Order Reduction
Authors: Hrishikesh Viswanath, Yue Chang, Aleksey Panas, Julius Berner, Peter Yichen Chen, Aniket Bera
Abstract: Simulating physical systems governed by Lagrangian dynamics often entails solving partial differential equations (PDEs) over high-resolution spatial domains, leading to significant computational expense. Reduced-order modeling (ROM) mitigates this cost by evolving low-dimensional latent representations of the underlying system. While neural ROMs enable querying solutions from latent states at arbitrary spatial points, their latent states typically represent the global domain and struggle to capture localized, highly dynamic behaviors such as fluids. We propose a sampling-based reduction framework that evolves Lagrangian systems directly in physical space, over the particles themselves, reducing the number of active degrees of freedom via data-driven neural PDE operators. To enable querying at arbitrary spatial locations, we introduce a learnable kernel parameterization that uses local spatial information from time-evolved sample particles to infer the underlying solution manifold. Empirically, our approach achieves a 6.6$\times$–32$\times$ reduction in input dimensionality while maintaining high-fidelity evaluations across diverse Lagrangian regimes, including fluid flows, granular media, and elastoplastic dynamics. We refer to this framework as GIOROM (\textbf{G}eometry-\textbf{I}nf\textbf{O}rmed \textbf{R}educed-\textbf{O}rder \textbf{M}odeling). All of our code and data is available at \url{https://github.com/HrishikeshVish/GIOROM}
URL: https://openreview.net/forum?id=vXCQA1EzaG
---
Title: Open Technical Problems in Open-Weight AI Model Risk Management
Authors: Stephen Casper, Kyle O'Brien, Shayne Longpre, Elizabeth Seger, Kevin Klyman, Rishi Bommasani, Aniruddha Nrusimha, Ilia Shumailov, Sören Mindermann, Steven Basart, Frank Rudzicz, Kellin Pelrine, Avijit Ghosh, Andrew Strait, Robert Kirk, Dan Hendrycks, Peter Henderson, J Zico Kolter, Geoffrey Irving, Yarin Gal, Yoshua Bengio, Dylan Hadfield-Menell
Abstract: Frontier AI models with openly available weights are steadily becoming more powerful and widely adopted. However, compared to proprietary models, open-weight models pose different opportunities and challenges for effective risk management. For example, they allow for more open research and testing. However, managing their risks is also challenging because they can be modified arbitrarily, used without oversight, and spread irreversibly. Currently, there is limited research on safety tooling specific to open-weight models. Addressing these gaps will be key to both realizing their benefits and mitigating their harms. In this paper, we present 16 open technical challenges for open-weight model safety involving training data, training algorithms, evaluations, deployment, and ecosystem monitoring. We conclude by discussing the nascent state of the field, emphasizing that openness about research, methods, and evaluations -- not just weights -- will be key to building a rigorous science of open-weight model risk management.
URL: https://openreview.net/forum?id=8QyGLnFkzc
---
Title: Continual Robot Learning via Language-Guided Skill Acquisition
Authors: Shuo Cheng, Zhaoyi Li, Kelin Yu, Danfei Xu
Abstract: To support daily human tasks, robots need to tackle complex, long-horizon tasks and continuously acquire new skills to handle new problems. Deep Reinforcement Learning (DRL) offers potential for learning fine-grained skills but relies heavily on human-defined rewards and faces challenges with long-horizon goals. Task and Motion Planning (TAMP) are adept at handling long-horizon tasks but often need tailored domain-specific skills, resulting in practical limitations and inefficiencies. To overcome these complementary limitations, we propose LG-SAIL (Language Models Guided Sequential, Adaptive, and Incremental Skill Learning), a framework that leverages Large Language Models (LLMs) to synergistically integrate TAMP and DRL for continuous skill learning in long-horizon tasks. Our framework achieves automatic task decomposition, operator creation, and dense reward generation for efficiently acquiring the desired skills. To facilitate new skill learning, our framework maintains a symbolic skill library and utilizes the existing model from semantic-related skills to warm start the training. LG-SAIL demonstrates superior performance compared to baselines across six challenging simulated task domains across two benchmarks. Furthermore, we demonstrate the ability to reuse learned skills to expedite learning in new task domains, and deploy the system on a physical robot platform. More results on website: https://sites.google.com/view/continuallearning.
URL: https://openreview.net/forum?id=oYRNxxGN9u
---
New submissions
===============
Title: Unified Semantic Transformer for 3D Scene Understanding
Abstract: Holistic 3D scene understanding involves capturing and parsing unstructured 3D environments. Due to the inherent complexity of the real world, existing models have predominantly been developed and limited to be task-specific. We introduce UNITE, a Unified Semantic Transformer for 3D scene understanding, a novel feed-forward neural network that unifies a diverse set of 3D dense semantic tasks within a single model. Our model operates on unseen scenes in a fully end-to-end manner and only takes a few seconds to infer the full 3D semantic geometry. Our approach is capable of directly predicting multiple dense semantic attributes, including 3D scene segmentation, instance embeddings, open-vocabulary features, and articulations, solely from RGB images. The method is trained using a combination of 2D distillation, heavily relying on self-supervision and leverages novel multi-view losses designed to ensure 3D view consistency. We demonstrate that UNITE achieves state-of-the-art performance on several different dense semantic tasks and even outperforms task-specific models, in many cases, surpassing methods that operate on ground truth 3D geometry.
URL: https://openreview.net/forum?id=eB7oHCJzud
---
Title: Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines
Abstract: Despite remarkable progress in Vision--Language--Action (VLA) models, a central bottleneck remains underexamined: the data infrastructure that underlies embodied learning. In this survey, we argue that future advances in VLA will depend less on model architecture and more on the co-design of high-fidelity data engines and structured evaluation protocols. To this end, we present a systematic, data-centric analysis of VLA research organized around three pillars: datasets, benchmarks, and data engines. For datasets, we categorize real-world and synthetic corpora along embodiment diversity, modality composition, and action space formulation, revealing a persistent fidelity-cost trade-off that fundamentally constrains large-scale collection. For benchmarks, we analyze task complexity and environment structure jointly, exposing structural gaps in compositional generalization and long-horizon reasoning evaluation that existing protocols fail to address. For data engines, we examine simulation-based, video-reconstruction, and automated task-generation paradigms, identifying their shared limitations in physical grounding and sim-to-real transfer. Synthesizing these analyses, we distill four open challenges: representation alignment, multimodal supervision, reasoning assessment, and scalable data generation. Addressing them, we argue, requires treating data infrastructure as a first-class research problem rather than a background concern.
URL: https://openreview.net/forum?id=tAaWFpvnmm
---
Title: Benchmarking Cross-Seed Feature Correspondence in Sparse Autoencoders
Abstract: Sparse autoencoders (SAEs) trained on the same model learn seed-dependent dictionaries,
raising the question of whether features found by one run correspond to those found by another.
We introduce a benchmark that evaluates cross-seed matching methods on functional
grounds, beyond geometric similarity, using two complementary tests: per-feature ablation
fingerprints for scalable screening, and a substitution test that directly measures functional
interchangeability by swapping one SAE’s feature contribution for another’s. Both tests are
validated against hard negative controls and stratified by feature activity level.
Evaluating eight matching methods on BatchTopK and ReLU SAEs (five seeds, Pythia-410M
layers 4, 8, and 12, with replication on GPT-2 Small), we find that cross-seed correspondence
exhibits a quality/coverage tradeoff analogous to precision/recall. At the top of the ranking,
greedy cosine and Sinkhorn optimal transport perform equally well (R = 0.86 at top-100);
in the tail, Sinkhorn with uniform marginals retains higher quality (R = 0.60 vs. 0.52 at
top-2000), achieving the highest overall AUSQC (area under the substitution-quality curve).
Results are validated on a held-out corpus with seed-level bootstrap confidence intervals.
All claims are restricted to the fingerprinted feature subset (∼42%), and we show that effect
sizes attenuate for low-activity features. The benchmark protocol is designed so that future
consistency methods can be evaluated on the same footing, providing a shared standard for
measuring progress on feature reproducibility.
URL: https://openreview.net/forum?id=5cy6WtSC8f
---
Title: Tight Bounds and Fundamental Impossibility for Knowledge Editing Side Effects in Transformers
Abstract: Knowledge editing enables targeted updates to factual associations in large language models without costly retraining, yet no formal guarantees exist for the unintended side effects these updates introduce, making safe deployment in high-stakes settings certifiably impossible. We close this gap with the first theoretical framework providing provably tight bounds (up to a computable constant $C_\Phi$) on knowledge editing side effects in transformers. Our central theorem establishes tight, computable bounds on how rank-$r$ weight perturbations propagate to unrelated inputs, with all constants made explicit via a non-circular algorithm that avoids the dependency cycles afflicting prior analyses. We further derive edit capacity bounds that predict when sequential edits trigger catastrophic degradation, and prove a fundamental impossibility result: perfect locality and generalization are mutually exclusive under representational superposition, characterizing an inherent Pareto frontier rather than a fixable algorithmic limitation. Experiments across 21,600 edits on GPT-2 and GPT-J, with additional cross-architecture validation on OPT, BLOOM, and LLaMA (Appendix), confirm all theoretical predictions (Spearman $\rho = 0.82$, $p < 10^{-50}$), with the impossibility frontier matching measurements within 3%. Applied as a pre-deployment safety screen on GPT-2 with ROME, our bounds raise locality from 67.1% to 92.3%, demonstrating immediate practical value.
URL: https://openreview.net/forum?id=0h40LsHt34
---
Title: Uncertainty Quantification in Linear Regression With Mismatched Data
Abstract: The fundamental assumption in regression analysis that each response-predictor pair corresponds to the same observational unit is not always valid, especially with mismatched data. This paper presents a novel approach for uncertainty quantification in linear regression when data mismatch occurs. Using the generalized fiducial inference framework, we develop a method to generate fiducial samples for constructing confidence intervals and measuring uncertainty in key regression parameters. We establish the theoretical properties of our approach. And the empirical coverage rates of the proposed method are consistently closer to the target confidence level compared with existing approaches on both simulated and real datasets. To our knowledge, this is the first study to explore uncertainty quantification for mismatched data in linear regression.
URL: https://openreview.net/forum?id=rPPQ21dntB
---
Title: The Scaling Laws of Classification Networks: Insights from Adaptive Exact Average Density Approximation
Abstract: Our main goal is to establish a generalization bound for classification tasks that aligns with the empirical scaling laws observed in deep neural networks (DNNs). Prior studies on scaling laws have not placed enough emphasis on network depth. This makes our approach particularly innovative. Suppose the boundary of the target classification function is a semi-algebraic set, we show that the generalization error can follow scaling laws for large networks. This study explores the relationship between scaling laws and sample size on data manifolds. The rate of scaling with respect to sample size is intrinsically linked to the effective dimension of the data manifold, independent of the specific network model or learning algorithm applied. In contrast, the rate per parameter exhibits variability across learning methods and network architectures. This variability can be quantified by the number of model parameters required to accurately characterize the target classification boundary at various radii of covering balls. Using this scaling law, we empirically demonstrate the feasibility of predicting the generalization errors of larger models from metrics of smaller models.
URL: https://openreview.net/forum?id=Y7gwqRVnqi
---
Title: Conditional Local Importance by Quantile Expectations
Abstract: Global variable importance measures are commonly used to interpret the results of machine learning models. Local variable importance techniques assess how variables contribute to individual observations. Current, popular methods, including LIME and SHAP, typically fail to accurately reflect locally dependent relationships between variables and instead focus on marginal importance values. Additionally, they are not natively adapted for multi-class classification problems. We propose a new model-agnostic method for calculating local variable importance, CLIQUE, that captures locally dependent relationships, provides improvements over permutation-based methods, and can be directly applied to multi-class classification problems. Simulated and real-world examples show that CLIQUE emphasizes locally dependent information, captures interaction behavior beyond what can be evaluated by correlations, and properly reduces bias in regions where variables do not affect the response.
URL: https://openreview.net/forum?id=gsuZFPDRqE
---
Title: Conformal Tradeoffs: Operational Profiles Beyond Coverage
Abstract: Conformal prediction gives exact finite-sample coverage guarantees under
exchangeability, but deployed systems are judged by more than coverage
alone. For a fixed calibrated rule reused over a finite operational
window, stakeholders also care about deployment-facing quantities such as
commitment frequency, deferral, and decisive error exposure. These are
not determined by coverage: calibration choices with similar coverage can
still induce materially different operational profiles.
We study this characterization gap in a scoped setting: binary
split conformal prediction under exchangeability with a fixed deployed
rule. We introduce the Small-Sample Beta Correction (SSBC) which gives finite-sample coverage
semantics for the deployed rule: it inverts the Beta/Beta--Binomial law
governing calibration-conditional coverage to map a user request
$(\alpha^\star,\delta)$ to the least conservative calibration grid point
with calibration-conditional PAC semantics for the realized deployed rule.
Calibrate-and-Audit then fixes the rule by
calibration and uses an independent audit split to estimate the induced
region--class label table, a reusable summary from which deployment-facing
Key Performance Indicators (KPIs) follow by projection. Under this design,
fixed operational rates admit exact finite-sample Binomial inference,
while Beta--Binomial envelopes serve as practical predictive summaries for
future windows. The induced partition also exposes regime boundaries,
Pareto-relevant tradeoffs, and inverse-pricing questions for fixed
downstream conventions.
Simulations validate the SSBC semantics and compare audit-based summaries
with leave-one-out planning proxies; molecular toxicity data provide an
audit-based empirical example, and a solubility case study illustrates scenario
planning once coverage semantics are fixed.
URL: https://openreview.net/forum?id=UfuChBDHL8
---
Title: RESIST: Resilient Decentralized Learning Using Consensus Gradient Descent
Abstract: Empirical risk minimization (ERM) is a cornerstone of modern machine learning (ML), supported by advances in optimization theory that ensure efficient solutions with provable algorithmic convergence rates, which measure the speed at which optimization algorithms approach a solution, and statistical learning rates, which characterize how well the solution generalizes to unseen data. Privacy, memory, computational, and communications constraints increasingly necessitate data collection, processing, and storage across network-connected devices. In many applications, these networks operate in decentralized settings where a central server cannot be assumed, requiring decentralized ML algorithms that are both efficient and resilient. Decentralized learning, however, faces significant challenges, including an increased attack surface for adversarial interference during decentralized learning processes. This paper focuses on the man-in-the-middle (MITM) attack, wherein adversaries exploit communication vulnerabilities between devices to inject malicious updates during training, potentially causing models to deviate significantly from their intended ERM solutions. To address this challenge, we propose RESIST (Resilient dEcentralized learning using conSensus gradIent deScenT), an optimization algorithm designed to be robust against adversarially compromised communication links, where transmitted information may be arbitrarily altered before being received. Unlike existing adversarially robust decentralized learning methods, which often (i) guarantee convergence only to a neighborhood of the solution, (ii) lack guarantees of linear convergence for strongly convex problems, or (iii) fail to ensure statistical consistency as sample sizes grow, RESIST overcomes all three limitations. It achieves algorithmic and statistical convergence for strongly convex, Polyak–Łojasiewicz, and nonconvex ERM problems by employing a multistep consensus gradient descent framework and robust statistics-based screening methods to mitigate the impact of MITM attacks. Experimental results demonstrate the robustness and scalability of RESIST across diverse attack strategies, screening methods, and loss functions, confirming its suitability for real-world decentralized optimization and learning in adversarial environments.
URL: https://openreview.net/forum?id=aNu8S9O6Ug
---
Title: Functional Safety for Language Models
Abstract: Large language models (LLMs) are increasingly used for document and code editing, yet standard approaches typically lack formal assurances about the scope, structure, or side effects of their modifications. We introduce \emph{Functional Safety}, a hierarchy-aware editing architecture that formalizes LLM-driven edits as typed plans over explicit hierarchies with deterministic execution. A stochastic planning stage operates on an explicit hierarchical representation and emits a structured plan of typed operations that separate structural reorganization from bounded content generation. We analyze each step with two footprints: a \emph{structural footprint} (nodes whose relations may change) and a \emph{payload footprint} (nodes whose local content may change), and the guarantees are scoped per step. Execution is performed by a deterministic, structure-constrained component that enforces locality, guards protected regions, preserves byte-for-byte payload outside each step's payload footprint, and confines structural changes to each step's structural footprint under the stated assumptions. We formalize the architecture, specify its invariants, and prove Deterministic Safety and Conditional Functional Safety theorems that bound side effects under those assumptions. Empirical evaluations on long-form document rewriting, code refactoring, and multi-page policy briefs show that Functional Safety substantially reduces side effects relative to ReAct-style tool agents reflecting current agentic editor practice. These results demonstrate that principles from functional programming—explicit structure, composability, and controlled side effects—provide a rigorous foundation for reliable LLM-driven editing.
URL: https://openreview.net/forum?id=78bR5opfkN
---
Title: SAFT: Structure-Aware Fine-Tuning of Large Language Models for AMR-to-Text Generation
Abstract: Large Language Models (LLMs) are increasingly applied to tasks involving structured inputs such as semantic graphs, yet adapting them to such inputs remains non-trivial. Common approaches either linearize graphs, discarding structural information, or rely on specialized architectures that are not directly compatible with standard pretrained LLMs. We present SAFT, a structure-aware fine-tuning method that augments LLMs with graph positional encodings derived from the magnetic Laplacian of the input graph. These encodings are projected into the LLM embedding space, introducing relational inductive bias without modifying the model architecture. While SAFT is conceptually applicable to any task involving directed graph inputs with node–token alignment, we focus on the task of generating natural language text from an input AMR (Abstract Meaning Representation) graph, a directed graph encoding predicate-argument semantics of natural language sentences. AMR-to-text generation requires models to integrate both linguistic fluency and structural faithfulness, making it a demanding evaluation setting. We show that SAFT consistently improves or matches standard fine-tuning across all tested model families and scales, with gains that increase with graph structural complexity, both on sentence-level graphs of increasing depth and on document-level graphs of increasing size, demonstrating that structural encoding provides a reliable and scalable inductive bias for LLM fine-tuning.
URL: https://openreview.net/forum?id=QZoUMyzYDB
---
Title: Emotion is Not Just a Label: Latent Emotional Factors in LLM Processing
Abstract: Large language models are routinely deployed on text that varies widely in emotional tone, yet their reasoning behavior is typically evaluated without accounting for emotion as a source of representational variation. Prior work has largely treated emotion as a prediction target, for example in sentiment analysis or emotion classification. In contrast, we study emotion as a latent factor that shapes how models attend to and reason over text. We analyze how emotional tone systematically alters attention geometry in transformer models, showing that metrics such as locality, center-of-mass distance, and entropy vary across emotions and correlate with downstream question-answering performance. To facilitate controlled study of these effects, we introduce Affect-Uniform ReAding QA (AURA-QA), a question-answering dataset with emotionally balanced, human-authored context passages. Finally, an emotional regularization framework is proposed that constrains emotion-conditioned representational drift during training. Experiments across multiple QA benchmarks demonstrate that this approach improves reading comprehension in both emotionally-varying and non-emotionally varying datasets, yielding consistent gains under distribution shift and in-domain improvements on several benchmarks.
URL: https://openreview.net/forum?id=MlKBbnC0Fj
---
Title: Variance-reduced accelerated methods for decentralized stochastic double-regularized nonconvex strongly-concave minimax problems
Abstract: In this paper, we consider the decentralized, stochastic nonconvex strongly-concave (NCSC) minimax problem with nonsmooth regularization terms on both primal and dual variables, wherein a network of $m$ computing agents collaborate via peer-to-peer communications. We consider when the coupling function is in expectation or finite-sum form and the double regularizers are convex functions, applied separately to the primal and dual variables. Our algorithmic framework introduces a Lagrangian multiplier to eliminate the consensus constraint on the dual variable. Coupling this with variance-reduction (VR) techniques, our proposed method, entitled \texttt{VRLM}, by a single neighbor communication per iteration, is able to achieve an $\mathcal{O}(\kappa^3\varepsilon^{-3})$ sample complexity under the general stochastic setting, with either a big-batch or small-batch VR option, where $\kappa$ is the condition number of the problem and $\varepsilon$ is the desired solution accuracy. With a big-batch VR, we can additionally achieve $\mathcal{O}(\kappa^2\varepsilon^{-2})$ communication complexity. Under the special finite-sum setting, our method with a big-batch VR can achieve an $\mathcal{O}(n + \sqrt{n} \kappa^2\varepsilon^{-2})$ sample complexity and $\mathcal{O}(\kappa^2\varepsilon^{-2})$ communication complexity, where $n$ is the number of components in the finite sum. All complexity results match the best-known results achieved by a few existing methods for solving special cases of the problem we consider. To the best of our knowledge, this is the first work which provides convergence guarantees for NCSC minimax problems with general convex nonsmooth regularizers applied to both the primal and dual variables in the decentralized stochastic setting. Numerical experiments are conducted on two machine learning problems.
URL: https://openreview.net/forum?id=t1Nj3VTNzQ
---
Title: DC-ESN: Diffusion Convolutional Echo State Network for Spatiotemporal Traffic Forecasting
Abstract: Traffic forecasting is a challenging spatiotemporal problem that requires capturing complex dependencies across both road networks and time. Existing deep learning approaches such as Convolutional Neural Networks (CNNs) for spatial modeling, Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU) networks for temporal modeling, as well as Graph Convolutional Networks (GCNs) for capturing topological structures have achieved notable success. But they often incur high computational cost and require large number of trainable parameters. In this work, we introduces a novel architecture named Diffusion-Convolutional Echo State Network (DC-ESN) designed for spatio-temporal forecasting, which combines diffusion convolution for spatial feature extraction with a Echo State Network (ESN) for efficient temporal modeling. This structural decoupling allows the model to learn complex spatial topologies via gradient descent while leveraging the asymptotic stability of the fixed reservoir for temporal memory, offering a robust and computationally efficient alternative to fully trainable spatiotemporal graph networks. Compared with Diffusion-Convolutional Recurrent Neural Networks (DCRNN), Diffusion-Convolutional Echo State Network (DC-ESN) achieves comparable predictive accuracy while significantly improving inference efficiency and reducing GPU memory usage. Experiments on the METR-LA and PEMS-BAY benchmark traffic datasets demonstrate that DC-ESN attains faster inference with minimal accuracy loss, making it suitable for real-time forecasting applications.
URL: https://openreview.net/forum?id=9AOUXLYlwy
---
Title: Exploiting Completeness Perception with Diffusion Transformer for Unified 3D MRI Synthesis
Abstract: Missing data problems, such as missing modalities in multi-modal brain MRI and missing slices in cardiac MRI, pose significant challenges in clinical practice. Existing methods rely on external guidance to supply detailed missing state for instructing generative models to synthesize missing MRIs. However, manual indicators are not always available or reliable in real-world scenarios due to the unpredictable nature of clinical environments. Moreover, these explicit masks are not informative enough to provide guidance for improving semantic consistency. In this work, we argue that generative models should infer and recognize missing states in a self-perceptive manner, enabling them to better capture subtle anatomical and pathological variations. Towards this goal, we propose CoPeDiT, a general-purpose latent diffusion model equipped with completeness perception for unified synthesis of 3D MRIs. Specifically, we incorporate dedicated pretext tasks into our tokenizer, CoPeVAE, empowering it to learn completeness-aware discriminative prompts, and design MDiT3D, a specialized diffusion transformer architecture for 3D MRI synthesis that effectively uses the learned prompts as guidance to enhance semantic consistency in 3D space. Comprehensive evaluations on three large-scale MRI datasets demonstrate that CoPeDiT significantly outperforms state-of-the-art methods, achieving superior robustness and yielding high-fidelity, structurally consistent synthesis across diverse missing patterns.
URL: https://openreview.net/forum?id=DCaolE9oBN
---
Title: XConv: Low-memory stochastic backpropagation for convolutional layers
Abstract: Training convolutional neural networks at scale demands substantial memory, largely due to storing intermediate activations for backpropagation. Existing approaches---such as checkpointing, invertible architectures, or gradient approximation methods like randomized automatic differentiation---either incur significant computational overhead, impose architectural constraints, or require non-trivial codebase modifications. We propose XConv, a drop-in replacement for standard convolutional layers that addresses all three limitations: it preserves standard backpropagation, imposes no architectural constraints, and integrates seamlessly into existing codebases. XConv exploits the algebraic structure of convolutional layer gradients, storing highly compressed activations and approximating weight gradients via multi-channel randomized trace estimation. We establish convergence guarantees and derive error bounds for the proposed estimator, showing that the variance of the resulting gradient errors is comparable to that of stochastic gradient descent. Empirically, XConv achieves performance comparable to exact gradient methods across classification, generative modeling, super-resolution, inpainting, and segmentation---with gaps that narrow as the number of probing vectors increases---while reducing memory usage by a factor of two or more and remaining computationally competitive with optimized convolution implementations.
URL: https://openreview.net/forum?id=ajv7wvEvnh
---
Title: LoLA: Low-Rank Linear Attention with Sparse Caching
Abstract: The per-token cost of transformer inference scales with context length, preventing its application to lifelong in-context learning. Linear attention is an efficient alternative that maintains a constant memory footprint, even on infinite context lengths. While this is a potential candidate for lifelong learning, it falls short in memory capacity. In this paper, we propose LoLA, a training-free augmentation to linear attention that boosts associative recall. LoLA distributes past key-value pairs from context into three memory systems: (i) recent pairs in a local sliding window cache; (ii) difficult-to-memorize pairs in a sparse, global cache; and (iii) generic pairs in the recurrent hidden state of linear attention. We show through ablations that our self-recall error metric is crucial to efficiently manage long-term associative memories. On pass-key retrieval tasks, LoLA improves the base model's performance from 0.6% to 97.4% accuracy. This is achieved with a 4.6x smaller cache than Llama-3.1 8B on 4K context length. LoLA also outperforms other 1B and 8B parameter subquadratic models on zero-shot commonsense reasoning tasks.
URL: https://openreview.net/forum?id=3KhDA252y3
---
Title: The Price of Justice in Machine Learning: Fair Division with Subjective Value under Bounded Rationality
Abstract: Statistical fairness criteria such as Demographic Parity, Equalized Odds, and Calibration are widely used in machine learning, but they constrain rates and may fail to capture how burdens are distributed across individuals. We introduce a harm-allocation view of fairness that models heterogeneous subjective error costs and evaluates decisions through fair-division axioms. Within this framework, we characterize when common statistical criteria can serve as valid surrogates for harm-based fairness, construct counterexamples showing severe failures under cost heterogeneity, and reinterpret classic incompatibilities as conflicts between allocation principles. We further establish approximation lower bounds under realistic constraints, including noisy or coarse cost information and restricted policy classes. Experiments support the theory by showing when surrogate metrics align with or diverge from harm-based fairness, revealing approximation floors and showing that these misalignments persist across tasks, models, and diverse fairness interventions under finer-grained harm diagnostics.
URL: https://openreview.net/forum?id=41J9frg42f
---
Title: Do Object Channels Improve Robustness in Deep Reinforcement Learning?
Abstract: Pixel-based reinforcement learning agents often exploit spurious visual correlations, leading to brittle policies that fail under minor visual perturbations. We systematically investigate spatial grounded semantic channel representations, often called Feature Maps, Planes, or Object Channels, as a representation design principle for reducing shortcut learning.
Object channels map detected entities into binary tensors aligned with the original coordinate frame, preserving compatibility with standard RL backbones without architectural modifications.
Specifically, through systematic evaluation in Atari environments under controlled perturbations, we demonstrate that such channel representations substantially improve zero-shot robustness to distribution shifts while maintaining competitive in-distribution performance.
We analyze the abstraction–fidelity trade-off and show that combining object channels with raw pixels improves robustness and sample efficiency compared to pure pixel-based approaches. The experimental results indicate that spatially grounded object-based encodings offer a practical mechanism for bridging pixel- and object-centric RL.
URL: https://openreview.net/forum?id=7BFbso4B3R
---
Title: Adaptive and Stratified Subsampling for High-Dimensional Robust Estimation
Abstract: We study robust high-dimensional sparse regression under finite-variance heavy-tailed noise, ε-contamination, and α-mixing dependence via two subsampling estimators: Adaptive Importance Sampling (AIS) and Stratified Sub-sampling (SS). Under sub-Gaussian design whose scopeis precisely delimited and finite-variance noise, a subsample of size$m=\Omega(s\log p)$ achieves the minimax-optimal rate $O(\sqrt{s\log p/m})$. We close the theory-algorithm gap: Theorem 4.6 applies to AIS at termination conditional on stabilized weights (Proposition 4.1), and SS fits the median-of-means M-estimation framework of Lecu´e and Lerasle (Proposition 4.3). The de-biasing step is fully specified via the nodewise-Lasso precision estimator under a new sparse-precision assumption, yielding valid coordinate-wise CIs (Theorem 4.14). The α-mixing extension uses a calendar-time block protocol that guarantees temporal separation (Theorem 4.12). Empirically, AIS achieves 3.1× lower error than uniform subsampling at 20% contamination, and 29.5% lower test MSE on Riboflavin (p=4,088 ≫ n=71).
URL: https://openreview.net/forum?id=R8y19hU9Ab
---
Title: Achieving Adaptivity and Optimality for Multi-armed Bandits using Exponential-Kullback Leibler Maillard Sampling
Abstract: We study the problem of $K$-armed bandits with reward distributions belonging to a one-parameter exponential distribution family. In the literature, several criteria have been proposed to evaluate the performance of such algorithms, including Asymptotic Optimality, Minimax Optimality, Sub-UCB, and variance-adaptive worst-case regret bound. Thompson Sampling-based and Upper Confidence Bound-based algorithms have been employed to achieve some of these criteria. However, none of these algorithms simultaneously satisfy all the aforementioned criteria.
In this paper, we design an algorithm, Exponential Kullback-Leibler Maillard Sampling (abbrev. \expklms), that achieves multiple optimality criteria simultaneously, including Asymptotic Optimality, Minimax Optimality with a $\sqrt{\ln (K)}$ factor, Sub-UCB, and a variance-adaptive worst-case regret bound. Our algorithm design follows the Minimum Empirical Divergence framework~\citep{honda2011asymptotically,maillard2011apprentissage}, with the exploration probability of arm $a$ proportional to $\text{exp}\left(-L(N_{t-1, a}) \text{KL}(\hat{\mu}_{t-1, a}, \max_{a'} \hat{\mu}_{t-1, a'})\right)$, where $L(\cdot)$ is an inverse temperature function, $N_{t-1, a}$ is the number of times arm $a$ that has been pulled before time $t$, $\hat{\mu}_{t-1, a}$ is the empirical mean of arm $a$ before time $t$, and $\text{KL}(\cdot, \cdot)$ is the Kullback-Leibler divergence between two distributions in the one-parameter exponential distribution family. Our analysis allows different choices of inverse temperature function $L(k)$. We also provide numerical simulations demonstrating the effectiveness of our algorithms.
URL: https://openreview.net/forum?id=IuVkRmecVp
---
Title: Extracting Common Components from Partially Observed Views Using Diffusion Geometry
Abstract: Data acquired from multiple sensors or modalities, commonly referred to as multiview data, is prevalent in real-world applications. A core problem in multiview data analysis is finding representations of common components across views while filtering out view-specific nuisance factors. A widely spread assumption in existing methods is that the views are fully aligned, where each sample has measurements from all views. However, in practice, data is often partially aligned, where some samples have missing measurements from one or more views, and only a subset of the samples are fully aligned. In this work, we propose ADM+, a multiview manifold learning algorithm that computes a low-dimensional embedding of common information from partially aligned data. ADM+ extends Alternating Diffusion Maps (ADM), an existing multiview manifold learning method, to the partial alignment setting by using fully aligned samples as anchor points for extracting common components for unaligned samples. Unlike existing methods, ADM+ does not require prior imputation of missing data or interpolation in the embedding space and makes use of all available data. We provide a computationally efficient implementation, improving upon the $O(N^3)$ time complexity of ADM, and a theoretical analysis showing that ADM+ approximates an anisotropic diffusion process that emphasizes common components. Empirical evaluations across three domains -- dynamical systems, synthetic multiview images, and real-world functional magnetic resonance imaging (fMRI) -- demonstrate that ADM+ achieves favorable performance compared to kernel- and manifold-based baselines. In addition, ADM+ shows robustness to distributional discrepancies between aligned and unaligned samples.
URL: https://openreview.net/forum?id=yTHGIV8ToF
---
Title: Can LLMs Rank Candidates with Missing Sensitive Attributes Fairly?
Abstract: Large language models (LLMs) are increasingly deployed in high-stakes ranking systems used for hiring, lending, and scholarship allocation, raising concerns about fairness, accountability, and ethical use. These challenges are exacerbated in ranking settings where sensitive demographic attributes are unavailable due to legal, ethical, or practical constraints. Inferring such attributes may introduce harm by violating consent requirements, misrepresenting individuals, and reinforcing structural inequities. This work thus investigates the timely question: How is LLM-based fair re-ranking impacted when demographic information is missing? In this context, we study three alternate strategies: (1) inferring sensitive attributes using traditional third-party services prior to ranking, (2) directly prompting LLMs to produce fair rankings without explicit mention of attribute inference, and (3) employing a chain-of-thought approach in which LLMs are first prompted to infer attributes and thereafter to perform fairness-aware re-ranking. We compare these strategies across multiple datasets using established group-fairness metrics for ranking. Our experiments demonstrate that LLMs match the accuracy of leading third-party services in demographic inference. Moreover, LLMs can embed fairness objectives into rankings even without explicitly inferring sensitive attributes, revealing a new design space for fairness interventions that avoids direct demographic labeling. Lastly, few-shot prompting is found to be crucial for striking the desired balance between fairness and utility. We conclude by discussing the ethical and governance implications of deploying LLMs for fairness-critical ranking tasks. While LLMs offer flexibility under demographic uncertainty, their capacity for implicit inference also raises significant risks if adopted without transparency, evaluation, and institutional oversight. To support reproducibility and scrutiny, we release our source code and experimental artifacts.
URL: https://openreview.net/forum?id=VrAs5EJ11G
---
Title: LMA: Latent Motion Adjuster for Physics-based Multi-agent Interaction
Abstract: Learning interactive multi-agent behaviors from scratch is often sample-inefficient and fails to exploit reusable skills learned in simpler settings. While latent skill representations enable efficient single-agent reinforcement learning, their extension to multi-agent interaction requires conditioning behaviors on other agents without destroying pretrained structure. We formulate multi-agent interaction as a latent adaptation problem and propose the Latent Motion Adjuster (LMA), a lightweight conditional module that modifies latent actions produced by a pretrained single-agent policy based on other agents’ states. Rather than relearning policies from scratch, our method performs structured residual adaptation in latent space, enabling efficient skill reuse under both cooperative and competitive scenarios. Experiments on physics-based control benchmarks demonstrate that latent-space adaptation improves sample efficiency and interaction performance over fine-tuning and strategic baselines. These results suggest that conditional latent modulation provides a principled mechanism for transferring single-agent skills to multi-agent reinforcement learning.
URL: https://openreview.net/forum?id=OR5ClOTaWU
---
Title: EditProp: Consistent Video Style Transfer by Editing Propagation
Abstract: Video style transfer, which aims to transfer a source video into another video with a different appearance while preserving its original structure, plays an important role in the video production industry. Existing methods often edit the first frame with an image editing tool, and feed it into an image-to-video generation model with source video guidance to generate the edited video. Although such a paradigm enables users to perform creative video editing with powerful image editing tools, it relies heavily on the native propagation capability of the video generation model, which can be limited by having only the first frame as appearance guidance. As a result, the edited video suffers from appearance drifting and structure distortion, leading to severe inconsistencies as time goes on. To this end, we propose EditProp, a novel video style transfer framework with two propagation stages: i) In the Keyframe Propagation stage, the edit in the first keyframe is faithfully propagated to other keyframes with an image-based in-context generation model, producing high-quality edited keyframes with strong appearance consistency. ii) Then, in the subsequent Video Propagation stage, the source video structure and the propagated keyframes are injected into the video generation model as control signals, providing sufficient appearance and structure guidance to generate the translated video. Experimental results demonstrate that our EditProp enables effective transfer to various styles, achieving superior editing results with strong appearance and structure consistency. Furthermore, thanks to our versatile keyframe-based propagation, our framework also enables extra applications such as smooth video style transition and long video style transfer.
URL: https://openreview.net/forum?id=WA0ApsjWQb
---
Title: Autoregressive Image Generation with Frequency Progression
Abstract: Autoregressive (AR) models for image generation typically adopt a two-stage paradigm of vector quantization and raster-scan ``next-token prediction", inspired by its great success in language modeling. However, due to the huge modality gap, image autoregressive models may require a systematic reevaluation from two perspectives: tokenizer format and regression direction.
In this paper, we introduce the frequency progressive autoregressive (\textbf{FAR}) paradigm and instantiate FAR with the continuous tokenizer.
Specifically, we identify spectral dependency as the desirable regression direction for FAR, wherein higher-frequency components build upon the lower one to progressively construct a complete image. This design seamlessly fits the causality requirement for autoregressive models and preserves the unique spatial locality of image data.
Besides, we delve into the integration of FAR and the continuous tokenizer, introducing a series of techniques to address optimization challenges and improve the efficiency of training and inference processes.
We demonstrate the efficacy of FAR through comprehensive experiments on the ImageNet dataset and verify its potential on text-to-image generation.
URL: https://openreview.net/forum?id=cEfd15ouQ1
---
Title: LoRA-FL: A Low-Rank Adversarial Attack for Compromising Group Fairness in Federated Learning
Abstract: Federated Learning (FL) enables collaborative model training without requiring participants
to share raw data, and is increasingly deployed in regulated domains such as healthcare, fi
nance, and large-scale personalization. FL offers privacy and governance benefits, it can
also obscure fairness risks: heterogeneity in client data distributions may lead to models
that systematically disadvantage minority groups. Ensuring fairness in such settings is not
only an ethical concern but also a regulatory requirement under frameworks such as GDPR
and anti-discrimination law. Existing adversarial manipulations in FL, such as noise injec
tion or scaling attacks, typically degrade predictive performance or are mitigated by robust
aggregation rules (e.g., KRUM or FLAME), limiting their practical relevance. In this work,
we introduce LoRA-FL, a stealthy fairness attack that leverages low-rank adapters to in
ject group-level bias while preserving accuracy. By constraining adversarial updates to a
compact subspace that aligns with benign client variation, LoRA-FL evades both standard
and robust aggregators, even under heterogeneous (non-IID) data distributions. We provide
empirical results, across widely used fairness benchmarks, including tabular datasets such
as Adult and Bank. With LoRA-FL as few as 10–20% adversarial clients can increase viola
tions of demographic parity and equalized odds by over 40%, while maintaining comparable
predictive performance.
URL: https://openreview.net/forum?id=AaRjCkhXNU
---
Title: SensX: Model-Agnostic Local Feature Attribution via Calibrated Global Sensitivity Analysis
Abstract: Local feature attribution is a standard tool for auditing and debugging deep learning predictions, but existing attribution methods are not designed for systems that chain pretrained, frozen, or API-only modules. Gradient-based methods such as Integrated Gradients require an end-to-end computational graph that may be unavailable. Perturbation-based methods such as KernelSHAP require a reference input or background distribution that composite pipelines cannot reliably provide. We present SensX, a local attribution method that treats the model as a black box and replaces arbitrary design choices with interpretable, application-grounded parameters. SensX adapts Morris-style coordinate walks from global sensitivity analysis to local attribution. It requires no access to model internals, training data, or arbitrary reference inputs. We validate SensX across four case studies, each targeting a distinct limitation of existing methods. On a synthetic benchmark where ground-truth relevant features vary per input, SensX reaches $95\%$ top-$2$ attribution accuracy versus $58\%$ for the best KernelSHAP/Integrated Gradients variant. On a ViT with $>150{,}000$ pixel-channel features, SensX produces spatially coherent maps and exposes systematic intra-patch bias where KernelSHAP is infeasible and Integrated Gradients yields task-irrelevant attributions. On single-cell classifiers with unstructured gene-expression features, SensX attains the lowest top-$k$ perturbation AUC. On a composite spatial transcriptomics system where neither method is applicable, SensX reveals reliance on preprocessing grid artifacts and a bias toward low-staining regions.
URL: https://openreview.net/forum?id=dKzReyfUeW
---
Title: Elytra: A Flexible Framework for Securing Large Vision Systems
Abstract: Adversarial attacks have emerged as a critical threat to autonomous driving systems.
These attacks exploit the underlying neural network, allowing small -- almost invisible -- perturbations to alter the behavior of such systems in potentially malicious ways,
*e.g.*, causing a traffic sign classification network to misclassify a stop sign as a speed limit sign.
Prior work in hardening such systems against adversarial attacks has looked at fine-tuning of the system or adding additional pre-processing steps to the input pipeline.
Such solutions either have a hard time generalizing, require knowledge of adversarial attacks during training, or are computationally undesirable.
Instead, we propose a framework called *Elytra* to take insights for parameter-efficient fine-tuning and use low-rank adaptation (LoRA) to train a lightweight security patch (or patches), enabling us to dynamically patch large pre-existing vision systems as new vulnerabilities are discovered.
We demonstrate that the *Elytra* framework can patch pre-trained large vision models to improve classification accuracy by up to 24.09% in the presence of adversarial examples.
URL: https://openreview.net/forum?id=50RD7a2CXi
---
Title: SinGLU: Sinusoidal Gated Linear Units Improve Classification Accuracy of Small Vision Transformers
Abstract: Gated Linear Unit (GLU) variants such as SwiGLU are now widely used in modern Transformers. However, the GLU functions explored in the recent literature represent only a small fraction of the possible GLU design space. Starting from a mathematically complete enumeration of all zeroth-, first-, and second‑order GLU formulas, we conduct a controlled study on ViT‑Tiny across CIFAR‑10, CIFAR‑100, SVHN and ImageNet‑64, instantiating each GLU formula with Sigmoid, Tanh and Sin activations. Under identical training recipes and equal parameter counts, our proposed first order GLU variant \textbf{SinGLU} consistently outperforms SwiGLU, the de‑facto standard in contemporary Transformers. Inference latency differs by <0.1\% on an NVIDIA A100 GPU, confirming cost parity. All Code and model weights will be released upon publication.
URL: https://openreview.net/forum?id=qq4yipldw2
---
Title: Behavior-dLDS: A decomposed linear dynamical systems model for neural activity partially constrained by behavior
Abstract: Brain-wide recordings of large-scale networks of neurons now provide an unprecedented view into how the brain drives behavior. However, brain activity contains both information directly related to behavior as well as the potential for many internal computations. Moreover, observable behavior is executed not only by the brain, but also by the spinal cord and peripheral nervous system. Behavior is a coarse-grained product of neural activity, and we thus take the view that it can be best represented by lower-dimensional latent neural dynamics. Capturing this indirect relationship while disambiguating behavior-generating networks from internal computations running in parallel requires new modeling approaches that can embody the parallel and distributed nature of large-scale neural populations. We thus present behavior-decomposed linear dynamical systems (b-dLDS) to disentangle simultaneously recorded subsystems and identify how the latent neural subsystems relate to behavior. We demonstrate the ability of b-dLDS to decouple behavioral vs. internal computations on controlled, simulated data, showing improvements over a state-of-the-art model that uses behavior to supervise all dynamics based on behavior. We then show that b-dLDS can further scale up to tens of thousands of neurons by applying our model to a large-scale recording of a zebrafish hindbrain during the complex positional homeostasis behavior, wherein b-dLDS highlights behavior-related dynamic connectivity networks.
URL: https://openreview.net/forum?id=8p9tP9qLPN
---
Title: Learning with Differentially Private Sliced Wasserstein Gradients
Abstract: In this work, we introduce a novel framework for privately optimizing objectives that depend on sliced Wasserstein distances between data-dependent empirical measures. Our main theoretical contribution is a non-trivial analysis of the sensitivity of the Wasserstein gradients to individual data points, derived from an explicit formulation of the gradient in a fully discrete setting. This enables strong privacy guarantees with minimal utility loss. We demonstrate that standard privacy accounting methods naturally extend to Wasserstein-based objectives, allowing for large-scale private training. This supports a wide range of private machine learning applications involving distribution matching under privacy constraints on the source, the target, or both. These include: (i) an in-processing method for fairness mitigation using a private Wasserstein penalty, and (ii) what we believe is the first approach for training private sliced Wasserstein autoencoders. We validate our framework through experiments showing its ability to effectively balance privacy and utility, offering a theoretically grounded approach to privacy-preserving machine learning with sliced Wasserstein losses.
URL: https://openreview.net/forum?id=pqo5VNImZE
---
Title: AgriPath: A Systematic Exploration of Architectural Trade- offs for Crop Disease Classification
Abstract: Reliable crop disease detection requires models that perform consistently across diverse acquisition conditions, yet existing evaluations often focus on single architectural families or lab-generated datasets. This work presents a systematic empirical comparison of three model paradigms for fine-grained crop disease classification: Convolutional Neural Networks (CNNs), contrastive Vision–Language Models (VLMs), and generative VLMs. To enable controlled analysis of domain effects, we introduce $\textit{AgriPath-LF16}$, a benchmark of 111k images spanning 16 crops and 41 diseases with explicit separation between laboratory and field imagery, alongside a balanced 30k subset for standardised training and evaluation.
We train and evaluate all models under unified protocols across full, lab-only, and field-only training regimes using macro-F1 and Parse Success Rate (PSR) to account for generative reliability. The results reveal distinct performance profiles: CNNs achieve the highest accuracy on lab imagery but exhibit pronounced degradation under domain shift; contrastive VLMs provide a robust and parameter-efficient alternative with competitive cross-domain performance; generative VLMs demonstrate the strongest resilience to distributional variation, albeit with additional failure modes stemming from free-text generation. These findings highlight that architectural choice should be guided by deployment context rather than aggregate accuracy alone.
URL: https://openreview.net/forum?id=5UI1wrq5pS
---
Title: Convergence Bound and Critical Batch Size of Muon Optimizer
Abstract: Muon, a recently proposed optimizer that leverages the inherent matrix structure of neural network parameters, has demonstrated strong empirical performance, indicating its potential as a successor to standard optimizers such as AdamW. This paper presents theoretical analysis to support its practical success. We provide convergence proofs for Muon across four practical settings, systematically examining its behavior with and without the inclusion of Nesterov momentum and weight decay. We then demonstrate that the addition of weight decay yields strictly tighter theoretical bounds and clarify the interplay between the weight decay coefficient and the learning rate. Finally, we derive a lower bound on the critical batch size for Muon---the batch size that minimizes the stochastic first-order oracle (SFO) complexity of training. Because the resulting formula involves problem-dependent quantities that are not directly observable (gradient variance, target precision, effective rank), it does not predict the critical batch size in absolute terms; rather, it reveals how the hyperparameters $\beta$ (momentum) and $\lambda$ (weight decay) govern the qualitative scaling of this value.
Our experiments validate these hyperparameter-dependent predictions across workloads including image classification and language modeling.
URL: https://openreview.net/forum?id=q1bzNUyMzc
---
Title: Koopman-informed recurrent neural networks
Abstract: Recurrent neural networks are a successful neural architecture for many time-dependent problems, including time series analysis, forecasting, and modeling of dynamical systems. In the context of dynamical systems, training with backpropagation through time can lead to challenges arising from exploding or vanishing gradients. Training such networks with backpropagation through time is a notoriously difficult problem because their loss gradients tend to explode or vanish. In this contribution, we introduce Koopman-informed recurrent neural networks, a computational approach to construct all weights and biases of a recurrent neural network without using gradient-based methods. The approach is based on a combination of random feature networks and Koopman operator theory for dynamical systems. The hidden parameters of a single recurrent block are sampled at random, while the outer weights are constructed using extended dynamic mode decomposition. This approach alleviates some problems with backpropagation commonly related to recurrent networks. The connection to Koopman operator theory also allows us to start using results in this area to analyze recurrent neural networks. In computational experiments on time series, forecasting for chaotic dynamical systems, and control problems, as well as on real-world data, we observe that the training time and forecasting accuracy of the Koopman-informed recurrent neural networks we construct are improved when compared to models trained with commonly used gradient-based methods.
URL: https://openreview.net/forum?id=KHsnxKYG6k
---
Title: Accelerating SGDM via Learning Rate and Batch Size Schedules: A Lyapunov-Based Analysis
Abstract: We analyze the convergence behavior of stochastic gradient descent with momentum (SGDM) under dynamic learning-rate and batch-size schedules by introducing a novel and simpler Lyapunov function. We extend the existing theoretical framework to cover three practical scheduling strategies commonly used in deep learning: a constant batch size with a decaying learning rate, an increasing batch size with a decaying learning rate, and an increasing batch size with an increasing learning rate. Our results reveal a clear hierarchy in convergence: a constant batch size does not guarantee convergence of the expected gradient norm, whereas an increasing batch size does, and simultaneously increasing both the batch size and learning rate achieves a provably faster decay. Empirical results validate our theory, showing that dynamically scheduled SGDM significantly outperforms its fixed-hyperparameter counterpart in convergence speed. We also evaluated a warm-up schedule in experiments, which empirically outperformed all other strategies in convergence behavior.
URL: https://openreview.net/forum?id=s6DTv7Sorj
---
Title: MEMETRON: Memetic Response Optimizer for Reward-Guided Post-Decoding Optimization of Large Language Model
Abstract: Modern large language models (LLMs) are commonly optimized using scalar reward signals defined over completed responses, applied both during training and at inference time. However, most such reward-guided post-decoding methods remain one-shot: they independently sample a set of responses, score each once, and select the best. Staying shallow and narrow leaves higher-reward responses unrealized, while scaling up to shallow and wide sampling exacerbates reward hacking, making downstream selection methods such as best-of-$N$ and self-consistency unreliable. We propose \MEMETRON, a memetic optimization framework that formulates reward-guided post-decoding optimization (RPDO) as discrete black-box optimization over completed responses. \MEMETRON alternates between \GENETRON for population-based optimization and \ANNETRON for annealing-based local refinement under a black-box scalar reward. Across mathematical reasoning and instruction-following tasks, \MEMETRON reliably discovers higher-scoring responses. On mathematical reasoning, \MEMETRON increases pass@$k$ correctness coverage and improves the selection reliability of best-of-$N$ and self-consistency; on instruction following, it improves LLM judge preference. On verifiable tasks, \MEMETRON can incorporate ground-truth correctness via reward shaping. Comparing shaped and unshaped runs exposes extreme cases of RM-correctness misalignment, and the resulting contrastive pairs serve as training signal for reward model fine-tuning, rejection sampling SFT warmups for RL-based training pipelines such as PPO and GRPO, and direct preference learning such as DPO.
URL: https://openreview.net/forum?id=QRW8OGn3vb
---
Title: Countering Backdoor Attacks in Image Recognition: A Survey and Evaluation of Mitigation Strategies
Abstract: The widespread adoption of deep learning across various industries has introduced substantial challenges, particularly in terms of model explainability and security. The inherent complexity of deep learning models, while contributing to their effectiveness, also renders them susceptible to adversarial attacks. Among these, backdoor attacks are especially concerning, as they involve surreptitiously embedding specific triggers within training data, causing the model to exhibit aberrant behavior when presented with input containing the triggers. Such attacks often exploit vulnerabilities in outsourced processes, compromising model integrity without affecting performance on clean (trigger-free) input data. In this paper, we present a review of prominent existing mitigation strategies designed to counter backdoor attacks in image recognition. We provide an in-depth analysis of the theoretical foundations, practical efficacy, and limitations of these approaches. In addition, we conduct an extensive benchmarking of sixteen prominent approaches against eight distinct backdoor attacks, utilizing three datasets, four model architectures, and three poisoning ratios. Our results, derived from 122,236 individual experiments, indicate that while many approaches provide some level of protection, their performance can vary considerably. Furthermore, when compared to two seminal approaches, most newer approaches do not demonstrate substantial improvements in overall performance or consistency across diverse settings. Drawing from these findings, we propose potential directions for developing more effective and generalizable defensive mechanisms in the future.
URL: https://openreview.net/forum?id=OysA7cuCUh
---
Title: Training-Free Pseudo-Fusion Strategies for Composed Image Retrieval via Diffusion and Multimodal Large Language Models
Abstract: Composed Image Retrieval (CIR) is an emerging paradigm in Content-based Image Retrieval that enables users to formulate compositional queries by combining a reference image with an auxiliary modality, usually text-based. This approach supports fine-grained search where the target image shares structural elements with the user-provided image but is modified according to the provided auxiliary text. Conventional CIR methods rely on multimodal fusion to combine visual and textual features into a joint query embedding, which requires training modules that align composed queries with the targets. In this work, we propose PEFUSE (for pseudo-fusion), a training-free framework that leverages pretrained models to bridge modalities via generative conversion. We introduce two novel strategies: uni-directional and bi-directional conversion, both implemented using diffusion models and multimodal large language models, converting CIR to four single-modality retrieval problems. These methods reformulate CIR as either intra-modal or cross-modal single-query retrieval tasks, bypassing the need for dedicated training. Extensive experiments on standard benchmarks demonstrate that converting CIR into text-to-image retrieval tasks is better than alternative conversion strategies, achieving competitive or superior performance compared with state-of-the-art methods, while maintaining strong time efficiency. These results highlight the effectiveness of the pseudo-fusion paradigm for composed retrieval. Our code is available at: https://anonymous.4open.science/r/ComposedImageRetrieval-9241.
URL: https://openreview.net/forum?id=6W3pFEQXZc
---
Title: Energy-Efficient Inference with Small Language Models: A Comparative Study on Code Generation, Classification, and Environmental Impact
Abstract: Large language models (LLMs) are widespread in enterprise applications for code completion, email classification, and sentiment analysis. Although these models have good performance, their high computational requirements make them consume high energy in inference. Can smaller language models (SLMs) with three billion parameters (Qwen2.5-3B-Instruct)
perform similarly in structured high-frequency tasks while providing significantly lower environmental impact?
We tested an SLM on three enterprise workloads: code generation with HumanEval benchmark (164 tasks), HR email routing (1,339 examples), and binary sentiment analysis (100 samples). We recorded output quality, inference latency, throughput, GPU memory utilization and energy consumption. The SLM achieved 72.6% pass rate on code generation and 86% on sentiment analysis at 388–647× less energy per query than GPT-4o on code tasks and 210–1,333× less on classification tasks. Scaling to organizational context, replacing LLMs with task-specific SLMs for 10,000 daily code completions, 100,000 sentiment queries, and 50,000 monthly email classifications would save 52,642 kWh annually, reducing CO2 emissions by 23.1 metric tons. An SLM-first deployment strategy is a practical way to
attain sustainable AI with significant energy savings.
URL: https://openreview.net/forum?id=J7o94C1xnb
---
Title: Data Shifts Hurt CoT: A Theoretical Study
Abstract: Chain of Thought (CoT) has been applied to various large language models (LLMs) and proven to be effective in improving the quality of outputs. In recent studies, transformers are proven to have absolute upper bounds in terms of expressive power, and consequently, they cannot solve many computationally difficult problems. However, empowered by CoT, transformers are proven to be able to solve some difficult problems effectively, such as the $k$-parity problem. Nevertheless, those works rely on two imperative assumptions: (1) identical training and testing distribution, and (2) corruption-free training data with correct reasoning steps. However, in the real world, these assumptions do not always hold. Although the risks of data shifts have caught attention, our work is the first to rigorously study the exact harm caused by such shifts to the best of our knowledge. Focusing on the $k$-parity problem, in this work we investigate the joint impact of two types of data shifts: the distribution shifts and data poisoning, on the quality of trained models obtained by a well-established CoT decomposition. In addition to revealing a surprising phenomenon that CoT leads to worse performance on learning parity than directly generating the prediction, our technical results also give a rigorous and comprehensive explanation of the mechanistic reasons of such impact.
URL: https://openreview.net/forum?id=YWFmIoHP5y
---
Title: Global Linear Convergence of Inexact TD Under Generalized Smoothness
Abstract: Recent work by~\cite{asadi2023td} gave an optimization view of TD learning with target networks and showed stability under a force-dominance condition, but their linear-rate analysis relies on global smoothness (a uniform curvature bound). This assumption can fail even when the inner problem is well posed, since curvature encountered during training can grow with the scale of TD residual–induced gradients. We retain the stabilized regime from prior theory—strong convexity in the inner variable—to isolate upper-curvature growth effects. Under generalized smoothness, where the Hessian norm may grow with gradient scale via a nondecreasing profile $\ell(\cdot)$, we analyze the inexact TD recursion with $K$ inner gradient steps per target refresh and propose a curvature-checked constant stepsize rule that ensures global stability without a global smoothness constant. Our main result proves global linear convergence under force dominance with a single trajectory-dependent admissibility requirement governed by the maximum gradient magnitude $M$ encountered along the run. This yields an explicit scaling law: the largest admissible constant stepsize decays as $1/\ell(cM)$ (for a universal constant $c$), and maintaining a fixed contraction requires $K$ to grow proportionally to $\ell(cM)$. In the uniformly smooth case we recover~\cite{asadi2023td}, while under curvature growth the worst trajectory gradient scale controls both stability and attainable convergence speed, aligning with step-control heuristics used in reinforcement learning (RL)
URL: https://openreview.net/forum?id=Xvyehi4izc
---
Title: Probing Visual Concepts in Lightweight Vision-Language Models for Automated Driving
Abstract: The use of Vision-Language Models (VLMs) in automated driving applications is becoming increasingly common, with the aim of leveraging their reasoning and generalisation capabilities to handle long tail scenarios. However, these models often fail on simple visual questions that are highly relevant to automated driving, and the reasons behind these failures remain poorly understood. In this work, we examine the intermediate activations of VLMs and assess the extent to which specific visual concepts are linearly encoded, with the goal of identifying bottlenecks in the flow of visual information. Specifically, we create counterfactual image sets that differ only in a targeted visual concept and then train linear probes to distinguish between them using the activations of four state-of-the-art (SOTA) VLMs. Our results show that concepts such as the presence of an object or agent in a scene are explicitly and linearly encoded, whereas other spatial visual concepts, such as the orientation of an object or agent, are only implicitly encoded by the spatial structure retained by the vision encoder. In parallel, we observe that in certain cases, even when a concept is linearly encoded in the model’s activations, the model still fails to answer correctly. This leads us to identify two failure modes. The first is perceptual failure, where the visual information required to answer a question is not linearly encoded in the model’s activations. The second is cognitive failure, where the visual information is present but the model fails to align it correctly with language semantics. Finally, we show that increasing the distance of the object in question quickly degrades the linear separability of the corresponding visual concept. Overall, our findings improve our understanding of failure cases in VLMs on simple visual tasks that are highly relevant to automated driving.
URL: https://openreview.net/forum?id=HlBBy19ojC
---
Title: LAW & ORDER: Adaptive Spatial Weighting for Medical Diffusion and Segmentation
Abstract: Medical image analysis relies on accurate segmentation, and benefits from controllable synthesis (of new training images). Yet both tasks of the cyclical pipeline face spatial imbalance: lesions occupy small regions against vast backgrounds. In particular, diffusion models have been shown to drift from prescribed lesion layouts, while efficient segmenters struggle on spatially uncertain regions. Adaptive spatial weighting addresses this by learning where to allocate computational resources. This paper introduces a pair of network adapters: 1) Learnable Adaptive Weighter (LAW) which predicts per-pixel loss modulation from features and masks for diffusion training, stabilized via a mix of normalization, clamping, and regularization to prevent degenerate solutions; and 2) Optimal Region Detection with Efficient Resolution (ORDER) which applies selective bidirectional skip attention at late decoder stages for efficient segmentation. Experiments on polyp and kidney tumor datasets demonstrate that LAW achieves 20% FID generative improvement over a uniform baseline (52.28 vs. 65.60), with synthetic data then improving downstream segmentation by 4.9% Dice coefficient (83.2% vs. 78.3%). ORDER reaches 6.0% Dice improvement on MK-UNet (81.3% vs. 75.3%) with 0.56 GFLOPs and just 42K parameters, remaining 730x smaller than the standard nnUNet.
URL: https://openreview.net/forum?id=sJXqzr3oLl
---
Title: TropNNC: Structured Neural Network Compression Using Tropical Geometry
Abstract: We present TropNNC, a framework for compressing neural networks with linear and convolutional layers and ReLU-type activations using tropical geometry. By representing a network’s output as a tropical rational function, TropNNC enables structured compression via reduction of the corresponding tropical polynomials. Our method identifies redundancy via similarity and improves upon the geometric approximation of previous work by adaptively selecting the weights of retained neurons. We relate it to SVD and spectral clustering, and develop a theoretical analysis that yields useful insights into the network compression problem in general. We provide the tightest known theoretical compression bound, and the first successful application of tropical geometry to convolutional layers. TropNNC requires access only to network weights -- no training data -- and achieves competitive performance on MNIST, CIFAR, and ImageNet, matching strong baselines such as ThiNet and CUP.
URL: https://openreview.net/forum?id=u7DRq1icmY
---
Title: Anonymized-Bench: From Performance to Capability, Rethinking Evaluation in Geospatial AI
Abstract: Geospatial Foundation Models (GeoFMs) are transforming Earth Observation (EO), but
evaluation lacks standardized protocols. Anonymized-Bench addresses this with a com-
prehensive framework spanning classification, segmentation, regression, object detection,
and instance segmentation across 19 permissively-licensed datasets. We introduce capabil-
ity groups to rank models on datasets that share common characteristics (e.g., resolution,
spectral bands, temporality), enabling users to identify which models excel in each capa-
bility and to determine where future work should focus. To support both fair comparison
and methodological innovation, we define a prescriptive yet flexible evaluation protocol.
This ensures consistency in benchmarking while facilitating research into model adapta-
tion strategies—a key open challenge in advancing GeoFMs for downstream tasks. Our
experiments show that no single model dominates across all tasks, confirming the specificity
of choices made during architecture design and pretraining. While models pretrained on
natural images (ConvNext-ImageNet, DINOv3) excel on high-resolution tasks, EO-specific
models (TerraMind, Prithvi, and Clay) outperform them on multispectral applications such
as agriculture and disaster response. These findings demonstrate that optimal model choice
depends on task requirements, data modalities, and operational constraints, and that the
goal of a single GeoFM that performs well across all tasks remains open for future research.
Anonymized-Bench enables informed, reproducible GeoFM evaluation tailored to specific
use cases. Code, data, and the leaderboard are publicly released under a permissive license.
URL: https://openreview.net/forum?id=NPf175jnP1
---
Title: Teaching People LLM’s Errors and Getting it Right
Abstract: People use large language models (LLMs) when they should not. This is partly because they see LLMs compose poems and answer intricate questions, so they understandably, but incorrectly, assume LLMs won't stumble on basic tasks like simple arithmetic. Prior work has tried to address this by clustering instance embeddings into regions where an LLM is likely to fail and automatically describing patterns in these regions. The found failure patterns are taught to users to mitigate their overreliance. Yet, this approach has not fully succeeded. In this analysis paper, we aim to understand why.
We first examine whether the negative result stems from the absence of failure patterns. We group instances in two datasets by their meta-labels and evaluate an LLM's predictions on these groups. We then define criteria to flag groups that are sizable and where the LLM is error-prone, and find meta-label groups that meet these criteria. Their meta-labels are the LLM's failure patterns that could be taught to users, so they do exist. We next test whether prompting and embedding-based approaches can surface these known failures. Without this, users cannot be taught about them to reduce their overreliance. We find mixed results across methods, which could explain the negative result. Finally, we revisit the final metric that measures teaching effectiveness. We propose to assess a user's ability to effectively use the given failure patterns to anticipate when an LLM is error-prone. A user study shows a positive effect from teaching with this metric, unlike the human-AI team accuracy. Our findings show that teaching failure patterns could be a viable approach to mitigating overreliance, but success depends on better automated failure-discovery methods and using metrics like ours.
URL: https://openreview.net/forum?id=vwm1xHjxUj
---
Title: MaskGT: Learning Task-Adaptive Connectivity in Graph Transformers
Abstract: Graph Transformers (GTs) enable all-to-all interactions, but the optimal connectivity is task-dependent: some problems favor sparse, topology-aligned message passing, while others need global attention. We propose MaskGT, a GT-agnostic module that learns a discrete sparse gate over attention edges. By learning which node pairs may communicate within self-attention, MaskGT injects a task-adaptive relational inductive bias without fully committing to the input adjacency. Across synthetic and real-world benchmarks, MaskGT improves performance and robustness by suppressing spurious interactions under structural noise, and enables parameter-efficient multi-task and transfer by localizing task-specific structure in the mask while reusing a shared backbone. These results position MaskGT as a step toward more general-purpose graph models.
URL: https://openreview.net/forum?id=CS4BJcbCGF
---
Title: ZipAct: Zipping Interaction History into a Compact State for Efficient LLM Agents
Abstract: Current large language model (LLM) agentic frameworks typically rely on the entire raw interaction history to make decisions. Despite recent remarkable progress, this paradigm notably suffers from the \textit{context snowball} effect: as the task progresses step by step, the history grows unboundedly, resulting in excessive token consumption and diluted agent attention. Toward this end, this paper proposes a novel and lightweight framework named ZipAct, which ``zips'' the lengthy history into a compact state during agentic reasoning. In particular, instead of feeding the full history to the model, our ZipAct maintains a structured state table comprising the agent's goal, world status and key constraints, which are updated dynamically at each step. Our simple design shifts agentic reasoning from a history-dependent paradigm to a state-dependent paradigm, which significantly reduces computational cost from quadratic ($O(T^2)$) to linear ($O(T)$). Extensive, comprehensive experiments across multiple benchmark datasets demonstrate that ZipAct drastically reduces token usage while stably preserving or improving success rates compared to competing baselines. For reproducing results, our codebase can be accessed at: \url{https://anonymous.4open.science/r/ZipAct-81CD/}.
URL: https://openreview.net/forum?id=ZssIalqqrz
---
Title: You Only Prune Once: A Zero-Shot, Data-Free Pruning at Initialization via Low-Rank Residual Saliency
Abstract: Pruning at initialization (PaI) seeks sparse subnetworks that can be trained from scratch without iterative retraining or post-hoc compression. Most existing PaI methods rely on data, gradients, or iterative structural optimization, and their saliency scores are typically coupled to a specific sparsity budget. This work introduces a zero-shot, data and gradient-free pruning criterion based on nonnegative low-rank residual saliency. At random initialization, a once-only ordering of parameters is obtained by measuring their deviation from a low-rank additive template in the absolute weight space. This fixed ordering can be re-thresholded to realize arbitrary sparsity levels without rescoring, decoupling parameter ranking from sparsity budget and dataset.
Structural and dynamical analyses provide insight into the effectiveness of residual-based pruning. Spectral evaluation shows stronger post-pruning low-rank concentration than competing methods, while neural tangent kernel diagnostics indicate alignment between residual magnitude and functional influence. Empirical results across CIFAR-10/100, Tiny-ImageNet, ImageNet, and modern ConvNeXt architectures demonstrate competitive or superior performance relative to gradient-based and topology-driven PaI baselines, particularly at extreme sparsity ($\geq 99\%$), alongside substantial reductions in pruning time. These findings suggest that a once-only, dataset-agnostic saliency ordering can reliably identify trainable sparse subnetworks from intrinsic structural properties of random initialization.
URL: https://openreview.net/forum?id=J78UavuUUA
---
Title: Adaptive Rank Control for Robust Reinforcement Learning
Abstract: Robust reinforcement learning (RL) is commonly formulated as a min--max optimization problem to account for epistemic uncertainty in transition dynamics.
While theoretically appealing, such formulations are computationally demanding and often induce overly conservative policies.
We study an alternative approach in which transition dynamics are sampled from an uncertainty set and robustness is achieved through explicit control of policy complexity.
In the neural tangent kernel regime, we show that training with uniformly sampled dynamics induces a bias--variance tradeoff, with lower-rank policy representations exhibiting reduced sensitivity to epistemic perturbations.
Within the framework of entropy-regularized RL, we formulate robust learning as a bi-level optimization problem that balances expressiveness and robustness via adaptive low-rank policy representations, leading to an adaptive rank-selection mechanism that navigates this tradeoff during training.
We establish policy convergence and demonstrate empirically on MuJoCo continuous-control benchmarks that the proposed method provides a scalable and computationally efficient alternative to traditional robust RL, achieving improved robustness without the overhead of adversarial inner-loop optimization.
URL: https://openreview.net/forum?id=lG7VizZQcS
---
Title: UnitNorm: Rethinking Normalization for Transformers in Time Series
Abstract: Normalization techniques are crucial for enhancing Transformer models' performance and stability in time series analysis tasks, yet we originally identify that traditional methods like batch and layer normalization often lead to issues such as token shift, attention shift, and sparse attention.
We propose UnitNorm, a novel normalization approach that scales input vectors by their norms and modulates attention patterns, effectively circumventing these challenges.
Grounded in existing normalization frameworks, UnitNorm's effectiveness is demonstrated across diverse time series analysis tasks, including forecasting, classification, and anomaly detection, via a rigorous evaluation on 6 state-of-the-art models and 10 datasets.
UnitNorm demonstrates superior performance, particularly where robust attention and contextual understanding are vital, achieving up to a 1.46 MSE decrease in forecasting and a 4.89\% accuracy increase in classification.
This work not only calls for a re-evaluation of normalization strategies in time series Transformers but also sets a new direction for enhancing model performance and stability.
URL: https://openreview.net/forum?id=OOUvt1IAMl
---
Title: FedDID: Discrepancy-Informed Distillation to Target Personalization and Generalization in Federated Learning
Abstract: Statistical heterogeneity poses a central challenge in federated learning (FL), degrading both local personalization through client class imbalance and global generalization through unstable knowledge retention across rounds. Prior work often treats these goals separately, yielding generic FL methods optimized for global performance and personalized FL methods tailored to local models; even recent approaches that consider both typically optimize them with distinct objectives. We observe that global-model regularization is a shared structure across both paradigms and can be leveraged to pursue both goals within a single mechanism. We propose Federated Discrepancy-Informed Distillation (FedDID), which jointly promotes personalization and generalization by adaptively aligning local and global models via classwise knowledge distillation weighted by prediction-confidence discrepancies, without notable computational overhead. We provide theoretical motivation and show strong empirical performance under label heterogeneity, achieving the best overall balance across datasets by combining high global accuracy with low forgetting alongside strong local accuracy. On CIFAR-10 in particular, FedDID improves global accuracy by 17% over the next best-performing baseline while remaining competitive with the local performance of dedicated personalization methods.
URL: https://openreview.net/forum?id=1mbuIjWhzs
---
Title: A New First-Order Meta-Learning Algorithm with Convergence Guarantees
Abstract: Learning new tasks by leveraging prior experience is a fundamental trait of intelligent systems. While Model-Agnostic Meta-Learning (MAML) is a leading approach, it suffers from significant computational and memory overhead due to the requirement of computing second-order meta-gradients. We propose \textbf{FO-B-MAML}, a novel first-order variant of MAML derived from a bi-level optimization perspective. Our framework introduces a new expression of the meta-gradient, defined as the derivative of the solution of a perturbed optimization problem. This formulation allows the meta-gradient to be estimated using various finite difference methods; in this work, we propose and analyze two simple yet effective estimators: a forward and a symmetric approximation.
Unlike existing first-order methods like FO-MAML and Reptile, which suffer from irreducible bias, we prove that FO-B-MAML converges to a stationary point of the meta-objective. Notably, the symmetric estimator achieves an improved $\mathcal{O}(\delta^{2/3})$ bias rate, strictly enhancing previous first-order theory. Furthermore, we demonstrate that the MAML objective violates standard smoothness assumptions; we show instead that its smoothness constant grows with the norm of the meta-gradient. This property theoretically justifies the use of normalized or clipped-gradient methods (SNGDM) over vanilla gradient descent.
Our empirical results validate these advancements: FO-B-MAML achieves high accuracy on MNIST-1D, tracking closely with second-order MAML performance. Crucially, our method bypasses the ``activation bottleneck'' of second-order approaches, maintaining a flat memory footprint even when scaling to deep, activation-heavy CNNs with up to 250 channels.
URL: https://openreview.net/forum?id=4zGAXQxisB
---
Title: LLM Probability Concentration: How Alignment Shrinks the Generative Horizon
Abstract: Despite their impressive capabilities, aligned large language models (LLMs) often generate outputs that lack diversity. What drives this consistency in the generation? We investigate this phenomenon through the lens of probability concentration in the model's output distribution. To quantify this concentration, we introduce the **Branching Factor** (BF)--a token-invariant measure of the effective number of plausible next steps during generation. Our empirical analysis reveals two key findings: (1) BF often decreases as generation progresses, suggesting that LLMs become more predictable as they generate. (2) alignment tuning substantially sharpens the model's output distribution from the outset, reducing BF by a factor of 2--5 overall, and up to an order of magnitude (e.g., from 12 to 1.2) at the beginning positions. This stark reduction helps explain why aligned models often appear less sensitive to decoding strategies.
Building on this insight, we find this consistency has surprising implications for complex reasoning. Aligned Chain-of-Thought (CoT) models (e.g., DeepSeek-distilled models), for instance, leverage this effect; by generating longer reasoning chains, they push generation into later, more deterministic (lower BF) stages, resulting in more stable outputs. We hypothesize that alignment tuning does not fundamentally change a model's behavior, but instead steers it toward stylistic tokens (e.g., "Sure") that unlock low-entropy trajectories already present in the base model. This view is supported by nudging experiments, which show prompting base models with such tokens can similarly reduce BF. Together, our findings establish BF as a powerful diagnostic for understanding and controlling LLM outputs - clarifying how alignment reduces variability, how CoT promotes stable generations, and how base models can be steered away from diversity.
URL: https://openreview.net/forum?id=KotVuXj6CL
---
Title: Interpretable factorization of clinical questionnaires to identify latent factors of psychopathology
Abstract: Psychiatry research seeks to understand the manifestations of psychopathology in behavior, as measured in questionnaire data, by identifying a small number of latent factors that explain them. While factor analysis is the canonical tool for this purpose, the resulting factors may not be interpretable, and may also be subject to confounding variables. Moreover, missing data are common, and explicit imputation is often required. To overcome these limitations, we introduce Interpretability Constrained Questionnaire Factorization (ICQF), a non-negative matrix factorization method with regularization tailored for questionnaire data. Our method aims to promote factor interpretability and solution stability. We provide an optimization procedure with theoretical convergence guarantees, and an automated procedure to determine latent dimensionality accurately. We validate these procedures using realistic synthetic data. We demonstrate the effectiveness of our method in a widely used general-purpose questionnaire, in two independent datasets (the Healthy Brain Network and Adolescent Brain Cognitive Development studies). Specifically, we show that ICQF improves interpretability, as defined by domain experts, while preserving diagnostic information across a range of disorders, and outperforms competing methods for smaller dataset sizes. This suggests that the regularization in our method matches domain characteristics.
URL: https://openreview.net/forum?id=1Yq6INJwiO
---
Title: Hybrid Belief–Reinforcement Learning for Efficient Coordinated Spatial Exploration
Abstract: Coordinating multiple autonomous agents to explore and serve spatially heterogeneous demand requires jointly learning unknown spatial patterns and planning trajectories that maximize task performance. Pure model-based approaches provide structured uncertainty estimates but lack adaptive policy learning, while deep reinforcement learning often suffers from poor sample efficiency when spatial priors are absent. This paper presents a hybrid belief-reinforcement learning (HBRL) framework to address this gap. In the first phase, agents construct spatial beliefs using a Log-Gaussian Cox Process (LGCP) and execute information-driven trajectories guided by a Pathwise Mutual Information (PathMI) planner with multi-step lookahead. In the second phase, trajectory control is transferred to a Soft Actor-Critic (SAC) agent, warm-started through dual-channel knowledge transfer: belief state initialization supplies spatial uncertainty, and replay buffer seeding provides demonstration trajectories generated during LGCP exploration. A variance-normalized overlap penalty enables coordinated coverage through shared belief state, permitting cooperative sensing in high-uncertainty regions while discouraging redundant coverage in well-explored areas. The framework is evaluated on a multi-UAV wireless service provisioning task. Results show 10.8% higher cumulative reward and 38% faster convergence over baselines, with ablation studies confirming that dual-channel transfer outperforms either channel alone.
URL: https://openreview.net/forum?id=ow8AIavKw5
---
Title: Identifying Responders in Randomized Controlled Trials and Observational Studies
Abstract: In this paper, we introduce causal responder detection (CARD), a novel method for re-
sponder analysis that identifies treated subjects whose outcomes are statistically different
from the control response distribution. Leveraging recent advances in conformal predic-
tion for novelty detection, CARD builds on the AdaDetect framework and, in randomized
settings, inherits its finite-sample false discovery rate (FDR) control when coupled with
the Benjamini–Hochberg procedure. For observational studies, we propose a propensity
score–adjusted extension and establish asymptotic FDR control under standard causal as-
sumptions of ignorability and overlap. Simulation studies and real-data applications demon-
strate that CARD effectively detects responders with high power across a range of hetero-
geneous and distributional treatment-effect scenarios.
URL: https://openreview.net/forum?id=zPt0o32mYn
---
Title: The α-Law of Observable Belief Revision in Large Language Model Inference
Abstract: Large language models (LLMs) that iteratively revise their outputs—via chain-of-thought, self-reflection, or multi-agent debate—lack principled guarantees on the stability of their probability updates. We identify a consistent multiplicative scaling law governing how
instruction-tuned LLMs revise probability assignments over candidate answers: log q1(i) =α[log q0(i) + log b(i)] + c, where α is a belief revision exponent and b is evidence from verification. We prove that α < 1 is necessary and sufficient for asymptotic stability under iterated revision. Empirical validation across 4,975 problems, four graduate-level benchmarks (GPQA Diamond, TheoremQA, MMLU-Pro, ARC-Challenge), and two primary model families (GPT-5.2, Claude Sonnet 4) yields α = 1.163 ± 0.084 with mean R2 = 0.76—models
exhibit near-Bayesian update behavior, slightly above the stability boundary. While single-step α exceeds 1, multi-step validation on 198 GPQA problems over 7 revision steps shows α decays from 0.84 to 0.54, yielding contractive long-run dynamics consistent with the stability theorem. Token-level logprob validation on 191 problems with Llama-3.3-70B confirms median α ≈ 1.0 for both logprob and self-reported elicitation. Decomposing the update into prior and evidence components reveals architecture-specific trust ratio fingerprints: GPT-5.2 exhibits balanced weighting (τ ≈ 1.0) while Claude shows slight evidence-favoring (τ ≈ 1.1). This work characterizes observable inference-time update behavior; it does not claim that LLMs internally perform Bayesian inference. The α-law provides a principled diagnostic for monitoring observable update quality in LLM inference systems.
URL: https://openreview.net/forum?id=0HgEgxqpaR
---
Title: FUND: Flow Matching for Sampling Unnormalized Distributions
Abstract: Efficient sampling from Boltzmann distributions is central to modelling complex physical systems. Markov Chain Monte Carlo (MCMC) methods suffer from critical slowing down, high autocorrelation, and poor mode-mixing, limiting their scalability. Recent advances, like Boltzmann Generators, offer a promising alternative but remain constrained by costly MCMC-based training, inefficient sampling, and poor ergodicity. We introduce an algorithm for learning Boltzmann distributions that does not require any true samples for training. Our approach draws inspiration from flow matching but departs fundamentally from sample-trajectory matching to distribution-trajectory matching. The algorithm iteratively reshapes the target distribution, using model generated samples to guide learning and ensure comprehensive mode coverage. We validate our method on challenging benchmarks, including a 2D Gaussian mixture, Many-Well distributions, and high-dimensional scalar $\phi^4$ theory. Our approach not only outperforms traditional MCMC and flow-based methods in efficiency and
accuracy but also establishes a new paradigm for sample-free learning of complex physical distributions.
URL: https://openreview.net/forum?id=O05dDDVcyZ
---
Title: Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
Abstract: Reinforcement learning (RL) has become a central post-training tool for improving the reasoning abilities of large language models (LLMs). In these systems, the rollout, the trajectory sampled from a prompt to termination, including intermediate reasoning steps and optional tool or environment interactions, determines the data that the optimizer ultimately learns from, yet rollout design is often treated as an implementation detail and underreported. This survey provides an optimizer-agnostic view of rollout strategies for RL-based post-training of reasoning LLMs. We formalize rollout pipelines with unified notation and introduce Generate–Filter–Control–Replay (GFCR), a lifecycle taxonomy that decomposes rollout pipelines into four modular and composable stages: Generate proposes candidate trajectories and topologies; Filter constructs intermediate signals via verifiers, judges, or critics; Control allocates compute and makes continuation/branching/stopping decisions under budgets; and Replay retains and reuses artifacts across rollouts without weight updates, including self-evolving curricula that autonomously generate new training tasks and data. We complement GFCR with a criterion taxonomy of reliability, coverage, and cost sensitivity that characterizes the trade-offs rollout designs must navigate. Using this framework, we synthesize methods spanning RL with verifiable rewards, process supervision, judge-based gating, guided and tree/segment rollouts, adaptive compute allocation, early-exit and partial rollouts, systems-level throughput optimization, and replay/recomposition for self-improvement. We ground the framework with case studies in math, code/SQL, multimodal reasoning, tool-using agents, and agentic skill benchmarks that evaluate skill induction, reuse, and cross-task transfer. Finally, we provide a practitioner-oriented diagnostic index that maps common rollout pathologies to GFCR modules and mitigation levers, alongside open challenges for building reproducible, compute-efficient, and trustworthy rollout pipelines.
URL: https://openreview.net/forum?id=aB848FutiR
---
Title: No Imputation Needed: A Switch Approach to Irregularly Sampled Time Series
Abstract: Modeling irregularly-sampled time series (ISTS) is challenging because of missing values. Most existing methods focus on handling ISTS by converting irregularly sampled data into regularly sampled data via imputation. These models assume an underlying missing mechanism, which may lead to unwanted bias and sub-optimal performance. We present SLAN (Switch LSTM Aggregate Network), which utilizes a group of LSTMs to model ISTS without imputation, eliminating the assumption of any underlying process. It dynamically adapts its architecture on the fly based on the measured sensors using switches. SLAN exploits the irregularity information to explicitly capture each sensor's local summary and maintains a global summary state throughout the observational period. We demonstrate the efficacy of SLAN on two public datasets, namely, MIMIC-III and Physionet 2012, for the in-hospital mortality prediction task. The code will be publicly made available upon acceptance of the manuscript.
URL: https://openreview.net/forum?id=B5FOZfmBlQ
---
Title: A Stochastic Optimization Framework for Private and Fair Learning From Decentralized Data
Abstract: Machine learning models are often trained on sensitive data (e.g., medical records and race/gender) that is distributed across different “silos” (e.g., hospitals). These federated learning models may then be used to make consequential decisions, such as allocating health- care resources. Two key challenges emerge in this setting: (i) maintaining the privacy of each person’s data, even if other silos or an adversary with access to the central server tries to infer this data; (ii) ensuring that decisions are fair to different demographic groups (e.g., race/gender). In this paper, we develop a novel algorithm for private and fair federated learning (FL). Our algorithm satisfies inter-silo record-level differential privacy (ISRL-DP), a strong notion of private FL requiring that each silo’s communicated messages satisfy record-level differential privacy. In addition to being differentially private, our framework can be used to promote different fairness notions, including demographic parity and equalized odds. We prove that our algorithm converges under mild smoothness assumptions on the loss function (even in nonconvex settings), whereas prior work required strong convexity for convergence. As a byproduct of our analysis, we obtain the first convergent algorithm for ISRL-DP optimization of nonconvex-strongly concave min-max loss functions in federated learning. This convergent DP optimization algorithm is a valuable contribution in its own right. Additionally, our experiments demonstrate the state-of-the-art fairness-accuracy tradeoffs of our algorithm across different privacy levels. Compared to existing state of the art, we obtained an average of around 64% reduction in demographic parity fairness violation and 95% lower for equalized odds.
URL: https://openreview.net/forum?id=ClAkvLjKr6
---
Title: When Parametric Knowledge Wins: A Controlled Ablation of Agent Skills and Tool Use for PII Detection in Small Language Models
Abstract: Agent augmentation is widely assumed to improve performance, yet this study presents that for small language models, it systematically degrades capability under controlled conditions. This paper identifies a structural failure mode in agentic pipelines: agent augmentation that
is assumed to benefit capable models systematically degrades performance in the 7–9B parameter class. A controlled ablation was run across four open-weight instruction-tuned models (Gemma 2 9B, Llama 3.1 8B, Mistral 7B, Qwen 2.5 7B) and four conditions: zero-shot prompting, documentation injection (+Docs), tool access (+Tool), and skills injection (+Skills). The benchmark is a stratified 2,000-sample dataset drawn from three public PII sources and scored against PII-Codex canonical types after full label alignment. A systematic capability regression is presented caused by agent augmentation in 7–9B parameter models. Under strict canonical-to-canonical scoring, zero-shot prompting outperforms every augmented condition for every included model in the 7–9B class. Tool use and skills injection reduce mean F1 by 13 to 24 percentage points relative to zero-shot (p < 0.0001, Cohen’s d from −0.39 to −0.67). Documentation is mostly neutral, though it significantly hurts Llama 3.1 8B (∆ = −0.17). Adding a Skill document on top of tool access provided no measurable benefit for any model. The degradation is not uniform. Structured types like Date and IP Address actually improve under tool use, while temporal (Date Time) and medical (Health Insurance ID) types collapse near zero, driven by label-schema mismatches between PII-Codex output and ground truth. Implications are discussed for evaluation methodology and agentic pipeline design in the 7–9B parameter class.
URL: https://openreview.net/forum?id=K03RICaFse
---
Title: In‑Context Planning with Latent Temporal Abstractions
Abstract: Planning-based reinforcement learning for continuous control is bottlenecked by two practical issues: planning at primitive time scales leads to prohibitive branching and long horizons, while real environments are frequently partially observable and exhibit regime shifts that invalidate stationary, fully observed dynamics assumptions. We introduce I‑TAP (In‑Context Latent Temporal‑Abstraction Planner), an offline RL framework that unifies in-context adaptation with online planning in a learned discrete temporal-abstraction space. From offline trajectories, I‑TAP learns an observation-conditioned residual-quantization VAE that compresses each observation–macro-action segment into a coarse-to-fine stack of discrete residual tokens, and a temporal Transformer that autoregressively predicts these token stacks from a short recent history. The resulting sequence model acts simultaneously as a context-conditioned prior over abstract actions and a latent dynamics model. At test time, I‑TAP performs Monte Carlo Tree Search directly in token space, using short histories for implicit adaptation without gradient update, and decodes selected token stacks into executable actions. Across deterministic MuJoCo, stochastic MuJoCo with different latent dynamics regimes, and high-dimensional Adroit manipulation, including partially observable variants, I-TAP consistently matches or outperforms strong model-free and model-based offline baselines, demonstrating efficient and robust in-context planning under stochastic dynamics and partial observability.
URL: https://openreview.net/forum?id=1P3ZSoV1tm
---
Title: Provably Safe Generative Sampling with Constricting Barrier Functions
Abstract: Flow-based generative models, such as diffusion models and flow matching models, have achieved remarkable success in learning complex data distributions. However, a critical gap remains for their deployment in safety-critical domains: the lack of formal guarantees that generated samples will satisfy hard constraints. We propose a safety filtering framework that acts as an online shield for any pre-trained generative model. Our key insight is to cooperate with the generative process rather than override it. We define a constricting safety tube that is relaxed at the initial noise distribution and progressively tightens to the target safe set at the final data distribution, mirroring the coarse-to-fine structure of the generative process itself. By characterizing this tube via Control Barrier Functions (CBFs), we synthesize a feedback control input through a convex Quadratic Program (QP) at each sampling step. As the tube is loosest when noise is high and intervention is cheapest in terms of control energy, most constraint enforcement occurs when it least disrupts the model’s learned structure. We prove that this mechanism guarantees safe sampling while minimizing the distributional shift from the original model at each sampling step, as quantified by the KL divergence. Our framework applies to any pre-trained flow-based sampling scheme requiring no retraining or architectural modifications. We validate the approach across constrained image generation, physically-consistent trajectory sampling, and safe robotic manipulation policies, achieving 100% constraint satisfaction while preserving semantic fidelity.
URL: https://openreview.net/forum?id=iZi471b4Pf
---
Title: Tabular Learning Revisited: An Empirical Study of Tabular Classification
Abstract: Tabular data represent one of the most prevalent data formats in applied machine learning, largely because they accommodate a broad spectrum of real-world problems.
Existing literature has studied many of the shortcomings of neural architectures on tabular data and has repeatedly confirmed the scalability and robustness of gradient-boosted decision trees across varied datasets. However, recent deep learning models have not been subjected to a comprehensive evaluation under conditions that allow for a fair comparison with existing classical approaches. This situation motivates an investigation into whether recent deep-learning paradigms outperform classical ML methods on tabular data. Our survey fills this gap by benchmarking twenty state-of-the-art methods, spanning neural networks, classical ML and AutoML techniques. Our empirical results over 68 diverse classification datasets from a well-established benchmark indicate a paradigm shift, where Deep Learning methods outperform classical approaches.
URL: https://openreview.net/forum?id=I8BIGp4XOb
---
Title: Multi-View Modeling for Stock Investment Risk Forecasting
Abstract: Forecasting stock investment risk is crucial for effective financial decision-making. Existing research on stock risk forecasting is still limited due to the lack of large-scale datasets and standardized investment risk forecasting tasks. To address this problem, we construct a stock investment risk dataset that standardizes the stock risk forecasting task as regression and classification problem, providing a benchmark for stock investment risk forecasting. Recent works only based on time series data capture a limited aspect of historical stock price data. To address this issue, we propose a multi-view framework that leverages large language models (LLMs) and pre-trained vision models to extract complementary long-periodic patterns and short-periodic patterns from historical stock data. Additionally, we propose a strategy that maps these features back to the temporal domain, effectively preserving temporal dependencies. Experimental results on our dataset demonstrate that our proposed model outperform the competitive baselines in regression task and classification task of stock investment risk forecasting. The codes and dataset are release in https://anonymous.4open.science/r/MultiV-RF-F87F
URL: https://openreview.net/forum?id=x6VwjjpXHi
---
Title: Index2Sort: Sorting Algorithm Using Static Index Structure
Abstract: We introduce Index2Sort, a general framework for deriving sorting algorithms from static indexes. Index2Sort treats the index as an opaque box that exposes only two operations: index construction and rank queries. This abstraction allows Index2Sort to be applied to various index structures, including classical and learned indexes. Our theoretical analysis shows that the computational guarantees of the index transfer directly to Index2Sort. If the index can be constructed in expected time $\mathcal{O}(nC(n))$ and can answer rank queries in expected time $\mathcal{O}(Q(n))$, then Index2Sort sorts the input in expected time $\mathcal{O}(nC(n) + nQ(n))$. In particular, when using a state-of-the-art learned index with $C(n)=Q(n)=1$, this yields an expected complexity of $\mathcal{O}(n)$, which is a strictly tighter bound than those of existing learned sorting algorithms. In contrast to recent theoretical works on learned sorting, which derive complexity guarantees by analyzing the internal structure of a learned index and designing a sorting algorithm with a similar structure, Index2Sort achieves stronger guarantees without requiring any inspection or modification of the index internals.
URL: https://openreview.net/forum?id=YUmVxjMhpm
---
Title: Reproducing FACTER: Fairness via Conformal Thresholding and Prompt Repair
Abstract: Fayyazi et al. (2025) recently proposed FACTER, an innovative model-agnostic framework designed to jointly enforce fairness and statistical coverage in LLM-based recommendation through conformal thresholding and iterative prompt repair. In this work, we conduct a critical reproduction of the FACTER framework across diverse architectures and dataset sparsity levels. Our evaluation reveals a divergence in recommendation utility during open-ended generation tasks, where metrics appear highly sensitive to specific evaluation protocols. To build upon these results, we introduce a re-ranking extension that constrains the search space, successfully aligning utility magnitudes with previously reported findings while maintaining fairness effectiveness. Further analysis of the fairness mechanism suggests that violation reductions are largely influenced by the adaptive nature of the conformal thresholding. Additionally, we find that in constrained ranking scenarios, static fairness instructions can achieve comparable results to the dynamic repair loop, suggesting opportunities for optimising computational overhead. By reconciling implementation nuances with theoretical formalisations, our study provides insights into the practical deployment of fair LLM-based recommenders. All code and reproduction artifacts are available at https://anonymous.4open.science/r/facter-repr-105B/.
URL: https://openreview.net/forum?id=4BPFVex4EM
---
Title: Regularity and Stability Properties of Selective SSMs with Discontinuous Gating
Abstract: Deep selective State-Space Models (SSMs), whose state-space parameters are modulated online by a selection signal, offer significant expressive power but pose challenges for stability analysis, especially under discontinuous gating. We study continuous-time selective SSMs through the lenses of passivity and Input-to-State Stability (ISS), explicitly distinguishing the selection schedule $x(\cdot)$ from the driving (port) input $u(\cdot)$. First, we show that state-strict dissipativity ($\beta>0$) together with quadratic bounds on a storage functional implies exponential decay of homogeneous trajectories ($u\equiv 0$), yielding exponential forgetting. Second, by freezing the selection ($x(t)\equiv 0$) we obtain a passive LTV input-output subsystem and prove that its minimal available storage is necessarily quadratic, $V_{a,0}(t,h)=\tfrac{1}{2}h^H Q_0(t)h,$ with $Q_0 \in \mathrm{AUC}_{\mathrm{loc}}$, accommodating discontinuities induced by gating. Third, under the strong hypothesis that a single quadratic storage certifies passivity uniformly over all admissible selection schedules, we derive a parametric LMI and universal kernel constraints on gating, formalizing an "irreversible forgetting" structure. Finally, we give sufficient conditions for global ISS with respect to the port input $u(\cdot)$, uniformly over admissible selection schedules, and we validate the main predictions in targeted simulation studies.
URL: https://openreview.net/forum?id=7Vav53cDeN
---
Title: Limits and Gains of Test-Time Scaling in Vision-Language Reasoning
Abstract: Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning ability of Large Language Models (LLMs) by allocating additional computation at inference, yet its application to multimodal systems such as Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic empirical study of inference-time reasoning methods applied across both open-source and closed-source VLMs on different benchmarks. Our results reveal that while closed-source models consistently benefit from structured reasoning and iterative Self-Refinement, open-source VLMs show inconsistent behavior: external verification provides the most reliable gains, whereas iterative refinement often degrades performance. We further find that the effectiveness of TTS is dataset-dependent, yielding clear improvements on multi-step reasoning tasks but offering only limited gains on perception-focused benchmarks. These findings demonstrate that TTS is not a universal solution and must be tailored to both model capabilities and task characteristics, motivating future work on adaptive TTS strategies and multimodal reward models.
URL: https://openreview.net/forum?id=zHGxvPBydX
---
Title: What is a Number, That a Large Language Model May Know It?
Abstract: Numbers are a basic part of how humans represent and describe the world around them. As a consequence, learning effective representations of numbers is critical for the success of large language models as they become more integrated into everyday decisions. However, these models face a challenge: depending on context, the same sequence of digit tokens, e.g., 911, can be treated as a number or as a string. What kind of representations arise from this duality, and what are its downstream implications? Using a similarity-based prompting technique from cognitive science, we show that LLMs learn representational spaces that blend string-like and numerical representations. In particular, we show that elicited similarity judgments from these models over integer pairs can be captured by a combination of Levenshtein edit distance and numerical Log-Linear distance, suggesting an entangled representation. In a series of experiments we show how this entanglement is reflected in the latent embeddings, how it can be reduced but not entirely eliminated by context, and how it can propagate into a realistic decision scenario. These results shed light on a representational tension in transformer models that must learn what a number is from text input.
URL: https://openreview.net/forum?id=dnXulFHKbw
---
Title: Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models
Abstract: Recent advances in large language models (LLMs) have substantially improved single-turn task performance, yet real-world applications increasingly demand sophisticated multi-turn interactions. This survey provides a comprehensive review of recent progress in evaluating and enhancing multi-turn LLM interactions. Centered on a task-oriented taxonomy—spanning instruction following in domains such as mathematics and coding, and conversational engagement in role-playing, healthcare, education, and adversarial jailbreak settings—we systematically examine the challenges of maintaining context, coherence, fairness, and responsiveness across prolonged dialogues. We organize existing benchmarks and datasets into coherent categories reflecting the evolving landscape of multi-turn dialogue evaluation, and review a broad spectrum of enhancement methodologies, including model-centric strategies (in-context learning, supervised fine-tuning, reinforcement learning, and architectural innovations), external integration approaches (memory augmentation, retrieval-based methods, and knowledge graphs), and agent-based techniques for collaborative interaction. Finally, we identify open challenges and promising directions for future research to further improve the robustness and effectiveness of multi-turn LLM interactions.
URL: https://openreview.net/forum?id=UYNQXPevpF
---
Title: Higher Resolution, Better Generalization: Unlocking Visual Scaling in Deep Reinforcement Learning
Abstract: Pixel-based deep reinforcement learning agents are typically trained on heavily downsampled visual observations, a convention inherited from early benchmarks rather than grounded in principled design. In this work, we show that observation resolution is a critical yet overlooked variable for policy learning: higher-resolution inputs can substantially improve both performance and generalization, provided the network architecture can process them effectively. We find that the widely used Impala encoder, which flattens spatial features into a vector, suffers from quadratic parameter growth as resolution increases and fails to leverage the additional visual detail. Replacing this operation with global average pooling, as in the Impoola architecture, decouples parameter count from resolution and yields consistent improvements across resolutions and network widths—at their respective best conditions, visual scaling unlocks a 28% performance gain for Impoola over Impala. These gains are strongest in environments that require precise perception of small or distant objects, and gradient saliency analysis confirms that the underlying mechanism is a more spatially localized visual attention of the policy at higher resolutions. Our results challenge the prevailing practice of aggressive input downsampling and position resolution-independent architectures as a simple, effective path toward scalable visual deep RL.
URL: https://openreview.net/forum?id=xNm5W5Widp
---
Title: Compressibility Measures Complexity: Minimum Description Length Meets Singular Learning Theory
Abstract: We study neural network compressibility by using singular learning theory to extend the minimum description length (MDL) principle to singular models like neural networks. Through extensive experiments on the Pythia suite with quantization, factorization, and other compression techniques, we find that complexity estimates based on the local learning coefficient (LLC) are closely, and in some cases, linearly correlated with compressibility. Our results provide a path toward rigorously evaluating the limits of model compression.
URL: https://openreview.net/forum?id=mJVJ2FcbBC
---
Title: Localized Additive Explanations (LAX)
Abstract: We propose a novel technique for training an interpretable artificial neural network for image classification. Models trained with our method generate their own “visual explanation” for a prediction during the prediction step. Our approach, Localized Additive eXplanations (LAX), trains a neural network to predict the areas of an image, which, if occluded (dropped out), would prevent the learner from predicting the correct class. These regions must be essential to the prediction mechanism modeled by the learner, thus extracting regions that contain essential signals used for prediction. This method is applicable to any image classification problem, but requires training a model with a special structure that can be used to produce such visualizations. By utilizing this custom structure we bypass the need or an external explanation step, achieve higher granularity of evidence localization, and a performance benefit by incorporating the explanation generation in the prediction step. Our approach sacrifices a small amount of classification accuracy for the benefit of reduced time-to-explanation. LAX models are trained to optimize objectives which ensure that the signals identified by the explanation are ‘correct’ (they are evidence for a class), ‘complete’ (essential evidence for predicting a class), and ‘exclusive’ (not identifying anything other than evidence for a class). When the joint objective is achieved well, the resultant neural network is an accurate, but more importantly the same network provides a truthful explanation along with each prediction.
URL: https://openreview.net/forum?id=AAGozPtEXs
---
Title: ReciNet: Reciprocal Space-Aware Long-Range Modeling for Crystalline Property Prediction
Abstract: Predicting properties of crystals from their structures is a fundamental yet challenging task in materials science. Unlike molecules, crystal structures exhibit infinite periodic arrangements of atoms, requiring methods capable of capturing both local and global information effectively. However, current works fall short of capturing long-range interactions within periodic structures. To address this, we leverage reciprocal space, the natural domain for periodic crystals, and construct a Fourier series representation from fractional coordinates and reciprocal lattice vectors with learnable filters. Building on this, we introduce the reciprocal space-based geometry network (ReciNet), a novel architecture that integrates geometric GNNs and reciprocal blocks to model short-range and long-range interactions. Experiments on comprehensive benchmarks JARVIS, Materials Project, and MatBench demonstrate that ReciNet achieves state-of-the-art predictive accuracy across a range of crystal property prediction tasks. Additionally, we explore a model extension for multi-property prediction with the mixture-of-experts, which demonstrates high computational efficiency and reveals positive transfer between correlated properties. These findings highlight the potential of our model as a scalable and accurate solution for crystal property prediction.
URL: https://openreview.net/forum?id=ODlxgod5e3
---
Title: Merging Feed-Forward Sublayer for Compressed Transformers
Abstract: Pruning is a prevailing model compression method that typically operates by identifying and removing unimportant parameters based on various important metrics. In this work, we challenge this paradigm by instead targeting redundant parameters via intra-model merging techniques. Specifically, we propose a method that combines multiple feed-forward sublayers in Transformer models through neuron alignment, merging, and weight tying. We find that this method produces compressed models with performance comparable to their original models while tying more than a third of their feed-forward sublayers, and demonstrates improved performance over a strong, generalized layer pruning baseline. For example, we can remove more than 21% of the total parameters from a vision transformer while maintaining 99% of its original performance on ImageNet. Additionally, we observe high activation similarity between different feed-forward sublayers, offering novel insight into their behavior and contextualizing their surprising mergeability.
URL: https://openreview.net/forum?id=t8iuiH46g0
---
Title: Geometric Uncertainty for Detecting and Correcting Hallucinations in LLMs
Abstract: Large language models are known to hallucinate, generating linguistically plausible but incorrect answers to questions. Uncertainty quantification has been proposed as a strategy to detect such behaviour, but existing methods lack a unified framework to assess reliability at both the prompt and answer level. We introduce a geometric framework which quantifies language model uncertainty at both levels by explicitly modelling a prompt-conditioned semantic distribution in answer embedding space. Our approach is black-box and sampling-based; we generate multiple answers per prompt, and use archetypal analysis to estimate a geometric support for the answer distribution. At the prompt level, we approximate the distribution entropy to quantify uncertainty; for each individual answer, we then use notions of atypicality to assess its reliability relative to the batch. We employ our framework to not only detect hallucinations but correct them, by selecting the batch example deemed most reliable. Experiments show that our framework performs comparably to or better than prior methods on short form question-answering datasets, and achieves superior results on medical datasets where hallucinations carry particularly critical risks. Beyond pure performance, we suggest the theoretical grounding of our work provides support for semantic distributions as useful objects of study for language model uncertainty.
URL: https://openreview.net/forum?id=5UVv7gkgUD
---
Title: EEG-EyeTrack: A Benchmark for Time Series and Functional Data Analysis with Open Challenges and Baselines
Abstract: We present a new benchmark dataset for functional data analysis (FDA), focusing on the reconstruction of eye movements from EEG data. Our contribution is threefold: first, we introduce a challenging dataset collected with consumer-grade hardware under realistic conditions. Second, we propose open challenges and evaluation metrics tailored to FDA applications. Third, we establish baseline results for the primary regression task of reconstructing eye movements from EEG signals using functional neural networks. We report baseline results on both our new dataset and the established EEGEyeNet dataset, which was recorded with research-grade hardware.
URL: https://openreview.net/forum?id=merpXrz6rK
---
Title: [Re] Interpreting CLIP with Hierarchical Sparse Autoencoders
Abstract: CLIP (Contrastive Language-Image Pretraining) is a multimodal model able to transform both images and text into a fixed-size vector (Radford et al., 2021). These vectors are usually uninterpretable due to polysemanticity. Sparse AutoEncoders (SAEs) enable sparser representations of vectors, while also promoting monosemanticity. The purpose of this work is to reproduce the paper Interpreting CLIP with Hierarchical Sparse Autoencoders (Zaigrajew et al., 2025). The authors introduce the Matryoshka Sparse Autoencoder (MSAE), which learns representations in a hierarchical fashion. In order to reproduce this work, we identify and attempt to reproduce four central claims. Our results mainly support these four claims. From this, we conclude that the results are reproducible. We also propose two extensions: (i) a model-agnostic centroid-based method to assess monosemanticity for SAE neurons, and (ii) further inspection of progressive recovery capabilities at various granularity levels.
URL: https://openreview.net/forum?id=H4NvRHjiqD
---
Title: WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning
Abstract: Multimodal information, together with our knowledge, help us to understand the complex and dynamic world. Large language models (LLM) and large multimodal models (LMM), however, still struggle to emulate this capability. In this paper, we present WorldQA, a video understanding dataset designed to push the boundaries of multimodal world models with three appealing properties:
(1) Multimodal Inputs: The dataset comprises 1007 question-answer pairs and 303 videos, necessitating the analysis of both auditory and visual data for successful interpretation.
(2) World Knowledge: We identify five essential types of world knowledge for question formulation. This approach challenges models to extend their capabilities beyond mere perception.
(3) Long-Chain Reasoning: Our dataset introduces an average reasoning step of 4.45, notably surpassing other videoQA datasets. Furthermore, we introduce WorldRetriever, an agent designed to synthesize expert knowledge into a coherent reasoning chain, thereby facilitating accurate responses to WorldQA queries.
Extensive evaluations of 13 prominent LLMs and LMMs reveal that WorldRetriever, although being the most effective model, achieved only 70\% of human-level performance in multiple-choice questions. This finding highlights the necessity for further advancement in the reasoning and comprehension abilities of models. Our experiments also yield several key insights. For instance, while humans tend to perform better with increased frames, current LMMs, including WorldRetriever, show diminished performance under similar conditions. We hope that WorldQA, our methodology, and these insights could contribute to the future development of multimodal world models.
URL: https://openreview.net/forum?id=Nu3lsLOWa9
---
Title: COSTAR: Improved Temporal Counterfactual Estimation with Self-Supervised Learning
Abstract: Estimation of temporal counterfactual outcomes from observed history is crucial for decision-making in many domains such as healthcare and e-commerce, particularly when randomized controlled trials (RCTs) suffer from high cost or impracticality. For real-world datasets, modeling time-dependent confounders is challenging due to complex dynamics, long-range dependencies and both past treatments and covariates affecting the future outcomes. In this paper, we introduce Counterfactual Self-Supervised Transformer (COSTAR), a novel approach that integrates self-supervised learning for improved historical representations. We propose a component-wise contrastive loss tailored for temporal treatment outcome observations and explain its effectiveness from the view of unsupervised domain adaptation. COSTAR yields superior performance in estimation accuracy and generalization to out-of-distribution data compared to existing models, as validated by empirical results on both synthetic and real-world datasets.
URL: https://openreview.net/forum?id=gut3EImjHX
---
Title: On the Role of MLP Layers in Transformer ICL with Categorical Outcomes
Abstract: We study in-context learning (ICL) with Transformers for categorical outputs $y_i$, a setting largely unexplored compared to research on real-valued $y_i$. While attention-only Transformers can, in principle, perform functional gradient descent (GD) inference for real-valued outputs, we show that categorical $y_i$ introduce a nonlinear interlayer computation. The MLP layers interleaved with attention in the standard Transformer are a natural architectural component to approximate this computation, providing a concrete role for MLPs that is absent in the real-valued setting. We characterize conditions under which attention-only models can nevertheless succeed: at early layers, when all positions share similar representations, and the softmax operates in its approximately linear regime. Our theory predicts that attention-only models should degrade at greater depth and under distribution mismatch between training and testing data -- predictions we confirm empirically across synthetic data, real-world image classification with domain shift, and surgical action triplet recognition. Guided by the analysis, we propose a sparse Transformer parameterization linked to functional GD that reduces trainable parameters by roughly $50\times$ relative to an unconstrained Transformer, with minimal performance degradation. This data efficiency proves to be particularly valuable in data-limited applications, which we demonstrate through the ICL analysis of human surgical procedures.
URL: https://openreview.net/forum?id=8P4V7V1cs4
---
Title: Trace Reconstruction with Language Models
Abstract: The general trace reconstruction problem seeks to recover an original sequence from its noisy copies independently corrupted by deletions, insertions, and substitutions. This problem arises in applications such as DNA data storage, a promising storage medium due to its high information density and longevity. However, errors introduced during DNA synthesis, storage, and sequencing require correction through algorithms and codes, with trace reconstruction often used as part of data retrieval. In this work, we propose TReconLM, a decoder-only transformer that solves trace reconstruction as a next-token prediction task. TReconLM outperforms state-of-the-art trace reconstruction algorithms, including prior deep learning approaches, recovering a substantially higher fraction of sequences without error. We pretrain on synthetic data generated from a simple error model and fine-tune on real-world data to adapt to technology-specific error patterns.
URL: https://openreview.net/forum?id=k2zAyJAfnj
---
Title: When Does Multimodality Lead to Better Time Series Forecasting?
Abstract: Recently, there has been growing interest in incorporating textual information into foundation models for time series forecasting. However, it remains unclear whether and under what conditions such multimodal integration consistently yields gains. We systematically investigate these questions across a diverse benchmark of 16 forecasting tasks spanning 7 domains, including health, environment, and economics. We evaluate two popular multimodal forecasting paradigms: aligning-based methods, which align time series and text representations; and prompting-based methods, which directly prompt large language models for forecasting. Our findings reveal that the benefits of multimodality are highly condition-dependent. While we confirm reported gains in some settings, these improvements are not universal across datasets or models. To move beyond empirical observations, we disentangle the effects of model architectural properties and data characteristics, drawing data-agnostic insights that generalize across domains. Our findings highlight that on the modeling side, incorporating text information is most helpful given (1) high-capacity text models, (2) comparatively weaker time series models, and (3) appropriate aligning strategies. On the data side, performance gains are more likely when (4) sufficient training data is available and (5) the text offers complementary predictive signal beyond what is already captured from the time series alone. Our study offers a rigorous, quantitative foundation for understanding when multimodality can be expected to aid forecasting tasks, and reveals that its benefits are neither universal nor always aligned with intuition.
URL: https://openreview.net/forum?id=RggcWYWR3N
---
Title: CoDoL: Conditional Domain Prompt Learning for Out-of-Distribution Generalization
Abstract: Recent advances in pre-training vision-language models (VLMs), e.g., contrastive language-image pre-training (CLIP) methods, have shown great potential in learning out-of-distribution (OOD) representations. Despite showing competitive performance, the prompt-based CLIP methods still suffer from: i) inaccurate text descriptions, which leads to degraded accuracy and robustness, and poses a challenge for zero-shot CLIP methods. ii) limited vision-language embedding alignment, which significantly affects the generalization performance. To tackle the above issues, this paper proposes a novel Conditional Domain prompt Learning (CoDoL) method, which utilizes readily-available domain information to form prompts and improves the vision-language embedding alignment for improving out-of-distribution (OOD) generalization. To capture both instance-specific and domain-specific information, we further propose a lightweight Domain Meta Network (DMN) to generate input-conditional tokens for images in each domain. Extensive experiments on four OOD benchmarks (PACS, VLCS, OfficeHome, and DigitDG) validate the effectiveness of our proposed CoDoL method in terms of improving the vision-language embedding alignment as well as the out-of-distribution generalization performance.
URL: https://openreview.net/forum?id=MDxhbeE21D
---
Title: Critical Video-Language Understanding via Query-Guided Frame Selection and Visual-Query Transformation
Abstract: Recent advances in language-model-based video understanding have developed rapidly, driven by the emergence of LLMs to a great extent. Nevertheless, existing research has primarily concentrated on creating a projection mechanism converting video features into tokens—a method that is both conceptually simplistic and practically ill-performed. In this study, we propose VaQuitA, a brand new framework which more effectively unifies video representations and text inputs. At the data level, instead of sampling frames uniformly, we take advantage of a CLIP-score-based scheme, ensuring frames are more closely aligned with the query. At the feature level, we introduce a tunable Video Perceiver and a Visual-Query Transformer (VQ-Former), which together improve the synergy between the input question and the video features. In addition, we show that adding a brief prompt—specifically, ``Please be critical." improves the LLM’s ability to comprehend video content. Experiments across various benchmark datasets show that VaQuitA establishes a new state-of-the-art for zero-shot video question-answering, while also enabling high-quality multi-turn video-based dialogues with users. Code will be released.
URL: https://openreview.net/forum?id=0FW5VywmT4
---
Title: O$^2$-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering
Abstract: Large Language Models (LLMs), despite their advancements, are fundamentally limited by their static parametric knowledge, hindering performance on tasks requiring open-domain up-to-date information. While enabling LLMs to interact with external knowledge environments is a promising solution, current efforts primarily address closed-end problems. Open-ended questions, which characterized by lacking a standard answer or providing non-unique and diverse answers, remain underexplored. To bridge this gap, we present O2-Searcher, a novel search agent leveraging reinforcement learning to effectively tackle both open-ended and closed-ended questions in the open domain. O2-Searcher leverages an efficient, locally simulated search environment for dynamic knowledge acquisition, effectively decoupling the external world knowledge from model's sophisticated reasoning processes. It employs a unified training mechanism with meticulously designed reward functions, enabling the agent to identify problem types and adapt different answer generation strategies. Furthermore, to evaluate performance on complex open-ended tasks, we construct O2-QA, a high-quality benchmark featuring 300 manually curated, multi-domain open-ended questions with associated web page caches. Extensive experiments show that O2-Searcher, using only a 3B model, significantly surpasses leading LLM agents on O2-QA. It also achieves SOTA results on various closed-ended QA benchmarks against similarly-sized models, while performing on par with much larger ones.
URL: https://openreview.net/forum?id=rbIKFKFEeU
---
Title: Task-Aware Linear Representations for Signal and Sensor Classification: Learning Interpretable Feature Rotations for Gradient Boosted Decision Trees
Abstract: Gradient Boosted Decision Trees (GBDTs) dominate tabular machine learning but suffer from a fundamental geometric limitation: axis-aligned splits cannot efficiently model decision boundaries diagonal to the feature axes. This limitation is especially pronounced for signal and sensor data, where physical measurements exhibit natural correlations that create oblique decision boundaries. We propose Task-Aware Linear Representations (TALR), a two-stage hybrid framework that learns a global orthogonal rotation via differentiable surrogate optimization, then applies this rotation to precondition data for standard GBDT training. Our recommended method, TALR-low_rank (Q = I + U V ⊤ ), achieves statistical parity with tuned XGBoost across 45 benchmarks (mean delta −0.11%, p = 0.61) while providing interpretable feature combinations via sparse rotation matrices (Effective Parent Features ≈ 1.0). On 19 signal/sensor datasets, low_rank shows a 57.9%
win rate with positive mean improvement. A density guardrail (n/d < 2 ⇒ identity fallback) ensures TALR never degrades performance on ill-conditioned data. TALR significantly outperforms unsupervised RotationForest (81.4% win rate, p < 0.0001) and provides complementary interpretability to EBM (feature combinations vs. individual effects). The diversity of TALR’s five rotation methods—each suited to different data geometries—yields a 62.2% oracle win rate (p = 0.001), motivating automatic method selection as future work.
URL: https://openreview.net/forum?id=xvGC6SvpRs
---
Title: Learning Fine-grained Parameter Sharing via Sparse Tensor Decomposition
Abstract: Large neural networks achieve state-of-the-art performance on many tasks, yet their sheer size hinders deployment on resource-constrained devices. Among existing compression approaches, parameter sharing remains relatively unexplored. In this paper, we introduce Fine-grained Parameter Sharing (FiPS), a unified compression framework that combines parameter sharing, tensor decomposition, and sparsity. FiPS compresses transformers by factorizing MLPs concatenated across layers into a shared low-rank basis with sparse, layer-specific projection matrices. Both components are initialized via singular value decomposition (SVD) and jointly optimized through block-wise reconstruction error minimization. FiPS compresses a variety of Vision Transformers (ViTs) and Large Language Models (LLMs) by 20–50% with negligible quality degradation. We further combine FiPS with Quantization-Aware Training (QAT) to achieve state-of-the-art compression on Gemma-2 models. These results establish fine-grained parameter sharing as a practical route to compact, high-performance transformer models.
URL: https://openreview.net/forum?id=vbS7Z8Zswe
---
Title: Make Your LVLM KV Cache More Lightweight
Abstract: Key-Value (KV) cache has become a de facto component of modern Large Vision-Language Models (LVLMs) for inference. While it enhances decoding efficiency in Large Language Models (LLMs), its direct adoption in LVLMs introduces substantial GPU memory overhead due to the large number of vision tokens processed during the prefill stage. To tackle this problem, we propose LightKV, a novel approach that reduces KV cache size by exploiting the redundancy among vision-token embeddings. Guided by text prompts, LightKV employs cross-modality message passing to aggregate informative messages across vision tokens and progressively compress them during prefill. This prompt-aware guidance distinguishes our method from prior vision-only compression strategies. We evaluate LightKV on eight open-source LVLMs across eight public benchmarks, e.g., MME and SeedBench. Experimental results demonstrate that with only 50% of the original vision tokens, LightKV (a) halves KV cache size, (b) reduces computation by up to 40%, and (c) preserves general-purpose performance while significantly outperforming existing baselines.
URL: https://openreview.net/forum?id=n77IeySrQl
---
Title: Dynamic guessing for Hamiltonian Monte Carlo with embedded numerical root-finding
Abstract: Thanks to scientific machine learning, it is possible to fit Bayesian statistical models whose parameters satisfy analytically intractable algebraic conditions like steady-state constraints. This is often done by embedding a differentiable numerical root-finder inside a gradient-based sampling algorithm like Hamiltonian Monte Carlo. However, computing and differentiating large numbers of numerical solutions comes at a high computational cost. We show that dynamically varying the starting guess within a Hamiltonian trajectory can improve performance. To choose a good guess we propose two heuristics: *guess-previous* reuses the previous solution as the guess and *guess-implicit* extrapolates the previous solution using implicit differentiation. We benchmark these heuristics on a range of representative models. We also present a JAX-based Python package providing easy access to a performant sampler augmented with dynamic guessing.
URL: https://openreview.net/forum?id=z4PfNDNAcN
---