Weekly TMLR digest for May 31, 2026

8 views

Skip to first unread message

TMLR

unread,

May 31, 2026, 12:00:11 AMMay 31

to tmlr-annou...@googlegroups.com

New certifications
==================

J2C Certification: Variational Pseudo Marginal Methods for Jet Reconstruction in Particle Physics

Hanming Yang, Antonio Khalil Moretti, Sebastian Macaluso, Philippe Chlenski, Christian A. Naesseth, Itsik Pe'er

https://openreview.net/forum?id=pCapRF2vFf

---

J2C Certification: Defending Against Unknown Corrupted Agents: Reinforcement Learning of Adversarially Robust Nash Equilibria

Andi Nika, Jonathan Nöther, Adish Singla, Goran Radanovic

https://openreview.net/forum?id=aggyMifxLQ

---

J2C Certification: PCNN: Probable-Class Nearest-Neighbor Explanations Improve Fine-Grained Image Classification Accuracy for AIs and Humans

Giang Nguyen, Valerie Chen, Mohammad Reza Taesiri, Anh Totti Nguyen

https://openreview.net/forum?id=OcFjqiJ98b

---

J2C Certification: Enhancing Vision-Language Model with Unmasked Token Alignment

Jihao Liu, Jinliang Zheng, Boxiao Liu, Yu Liu, Hongsheng Li

https://openreview.net/forum?id=JkFEVbW6wE

---

J2C Certification: Assessing Robustness via Score-Based Adversarial Image Generation

Marcel Kollovieh, Lukas Gosch, Marten Lienen, Yan Scholten, Leo Schwinn, Stephan Günnemann

https://openreview.net/forum?id=7Oqb6zlGWl

---

J2C Certification: ODNet: Opinion Dynamics-Inspired Neural Message Passing for Graphs and Hypergraphs

Bingxin Zhou, Outongyi Lv, Jing Wang, Xiang Xiao, Weishu Zhao

https://openreview.net/forum?id=ytKFKoCpyK

---

Featured Certification, Reproducibility Certification, Survey Certification: Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines

Ziyao Wang, Bingying Wang, Hanrong Zhang, Tingting Du, Tianyang Chen, Guoheng Sun, Yexiao He, Zheyu Shen, Wanghao Ye, Ang Li

https://openreview.net/forum?id=tAaWFpvnmm

---

Accepted papers
===============

Title: ProJo4D: Progressive Joint Optimization for Sparse-View Inverse Physics Estimation

Authors: Daniel Rho, Jun Myeong Choi, Biswadip Dey, Roni Sengupta

Abstract: Neural rendering has advanced significantly in 3D reconstruction and novel view synthesis, and integrating physics into these frameworks opens new applications such as physically accurate digital twins for robotics and XR.
However, the inverse problem of estimating physical parameters from visual observations remains challenging.
Existing physics-aware neural rendering methods typically require dense multi-view videos, making them impractical for scalable, real-world deployment.
Under sparse-view settings, the sequential optimization strategies employed by current approaches suffer from severe error accumulation: inaccuracies in initial 3D reconstruction propagate to subsequent stages, degrading physical state and material parameter estimates.
On the other hand, simultaneous optimization of all parameters fails due to the highly non-convex and often non-differentiable nature of the problem.
We propose ProJo4D, a progressive joint optimization framework that gradually expands the set of jointly optimized parameters. This design enables physics-informed gradients to refine geometry while avoiding the instability of direct joint optimization over all parameters.
Evaluations on synthetic and real-world datasets demonstrate that ProJo4D substantially outperforms prior work in 4D future state prediction and physical parameter estimation, achieving up to 10$\times$ improvement in geometric accuracy while maintaining computational efficiency.

URL: https://openreview.net/forum?id=pqvVrqlXCZ

---

Title: On Theoretical Identifiability of Binary Latent Causal Graphical Models

Authors: Seunghyun Lee, Yuqi Gu

Abstract: This paper considers a challenging problem of identifying a causal graphical model under the presence of latent variables. While various identifiability conditions have been proposed in the literature, they often require multiple pure children per latent variable or restrictions on the latent causal graph. Furthermore, it is common for all observed variables to exhibit the same modality. Consequently, the existing identifiability conditions are often too stringent for complex real-world data. We consider a general nonparametric measurement model with arbitrary observed variable types and binary latent variables, and propose a double triangular graphical condition that guarantees identifiability of the entire causal graphical model. The proposed condition significantly relaxes the popular pure children condition. We also establish necessary conditions for identifiability and provide valuable insights into fundamental limits of identifiability. Simulation studies verify that latent structures satisfying our conditions can be accurately estimated from data. We also illustrate the practicality of our conditions with a real data example.

URL: https://openreview.net/forum?id=KiiSlAsLuN

---

Title: Prompt Optimization Meets Subspace Representation Learning for Few-shot Out-of-Distribution Detection

Authors: Faizul Rakib Sayem, Shahana Ibrahim

Abstract: The reliability of artificial intelligence (AI) systems in open-world settings depends heavily on their ability to flag out-of-distribution (OOD) inputs unseen during training. Recent advances in large-scale vision-language models (VLMs) have enabled promising few-shot OOD detection frameworks using only a handful of in-distribution (ID) samples. However, existing prompt learning-based OOD methods largely overlook the geometry of the visual feature embeddings learned by VLMs whose structure is particularly informative for distinguishing ID from OOD data and holds rich representation capacity as they are pre-trained on millions of samples. To address this, we introduce a \textit{geometry-aware context optimization framework} that integrates subspace representation learning with prompt tuning. By projecting ID-relevant features into a subspace spanned by prompt vectors and simultaneously projecting ID-irrelevant components via orthogonal null-space projections, our approach strengthens the discriminative power of the learned prompt vectors, thereby leading to enhanced ID–OOD separability at test time. To enable an easy-to-handle, end-to-end learning under this framework, we design a geometry-regularized learning criterion that ensures strong OOD detection performance as well as high ID classification accuracy across settings. Moreover, the proposed framework can be seamlessly integrated with a wide range of existing context optimization methods, effectively complementing their softmax-based OOD detectors. Experiments on various real-world datasets showcase the effectiveness of our approach for reliable open-world AI systems.

URL: https://openreview.net/forum?id=TFG2gPjkiF

---

Title: Efficient Ensembling Improves Training Data Attribution

Authors: Junwei Deng, Ting-Wei Li, Shichang Zhang, Jiaqi W. Ma

Abstract: Training data attribution (TDA) methods aim to quantify the influence of individual training data points on the model predictions, with broad applications in data-centric AI, such as mislabel detection, data selection, and copyright compensation. However, existing methods in this field, which can be categorized as retraining-based and gradient-based, have struggled with the trade-off between computational efficiency and attribution efficacy. Retraining-based methods can accurately attribute complex non-convex models but are computationally prohibitive, while gradient-based methods are efficient but often fail for non-convex models. Recent research has shown that augmenting gradient-based methods with ensembles of multiple independently trained models can achieve significantly better attribution efficacy. However, this approach remains impractical for very large-scale applications.

In this work, we discover that expensive, fully independent training is unnecessary for ensembling the gradient-based methods, and we propose two efficient ensemble strategies, DROPOUT ENSEMBLE and LORA ENSEMBLE, alternative to naive independent ensemble. These strategies significantly reduce training time (up to 80%), serving time (up to 60%), and space cost (up to 80%) while maintaining similar attribution efficacy to the naive independent ensemble. Our extensive experimental results demonstrate that the proposed strategies are effective across multiple TDA methods on diverse datasets and models, including various generative settings, significantly advancing the Pareto frontier of TDA methods with better computational efficiency and attribution efficacy. We conduct a theoretical analysis that provides insights into the success of our empirical findings.

URL: https://openreview.net/forum?id=4sSSs0fAp3

---

Title: What’s in the Bottle? A Survey and Roadmap of Concept Bottleneck Models

Authors: Patrick Knab, David Steinmann, Christian Bartelt, Kristian Kersting, Bernt Schiele, Thomas Seidl, Udo Schlegel, Wolfgang Stammer

Abstract: Concept Bottleneck Models (CBMs) are interpretable learning architectures that factor predictions through intermediate, ideally human-understandable concepts, enabling explicit and inspectable reasoning. Although CBM research has gained substantial momentum in recent years, this growth has also revealed numerous open challenges and a fragmented set of methodological choices. In this work, we systematically review the CBM literature, identify previously unidentified core components and challenges, and propose a unified taxonomy. Based on this taxonomy, we provide a detailed categorization of existing works. We hereby discuss current challenges for the CBM paradigm and outline important directions to extend it beyond its current scope. Overall, this survey aims to consolidate the CBM landscape, clarify open issues, and provide guidance for developing future models.

URL: https://openreview.net/forum?id=IF5vnqxBEW

---

Title: A Case for Vanilla SWD: New Perspectives on Informative Slices, Sliced-Wasserstein Distances, and Learning Rates

Authors: Huy Tran, Yikun Bai, Ashkan Shahbazi, John R. Hershey, David Hyde, Soheil Kolouri

Abstract: The practical applications of Wasserstein distances (WDs) are constrained by their sample and computational complexities. Sliced-Wasserstein distances (SWDs) provide a workaround by projecting distributions onto one-dimensional subspaces, leveraging the more efficient, closed-form WDs for 1D distributions. However, in high dimensions, most random projections become uninformative due to the concentration of measure phenomenon. Although several SWD variants have been proposed to focus on \textit{informative} slices, they often introduce additional complexity, numerical instability, and compromise desirable theoretical (metric) properties of SWD. Amid the growing literature that focuses on directly modifying the slicing distribution, we revisit the standard, "vanilla" Sliced-Wasserstein through an effective-subspace model and a rescaling view of slice informativeness. We show that, with an effective-subspace-aligned notion of slice informativeness, reweighting all individual slices simplifies in expectation to a single global scaling factor relating ambient-space SWD to effective-subspace SWD. For GD/SGD-style first-order optimization, the same factor appears as a step-size calibration effect. We perform extensive experiments across various machine learning tasks showing that vanilla SWD, when properly calibrated, can often match or surpass the performance of more complex variants while retaining its simplicity and metric structure.

URL: https://openreview.net/forum?id=li8D5pxczd

---

Title: When Active Learning Meets Graph Similarity: Evidential Variance for Graph Selection

Authors: Chengtai Cao, Haoyu Yang, Shenglin Wang, Xinglin Lian, Fan Zhou

Abstract: Graph Similarity Learning (GSL) is pivotal in graph data mining, yet training effective models necessitates substantial labeled pairs, which incur prohibitive annotation costs. To address this, we introduce Active Learning (AL) into the GSL paradigm. However, directly transferring existing AL strategies is non-trivial due to two unique impediments: (1) the continuous regression nature of similarity prediction complicates standard uncertainty quantification, and (2) the paired-input structure requires evaluating a graph's informational value across its pairings rather than in isolation. To bridge this gap, we propose EVGS (Evidential Variance for Graph Selection), a novel AL framework tailored for GSL. EVGS leverages evidential deep learning to impose a prior over predictions, enabling disentangled uncertainty estimation. Crucially, we identify a ``gradient shrinkage'' pathology inherent to the data-scarce regime characteristic of AL cycles. We introduce a novel MSE-anchored regularizer to mitigate this issue, ensuring discriminative uncertainty estimation even with limited labels. Furthermore, to address the paired-input challenge, we propose a graph-centric selection criterion: uncertainty variance. This metric captures a graph's holistic informational value by measuring fluctuations in its epistemic uncertainty across diverse interactions. Extensive experiments on three benchmarks with two GSL backbones demonstrate that EVGS consistently outperforms established AL baselines.

URL: https://openreview.net/forum?id=dV6UopxOjX

---

Title: Variance-reduced accelerated methods for decentralized stochastic double-regularized nonconvex strongly-concave minimax problems

Authors: Gabriel Mancino-Ball, Yangyang Xu

Abstract: In this paper, we consider the decentralized, stochastic nonconvex strongly-concave (NCSC) minimax problem with nonsmooth regularization terms on both primal and dual variables, wherein a network of $m$ computing agents collaborate via peer-to-peer communications. We consider when the coupling function is in expectation or finite-sum form and the double regularizers are convex functions, applied separately to the primal and dual variables. Our algorithmic framework introduces a Lagrangian multiplier to eliminate the consensus constraint on the dual variable. Coupling this with variance-reduction (VR) techniques, our proposed method, entitled \texttt{VRLM}, by a single neighbor communication per iteration, is able to achieve an $\mathcal{O}(\kappa^3\varepsilon^{-3})$ sample complexity under the general stochastic setting, with either a big-batch or small-batch VR option, where $\kappa$ is the condition number of the problem and $\varepsilon$ is the desired solution accuracy. With a big-batch VR, we can additionally achieve $\mathcal{O}(\kappa^2\varepsilon^{-2})$ communication complexity. Under the special finite-sum setting, our method with a big-batch VR can achieve an $\mathcal{O}(n + \sqrt{n} \kappa^2\varepsilon^{-2})$ sample complexity and $\mathcal{O}(\kappa^2\varepsilon^{-2})$ communication complexity, where $n$ is the number of components in the finite sum. All complexity results match the best-known results achieved by a few existing methods for solving special cases of the problem we consider. To the best of our knowledge, this is the first work which provides convergence guarantees for NCSC minimax problems with general convex nonsmooth regularizers applied to both the primal and dual variables in the decentralized stochastic setting. Numerical experiments are conducted on two machine learning problems.

URL: https://openreview.net/forum?id=t1Nj3VTNzQ

---

Title: Efficient and Programmable Exploration of Synthesizable Chemical Space

Authors: Shitong Luo, Connor W. Coley

Abstract: The constrained nature of synthesizable chemical space poses a significant challenge for sampling molecules that are both synthetically accessible and possess desired properties. In this work, we present PrexSyn, an efficient and programmable model for molecular discovery within synthesizable chemical space. PrexSyn is based on a decoder-only transformer trained on a billion-scale datastream of synthesizable pathways paired with molecular properties, enabled by a real-time, high-throughput C++-based data generation engine. The large-scale training data allows PrexSyn to reconstruct the synthesizable chemical space nearly perfectly at a high inference speed and learn the association between properties and synthesizable molecules. Based on its learned property-pathway mappings, PrexSyn can generate synthesizable molecules that satisfy not only single-property conditions but also composite property queries joined by logical operators, thereby allowing users to ``program'' generation objectives. Moreover, by exploiting this property-based querying capability, PrexSyn can efficiently optimize molecules against black-box oracle functions via iterative query refinement, achieving higher sampling efficiency than even synthesis-agnostic baselines, making PrexSyn a powerful general-purpose molecular optimization tool. Overall, PrexSyn pushes the frontier of synthesizable molecular design by setting a new state of the art in synthesizable chemical space coverage, molecular sampling efficiency, and inference speed.

URL: https://openreview.net/forum?id=xDlIer2UnI

---

Title: Beyond Correctness: Rewarding Faithful Reasoning in Retrieval-Augmented Generation

Authors: Zhichao Xu, Zongyu Wu, Yun Zhou, Aosong Feng, Kang Zhou, Sangmin Woo, Kiran Ramnath, Yijun Tian, Xuan Qi, Weikang Qiu, Lin Lee Cheong, Haibo Ding

Abstract: Inspired by the success of reinforcement learning (RL) in Large Language Model (LLM) training for domains like math and code, recent work has begun training LLMs to dynamically plan, query, and reason with search engines as tools --- a paradigm increasingly referred to as agentic search. Although these methods achieve performance improvement across popular short-form QA benchmarks, many prioritize final answer correctness while overlooking the quality of intermediate reasoning steps, which may lead to chain-of-thought unfaithfulness. In this paper, we first introduce a comprehensive evaluation framework for agentic search, covering three distinct faithfulness metrics: Think-Search faithfulness, Information-Think faithfulness, and Think-Answer faithfulness. Our evaluations reveal that canonical agentic search systems trained through Reinforcement Learning from Verifiable Reward (RLVR) using episode-level outcome-based reward --- including Search-R1 and ReSearch --- have significant room for improvement on these faithfulness dimensions. To foster faithful reasoning in agentic search, we introduce VERITAS (Verifying Entailed Reasoning through Intermediate Traceability in Agentic Search), a novel framework that integrates fine-grained turn-level faithfulness rewards into the reinforcement learning process. Our experiments show that models trained with VERITAS not only significantly improve reasoning faithfulness, but also achieve better task performance compared to baselines trained against episode-level outcome-based reward.

URL: https://openreview.net/forum?id=mZ0gGlXelF

---

Title: Automaton Distillation: Neuro-Symbolic Transfer Learning for Deep Reinforcement Learning

Authors: Precious Nwaorgu, Suraj Singireddy, Andre Beckus, Aden McKinney, Mahyar Alinejad, Chinwendu Enyioha, Sumit Kumar Jha, Alvaro Velasquez, George K. Atia

Abstract: Reinforcement learning (RL) agents often struggle to reuse knowledge when task dynamics change, even when the underlying objective remains the same. This sample inefficiency is compounded by poor generalization beyond the training distribution. We introduce automaton distillation—a neuro-symbolic transfer learning approach that addresses both challenges by distilling Q-value estimates from a teacher agent into a compact automaton representation of the shared task objective. Critically, our method requires no explicit alignment
between source and target state-action spaces: the automaton serves as a domain-agnostic intermediary through which value information is transferred. We propose two variants. Static transfer performs value iteration over the abstract MDP induced by the automaton, providing a lightweight initialization. Dynamic transfer distills empirical Q-values from a teacher’s replay buffer onto automaton transitions, grounding symbolic abstractions in actual environment dynamics and correcting for mismatches between automaton trace length and true trajectory cost. We evaluate both variants on discrete and continuous gridworld tasks with sparse, non-Markovian rewards, and on a continuous benchmark. These results demonstrate that a shared symbolic objective is a sufficient bridge for effective few-shot transfer, even when source and target environments differ substantially in dynamics.

URL: https://openreview.net/forum?id=Tyxmx2vNDb

---

Title: MIST: Mutual Information Estimation via Supervised Training

Authors: German Gritsai, Megan Richards, Maxime Méloux, Kyunghyun Cho, Maxime Peyrard

Abstract: We propose a fully data-driven approach to designing mutual information (MI) estimators. Since any MI estimator is a function of the observed sample from two random variables, we parameterize this function with a neural network (MIST) and train it end-to-end to predict MI values. Training is performed on a large meta-dataset of 625,000 synthetic joint distributions with known ground-truth MI. To handle variable sample sizes and dimensions, we employ a two-dimensional attention scheme ensuring permutation invariance across input samples. To quantify uncertainty, we optimize a quantile regression loss, enabling the estimator to approximate the sampling distribution of MI rather than return a single point estimate. This research program departs from prior work by taking a fully empirical route, trading universal theoretical guarantees for flexibility and efficiency. Empirically, the learned estimators largely outperform classical baselines across sample sizes and dimensions, including on joint distributions unseen during training. The resulting quantile-based intervals are well-calibrated and more reliable than bootstrap-based confidence intervals, while inference is orders of magnitude faster than existing neural baselines.

URL: https://openreview.net/forum?id=Qi4JgS2PLw

---

Title: Lifting Data-Tracing Machine Unlearning to Knowledge-Tracing for Foundation Models

Authors: Yuwen Tan, Boqing Gong

Abstract: Machine unlearning removes certain training data points and their influence from AI models (e.g., when a data owner revokes their consent to allow models to learn from the data). In this position paper, we propose to lift data-tracing machine unlearning to knowledge-tracing for foundation models (FMs). We support this position based on practical needs and insights from cognitive studies. Practically, tracing data cannot meet the diverse unlearning requests for FMs, which may be from regulators, enterprise users, product teams, etc., who have no access to FMs' massive training data. Instead, it is convenient for these parties to issue an unlearning request about the knowledge or capability FMs (should not) possess. Cognitively, knowledge-tracing unlearning aligns with how the human brain forgets more closely than tracing individual training data points does. We further discuss the nontrivial challenges in the knowledge-tracing machine unlearning paradigm. Finally, we provide a concrete case study about a vision-language FM to illustrate how an unlearner might instantiate the knowledge-tracing machine unlearning paradigm. Code is available at: https://1yuwen.github.io/Knowledge-Tracing-MU-Page.

URL: https://openreview.net/forum?id=ScvUCNMdYN

---

Title: Maximum Mean Discrepancy with Unequal Sample Sizes via Generalized U-Statistics

Authors: Aaron Wei, Milad Jalali, Danica J. Sutherland

Abstract: Existing two-sample testing techniques, particularly those based on choosing a kernel for the Maximum Mean Discrepancy (MMD), often assume equal sample sizes from the two distributions. Applying these methods in practice can require discarding valuable data, unnecessarily reducing test power. We address this long-standing limitation by extending the theory of generalized U-statistics and applying it to the usual MMD estimator, resulting in new characterization of the asymptotic distributions of the MMD estimator with unequal sample sizes (particularly outside the proportional regimes required by previous partial results). This generalization also provides a new criterion for optimizing the power of an MMD test with unequal sample sizes. Our approach preserves all available data, enhancing test accuracy and applicability in realistic settings. Along the way, we give much cleaner characterizations of the variance of MMD estimators, revealing something that might be surprising to those in the area: while zero MMD implies a degenerate estimator, it is sometimes possible to have a degenerate estimator with nonzero MMD as well. We give a construction of such a case, and a proof that it does not happen in common situations.

URL: https://openreview.net/forum?id=KjXW75GHHF

---

Title: Plan2Cleanse: Test-Time Backdoor Defense via Monte-Carlo Planning in Deep Reinforcement Learning

Authors: Sze-Ann Chen, Zhi-Yi Chin, Kui-Yuan Chen, Chi-Yu Li, Ping-Chun Hsieh

Abstract: Ensuring the security of reinforcement learning (RL) models is critical, particularly when they are trained by third parties and deployed in real-world systems. Attackers can implant backdoors into these models, causing them to behave normally under typical conditions, but execute malicious behaviors when specific triggers are activated. In this work, we propose Plan2Cleanse, a test-time detection and mitigation framework that adapts Monte Carlo Tree Search to efficiently identify and neutralize RL backdoor attacks without requiring model retraining. Our approach recasts backdoor detection as a planning problem, enabling systematic exploration of temporally extended trigger sequences while maintaining black-box access to the target policy. By leveraging the detection results, Plan2Cleanse can further achieve efficient mitigation through tree-search preventive replanning. We evaluated our method in competitive MuJoCo environments, simulated O-RAN wireless networks, and Atari games. Plan2Cleanse achieves substantial improvements, increasing trigger detection success rates by more than 61.4 percentage points in stealthy O-RAN scenarios and improving win rates from 35\% to 53\% in competitive Humanoid environments. These results demonstrate the effectiveness of our test-time defense approach and highlight the importance of proactive defenses against backdoor threats in RL deployments. Our implementation is publicly available at \url{https://github.com/rl-bandits-lab/RL-Backdoor}.

URL: https://openreview.net/forum?id=ZKhKxqwuPu

---

Title: When VMP Meets CEP: An Algorithmic Equivalence Under Mild Conditions

Authors: Siyuan Li, Shikai Fang, Lei Cheng, Yik-Chung WU, Sergios Theodoridis

Abstract: Approximate Bayesian inference (ABI) methods have become indispensable tools in modern machine learning and statistics for approximating intractable posterior distributions. Despite extensive studies, the theoretical connections among different ABI methods have remained relatively unexplored. This paper establishes an algorithmic equivalence between two widely employed ABI techniques, namely variational message passing (VMP) and conditional expectation propagation (CEP). Through rigorous mathematical analysis, we demonstrate that these two approaches, despite originating from different perspectives (variational inference and expectation propagation, respectively), yield the same update equations under mild conditions, from both optimization and graphical model viewpoints. As a direct consequence, we establish a convergence guarantee for CEP and show that VMP-derived algorithms can inherit streaming variants without additional derivation effort. To validate our theoretical findings, we apply both VMP and CEP to Bayesian tensor decomposition and verify that they produce identical updates, demonstrating how the equivalence provides a principled route to a streaming variant.

URL: https://openreview.net/forum?id=QdO4VrnNfb

---

Title: UFO2: The Desktop AgentOS

Authors: Chaoyun Zhang, He Huang, Chiming Ni, Jian Mu, Si Qin, Shilin He, Lu Wang, Fangkai Yang, Pu Zhao, Bo Qiao, Chao Du, Liqun Li, Yu Kang, Paul Jiang, Suzhen Zheng, Rujia Wang, Jiaxu Qian, Minghua Ma, Jian-Guang Lou, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang

Abstract: Recent Computer-Using Agents (CUAs), powered by multimodal large language models (LLMs), offer a promising direction for automating complex desktop workflows through natural language. However, most existing CUAs remain conceptual prototypes, hindered by shallow OS integration, fragile screenshot-based interaction, and disruptive execution.

We present UFO2, a multiagent AgentOS for Windows desktops that elevates CUAs into practical, system-level automation. UFO2 features a centralized HostAgent for task decomposition and coordination, alongside a collection of application-specialized AppAgents equipped with native APIs, domain-specific knowledge, and a unified GUI--API action layer. This architecture enables robust task execution while preserving modularity and extensibility. A hybrid control detection pipeline fuses Windows UI Automation (UIA) with vision-based parsing to support diverse interface styles. Runtime efficiency is further enhanced through speculative multi-action planning, reducing per-step LLM overhead. Finally, a Picture-in-Picture (PiP) interface enables automation within an isolated virtual desktop, allowing agents and users to operate concurrently without interference.

We evaluate UFO2 across over 20 real-world Windows applications, demonstrating substantial improvements in robustness and execution accuracy over prior CUAs. Our results show that deep OS integration unlocks a scalable path toward reliable, user-aligned desktop automation.

The source code of UFO2 is publicly available at https://github.com/microsoft/UFO/, with comprehensive documentation provided at https://microsoft.github.io/UFO/.

URL: https://openreview.net/forum?id=iAuZVWCduc

---

Title: Post-Training Augmentation Invariance

Authors: Keenan Eikenberry, Lizuo Liu, Yoonsang Lee

Abstract: This work develops a framework for post-training augmentation invariance, in which our goal is to add invariance properties to a pretrained network without altering its behavior on the original, non-augmented input distribution. We define this notion precisely and additionally introduce augmented encoders, which are probabilistic encoders that formalize augmentation-based encoding processes and that serve as our fundamental object of study. We introduce two losses for augmented encoders, namely, Markov-Wasserstein minimization and Wasserstein correlation maximization, and we demonstrate empirically that both losses can be used to train lightweight, one-hidden-layer MLP adapter networks $E_{\theta}$ that, when appended to the latent space of a pretrained network $F$, do indeed lead to (approximate) post-training augmentation invariance. For example, on STL10 with $F=\text{DINO}$ features, the composite network $C\circ E_{\theta}\circ F$, where $C$ is a linear classifier and where $E_{\theta}$ is one of our proposed adapter networks, achieves $94\%$ classification accuracy on arbitrarily rotated images, whereas a network of the form $C\circ F$ without the adapter $E_{\theta}$ drops to $71\%$ accuracy. Similarly, we can boost noise-invariant classification results from $58\%$ up to $86\%$. Significantly, we obtain these results with no fine-tuning (the weights of $F$ remain frozen throughout), and our methods introduce little corruption to the original features, since $E_{\theta}$ acts nearly isometrically on the non-augmented latent distribution. In contrast, we show that adapter networks trained with alternative candidate losses, specifically SimCLR and HSIC maximization, produce uncompetitive classification results and fundamentally corrupt the original latent space. Code available at \url{https://github.com/keenan-eikenberry/augmentation_invariance}.

URL: https://openreview.net/forum?id=Z4uUwU6zRe

---

Title: Local MDI+: Local Feature Importances for Tree-Based Models

Authors: Zhongyuan Liang, Zachary T. Rewolinski, Abhineet Agarwal, Tiffany Tang, Bin Yu

Abstract: Tree-based ensembles such as random forests remain the go-to for tabular data over deep learning models due to their prediction performance and computational efficiency. These advantages have led to their widespread deployment in high-stakes domains, where interpretability is essential for ensuring trustworthy predictions. This has motivated the development of popular local (i.e. sample-specific) feature importance (LFI) methods such as LIME and TreeSHAP. However, these approaches rely on approximations that ignore the model’s internal structure and instead depend on potentially unstable perturbations. These issues are addressed in the global setting by MDI+, a global feature importance method which combines tree-based and linear feature importances by exploiting an equivalence between decision trees and least squares on a transformed node basis. However, the global MDI+ scores are not able to explain predictions when faced with heterogeneous individual characteristics. To address this gap, we propose Local MDI+ (LMDI+), a novel extension of the MDI+ framework that quantifies feature importances for each particular sample. Across twelve real-world benchmark datasets, LMDI+ outperforms existing baselines at identifying instance-specific predictive features, yielding an average 10% improvement in predictive performance when using only the selected features. It further demonstrates greater stability by consistently producing similar instance-level feature importance rankings across repeated model fits with different random seeds. Ablation experiments show that each component of LMDI+ contributes to these gains, and that the improvements extend beyond random forests to gradient boosting models. Finally, we show that LMDI+ enables local interpretability use cases by identifying closely matched counterfactuals for each classification benchmark and discovering homogeneous subgroups in a case study using a commonly-used housing dataset.

URL: https://openreview.net/forum?id=TcXidnGHpA

---

Title: Cluster-Dags as Powerful Background Knowledge For Causal Discovery

Authors: Jan Marco Ruiz de Vargas, Kirtan Padh, Niki Kilbertus

Abstract: Finding cause-effect relationships is of key importance in science. Causal discovery aims to recover a graph from data that succinctly describes these cause-effect relationships. However, current methods face several challenges, especially when dealing with high-dimensional data and complex dependencies. Incorporating prior knowledge about the system can aid causal discovery. In this work, we leverage Cluster-DAGs as a prior knowledge framework to warm-start causal discovery. We show that Cluster-DAGs offer greater flexibility than existing approaches based on tiered background knowledge and introduce two modified constraint-based algorithms, Cluster-PC and Cluster-FCI, for causal discovery in the fully and partially observed setting, respectively. Empirical evaluation on simulated data demonstrates that Cluster-PC and Cluster-FCI outperform their respective baselines without prior knowledge.

URL: https://openreview.net/forum?id=gSSmvVDKxB

---

Title: Stretched Exponential Convergence of (Stochastic) Gradient Descent for Separable Logistic Regression

Authors: Sacchit Kale, Piyushi Manupriya, Pierre Marion, Francis Bach, Anant Raj

Abstract: Gradient descent and stochastic gradient descent are central to modern machine learning, yet their behavior under large step sizes remains theoretically unclear. Recent work suggests that acceleration often arises near the edge of stability, where optimization trajectories become unstable and difficult to analyze. Existing results for separable logistic regression achieve faster convergence by explicitly leveraging such unstable regimes through constant or adaptive large step sizes. In this paper, we show that instability is not inherent to acceleration. We prove that gradient descent with a simple, non-adaptive increasing step-size schedule achieves stretched exponential convergence for separable logistic regression under a margin condition, while remaining entirely within a stable optimization regime. The resulting method is anytime and does not require prior knowledge of the optimization horizon or target accuracy. We also establish stretched exponential convergence of stochastic gradient descent using a lightweight adaptive step-size rule that avoids line search and specialized procedures, improving upon existing polynomial-rate guarantees. Together, our results demonstrate that carefully structured step-size growth alone suffices to obtain stretched exponential acceleration for both gradient descent and stochastic gradient descent.

URL: https://openreview.net/forum?id=R5OaFwCmS0

---

Title: Dynamic Mixture of Progressive Parameter-Efficient Expert Library for Lifelong Robot Learning

Authors: Yuheng Lei, Sitong Mao, Shunbo Zhou, Hongyuan Zhang, Xuelong Li, Ping Luo

Abstract: A generalist agent must continuously learn and adapt throughout its lifetime, achieving efficient forward transfer while minimizing catastrophic forgetting. Previous work within the dominant pretrain-then-finetune paradigm has explored parameter-efficient fine-tuning for single-task adaptation, effectively steering a frozen pretrained model with a small number of parameters. However, in the context of lifelong learning, these methods rely on the impractical assumption of a test-time task identifier and restrict knowledge sharing among isolated adapters. To address these limitations, we propose Dynamic Mixture of Progressive Parameter-Efficient Expert Library (DMPEL) for lifelong robot learning. DMPEL progressively builds a low-rank expert library and employs a lightweight router to dynamically combine experts into an end-to-end policy, enabling flexible and efficient lifelong forward transfer. Furthermore, by leveraging the modular structure of the fine-tuned parameters, we introduce expert coefficient replay, which guides the router to accurately retrieve frozen experts for previously encountered tasks. This technique mitigates forgetting while being significantly more storage- and computation-efficient than experience replay over the entire policy. Extensive experiments on the lifelong robot learning benchmark LIBERO demonstrate that our framework outperforms state-of-the-art lifelong learning methods in success rates during continual adaptation, while utilizing minimal trainable parameters and storage.

URL: https://openreview.net/forum?id=MHVBrjS8cG

---

Title: Decoupling Planning from Control: Stable Hierarchical RL with a Learned Metric Space

Authors: Sho Mitsuhashi, Shin Ishii

Abstract: Hierarchical Reinforcement Learning (HRL) offers a promising framework for solving complex, long-horizon tasks by decomposing them into manageable subproblems. However, conventional HRL methods suffer from a critical non-stationarity problem: the high-level planner's learning process is destabilized because the low-level policy is concurrently learning and constantly changing. This issue is particularly severe in resource-constrained systems, such as edge-cloud robotics, where the low-level controller must be a computationally simple, low-capacity model.
To address this challenge, we propose a novel HRL framework that mitigates the non-stationarity issue by decoupling high-level planning from low-level control. The core of our approach is to reframe the planner's task: instead of learning the planner via RL on non-stationary transitions, it learns to navigate a learned "map" of the environment. This map is represented by a critic network trained to function as a metric space, where distances reflect approximate travel costs. Planning is then simplified to finding optimal subgoals that lie along the shortest path (geodesic) between the current state and the final goal. To further encourage geometric consistency in the learned map, we introduce a trajectory regularization loss based on the agent's experienced trajectories.
Experiments demonstrate that our decoupled framework is highly robust. In scenarios with resource-constrained low-level policies, our method learns to solve complex tasks effectively where standard approaches fail. This result highlights our framework's suitability for real-world systems where low-level controllers have inherently limited computational capacity.

URL: https://openreview.net/forum?id=Kmtlv8X0BN

---

Title: Explicit Second-Order Min-Max Optimization: Practical Algorithms and Complexity Analysis

Authors: Tianyi Lin, Panayotis Mertikopoulos, Michael I. Jordan

Abstract: We propose and analyze several inexact regularized Newton-type methods for finding a global saddle point of \emph{convex-concave} unconstrained min-max optimization problems. Compared to first-order methods, our understanding of second-order methods for min-max optimization is relatively limited, as obtaining global rates of convergence with second-order information can be much more involved. In this paper, we examine how second-order information is used to speed up extra-gradient methods, even under inexactness. In particular, we show that the proposed methods generate iterates that remain within a bounded set and that the averaged iterates converge to an $\epsilon$-saddle point within $O(\epsilon^{-2/3})$ iterations in terms of a restricted gap function. We also provide a simple routine for solving the subproblem at each iteration, requiring a single Schur decomposition and $O(\log\log(1/\epsilon))$ calls to a linear system solver in a quasi-upper-triangular system. Thus, our method improves the existing line-search-based second-order min-max optimization methods~\citep{Monteiro-2012-Iteration, Bullins-2022-Higher, Jiang-2025-Generalized} by shaving off an $O(\log\log(1/\epsilon))$ factor in the required number of Schur decompositions. Finally, we evaluate our method on both synthetic benchmarks and a real-world application arising from AUC maximization on standard LIBSVM datasets, and find that the proposed second-order approach delivers stronger practical efficiency than representative first-order methods on these problems.

URL: https://openreview.net/forum?id=Hyk1GhEXGa

---

Title: Unified Semantic Transformer for 3D Scene Understanding

Authors: Sebastian Koch, Johanna Wald, Hidenobu Matsuki, Pedro Hermosilla, Timo Ropinski, Federico Tombari

Abstract: Holistic 3D scene understanding involves capturing and parsing unstructured 3D environments. Due to the inherent complexity of the real world, existing models have predominantly been developed and limited to be task-specific. We introduce UNITE, a Unified Semantic Transformer for 3D scene understanding, a novel feed-forward neural network that unifies a diverse set of 3D dense semantic indoor tasks within a single model. Our model operates on unseen scenes trained in a fully end-to-end manner and only takes a couple seconds to infer the full 3D semantic geometry. Our approach is capable of directly predicting multiple dense semantic attributes, including 3D scene segmentation, instance embeddings, open-vocabulary features, and articulations, solely from RGB images. The method is trained using a combination of 2D distillation, heavily relying on self-supervision and leverages novel multi-view losses designed to ensure 3D view consistency. We demonstrate that UNITE achieves state-of-the-art performance on several different dense indoor semantic tasks and even outperforms task-specific models, in many cases, surpassing methods that operate on ground truth 3D geometry.

URL: https://openreview.net/forum?id=eB7oHCJzud

---

Title: Personalization Toolkit: Training Free Personalization of Large Vision Language Models

Authors: Soroush Seifi, Vaggelis Dorovatas, Matteo Cassinelli, Fabien Despinoy, Daniel Olmeda Reino, Rahaf Aljundi

Abstract: Personalization of Large Vision-Language Models (LVLMs) involves customizing models to recognize specific users or object instances and to generate contextually tailored responses. Existing approaches rely on time-consuming training for each item, making them impractical for real-world deployment, as reflected in current personalization benchmarks limited to object-centric single-concept evaluations.
In this paper, we present a novel training-free approach to LVLM personalization called \ours. We introduce a comprehensive, real-world benchmark designed to rigorously evaluate various aspects of the personalization task. \ours leverages pre-trained vision foundation models to extract distinctive features, applies retrieval-augmented generation (RAG) techniques to identify instances within visual inputs, and employs visual prompting strategies to guide model outputs. Our model-agnostic vision toolkit enables efficient and flexible multi-concept personalization across both images and videos, without any additional training. We achieve state-of-the-art results, surpassing existing training-based methods.

URL: https://openreview.net/forum?id=5mbn3B0O29

---

Title: Soft Preference Optimization: Aligning Language Models to Expert Distributions

Authors: Arsalan Sharifnassab, Saber Salehkaleybar, Dale Schuurmans

Abstract: Preference optimization methods such as DPO often yield aligned models that are overly deterministic, reducing output diversity and increasing the risk of mode collapse. This can limit downstream applications that benefit from multiple plausible outputs, such as reasoning and search. We propose Soft Preference Optimization (SPO), a reward-model-free algorithm that controls entropy of the aligned model through a ``softness'' parameter. SPO minimizes a preference-based loss together with a global KL regularization term, which helps prevent unwanted distribution shifts outside the preference dataset. While the method does not rely on any reward model assumption, we provide theoretical guarantees that under a Bradley–Terry assumption, it converges to a softmax distribution over the expert rewards. We present the methodology, theoretical analysis, and comparative advantages in alignment precision and output diversity.

URL: https://openreview.net/forum?id=EUPIcAkrSR

---

Title: MMCOMPOSITION: Revisiting the Compositionality of Pre- trained Vision-Language Models

Authors: Hang Hua, Yolo Y. Tang, Ziyun Zeng, Liangliang Cao, Yang Zhengyuan, Hangfeng He, Chenliang Xu, Jiebo Luo

Abstract: The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal under- standing, enabling more sophisticated and accurate integration of visual and textual information across various tasks, including image and video captioning, visual question answering, and cross-modal retrieval. Despite VLMs’ superior capabilities, researchers lack a comprehensive understanding of their compositionality – the ability to understand and produce novel combinations of known visual and textual components. Prior benchmarks provide only a relatively rough compositionality evaluation from the perspectives of objects, relations, and attributes while neglecting deeper reasoning about object interactions, counting, and complex compositions. However, compositionality is a critical ability that facilitates coherent reasoning and understanding across modalities for VLMs. To address this limitation, we propose MMCOMPOSITION, a novel human-annotated benchmark for comprehensively and accurately evaluating VLMs’ compositionality. With MMCOMPOSITION, we can quantify and explore the compositionality of the mainstream VLMs. Surprisingly, we find GPT-4o’s compositionality inferior to the best open-source model, and we analyze the underlying reasons. Our experimental analysis reveals the limitations of VLMs in fine-grained compositional perception and reasoning, and points to areas for improvement in VLM design and training.

URL: https://openreview.net/forum?id=aWO15tpSH8

---

Title: Some Robustness Properties of Label Cleaning

Authors: Chen Cheng, John Duchi

Abstract: We demonstrate that learning procedures that rely on aggregated labels, e.g., label information distilled from noisy responses, enjoy robustness properties impossible without data cleaning. This robustness appears in several ways. In the context of risk consistency---when one takes the standard approach in machine learning of minimizing a surrogate (typically convex) loss in place of a desired task loss (such as the zero-one mis-classification error)---procedures using label aggregation obtain stronger consistency guarantees than those even possible using raw labels. And while classical statistical scenarios of fitting perfectly-specified models suggest that incorporating all possible information---modeling uncertainty in labels---is statistically efficient, consistency fails for ``standard'' approaches as soon as a loss to be minimized is even slightly mis-specified. Yet procedures leveraging aggregated information still converge to optimal classifiers, highlighting how incorporating a fuller view of the data analysis pipeline, from collection to model-fitting to prediction time, can yield a more robust methodology by refining noisy signals.

URL: https://openreview.net/forum?id=O2ORErbcBy

---

Title: The Out-of-sample Extensions of t-SNE: From Gradient Descent to Fixed-point Iteration Algorithms

Authors: Paul Honeine

Abstract: This paper addresses the out-of-sample extension of the t-distributed stochastic neighbor embedding (t-SNE), namely extending the embedding to other data that were not considered in the training of the t-SNE. We demonstrate the ease of deriving the out-of-sample extension of t-SNE, thanks to the proper nature of t-SNE, namely without any auxiliary model. Several resolution strategies are devised, from gradient descent to fixed-point iteration algorithms. Moreover, we establish several theoretical findings that allow to understand the underlying optimization mechanism of the fixed-point iteration, by providing several appealing properties, including connections with the mean shift algorithm and the resolution of the pre-image problem in Machine Learning. Experimental results on three well-known real data sets show the relevance and efficiency of the proposed out-of-sample methods, with the repulsion-free fixed-point iteration outperforming the other methods.

URL: https://openreview.net/forum?id=kYwq49F8Gt

---

Title: ChromaFormer: A Scalable and Accurate Transformer Architecture for Land Cover Classification

Authors: Mingshi Li, Dusan Grujicic, Ben Somers, Stien Heremans, Steven De Saeger, Matthew B. Blaschko

Abstract: Remote sensing satellites such as Sentinel-2 provide high-resolution, multi-spectral imagery that enables dense, large-scale land cover classification. However, most deep learning models used in this domain—typically CNN-based architectures—are limited in their ability to process high-dimensional spectral data and scale with increasing dataset sizes. Moreover, while transformer architectures have recently been introduced for remote sensing tasks, their performance on large, densely labeled multi-spectral datasets remains underexplored.

In this paper, we present ChromaFormer, a scalable family of multi-spectral transformer models designed for large-scale land cover classification. We introduce a novel Spectral Dependency Module (SDM) that explicitly learns inter-band relationships through attention across spectral channels, enabling efficient spectral-spatial feature fusion. Our models are evaluated on the Biological Valuation Map (BVM) of Flanders, a large, densely labeled dataset spanning over 13,500 km² and 14 classes. ChromaFormer models achieve substantial accuracy gains over baselines: while a 23M-parameter UNet++ achieves less than 70% accuracy, a 655M-parameter ChromaFormer attains over 96% accuracy. We also analyze performance scaling trends and demonstrate generalization to standard benchmarks. Our results highlight the effectiveness of combining scalable transformer architectures with explicit spectral modeling for next-generation remote sensing tasks.

URL: https://openreview.net/forum?id=qzJVTJYEBc

---

Title: Dynamic Regret with Untrusted Decision Predictions via Heterogeneous Expert Aggregation

Authors: Wentao Zhang

Abstract: We study online convex optimization with dynamic regret, where the learner has access to untrusted decision predictions about the per-round minimizers. Existing methods either exploit only gradient feedback, achieving $O(\sqrt{T(1+P\_T)})$ dynamic regret but remaining unable to benefit from predictions, or follow predictions blindly, obtaining regret proportional to the prediction error but with no worst-case safeguard. We propose a framework based on heterogeneous expert aggregation that simultaneously adapts to both the environment non-stationarity, characterized by path length $P\_T$, and prediction quality, measured by cumulative error $\bar{E}\_T$, without prior knowledge of either. The framework maintains a diverse pool of experts, which includes a gradient-based expert utilizing Online Gradient Descent, a prediction-based expert following predictions, and a new hybrid subroutine called Online Anchor Mirror Descent. These experts are aggregated by AdaHedge, whose small-loss property is critical to our results. We prove that our strongest variant achieves dynamic regret that smoothly interpolates between $O(GD\log\log T)$ when predictions are accurate and $O(R^\*)$ when predictions are adversarial, where $R^\*$ $= O(G\sqrt{T(D^2+2DP\_T)})$ is the optimal prediction-free rate. The small-loss bound of AdaHedge ensures that the aggregation overhead depends on the best expert's loss rather than on $T$, enabling a qualitative improvement over the $\Omega(\sqrt{T})$ floor of prediction-free methods. We further introduce an instance-dependent refinement of the new hybrid subroutine that can strictly improve the guarantee on favorable trajectories. Experiments on synthetic benchmarks validate all theoretical predictions: our methods achieve near-constant regret under accurate predictions, degrade gracefully under adversarial predictions, and outperform baselines by up to $26\times$ in non-stationary environments.

URL: https://openreview.net/forum?id=LWsEyfdnp9

---

Title: A Survey on Foundations and Frontiers of Multimodal Agentic Frameworks: Techniques and Applications

Authors: Neel Mokaria, Rishie Raj, Dheeraj Baiju, Xiaoqian Shen, Shraman Pramanick, Kevin Qinghong Lin, Arda Senocak, Mike Zheng Shou, Philip Torr, Mohamed Elhoseiny, Yapeng Tian, Ruohan Gao, Salman Khan, Sayan Nag, Sanjoy Chowdhury, Dinesh Manocha

Abstract: Advances in large language models (LLMs) have fueled a wave of research into agency: the ability to reason, plan, and act. This effort has produced agentic frameworks that orchestrate perception, memory, and decision-making around powerful LLM backbones. With the advent of large multimodal models (LMMs), these systems can process and integrate diverse modalities, including images, audio, and video, thereby improving their real-world applicability. Yet, while surveys of LLM-based agents exist, the role of multimodality in shaping agency has not been systematically examined in recent years. This survey fills the gap by analyzing the impact of multimodality across the core functional modules of the agentic framework: perception, reasoning, planning, memory, and action. Using this lens, we trace the evolution from text-centric agents to multimodal frameworks, examine how modalities are integrated through delegated, late-fusion, and early-fusion architectures, and assess the emergence of agentic behaviors enabled by grounded perception and multimodal reasoning. We organize existing work through a modality-centric taxonomy that links architectural design choices to agent capabilities. Moreover, we review multimodal agentic systems across various application domains, including Robotics, GUI & Web Navigation, Multimedia Content Generation & Editing, and Long-form Video Understanding & Retrieval. Beyond capabilities, we analyze performance across these settings and discuss efficiency-scalability trade-offs, including training and inference costs, latency, and deployment constraints. By focusing on the impact of multimodality in agentic design, we aim to identify key gaps and chart a roadmap toward robust and general-purpose intelligent systems.

URL: https://openreview.net/forum?id=eaVoaI7f8v

---

Title: Beyond ReinMax: Low-Variance Gradient Estimators for Discrete Latent Variables

Authors: Daniel Wang, Thang D Bui

Abstract: Machine learning models involving discrete latent variables require gradient estimators to facilitate backpropagation in a computationally efficient manner. The most recent addition to the Straight-Through family of estimators, ReinMax, can be viewed from a numerical ODE perspective as incorporating an approximation via Heun's method to reduce bias, but at the cost of high variance. In this work, we introduce the ReinMax-Rao and ReinMax-CV estimators which incorporate Rao-Blackwellisation and control variate techniques into ReinMax to reduce its variance. Our estimators demonstrate superior performance on training variational autoencoders with discrete latent spaces. Furthermore, we investigate the possibility of leveraging alternative numerical methods for constructing more accurate gradient approximations and present an alternative view of ReinMax from a simpler numerical integration perspective.

URL: https://openreview.net/forum?id=crlvtnsyIT

---

Title: Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models

Authors: Shuai Wang, Zhenhua Liu, Jiaheng Wei, Xuanwu Yin, Dong Li, Emad Barsoum

Abstract: We present Athena-PRM, a multimodal process reward model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. Developing high-performance PRMs typically demands significant time and financial investment, primarily due to the necessity for step-level annotations of reasoning steps. Conventional automated labeling methods, such as Monte Carlo estimation, often produce noisy labels and incur substantial computational costs. To efficiently generate high-quality process-labeled data, we propose leveraging prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. Remarkably, Athena-PRM demonstrates outstanding effectiveness across various scenarios and benchmarks with just 5,000 samples. Furthermore, we also develop two effective strategies to improve the performance of PRMs: ORM initialization and up-sampling for negative data. We validate our approach in three specific scenarios: verification for test-time scaling, direct evaluation of reasoning step correctness, and reward ranked fine-tuning. Our Athena-PRM consistently achieves superior performance across multiple benchmarks and scenarios. Notably, when using Qwen2.5-VL-7B as the policy model, Athena-PRM enhances performance by 10.2 points on WeMath and 7.1 points on MathVista for test-time scaling. Furthermore, Athena-PRM sets the state-of-the-art (SoTA) results in VisualProcessBench and outperforms the previous SoTA by 3.9 F1-score, showcasing its robust capability to accurately assess the correctness of the reasoning step. Additionally, utilizing Athena-PRM as the reward model, we develop Athena-7B with reward ranked fine-tuning and outperforms baseline with a significant margin on five benchmarks.

URL: https://openreview.net/forum?id=unWmplHccF

---

Title: Speeding up fairness reductions

Authors: Andrea Baraldi, Matteo Brucato, Miroslav Dudík, Francesco Guerra, Matteo Interlandi

Abstract: We study the problem of fair classification, where the goal is to optimize classification accuracy subject to fairness constraints. This type of problem occurs in many real-world applications, where we seek to assure that a deployed AI system does not disproportionally impact historically disadvantaged groups. One of the leading approaches in the literature is the reduction approach (Agarwal et al., 2018; 2019), which enjoys many favorable properties. For instance, it supports a wide range of fairness constraints and model families and is usually easy to incorporate in existing ML pipelines. The reduction approach acts as a wrapper around a standard ML algorithm and obtains a model that satisfies fairness constraints by repeatedly running a fairness-unaware base algorithm. A typical number of iterations is around 100, meaning that the reduction approach can be up to 100 times slower than the base algorithm, which limits its applicability. To overcome this limitation, we introduce two algorithmic innovations. First, we interleave the exponentiated gradient updates of the standard reduction approach with column-generation updates, which leads to a decrease in the number of calls to the base algorithm. Second, we introduce adaptive sampling, which decreases the sizes of the datasets used in the calls to the base algorithm. We conduct comprehensive experiments to evaluate efficacy of our improvements, showing that our two innovations speed up the reduction approach by an order of magnitude without sacrificing the quality of the resulting solutions.

URL: https://openreview.net/forum?id=C0AdL3r1Dc

---

Title: HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs

Authors: Tin Nguyen, Logan Bolton, Mohammad Reza Taesiri, Trung Bui, Anh Totti Nguyen

Abstract: An Achilles heel of Large Language Models (LLMs) is their tendency to hallucinate non-factual statements. A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on. To combat this problem, we propose Highlighted Chain-of-Thought Prompting (HoT), a technique for prompting LLMs to generate responses with XML tags that ground facts to those provided in the question. That is, given an input question, LLMs would first re-format the question to add XML tags
highlighting key facts, and then, generate a response with highlights over the facts referenced from the input. Compared to vanilla chain of thought prompting (CoT), HoT reduces the rate of hallucination and separately improves LLM accuracy of 5 LLMs consistently on over 22 tasks from arithmetic, reading comprehension, to logical reasoning.
Consistent with the success of HoT few-shot prompting, training small LLMs (LLaMA-3.2-1B and Qwen2.5-1.5B) via supervised-finetuning on HoT examples improve LLMs accuracy (on 5 out-of-distribution tasks) over the baselines and over finetuning on CoT examples. When asking humans to verify LLM responses, highlights help time-limited participants to more accurately and efficiently recognize when LLMs are correct. Yet, surprisingly, when LLMs are wrong, HoTs tend to fool users into believing that an answer is correct.

URL: https://openreview.net/forum?id=abm6pDTbT1

---

Title: Divide and Conquer: Selective Value Learning and Policy Optimization for Offline Safe Reinforcement Learning

Authors: Jiahui Zhu, Lei Ying, Honghao Wei

Abstract: Offline safe reinforcement learning (RL) aims to learn policies that maximize reward while satisfying safety constraints from a fixed dataset. Existing methods extend offline RL with primal–dual value learning and behavior-regularized policy optimization, but in safety-critical tasks they struggle: uniform updates across all states ignore the difference between safety-preserving and unsafe states, leading to inaccurate value estimates, infeasible solutions when constraints conflict, and strong sensitivity to dataset quality. We propose SEVPO($\textbf{SE}$lective $\textbf{V}$alue Learning and $\textbf{P}$olicy $\textbf{O}$ptimization), a divide-and-conquer framework that separates updates based on state safety. SEVPO learns conservative cost values to identify safe states, applying reward-constrained optimization with selective regularization there, and switches to cost-minimization outside to compute least-cost escape paths. Extensive experiments show SEVPO achieves high reward and strict safety guarantees, outperforming state-of-the-art offline safe RL across diverse dataset qualities. We further validate SEVPO by training a Unitree Go2 quadruped robot in dynamic environments using only offline data, demonstrating its potential for safety-critical robotics (https://youtu.be/tDpWq2EV_Ig).

URL: https://openreview.net/forum?id=4KYrv6qYMl

---

Title: Structured Representation Learning with Locally Linear Embeddings and Adaptive Feature Fusion

Authors: Somjit Nath, Jackson J Cone, Derek Nowrouzezahrai, Samira Ebrahimi Kahou

Abstract: Neuroscientific research has revealed that the brain encodes complex behaviors by leveraging structured, low-dimensional manifolds and dynamically fusing multiple sources of information through adaptive gating mechanisms. Inspired by these principles, we propose a novel reinforcement learning (RL) framework that encourages the disentanglement of dynamics-specific and reward-specific features, drawing direct parallels to how neural circuits separate and integrate information for efficient decision-making. Our approach leverages locally linear embeddings (LLEs) to capture the intrinsic, locally linear structure inherent in many environments—mirroring the local smoothness observed in neural population activity—while concurrently deriving reward-specific features through the standard RL objective. An attention mechanism, analogous to cortical gating, adaptively fuses these complementary representations on a per-state basis. Experimental results on benchmark tasks demonstrate that our method, grounded in neuroscientific principles, improves learning efficiency and overall performance compared to conventional RL approaches, highlighting the benefits of explicitly modeling local state structures and adaptive feature selection as observed in biological systems.

URL: https://openreview.net/forum?id=p7p3iuah0G

---

Title: Diagnosing Failure Modes of Neural Operators Across Diverse PDE Families

Authors: Lennon Shikhman

Abstract: Neural PDE solvers are increasingly used as learned surrogates for families of partial differential equations, where the key machine learning challenge is not only interpolation on a fixed benchmark distribution but generalization under structured shifts in coefficients, boundary conditions, discretization, and rollout horizon. Yet evaluation is still often dominated by in-distribution test error, making robustness difficult to assess. We introduce a standardized stress-testing framework for neural PDE solvers under deployment-relevant shift. We instantiate it on three representative architectures--Fourier Neural Operators (FNOs), a DeepONet-style model, and convolutional neural operators (CNOs)--across five qualitatively different PDE families: dispersive, elliptic, multi-scale fluid, financial, and chaotic systems. Across 750 trained models, we measure robustness using baseline-normalized degradation factors together with spectral and rollout diagnostics. The resulting comparisons reveal that strong in-distribution accuracy does not reliably predict robustness, and that failure patterns depend jointly on architecture and PDE family. Our results provide a clearer basis for evaluating robustness claims in neural PDE solvers and suggest that function-space generalization under structured shift should be treated as a first-class evaluation target.

URL: https://openreview.net/forum?id=0S1LWZHQYn

---

Title: Flow Matching for Probabilistic Monocular 3D Human Pose Estimation

Authors: Cuong Le, Pavlo Melnyk, Bastian Wandt, Mårten Wadenbäck

Abstract: Recovering 3D human poses from a monocular camera view is a highly ill-posed problem due to the depth ambiguity. Earlier studies on 3D human pose lifting from 2D often contain incorrect-yet-overconfident 3D estimations. To mitigate the problem, emerging probabilistic approaches treat the 3D estimations as a distribution, taking into account the uncertainty measurement of the poses. Falling in a similar category, we proposed FMPose, a probabilistic 3D human pose estimation method based on the flow matching generative approach. Conditioned on the 2D cues, the flow matching scheme learns the optimal transport from a simple source distribution to the plausible 3D human pose distribution via continuous normalizing flows. The 2D lifting condition is modeled via graph convolutional networks, leveraging the learnable connections between human body joints as the graph structure for feature aggregation. While trade-offs between processing time and precision exist, already in the equal-accuracy comparison, FMPose exhibits significantly faster processing time than the diffusion model, and also offers another faster and more accurate configuration. Experimental results show major improvements of our FMPose over current state-of-the-art methods on two common benchmarks for 3D human pose estimation, namely Human3.6M, MPI-INF-3DHP. Additionally, FMPose shows competitive performance on the more challenging 3DPW dataset. The code implementation is available at https://github.com/cuongle1206/FMPose.

URL: https://openreview.net/forum?id=UlpH4XBLR4

---

Title: PRISM: Patch Diffusion with Dynamic Retrieval Augmented Guidance and Permutation Invariant Conditioning

Authors: Shivam Pal, Avideep Mukherjee, Vinay P. Namboodiri, Piyush Rai

Abstract: Diffusion models have achieved state-of-the-art results in image generation but often require extensive computational resources and large-scale datasets, limiting their practicality in resource-constrained settings. To address these challenges, we introduce PRISM, a retrieval-guided, patch-based method that trains solely on image patches instead of full resolution images.
PRISM achieves superior global coherence and outperforms patch-only baselines, even when trained on only a fraction of the data. For each training example, PRISM retrieves semantically related neighbors from a disjoint retrieval set using CLIP embeddings. It aggregates their unordered signals with a Set Transformer, ensuring permutation-invariant conditioning that captures higher-order relationships. A dynamic neighbor-annealing schedule optimizes the contextual guidance over time, leading to more coherent results. Experiments on unconditional image generation tasks using CIFAR-10, CelebA, ImageNet-100, and AFHQv2 datasets, along with ablation studies, validate our approach, demonstrating that retrieval-augmented, set-based conditioning closes the coherence gap in patch-only diffusion.

URL: https://openreview.net/forum?id=ru712j5D2d

---

Title: Aligning time series anomaly detection research with practical applications

Authors: Daniel Barrish, Jan van Vuuren

Abstract: The field of time series anomaly detection is hindered not by its models and algorithms, but rather by its inadequate evaluation methodologies. A growing number of researchers have claimed in recent years that various prevalent metrics, datasets, and benchmarking practices employed in the literature are flawed. In this paper, we echo this sentiment by demonstrating that widespread metrics are incongruent with desirable model behaviour in practice and that datasets are plagued by inaccurate labels and unrealistic anomaly density, amongst other issues. Furthermore, we provide suggestions and guidance on realigning theoretical research with the demands of practical applications, with the goal of establishing a stable, principled benchmarking framework within which models may be evaluated and compared fairly. Finally, we offer a perspective on the main challenges and unanswered questions in the field, alongside potential future research directions.

URL: https://openreview.net/forum?id=RyMLAr5tFU

---

Title: Multimodal Masked Point Distillation for 3D Representation Learning

Authors: Muhammad Abdullah Jamal, Omid Mohareri

Abstract: We propose a two-stage pre-training approach using point clouds for a diverse set of 3D understanding tasks. In the first stage, we pre-train the 3D encoder to acquire knowledge from the other modalities such as vision and language. This stage aligns 3D representations with multiple modalities by leveraging several pre-trained foundation models, unlike the current cross-modal paradigm that typically uses only a single pre-trained model. In the second stage, the pre-training approach is improved upon masked point modeling by global-local feature distillation of semantic 3D embeddings and token shuffling approach. These techniques enable the model to focus on the 3D modality while leveraging the multimodal information associated with the point clouds. This pre-training approach is model-agnostic and can be applied to any 3D transformer encoder. We conduct extensive experiments on a wide range of 3D understanding tasks, from synthetic and real-world object recognition to indoor semantic segmentation and object detection, achieving state-of-the-art results. For instance, on the ScanObjectNN variants, our approach achieves $\textbf{96.1\%}$, $\textbf{94.2\%}$ and $\textbf{91.2\%}$ accuracy using multi-scale 3D encoder proposed in Point-M2AE.

URL: https://openreview.net/forum?id=Gxb3z4VlM7

---

Title: Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines

Authors: Ziyao Wang, Bingying Wang, Hanrong Zhang, Tingting Du, Tianyang Chen, Guoheng Sun, Yexiao He, Zheyu Shen, Wanghao Ye, Ang Li

Abstract: Despite remarkable progress in Vision--Language--Action (VLA) models, a central bottleneck remains underexamined: the data infrastructure that underlies embodied learning. In this survey, we argue that future advances in VLA will depend less on model architecture and more on the co-design of high-fidelity data engines and structured evaluation protocols. To this end, we present a systematic, data-centric analysis of VLA research organized around three pillars: datasets, benchmarks, and data engines. For datasets, we categorize real-world and synthetic corpora along embodiment diversity, modality composition, and action space formulation, revealing a persistent fidelity-cost trade-off that fundamentally constrains large-scale collection. For benchmarks, we analyze task complexity and environment structure jointly, exposing structural gaps in compositional generalization and long-horizon reasoning evaluation that existing protocols fail to address. For data engines, we examine simulation-based, video-reconstruction, and automated task-generation paradigms, identifying their shared limitations in physical grounding and sim-to-real transfer. Synthesizing these analyses, we distill four open challenges: representation alignment, multimodal supervision, reasoning assessment, and scalable data generation. Addressing them, we argue, requires treating data infrastructure as a first-class research problem rather than a background concern.

URL: https://openreview.net/forum?id=tAaWFpvnmm

---

New submissions
===============

Title: The Embodiment Gap in Robot Foundation Models

Abstract: Robot foundation models (RFMs), including vision-language-action (VLA) policies, are often read through a familiar scaling story: more data, larger models, and broader benchmarks. Robotics adds a practical follow-up: when a shared model reaches a new body, what work lets it act there? This survey asks what travels across robot bodies and what has to be realized on the target robot. We call the mismatch between reusable structure and target-specific execution the embodiment gap. The gap identifies which structures become reusable, where body-specific work remains, and what evidence should accompany cross-embodiment success claims. We organize this lens around three scaling directions–semantic meaning and perception, physical robot data and interfaces, and embodiment correspondence–and use it to define a reporting agenda for target-body residuals. The goal is to make cross-embodiment progress easier to compare, reproduce, and build on, while encouraging systems that leave new robots with less target-specific work, clearer failure attribution, and safer recovery.

URL: https://openreview.net/forum?id=D0XcH9Cso4

---

Title: CaveAgent: Transforming LLMs into Stateful Runtime Operators

Abstract: LLM-based agents are increasingly capable of complex task execution, yet current agentic systems remain constrained by text-centric paradigms that struggle with long-horizon tasks due to fragile multi-turn dependencies and context drift. We present CaveAgent, a framework that shifts LLM tool use from "LLM-as-Text-Generator" to "LLM-as-Runtime-Operator." CaveAgent introduces a dual-stream architecture: a semantic stream for lightweight reasoning and a runtime stream backed by a persistent Python environment for stateful execution. Rather than treating the LLM's text context as the primary workspace, CaveAgent elevates the persistent runtime as the central locus. Beyond leveraging code generation to resolve interdependent sub-tasks (e.g., loops, conditionals) in a single step, CaveAgent introduces Stateful Runtime Management: it injects, manipulates, and retrieves complex Python objects (e.g., DataFrames, database connections) that persist across turns, unlike existing code-based approaches that remain text-bound. CaveAgent further provides a runtime-integrated skill management system that extends the Agent Skills open standard, enabling ecosystem interoperability through executable skill injections. This persistence mechanism serves as a high-fidelity external memory that reduces context drift in multi-turn interactions and preserves processed data for downstream applications with less information loss. Evaluations on Tau$^2$-bench and the Berkeley Function Calling Leaderboard (BFCL) across six state-of-the-art LLMs demonstrate consistent improvements in 11 out of 12 settings, with gains up to +13.5% success rate on multi-turn retail tasks. On BFCL, the three open-source models we evaluate all reach 94.0-94.7% under CaveAgent, comparable to closed-source Claude Sonnet 4.5 (94.4%) and Gemini 3 Pro (94.3%) and exceeding GPT-5.1 (89.6%) under their native function-calling protocols; the 30B Qwen3-Coder reaching 94.4% suggests the function-calling protocol is a key performance bottleneck alongside model scale. Token efficiency studies show 28.4% reduction in total token consumption and up to 51% token reduction on data-intensive tasks relative to the best baseline. The accessible runtime state further provides programmatically verifiable feedback, enabling automated evaluation and reward signal generation without human annotation and establishing a structural foundation for future research in Reinforcement Learning with Verifiable Rewards (RLVR).

URL: https://openreview.net/forum?id=p3dlOhpqKD

---

Title: Dale meets Langevin: A Multiplicative Denoising Diffusion Model

Abstract: Exponentiated gradient descent (EGD), a biologically motivated optimization algorithm that respects Dale's law, results in log-normally distributed synaptic weights, in alignment with experimental observations in neuroscience. Since the marginal distribution of geometric Brownian motion (GBM) at any fixed time is log-normal, there is a natural connection between EGD and GBM-based stochastic processes. We propose a multiplicative score-based generative model with GBM as a forward noising process and derive its corresponding reverse-time SDE in both the ambient space and in the $\log$-transformed space. We derive two multiplicative samplers by discretizing the corresponding reverse-time SDEs: a sign-agnostic sampler obtained directly from the ambient-space reverse-time SDE, and a sign-preserving sampler, which we refer to as the Dale-Langevin sampler, obtained via the Lamperti transform. We further connect the framework to Mirrored Langevin Dynamics, showing that the convex function driving EGD in optimization precisely governs the Dale-Langevin sampler. The Stein score, defined as $\nabla \log p_{\boldsymbol{X}}(\boldsymbol{x})$ for a random vector $\boldsymbol{X}$ with density $p_{\boldsymbol{X}}$ evaluated at $\boldsymbol{x}$, comes up naturally in the additive noise based diffusion models. In the multiplicative setting, we encounter $\boldsymbol{x} \circ \nabla \log p_{\boldsymbol{X}}(\boldsymbol{x})$, a modulated version of the Stein score for sampling, which we name the Hyvärinen score. In order to estimate the Hyvärinen score, we introduce the multiplicative denoising score-matching loss (M-DSM), the multiplicative explicit score-matching loss (M-ESM), and establish their equivalence. This development subsumes the non-negative score matching loss of Hyvärinen (2007) as a special case. Experimental results on MNIST, Fashion-MNIST, Kuzushiji MNIST, and CIFAR-10 validate the generative capability of the proposed framework.

URL: https://openreview.net/forum?id=2LecV2qq1C

---

Title: Large Language Models as Automatic Annotators and Annotation Adjudicators for Fine-Grained Opinion Analysis

Abstract: Fine-grained opinion analysis of text provides a detailed understanding of expressed sentiments, including the addressed entity. Although this level of detail is valuable, annotating opinions in datasets for model training requires considerable human effort and substantial cost, especially across diverse domains and real-world applications. To address this shortage of domain-specific labelled datasets, we explore the feasibility of LLMs as automatic annotators for fine-grained opinion analysis. We use a declarative annotation pipeline, an approach that reduces the variability of manual prompt engineering when using LLMs to identify fine-grained opinion spans in text. We also present a dedicated methodology for an LLM to adjudicate multiple labels and produce final annotations. We trial the pipeline with models of different sizes for the Aspect Sentiment Triplet Extraction (ASTE) and Aspect-Category-Opinion-Sentiment (ACOS) analysis tasks. In this work, we attempt to develop fully autonomous LLM-based annotators, but our results reveal an uneven picture characterised by a critical performance bifurcation: LLMs are reliable at the span level yet struggle to faithfully reproduce the relational structures that connect those spans. This suggests that LLMs are better positioned as high-fidelity annotation assistants and data augmentation tools to expand fine-grained opinion-annotated datasets, rather than replacing human annotators entirely.

URL: https://openreview.net/forum?id=BGWI1mSy9j

---

Title: DiffEM: Learning from Corrupted Data with Diffusion Models via Expectation Maximization

Abstract: Diffusion models have emerged as powerful generative priors for high-dimensional inverse problems, yet learning them when observations are only available through a corruption channel remains challenging. In this work, we propose DiffEM, a new method for training diffusion models with Expectation-Maximization (EM) from corrupted data that does not rely on any approximations or heuristics. DiffEM utilizes conditional diffusion models to reconstruct clean data from observations in the E-step, and then uses the reconstructed data to refine the conditional diffusion model in the M-step. Theoretically, we provide monotonic convergence guarantees for the DiffEM iteration, assuming appropriate statistical conditions. We demonstrate the effectiveness of our approach through experiments on various image reconstruction tasks.

URL: https://openreview.net/forum?id=BIs7hDZe8q

---

Title: Aligned to Catastrophe: A Phase Transition in Welfare Collapse under Preference-Faithful AI Amplification

Abstract: The standard picture of AI catastrophe is a machine whose goals diverge from ours. This paper identifies a different failure: an AI whose goals are exactly ours, when ours contain a net drift toward harm. We call this the Brownian Drift Failure Mode. In a minimal mean-field model of preferences under AI amplification, we prove a sharp phase transition in capability: above the threshold, any negative drift in aggregate preferences---however small---produces exponential collapse of the population mean preference, at a rate independent of how negative the bias is. A two-population extension shows that intergroup conflict becomes unstable at strictly lower capability than aggregate collapse. Above the threshold standard interventions break: bounded corrections shift the basin of collapse but do not remove it, and hard constraints prevent divergence only by pinning welfare permanently at an exogenous floor. Monte Carlo simulations confirm the analytical results and their robustness to population heterogeneity. Alignment is necessary but not sufficient: what the aggregate prefers matters, not only how faithfully it is served.

URL: https://openreview.net/forum?id=gBJfgXegzO

---

Title: Sufficient conditions for the misalignment of AI

Abstract: Formulating alignment mathematically, we prove two sufficient conditions on large language models (LLM) which imply that an LLM will not correctly evaluate the alignment of all possible inputs. The first sufficient condition is for the embedding vectors of misaligned statements to cluster in the embedding space. The second sufficient condition is when alignment is necessarily self-consistent, in a logically precise way. Practically, our results offer some understanding of how AI may be unintentionally misaligned, which may be useful for high-level AI design considerations.

URL: https://openreview.net/forum?id=8k621zTZgR

---

Title: Generative Drifting is Secretly Score Matching: a Spectral and Variational Perspective

Abstract: Generative Modeling via Drifting~\citep{deng2026drifting} has recently achieved state-of-the-art one-step image generation through a kernel-based drift operator, yet the success is largely empirical and its theoretical foundations remain poorly understood. In this paper, we make the following observation: \emph{under a Gaussian kernel, the drift operator is exactly a score difference on smoothed distributions}.
This insight allows us to answer all three key questions, which were left open in the original work: (1) whether a vanishing drift guarantees equality of distributions ($V_{p,q}=0\Rightarrow p=q$), (2) how to choose between kernels, and (3) why the stop-gradient operator is indispensable for stable training. Our observations position drifting within the well-studied score-matching family and enable a rich theoretical perspective for subsequent analysis. By linearizing the McKean-Vlasov dynamics resulting from our formulation and probing these dynamics in Fourier space, we reveal frequency-dependent convergence timescales comparable to \emph{Landau damping} in plasma kinetic theory: the Gaussian kernel suffers an exponential high-frequency bottleneck, potentially explaining the empirical preference for the Laplacian kernel. Our analysis also suggests a fix: an exponential bandwidth annealing schedule $\sigma(t)=\sigma_0 e^{-rt}$ that reduces convergence time from $\exp(O(K_{\max}^2))$ to $O(\log K_{\max})$. Finally, by formalizing drifting as a Wasserstein gradient flow of the smoothed KL divergence, we prove that the stop-gradient operator is not a heuristic but is derived directly from the frozen-field discretization mandated by the Jordan, Kinderlehrer and Otto (JKO) scheme, and removing it severs training from any gradient-flow guarantee. This variational perspective further provides a general template for constructing novel drift operators, which we demonstrate with a Sinkhorn divergence drift. We validate our analysis on toy datasets and scale it up to ImageNet.

URL: https://openreview.net/forum?id=T4h31imzIC

---

Title: Quantitative LLM Judges

Abstract: LLM-as-a-judge is a framework where a large language model (LLM) evaluates the output of another LLM. While LLMs excel at producing qualitative textual evaluations, they often struggle to predict human preferences and numeric scores. We propose quantitative LLM judges, which align evaluation scores of LLM judges to humans in a given domain using regression models. These models are trained to improve the score of the original judge using its rationale and score. We present four quantitative judges for different types of absolute and relative feedback, which showcases the generality and versatility of our framework. Our framework can be applied to proprietary models and when human feedback is limited, which is expected in practice. We validate our claims empirically on four datasets. Our experiments show that quantitative judges can improve the predictive power of existing judges through post-hoc modeling.

URL: https://openreview.net/forum?id=isUKzVxmOB

---

Title: Private and interpretable clinical prediction with quantum-inspired tensor train models

Abstract: Publicly available clinical machine learning models pose an underappreciated privacy risk: their parameters or outputs can be exploited to identify patients whose data were used during training. Moreover, this risk is exacerbated by models such as logistic regression (LR), which are typically preferred in clinical settings for their transparency. To assess this empirically, we attack LORIS, a publicly available LR model for immunotherapy response prediction hosted on a U.S. government website. From evaluations through its public interface, we recover the model parameters and identify the training cohort with certainty. More broadly, we design cohort-level membership inference attacks under three levels of adversarial access---binary black-box, continuous black-box, and white-box---and apply them to both LR models and shallow neural networks (NNs) trained on the same task. Our results reveal that even a cohort of 35 patients can be reliably identified within training sets of hundreds to thousands, and that common practices such as cross-validation amplify rather than mitigate this risk. To address these vulnerabilities, we propose a quantum-inspired defense based on tensorizing discretized models into tensor trains (TTs). This representation fully obfuscates model parameters and preserves accuracy, while offering black-box privacy comparably to Differential Privacy. Additionally, the TT representations retain LR interpretability and extend it through efficient computation of marginal and conditional distributions, enabling this richer analysis also for black-box models such as NNs. Our results establish tensorization as a practical, post-hoc foundation for private, interpretable, and effective clinical prediction.

URL: https://openreview.net/forum?id=QtG3fC1v5t

---

Title: When Responsibility Guidance Hurts: A Pilot Study of PreExecution Projection in LLM Agents

Abstract: Multi-agent LLM orchestration is increasingly framed as a routing problem, an aggregation problem, or a post-hoc failure problem. We study an intermediate object that none of these frames opens up: the responsibility structure an agent projects between receiving a natural-language delegation and executing against an artifact. We formalize responsibility projection as multi-label weight prediction over a closed dimension set, instantiate it on Jv1.1 — a 12-dimension taxonomy for paper-research delegation (seven category dimensions, five cross-cutting) — and use the closure to make projections from different model families directly comparable. The primary empirical contribution (P1) is that pre-execution responsibility projection is measurable and family-attributable: on a 50-example pilot under the v1.3 Anthropic-excluded cross panel (gpt-5 / gemini-2.5-pro / grok-4) with within-model variance estimated from five claude-sonnet-4-6 repetitions at T = 0.5, crossfamily projection mismatch is approximately 6× within-family stochastic variance (median R(d) = 5.87, 95% bootstrap CI [4.47, 7.96], paired bootstrap CI on dC − dW excludes zero, Wilcoxon p < 10−15), and the main-run extension at n = 310 gives median R(d) = 5.40 with CI [4.86, 5.89]. The secondary contribution (P2) is a negative actionability result: under a 12-judge Anthropic-excluded panel and a three-condition execution split, projectiondriven execution shows a directional disadvantage relative to direct execution on the headline weighted-R1 settlement loss (paired diff direct_naive − projection_driven = −0.139, 95% bootstrap CI [−0.169, −0.109]), with the cost concentrating on R1.4 (novelty mapping) and R1.7 (citation audit) — the two dimensions whose s = 5 anchor demands deep specialty engagement; we report this as a boundary condition, not a refutation of the projection layer. The methodological contribution (P3) is a closed Stage 1 human-anchor pilot on R1.7 that surfaces an anchor specifiability ceiling: form-embedded sharpened anchors are insufficient for reliable rater application without a separately-read protocol document, and the humananchor scalability constraint is anchor specifiability and domain-expertise gating, not rater throughput. The scope of this paper is one delegation category (R1, paper-research) at pilot scale (n = 50) with the P1 measurability extension validated at n = 310; the four-condition Experiment 2 with task-aware-routing and CLAMBER-style baselines, longitudinal reputation evaluation on real LLM agents, and a main-run-scale r⋆ extension are reserved for follow-up work.

URL: https://openreview.net/forum?id=JbUV6y4mwC

---

Title: Theoretical Refinement of CLIP by Utilizing Linear Structure of Optimal Similarity

Abstract: In this study, we propose an enhancement to the similarity computation mechanism in multimodal contrastive pretraining frameworks such as CLIP. Prior theoretical research has demonstrated that the optimal similarity metrics between paired modalities should correspond to the pointwise mutual information (PMI) between the two modalities. However, the current implementations of CLIP and its variants fail to fully utilize the underlying linear structure of PMI. We therefore propose KME-CLIP, which leverages this structure through the inner product in a reproducing kernel Hilbert space. We theoretically prove that our method can approximate PMI with arbitrary accuracy and empirically demonstrate that our approach overall outperforms the standard CLIP formulation across several retrieval and classification tasks.

URL: https://openreview.net/forum?id=xRKVzuZ68J

---

Title: A Closed-Form Persistence-Landmark Pipeline for Certified Point-Cloud and Graph Classification

Abstract: We introduce PLACE (Persistence-Landmark Analytic Classification Engine), a closed-form pipeline for classifying point clouds and graphs through their persistent-homology signatures. Three quantitative guarantees -- a margin-based excess-risk rate, a closed-form descriptor-selection rule, and a per-prediction certificate -- are derived from training labels alone, with no learned weights or held-out calibration. The embedding sums Mitra-Virk single-point coordinate functions over a sparse landmark grid; the closed-form weight rule $w_k^2 \propto (d_{k+1}^2 - d_k^2)/R_k^2$ maximizes the distortion slope in Mitra-Virk's affine certificate under $\nu$-coherence. (i) An $O(kR/(\Delta\sqrt{m_{\min}}))$ margin bound, driven by class-mean separation $\Delta$ and embedding radius $R$, matched in the sample-starved regime $m \lesssim R/\Delta$ by a Le Cam minimax lower bound. (ii) The Mahalanobis margin under Ledoit-Wolf-shrunk covariance is the strongest closed-form ranker on a 64-descriptor chemical-graph pool (mean Spearman $\rho = +0.56$ across 11 benchmarks, positive on 10 of 11); the isotropic surrogate $\Delta/\sqrt{\ell}$ admits a closed-form selection-consistency rate on the homogeneous protein/social pools. (iii) A training-time-decided certificate, with no per-prediction overhead, in three concrete radii (Pinelis, Gaussian plug-in, and variance-aware Pinelis-Bernstein). Empirically, PLACE is the strongest diagram-based method on Orbit5k and matches the strongest topology-based baseline within statistical noise on MUTAG and COX2; remaining gaps fall into two diagnosable regimes (descriptor blindness on NCI1/NCI109; pool-coverage limits elsewhere). The Pinelis-Bernstein radius fires on 8 of the 12 benchmarks; on MUTAG the empirical and population nearest-centroid rules agree on every one of 940 held-out test predictions, validating the certificate's mechanism.

URL: https://openreview.net/forum?id=4kZxNlE5Ve

---

Title: Linear Model Merging Unlocks Simple and Scalable Multimodal Data Mixture Optimization

Abstract: Selecting the best data mixture is critical for successful Supervised Fine-Tuning (SFT) of Multimodal Large Language Models. However, determining the optimal mixture weights across multiple domain-specific datasets remains a significant bottleneck due to the combinatorial search space and the high cost associated with even a single training run. This is the so-called Data Mixture Optimization (DMO) problem. On the other hand, model merging unifies domain-specific experts through parameter interpolation. This strategy is efficient, as it requires only a single training run per domain, yet it often leads to suboptimal models. In this work, we take the best of both worlds, studying model merging as an efficient strategy for estimating the performance of different data mixtures. We train domain-specific multimodal experts and evaluate their weighted parameter-space combinations to estimate the efficacy of corresponding data mixtures. We conduct extensive experiments on 14 multimodal benchmarks, and empirically demonstrate that the merged proxy models exhibit a high rank correlation with models trained on actual data mixtures. This decouples the search for optimal mixtures from the resource-intensive training process, thereby providing a scalable and efficient strategy for navigating the complex landscape of mixture weights. Code and models will be publicly available.

URL: https://openreview.net/forum?id=rxaQbSybeT

---

Title: A Two-Parameter Weibull Framework for Diagnosing Transformer Weight Distributions

Abstract: We apply the Weibull distribution, a two-parameter family from extreme-value theory, as a diagnostic framework for element-wise weight magnitude distributions in transformers. At initialization, i.i.d. Gaussian weights give $|w| \sim \text{HalfNormal}$, which anchors the Weibull shape parameter at $k \approx 1.20$. This makes $k$ a principled, architecture-independent measuring stick for training dynamics; fitting each weight matrix independently at every layer enables diagnostics invisible to aggregate statistics.

Applying this framework to 12 model entries spanning 7 architectural families reveals the following findings. First, FFN modules and the attention output projection (the Transmission Class) fall in a narrow $k$ band $[1.186, 1.204]$ across architectures (CV $= 0.51\%$). Second, the attention input projections (the Selection Class, $W_q$ and $W_k$) depart from this band in an architecture-dependent manner: separately-stored MHA shows the largest drift, grouped-query attention shows milder drift, and merged storage shows transitional behavior. The scale parameter $\lambda$ grows during training and tracks $\sqrt{\eta/\lambda_{wd}}$ as a within-family scaling trend in Pythia ($n = 5$ sizes). The two Weibull parameters carry independent information: $k$ labels the functional class, $\lambda$ labels training progress.

The framework was further used to diagnose an 11-entry Qwen cohort: shallow-FFN layers exhibit bimodal weight distributions. We release npm-weibull-py v0.4 and DATABASE_v9_1 (anonymized for review; URLs in camera-ready).

URL: https://openreview.net/forum?id=j7qGffCwDa

---

Title: Locally Adaptive Conformal Inference for Operator Models

Abstract: Operator models are regression algorithms between Banach spaces of functions. They have become an increasingly critical tool for spatiotemporal forecasting and physics emulation, especially in high stakes scenarios where robust, calibrated uncertainty quantification is required. We introduce Local Sliced Conformal Inference (LSCI), a distribution free framework for generating function valued, locally adaptive prediction sets for operator models. We prove finite sample validity and derive a data dependent upper bound on the coverage gap under local exchangeability. On synthetic Gaussian process tasks and real applications (air quality monitoring, energy demand forecasting, and weather prediction), LSCI yields tighter sets with stronger adaptivity compared to conformal baselines. We also empirically demonstrate robustness against biased predictions and certain out-of-distribution noise regimes.

URL: https://openreview.net/forum?id=RCDStCXN3W

---

Title: Ground-Truth Subgraphs for Better Training and Evaluation of Knowledge Graph Augmented LLMs

Abstract: Retrieval of information from graph-structured knowledge bases represents a promising direction for improving the factuality of LLMs. While various solutions have been proposed, a comparison of methods is difficult due to the lack of challenging QA datasets with ground-truth targets for graph retrieval.
We present SynthKGQA, an LLM-powered framework for generating high-quality Knowledge Graph Question Answering datasets from any Knowledge Graph, providing the full set of ground-truth facts in the KG to reason over questions. To demonstrate its utility, we apply SynthKGQA to Wikidata to generate GTSQA. This new dataset is specifically designed to test zero-shot generalization with respect to unseen graph structures and relation types, enabling us to analyze the abilities and limitations of SOTA graph retrieval approaches at an unprecedented level of granularity. We also show that KG retrievers trained on GTSQA can transfer to human-curated benchmarks, and that the ground-truth subgraphs produced by SynthKGQA provide a better training supervision signal than previously-used heuristics.

URL: https://openreview.net/forum?id=n0Nh5SeR3i

---

Title: How You Say It Matters: Personalizing LLM Responses via Dual Time-Scale Closed-Loop Adaptation

Abstract: Personalization in large language models (LLMs) is framed as a content problem focused on deciding what to retrieve, generate, or recommend. However, identical content can have different effects on the user’s affective and cognitive states depending on how it is delivered, including its structure, tone, and relational style. We present an adaptation framework that addresses delivery personalization using a fast loop that corrects per-turn quality degradation within sessions, while a slow loop learns per-user priors across sessions. We evaluate 1,094 conversations across three models from Anthropic and OpenAI and show that our framework systematically differentiates the output along the targeted affective and cognitive dimensions and outperforms the unconditioned baseline across all measured quality outcomes, with effect sizes (d = 0.61–1.26). These results suggest that delivery can be modeled as a distinct axis of LLM personalization, adapting to both longer-term user patterns and changes within an interaction.

URL: https://openreview.net/forum?id=cBuU9lkNI5

---

Title: When Aggregation Stops Collaborating: Layer-wise Inertia in Low-Data Federated Learning

Abstract: Federated learning (FL) enables collaborative model training across decentralized clients while preserving data privacy, leveraging aggregated updates to build robust global models. However, this training paradigm faces significant challenges due to data heterogeneity, and each client has access to only scarce local training data, which often impedes effective collaboration. In such scenarios, we reveal that the collaboration bottleneck is closely tied to the \textit{Layer-wise Inertia Phenomenon} in FL, where intermediate layers of the global model rapidly become stagnant after early communication rounds, ultimately weakening the effectiveness of global aggregation. We demonstrate the presence of this phenomenon across a wide range of federated settings, spanning diverse datasets and architectures. To address this issue, we propose LIPS (Layer-wise Inertia Phenomenon with Sparsity), a simple yet effective method that periodically introduces \textit{transient sparsity} to stimulate meaningful updates and empower global aggregation. Experiments demonstrate that LIPS effectively mitigates layer-wise inertia, enhances aggregation effectiveness, and improves overall performance in various FL scenarios. This work not only deepens the understanding of layer-wise learning dynamics in FL but also paves the way for more effective collaboration strategies in resource-constrained environments.

URL: https://openreview.net/forum?id=oEiRZPMp6q

---

Title: A Hierarchical Probabilistic Framework for Incremental Knowledge Tracing in Classroom Settings

Abstract: Knowledge tracing (KT) aims to estimate a student's evolving knowledge state and predict their performance on new exercises based on performance history. Many realistic classroom settings for KT are typically low-resource in data and require online updates as students' exercise history grows, which creates significant challenges for existing KT approaches. To restore strong performance under low-resource conditions, we revisit the hierarchical knowledge concept (KC) information, which is typically available in many classroom settings and can provide strong prior when data are sparse. We therefore propose Knowledge-Tree-based Knowledge Tracing ($KT^2$), a probabilistic KT framework that models student understanding over a tree-structured hierarchy of knowledge concepts using a Hidden Markov Tree Model. $KT^2$ estimates student mastery via an EM algorithm and supports personalized prediction through an incremental update mechanism as new responses arrive. Our experiments show that $KT^2$ consistently outperforms strong baselines in realistic online, low-resource settings.

URL: https://openreview.net/forum?id=9kfruXm7e9

---

Title: Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization

Abstract: Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-model transferability, casting doubt on the feasibility of transferable multimodal jailbreaks. We revisit this conclusion under a strictly untargeted threat model without enforcing a fixed prefix or response pattern. Our preliminary experiment reveals that refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack. Motivated by this finding, we propose Untargeted Jailbreak via Entropy Maximization(UJEM)-KL, a lightweight attack that maximizes entropy at these decision tokens to flip refusal outcomes, while stabilizing the remaining low-entropy positions to preserve output quality. Across three VLMs and two safety benchmarks, UJEM-KL achieves competitive white-box attack success rates and consistently improves transferability, while remaining effective under representative defenses. Our experimental results indicate that the limited transferability primarily stems from overly constrained optimization objectives.

URL: https://openreview.net/forum?id=dfX4zpagYF

---

Title: Inducing Uncertainty on Open-Weight Models for Test-Time Privacy in Image Recognition

Abstract: A key concern for AI safety remains understudied in the machine learning (ML) literature: how can we ensure users of ML models do not leverage predictions on incorrect personal data to harm others? This is particularly pertinent given the rise of open-weight models, where simply masking model outputs does not suffice to prevent adversaries from recovering harmful predictions. To address this threat, which we call test-time privacy, we induce maximal uncertainty on protected instances while preserving accuracy on all other instances. Our proposed algorithm uses a Pareto optimal objective that explicitly balances test-time privacy against utility. We also provide a certifiable approximation algorithm which achieves $(\varepsilon, \delta)$ guarantees without convexity assumptions. We then prove a tight bound that characterizes the privacy-utility tradeoff that our algorithms incur. Empirically, our method obtains at least > 3× stronger uncertainty than pretraining with marginal drops in accuracy on various image recognition benchmarks. Altogether, this framework provides a tool to guarantee protection to end users.

URL: https://openreview.net/forum?id=DkZuek4ZOM

---

Title: FIMP: Foundation Model-Informed Message Passing for Graph Neural Networks

Abstract: Foundation models have achieved remarkable success across many domains, relying on pretraining over vast amounts of data. Graph-structured data often lacks the same scale as unstructured data, making the development of graph foundation models challenging. In this work, we propose Foundation-Informed Message Passing (FIMP), a Graph Neural Network (GNN) message-passing framework that repurposes existing pretrained non-textual foundation models for graph-based tasks. We show that the self-attention layers of foundation models can effectively be leveraged on graphs to perform cross-node attention-based message-passing. Our model is evaluated across diverse domains on image networks, single-cell RNA sequencing, and fMRI brain activity recordings in finetuned and zero-shot settings. FIMP outperforms strong baselines, demonstrating that it can effectively leverage state-of-the-art foundation models in graph tasks.

URL: https://openreview.net/forum?id=fj7sjOwtXc

---

Title: Conjugate and MCMC Bayesian Chain-Rule Prediction- Powered Inference for Binary Prevalence Estimation

Abstract: Prediction-Powered Inference (PPI) combines abundant machine predictions with scarce labels to estimate population functionals with improved efficiency. Recent Bayesian treatments of PPI have introduced general conjugate and Monte Carlo formulations. We focus on a narrower but practically important setting: binary prevalence estimation through the chain-rule functional

$$
g=P(H=1\mid A=1)P(A=1)+P(H=1\mid A=0)P(A=0).
$$

For the base Beta-Bernoulli model, we show that the posterior factorizes into independent Beta distributions, so uncertainty in $g$ can be propagated by direct sampling without MCMC. We reserve NUTS only for genuinely non-conjugate extensions, including hierarchical partial pooling, logit-normal priors, $K$-bin score models, and joint threshold uncertainty.

We benchmark this conjugate Bayesian chain-rule estimator (CRE) against three baselines: a labeled-only Bayesian estimator, the classical difference estimator, and a prior-free analytic PPI estimator based on continuous probabilities with small-sample $t$ critical values. Empirically, we report (i) simulation-based calibration for the conjugate engine, (ii) repeated-labeling resampling studies with $M=500$ replications on ADNI-derived out-of-fold prediction tables, and (iii) an Alzheimer's disease MRI case study with out-of-fold threshold selection, bootstrap cut-point dispersion, and propensity-overlap diagnostics. In the full cohort, CRE achieves near-nominal coverage and narrower intervals than the labeled-only Bayesian baseline, while substantially improving stability over the difference estimator at small label budgets; gains are smaller but remain competitive in a fixed 65-70 subset. These results position conjugate Bayesian chain-rule PPI as a lightweight and auditable option for deployment-time prevalence monitoring.

URL: https://openreview.net/forum?id=75xIeUlse6

---

Title: PCHMR: Empowering and Benchmarking Human Mesh Recovery in Privacy-Constrained Real-World Settings

Abstract: Fine-tuning directly on user-side data is an effective way to improve the performance of widely adopted data-driven HMR models. However, doing so for existing HMR methods often assumes aggregating user-side data from real-world HMR-deployed devices to a central training server, posing a significant privacy risk as sensitive human images are transmitted. How can human mesh recovery be evaluated and improved in privacy-constrained real-world settings? This paper serves as a benchmark study of this problem. We conduct a comprehensive benchmark in which state-of-the-art HMR models are trained under federated and secure-aggregation variants that avoid raw-image centralization under explicit threat model assumptions and benchmark their performance under a wide array of realistic clientscale and data-heterogeneity settings. We document that common HMR training pipelines are built around centralized data access and quantify how representative HMR backbones behave when that assumption is removed. Furthermore, to study the data bottleneck of privacy-constrained HMR training, we propose a local annotation and fine-tuning pipeline enhanced with depth foundation models, with which collaboratively trained HMR models can be locally tailored to the end user’s distribution. We demonstrate its effectiveness with results on in-the-wild data while clarifying that the reported personalization gains are measured against DePoser-generated pseudo-ground truth. This benchmark aims to support future work on privacy-constrained HMR models and their real-world deployment and evaluation.

URL: https://openreview.net/forum?id=4XHzBaL0qA

---

Title: Don't Forget the Critic: Value-Based Data Rehearsal for Multi-Cyclic Continual Reinforcement Learning

Abstract: Data rehearsal has emerged as a leading approach for mitigating catastrophic forgetting in Continual Reinforcement Learning (CRL). However, existing work remains confined to policy gradient frameworks, regularizing only actors due to the performance degradation incurred by critic regularization. This actor-centric approach overlooks the potential of data rehearsal for value function approximation. Moreover, existing evaluations in CRL rarely consider multi-cyclic environments where task sequences repeat, a critical real-world scenario that exacerbates forgetting and plasticity. We investigate data rehearsal for Deep Q-Networks using Q-value regularization in multi-cyclic settings and propose Qreg+NWLU which introduces two simple modifications: (1) continuous data rehearsal that dynamically collects and updates stored Q-values throughout training, and (2) "No-Wait" regularization that applies immediately rather than after the first task. Together, these modifications yield improvements in learning efficiency, forgetting mitigation, and knowledge transfer over Qreg and conventional CRL methods within value function approximation settings.

URL: https://openreview.net/forum?id=wYayhflqqR

---

Title: The Canonical Representation of a Task

Abstract: Generalization in deep learning remains poorly understood, as neural networks fall outside the framework of classical statistical learning theory. To make progress on understanding generalization, research has focused on controlled tasks such as modular arithmetic, as a testbed. On these tasks, models exhibit grokking, i.e., a delayed onset of generalization after training loss has converged. Prior work has identified empirical regularities in the learned representations associated with this transition, but the mapping between representation structure and generalization behavior remains empirical and descriptive. We lack a predictive theory of why and when generalization occurs. In this work, we provide such a predictive theory for modular arithmetic tasks including addition, subtraction, multiplication, and division. We introduce the notion of \textit{canonical representation} of a task: the representation determined by the target function prior to training which is needed for perfect generalization. For modular arithmetic, the canonical representation can be derived from the group structure of the task. We then define \textit{representational deviation} as the discrepancy between the learned representation and the canonical representation which meets a specified target loss. From this, we derive that reaching a prescribed level of generalization requires the representational deviation to fall below a threshold. We finally provide a set of reproducible experiments which empirically confirm the above findings and offer a regularizer to accelerate the grokking transition.

URL: https://openreview.net/forum?id=0g8EXT7cTa

---

Title: OTIS: Learning High-Quality Time Series Features With Tiny Encoders

Abstract: We introduce \texttt{OTIS}, an \textbf{o}pen \textbf{ti}me \textbf{s}eries encoder that yields high-quality time series features for downstream deployment on \emph{any} system, including resource-constrained wearables and industrial sensors. Currently, the development of powerful general-purpose encoders relies on the scaling laws hypothesis, using large encoder sizes to memorise the heterogeneous distributions of multi-domain training data. However, this reliance on scale creates a barrier to real-world utility, rendering deployment on resource-constrained systems infeasible due to strict memory, energy, and latency constraints. Surprisingly, we find that tailoring standard masked modelling pre-training to time series properties yields a tiny $7.1\,$M encoder that matches the state-of-the-art performance of $54\times$ larger encoders across $162$ tasks, while requiring $10\times$ less memory, $43\times$ less energy, and $37\times$ lower latency. To achieve this without the capacity tax, we introduce three novel components: (1) a \textit{domain-aware tokeniser} to resolve conflicting semantics within multi-domain training data; (2) a \textit{dual masking strategy} to capture spatiotemporal structures and temporal causality; and (3) a \textit{structure-aware objective} to decouple feature learning from modelling noise. Consequently, \texttt{OTIS} produces high-quality time series features that enable state-of-the art performance in discriminative tasks and even extend seamlessly to generative tasks at minimal additional cost. To democratise access to powerful time series features on any system, we release our code and pre-trained weights.

URL: https://openreview.net/forum?id=WW206A1Tru

---

Title: Defending Diffusion Models Against Membership Inference Attacks via Higher-Order Langevin Dynamics

Abstract: Recent advances in generative artificial intelligence applications have raised new data security concerns. This paper focuses on defending diffusion models against membership inference attacks. This type of attack occurs when the attacker can determine if a certain data point was used to train the model. Although diffusion models are intrinsically more resistant to membership inference attacks than other generative models, they are still susceptible. The defense proposed here utilizes critically-damped higher-order Langevin dynamics, which introduces several auxiliary variables and a joint diffusion process along these variables. The idea is that the presence of auxiliary variables mixes external randomness that helps to corrupt sensitive input data earlier on in the diffusion process. This concept is theoretically investigated and validated on a toy dataset and the CIFAR-10 dataset using the Area Under the Receiver Operating Characteristic (AUROC) curves and the FID metric.

URL: https://openreview.net/forum?id=5ElZ1uT8wU

---

Title: Evaluating Causal Discovery Algorithms Without Ground Truth in Additive Noise Models

Abstract: Evaluating the performance of causal discovery algorithms is a fundamental challenge in the field of causality. Most performance measures, such as Structural Hamming Distance (SHD) and Structural Intervention Distance (SID), rely on comparing the discovered graph with the ground truth, which is often unavailable in real-world applications. In this paper, we propose a novel evaluation measure called PIM score, based on the Principle of Independent Mechanisms, for restricted Additive Noise Models (ANMs) and Linear Non-Gaussian Acyclic Models (LiNGAMs) to assess whether a given graph represents the true underlying structure based solely on observational data. The proposed measure aggregates the mutual information between the residuals obtained by regressing each variable on its parents in the given graph. We show that the true underlying graph achieves the lowest score in terms of this measure. In particular, we show that the mutual information among residuals is zero if and only if the given graph is the true one. Additionally, we introduce a method for leveraging this performance measure to rank the performance of a set of graphs based on how well they represent the ground truth. We evaluate the performance of our proposed measure using both synthetic and real data in discovering the true graph and compare its performance against two other baseline measures.
Experimental results show that our proposed measure exhibits higher correlations with SHD and SID compared to existing approaches, making it a promising measure for evaluating recovered graphs when the true graph is unavailable.

URL: https://openreview.net/forum?id=iQasMGKke2

---

Title: Federated Measurement of Demographic Disparities from Quantile Sketches

Abstract: Fairness audits are often defined for a target population, while the score data needed to conduct them are held by separate institutions, jurisdictions, or clients. This fragmentation creates a local--global mismatch: a scoring rule may appear balanced within each silo but display substantial demographic disparity after aggregation, or conversely hide client-level discrepancies in a pooled summary. We study this problem for score-level demographic parity under federated data constraints. We define the population disparity as a Wasserstein--Fr\'echet variance of sensitive-group score distributions and express the same target through client-level conditional laws and group-specific mixing weights. This representation shows why local audits, and averages of local audits, are generally not valid proxies for the population audit. We propose a one-shot quantile-sketch protocol in which each silo releases only subgroup counts and $k$ empirical score quantiles per sensitive group. The server uses these summaries to reconstruct group-mixture score distributions and estimate the global disparity. For the squared Wasserstein distance, the same summaries yield an ANOVA-style decomposition separating mixture/composition effects, barycentric cross-silo heterogeneity, and their interaction. We prove an $O(1/k)$ deterministic discretization bound and finite-sample guarantees based on quantile concentration. Experiments on synthetic data, \texttt{COMPAS}, and \texttt{ACSIncome} show that local audits can misrepresent population-level disparity, while moderate sketch sizes closely recover the centralized benchmark, including with natural state-level clients in \texttt{ACSIncome}.

URL: https://openreview.net/forum?id=lzLv2xQUHM

---

Title: From Centerlines to Hemodynamics: Anisotropic RBF Decoders for Coronary Arteries

Abstract: Accurate and rapid estimation of hemodynamic metrics, such as pressure and wall shear stress (WSS), is important for assessing the severity of Coronary Artery Disease (CAD). Existing approaches, including invasive Fractional Flow Reserve (FFR) measurements and computationally expensive Computational Fluid Dynamics (CFD) simulations, face challenges in invasiveness, cost, and speed. We present a framework for fast, non-invasive coronary hemodynamics prediction. The model encodes 1D vessel centerlines together with inlet flow rate using a transformer-based encoder, and predicts continuous wall-based fields via an anisotropic Radial Basis Function (RBF) decoder aligned with vessel morphology. To support training and evaluation, we introduce two datasets with paired steady-state OpenFOAM simulations: (i) a synthetic benchmark of $4{,}200$ single-vessel geometries with controlled anatomical variations, and (ii) a multi-vessel dataset derived from ImageCAS including $4{,}800$ cases spanning both right and left coronary arteries, generated by randomly introducing stenoses and varying physiologically plausible flow rates. Across both datasets, our method achieves lower pressure and WSS errors than strong neural-operator baselines (GNOT, Transolver, and ONO) at a fraction of the computational cost of CFD. On the multi-vessel dataset, using $1{,}024$ anisotropic RBF centers our model reduces the mean relative $\ell_2$ error by $52\%$ compared to the best neural-operator baseline, while at $128$ centers it requires $13.8\times$ fewer FLOPs than GNOT and still outperforms all baselines. The single-vessel dataset is publicly available.

URL: https://openreview.net/forum?id=AoJUrVjufP

---

Title: Large Scale Empirical Bayesian Causal Discovery Using Total Effect Estimates From Intervention Data

Abstract: Inferring the causal relationships among a set of variables in the form of a directed acyclic graph (DAG) is an important but notoriously challenging problem. Recently, advancements in high-throughput genomic perturbation screens have inspired the development of methods that leverage interventional data to improve model identification. However, existing methods still suffer from poor performance on large-scale tasks and fail to quantify uncertainty. Here, we propose Interventional Bayesian Causal Discovery (IBCD), an empirical Bayesian framework that infers the causal graph by using intervention data to estimate the effect of each variable on every other, then inferring the posterior graph given these estimates. For tractability, our approach models the likelihood of the matrix of estimated total causal effects, which can be approximated by a matrix normal distribution, rather than the full data matrix. We place a spike-and-slab horseshoe prior on the edges and separately learn data-driven weights for scale-free and Erdős–Rényi structures from observational data, treating each edge as a latent variable to enable uncertainty-aware inference. Through extensive simulation, we show that IBCD achieves superior structure recovery compared to existing baselines. We apply IBCD to CRISPR perturbation (Perturb-seq) data on 521 genes, demonstrating that edge posterior inclusion probabilities enable identification of robust graph structures.

URL: https://openreview.net/forum?id=yW4T2fsf0l

---

Title: Causal Bayesian Optimization: Foundations, Methods, and Applications

Abstract: Causal Bayesian Optimization (CBO) integrates causal inference with Bayesian optimization to enable sample-efficient intervention selection in systems governed by causal structure. This survey provides a comprehensive and systematic review of the CBO landscape, organizing the growing literature through a unified BO-loop perspective that reveals how causal assumptions shape four core components: intervention search spaces, surrogate construction, acquisition design, and decision policies. We classify methods along recurring design axes, i.e., graph knowledge, intervention representation, uncertainty source, and budget allocation, and establish formal connections between CBO and adjacent fields, including causal bandits, Bayesian experimental design, safe optimization, and policy search. To address the lack of standardized evaluation in the field, we introduce a reproducibility-oriented benchmark that covers hard- and soft-intervention settings, implements both the standard GAP metric and a new trajectory-aware Path-Aware GAP (PA-GAP) metric, and evaluates seven CBO methods alongside a non-causal BO baseline under a common scoring protocol. Our empirical study across thirteen datasets, three budget levels, and two metrics reveals that no single method dominates uniformly: rankings depend critically on the dataset, budget, and metric, and strong non-causal baselines remain competitive in several settings. We conclude by identifying six open challenges, including robustness to hidden confounding, scalable unknown-graph optimization, mixed intervention types, realistic cost models, tighter theoretical guaranties, and integration with modern representation learning, that must be addressed for CBO to transition from proof-of-concept demonstrations to reliable real-world deployment.

URL: https://openreview.net/forum?id=XT6DC37m5I

---

Title: When Does Sparse MoE Help in Vision? The Role of Backbone Compute Leverage in Sparse Routing

Abstract: Mixture-of-Experts (MoE) networks promise favorable accuracy--compute trade-offs, yet practical vision deployments are hindered by expert collapse and limited end-to-end efficiency gains. We study when sparse top-$k$ routing with hard capacity constraints helps in vision classification, evaluated under multi-seed protocols on four benchmarks (CIFAR-10/100, Tiny-ImageNet, ImageNet-1K). We observe a \emph{compute-leverage pattern}: positive accuracy gaps require a substantial fraction $\rho$ of total FLOPs to be routed; at ImageNet scale this is necessary but not sufficient, as multi-expert routing ($k \geq 2$) is additionally required. Two controlled experiments isolate these factors. A hidden-size sweep on CIFAR-10 yields both predicted point-estimate sign reversals across standard and depthwise backbones (one statistically supported, one directional), ruling out backbone family alone as the explanation. An ImageNet-1K ablation that varies only top-$k$---holding architecture, initialization, and $\rho$ fixed---reverses the gap from positive to negative across all five seeds. A per-sample variant of Soft MoE that softmaxes over experts rather than the batch rescues CIFAR-100 above the dense baseline, identifying batch-axis dispatch as the dominant failure mode in per-sample CNN settings. Code is included in the supplementary material.

URL: https://openreview.net/forum?id=uZSh6WL0SI

---

Title: Lipschitz-Guided Design of Interpolation Schedules in Generative Models

Abstract: We study the design of interpolation schedules in flow and diffusion-based generative models from both statistical and numerical perspectives.
Within the stochastic interpolants framework, we first show that scalar interpolation schedules are statistically equivalent under the Kullback--Leibler divergence in path space, after optimal a posteriori tuning of the diffusion coefficient.
This equivalence motivates focusing on numerical properties of the drift field rather than purely statistical criteria.
We propose minimizing the averaged squared Lipschitzness of the drift as a principled criterion for schedule design, in contrast with kinetic-energy minimization in optimal transport.
A simple transfer formula expresses the drift of one schedule in terms of the drift of another, allowing the designed schedule to be used at inference time with a model trained under a different (e.g., linear) schedule, without retraining.
We work out the optimal schedules analytically for Gaussian and Gaussian-mixture targets: for Gaussians, we obtain exponential improvements in the Lipschitz constant over linear schedules; for Gaussian mixtures, we obtain schedules that mitigate mode collapse in few-step sampling.
We then validate the approach on high-dimensional invariant measures of stochastic Allen--Cahn and Navier--Stokes equations, where the designed schedule yields markedly more accurate fine-scale statistics at fixed integrator budget.

URL: https://openreview.net/forum?id=e0ET3eweRc

---

Title: Unified Deployment-Aware Evaluation of Open Reasoning Language Models

Abstract: Open reasoning language models are often compared under mixed sample sizes, partially standardized prompts, and accuracy-centered summaries, which makes practical model selection difficult to interpret. We present a unified evaluation of seven open reasoning language model configurations across four benchmarks, namely ARC-Challenge, GSM8K, MATH levels 1 to 3, and TruthfulQA MC1, under three prompting strategies: zero-shot, chain-of-thought (CoT), and few-shot CoT. Every model--dataset--strategy condition is evaluated on the same 238-example subset, which yields a complete 7 × 4 × 3 design with 84 conditions and 19,992 evaluated examples. In addition to accuracy, we report Wilson confidence intervals, latency, peak video random access memory (VRAM), weighted aggregate performance, Pareto-efficient operating points, prompt-sensitivity metrics, and compatibility diagnostics. Under this unified protocol, the highest weighted score is achieved by Gemma-4-26B-A4B with zero-shot prompting at 0.794, while Gemma-4-E4B remains close to the top across prompting settings with substantially lower latency and memory, making it a particularly attractive practical operating point. Bootstrap and paired-permutation analyses show that top weighted configurations are close enough that deployment tradeoffs remain important. We further find that prompting strategy changes ranking order rather than simply shifting all models in the same direction, and that benchmark-specific complementarity creates measurable routing headroom: an oracle task-aware selector reaches a weighted score of 0.825. Finally, compatibility diagnostics reveal that some apparent failures, especially for Phi-4-Reasoning on GSM8K, reflect deployment-relevant robustness and interface-adherence problems under the shared evaluation pipeline. These results support a central claim: open-model evaluation should be framed as a deployment-aware, multi-objective operating-point problem rather than as a single-score leaderboard exercise.

URL: https://openreview.net/forum?id=FBmgI8UMt8

---

Title: AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration — Learning from Cheap, Optimizing Expensive

Abstract: Effectively configuring scalable large language model (LLM) experiments, spanning architecture design, hyperparameter tuning, and beyond, is crucial for advancing LLM research, as poor configuration choices can waste substantial computational resources and prevent models from realizing their full potential. Prior automated methods are designed for low-cost settings where repeated trial and error is feasible, but scalable LLM experiments are too expensive for such extensive iteration. To our knowledge, no work has addressed the automation of high-cost LLM experiment configurations, leaving this problem labor-intensive and dependent on expert intuition. Motivated by this gap, we propose AutoLLMResearch, an agentic framework that mimics how human researchers learn generalizable principles from low-fidelity experiments and extrapolate to efficiently identify promising configurations in expensive LLM settings. The core challenge is how to enable an agent to learn, through interaction with a multi-fidelity experimental environment that captures the structure of the LLM configuration landscape. To achieve this, we propose a systematic framework with two key components: 1) LLMConfig-Gym, a multi-fidelity environment encompassing four critical LLM experiment tasks, supported by over one million GPU hours of verifiable experiment outcomes; 2) A structured training pipeline that formulates configuration research as a long-horizon Markov Decision Process and accordingly incentivizes cross-fidelity extrapolation reasoning. Extensive evaluation against diverse strong baselines on held-out experiments demonstrates the effectiveness, generalization, and interpretability of our framework, supporting its potential as a practical and general solution for scalable real-world LLM experiment automation.

URL: https://openreview.net/forum?id=Z0H3LjAM7O

---

Title: Fi-sLSTM-Mixer: Multivariate Time Series Forecasting with Fuzzy-Conditioned Scalar Memories

Abstract: Multivariate time series forecasting in real-world deployments must contend with noisy, uncertain, and shifting data conditions that expose a structural weakness in state-of-the-art recurrent architectures: their gates rely on deterministic pre-activations and cannot adapt to input reliability. We introduce Fi-sLSTM-Mixer, a Fuzzy-integrated sLSTM which augments xLSTM-Mixer with a fuzzy relevance value rt derived from an ITTTFL inference system and injected into the forget and output gate pre-activations via a zero-initialized projection $W_r$ . The normalized input gate is provably invariant to rt by construction, producing a clean separation between data-driven variate attention and reliability-driven memory modulation. A twelve-test mechanistic protocol across 12 benchmarks confirms that the model learns consistent, domain-interpretable routing policies, statistically significant on every dataset, without sacrificing the backbone’s representational capacity. Empirically, Fi-sLSTM-Mixer achieves 41 wins across 90 metric slots against five state-of-the-art baselines, with the largest gains on volatile industrial and high-dimensional streams where reliability signals matter most, at a cost of under 600 additional parameters and negligible training overhead.

URL: https://openreview.net/forum?id=lYKDDI7prJ

---

Title: Learning What to Fail On: Failure-Mode Contextual Bandits for Adversarial Data Curation

Abstract: We introduce a failure-aware adversarial retrieval-augmented framework for improving robustness in natural language understanding. Rather than selecting synthetic examples with a fixed reward threshold, our method formulates adversarial data curation as a failure-mode contextual bandit problem. Candidate examples are generated with retrieval-augmented prompting, filtered by the current target model, automatically validated by an LLM judge ensemble, and clustered into recurring failure modes. A stochastic policy then selects which failure modes to sample for retraining, and is updated using validation-based reward that balances robustness gains, forgetting, and data cost. This makes the data curator itself the learning agent, enabling adaptive selection of the most useful model failures across training rounds. On standard benchmarks, our approach improves RoBERTa-base accuracy from 88.48% to 92.60% on SNLI, from 75.04% to 80.95% on ANLI, and from 54.67% to 71.99% on MultiNLI, while consistently outperforming prior adversarial augmentation methods. We further demonstrate transfer to FEVER fact verification, achieving up to 79.86\% FEVER score and 82.45% accuracy with RoBERTa-large. Finally, we provide a theoretical interpretation showing that, under stated assumptions, failure-mode sampling can reduce shortcut-aligned gradient contributions while inducing bounded distributional drift. By combining retrieval, automated validation, contextual-bandit failure selection, and controlled adversarial retraining, our framework enables scalable robustness improvement without additional human annotation.

URL: https://openreview.net/forum?id=DSpKdN6whZ

---

Title: FieldFormer: Locality-Aware Transformers for Spatio-Temporal Modeling on Sparse Sensor Networks

Abstract: Spatio-temporal sensor data in real-world systems is often sparse, noisy, and irregular, making it difficult to infer global structure from limited observations. Under extreme sparsity, we run into the limits of identifiability of latent system states, making latent field reconstruction fundamentally underconstrained. In such scenarios, multiple physically plausible fields may remain consistent with the same observations, requiring reconstruction models to rely heavily on inductive biases regarding locality, transport structure, and spatial regularity.

Under such sparsity regimes, reliable reconstruction becomes concentrated around the observational support induced by the sensor network, making sensor-space modeling a more identifiable objective than unconstrained global field recovery. We introduce FieldFormer, a mesh-free transformer architecture designed for locality-aware sensor-space modeling in persistent sensor networks. For each query, FieldFormer aggregates local evidence using a learnable velocity-scaled distance metric that adapts neighborhood geometry to heterogeneous spatio-temporal relationships. Neighborhoods are constructed as fixed maximal sparse contexts over nearby sensors and bounded temporal windows, while learned velocity-scaled offsets modulate token geometry within this context, enabling stable and scalable inference under extreme sparsity. A local transformer encoder integrates neighborhood information, while global consistency is modeled through coordinate-based neural field formulation.

We evaluate FieldFormer across five benchmarks spanning synthetic and real-world spatio-temporal systems, including anisotropic heat diffusion, shallow-water dynamics, atmospheric transport fields, and pollution monitoring datasets. Our results reveal that locality-aware reconstruction provides strong advantages in persistent sparse sensor networks where local domains of dependence remain observed, enabling FieldFormer to consistently outperform state-of-the-art baselines on sensor-space prediction tasks under highly sparse and noisy sensing regimes.

URL: https://openreview.net/forum?id=we4FYGOE2y

---

Title: Uncertainty Quantification for LLM Function-Calling

Abstract: Large Language Models (LLMs) are increasingly deployed to autonomously solve real-world tasks. A key ingredient for this is the LLM Function-Calling paradigm, a widely used approach for equipping LLMs with tool-use capabilities. However, an LLM calling functions incorrectly can have severe implications, especially when their effects are irreversible, e.g., transferring money or deleting data. Hence, it is of paramount importance to consider the LLM's confidence that a function call solves the task correctly prior to executing it. Uncertainty Quantification (UQ) methods can be used to quantify this confidence and prevent potentially incorrect function calls. In this work, we present what is, to our knowledge, the first evaluation of UQ methods for LLM Function-Calling (FC). While multi-sample UQ methods, such as Semantic Entropy, show strong performance for natural language Q&A tasks, we find that in the FC setting, it offers no clear advantage over simple single-sample UQ methods. Additionally, we find that the particularities of FC outputs can be leveraged to improve the performance of existing UQ methods in this setting. Specifically, multi-sample UQ methods benefit from clustering FC outputs based on their abstract syntax tree parsing, while single-sample UQ methods can be improved by selecting only semantically meaningful tokens when calculating logit-based uncertainty scores.

URL: https://openreview.net/forum?id=5ah8wJpDRt

---

Title: Confidence Calibration in Vision-Language-Action Models

Abstract: Trustworthy robot behavior requires not only high levels of task success but also that the robot can reliably quantify how likely it is to succeed. To this end, we present a first-of-its-kind study of confidence calibration in vision-language-action (VLA) foundation models, which map visual observations and natural language instructions to low-level robot motor commands. We establish a confidence estimation baseline for VLAs, examine how task success relates to calibration error and how calibration evolves over time, and introduce two lightweight techniques to remedy the miscalibration we observe: prompt ensembles and action-wise Platt scaling. Our aim in this study is to begin to develop the tools and conceptual understanding necessary to render VLAs trustworthy via reliable uncertainty quantification.

URL: https://openreview.net/forum?id=OmpTwegbHR

---

Title: High-Dimensional Online Change Point Detection with Adaptive Thresholding and Interpretability

Abstract: Change point detection (CPD) identifies abrupt and significant changes in sequential data, with applications in human activity recognition, financial markets, cybersecurity, manufacturing, and autonomous systems. Traditional CPD methods often face computational challenges in high-dimensional settings and typically provide limited explanations for detected changes, which can restrict their practical usability. This paper introduces a CPD framework that improves scalability and interpretability by leveraging the Sliced Wasserstein (SW) distance.
Our contributions are fourfold: (1) we transform multivariate sequential data into one-dimensional scores using the SW distance, making the resulting representation compatible with existing CPD methods; (2) we analyze the distributional behavior of random slices of the SW distance and show that, under suitable assumptions, they can be approximated by a Gamma distribution, providing a principled basis for threshold calibration; (3) we propose a self-adapting online CPD algorithm that combines this SW-based score with an adaptive quantile-based threshold; (4) we introduce a model-specific framework for generating contrastive explanations for annotated change points.
Empirically, our method reduces false positives by at least $48\%$ on average compared with popular online and offline CPD baselines, while maintaining competitive or superior detection performance\footnote{Code is available at \url{https://anonymous.4open.science/r/SWCPD-7022}.}. At the same time, it produces interpretable change-point annotations, making it practical for deployment in high-stakes applications.

URL: https://openreview.net/forum?id=4ewaiYXoiv

---

Title: Auditing Closed-Loop Learning in Recurrent Neural Networks: Reproduction, Robustness, and Generalization

Abstract: Recurrent neural networks are often used as mechanistic models of learning and control, but closed-loop training creates reproducibility challenges because a model's actions alter future inputs. We conduct a claim-level reproducibility study of Ger and Barak's closed-loop RNN learning dynamics, testing independent implementation, seed variation, protocol perturbations, coupled-system diagnostics, and architecture/task transfer. Under a main-text-aligned double-integrator protocol, the trajectory-level peak, not a persistent final gap, reproduces strongly: 50/50 paired seeds show the post-initial open-loop deployed-loss peak, with a mean peak/initial ratio of 19.1, while final open-loop and closed-loop losses converge after open-loop recovery. The spectral stage and coupled-stability diagnostics also reproduce in 50/50 seeds. A targeted A1 analysis separates stability and behavioral tradeoffs: short-horizon improvements coincide with coupled-radius increases in 120/120 runs; long-horizon loss worsening occurs in 62/120, but never without the radius signal. Generalization is hierarchical: GRU variants preserve final-loss divergence, low-rank variants often produce open-loop deployed rollout blow-ups, tanh RNNs preserve the peak signature without a final gap, tracking transfer is strong, and path-integration transfer is weak. These results motivate reporting practices for closed-loop RNN studies: deployed closed-loop loss, peak signatures, paired seeds, spectral stage criteria, coupled-system spectra, feedback strength, rollout horizon, and failure rates.

URL: https://openreview.net/forum?id=e0pSqIqwXO

---

Title: Convergence Analysis of Wasserstein Proximal Algorithm beyond Geodesic Convexity

Abstract: The proximal algorithm is a powerful tool to minimize nonlinear and nonsmooth functionals in a general metric space. Motivated by the recent progress in studying the training dynamics of the noisy gradient descent algorithm on two-layer neural networks in the mean-field regime, we provide in this paper a simple and self-contained analysis for the convergence of the general-purpose Wasserstein proximal algorithm without assuming geodesic convexity on the objective functional. Under a natural Wasserstein analog of the Euclidean Polyak-{\L}ojasiewicz inequality, we show that the proximal algorithm achieves an unbiased and dimension-free linear convergence rate. Our convergence rate improves upon existing rates of the proximal algorithm for solving Wasserstein gradient flows when specialized to strong geodesic convex functionals. We also extend our analysis to the inexact proximal algorithm for geodesically semiconvex objectives. In our numerical experiments, proximal training demonstrates a faster convergence rate than the noisy gradient descent algorithm on two-layer mean-field neural networks.

URL: https://openreview.net/forum?id=GU9JPTcBsM

---

Title: Revisiting Dynamic Graphs from the Perspective of Time Series

Abstract: Numerous studies have investigated temporal modeling in dynamic graphs. Existing approaches predominantly fall into two categories: discrete-time dynamic graph (DTDG) methods and continuous-time dynamic graph (CTDG) methods. While both paradigms have shown effectiveness in capturing temporal dependencies, they suffer from several inherent limitations. Specifically, DTDG approaches often lose fine-grained temporal information due to snapshot-based discretizations, whereas CTDG methods preserve precise timestamps but may struggle to capture long-range temporal dependencies because of computational constraints. Moreover, interactions in real-world dynamic graphs frequently exhibit predictable and recurring temporal patterns, which are not fully exploited by existing methods. To better leverage such regularities, we propose to transform node interactions into binary time-series representations, enabling explicit modeling of temporal patterns. Building on this formulation, we introduce a novel model, termed Time Series-based Dynamic Graph (TSDyG), which approaches dynamic graph learning from a time-series perspective. Compared to existing DTDG and CTDG methods, TSDyG offers several advantages: it preserves fine-grained temporal information, captures long-range dependencies, and effectively capture recurring interaction patterns. We conduct extensive experiments on multiple benchmark datasets, and the results demonstrate that TSDyG achieves competitive performance on downstream tasks such as temporal link prediction.

URL: https://openreview.net/forum?id=rq697OS6FU

---

Title: Fine-tuning Pocket-Aware Diffusion Models via Denoising Policy Optimization

Abstract: Structure-based drug design has been accelerated by pocket-aware 3D generative models, yet most methods primarily fit the training distribution and may fall short of satisfying multiple properties required in real-world therapeutic drug discovery. Recently, increasing attention has focused on structure-based molecule optimization (SBMO), which targets fine-grained control over multiple specified molecular properties. In this paper, we present DEPPA, a novel SBMO approach building upon Denoising Diffusion Policy Optimization for fine-tuning a pre-trained pocket-aware diffusion model via reinforcement learning. DEPPA enables optimization over multiple properties, including binding affinity, drug-likeness, synthesizability and diversity. We formulate the reverse denoising process of the pretrained pocket-aware diffusion model as a multi-step Markov Decision Process, where the desired properties that serve as reward signals are evaluated on the final generated ligand molecules. DEPPA incorporates a coarse denoising scheduler during the RL fine-tuning to achieve efficient and effective molecule optimization. Experimental results on the CrossDocked2020 benchmark demonstrate that DEPPA outperforms baselines in binding affinity (Vina Score -8.5 kcal/mol), drug-likeness and diversity while exhibiting competitive performance in synthesizability. The source code is available at https://anonymous.4open.science/r/DePPA-5E76.

URL: https://openreview.net/forum?id=5sjQt9I2mt

---

Title: ReSink: Stop Words to Improve Training-Free Referral Segmentation

Abstract: Most existing approaches to referring segmentation achieve strong performance only through fine-tuning or by composing multiple pre-trained models, often at the cost of additional training and architectural modifications. Meanwhile, large-scale generative diffusion models encode rich semantic information, making them attractive as general-purpose feature extractors. In this work, we introduce a new method that directly exploits features, attention scores, from diffusion transformers for downstream tasks, requiring neither architectural modifications nor additional training. To systematically evaluate these features, we extend benchmarks with vision–language grounding tasks spanning both images and videos. Our key insight is that stop words act as attention magnets: they accumulate surplus attention and can be filtered to reduce noise. Moreover, we identify global attention sinks (GAS) emerging in deeper layers and show that they can be safely suppressed or redirected onto auxiliary tokens, leading to sharper and more accurate grounding maps. We further propose an attention redistribution strategy, where appended stop words partition background activations into smaller clusters, yielding sharper and more localized heatmaps. Building on these findings, we develop ReSink, a simple training-free grounding framework that combines cross-attention maps, GAS handling, and redistribution. Across zero-shot referring image and video segmentation benchmarks, our approach achieves strong performance and surpasses prior methods on most datasets, establishing a new state of the art without fine-tuning, additional components and complex reasoning.

URL: https://openreview.net/forum?id=5DTaxhAP4h

---

Title: U-CECE: A Universal Multi-Resolution Framework for Conceptual Counterfactual Explanations

Abstract: As AI models grow more complex, explainability is essential for building trust, yet concept-based counterfactual methods still face a trade-off between expressivity and efficiency. Representing underlying concepts as atomic sets is fast but misses relational context, whereas full graph representations are more faithful but require solving the NP-hard Graph Edit Distance (GED) problem. We propose U-CECE, a unified, model-agnostic multi-resolution framework for conceptual counterfactual explanations that adapts to data regime and compute budget. U-CECE spans three levels of expressivity: atomic concepts for broad explanations, relational sets-of-sets for simple interactions, and structural graphs for full semantic structure. At the structural level, both a precision-oriented transductive mode based on supervised Graph Neural Networks (GNNs) and a scalable inductive mode based on unsupervised graph autoencoders (GAEs) are supported. Experiments on the structurally divergent CUB and Visual Genome datasets characterize the efficiency-expressivity trade-off across levels, while human surveys and LVLM-based evaluation show that the retrieved structural counterfactuals are semantically equivalent to, and often preferred over, exact GED-based ground-truth explanations.

URL: https://openreview.net/forum?id=DmgXSefvC0

---

Title: HyperCLIP: Prompt-Conditioned Image Encoders for Contrastive Vision-Language Pre-training

Abstract: CLIP-style image encoders are trained to be discriminative for every category set a user might supply, since the category set is unknown at training time. This makes the encoder's job harder than the job any single deployment actually requires, and is part of why small image encoders underperform large ones on zero-shot classification. In CLIP, the class prompts available at inference are used only to define the classifier head; we argue they carry more task structure than this role exposes, enough to also modulate the image encoder's feature extraction through a small channel (BatchNorm scale and bias). We provide evidence for this view by introducing HyperCLIP, a contrastive pre-training architecture in which a hypernetwork generates the BatchNorm scale and bias of a small image encoder directly from the class-prompt embeddings produced by the text encoder, with all three components trained jointly under the SigLIP loss. Across eight small vision backbones, HyperCLIP improves zero-shot accuracy over a matched SigLIP baseline by up to 3.3% on ImageNet-1K and 5.6% on CIFAR-100; the gains concentrate in BatchNorm-rich backbones, are equivalent to one step up the EfficientNet scaling ladder, and recover roughly half of what supervised BatchNorm fine-tuning can achieve, without any task labels and with no added inference-time cost.

URL: https://openreview.net/forum?id=kX7iwF5s3v

---

Title: Safe Online Learning via Smooth Safety-Structured Policy Composition

Abstract: Safe online reinforcement learning requires policies to respect safety constraints while maintaining smooth optimization dynamics. Existing approaches typically rely on either strict safety enforcement via action interventions, which introduce discontinuities in system interaction and learning, or soft safety constraint formulations, which preserve smooth learning but provide limited safety assurance. We propose AutoSafe, a safety-aware policy architecture that integrates structured safety monitoring and intervention directly into the action generation process. This design enables smooth, risk-dependent transitions between performance-driven and safety-preserving behaviors, resulting in continuous online interaction and learning dynamics. Empirical results across a suite of continuous-control benchmarks demonstrate strong safety enforcement without sacrificing learning smoothness. We further validate AutoSafe on a physical cart-pole system, highlighting its practical effectiveness for safe online learning in the real world.

URL: https://openreview.net/forum?id=kRYK1jqdBz

---

Title: From Linking Homophily and Label Informativeness to Rewiring in GNNs

Abstract: Message-passing graph neural networks (GNNs) are widely used for node classification. These models learn node representations by aggregating information along the edges of a given graph. A central open question remains which graph properties make message passing effective. While homophily was long viewed as a key ingredient, recent work has increasingly questioned this view, arguing that message passing can remain effective under heterophily when the label distribution is informative, i.e., when a node's label is predictable from its neighbors' labels. In this work, we bridge these perspectives by formally connecting label distribution informativeness and homophily, showing they are not independent and, crucially, that strong neighbor-label predictability is unlikely when homophily is low under realistic multi-class label marginals. Building on this insight, we propose a rewiring framework that increases homophily using a reference edge set, providing guarantees on the homophily of the rewired graph and, in regimes we characterize, also provably strengthening neighbor-label predictability. Across diverse heterophilic benchmarks, our approach outperforms existing rewiring methods and specialized heterophily GNNs, yielding higher node-classification accuracy while remaining efficient and scalable to large graphs.

URL: https://openreview.net/forum?id=M0C1hvJo6m

---

Title: Minimal-Intervention KV Retention via Set-Conditioned Diversity

Abstract: KV-cache compression at small budgets is a crowded design space spanning cache representation, head-wise routing, compression cadence, decoding behavior, and within-budget scoring. We study seven mechanisms across these five families on long-form mathematical reasoning (MATH-500~\citep{hendrycks2021math}) at budgets $b \in \{64, 128\}$, under an evaluation standard that tightened over the study and converged on matched mean cache with $n \geq 200$ on two distilled-reasoning models (Qwen-7B and Llama-8B variants of DeepSeek-R1-Distill~\citep{deepseek2025r1}). All seven were rejected as catalogue directions, one on screening grade evidence. We then propose $\alpha$, a one-function modification to the TriAttention~\citep{mao2026triattention} retention scorer that replaces argmax-top-$k$ with greedy facility-location-inspired selection under a V-space redundancy penalty controlled by a single weight $\lambda$. A pre-registered protocol tunes $\lambda$ on a frozen development split and confirms on a disjoint held-out split; with $\lambda = 0.5$, $\alpha$ clears Bonferroni on two of the four (model, budget) cells (Qwen $b{=}128$ and Llama $b{=}64$), no cell is significantly negative, and the pre-registered Branch~A triggers. The finding is asymmetric: the surviving mechanism was among the smallest tested, but minimality alone did not predict survival~--- two comparably small scoring modifications were also rejected~--- so what distinguished $\alpha$ was its set-conditioned selection rule, in which each retention decision depends on the already-retained set, rather than its size. The combined matched-memory, sympy-graded, held-out confirmation protocol is the evidence standard that made
the asymmetry visible.

URL: https://openreview.net/forum?id=jzZDL2kqTP

---

Title: Bayesian neural networks with Dirichlet process priors for reinforcement learning

Abstract: We introduce a new class of Bayesian Neural Networks (BNNs) which capture (Bayesian) uncertainty in predictions by exploiting the uncertainty about the underlying training-data-generation-distribution via treating it as a random variable distributed according to Bayesian nonparametric priors on the space of distribution functions, i.e. Dirichlet Processes (DPs). We show that these DP based BNNs provide a generalized Bayesian framework for designing randomized value-function based deep reinforcement learning (RL) algorithms. Crucially, RL with DP-BNNs enables to introduce a "prior" mechanism in a principled Bayesian manner. In the past, such a "prior" mechanism has been shown to be decisive (Osband et al., 2018) in the success of randomized-value function based deep-RL algorithms, and a principled Bayesian procedure remained unknown.

URL: https://openreview.net/forum?id=VrWIwB8g3Z

---

Title: Zero Attribution Is Not Zero Influence: Feature Lock Attacks and the Limits of Post-Hoc Fairness Auditing

Abstract: Post-hoc explainability methods such as SHAP have become the de facto standard as auditing tools to detect whether protected features influence a machine learning model's prediction. The reliability of this auditing paradigm rests on the assumption that these methods accurately report a feature's influence. We demonstrate that this paradigm is fundamentally vulnerable to a class of input-layer manipulation attacks. This work introduces the Feature Lock Attack, a post-hoc adversarial wrapper that allows a model trained with a protected feature to evade detection by any perturbation-based post-hoc explainability audit where attribution depends on observing output variation when a feature is perturbed. The attack guarantees zero Shapley attribution by construction as it triggers the Dummy Player axiom of cooperative game theory. We then extend this guarantee to LIME and formalize the theoretical boundary of the attack. This paper evaluates the attack across 40 distinct experimental configurations. The attack suppresses the SHAP and LIME attributions to the noise floor of genuine non-use, with zero accuracy cost. Furthermore, the attack becomes proportionally more effective as the model's dependence on the protected feature grows. Our results show that under adversarial deployment, relying on post-hoc explainability tools for fairness auditing is fundamentally brittle, as zero attribution is not evidence of equity, but an artifact of non-detection.

URL: https://openreview.net/forum?id=FKKhGYNBkq

---

Title: The Feynman Trap: When Chain-of-Thought Reasoning Undermines Correct Intuitions in Language Models

Abstract: The Feynman Trap: When Chain-of-Thought Reasoning Undermines Correct Intuitions in Language Models Chain-of-thought (CoT) prompting is widely assumed to uniformly improve LLM reasoning. We identify and quantify the Feynman Trap: a systematic phenomenon where models that answer correctly under zero-shot conditions produce incorrect answers when prompted to reason step-by-step. Across four 7B-class models (Qwen2.5-7B, Llama-3.1-8B, Llama-2-7B, Mistral-7B) on GSM8K (n=1,319) and CoQA
(n=500), we find that 9.5–88% of zero-shot-correct answers flip to incorrect under CoT (corrected for extraction artifacts), with flip rate inversely associated with model capability. CoT is net positive on math but net negative on conversational text QA for all four models (−3.6 to −26.8 percentage points). Counterfactual prompt controls show that format-free brief reasoning produces comparable flip rates, suggesting flips are not solely due to CoT formatting artifacts. Temporal analysis reveals models commit to errors early (median 64–145 tokens) and do not self-correct. Self-consistency (SC@5) rescues 9–67% of flips, with a generally
positive but non-monotonic association with capability. Our findings challenge the assumption that reasoning is uniformly beneficial and suggest that the choice to use CoT should be task- and model-dependent.

URL: https://openreview.net/forum?id=JNhGH8NVUL

---

Title: Differentially Private Federated Clustering with Random Rebalancing

Abstract: Federated clustering aims to group similar clients into clusters and produce one model for each cluster. Such a personalization approach typically improves model performance compared with training a single model to serve all clients, but can be more vulnerable to privacy leakage. Directly applying client-level differentially private (DP) mechanisms to federated clustering could degrade the utilities significantly. We identify that such deficiencies are mainly due to the difficulties of averaging privacy noise within each cluster (following standard privacy mechanisms), as the number of clients assigned to the same clusters is uncontrolled. To this end, we propose a simple and effective technique, named RR-Cluster, that can be viewed as a light-weight add-on to many federated clustering algorithms. RR-Cluster achieves reduced privacy noise via randomly rebalancing cluster assignments, guaranteeing a minimum number of clients assigned to each cluster. We analyze the tradeoffs between decreased privacy noise variance and potentially increased bias from incorrect assignments and provide convergence bounds for RR-Cluster. Empirically, we demonstrate that RR-Cluster plugged into existing federated clustering algorithms results in significantly improved privacy/utility tradeoffs across both synthetic and real-world datasets.

URL: https://openreview.net/forum?id=7PoXKFo0AZ

---

Title: Transformers on Consumer Hardware: A Critical Perspective of TinyML Optimization Techniques and Open Problems

Abstract: Deep learning models that are based on the transformer architecture have a reputation for requiring large compute resources for training and inference. This requirement has placed transformer-based models, such as large language and other generative AI models, beyond the reach of low-resource devices which make up most of the computer systems in the world. Conversely, machine learning is currently experiencing a revolution of a smaller sort, in which techniques under the umbrella of TinyML are optimizing feed-forward and convolutional models to run successfully on these low-resource devices. Gated access to high-performance compute clusters, rising compute costs, and lack of general access have driven research in combining these two fields. Today, TinyML techniques such as pruning, quantization, and software-hardware co-design are being applied to transformer-based models to deploy transformer-based models to low-resource and edge devices. Analysis of the surveyed works reveals that the techniques applied are largely orthogonal to one another, that knowledge distillation is significantly underrepresented, and that edge training remains rare. The most accessible path toward progress lies in combining the independently developed contributions already present in the literature.

URL: https://openreview.net/forum?id=fwKQTX8Aew

---

Title: Adaptive Human–AI Coordination via Hierarchical Action Disentanglement

Abstract: Human–AI collaboration requires intelligent agents that can rapidly adapt their strategies to diverse partner styles and skill levels, while remaining capable of coordinating with previously unseen partners. Existing deep hierarchical reinforcement learning (DHRL) approaches often collapse to a single behavior or produce diverse behaviors that do not align with partner dynamics, leading to suboptimal coordination. To address these challenges, we introduce Intrinsic Action Disentanglement (IAD), a DHRL-based approach that trains agents to discover distinct low-level action sequences corresponding to different partner behaviors. IAD achieves this through a novel intrinsic reward that encourages the low-level policy to produce disentangled action distributions conditioned on high-level latent skills. This design ensures that each high-level skill is mapped to a distinct, partner-aware response, enabling agents to flexibly adapt to partners with varying skill levels and coordination styles while maintaining robust coordination with previously unseen partners. We evaluate IAD extensively in the collaborative Overcooked-AI environment across multiple layouts, each presenting unique coordination challenges. Agents are tested with large, unseen populations of partners characterized by varying skill levels and behavioral styles, as well as a human-proxy model trained from human–human gameplay data. Analyses of skill usage reveal that IAD effectively utilizes its full set of skills, dynamically switching between them to adapt to diverse partner behaviors. Across all settings, IAD consistently outperforms baseline methods, achieving higher returns and robust coordination.

URL: https://openreview.net/forum?id=mwPpbDkxJf

---

Title: Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning

Abstract: The safety alignment of large language models (LLMs) is becoming increasingly important with their democratization. In this paper, we study the safety degradation that comes with adapting LLMs to new tasks. We attribute this safety compromise to catastrophic forgetting and frame the problem of preserving safety when fine-tuning as a continual learning (CL) problem. We consider the fine-tuning-as-a-service setup where users upload their data to a service provider to get a customized model that excels on the user’s selected task. We adapt several CL approaches from the literature and systematically evaluate their ability to mitigate safety degradation. These include regularization-based, memory-based, and model merging approaches. We consider two scenarios, (1) benign user data and (2) poisoned user data. Our results demonstrate that CL approaches consistently achieve lower attack success rates than standard fine-tuning. Among these, DER outperforms both other CL methods and existing safety-preserving baselines while maintaining task utility. These findings generalize across three downstream tasks (GSM8K, SST2, Code) and three model families (LLaMA2-7B, Mistral-7B, Gemma-2B), establishing CL as a practical solution to preserve safety.

URL: https://openreview.net/forum?id=2XLRmgnqdu

---

Title: Beyond Forgetting: Machine Unlearning Elicits Controllable Side Behaviors and Capabilities

Abstract: We consider Representation Misdirection (RM), a class of large language model (LLM) unlearning methods that achieve forgetting by redirecting the forget-representations, that is, latent representations of forget-samples, toward a target vector. Despite being important, the roles of the target vector used in RM, however, remain underexplored. Here, we approach and revisit RM through the lens of the Linear Representation Hypothesis. Specifically, if one can identify a one-dimensional representation corresponding to a high-level concept, the Linear Representation Hypothesis enables linear operations on this concept vector within the forget-representation space. Under this view, we hypothesize that, beyond forgetting, machine unlearning via RM elicits controllable emergent side behaviors and stronger side capabilities corresponding to the high-level concept. Our hypothesis is empirically validated across a wide range of tasks, including behavioral control (e.g., controlling unlearned models' truthfulness, sentiment, refusal, and language) and capability enhancement (e.g., improving unlearned models' in-context learning (ICL) capability). Our findings reveal that this phenomenon could be either a hidden risk if misused or a mechanism that can be harnessed for developing unlearned models that require stronger capabilities and controllable behaviors.

URL: https://openreview.net/forum?id=5CzZVcDtk6

---

Title: Uniformity First: Uniformity-aware Test-time Adaptation of Vision-language Models against Sensor Degradation

Abstract: Pre-trained vision-language models, such as contrastive language-image pre-training (CLIP), have demonstrated a remarkable generalizability, enabling a wide range of applications, including zero-shot classification.
However, vision-language models still struggle to handle distribution shifts, where input samples have large gaps from training ones.
We found that CLIP is especially vulnerable to sensor degradation, a type of realistic distribution shift caused by sensor conditions such as weather, light, or noise.
Collecting a new dataset from a test distribution for fine-tuning is highly costly since sensor degradation occurs unexpectedly and has a wide variety of types.
Thus, we investigate test-time adaptation (TTA) of zero-shot classification, which enables on-the-fly adaptation to the test distribution with unlabeled test data.
Existing TTA methods for CLIP mainly focus on modifying image and text embeddings or predictions to address distribution shifts.
Although these methods can adapt to domain shifts, such as out-of-distribution or different renditions in input images, they fail to adapt to distribution shifts beyond domain shifts, e.g., sensor degradation.
We found that uniformity of image embeddings, which is related to the amount of information, is a key factor that differentiates domain shifts and other distribution shifts.
To enable adaptation on distribution shifts including sensor degradation, we propose a novel method called uniformity-aware information-balanced TTA (UnInfo).
To address distribution shifts, we introduce uniformity-aware confidence maximization, information-aware loss balancing, and knowledge distillation from the exponential moving average (EMA) teacher.
Through experiments, we demonstrate that our UnInfo improves accuracy under sensor degradation by retaining information in terms of uniformity.

URL: https://openreview.net/forum?id=YELPe35KIg

---

Title: GhostWord: A Fine-Grained Backdoor Attack on Automatic Speech Recognition

Abstract: Automatic Speech Recognition (ASR) systems are widely deployed in safety-critical settings but remain vulnerable to data-poisoning backdoor attacks. Existing ASR backdoors typically use phrase-level triggers paired with a fixed target sentence, creating strong artifacts (e.g., repeated transcripts or triggers placed in non-speech regions) that simple preprocessing can mitigate. We propose GhostWord, a word-level, time-localized ASR backdoor that uses codebooks mapping short ($\approx$400 ms) acoustic triggers to target words. During poisoning, we inject a trigger into the forced-aligned time span of a chosen source word in the audio and replace only that word in the transcript, enabling precise semantic flips and composable sentence manipulation while avoiding many-to-one label artifacts. Across Common Voice (v23 English, v24 Lithuanian) and multiple backbones (Whisper-Small/Medium, MMS, SpeechT5), GhostWord achieves an average attack success rate of 89.4% and transfers across languages and models. Adapting optimization-based defenses (ABL, ANP, SAU, I-BAU) reveals a sharp robustness--accuracy trade-off: attack success drops from 89.4% to 28.3% while clean WER rises from 21.9 to 47.2%, consistent with a theoretical explanation that, in the high-class regime, optimization-based defenses incur unavoidable clean-performance degradation.

URL: https://openreview.net/forum?id=vngoPfQCJf

---

Title: Agent Harness Engineering: A Survey

Abstract: The rapid deployment of large language model (LLM) agents in production has revealed a recurring pattern: task execution reliability depends less on the underlying model than on the infrastructure layer that wraps it, the agent execution harness. This survey provides a practice-grounded, systematic treatment of agent harness engineering, organized around three claims. First, the agent harness is an independent system layer whose engineering quality drives a large share of real-world reliability, a position we develop through a three-phase engineering evolution from prompt to context to harness engineering, a cross-layer synthesis covering the cost--quality--speed trilemma, the capability--control tradeoff, and the harness coupling problem, and an open-problem agenda grounded in both research gaps and production pain points. Second, we propose ETCLOVG, a seven-layer taxonomy (Execution environment, Tool interface, Context management, Lifecycle/Orchestration, Observability, Verification, Governance) that extends prior six-component frameworks by treating observability and governance as independent architectural concerns. Third, we map 170+ open-source projects onto this taxonomy to expose ecosystem patterns, coverage gaps, and emerging design principles, alongside engineering principles distilled from production deployments at OpenAI, Anthropic, and LangChain that address the gap between practitioner knowledge and research vocabulary.

URL: https://openreview.net/forum?id=3hXEPbG0dh

---

Title: Understanding and Mitigating Overconfidence in Focus Group Surveys

Abstract: Subjective evaluation tasks including critical analysis and rating remain at the top of Bloom’s Taxonomy. These have emerged as new pathways for evaluating Language Models (LMs) wherein correctness is relative. While LMs present diverse and human-aligned opinions on such tasks, their confidence and reliability in opinions remains unexplored. We take a deeper look at the reliability of LMs for subjective evaluations by selecting one such task of focus group surveys. LMs act as participants by completing survey questionnaires of diverse physical products. Participants must verbalize their opinions and product details in order to aid business organizations in their commercial goals. While survey responses are diverse, detailed and aligned with human intent, participants are found to be overconfident in their responses. Models often confabulate product appearance, shape and haptic feedback with high self-reported confidence. We address overconfidence by taking a surgical approach. We uncover that (1) choice of prompt prefix and (2) steering guidance at earlier layers are pivotal in mitigating overconfidence. Following our desiderata of participants to possess long-term awareness and diversity in viewpoints, we propose a framework that minimizes overconfidence using prefix intensity and teacher-guided steering. Our collective recommendations, termed the Over-Confidence Checklist (OCC), aid in minimizing and customizing rating confidence into pre-determined quantiles. We empirically validate that following the OCC leads to reliable confidence ratings while grounding response in truthful product-specific details. Survey datasets and code will be released in the final version.

URL: https://openreview.net/forum?id=NGuOZYQZBq

---

Title: Evaluating Global Decision Faithfulness of LLMs with Structured Tabular Decision Simulations

Abstract: Large language models (LLMs) often achieve impressive predictive accuracy, yet correctness alone does not imply that their decisions are grounded in relevant, domain-appropriate factors. In structured decision settings, such as medical triage, financial risk assessment, or policy analysis, reliable performance requires more than producing correct labels: a model should make consistent decisions across multiple instances and rely on relevant, domain-grounded decision factors.
We introduce **Structured Tabular Decision Simulations (STaDS)**, an evaluation framework that casts expert-like decision problems into tabular form
and evaluates LLMs along three behavioral dimensions: (i) question and instruction comprehension, (ii) knowledge-based prediction, and (iii) reliance on relevant decision factors. The third dimension extends faithfulness evaluation from local reasoning traces to global decision faithfulness: whether a model's stated decision factors align with the factors that behaviorally affect its predictions across many instances.
By analyzing 9 frontier LLMs across 15 diverse decision settings, we find that predictive competence and global decision faithfulness are empirically separable: models frequently achieve high accuracy while exhibiting low or negative alignment between stated and behaviorally measured feature reliance. This accuracy-faithfulness gap is consistent across model families and domains, and remains visible in a targeted domain-specialized medical-model case study. Our results highlight that
accuracy metrics alone are insufficient and motivate the adoption of global faithfulness evaluation as a complementary protocol.

URL: https://openreview.net/forum?id=2QOgUm7iul

---

Title: From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

Abstract: Voice agents increasingly require reliable tool use from speech, whereas prominent tool-calling benchmarks remain text-based. We study whether verified text benchmarks can be converted into controlled audio-based tool calling evaluations without re-annotating the tool schema and gold labels. Our dataset-agnostic framework uses text-to-speech, speaker variation, and environmental noise to create paired text–audio instances while preserving the original dataset annotations. Based on extensive evaluation of 7 omni-modal models on Confetti and When2Call, our framework demonstrates that the performance is strongly model- and task-dependent: Gemini-3.1-Flash-Live obtains the highest Confetti score (70.4), whereas GPT-Realtime-1.5 performs best on When2Call (71.9). On Confetti, the text-to-voice gap ranges from 1.8 points for Qwen3-Omni to 4.8 points for GPT-Realtime-1.5. A targeted analysis of failure cases demonstrates that degradations most often reflect misunderstandings of argument values in the speech. Considering real-world deployment scenarios, we further report text-only results, an ambiguity-based reformulation stress test, and a reference-free LLM-as-judge protocol validated against human preferences. Notably, we find that open-source Qwen3 judges with at least 8B parameters exceed 80% agreement with proprietary judges, supporting privacy-preserving evaluation. Overall, our proposed pipeline provides a verifiable and reproducible first-stage diagnostic that complements purpose-built audio corpora.

URL: https://openreview.net/forum?id=69MeTLVKze

---

Title: Subliminal Learning Leaves Traceable Representations in MNIST Autoencoders

Abstract: Knowledge distillation is a widely adopted technique that allows us to efficiently produce cheaper but capable student models from expensive-to-deploy teacher models. However, this can induce a side effect where the student inherits traits from the teacher that were not the intended objective of the distillation, through a phenomenon called subliminal learning. In this short note, we ask whether an unintentional trait in a distilled student can be traced back to the teacher it was subliminally acquired from. We use an auxiliary-logit distillation setup of subliminal learning, similar to prior studies. We demonstrate that in an MNIST autoencoder, a student trained only to imitate auxiliary logits on random noise inputs subliminally acquires reconstruction performance. Moreover, we can trace students back to their source teachers with high accuracy by comparing their internal representations. Where prior work demonstrates transfer of behavioral traits or classifier performance, our result shows that a mechanistic representation trait is also transmitted and can be used to trace back the teacher model.

URL: https://openreview.net/forum?id=owvbayFmdV

---

Title: FloatSOM: GPU Accelerated, Distributed, Topology- Flexible Self-Organizing Maps

Abstract: GPU-acceleratedSelf-OrganizingMap(SOM)implementationsareamongthemostcompet-
itive options for large-scale SOM analysis, but growing dataset sizes increasingly challenge
their practical use because workloads no longer fit cleanly within device-memory limits. We
introduce FloatSOM, a SOM framework for scalable training and deployment that supports
multi-GPU execution, out-of-memory disk-backed streaming, and novel topologies beyond
regular lattices. We evaluate FloatSOM on 14 synthetic and real benchmark datasets to-
gether with controlled speed-scaling benchmarks, and show that these improved topologies,
combined with topology-aware hyperparameter fine-tuning, yield lower quantization error
than current state-of-the-art SOM baselines. FloatSOM also sustains this performance at
large scale with high-throughput distributed execution; in the largest benchmark, it trains
a 1024-node SOM network on 1,000,000,000 samples with 50 features in 6.16 minutes on 8
GPUs across two separate high-performance-computing nodes.

URL: https://openreview.net/forum?id=n2NQNxu9Ei

---

Title: ARMS: Automatic Reward Shaping for Sparse-Reward Multi-Agent Reinforcement Learning

Abstract: Sparse rewards are a major bottleneck in multi-agent reinforcement learning (MARL), where simultaneous learning induces non-stationarity and makes reward design especially delicate. Reward shaping can accelerate learning, but in the multi-agent setting it must preserve the strategic structure of the problem rather than merely improve short-term optimization. We propose \textbf{A}utomatic \textbf{R}eward-shaping in \textbf{M}ulti-agent \textbf{S}ystems (\textbf{ARMS}), a self-supervised reward shaping framework for MARL that learns dense shaping signals from sparse environmental rewards through trajectory ranking. Since single-agent trajectory-ranking guarantees do not directly transfer to MARL, we reformulate policy invariance through conditional best-response reasoning, and show that if certain conditions hold, then using shaping rewards preserves each agent's best-response set under fixed opponent policies, and consequently preserve the set of Nash equilibria. Guided by this perspective, ARMS alternates between policy learning and reward learning while sharing shaping parameters across agents for effciency. Experiments in a partially observable multi-agent pathfinding domain show that ARMS improves sampling efficiency under increasing reward sparsity and agent count, generalizes to unseen environments, and reveals a MARL-specific failure mode in which limited exploration and coupled policy--reward dynamics induce oscillatory behavior. Increasing exploration mitigates this effect and stabilizes learning. To the best of our knowledge, ARMS is the first automatic reward shaping framework for MARL whose design is motivated by a game-theoretic equilibrium-preservation result.

URL: https://openreview.net/forum?id=hZ9gu1iO12

---

Title: IRL-GAD: Graph Anomaly Detection via Inverse Reinforcement Learning as Normality Modeling

Abstract: Most graph anomaly detection methods define anomaly as statistical outlier-ness in a fitted representation, a framing that faces a
specific weakness when adversaries position themselves close to normal nodes in feature space. We propose a behavioral
alternative: anomaly as deviation from an implicit policy governing the normal population. We recast each node's multi-hop
neighborhood as a trajectory in a Markov decision process (the Node-MDP), recover the reward driving normal demonstrations via
maximum-entropy inverse reinforcement learning (MaxEnt-GIRL), and score nodes by the KL divergence between their observed
aggregation policy and the soft-optimal policy induced by the recovered reward. The reward decomposes into structural, semantic,
and temporal components, yielding component-level interpretability of every detection. Theoretically, we establish reward
identifiability with a graph-specific strengthening, a finite-sample recovery bound, a camouflage detection margin under
a bounded threat model that holds adaptively within the budget against an omniscient adversary, a closed-form soft-value regret
bound, and a PAC-style bound on the deployed detector's false-positive rate. Across six benchmarks (homophilic, camouflaged, dynamic, large-scale), IRL-GAD improves on the strongest baseline by $+1.7$ AUC-ROC points on average and by $+2.3$ on YelpChi, with the learned reward transferring to anomaly types absent at training.

URL: https://openreview.net/forum?id=2OKaBZex1C

---

Title: An Empirical Study into Clustering of Unseen Datasets with Self-Supervised Encoders

Abstract: Can pretrained models generalize to new datasets without any retraining? We deploy pretrained image models on datasets they were not trained for, and investigate whether their embeddings form meaningful clusters. Our suite of benchmarking experiments use encoders pretrained solely on ImageNet-1k with either supervised or self-supervised training techniques, deployed on image datasets that were not seen during training, and clustered with conventional clustering algorithms. This evaluation provides new insights into the embeddings of self-supervised models, which prioritize different features to supervised models. Supervised encoders typically offer more utility than SSL encoders within the training domain, and vice-versa far outside of it, however, fine-tuned SSL encoders demonstrate the opposite trend.
Clustering provides a way to evaluate the utility of self-supervised learnt representations orthogonal to existing feature quality estimation methods. Additionally, we find the silhouette score when measured in a UMAP-reduced space is highly correlated with clustering performance, and can therefore be used as a proxy for clustering performance on data with no ground truth labels.

URL: https://openreview.net/forum?id=gdwg7ntmT5

---

Title: Optimal Representations for Generalized Contrastive Learning with Imbalanced Datasets

Abstract: In this paper, we provide a computable characterization of the geometry of optimal representations in Contrastive Learning (CL) when the classes are imbalanced. When classes are balanced and the representation dimension is greater than the number of classes, it is well-known that the optimal representations exhibit Neural Collapse (NC), i.e., representations from the same class collapse to their class means and the class means form an Equiangular Tight Frame (ETF). For imbalanced classes and a large, generalized family of CL losses, we prove that the optimal representations of all samples from the same class collapse to their class means and their geometry exhibits an angular symmetry structure that is determined by the relative class proportions. In general, we show that the geometry can be determined by solving a convex optimization problem. Exploiting this symmetry structure, we analytically investigate a special case where class imbalance is extreme and prove that CL exhibits a phenomenon called Minority Collapse (MC) where all samples from the minority classes (classes with small probabilities) collapse into a single vector, whenever the class imbalance exceeds a threshold, which in turn depends on the regularity properties of the CL loss used and on the number of negative samples. Numerical results are provided to illustrate these phenomena and corroborate the theoretical results. We conclude by identifying a number of open problems.

URL: https://openreview.net/forum?id=CJ0KPc28wP

---

Title: Tensor Network Structure Search Via Canonical Dimension Tree Enumeration

Abstract: Tensor networks provide a powerful framework for compressing multi-dimensional data. The optimal tensor network structure for a given data tensor depends on both data characteristics and specific optimality criteria, making tensor network structure search a challenging problem. Existing solutions typically rely on sampling and compressing numerous candidate structures; these procedures are computationally expensive and therefore limiting for practical applications. We address this challenge by decoupling topology enumeration from rank assignment search. We first represent the search space using canonical dimension trees, a hierarchical structure that encodes potential network topology through nested index partitions. This representation inherently rules out redundant and suboptimal topologies by construction. To eliminate the assessment bottleneck, we introduce a mechanism powered by the precomputation of a singular value map. By archiving the singular values of all feasible tensor matricizations, we transform the evaluation of any candidate dimension tree into a constraint-solving problem. This allows us to solve for the near-optimal ranks and calculate the dimension tree's cost via simple metadata lookup, bypassing the on-the-fly tensor decompositions for all but the most promising candidates. Experimental results show that our approach improves search speed by up to $10\times$ and achieves compression ratios $1.5\times$ to $3\times$ better than state-of-the-art. Notably, our approach scales to larger tensors that are unattainable by prior work.
Furthermore, the discovered topologies generalize well to similar data; they achieve compression ratios up to $2.4\times$ better than generic structures, while maintaining a search time of approximately $110$ seconds for 6D tensors of 1--2 GB disk size.

URL: https://openreview.net/forum?id=3p1LKyAbu9

---

Title: Meltdown: Circuits and Bifurcations in Point-Cloud-Conditioned 3D Diffusion Transformers

Abstract: Sparse point clouds are a common input modality for 3D surface reconstruction, including in safety-critical settings such as surgical navigation and autonomous perception. Recent point-cloud-conditioned 3D diffusion transformers achieve state-of-the-art results in this regime by leveraging learned priors. We show that these models can fail catastrophically under realistic input variation, and present a mechanistic case study of why. We identify a failure mode we call Meltdown: tiny on-surface perturbations to a sparse input point cloud can fracture the reconstructed output into hundreds of disconnected pieces. Adversarial search recovers Meltdown in 89.9-100% of shapes across the two open-weight state-of-the-art architectures we study (WaLa, Make-a-Shape) on real-world datasets (GSO, SimJEB) and under both DDPM and DDIM sampling. We trace Meltdown along the forward pass: it is governed by how uniformly the points are distributed on the surface, faithfully transduced through the point-cloud encoder, and committed by a single early-denoising cross-attention write in the diffusion backbone. Diffusion-trajectory ensembles exhibit symmetry-breaking near this commit step, consistent with a bifurcation of the reverse process. Through a suite of matched-magnitude controls, we show that the variable on which the model commits is directional, concentrated in a low-rank subspace of the write's perturbation drift. Motivated by this finding, we introduce PowerRemap, a test-time control that reshapes the singular spectrum of the localized write to suppress this drift, with rescue rates of 98.3% on WaLa and 84.6% on Make-a-Shape. Together, these results link a circuit-level cross-attention mechanism to a trajectory-level account of the failure, demonstrating how mechanistic analysis can explain and guide behavior in conditional diffusion transformers.

URL: https://openreview.net/forum?id=PlHUWukHqp

---

Title: Learning to Partially Defer for Sequences

Abstract: In the Learning to Defer (L2D) framework, a prediction model can either make a prediction or defer it to an expert, as determined by a rejector. Current L2D methods train the rejector to decide whether to reject the entire prediction, which is not desirable when the model predicts long sequences. We present an L2D setting for sequence outputs where the system can defer specific outputs of the whole model prediction to an expert in an effort to interleave the expert and machine throughout the prediction. We propose two types of model-based post-hoc rejectors for pre-trained predictors: a token-level rejector, which defers specific token predictions to experts with next token prediction capabilities, and a one-time rejector for experts without such abilities, which defers the remaining sequence from a specific point onward. In the experiments, we also empirically demonstrate that such granular deferrals achieve better cost-accuracy tradeoffs than whole deferrals on Traveling salesman solvers, News summarization, and Weather prediction.

URL: https://openreview.net/forum?id=xFFYq6NFQT

---

Title: Scalable Mean-Field Variational Inference through Deterministic Teacher-Guided Knowledge Distillation

Abstract: Scaling mean-field variational inference (MFVI) to large-scale deep networks remains challenging: in high dimensions, Gaussian posteriors concentrate probability mass on thin hyperspherical shells far from the mean (the ``soap-bubble'' phenomenon), so sampled weights produce high-variance gradient estimates that destabilize ELBO training. On ImageNet, vanilla MFVI achieves only 21.57\% top-1 accuracy with a ResNet-18 backbone, indicating severe optimization degradation relative to its deterministic counterpart. We introduce a decoupled training framework that addresses this issue at the level of the optimization objective. A pretrained teacher network supervises the mean parameters $\mu$ through an output-space knowledge distillation term, while the variance parameters $\rho$ are optimized exclusively through the standard ELBO. The combined objective admits a hybrid MAP-VI interpretation in which the distillation term defines a teacher-induced pseudo-prior on $\mu$, recovering standard MFVI as a special case. Under standard local regularity conditions, we prove a dimension-independent confinement bound showing that the optimum lies within a controllable neighborhood of the teacher's effective solution. Empirically, the method reaches 71.58\% top-1 accuracy and ECE 0.0251 on ImageNet with ResNet-18, and is complementary to existing techniques such as MOPED and Radial reparameterization.

URL: https://openreview.net/forum?id=sqloRV7aE8

---

Title: Belief‑Aware Collaborative AI: Planning with Human Beliefs of AI Intentions

Abstract: To enable effective human-AI collaboration, optimizing AI performance in isolation is not sufficient. AI systems need to also account for human factors. Prior research shows that incorporating models of human behavior into AI design can improve collaborative performance. However, existing approaches often implicitly assume that human behavior remains fixed regardless of the AI agent’s actions. In practice, humans adapt their behavior based on their beliefs about the AI’s intentions, that is, what they believe the AI is trying to accomplish. In this work, we develop and evaluate collaborative AI agents that account for human beliefs about AI intentions when choosing their actions. We formulate human-AI collaboration as a goal-oriented multi-agent decision-making problem and develop a belief model by extending level-$k$ reasoning with data-driven models of human behavior. Building on this belief model, we first design explicable AI policies that generate behavior from which humans can more easily infer the AI’s intentions, providing a direct test of whether the model captures human belief formation. We then incorporate the belief model into the training of collaborative AI agents to improve coordination with human partners. Through simulations and extensive human-subject experiments, we show that our belief model better captures human inferences about AI intentions and can be used to generate more explicable AI behavior. More importantly, we demonstrate that collaborative AI agents trained with models of human beliefs significantly improve team performance in human-AI collaboration settings. These results demonstrate the value of modeling human beliefs about AI intentions as a design principle for collaborative AI.

URL: https://openreview.net/forum?id=mx8OsOcTpo

---

Title: Stochastic Equilibrium Propagation for Spiking Convergent Recurrent Neural Networks

Abstract: Spiking Neural Networks (SNNs) promise energy-efficient, sparse, biologically inspired computation. Training them with Backpropagation Through Time (BPTT) and surrogate gradients achieves strong performance but remains biologically implausible. Equilibrium Propagation (EP) provides a more local and biologically grounded alternative. However, existing EP frameworks for SNNs largely rely on deterministic neurons, which require complex mechanisms to handle spiking discontinuities and do not scale beyond simple benchmarks such as MNIST and Fashion-MNIST. Inspired by the stochastic nature of biological spiking mechanism and recent hardware trends, we propose a stochastic EP framework that integrates probabilistic spiking neurons into the EP paradigm. This formulation smoothens the optimization landscape, stabilizes training, and enables efficient and scalable learning in SNNs. We provide theoretical guarantees showing that the proposed stochastic EP dynamics approximate deterministic EP under mean-field theory, thereby inheriting its underlying theoretical guarantees. The proposed framework narrows the performance gap with both BPTT-trained SNNs and EP-trained non-spiking convergent recurrent neural networks (CRNNs) on CIFAR-10, DVS Gesture, and IMDB datasets, while preserving temporal and spatial locality. Our results highlight stochastic EP as a promising approach for neuromorphic and on-chip learning.

URL: https://openreview.net/forum?id=eAfflo9Uoo

---

Title: Verification and Training of Neural Networks for Robustness Against Neuron Pruning

Abstract: Structured neuron pruning removes entire hidden units to reduce model size and computation but often leads to unpredictable accuracy degradation. Existing pruning methods typically rely on heuristic importance scores and provide no formal guarantees on the behavior of pruned models. In this work, we propose a certifiable approach for structured neuron pruning in fully connected layers of feedforward neural networks that guarantees robustness against all pruning masks satisfying a given layer-wise sparsity budget. We further develop a computable upper bound on the worst-case change in pairwise class margins induced by neuron pruning. The analysis models pruning as row-zeroing (equivalently, neuron gating via binary masks) in the weight matrices and bounds the resulting deviation via operator-norm-based error propagation. These bounds are then used to develop a margin-aware robust training objective for certifiable pruning robustness. Experiments on MNIST and CIFAR-10 show that the resulting models achieve non-trivial certified accuracy under a range of pruning budgets and that our robust training substantially improves both certified and empirical robustness over standard baselines.

URL: https://openreview.net/forum?id=6m3LtUZwvF

---

Title: A Survey of Robotic Learning for Perception and Manipulation: From Modular Pipelines to Robotic Foundation Models

Abstract: Over the past decade, robotic manipulation systems have undergone a fundamental paradigm shift: from carefully engineered hierarchical pipelines to data-driven foundation-model-based robotic policies. Following the 2015 DARPA Robotics Challenge, classical systems relied on decomposed perception-planning-control architectures with strong modeling assumptions and task-specific engineering. Since then, advances in machine learning, large-scale visual representation learning, and robot interaction data collection have enabled a progression toward imitation learning policies, end-to-end generative visuomotor policies, and, most recently, robotic foundation models capable of multi-task and cross-embodiment generalization.

This survey provides a structured perspective on this evolution from the viewpoint of robotic perception and manipulation. We introduce a taxonomy of manipulation systems organized along architectural transitions: \textit{hierarchical pipelines, imitation-based policies, learning-based generative visuomotor policies, and robotic foundation models (e.g., VLAs)}, and analyze each paradigm in terms of system design, data requirements, and embodied intelligence capabilities such as compositionality, generalization, and adaptability. Beyond model architectures, we examine the scaling of data that underpins recent progress, covering developments in large-scale visual and 3D datasets, in-the-wild robot interaction corpora, and emerging multimodal sensing modalities including tactile and force feedback.

We further discuss emerging directions that integrate robotics foundation models with reinforcement learning and world models to enable online adaptation and long-horizon reasoning in physical environments. We review current benchmarks and evaluation protocols, highlighting limitations in measuring generalization, safety, and data efficiency, and conclude by outlining open challenges toward general-purpose embodied agents, including interaction-centric scaling, safety and alignment in physical deployment, multimodal perception integration, and the fusion of cognitive abstraction with physical reasoning.

By synthesizing architectural, data-centric, and systems-level trends, this survey aims to provide both a conceptual map of robotic learning’s recent trajectory and a forward-looking agenda for advancing robotic manipulation toward truly general embodied intelligence.

URL: https://openreview.net/forum?id=DwR0r5eGp0

---

Title: [Re] XFeat: Accelerated Features for Lightweight Image Matching

Abstract: We present a reproducibility study of XFeat, a lightweight local feature extractor and matcher designed for efficient visual correspondence on resource-constrained hardware. We re-implement the architecture based on the paper and supplementary material, re-evaluate the authors' released checkpoint alongside our re-implementation, and conduct additional architectural ablations to clarify unmotivated design choices. This distinction between re-evaluation and reproduction is crucial, as the paper, supplement, and public code differ in several important details, including the backbone layout, the fusion block, and the training losses. Empirically, our reproduced models closely match, and in some cases slightly outperform, the re-evaluated original checkpoint on Megadepth-1500 and ScanNet-1500, supporting the main claim that XFeat provides a strong accuracy–efficiency trade-off for real-world use. At the same time, our ablations explore two seemingly crucial architectural arguments from the original paper. In particular, the parallel keypoint branch is important for semi-dense matching, but its benefit is less pronounced than the original paper claims, and the motivation for the single skip-connection is less conclusive than originally implied. Finally, our experiments show that downstream computer vision tasks, such as homography estimation, can be reproduced successfully, whereas visual localization on Aachen remains below the paper's reported numbers even when re-evaluating the authors' own checkpoint, suggesting the gap stems from underspecified evaluation details rather than the model itself.

URL: https://openreview.net/forum?id=2WI889Ulin

---

Title: Minerva: Reinforcement Learning with Verifiable Rewards for Cyber Threat Intelligence LLMs

Abstract: Cyber threat intelligence (CTI) analysts routinely convert noisy, unstructured security artifacts into standardized, automation-ready representations. Although large language models (LLMs) show promise for this task, existing approaches remain brittle when producing structured CTI outputs and have largely relied on supervised fine-tuning (SFT). In contrast, CTI standards and community-maintained resources define canonical identifiers and schemas that enable deterministic verification of model outputs. We leverage this structure to study reinforcement learning with verifiable rewards (RLVR) for CTI tasks. We introduce \textit{Minerva}, a unified dataset and training pipeline spanning multiple CTI subtasks, each paired with task-specific verifiers that score structured outputs and identifier predictions. To address reward sparsity during rollout, we propose \textit{MinervaRL}, a lightweight self-training mechanism that generates additional verified trajectories and distills them back into the model. Averaged across four backbones and 12 CTI benchmarks, MinervaRL improves the mean score by 15.8 percentage points over the corresponding base models and by 4.3 points over GRPO.

URL: https://openreview.net/forum?id=w1IMu9exKC

---

Title: Localized Operator Learning with Adaptive Partition-of-Unity Mixture-of-Expert Networks

Abstract: Operator learning methods such as DeepONets and FNOs often struggle with PDE families featuring sharp interfaces, heterogeneous coefficients, and localized multiscale structures.
We introduce a partition-of-unity (POU) mixture-of-experts framework for localized operator learning, in which geometry-aware gating networks produce smooth spatial partitions and route computation to specialized local experts.
Our main contribution is HiRefPOU, a hierarchical residual POU architecture for DeepONets that enables coarse-to-fine refinement while preserving global continuity.
We also show that the same POU principle can be incorporated into Fourier Neural Operators to introduce spatial adaptivity without modifying the underlying spectral layers.
On heterogeneous Darcy and reaction--diffusion benchmarks, HiRefPOU achieves substantially lower error than global DeepONet and static POU-MoE baselines, while the broader operator-learning experiments show that the benefits of localization depend on the PDE structure and the chosen neural-operator backbone.
The learned partitions are interpretable and align with interfaces and regions of rapid solution variation.
These results show that explicit geometric localization can improve both accuracy and interpretability in neural operator learning.

URL: https://openreview.net/forum?id=ccEY1fZ5gA

---

Title: Direction for Detection: A Survey of Automated Vulnerability Detection and all of its Pain Points

Abstract: Security vulnerabilities in software can have severe consequences; however, manual vulnerability detection is costly and does not scale, especially as agentic coding frameworks increase the rate of code production. Over the last decade, a large body of research has applied machine learning machine learning to automate vulnerability detection (ML4AVD), yet self-reported performance on the most popular datasets shows no clear upward trend. The ML4AVD research community has identified several flaws in problem formulations, datasets, and metrics, but these are discussed in isolation, leaving the overarching problems that generate and reinforce these flaws unaddressed. We first systematize the field through a survey of 87 influential works based on their problem formulation, input and detection granularity, target programming languages, evaluation metrics, datasets, and detection approach. Drawing on this corpus and prior empirical work, we identify twelve pain points spanning the ML4AVD pipeline and show that they are self-reinforcing and causally inter-meshed: feedback loops between datasets, formulations, baselines, and metrics perpetuate each other and explain the field’s persistent concentration on binary classification of C/C++ vulnerabilities at the function level. Thus, the field optimizes for a narrow and artificial problem that omits vulnerability type prediction, broader language support, and separation of input from detection granularity. We pair each pain point with concrete recommendations to break these loops. Finally, we use AIxCC as a case study to assess how well a recent high-profile effort aligns with these recommendations and reflect on the relevance of ML4AVD in the era of agentic AI.

URL: https://openreview.net/forum?id=01TkT5p3xT

---

Title: When Engineering Outruns Intelligence: Rethinking Instruction-Guided Navigation

Abstract: Recent ObjectNav systems credit large language models (LLMs) for sizable zero-shot gains, yet it remains unclear how much comes from language versus geometry. We revisit this question by re-evaluating an instruction-guided pipeline, InstructNav, under a detector-controlled setting and introducing two training-free variants that only alter the action value map: a geometry-only Frontier Proximity Explorer (FPE) and a lightweight Semantic-Heuristic Frontier (SHF) that polls the LLM with simple frontier votes. Across HM3D and MP3D, FPE matches or exceeds the detector-controlled instruction follower while using no API calls and running faster; SHF attains comparable accuracy with a smaller, localized language prior. These results suggest that carefully engineered frontier geometry accounts for much of the reported progress, and that language is most reliable as a light heuristic rather than an end-to-end planner.

URL: https://openreview.net/forum?id=49WESFidIW

---

Title: AlphaZero in Sparsely Rewarded Games: Limits and Auxiliary Supervision

Abstract: AlphaZero has demonstrated that a neural-guided Monte Carlo Tree Search can
achieve superhuman performance, but strong play
does not necessarily imply perfect play. We study this gap in two
oracle-evaluable domains with contrasting structure: Connect Four, a solved
partisan game with exact game-theoretic values, and Chomp, an impartial game
whose optimal play is governed by Grundy-number structure. Under a unified
self-play $+$ MCTS pipeline, we compare vanilla AlphaZero, a multi-frame variant, and an AlphaZero Auxiliary Loss (AZAL) that adds
oracle-derived policy supervision. We find that vanilla AlphaZero achieves
strong play across both domains but cannot preserve the exact
trajectories required for optimal play: in Connect Four, it fails to maintain
the optimal line of play, while in Chomp, it fails to consistently restore the
$g=0$ invariant. Multi-frame inputs alone do not remove this gap. Nevertheless, AZAL
substantially improves optimality recovery, reaching perfect oracle consistency on the evaluated Chomp traces and near-perfect
oracle consistency on the evaluated Connect Four trace. These results suggest that, in these oracle-evaluable settings, a major bottleneck is the weakness of the standard AlphaZero search-learning signal.

URL: https://openreview.net/forum?id=1z0CnFiJKg

---

Title: [Re] Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Web-Scale Datasets for Responsible LLMs

Abstract: Web-scale pretraining datasets for large language models (LLMs) contain harmful content, and existing binary filtering approaches either fail to prevent toxicity leakage or remove high-quality contextual material, reducing downstream utility. To address this trade-off, Mendu et al. (2025) introduced a three-dimensional safety taxonomy and the Topical and Toxic Prompt (TTP). However, their evaluation relied primarily on proprietary models, limiting reproducibility and scalability. In this work, we address this limitation by reproducing and evaluating the proposed taxonomy and prompt-based framework using several open-weight models. We benchmark TTP across a comprehensive cohort of open-weight architectures, identifying Llama 3.3 70B as a cost-effective alternative that matches GPT-4o. We replicate the original web-scale harm analysis on Common Crawl, C4, and FineWeb, reproduce the HAVOC benchmark to measure toxicity leakage, and extend the study by deploying HarmFormer at the scale of 1 million pages per dataset. We also report computational and environmental costs for each pipeline. Our results confirm that open-weight models are viable for transparent, reproducible dataset curation. However, we observe that the Safe-Topical boundary depends on the model chosen. Specifically, open-weight models consistently detect at rates twice as high as originally reported, particularly in Ideological Harm.

URL: https://openreview.net/forum?id=QqMlGswuJw

---

Title: FLARE: Fast Low-rank Attention Routing Engine

Abstract: The quadratic complexity of self-attention limits the scalability of transformers on long sequences. We introduce **Fast Low-rank Attention Routing Engine (FLARE)**, a token-mixing operator that realizes low-rank attention by routing information through a small set of latent tokens. Each layer induces an input-input token mixing matrix of rank at most $M$ via a minimal encode-decode factorization implemented using only two standard scaled dot-product attention (SDPA) calls. Because the dominant $\mathcal{O}(NM)$ computation is expressed purely in terms of standard SDPA, FLARE is compatible with fused attention kernels and avoids materializing $M\times N$ projection matrices. FLARE further assigns disjoint latent slices to each attention head, yielding a mixture of head-specific low-rank pathways. Empirically, FLARE scales to **one-million-point unstructured meshes on a single GPU**, delivers strong results across PDE surrogate benchmarks, and performs competitively on the Long Range Arena suite. We additionally release a large-scale additive manufacturing benchmark dataset.

URL: https://openreview.net/forum?id=zjTqjNr76I

---

Title: Thermodynamic Cyclic Processes with Markov Samplers in Bayesian Inference

Abstract: The concept of Markov chain Monte Carlo (MCMC) cycles, an analogy to cyclic processes in
heat engines, is presented in order to examine Bayesian inference problems. In this effort,
we develop adaptive ensemble schedulers that allow the tuning of external parameters of
a Bayesian canonical ensemble during an MCMC run. We apply our method to different
statistical models. As a fundamental insight, we find that such systems produce a non-zero
net work output if and only if the considered model is non-Gaussian.

URL: https://openreview.net/forum?id=88KWqihymD

---

Title: Public Profile Matters: A Scalable Integrated Approach to Recommend Citations in the Wild

Abstract: Proper citation of relevant literature is essential for contextualising and validating scientific contributions. While current citation recommendation systems leverage local and global textual information, they often overlook the nuances of the human citation behaviour. Recent methods that incorporate such patterns improve performance but incur high computational costs and introduce systematic biases into downstream rerankers. To address this, we propose Profiler, a lightweight, non-learnable module that captures human citation patterns efficiently and without bias, significantly enhancing candidate retrieval. Furthermore, we identify a critical limitation in current evaluation protocol: the systems are assessed in a transductive setting, which fails to reflect real-world scenarios. We introduce a rigorous Inductive evaluation setting that enforces strict temporal constraints, simulating the recommendation of citations for newly authored papers in the wild. Finally, we present DAVINCI, a novel reranking model that integrates profiler-derived confidence priors with semantic information via an adaptive vector-gating mechanism. Our system achieves new state-of-the-art results across multiple benchmark datasets, demonstrating superior efficiency and generalisability. The code and the trained models will be made available upon acceptance.

URL: https://openreview.net/forum?id=R1g1Z4yX57

---

Title: Structured Machine Theory of Mind from Agent Trajectories

Abstract: Predictive models of human behavior trained on large-scale trajectory data optimize for statistical accuracy without representing the mental states that causally generate behavior. Such models support prediction but not principled intervention: they cannot answer how an agent's behavior would change if its beliefs or preferences were different. We introduce Structured Machine Theory of Mind (SMToM), a framework that addresses this limitation by attributing explicit, independently supervised belief and desire representations from observed trajectories within a Belief-Desire-Intention causal structure. The central architectural element is a goal head that consumes only the predicted mental-state channels and a current-trajectory embedding, making counterfactual intervention on beliefs and desires a direct operation. We instantiate SMToM on a controlled pedestrian navigation domain where ground-truth mental states are known by construction, enabling rigorous evaluation of both attribution accuracy and counterfactual validity. The resulting model, BDIBottleneck, outperforms trajectory-only and context-aware baselines on top-1 goal inference across path fractions and held-out agent splits, approaching the approximate upper bound at early-to-mid path reveal. Desire counterfactual experiments confirm that substituting an agent's inferred preferences with a different activity type coherently shifts predicted destinations toward relevant locations. Belief counterfactual experiments confirm that marking a location as unavailable in the agent's belief state reliably reduces its predicted probability as a destination, with effects that are statistically significant on both evaluation splits. Together, these results demonstrate, in a controlled navigation setting, that explicit BDI-structured supervision is a viable foundation for causal behavioral analysis of longitudinal trajectory data.

URL: https://openreview.net/forum?id=PfSNLg0zcK

---

Title: Safely Exploring Large Momentum Steps with Stochastic Curve Searches

Abstract: The use of stochastic line searches has emerged as an effective safeguard strategy for employing large learning rates in the training of deep models via stochastic gradient descent. However, exploiting this approach with different search directions is not straightforward; momentum type directions, in particular, pose several challenges in this regard both from the theoretical and the computational sides.
In this work, we present stochastic curve search (SCS) as a generalization of the stochastic line search. SCS allows to evaluate updates along directions that may not be of descent, while still ensuring the sufficient decrease of the mini-batch objective at each iteration.
We show that the proposed framework is well-defined and that, under standard assumptions, the method converges in expectation.
We also empirically establish that using SCS alongside several momentum based algorithms allows the employment of aggressive hyperparameters, improving either the stability or the speed of the training process. The resulting algorithmic framework is demonstrated to perform competitively against state-of-the-art methods, achieving interesting results in terms of both efficiency and effectiveness across a diverse set of learning benchmarks.

URL: https://openreview.net/forum?id=Uk4Y5Ng0vb

---

Title: From Connectivity to Rewards: Dense Reward Learning with Directed State Graphs

Abstract: The integration of graphs with Goal-Conditioned Hierarchical Reinforcement Learning (GCHRL) has received increasing attention, as graphs naturally encode task hierarchies for effective subgoal sampling. However, existing methods often overlook intrinsic connectivity information, failing to fully leverage the underlying topology for efficient learning. Most graph-based GCHRL methods use the graph as a stochastic sampling tool rather than as an environmental model that encodes connectivity and state-accessibility information. This limitation is particularly acute in quasimetric environments, where the inherent asymmetry of state transitions poses a fundamental challenge to stable policy learning and robust path planning. In this paper, we address these problems by introducing a state connectivity model designed to predict pairwise state connectivity strength in asymmetric environments. We transform these connectivity strengths into scalar auxiliary dense rewards, providing continuous guidance across multiple hierarchical levels. We demonstrate that our proposed framework, Graph-Guided Quasimetric Dense Reward (G2QDR), can be integrated into any existing GCHRL architecture to boost performance, and the state connectivity model is efficiently implemented via a neural network trained on a directed state graph generated during exploration. Empirical results across a wide range of sparse reward environments indicate that G2QDR significantly enhances the performance of baseline GCHRL approaches with minimal computational overhead.

URL: https://openreview.net/forum?id=F65zrefsjB

---

Title: Domain Adaptation under Continuous Spurious Shift

Abstract: Recent advances in domain adaptation have shown promise in transferring knowledge across domains characterized by a continuous value or vector, such as varying patient ages, where ``age'' serves as a continuous index. However, these approaches often fail when spurious features shift continuously along with the domain index. This paper introduces a new method designed to withstand the continuous shifting of spurious features during domain adaptation. Our method enhances domain adaptation performance by aligning representations across continuously indexed domains, inspired by principles of causal transportability. Theoretical analysis provides insight into how our approach encourages transportable representations across different domains under certain assumptions. Empirical results, from both semi-synthetic and real-world medical datasets, indicate that our method outperforms state-of-the-art domain adaptation methods.

URL: https://openreview.net/forum?id=ncmIgEQucO

---

Title: MiloNet: A Framework for Traceable and Verified RAG

Abstract: Retrieval-augmented generation (RAG) has improved factual grounding, but strong retrieval alone does not guarantee reliable answers. Systems can still produce unsupported claims or omit required information after relevant evidence has been found. We introduce MiloNet, a framework for traceable and verified RAG that combines summary-based routing, hierarchical evidence construction, and verification-heavy postprocessing to control which evidence is admitted and which content can appear in the final answer. Under a unified evaluation protocol on the RAGBench HotpotQA test split, MiloNet-full achieves near-zero unsupported-claim rates, sharply reduces omissions, and improves faithfulness and overall reliability over the baselines. These results show that decision-grade RAG requires explicit control over synthesis, provenance, and admissibility, not just stronger evidence coverage.

URL: https://openreview.net/forum?id=wBuEiCVBEn

---

Title: RGMoE: A Unified Robust Graph Mixture of Experts to Defend Various Graph Adversarial Attacks

Abstract: Graph Neural Networks (GNNs) have achieved great success in modeling graph-structured data. However, recent research has revealed that GNNs are highly vulnerable to adversarial attacks, including manipulation attacks, node injection attacks, and the emerging threat of backdoor attacks. Although a variety of defense strategies have been proposed, most existing methods are tailored to a single attack type and lack a unified framework capable of defending against multiple threats simultaneously. To address this, we leverage the flexibility of the Mixture of Experts (MoE) architecture to design a unified and scalable framework to defend against various graph adversarial attacks. Our preliminary analysis shows that the inherent diversity of MoE allows some experts to be naturally unaffected by specific attacks. However, the limited number of robust experts and how to route the perturbed samples to them remain open problems. To this end, we propose the \underline{R}obust \underline{G}raph \underline{MoE} (\textbf{\method}), which introduces a novel logic diversity loss to encourage individual experts to focus on distinct neighborhood structures during decision making, thus ensuring a sufficient subset of experts to be unaffected by local structural perturbations. Moreover, {\method} incorporates a robustness-aware router that identifies perturbed nodes and adaptively routes them to the corresponding unaffected experts, thereby enabling robust predictions for attacked samples. Extensive experiments on large-scale graphs demonstrate the effectiveness of {\method} against multiple graph adversarial attacks. Our code is available at \url{https://anonymous.4open.science/r/RGMoE-F870}.

URL: https://openreview.net/forum?id=jENJpo9HVM

---

Title: ConquerNet: Convolution-Smoothed Quantile ReLU Neural Networks with Minimax Guarantees

Abstract: Quantile regression is a fundamental tool for distributional learning but poses significant
optimization challenges for deep models due to the non-smoothness of the pinball loss. We
propose ConquerNet, a class of convolution-smoothed quantile ReLU neural networks,
which yield smooth objectives while preserving the underlying quantile structure. We
establish general nonasymptotic risk bounds for ConquerNet under mild conditions, providing
minimax guarantees over Besov function classes. In numerical studies, we demonstrate that
the proposed approach outperforms standard quantile neural networks at multiple quantile
levels, showing improved estimation accuracy and training efficiency across the board, with
particularly pronounced advantages at high and low quantiles.

URL: https://openreview.net/forum?id=o4hWEuzrSu

---

Reply all

Reply to author

Forward

0 new messages