Weekly TMLR digest for Mar 08, 2026

0 views
Skip to first unread message

TMLR

unread,
Mar 8, 2026, 12:00:13 AM (yesterday) Mar 8
to tmlr-annou...@googlegroups.com


New certifications
==================

Survey Certification: From Euclidean to Graph-Structured Data: A Survey of Collaborative Learning

Rémi Bourgerie, Sarunas Girdzijauskas, Viktoria Fodor

https://openreview.net/forum?id=vj9l8AjLT6

---


Featured Certification, J2C Certification: Discovering Symbolic Differential Equations with Symmetry Invariants

Jianke Yang, Manu Bhat, Bryan Hu, Yadi Cao, Nima Dehmamy, Robin Walters, Rose Yu

https://openreview.net/forum?id=9t1dEyYfPc

---


J2C Certification: StethoLM: Audio Language Model for Cardiopulmonary Analysis Across Clinical Tasks

Yishan Wang, Tsai-Ning Wang, Mathias Funk, Aaqib Saeed

https://openreview.net/forum?id=i9RuUH9Jyj

---


J2C Certification: Prompt Estimation from Prototypes for Federated Prompt Tuning of Vision Transformers

M Yashwanth, Sharannya Ghosh, Aditay Tripathi, Anirban Chakraborty

https://openreview.net/forum?id=gO1CpPRj6A

---


J2C Certification: Fed-SB: A Silver Bullet for Extreme Communication Efficiency and Performance in (Private) Federated LoRA Fine-Tuning

Raghav Singhal, Kaustubh Ponkshe, Rohit Vartak, Lav R. Varshney, Praneeth Vepakomma

https://openreview.net/forum?id=87UyFEhzyP

---


Featured Certification: Learning to reconstruct from saturated data: audio declipping and high-dynamic range imaging

Victor Sechaud, Laurent Jacques, Patrice Abry, Julián Tachella

https://openreview.net/forum?id=AkJWgglkLd

---


Featured Certification: GIFT: A Framework Towards Global Interpretable Faithful Textual Explanations of Vision Classifiers

Eloi Zablocki, Valentin Gerard, Amaia Cardiel, Eric Gaussier, Matthieu Cord, Eduardo Valle

https://openreview.net/forum?id=OwhW5MpFmD

---


Reproducibility Certification: mTSBench: Benchmarking Multivariate Time Series Anomaly Detection and Model Selection at Scale

Xiaona Zhou, Constantin Brif, Ismini Lourentzou

https://openreview.net/forum?id=8LfB8HD1WU

---


J2C Certification: \texttt{Complex-Edit}: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark

Siwei Yang, Mude Hui, Bingchen Zhao, Yuyin Zhou, Nataniel Ruiz, Cihang Xie

https://openreview.net/forum?id=lL1JR6dxG8

---


J2C Certification: Rethinking the Mixture of Vision Encoders Paradigm for Enhanced Visual Understanding in Multimodal LLMs

Mozhgan Nasr Azadani, James Riddell, Sean Sedwards, Krzysztof Czarnecki

https://openreview.net/forum?id=tgnTVmRybs

---


J2C Certification: Stochastic Multi-Objective Multi-Armed Bandits: Regret Definition and Algorithm

Mansoor Davoodi, Setareh Maghsudi

https://openreview.net/forum?id=7N7sK5CFuP

---


J2C Certification: AdaCubic: An Adaptive Cubic Regularization Optimizer for Deep Learning

Ioannis Tsingalis, Constantine Kotropoulos, Corentin Briat

https://openreview.net/forum?id=pZBQ7J37lk

---


Survey Certification: A Survey of Model Architectures in Information Retrieval

Zhichao Xu, Fengran Mo, Zhiqi Huang, Crystina Zhang, Puxuan Yu, Bei Wang Phillips, Jimmy Lin, Vivek Srikumar

https://openreview.net/forum?id=xAIbTbHRrX

---


J2C Certification: Thermodynamically Consistent Latent Dynamics Identification for Parametric Systems

Xiaolong He, Yeonjong Shin, Anthony Gruber, Sohyeon Jung, Kookjin Lee, Youngsoo Choi

https://openreview.net/forum?id=Qy3oLpRzpf

---


Featured Certification, J2C Certification: Distilled Thompson Sampling: Practical and Efficient Thompson Sampling via Imitation Learning

Hongseok Namkoong, Samuel Daulton, Eytan Bakshy

https://openreview.net/forum?id=J8PrWwvYX2

---


Accepted papers
===============


Title: From Euclidean to Graph-Structured Data: A Survey of Collaborative Learning

Authors: Rémi Bourgerie, Sarunas Girdzijauskas, Viktoria Fodor

Abstract: The conventional approach to machine learning, that is, collecting data, training models, and performing inference in a single location, faces fundamental limitations, including scalability and privacy, that restrict its applicability. To address these challenges, recent research has explored collaborative learning approaches, including federated learning and decentralized learning, where individual agents perform training and inference locally, with limited collaboration.
Most collaborative learning research focuses on Euclidean data with regular, grid-like structure (e.g., images, text). However, these approaches fail to capture the relational patterns in many real-world applications, best represented by graphs. Learning on graphs relies on message-passing mechanisms to propagate information between connected nodes, making it conceptually well-suited for collaborative environments where agents must exchange information. Yet, the opportunities and challenges of learning on graph-structured data in collaborative settings remain largely underexplored.
This survey provides a comprehensive investigation of collaborative learning from Euclidean to graph-structured data, aiming to consolidate this emerging field. We begin by reviewing its foundational principles for Euclidean data, organizing them along three core dimensions: learning effectiveness, efficiency, and privacy preservation. We then extend the discussion to graph-structured data, introducing a taxonomy of graph distribution scenarios, characterizing associated statistical heterogeneities, and developing standardized problem formulations and algorithmic frameworks. Finally, we systematically identify open challenges and promising research directions.
By bridging established techniques for Euclidean data with emerging methods for graph learning, our survey provides researchers and practitioners with a well-structured foundation of collaborative learning, supporting further development across a wide range of scientific and industrial fields.

URL: https://openreview.net/forum?id=vj9l8AjLT6

---

Title: Contextual Learning for Anomaly Detection in Tabular Data

Authors: Spencer King, Zhilu Zhang, Ruofan Yu, Baris Coskun, Wei Ding, Qian Cui

Abstract: Anomaly detection is critical in domains such as cybersecurity and finance, especially when working with large-scale tabular data. Yet, unsupervised anomaly detection---where no labeled anomalies are available---remains challenging because traditional deep learning methods model a single global distribution, assuming all samples follow the same behavior. In contrast, real-world data often contain heterogeneous contexts (e.g., different users, accounts, or devices), where globally rare events may be normal within specific conditions. We introduce a \emph{contextual learning framework} that explicitly models how normal behavior varies across contexts by learning conditional data distributions $P(\mathbf{Y} \mid \mathbf{C})$ rather than a global joint distribution $P(\mathbf{X})$. The framework encompasses (1) a probabilistic formulation for context-conditioned learning, (2) a principled bilevel optimization strategy for automatically selecting informative context features using early validation loss, and (3) theoretical grounding through variance decomposition and discriminative learning principles. We instantiate this framework using a novel conditional Wasserstein autoencoder as a simple yet effective model for tabular anomaly detection. Extensive experiments across eight benchmark datasets demonstrate that contextual learning consistently outperforms global approaches---even when the optimal context is not intuitively obvious---establishing a new foundation for anomaly detection in heterogeneous tabular data.

URL: https://openreview.net/forum?id=PmqZslRENW

---

Title: Probing Layer-wise Memorization and Generalization in Deep Neural Networks via Model Stitching

Authors: Aishwarya Gupta, Indranil Saha, Piyush Rai

Abstract: It is well-known that deep neural networks can both memorize randomly labeled training data and generalize to unseen inputs. However, despite several prior efforts, the mechanism and dynamics of how and where memorization takes place in the network are still unclear, with contradictory findings in the literature. To address this, we aim to study the functional similarity between the layers of the memorized model and the model that generalizes. Specifically, we leverage model stitching as a tool to enable layer-wise comparison of a memorized noisy model, trained on a partially noisy-labeled dataset, to that of the generalized clean model, trained on a clean, noise-free dataset. Our simple but effective approach guides the design of experiments that help shed light on the learning dynamics of different layers in deep neural networks and why models with harmful memorization still generalize well. Our results show that early layers are as important as deeper ones for generalization. We find that ``cleaning'' the early layers of the noisy model improves the functional similarity of its deeper layers to that of the corresponding layers in the clean model. Moreover, cleaning the noise in the early layers of the noisy model can drastically reduce memorization and improve generalization. Furthermore, noise fixation up to a certain depth results in generalization similar to that of a noise-free model. However, interestingly, the reverse may not be true. That is, if early layers are noisy but deeper layers are noise-free, then perfect memorization cannot be achieved, emphasizing the dominant role of deeper layers in memorization. Our extensive experiments on four different architectures - customized CNN model, ResNet-18, ResNet-34, and ResNet-50, and three datasets - SVHN, CIFAR-10, and CIFAR-100, with varying levels of noise, consistently corroborate our findings.

URL: https://openreview.net/forum?id=wWye46fXo7

---

Title: COunterfactual Reasoning for Temporal EXplanations: Plausible and Robust Explanations for EEG-Based Seizure Detection

Authors: Martina Zannotti, Bardh Prenkaj, Marco Piangerelli, Flavio Corradini, Gjergji Kasneci

Abstract: Identifying the drivers of change in time-sensitive domains like healthcare is critical for reliable decision-making, yet explanations must account for both temporal dynamics and structural complexity. While counterfactual explanations are well-studied for static data, existing methods often fail in dynamic, spatio-temporal settings, producing implausible or temporally inconsistent explanations. To address this, we introduce COunterfactual Reasoning for Temporal EXplanations (CORTEX), a search-based explainer for multivariate time series modeled as spatio-temporal graphs, tailored to binary seizure detection from EEG recordings. CORTEX generates temporally robust and plausible counterfactuals by retrieving relevant past instances and sieving them via structural dissimilarity, temporal distance, and robustness. As a result of its design choices, when evaluated on clinical seizure detection data, CORTEX outperforms state-of-the-art methods by 3.73x in validity and 6.32x in fidelity, and achieves zero implausibility, demonstrating consistency and practical relevance. By shifting the focus from mere validity to plausible, time-consistent explanations, CORTEX enables more reliable, controllable counterfactual explanations.

URL: https://openreview.net/forum?id=FkHVmYnNS9

---

Title: Any Image Restoration via Efficient Spatial-Frequency Degradation Adaptation

Authors: Bin Ren, Eduard Zamfir, Zongwei Wu, Yawei Li, Yidi Li, Danda Pani Paudel, Radu Timofte, Ming-Hsuan Yang, Luc Van Gool, Nicu Sebe

Abstract: Restoring multiple degradations efficiently via just one model has become increasingly significant and impactful, especially with the proliferation of mobile devices. Traditional solutions typically involve training dedicated models per degradation, resulting in inefficiency and redundancy. More recent approaches either introduce additional modules to learn visual prompts, significantly increasing the size of the model, or incorporate cross-modal transfer from large language models trained on vast datasets, adding complexity to the system architecture. In contrast, our approach, termed AnyIR, takes a unified path that leverages inherent similarity across various degradations to enable both efficient and comprehensive restoration through a joint embedding mechanism, without scaling up the model or relying on large language models. Specifically, we examine the sub-latent space of each input, identifying key components and reweighting them first in a gated manner. To unify intrinsic degradation awareness with contextualized attention, we propose a spatial–frequency parallel fusion strategy that strengthens spatially informed local–global interactions and enriches restoration fidelity from the frequency domain. Comprehensive evaluations across four all-in-one restoration benchmarks demonstrate that AnyIR attains state-of-the-art performance while reducing model parameters by 84% and FLOPs by 80% relative to the baseline. These results highlight the potential of AnyIR as an effective and lightweight solution for further all-in-one image restoration. Our code is available at: https://github.com/Amazingren/AnyIR.

URL: https://openreview.net/forum?id=RObgCpdDqr

---

Title: Leveraging Recursive Methods for Efficient Federated Learning

Authors: Jie Liu, Zuang Wang, Yongqiang Wang

Abstract: Federated learning algorithms perform multiple local updates on clients before communicating with the parameter server to reduce communication overhead and improve overall training efficiency.
However, local updates also lead to the “client-drift” problem under
non-IID data, which avoids convergence to the exact optimal solution under heterogeneous
data distributions. To ensure accurate convergence, existing federated-learning algorithms
employ auxiliary variables to locally estimate the global gradient or the drift from the global
gradient, which, however, also incurs extra communication and storage overhead. In this
paper, we propose a new recursion-based federated-learning architecture that completely
eliminates the need for auxiliary variables while ensuring accurate convergence under het-
erogeneous data distributions. This new federated-learning architecture, called FedRecu, can
significantly reduce communication and storage overhead compared with existing federated-
learning algorithms with accurate convergence guarantees. More importantly, this novel ar-
chitecture enables FedRecu to employ much larger stepsizes than existing federated-learning
algorithms, thereby leading to much faster convergence. We provide rigorous convergence
analysis of FedRecu under both convex and nonconvex loss functions, in both the determin-
istic gradient case and the stochastic gradient case. In fact, our theoretical analysis shows
that FedRecu ensures o(1/K) convergence to an accurate solution under general convex loss
functions, which improves upon the existing achievable O(1/K) convergence rate for general
convex loss functions. Numerical experiments
on benchmark datasets confirm the effectiveness of the proposed algorithm

URL: https://openreview.net/forum?id=cVGagKtiVr

---

Title: Discovering Symbolic Differential Equations with Symmetry Invariants

Authors: Jianke Yang, Manu Bhat, Bryan Hu, Yadi Cao, Nima Dehmamy, Robin Walters, Rose Yu

Abstract: Discovering symbolic differential equations from data uncovers fundamental dynamical laws underlying complex systems. However, existing methods often struggle with the vast search space of equations and may produce equations that violate known physical laws. In this work, we address these problems by introducing the concept of \textit{symmetry invariants} in equation discovery. We leverage the fact that differential equations admitting a symmetry group can be expressed in terms of differential invariants of symmetry transformations. Thus, we propose to use these invariants as atomic entities in equation discovery, ensuring the discovered equations satisfy the specified symmetry. Our approach integrates seamlessly with existing equation discovery methods such as sparse regression and genetic programming, improving their accuracy and efficiency. We validate the proposed method through applications to various physical systems, such as Darcy flow and reaction-diffusion, demonstrating its ability to recover parsimonious and interpretable equations that respect the laws of physics.

URL: https://openreview.net/forum?id=9t1dEyYfPc

---

Title: Improving Local Explainability By Learning Causal Graphs From Data

Authors: Daan Roos, Sebastian Gerwinn, Jan-Willem van de Meent, Sara Magliacane

Abstract: Causal Shapley values take into account causal relations among dependent features to adjust the contributions of each feature to a prediction. A limitation of this approach is that it can only leverage known causal relations. In this work we combine the computation of causal Shapley values with causal discovery, i.e., learning causal graphs from data. In particular, we compute causal explanations across the Markov Equivalence Class (MEC), a set of candidate causal graphs learned from observational data, providing a list of causal Shapley values that explain the prediction. We propose two methods for estimating this list efficiently, drawing on the equivalences of the interventional distributions for a subset of the causal graphs. We evaluate our methods on synthetic and real-world data, showing that they provide explanations that are more consistent with the true causal effects compared to traditional Shapley value approaches that disregard causal relations. Our results show that even when the Markov Equivalence Class is learned incorrectly, in most settings the explanations of our framework are on average closer to true causal Shapley values than marginal and conditional Shapley values.

URL: https://openreview.net/forum?id=A1bXT7RQLU

---

Title: Addition is almost all you need: Compressing large language models with double binary factorization

Authors: Vladimír Boža, Vladimír Macko

Abstract: Binary quantization approaches, which replace weight matrices with binary matrices and substitute costly multiplications with cheaper additions,
offer a computationally efficient approach to address the increasing computational and storage requirements of Large Language Models (LLMs).
However, the severe quantization constraint ($\pm1$) can lead to significant accuracy degradation.
In this paper, we propose Double Binary Factorization (DBF), a novel method that factorizes dense weight matrices into products of two binary (sign) matrices, each accompanied by scaling vectors.
DBF preserves the efficiency advantages of binary representations while achieving compression rates that are competitive with or superior to state-of-the-art methods.
Specifically, in a 1-bit per weight range, DBF is better than existing binarization approaches. In a 2-bit per weight range, DBF is competitive with the best quantization methods like QuIP\# and QTIP.
Unlike most existing compression techniques, which offer limited compression level choices, DBF allows fine-grained control over compression ratios by adjusting the factorization's intermediate dimension.
Based on this advantage, we further introduce an algorithm for estimating non-uniform layer-wise compression ratios for DBF, based on previously developed channel pruning criteria.
The code is available at: \url{https://github.com/usamec/double_binary}

URL: https://openreview.net/forum?id=k5kUKoewdQ

---

Title: StethoLM: Audio Language Model for Cardiopulmonary Analysis Across Clinical Tasks

Authors: Yishan Wang, Tsai-Ning Wang, Mathias Funk, Aaqib Saeed

Abstract: Listening to heart and lung sounds — auscultation — is one of the first and most fundamental steps in a clinical examination. Despite being fast and non-invasive, it demands years of experience to interpret subtle audio cues. Recent deep learning methods have made progress in automating cardiopulmonary sound analysis, yet most are restricted to simple classification and offer little clinical interpretability or decision support. We present StethoLM, the first audio–language model specialized for cardiopulmonary auscultation, capable of performing instruction-driven clinical tasks across the full spectrum of auscultation analysis. StethoLM integrates audio encoding with a medical language model backbone and is trained on StethoBench, a comprehensive benchmark comprising 77,027 instruction–response pairs synthesized from 16,125 labeled cardiopulmonary recordings spanning seven clinical task categories: binary classification, detection, reporting, reasoning, differential diagnosis, comparison, and location-based analysis. Through multi-stage training that combines supervised fine-tuning and direct preference optimization, StethoLM achieves substantial gains in performance and robustness on out-of-distribution data. Our work establishes a foundation for instruction-following AI systems in clinical auscultation.

URL: https://openreview.net/forum?id=i9RuUH9Jyj

---

Title: Optimistic Online Learning in Symmetric Cone Games

Authors: Anas Barakat, Wayne Lin, John Lazarsfeld, Antonios Varvitsiotis

Abstract: We introduce symmetric cone games (SCGs), a broad class of multi-player games where each player's strategy lies in a generalized simplex (the trace-one slice of a symmetric cone). This framework unifies a wide spectrum of settings, including normal-form games (simplex strategies), quantum games (density matrices), and continuous games with ball-constrained strategies. It also captures several structured machine learning and optimization problems, such as distance metric learning and Fermat–Weber facility location, as two-player zero-sum SCGs. To compute approximate Nash equilibria in two-player zero-sum SCGs, we propose a single online learning algorithm: Optimistic Symmetric Cone Multiplicative Weights Updates (OSCMWU). Unlike prior methods tailored to specific geometries, OSCMWU provides closed-form, projection-free updates over any symmetric cone and achieves a $\tilde{\mathcal{O}}(1/\epsilon)$ iteration complexity for computing $\epsilon$-saddle points. Our analysis builds on the Optimistic Follow-the-Regularized-Leader framework and hinges on a key technical contribution: We prove that the symmetric cone negative entropy is strongly convex with respect to the trace-one norm. This result extends known results for the simplex and spectraplex to all symmetric cones, and may be of independent interest.

URL: https://openreview.net/forum?id=RQjvE2o3o6

---

Title: A Simple and Effective Reinforcement Learning Method for Text-to-Image Diffusion Fine-tuning

Authors: Shashank Gupta, Chaitanya Ahuja, Tsung-Yu Lin, Sreya Dutta Roy, Harrie Oosterhuis, Maarten de Rijke, Satya Narayan Shukla

Abstract: Reinforcement learning (RL)-based fine-tuning has emerged as a powerful approach for aligning diffusion models with black-box objectives. Proximal policy optimization ( PPO) is a popular choice of method for policy optimization. While effective in terms of performance and sample complexity, PPO is highly sensitive to hyper-parameters and involves substantial computational overhead. REINFORCE, on the other hand, mitigates some implementation complexities such as high memory overhead and sensitive hyper-parameter tuning, but has suboptimal performance due to high variance and crucially sample inefficiency, which is the primary notion of efficiency we study in this work. While the variance of the REINFORCE can be reduced by sampling multiple actions per input prompt and using a baseline correction term, it still suffers from sample inefficiency. To address these challenges, we systematically analyze the sample efficiency-effectiveness trade-off between REINFORCE and PPO, and propose leave-one-out PPO ( LOOP), a novel RL for diffusion fine-tuning method. LOOP combines variance reduction techniques from REINFORCE, such as sampling multiple actions per input prompt and a baseline correction term, with the robustness and sample efficiency of PPO via clipping and importance sampling. Our results demonstrate that LOOP effectively improves diffusion models on various black-box objectives, and achieves a better balance between sample efficiency and final performance.

URL: https://openreview.net/forum?id=i8WJhKn455

---

Title: Prompt Estimation from Prototypes for Federated Prompt Tuning of Vision Transformers

Authors: M Yashwanth, Sharannya Ghosh, Aditay Tripathi, Anirban Chakraborty

Abstract: Visual Prompt Tuning (VPT) of pre-trained Vision Transformers (ViTs) has proven highly effective as a parameter-efficient fine-tuning technique for adapting large models to downstream tasks with limited data. Its parameter efficiency makes it particularly suitable for Federated Learning (FL), where both communication and computation budgets are often constrained. However, global prompt tuning struggles to generalize across heterogeneous clients, while personalized tuning overfits to local data and lacks generalization. We propose PEP-FedPT (Prompt Estimation from Prototypes for Federated Prompt Tuning), a unified framework designed to achieve both generalization and personalization in federated prompt tuning of ViTs. Within this framework, we introduce the novel Class-Contextualized Mixed Prompt (CCMP) — based on class-specific prompts maintained alongside a globally shared prompt. For each input, CCMP adaptively combines class-specific prompts using weights derived from global class prototypes and client class priors. This approach enables per-sample prompt personalization without storing client-dependent trainable parameters. The prompts are collaboratively optimized via traditional federated averaging technique on the same. Comprehensive evaluations on CIFAR-100, TinyImageNet, DomainNet, and iNaturalist datasets demonstrate that PEP-FedPT consistently surpasses the state-of-the-art baselines under diverse data heterogeneity scenarios, establishing a strong foundation for efficient and generalizable federated prompt tuning of Vision Transformers.

URL: https://openreview.net/forum?id=gO1CpPRj6A

---

Title: MIRA: Multi-view Information Retrieval with Adaptive Routing for Test-time Long-video Comprehension

Authors: Zecheng Hao, Wayne Ma, Yufeng Cui, Shuang Li, Xinlong Wang, Tiejun Huang

Abstract: Foundational Multi-modal Large Language Models (MLLMs) have achieved rapid progress in handling complex tasks across diverse modalities. However, they still struggle to deliver satisfactory performance on Long-video Comprehension (LVC) tasks involving thousands of frames. Existing optimization strategies can be broadly categorized into LVC-specific fine-tuning, built-in token compression and training-free keyframe extraction, with the latter being most suitable for flexible deployment across various MLLMs. Unfortunately, current training-free approaches predominantly focus on query-frame relevance retrieval, overlooking other levels of visual information and the inherent heterogeneity of LVC tasks. In this work, we propose the $\textbf{M}$ulti-view $\textbf{I}$nformation $\textbf{R}$etrieval with $\textbf{A}$daptive Routing ($\textbf{MIRA}$) framework, which evaluates video frames using distinct metrics for relevance and causality, combines these scores to select a balanced pool of keyframes, and employs an adaptive feedback loop to tailor the retrieval process to different user queries, enabling more precise and sample-grained video comprehension. Extensive experiments demonstrate the advanced performance of our scheme across multiple challenging LVC benchmarks. For instance, integrating $\textbf{MIRA}$ with Qwen-2.5-VL yields performance gains of 3.5% to 13.1% on LVB, VideoMME and MLVU.

URL: https://openreview.net/forum?id=LZb2kzO8tu

---

Title: Fed-SB: A Silver Bullet for Extreme Communication Efficiency and Performance in (Private) Federated LoRA Fine-Tuning

Authors: Raghav Singhal, Kaustubh Ponkshe, Rohit Vartak, Lav R. Varshney, Praneeth Vepakomma

Abstract: Low-Rank Adaptation (LoRA) has become ubiquitous for efficiently fine-tuning foundation models. However, federated fine-tuning using LoRA is challenging due to suboptimal updates arising from traditional federated averaging of individual adapters. Existing solutions either incur prohibitively high communication cost that scales linearly with the number of clients or suffer from performance degradation due to limited expressivity. We introduce Fed-SB, a novel approach for federated fine-tuning of LLMs using LoRA-SB, a recently proposed low-rank adaptation method. LoRA-SB optimally aligns the optimization trajectory with the ideal low-rank full fine-tuning projection by learning a small square matrix ($R$) between adapters $B$ and $A$, keeping other components fixed. Direct averaging of $R$ guarantees exact updates, substantially reducing communication cost, which remains independent of the number of clients, and enables scalability. Fed-SB achieves state-of-the-art performance across commonsense reasoning, arithmetic reasoning, and language inference tasks while reducing communication costs by up to 230x. In private settings, Fed-SB further improves performance by (1) reducing trainable parameters, thereby lowering the noise required for differential privacy and (2) avoiding noise amplification introduced by other methods. Overall, Fed-SB offers a state-of-the-art, efficient, and scalable solution for both private and non-private federated fine-tuning. Our code is available publicly at: https://github.com/CERT-Lab/fed-sb.

URL: https://openreview.net/forum?id=87UyFEhzyP

---

Title: Rethinking Disentanglement under Dependent Factors of Variation

Authors: Antonio Almudévar, Alfonso Ortega

Abstract: Representation learning enables the discovery and extraction of underlying factors of variation from data. A representation is typically considered disentangled when it isolates these factors in a way that is interpretable to humans. Existing definitions and metrics for disentanglement often assume that the factors of variation are statistically independent. However, this assumption rarely holds in real-world settings, limiting the applicability of such definitions and metrics in real-world applications. In this work, we propose a novel definition of disentanglement grounded in information theory, which remains valid even when the factors are dependent. We show that this definition is equivalent to requiring the representation to consist of minimal and sufficient variables. Based on this formulation, we introduce a method to quantify the degree of disentanglement that remains effective in the presence of statistical dependencies among factors. Through a series of experiments, we demonstrate that our method reliably measures disentanglement in both independent and dependent settings, where existing approaches fail under the latter.

URL: https://openreview.net/forum?id=PgwkNC63CS

---

Title: Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?

Authors: Wenzhe Li, Yong Lin, Mengzhou Xia, Chi Jin

Abstract: Ensembling outputs from diverse sources is a straightforward yet effective approach to boost performance. Mixture-of-Agents (MoA) is one such popular ensemble method that aggregates outputs from multiple \emph{different} Large Language Models (LLMs). This paper raises the question in the context of language models: is mixing different LLMs truly beneficial?
We propose Self-MoA --- an ensemble method that aggregates outputs from only the \emph{single} top-performing LLM. Our extensive experiments reveal that, surprisingly, Self-MoA outperforms standard MoA that mixes different LLMs in a large number of scenarios: Self-MoA achieves $6.6\%$ improvement over MoA on the AlpacaEval 2.0 benchmark, and an average of $3.8\%$ improvement across various benchmarks, including MMLU, CRUX, and MATH. Applying Self-MoA to one of the top-ranking models in AlpacaEval 2.0 directly achieves the new state-of-the-art performance on the leaderboard. To understand the effectiveness of Self-MoA, we systematically investigate the trade-off between diversity and quality of outputs under various MoA settings. We confirm that the MoA performance is rather sensitive to the quality, and mixing different LLMs often lowers the average quality of the models. To complement the study, we identify the scenarios where mixing different LLMs could be helpful. This paper further introduces a sequential version of Self-MoA, that is capable of aggregating a large number of LLM outputs on-the-fly over multiple rounds, and is as effective as aggregating all outputs at once.

URL: https://openreview.net/forum?id=K6WwK8URlV

---

Title: Incentive-Aware Synthetic Control: Accurate Counterfactual Estimation via Incentivized Exploration

Authors: Dung Daniel Ngo, Keegan Harris, Anish Agarwal, Vasilis Syrgkanis, Steven Wu

Abstract: Synthetic control methods (SCMs) are a canonical approach used to estimate treatment effects from panel data in the internet economy. We shed light on a frequently overlooked but ubiquitous assumption made in SCMs of ``overlap'': a treated unit can be written as some combination---typically, convex or linear---of the units that remain under control. We show that if units select their own interventions, and there is sufficiently large heterogeneity between units that prefer different interventions, overlap will not hold. We address this issue by proposing a recommender system which incentivizes units with different preferences to take interventions they would not normally consider. Specifically, leveraging tools from information design and online learning, we propose an SCM that incentivizes exploration in panel data settings by providing incentive-compatible intervention recommendations to units. We establish this estimator obtains valid counterfactual estimates without the need for an a priori overlap assumption. We extend our results to the setting of synthetic interventions, where the goal is to produce counterfactual outcomes under all interventions, not just control. Finally, we provide two hypothesis tests for determining whether unit overlap holds for a given panel dataset.

URL: https://openreview.net/forum?id=koln3ufP5c

---

Title: Learning to reconstruct from saturated data: audio declipping and high-dynamic range imaging

Authors: Victor Sechaud, Laurent Jacques, Patrice Abry, Julián Tachella

Abstract: Learning based methods are now ubiquitous for solving inverse problems, but their deployment in real-world applications is often hindered by the lack of ground truth references for training. Recent self-supervised learning strategies offer a promising alternative, avoiding the need for ground truth. However, most existing methods are limited to linear inverse problems. This work extends self-supervised learning to the non-linear problem of recovering audio and images from clipped measurements, by assuming that the signal distribution is approximately invariant to changes in amplitude. We provide sufficient conditions for learning to reconstruct from saturated signals alone and a self-supervised loss that can be used to train reconstruction networks. Experiments on both audio and image data show that the proposed approach is almost as effective as fully supervised approaches, despite relying solely on clipped measurements for training.

URL: https://openreview.net/forum?id=AkJWgglkLd

---

Title: Frictionless Hamiltonian Descent with Discretization and Parallel Optimization

Authors: Heng Zhu, Jun-Kun Wang

Abstract: Frictionless Hamiltonian Descent is a recently proposed optimization method that leverages a fundamental principle from classical mechanics. The algorithm is based on energy conservation of the Hamiltonian Flow, with resetting the kinetic energy at each iteration, and is shown to be a descent method. However, the idealized frictionless Hamiltonian Descent requires access to the oracle of the Hamiltonian Flow, while exactly implementing the Hamiltonian Flow becomes elusive when the underlying function is not quadratic. Motivated from considerable popularity of Hamiltonian dynamics in sampling, where a geometric numerical integrator is used to simulate the idealized Hamiltonian Monte Carlo, we consider Hamiltonian Descent with two kinds of integrator, which results in some new optimization dynamics. Moreover, we extend the original framework by introducing various forms of kinetic energy. This expansion yields a broad class of optimization algorithms and provides a fresh perspective of algorithm design. We further propose a novel parallelization technique for parallelizing the inherently sequential updates of the proposed optimization algorithms, where gradients at different points are computed simultaneously. The parallelization technique improves the actual running time by 2-3x in practice for multinomial logistic regression across a range of datasets when 4 GPUs is used, compared to approximating the Hamiltonian Flow in the standard sequential fashion by a single GPU.

URL: https://openreview.net/forum?id=114IOQ3JWe

---

Title: An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandit Problems

Authors: Amaury Gouverneur, Tobias Oechtering, Mikael Skoglund

Abstract: We study the performance of the Thompson Sampling algorithm for logistic bandit problems. In this setting, an agent receives binary rewards with probabilities determined by a logistic function, $\exp(\beta \langle a, \theta \rangle)/(1+\exp(\beta \langle a, \theta \rangle))$, with parameter $\beta>0$, and both the action $a\in \mathcal{A}$ and the unknown parameter $\theta \in \mathcal{O}$ lie within the $d$-dimensional unit ball. Adopting the information-theoretic framework introduced by Russo & Van Roy (2016), we derive regret bounds via the analysis of the information ratio, a statistic that quantifies the trade-off between the immediate regret incurred by the agent and the information it just gained about the parameter $\theta$. We improve upon previous results and establish that the information ratio is bounded by $d(4/\alpha)^2$, where $d$ is the dimension of the problem and $\alpha$ is a \emph{minimax measure} of the alignment between the action space $\mathcal{A}$ and the parameter space $\mathcal{O}$. Notably, our bound does not scale exponentially with the logistic slope and is independent of the cardinality of the action and parameter spaces. Using this result, we derive a bound on the Thompson Sampling expected regret of order $O(d \alpha^{-1} \sqrt{T \log(\beta T/d)})$, where $T$ is the number of time steps. To our knowledge, this is the first regret bound for any logistic bandit algorithm that avoids any exponential scaling with $\beta$ and is independent of the number of actions. In particular, when the parameters are on the sphere and the action space contains the parameter space, the expected regret bound is of order $O(d \sqrt{T \log(\beta T/d)})$.

URL: https://openreview.net/forum?id=94y5XfiJ7N

---

Title: Prompt-based Adaptation in Large-scale Vision Models: A Survey

Authors: Xi Xiao, Yunbei Zhang, Lin Zhao, Yiyang Liu, Xiaoying Liao, Zheda Mai, Xingjian Li, Xiao Wang, Hao Xu, Jihun Hamm, Xue Lin, Min Xu, Qifan Wang, Tianyang Wang, Cheng Han

Abstract: In computer vision, Visual Prompting (VP) and Visual Prompt Tuning (VPT) have recently emerged as lightweight and effective alternatives to full fine-tuning for adapting large-scale vision models within the ``pretrain-then-finetune'' paradigm. However, despite rapid progress, their conceptual boundaries remain blurred, as VP and VPT are frequently used interchangeably in current research, reflecting a lack of systematic distinction between these techniques and their respective applications. In this survey, we revisit the designs of VP and VPT from first principles, and conceptualize them within a unified framework termed Prompt-based Adaptation (PA). Within this framework, we distinguish methods based on their injection granularity: VP operates at the pixel level, while VPT injects prompts at the token level. We further categorize these methods by their generation mechanism into fixed, learnable, and generated prompts. Beyond the core methodologies, we examine PA’s integrations across diverse domains, including medical imaging, 3D point clouds, and vision-language tasks, as well as its role in test-time adaptation and trustworthy AI. We also summarize current benchmarks and identify key challenges and future directions. To the best of our knowledge, we are the first comprehensive survey dedicated to PA's methodologies and applications in light of their distinct characteristics. Our survey aims to provide a clear roadmap for researchers and practitioners in all area to understand and explore the evolving landscape of PA-related research.

URL: https://openreview.net/forum?id=UwtXDttgsE

---

Title: Collaborative QA using Interacting LLMs. Impact of Network Structure, Node Capability and Distributed Data.

Authors: Adit Jain, Vikram Krishnamurthy, Yiming Zhang

Abstract: In this paper, we model and analyze how a network of interacting LLMs performs \textit{collaborative question-answering (CQA)} in order to estimate a ground truth given a distributed set of documents. This problem is interesting because LLMs often hallucinate when direct evidence to answer a question is lacking, and these effects become more pronounced in a network of interacting LLMs. The hallucination spreads, causing previously accurate LLMs to hallucinate. We study interacting LLMs and their hallucination by combining novel ideas of mean-field dynamics (MFD) from network science and the randomized utility model from economics to construct a useful generative model. We model the LLM with a latent state that indicates if it is truthful or not with respect to the ground truth, and extend a tractable analytical model considering an MFD to model the diffusion of information in a directed network of LLMs. To specify the probabilities that govern the dynamics of the MFD, we propose a randomized utility model. For a network of LLMs, where each LLM has two possible latent states, we posit sufficient conditions for the existence and uniqueness of a fixed point and analyze the behavior of the fixed point in terms of the incentive (e.g., test-time compute) given to individual LLMs. We experimentally study and analyze the behavior of a network of $100$ open-source LLMs with respect to data heterogeneity, node capability, network structure, and sensitivity to framing on multiple semi-synthetic datasets.

URL: https://openreview.net/forum?id=nyZ4JMrV8b

---

Title: Dynamics‑Aligned Diffusion Planning for Offline RL: A Unified Framework with Forward and Inverse Guidance

Authors: Zihao Wang, Ke Jiang, Xiaoyang Tan

Abstract: Diffusion-based planning has emerged as a powerful paradigm for offline reinforcement learning (RL). However, existing approaches often overlook the physical constraints imposed by real-world dynamics, resulting in dynamics inconsistency—a mismatch between diffusion-generated trajectories and those feasible under true environment transitions. To address this issue, we propose Dynamics-Aligned Diffusion Planning (DADP), a unified framework that explicitly enforces dynamics consistency during the diffusion denoising process. DADP offers two complementary variants: DADP-F (Forward), which employs a forward dynamics model to ensure state-level feasibility, and DADP-I (Inverse), which leverages an inverse dynamics model to enhance action-level executability. Both variants share a unified guidance formulation that integrates task return optimization and dynamics alignment through gradient-based updates. Experiments on state-based D4RL Maze2D and MuJoCo benchmarks demonstrate that DADP-F and DADP-I outperform state-of-the-art offline RL baselines, effectively reducing dynamics inconsistency and improving long-horizon robustness. This unifies diffusion-based planning with physically grounded dynamics modeling.

URL: https://openreview.net/forum?id=h3hG6EuqU2

---

Title: Leveraging Multimodal LLM Descriptions of Activity for Explainable Semi-Supervised Video Anomaly Detection

Authors: Furkan Mumcu, Michael J. Jones, Anoop Cherian, Yasin Yilmaz

Abstract: Existing semi-supervised video anomaly detection (VAD) methods often struggle with detecting complex anomalies involving object interactions and generally lack explainability. To overcome these limitations, we propose a novel VAD framework leveraging Multimodal Large Language Models (MLLMs). Unlike previous MLLM-based approaches that make direct anomaly judgments at the frame level, our method focuses on extracting and interpreting object activity and interactions over time. By querying an MLLM with visual inputs of object pairs at different moments, we generate textual descriptions of the activity and interactions from nominal videos. These textual descriptions serve as a high-level representation of the activity and interactions of objects in a video. They are used to detect anomalies during test time by comparing them to textual descriptions found in nominal training videos. Our approach inherently provides explainability and can be combined with many traditional VAD methods to further enhance their interpretability. Extensive experiments on benchmark datasets demonstrate that our method not only detects complex interaction-based anomalies effectively but also achieves state-of-the-art performance on datasets without interaction anomalies.

URL: https://openreview.net/forum?id=dfc2HpDSlH

---

Title: Genomic Next-Token Predictors are In-Context Learners

Authors: Nathan Breslow, Aayush Mishra, Michael Schatz, Anqi Liu, Mahler Revsine, Daniel Khashabi

Abstract: In-context learning (ICL) -- the capacity of a model to infer and apply abstract patterns from examples provided within its input -- has been extensively studied in large language models trained for next-token prediction on human text. In fact, prior work often attributes this emergent behavior to distinctive statistical properties in *human* language. This raises a fundamental question: can ICL arise *organically* in other sequence domains purely through large-scale predictive training?

To explore this, we turn to genomic sequences, an alternative symbolic domain rich in statistical structure. Specifically, we study the Evo2 genomic model, trained predominantly on next-nucleotide (A/T/C/G) prediction, at a scale comparable to mid-sized LLMs. We develop a controlled experimental framework comprising symbolic reasoning tasks instantiated in both linguistic and genomic forms, enabling direct comparison of ICL across genomic and linguistic models. Our results show that genomic models, like their linguistic counterparts, exhibit log-linear gains in pattern induction as the number of in-context demonstrations increases. To the best of our knowledge, this is the first evidence of organically emergent ICL in genomic sequences, supporting the hypothesis that ICL arises as a consequence of large-scale predictive modeling over rich data.

URL: https://openreview.net/forum?id=KmNFx8DmaZ

---

Title: When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models

Authors: Sunny Sanyal, Ravid Shwartz-Ziv, Alex Dimakis, sujay sanghavi

Abstract: Large Language Models (LLMs) are known for their performance, but we uncover a significant structural inefficiency: a phenomenon we term attention collapse. In many pre-trained decoder-style LLMs, the attention matrices in deeper layers degenerate, collapsing to near rank-one structures. These underutilized layers, which we call lazy layers, are redundant and impair model efficiency. To address this, we introduce Inheritune, a simple yet powerful training recipe designed to build smaller, stronger language models. Inheritune initializes a compact model by inheriting the potent early layers from a larger pre-trained model and then progressively trains and expands it. Our experiments on various models, including the GPT-2 family, demonstrate that models trained with Inheritune can match or even surpass the performance of their larger counterparts, despite having significantly fewer layers. This work presents a novel path toward model compression by design, enabling the creation of compact, yet highly performant language models.

URL: https://openreview.net/forum?id=2zQn0bUoPf

---

Title: GIFT: A Framework Towards Global Interpretable Faithful Textual Explanations of Vision Classifiers

Authors: Eloi Zablocki, Valentin Gerard, Amaia Cardiel, Eric Gaussier, Matthieu Cord, Eduardo Valle

Abstract: Understanding the decision processes of deep vision models is essential for their safe and trustworthy deployment in real-world settings. Existing explainability approaches, such as saliency maps or concept-based analyses, often suffer from limited faithfulness, local scope, or ambiguous semantics. We introduce GIFT, a post-hoc framework that aims to derive Global, Interpretable, Faithful, and Textual explanations for vision classifiers. GIFT begins by generating a large set of faithful, local visual counterfactuals, then employs vision–language models to translate these counterfactuals into natural-language descriptions of visual changes. These local explanations are aggregated by a large language model into concise, human-readable hypotheses about the model’s global decision rules. Crucially, GIFT includes a verification stage that quantitatively assesses the causal effect of each proposed explanation by performing image-based interventions, ensuring that the final textual explanations remain faithful to the model’s true reasoning process. Across diverse datasets, including the synthetic CLEVR benchmark, the real-world CelebA faces, and the complex BDD driving scenes, GIFT reveals not only meaningful classification rules but also unexpected biases and latent concepts driving model behavior. Altogether, GIFT bridges the gap between local counterfactual reasoning and global interpretability, offering a principled and extensible approach to causally grounded textual explanations for vision models.

URL: https://openreview.net/forum?id=OwhW5MpFmD

---

Title: mTSBench: Benchmarking Multivariate Time Series Anomaly Detection and Model Selection at Scale

Authors: Xiaona Zhou, Constantin Brif, Ismini Lourentzou

Abstract: Anomaly detection in multivariate time series is essential across domains such as healthcare, cybersecurity, and industrial monitoring, yet remains fundamentally challenging due to high-dimensional dependencies, the presence of cross-correlations between time-dependent variables, and the scarcity of labeled anomalies. We introduce mTSBench, the largest benchmark to date for multivariate time series anomaly detection and model selection, consisting of 344 labeled time series across 19 datasets from a wide range of application domains. We comprehensively evaluate 24 anomaly detectors, including the only two publicly available large language model-based methods for multivariate time series. Consistent with prior findings, we observe that no single detector dominates across datasets, motivating the need for effective model selection. We benchmark three recent model selection methods and find that even the strongest of them remain far from optimal. Our results highlight the outstanding need for robust, generalizable selection strategies. We open-source the benchmark at \url{https://plan-lab.github.io/mtsbench} to encourage future research.

URL: https://openreview.net/forum?id=8LfB8HD1WU

---

Title: Learning Encoding-Decoding Direction Pairs to Unveil Concepts of Influence in Deep Vision Networks

Authors: Alexandros Doumanoglou, Kurt Driessens, Dimitrios Zarpalas

Abstract: Empirical evidence shows that deep vision networks often represent concepts as directions in latent space with concept information written along directional components in the vector representation of the input. However, the mechanism to encode (write) and decode (read) concept information to and from vector representations is not directly accessible as it constitutes a latent mechanism that naturally emerges from the training process of the network. Recovering this mechanism unlocks significant potential to open the black-box nature of deep networks, enabling understanding, debugging, and improving deep learning models.

In this work, we propose an unsupervised method to recover this mechanism. For each concept, we explain that under the hypothesis of linear concept representations, this mechanism can be implemented with the help of two directions: the first facilitating encoding of concept information and the second facilitating decoding. Compared to previous matrix decomposition, autoencoder, and dictionary learning approaches which rely on the reconstruction of feature activations, we propose a different perspective to learn these encoding-decoding direction pairs. We base identifying the decoding directions on directional clustering of feature activations and introduce signal vectors to estimate encoding directions under a probabilistic perspective. Unlike most other works, we also take advantage of the network’s instructions encoded in its weights to guide our direction search. For this, we illustrate that a novel technique called \textit{Uncertainty Region Alignment} can exploit these instructions to reveal the encoding-decoding mechanism of interpretable concepts that influence the network's predictions.

Our thorough and multifaceted analysis shows that, in controlled, toy settings with synthetic data, our approach can recover the ground-truth encoding-decoding direction pairs. In real-world settings, our method effectively reveals the encoding-decoding mechanism of interpretable concepts, often scoring substantially better in interpretability metrics than other unsupervised baselines, such as PCA and NMF. Finally, we provide concrete applications of how the learned directions can help open the black box and understand global model behavior, explain individual sample predictions in terms of local, spatially-aware, concept contributions and intervene on the network's prediction strategy to provide either counterfactual explanations or correct erroneous model behavior.

URL: https://openreview.net/forum?id=lIeyZpPEJn

---

Title: A Bayesian Bootstrap Framework for Mutual Information Neural Estimation: Bridging Classical Mutual Information Learning and Bayesian Nonparametric Learning

Authors: Forough Fazeli-Asl, Michael Minyi Zhang, Linglong Kong, Bei Jiang

Abstract: In this work, we introduce a Bayesian bootstrap resampling framework for estimating mutual information (MI) via ``mutual information neural estimation'' (MINE), making MINE directly applicable in a Bayesian nonparametric learning (BNPL) framework. The resulting estimator shows low variability across batch sizes and high-dimensional settings, as demonstrated through extensive numerical studies. In particular, our proposed bootstrap version yields tighter and lower-variance estimates than the original MINE formulation, both theoretically and empirically. We further demonstrate its practical value in a downstream task by improving VAE-GAN training within BNPL, leading to higher-quality outputs. Beyond enabling MI-based BNPL, the proposed bootstrap estimator also performs competitively against leading frequentist state-of-the-art benchmarks. Overall, our findings establish the first principled framework for Bayesian bootstrap-based MI estimation and highlight its effectiveness as a reliable tool for future BNPL studies.

URL: https://openreview.net/forum?id=mqGzGKXnFi

---

Title: SURFACEBENCH: A Geometry-Aware Benchmark for Symbolic Surface Discovery

Authors: Sanchit Kabra, Shobhnik Kriplani, Parshin Shojaee, Chandan K. Reddy

Abstract: Equation discovery from data is a central challenge in machine learning for science, which requires the recovery of concise symbolic expressions that govern complex physical and geometric phenomena. Recent large language model (LLM) approaches have shown promise in symbolic regression, yet existing benchmarks predominantly evaluate low-dimensional scalar functions and rely on string-level or regression-based metrics that fail to capture structural and geometric equivalence. We introduce SURFACEBENCH, the first geometry-aware benchmark for symbolic discovery of three-dimensional surfaces. Unlike scalar curve-fitting tasks, SURFACEBENCH targets surface-level reasoning, where multi-variable coupling, coordinate transformations, and geometric structure must be inferred directly from data. The benchmark comprises 183 analytically constructed, science-inspired surface equations across 15 categories and three representation paradigms: explicit, implicit, and parametric forms. Each task includes variable semantics and synthetically sampled 3D data, and is designed to stress symbolic composition, structural ambiguity, and representational non-uniqueness while mitigating memorization. To evaluate discovery quality, SURFACEBENCH incorporates symbolic equivalence checks with geometric metrics of the object-space (Chamfer and Hausdorff distances) and regression-based error measures, allowing evaluation of functional fidelity beyond algebraic syntax. Empirical evaluation across evolutionary, neural, and LLM-driven frameworks reveals that no current method achieves consistent performance across representation types, with LLM-based approaches exhibiting strong structural priors but limited robustness in parameter calibration and multi-equation reasoning. SURFACEBENCH provides a challenging and diagnostic testbed that bridges symbolic reasoning and geometric reconstruction, enabling principled benchmarking of compositional generalization and structure-aware scientific induction in high-dimensional equation discovery. The code and data are available at this link: https://github.com/deep-symbolic-mathematics/surfacebench.

URL: https://openreview.net/forum?id=sHLTzkczSi

---

Title: Jr. AI Scientist and Its Risk Report: Autonomous Scientific Exploration from a Baseline Paper

Authors: Atsuyuki Miyai, Mashiro Toyooka, Takashi Otonari, Zaiying Zhao, Kiyoharu Aizawa

Abstract: AI Scientist systems are autonomous agents capable of conducting scientific research. Understanding their current capabilities and risks is essential for ensuring trustworthy and sustainable AI-driven scientific progress while preserving the integrity of the academic ecosystem. To this end, we develop Jr. AI Scientist, a state-of-the-art autonomous AI scientist system that mimics the core research workflow of a novice student researcher: Given the baseline paper from the human mentor, it analyzes its limitations, formulates novel hypotheses for improvement, validates them through rigorous experimentation, and writes a paper with the results. Unlike previous approaches that assume full automation or operate on small-scale code, Jr. AI Scientist follows a well-defined research workflow and leverages modern coding agents to handle complex, multi-file implementations, leading to scientifically valuable contributions. Through our experiments, the Jr. AI Scientist successfully generated new research papers that build upon real NeurIPS, IJCV, and ICLR works by proposing and implementing novel algorithms. For evaluation, we conducted automated assessments using AI Reviewers, author-led evaluations, and submissions to Agents4Science, a venue dedicated to AI-driven scientific contributions. The findings demonstrate that Jr. AI Scientist generates papers receiving higher review scores by DeepReviewer than existing fully automated systems. Nevertheless, we identify important limitations from both the author evaluation and the Agents4Science reviews, indicating the potential risks of directly applying current AI Scientist systems and key challenges for future research. Finally, we comprehensively report various risks identified during development. We believe this study clarifies the current role and limitations of AI Scientist systems, offering insights into the areas that still require human expertise and the risks that may emerge as these systems evolve. The generated papers and the authors' annotations on them are included in the supplementary materials. Issues, comments, and questions are all welcome in https://github.com/Agent4Science-UTokyo/Jr.AI-Scientist.

URL: https://openreview.net/forum?id=OeV062d8Sw

---

Title: Gaming and Cooperation in Federated Learning: What Can Happen and How to Monitor It

Authors: Dongseok Kim, Hyoungsun Choi, Mohamed Jismy Aashik Rasool, Gisung Oh

Abstract: The success of federated learning (FL) ultimately depends on how strategic participants behave
under partial observability, yet most formulations still treat FL as a static optimization
problem. We instead view FL deployments as governed strategic systems and develop an analytical
framework that separates welfare-improving behavior from metric gaming. Within
this framework, we introduce indices that quantify manipulability, the price of gaming, and
the price of cooperation, and we use them to study how rules, information disclosure, evaluation
metrics, and aggregator-switching policies reshape incentives and cooperation patterns.
We derive threshold conditions for deterring harmful gaming while preserving benign cooperation,
and for triggering auto-switch rules when early-warning indicators become critical.
Building on these results, we construct a design toolkit including a governance checklist and
a simple audit-budget allocation algorithm with a provable performance guarantee. Simulations
across diverse stylized environments and a federated learning case study consistently
match the qualitative and quantitative patterns predicted by our framework. Taken together,
our results provide design principles and operational guidelines for reducing metric
gaming while sustaining stable, high-welfare cooperation in FL platforms.

URL: https://openreview.net/forum?id=Ck3q5YdWIv

---

Title: Multiscale Training of Convolutional Neural Networks

Authors: Shadab Ahamed, Niloufar Zakariaei, Eldad Haber, Moshe Eliasof

Abstract: Training convolutional neural networks (CNNs) on high‑resolution images is often bottlenecked by the cost of evaluating gradients of the loss on the finest spatial mesh. To address this, we propose Multiscale Gradient Estimation (MGE), a Multilevel Monte Carlo‑inspired estimator that expresses the expected gradient on the finest mesh as a telescopic sum of gradients computed on progressively coarser meshes. By assigning larger batches to the cheaper coarse levels, MGE achieves the same variance as single‑scale stochastic gradient estimation while reducing the number of fine mesh convolutions by a factor of 4 with each downsampling. We further embed MGE within a Full‑Multiscale training algorithm that solves the learning problem on coarse meshes first and "hot‑starts" the next finer level, cutting the required fine mesh iterations by an additional order of magnitude. Extensive experiments on image denoising, deblurring, inpainting and super‑resolution tasks using UNet, ResNet and ESPCN backbones confirm the practical benefits: Full-Multiscale reduces the computation costs by 4-16$\times$ with no significant loss in performance. Together, MGE and Full‑Multiscale offer a principled, architecture‑agnostic route to accelerate CNN training on high‑resolution data without sacrificing accuracy, and they can be combined with other variance‑reduction or learning‑rate schedules to further enhance scalability.

URL: https://openreview.net/forum?id=HTQuEZwEHw

---

Title: Convergence of Stochastic Gradient Langevin Dynamics in the Lazy Training Regime

Authors: Noah Oberweis, Semih Cayci

Abstract: Continuous-time models provide important insights into the training dynamics of optimization algorithms in deep learning. In this work, we establish a non-asymptotic convergence analysis of stochastic gradient Langevin dynamics (SGLD), which is an Itô stochastic differential equation (SDE) approximation of stochastic gradient descent in continuous time, in the lazy training regime. We show that, under regularity conditions on the Hessian of the loss function, SGLD with multiplicative and state-dependent noise (i) yields a non-degenerate kernel throughout the training process with high probability, and (ii) achieves exponential convergence to the empirical risk minimizer in expectation, and we establish finite-time and finite-width bounds on the optimality gap. We corroborate our theoretical findings with numerical examples in the regression setting.

URL: https://openreview.net/forum?id=3s037DKmLo

---

Title: Rank-1 Approximation of Inverse Fisher for Natural Policy Gradients in Deep Reinforcement Learning

Authors: Yingxiao Huo, Satya Prakash Dash, Radu Stoican, Samuel Kaski, Mingfei Sun

Abstract: Natural gradients have been long studied in deep reinforcement learning due to its fast convergence properties and covariant weight updates.
However, computing natural gradients requires inversion of Fisher Information Matrix (FIM) at each iteration, which is computationally prohibitive in nature.
In this paper, we present an efficient and scalable natural policy optimization technique which leverages a rank-1 approximation to full inverse-FIM.
We theoretically show that under certain conditions, rank-1 approximation to inverse-FIM converges faster than policy gradients and under some condition, enjoys the same sample complexity as stochastic policy gradient methods.
We benchmark our method on a diverse set of environments and show that our methods achieve superior performance than standard trust-region and actor-critic baselines.

URL: https://openreview.net/forum?id=ko8Kn7TS6m

---

Title: Uncovering the Redundancy in Transformers via a Unified Study of Layer Dropping

Authors: Shwai He, Guoheng Sun, Zheyu Shen, Ang Li

Abstract: While scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks, it also introduces redundant architectures, posing efficiency challenges for real-world deployment. Despite some recognition of redundancy in LLMs, the variability of redundancy across different architectures in transformers, such as MLP and Attention layers, is under-explored. In this work, we investigate redundancy across different Transformer modules, including blocks, MLP layers, and attention layers, through the lens of layer dropping. Surprisingly, despite the pivotal role of attention mechanisms in distinguishing Transformers from other architectures, we find that a large portion of attention layers exhibit excessively high redundancy and can be pruned without degrading performance. For example, LLaMA-3-70B achieves a 43.4\% speedup with only a 1.8\% drop in performance by pruning half of its attention layers. In contrast, dropping MLP layers severely impairs the model's ability to distinguish between tokens, leading to catastrophic performance degradation. Moreover, our analysis reveals that attention layer redundancy persists not only throughout training but is also evident in randomly initialized models. We attribute this redundancy to three key factors that constrain representational updates from attention layers: sparse attention patterns, over-smoothed token embeddings, and the low representational magnitude of attention outputs. Overall, our findings offer valuable insights into the internal redundancy of Transformer architectures and provide practical guidance for designing more efficient LLMs. The code is released at: https://github.com/CASE-Lab-UMD/LLM-Drop.

URL: https://openreview.net/forum?id=1I7PCbOPfe

---

Title: Proxy-Anchor and EVT-Driven Continual Learning Method for Generalized Category Discovery

Authors: Alireza Fathalizadeh, Roozbeh Razavi-Far

Abstract: Continual generalized category discovery has been introduced and studied in the literature as a method that aims to continuously discover and learn novel categories in incoming data batches while avoiding catastrophic forgetting of previously learned categories. A key component in addressing this challenge is the model’s ability to separate novel samples, where Extreme Value Theory (EVT) has been effectively employed. In this work, we propose a novel method that integrates EVT with proxy anchors to define boundaries around proxies using a probability of inclusion function, enabling the rejection of unknown samples. Additionally, we introduce a novel EVT-based loss function to enhance the learned representation, achieving superior performance compared to other deep-metric learning methods in similar settings. Using the derived probability functions, novel samples are effectively separated from previously known categories. However, category discovery within these novel samples can sometimes overestimate the number of new categories. To mitigate this issue, we propose a novel EVT-based approach to reduce the model size and discard redundant proxies. We also incorporate a novel experience replay and knowledge distillation mechanisms during the continual learning stage to prevent catastrophic forgetting. Experimental results demonstrate that our proposed approach outperforms state-of-the-art methods in continual generalized category discovery scenarios.

URL: https://openreview.net/forum?id=P3Qe9yJRvf

---

Title: \texttt{Complex-Edit}: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark

Authors: Siwei Yang, Mude Hui, Bingchen Zhao, Yuyin Zhou, Nataniel Ruiz, Cihang Xie

Abstract: We introduce Complex-Edit, a comprehensive benchmark designed to systematically evaluate instruction-based image editing models across instructions of varying complexity. To develop this benchmark, we harness GPT-4o to automatically collect a diverse set of editing instructions at scale. Our approach follows a well-structured "Chain-of-Edit" pipeline: we first generate individual atomic editing tasks independently and then integrate them to form cohesive, complex instructions. Additionally, we introduce a suite of metrics to assess various aspects of editing performance, along with a VLM-based auto-evaluation pipeline that supports large-scale assessments. Our benchmark yields several notable insights: 1) Open-source models significantly underperform relative to proprietary, closed-source models, with the performance gap widening as instruction complexity increases; 2) Increased instructional complexity primarily impairs the models’ ability to retain key elements from the input images; 3) Stronger models aren't necessarily more resilient towards higher complexity; 4) Decomposing a complex instruction into a sequence of atomic steps, executed in a step-by-step manner, substantially degrades performance across multiple metrics; 5) A straightforward Best-of-N selection strategy improves results for both direct editing and the step-by-step sequential approach; and 6) We observe a "curse of synthetic data": when synthetic data is involved in model training, the edited images from such models tend to appear increasingly synthetic as the complexity of the editing instructions rises --- a phenomenon that intriguingly also manifests in the latest GPT-Image-1's outputs. The code for evaluation and data generation, and the test set is released at https://github.com/UCSC-VLAA/Complex-Edit.

URL: https://openreview.net/forum?id=lL1JR6dxG8

---

Title: ODE-Constrained Generative Modeling of Cardiac Dynamics for 12-Lead ECG Synthesis

Authors: Yakir Yehuda, Kira Radinsky

Abstract: Generating realistic training data for supervised learning remains a significant challenge in artificial intelligence. This is particularly true in the synthesis of electrocardiograms (ECGs), where the objective is to develop a synthetic 12-lead ECG model. The primary challenge in this task lies in accurately modeling the intricate biological and physiological interactions among different ECG leads. Although mathematical process models have shed light on these dynamics, effectively incorporating this understanding into generative models is not straightforward. We introduce an innovative method that employs ordinary differential equations (ODEs) to enhance the fidelity of 12-lead ECG data generation. This approach integrates cardiac dynamics directly into the generative optimization process via a novel Euler Loss, producing biologically plausible data that respects real-world variability and inter-lead constraints. Empirical analysis on the G12EC and PTB-XL datasets demonstrates that augmenting training data with MultiODE-GAN yields consistent, statistically significant improvements in specificity across multiple cardiac abnormalities. This highlights the value of enforcing physiological coherence in synthetic medical data.

URL: https://openreview.net/forum?id=4N56Pwwsti

---

Title: VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model

Authors: Mengzhuo Chen, Jiani zheng, Lu Wang, Fangkai Yang, Chaoyun Zhang, Lingrui Mei, Wenjie Yin, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan

Abstract: Training Vision-Language Models (VLMs) for Graphical User Interfaces (GUI) agents via Reinforcement Learning (RL) faces critical challenges: environment-based RL requires costly interactions, while environment-free methods struggle with distribution shift and reward generalization. We propose an environment-free RL framework that decouples action utility learning from policy optimization by leveraging a pretrained Value Environment Model (VEM), which requires no live environment interaction during policy optimization. VEM predicts value-aligned action utilities directly from offline data, distilling human-like priors about GUI interaction outcomes without requiring next-state prediction or environmental feedback. This avoids compounding errors and enhances resilience to UI changes by focusing on semantic reasoning (e.g., “Does this action advance the user’s goal?”). The framework operates in two stages: (1) pretraining VEM to learn action-level utility signals and (2) guiding policy exploration with frozen VEM signals, enabling layout-agnostic GUI automation. Evaluated across diverse benchmarks including Android-in-the-Wild for mobile apps and Multimodal-Mind2Web for web environments, VEM achieves state-of-the-art or highly competitive performance in both offline and online settings. It significantly outperforms environment-free baselines and matches or exceeds environment-based approaches, crucially without incurring interaction costs. Importantly, VEM demonstrates that robust, generalizable GUI agents can be trained efficiently using semantic-aware action utility prediction, proving effective across distinct interaction platforms like mobile and web. The code is available at https://github.com/microsoft/GUI-Agent-RL.

URL: https://openreview.net/forum?id=q1wLUxaBPn

---

Title: Repulsive Monte Carlo on the sphere for the sliced Wasserstein distance

Authors: Vladimir Petrovic, Rémi Bardenet, Agnes Desolneux

Abstract: In this paper, we consider the problem of computing the integral of a function on the unit sphere, in any dimension, using Monte Carlo methods. Although the methods we present are general, our guiding thread is the sliced Wasserstein distance between two measures on $\mathbb{R}^d$, which is precisely an integral of the $d$-dimensional sphere. The sliced Wasserstein distance (SW) has gained momentum in machine learning either as a proxy to the less computationally tractable Wasserstein distance, or as a distance in its own right, due in particular to its built-in alleviation of the curse of dimensionality. There has been recent numerical benchmarks of quadratures for the sliced Wasserstein, and our viewpoint differs in that we concentrate on quadratures where the nodes are repulsive, i.e. negatively dependent. Indeed, negative dependence can bring variance reduction when the quadrature is adapted to the integration task. Our first contribution is to extract and motivate quadratures from the recent literature on determinantal point processes (DPPs) and repelled point processes, as well as repulsive quadratures from the literature specific to the sliced Wasserstein distance. We then numerically benchmark these quadratures. Moreover, we analyze the variance of the UnifOrtho estimator, an orthogonal Monte Carlo estimator. Our analysis sheds light on UnifOrtho's success for the estimation of the sliced Wasserstein in large dimensions, as well as counterexamples from the literature. Our final recommendation for the computation of the sliced Wasserstein distance is to use randomized quasi-Monte Carlo in low dimensions and UnifOrtho in large dimensions. DPP-based quadratures only shine when quasi-Monte Carlo also does, while repelled quadratures show moderate variance reduction in general, but more theoretical effort is needed to make them robust.

URL: https://openreview.net/forum?id=JSiTmB6Ehu

---

Title: Online Learning with Multiple Fairness Regularizers via Graph-Structured Feedback

Authors: Quan Zhou, Jakub Marecek, Robert Noel Shorten

Abstract: There is an increasing need to enforce multiple, often competing, measures of fairness within automated decision systems. The appropriate weighting of these fairness objectives is typically unknown a priori, may change over time and, in our setting, must be learned adaptively through sequential interactions. In this work, we address this challenge in a bandit setting, where decisions are made with graph-structured feedback.

URL: https://openreview.net/forum?id=y8iWuDZtEw

---

Title: Large Language Models for Scientific Idea Generation: A Creativity-Centered Survey

Authors: Fatemeh Shahhosseini, Arash Marioriyad, Ali Momen, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban, Shaghayegh Haghjooy Javanmard

Abstract: Scientific idea generation is central to discovery, requiring the joint satisfaction of novelty and scientific soundness. Unlike standard reasoning or general creative generation, scientific ideation is inherently open-ended and multi-objective, making its automation particularly challenging. Recent advances in large language models (LLMs) have enabled the generation of coherent and plausible scientific ideas, yet the nature and limits of their creative capabilities remain poorly understood. This survey provides a structured synthesis of methods for LLM-driven scientific ideation, focusing on how different approaches trade off novelty and scientific validity. We organize existing methods into five complementary families: External knowledge augmentation, Prompt-based distributional steering, Inference-time scaling, Multi-agent collaboration, and Parameter-level adaptation. To interpret their contributions, we adopt two complementary creativity frameworks: Boden’s taxonomy to characterize the expected level of creative novelty, and Rhodes’ 4Ps framework to analyze the aspects or sources of creativity emphasized by each method. By aligning methodological developments with cognitive creativity frameworks, this survey clarifies the evaluation landscape and identifies key challenges and directions for reliable and systematic LLM-based scientific discovery.

URL: https://openreview.net/forum?id=9lWojZKMjt

---

Title: Return Augmented Decision Transformer for Off-Dynamics Reinforcement Learning

Authors: Ruhan Wang, Yu Yang, Zhishuai Liu, Dongruo Zhou, Pan Xu

Abstract: We study offline off-dynamics reinforcement learning (RL) to utilize data from an easily accessible source domain to enhance policy learning in a target domain with limited data. Our approach centers on return-conditioned supervised learning (RCSL), particularly focusing on Decision Transformer (DT) type frameworks, which can predict actions conditioned on desired return guidance and complete trajectory history. Previous works address the dynamics shift problem by augmenting the reward in the trajectory from the source domain to match the optimal trajectory in the target domain. However, this strategy can not be directly applicable in RCSL owing to (1) the unique form of the RCSL policy class, which explicitly depends on the return, and (2) the absence of a straightforward representation of the optimal trajectory distribution. We propose the Return Augmented (REAG) method for DT type frameworks, where we augment the return in the source domain by aligning its distribution with that in the target domain. We provide the theoretical analysis demonstrating that the RCSL policy learned from REAG achieves the same level of suboptimality as would be obtained without a dynamics shift. We introduce two practical implementations $REAG^{∗}_{Dara}$ and $REAG^{∗}_{MV}$ respectively. Thorough experiments on D4RL datasets and various DT-type baselines demonstrate that our methods consistently enhance the performance of DT type frameworks in off-dynamics RL.

URL: https://openreview.net/forum?id=QDVOr5J9Xp

---

Title: Rethinking the Mixture of Vision Encoders Paradigm for Enhanced Visual Understanding in Multimodal LLMs

Authors: Mozhgan Nasr Azadani, James Riddell, Sean Sedwards, Krzysztof Czarnecki

Abstract: Mixture of Vision Encoders (MoVE) has emerged as a powerful approach to enhance the fine-grained visual understanding of multimodal large language models (MLLMs), improving their ability to handle tasks such as complex optical character recognition and scene understanding. Despite these advances, effectively combining diverse encoders and their visual tokens, while also scaling to high-resolution inputs, remains an open challenge. In this work, we conduct a systematic study of fusion designs for MoVE-based MLLMs, highlighting principles for token-level integration across complementary encoders. Our study shows that a lightweight recipe consisting of post-adaptation fusion with independent projectors, tile-level sequence interleaving, and dynamic tiling with global context delivers strong performance on diverse benchmarks. We integrate these principles into a simple and effective architecture that we call LEO. Extensive evaluation on 11 vision–language benchmarks demonstrates that LEO achieves better results on the majority of tasks compared to existing MoVE-based approaches. Furthermore, LEO adapts effectively to the specialized domain of autonomous driving without altering its architecture or training recipe, achieving competitive performance against established baselines and thereby highlighting its ability to generalize. The code is available at https://github.com/Mozhgan91/LEO.

URL: https://openreview.net/forum?id=tgnTVmRybs

---

Title: STEALTH: Secure Transformer for Encrypted Alignment of Latent Text Embeddings via Semantic Isomorphism Enforcement (SIE) Loss Function

Authors: Nafew Azim, Nabeel Mohammed

Abstract: The pervasive use of large language models (LLMs) on sensitive data presents a critical privacy challenge, as traditional encryption renders data unusable for inference. We introduce STEALTH, a 120M secure transformer framework designed to process encrypted text while preserving its semantic utility under an authorized-key threat model (no decryption or side-channel access). The core innovation of STEALTH is the Semantic Isomorphism Enforcement (SIE) loss function, a loss that trains the model to learn a topology-preserving mapping between encrypted text embeddings and their original plaintext latent space. This encourages preservation of semantic relationships and topological structure in the encrypted domain. Using retrieval-based reconstruction from a domain-aligned plaintext corpus, STEALTH achieves near-perfect semantic retrieval (BLEU score of 1.0 under full-corpus coverage in our experiments) and enables accurate privacy-preserving clustering on encrypted embeddings. We evaluate STEALTH across 44 datasets spanning general language understanding, healthcare, finance, legal, e-commerce, programming, content analysis, reading comprehension, and corporate communication domains with 16 encryption schemes (704 experimental conditions), establishing a comprehensive benchmark for
privacy-preserving NLP on encrypted text. Performance depends on domain alignment between encrypted inputs and the indexed plaintext corpus. Our results demonstrate that, with well-aligned domain indexes and retrieval support, models can perform effective NLP on encrypted data without direct decryption.

URL: https://openreview.net/forum?id=73PV17dVCM

---

Title: Don't Let It Hallucinate: Premise Verification via Retrieval-Augmented Logical Reasoning

Authors: Yuehan Qin, Li Li, Yi Nian, Xinyan Velocity Yu, Yue Zhao, Xuezhe Ma

Abstract: Large language models (LLMs) have shown substantial capacity for generating fluent, contextually appropriate responses. However, they can produce hallucinated outputs, especially when a user query includes one or more false premises—claims that contradict established facts. Such premises can mislead LLMs into offering fabricated or misleading details. Existing approaches include pretraining, fine-tuning, and inference-time techniques that often rely on access to logits or address hallucinations after they occur. These methods tend to be computationally expensive, require extensive training data, or lack proactive mechanisms to prevent hallucination before generation, limiting their efficiency in real-time applications. We propose a retrieval-based framework that identifies and addresses false premises before generation. Our method first transforms a user’s query into a logical representation, then applies retrieval-augmented generation (RAG) to assess the validity of each premise using factual sources. Finally, we incorporate the verification results into the LLM’s prompt to maintain factual consistency in the final output. Experiments show that this approach effectively reduces hallucinations, improves factual accuracy, and does not require access to model logits or large-scale fine-tuning.

URL: https://openreview.net/forum?id=BDxStRGWba

---

Title: KASPER: Kolmogorov Arnold Networks for Stock Prediction and Explainable Regimes

Authors: Vidhi Oad, Param Pathak, Nouhaila Innan, Shalini Devendrababu, Muhammad Shafique

Abstract: Forecasting in financial markets remains a significant challenge due to their nonlinear and regime-dependent dynamics. Traditional deep learning models, such as long short-term memory networks and multilayer perceptrons, often struggle to generalize across shifting market conditions, highlighting the need for a more adaptive and interpretable approach. To address this, we introduce Kolmogorov–Arnold networks for stock prediction and explainable regimes (KASPER), a novel framework that integrates regime detection, sparse spline-based function modeling, and symbolic rule extraction. The framework identifies hidden market conditions using a Gumbel-Softmax-based mechanism, enabling regime-specific forecasting. For each regime, it employs Kolmogorov–Arnold networks (KANs) with sparse spline activations to capture intricate price behaviors while maintaining robustness. Interpretability is achieved through symbolic learning based on Monte Carlo Shapley values, which extracts human-readable rules tailored to each regime. Applied to real-world financial time series from Yahoo Finance, the model achieves an $R^2$ value of 0.89, a Sharpe Ratio of 12.02, and a mean squared error as low as 0.0001, outperforming existing methods. This research establishes a new direction for regime-aware, transparent, and robust forecasting in financial markets.

URL: https://openreview.net/forum?id=PD4jGJQtL8

---

Title: On Adversarial Attacks In Acoustic Localization

Authors: Tamir Shor, Chaim Baskin, Alex M. Bronstein

Abstract: Multi-rotor aerial vehicles (drones) are increasingly deployed across diverse domains, where
accurate navigation is critical. The limitations of vision-based methods under poor lighting
and occlusions have driven growing interest in acoustic sensing as an alternative. However,
the security of acoustic-based localization has not been examined. Adversarial attacks pose
a serious threat, potentially leading to mission-critical failures and safety risks. While prior
research has explored adversarial attacks on vision-based systems, no work has addressed
the acoustic setting. In this paper, we present the first comprehensive study of adversarial
robustness in acoustic drone localization. We formulate white-box projected gradient descent (PGD) attacks from an external sound source and show their significant impact on
localization accuracy. Furthermore, we propose a novel defense algorithm based on rotor
phase modulation, capable of effectively recovering clean signals and mitigating adversarial
degradation. Our results highlight both the vulnerability of acoustic localization and the
potential for robust defense strategies.

URL: https://openreview.net/forum?id=Nxm5xXoLFb

---

Title: Stochastic Multi-Objective Multi-Armed Bandits: Regret Definition and Algorithm

Authors: Mansoor Davoodi, Setareh Maghsudi

Abstract: Multi-armed bandit (MAB) problems are widely applied to online optimization tasks that require balancing exploration and exploitation. In practical scenarios, these tasks often involve multiple conflicting objectives, giving rise to multi-objective multi-armed bandits (MO-MAB). Existing MO-MAB approaches predominantly rely on the Pareto regret metric introduced in \citet{drugan2013designing}. However, this metric has notable limitations, particularly in accounting for all Pareto-optimal arms simultaneously. To address these challenges, we propose a novel and comprehensive regret metric that ensures balanced performance across conflicting objectives. Additionally, we introduce the concept of \textit{Efficient Pareto-Optimal} arms, which are specifically designed for online optimization. Based on our new metric, we develop a two-phase MO-MAB algorithm that achieves sublinear regret for both Pareto-optimal and efficient Pareto-optimal arms.

URL: https://openreview.net/forum?id=7N7sK5CFuP

---

Title: Message-Passing GNNs Fail to Approximate Sparse Triangular Factorizations

Authors: Vladislav Trifonov, Ekaterina Muravleva, Ivan Oseledets

Abstract: Graph Neural Networks (GNNs) have been proposed as a tool for learning sparse matrix preconditioners, which are key components in accelerating linear solvers. We present theoretical and empirical evidence that message-passing GNNs are fundamentally incapable of approximating sparse triangular factorizations for classes of matrices for which high-quality preconditioners exist but require non-local dependencies. To illustrate this, we construct a set of baselines using both synthetic matrices and real-world examples from the SuiteSparse collection. Across a range of GNN architectures, including Graph Attention Networks and Graph Transformers, we observe low cosine similarity ($\leq0.7$ in key cases) between predicted and reference factors. Our theoretical and empirical results suggest that architectural innovations beyond message-passing are necessary for applying GNNs to scientific computing tasks such as matrix factorization. Moreover, experiments demonstrate that overcoming non-locality alone is insufficient. Tailored architectures are necessary to capture the required dependencies since even a completely non-local Global Graph Transformer fails to match the proposed baselines.

URL: https://openreview.net/forum?id=YIr9SzD3C9

---

Title: AdaCubic: An Adaptive Cubic Regularization Optimizer for Deep Learning

Authors: Ioannis Tsingalis, Constantine Kotropoulos, Corentin Briat

Abstract: A novel regularization technique, AdaCubic, is proposed that adapts the weight of the cubic term. The heart of AdaCubic is an auxiliary optimization problem with cubic constraints that dynamically adjusts the weight of the cubic term in Newton’s cubic regularized method. We use Hutchinson’s method to approximate the Hessian matrix, thereby reducing computational cost. We demonstrate that AdaCubic inherits the cubically regularized Newton method’s local convergence guarantees. Our experiments in Computer Vision, Natural Language Processing, and Signal Processing tasks demonstrate that AdaCubic outperforms or competes with several widely used optimizers. Unlike other adaptive algorithms that require hyperparameter fine-tuning, AdaCubic is evaluated with a fixed set of hyperparameters, rendering it a highly attractive optimizer in settings where fine-tuning is infeasible. This makes AdaCubic an attractive option for researchers and practitioners alike. To our knowledge, AdaCubic is the first optimizer to leverage cubic regularization in scalable deep learning applications.

URL: https://openreview.net/forum?id=pZBQ7J37lk

---

Title: A Survey of Model Architectures in Information Retrieval

Authors: Zhichao Xu, Fengran Mo, Zhiqi Huang, Crystina Zhang, Puxuan Yu, Bei Wang Phillips, Jimmy Lin, Vivek Srikumar

Abstract: The period from 2019 to the present has represented one of the biggest paradigm shifts in information retrieval (IR) and natural language processing (NLP), culminating in the emergence of powerful large language models (LLMs) from 2022 onward. Methods leveraging pretrained encoder-only models (e.g., BERT) and decoder-only generative LLMs have outperformed many previous approaches, particularly excelling in zero-shot scenarios and complex reasoning tasks. Our survey study investigates the evolution of model architectures in IR, focusing on two key aspects: backbone models for feature extraction and end-to-end system architectures for relevance estimation. The review intentionally separates architectural considerations from training methodologies, in order to provide a focused analysis of structural innovations in IR systems. We trace the development from traditional term-based methods to modern neural approaches, particularly discussing the impact of transformer-based models and subsequent large language models (LLMs). We conclude with a forward-looking discussion of emerging challenges and future directions, including architectural optimizations for performance and scalability, handling of multimodal, multilingual data, and adaptation to novel application domains such as autonomous search agents that might be the next-generation paradigm of IR.

URL: https://openreview.net/forum?id=xAIbTbHRrX

---

Title: Simplifying Optimal Transport through Schatten-$p$ Regularization

Authors: Tyler Maunu

Abstract: We propose a new general framework for recovering low-rank structure in optimal transport using Schatten-$p$ norm regularization. Our approach extends existing methods that promote sparse and interpretable transport maps or plans, while providing a unified and principled family of convex programs that encourage low-dimensional structure. The convexity of our formulation enables direct theoretical analysis: we derive optimality conditions and prove recovery guarantees for low-rank couplings, barycentric displacements, and cross-covariances in simplified settings. To efficiently solve the proposed program, we develop a mirror descent algorithm with convergence guarantees in the convex setting. Experiments on synthetic and real data demonstrate the method’s efficiency, scalability, and ability to recover low-rank transport structures. In particular, we demonstrate its utility on a machine-learning task in learning transport between high-dimensional cell perturbations for biological applications. All code is publicly available at https://github.com/twmaunu/schatten_ot.

URL: https://openreview.net/forum?id=DIawkTG5VH

---

Title: When Are Two Scores Better Than One? Investigating Ensembles of Diffusion Models

Authors: Raphaël Razafindralambo, Rémy Sun, Damien Garreau, Frederic Precioso, Pierre-Alexandre Mattei

Abstract: Diffusion models now generate high-quality, diverse samples, with an increasing focus on
more powerful models. Although ensembling is a well-known way to improve supervised
models, its application to unconditional score-based diffusion models remains largely un-
explored. In this work we investigate whether it provides tangible benefits for generative
modelling. We find that while ensembling the scores generally improves the score-matching
loss and model likelihood, it fails to consistently enhance perceptual quality metrics such as
FID on image datasets. We confirm this observation across a breadth of aggregation rules
using Deep Ensembles, Monte Carlo Dropout, on CIFAR-10 and FFHQ. We attempt to ex-
plain this discrepancy by investigating possible explanations, such as the link between score
estimation and image quality. We also look into tabular data through random forests, and
find that one aggregation strategy outperforms the others. Finally, we provide theoretical
insights into the summing of score models, which shed light not only on ensembling but also
on several model composition techniques (e.g. guidance). Our Python code is available at
https://github.com/rarazafin/score_diffusion_ensemble.

URL: https://openreview.net/forum?id=4iRx9b0Csu

---

Title: Generative Evolutionary Meta-Solver (GEMS): Scalable Surrogate-Free Multi-Agent Reinforcement Learning

Authors: Alakh Sharma, Gaurish Trivedi, Kartikey Singh Bhandari, Yash Sinha, Dhruv Kumar, Pratik Narang, Jagat Sesh Challa

Abstract: Scalable multi-agent reinforcement learning (MARL) remains a central challenge for AI. Existing population-based methods, like Policy-Space Response Oracles, PSRO, require storing explicit policy populations and constructing full payoff matrices, incurring quadratic computation and linear memory costs. We present Generative Evolutionary Meta-Solver (GEMS), a surrogate-free framework that replaces explicit populations with a compact set of latent anchors and a single amortized generator. Instead of exhaustively constructing the payoff matrix, GEMS relies on unbiased Monte Carlo rollouts, multiplicative-weights meta-dynamics, and a model-free empirical-Bernstein UCB oracle to adaptively expand the policy set. Best responses are trained within the generator using an advantage-based trust-region objective, eliminating the need to store and train separate actors. We evaluated GEMS in a variety of Two-player and Multi-Player games such as the Deceptive Messages Game, Kuhn Poker and Multi-Particle environment. We find that GEMS is up to ~$\mathbf{6\times}$ faster, has $\mathbf{1.3\times}$ less memory usage than PSRO, while also reaps higher rewards simultaneously. These results demonstrate that GEMS retains the game theoretic guarantees of PSRO, while overcoming its fundamental inefficiencies, hence enabling scalable multi-agent learning in multiple domains.

URL: https://openreview.net/forum?id=ZwEJsXoBHD

---

Title: SyntheRela: A Benchmark For Synthetic Relational Database Generation

Authors: Valter Hudovernik, Martin Jurkovic, Erik Štrumbelj

Abstract: Synthesizing relational databases has started to receive more attention from researchers, practitioners, and industry. The task is more difficult than synthesizing a single table due to the added complexity of relationships between tables. For the same reasons, benchmarking methods for synthesizing relational databases introduces new challenges. Our work is motivated by a lack of an empirical evaluation of state-of-the-art methods and by gaps in the understanding of how such an evaluation should be done. We review related work on relational database synthesis, common benchmarking datasets, and approaches to measuring the fidelity and utility of synthetic data. We combine best practices, a novel robust detection metric, and a novel approach to evaluating utility with graph neural networks into a benchmarking tool. We use this benchmark to compare 6 open-source methods over 8 real-world databases, with a total of 39 tables. The open-source SyntheRela benchmark is available on GitHub with a public leaderboard.

URL: https://openreview.net/forum?id=Mi8XioazWy

---

Title: From Clutter to Clarity: Visual Recognition through Foveated Object-Centric Learning (FocL)

Authors: Amitangshu Mukherjee, Deepak Ravikumar, Kaushik Roy

Abstract: Human active vision integrates spatial attention (dorsal) and object recognition (ventral) as distinct information processing pathways. Rapid eye movements focus perception on task-relevant regions while filtering out background clutter. Mimicking this ventral specialization, we introduce FocL (Foveated Object-Centric Learning), a training strategy that biases image classification models toward label-consistent object regions by replacing full images with foveated crops. Standard training often relies on spurious correlation between label and background, increasing memorization of hard examples in the tail of the difficulty distribution. FocL simulates saccades by jittering fixation points and extracting foveated glimpses from annotated bounding boxes. This object-first restructuring reduces non-foreground contamination and lowers mean training loss. FocL reduces memorization, lowering mean cumulative sample loss by approximately 65 % and making nearly all high-memorization samples (top 1 %) easier to learn. It also increases the mean $\ell_2$ adversarial perturbation distance required to flip predictions by approximately 62 %. On ImageNet-V1, FocL achieves up to 11 % higher accuracy on oracle crops. When paired with the Segment Anything Model (SAM) as a dorsal proposal generator, FocL provides around an 7 % gain on ImageNet-V1 and up to 8 % under natural distribution shift (ImageNet-V2). Extending this setup to COCO, FocL improves cross-domain mAP by 3--4 points without any target-domain training. Finally, given object localization (bounding boxes), FocL reaches higher accuracy using roughly 56\% fewer training images, offering a simple path to more robust and efficient visual recognition.

URL: https://openreview.net/forum?id=kVS7sMlv7P

---

Title: LVLM-Count: Enhancing the Counting Ability of Large Vision-Language Models

Authors: Muhammad Fetrat Qharabagh, Mohammadreza Ghofrani, Kimon Fountoulakis

Abstract: Counting is a fundamental operation for various real-world visual tasks, requiring both object recognition and robust counting capabilities. Despite their advanced visual perception, large vision-language models (LVLMs) are known to struggle with counting tasks. In this work, we evaluate the performance of several LVLMs on visual counting tasks across multiple counting and vision datasets. We observe that while their performance may be less prone to error for small numbers of objects, they exhibit significant weaknesses as the number of objects increases. To alleviate this issue, we propose a simple yet effective baseline method that enhances LVLMs’ counting ability for large numbers of objects using a divide-and-conquer approach. Our method decomposes counting problems into sub-tasks. Moreover, it incorporates a mechanism to prevent objects from being split during division, which could otherwise lead to repetitive counting—a common issue in a naive divide-and-conquer implementation. We demonstrate the effectiveness of this approach across various datasets and benchmarks, establishing it as a valuable reference for evaluating future solutions.

URL: https://openreview.net/forum?id=G1i9MUQj63

---

Title: Thermodynamically Consistent Latent Dynamics Identification for Parametric Systems

Authors: Xiaolong He, Yeonjong Shin, Anthony Gruber, Sohyeon Jung, Kookjin Lee, Youngsoo Choi

Abstract: We propose an efficient thermodynamics-informed latent space dynamics identification (tLaSDI) framework for the reduced-order modeling of parametric nonlinear dynamical systems. This framework integrates autoencoders for dimensionality reduction with the newly developed parametric GENERIC formalism-informed neural networks (pGFINNs), which enable efficient learning of parametric latent dynamics while preserving key thermodynamic principles, such as free energy conservation and entropy generation, across the parameter space. To further enhance model performance, a physics-informed active learning strategy is incorporated, leveraging a greedy, residual-based error indicator to adaptively sample informative training data, outperforming uniform sampling at equivalent computational cost. Numerical experiments on the Burgers' equation and the 1D/1V Vlasov-Poisson equation demonstrate that the proposed method achieves up to 2,495x speed-up over the full-order numerical baseline with 1-3% relative errors, as well as significant reductions in training (50-90%) and inference (57-61%) cost.
Moreover, the learned latent space dynamics reveal the underlying thermodynamic behavior of the system, offering valuable insights into the physical-space dynamics. Code is available at the repository: https://github.com/xiaolong7/pGFINN-tLaSDI.

URL: https://openreview.net/forum?id=Qy3oLpRzpf

---

Title: A Bayesian Nonparametric Perspective on Mahalanobis Distance for Out of Distribution Detection

Authors: Randolph Linderman, Noah Cowan, Yiran Chen, Scott Linderman

Abstract: Bayesian nonparametric methods are naturally suited to the problem of out-of-distribution (OOD) detection. However, these techniques have largely been eschewed in favor of simpler methods based on distances between pre-trained or learned embeddings of data points. Here we show a formal relationship between Bayesian nonparametric models and the relative Mahalanobis distance score (RMDS), a commonly used method for OOD detection. Building on this connection, we propose Bayesian nonparametric mixture models with hierarchical priors that generalize the RMDS. We evaluate these models on the OpenOOD detection benchmark and show that Bayesian nonparametric methods can improve upon existing OOD methods, especially in regimes where training classes differ in their covariance structure and where there are relatively few data points per class.

URL: https://openreview.net/forum?id=w3bMXPMDW1

---

Title: A Watermark for Black-Box Language Models

Authors: Dara Bahri, John Frederick Wieting

Abstract: Watermarking has recently emerged as an effective strategy for detecting the outputs of large language models (LLMs). Most existing schemes require \emph{white-box} access to the model's next-token probability distribution, which is typically not accessible to downstream users of an LLM API. In this work, we propose a principled watermarking scheme that requires only the ability to sample sequences from the LLM (i.e. \emph{black-box} access), boasts a \emph{distortion-free} property, and can be chained or nested using multiple secret keys. We provide performance guarantees, demonstrate how it can be leveraged when white-box access is available, and show when it can outperform existing white-box schemes via comprehensive experiments.

URL: https://openreview.net/forum?id=6gcHcgGmLo

---

Title: Reconciling In-Context and In-Weight Learning via Dual Representation Space Encoding

Authors: Guanyu Chen, Ruichen Wang, Tianren Zhang, Feng Chen

Abstract: In-context learning (ICL) is a valuable capability exhibited by Transformers pretrained on diverse sequence tasks. However, previous studies have observed that ICL often conflicts with the model’s inherent in-weight learning (IWL) ability. By examining the representation space learned by a toy model in synthetic experiments, we identify the shared encoding space for context and samples in Transformers as a potential source of this conflict. To address this, we modify the model architecture to separately encode the context and samples into two distinct spaces: a \textit{task representation space} and a \textit{sample representation space}. We model these two spaces under a simple yet principled framework, assuming a linear representational structure and treating them as a pair of dual spaces. Both theoretical analysis and empirical results demonstrate the effectiveness of our proposed architecture, CoQE, in the single-value answer setting. It not only enhances ICL performance through improved representation learning, but also successfully reconciles ICL and IWL capabilities across synthetic few-shot classification and a newly designed pseudo-arithmetic task. The code is available at: \url{https://github.com/McGuinnessChen/dual-representation-space-encoding}.

URL: https://openreview.net/forum?id=bJK7VIOWAU

---

Title: Towards Multimodal Active Learning: Efficient Learning with Limited Paired Data

Authors: Jiancheng Zhang, Yinglun Zhu

Abstract: Active learning (AL) is a principled strategy to reduce annotation cost in data-hungry deep learning. However, existing AL algorithms focus almost exclusively on unimodal data, overlooking the substantial annotation burden in multimodal learning. We introduce the first framework for multimodal active learning with unaligned data, where the learner must actively acquire cross-modal alignments rather than labels on pre-aligned pairs. This setting captures the practical bottleneck in modern multimodal pipelines, where unimodal features are easy to obtain but high-quality alignment is costly. We develop a new algorithm that combines uncertainty and diversity principles in a modality-aware design, achieves linear-time acquisition, and applies seamlessly to both pool-based and streaming-based settings. Extensive experiments on benchmark datasets demonstrate that our approach consistently reduces multimodal annotation cost while preserving performance; for instance, on the ColorSwap dataset it cuts annotation requirements by up to 40% without loss in accuracy.

URL: https://openreview.net/forum?id=xMLajoct78

---

Title: Diversity Boosts AI-Generated Text Detection

Authors: Advik Raj Basani, Pin-Yu Chen

Abstract: Detecting AI-generated text is an increasing necessity to combat misuse of LLMs in education, business compliance, journalism, and social media, where synthetic fluency can mask misinformation or deception. While prior detectors often rely on token-level likelihoods or opaque black-box classifiers, these approaches struggle against high-quality generations and offer little interpretability. In this work, we propose DivEye, a novel detection framework that captures how unpredictability fluctuates across a text using surprisal-based features. Motivated by the observation that human-authored text exhibits richer variability in lexical and structural unpredictability than LLM outputs, DivEye captures this signal through a set of interpretable statistical features. Our method outperforms existing zero-shot detectors by up to $33.2$% and achieves competitive performance with fine-tuned baselines across multiple benchmarks. DivEye is robust to paraphrasing and adversarial attacks, generalizes well across domains and models, and improves the performance of existing detectors by up to $18.7$% when used as an auxiliary signal. Beyond detection, DivEye provides interpretable insights into why a text is flagged, pointing to rhythmic unpredictability as a powerful and underexplored signal for LLM detection.

URL: https://openreview.net/forum?id=WscAU9It1l

---

Title: Multimodal Prescriptive Deep Learning

Authors: Dimitris Bertsimas, Lisa Everest, Vasiliki Stoumpou

Abstract: We introduce a multimodal deep learning framework, Prescriptive Neural Networks (PNNs), that combines ideas from optimization and machine learning to perform treatment recommendation, and that, to the best of our knowledge, is among the first prescriptive approaches tested with both structured and unstructured data within a unified model. The PNN is a feedforward neural network trained on embeddings to output an outcome-optimizing prescription. In two real-world multimodal datasets, we demonstrate that PNNs prescribe treatments that are able to greatly improve estimated outcome rewards; by over 40% in transcatheter aortic valve replacement (TAVR) procedures and by 25% in liver trauma injuries. In four real-world, unimodal tabular datasets, we demonstrate that PNNs outperform or perform comparably to other well-known, state-of-the-art prescriptive models; importantly, on tabular datasets, we also recover interpretability through knowledge distillation, fitting interpretable Optimal Classification Tree models onto the PNN prescriptions as classification targets, which is critical for many real-world applications. Finally, we demonstrate that our multimodal PNN models achieve stability across randomized data splits comparable to other prescriptive methods and produce realistic prescriptions across the different datasets.

URL: https://openreview.net/forum?id=AwfWOCVLbJ

---

Title: Correctness-Aware Knowledge Distillation for Enhanced Student Learning

Authors: Ishan Mishra, Deepak Mishra, Jinjun Xiong

Abstract: In real-world learning, students rely on their mentors for guidance but must also develop the ability to recognize and learn from their mentors' mistakes. Inspired by this mentor-critic dynamic, we propose Mentor-Critic Distillation (MCD), a novel framework for knowledge distillation in machine learning. Traditional distillation methods risk transferring both correct insights and errors from the mentor (teacher model) to the student model, which can hinder student performance. Notably, previous state-of-the-art approaches fail to account for scenarios where the teacher is incorrect, often leaving the student model vulnerable to inheriting these errors. To address this limitation, MCD introduces a weighted knowledge transfer mechanism that decouples the learning process based on the mentor's correctness. When the mentor model is correct, the student model follows the mentor's guidance with a large weight on knowledge transfer. However, when the mentor is incorrect, the student relies more on the ground truth but still learns inter-class relationships from the mentor, adjusting the weight toward task-specific losses such as cross-entropy. This mentor-critic approach ensures that the student model benefits from the mentor's expertise without inheriting its mistakes. We provide theoretical analysis proving that MCD strictly generalizes vanilla KD and guarantees reduced negative transfer. We evaluate our Mentor-Critic Distillation across diverse teacher-student configurations on benchmark datasets, including CIFAR-100, ImageNet, and MedMNIST. Notably, MCD requires no architectural modifications or additional parameters, making it a practical drop-in replacement for standard knowledge distillation. These results highlight MCD's effectiveness in optimizing knowledge transfer and its robustness across diverse domains and data regimes, particularly in data-scarce scenarios typical of specialized domains such as medical imaging.

URL: https://openreview.net/forum?id=XpRXmzd2sF

---

Title: Evaluating the Adversarial Robustness of CNNs Layer by Layer

Authors: Yaowen Wang, Daniel Cullina

Abstract: In order to measure the adversarial robustness of a feature extractor, Bhagoji et al. introduced a distance on example spaces measuring the minimum perturbation of a pair of examples to achieve identical feature extractor outputs. They related these distances to the best possible robust accuracy of any classifier using the feature extractor. By viewing initial layers of a neural network as a feature extractor, this provides a method of attributing adversarial vulnerability of the classifier as a whole to individual layers. However, this framework views any injective feature extractor as perfectly robust: any bad choices of feature representation can be undone by later layers. Thus the framework attributes all adversarial vulnerabilities to the layers that perform dimensionality reduction. Feature spaces at intermediate layers of convolutional neural networks are generally much larger than input spaces, so this methodology provides no information about the contributions of individual layers to the overall robustness of the network. We extend the framework to evaluate feature extractors with high-dimensional output spaces by composing them with a random linear projection to a lower dimensional space. This results in non-trivial information about the quality of the feature space representations for building an adversarial robust classifier.

URL: https://openreview.net/forum?id=2Gx9KzsaYB

---

Title: Adversarial Attacks in Weight-Space Classifiers

Authors: Tamir Shor, Ethan Fetaya, Chaim Baskin, Alex M. Bronstein

Abstract: Implicit Neural Representations (INRs) have been recently garnering increasing interest in
various research fields, mainly due to their ability to represent large, complex data in a compact and continuous manner. Past work further showed that numerous popular downstream
tasks can be performed directly in the INR parameter-space. Doing so can substantially
reduce the computational resources required to process the represented data in their native domain. A major difficulty in using modern machine-learning approaches, is their high
susceptibility to adversarial attacks, which have been shown to greatly limit the reliability
and applicability of such methods in a wide range of settings. In this work, we show that
parameter-space models trained for classification are inherently robust to adversarial attacks
– without the need of any robust training. To support our claims, we develop a novel suite of
adversarial attacks targeting parameter-space classifiers, and furthermore analyze practical
considerations of such attacks.

URL: https://openreview.net/forum?id=eOLybAlili

---

Title: Distilled Thompson Sampling: Practical and Efficient Thompson Sampling via Imitation Learning

Authors: Hongseok Namkoong, Samuel Daulton, Eytan Bakshy

Abstract: Thompson sampling (TS) has emerged as a robust technique for contextual bandit problems. However, TS requires posterior inference and optimization for action generation, prohibiting its use in many online platforms where latency and ease of deployment are of concern. We operationalize TS by proposing a novel imitation-learning-based algorithm that distills a TS policy into an explicit policy representation, allowing fast decision-making and easy deployment in mobile and server-based environments. Using batched data collected under the imitation policy, our algorithm iteratively performs offline updates to the TS policy, and learns a new explicit policy representation to imitate it. Empirically, our imitation policy achieves performance comparable to batch TS while allowing more than an order of magnitude reduction in decision-time latency. Buoyed by low latency and simplicity of implementation, our algorithm has been successfully deployed in multiple video upload systems for Meta. Using a randomized controlled trial, we show our algorithm resulted in significant improvements in video quality and watch time.

URL: https://openreview.net/forum?id=J8PrWwvYX2

---

Title: VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

Authors: Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Raghuveer Thirukovalluru, Xuan Zhang, Zeyuan Chen, Ran Xu, Caiming Xiong, Yingbo Zhou, Wenhu Chen, Semih Yavuz

Abstract: Multimodal embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering over different modalities. However, existing multimodal embeddings like VLM2Vec, E5-V, GME are predominantly focused on natural images, with limited support for other visual forms such as videos and visual documents. This restricts their applicability in real-world scenarios, including AI agents, retrieval-augmented generation (RAG) systems, and recommendation. To close this gap, we propose VLM2Vec-V2, a unified framework for learning embeddings across diverse visual forms. First, we introduce MMEB-V2, a comprehensive benchmark that extends MMEB with five new task types: visual document retrieval, video retrieval, temporal grounding, video classification and video question answering -- spanning text, image, video, and visual document inputs. Next, we train VLM2Vec-V2, a general-purpose embedding model that supports text, image, video, and visual document inputs. Extensive experiments show that VLM2Vec-V2 achieves strong performance not only on the newly introduced video and document retrieval tasks, but also improves over prior baselines on the original image benchmarks. Through extensive evaluation, our study offers insights into the generalizability of various multimodal embedding models and highlights effective strategies for unified embedding learning, laying the groundwork for more scalable and adaptable representation learning in both research and real-world settings.

URL: https://openreview.net/forum?id=TpU38jbKIJ

---

Title: LoRA-Ensemble: Efficient Uncertainty Modelling for Self-Attention Networks

Authors: Dominik J. Mühlematter, Michelle Halbheer, Alexander Becker, Dominik Narnhofer, Helge Aasen, Konrad Schindler, Mehmet Ozgur Turkoglu

Abstract: Numerous real-world decisions rely on machine learning algorithms and require calibrated uncertainty estimates. However, modern methods often yield overconfident, uncalibrated predictions. The dominant approach to quantifying the uncertainty inherent in the model is to train an ensemble of separate predictors and measure their empirical variance. In an explicit implementation, the ensemble has a high computational cost and memory footprint, especially if the base model itself is already large, like modern transformers. This motivates efforts to develop implicit ensemble methods that emulate the ensemble without explicitly instantiating all its members. We introduce LoRA-Ensemble, a parameter-efficient ensembling method for self-attention networks. It is based on Low-Rank Adaptation (LoRA), originally developed for efficient LLM fine-tuning, and extends it into an implicit ensembling scheme, where all ensemble members share the same, pre-trained self-attention network, but have individual low-rank matrices for the attention projections. The resulting method not only outperforms state-of-the-art implicit techniques like BatchEnsemble, but even matches or exceeds the accuracy of an Explicit Ensemble, while at the same time achieving superior calibration.

URL: https://openreview.net/forum?id=yhXXmOMpSQ

---

Title: DiffKGW: Stealthy and Robust Diffusion Model Watermarking

Authors: Tianxin Wei, Ruizhong Qiu, Yifan Chen, Yunzhe Qi, Jiacheng Lin, Wenxuan Bao, Wenju Xu, Sreyashi Nag, Ruirui Li, Hanqing Lu, Zhengyang Wang, Chen Luo, Hui Liu, Suhang Wang, Jingrui He, Qi He, Xianfeng Tang

Abstract: Diffusion models are known for their supreme capability to generate realistic images. However, ethical concerns, such as copyright protection and the generation of inappropriate content, pose significant challenges for the practical deployment of diffusion models. Recent work has proposed a flurry of watermarking techniques that inject artificial patterns into initial latent representations of diffusion models, offering a promising solution to these issues. However, enforcing a specific pattern on selected elements can disrupt the Gaussian distribution of the initial latent representation. Inspired by watermarks for large language models (LLMs), we generalize the LLM KGW watermark to image diffusion models and propose a stealthy probability adjustment approach DiffKGW that preserves the Gaussian distribution of initial latent representation. In addition, we dissect the design principles of state-of-the-art watermarking techniques and introduce a unified framework. We identify a set of dimensions that explain the manipulation enforced by watermarking methods, including the distribution of individual elements, the specification of watermark shapes within each channel, and the choice of channels for watermark embedding. Through the empirical studies on regular text-to-image applications and the first systematic attempt at watermarking image-to-image diffusion models, we thoroughly verify the effectiveness of our proposed framework through comprehensive evaluations. On all the diffusion models, including Stable Diffusion, our approach induced from the proposed framework not only preserves image quality but also outperforms existing methods in robustness against a wide range of attacks.

URL: https://openreview.net/forum?id=OXi9vcIOgD

---

Title: MiniGPT-Med: A Unified Vision-Language Model for Radiology Image Understanding

Authors: Asma Alkhaldi, Raneem Alnajim, Layan Alabdullatef, Rawan Alyahya, Jun Chen, Deyao Zhu, Ahmed Z. Alsinan, Mohamed Elhoseiny

Abstract: Recent advances in artificial intelligence (AI) have precipitated significant breakthroughs in healthcare, particularly in the refinement of diagnostic procedures. However, existing studies have been limited in terms of functional coverage. This study introduces MiniGPT-Med, a vision-language model adapted from MiniGPT-v2 for medical applications through domain-specific fine-tuning on medical datasets. MiniGPT-Med demonstrates remarkable versatility across various imaging modalities, including X-rays, CT scans, and MRIs, enhancing its utility. The model is capable of performing tasks such as medical report generation, visual question answering (VQA), and disease identification within medical imagery. Its integrated processing of both image and textual clinical data markedly improves diagnostic accuracy. Our empirical assessments confirm the superior performance of MiniGPT-Med in disease detection, medical report generation, and VQA benchmarks, representing a significant step towards reducing the gap in assisting radiology practice. Furthermore, it achieves state-of-the-art performance in medical report generation, with substantial gains in BERT-Sim over both specialist and generalist baselines, improving by 17 and 12 points, respectively. MiniGPT-Med promises to become a unified Vision-Language model for radiology diagnoses, enhancing diagnostic efficiency across a wide range of medical imaging applications.

URL: https://openreview.net/forum?id=NenHFEg1Di

---

Title: Augmenting Molecular Graphs with Geometries via Machine Learning Interatomic Potentials

Authors: Cong Fu, Yuchao Lin, Zachary Krueger, Haiyang Yu, Maho Nakata, Jianwen Xie, Emine Kucukbenli, Xiaofeng Qian, Shuiwang Ji

Abstract: Accurate molecular property predictions require 3D geometries, which are typically obtained using expensive methods such as density functional theory (DFT). Here, we attempt to obtain molecular geometries by relying solely on machine learning interatomic potential (MLIP) models. To this end, we first curate a large-scale molecular relaxation dataset comprising 3.5 million molecules and 300 million snapshots. Then MLIP pre-trained models are trained with supervised learning to predict energy and forces given 3D molecular structures. Once trained, we show that the pre-trained models can be used in different ways to obtain geometries either explicitly or implicitly. First, it can be used to obtain approximate low-energy 3D geometries via geometry optimization. While these geometries do not consistently reach DFT-level chemical accuracy or convergence, they can still improve downstream performance compared to non-relaxed structures. To mitigate potential biases and enhance downstream predictions, we introduce geometry fine-tuning based on the relaxed 3D geometries. Second, the pre-trained models can be directly fine-tuned for property prediction when ground truth 3D geometries are available. Our results demonstrate that MLIP pre-trained models trained on relaxation data can learn transferable molecular representations to improve downstream molecular property prediction and can provide practically valuable but approximate molecular geometries that benefit property predictions. Our code is publicly available at: https://github.com/divelab/AIRS/.

URL: https://openreview.net/forum?id=JwxhHTISJL

---

Title: From Feature Visualization to Visual Circuits: Effect of Model Perturbation

Authors: geraldin nanfack, Michael Eickenberg, Eugene Belilovsky

Abstract: Understanding the inner workings of large-scale deep neural networks is challenging yet crucial in several high-stakes applications. Mechanistic interpretability is an emergent field that tackles this challenge, often by identifying human-understandable subgraphs in deep neural networks known as circuits. In vision-pretrained models, these subgraphs are typically interpreted by visualizing their node features through a popular technique called feature visualization. Recent works have analyzed the stability of different feature visualization types under the adversarial model manipulation framework, where models are subtly perturbed to alter their interpretations while maintaining performance. However, existing model manipulation methods have two key limitations: (1) they manipulate either synthetic or natural feature visualizations individually, but not both simultaneously, and (2) no work has studied whether circuit-based interpretations are vulnerable to such manipulations.
This paper exposes these vulnerabilities by proposing a novel attack called ProxPulse that simultaneously manipulates both types of feature visualizations. Surprisingly, we find that visual circuits exhibit some robustness to ProxPulse. We therefore introduce CircuitBreaker, the first attack targeting entire circuits, which successfully manipulates circuit interpretations, revealing that circuits also lack robustness. The effectiveness of these attacks is validated across a range of pre-trained models, from smaller architectures like AlexNet to medium-scale models like ResNet-50, and larger ones such as ResNet-152 and DenseNet-201 on ImageNet. ProxPulse changes both visualization types with <1\% accuracy drop, while our CircuitBreaker attack manipulates visual circuits with attribution correlation scores dropping from near-perfect to ~0.6 while preserving circuit head functionality.

URL: https://openreview.net/forum?id=x6ZwuyTy65

---


New submissions
===============


Title: Missing Value Uncertainty: Could Collecting Missing Values Change the Prediction?

Abstract: In mission-critical domains such as sensor networks, operators often face the critical decision of whether to act on incomplete information or if collecting missing values is likely to change the prediction. Existing methods typically focus on imputing missing values or quantifying model uncertainty, but they do not directly assess the stability of a prediction if missing values were to be revealed. To address this gap, we introduce a framework for Missing Value Uncertainty (MVU), which is the distribution of predictions induced by incomplete inputs at inference time. We formalize the problem by defining hard confidence: the probability that a prediction will not change after collecting the missing data. We propose a novel Direct Missing Value (DMV) to efficiently estimate the MVU distribution, bypassing the need for expensive Monte Carlo sampling or retraining the model. Second, we introduce the Missing Value Calibration Error (MVCE), a new metric specifically designed to evaluate the calibration of hard confidence values, and a post-hoc calibration procedure to improve MVU estimation. We showcase our method and metric on synthetic and real-world datasets.

URL: https://openreview.net/forum?id=BRWTS5e03Z

---

Title: A survey on transfer learning for evolving domains

Abstract: Transfer learning explores how to leverage knowledge from various tasks or domains (sources) to enhance predictive performance in related tasks or domains (targets). Typically, transfer learning research is segmented into several isolated sub-areas (such as domain generalisation, domain adaptation or multi-domain learning), each making distinct assumptions about target data availability, such as the quantity of data and labels available at training time. However, there are several real-world applications where these problems occur as a continuum, evolving from one stage to another as more data and labels are progressively collected from each domain. In those cases, a robust transfer learning solution should seamlessly integrate an expanding dataset and progressively improve its performance over time.
In this survey, we review the state of the art in transfer learning from the perspective of this continuum, focusing on the data requirements of each method. We find that most methods are tailored to specific settings, and no current work considers an integrated view over the whole spectrum of data availability. We refer to this new perspective on transfer learning as Transfer Learning for Evolving Domains (TrED) and argue that it is an important and challenging direction for future research.

URL: https://openreview.net/forum?id=yrrZZaQZAM

---

Title: Learning to Strategically Acquire Resources in Competition

Abstract: We consider multiple agents competing to acquire some costly divisible resource (\emph{e.g.} shares of a financial asset, compute resources, etc.) over time. Leveraging a canonical model for price dynamics, we propose a novel game-theoretic model for this problem, generalizing settings studied in diverse literatures. Our analysis considers different assumptions on the information available to agents. Under partial-information with a common prior (which subsumes complete information as a special case), we establish the existence, uniqueness, and efficient computability of the Bayesian Nash equilibrium (BNE), and bound the price of anarchy. Next and more generally, we consider agents with no common prior learning to act optimally given realistic market feedback from repeated interactions. We provide sufficient conditions on agents doing simultaneous learning dynamics for last-iterate convergence to the BNE. For all settings, we provide detailed simulations based on real financial data, empirically validating our theory and offering new insights on strategic behavior in the context of trading and resource acquisition.

URL: https://openreview.net/forum?id=Z8j6jGVxAZ

---

Title: Don't Go Breaking My LLM: The Impact of Pruning Attention Layers on Explanation Faithfulness and Confidence Calibration

Abstract: Pruning Large Language Models (LLMs) reduces memory and inference costs by removing parts of the network, producing smaller models that retain most of their accuracy. As attention layers are the most resource-intensive parts of LLMs, pruning them is a promising compression strategy.
Prior work shows that up to $33\%$ of attention layers can be pruned with minimal accuracy loss. Nevertheless, the impact of attention pruning on model interpretability, specifically faithfulness and confidence calibration, remains unstudied.
To address this gap, we study how pruning attention layers affects explanation faithfulness and confidence calibration across five LLMs and eight datasets.
While the pruned models often maintain high accuracy, we find that their faithfulness and calibration often degrade.
Notably, faithfulness and calibration can fluctuate significantly, even when accuracy remains stable, highlighting a misalignment between model confidence, interpretability, and accuracy.
Our findings suggest that layer pruning can affect LLMs' interpretability and reliability in ways not captured by accuracy and efficiency measures alone. We recommend including explainability and calibration metrics when evaluating pruned models.

URL: https://openreview.net/forum?id=VxZd6HfMOo

---

Title: ARROW: Augmented Replay for RObust World models

Abstract: Continual reinforcement learning challenges agents to acquire new skills while retaining previously learned ones with the goal of improving performance in both past and future tasks. Most existing approaches rely on model-free methods with replay buffers to mitigate catastrophic forgetting; however, these solutions often face significant scalability challenges due to large memory demands. Drawing inspiration from neuroscience, where the brain replays experiences to a predictive World Model rather than directly to the policy, we present ARROW (Augmented Replay for RObust World models), a model-based continual RL algorithm that extends DreamerV3 with a memory-efficient, distribution-matching replay buffer. Unlike standard fixed-size FIFO buffers, ARROW maintains two complementary buffers: a short-term buffer for recent experiences and a long-term buffer that preserves task diversity through intelligent sampling. We evaluate ARROW on two challenging continual RL settings: Tasks without shared structure (Atari), and tasks with shared structure, where knowledge transfer is possible (Procgen CoinRun variants). Compared to model-free and model-based baselines with replay buffers of the same-size, ARROW demonstrates substantially less forgetting on tasks without shared structure, while maintaining comparable forward transfer. Our findings highlight the potential of model-based RL and bio-inspired approaches for continual reinforcement learning, warranting further research.

URL: https://openreview.net/forum?id=3FK2tFwNwK

---

Title: A Unified Theory of Sinusoidal Activation Families for Implicit Neural Representations

Abstract: Implicit Neural Representations (INRs) model continuous signals using compact neural networks, and they’ve become a standard tool in vision, graphics, and signal processing. A central challenge is capturing fine detail accurately without relying on heavy hand-crafted encodings or brittle training heuristics. Across the literature, periodic activations have emerged as a compelling remedy: from SIREN, which uses a single sinusoid with a fixed global frequency, to more recent architectures that use multiple sinusoids and sometimes learn their frequencies and phases. We study this family of sinusoidal activations and develop a principled theoretical and practical framework for trainable sinusoidal activations in INRs. We instantiate this framework with Sinusoidal Trainable Activation Functions (STAF), a Fourier-like activation whose amplitudes, frequencies, and phases are learned. Our analysis (i) establishes a Kronecker-equivalence construction that represents trainable sinusoidal activations using standard sine networks and quantifies how expressiveness grows, (ii) characterizes how the Neural Tangent Kernel (NTK) spectrum changes under a trainable sinusoidal parameterization, and (iii) provides an initialization method that produces standard normal post-activations without relying on asymptotic central limit theorem (CLT) arguments. Empirically, across images, audio, shapes, inverse problems (super-resolution and denoising), and NeRF, STAF is competitive and often superior in reconstruction fidelity, with consistently faster early-stage optimization. While periodic activations can reduce practical symptoms of spectral bias, our results show they do not eliminate it; instead, trainable sinusoids reshape the optimization landscape to improve the capacity–convergence trade-off.

URL: https://openreview.net/forum?id=ZDmBPYptbL

---

Title: Hard Samples, Bad Labels: Robust Loss Functions That Know When to Back Off

Abstract: Incorrectly labelled training data are frustratingly ubiquitous in both benchmark and specially curated datasets. Such mislabelling clearly adversely affects the performance and generalizability of models trained through supervised learning on the associated datasets. Frameworks for detecting label errors typically require well-trained / well-generalized models; however, at the same time most frameworks rely on training these models on corrupt data, which clearly has the effect of reducing model generalizability and subsequent effectiveness in error detection – unless a training scheme robust to label errors is employed. We propose two novel loss functions, Blurry Loss and Piecewise-zero Loss, that enhance robustness to label errors by de-weighting or disregarding difficult-to-classify samples, which are likely to be erroneous. These loss functions leverage the idea that mislabelled examples typically appear as outliers to their as-labelled class, being difficult to classify, and should contribute less to the learning signal. Comprehensive experiments on a variety of both artificially corrupted and real-world datasets demonstrate that the proposed loss functions outperform state-of-the-art robust loss functions in nearly all cases, achieving superior F1 and Balanced Accuracy scores for error detection. Further analyses through ablation studies offer insights to confirm the mechanism through which these loss functions operate, and demonstrate their broad applicability to cases of both uniform and non-uniform corruption, and with different label error detection frameworks. By using these robust loss functions, machine learning practitioners can more effectively identify, prune, or correct errors in their training data. Code, including a working demonstration Jupyter Notebook, is available at https://anonymous.4open.science/r/Robust\_Loss-6BAD/.

URL: https://openreview.net/forum?id=PzJLHa6VvY

---

Title: Expressivity Saturation: Reduced Affine Region Usage Under Increasing Task Complexity

Abstract: Piecewise-affine neural networks (e.g., with ReLU or LeakyReLU activations) implement continuous piecewise-affine maps, and the number of affine regions provides a natural proxy for expressive capacity. However, the gap between theoretical region capacity and the affine regions realized after training remains insufficiently understood. We study this gap from two complementary perspectives. First, we give a rigorous, architecture-dependent theorem for one-dimensional probes: for multilayer perceptrons with piecewise-affine activations, the number of affine pieces realized along a probe is upper bounded by an explicit product of layer-wise width terms (and activation breakpoint factors). This yields a neuron-threshold lower bound for representing target functions with prescribed one-dimensional piece complexity, formalizing the minimal region budget required for complex signals. Second, we exactly enumerate affine regions realized within bounded 2D and higher-dimensional domains under controlled task complexity. Under fixed architectures and training protocols, increasing input--label complexity yields trained solutions with markedly fewer realized regions in the evaluation domain, even though worst-case architectural capacity is unchanged; we call this reduced region usage expressivity saturation. Moreover, in the most challenging regimes, 2D visualizations show that region-usage collapse often coincides with degraded decision boundaries. Finally, we visualize the training dynamics of affine-region partitions and decision boundaries, revealing a consistent refinement process during optimization.

URL: https://openreview.net/forum?id=JiyZE3yKv8

---

Title: Modeling high dimensional point clouds with the spherical cluster model

Abstract: A parametric cluster model is a statistical model providing geometric insights onto the points defining a cluster.
The {\em spherical cluster model} (SC) approximates a finite point set $P\subset \mathbb{R}^d$ by a sphere $S(c,r)$ as follows. Taking $r$ as a fraction $\eta\in(0,1)$ (hyper-parameter) of the std deviation of distances between the center $c$ and the data points, the cost of the
SC model is the sum over all data points lying outside the sphere $S$ of their power distance with respect to $S$. The center $c$ of the SC model is the point minimizing this cost. Note that $\eta=0$ yields the celebrated center of mass used in KMeans clustering. We make three contributions.

First, we show fitting a spherical cluster yields a strictly convex but not smooth combinatorial optimization problem. Second, we present
an exact solver using the Clarke gradient on a suitable stratified cell complex defined from an arrangement of hyper-spheres. Finally, we present experiments on a variety of datasets ranging in dimension from $d=9$ to $d=10,000$, with two main observations. First, the exact algorithm is orders of magnitude faster than BFGS based heuristics for datasets of small/intermediate dimension and small values of $\eta$, and for high dimensional datasets (say $d>100$) whatever the value of $\eta$. Second, the center of the SC model behave as a parameterized high-dimensional median.

The SC model is of direct interest for high dimensional multivariate data analysis, and the application to the design of mixtures of SC will be reported in a companion paper.

URL: https://openreview.net/forum?id=5275cZxrSe

---

Title: Variance-Gated Ensembles: An Epistemic-Aware Framework for Uncertainty Estimation

Abstract: Machine learning applications require fast and reliable per-sample uncertainty estimation. A common approach is to use predictive distributions from Bayesian or approximation methods and additively decompose uncertainty into aleatoric (data-related) and epistemic (model-related) components. However, additive decomposition has recently been questioned, with evidence that it breaks down when using finite-ensemble sampling and/or mismatched predictive distributions. This paper introduces Variance-Gated Ensembles (VGE), an intuitive, differentiable framework that injects epistemic sensitivity via a signal-to-noise gate computed from ensemble statistics. VGE provides: (i) a Variance-Gated Margin Uncertainty (VGMU) score that couples decision margins with ensemble predictive variance; and (ii) a Variance-Gated Normalization (VGN) layer that generalizes the variance-gated uncertainty mechanism to training via per-class, learnable normalization of ensemble member probabilities. We derive closed-form vector-Jacobian products enabling end-to-end training through ensemble sample mean and variance. VGE matches or exceeds state-of-the-art information-theoretic baselines while remaining computationally efficient. As a result, VGE provides a practical and scalable approach to epistemic-aware uncertainty estimation in ensemble models.

URL: https://openreview.net/forum?id=fNMZjV1gje

---

Title: Family Matters: A Systematic Study of Spatial vs. Frequency Masking for Continual Test-Time Adaptation

Abstract: Recent continual test-time adaptation (CTTA) methods adopt masked image modeling to stabilize learning under distribution shift, yet each treats its masking family $\mathcal{F}$ as a fixed design choice and innovates exclusively along the selection strategy $\mathcal{S}$, leaving the family axis underexplored. We present a systematic empirical study that isolates this axis. Using a controlled CTTA instantiation---Mask to Adapt (M2A)---that fixes $\mathcal{S}{=}\textit{random}$ and standard losses, we vary only $\mathcal{F}$ across spatial (patch, pixel) and frequency (all-band, low-band, high-band) families while keeping every other component identical. The study's contributions are the design guidance it extracts for the CTTA settings we evaluated: (1)~\emph{the masking family determines whether adaptation compounds useful structure or compounds errors}---on patch-tokenized architectures, spatial masking accumulates stable representations over long streams while frequency masking collapses catastrophically. We characterize this instability through a \emph{structural-preservation} account, where spatial coherence maintains the broad-spectrum redundancy needed to avoid terminally overlapping with a corruption's spectral signature; (2)~\emph{the optimal family depends on architecture-task alignment}---on CNNs, whose overlapping receptive fields dilute patch occlusion, the family gap vanishes, whereas on fine-grained tasks with global cues and large-capacity ViTs, frequency masking becomes competitive. In confounded system-level comparisons---where baselines also differ in losses and auxiliary components---M2A's random selection performs comparably to heuristic strategies, though we treat this observation as suggestive context rather than a controlled quantification of $\mathcal{S}$'s relative importance.

URL: https://openreview.net/forum?id=pBI64qNXHp

---

Title: Oracle-RLAIF: An Improved Fine-Tuning Framework for Multi-modal Video Models using Reinforcement Learning from Ranked Feedback

Abstract: Recent advances in large video-language models (VLMs) rely on extensive fine-tuning techniques that strengthen alignment between textual and visual comprehension. Many implementations typically begin with supervised fine-tuning (SFT) followed by reinforcement learning from preference data to enhance video comprehension. However, as VLMs scale in parameter size, so does the cost of gathering enough human feedback. To make fine-tuning more cost-effective, recent frameworks have explored reinforcement learning with AI feedback (RLAIF), which replace human preference with AI as a judge. Current RLAIF frameworks rely on a specialized reward model trained with video narratives to create calibrated scalar rewards-- an expensive and restrictive pipeline. We propose Oracle-RLAIF, a novel framework that replaces the trained reward model with a more general Oracle ranker which acts as a drop-in model ranking candidate model responses rather than scoring them. Alongside Oracle-RLAIF, we introduce $GRPO_{rank}$, a novel rank-based loss function based on Group Relative Policy Optimization (GRPO) that directly optimizes ordinal feedback with rank-aware advantages. Empirically, we demonstrate that Oracle-RLAIF consistently outperforms leading VLM fine-tuning methods when evaluated across various video comprehension benchmarks. Oracle-RLAIF paves the path to creating flexible and data-efficient frameworks for aligning large multi-modal video models with reinforcement learning from rank rather than score.

URL: https://openreview.net/forum?id=RIRgnRicTa

---

Title: EquiReg: Equivariance Regularized Diffusion for Inverse Problems

Abstract: Diffusion models represent the state-of-the-art for solving inverse problems such as image restoration tasks. Diffusion-based inverse solvers incorporate a likelihood term to guide prior sampling, generating data consistent with the posterior distribution. However, due to the intractability of the likelihood, most methods rely on isotropic Gaussian approximations, which can push estimates off the data manifold and produce inconsistent, poor reconstructions. We propose Equivariance Regularized (EquiReg) diffusion, a general plug-and-play framework that improves posterior sampling by penalizing trajectories that deviate from the data manifold. EquiReg formalizes manifold-preferential equivariant functions that exhibit low equivariance error for on-manifold samples and high error for off-manifold ones, thereby guiding sampling toward symmetry-preserving regions of the solution space. We highlight that such functions naturally emerge when training non-equivariant models with augmentation or on data with symmetries. EquiReg is particularly effective under reduced sampling and measurement consistency steps, where many methods suffer severe quality degradation. By regularizing trajectories toward the manifold, EquiReg implicitly accelerates convergence and enables high-quality reconstructions. EquiReg consistently improves performance in linear and nonlinear image restoration tasks and solving partial differential equations.

URL: https://openreview.net/forum?id=3iaYKJLcyG

---

Title: Memory-Efficient Differentially Private Training with Gradient Random Projection

Abstract: Differential privacy (DP) protects sensitive data during neural network training, but standard methods like DP-Adam suffer from high memory overhead due to per-sample gradient clipping, limiting scalability. We introduce DP-GRAPE (Gradient RAndom ProjEction), a DP training method that significantly reduces memory usage while maintaining utility on par with first-order DP approaches. DP-GRAPE is motivated by our finding that privatization flattens the gradient singular value spectrum, making SVD-based projections (Zhao et al., 2024) unnecessary. Consequently, DP-GRAPE employs three key components: (1) random Gaussian matrices replace SVD-based subspaces, (2) gradients are privatized after projection, and (3) projection is applied during backpropagation. These contributions eliminate the need for costly SVD computations, enable substantial memory savings, and lead to improved utility. Despite operating in lower-dimensional subspaces, our theoretical analysis shows that DP-GRAPE achieves a privacy-utility trade-off comparable to DP-SGD. Our extensive empirical experiments show that DP-GRAPE can significantly reduce the memory footprint of DP training without sacrificing accuracy or training time. In particular, DP-GRAPE reduces memory usage by over 63% when pre-training Vision Transformers and over 70% when fine-tuning RoBERTa-Large as compared to DP-Adam, while achieving similar performance. We further demonstrate that DP-GRAPE scales to fine-tuning large models such as OPT with up to 6.7 billion parameters, a scale at which DP-Adam fails due to memory constraints.

URL: https://openreview.net/forum?id=CyxgbXCrWZ

---

Title: What’s in the Bottle? A Survey and Roadmap of Concept Bottleneck Models

Abstract: Concept Bottleneck Models (CBMs) are interpretable learning architectures that factor predictions through intermediate, ideally human-understandable concepts, enabling explicit and inspectable reasoning. Although CBM research has gained substantial momentum in
recent years, this growth has also revealed numerous open challenges and a fragmented set of methodological choices. In this work, we systematically review the CBM literature, identify previously unidentified core components and challenges, and propose a unified taxonomy. Based on this taxonomy, we provide a detailed categorization of existing works. We hereby discuss current challenges for the CBM paradigm and outline important directions to extend it beyond its current scope. Overall, this survey aims to consolidate the CBM landscape, clarify open issues, and provide guidance for developing future models.

URL: https://openreview.net/forum?id=IF5vnqxBEW

---

Title: Beyond Subtokens: A Rich Character Embedding for Low-resource and Morphologically Complex Languages

Abstract: Tokenization and sub-tokenization based models like word2vec, BERT and the GPT-like
models are the state-of-the-art in natural language processing. Typically, these approaches
have limitations with respect to their input representation. They fail to fully capture or-
thographic similarities and morphological variations, especially in highly inflected and low-
resource languages. Additionally, the needed sub-tokenization for these languages increases
the length of the input sequence. The input sequence is even larger for purely byte- or
character-based models. To mitigate this problem, we propose a method to compute word
vectors directly from character strings, integrating both semantic and syntactic information.
We denote this trunsformer-based approach Rich Character Embeddings (RCE). Further-
more, we propose a hybrid model that combines trunsformer and convolutional mecha-
nisms. Both vector representations can be used as a drop-in replacement for dictionary-
and subtoken-based word embeddings in existing model architectures. It has the potential
to improve performance for both large context-based language models like BERT and small
models like word2vec for low-resource and morphologically rich languages. We evaluate our
approach on various tasks like the SWAG, declension prediction for inflected languages,
metaphor and chiasmus detection for various languages. Our experiments show that it out-
performs traditional token-based approaches on limited data using Odd-One-Out and TopK
metrics as well as on application-based downstream tasks.

URL: https://openreview.net/forum?id=4n4db5qmXZ

---

Title: A Survey of Reasoning and Agentic Systems in Time Series with Large Language Models

Abstract: Time series reasoning treats time as a first-class axis and incorporates intermediate evidence directly into the answer.
This survey defines the problem and organizes the literature by reasoning topology with three families: direct reasoning in one step, linear chain reasoning with explicit intermediates, and branch-structured reasoning that explores, revises, and aggregates.
The topology is crossed with the main objectives of the field, including traditional time series analysis, explanation and understanding, causal inference and decision making, and time series generation, while a compact tag set spans these axes and captures decomposition and verification, ensembling, tool use, knowledge access, multimodality, agent loops, and LLM adaptation regimes.
Methods and systems are reviewed across domains, showing what each topology enables and where it breaks down in faithfulness or robustness, along with curated datasets, benchmarks, and resources that support study and deployment (with an accompanying repository at \url{https://anonymous.4open.science/r/Time-Series-Reasoning-Survey-TMLR/}).
Evaluation practices that keep evidence visible and temporally aligned are highlighted, and guidance is distilled on matching topology to uncertainty, grounding with observable artifacts, planning for shift and streaming, and treating cost and latency as design budgets.
We emphasize that reasoning structures must balance capacity for grounding and self-correction against computational cost and reproducibility, while future progress will likely depend on benchmarks that tie reasoning quality to utility and on closed-loop testbeds that trade off cost and risk under shift-aware, streaming, and long-horizon settings.
Taken together, these directions mark a shift from narrow accuracy toward reliability at scale, enabling systems that not only analyze but also understand, explain, and act on dynamic worlds with traceable evidence and credible outcomes.

URL: https://openreview.net/forum?id=l3QW42g6u3

---

Title: TensorGRaD: Tensor Gradient Robust Decomposition for Memory-Efficient Neural Operator Training

Abstract: Scientific problems require resolving multi-scale phenomena across different resolutions and learning solution operators in infinite-dimensional function spaces. Neural operators provide a powerful framework for this, using tensor-parameterized layers to capture complex, multi-dimensional relationships. However, scaling neural operators to high-resolution problems leads to significant computational demands, making the training of industrial-scale models prohibitive. In this work, we introduce TensorGRaD, a novel method that directly addresses the memory challenges associated with optimizing large tensor-structured weights. Our approach, based on a robust tensor decomposition, factorizes gradients as the sum of a low-rank tensor and a sparse one to efficiently capture information within optimizer states, including outliers. Additionally, we provide a recipe for mixed precision training of TensorGRaD, achieving further memory savings without sacrificing accuracy. We showcase the effectiveness of TensorGRaD on Fourier Neural Operators, a class of models crucial for solving partial differential equations (PDE). We provide theoretical guarantees for TensorGRaD, demonstrating its fundamental advantage over matrix-based gradient compression methods. We empirically demonstrate large improvements across various PDE tasks, including the challenging turbulent Navier-Stokes case at a Reynolds number of $10^5$. TensorGRaD reduces total memory usage by over 50% while maintaining and sometimes even improving accuracy.

URL: https://openreview.net/forum?id=wd1pTrQFv2

---

Title: Beyond ReinMax: Low-Variance Gradient Estimators for Discrete Latent Variables

Abstract: Machine learning models involving discrete latent variables require gradient estimators to facilitate backpropagation in a computationally efficient manner. The most recent addition to the Straight-Through family of estimators, ReinMax, can be viewed from a numerical ODE perspective as incorporating an approximation via Heun's method to reduce bias, but at the cost of high variance. In this work, we introduce the ReinMax-Rao and ReinMax-CV estimators which incorporate Rao-Blackwellisation and control variate techniques into ReinMax to reduce its variance. Our estimators demonstrate superior performance on training variational autoencoders with discrete latent spaces. Furthermore, we investigate the possibility of leveraging alternative numerical methods for constructing more accurate gradient approximations and present an alternative view of ReinMax from a simpler numerical integration perspective.

URL: https://openreview.net/forum?id=crlvtnsyIT

---

Title: Retrievit: In-context Retrieval Capabilities of Transformers, State Space Models, and Hybrid Architectures

Abstract: Transformers excel at in-context retrieval but suffer from quadratic complexity with sequence length, while State Space Models (SSMs) offer efficient linear-time processing but have limited retrieval capabilities. We investigate whether hybrid architectures combining Transformers and SSMs can achieve the best of both worlds on two synthetic in-context retrieval tasks. The first task, n-gram retrieval, requires the model to identify and reproduce an n-gram that succeeds the query within the input sequence. The second task, position retrieval, presents the model with a single query token and requires it to perform a two-hop associative lookup: first locating the corresponding element in the sequence, and then outputting its positional index. Under controlled experimental conditions, we assess data efficiency, length generalization, robustness to out of domain training examples, and learned representations across Transformers, SSMs, and hybrid architectures. We find that hybrid models outperform SSMs and match or exceed Transformers in data efficiency and extrapolation for information-dense context retrieval. However, Transformers maintain superiority in position retrieval tasks. Through representation analysis, we discover that SSM-based models develop locality-aware embeddings where tokens representing adjacent positions become neighbors in embedding space, forming interpretable structures. This emergent property, absent in Transformers, explains both the strengths and limitations of SSMs and hybrids for different retrieval tasks. Our findings provide principled guidance for architecture selection based on task requirements and reveal fundamental differences in how Transformers and SSMs, and hybrid models learn positional association.

URL: https://openreview.net/forum?id=PxkhbVeRRj

---

Title: MVDGC: Joint 3D and 2D Multi-view Pedestrian Detection via Dual Geometric Constraints

Abstract: The core challenge in multi-view pedestrian detection (MVPD) lies in effective aggregation of visual features from different viewpoints for robust occlusion reasoning. Recent approaches have addressed this by first projecting image-view features onto a Bird's Eye View (BEV) map, where ground localization is then performed. Despite impressive performance, the perspective transformation induces severe distortion, causing spatial structure break and degrading the quality of object feature extraction. The blurred and ambiguous features hinder accurate BEV point localization, especially in densely populated regions. Moreover, the strong mutual relationship between the BEV ground point and image bounding boxes is not capitalized on. Although multi-view consistency of 2D detections can serve as a powerful constraint in BEV space, these detections are commonly treated as auxiliary signals rather than being jointly optimized with the primary task.

In this work, we propose MVDGC, a unified framework that jointly estimates pedestrian locations on the BEV plane and 2D bounding boxes in image views. MVDGC employs a sparse set of 3D cylindrical queries that embraces geometric context across both BEV and image views, enforcing dual spatial constraints for precise localization. Specifically, the geometric constraints is established by modeling each pedestrian as a vertical cylinder whose center lies on the BEV plane and whose projection casts a rectangular box in the image views. These queries function as shape anchors that directly extract 2D features from the intact image-view features using camera projection, eliminating projection-induced distortions. The 3D cylindrical query enables the unification of BEV and ImV localization into a single task: 3D cylinder position and shape refinement.

Extensive experiments and ablation studies demonstrate that MVDGC achieves state-of-the-art performance across multiple evaluation metrics on MVPD benchmarks, including Wildtrack and MultiviewX, as well as on the generalized multi-view detection (GMVD) dataset. Moreover, by explicitly modeling BEV-ImV coherency through cylindrical queries, MVDGC not only delivers high precision in multi-view detection but also surpasses image-based tracking methods in a single-view scenario. Code is available upon acceptance.

URL: https://openreview.net/forum?id=40cVQX5Mxc

---

Title: Shortcut Solutions Learned by Transformers Impair Continual Compositional Reasoning

Abstract: Identifying and exploiting common features across domains is at the heart of the human ability to make analogies, and is believed to be crucial for the ability to continually learn. To do this successfully, general and flexible computational strategies must be developed. While the extent to which Transformer neural network models can perform compositional reasoning has been the subject of intensive recent investigation, little work has been done to systematically understand how well these models can leverage their representations to learn new, related experiences. To address this gap, we expand the previously developed Learning Equality and Group Operations (LEGO) framework to a continual learning (CL) setting ("continual LEGO"). Using this continual LEGO experimental paradigm, we study the capability of feedforward and recurrent Transformer models to perform CL. We find that BERT, a popular feedforward Transformer model, learns shortcut solutions that limits its ability to generalize and prevents strong forward transfer to new experiences. In contrast, we find that ALBERT, a recurrent version of BERT, learns a For loop solution, which leads to better CL performance. When applying BERT and ALBERT models to a CL setting that requires composition across experiences, we find that both model families fail. Our investigation suggests that ALBERT models can have their performance drop rescued by use of generative replay strategies, but this is not true for BERT models, where a detrimental shortcut solution becomes entrenched with initial training. Our results demonstrate that recurrent transformer architectures may have an inductive bias better suited for CL and motivate future investigation of the interplay between Transformer architecture and computational solutions that emerge.

URL: https://openreview.net/forum?id=UdAJlllWSg

---

Title: Performance Prediction In Reinforcement Learning: The Good, The Bad And The Ugly

Abstract: Reinforcement learning (RL) methods are known to be highly sensitive to their hyperparameter settings and costly to evaluate. In light of this, surrogate models that predict the performance of a given algorithm given a hyperparameter configuration seem an attractive solution for understanding and optimising these computationally expensive tasks. In this work, we are studying such surrogates for RL and find that RL methods present a significant challenge to current performance prediction approaches. Specifically, RL landscapes appear to be rugged and noisy, which leads to the poor quality of surrogate models. Even if surrogate models are only used for gaining insights into the hyperparameter landscapes and not as replacements for algorithm evaluations in benchmarking, we find that they deviate substantially from the ground truth. While our evaluation highlights the limits of surrogate modelling for RL, we propose a method for automatically reducing configuration spaces for improved surrogate performance. We also derive recommendations for RL practitioners that caution against blindly trusting surrogate-based methods for this domain and highlight where and how they can be used.

URL: https://openreview.net/forum?id=l2f3LuGjIL

---

Title: Using the Path of Least Resistance to Explain Deep Networks

Abstract: Integrated Gradients (IG), a widely used axiomatic path-based attribution method, assigns importance scores to input features by integrating model gradients along a straight path from a baseline to the input. While effective in some cases, we show that straight paths can lead to flawed attributions. In this paper, we identify the cause of these misattributions and propose an alternative approach that equips the input space with a model-induced Riemannian metric (derived from the explained model's Jacobian) and computes attributions by integrating gradients along geodesics under this metric. We call this method Geodesic Integrated Gradients (GIG).

To approximate geodesic paths, we introduce two techniques: a k-Nearest Neighbours-based approach for smaller models and a Stochastic Variational Inference-based method for larger ones. Additionally, we propose a new axiom, No-Cancellation Completeness (NCC), which strengthens completeness by ruling out feature-wise cancellation. We prove that, for path-based attributions under the model-induced metric, NCC holds if and only if the integration path is a geodesic.

Through experiments on both synthetic and real-world image classification data, we provide empirical evidence supporting our theoretical analysis and showing that GIG produces more faithful attributions than existing methods, including IG, on the benchmarks considered.

URL: https://openreview.net/forum?id=W1XVIvn445

---

Title: Efficient Zeroth-Order Federated Finetuning of Language Models on Resource-Constrained Devices

Abstract: Federated Learning (FL) is a promising paradigm for finetuning Large Language Models (LLMs) across distributed data sources while preserving data privacy. However, finetuning such large models is challenging on edge devices due to its high resource demand.
Zeroth-order optimization estimates gradients through finite-difference approximations, which rely on function evaluations under random perturbations of the model parameters. Consequently, ZO with task alignment provides a potential solution, allowing finetuning using only forward passes with inference-level memory requirements and low communication overhead, but suffers from slow convergence and higher computational demand. In this paper, we propose a new ZO-based method that applies a more efficient technique to reduce the computational demand associated with using large number of perturbations, while preserving their convergence benefits. This is achieved by splitting the model into consecutive blocks and allocating a higher number of perturbations to the second block, enabling efficient reuse of intermediate activations to update the full network with fewer forward evaluations. Our evaluation on RoBERTa-large, OPT1.3B, LLaMa-3-3.2B models shows up to $3\times$ reduction in computation compared to the other ZO-based techniques, while retaining the memory and communication benefits over first-order federated learning techniques.

URL: https://openreview.net/forum?id=nVmz9Q2l7L

---

Title: Unified Semantic and ID Representation Learning for Deep Recommenders

Abstract: Effective recommendation is crucial for large-scale online platforms. Traditional recommendation systems primarily rely on ID tokens to uniquely identify items, which can effectively capture specific item relationships but suffer from issues such as redundancy and poor performance in cold-start scenarios. Recent approaches have explored using semantic tokens as an alternative, yet they face challenges, including item duplication and inconsistent performance gains, leaving the potential advantages of semantic tokens inadequately examined. To address these limitations, we propose a Unified Semantic and ID Representation Learning framework that leverages the complementary strengths of both token types. In our framework, ID tokens capture unique item attributes, while semantic tokens represent shared, transferable characteristics. Additionally, we analyze the role of cosine similarity and Euclidean distance in embedding search, revealing that cosine similarity is more effective in decoupling accumulated embeddings, while Euclidean distance excels in distinguishing unique items. Our framework integrates cosine similarity in earlier layers and Euclidean distance in the final layer to optimize representation learning. Experiments on three benchmark datasets show that our method significantly outperforms state-of-the-art baselines, with improvements ranging from 6\% to 17\% and a reduction in token size by over 80\%. These results demonstrate the effectiveness of combining ID and semantic tokenization to enhance the generalization ability of recommender systems.

URL: https://openreview.net/forum?id=TAzPZ1US0J

---

Title: Boundary-Consistent Graph Neural Networks for Topological Flux Prediction

Abstract: Graph Neural Networks (GNNs) have achieved notable success in spatiotemporal modeling across diverse application domains. However, their efficacy in flux prediction (FP), where the goal is to model spatiotemporal fluid transport over networked physical systems, remains contentious. Recent studies report that GNNs can underperform even simple baselines in FP settings, leading to a claim that GNNs may be intrinsically ill-suited for such tasks.

In this paper, we revisit this claim by dissecting the GNN learning dynamics on fluid transport networks, with an emphasis on its boundary regions. Specifically, we decompose the graph into boundary and interior nodes, where boundary nodes regulate the total influx and are the primary interface with external forcing. Our empirical and theoretical analyses reveal that dominant prediction errors concentrate at boundary nodes. From a dynamical-systems perspective, we interpret the boundary errors as the consequence of unmodeled external forcing, which causes degraded performance on boundaries. We therefore hypothesize that the observed performance degradation of GNNs was not caused by their expressivity; rather, it arises from the deficit of explicit external forcing modeling during training.

To validate this hypothesis, we propose \myalg, which learns ghost-node proxies to approximate unmodeled external forcing. Each boundary node is augmented with an associated ghost node that represents the latent forcing. This yields a ghost--boundary--interior coupled system, which we solve using an implicit fixed-point formulation. The resulting equilibrium \emph{jointly} infers the external forcing and propagates it into the interior. This enriches standard GNN backbones with boundary-consistent representations while preserving interior message passing. Extensive experiments on two real-world fluid network datasets demonstrate that \myalg\ improves standard GNNs by reducing average MSE by 8.4\% and 5.0\%, and boundary-node MSE by 11.2\% and 7.1\%, respectively. For computational efficiency, we further introduce an explicit inverse-operator solver that amortizes the fixed-point inference and accelerates inference by up to $2\times$, depending on the backbone architecture.

URL: https://openreview.net/forum?id=31gTIfhoH0

---

Title: MiLDEdit: Reasoning-Based Multi-Layer Design Document Editing

Abstract: Real-world design documents (e.g., posters) are inherently multi-layered, combining decoration, text, and images. Editing them from natural-language instructions requires fine-grained, layer-aware reasoning to identify relevant layers and coordinate modifications. Prior work largely overlooks multi-layer design document editing, focusing instead on single-layer image editing or multi-layer generation, which assume a flat canvas and lack the reasoning needed to determine what and where to modify. To address this gap, we introduce the Multi-Layer Document Editing Agent (MiLDEAgent), a reasoning-based framework that combines an RL-trained multimodal reasoner for layer-wise understanding with an image editor for targeted modifications. To systematically benchmark this setting, we introduce the Multi-Layer Document Editing Benchmark (MiLDEBench), a human-in-the-loop corpus of over 20K design documents paired with diverse editing instructions. The benchmark is complemented by a task-specific evaluation protocol, MiLDEEval, which spans four dimensions, including instruction following, layout consistency, aesthetics, and text rendering. Extensive experiments on 13 open-source and 2 closed-source models reveal that existing approaches fail to generalize: open-source models often cannot complete multi-layer document editing tasks, while closed-source models suffer from format violations. In contrast, MiLDEAgent achieves strong layer-aware reasoning and precise editing, significantly outperforming all open-source baselines and attaining performance comparable to closed-source models, thereby establishing the first strong baseline for multi-layer document editing.

URL: https://openreview.net/forum?id=G3l9DAz5uH

---

Title: KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation

Abstract: Generating video from various conditions, such as text, image, and audio, enables precise spatial and temporal control, leading to high-quality generation results. Most existing audio-to-visual animation models rely on uniformly sampled frames from video clips. Such a uniform sampling strategy often fails to capture key audio-visual moments in videos with dramatic motions, causing unsmooth motion transitions and audio-visual misalignment. To address these limitations, we introduce KeyVID, a keyframe-aware audio-to-visual animation framework that adaptively prioritizes the generation of keyframes in audio signals to improve the generation quality. Guided by the input audio signals, KeyVID first localizes and generates the corresponding visual keyframes that contain highly dynamic motions. The remaining frames are then synthesized using a motion interpolation module, effectively reconstructing the full video sequence. This design enables the generation of high frame-rate videos that faithfully align with audio dynamics, while avoiding the cost of directly training with all frames at a high frame rate. Through extensive experiments, we demonstrate that KeyVID significantly improves audio-video synchronization and video quality across multiple datasets, particularly for highly dynamic motions. The anonymous code link for review is https://anonymous.4open.science/r/KeyVID-73E6.

URL: https://openreview.net/forum?id=ipxcKl2fHO

---

Title: Adaptive LLM Safety via Inference-Time System-Level Optimization

Abstract: Large Language Models (LLMs) face rapidly evolving security threats, ranging from adversarial attacks like jailbreaking to the leakage of sensitive information that should have been unlearned. Existing defense mechanisms are often static and require extensive model retraining, making them slow to adapt to evolving threats. We investigate whether adaptive, inference-time system designs can mitigate the limitations of static LLM defenses. We study a modular inference-time defense system (which we refer to as AegisLLM). It utilizes a workflow of specialized modules whose defensive policies can be optimized with a remarkably small number of examples to achieve strong performance on multiple, distinct security challenges. We demonstrate the effectiveness of this system on two critical threats: sensitive information disclosure (with \textit{unlearning} as defense) and jailbreaking. On the WMDP benchmark, it approaches the random-guess lower bound for unlearning with only 20 training examples. For jailbreaking, it improves defenses by $\sim$51\% over the base model on the StrongReject benchmark, while maintaining a high utility as measured by the false refusal rate of only 7.9\% on the PHTest benchmark. Furthermore, we show that prompts optimized on one benchmark generalize effectively to others, underscoring the robustness of this approach. Our work highlights the significant advantages of adaptive, system-level security and demonstrates the power of prompt optimization for creating scalable and efficient LLM safety solutions. We provide our code at: \url{https://anonymous.4open.science/r/aegisllm-11B0}.

URL: https://openreview.net/forum?id=EkPGNcycxm

---

Title: ROCM: RLHF on consistency models

Abstract: Diffusion models have revolutionized generative modeling in continuous domains like image, audio, and video synthesis. However, their iterative sampling process leads to slow generation and inefficient training, challenges that are further exacerbated when incorporating Reinforcement Learning from Human Feedback (RLHF) due to sparse rewards and long time horizons. Consistency models address these issues by enabling single-step or efficient multi-step generation, significantly reducing computational costs. In this work, we propose a direct reward optimization framework for applying RLHF to consistency models, incorporating distributional regularization to enhance training stability and prevent reward hacking. We investigate various $f$-divergences as regularization strategies, striking a balance between reward maximization and model consistency. Unlike policy gradient methods, our approach leverages first-order gradients, making it more efficient and less sensitive to hyperparameter tuning. Empirical results show that our method achieves competitive or superior performance compared to policy gradient based RLHF methods, across various automatic metrics and human evaluation. Additionally, our analysis demonstrates the impact of different regularization techniques in improving model generalization and preventing overfitting.

URL: https://openreview.net/forum?id=g0Ht3AzyUP

---

Title: Video Generation Models: A Survey of Post-Training and Alignment

Abstract: Video generation has rapidly progressed from short, low-quality clips to high-resolution, long-duration sequences with complex spatiotemporal dynamics. Despite strong generative priors learned through large-scale pretraining, pretrained video models often fail to reliably follow human intent, maintain temporal coherence, or satisfy physical and safety constraints. Compared with image and text generation, alignment in video generation presents unique challenges, including error accumulation over time, motion-appearance coupling, multi-objective trade-offs, and limited supervision for temporal properties. These challenges motivate systematic post-training strategies that adapt pretrained models without retraining them from scratch. In this survey, we present the first comprehensive review of post-training and alignment in video generation models. We frame post-training as a unifying framework and distinguish between implicit alignment and explicit alignment based on how alignment signals are enforced. From this perspective, we organize existing approaches into four broad categories: (1) supervised fine-tuning methods, (2) self-training and distillation methods, (3) preference- and reward-based methods, and (4) inference-time methods. This taxonomy provides a coherent view of how alignment signals shape model behavior across both training and deployment. Beyond methodological advances, we review commonly used datasets, benchmarks, and evaluation practices, and discuss open challenges such as scalable reward design, long-horizon temporal consistency, stability-expressiveness trade-offs, and safety-aware generation. This survey aims to provide a structured conceptual foundation and practical guidance for advancing controllable and reliable video generation models.

URL: https://openreview.net/forum?id=YlUEWLESIu

---

Title: Reasoning with Preference Constraints: A Benchmark for Language Models in Many-to-One Matching Markets

Abstract: Recent advances in reasoning with large language models (LLMs) have demonstrated strong performance on complex mathematical tasks. Techniques such as Chain-of-Thought and In-Context Learning have further enhanced this capability, making LLM agents both powerful and accessible tools for a wide range of users, including non-experts. However, the application of such agents to problems arising in operations research, particularly those at the intersection of combinatorial optimization and game theory that require domain expertise, remains underexplored. To address this gap, we introduce a benchmark of 369 instances for the College Admission Problem, a canonical many-to-one matching problem that requires reasoning about agents’ preferences, stability, feasibility, and optimality. We evaluate several open-weight LLM models, both reasoning-specialized and more traditional, defined here as models used without any dedicated reasoning mechanisms. Even though no prompt consistently offered the best performance, using strategies such as Chain-of-Thought, In-Context Learning and role-based prompting, reasoning LLMs reacted differently from the traditional ones. While the reasoning enhanced models significantly outperform traditional ones, they all struggle to meet all evaluation criteria consistently. Finally, we report the performances from iterative prompting with auto-generated feedback and show that they are not monotonic; they can peak early and then significantly decline in later attempts. Overall, this work offers a new perspective on model reasoning performance and the effectiveness of prompting strategies in combinatorial optimization problems with preferential constraints.

URL: https://openreview.net/forum?id=2dpt2Ughzt

---

Title: Neural Networks Performance Prediction using Weights and Gradients Analysis

Abstract: Neural network performance predictors are widely used to accelerate neural architecture search, but existing methods face a persistent trade-off: learning-based predictors require costly per-dataset initialization, while lightweight proxies are fast yet struggle to exploit prior experience and often degrade under dataset shift. We introduce NAP2, a hybrid performance predictor that models early training dynamics. NAP2 tracks the temporal evolution of layer-wise weight and gradient statistics over a small number of mini-batches, producing accurate rankings from as little as 100 mini-batches per candidate. Crucially, NAP2 supports cross-dataset reuse: a predictor trained on one dataset can be applied to another without fine-tuning, avoiding the re-initialization overhead incurred by many model-based approaches. Experiments on NAS-Bench-201 across CIFAR-10, CIFAR-100, and ImageNet16-120 show that NAP2 is competitive with strong hybrid baselines under limited budgets and delivers reliable zero-shot transfer, outperforming established learning-curve and zero-cost baselines at short query times. We further demonstrate robustness to significant distribution shift, with a predictor trained on CIFAR-10 transferring effectively to SVHN. Our code and trained models are available at https://anonymous.4open.science/r/NAP2-6027/README.md.

URL: https://openreview.net/forum?id=51TWh8tlSy

---

Title: Invariant Causal Set Covering Machine

Abstract: Rule-based models, such as decision trees, appeal to practitioners due to their interpretable nature. However, the learning algorithms that produce such models are often vulnerable to spurious associations, and thus, they are not guaranteed to extract causally relevant insights. This limitation reduces their utility in gaining mechanistic insights into a phenomenon of interest. In this work, we build on ideas from the invariant causal prediction literature to propose Invariant Causal Set Covering Machines, an extension of the classical Set Covering Machine (SCM) algorithm for conjunctions/disjunctions of binary-valued rules that provably avoids spurious associations. The proposed method leverages structural assumptions about the functional form of such models, enabling an algorithm that identifies the causal parents of a variable of interest in polynomial time. We demonstrate the validity and efficiency of our approach through a simulation study and highlight its favorable performance compared to SCM in uncovering causal variables across real-world datasets.

URL: https://openreview.net/forum?id=slquR2A8rA

---

Title: Visual Explanations for Capsule Networks

Abstract: The limited availability of explainability methods for Capsule Networks (CapsNets) restricts their adoption in critical domains such as clinical practice or legal document analysis. Although CapsNets offer structured and interpretable representations, existing explanation methods have primarily focused on more traditional Convolutional Neural Networks (CNNs) and are not directly applicable to capsule-based architectures. To address this issue, we propose a general method (Caps-CAM), which generates attribution maps to justify the predictions made by feed-forward CapsNet architectures. Unlike prior explanation methods for CapsNets that adapt techniques originally designed for CNNs, Caps-CAM explicitly employs gradient information that reflects the relevance of each capsule to a class of interest. As the gradient can help highlight the most relevant capsules, each selected capsule activation map is weighted by its corresponding gradient. The final attribution heatmap is then generated as a linear combination of weighted activation maps based on their contribution to the target class. Experiments show that Caps-CAM serves as an explanation method for CapsNets and compare the results of this method with other state-of-the-art explanation techniques previously introduced for CNNs. Empirical comparisons w.r.t. state-of-the-art explanation techniques previously introduced for CNNs, show that Caps-CAM can effectively serve as an explanation method for CapsNets. Experiments on standard and real-application data sets show the effectiveness of the introduced Caps-CAM.

URL: https://openreview.net/forum?id=eNQK9WJkid

---

Title: When Do LLM Preferences Predict Downstream Behavior?

Abstract: Preference-driven behavior in LLMs may be a necessary precondition for AI misalignment such as sandbagging: models cannot strategically pursue misaligned goals unless their behavior is influenced by their preferences. Yet prior work has typically prompted models explicitly to act in specific ways, leaving unclear whether observed behaviors reflect instruction-following capabilities vs underlying model preferences. Here we test whether this precondition for misalignment is present. Using entity preferences as a behavioral probe, we measure whether stated preferences predict downstream behavior in five frontier LLMs across three domains: donation advice, refusal behavior, and task performance. Conceptually replicating prior work, we first confirm that all five models show highly consistent preferences across two independent measurement methods. We then test behavioral consequences in a simulated
user environment. We find that all five models give preference-aligned donation advice. All five models also show preference-correlated refusal patterns when asked to recommend donations, refusing more often for less-preferred entities. All preference-related behaviors that we observe here emerge without instructions to act on preferences. Results for task performance are mixed: on a question-answering benchmark (BoolQ), two models show small but significant accuracy differences favoring preferred entities; one model shows the opposite pattern; and two models show no significant relationship. On complex agentic tasks, we find no evidence of preference-driven performance differences. While LLMs have consistent preferences that reliably predict advice-giving behavior, these preferences do not consistently translate into downstream task performance.

URL: https://openreview.net/forum?id=RmWSM3sFOr

---

Title: D-SCOPE: Diffusion-based Sonar Counterfactual and Prototype Explanations

Abstract: The harsh conditions of underwater environments pose significant challenges for effective monitoring. While using cameras is possible, they are typically limited to short ranges due to underwater visibility conditions. SOund NAvigation and Ranging (SONAR) can perceive objects at greater distances, but produces low-visibility images that are hard to interpret, even for experts. When Artificial Intelligence (AI) methods are used on these SONAR images, Explainable Artificial Intelligence (XAI) methods might help the user understand the AI outputs. Traditional explainability methods, such as saliency maps or perturbation-based visualisations, often struggle to provide informative explanations when applied to low-contrast imagery. This work introduces Diffusion-based SONAR COunterfactual & Prototype Explanations (D-SCOPE), a novel post-hoc explainability framework for SONAR image classification. Our presented approach leverages classifier-guided diffusion models, trained on two publicly available Marine Debris Forward-Looking SONAR datasets, to generate two types of visual explanations: (1) counterfactual explanations that highlight minimal semantic changes to alter a model’s decision, and (2) prototype-based explanations for case-based reasoning that translate representative RGB samples into the SONAR domain, serving as intuitive visual references. For counterfactual explanations, a semi-factual explanation is generated by displaying the intermediate steps leading to a change in prediction. For the prototype-based explanation, class-specific prototypes are provided. To the best of our knowledge, this is the first approach applying diffusion-based generative models for explainability in the SONAR modality. Guided diffusion models are shown to produce high-fidelity, class-conditioned counterfactuals in challenging underwater settings. In addition, the proposed cross-domain prototype generation mechanism enhances human interpretability by bridging the gap between clear and recognisable RGB representations and SONAR imagery. Our framework is validated through qualitative and quantitative experiments as well as a controlled human evaluation. The code and the pretrained models shall be released to support further research.

URL: https://openreview.net/forum?id=TXURDGteqv

---

Title: Global Convergence of Sampling-Based Nonconvex Optimization through Diffusion-Style Smoothing

Abstract: Sampling-based optimization (SBO), like cross-entropy method and evolutionary algorithms, has achieved many successes in solving non-convex problems without gradients, yet its convergence is poorly understood. In this paper, we establish a non-asymptotic convergence analysis for SBO through the lens of smoothing. Specifically, we recast SBO as gradient descent on a smoothed objective, mirroring noise‑conditioned score ascent in diffusion models. Our first contribution is a landscape analysis of the smoothed objective, demonstrating how smoothing helps escape local minima and uncovering a fundamental coverage–optimality trade-off: smoothing renders the landscape more benign by enlarging the locally convex region around the global minimizer, but at the cost of introducing an optimality gap. Building on this insight, we establish non-asymptotic convergence guarantees for SBO algorithms to a neighborhood of the global minimizer. Furthermore, we propose an annealed SBO algorithm, Diffusion-Inspired-Dual-Annealing (DIDA), which is provably convergent to the global optimum. We conduct extensive numerical experiments to verify our landscape results and also demonstrate the compelling performance of DIDA compared to other gradient-free optimization methods. Lastly, we discuss implications of our results for diffusion models.

URL: https://openreview.net/forum?id=8Nx1ZjfEhd

---

Title: Domain Adaptation for Cold-Start Users in Sequential Recommendation

Abstract: Sequential recommendation tracks users' preferences over time based on users' historical activities and makes prediction on their next most probable action. However, this approach faces limitations when dealing with cold-start users who possess minimal interaction data, leading to difficulty in learning their preferences. To address this challenge, by taking regular users with longer interaction histories and cold-start users as two domains, this paper introduces domain adaptation techniques to narrow the performance gap caused by knowledge shifts in domains. We propose a dual-transformer framework with separate models for long (source) and short (target) sequences, collaboratively trained with shared item embeddings. To enable effective knowledge transfer, we introduce an emulated target domain by sampling short sequences from the source, and apply contrastive learning to align their contextual representations. To further improve adaptation under complex knowledge shifts, we reduce item popularity bias and incorporate user similarity into the contrastive loss. Experiments on five public datasets show consistent improvements over strong baselines, demonstrating the robustness of our approach under both length shifts and compounded shifts involving item distribution changes.
Our code can be found at https://anonymous.4open.science/r/DACSR-1.

URL: https://openreview.net/forum?id=jrVmW6LfMa

---

Title: CGR: Confidence-Guided Replay for Buffer-Based Continual Learning

Abstract: Continual Learning (CL) aims to acquire new knowledge while preserving previously learned information without catastrophic forgetting. Buffer-based methods, which retain samples from past tasks, have demonstrated promising results; however, efficiently allocating limited buffer space remains a significant challenge. Recent studies often either neglect the varying impact individual samples have on the learning process or incur high computational costs to identify informative replay samples. To overcome these limitations, we propose a novel approach called Confidence-Guided Replay (CGR), a lightweight policy that dynamically allocates the buffer by monitoring confidence fluctuations in the main continual learner model. Leveraging measures of sample contribution and difficulty, CGR adaptively prioritizes highly informative samples within the buffer, significantly enhancing knowledge retention and utilization efficiency. Our approach provides a flexible solution for dynamic buffer allocation, effectively addressing the varying importance and learning complexity of samples over time, and improves CL performance.

URL: https://openreview.net/forum?id=RulqfUzCoL

---

Title: Adapt the Face, Not the Voice: Asymmetric Fine-Tuning of Foundation Models for Cross-Modal Person Matching

Abstract: Cross-modal person matching - associating a person’s voice with their face - requires bridging speech and vision representations that share no direct physical correspondence. We investigate a simple approach: pairing frozen unimodal foundation models (WavLM-Large for speech, SigLIP ViT-B/16 for faces) with lightweight trainable projections into a shared embedding space. Our central finding is an informative asymmetry in the effectiveness of Low-Rank Adaptation (LoRA): adapting the face encoder yields substantial gains while adapting the voice encoder provides no benefit. We explain this asymmetry through layer-wise identity probing: WavLM already encodes strong speaker identity information (93.8% linear probe accuracy on 70 classes), while SigLIP’s face identity representations are comparatively weak (79.5%), leaving substantially more room for task-specific adaptation. This gap widens on a larger evaluation: on 1,211-identity VoxCeleb1, WavLM maintains 90.5% probe accuracy while SigLIP drops to 58.1%. The asymmetric LoRA finding replicates across two datasets -MAV-Celeb (70 identities, per-identity split) and VoxCeleb1 (1,211 identities, identity-disjoint split) - and across evaluation protocols including verification, retrieval, and N -way matching. On MAV-Celeb, face-only LoRA achieves 16.6 ± 0.4% Equal Error Rate (mean ± std over 3 seeds) with only 1.33M trainable parameters (0.32% of the encoder total), compared to 19.9%
for the prior best published result under a comparable (though not identical) evaluation protocol. Our results suggest a hypothesis for cross-modal adaptation: selectively adapting the encoder whose pretraining is least aligned with the target task is both necessary and
sufficient.

URL: https://openreview.net/forum?id=6pr8Zwv64z

---

Title: An Efficient End-to-End Framework for Localized Second-Order Pooling

Abstract: Second-order pooling has proven effective for deep image classification by representing an image with the covariance matrix of its local feature descriptors. However, the diverse visual content in images leads to local descriptors being distributed as multiple modes in the feature space, limiting the effectiveness of a single global covariance matrix. In this work, we propose an efficient end-to-end framework for localized second-order pooling. Our approach jointly learns clusters of local feature descriptors across the entire training set, adaptively assigns the descriptors of each image to the appropriate clusters, computes a localized covariance matrix based on the descriptors assigned to each cluster, and then integrates these matrices to form the image representation.This is achieved through an attention-based local cluster mining branch that automatically identifies the clusters that each local descriptor shall be assigned to. Furthermore, to manage the significant computational overhead incurred by the use of multiple local covariance matrices, we utilize a simple but efficient sample-adaptive feature fusion scheme. This scheme adaptively generates fusion weights for each localized covariance matrix using a lightweight predictor, ensuring both computational efficiency and flexibility. Extensive experiments on multiple fine-grained and large-scale image classification datasets demonstrate that our method consistently improves performance when integrated with state-of-the-art second-order pooling methods and leading network architectures. Ablation studies further verify the efficiency of our feature fusion scheme compared to the existing common alternatives.

URL: https://openreview.net/forum?id=nJBeXjxAqx

---

Title: World Model Anomaly Detection with a Latent Linear Prior

Abstract: Model-based reinforcement learning (MBRL) learns world models—internal simulators of environment dynamics—to plan by imagining future trajectories. However, when these models incorrectly predict state transitions, they generate unrealistic states that mislead agents into learning delusional policies. Inspired by human vision, we propose anomaly detection in world model with \textbf{L}inear \textbf{P}rior (LP), a three‐stage approach that 1) enforces a lightweight linear prior on successive latent states, 2) flags generated states that deviate from this prior, and 3) removes their contribution during agent learning. On the challenging Atari100k benchmark, LP-assisted GRU and Transformer based MBRL agents achieve competitive results while requiring less value updates with minimal additional computational cost. Notably, by suppressing false value updates with LP, DreamerV3 boosts human-normalized mean score by 9% while requiring less than 90% of the value updates. We release our implementation at https://anonymous.4open.science/r/lp-dreamer.

URL: https://openreview.net/forum?id=VLIzLK3CfR

---

Title: RevealIt: REinforcement learning with Visibility of Evolving Agent poLicy for InTerpretability

Abstract: Understanding the agent's learning process, particularly the factors that contribute to its success or failure post-training, is crucial for comprehending the rationale behind the agent's decision-making process. Prior methods clarify the learning process by creating a structural causal model (SCM) or visually representing the distribution of value functions. Nevertheless, these approaches have constraints as they exclusively function in 2D environments or with simple transition dynamics. Understanding the agent's learning process in complex environments or tasks is more challenging. In this paper, we propose {\rit}, a novel framework for explaining the learning process of an agent in complex environments. Initially, we visualize the policy structure and the agent's learning process for various training tasks. By visualizing these findings, we can understand how much a particular training task or stage affects the agent's test performance. Then, a GNN-based explainer learns to highlight the most important section of the policy, providing a clearer and robust explanation of the agent's learning process. The experiments demonstrate that explanations derived from this framework can effectively help optimize the training tasks, resulting in improved learning efficiency and final performance.

URL: https://openreview.net/forum?id=Dmu7agmmTQ

---

Title: Broadcast Product: Redefining Shape-aligned Element-wise Multiplication and Beyond

Abstract: Broadcast operations are widely used in scientific computing libraries, yet their mathematical formulation is often implicit and inconsistently represented in machine learning literature. This problem frequently leads to invalid equations when element-wise products are written despite mismatched tensor shapes. In this paper, we formalize such operations by introducing the broadcast product $\boxdot$, which explicitly extends the Hadamard product through shape-aligned element duplication. We provide a rigorous definition of the broadcast product, analyze its algebraic properties, and show how it can be expressed using standard linear algebra. Building on this framework, we formulate least-squares problems and propose a broadcast decomposition. Preliminary experiments on synthetic data suggest that the proposed decomposition captures structures that are challenging for conventional tensor decompositions. This work establishes a mathematical foundation for broadcast-aware tensor operations, connecting practical implementations with rigorous tensor analysis.

URL: https://openreview.net/forum?id=zv0OtOPpPO

---

Title: MIST: Mutual Information Estimation via Supervised Training

Abstract: We propose a fully data-driven approach to designing mutual information (MI) estimators. Since any MI estimator is a function of the observed sample from two random variables, we parameterize this function with a neural network (MIST) and train it end-to-end to predict MI values. Training is performed on a large meta-dataset of 625,000 synthetic joint distributions with known ground-truth MI. To handle variable sample sizes and dimensions, we employ a two-dimensional attention scheme ensuring permutation invariance across input samples. To quantify uncertainty, we optimize a quantile regression loss, enabling the estimator to approximate the sampling distribution of MI rather than return a single point estimate. This research program departs from prior work by taking a fully empirical route, trading universal theoretical guarantees for flexibility and efficiency. Empirically, the learned estimators largely outperform classical baselines across sample sizes and dimensions, including on joint distributions unseen during training. The resulting quantile-based intervals are well-calibrated and more reliable than bootstrap-based confidence intervals, while inference is orders of magnitude faster than existing neural baselines. Beyond immediate empirical gains, this framework yields trainable, fully differentiable estimators that can be embedded into larger learning pipelines. Moreover, exploiting MI’s invariance to invertible transformations, meta-datasets can be adapted to arbitrary data modalities via normalizing flows, enabling flexible training for diverse target meta-distributions.

URL: https://openreview.net/forum?id=Qi4JgS2PLw

---

Title: Issues with Value-Based Multi-objective Reinforcement Learning: Value Function Interference and Overestimation Sensitivity

Abstract: Multi-objective reinforcement learning (MORL) algorithms extend conventional reinforcement learning (RL) to the more general case of problems with multiple, conflicting objectives, represented by vector-valued rewards. Widely-used scalar RL methods such as Q-learning can be modified to handle multiple objectives by (1) learning vector-valued value functions, and (2) performing action selection using a scalarisation or ordering operator which reflects the user's preferences with respect to the different objectives. This paper investigates two previously unreported issues which can hinder the performance of value-based MORL algorithms when applied in conjunction with a non-linear utility function -- value function interference, and sensitivity to overestimation. We illustrate the nature of these phenomena on simple multi-objective MDPs using a tabular implementation of multiobjective Q-learning.

URL: https://openreview.net/forum?id=KImrufIw0L

---

Title: Is Extending Modality the Right Path Towards Omni-Modality?

Abstract: Omni-modal language models (OLMs) aim to integrate and reason over diverse input modalities, such as text, images, video, and audio, while maintaining strong language capabilities. Despite recent advancements, existing models, especially open-source ones, remain far from true omni-modality, struggling to generalize beyond the specific modality pairs they are trained on or to achieve strong performance when processing multi-modal inputs. We study the effect of extending modality, the dominant technique for training multimodal models, where an off-the-shelf language model is fine-tuned on target-domain and language data. Specifically, we investigate three key questions: (1) Does modality extension compromise core language abilities? (2) Can model merging effectively integrate independently fine-tuned modality-specific models to achieve omni-modality? (3) Does omni-modality extension lead to better knowledge sharing and generalization compared to sequential extension? Through extensive experiments, we analyze these trade-offs and provide insights into the feasibility of achieving true omni-modality using current approaches.

URL: https://openreview.net/forum?id=Hb9QHdu0Fm

---

Title: Data selection through iterative Self-Filtering for vision-language settings

Abstract: The availability of large amounts of clean data is paramount to training neural networks. However, at large scales, manual oversight is impractical, resulting in sizeable datasets that can be very noisy. Attempts to mitigate this obstacle to producing performant vision-language models have so far involved heuristics, curated reference datasets, and using pre-trained models. Here we propose a novel, bootstrapped method in which a CLIP model is trained on an evolving, self-selected dataset. This evolving dataset constitutes a balance of filtered, highly probable clean samples as well as diverse samples from the entire distribution. Our proposed Self-Filtering method iterates between training the model and selecting a subsequently improved data mixture. Training on vision-language datasets filtered by the proposed approach improves the downstream performance, without the need for additional data or pre-trained models.

URL: https://openreview.net/forum?id=F09NfCXuCe

---

Title: MetaTT: A Global Tensor-Train Adapter for Parameter-Efficient Fine-Tuning

Abstract: We present MetaTT, a Tensor Train (TT) adapter framework for fine-tuning of pre-trained transformers. MetaTT enables flexible and parameter-efficient model adaptation by using a single shared TT to factorize transformer sub-modules. This factorization indexes key structural dimensions, including layer and matrix type, and can optionally incorporate heads and tasks. This design allows MetaTT’s parameter count to scale with the sum, rather than the product, of the modes, resulting in a substantially more compact adapter. Our benchmarks compare MetaTT with LoRA along with recent state-of-the-art matrix and tensor decomposition based fine-tuning methods. We observe that when tested on single-task standard language modeling benchmarks, MetaTT achieves competitive parameter efficiency to accuracy tradeoff. We further demonstrate that MetaTT performs competitively when compared to state-of-the-art methods on multi-task learning. Finally, we leverage the TT-ansatz to design a rank-adaptive optimizer inspired by the DMRG method from many-body physics. Our results demonstrate that integrating this approach with AdamW enhances optimization performance for a specified target rank.

URL: https://openreview.net/forum?id=1HdcPWfA9s

---

Title: MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding

Abstract: Document Question Answering (DocQA) is a very common task. Existing methods using Large Language Models (LLMs) or Large Vision Language Models (LVLMs) and Retrieval Augmented Generation (RAG) often prioritize information from a single modal, failing to effectively integrate textual and visual cues. These approaches struggle with complex multi-modal reasoning, limiting their performance on real-world documents. We present MDocAgent (A Multi-Modal Multi-Agent Framework for Document Understanding), a novel RAG and multi-agent framework that leverages both text and image. Our system employs five specialized agents: a general agent, a critical agent, a text agent, an image agent and a summarizing agent. These agents engage in multi-modal context retrieval, combining their individual insights to achieve a more comprehensive understanding of the document's content. This collaborative approach enables the system to synthesize information from both textual and visual components, leading to improved accuracy in question answering. Preliminary experiments on five benchmarks like MMLongBench, LongDocURL demonstrate the effectiveness of our MDocAgent, achieve an average improvement of 12.1% compared to current state-of-the-art method. This work contributes to the development of more robust and comprehensive DocQA systems capable of handling the complexities of real-world documents containing rich textual and visual information.

URL: https://openreview.net/forum?id=F8YOD912sf

---

Title: Prompt Optimization Meets Subspace Representation Learning for Few-shot Out-of-Distribution Detection

Abstract: The reliability of artificial intelligence (AI) systems in open-world settings depends heavily on their ability to flag out-of-distribution (OOD) inputs unseen during training. Recent advances in large-scale vision-language models (VLMs) have enabled promising few-shot OOD detection frameworks using only a handful of in-distribution (ID) samples. However, existing prompt learning-based OOD methods largely overlook the geometry of the visual feature embeddings learned by VLMs whose structure is particularly informative for distinguishing ID from OOD data and holds rich representation capacity as they are pre-trained on millions of samples. To address this, we introduce a \textit{geometry-aware context optimization framework} that integrates subspace representation learning with prompt tuning. By projecting ID-relevant features into a subspace spanned by prompt vectors and simultaneously projecting ID-irrelevant components via orthogonal null-space projections, our approach strengthens the discriminative power of the learned prompt vectors, thereby leading to enhanced ID–OOD separability at test time. To enable an easy-to-handle, end-to-end learning under this framework, we design a geometry-regularized learning criterion that ensures strong OOD detection performance as well as high ID classification accuracy across settings. Moreover, the proposed framework can be seamlessly integrated with a wide range of existing context optimization methods, effectively complementing their softmax-based OOD detectors. Experiments on various real-world datasets showcase the effectiveness of our approach for reliable open-world AI systems.

URL: https://openreview.net/forum?id=TFG2gPjkiF

---

Title: PERRY: Policy Evaluation with Confidence Intervals using Auxiliary Data

Abstract: Off-policy evaluation (OPE) methods estimate the value of a new reinforcement learning (RL) policy prior to deployment. Recent advances have shown that leveraging auxiliary datasets, such as those synthesized by generative models, can improve the accuracy of OPE methods. Unfortunately, such auxiliary datasets may also be biased, and existing methods for using data augmentation within OPE lack principled uncertainty quantification. In high stakes domains like healthcare, reliable uncertainty estimates are important for ensuring safe and informed deployment of RL policies. In this work, we propose two methods to construct valid confidence intervals for OPE when using data augmentation. The first provides a confidence interval over $V^{\pi}(s)$, the policy value conditioned on an initial state $s$. To do so we introduce a new conformal prediction method suitable for Markov Decision Processes (MDPs) with high-dimensional state spaces. Second, we consider the more common task of estimating the average policy performance over all initial states, $V^{\pi}$; we introduce a method that draws on ideas from doubly robust estimation and prediction powered inference. Across simulators spanning inventory management, robotics, healthcare, and a real healthcare dataset from MIMIC-IV, we find that our methods can effectively leverage auxiliary data and consistently produce confidence intervals that cover the ground truth policy values, unlike previously proposed methods. Our work enables a future in which OPE can provide rigorous uncertainty estimates for high-stakes domains.

URL: https://openreview.net/forum?id=RbecZhn2qX

---

Title: Fine-Grained Uncertainty Quantification for Long-Form Language Model Outputs: A Comparative Study

Abstract: Uncertainty quantification has emerged as an effective approach to closed-book hallucination detection for LLMs, but existing methods are largely designed for short-form outputs and do not generalize well to long-form generation. We introduce a taxonomy for fine-grained uncertainty quantification in long-form LLM outputs that distinguishes methods by design choices at three stages: response decomposition, unit-level scoring, and response-level aggregation. We formalize several families of consistency-based black-box scorers, providing generalizations and extensions of existing methods. In our experiments across multiple LLMs and datasets, we find 1) claim-response entailment consistently performs better or on par with more complex claim-level scorers, 2) claim-level scoring generally yields better results than sentence-level scoring, and 3) uncertainty-aware decoding is highly effective for improving the factuality of long-form outputs. Our framework clarifies relationships between prior methods, enables apples-to-apples comparisons, and provides practical guidance for selecting components for fine-grained UQ.

URL: https://openreview.net/forum?id=gngp4Zz9Sj

---

Title: NAS Without Priors: A Robust Architecture Search Framework for Unseen-Data

Abstract: Neural architecture search (NAS) has been widely used to automate neural network design for image classification; however, most NAS research has focused on CIFAR, ImageNet, and their derivative benchmarks. These datasets benefit from well-established architecture design practices, data preprocessing techniques, and training protocols developed prior to NAS, causing many NAS methods to meta-overfit and struggle to generalize to entirely novel datasets. In this work, we analyze the limitations of existing NAS practices and propose a framework specifically designed to generalize to unseen-data. In contrast to the prevailing paradigm of exploring extremely large search spaces using low-fidelity evaluations, we advocate sparser exploration combined with high-fidelity performance estimation. We demonstrate that macro-architecture variations alone induce substantial architectural diversity, and that concentrating computational resources on high-fidelity evaluation of fewer candidates produces reliable reward signals enabling better architecture discovery. To obtain robust candidate rankings, we repeatedly train architectures on the entire training set using multiple random seeds. While this approach substantially reduces performance variance due to random seed variability and enables accurate candidate ranking, it comes at a significant computational cost. To mitigate the cost of such high-fidelity evaluation, particularly for larger or high-resolution datasets, we introduce a dataset- and architecture-aware multi-fidelity search strategy that both reduces computational overhead and stabilizes candidate rankings under varying fidelity levels. We evaluate our framework on the NAS Unseen-Data Challenge, where, under a strict time constraint of 8 hours per dataset for both search and training, it outperforms manually designed architectures across all three challenge datasets and achieves first place with a combined score of 12.19, compared to 10.89 and 10.43 for the second- and third-place NAS solutions, respectively.

URL: https://openreview.net/forum?id=PaEk1gYrFz

---

Title: Score-based Lyapunov Stable Neural ODE for Robust Classification

Abstract: Adversarial attacks pose a significant obstacle to the widespread deployment of modern AI
systems. These attacks, often implemented as imperceptible perturbations to the input,
typically an image, can deliberately mislead neural networks into making incorrect predic-
tions. Over the past decade, numerous studies have sought to understand and mitigate this
vulnerability. Among them, a promising line of research interprets neural networks as dy-
namical systems and leverages Lyapunov theory to enhance robustness against adversarial
perturbations. However, the original intent of Lyapunov theory differs from that of build-
ing accurate and robust neural networks, leading to conceptual and practical challenges.
Existing approaches typically incorporate Lyapunov constraints through penalization, but
such formulations only ensure local stability around input data points and do not guaran-
tee broader regions of convergence. In this work, we propose a framework based on vector
fields that explicitly admit asymptotically stable equilibrium points, thereby strengthening
the Lyapunov-based foundation of the model. This enhanced theoretical grounding enables
us to prove that every point within the support of the input data distribution converges
to a stable equilibrium point, and enables us to draw a natural connection to the concept
of score estimation. Experimentally, we demonstrate that our model improves adversarial
robustness over prior Lyapunov-regularized approaches across standard image classification
benchmarks. Qualitatively, the induced dynamics exhibit a denoising effect against adversarial perturbations, driving inputs toward stable modes of the data distribution

URL: https://openreview.net/forum?id=LqIKlX9Pzg

---

Title: DMT-JEPA: Learning Discriminative Masked Targets for Joint-Embedding Predictive Architecture

Abstract: The joint-embedding predictive architecture (JEPA) recently has shown impressive results in extracting visual representations from unlabeled imagery under a masking strategy. However, we reveal its disadvantages, notably its insufficient understanding of local semantics. This deficiency originates from masked modeling in the embedding space, resulting in a reduction of discriminative power and can even lead to the neglect of critical local semantics. To bridge this gap, we introduce DMT-JEPA, a novel masked modeling objective rooted in JEPA, specifically designed to generate discriminative latent targets from neighboring information. Our key idea is simple: we consider a set of semantically similar neighboring patches as a target of a masked patch. To be specific, the proposed DMT-JEPA (a) computes feature similarities between each masked patch and its corresponding neighboring patches to select patches having semantically meaningful relations, and (b) employs lightweight cross-attention heads to aggregate features of neighboring patches as the masked targets. Consequently, DMT-JEPA highlights that increased discriminative power of target representations benefits a diverse spectrum of downstream tasks. Through extensive experiments, we demonstrate our effectiveness across various visual benchmarks, including ImageNet-1K image classification, ADE20K semantic segmentation, and COCO object detection tasks.
Code is available at: \url{https://anonymous.4open.science/r/DMT-JEPA-anony}.

URL: https://openreview.net/forum?id=73demKsXn4

---

Title: Freeze, Prompt, and Adapt: A Framework for Source-free Unsupervised GNN Prompting

Abstract: Prompt tuning has become a key mechanism for adapting pre-trained Graph Neural Networks (Gnns) to new downstream tasks. However, existing approaches are predominantly supervised, relying on labeled data to optimize the prompting parameters and typically finetuning a task-specific prediction head—practices that undermine the promise of parameter-efficient adaptation. We propose Unsupervised Graph Prompting Problem (UGPP), a challenging new setting where the pre-trained GNN is kept entirely frozen, labels on the target domain are unavailable, the source data is inaccessible, and the target distribution exhibits covariate shift. To address this, we propose UGPrompt, the first fully unsupervised GNN prompting framework. UGPrompt leverages consistency regularization and pseudo-labeling to train a prompting function, complemented with diversity and domain regularization to mitigate class imbalance and distribution mismatch. Our extensive experiments demonstrate that UGPrompt consistently outperforms state-of-the-art supervised prompting methods with access to labeled data, demonstrating the viability of unsupervised prompting as a practical adaptation paradigm for GNNs.

URL: https://openreview.net/forum?id=9KKgIQwCLO

---

Title: CreativityPrism: A Holistic Evaluation Framework for Large Language Model Creativity

Abstract: Creativity is often seen as a hallmark of human intelligence. While large language models (LLMs) are increasingly perceived as generating creative text, there is still no holistic and scalable framework to evaluate their creativity across diverse scenarios. Existing methods of LLM creativity evaluation either heavily rely on humans, limiting speed and scalability, or are fragmented across different domains and different definitions of creativity. To address this gap, we propose CreativityPrism, an evaluation analysis framework that consolidates eight tasks from three domains, divergent thinking, creative writing, and logical reasoning, into a taxonomy of creativity that emphasizes three dimensions, quality, novelty, and diversity of LLM generations. The framework is designed to be scalable with reliable automatic evaluation judges that have been validated against human annotations. We evaluate 17 state-of-the-art (SoTA) proprietary and open-sourced LLMs on CreativityPrism and find that while proprietary LLMs dominate creative writing and logical reasoning tasks by a 15% lead over open-sourced ones, they offer no significant advantage in divergent thinking, a domain much less explored in existing post-training regimes. Our analysis also shows that high performance in one creative dimension or domain rarely generalizes to others; specifically, novelty metrics often show weak or negative correlations with other metrics. This fragmentation confirms that a holistic, multi-dimensional framework like CreativityPrism is essential for any meaningful assessment of LLM creativity.

URL: https://openreview.net/forum?id=3pfsQcEtNC

---

Title: Reproducing DragDiffusion: Interactive Point-Based Editing with Diffusion Models

Abstract: DragDiffusion \citep{Shi_2024_CVPR} is a diffusion-based method for interactive point-based image editing that enables users to manipulate images by directly dragging selected points. The method claims that accurate spatial control can be achieved by optimizing a single diffusion latent at an intermediate timestep, together with identity-preserving fine-tuning and spatial regularization. This work presents a reproducibility study of DragDiffusion using the authors’ released implementation and the DragBench benchmark. We reproduce the main ablation studies on diffusion timestep selection, LoRA-based fine-tuning \citep{Hu_2022_ICLR}, mask regularization strength, and UNet feature supervision, and observe close agreement with the qualitative and quantitative trends reported in the original work. At the same time, our experiments show that performance is sensitive to a small number of hyperparameter assumptions, particularly the optimized timestep and the feature level used for motion supervision, while other components admit broader operating ranges. We further evaluate a multi-timestep latent optimization variant and find that it does not improve spatial accuracy while substantially increasing computational cost. Overall, our findings support the central claims of DragDiffusion while clarifying the conditions under which they are reliably reproducible.

URL: https://openreview.net/forum?id=iJoJVNl1tp

---

Title: Towards Scalable Explainable AI: Using Vision-Language Models to Interpret Vision Systems

Abstract: Explainable AI (xAI) is increasingly important for the trustworthy deployment of vision models in domains such as medical imaging, autonomous driving, and safety-critical systems. However, modern vision models are typically trained on massive datasets, making it nearly impossible for researchers to manually track how models learn from each sample, especially when relying on saliency maps that require intensive visual inspection. Traditional xAI methods, while useful, often focus on the instance-level explanation and risk losing important information about model behavior at scale, leaving analysis time-consuming, subjective, and difficult to reproduce. To overcome these challenges, we propose an automated evaluation pipeline that leverages Vision-Language Models to analyze vision models at both the sample and dataset levels. Our pipeline systematically assesses, generates, and interprets saliency-based explanations, aggregates them into structured summaries, and enables scalable discovery of failure cases, biases, and behavioral trends. By reducing reliance on manual inspection while preserving critical information, the proposed approach facilitates more efficient and reproducible xAI research, supporting the development of robust and transparent vision models.

URL: https://openreview.net/forum?id=Ta2cvwmlVb

---

Title: A Descriptive and Normative Theory of Human Beliefs in RLHF

Abstract: Human preferences in RLHF are typically modeled as a function of the human's reward function or corresponding optimal state-action values. In this work, we propose that human beliefs about the capabilities of the agent being trained also play a key role in preference generation. We examine two questions related to this hypothesis, one descriptive and one normative, respectively: Do human labelers' beliefs about agent capabilities affect the preferences that they provide? And what is the ideal set of beliefs about an agent---and resulting preferences---for humans to have? We propose a new preference model that incorporates human beliefs and provide a normative theory that bounds the error on the final learned policy based on the _mismatch_ between the human's beliefs and an idealized set of beliefs. We then confirm via a human study that beliefs about agent capabilities do, in fact, significantly affect preferences and can be influenced through simple interventions. Additionally, we empirically show through synthetic experiments that it is often suboptimal for human preference labelers to assume agent optimality. Collectively, these results theoretically and empirically demonstrate how reducing the mismatch between human beliefs and agent capabilities can lead to more performant RLHF and point toward new best practices for RLHF practitioners.

URL: https://openreview.net/forum?id=YdW0KZwPeT

---

Title: DINOv3

Abstract: Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being tailored to specific tasks or domains, this training paradigm has the potential to learn visual representations from diverse sources, ranging from natural to aerial images—using a single algorithm. This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies. First, we leverage the benefit of scaling both dataset and model size by careful data preparation, design, and optimization. Second, we introduce a new method called Gram anchoring, which effectively addresses the known yet unsolved issue of dense feature maps degrading during long training schedules. Finally, we apply post-hoc strategies that further enhance our models’ flexibility with respect to resolution, model size, and alignment with text. As a result, we present a versatile vision foundation model that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models. We also share the DINOv3 suite of vision models, designed to advance the state of the art on a wide spectrum of tasks and data by providing scalable solutions for diverse resource constraints and deployment scenarios.

URL: https://openreview.net/forum?id=2NlGyqNjns

---

Title: UniRec: Unified Multimodal Encoding for LLM-Based Recommendations

Abstract: Large language models (LLMs) have recently shown promise for multimodal recommen-
dation, particularly with text and image inputs. Yet real-world recommendation signals
extends far beyond these modalities. To reflect this, we formalize recommendation fea-
tures into four modalities: text, images, categorical features, and numerical attributes, and
emphasize unique challenges this heterogeneity poses for LLMs in understanding multi-
modal information. In particular, these challenges arise not only across modalities but also
within them, as attributes (e.g., price, rating, time) may all be numeric yet carry distinct
meanings. Beyond this intra-modality ambiguity, another major challenge is the nested
structure of recommendation signals, where user histories are sequences of items, each car-
rying multiple attributes. To address these challenges, we propose UniRec, a unified mul-
timodal encoder for LLM-based recommendation. UniRec first employs modality-specific
encoders to produce consistent embeddings across heterogeneous signals. It then applies
a triplet representation—comprising attribute name, type, and value—to separate schema
from raw inputs and preserve semantic distinctions. Finally, a hierarchical Q-Former mod-
els the nested structure of user interactions while maintaining their layered organization.
On multiple real-world benchmarks, UniRec outperforms state-of-the-art multimodal and
LLM-based recommenders by up to 15%, while extensive ablation studies further validate
the contributions of each component.

URL: https://openreview.net/forum?id=WXE255GWhQ

---

Title: FusionFactory: Fusing LLM Capabilities with Multi-LLM Log Data

Abstract: The rapid advancement of large language models (LLMs) has created a diverse landscape of models, each excelling at different tasks. This diversity drives researchers to employ multiple LLMs in practice, leaving behind valuable multi-LLM log data. This naturally leads to the question of whether such logs can be fully leveraged to fuse LLMs' complementary capabilities. Although prior work has explored various strategies for integrating multiple LLMs, we argue that practical fusion must meet two essential requirements: (1) compatibility with real-world serving scenarios (e.g., local and API-based serving), and (2) flexibility to operate at different stages of the LLM pipeline to meet varied user needs (e.g., fine-tuning and inference stages). To this end, we introduce LLMFusionBench, a large-scale benchmark for LLM fusion that spans 14 tasks across five domains, with responses from 20 open-source LLMs (8B-671B) totaling 103M tokens. Building on LLMFusionBench, we propose FusionFactory, a systematic framework with three elaborated levels: (1) query-level fusion via tailored LLM routers, (2) thought-level fusion leveraging retrieved abstract reasoning templates, and (3) model-level fusion via distillation from top-ranked responses. Experiments show that FusionFactory consistently outperforms the best individual LLM across all 14 benchmarks, with the optimal fusion configuration varying across benchmarks, highlighting the promise of multi-LLM log data as a practical foundation for fusing diverse LLM capabilities.

URL: https://openreview.net/forum?id=N951scS3yE

---

Title: FairSAM: Fair Classification on Corrupted Data Through Sharpness-Aware Minimization

Abstract: Image classification models trained on clean data often suffer from significant performance degradation when exposed to corrupted testing or deployment data, such as images with impulse noise, Gaussian noise, or environmental noise. This degradation not only impacts overall performance but also disproportionately affects various demographic subgroups, raising critical algorithmic bias concerns. Although robust learning algorithms such as Sharpness-Aware Minimization improve overall model robustness and generalization, they do not address biased performance degradation across demographic subgroups. Existing fairness-aware machine learning methods aim to reduce performance disparities but struggle to maintain robust and equitable accuracy across demographic subgroups when faced with data corruption. This reveals an inherent tension between robustness and fairness when dealing with corrupted data. To address these challenges, we introduce a newly-designed metric to assess performance degradation across subgroups under data corruption. We propose FairSAM, a framework that integrates Fairness-oriented strategies into SAM to deliver equalized performance across demographic groups under corrupted conditions. Our experiments on multiple real-world datasets and various predictive tasks show that FairSAM reconciles robustness and fairness. The framework yields a structured solution for fair and robust image classification in the presence of data corruption.

URL: https://openreview.net/forum?id=W2QKvn57yw

---

Title: Forge: Foundational Optimization Representations from Graph Embeddings

Abstract: Combinatorial optimization problems are ubiquitous in science and engineering. Still, learning-based approaches to accelerate combinatorial optimization often require solving a large number of difficult instances to collect training data, incurring significant computational cost. Existing learning-based methods require training dedicated models for each problem distribution, for each downstream task, severely limiting their scalability and generalization. We introduce Forge: Foundational Optimization Representations from Graph Embeddings, a framework that pre-trains a vector-quantized graph autoencoder on a large, diverse collection of mixed-integer programming (MIP) instances in an unsupervised manner, without relying on optimization solvers or optimal solutions. Vector quantization produces discrete code assignments that serve as a vocabulary for representing optimization instances. We evaluate Forge in both unsupervised and supervised settings. In the unsupervised setting, Forge embeddings effectively cluster unseen instances across problem domains and sizes. In the supervised setting, we fine-tune Forge embeddings and show that a single pre-trained model helps predicting both the integrality gap for cut-generation and variable hints for search guidance across multiple problem and size distributions. In both tasks, we improve the performance of a commercial optimization solver and outperform state-of-the-art learning-based methods. Finally, we open-source our training code, pre-trained Forge weights, and embeddings for multiple MIP distributions to foster further research in representation learning for optimization problems.

URL: https://openreview.net/forum?id=7Uo1yRWUpo

---

Title: Symbolic Recovery of PDEs from Measurement Data

Abstract: Models based on partial differential equations (PDEs) are powerful for describing a wide range of complex relationships in the natural sciences. Accurately identifying the PDE model, which represents the underlying physical law, is essential for a proper understanding of the problem. This reconstruction typically relies on indirect and noisy measurements of the system’s state and, without specifically tailored methods, rarely yields symbolic expressions, thereby hindering interpretability. In this work, we address this issue by considering existing neural network architectures based on rational functions for the symbolic representation of physical laws. These networks leverage the approximation power of rational functions while also benefiting from their flexibility in representing arithmetic operations. Our main contribution is an identifiability result, showing that, in the limit of noiseless, complete measurements, such symbolic networks can uniquely reconstruct the simplest physical law within the PDE model. Specifically, reconstructed laws remain expressible within the symbolic network architecture, with regularization-minimizing parameterizations promoting interpretability and sparsity in case of $L^1$-regularization. In addition, we provide regularity results for symbolic networks. Empirical validation using the ParFam architecture supports these theoretical findings, providing evidence for the practical reconstructibility of physical laws.

URL: https://openreview.net/forum?id=TbHfgo10W3

---

Title: DualTune-GhostDP: A Unified Framework for Synergistic Differentially Private Fine-Tuning of Prompt-Based Large Language Models

Abstract: With growing concerns about data privacy and confidentiality, there has been increased attention on privacy preserving integration in many applications, particularly data driven ones like Large Language Models (LLMs). LLMs are powerful in-context learners and are widely adopted in real world products. However, their dependence on sensitive private data in training and prompts exposes them to potential data leakage and privacy breaches. Differential Privacy (DP) delivers a rigorous, mathematically provable safeguard against
these vulnerabilities; however, this assurance often comes with considerable reductions in model performance and increased computational cost. While prior work has highlighted the inherent trade-off between privacy and utility, our proposed method, DualTune-GhostDP, shows that strong privacy guarantees can be maintained under a controlled budget without sacrificing high model performance. Our method adopts a two-phase fine-tuning pipeline that integrates Ghost Clipping with an EdgeWorth (EW) Advanced Privacy Accountant, replacing conventional DP accounting mechanisms. Experimental results show that the principled integration of these components in DualTune-GhostDP consistently outperforms the individual benefits of each and both the single-phase Differentially Private Stochastic Gradient Descent (DP-SGD) baseline and a two-phase fine-tuning variant using standard clipping. Specifically, it achieves higher accuracy, faster convergence, and improved computational efficiency while maintaining differential privacy guarantees. In addition, we assess robustness to Membership Inference Attacks (MIA), which aim to determine whether a particular sample was used during training. Our findings demonstrate that DualTune-GhostDP substantially mitigates membership leakage across all training stages, strengthening both the privacy assurances and the overall stability of the approach against such attack relative to existing baselines.

URL: https://openreview.net/forum?id=aJ74dIAQZH

---

Title: Guided Diffusion by Optimized Loss Functions on Relaxed Parameters for Inverse Material Design

Abstract: Inverse design problems are common in engineering and materials science. The forward direction, i.e., computing output quantities from design parameters, typically requires running a numerical simulation, such as a FEM, as an intermediate step, which is an optimization problem by itself. In many scenarios, several design parameters can lead to the same or similar output values. For such cases, multi-modal probabilistic approaches are advantageous to obtain diverse solutions. A major difficulty in inverse design stems from the structure of the design space, since discrete parameters or further constraints disallow the direct use of gradient-based optimization. To tackle this problem, we propose a novel inverse design method based on diffusion models. Our approach relaxes the original design space into a continuous grid representation, where gradients can be computed by implicit differentiation in the forward simulation. A diffusion model is trained on this relaxed parameter space in order to serve as a prior for plausible relaxed designs. Parameters are sampled by guided diffusion using gradients that are propagated from an objective function specified at inference time through the differentiable simulation. A design sample is obtained by backprojection into the original parameter space. We develop our approach for a composite material design problem where the forward process is modeled as a linear FEM problem. We evaluate the performance of our approach in finding designs that match a specified bulk modulus. We demonstrate that our method can propose diverse designs within 1% relative error margin from medium to high target bulk moduli in 2D and 3D settings. We also demonstrate that the material density of generated samples can be minimized simultaneously by using a multi-objective loss function.

URL: https://openreview.net/forum?id=JVUk7fT0Ll

---

Title: Policy Gradients for Cumulative Prospect Theory in Reinforcement Learning

Abstract: We derive a policy gradient theorem for Cumulative Prospect Theory (CPT) objectives in finite-horizon Reinforcement Learning (RL), generalizing the standard policy gradient theorem and encompassing distortion-based risk objectives as special cases. Motivated by behavioral economics, CPT combines an asymmetric utility transformation around a reference point with probability distortion. Building on our theorem, we design a first-order policy gradient algorithm for CPT-RL using a Monte Carlo gradient estimator based on order statistics. We establish statistical guarantees for the estimator and prove asymptotic convergence of the resulting algorithm to first-order stationary points of the (generally non-convex) CPT objective. Simulations illustrate qualitative behaviors induced by CPT and compare our first-order approach to existing zeroth-order methods.

URL: https://openreview.net/forum?id=6iGrtiSCR7

---

Title: Demystifying MaskGIT Sampler and Beyond: Adaptive Order Selection in Masked Diffusion

Abstract: Masked diffusion models have shown promising performance in generating high-quality samples in a wide range of domains, but accelerating their sampling process remains relatively underexplored. To investigate efficient samplers for masked diffusion, this paper theoretically analyzes the MaskGIT sampler for image modeling, revealing its implicit temperature sampling mechanism. Through this analysis, we show that MaskGIT is asymptotically equivalent to a choose-then-sample (CTS) formulation, instantiated as the “moment sampler,” which explicitly separates index selection from token sampling. This CTS reformulation is essential: it yields unbiased token sampling and exposes an algorithmic design space for index selection, both of which are inaccessible in MaskGIT’s original formulation. Regarding token sampling, we reveal that MaskGIT implicitly adopts a low-temperature sampler, which explains why MaskGIT often degrades with more sampling steps. The CTS reformulation of MaskGIT allows to fix the temperature sampling to ensure unbiasedness. We also improve the index selection in CTS through two key innovations: a partial caching technique for transformers that approximates longer sampling trajectories without proportional computational cost, and a hybrid approach formalizing the exploration-exploitation trade-off in adaptive unmasking. Experiments in image and text domains demonstrate our theory as well as the efficiency of our proposed methods, advancing both theoretical understanding and practical implementation of masked diffusion samplers.

URL: https://openreview.net/forum?id=mKlW68i2Ig

---

Title: Toward Greater Autonomy in Materials Discovery Agents: Unifying Planning, Physics, and Scientists

Abstract: We aim at designing language agents with greater autonomy for crystal materials discovery. While most of existing studies restrict the agents to perform specific tasks within predefined workflows, we aim to automate workflow planning given high-level goals and scientist intuition. To this end, we propose Materials Agent unifying Planning, Physics, and Scientists, known as MAPPS. MAPPS consists of a Workflow Planner, a Tool Code Generator, and a Scientific Mediator. The Workflow Planner uses large language models (LLMs) to generate structured and multi-step workflows. The Tool Code Generator synthesizes executable Python code for various tasks, including invoking a force field foundation model that encodes physics. The Scientific Mediator coordinates communications, facilitates scientist feedback, and ensures robustness through error reflection and recovery. By unifying planning, physics, and scientists, MAPPS enables flexible and reliable materials discovery with greater autonomy, achieving a five-fold improvement in stability, uniqueness, and novelty rates compared with prior generative models when evaluated on the MP-20 data. We provide extensive experiments across diverse tasks to show that MAPPS is a promising framework for autonomous materials discovery.

URL: https://openreview.net/forum?id=Cwq1U8tbWW

---

Title: Scaling Laws for Masked-Reconstruction Transformers on Single-Cell Transcriptomics

Abstract: Neural scaling laws -- power-law relationships between loss, model size, and data -- have been extensively documented for language and vision transformers, yet their existence in single-cell genomics remains largely unexplored. We present the first systematic study of scaling behaviour for masked-reconstruction transformers trained on single-cell RNA sequencing (scRNA-seq) data. Using expression profiles from the CELLxGENE Census, we construct two experimental regimes: a data-rich regime (512 highly variable genes, 200,000 cells) and a data-limited regime (1,024 genes, 10,000 cells). Across seven model sizes spanning three orders of magnitude in parameter count (533 to 3.4 x 10^8 parameters), we fit the parametric scaling law to validation mean squared error (MSE). The data-rich regime exhibits clear power-law scaling with an irreducible loss floor of c ~ 1.44, while the data-limited regime shows negligible scaling, indicating that model capacity is not the binding constraint when data are scarce. These results establish that scaling laws analogous to those observed in natural language processing do emerge in single-cell transcriptomics when sufficient data are available, and they identify the data-to-parameter ratio as a critical determinant of scaling behaviour. A preliminary conversion of the data-rich asymptotic floor to information-theoretic units yields an estimate of approximately 2.30 bits of entropy per masked gene position. We discuss implications for the design of single-cell foundation models and outline the additional measurements needed to refine this entropy estimate.

URL: https://openreview.net/forum?id=a8rUQqionr

---

Title: Time-Aware Prior Fitted Networks for Zero-Shot Forecasting with Exogenous Variables

Abstract: In many time series forecasting settings, the target time series is accompanied by exogenous covariates, such as promotions and prices in retail demand; temperature in energy load; calendar and holiday indicators for traffic or sales; and grid load or fuel costs in electricity pricing. Ignoring these exogenous signals can substantially degrade forecasting accuracy, particularly when they drive spikes, discontinuities, or regime and phase changes in the target series. Most current time series foundation models (e.g., Chronos, Sundial, TimesFM, TimeMoE, TimeLLM, and LagLlama) ignore exogenous covariates and make forecasts solely from the numerical time series history, thereby limiting their performance. In this paper, we develop ApolloPFN, a prior-data fitted network (PFN) that is time-aware (unlike prior PFNs) and that natively incorporates exogenous covariates (unlike prior univariate forecasters). Our design introduces two major advances: (i) a synthetic data generation procedure tailored to resolve the failure modes that arise when tabular (non-temporal) PFNs are applied to time series; and (ii) time-aware architectural modifications that embed inductive biases needed to exploit the time series context. We demonstrate that ApolloPFN achieves state-of-the-art results across benchmarks, such as M5 and electric price forecasting, that contain exogenous information.

URL: https://openreview.net/forum?id=nJARpxp3cF

---

Title: Automata Learning from Recurrent Networks: A Critical Synthesis for Verification, Testing, and Interpretability

Abstract: Recurrent Neural Networks (RNNs) have demonstrated their effectiveness in modeling sequential data and are a key building block of modern deep learning architectures. In this review paper, we study recurrent networks from the lens of automata theory. Given an RNN, automata learning seeks to model its behavior with an automaton, which enables better interpretability and eases our understanding of its working mechanisms. We begin by examining the theoretical foundations of this approach, displaying how it can be applied to learn automata from various types of recurrent nets, including the Elman Recurrent Network (ERN), Long Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU). Next, we review the applications of this approach in formal verification, model-based testing, and the interpretability of these deep learning models. We finish with a discussion on the advantages and critical problems of this method, while outlining key goals for future research, such as defining standard benchmarks and identifying limitations that need to be addressed to advance this field further.

URL: https://openreview.net/forum?id=R52ETbUBVo

---

Title: Diff-ID: Identity Consistent Facial Image Generation and Morphing via Diffusion Models

Abstract: Generative diffusion models have revolutionized facial image synthesis, yet robust identity preservation in high resolution outputs remains a critical challenge. This issue is especially vital for security systems, biometric authentication, and privacy sensitive applications, where any drift in identity integrity can undermine trust and functionality. We introduce Diff-ID, a diffusion based framework that enforces identity consistency while delivering photorealistic quality. Central to our approach is a custom 210K image dataset synthesized from CelebA-HQ, FFHQ, and LAION-Face and captioned via a fine tuned BLIP model to bolster identity awareness during training. Diff-ID integrates ArcFace and CLIP embeddings through a dual cross attention adapter within a fine tuned Stable Diffusion UNet. To further reinforce identity fidelity, we propose a pseudo discriminator loss based on ArcFace cosine similarity with exponential timestep weighting. Experiments on held out and unseen faces demonstrate that Diff-ID outperforms state of the art methods in both identity retention and visual realism. Finally, we showcase a unified DDIM based morphing pipeline that enables seamless facial interpolation without per identity fine tuning. We further argue that identity preservation and photorealism should be evaluated jointly rather than in isolation, as high identity similarity alone does not guarantee realistic outputs. To this end, we introduce a unified evaluation metric that combines identity similarity and perceptual realism into a single interpretable score.

URL: https://openreview.net/forum?id=JgoFMxqnYc

---

Title: Adaptive Budget Allocation for Orthogonal Subspace Adapter Tuning in LLMs Continual Learning

Abstract: Large language models (LLMs) often suffer from catastrophic forgetting in continual learning (CL) scenarios, where performance on previously learned tasks degrades severely while training on sequentially arriving tasks.
Although pioneering CL approaches using orthogonal subspaces can mitigate task interference, they typically employ fixed budget allocation, neglecting the varying complexity across tasks and layers.
Besides, recent budget-adaptive tuning methods for LLMs often adopt multi-stage paradigms that decouple optimization and budget allocation. Such decoupling results in potential misalignment, which hinders those approaches' practical application in CL scenarios.
To address these limitations, we propose OA-Adapter, a novel parameter-efficient approach for continual learning in LLMs that unifies dynamic budget adaptation with orthogonal subspace learning in an end-to-end training stage.
Specifically, OA-Adapter introduces a dynamic bottleneck dimension adaptation mechanism that simultaneously allocates an efficient parameter budget and optimizes task objectives without misalignment.
To effectively preserve previously acquired knowledge while coordinating with the dynamic budget allocation, orthogonal constraints are applied specifically between the parameter subspace of the current task and the dynamically allocated parameter subspaces of historical tasks.
Experimental results on continual learning benchmarks demonstrate that OA-Adapter outperforms state-of-the-art methods in both accuracy and parameter efficiency. OA-Adapter achieves higher average accuracy while using \(58.5\%\) fewer parameters on the Standard CL Benchmark, and maintains its advantages on two larger benchmarks comprising 15 tasks.

URL: https://openreview.net/forum?id=LNGDlLOdex

---

Title: OmegAMP: Targeted AMP Discovery via Biologically Informed Generation

Abstract: Deep learning-based antimicrobial peptide (AMP) discovery faces critical challenges such as limited controllability, lack of representations that efficiently model antimicrobial properties, and low experimental hit rates. To address these challenges, we introduce OmegAMP, a framework designed for reliable AMP generation with increased controllability. Its diffusion-based generative model leverages a novel conditioning mechanism to achieve fine-grained control over desired physicochemical properties and to direct generation towards specific activity profiles, including species-specific effectiveness. This is further enhanced by a biologically informed encoding space that significantly improves overall generative performance. Complementing these generative capabilities, OmegAMP leverages a novel synthetic data augmentation strategy to train classifiers for AMP filtering, drastically reducing false positive rates and thereby increasing the likelihood of experimental success. Our in silico experiments demonstrate that OmegAMP delivers state-of-the-art performance across key stages of the AMP discovery pipeline, enabling an unprecedented success rate in in vitro experiments. We tested 25 candidate peptides, 24 of them (96%) demonstrated antimicrobial activity, proving effective even against multi-drug resistant strains. Our findings underscore OmegAMP's potential to significantly advance computational frameworks in the fight against antimicrobial resistance.

URL: https://openreview.net/forum?id=hAq3XLZ9ex

---

Title: Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding

Abstract: Vision-Language Models (VLMs) have demonstrated impressive world knowledge across a wide range of tasks, making them promising candidates for embodied reasoning applications. However, existing benchmarks primarily evaluate the embodied reasoning ability of VLMs through multiple-choice questions based on image annotations -- for example, selecting which trajectory better describes an event in the image. In this work, we introduce the Point-It-Out (\modelname) benchmark, a novel benchmark designed to systematically assess the embodied reasoning abilities of VLMs through precise visual grounding. We propose a hierarchical evaluation protocol spanning three stages (S1: referred-object localization, S2: task-driven pointing, and S3: visual trace prediction), with data collected from critical domains for embodied intelligence, including indoor, kitchen, driving, and robotic manipulation scenarios. Extensive experiments with over ten state-of-the-art VLMs reveal several interesting findings. For example, strong general-purpose models such as GPT-4o, while excelling on many benchmarks (e.g., language, perception, and reasoning), underperform compared to some open-source models in precise visual grounding; models such as MoLMO perform well in S1 and S2 but struggle in S3, where requires grounding combined with visual trace planning.

URL: https://openreview.net/forum?id=9e0hRhFsal

---

Title: Blind Inverse Game Theory: Jointly Decoding Rewards and Rationality in Entropy-Regularized Competitive Games

Abstract: Inverse Game Theory (IGT) methods based on the entropy-regularized Quantal Response Equilibrium (QRE) offer a tractable approach for competitive settings, but critically assume the agents' rationality parameter (temperature $\tau$) is known a priori. When $\tau$ is unknown, a fundamental scale ambiguity emerges that couples $\tau$ with the reward parameters ($\theta$), making them statistically unidentifiable. We introduce Blind-IGT, the first statistical framework to jointly recover both $\theta$ and $\tau$ from observed behavior. We analyze this bilinear inverse problem and establish necessary and sufficient conditions for unique identification by introducing a normalization constraint that resolves the scale ambiguity. We propose an efficient Normalized Least Squares (NLS) estimator and prove it achieves the optimal $\mathcal{O}(N^{-1/2})$ convergence rate for joint parameter recovery. When strong identifiability conditions fail, we provide partial identification guarantees through confidence set construction. We extend our framework to Markov games and demonstrate optimal convergence rates with strong empirical performance even when transition dynamics are unknown.

URL: https://openreview.net/forum?id=tHG9WmaUzR

---

Title: Uncertainty and Scale-Calibrated Contrastive Federated Segmentation under Client Heterogeneity

Abstract: Federated learning presents a promising approach for medical image segmentation, particularly in addressing data privacy concerns. However, it faces significant challenges due to data heterogeneity across participating clients. This heterogeneity introduces variations in data scales and distributions, making it difficult to balance spatial accuracy and feature similarity when managing multidimensional heterogeneous data. To address these challenges, we propose a novel \textbf{Uncertainty- and Scale-Calibrated Contrastive Federated Segmentation under Client Heterogeneity (SAFCF)} with two key approaches: (i) an \textbf{uncertainty-driven dynamic scale-adaptive weighted aggregation (DSWA)} method, which balances the influence of local client data scales and reduces model drift caused by data heterogeneity through the use of epistemic uncertainty in weighted aggregation, and (ii) a \textbf{contrastive federated segmentation loss (CFSL)}, a local loss function that effectively balances spatial accuracy and feature similarity at the pixel level of an image by combining modified Dice loss with improved contrastive loss. Additionally, epistemic uncertainty layer learns weight distributions to introduce uncertainty, further improving model robustness and enabling adaptive learning from diverse data during training. Our framework demonstrates substantial improvements on standard benchmark medical image segmentation datasets, especially under highly non-IID conditions, when compared to traditional algorithms.

URL: https://openreview.net/forum?id=avrvLzsebU

---

Title: Early Quantization Shrinks Codebook: A Simple Fix for Diversity-Preserving Tokenization

Abstract: While discrete tokenizers are suspected to inherently limit sample diversity in token-based generative models, we show this diversity gap is not caused by discretization itself, but rooted in the timing of quantization. In this study, we systematically identify quantization in the initial stage as the primary catalyst for a representational misalignment, where the codebook prematurely shrinks into a narrow latent manifold. This initial shrinkage prevents the codebook from capturing the diverse embedding space of the encoder. Though this may yield deceptively strong reconstructions, it creates a bottleneck that forces the generator to rely on a homogenized set of tokens. Ultimately, the codebook’s failure to anchor to robust representations at the onset of training impairs generative variety and limits sample diversity. To address this, we propose Deferred Quantization, a simple yet effective strategy that introduces a separate continuous learning phase. By allowing the encoder to first establish a well-distributed representation space before introducing discretization, the codebook can effectively anchor to a mature and diverse latent landscape. Across tokenizers and token-based generators, Deferred Quantization consistently reduces shrinkage, improves generative diversity, and preserves reconstruction and compression. We additionally provide a shrinkage diagnostic suite and offer practical guidance for designing diversity-preserving discrete tokenizers.

URL: https://openreview.net/forum?id=D28WzXqIrK

---

Title: ProteinZero: Self-Improving Protein Generation via Online Reinforcement Learning

Abstract: Protein generative models have shown remarkable promise in protein design, yet their success rates remain constrained by reliance on curated sequence-structure datasets and by misalignment between supervised objectives and real design goals. We present ProteinZero, an online reinforcement learning framework for inverse folding models that enables scalable, automated, and continuous self-improvement with computationally efficient feedback. ProteinZero employs a reward pipeline that combines structural guidance from ESMFold with a novel self-derived ddG predictor, providing stable multi-objective signals while avoiding the prohibitive cost of physics-based methods. To ensure robustness in online RL, we further introduce a novel embedding-level diversity regularizer that mitigates mode collapse and promotes functionally meaningful sequence variation. Within a general RL formulation balancing multi-reward optimization, KL-divergence from a reference model, and diversity regularization, ProteinZero achieves robust improvements across designability, stability, recovery, and diversity. On the CATH-4.3 benchmark, it consistently outperforms state-of-the-art baselines including ProteinMPNN, ESM-IF, and InstructPLM, reducing design failure rates by 36-48% and achieving success rates above 90% across diverse folds. Importantly, a complete RL run can be executed on a single 8$\times$GPU node within three days, including reward computation and data generation. These results indicate that efficient online RL fine-tuning can complement supervised pretraining by allowing protein generative models to evolve continuously from their own outputs and optimize multiple design objectives without labeled data, opening new possibilities for exploring the vast protein design space. Sample designed sequences are provided in the supplementary material, and full source code and model checkpoints will be released upon publication.

URL: https://openreview.net/forum?id=pVvm0CQAr6

---

Title: The Expanded Othello AI Arena: Evaluating Intelligent Systems Through Constrained Adaptation to Unseen Conditions

Abstract: The ability to rapidly adapt to environmental changes is a core requirement for Artificial General Intelligence (AGI), yet most AI benchmarks evaluate performance in static environments. We present the Expanded Othello AI Arena, a benchmark designed to measure Skill-Acquisition Efficiency — the rate at which agents discover latent objectives and converge to effective strategies within a limited interaction budget. The Arena formalizes a spectrum of 56 environments using a parametric framework $\mathcal{E} = (\mathcal{L}, \mathcal{C})$, where $\mathcal{L}$ defines Othello board geometries and $\mathcal{C}$ represents latent winning conditions via a disc-ratio threshold $K$. This parameterization requires agents to decipher terminal rules through direct interaction while simultaneously interpreting the opponent's behavior — in narrow regimes, agents must strategically induce the opponent into violating the hidden threshold to secure victory. Unlike traditional evaluation, the Arena imposes a strict 2,000-game interaction budget to prioritize sample efficiency over asymptotic optimization. We establish the benchmark's utility through a neuroevolutionary adaptive-Minimax baseline that utilizes meta-learned spatial priors and adaptive weighting. Our empirical analysis reveals that while this baseline achieves competitive performance in standard and inverse regimes, it fails in narrow-interval regimes that demand adversarial inducement, exposing a substantial efficiency gap that gradient-based reinforcement learning cannot bridge even with five times the interaction budget. Released as an extensible Python-based research toolkit, the Arena provides a standardized platform for exploring research directions including test-time learning, in-context learning, and world models. The code is available at: \url{https://anonymous.4open.science/r/ExpandedOthello/}.

URL: https://openreview.net/forum?id=WXKQtqPC2d

---

Title: SafeFix: Targeted Model Repair via Controlled Image Generation

Abstract: Deep learning models for visual recognition often exhibit systematic errors due to underrepresented semantic subpopulations. While existing debugging frameworks can identify these failure slices, effectively repairing them remains difficult. Current solutions often rely on manually designed prompts to generate synthetic images—an approach that introduces distribution shift and semantic errors, often resulting in new bugs. To address these issues, we introduce SafeFix, a framework for distribution-consistent model repair via controlled generation that employs a diffusion model to generate semantically faithful images that modify only specific failure attributes while preserving the underlying data distribution. To ensure the reliability of the repair data, we implement a verification mechanism using a large vision--language model (LVLM) to enforce semantic consistency and label preservation. By retraining models on the synthetic data, we significantly reduce errors in rare cases and improve overall performance. Our experiments show that SafeFix achieves superior robustness by maintaining high precision in attribute editing without introducing additional bugs.

URL: https://openreview.net/forum?id=TtpW6JiEiW

---

Title: Attention-Head Binding as a Term-Conditioned Mechanistic Marker of Accessibility Concept Emergence in Language Models

Abstract: Assessing when language models develop specific capabilities remains challenging, as behavioral evaluations are expensive and internal representations are opaque. We introduce attention-head binding ($EB^*$), a lightweight mechanistic metric that tracks how attention heads bind multi-token technical terms, such as accessibility concepts (“screen reader,” “alt text”), into coherent units during training. Using Pythia models (160M, 1B, 2.8B) across eight checkpoints, we report four empirical findings (C1, C3, C4, C5). At 160M and 2.8B, binding precedes behavioral competence (Spearman $r = 0.33$–$0.34$, $p < 0.001$), serving as an early warning signal (C1). At 1B, we observe a decoupling effect: binding saturates early, while behavior continues to improve, revealing divergent developmental trajectories (C4). High-binding/mid-accuracy checkpoints contain unlockable latent knowledge: few-shot prompting yields up to $+61$ percentage points improvement ($183%$ relative gain) and near-ceiling generation scores ($94.4%$) from low zero-shot baselines (C3). Causal ablation reveals opposite mechanistic regimes across scales: high-binding heads are necessary at 160M (ablation impairs recognition by $-16.7$ percentage points [pp]) but functionally superseded at 2.8B (ablation improves recognition by $+33.3$ pp). This provides direct evidence for the decoupling phenomenon (C5). These findings not only establish attention binding as a diagnostic for concept emergence but also demonstrate that mechanistic structure and behavioral competence undergo qualitative transformation across model scales — a phenomenon we term the binding–behavior decoupling effect. Code: available in the supplementary material.

URL: https://openreview.net/forum?id=QG7mfCy9mu

---

Title: The Learnability of the Multiplayer Adversarial Bandit Problem

Abstract: Following the work of \cite{chang2022online}, we consider the information asymmetric multiplayer adversarial bandits where at each round, the players each have their own set of actions to select from, contributing to a joint action. The players are not allowed to communicate during learning; however, they are allowed to agree on a strategy beforehand. We show that when the players pull simultaneously, there always exists an adaptive adversary that can incur linear regret. We modify the setting so that the pulls are successive instead, and show near optimal regret bounds in the case of successive pulls, both with and without reward asymmetry.

URL: https://openreview.net/forum?id=dfh1d2q2Z3

---

Title: Tight Regret Bounds in Multi-Armed Bandits with Heterogeneous Variances

Abstract: We study stochastic multi-armed bandits with heterogeneous reward variances.
In the known-variance setting, we propose a variance-aware MOSS algorithm
that achieves minimax-optimal regret
matching an information-theoretic lower bound up to constants.
For the unknown-variance case, we construct high-probability variance
upper confidence bounds and show that the resulting algorithm attains
the same minimax rate up to a logarithmic factor.
Our analysis establishes sharp worst-case guarantees that explicitly
capture the variance structure of the problem.

URL: https://openreview.net/forum?id=xnF9pViAZw

---

Title: Multiplayer Combinatorial Bandits Under Information Asymmetry

Abstract: In this paper, we extend linear combinatorial bandits \cite{gai2012combinatorial} to a multiplayer setting with information asymmetry \cite{chang2022online, chang2024optimal}, where each player controls an arm and independently decides whether to pull it, with coordination allowed only before rounds begin. We analyze three scenarios: action asymmetry (players can't observe others' actions but receive identical rewards per iteration), reward asymmetry (players observe actions but receive private i.i.d.\ rewards), and combined asymmetry. We derive near-optimal, gap-independent regret bounds for all scenarios: For action or reward asymmetry, we achieve $\tilde{\mathcal{O}}(\sqrt{T})$, which improves significantly from \cite{gai2010learning}; for both action and reward asymmetry, we achieve near-optimal bounds similar to that of \cite{chang2022online}. We finally generalize our results to settings where players decide either not to pull or to pull one out of multiple arms, and achieve similar bounds in similar settings as above.

URL: https://openreview.net/forum?id=sspckudDFX

---

Title: CentroidKV: Efficient Long-Context LLM Inference via KV Cache Clustering

Abstract: Large language models (LLMs) with extended context windows have become increasingly prevalent for tackling complex tasks. However, the substantial Key-Value (KV) cache required for long-context LLMs poses significant deployment challenges. Existing approaches either discard potentially critical information needed for future generations or offer limited efficiency gains due to high computational overhead. In this paper, we introduce \textit{CentroidKV}, a simple yet effective framework for online KV cache clustering. Our approach is based on the observation that key states exhibit high similarity along the sequence dimension. To enable efficient clustering, we divide the sequence into chunks and propose \textit{Chunked Soft Matching}, which employs an alternating partition strategy within each chunk and identifies clusters based on similarity. CentroidKV then merges the KV cache within each cluster into a single centroid. Additionally, we provide a theoretical analysis of the computational complexity and the optimality of the intra-chunk partitioning strategy. Extensive experiments across various models and long-context benchmarks demonstrate that CentroidKV achieves up to 75\% reduction in KV cache memory usage while maintaining comparable model performance. Moreover, with minimal computational overhead, CentroidKV accelerates the decoding stage of inference by up to 3.19$\times$ and reduces end-to-end latency by up to 2.72$\times$.

URL: https://openreview.net/forum?id=T3EeupQhGj

---

Title: Replicability is Asymptotically Free in Multi-armed Bandits

Abstract: We consider a replicable stochastic multi-armed bandit algorithm that ensures, with high probability, that the algorithm's sequence of actions is not affected by the randomness inherent in the dataset. Replicability allows third parties to reproduce published findings and assists the original researcher in applying standard statistical tests. We observe that existing algorithms require $O(K^2/\rho^2)$ times more regret than nonreplicable algorithms, where $K$ is the number of arms and $\rho$ is the level of nonreplication. However, we demonstrate that this additional cost is unnecessary when the time horizon $T$ is sufficiently large for a given $K, \rho$, provided that the magnitude of the confidence bounds is chosen carefully. Therefore, for a large $T$, our algorithm only suffers $K^2/\rho^2$ times smaller amount of exploration than existing algorithms. To ensure the replicability of the proposed algorithms, we incorporate randomness into their decision-making processes. We propose a principled approach to limiting the probability of nonreplication. This approach elucidates the steps that existing research has implicitly followed. Furthermore, we derive the first lower bound for the two-armed replicable bandit problem, which implies the optimality of the proposed algorithms up to a $\log\log T$ factor for the two-armed case.

URL: https://openreview.net/forum?id=E8rmbq8BYP

---

Title: A Mechanistic Study of Transformers Training Dynamics

Abstract: Large-scale pretraining of transformers has been central to the success of foundation models. However, the scale of those models limits our understanding of the mechanisms at play during optimization. In this work, we study the training dynamics of transformers in a controlled and interpretable setting. On the sparse modular addition task, we demonstrate that specialized attention circuits, called clustering heads, can be implemented during gradient descent to solve the problem. Our experiments show that such pathways naturally emerge during training. By monitoring the evolution of tokens via a visual sandbox, we uncover a two-stage learning and the occurrences of loss spikes due to the high curvature of normalization layers. Our findings provide several insights into patterns observed in more practical settings, such as the pretraining of large language models.

URL: https://openreview.net/forum?id=aHbZx0bckL

---

Title: LISA: Latent Inference on the Sphere for A-posteriori sampling

Abstract: Deterministic $\textit{autoencoders}$ (AEs) train stably and reconstruct sharply, but unconditional generation is typically delegated to an ex-post latent density fit (e.g., MVG/GMM) followed by decoding sampled codes. We argue that this pipeline is brittle because reconstruction training does not define a well-conditioned sampling interface in latent space. We support this with two diagnostics: $\textbf{(i)}$ deterministic decoders can be strongly sensitive to latent radius, so decoding can be poorly conditioned along an otherwise unconstrained radial degree of freedom, an effect we quantify via a controlled radial-scaling test in latent space. This implies that any ex-post sampler that perturbs radius can provoke large output changes in simple Euclidean AEs. Secondly, the $\textbf{(ii)}$ learned latent representations can exhibit strong directional concentration, revealed by a spiky directional second-moment spectrum (low effective rank / large leading eigenvalue), making simple density models poorly calibrated for sampling. Motivated by these failure modes, we propose $\textit{Latent Inference on the Sphere for A-posteriori sampling}$ (LISA), a deterministic autoencoder that projects codes onto the hypersphere to remove latent scale and adds a repulsive hyperspherical kernel energy regularizer to promote directional coverage. Combined with projected MVG/GMM samplers, LISA yields a simple plug-in prior for unconditional generation. Across standard image benchmarks and structured generation tasks, this approach achieves lower $\textit{Frechet inception distance}$ (FID) among a broad suite of autoencoder baselines, without stochastic encoders, adversarial training, or learned priors, highlighting latent-geometry conditioning as an effective design principle for deterministic generative modelling. We also introduce $\textit{Grammar-LISA}$ for structured data (arithmetic expressions), demonstrating that the approach extends beyond image data.

URL: https://openreview.net/forum?id=f0agX08y69

---

Title: Balanced Twins: Causal Inference on Time Series with Hidden Confounding

Abstract: Accurately estimating treatment effects in time series is essential for evaluating interventions in real-world applications, especially when treatment assignment is biased by unobserved factors. In many practical settings, interventions are adopted at different times across individuals, leading to staggered treatment exposure and heterogeneous pre-treatment histories. In such cases, aggregating outcome trajectories across treated units is ill-defined, making individual treatment effect (ITE) estimation a prerequisite for reliable causal inference. We therefore study the problem of estimating the average treatment effect for the treated (ATT) by first recovering individual-level counterfactuals.
We introduce a neural framework that learns simultaneously low-dimensional latent representations of individual time series and propensity scores. These estimates are then used to approximate the individual treatment effects through a flexible matching procedure that avoids classical convexity constraints commonly used in synthetic control methods. By operating at the individual level, our approach naturally accommodates staggered interventions and improves counterfactual estimation under latent bias, without relying on explicit temporal modeling assumptions.
We illustrate our approach on both real-world energy consumption data and clinical time series, including high-frequency electricity demand-response programs and semi-synthetic data for individuals in intensive care unit (ICU), where hidden confounding, staggered treatment adoption, and non-stationary dynamics are prevalent.

URL: https://openreview.net/forum?id=3AwFgQHOPj

---

Title: From Offline to Online Memory-Free and Task-Free Continual Learning via Fine-Grained Hypergradients

Abstract: Continual Learning (CL) aims to learn from a non-stationary data stream where the underlying distribution changes over time. While recent advances have produced efficient memory-free methods in the offline CL (offCL) setting, online CL (onCL) remains dominated by memory-based approaches. The transition from offCL to onCL is challenging, as many offline methods rely on (1) prior knowledge of task boundaries and (2) sophisticated scheduling or optimization schemes, both of which are unavailable when data arrives sequentially and can be seen only once. In this paper, we investigate the adaptation of state-of-the-art memory-free offCL methods to the online setting. We first show that augmenting these methods with lightweight prototypes significantly improves performance, albeit at the cost of increased Gradient Imbalance, resulting in a biased learning towards earlier tasks. To address this issue, we formulate Fine-Grained Hypergradients as an online mechanism for rebalancing gradient updates during training. Our experiments demonstrate that the synergy between prototype memory and hypergradient reweighting substantially allows for improved performance of memory-free methods in onCL. Code will be released upon acceptance.

URL: https://openreview.net/forum?id=Tu9rHiczLm

---

Title: Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos

Abstract: The ability to learn manipulation skills by watching videos of humans has the potential to unlock a new source of highly scalable data for robot learning. Here, we tackle prehensile manipulation, in which tasks involve grasping an object before performing various post-grasp motions. Human videos offer strong signals for learning the post-grasp motions, but they are less useful for learning the prerequisite grasping behaviors, especially for robots without human-like hands. A promising way forward is to use a modular policy design, leveraging a dedicated grasp generator to produce stable grasps. However, arbitrary stable grasps are often not task-compatible, hindering the robot's ability to perform the desired downstream motion. To address this challenge, we present Perceive-Simulate-Imitate (PSI), a framework for training a modular manipulation policy using human video motion data processed by paired grasp-trajectory filtering in simulation. This simulation step extends the trajectory data with grasp suitability labels, which allows for supervised learning of task-oriented grasping capabilities. We show through real-world experiments that our framework can be used to learn precise manipulation skills efficiently without any robot data, resulting in significantly more robust performance than using a grasp generator naively.

URL: https://openreview.net/forum?id=ZEmv4DhaGL

---

Title: A Unified Framework with Environmental and Interaction Uncertainty for Robust Multi-Agent Reinforcement Learning

Abstract: Multi-agent reinforcement learning (MARL) has achieved remarkable success across diverse domains, yet its robustness remains hindered by various inherent uncertainties arising from multi-agent systems. Although previous studies have explored robustness in MARL, most of them focus on a single type of uncertainty, without a unified framework to handle multiple sources simultaneously. As a result, their methods often fail to remain robust when exposed to diverse and interacting disturbances. To address this limitation, we propose a unified framework that explicitly models two complementary sources of uncertainty: environmental uncertainty, caused by stochastic dynamics, and interaction uncertainty, arising from the unpredictable behaviors of other agents. We capture these factors using hierarchical entropy-based uncertainty sets, which are then integrated into the robust Markov game formulation. This hierarchical design enables the framework to distinguish the distinct impacts of each uncertainty source while avoiding the excessive conservatism of treating them as a single unified set. On top of this formulation, we introduce the solution concept of an Aleatoric Robust Equilibrium (ARE), where each agent optimizes its policy against worst-case scenarios derived from the hierarchical sets. To compute the ARE, we develop specialized actor–critic algorithms with theoretical convergence guarantees. Extensive experiments in both the multi-agent particle environment (MPE) and the multi-agent MuJoCo benchmark show that our approach achieves consistently superior robustness and performance across a wide range of uncertainty settings.

URL: https://openreview.net/forum?id=DMllImVr8k

---

Title: Federated Class-Incremental Learning with Hierarchical Generative Prototypes

Abstract: Federated Learning (FL) aims at unburdening the training of deep models by distributing computation across multiple devices (clients) while safeguarding data privacy. On top of that, Federated Continual Learning (FCL) also accounts for data distribution evolving over
time, mirroring the dynamic nature of real-world environments. While previous studies have identified Catastrophic Forgetting and Client Drift as primary causes of performance degradation in FCL, we shed light on the importance of Incremental Bias and Federated Bias,
which cause models to prioritize classes that are recently introduced or locally predominant, respectively. Our proposal constrains both biases to the last layer by efficiently fine-tuning a pre-trained backbone using learnable prompts, resulting in clients that produce less biased
representations and more biased classifiers. Therefore, instead of solely relying on parameter aggregation, we leverage generative prototypes to effectively balance the predictions of the global model. Our proposed methodology significantly improves the current state of the art across six datasets, each including three different scenarios. Code to reproduce the results is provided in the supplementary material.

URL: https://openreview.net/forum?id=k2TT42Ei8W

---

Title: VFEM: Visual Feature Empowered Multivariate Time Series Forecasting with Cross-Modal Fusion

Abstract: Large time series foundation models often adopt channel-independent architectures to handle varying data dimensions, but this design ignores crucial cross-channel dependencies. Meanwhile, existing cross-modal methods predominantly rely on textual modalities, leaving the spatial pattern recognition capabilities of vision models underexplored for time series analysis. To address these limitations, we propose VFEM, a cross-modal forecasting model that leverages pre-trained large vision models (LVMs) to capture complex cross-variable patterns. VFEM transforms multivariate time series into visual representations, enabling LVMs to perceive spatial relationships that are not explicitly modeled by channel-independent models. Through a dual-branch architecture, visual and temporal features are independently extracted and then fused via cross-modal attention, allowing complementary information from both modalities to enhance forecasting. By freezing the LVM and training only 7.45% of the total parameters, VFEM achieves competitive performance on multiple benchmarks, offering a new perspective on multivariate time series forecasting.

URL: https://openreview.net/forum?id=mPhlTmYiyg

---

Title: Decentralized Federated Learning with Function Space Regularization

Abstract: In this work we propose FedFun, a framework for decentralized federated learning that enforces consensus across clients in function space rather than parameter space. By framing agreement as a regularization penalty in a Hilbert space of hypotheses, our method allows
optimization using proximal gradient updates that encourage similarity between neighboring models while supporting both parametric and non-parametric learners. This function space perspective enables convergence guarantees under mild assumptions, covering situations where client objectives are non-convex in the usual sense and where clients may utilize differing architectures. In addition to convergence analysis, we demonstrate compatibility with models like neural networks and decision trees, and empirically evaluate implementations of FedFun on various sample datasets.

URL: https://openreview.net/forum?id=79a397AzQu

---

Title: Training Conditional GANs on Limited and Long-Tailed Data: a Survey and Comparative Analysis

Abstract: Generative adversarial networks (GANs) are a generative model framework that is competitive with state-of-the-art autoencoders and diffusion models in many tasks. While the latter have achieved impressive generation capabilities, mostly through large-scale, general-purpose text-to-image models, their computational requirements place them out of reach for practitioners. On the other hand, as GAN architectures mature and new developments allow for more stable training, interest in their application has grown across diverse domains. However, real-world data are often hard to deal with due to limited amount of samples or long-tailed distributions. Furthermore, previous works addressing these issues lack guidance regarding their applicability and have not been compared through appropriately diverse benchmarks nor assessed using the same metrics. In this article, we conduct a survey on methods for training GANs on limited and long-tailed data and conduct an extensive comparative analysis of existing methods. Our results allow us to draw conclusions about the advantages, disadvantages, and practical applicability of these methods, hopefully making GANs more accessible to practitioners in diverse fields. The code will be made available as soon as deanonymization is allowed.

URL: https://openreview.net/forum?id=Gubw0PmHRY

---

Title: How Much Information Fits in a Vector?

Abstract: Recent work in neural network interpretability has suggested that hidden activations of some deep models can be viewed as linear projections of much higher-dimensional vectors of sparse latent ``features.'' In general, this kind of representation is known as a superposition code. This work presents an information-theoretic account of superposition codes in a setting applicable to interpretability. We show that when the number $k$ of active features is very small compared to the number $N$ of total features, simple inference methods currently used by sparse autoencoders can reliably decode a $d$-dimensional superposition code when $d$ is a constant factor greater than the Shannon limit. Specifically, when $\ln k / \ln N \le \eta < 1$ and $H$ is the entropy of the latent vector in bits, asymptotically it suffices that $d / H > C(\eta)$ for certain increasing functions $C(\eta).$ However, the behavior of $C(\eta)$ depends on what decoding method is used. For example, when $\eta = 0.3$, we empirically show that a method based on the popular top-$k$ activation function typically requires a factor of $C = 4$ dimensions per bit. On the other hand, we exhibit an algorithm that succeeds with less than $2$ dimensions per bit and requires only around $3$ times as many FLOPs for the same values of $(N, d).$ We hope this work helps connect research in interpretability with perspectives from compressive sensing and information theory.

URL: https://openreview.net/forum?id=Nby4pCPIZI

---

Title: Gaussian Process Priors for Boundary Value Problems of Linear Partial Differential Equations

Abstract: Working with systems of partial differential equations (PDEs) is a fundamental task in computational science. Well-posed systems are addressed by numerical solvers or neural operators, whereas systems described by data are often addressed by PINNs or Gaussian processes. In this work, we propose Boundary Ehrenpreis--Palamodov Gaussian Processes (B-EPGPs), a probabilistic framework for constructing GP priors for linear constant-coefficient PDE systems with linear boundary conditions that can be conditioned on a finite data set. Starting from the Ehrenpreis--Palamodov representation, we learn the free parameters from data and enforce boundary conditions analytically for piecewise-flat boundaries. This yields priors (and posteriors) whose sample paths satisfy the PDE and the enforced boundary conditions pointwise by construction. We provide constructive examples, formal correctness proofs, and experiments showing improved accuracy and reduced runtime/memory compared to existing GP-PDE baselines.

URL: https://openreview.net/forum?id=BaeSiyRXxZ

---

Title: Hierarchical Diffusion for Efficient and Transferable Climate Downscaling

Abstract: Downscaling is essential for generating the high-resolution climate data needed for local
planning, but traditional methods remain computationally demanding. Recent years have
seen impressive results from AI downscaling models, particularly diffusion models, which
have attracted attention due to their ability to generate ensembles and overcome the
smoothing problem common in other AI methods. However, these models typically remain
computationally intensive. We introduce a Hierarchical Diffusion Downscaling (HDD) model,
which introduces an easily-extensible hierarchical sampling process to the diffusion framework.
A coarse-to-fine hierarchy is imposed via a simple downsampling scheme. HDD achieves
competitive accuracy on the ERA5 reanalysis dataset and CMIP5 models, significantly
reducing computational load by running on up to half as many pixels with competitive
results. Additionally, a single model trained at 0.25° resolution transfers seamlessly across
multiple CMIP5 models with much coarser resolution. HDD thus offers a lightweight
alternative for probabilistic climate downscaling, facilitating affordable large-ensemble high
resolution climate projections; with a single model that can be applied across GCMs of
varying input sizes. See a full code implementation at: https://github.com/HDD/HDD
Hierarchical-Diffusion-Downscaling

URL: https://openreview.net/forum?id=OhTYgFpMU2

---

Title: Fallback-Enabled Closed-Set Classification: Cross-Modal Consistency in Vision-Language Models

Abstract: Vision-Language Models (VLMs) can describe and label images; however, this does not imply that they truly process what they are perceiving. Recent studies show that, despite their breadth of training, VLMs are surprisingly unreliable as classifiers, for either closed-world or open-world settings. In this work, we explore a deeper question: can a VLM recognize when an image falls outside the set of categories it is asked to choose from? Our results reveal a surprising failure mode: even when the notion of in-set versus out-of-set is explicitly defined, VLM models often assign plausible in-set labels to out-of-set images, violating the task’s explicit constraint. Motivated by this, we propose a cross-modal consistency framework that reasons over both the visual and textual arms of the model and accepts an answer only when they agree. Experiments on three well-known datasets (DomainNet, VisDA and INaturalist-2021) demonstrate that this approach consistently improves balanced known vs. unknown detection over Source-Free Universal Domain Adaptation (SF-UniDA) baselines, showing that cross-modal consistency improves a VLM’s ability to follow the task logic and distinguish when an image falls outside the intended label space. Our results suggest that, with strong VLMs, fallback behavior need not rely exclusively on specialized SF-UniDA adaptation pipelines: a lightweight cross-modal consistency decision rule can be competitive with representative SF-UniDA baselines on standard benchmarks.

URL: https://openreview.net/forum?id=tOKG6sSk3I

---

Title: NEUTAG: Graph Transformer for Attributed Graphs

Abstract: Graph Transformers (\textsc{GT}) have demonstrated their superiority in graph classification tasks, but their performance in node classification settings remains below par. They are designed for either homophilic or heterophilic graphs and show poor scalability to million-sized graphs. In this paper, we address these limitations for node classification tasks by designing a model that utilizes a special feature encoding that transforms the input graph, separating nodes and features, which enables the flow of information not only from the local neighborhood of a node but also from distant nodes, via their connections through shared feature nodes. We theoretically demonstrate that this design allows each node to exchange information with all nodes in the graph, effectively mimicking all-node-pair message passing while avoiding $\mathcal{O}(N^2)$ computation. We further analyze the universal approximation ability of the proposed transformer. Finally, we demonstrate the effectiveness of the proposed method on diverse sets of large-scale graphs, including the homophilic \& the heterophilic varieties.

URL: https://openreview.net/forum?id=kQrIrYvbbw

---

Title: A Geometric Lens on LLM Abilities through Joint Embedding Item Response Theory

Abstract: Standard LLM evaluation practices compress diverse abilities into single scores, obscuring their inherently multidimensional nature.
We present *JE-IRT*, a geometric item-response framework that embeds both LLMs and questions in a shared space.
For question embeddings, the **direction** encodes semantics and the **norm** encodes difficulty, while correctness on each question is determined by the geometric interaction between the model and question embeddings.
This geometry replaces a global ranking of LLMs with topical specialization and enables smooth variation across related questions.
Building on this framework, our experimental results reveal that out-of-distribution behavior can be explained through directional alignment, and that larger norms consistently indicate harder questions.
Moreover, JE-IRT naturally supports generalization: once the space is learned, new LLMs are added by fitting a single embedding.
The learned space further reveals an LLM-internal taxonomy that only partially aligns with human-defined subject categories.
We also show that simple linear probes of the embedding space recover cross-subject ability directions, such as an arithmetic axis that highlights quantitatively demanding questions in seemingly distant subjects like **virology** and **global facts**.
JE-IRT thus establishes a unified and interpretable geometric lens that connects LLM abilities with the structure of questions, offering a distinctive perspective on model evaluation and generalization.

URL: https://openreview.net/forum?id=VQe6p9wn5g

---

Title: Rethinking On-policy Optimization for Query Augmentation

Abstract: Recent advances in large language models (LLMs) have led to a surge of interest in query augmentation for information retrieval (IR). Two main approaches have emerged. The first prompts LLMs to generate answers or pseudo-documents that serve as new queries, relying purely on the model's parametric knowledge or contextual information. The second applies reinforcement learning (RL) to fine-tune LLMs for query rewriting, directly optimizing retrieval metrics. While having respective advantages and limitations, the two approaches have not been compared under consistent experimental conditions. In this work, we present the first systematic comparison of prompting-based and RL-based query augmentation across diverse benchmarks, including evidence-seeking, ad hoc, and tool retrieval. Our key finding is that under a compute-aware comparison setting, simple, training-free query augmentation often performs on par with, or even surpasses, more expensive RL-based counterparts, especially when using powerful LLMs. Motivated by this discovery, we introduce a novel hybrid method, On-policy Pseudo-document Query Expansion (OPQE), which, instead of rewriting a query, the LLM policy learns to generate a pseudo-document that maximizes retrieval performance, thus merging the flexibility and generative structure of prompting with the targeted optimization of RL. We show OPQE outperforms both standalone prompting and RL-based rewriting, demonstrating that a synergistic approach yields the best results. We will fully open source our implementation to facilitate reproducibility.

URL: https://openreview.net/forum?id=mmqbjhz5Br

---

Title: Diffusion-based Cumulative Adversarial Purification for Vision Language Models

Abstract: Vision Language Models (VLMs) have shown remarkable capabilities in multimodal understanding, yet their susceptibility to adversarial perturbations poses a significant threat to their reliability in real-world applications. Despite often being imperceptible to humans, these perturbations can drastically alter model outputs, leading to erroneous interpretations and decisions. This paper introduces DiffCAP, a novel diffusion-based purification strategy that can effectively neutralize adversarial corruptions in VLMs. We theoretically establish a certified recovery region in the forward diffusion process and meanwhile quantify the convergence rate of semantic variation with respect to VLMs. These findings manifest that adversarial effects monotonically fade as diffusion unfolds. Guided by this principle, DiffCAP leverages noise injection with a similarity threshold of VLM embeddings as an adaptive criterion, before reverse diffusion restores a clean and reliable representation for VLM inference. Through extensive experiments across six datasets with three VLMs under varying attack strengths in three task scenarios, we show that DiffCAP consistently outperforms existing defense techniques by a substantial margin. Notably, DiffCAP significantly reduces both hyperparameter tuning complexity and the required diffusion time, thereby accelerating the denoising process. Equipped with theorems and empirical support, DiffCAP provides a robust and practical solution for securely deploying VLMs in adversarial environments.

URL: https://openreview.net/forum?id=kpuV3mzwqw

---

Title: Continual Test-Time Adaptation: A Comprehensive Survey

Abstract: Deep neural nets achieve remarkable performance when training and test data share the same distribution, but this assumption frequently breaks in real-world deployment, where data undergoes continual distributional shifts. Continual Test-Time Adaptation (CTTA) addresses this challenge by adapting pretrained models to non-stationary target distributions on-the-fly, without access to source data or labeled targets, while mitigating two critical failure modes: catastrophic forgetting of source knowledge and error accumulation from noisy pseudo-labels over extended time horizons. In this comprehensive survey, we formally define the CTTA problem, analyze the diverse continual domain shift patterns that characterize different evaluation protocols, and propose a hierarchical taxonomy that categorizes existing methods into three families: optimization-based strategies (entropy minimization, pseudo-labeling, parameter restoration), parameter-efficient methods (normalization layer adaptation, adaptive parameter selection), and architecture-based approaches (teacher-student frameworks, adapters, visual prompting, masked modeling). We systematically review representative methods within each category and present comparative benchmarks and experimental results across standard evaluation settings. Finally, we discuss limitations of current approaches and highlight emerging research directions, including adaptation of foundation models and black-box systems, providing a roadmap for future research in robust continual test-time adaptation.

URL: https://openreview.net/forum?id=mM3r03Xw1V

---

Title: Uncertainty-Aware Safety Propagation Critics for Safe Reinforcement Learning

Abstract: Safe reinforcement learning (RL) aims to optimize long-term performance while satisfying safety constraints, a requirement that is critical in many applications but difficult to guarantee when cost estimates are inaccurate or data is limited. In model-free actor-critic methods, cost critics are often unreliable in poorly explored regions, leading to constraint violations during both training and deployment. In this work, we propose a novel uncertainty-aware approach in safe RL called USPC, which constructs conservative cost surrogates using epistemic uncertainty. Our method trains an ensemble of cost critics to estimate uncertainty and uses these estimates to build an upper confidence bound on predicted costs. We then introduce a safe set network that approximates a pessimistic surrogate of the cost action-value function inspired by safe Bayesian optimization, enabling scalable safety propagation in continuous state-action spaces. Replacing standard cost critics with this surrogate in existing off-policy safe RL algorithms yields policies that are significantly less likely to violate cost constraints. We show empirically across multiple Safety Gymnasium benchmark tasks that our approach consistently reduces both the frequency and magnitude of constraint violations while maintaining competitive reward performance compared to strong baselines.

URL: https://openreview.net/forum?id=b8GWqgLEHh

---

Title: Measure Theory of Conditionally Independent Random Function Evaluation

Abstract: In sequential design strategies, common in geostatistics and Bayesian optimization, the selection of a new observation point $X_{n+1}$ of a random function $\mathbf f$ is informed by past data, captured by the filtration $\mathcal F_n=\sigma(\mathbf f(X_0),\dots,\mathbf f(X_n))$. The random nature of $X_{n+1}$ introduces measure-theoretic subtleties in deriving the conditional distribution $\mathbb P(\mathbf f(X_{n+1})\in A \mid \mathcal F_n)$. Practitioners often resort to a heuristic: treating $X_0,\dots, X_{n+1}$ as fixed parameters within the conditional probability calculation. This paper investigates the mathematical validity of this widespread practice. We construct a counterexample to prove that this approach is, in general, incorrect. We also establish our central positive result: for continuous Gaussian random functions and their canonical conditional distribution, the heuristic is sound. This provides a rigorous justification for a foundational technique in Bayesian optimization and spatial statistics. We further extend our analysis to include settings with noisy evaluations and to cases where $X_{n+1}$ is not adapted to $\mathcal F_n$ but is conditionally independent of $\mathbf f$ given the filtration.

URL: https://openreview.net/forum?id=XgReLRlKEk

---

Title: FedPS: Federated data Preprocessing via aggregated Statistics

Abstract: Federated Learning (FL) enables multiple parties to collaboratively train machine learning models without sharing raw data. However, before training, data must be preprocessed to address missing values, inconsistent formats, and heterogeneous feature scales. This preprocessing stage is critical for model performance but is largely overlooked in FL research. In practical FL systems, privacy constraints prohibit centralizing raw data, while communication efficiency introduces further challenges for distributed preprocessing.
We introduce FedPS, a unified framework for federated data preprocessing based on aggregated statistics. FedPS leverages data-sketching techniques to efficiently summarize local datasets while preserving essential statistical information. Building on these summaries, we design federated algorithms for feature scaling, encoding, discretization, and missing-value imputation, and extend preprocessing-related models such as k-Means, k-Nearest Neighbors, and Bayesian Linear Regression to both horizontal and vertical FL settings. FedPS provides flexible, communication-efficient, and consistent preprocessing pipelines for practical FL deployments.

URL: https://openreview.net/forum?id=MdeXZVjNKu

---

Title: Predicting integers from continuous parameters

Abstract: We study the problem of predicting numeric labels that are constrained to the integers or to a subrange of the integers. For example, the number of up-votes on social media posts, or the number of bicycles available at a public rental station. While it is possible to model these as continuous values, and to apply traditional regression, this approach changes the underlying distribution on the labels from discrete to continuous. Discrete distributions have certain benefits, which leads us to the question whether such integer labels can be modeled directly by a discrete distribution, whose parameters are predicted from the features of a given instance. Moreover, we focus on the use case of output distributions of neural networks, which adds the requirement that the _parameters_ of the distribution be continuous so that backpropagation and gradient descent may be used to learn the weights of the network. We investigate several options for such distributions, some existing and some novel, and test them on a range of tasks, including tabular learning, sequential prediction and image generation. We find that overall the best performance comes from two distributions: _Bitwise_, which represents the target integer in bits and places a Bernoulli distribution on each, and a discrete analogue of the Laplace distribution, which uses a distribution with exponentially decaying tails around a continuous mean.

URL: https://openreview.net/forum?id=d1WKFlKFEa

---

Title: Properties and limitations of geometric tempering for gradient flow dynamics

Abstract: We consider the problem of sampling from a probability distribution $\pi$. It is well known that this can be written as an optimisation problem over the space of probability distributions in which we aim to minimise the Kullback--Leibler divergence from $\pi$.
We consider the effect of replacing $\pi$ with a sequence of moving targets $(\pi_t)_{t\ge0}$ defined via geometric tempering on the Wasserstein and Fisher--Rao gradient flows.
We show that replacing the target distribution with a geometric mixture of initial and target distribution does not lead to a convergence speed up.

URL: https://openreview.net/forum?id=IP0w5LdcxC

---

Title: Rotary Positional Embeddings as Phase Modulation: Theoretical Bounds on the RoPE Base for Long-Context Transformers

Abstract: Rotary positional embeddings (RoPE) are widely used in large language models to encode
token positions through multiplicative rotations, yet their behavior at long context lengths
remains poorly characterized. In this work, we reinterpret RoPE as phase modulation
appliedto abank of complexoscillators, enabling analysis throughclassical signal processing
theory.
Under this formulation, we derive principled lower bounds on the RoPE base parameter that
are necessary to preserve positional coherence over a target context length. These include
a fundamental aliasing bound, analogous to a Nyquist limit, and a DC-component stability
bound that constrains phase drift in low-frequency positional modes. We further extend
this analysis to deep transformers, showing that repeated rotary modulation across layers
compounds angular misalignment, tightening the base requirement as depth increases.
Complementing these results, we derive a precision-dependent upper bound on the RoPE
base arising from finite floating-point resolution. Beyond this limit, incremental phase up-
datesbecomenumericallyindistinguishable, leadingtopositionalerasureevenintheabsence
of aliasing. Together, the lower and upper bounds define a precision- and depth-dependent
feasibility region—a “Goldilocks zone”—for long-context transformers.
We validate the framework through a comprehensive case study of state-of-the-art models,
includingLLaMA,Mistral, andDeepSeekvariants, showingthatobservedsuccesses, failures,
and community retrofits align closely with the predicted bounds. Notably, models that
violate the stability bound exhibit attention collapse and long-range degradation, while
attempts to scale beyond one million tokens encounter a hard precision wall independent of
architecture or training.
Our analysis establishes RoPE base selection as a fundamental necessary architectural con-
straint, ratherthanatunablehyperparameter, andprovidespracticalguidancefordesigning,
scaling, and retrofitting long-context transformers under realistic numerical limits.

URL: https://openreview.net/forum?id=zxyMneble7

---

Title: Less Forgetting, More OOD Generalization: Adaptive Augmented Reweighted Replay (AA-RR) for Continual Learning

Abstract: Machine learning models often forget previously learned classes when trained sequentially. Rehearsal-based methods mitigate this by replaying stored samples, but their reliance on memorization leads to poor out-of-distribution (OOD) generalization—a problem that remains largely unstudied. This memorization is driven by unbalanced gradient updates, spurious correlations, and class-imbalanced replay buffers. To address these issues, we introduce Adaptive Augmented Reweighted Replay (AA-RR), a lightweight framework designed to improve generalization in rehearsal-based continual learning (CL). AA-RR applies adaptive, class-aware loss reweighting to correct gradient imbalance while accounting for data recency and limited buffer capacity. It further incorporates data-centric augmentation and a principled sample-selection strategy based on forgetting dynamics to retain representative, consistently learned examples. Experiments on standard CL benchmarks show that AA-RR markedly boosts generalization and surpasses state-of-the-art baselines, especially under covariate shift.

URL: https://openreview.net/forum?id=4wd79nLVna

---

Title: Crane: Context-Guided Prompt Learning and Attention Refinement for Zero-Shot Anomaly Detection

Abstract: Zero-shot anomaly detection/localization trains on a source domain and discriminates images from unseen target domains given only textual prompts (e.g., “normal" vs. “anomaly"); therefore, performance hinges on generalization. Recent methods build on CLIP for its strong zero-shot generalization; however, as we show, localization has not improved as much as detection and, especially for small regions, remains near random, with AUPRO close to chance, indicating weak pixel-level generalization. We attribute this to CLIP’s limited ability to retain fine-grained features in its vision encoder and insufficient alignment between the text encoder and dense visual features, which have not been effectively addressed in previous methods. To address these challenges, first, we replace CLIP’s vision encoder with an adapted vision encoder that uses a correlation-based attention module to better preserve fine-grained features and small details. Second, we boost text–vision alignment by conditioning the learnable prompts in the text encoder on image context extracted from the vision encoder and performing local-to-global representation fusion, further improving localization. Finally, we show that our correlation-based attention module can incorporate feature correlations from additional models such as DINOv2, further enhancing spatial understanding and localization. We call our model Crane (Context-Guided Prompt Learning and Attention Refinement) and its DINOv2-boosted variant Crane+ and show that it improves the state-of-the-art by up to 28% in pixel-level localization (AUPRO) and up to 4.5% in image-level detection (AP), across 14 industrial and medical datasets.

URL: https://openreview.net/forum?id=logc7dzJRS

---

Title: Learning from Missing Values: Encoding Missingness in Representation-Space for LSTM Time Series Forecasting

Abstract: While many state-of-the-art techniques reconstruct incomplete time series datasets by replacing gaps with modeled estimates, we propose an alternative: encode missing values as an extremal sentinel value, allowing a prediction model to learn from the pattern of missingness. Incomplete data is a common problem in real-world time series forecasting, particularly in environmental monitoring where sensor failures can cause continuous gaps in data. This paper proposes the Min-Std method, a novel computationally efficient imputation strategy that encodes missingness in representation-space with an extremal statistical sentinel $(min - \sigma)$ mapped to $0$ under Min-Max scaling. The result is that instead of training a model on (possibly imprecise) estimates for missing data, we simply replace the missing value with a sentinel the model can recognize to mean `uninformative'. By ensuring this sentinel is uniquely mapped to $0$, the only $0$ values the model will receive are either missing values, or values dropped by a dropout regularizer. We compare prediction results using our Min-Std imputation strategy against 12 imputation methods (including Kalman Smoothing and MissForest) across 6 different transformations on 28 distinct environmental datasets. Friedman's nonparametric test and critical difference ranking demonstrate that Min-Std imputation consistently yields superior predicting performance (measured by KGE, NMSE, and F1 Score) compared to complex model-based alternatives while being orders of magnitude faster (e.g. $0.02$s vs $500$s$+$). Our findings suggest that single-channel, explicit representation-space encodings of missingness are preferable to reconstruction-based imputation.

URL: https://openreview.net/forum?id=DMmCMIrrez

---

Title: Trajectory-Based Neural Darwinism in Convolutional Neural Networks: Variation, Competition, and Selective Retention

Abstract: Understanding how neural networks develop and stabilize internal representations remains a central challenge. Inspired by Edelman’s Neural Darwinism, we introduce the Neuron Darwinian Dynamics System (NDDS), a trajectory-based framework that treats neurons as evolving agents under both local and global selective pressures. We define the Global Darwinian Pressure (GDP) as the population-average neuron fitness, capturing system-wide selection dynamics. Layer-wise analyses show that selective pressure intensifies over training, particularly in deeper layers, reflecting progressive consolidation of high-fitness neurons. Ablation experiments further reveal that removing survived neurons leads to substantial accuracy loss, whereas eliminating low-fitness neurons causes minimal degradation, demonstrating NDDS’s ability to identify functionally critical units. Dynamic trajectory analyses show that survived neurons maintain coherent activity, stronger weights, and higher global Darwinian pressures, while eliminated neurons stagnate. Overall, our results support a Darwinian view of representation learning: networks achieve early-stage redundancy and later-stage specialization, enabling robust and stable task-relevant representations.

URL: https://openreview.net/forum?id=3SIeuOvN6W

---

Title: On Symmetric Losses for Policy Optimization with Noisy Preferences

Abstract: Optimizing policies based on human preferences is key to aligning language models with human intent.
This work focuses on reward modeling, a core component in reinforcement learning from human feedback (RLHF), and offline preference optimization, such as direct preference optimization.
Conventional approaches typically assume accurate annotations. However, real-world preference data often contains noise due to human errors or biases, which can be asymmetric.
We propose a principled framework for robust policy optimization under noisy preferences based on the view of reward modeling as a binary classification problem.
Specifically, we demonstrate that asymmetric preference noise can be effectively treated as symmetric noise under this framework.
This viewpoint allows us to leverage symmetric losses, well known for their robustness to label noise in classification, for reward modeling, which leads to our Symmetric Preference Optimization (SymPO) method, a novel offline preference optimization algorithm.
Theoretically, we prove that symmetric losses enable successful policy improvement even with noisy labels, as the resulting reward is rank-preserving—a property we identify as sufficient for policy improvement.
Empirical evaluations on a synthetic dataset and real-world language model alignment tasks demonstrate that SymPO achieves competitive or higher performance than existing robust methods in high-noise scenarios.

URL: https://openreview.net/forum?id=cBWGLmSeao

---

Title: A Method and a Metric for GNN Reliance on Information from Features and Structure

Abstract: Graph Neural Networks (GNNs) rely on both node-edge features and graph structure, but the relative use of these information sources is poorly understood. In many cases either features or structure contain more useful information, and in extreme cases one may inhibit learning, as in some tasks where models overfit on structural patterns. Understanding the balance of these information sources is therefore essential for strategic model design.

We introduce Noise-Noise Analysis to measure each source’s contribution to model performance, along with the Noise-Noise Ratio Difference (NNRD) metric that quantifies whether a model is feature-reliant or structure-reliant. Through experiments on synthetic and real-world graph-classification datasets, we show that GCN, GAT, and GIN layers can all perform graph-less learning (ignoring structure when unhelpful), but only the GIN performs feature-less learning. All three architectures exhibit bias toward features over structure. Noise-Noise Analysis provides practitioners a fast tool to understand their models’ information usage.

URL: https://openreview.net/forum?id=e0Uw9aBr3A

---

Title: CoCoA Is ADMM: Unifying Two Paradigms in Distributed Optimization

Abstract: We consider primal-dual algorithms for general empirical risk minimization problems in distributed settings, focusing on two prominent classes of algorithms. The first class is the communication-efficient distributed dual coordinate ascent (CoCoA), derived from the coordinate ascent method for solving the dual problem. The second class is the alternating direction method of multipliers (ADMM), including consensus ADMM, proximal ADMM, and linearized ADMM. We demonstrate that both classes of algorithms can be transformed into a unified update form that involves only primal and dual variables. This discovery reveals key connections between the two classes of algorithms: CoCoA can be interpreted as a special case of proximal ADMM for solving the dual problem, while consensus ADMM is equivalent to a proximal ADMM algorithm. This discovery provides insight into how we can easily enable the ADMM variants to outperform the CoCoA variants by adjusting the augmented Lagrangian parameter. We further explore linearized versions of ADMM and analyze the effects of tuning parameters on these ADMM variants in the distributed setting. Extensive simulation studies and real-world data analysis support our theoretical findings.

URL: https://openreview.net/forum?id=kLhxBDa2yD

---

Title: Learnable Coreset Selection for Graph Active Learning

Abstract: Graph Neural Networks (GNNs) have demonstrated their effectiveness in a variety of graph-based tasks. However, their performance heavily depends on the availability of a sufficient amount of labeled data, which is often costly to acquire in real-world applications.
To tackle this, GNN-based Active Learning (AL) methods aim to enhance labeling efficiency by selecting the most informative nodes for labeling. However, existing methods often rely on heuristic or implicit approaches that fail to fully capture the influence of labeled data on unlabeled nodes, thereby limiting their adaptability across diverse graph types.
In this paper, we propose LearnAL, a Learnable coreset labeling framework for graph Active Learning to address these limitations. Unlike traditional heuristic-based methods, LearnAL explicitly models the correlations between labeled and unlabeled nodes using an attention architecture, linking these correlations directly to prediction performance. Leveraging global influence (attention) scores, LearnAL selects and labels samples that maximize representational diversity, enhancing sample coverage.
We provide theoretical analysis demonstrating that this attention-based selection reduces the covering radius bound, improving prediction performance on unlabeled data. Our experimental results show that the labeled coreset significantly enhances the generalizability of various graph models across different graph datasets, as well as CNN models in image classification tasks.

URL: https://openreview.net/forum?id=ursw3nWq5K

---

Title: Theoretical Foundations of Continual Learning via Drift-Plus-Penalty

Abstract: In many real-world settings, data streams are inherently nonstationary and arrive sequentially, necessitating learning systems to adapt continuously without repeatedly retraining from scratch. Continual learning (CL) addresses this setting by seeking to incorporate new tasks while preventing catastrophic forgetting, whereby updates for recent data induce performance degradation on previously acquired knowledge. We introduce a control-theoretic perspective on CL that explicitly regulates the temporal evolution of forgetting, framing adaptation to new tasks as a controlled process subject to long-term stability constraints. We focus on replay-based CL settings in which a finite memory buffer preserves representative samples from prior tasks, allowing forgetting to be explicitly regulated. We propose COntinual Learning with Drift-Plus-Penalty (\texttt{COLD}), a novel CL framework grounded in the stochastic optimization-based Drift-Plus-Penalty (DPP) principle. At each task, \texttt{COLD} minimizes the instantaneous penalty corresponding to the current task loss while simultaneously maintaining a virtual queue that explicitly tracks deviations from long-term stability on previously learned tasks, hence capturing the stability–plasticity trade-off as a regulated dynamical process. We establish stability and convergence guarantees that characterize this trade-off, as governed by a tunable control parameter. Empirical results on standard benchmark datasets show that the proposed framework consistently achieves superior accuracy compared to a wide range of state-of-the-art CL baselines, while exhibiting competitive and tunable forgetting behavior that reflects the explicit regulation of the stability–plasticity trade-off through virtual queues and the DPP objective.

URL: https://openreview.net/forum?id=QhxNMdhhBy

---

Title: Interleaved Gibbs Diffusion: Generating Discrete-Continuous Data with Implicit Constraints

Abstract: We introduce Interleaved Gibbs Diffusion (IGD), a novel generative modeling framework for discrete-continuous data, focusing on problems with important, implicit and unspecified constraints in the data. Most prior works on discrete and discrete-continuous diffusion assume a factorized denoising distribution, which can hinder the modeling of strong dependencies between random variables in such problems. We empirically demonstrate a significant improvement in 3-SAT performance out of the box by switching to a Gibbs-sampling style discrete diffusion model which does not assume factorizability. Motivated by this, we introduce IGD which generalizes discrete time Gibbs sampling type Markov chain for the case of discrete-continuous generation. IGD allows for seamless integration between discrete and continuous denoisers while theoretically guaranteeing exact reversal of a suitable forward process. Further, it provides flexibility in the choice of denoisers, allows conditional generation via state-space doubling and inference time refinement. Empirical evaluations on three challenging generation tasks - molecule structures, layouts and tabular data - demonstrate state-of-the-art performance. Notably, IGD achieves state-of-the-art results without relying on domain-specific inductive biases like equivariant diffusion or auxiliary losses. We explore a wide range of modeling, and interleaving strategies along with hyperparameters in each of these problems.

URL: https://openreview.net/forum?id=6ANE8ycxCI

---

Title: Prompting Large-Scale Vision Models with Cascaded Semantics

Abstract: As a leading parameter-efficient tuning paradigm in NLP, prompt tuning has recently been explored for its potential in computer vision. Unlike updating pre-trained large-scale models (e.g., vision transformer, or ViT for short), visual prompt tuning (VPT) incorporates additional learnable parameters (i.e., prompt) that are updated during tuning. However, original visual prompts are randomly initialized, without leveraging the power of prior knowledge, which has been frequently used in NLP (e.g., instruction). To bridge this gap, we propose a novel methodology, aiming to inject semantic prior to prompt the tuning. To this end, we pioneer in leveraging both fundamental image prior and advanced image semantics as such priors. The former, including color, texture, and shape, are extracted by classical hand-crafted operators, suitable for the input space, while the self-attention map is utilized as the latter, suitable for the feature space. We propose a scheme to integrate the two types of semantic priors into ViT's tuning through cascading. Extensive experiments conducted on 34 challenging image classification datasets demonstrate the superiority of our method in adapting pre-trained ViTs to various downstream scenarios while using only 0.74\% of ViT parameters as tuned.

URL: https://openreview.net/forum?id=SSsobNZJPO

---

Title: RobustMAD: Evaluating Real-World Robustness of Multimodal Small Language Models for Deployable Anomaly Detection Assistants

Abstract: Multimodal industrial anomaly inspection assistants are a critical component of next-generation smart factories, enabling interactive vision–language–based querying. However, multimodal large language models remain impractical for on-site deployment due to prohibitive computational demands and privacy risks from cloud-based inference. Compact multimodal \textit{small} language models (MSLMs) offer a deployable alternative, yet progress is constrained by the lack of comprehensive robustness analyses and meaningfully challenging benchmarks that reflect real-world industrial conditions. To address this gap, we develop RobustMAD, the first practically grounded benchmark, designed to systematically evaluate real-world robustness through diverse open-ended queries spanning object understanding, anomaly detection, unanswerable problems, and visual quality degradations. Contrary to conventional assumptions, top-performing MSLMs exhibit promising capabilities, surprisingly outperforming even the larger GPT-5 Nano. However, they are still far below industrial requirements, and RobustMAD exposes critical robustness gaps posing significant operational risks. In particular, three recurring failure modes emerge: (i) fragile multimodal grounding under fine-grained distinctions or degraded visual conditions, (ii) insufficiently comprehensive responses, and (iii) weak logical grounding on unanswerable or ill-posed queries, leading to hallucinated outputs. Grounded in these insights, we provide actionable guidance for the design of next-generation multimodal industrial inspection assistants that leverage their promising competence. Code is available at \url{http://anonymous.4open.science/r/RobustMAD-146D}.

URL: https://openreview.net/forum?id=skrA9UYNIZ

---

Title: Speeding up fairness reductions

Abstract: We study the problem of fair classification, where the goal is to optimize classification accuracy subject to fairness constraints. This type of problem occurs in many real-world applications, where we seek to assure that a deployed AI system does not disproportionally impact historically disadvantaged groups. One of the leading approaches in the literature is the reduction approach (Agarwal et al., 2018; 2019), which enjoys many favorable properties. For instance, it supports a wide range of fairness constraints and model families and is usually easy to incorporate in existing ML pipelines. The reduction approach acts as a wrapper around a standard ML algorithm and obtains a model that satisfies fairness constraints by repeatedly running a fairness-unaware base algorithm. A typical number of iterations is around 100, meaning that the reduction approach can be up to 100 times slower than the base algorithm, which limits its applicability. To overcome this limitation, we introduce two algorithmic innovations. First, we interleave the exponentiated gradient updates of the standard reduction approach with column-generation updates, which leads to a decrease in the number of calls to the base algorithm. Second, we introduce adaptive sampling, which decreases the sizes of the datasets used in the calls to the base algorithm. We conduct comprehensive experiments to evaluate efficacy of our improvements, showing that our two innovations speed up the reduction approach by an order of magnitude without sacrificing the quality of the resulting solutions.

URL: https://openreview.net/forum?id=C0AdL3r1Dc

---

Title: Answer Matching Outperforms Multiple Choice for Language Model Evaluation

Abstract: Multiple choice benchmarks have long been the workhorse of language model evaluation because grading multiple choice is objective and easy to automate. However, we show multiple choice questions from popular benchmarks can often be answered without even seeing the question. These shortcuts arise from a fundamental limitation of discriminative evaluation not shared by evaluations of the model's free-form, generative answers. Until recently, there appeared to be no viable, scalable alternative to multiple choice --- but, we show that this has changed. We consider generative evaluation via what we call answer matching: Give the candidate model the question without the options, have it generate a free-form response, then use a modern language model with the reference answer to determine if the response matches the reference. To compare the validity of different evaluation strategies, we measure agreement with human grading, by annotating responses to MMLU-Pro and GPQA-Diamond questions. We find answer matching using recent models -- even small ones -- achieves near-perfect agreement, in the range of inter-annotator agreement. In contrast, both multiple choice evaluation and using LLM-as-a-judge without reference answers aligns poorly with human grading. Improved evaluations via answer matching are not merely a conceptual concern --- it reduces costs, and significantly changes model rankings. Multiple choice benchmarks that seem saturated start showing room for improvement when evaluated with answer matching. In light of these findings, we discuss how to move the evaluation ecosystem from multiple choice to answer matching.

URL: https://openreview.net/forum?id=kyzLTGreeC

---

Title: Generalizability of Experimental Studies

Abstract: Experimental studies are a cornerstone of Machine Learning (ML) research.
A common and often implicit assumption is that the study's results will generalize beyond the study itself, e.g., to new data.
That is, repeating the same study under different conditions will likely yield similar results.
Existing frameworks to measure generalizability, borrowed from the casual inference literature, cannot capture the complexity of the results and the research questions of an ML study.
The problem of measuring generalizability in the more general ML setting is thus still open, also due to the lack of a mathematical formalization of experimental studies.
In this paper, we propose such a formalization, use it to develop a framework to quantify generalizability, and propose an instantiation based on rankings and the Maximum Mean Discrepancy.
We show how our framework offers insights into the number of experiments necessary for a generalizable study, and how experimenters can benefit from it.
Finally, we release the genexpy Python package, which allows for the evaluation of the generalizability of other experimental studies.

URL: https://openreview.net/forum?id=upBxad8UVy

---

Title: NePuDA: Neighborhood-Purifying Discriminant Analysis

Abstract: Linear Discriminant Analysis (LDA) is a popular technique for supervised dimensionality reduction due to its clarity and interpretability. However, LDA and its variants often assume that data within each class are Gaussian-distributed or form distinct groups/subclasses, which is not always true for high-dimensional, real-world datasets, where classes can have complex and irregular shapes and exhibit significant overlap. Recognizing this limitation, we propose a novel approach, \textbf{Ne}ighborhood-\textbf{Pu}rifying \textbf{D}iscriminant \textbf{A}nalysis, which forgoes the search for an ideal, class-separated subspace in favor of one where data samples are naturally surrounded by neighbors from the same class. Specifically, NePuDA aims to identify projection directions that reinforce this neighborhood purity for all data samples, with the intuitive logic that if an object shares characteristics with a known category, it likely belongs to that category. Accordingly, we formulate the objective function of the proposed method and introduce an iterative optimization procedure to solve it in an efficient manner. Detailed theoretical analyses are provided, covering convergence, computational complexity, and connections to existing LDA variants. Extensive empirical evaluations on a range of synthetic and real-world datasets demonstrate that NePuDA consistently extracts highly discriminative features, outperforming twelve classical and state-of-the-art supervised dimensionality reduction algorithms in classification accuracy. Our code is publicly available at \url{https://anonymous.4open.science/r/NePuDA_code-C47F/}.

URL: https://openreview.net/forum?id=9z2dtltVso

---

Title: Harnessing Heterogeneity: Improving Convergence Through Partial Variance Control in Federated Learning

Abstract: Federated Learning (FL) has emerged as a promising paradigm for collaborative model training without sharing local data. However, a significant challenge in FL arises from the heterogeneous data distributions across participating clients. This heterogeneity leads to highly variable gradient norms in the model's final layers, resulting in poor generalization, slower convergence, and reduced robustness of the global model. To address these issues, we propose a novel technique that incorporates a gradient penalty term into partial variance control. Our method enables diverse representation learning from heterogeneous client data in the initial layers while modifying standard SGD in the final layers. This approach reduces variance in the classification layers, aligns gradients, and mitigates the effects of data heterogeneity. Through theoretical analysis, we establish convergence rate bounds for the proposed algorithm, demonstrating its potential for competitive convergence compared to current FL methods in highly heterogeneous data settings. Empirical evaluations on five benchmark datasets validate our approach, showing enhanced performance and faster convergence over state-of-the-art baselines across various levels of data heterogeneity.

URL: https://openreview.net/forum?id=I9VhJ5iLNr

---

Title: RegMean++: Enhancing Effectiveness and Generalization of Regression Mean for Model Merging

Abstract: Regression Mean (RegMean), an approach that formulates model merging as a linear regression problem, aims to find the optimal weights for each linear layer in the merge model by minimizing the discrepancy in predictions between the merge and candidate models. RegMean provides a precise closed-form solution for the merging problem; therefore, it offers explainability and computational efficiency. However, RegMean merges each linear layer independently, overlooking how the features and information in the earlier layers propagate through the layers and influence the final prediction in the merge model. Here, we introduce RegMean++, a simple yet effective alternative to RegMean, that explicitly incorporates both intra-layer and cross-layer dependencies between merge models' layers into RegMean's objective. By accounting for these dependencies, RegMean++ better captures the behaviors of the merge model. Extensive experiments demonstrate that RegMean++ consistently outperforms RegMean across diverse settings, including in-domain (ID) and out-of-domain (OOD) generalization, sequential merging, large-scale tasks, and robustness under several types of distribution shifts. Furthermore, RegMean++ achieves competitive or state-of-the-art performance compared to various recent advanced model merging methods.

URL: https://openreview.net/forum?id=H5lDsSCS9i

---

Title: BitLogic: Training Framework for Gradient-Based FPGA Native Neural Networks

Abstract: The energy and latency costs of deep neural network inference are increasingly driven by deployment rather than training, motivating hardware-specialized alternatives to arithmetic-heavy models. FPGA provide an attractive substrate for such specialization, yet existing FPGA-based neural approaches are fragmented and difficult to compare. We present BitLogic, a fully gradient-based, end-to-end trainable framework for FPGA-native neural networks built around LUT computation. BitLogic replaces multiply--accumulate operations with differentiable LUT nodes that map directly to FPGA primitives, enabling native binary computation, sparse connectivity, and efficient hardware realization. The framework offers a modular functional API supporting diverse architectures, along with learned encoders, hardware-aware heads, and multiple boundary-consistent LUT relaxations. An automated RTL export pipeline translates trained PyTorch models into synthesizable HDL, ensuring equivalence between software and hardware inference. Experiments across standard vision benchmarks and heterogeneous hardware platforms demonstrate competitive accuracy and substantial gains in FPGA efficiency, including 72.3% test accuracy on CIFAR-10 achieved with fewer than 0.3M logic gates, while attaining sub-20ns single-sample inference using only LUT resources.

URL: https://openreview.net/forum?id=ZbsSZAfDod

---

Title: ECLayr: A Fast and Robust Topological Layer via Euler Characteristic Curves

Abstract: We introduce a flexible and computationally efficient topological layer for general deep learning architectures, built upon the Euler Characteristic Curve. Unlike existing approaches that rely on computationally intensive persistent homology, our method bypasses this bottleneck while retaining essential topological information across diverse data modalities. To enable complete end-to-end training, we develop a novel backpropagation scheme that improves computation and mitigates vanishing gradient issues. We go on to provide stability analysis, establishing stability guarantees for the proposed layer in the presence of noise and outliers. We integrate the proposed layer into topological autoencoders to enhance representation learning through topological signals. We further demonstrate the effectiveness of our approach through classification experiments on a variety of datasets, including high-dimensional settings where persistent homology becomes computationally challenging.

URL: https://openreview.net/forum?id=mpz0Z1cP36

---

Title: Legal Retrieval for Public Defenders

Abstract: AI tools are increasingly suggested as solutions to assist public agencies with heavy workloads. In public defense---where a constitutional right to counsel meets the complexities of law, overwhelming caseloads, and constrained resources---practitioners face especially taxing conditions. Yet, there is little evidence of how AI could meaningfully support defenders' day-to-day work. In partnership with the anonymized Office of the Public Defender, we develop the anonymized BriefBank, a retrieval tool which surfaces relevant appellate briefs to streamline legal research and writing.
We show that existing retrieval benchmarks fail to transfer to real public defense research, however adding domain knowledge improves retrieval quality. This includes query expansion with legal reasoning, domain-specific data and curated synthetic examples. To facilitate further research, we release a new, realistic retrieval dataset, manually annotated by real public defenders, and provide a taxonomy of these realistic defender search queries.
Together, our work improves on the status quo of realistic retrieval benchmarking and provides a starting point for leveraging AI in a real-world public interest setting.

URL: https://openreview.net/forum?id=HnbKQGRnDt

---

Title: Gen-MURE: Generalized Multiplicative Unbiased Risk Estimate

Abstract: Coherent imaging modalities such as ultrasound and synthetic aperture radar (SAR) images are degraded by signal-dependent multiplicative noise, where the noise distributions vary widely across acquisition scenarios. Existing self-supervised image denoising methods either assume zero-mean additive noise, independence across pixels or require the noise distribution to be known, which often limit their applicability in real-world image denoising systems. We propose a Generalized Multiplicative Unbiased Risk Estimate (Gen-MURE), a model-agnostic self-supervised image denoising framework for enhancing images corrupted by signal-dependent, multiplicative speckle. Gen-MURE does not rely on explicit assumptions of the exact noise distribution and formulates a principled framework capable of denoising speckle ranging from different distributions. Gen-MURE does not require access to clean ground truth images or parameters of the noise model, and denoise images in one single step without any iterative refinement. Extensive experiments on ultrasound images along with unseen simulated and real SAR images demonstrate the efficiency and robustness of Gen-MURE.

URL: https://openreview.net/forum?id=Hie13qRm1x

---

Title: Provable Emergence of Deep Neural Collapse and Low-Rank Bias in $L^2$-Regularized Nonlinear Networks

Abstract: We present a unified theoretical framework connecting the first property of Deep Neural Collapse (DNC1) to the emergence of implicit low-rank bias in nonlinear networks trained with $L^2$ weight decay regularization. Our main contributions are threefold. First, we derive a quantitative relation between the Total Cluster Variation (TCV) of intermediate embeddings and the numerical rank of stationary weight matrices. In particular, we establish that, at any critical point, the distance from a weight matrix to the set of rank-$K$ matrices is bounded by a constant times the TCV of earlier-layer features, scaled inversely with the weight-decay parameter.
Second, we prove global optimality of DNC1 in a constrained representation-cost setting for both feedforward and residual architectures, showing that zero TCV across intermediate layers minimizes the representation cost under natural architectural constraints.
Third, we establish a benign landscape property: for almost every interpolating initialization there exists a continuous, loss-decreasing path from the initialization to a globally optimal, DNC1-satisfying configuration. Our theoretical claims are validated empirically; numerical experiments confirm the predicted relations among TCV, singular-value structure, and weight decay. These results indicate that neural collapse and low-rank bias are intimately linked phenomena arising from the optimization geometry induced by weight decay.

URL: https://openreview.net/forum?id=PvmFUzchzY

---

Title: Choosing the right basis for interpretability: Psychophysical comparison between neuron-based and \\ dictionary-based representations

Abstract: Interpretability research often adopts a neuron-centric lens, treating individual neurons as the fundamental units of explanation. However, neuron-level explanations can be undermined by superposition, where single units respond to mixtures of unrelated patterns. Dictionary learning methods, such as sparse autoencoders and non-negative matrix factorization, offer a promising alternative by learning a new basis over layer activations. Despite this promise, direct human evaluations comparing neuron-based and dictionary-based representations remain limited.

We conducted three large-scale online psychophysics experiments (N=481) comparing explanations derived from neuron-based and dictionary-based representations in two convolutional neural networks (ResNet50, VGG16). We operationalize interpretability via visual coherence: a basis is more interpretable if humans can reliably recognize a common visual pattern in its maximally activating images and generalize that pattern to new images. Across experiments, dictionary-based representations were consistently more interpretable than neuron-based representations, with the advantage increasing in deeper layers.

Critically, because models differ in how neuron-aligned their representations are---with ResNet50 exhibiting greater superposition, neuron-based evaluations can mask cross-model differences, such that ResNet50's higher interpretability emerges only under dictionary-based comparisons.

These results provide psychophysical evidence that dictionary-based representations offer a stronger foundation for interpretability and caution against model comparisons based solely on neuron-level analyses.

URL: https://openreview.net/forum?id=vb57jDotru

---

Title: Towards Principled Task Grouping for Multi-Task Learning

Abstract: Multi-task learning (MTL) aims to leverage shared information among tasks to improve learning efficiency and accuracy. However, MTL often struggles to effectively manage positive and negative transfer between tasks, which can hinder performance improvements. Task grouping addresses this challenge by organizing tasks into meaningful clusters, maximizing beneficial transfer while minimizing detrimental interactions.
This paper introduces a principled approach to task grouping in MTL, advancing beyond existing methods by addressing key theoretical and practical limitations. Unlike prior studies, our method offers a theoretically grounded approach that does not depend on restrictive assumptions for constructing transfer gains. We also present a flexible mathematical programming formulation that accommodates a wide range of resource constraints, thereby enhancing its versatility.
Experimental results across diverse domains, including computer vision datasets, combinatorial optimization benchmarks, and time series tasks, demonstrate the superiority of our method over extensive baselines, thereby validating its effectiveness and general applicability in MTL without sacrificing efficiency.

URL: https://openreview.net/forum?id=3DeSIpzuro

---

Title: Advancing Counterfactual Prediction through Nonlinear Quantile Regression

Abstract: The ability to address counterfactual “what if” inquiries is essential for understanding and leveraging causal relationships. Traditional counterfactual prediction, under Pearl’s framework, typically relies on access to or estimation of a structural causal model (SCM). In practice, however, the underlying causal model is often unknown and difficult to identify. To overcome this limitation, we present a method for answering counterfactual queries without explicitly estimating the SCM. We establish a novel connection between counterfactual prediction and quantile regression, showing that counterfactual prediction can be reframed as an extended quantile regression problem. Building on this insight, we propose a practical framework for efficient and effective counterfactual prediction using neural networks under a bi-level optimization scheme. The proposed framework is theoretically shown to yield a unique and well-posed solution, providing a principled basis for reliable counterfactual estimation. Moreover, it improves the ability to generalize estimated counterfactual outcomes to unseen data, for which we further derive an upper bound on the generalization error. Empirical evaluations across multiple datasets offer strong evidence supporting the proposed framework and its theoretical properties.

URL: https://openreview.net/forum?id=0T1Pd65fyD

---

Title: Harnessing Optimization Dynamics for Curvature-Informed Model Merging

Abstract: Model merging is an effective strategy for composing capabilities in large language models without the need for costly joint retraining. We study this process in the supervised fine-tuning (SFT) stage, consolidating multiple checkpoints specialized for distinct capabilities (e.g., math, coding, and precise instruction following) into a single model. First, we introduce Optimization Trajectory Aware (OTA) Merging, a curvature-aware method for mitigating task interference that uses optimizer second-moment statistics as a diagonal curvature proxy to first prune the task vector with our Fast Fisher Grafting (FFG) technique and then reweight the pruned vector. When merging diverse, capability-based checkpoints, OTA improves the merged model's performance over strong baseline methods, as evaluated on unseen capability-based benchmarks. Second, we conduct a comprehensive, theoretically-inspired empirical analysis to explain the effectiveness of OTA. Our analysis surprisingly reveals that FFG implicitly induces a layer- and role-wise aware pruning mechanism that is capable of maintaining fine-tuning performance at much more aggressive pruning ratios compared to magnitude pruning and that exhibits interpretable task localization properties. Third, an extensive comparison of our curvature proxy across capability checkpoints shows that experts converge to a basin with substantial curvature similarity, offering a novel lens on why simple linear merging can be effective in practice. This result further strengthens our ablation study, showing that FFG is critical for merging performance. Finally, we develop a memory-light variant of OTA that efficiently compresses the second moments, mitigating the additional storage requirements of our method and improving scalability. We make all code, training and evaluation scripts, visualization artifacts, and capability-specific SFT checkpoints accessible through an anonymized repository at \url{https://github.com/tmlr-ota/ota}.

URL: https://openreview.net/forum?id=Wb2r8TdAyD

---

Title: MiCoTA: Bridging the Learnability Gap with Intermediate CoT and Teacher Assistants

Abstract: Large language models (LLMs) excel at reasoning tasks requiring long thought sequences for planning, reflection, and refinement. However, their substantial model size and high computational demands are impractical for widespread deployment. Yet, small language models (SLMs) often struggle to learn long-form CoT reasoning due to their limited capacity, a phenomenon we refer to as the "SLMs Learnability Gap". To address this, we introduce **Mid-CoT Teacher Assistant Distillation (MiCoTA)**, a framework for improving long CoT distillation for SLMs. MiCoTA employs intermediate-sized models as teacher assistants and utilizes intermediate-length CoT sequences to bridge both the capacity and reasoning length gaps. Our experiments on downstream tasks demonstrate that although SLMs distilled from large teachers can perform poorly, by applying MiCoTA, they achieve significant improvements in reasoning performance. Specifically, Qwen2.5-7B-Instruct and Qwen2.5-3B-Instruct achieve an improvement of 3.47 and 3.93 respectively on average score on AIME2024, AMC, Olympiad, MATH-500 and GSM8K benchmarks. To better understand the mechanism behind MiCoTA, we perform a quantitative experiment demonstrating that our method produces data more closely aligned with base SLM distributions. Our insights pave the way for future research into long-CoT data distillation for SLMs.

URL: https://openreview.net/forum?id=tKrFhQooVN

---

Title: Savaal: Scalable Concept-Driven Question Generation to Enhance Human Learning

Abstract: Assessing and enhancing human learning through question-answering is vital, yet automating this process remains challenging. We propose Savaal, a scalable question-generation system using large language models (LLMs) with three objectives: (i) scalability, enabling question-generation from hundreds of pages of text (ii) depth of understanding, producing questions beyond factual recall to test conceptual reasoning, and (iii) domain-independence, automatically generating questions across diverse knowledge areas. Instead of providing an LLM with large documents as context, Savaal improves results with a three-stage processing pipeline. Our evaluation with 76 human experts on 71 papers and PhD dissertations shows that Savaal generates questions that better test depth of understanding by 6.5$\times$ for dissertations and 1.5$\times$ for papers compared to a direct-prompting LLM baseline. Notably, as document length increases, Savaal's advantages in higher question quality and lower cost become more pronounced.

URL: https://openreview.net/forum?id=2DWDQTsz7K

---

Title: AuxiLight: Fast, Lightweight Auxiliary Loss Balancing Algorithm with Application in 6D Pose Estimation

Abstract: Jointly learning multiple tasks has proven to be beneficial. Auxiliary task learning builds on this by jointly training a set of predefined auxiliary tasks to improve the performance of a desired main task. Unfortunately, auxiliary task learning is often not practical because 1) jointly training multiple tasks requires an exponential hyperparameter search space of task loss weights; 2) the auxiliary tasks often lead to expensive annotation costs to obtain ground truth. In this work, we propose AuxiLight, a generic auxiliary learning algorithm, and consider the concrete real-world use case of 6D pose estimation. AuxiLight addresses the first issue via an algorithm that adaptively balances the auxiliary task losses based on an analysis of the training dynamics. Experiments on standard multi-task datasets show that our method consistently outperforms single-task models and state-of-the-art auxiliary task learning methods, being the fastest and the most lightweight among the known task-weighting algorithms. To demonstrate the practicality of auxiliary learning on real-world tasks, we further apply our method to 6D object pose estimation. We highlight that, for this task, multiple ground-truth auxiliary annotations can in fact be generated for free. This lets us showcase a concrete use of auxiliary learning for real-world problems that does induce annotation costs.

URL: https://openreview.net/forum?id=iQGmKWqr4r

---

Title: Semantic Anchor Transport: Robust Test-Time Adaptation for Vision-Language Models

Abstract: Large pre-trained vision-language models (VLMs), such as CLIP, have shown unprecedented zero-shot performance across a wide range of tasks. Nevertheless, these models may be unreliable under distributional shifts, as their performance is significantly degraded. In this work, we investigate how to efficiently utilize class text information to mitigate distribution drifts encountered by VLMs during inference. In particular, we propose generating pseudo-labels for the noisy test-time samples by aligning visual embeddings with reliable, text-based semantic anchors. Specifically, to maintain the regular structure of the dataset properly, we formulate the problem as a batch-wise label assignment, which is efficiently solved using Optimal Transport. Our method, Semantic Anchor Transport (SAT), utilizes such pseudo-labels as supervisory signals for test-time adaptation, yielding a principled cross-modal alignment solution. Moreover, SAT further leverages heterogeneous textual clues, with a multi-template distillation approach that replicates multi-view contrastive learning strategies in unsupervised representation learning without incurring additional computational complexity. Extensive experiments on multiple popular test-time adaptation benchmarks presenting diverse complexity empirically show the superiority of SAT, achieving consistent performance gains over recent state-of-the-art methods, yet being computationally efficient.

URL: https://openreview.net/forum?id=iE0Yemxvio

---

Title: pMixFed: Efficient Personalized Federated Learning through Adaptive Layer-Wise Mixup

Abstract: Partial Personalized Federated Learning (PFL) aims to balance generalization and person-
alization by decoupling models into shared and personalized layers. However, existing meth-
ods typically rely on rigid, static partitioning, which leads to significant global-local model
discrepancies, client drift, and catastrophic forgetting. To overcome these limitations, we
propose pMixFed, a dynamic, layer-wise PFL approach that integrates an adaptive mixing
mechanism (inspired by Mixup) directly into the parameter space. Unlike static methods,
pMixFed employs an adaptive strategy to dynamically partition layers and utilizes a gradual
transition of personalization degrees to smooth the integration of global and local knowledge.
This mechanism effectively mitigates the “hard split” issues found in prior work. Exten-
sive experiments demonstrate that pMixFed consistently outperforms competitive baselines
(such as FedAlt and FedSim) in heterogeneous settings, exhibiting faster model training, in-
creased robustness against performance drops, and a self-tuning mechanism that effectively
handles cold-start users.

URL: https://openreview.net/forum?id=8762NeCGC0

---

Title: Chance-Constrained Inference for Hallucination Risk Control in Large Language Models

Abstract: Large language models generate outputs stochastically and may produce fluent but invalid responses, including factual hallucinations. Existing mitigation strategies reduce average error rates but do not provide explicit control over the \emph{frequency} of such failures under repeated use. We formulate inference as a deployment-time risk control problem and introduce \emph{chance-constrained inference}, which directly bounds the probability of hallucinations among accepted generations. Hallucinations are modeled as stochastic constraint violations, and we show that confidence-based selective prediction does not, in general, imply probabilistic risk guarantees. To enforce chance constraints efficiently, we propose a sequential, anytime-valid inference procedure that adaptively certifies feasibility or infeasibility using finite samples, avoiding conservative fixed-sample bounds. Experiments on questions inspired by NaturalQuestions and controlled multi-hop question answering demonstrate reliable risk control, early detection of intrinsically infeasible inputs, and safe composition under repeated use, while confidence-based baselines fail to provide consistent guarantees.

URL: https://openreview.net/forum?id=cJDhDC69m9

---

Title: Inference-Time Scaling for Joint Audio–Video Generation

Abstract: Joint audio-video generation aims to synthesize realistic audio-video pairs that are both semantically aligned with text prompts and precisely synchronized. While existing joint audio-video generation models often require substantial training resources to improve fidelity, Inference-Time Scaling (ITS) has recently emerged as a promising training-free alternative in single-modality domains. However, extending ITS from a single modality to multimodal domains is non-trivial, as it requires balancing multiple heterogeneous objectives. In this paper, we present the first comprehensive study of ITS for joint audio-video generation. We first demonstrate that a multi-verifier framework is essential to address the limitations of single-objective guidance, including asymmetric performance trade-offs and verifier hacking. Through systematic analysis, we then identify an optimal multi-verifier combination that yields balanced improvements across all quality dimensions. Finally, to effectively aggregate diverse reward signals, we propose Adaptive Reward Weighting (ARW), a novel test-time optimization algorithm. ARW treats reward aggregation as an online optimization problem, utilizing learnable parameters to calibrate reward variances without requiring prior knowledge of reward distributions, thereby ensuring robust multi-objective selection. Experimental results on VGGSound and JavisBench-mini benchmarks demonstrate that our framework significantly enhances semantic alignment, perceptual quality, and audio-visual synchronization of generated outputs.

URL: https://openreview.net/forum?id=MHNFjjm5nO

---

Title: Large Pretraining Datasets Don't Guarantee Robustness after Fine-Tuning

Abstract: Large-scale pretrained models are widely leveraged as foundations for learning new specialized tasks via fine-tuning, with the goal of maintaining the general performance of the model while allowing it to gain new skills. A valuable goal for all such models is robustness: the ability to perform well on out-of-distribution (OOD) tasks. We assess whether fine-tuning preserves the overall robustness of the pretrained model, and observed that models pretrained on large datasets exhibited strong catastrophic forgetting and loss of OOD generalization. To systematically assess robustness preservation in fine-tuned models, we propose the Robustness Inheritance Benchmark (ImageNet-RIB). The benchmark, which can be applied to any pretrained model, consists of a set of related but distinct OOD (downstream) tasks and involves fine-tuning on one of the OOD tasks in the set then testing on the rest. We find that though continual learning methods help, fine-tuning reduces robustness across pretrained models. Surprisingly, models pretrained on the largest and most diverse datasets (e.g., LAION-2B) exhibit both larger robustness losses and lower absolute robustness after fine-tuning on small datasets, relative to models pretrained on smaller datasets. These findings suggest that starting with the strongest foundation model is not necessarily the best approach for performance on specialist tasks.

URL: https://openreview.net/forum?id=VyVkIucjWU

---

Title: MSpecTmol: A Multi-Modal Spectroscopic Learning Framework for Automated Molecular Structure Elucidation

Abstract: Spectroscopic techniques are indispensable for the elucidation of molecular structures, particularly for novel molecules with unknown configurations. However, a fundamental limitation of any single spectroscopic modality is that it provides an inherently circumscribed and fragmented view, capturing only specific facets of the complete molecular structure, which is often insufficient for unequivocal and robust characterization. Consequently, the integration of data from multiple spectroscopic sources is imperative to overcome these intrinsic limitations and achieve a comprehensive and accurate structural characterization. In this work, we introduce \textbf{MSpecTmol}, a novel \textbf{M}ulti-modal \textbf{Spec}trum information fusion learning framework for automated \textbf{Mol}ecular structure elucidation. By extending information bottleneck theory, our framework provides a principled and adaptive approach to fusing spectra. It designates a primary modality to extract core molecular features while leveraging auxiliary inputs to enrich the representation. To validate the end-to-end effectiveness of our framework, we design a two-fold evaluation: molecular substructure classification to probe its discriminative power in identifying substructures, and extends this knowledge to reconstruct plausible 3D structures. Our results not only demonstrate state-of-the-art performance in molecular substructure classification but also achieve near-experimental accuracy (\textasciitilde 0.68\AA) in molecular conformation reconstruction. These findings underscore the model’s capacity to learn interpretable features aligned with chemical intuition, thereby paving the way for future advances in automated and reliable spectroscopic analysis. Our code can be found at \href{https://anonymous.4open.science/r/MspecTmol-6B4D}{https://anonymous.4open.science.}

URL: https://openreview.net/forum?id=kRhf5Z1Cy1

---

Title: CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents

Abstract: Text-guided sound separation supports flexible audio editing across media and assistive applications, but existing models like AudioSep are too compute-heavy for edge deployment. Neural Audio Codec–based models such as CodecFormer and SDCodec are compute efficient but limited to fixed-class separation.
We introduce CodecSep, the first NAC-based model for on-device universal, text-driven separation. CodecSep combines DAC compression with a lightweight transformer masker modulated by CLAP-derived FiLM parameters. A key empirical finding underlying this design is that the latent space of modern neural audio codecs is already sufficiently structured: source-specific information is partially disentangled in the NAC representation, allowing effective source extraction via masking alone, without explicit re-encoding.
Across six open-domain benchmarks under matched training and prompting protocols, CodecSep surpasses AudioSep in separation fidelity (SI\mbox{-}SDR) while remaining competitive in perceptual quality (ViSQOL) and matching or exceeding fixed-stem baselines (TDANet, Sudo rm-rf, CodecFormer, SDCodec). In code-stream deployments, it requires just 1.35~GMACs end-to-end—$\sim$54$\times$ less compute (25$\times$ architecture-only) than spectrogram-domain separators like AudioSep—while remaining fully bitstream-compatible.

URL: https://openreview.net/forum?id=r63GX9hKhC

---

Title: MIDAS: Mosaic Input-Specific Differentiable Architecture Search

Abstract: Differentiable Neural Architecture Search (NAS) provides efficient, gradient-based methods for automatically designing neural networks, yet its adoption remains limited in practice. We present MIDAS, a novel approach that modernizes DARTS by replacing static architecture parameters with dynamic, input-specific parameters computed via self-attention.
To improve robustness, MIDAS (i) localizes the architecture selection by computing it separately for each spatial patch of the activation map, and (ii) introduces a parameter-free, topology-aware search space that models node connectivity and simplifies selecting the two incoming edges per node. We evaluate MIDAS on the DARTS, NAS-Bench-201, and RDARTS search spaces. In DARTS, it reaches 97.42% top-1 on CIFAR-10 and 83.38% on CIFAR-100. In NAS-Bench-201, it consistently finds globally optimal architectures. In RDARTS, it sets the state of the art on two of four search spaces on CIFAR-10. We further analyze why MIDAS works, showing that patchwise attention improves discrimination among candidate operations, and the resulting input-specific parameter distributions are class-aware and predominantly unimodal, providing reliable guidance for decoding.

URL: https://openreview.net/forum?id=9R6EBFTzuy

---

Title: On Hamming–Lipschitz Type Stability of the Subdominant (Minmax) Ultrametric: Theory and Simple Proofs

Abstract: We study the subdominant (minmax) ultrametric as an operator on pairwise data. Prior stability results show that this operator is non-expansive under uniform perturbations in the supremum norm and in the Gromov–Hausdorff sense, but they say nothing about how widely sparse, targeted edits can ripple through the hierarchy. We close this gap with a pair-count Lipschitz theory in Hamming space: we bound how many ultrametric entries can change, regardless of their magnitudes. The analysis is routed through the \emph{minimum spanning tree} (MST), which encodes the ultrametric as path bottlenecks. Our first theorem proves a locality principle; only pairs whose MST path crosses an edited or newly exposed cut can change, so the impact is confined to a union of fundamental cut rectangles. Building on this, we derive an instance dependent $$\ell_0$$ type Lipschitz bound whose constant is determined entirely by the MST’s exposed cuts. We then show optimality by constructing cases where a single off-tree edit forces a quadratic number of changes, so no smaller universal constant is possible for our proposed Lipschitz constant. Finally, under a mild minimal-overlap condition, the upper bound on the number of changed entries of the ultrametric is order-tight, yielding a two-sided characterization of propagation. Conceptually, this advances a magnitude-versus-extent picture for ultrametric stability: classical results control how much entries move under uniform perturbation; our theory controls how far changes spread under sparse edits. Additionally, as a proof of concept, we derive a risk score from our Lipschitz constant that identifies vulnerable edges in the graph. We use this score to drive two case studies: vulnerability maps of deep embeddings of CIFAR-10, ImageNet-10, and STL-10, where targeted edits to high-score edges cause far larger ultrametric and clustering changes than random edits with the same budget, and fragility maps in a superpixel-based single image segmentation that highlight load-bearing boundaries.

URL: https://openreview.net/forum?id=R4ASOCp3uM

---

Title: Extrapolation of Periodic Functions Using Binary Encoding of Continuous Numerical Values

Abstract: We report the discovery that binary encoding allows neural networks to extrapolate periodic functions beyond their training bounds. We introduce Normalized Base-2 Encoding (NB2E) as a method for encoding continuous numerical values and demonstrate that, using this input encoding, vanilla multi-layer perceptrons (MLP) successfully extrapolate diverse periodic signals without prior knowledge of their functional form. Internal activation analysis reveals that NB2E induces bit-phase representations, enabling MLPs to learn and extrapolate signal structure independently of position.

URL: https://openreview.net/forum?id=8BCFVGm998

---

Title: Contrastive VQ Priors for Multi-Class Plaque Segmentation via SAM Adaptation

Abstract: Accurate plaque subtype segmentation in coronary CT angiography (CCTA) is clinically relevant yet remains difficult in practice, where annotations are scarce and the visual evidence for non-calcified lesions is subtle and highly variable. Meanwhile, segmentation foundation models such as SAM provide strong robustness from large-scale pretraining, but their benefits do not reliably transfer to private CCTA tasks under naïve fine-tuning, especially for multi-class plaque taxonomy.
We present a targeted strategy to transfer SAM's segmentation robustness to a private CCTA setting by injecting a task-specific, texture-aware prior into the SAM feature stream.
Our framework is two-stage: (i) we learn a discrete latent prior from the private CCTA data using a vector-quantized autoencoder, and structure it with supervised contrastive learning to emphasize hard class boundaries; (ii) we fuse this prior into a SAM-based encoder through a query-based feature-aware cross-attention module, and decode with a multi-class head/decoder tailored for plaque taxonomy.
On the private CCTA benchmark, our approach consistently improves plaque subtype delineation and outperforms strong medical baselines (nnU-Net, TransUNet) as well as SAM-family adaptations (including Medical SAM variants and CAT-SAM). Ablations verify the roles of (a) contrastively-structured discrete priors, (b) attention-based retrieval versus additive fusion, and (c) multi-class decoding for SAM-style models.

URL: https://openreview.net/forum?id=5P7HfuejgL

---

Title: Fourier Neural Operators Explained: A Practical Perspective

Abstract: Partial differential equations (PDEs) govern a wide variety of dynamical processes in science and engineering, yet obtaining their numerical solutions often requires high-resolution discretizations and repeated evaluations of complex operators, leading to substantial computational costs. Neural operators have recently emerged as a powerful framework for learning mappings between function spaces directly from data, enabling efficient surrogate models for PDE systems. Among these architectures, the Fourier Neural Operator (FNO) has become the most influential and widely adopted due to its elegant spectral formulation, which captures global correlations through learnable transformations in Fourier space while remaining invariant to discretization and resolution. Despite their success, the practical use of FNOs is often hindered by an incomplete understanding among practitioners of their theoretical foundations, practical constraints, and implementation details, which can lead to their incorrect or unreliable application. This work presents a comprehensive and practice-oriented guide to FNOs, unifying their mathematical principles with implementation strategies. We provide an intuitive exposition to the concepts of operator theory and signal-processing that underlie the FNO, detail its spectral parameterization and the computational design of all its components, and address common misunderstandings encountered in the literature. The exposition is closely integrated with the NeuralOperator 2.0.0 library, offering modular state-of-the-art implementations that faithfully reflect the theory. By connecting rigorous foundations with practical insight, this guide aims to establish a clear and reliable framework for applying FNOs effectively across diverse scientific and engineering fields.

URL: https://openreview.net/forum?id=jqU59PGWRx

---

Title: Sharpness-Aware Minimization Driven by Local-Integrability Flatness

Abstract: Sharpness-Aware Minimization (SAM) improves generalization by optimizing for worst-case loss under parameter perturbations, but its max-based objective can be overly conservative, noise-sensitive, and reliant on smoothness assumptions that often fail in modern nonsmooth networks. We propose Lebesgue Sharpness-Aware Minimization (LSAM), a measure-theoretic alternative grounded in the Lebesgue Differentiation Theorem and local Sobolev regularity. Instead of minimizing the worst-case loss, LSAM minimizes the local average loss in a neighborhood of the parameters. This average-case notion of flatness favors Sobolev-regular Lebesgue points with low local loss oscillation and yields a generalization bound depending only on local integrability, a modulus of continuity, and a Sobolev-induced flatness term—without requiring Hessians or global Lipschitz conditions. To make LSAM practical, we introduce a Monte Carlo estimator of the local average that provides an unbiased gradient with modest overhead. Experiments on CIFAR-10/100 with ResNet, ResNeXt, WideResNet, and PyramidNet show that LSAM consistently finds flatter minima and improves test accuracy over both SGD and SAM.

URL: https://openreview.net/forum?id=29Zg9k5NCo

---

Title: MM-Eureka: Toward Stable Multimodal Reasoning via Rule-based Reinforcement Learning with Policy Drift Control

Abstract: Existing rule-based reinforcement learning (RL) methods that work well for text reasoning often collapse when extended to long-horizon multimodal reasoning settings. We identify a structural instability driven by ratio-based policy objectives under sparse multimodal rewards: importance sampling ratios in PPO-style objectives can amplify policy shifts, especially under negative advantages, which can trigger catastrophic mid-training collapse.
To make multimodal rule-based RL reliably trainable, we propose \textbf{CPGD (Clipped Policy Gradient Optimization with Policy Drift)}, a stability-oriented RL objective that removes ratio-induced amplification while maintaining proximal updates via an explicit policy drift regularizer and a numerically stable KL estimator. We provide both theoretical analysis and empirical evidence showing that ratio-based objectives can systematically amplify policy drift beyond intended bounds under sparse-reward multimodal settings, and demonstrate how CPGD addresses this through controlled policy updates.
To support diagnosis and evaluation under consistent settings, we introduce \textbf{MMK12}, a K12-level multimodal reasoning dataset with 15,616 training problems and 2,000 evaluation questions across mathematics, physics, chemistry, and biology, all with human-verified solutions. Using CPGD on MMK12, we train \textbf{MM-Eureka} models that demonstrate stable long-horizon training without collapse. CPGD achieves consistent performance improvements while maintaining training stability throughout, validating that the instability mechanism has been effectively addressed. We open-source our complete pipeline at \url{https://anonymous.4open.science/r/MM-EUREKA-C86D}

URL: https://openreview.net/forum?id=8y1ch6y24H

---

Title: On the (Non) Injectivity of Piecewise Linear Janossy Pooling

Abstract: Multiset functions, which are functions that map multisets to vectors, are a fundamental tool in the construction of neural networks for multisets and graphs. To guarantee that the vector representation of the multiset is faithful, it is often desirable to have multiset mappings that are both injective and bi-Lipschitz. Currently, there are several constructions of multiset functions achieving both these guarantees, leading to improved performance in some tasks but often also to higher compute time than standard constructions. Accordingly, it is natural to inquire whether simpler multiset functions achieving the same guarantees are available. In this paper, we make a large step towards giving a negative answer to this question. We consider the family of $k$-ary Janossy pooling, which includes many of the most popular multiset models, and prove that no piecewise linear Janossy pooling function can be injective. On the positive side, we show that when restricted to multisets without multiplicities, even simple deep-sets models suffice for injectivity and bi-Lipschitzness.
Finally, we provide empirical validation of our results through a multiset reconstruction task using a 2-ary Janossy pooling autoencoder, demonstrating a clear correlation between point separation and reconstruction accuracy.

URL: https://openreview.net/forum?id=WKJndBfSb8

---

Title: Understanding the Effects of Neuron Dominance in Deep Reinforcement Learning

Abstract: Recent studies in deep reinforcement learning have revealed that neural networks tend to lose their capacity to adapt to new targets over the course of training. The proliferation of inactive neurons, i.e., the so-called ``dormant neurons'', has been identified as one source of capacity loss. This paper investigates \textit{dominant neurons}, neurons whose activation values are significantly larger than average, as a potential cause for neuron dormancy. We demonstrate the existence of dominant neurons in a number of visual control tasks, and perform an analysis of the learning dynamics showing how dominant neurons can induce dormancy in the subsequent layer. To gain a better understanding of this phenomenon, we examine it through the lens of representation learning and establish its connection with representation collapse. Furthermore, this paper evaluates several mitigation strategies for dominant neurons across a variety of visual control tasks. Our results show that strategies that induce lower peak activation scores tend to exhibit greater representational capacity, lower dormant neuron percentage, and better performance. Among these mitigation strategies, LayerNorm with weight decay has the strongest performance, despite its simplicity. Moreover, switching the value learning loss from regression to a classification loss also significantly mitigates the neuron dominance issue and improves the performance. As a potential explanation of the effectiveness of classification losses, we provide an analysis that shows how a classification loss can prevent representation collapse.

URL: https://openreview.net/forum?id=VNV1h77UnH

---

Title: Entropy-Triggered Retraining as Nonequilibrium Entropy Production in Deployed Machine Learning Systems

Abstract: Machine learning models deployed in nonstationary environments inevitably experience performance degradation due to data drift. While numerous drift detection heuristics exist, most lack a dynamical interpretation and provide limited guidance on how retraining decisions should be balanced against operational cost. In this work, we propose an entropy-based retraining framework grounded in nonequilibrium statistical physics. Interpreting drift as probability flow governed by a Fokker-Planck equation, we quantify model-data mismatch using relative entropy and show that its time derivative admits an entropy-balance decomposition featuring a nonnegative entropy production term driven by probability currents. Guided by this theory, we implement an entropy-triggered retraining policy using an exponentially weighted moving-average (EWMA) control statistic applied to a streaming kernel density estimator of the Kullback-Leibler divergence. We evaluate this approach across multiple nonstationary data streams. In synthetic, financial, and web-traffic domains, entropy-based retraining achieves predictive performance comparable to frequent retraining while reducing retraining frequency by one to two orders of magnitude. However, in a challenging biomedical ECG setting, the entropy-based trigger underperforms the maximum-frequency baseline, highlighting limitations of feature-space entropy monitoring under complex label-conditional drift.

URL: https://openreview.net/forum?id=egS2O3CYXa

---

Reply all
Reply to author
Forward
0 new messages