Weekly TMLR digest for Mar 01, 2026

3 views
Skip to first unread message

TMLR

unread,
Mar 1, 2026, 12:00:17 AM (9 days ago) Mar 1
to tmlr-annou...@googlegroups.com


New certifications
==================

J2C Certification: \texttt{Complex-Edit}: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark

Siwei Yang, Mude Hui, Bingchen Zhao, Yuyin Zhou, Nataniel Ruiz, Cihang Xie

https://openreview.net/forum?id=lL1JR6dxG8

---


J2C Certification: Rethinking the Mixture of Vision Encoders Paradigm for Enhanced Visual Understanding in Multimodal LLMs

Mozhgan Nasr Azadani, James Riddell, Sean Sedwards, Krzysztof Czarnecki

https://openreview.net/forum?id=tgnTVmRybs

---


J2C Certification: Stochastic Multi-Objective Multi-Armed Bandits: Regret Definition and Algorithm

Mansoor Davoodi, Setareh Maghsudi

https://openreview.net/forum?id=7N7sK5CFuP

---


J2C Certification: AdaCubic: An Adaptive Cubic Regularization Optimizer for Deep Learning

Ioannis Tsingalis, Constantine Kotropoulos, Corentin Briat

https://openreview.net/forum?id=pZBQ7J37lk

---


Survey Certification: A Survey of Model Architectures in Information Retrieval

Zhichao Xu, Fengran Mo, Zhiqi Huang, Crystina Zhang, Puxuan Yu, Bei Wang Phillips, Jimmy Lin, Vivek Srikumar

https://openreview.net/forum?id=xAIbTbHRrX

---


J2C Certification: Thermodynamically Consistent Latent Dynamics Identification for Parametric Systems

Xiaolong He, Yeonjong Shin, Anthony Gruber, Sohyeon Jung, Kookjin Lee, Youngsoo Choi

https://openreview.net/forum?id=Qy3oLpRzpf

---


Featured Certification, J2C Certification: Distilled Thompson Sampling: Practical and Efficient Thompson Sampling via Imitation Learning

Hongseok Namkoong, Samuel Daulton, Eytan Bakshy

https://openreview.net/forum?id=J8PrWwvYX2

---


Accepted papers
===============


Title: Rank-1 Approximation of Inverse Fisher for Natural Policy Gradients in Deep Reinforcement Learning

Authors: Yingxiao Huo, Satya Prakash Dash, Radu Stoican, Samuel Kaski, Mingfei Sun

Abstract: Natural gradients have been long studied in deep reinforcement learning due to its fast convergence properties and covariant weight updates.
However, computing natural gradients requires inversion of Fisher Information Matrix (FIM) at each iteration, which is computationally prohibitive in nature.
In this paper, we present an efficient and scalable natural policy optimization technique which leverages a rank-1 approximation to full inverse-FIM.
We theoretically show that under certain conditions, rank-1 approximation to inverse-FIM converges faster than policy gradients and under some condition, enjoys the same sample complexity as stochastic policy gradient methods.
We benchmark our method on a diverse set of environments and show that our methods achieve superior performance than standard trust-region and actor-critic baselines.

URL: https://openreview.net/forum?id=ko8Kn7TS6m

---

Title: Uncovering the Redundancy in Transformers via a Unified Study of Layer Dropping

Authors: Shwai He, Guoheng Sun, Zheyu Shen, Ang Li

Abstract: While scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks, it also introduces redundant architectures, posing efficiency challenges for real-world deployment. Despite some recognition of redundancy in LLMs, the variability of redundancy across different architectures in transformers, such as MLP and Attention layers, is under-explored. In this work, we investigate redundancy across different Transformer modules, including blocks, MLP layers, and attention layers, through the lens of layer dropping. Surprisingly, despite the pivotal role of attention mechanisms in distinguishing Transformers from other architectures, we find that a large portion of attention layers exhibit excessively high redundancy and can be pruned without degrading performance. For example, LLaMA-3-70B achieves a 43.4\% speedup with only a 1.8\% drop in performance by pruning half of its attention layers. In contrast, dropping MLP layers severely impairs the model's ability to distinguish between tokens, leading to catastrophic performance degradation. Moreover, our analysis reveals that attention layer redundancy persists not only throughout training but is also evident in randomly initialized models. We attribute this redundancy to three key factors that constrain representational updates from attention layers: sparse attention patterns, over-smoothed token embeddings, and the low representational magnitude of attention outputs. Overall, our findings offer valuable insights into the internal redundancy of Transformer architectures and provide practical guidance for designing more efficient LLMs. The code is released at: https://github.com/CASE-Lab-UMD/LLM-Drop.

URL: https://openreview.net/forum?id=1I7PCbOPfe

---

Title: Proxy-Anchor and EVT-Driven Continual Learning Method for Generalized Category Discovery

Authors: Alireza Fathalizadeh, Roozbeh Razavi-Far

Abstract: Continual generalized category discovery has been introduced and studied in the literature as a method that aims to continuously discover and learn novel categories in incoming data batches while avoiding catastrophic forgetting of previously learned categories. A key component in addressing this challenge is the model’s ability to separate novel samples, where Extreme Value Theory (EVT) has been effectively employed. In this work, we propose a novel method that integrates EVT with proxy anchors to define boundaries around proxies using a probability of inclusion function, enabling the rejection of unknown samples. Additionally, we introduce a novel EVT-based loss function to enhance the learned representation, achieving superior performance compared to other deep-metric learning methods in similar settings. Using the derived probability functions, novel samples are effectively separated from previously known categories. However, category discovery within these novel samples can sometimes overestimate the number of new categories. To mitigate this issue, we propose a novel EVT-based approach to reduce the model size and discard redundant proxies. We also incorporate a novel experience replay and knowledge distillation mechanisms during the continual learning stage to prevent catastrophic forgetting. Experimental results demonstrate that our proposed approach outperforms state-of-the-art methods in continual generalized category discovery scenarios.

URL: https://openreview.net/forum?id=P3Qe9yJRvf

---

Title: \texttt{Complex-Edit}: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark

Authors: Siwei Yang, Mude Hui, Bingchen Zhao, Yuyin Zhou, Nataniel Ruiz, Cihang Xie

Abstract: We introduce Complex-Edit, a comprehensive benchmark designed to systematically evaluate instruction-based image editing models across instructions of varying complexity. To develop this benchmark, we harness GPT-4o to automatically collect a diverse set of editing instructions at scale. Our approach follows a well-structured "Chain-of-Edit" pipeline: we first generate individual atomic editing tasks independently and then integrate them to form cohesive, complex instructions. Additionally, we introduce a suite of metrics to assess various aspects of editing performance, along with a VLM-based auto-evaluation pipeline that supports large-scale assessments. Our benchmark yields several notable insights: 1) Open-source models significantly underperform relative to proprietary, closed-source models, with the performance gap widening as instruction complexity increases; 2) Increased instructional complexity primarily impairs the models’ ability to retain key elements from the input images; 3) Stronger models aren't necessarily more resilient towards higher complexity; 4) Decomposing a complex instruction into a sequence of atomic steps, executed in a step-by-step manner, substantially degrades performance across multiple metrics; 5) A straightforward Best-of-N selection strategy improves results for both direct editing and the step-by-step sequential approach; and 6) We observe a "curse of synthetic data": when synthetic data is involved in model training, the edited images from such models tend to appear increasingly synthetic as the complexity of the editing instructions rises --- a phenomenon that intriguingly also manifests in the latest GPT-Image-1's outputs. The code for evaluation and data generation, and the test set is released at https://github.com/UCSC-VLAA/Complex-Edit.

URL: https://openreview.net/forum?id=lL1JR6dxG8

---

Title: ODE-Constrained Generative Modeling of Cardiac Dynamics for 12-Lead ECG Synthesis

Authors: Yakir Yehuda, Kira Radinsky

Abstract: Generating realistic training data for supervised learning remains a significant challenge in artificial intelligence. This is particularly true in the synthesis of electrocardiograms (ECGs), where the objective is to develop a synthetic 12-lead ECG model. The primary challenge in this task lies in accurately modeling the intricate biological and physiological interactions among different ECG leads. Although mathematical process models have shed light on these dynamics, effectively incorporating this understanding into generative models is not straightforward. We introduce an innovative method that employs ordinary differential equations (ODEs) to enhance the fidelity of 12-lead ECG data generation. This approach integrates cardiac dynamics directly into the generative optimization process via a novel Euler Loss, producing biologically plausible data that respects real-world variability and inter-lead constraints. Empirical analysis on the G12EC and PTB-XL datasets demonstrates that augmenting training data with MultiODE-GAN yields consistent, statistically significant improvements in specificity across multiple cardiac abnormalities. This highlights the value of enforcing physiological coherence in synthetic medical data.

URL: https://openreview.net/forum?id=4N56Pwwsti

---

Title: VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model

Authors: Mengzhuo Chen, Jiani zheng, Lu Wang, Fangkai Yang, Chaoyun Zhang, Lingrui Mei, Wenjie Yin, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan

Abstract: Training Vision-Language Models (VLMs) for Graphical User Interfaces (GUI) agents via Reinforcement Learning (RL) faces critical challenges: environment-based RL requires costly interactions, while environment-free methods struggle with distribution shift and reward generalization. We propose an environment-free RL framework that decouples action utility learning from policy optimization by leveraging a pretrained Value Environment Model (VEM), which requires no live environment interaction during policy optimization. VEM predicts value-aligned action utilities directly from offline data, distilling human-like priors about GUI interaction outcomes without requiring next-state prediction or environmental feedback. This avoids compounding errors and enhances resilience to UI changes by focusing on semantic reasoning (e.g., “Does this action advance the user’s goal?”). The framework operates in two stages: (1) pretraining VEM to learn action-level utility signals and (2) guiding policy exploration with frozen VEM signals, enabling layout-agnostic GUI automation. Evaluated across diverse benchmarks including Android-in-the-Wild for mobile apps and Multimodal-Mind2Web for web environments, VEM achieves state-of-the-art or highly competitive performance in both offline and online settings. It significantly outperforms environment-free baselines and matches or exceeds environment-based approaches, crucially without incurring interaction costs. Importantly, VEM demonstrates that robust, generalizable GUI agents can be trained efficiently using semantic-aware action utility prediction, proving effective across distinct interaction platforms like mobile and web. The code is available at https://github.com/microsoft/GUI-Agent-RL.

URL: https://openreview.net/forum?id=q1wLUxaBPn

---

Title: Repulsive Monte Carlo on the sphere for the sliced Wasserstein distance

Authors: Vladimir Petrovic, Rémi Bardenet, Agnes Desolneux

Abstract: In this paper, we consider the problem of computing the integral of a function on the unit sphere, in any dimension, using Monte Carlo methods. Although the methods we present are general, our guiding thread is the sliced Wasserstein distance between two measures on $\mathbb{R}^d$, which is precisely an integral of the $d$-dimensional sphere. The sliced Wasserstein distance (SW) has gained momentum in machine learning either as a proxy to the less computationally tractable Wasserstein distance, or as a distance in its own right, due in particular to its built-in alleviation of the curse of dimensionality. There has been recent numerical benchmarks of quadratures for the sliced Wasserstein, and our viewpoint differs in that we concentrate on quadratures where the nodes are repulsive, i.e. negatively dependent. Indeed, negative dependence can bring variance reduction when the quadrature is adapted to the integration task. Our first contribution is to extract and motivate quadratures from the recent literature on determinantal point processes (DPPs) and repelled point processes, as well as repulsive quadratures from the literature specific to the sliced Wasserstein distance. We then numerically benchmark these quadratures. Moreover, we analyze the variance of the UnifOrtho estimator, an orthogonal Monte Carlo estimator. Our analysis sheds light on UnifOrtho's success for the estimation of the sliced Wasserstein in large dimensions, as well as counterexamples from the literature. Our final recommendation for the computation of the sliced Wasserstein distance is to use randomized quasi-Monte Carlo in low dimensions and UnifOrtho in large dimensions. DPP-based quadratures only shine when quasi-Monte Carlo also does, while repelled quadratures show moderate variance reduction in general, but more theoretical effort is needed to make them robust.

URL: https://openreview.net/forum?id=JSiTmB6Ehu

---

Title: Online Learning with Multiple Fairness Regularizers via Graph-Structured Feedback

Authors: Quan Zhou, Jakub Marecek, Robert Noel Shorten

Abstract: There is an increasing need to enforce multiple, often competing, measures of fairness within automated decision systems. The appropriate weighting of these fairness objectives is typically unknown a priori, may change over time and, in our setting, must be learned adaptively through sequential interactions. In this work, we address this challenge in a bandit setting, where decisions are made with graph-structured feedback.

URL: https://openreview.net/forum?id=y8iWuDZtEw

---

Title: Large Language Models for Scientific Idea Generation: A Creativity-Centered Survey

Authors: Fatemeh Shahhosseini, Arash Marioriyad, Ali Momen, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban, Shaghayegh Haghjooy Javanmard

Abstract: Scientific idea generation is central to discovery, requiring the joint satisfaction of novelty and scientific soundness. Unlike standard reasoning or general creative generation, scientific ideation is inherently open-ended and multi-objective, making its automation particularly challenging. Recent advances in large language models (LLMs) have enabled the generation of coherent and plausible scientific ideas, yet the nature and limits of their creative capabilities remain poorly understood. This survey provides a structured synthesis of methods for LLM-driven scientific ideation, focusing on how different approaches trade off novelty and scientific validity. We organize existing methods into five complementary families: External knowledge augmentation, Prompt-based distributional steering, Inference-time scaling, Multi-agent collaboration, and Parameter-level adaptation. To interpret their contributions, we adopt two complementary creativity frameworks: Boden’s taxonomy to characterize the expected level of creative novelty, and Rhodes’ 4Ps framework to analyze the aspects or sources of creativity emphasized by each method. By aligning methodological developments with cognitive creativity frameworks, this survey clarifies the evaluation landscape and identifies key challenges and directions for reliable and systematic LLM-based scientific discovery.

URL: https://openreview.net/forum?id=9lWojZKMjt

---

Title: Return Augmented Decision Transformer for Off-Dynamics Reinforcement Learning

Authors: Ruhan Wang, Yu Yang, Zhishuai Liu, Dongruo Zhou, Pan Xu

Abstract: We study offline off-dynamics reinforcement learning (RL) to utilize data from an easily accessible source domain to enhance policy learning in a target domain with limited data. Our approach centers on return-conditioned supervised learning (RCSL), particularly focusing on Decision Transformer (DT) type frameworks, which can predict actions conditioned on desired return guidance and complete trajectory history. Previous works address the dynamics shift problem by augmenting the reward in the trajectory from the source domain to match the optimal trajectory in the target domain. However, this strategy can not be directly applicable in RCSL owing to (1) the unique form of the RCSL policy class, which explicitly depends on the return, and (2) the absence of a straightforward representation of the optimal trajectory distribution. We propose the Return Augmented (REAG) method for DT type frameworks, where we augment the return in the source domain by aligning its distribution with that in the target domain. We provide the theoretical analysis demonstrating that the RCSL policy learned from REAG achieves the same level of suboptimality as would be obtained without a dynamics shift. We introduce two practical implementations $REAG^{∗}_{Dara}$ and $REAG^{∗}_{MV}$ respectively. Thorough experiments on D4RL datasets and various DT-type baselines demonstrate that our methods consistently enhance the performance of DT type frameworks in off-dynamics RL.

URL: https://openreview.net/forum?id=QDVOr5J9Xp

---

Title: Rethinking the Mixture of Vision Encoders Paradigm for Enhanced Visual Understanding in Multimodal LLMs

Authors: Mozhgan Nasr Azadani, James Riddell, Sean Sedwards, Krzysztof Czarnecki

Abstract: Mixture of Vision Encoders (MoVE) has emerged as a powerful approach to enhance the fine-grained visual understanding of multimodal large language models (MLLMs), improving their ability to handle tasks such as complex optical character recognition and scene understanding. Despite these advances, effectively combining diverse encoders and their visual tokens, while also scaling to high-resolution inputs, remains an open challenge. In this work, we conduct a systematic study of fusion designs for MoVE-based MLLMs, highlighting principles for token-level integration across complementary encoders. Our study shows that a lightweight recipe consisting of post-adaptation fusion with independent projectors, tile-level sequence interleaving, and dynamic tiling with global context delivers strong performance on diverse benchmarks. We integrate these principles into a simple and effective architecture that we call LEO. Extensive evaluation on 11 vision–language benchmarks demonstrates that LEO achieves better results on the majority of tasks compared to existing MoVE-based approaches. Furthermore, LEO adapts effectively to the specialized domain of autonomous driving without altering its architecture or training recipe, achieving competitive performance against established baselines and thereby highlighting its ability to generalize. The code is available at https://github.com/Mozhgan91/LEO.

URL: https://openreview.net/forum?id=tgnTVmRybs

---

Title: STEALTH: Secure Transformer for Encrypted Alignment of Latent Text Embeddings via Semantic Isomorphism Enforcement (SIE) Loss Function

Authors: Nafew Azim, Nabeel Mohammed

Abstract: The pervasive use of large language models (LLMs) on sensitive data presents a critical privacy challenge, as traditional encryption renders data unusable for inference. We introduce STEALTH, a 120M secure transformer framework designed to process encrypted text while preserving its semantic utility under an authorized-key threat model (no decryption or side-channel access). The core innovation of STEALTH is the Semantic Isomorphism Enforcement (SIE) loss function, a loss that trains the model to learn a topology-preserving mapping between encrypted text embeddings and their original plaintext latent space. This encourages preservation of semantic relationships and topological structure in the encrypted domain. Using retrieval-based reconstruction from a domain-aligned plaintext corpus, STEALTH achieves near-perfect semantic retrieval (BLEU score of 1.0 under full-corpus coverage in our experiments) and enables accurate privacy-preserving clustering on encrypted embeddings. We evaluate STEALTH across 44 datasets spanning general language understanding, healthcare, finance, legal, e-commerce, programming, content analysis, reading comprehension, and corporate communication domains with 16 encryption schemes (704 experimental conditions), establishing a comprehensive benchmark for
privacy-preserving NLP on encrypted text. Performance depends on domain alignment between encrypted inputs and the indexed plaintext corpus. Our results demonstrate that, with well-aligned domain indexes and retrieval support, models can perform effective NLP on encrypted data without direct decryption.

URL: https://openreview.net/forum?id=73PV17dVCM

---

Title: Don't Let It Hallucinate: Premise Verification via Retrieval-Augmented Logical Reasoning

Authors: Yuehan Qin, Li Li, Yi Nian, Xinyan Velocity Yu, Yue Zhao, Xuezhe Ma

Abstract: Large language models (LLMs) have shown substantial capacity for generating fluent, contextually appropriate responses. However, they can produce hallucinated outputs, especially when a user query includes one or more false premises—claims that contradict established facts. Such premises can mislead LLMs into offering fabricated or misleading details. Existing approaches include pretraining, fine-tuning, and inference-time techniques that often rely on access to logits or address hallucinations after they occur. These methods tend to be computationally expensive, require extensive training data, or lack proactive mechanisms to prevent hallucination before generation, limiting their efficiency in real-time applications. We propose a retrieval-based framework that identifies and addresses false premises before generation. Our method first transforms a user’s query into a logical representation, then applies retrieval-augmented generation (RAG) to assess the validity of each premise using factual sources. Finally, we incorporate the verification results into the LLM’s prompt to maintain factual consistency in the final output. Experiments show that this approach effectively reduces hallucinations, improves factual accuracy, and does not require access to model logits or large-scale fine-tuning.

URL: https://openreview.net/forum?id=BDxStRGWba

---

Title: KASPER: Kolmogorov Arnold Networks for Stock Prediction and Explainable Regimes

Authors: Vidhi Oad, Param Pathak, Nouhaila Innan, Shalini Devendrababu, Muhammad Shafique

Abstract: Forecasting in financial markets remains a significant challenge due to their nonlinear and regime-dependent dynamics. Traditional deep learning models, such as long short-term memory networks and multilayer perceptrons, often struggle to generalize across shifting market conditions, highlighting the need for a more adaptive and interpretable approach. To address this, we introduce Kolmogorov–Arnold networks for stock prediction and explainable regimes (KASPER), a novel framework that integrates regime detection, sparse spline-based function modeling, and symbolic rule extraction. The framework identifies hidden market conditions using a Gumbel-Softmax-based mechanism, enabling regime-specific forecasting. For each regime, it employs Kolmogorov–Arnold networks (KANs) with sparse spline activations to capture intricate price behaviors while maintaining robustness. Interpretability is achieved through symbolic learning based on Monte Carlo Shapley values, which extracts human-readable rules tailored to each regime. Applied to real-world financial time series from Yahoo Finance, the model achieves an $R^2$ value of 0.89, a Sharpe Ratio of 12.02, and a mean squared error as low as 0.0001, outperforming existing methods. This research establishes a new direction for regime-aware, transparent, and robust forecasting in financial markets.

URL: https://openreview.net/forum?id=PD4jGJQtL8

---

Title: On Adversarial Attacks In Acoustic Localization

Authors: Tamir Shor, Chaim Baskin, Alex M. Bronstein

Abstract: Multi-rotor aerial vehicles (drones) are increasingly deployed across diverse domains, where
accurate navigation is critical. The limitations of vision-based methods under poor lighting
and occlusions have driven growing interest in acoustic sensing as an alternative. However,
the security of acoustic-based localization has not been examined. Adversarial attacks pose
a serious threat, potentially leading to mission-critical failures and safety risks. While prior
research has explored adversarial attacks on vision-based systems, no work has addressed
the acoustic setting. In this paper, we present the first comprehensive study of adversarial
robustness in acoustic drone localization. We formulate white-box projected gradient descent (PGD) attacks from an external sound source and show their significant impact on
localization accuracy. Furthermore, we propose a novel defense algorithm based on rotor
phase modulation, capable of effectively recovering clean signals and mitigating adversarial
degradation. Our results highlight both the vulnerability of acoustic localization and the
potential for robust defense strategies.

URL: https://openreview.net/forum?id=Nxm5xXoLFb

---

Title: Stochastic Multi-Objective Multi-Armed Bandits: Regret Definition and Algorithm

Authors: Mansoor Davoodi, Setareh Maghsudi

Abstract: Multi-armed bandit (MAB) problems are widely applied to online optimization tasks that require balancing exploration and exploitation. In practical scenarios, these tasks often involve multiple conflicting objectives, giving rise to multi-objective multi-armed bandits (MO-MAB). Existing MO-MAB approaches predominantly rely on the Pareto regret metric introduced in \citet{drugan2013designing}. However, this metric has notable limitations, particularly in accounting for all Pareto-optimal arms simultaneously. To address these challenges, we propose a novel and comprehensive regret metric that ensures balanced performance across conflicting objectives. Additionally, we introduce the concept of \textit{Efficient Pareto-Optimal} arms, which are specifically designed for online optimization. Based on our new metric, we develop a two-phase MO-MAB algorithm that achieves sublinear regret for both Pareto-optimal and efficient Pareto-optimal arms.

URL: https://openreview.net/forum?id=7N7sK5CFuP

---

Title: Message-Passing GNNs Fail to Approximate Sparse Triangular Factorizations

Authors: Vladislav Trifonov, Ekaterina Muravleva, Ivan Oseledets

Abstract: Graph Neural Networks (GNNs) have been proposed as a tool for learning sparse matrix preconditioners, which are key components in accelerating linear solvers. We present theoretical and empirical evidence that message-passing GNNs are fundamentally incapable of approximating sparse triangular factorizations for classes of matrices for which high-quality preconditioners exist but require non-local dependencies. To illustrate this, we construct a set of baselines using both synthetic matrices and real-world examples from the SuiteSparse collection. Across a range of GNN architectures, including Graph Attention Networks and Graph Transformers, we observe low cosine similarity ($\leq0.7$ in key cases) between predicted and reference factors. Our theoretical and empirical results suggest that architectural innovations beyond message-passing are necessary for applying GNNs to scientific computing tasks such as matrix factorization. Moreover, experiments demonstrate that overcoming non-locality alone is insufficient. Tailored architectures are necessary to capture the required dependencies since even a completely non-local Global Graph Transformer fails to match the proposed baselines.

URL: https://openreview.net/forum?id=YIr9SzD3C9

---

Title: AdaCubic: An Adaptive Cubic Regularization Optimizer for Deep Learning

Authors: Ioannis Tsingalis, Constantine Kotropoulos, Corentin Briat

Abstract: A novel regularization technique, AdaCubic, is proposed that adapts the weight of the cubic term. The heart of AdaCubic is an auxiliary optimization problem with cubic constraints that dynamically adjusts the weight of the cubic term in Newton’s cubic regularized method. We use Hutchinson’s method to approximate the Hessian matrix, thereby reducing computational cost. We demonstrate that AdaCubic inherits the cubically regularized Newton method’s local convergence guarantees. Our experiments in Computer Vision, Natural Language Processing, and Signal Processing tasks demonstrate that AdaCubic outperforms or competes with several widely used optimizers. Unlike other adaptive algorithms that require hyperparameter fine-tuning, AdaCubic is evaluated with a fixed set of hyperparameters, rendering it a highly attractive optimizer in settings where fine-tuning is infeasible. This makes AdaCubic an attractive option for researchers and practitioners alike. To our knowledge, AdaCubic is the first optimizer to leverage cubic regularization in scalable deep learning applications.

URL: https://openreview.net/forum?id=pZBQ7J37lk

---

Title: A Survey of Model Architectures in Information Retrieval

Authors: Zhichao Xu, Fengran Mo, Zhiqi Huang, Crystina Zhang, Puxuan Yu, Bei Wang Phillips, Jimmy Lin, Vivek Srikumar

Abstract: The period from 2019 to the present has represented one of the biggest paradigm shifts in information retrieval (IR) and natural language processing (NLP), culminating in the emergence of powerful large language models (LLMs) from 2022 onward. Methods leveraging pretrained encoder-only models (e.g., BERT) and decoder-only generative LLMs have outperformed many previous approaches, particularly excelling in zero-shot scenarios and complex reasoning tasks. Our survey study investigates the evolution of model architectures in IR, focusing on two key aspects: backbone models for feature extraction and end-to-end system architectures for relevance estimation. The review intentionally separates architectural considerations from training methodologies, in order to provide a focused analysis of structural innovations in IR systems. We trace the development from traditional term-based methods to modern neural approaches, particularly discussing the impact of transformer-based models and subsequent large language models (LLMs). We conclude with a forward-looking discussion of emerging challenges and future directions, including architectural optimizations for performance and scalability, handling of multimodal, multilingual data, and adaptation to novel application domains such as autonomous search agents that might be the next-generation paradigm of IR.

URL: https://openreview.net/forum?id=xAIbTbHRrX

---

Title: Simplifying Optimal Transport through Schatten-$p$ Regularization

Authors: Tyler Maunu

Abstract: We propose a new general framework for recovering low-rank structure in optimal transport using Schatten-$p$ norm regularization. Our approach extends existing methods that promote sparse and interpretable transport maps or plans, while providing a unified and principled family of convex programs that encourage low-dimensional structure. The convexity of our formulation enables direct theoretical analysis: we derive optimality conditions and prove recovery guarantees for low-rank couplings, barycentric displacements, and cross-covariances in simplified settings. To efficiently solve the proposed program, we develop a mirror descent algorithm with convergence guarantees in the convex setting. Experiments on synthetic and real data demonstrate the method’s efficiency, scalability, and ability to recover low-rank transport structures. In particular, we demonstrate its utility on a machine-learning task in learning transport between high-dimensional cell perturbations for biological applications. All code is publicly available at https://github.com/twmaunu/schatten_ot.

URL: https://openreview.net/forum?id=DIawkTG5VH

---

Title: When Are Two Scores Better Than One? Investigating Ensembles of Diffusion Models

Authors: Raphaël Razafindralambo, Rémy Sun, Damien Garreau, Frederic Precioso, Pierre-Alexandre Mattei

Abstract: Diffusion models now generate high-quality, diverse samples, with an increasing focus on
more powerful models. Although ensembling is a well-known way to improve supervised
models, its application to unconditional score-based diffusion models remains largely un-
explored. In this work we investigate whether it provides tangible benefits for generative
modelling. We find that while ensembling the scores generally improves the score-matching
loss and model likelihood, it fails to consistently enhance perceptual quality metrics such as
FID on image datasets. We confirm this observation across a breadth of aggregation rules
using Deep Ensembles, Monte Carlo Dropout, on CIFAR-10 and FFHQ. We attempt to ex-
plain this discrepancy by investigating possible explanations, such as the link between score
estimation and image quality. We also look into tabular data through random forests, and
find that one aggregation strategy outperforms the others. Finally, we provide theoretical
insights into the summing of score models, which shed light not only on ensembling but also
on several model composition techniques (e.g. guidance). Our Python code is available at
https://github.com/rarazafin/score_diffusion_ensemble.

URL: https://openreview.net/forum?id=4iRx9b0Csu

---

Title: Generative Evolutionary Meta-Solver (GEMS): Scalable Surrogate-Free Multi-Agent Reinforcement Learning

Authors: Alakh Sharma, Gaurish Trivedi, Kartikey Singh Bhandari, Yash Sinha, Dhruv Kumar, Pratik Narang, Jagat Sesh Challa

Abstract: Scalable multi-agent reinforcement learning (MARL) remains a central challenge for AI. Existing population-based methods, like Policy-Space Response Oracles, PSRO, require storing explicit policy populations and constructing full payoff matrices, incurring quadratic computation and linear memory costs. We present Generative Evolutionary Meta-Solver (GEMS), a surrogate-free framework that replaces explicit populations with a compact set of latent anchors and a single amortized generator. Instead of exhaustively constructing the payoff matrix, GEMS relies on unbiased Monte Carlo rollouts, multiplicative-weights meta-dynamics, and a model-free empirical-Bernstein UCB oracle to adaptively expand the policy set. Best responses are trained within the generator using an advantage-based trust-region objective, eliminating the need to store and train separate actors. We evaluated GEMS in a variety of Two-player and Multi-Player games such as the Deceptive Messages Game, Kuhn Poker and Multi-Particle environment. We find that GEMS is up to ~$\mathbf{6\times}$ faster, has $\mathbf{1.3\times}$ less memory usage than PSRO, while also reaps higher rewards simultaneously. These results demonstrate that GEMS retains the game theoretic guarantees of PSRO, while overcoming its fundamental inefficiencies, hence enabling scalable multi-agent learning in multiple domains.

URL: https://openreview.net/forum?id=ZwEJsXoBHD

---

Title: SyntheRela: A Benchmark For Synthetic Relational Database Generation

Authors: Valter Hudovernik, Martin Jurkovic, Erik Štrumbelj

Abstract: Synthesizing relational databases has started to receive more attention from researchers, practitioners, and industry. The task is more difficult than synthesizing a single table due to the added complexity of relationships between tables. For the same reasons, benchmarking methods for synthesizing relational databases introduces new challenges. Our work is motivated by a lack of an empirical evaluation of state-of-the-art methods and by gaps in the understanding of how such an evaluation should be done. We review related work on relational database synthesis, common benchmarking datasets, and approaches to measuring the fidelity and utility of synthetic data. We combine best practices, a novel robust detection metric, and a novel approach to evaluating utility with graph neural networks into a benchmarking tool. We use this benchmark to compare 6 open-source methods over 8 real-world databases, with a total of 39 tables. The open-source SyntheRela benchmark is available on GitHub with a public leaderboard.

URL: https://openreview.net/forum?id=Mi8XioazWy

---

Title: From Clutter to Clarity: Visual Recognition through Foveated Object-Centric Learning (FocL)

Authors: Amitangshu Mukherjee, Deepak Ravikumar, Kaushik Roy

Abstract: Human active vision integrates spatial attention (dorsal) and object recognition (ventral) as distinct information processing pathways. Rapid eye movements focus perception on task-relevant regions while filtering out background clutter. Mimicking this ventral specialization, we introduce FocL (Foveated Object-Centric Learning), a training strategy that biases image classification models toward label-consistent object regions by replacing full images with foveated crops. Standard training often relies on spurious correlation between label and background, increasing memorization of hard examples in the tail of the difficulty distribution. FocL simulates saccades by jittering fixation points and extracting foveated glimpses from annotated bounding boxes. This object-first restructuring reduces non-foreground contamination and lowers mean training loss. FocL reduces memorization, lowering mean cumulative sample loss by approximately 65 % and making nearly all high-memorization samples (top 1 %) easier to learn. It also increases the mean $\ell_2$ adversarial perturbation distance required to flip predictions by approximately 62 %. On ImageNet-V1, FocL achieves up to 11 % higher accuracy on oracle crops. When paired with the Segment Anything Model (SAM) as a dorsal proposal generator, FocL provides around an 7 % gain on ImageNet-V1 and up to 8 % under natural distribution shift (ImageNet-V2). Extending this setup to COCO, FocL improves cross-domain mAP by 3--4 points without any target-domain training. Finally, given object localization (bounding boxes), FocL reaches higher accuracy using roughly 56\% fewer training images, offering a simple path to more robust and efficient visual recognition.

URL: https://openreview.net/forum?id=kVS7sMlv7P

---

Title: LVLM-Count: Enhancing the Counting Ability of Large Vision-Language Models

Authors: Muhammad Fetrat Qharabagh, Mohammadreza Ghofrani, Kimon Fountoulakis

Abstract: Counting is a fundamental operation for various real-world visual tasks, requiring both object recognition and robust counting capabilities. Despite their advanced visual perception, large vision-language models (LVLMs) are known to struggle with counting tasks. In this work, we evaluate the performance of several LVLMs on visual counting tasks across multiple counting and vision datasets. We observe that while their performance may be less prone to error for small numbers of objects, they exhibit significant weaknesses as the number of objects increases. To alleviate this issue, we propose a simple yet effective baseline method that enhances LVLMs’ counting ability for large numbers of objects using a divide-and-conquer approach. Our method decomposes counting problems into sub-tasks. Moreover, it incorporates a mechanism to prevent objects from being split during division, which could otherwise lead to repetitive counting—a common issue in a naive divide-and-conquer implementation. We demonstrate the effectiveness of this approach across various datasets and benchmarks, establishing it as a valuable reference for evaluating future solutions.

URL: https://openreview.net/forum?id=G1i9MUQj63

---

Title: Thermodynamically Consistent Latent Dynamics Identification for Parametric Systems

Authors: Xiaolong He, Yeonjong Shin, Anthony Gruber, Sohyeon Jung, Kookjin Lee, Youngsoo Choi

Abstract: We propose an efficient thermodynamics-informed latent space dynamics identification (tLaSDI) framework for the reduced-order modeling of parametric nonlinear dynamical systems. This framework integrates autoencoders for dimensionality reduction with the newly developed parametric GENERIC formalism-informed neural networks (pGFINNs), which enable efficient learning of parametric latent dynamics while preserving key thermodynamic principles, such as free energy conservation and entropy generation, across the parameter space. To further enhance model performance, a physics-informed active learning strategy is incorporated, leveraging a greedy, residual-based error indicator to adaptively sample informative training data, outperforming uniform sampling at equivalent computational cost. Numerical experiments on the Burgers' equation and the 1D/1V Vlasov-Poisson equation demonstrate that the proposed method achieves up to 2,495x speed-up over the full-order numerical baseline with 1-3% relative errors, as well as significant reductions in training (50-90%) and inference (57-61%) cost.
Moreover, the learned latent space dynamics reveal the underlying thermodynamic behavior of the system, offering valuable insights into the physical-space dynamics. Code is available at the repository: https://github.com/xiaolong7/pGFINN-tLaSDI.

URL: https://openreview.net/forum?id=Qy3oLpRzpf

---

Title: A Bayesian Nonparametric Perspective on Mahalanobis Distance for Out of Distribution Detection

Authors: Randolph Linderman, Noah Cowan, Yiran Chen, Scott Linderman

Abstract: Bayesian nonparametric methods are naturally suited to the problem of out-of-distribution (OOD) detection. However, these techniques have largely been eschewed in favor of simpler methods based on distances between pre-trained or learned embeddings of data points. Here we show a formal relationship between Bayesian nonparametric models and the relative Mahalanobis distance score (RMDS), a commonly used method for OOD detection. Building on this connection, we propose Bayesian nonparametric mixture models with hierarchical priors that generalize the RMDS. We evaluate these models on the OpenOOD detection benchmark and show that Bayesian nonparametric methods can improve upon existing OOD methods, especially in regimes where training classes differ in their covariance structure and where there are relatively few data points per class.

URL: https://openreview.net/forum?id=w3bMXPMDW1

---

Title: A Watermark for Black-Box Language Models

Authors: Dara Bahri, John Frederick Wieting

Abstract: Watermarking has recently emerged as an effective strategy for detecting the outputs of large language models (LLMs). Most existing schemes require \emph{white-box} access to the model's next-token probability distribution, which is typically not accessible to downstream users of an LLM API. In this work, we propose a principled watermarking scheme that requires only the ability to sample sequences from the LLM (i.e. \emph{black-box} access), boasts a \emph{distortion-free} property, and can be chained or nested using multiple secret keys. We provide performance guarantees, demonstrate how it can be leveraged when white-box access is available, and show when it can outperform existing white-box schemes via comprehensive experiments.

URL: https://openreview.net/forum?id=6gcHcgGmLo

---

Title: Reconciling In-Context and In-Weight Learning via Dual Representation Space Encoding

Authors: Guanyu Chen, Ruichen Wang, Tianren Zhang, Feng Chen

Abstract: In-context learning (ICL) is a valuable capability exhibited by Transformers pretrained on diverse sequence tasks. However, previous studies have observed that ICL often conflicts with the model’s inherent in-weight learning (IWL) ability. By examining the representation space learned by a toy model in synthetic experiments, we identify the shared encoding space for context and samples in Transformers as a potential source of this conflict. To address this, we modify the model architecture to separately encode the context and samples into two distinct spaces: a \textit{task representation space} and a \textit{sample representation space}. We model these two spaces under a simple yet principled framework, assuming a linear representational structure and treating them as a pair of dual spaces. Both theoretical analysis and empirical results demonstrate the effectiveness of our proposed architecture, CoQE, in the single-value answer setting. It not only enhances ICL performance through improved representation learning, but also successfully reconciles ICL and IWL capabilities across synthetic few-shot classification and a newly designed pseudo-arithmetic task. The code is available at: \url{https://github.com/McGuinnessChen/dual-representation-space-encoding}.

URL: https://openreview.net/forum?id=bJK7VIOWAU

---

Title: Towards Multimodal Active Learning: Efficient Learning with Limited Paired Data

Authors: Jiancheng Zhang, Yinglun Zhu

Abstract: Active learning (AL) is a principled strategy to reduce annotation cost in data-hungry deep learning. However, existing AL algorithms focus almost exclusively on unimodal data, overlooking the substantial annotation burden in multimodal learning. We introduce the first framework for multimodal active learning with unaligned data, where the learner must actively acquire cross-modal alignments rather than labels on pre-aligned pairs. This setting captures the practical bottleneck in modern multimodal pipelines, where unimodal features are easy to obtain but high-quality alignment is costly. We develop a new algorithm that combines uncertainty and diversity principles in a modality-aware design, achieves linear-time acquisition, and applies seamlessly to both pool-based and streaming-based settings. Extensive experiments on benchmark datasets demonstrate that our approach consistently reduces multimodal annotation cost while preserving performance; for instance, on the ColorSwap dataset it cuts annotation requirements by up to 40% without loss in accuracy.

URL: https://openreview.net/forum?id=xMLajoct78

---

Title: Diversity Boosts AI-Generated Text Detection

Authors: Advik Raj Basani, Pin-Yu Chen

Abstract: Detecting AI-generated text is an increasing necessity to combat misuse of LLMs in education, business compliance, journalism, and social media, where synthetic fluency can mask misinformation or deception. While prior detectors often rely on token-level likelihoods or opaque black-box classifiers, these approaches struggle against high-quality generations and offer little interpretability. In this work, we propose DivEye, a novel detection framework that captures how unpredictability fluctuates across a text using surprisal-based features. Motivated by the observation that human-authored text exhibits richer variability in lexical and structural unpredictability than LLM outputs, DivEye captures this signal through a set of interpretable statistical features. Our method outperforms existing zero-shot detectors by up to $33.2$% and achieves competitive performance with fine-tuned baselines across multiple benchmarks. DivEye is robust to paraphrasing and adversarial attacks, generalizes well across domains and models, and improves the performance of existing detectors by up to $18.7$% when used as an auxiliary signal. Beyond detection, DivEye provides interpretable insights into why a text is flagged, pointing to rhythmic unpredictability as a powerful and underexplored signal for LLM detection.

URL: https://openreview.net/forum?id=WscAU9It1l

---

Title: Multimodal Prescriptive Deep Learning

Authors: Dimitris Bertsimas, Lisa Everest, Vasiliki Stoumpou

Abstract: We introduce a multimodal deep learning framework, Prescriptive Neural Networks (PNNs), that combines ideas from optimization and machine learning to perform treatment recommendation, and that, to the best of our knowledge, is among the first prescriptive approaches tested with both structured and unstructured data within a unified model. The PNN is a feedforward neural network trained on embeddings to output an outcome-optimizing prescription. In two real-world multimodal datasets, we demonstrate that PNNs prescribe treatments that are able to greatly improve estimated outcome rewards; by over 40% in transcatheter aortic valve replacement (TAVR) procedures and by 25% in liver trauma injuries. In four real-world, unimodal tabular datasets, we demonstrate that PNNs outperform or perform comparably to other well-known, state-of-the-art prescriptive models; importantly, on tabular datasets, we also recover interpretability through knowledge distillation, fitting interpretable Optimal Classification Tree models onto the PNN prescriptions as classification targets, which is critical for many real-world applications. Finally, we demonstrate that our multimodal PNN models achieve stability across randomized data splits comparable to other prescriptive methods and produce realistic prescriptions across the different datasets.

URL: https://openreview.net/forum?id=AwfWOCVLbJ

---

Title: Correctness-Aware Knowledge Distillation for Enhanced Student Learning

Authors: Ishan Mishra, Deepak Mishra, Jinjun Xiong

Abstract: In real-world learning, students rely on their mentors for guidance but must also develop the ability to recognize and learn from their mentors' mistakes. Inspired by this mentor-critic dynamic, we propose Mentor-Critic Distillation (MCD), a novel framework for knowledge distillation in machine learning. Traditional distillation methods risk transferring both correct insights and errors from the mentor (teacher model) to the student model, which can hinder student performance. Notably, previous state-of-the-art approaches fail to account for scenarios where the teacher is incorrect, often leaving the student model vulnerable to inheriting these errors. To address this limitation, MCD introduces a weighted knowledge transfer mechanism that decouples the learning process based on the mentor's correctness. When the mentor model is correct, the student model follows the mentor's guidance with a large weight on knowledge transfer. However, when the mentor is incorrect, the student relies more on the ground truth but still learns inter-class relationships from the mentor, adjusting the weight toward task-specific losses such as cross-entropy. This mentor-critic approach ensures that the student model benefits from the mentor's expertise without inheriting its mistakes. We provide theoretical analysis proving that MCD strictly generalizes vanilla KD and guarantees reduced negative transfer. We evaluate our Mentor-Critic Distillation across diverse teacher-student configurations on benchmark datasets, including CIFAR-100, ImageNet, and MedMNIST. Notably, MCD requires no architectural modifications or additional parameters, making it a practical drop-in replacement for standard knowledge distillation. These results highlight MCD's effectiveness in optimizing knowledge transfer and its robustness across diverse domains and data regimes, particularly in data-scarce scenarios typical of specialized domains such as medical imaging.

URL: https://openreview.net/forum?id=XpRXmzd2sF

---

Title: Evaluating the Adversarial Robustness of CNNs Layer by Layer

Authors: Yaowen Wang, Daniel Cullina

Abstract: In order to measure the adversarial robustness of a feature extractor, Bhagoji et al. introduced a distance on example spaces measuring the minimum perturbation of a pair of examples to achieve identical feature extractor outputs. They related these distances to the best possible robust accuracy of any classifier using the feature extractor. By viewing initial layers of a neural network as a feature extractor, this provides a method of attributing adversarial vulnerability of the classifier as a whole to individual layers. However, this framework views any injective feature extractor as perfectly robust: any bad choices of feature representation can be undone by later layers. Thus the framework attributes all adversarial vulnerabilities to the layers that perform dimensionality reduction. Feature spaces at intermediate layers of convolutional neural networks are generally much larger than input spaces, so this methodology provides no information about the contributions of individual layers to the overall robustness of the network. We extend the framework to evaluate feature extractors with high-dimensional output spaces by composing them with a random linear projection to a lower dimensional space. This results in non-trivial information about the quality of the feature space representations for building an adversarial robust classifier.

URL: https://openreview.net/forum?id=2Gx9KzsaYB

---

Title: Adversarial Attacks in Weight-Space Classifiers

Authors: Tamir Shor, Ethan Fetaya, Chaim Baskin, Alex M. Bronstein

Abstract: Implicit Neural Representations (INRs) have been recently garnering increasing interest in
various research fields, mainly due to their ability to represent large, complex data in a compact and continuous manner. Past work further showed that numerous popular downstream
tasks can be performed directly in the INR parameter-space. Doing so can substantially
reduce the computational resources required to process the represented data in their native domain. A major difficulty in using modern machine-learning approaches, is their high
susceptibility to adversarial attacks, which have been shown to greatly limit the reliability
and applicability of such methods in a wide range of settings. In this work, we show that
parameter-space models trained for classification are inherently robust to adversarial attacks
– without the need of any robust training. To support our claims, we develop a novel suite of
adversarial attacks targeting parameter-space classifiers, and furthermore analyze practical
considerations of such attacks.

URL: https://openreview.net/forum?id=eOLybAlili

---

Title: Distilled Thompson Sampling: Practical and Efficient Thompson Sampling via Imitation Learning

Authors: Hongseok Namkoong, Samuel Daulton, Eytan Bakshy

Abstract: Thompson sampling (TS) has emerged as a robust technique for contextual bandit problems. However, TS requires posterior inference and optimization for action generation, prohibiting its use in many online platforms where latency and ease of deployment are of concern. We operationalize TS by proposing a novel imitation-learning-based algorithm that distills a TS policy into an explicit policy representation, allowing fast decision-making and easy deployment in mobile and server-based environments. Using batched data collected under the imitation policy, our algorithm iteratively performs offline updates to the TS policy, and learns a new explicit policy representation to imitate it. Empirically, our imitation policy achieves performance comparable to batch TS while allowing more than an order of magnitude reduction in decision-time latency. Buoyed by low latency and simplicity of implementation, our algorithm has been successfully deployed in multiple video upload systems for Meta. Using a randomized controlled trial, we show our algorithm resulted in significant improvements in video quality and watch time.

URL: https://openreview.net/forum?id=J8PrWwvYX2

---

Title: VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

Authors: Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Raghuveer Thirukovalluru, Xuan Zhang, Zeyuan Chen, Ran Xu, Caiming Xiong, Yingbo Zhou, Wenhu Chen, Semih Yavuz

Abstract: Multimodal embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering over different modalities. However, existing multimodal embeddings like VLM2Vec, E5-V, GME are predominantly focused on natural images, with limited support for other visual forms such as videos and visual documents. This restricts their applicability in real-world scenarios, including AI agents, retrieval-augmented generation (RAG) systems, and recommendation. To close this gap, we propose VLM2Vec-V2, a unified framework for learning embeddings across diverse visual forms. First, we introduce MMEB-V2, a comprehensive benchmark that extends MMEB with five new task types: visual document retrieval, video retrieval, temporal grounding, video classification and video question answering -- spanning text, image, video, and visual document inputs. Next, we train VLM2Vec-V2, a general-purpose embedding model that supports text, image, video, and visual document inputs. Extensive experiments show that VLM2Vec-V2 achieves strong performance not only on the newly introduced video and document retrieval tasks, but also improves over prior baselines on the original image benchmarks. Through extensive evaluation, our study offers insights into the generalizability of various multimodal embedding models and highlights effective strategies for unified embedding learning, laying the groundwork for more scalable and adaptable representation learning in both research and real-world settings.

URL: https://openreview.net/forum?id=TpU38jbKIJ

---

Title: LoRA-Ensemble: Efficient Uncertainty Modelling for Self-Attention Networks

Authors: Dominik J. Mühlematter, Michelle Halbheer, Alexander Becker, Dominik Narnhofer, Helge Aasen, Konrad Schindler, Mehmet Ozgur Turkoglu

Abstract: Numerous real-world decisions rely on machine learning algorithms and require calibrated uncertainty estimates. However, modern methods often yield overconfident, uncalibrated predictions. The dominant approach to quantifying the uncertainty inherent in the model is to train an ensemble of separate predictors and measure their empirical variance. In an explicit implementation, the ensemble has a high computational cost and memory footprint, especially if the base model itself is already large, like modern transformers. This motivates efforts to develop implicit ensemble methods that emulate the ensemble without explicitly instantiating all its members. We introduce LoRA-Ensemble, a parameter-efficient ensembling method for self-attention networks. It is based on Low-Rank Adaptation (LoRA), originally developed for efficient LLM fine-tuning, and extends it into an implicit ensembling scheme, where all ensemble members share the same, pre-trained self-attention network, but have individual low-rank matrices for the attention projections. The resulting method not only outperforms state-of-the-art implicit techniques like BatchEnsemble, but even matches or exceeds the accuracy of an Explicit Ensemble, while at the same time achieving superior calibration.

URL: https://openreview.net/forum?id=yhXXmOMpSQ

---

Title: DiffKGW: Stealthy and Robust Diffusion Model Watermarking

Authors: Tianxin Wei, Ruizhong Qiu, Yifan Chen, Yunzhe Qi, Jiacheng Lin, Wenxuan Bao, Wenju Xu, Sreyashi Nag, Ruirui Li, Hanqing Lu, Zhengyang Wang, Chen Luo, Hui Liu, Suhang Wang, Jingrui He, Qi He, Xianfeng Tang

Abstract: Diffusion models are known for their supreme capability to generate realistic images. However, ethical concerns, such as copyright protection and the generation of inappropriate content, pose significant challenges for the practical deployment of diffusion models. Recent work has proposed a flurry of watermarking techniques that inject artificial patterns into initial latent representations of diffusion models, offering a promising solution to these issues. However, enforcing a specific pattern on selected elements can disrupt the Gaussian distribution of the initial latent representation. Inspired by watermarks for large language models (LLMs), we generalize the LLM KGW watermark to image diffusion models and propose a stealthy probability adjustment approach DiffKGW that preserves the Gaussian distribution of initial latent representation. In addition, we dissect the design principles of state-of-the-art watermarking techniques and introduce a unified framework. We identify a set of dimensions that explain the manipulation enforced by watermarking methods, including the distribution of individual elements, the specification of watermark shapes within each channel, and the choice of channels for watermark embedding. Through the empirical studies on regular text-to-image applications and the first systematic attempt at watermarking image-to-image diffusion models, we thoroughly verify the effectiveness of our proposed framework through comprehensive evaluations. On all the diffusion models, including Stable Diffusion, our approach induced from the proposed framework not only preserves image quality but also outperforms existing methods in robustness against a wide range of attacks.

URL: https://openreview.net/forum?id=OXi9vcIOgD

---

Title: MiniGPT-Med: A Unified Vision-Language Model for Radiology Image Understanding

Authors: Asma Alkhaldi, Raneem Alnajim, Layan Alabdullatef, Rawan Alyahya, Jun Chen, Deyao Zhu, Ahmed Z. Alsinan, Mohamed Elhoseiny

Abstract: Recent advances in artificial intelligence (AI) have precipitated significant breakthroughs in healthcare, particularly in the refinement of diagnostic procedures. However, existing studies have been limited in terms of functional coverage. This study introduces MiniGPT-Med, a vision-language model adapted from MiniGPT-v2 for medical applications through domain-specific fine-tuning on medical datasets. MiniGPT-Med demonstrates remarkable versatility across various imaging modalities, including X-rays, CT scans, and MRIs, enhancing its utility. The model is capable of performing tasks such as medical report generation, visual question answering (VQA), and disease identification within medical imagery. Its integrated processing of both image and textual clinical data markedly improves diagnostic accuracy. Our empirical assessments confirm the superior performance of MiniGPT-Med in disease detection, medical report generation, and VQA benchmarks, representing a significant step towards reducing the gap in assisting radiology practice. Furthermore, it achieves state-of-the-art performance in medical report generation, with substantial gains in BERT-Sim over both specialist and generalist baselines, improving by 17 and 12 points, respectively. MiniGPT-Med promises to become a unified Vision-Language model for radiology diagnoses, enhancing diagnostic efficiency across a wide range of medical imaging applications.

URL: https://openreview.net/forum?id=NenHFEg1Di

---

Title: Augmenting Molecular Graphs with Geometries via Machine Learning Interatomic Potentials

Authors: Cong Fu, Yuchao Lin, Zachary Krueger, Haiyang Yu, Maho Nakata, Jianwen Xie, Emine Kucukbenli, Xiaofeng Qian, Shuiwang Ji

Abstract: Accurate molecular property predictions require 3D geometries, which are typically obtained using expensive methods such as density functional theory (DFT). Here, we attempt to obtain molecular geometries by relying solely on machine learning interatomic potential (MLIP) models. To this end, we first curate a large-scale molecular relaxation dataset comprising 3.5 million molecules and 300 million snapshots. Then MLIP pre-trained models are trained with supervised learning to predict energy and forces given 3D molecular structures. Once trained, we show that the pre-trained models can be used in different ways to obtain geometries either explicitly or implicitly. First, it can be used to obtain approximate low-energy 3D geometries via geometry optimization. While these geometries do not consistently reach DFT-level chemical accuracy or convergence, they can still improve downstream performance compared to non-relaxed structures. To mitigate potential biases and enhance downstream predictions, we introduce geometry fine-tuning based on the relaxed 3D geometries. Second, the pre-trained models can be directly fine-tuned for property prediction when ground truth 3D geometries are available. Our results demonstrate that MLIP pre-trained models trained on relaxation data can learn transferable molecular representations to improve downstream molecular property prediction and can provide practically valuable but approximate molecular geometries that benefit property predictions. Our code is publicly available at: https://github.com/divelab/AIRS/.

URL: https://openreview.net/forum?id=JwxhHTISJL

---

Title: From Feature Visualization to Visual Circuits: Effect of Model Perturbation

Authors: geraldin nanfack, Michael Eickenberg, Eugene Belilovsky

Abstract: Understanding the inner workings of large-scale deep neural networks is challenging yet crucial in several high-stakes applications. Mechanistic interpretability is an emergent field that tackles this challenge, often by identifying human-understandable subgraphs in deep neural networks known as circuits. In vision-pretrained models, these subgraphs are typically interpreted by visualizing their node features through a popular technique called feature visualization. Recent works have analyzed the stability of different feature visualization types under the adversarial model manipulation framework, where models are subtly perturbed to alter their interpretations while maintaining performance. However, existing model manipulation methods have two key limitations: (1) they manipulate either synthetic or natural feature visualizations individually, but not both simultaneously, and (2) no work has studied whether circuit-based interpretations are vulnerable to such manipulations.
This paper exposes these vulnerabilities by proposing a novel attack called ProxPulse that simultaneously manipulates both types of feature visualizations. Surprisingly, we find that visual circuits exhibit some robustness to ProxPulse. We therefore introduce CircuitBreaker, the first attack targeting entire circuits, which successfully manipulates circuit interpretations, revealing that circuits also lack robustness. The effectiveness of these attacks is validated across a range of pre-trained models, from smaller architectures like AlexNet to medium-scale models like ResNet-50, and larger ones such as ResNet-152 and DenseNet-201 on ImageNet. ProxPulse changes both visualization types with <1\% accuracy drop, while our CircuitBreaker attack manipulates visual circuits with attribution correlation scores dropping from near-perfect to ~0.6 while preserving circuit head functionality.

URL: https://openreview.net/forum?id=x6ZwuyTy65

---


New submissions
===============


Title: A Survey of Reasoning and Agentic Systems in Time Series with Large Language Models

Abstract: Time series reasoning treats time as a first-class axis and incorporates intermediate evidence directly into the answer.
This survey defines the problem and organizes the literature by reasoning topology with three families: direct reasoning in one step, linear chain reasoning with explicit intermediates, and branch-structured reasoning that explores, revises, and aggregates.
The topology is crossed with the main objectives of the field, including traditional time series analysis, explanation and understanding, causal inference and decision making, and time series generation, while a compact tag set spans these axes and captures decomposition and verification, ensembling, tool use, knowledge access, multimodality, agent loops, and LLM adaptation regimes.
Methods and systems are reviewed across domains, showing what each topology enables and where it breaks down in faithfulness or robustness, along with curated datasets, benchmarks, and resources that support study and deployment (with an accompanying repository at \url{https://anonymous.4open.science/r/Time-Series-Reasoning-Survey-TMLR/}).
Evaluation practices that keep evidence visible and temporally aligned are highlighted, and guidance is distilled on matching topology to uncertainty, grounding with observable artifacts, planning for shift and streaming, and treating cost and latency as design budgets.
We emphasize that reasoning structures must balance capacity for grounding and self-correction against computational cost and reproducibility, while future progress will likely depend on benchmarks that tie reasoning quality to utility and on closed-loop testbeds that trade off cost and risk under shift-aware, streaming, and long-horizon settings.
Taken together, these directions mark a shift from narrow accuracy toward reliability at scale, enabling systems that not only analyze but also understand, explain, and act on dynamic worlds with traceable evidence and credible outcomes.

URL: https://openreview.net/forum?id=l3QW42g6u3

---

Title: MVDGC: Joint 3D and 2D Multi-view Pedestrian Detection via Dual Geometric Constraints

Abstract: The core challenge in multi-view pedestrian detection (MVPD) lies in effective aggregation of visual features from different viewpoints for robust occlusion reasoning. Recent approaches have addressed this by first projecting image-view features onto a Bird's Eye View (BEV) map, where ground localization is then performed. Despite impressive performance, the perspective transformation induces severe distortion, causing spatial structure break and degrading the quality of object feature extraction. The blurred and ambiguous features hinder accurate BEV point localization, especially in densely populated regions. Moreover, the strong mutual relationship between the BEV ground point and image bounding boxes is not capitalized on. Although multi-view consistency of 2D detections can serve as a powerful constraint in BEV space, these detections are commonly treated as auxiliary signals rather than being jointly optimized with the primary task.

In this work, we propose MVDGC, a unified framework that jointly estimates pedestrian locations on the BEV plane and 2D bounding boxes in image views. MVDGC employs a sparse set of 3D cylindrical queries that embraces geometric context across both BEV and image views, enforcing dual spatial constraints for precise localization. Specifically, the geometric constraints is established by modeling each pedestrian as a vertical cylinder whose center lies on the BEV plane and whose projection casts a rectangular box in the image views. These queries function as shape anchors that directly extract 2D features from the intact image-view features using camera projection, eliminating projection-induced distortions. The 3D cylindrical query enables the unification of BEV and ImV localization into a single task: 3D cylinder position and shape refinement.

Extensive experiments and ablation studies demonstrate that MVDGC achieves state-of-the-art performance across multiple evaluation metrics on MVPD benchmarks, including Wildtrack and MultiviewX, as well as on the generalized multi-view detection (GMVD) dataset. Moreover, by explicitly modeling BEV-ImV coherency through cylindrical queries, MVDGC not only delivers high precision in multi-view detection but also surpasses image-based tracking methods in a single-view scenario. Code is available upon acceptance.

URL: https://openreview.net/forum?id=40cVQX5Mxc

---

Title: Efficient Zeroth-Order Federated Finetuning of Language Models on Resource-Constrained Devices

Abstract: Federated Learning (FL) is a promising paradigm for finetuning Large Language Models (LLMs) across distributed data sources while preserving data privacy. However, finetuning such large models is challenging on edge devices due to its high resource demand.
Zeroth-order optimization estimates gradients through finite-difference approximations, which rely on function evaluations under random perturbations of the model parameters. Consequently, ZO with task alignment provides a potential solution, allowing finetuning using only forward passes with inference-level memory requirements and low communication overhead, but suffers from slow convergence and higher computational demand. In this paper, we propose a new ZO-based method that applies a more efficient technique to reduce the computational demand associated with using large number of perturbations, while preserving their convergence benefits. This is achieved by splitting the model into consecutive blocks and allocating a higher number of perturbations to the second block, enabling efficient reuse of intermediate activations to update the full network with fewer forward evaluations. Our evaluation on RoBERTa-large, OPT1.3B, LLaMa-3-3.2B models shows up to $3\times$ reduction in computation compared to the other ZO-based techniques, while retaining the memory and communication benefits over first-order federated learning techniques.

URL: https://openreview.net/forum?id=nVmz9Q2l7L

---

Title: Boundary-Consistent Graph Neural Networks for Topological Flux Prediction

Abstract: Graph Neural Networks (GNNs) have achieved notable success in spatiotemporal modeling across diverse application domains. However, their efficacy in flux prediction (FP), where the goal is to model spatiotemporal fluid transport over networked physical systems, remains contentious. Recent studies report that GNNs can underperform even simple baselines in FP settings, leading to a claim that GNNs may be intrinsically ill-suited for such tasks.

In this paper, we revisit this claim by dissecting the GNN learning dynamics on fluid transport networks, with an emphasis on its boundary regions. Specifically, we decompose the graph into boundary and interior nodes, where boundary nodes regulate the total influx and are the primary interface with external forcing. Our empirical and theoretical analyses reveal that dominant prediction errors concentrate at boundary nodes. From a dynamical-systems perspective, we interpret the boundary errors as the consequence of unmodeled external forcing, which causes degraded performance on boundaries. We therefore hypothesize that the observed performance degradation of GNNs was not caused by their expressivity; rather, it arises from the deficit of explicit external forcing modeling during training.

To validate this hypothesis, we propose \myalg, which learns ghost-node proxies to approximate unmodeled external forcing. Each boundary node is augmented with an associated ghost node that represents the latent forcing. This yields a ghost--boundary--interior coupled system, which we solve using an implicit fixed-point formulation. The resulting equilibrium \emph{jointly} infers the external forcing and propagates it into the interior. This enriches standard GNN backbones with boundary-consistent representations while preserving interior message passing. Extensive experiments on two real-world fluid network datasets demonstrate that \myalg\ improves standard GNNs by reducing average MSE by 8.4\% and 5.0\%, and boundary-node MSE by 11.2\% and 7.1\%, respectively. For computational efficiency, we further introduce an explicit inverse-operator solver that amortizes the fixed-point inference and accelerates inference by up to $2\times$, depending on the backbone architecture.

URL: https://openreview.net/forum?id=31gTIfhoH0

---

Title: Adaptive LLM Safety via Inference-Time System-Level Optimization

Abstract: Large Language Models (LLMs) face rapidly evolving security threats, ranging from adversarial attacks like jailbreaking to the leakage of sensitive information that should have been unlearned. Existing defense mechanisms are often static and require extensive model retraining, making them slow to adapt to evolving threats. We investigate whether adaptive, inference-time system designs can mitigate the limitations of static LLM defenses. We study a modular inference-time defense system (which we refer to as AegisLLM). It utilizes a workflow of specialized modules whose defensive policies can be optimized with a remarkably small number of examples to achieve strong performance on multiple, distinct security challenges. We demonstrate the effectiveness of this system on two critical threats: sensitive information disclosure (with \textit{unlearning} as defense) and jailbreaking. On the WMDP benchmark, it approaches the random-guess lower bound for unlearning with only 20 training examples. For jailbreaking, it improves defenses by $\sim$51\% over the base model on the StrongReject benchmark, while maintaining a high utility as measured by the false refusal rate of only 7.9\% on the PHTest benchmark. Furthermore, we show that prompts optimized on one benchmark generalize effectively to others, underscoring the robustness of this approach. Our work highlights the significant advantages of adaptive, system-level security and demonstrates the power of prompt optimization for creating scalable and efficient LLM safety solutions. We provide our code at: \url{https://anonymous.4open.science/r/aegisllm-11B0}.

URL: https://openreview.net/forum?id=EkPGNcycxm

---

Title: ROCM: RLHF on consistency models

Abstract: Diffusion models have revolutionized generative modeling in continuous domains like image, audio, and video synthesis. However, their iterative sampling process leads to slow generation and inefficient training, challenges that are further exacerbated when incorporating Reinforcement Learning from Human Feedback (RLHF) due to sparse rewards and long time horizons. Consistency models address these issues by enabling single-step or efficient multi-step generation, significantly reducing computational costs. In this work, we propose a direct reward optimization framework for applying RLHF to consistency models, incorporating distributional regularization to enhance training stability and prevent reward hacking. We investigate various $f$-divergences as regularization strategies, striking a balance between reward maximization and model consistency. Unlike policy gradient methods, our approach leverages first-order gradients, making it more efficient and less sensitive to hyperparameter tuning. Empirical results show that our method achieves competitive or superior performance compared to policy gradient based RLHF methods, across various automatic metrics and human evaluation. Additionally, our analysis demonstrates the impact of different regularization techniques in improving model generalization and preventing overfitting.

URL: https://openreview.net/forum?id=g0Ht3AzyUP

---

Title: Invariant Causal Set Covering Machine

Abstract: Rule-based models, such as decision trees, appeal to practitioners due to their interpretable nature. However, the learning algorithms that produce such models are often vulnerable to spurious associations, and thus, they are not guaranteed to extract causally relevant insights. This limitation reduces their utility in gaining mechanistic insights into a phenomenon of interest. In this work, we build on ideas from the invariant causal prediction literature to propose Invariant Causal Set Covering Machines, an extension of the classical Set Covering Machine (SCM) algorithm for conjunctions/disjunctions of binary-valued rules that provably avoids spurious associations. The proposed method leverages structural assumptions about the functional form of such models, enabling an algorithm that identifies the causal parents of a variable of interest in polynomial time. We demonstrate the validity and efficiency of our approach through a simulation study and highlight its favorable performance compared to SCM in uncovering causal variables across real-world datasets.

URL: https://openreview.net/forum?id=slquR2A8rA

---

Title: Global Convergence of Sampling-Based Nonconvex Optimization through Diffusion-Style Smoothing

Abstract: Sampling-based optimization (SBO), like cross-entropy method and evolutionary algorithms, has achieved many successes in solving non-convex problems without gradients, yet its convergence is poorly understood. In this paper, we establish a non-asymptotic convergence analysis for SBO through the lens of smoothing. Specifically, we recast SBO as gradient descent on a smoothed objective, mirroring noise‑conditioned score ascent in diffusion models. Our first contribution is a landscape analysis of the smoothed objective, demonstrating how smoothing helps escape local minima and uncovering a fundamental coverage–optimality trade-off: smoothing renders the landscape more benign by enlarging the locally convex region around the global minimizer, but at the cost of introducing an optimality gap. Building on this insight, we establish non-asymptotic convergence guarantees for SBO algorithms to a neighborhood of the global minimizer. Furthermore, we propose an annealed SBO algorithm, Diffusion-Inspired-Dual-Annealing (DIDA), which is provably convergent to the global optimum. We conduct extensive numerical experiments to verify our landscape results and also demonstrate the compelling performance of DIDA compared to other gradient-free optimization methods. Lastly, we discuss implications of our results for diffusion models.

URL: https://openreview.net/forum?id=8Nx1ZjfEhd

---

Title: Adapt the Face, Not the Voice: Asymmetric Fine-Tuning of Foundation Models for Cross-Modal Person Matching

Abstract: Cross-modal person matching - associating a person’s voice with their face - requires bridging speech and vision representations that share no direct physical correspondence. We investigate a simple approach: pairing frozen unimodal foundation models (WavLM-Large for speech, SigLIP ViT-B/16 for faces) with lightweight trainable projections into a shared embedding space. Our central finding is an informative asymmetry in the effectiveness of Low-Rank Adaptation (LoRA): adapting the face encoder yields substantial gains while adapting the voice encoder provides no benefit. We explain this asymmetry through layer-wise identity probing: WavLM already encodes strong speaker identity information (93.8% linear probe accuracy on 70 classes), while SigLIP’s face identity representations are comparatively weak (79.5%), leaving substantially more room for task-specific adaptation. This gap widens on a larger evaluation: on 1,211-identity VoxCeleb1, WavLM maintains 90.5% probe accuracy while SigLIP drops to 58.1%. The asymmetric LoRA finding replicates across two datasets -MAV-Celeb (70 identities, per-identity split) and VoxCeleb1 (1,211 identities, identity-disjoint split) - and across evaluation protocols including verification, retrieval, and N -way matching. On MAV-Celeb, face-only LoRA achieves 16.6 ± 0.4% Equal Error Rate (mean ± std over 3 seeds) with only 1.33M trainable parameters (0.32% of the encoder total), compared to 19.9%
for the prior best published result under a comparable (though not identical) evaluation protocol. Our results suggest a hypothesis for cross-modal adaptation: selectively adapting the encoder whose pretraining is least aligned with the target task is both necessary and
sufficient.

URL: https://openreview.net/forum?id=6pr8Zwv64z

---

Title: World Model Anomaly Detection with a Latent Linear Prior

Abstract: Model-based reinforcement learning (MBRL) learns world models—internal simulators of environment dynamics—to plan by imagining future trajectories. However, when these models incorrectly predict state transitions, they generate unrealistic states that mislead agents into learning delusional policies. Inspired by human vision, we propose anomaly detection in world model with \textbf{L}inear \textbf{P}rior (LP), a three‐stage approach that 1) enforces a lightweight linear prior on successive latent states, 2) flags generated states that deviate from this prior, and 3) removes their contribution during agent learning. On the challenging Atari100k benchmark, LP-assisted GRU and Transformer based MBRL agents achieve competitive results while requiring less value updates with minimal additional computational cost. Notably, by suppressing false value updates with LP, DreamerV3 boosts human-normalized mean score by 9% while requiring less than 90% of the value updates. We release our implementation at https://anonymous.4open.science/r/lp-dreamer.

URL: https://openreview.net/forum?id=VLIzLK3CfR

---

Title: RevealIt: REinforcement learning with Visibility of Evolving Agent poLicy for InTerpretability

Abstract: Understanding the agent's learning process, particularly the factors that contribute to its success or failure post-training, is crucial for comprehending the rationale behind the agent's decision-making process. Prior methods clarify the learning process by creating a structural causal model (SCM) or visually representing the distribution of value functions. Nevertheless, these approaches have constraints as they exclusively function in 2D environments or with simple transition dynamics. Understanding the agent's learning process in complex environments or tasks is more challenging. In this paper, we propose {\rit}, a novel framework for explaining the learning process of an agent in complex environments. Initially, we visualize the policy structure and the agent's learning process for various training tasks. By visualizing these findings, we can understand how much a particular training task or stage affects the agent's test performance. Then, a GNN-based explainer learns to highlight the most important section of the policy, providing a clearer and robust explanation of the agent's learning process. The experiments demonstrate that explanations derived from this framework can effectively help optimize the training tasks, resulting in improved learning efficiency and final performance.

URL: https://openreview.net/forum?id=Dmu7agmmTQ

---

Title: Issues with Value-Based Multi-objective Reinforcement Learning: Value Function Interference and Overestimation Sensitivity

Abstract: Multi-objective reinforcement learning (MORL) algorithms extend conventional reinforcement learning (RL) to the more general case of problems with multiple, conflicting objectives, represented by vector-valued rewards. Widely-used scalar RL methods such as Q-learning can be modified to handle multiple objectives by (1) learning vector-valued value functions, and (2) performing action selection using a scalarisation or ordering operator which reflects the user's preferences with respect to the different objectives. This paper investigates two previously unreported issues which can hinder the performance of value-based MORL algorithms when applied in conjunction with a non-linear utility function -- value function interference, and sensitivity to overestimation. We illustrate the nature of these phenomena on simple multi-objective MDPs using a tabular implementation of multiobjective Q-learning.

URL: https://openreview.net/forum?id=KImrufIw0L

---

Title: Data selection through iterative Self-Filtering for vision-language settings

Abstract: The availability of large amounts of clean data is paramount to training neural networks. However, at large scales, manual oversight is impractical, resulting in sizeable datasets that can be very noisy. Attempts to mitigate this obstacle to producing performant vision-language models have so far involved heuristics, curated reference datasets, and using pre-trained models. Here we propose a novel, bootstrapped method in which a CLIP model is trained on an evolving, self-selected dataset. This evolving dataset constitutes a balance of filtered, highly probable clean samples as well as diverse samples from the entire distribution. Our proposed Self-Filtering method iterates between training the model and selecting a subsequently improved data mixture. Training on vision-language datasets filtered by the proposed approach improves the downstream performance, without the need for additional data or pre-trained models.

URL: https://openreview.net/forum?id=F09NfCXuCe

---

Title: MetaTT: A Global Tensor-Train Adapter for Parameter-Efficient Fine-Tuning

Abstract: We present MetaTT, a Tensor Train (TT) adapter framework for fine-tuning of pre-trained transformers. MetaTT enables flexible and parameter-efficient model adaptation by using a single shared TT to factorize transformer sub-modules. This factorization indexes key structural dimensions, including layer and matrix type, and can optionally incorporate heads and tasks. This design allows MetaTT’s parameter count to scale with the sum, rather than the product, of the modes, resulting in a substantially more compact adapter. Our benchmarks compare MetaTT with LoRA along with recent state-of-the-art matrix and tensor decomposition based fine-tuning methods. We observe that when tested on single-task standard language modeling benchmarks, MetaTT achieves competitive parameter efficiency to accuracy tradeoff. We further demonstrate that MetaTT performs competitively when compared to state-of-the-art methods on multi-task learning. Finally, we leverage the TT-ansatz to design a rank-adaptive optimizer inspired by the DMRG method from many-body physics. Our results demonstrate that integrating this approach with AdamW enhances optimization performance for a specified target rank.

URL: https://openreview.net/forum?id=1HdcPWfA9s

---

Title: Prompt Optimization Meets Subspace Representation Learning for Few-shot Out-of-Distribution Detection

Abstract: The reliability of artificial intelligence (AI) systems in open-world settings depends heavily on their ability to flag out-of-distribution (OOD) inputs unseen during training. Recent advances in large-scale vision-language models (VLMs) have enabled promising few-shot OOD detection frameworks using only a handful of in-distribution (ID) samples. However, existing prompt learning-based OOD methods largely overlook the geometry of the visual feature embeddings learned by VLMs whose structure is particularly informative for distinguishing ID from OOD data and holds rich representation capacity as they are pre-trained on millions of samples. To address this, we introduce a \textit{geometry-aware context optimization framework} that integrates subspace representation learning with prompt tuning. By projecting ID-relevant features into a subspace spanned by prompt vectors and simultaneously projecting ID-irrelevant components via orthogonal null-space projections, our approach strengthens the discriminative power of the learned prompt vectors, thereby leading to enhanced ID–OOD separability at test time. To enable an easy-to-handle, end-to-end learning under this framework, we design a geometry-regularized learning criterion that ensures strong OOD detection performance as well as high ID classification accuracy across settings. Moreover, the proposed framework can be seamlessly integrated with a wide range of existing context optimization methods, effectively complementing their softmax-based OOD detectors. Experiments on various real-world datasets showcase the effectiveness of our approach for reliable open-world AI systems.

URL: https://openreview.net/forum?id=TFG2gPjkiF

---

Title: NAS Without Priors: A Robust Architecture Search Framework for Unseen-Data

Abstract: Neural architecture search (NAS) has been widely used to automate neural network design for image classification; however, most NAS research has focused on CIFAR, ImageNet, and their derivative benchmarks. These datasets benefit from well-established architecture design practices, data preprocessing techniques, and training protocols developed prior to NAS, causing many NAS methods to meta-overfit and struggle to generalize to entirely novel datasets. In this work, we analyze the limitations of existing NAS practices and propose a framework specifically designed to generalize to unseen-data. In contrast to the prevailing paradigm of exploring extremely large search spaces using low-fidelity evaluations, we advocate sparser exploration combined with high-fidelity performance estimation. We demonstrate that macro-architecture variations alone induce substantial architectural diversity, and that concentrating computational resources on high-fidelity evaluation of fewer candidates produces reliable reward signals enabling better architecture discovery. To obtain robust candidate rankings, we repeatedly train architectures on the entire training set using multiple random seeds. While this approach substantially reduces performance variance due to random seed variability and enables accurate candidate ranking, it comes at a significant computational cost. To mitigate the cost of such high-fidelity evaluation, particularly for larger or high-resolution datasets, we introduce a dataset- and architecture-aware multi-fidelity search strategy that both reduces computational overhead and stabilizes candidate rankings under varying fidelity levels. We evaluate our framework on the NAS Unseen-Data Challenge, where, under a strict time constraint of 8 hours per dataset for both search and training, it outperforms manually designed architectures across all three challenge datasets and achieves first place with a combined score of 12.19, compared to 10.89 and 10.43 for the second- and third-place NAS solutions, respectively.

URL: https://openreview.net/forum?id=PaEk1gYrFz

---

Title: Score-based Lyapunov Stable Neural ODE for Robust Classification

Abstract: Adversarial attacks pose a significant obstacle to the widespread deployment of modern AI
systems. These attacks, often implemented as imperceptible perturbations to the input,
typically an image, can deliberately mislead neural networks into making incorrect predic-
tions. Over the past decade, numerous studies have sought to understand and mitigate this
vulnerability. Among them, a promising line of research interprets neural networks as dy-
namical systems and leverages Lyapunov theory to enhance robustness against adversarial
perturbations. However, the original intent of Lyapunov theory differs from that of build-
ing accurate and robust neural networks, leading to conceptual and practical challenges.
Existing approaches typically incorporate Lyapunov constraints through penalization, but
such formulations only ensure local stability around input data points and do not guaran-
tee broader regions of convergence. In this work, we propose a framework based on vector
fields that explicitly admit asymptotically stable equilibrium points, thereby strengthening
the Lyapunov-based foundation of the model. This enhanced theoretical grounding enables
us to prove that every point within the support of the input data distribution converges
to a stable equilibrium point, and enables us to draw a natural connection to the concept
of score estimation. Experimentally, we demonstrate that our model improves adversarial
robustness over prior Lyapunov-regularized approaches across standard image classification
benchmarks. Qualitatively, the induced dynamics exhibit a denoising effect against adversarial perturbations, driving inputs toward stable modes of the data distribution

URL: https://openreview.net/forum?id=LqIKlX9Pzg

---

Title: DMT-JEPA: Learning Discriminative Masked Targets for Joint-Embedding Predictive Architecture

Abstract: The joint-embedding predictive architecture (JEPA) recently has shown impressive results in extracting visual representations from unlabeled imagery under a masking strategy. However, we reveal its disadvantages, notably its insufficient understanding of local semantics. This deficiency originates from masked modeling in the embedding space, resulting in a reduction of discriminative power and can even lead to the neglect of critical local semantics. To bridge this gap, we introduce DMT-JEPA, a novel masked modeling objective rooted in JEPA, specifically designed to generate discriminative latent targets from neighboring information. Our key idea is simple: we consider a set of semantically similar neighboring patches as a target of a masked patch. To be specific, the proposed DMT-JEPA (a) computes feature similarities between each masked patch and its corresponding neighboring patches to select patches having semantically meaningful relations, and (b) employs lightweight cross-attention heads to aggregate features of neighboring patches as the masked targets. Consequently, DMT-JEPA highlights that increased discriminative power of target representations benefits a diverse spectrum of downstream tasks. Through extensive experiments, we demonstrate our effectiveness across various visual benchmarks, including ImageNet-1K image classification, ADE20K semantic segmentation, and COCO object detection tasks.
Code is available at: \url{https://anonymous.4open.science/r/DMT-JEPA-anony}.

URL: https://openreview.net/forum?id=73demKsXn4

---

Title: Freeze, Prompt, and Adapt: A Framework for Source-free Unsupervised GNN Prompting

Abstract: Prompt tuning has become a key mechanism for adapting pre-trained Graph Neural Networks (Gnns) to new downstream tasks. However, existing approaches are predominantly supervised, relying on labeled data to optimize the prompting parameters and typically finetuning a task-specific prediction head—practices that undermine the promise of parameter-efficient adaptation. We propose Unsupervised Graph Prompting Problem (UGPP), a challenging new setting where the pre-trained GNN is kept entirely frozen, labels on the target domain are unavailable, the source data is inaccessible, and the target distribution exhibits covariate shift. To address this, we propose UGPrompt, the first fully unsupervised GNN prompting framework. UGPrompt leverages consistency regularization and pseudo-labeling to train a prompting function, complemented with diversity and domain regularization to mitigate class imbalance and distribution mismatch. Our extensive experiments demonstrate that UGPrompt consistently outperforms state-of-the-art supervised prompting methods with access to labeled data, demonstrating the viability of unsupervised prompting as a practical adaptation paradigm for GNNs.

URL: https://openreview.net/forum?id=9KKgIQwCLO

---

Title: Towards Scalable Explainable AI: Using Vision-Language Models to Interpret Vision Systems

Abstract: Explainable AI (xAI) is increasingly important for the trustworthy deployment of vision models in domains such as medical imaging, autonomous driving, and safety-critical systems. However, modern vision models are typically trained on massive datasets, making it nearly impossible for researchers to manually track how models learn from each sample, especially when relying on saliency maps that require intensive visual inspection. Traditional xAI methods, while useful, often focus on the instance-level explanation and risk losing important information about model behavior at scale, leaving analysis time-consuming, subjective, and difficult to reproduce. To overcome these challenges, we propose an automated evaluation pipeline that leverages Vision-Language Models to analyze vision models at both the sample and dataset levels. Our pipeline systematically assesses, generates, and interprets saliency-based explanations, aggregates them into structured summaries, and enables scalable discovery of failure cases, biases, and behavioral trends. By reducing reliance on manual inspection while preserving critical information, the proposed approach facilitates more efficient and reproducible xAI research, supporting the development of robust and transparent vision models.

URL: https://openreview.net/forum?id=Ta2cvwmlVb

---

Title: DINOv3

Abstract: Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being tailored to specific tasks or domains, this training paradigm has the potential to learn visual representations from diverse sources, ranging from natural to aerial images—using a single algorithm. This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies. First, we leverage the benefit of scaling both dataset and model size by careful data preparation, design, and optimization. Second, we introduce a new method called Gram anchoring, which effectively addresses the known yet unsolved issue of dense feature maps degrading during long training schedules. Finally, we apply post-hoc strategies that further enhance our models’ flexibility with respect to resolution, model size, and alignment with text. As a result, we present a versatile vision foundation model that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models. We also share the DINOv3 suite of vision models, designed to advance the state of the art on a wide spectrum of tasks and data by providing scalable solutions for diverse resource constraints and deployment scenarios.

URL: https://openreview.net/forum?id=2NlGyqNjns

---

Title: UniRec: Unified Multimodal Encoding for LLM-Based Recommendations

Abstract: Large language models (LLMs) have recently shown promise for multimodal recommen-
dation, particularly with text and image inputs. Yet real-world recommendation signals
extends far beyond these modalities. To reflect this, we formalize recommendation fea-
tures into four modalities: text, images, categorical features, and numerical attributes, and
emphasize unique challenges this heterogeneity poses for LLMs in understanding multi-
modal information. In particular, these challenges arise not only across modalities but also
within them, as attributes (e.g., price, rating, time) may all be numeric yet carry distinct
meanings. Beyond this intra-modality ambiguity, another major challenge is the nested
structure of recommendation signals, where user histories are sequences of items, each car-
rying multiple attributes. To address these challenges, we propose UniRec, a unified mul-
timodal encoder for LLM-based recommendation. UniRec first employs modality-specific
encoders to produce consistent embeddings across heterogeneous signals. It then applies
a triplet representation—comprising attribute name, type, and value—to separate schema
from raw inputs and preserve semantic distinctions. Finally, a hierarchical Q-Former mod-
els the nested structure of user interactions while maintaining their layered organization.
On multiple real-world benchmarks, UniRec outperforms state-of-the-art multimodal and
LLM-based recommenders by up to 15%, while extensive ablation studies further validate
the contributions of each component.

URL: https://openreview.net/forum?id=WXE255GWhQ

---

Title: FusionFactory: Fusing LLM Capabilities with Multi-LLM Log Data

Abstract: The rapid advancement of large language models (LLMs) has created a diverse landscape of models, each excelling at different tasks. This diversity drives researchers to employ multiple LLMs in practice, leaving behind valuable multi-LLM log data. This naturally leads to the question of whether such logs can be fully leveraged to fuse LLMs' complementary capabilities. Although prior work has explored various strategies for integrating multiple LLMs, we argue that practical fusion must meet two essential requirements: (1) compatibility with real-world serving scenarios (e.g., local and API-based serving), and (2) flexibility to operate at different stages of the LLM pipeline to meet varied user needs (e.g., fine-tuning and inference stages). To this end, we introduce LLMFusionBench, a large-scale benchmark for LLM fusion that spans 14 tasks across five domains, with responses from 20 open-source LLMs (8B-671B) totaling 103M tokens. Building on LLMFusionBench, we propose FusionFactory, a systematic framework with three elaborated levels: (1) query-level fusion via tailored LLM routers, (2) thought-level fusion leveraging retrieved abstract reasoning templates, and (3) model-level fusion via distillation from top-ranked responses. Experiments show that FusionFactory consistently outperforms the best individual LLM across all 14 benchmarks, with the optimal fusion configuration varying across benchmarks, highlighting the promise of multi-LLM log data as a practical foundation for fusing diverse LLM capabilities.

URL: https://openreview.net/forum?id=N951scS3yE

---

Title: FairSAM: Fair Classification on Corrupted Data Through Sharpness-Aware Minimization

Abstract: Image classification models trained on clean data often suffer from significant performance degradation when exposed to corrupted testing or deployment data, such as images with impulse noise, Gaussian noise, or environmental noise. This degradation not only impacts overall performance but also disproportionately affects various demographic subgroups, raising critical algorithmic bias concerns. Although robust learning algorithms such as Sharpness-Aware Minimization improve overall model robustness and generalization, they do not address biased performance degradation across demographic subgroups. Existing fairness-aware machine learning methods aim to reduce performance disparities but struggle to maintain robust and equitable accuracy across demographic subgroups when faced with data corruption. This reveals an inherent tension between robustness and fairness when dealing with corrupted data. To address these challenges, we introduce a newly-designed metric to assess performance degradation across subgroups under data corruption. We propose FairSAM, a framework that integrates Fairness-oriented strategies into SAM to deliver equalized performance across demographic groups under corrupted conditions. Our experiments on multiple real-world datasets and various predictive tasks show that FairSAM reconciles robustness and fairness. The framework yields a structured solution for fair and robust image classification in the presence of data corruption.

URL: https://openreview.net/forum?id=W2QKvn57yw

---

Title: Symbolic Recovery of PDEs from Measurement Data

Abstract: Models based on partial differential equations (PDEs) are powerful for describing a wide range of complex relationships in the natural sciences. Accurately identifying the PDE model, which represents the underlying physical law, is essential for a proper understanding of the problem. This reconstruction typically relies on indirect and noisy measurements of the system’s state and, without specifically tailored methods, rarely yields symbolic expressions, thereby hindering interpretability. In this work, we address this issue by considering existing neural network architectures based on rational functions for the symbolic representation of physical laws. These networks leverage the approximation power of rational functions while also benefiting from their flexibility in representing arithmetic operations. Our main contribution is an identifiability result, showing that, in the limit of noiseless, complete measurements, such symbolic networks can uniquely reconstruct the simplest physical law within the PDE model. Specifically, reconstructed laws remain expressible within the symbolic network architecture, with regularization-minimizing parameterizations promoting interpretability and sparsity in case of $L^1$-regularization. In addition, we provide regularity results for symbolic networks. Empirical validation using the ParFam architecture supports these theoretical findings, providing evidence for the practical reconstructibility of physical laws.

URL: https://openreview.net/forum?id=TbHfgo10W3

---

Title: Policy Gradients for Cumulative Prospect Theory in Reinforcement Learning

Abstract: We derive a policy gradient theorem for Cumulative Prospect Theory (CPT) objectives in finite-horizon Reinforcement Learning (RL), generalizing the standard policy gradient theorem and encompassing distortion-based risk objectives as special cases. Motivated by behavioral economics, CPT combines an asymmetric utility transformation around a reference point with probability distortion. Building on our theorem, we design a first-order policy gradient algorithm for CPT-RL using a Monte Carlo gradient estimator based on order statistics. We establish statistical guarantees for the estimator and prove asymptotic convergence of the resulting algorithm to first-order stationary points of the (generally non-convex) CPT objective. Simulations illustrate qualitative behaviors induced by CPT and compare our first-order approach to existing zeroth-order methods.

URL: https://openreview.net/forum?id=6iGrtiSCR7

---

Title: Demystifying MaskGIT Sampler and Beyond: Adaptive Order Selection in Masked Diffusion

Abstract: Masked diffusion models have shown promising performance in generating high-quality samples in a wide range of domains, but accelerating their sampling process remains relatively underexplored. To investigate efficient samplers for masked diffusion, this paper theoretically analyzes the MaskGIT sampler for image modeling, revealing its implicit temperature sampling mechanism. Through this analysis, we show that MaskGIT is asymptotically equivalent to a choose-then-sample (CTS) formulation, instantiated as the “moment sampler,” which explicitly separates index selection from token sampling. This CTS reformulation is essential: it yields unbiased token sampling and exposes an algorithmic design space for index selection, both of which are inaccessible in MaskGIT’s original formulation. Regarding token sampling, we reveal that MaskGIT implicitly adopts a low-temperature sampler, which explains why MaskGIT often degrades with more sampling steps. The CTS reformulation of MaskGIT allows to fix the temperature sampling to ensure unbiasedness. We also improve the index selection in CTS through two key innovations: a partial caching technique for transformers that approximates longer sampling trajectories without proportional computational cost, and a hybrid approach formalizing the exploration-exploitation trade-off in adaptive unmasking. Experiments in image and text domains demonstrate our theory as well as the efficiency of our proposed methods, advancing both theoretical understanding and practical implementation of masked diffusion samplers.

URL: https://openreview.net/forum?id=mKlW68i2Ig

---

Title: Scaling Laws for Masked-Reconstruction Transformers on Single-Cell Transcriptomics

Abstract: Neural scaling laws -- power-law relationships between loss, model size, and data -- have been extensively documented for language and vision transformers, yet their existence in single-cell genomics remains largely unexplored. We present the first systematic study of scaling behaviour for masked-reconstruction transformers trained on single-cell RNA sequencing (scRNA-seq) data. Using expression profiles from the CELLxGENE Census, we construct two experimental regimes: a data-rich regime (512 highly variable genes, 200,000 cells) and a data-limited regime (1,024 genes, 10,000 cells). Across seven model sizes spanning three orders of magnitude in parameter count (533 to 3.4 x 10^8 parameters), we fit the parametric scaling law to validation mean squared error (MSE). The data-rich regime exhibits clear power-law scaling with an irreducible loss floor of c ~ 1.44, while the data-limited regime shows negligible scaling, indicating that model capacity is not the binding constraint when data are scarce. These results establish that scaling laws analogous to those observed in natural language processing do emerge in single-cell transcriptomics when sufficient data are available, and they identify the data-to-parameter ratio as a critical determinant of scaling behaviour. A preliminary conversion of the data-rich asymptotic floor to information-theoretic units yields an estimate of approximately 2.30 bits of entropy per masked gene position. We discuss implications for the design of single-cell foundation models and outline the additional measurements needed to refine this entropy estimate.

URL: https://openreview.net/forum?id=a8rUQqionr

---

Title: Time-Aware Prior Fitted Networks for Zero-Shot Forecasting with Exogenous Variables

Abstract: In many time series forecasting settings, the target time series is accompanied by exogenous covariates, such as promotions and prices in retail demand; temperature in energy load; calendar and holiday indicators for traffic or sales; and grid load or fuel costs in electricity pricing. Ignoring these exogenous signals can substantially degrade forecasting accuracy, particularly when they drive spikes, discontinuities, or regime and phase changes in the target series. Most current time series foundation models (e.g., Chronos, Sundial, TimesFM, TimeMoE, TimeLLM, and LagLlama) ignore exogenous covariates and make forecasts solely from the numerical time series history, thereby limiting their performance. In this paper, we develop ApolloPFN, a prior-data fitted network (PFN) that is time-aware (unlike prior PFNs) and that natively incorporates exogenous covariates (unlike prior univariate forecasters). Our design introduces two major advances: (i) a synthetic data generation procedure tailored to resolve the failure modes that arise when tabular (non-temporal) PFNs are applied to time series; and (ii) time-aware architectural modifications that embed inductive biases needed to exploit the time series context. We demonstrate that ApolloPFN achieves state-of-the-art results across benchmarks, such as M5 and electric price forecasting, that contain exogenous information.

URL: https://openreview.net/forum?id=nJARpxp3cF

---

Title: Automata Learning from Recurrent Networks: A Critical Synthesis for Verification, Testing, and Interpretability

Abstract: Recurrent Neural Networks (RNNs) have demonstrated their effectiveness in modeling sequential data and are a key building block of modern deep learning architectures. In this review paper, we study recurrent networks from the lens of automata theory. Given an RNN, automata learning seeks to model its behavior with an automaton, which enables better interpretability and eases our understanding of its working mechanisms. We begin by examining the theoretical foundations of this approach, displaying how it can be applied to learn automata from various types of recurrent nets, including the Elman Recurrent Network (ERN), Long Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU). Next, we review the applications of this approach in formal verification, model-based testing, and the interpretability of these deep learning models. We finish with a discussion on the advantages and critical problems of this method, while outlining key goals for future research, such as defining standard benchmarks and identifying limitations that need to be addressed to advance this field further.

URL: https://openreview.net/forum?id=R52ETbUBVo

---

Title: Adaptive Budget Allocation for Orthogonal Subspace Adapter Tuning in LLMs Continual Learning

Abstract: Large language models (LLMs) often suffer from catastrophic forgetting in continual learning (CL) scenarios, where performance on previously learned tasks degrades severely while training on sequentially arriving tasks.
Although pioneering CL approaches using orthogonal subspaces can mitigate task interference, they typically employ fixed budget allocation, neglecting the varying complexity across tasks and layers.
Besides, recent budget-adaptive tuning methods for LLMs often adopt multi-stage paradigms that decouple optimization and budget allocation. Such decoupling results in potential misalignment, which hinders those approaches' practical application in CL scenarios.
To address these limitations, we propose OA-Adapter, a novel parameter-efficient approach for continual learning in LLMs that unifies dynamic budget adaptation with orthogonal subspace learning in an end-to-end training stage.
Specifically, OA-Adapter introduces a dynamic bottleneck dimension adaptation mechanism that simultaneously allocates an efficient parameter budget and optimizes task objectives without misalignment.
To effectively preserve previously acquired knowledge while coordinating with the dynamic budget allocation, orthogonal constraints are applied specifically between the parameter subspace of the current task and the dynamically allocated parameter subspaces of historical tasks.
Experimental results on continual learning benchmarks demonstrate that OA-Adapter outperforms state-of-the-art methods in both accuracy and parameter efficiency. OA-Adapter achieves higher average accuracy while using \(58.5\%\) fewer parameters on the Standard CL Benchmark, and maintains its advantages on two larger benchmarks comprising 15 tasks.

URL: https://openreview.net/forum?id=LNGDlLOdex

---

Title: OmegAMP: Targeted AMP Discovery via Biologically Informed Generation

Abstract: Deep learning-based antimicrobial peptide (AMP) discovery faces critical challenges such as limited controllability, lack of representations that efficiently model antimicrobial properties, and low experimental hit rates. To address these challenges, we introduce OmegAMP, a framework designed for reliable AMP generation with increased controllability. Its diffusion-based generative model leverages a novel conditioning mechanism to achieve fine-grained control over desired physicochemical properties and to direct generation towards specific activity profiles, including species-specific effectiveness. This is further enhanced by a biologically informed encoding space that significantly improves overall generative performance. Complementing these generative capabilities, OmegAMP leverages a novel synthetic data augmentation strategy to train classifiers for AMP filtering, drastically reducing false positive rates and thereby increasing the likelihood of experimental success. Our in silico experiments demonstrate that OmegAMP delivers state-of-the-art performance across key stages of the AMP discovery pipeline, enabling an unprecedented success rate in in vitro experiments. We tested 25 candidate peptides, 24 of them (96%) demonstrated antimicrobial activity, proving effective even against multi-drug resistant strains. Our findings underscore OmegAMP's potential to significantly advance computational frameworks in the fight against antimicrobial resistance.

URL: https://openreview.net/forum?id=hAq3XLZ9ex

---

Title: Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding

Abstract: Vision-Language Models (VLMs) have demonstrated impressive world knowledge across a wide range of tasks, making them promising candidates for embodied reasoning applications. However, existing benchmarks primarily evaluate the embodied reasoning ability of VLMs through multiple-choice questions based on image annotations -- for example, selecting which trajectory better describes an event in the image. In this work, we introduce the Point-It-Out (\modelname) benchmark, a novel benchmark designed to systematically assess the embodied reasoning abilities of VLMs through precise visual grounding. We propose a hierarchical evaluation protocol spanning three stages (S1: referred-object localization, S2: task-driven pointing, and S3: visual trace prediction), with data collected from critical domains for embodied intelligence, including indoor, kitchen, driving, and robotic manipulation scenarios. Extensive experiments with over ten state-of-the-art VLMs reveal several interesting findings. For example, strong general-purpose models such as GPT-4o, while excelling on many benchmarks (e.g., language, perception, and reasoning), underperform compared to some open-source models in precise visual grounding; models such as MoLMO perform well in S1 and S2 but struggle in S3, where requires grounding combined with visual trace planning.

URL: https://openreview.net/forum?id=9e0hRhFsal

---

Title: Blind Inverse Game Theory: Jointly Decoding Rewards and Rationality in Entropy-Regularized Competitive Games

Abstract: Inverse Game Theory (IGT) methods based on the entropy-regularized Quantal Response Equilibrium (QRE) offer a tractable approach for competitive settings, but critically assume the agents' rationality parameter (temperature $\tau$) is known a priori. When $\tau$ is unknown, a fundamental scale ambiguity emerges that couples $\tau$ with the reward parameters ($\theta$), making them statistically unidentifiable. We introduce Blind-IGT, the first statistical framework to jointly recover both $\theta$ and $\tau$ from observed behavior. We analyze this bilinear inverse problem and establish necessary and sufficient conditions for unique identification by introducing a normalization constraint that resolves the scale ambiguity. We propose an efficient Normalized Least Squares (NLS) estimator and prove it achieves the optimal $\mathcal{O}(N^{-1/2})$ convergence rate for joint parameter recovery. When strong identifiability conditions fail, we provide partial identification guarantees through confidence set construction. We extend our framework to Markov games and demonstrate optimal convergence rates with strong empirical performance even when transition dynamics are unknown.

URL: https://openreview.net/forum?id=tHG9WmaUzR

---

Title: Uncertainty and Scale-Calibrated Contrastive Federated Segmentation under Client Heterogeneity

Abstract: Federated learning presents a promising approach for medical image segmentation, particularly in addressing data privacy concerns. However, it faces significant challenges due to data heterogeneity across participating clients. This heterogeneity introduces variations in data scales and distributions, making it difficult to balance spatial accuracy and feature similarity when managing multidimensional heterogeneous data. To address these challenges, we propose a novel \textbf{Uncertainty- and Scale-Calibrated Contrastive Federated Segmentation under Client Heterogeneity (SAFCF)} with two key approaches: (i) an \textbf{uncertainty-driven dynamic scale-adaptive weighted aggregation (DSWA)} method, which balances the influence of local client data scales and reduces model drift caused by data heterogeneity through the use of epistemic uncertainty in weighted aggregation, and (ii) a \textbf{contrastive federated segmentation loss (CFSL)}, a local loss function that effectively balances spatial accuracy and feature similarity at the pixel level of an image by combining modified Dice loss with improved contrastive loss. Additionally, epistemic uncertainty layer learns weight distributions to introduce uncertainty, further improving model robustness and enabling adaptive learning from diverse data during training. Our framework demonstrates substantial improvements on standard benchmark medical image segmentation datasets, especially under highly non-IID conditions, when compared to traditional algorithms.

URL: https://openreview.net/forum?id=avrvLzsebU

---

Title: The Expanded Othello AI Arena: Evaluating Intelligent Systems Through Constrained Adaptation to Unseen Conditions

Abstract: The ability to rapidly adapt to environmental changes is a core requirement for Artificial General Intelligence (AGI), yet most AI benchmarks evaluate performance in static environments. We present the Expanded Othello AI Arena, a benchmark designed to measure Skill-Acquisition Efficiency — the rate at which agents discover latent objectives and converge to effective strategies within a limited interaction budget. The Arena formalizes a spectrum of 56 environments using a parametric framework $\mathcal{E} = (\mathcal{L}, \mathcal{C})$, where $\mathcal{L}$ defines Othello board geometries and $\mathcal{C}$ represents latent winning conditions via a disc-ratio threshold $K$. This parameterization requires agents to decipher terminal rules through direct interaction while simultaneously interpreting the opponent's behavior — in narrow regimes, agents must strategically induce the opponent into violating the hidden threshold to secure victory. Unlike traditional evaluation, the Arena imposes a strict 2,000-game interaction budget to prioritize sample efficiency over asymptotic optimization. We establish the benchmark's utility through a neuroevolutionary adaptive-Minimax baseline that utilizes meta-learned spatial priors and adaptive weighting. Our empirical analysis reveals that while this baseline achieves competitive performance in standard and inverse regimes, it fails in narrow-interval regimes that demand adversarial inducement, exposing a substantial efficiency gap that gradient-based reinforcement learning cannot bridge even with five times the interaction budget. Released as an extensible Python-based research toolkit, the Arena provides a standardized platform for exploring research directions including test-time learning, in-context learning, and world models. The code is available at: \url{https://anonymous.4open.science/r/ExpandedOthello/}.

URL: https://openreview.net/forum?id=WXKQtqPC2d

---

Title: SafeFix: Targeted Model Repair via Controlled Image Generation

Abstract: Deep learning models for visual recognition often exhibit systematic errors due to underrepresented semantic subpopulations. While existing debugging frameworks can identify these failure slices, effectively repairing them remains difficult. Current solutions often rely on manually designed prompts to generate synthetic images—an approach that introduces distribution shift and semantic errors, often resulting in new bugs. To address these issues, we introduce SafeFix, a framework for distribution-consistent model repair via controlled generation that employs a diffusion model to generate semantically faithful images that modify only specific failure attributes while preserving the underlying data distribution. To ensure the reliability of the repair data, we implement a verification mechanism using a large vision--language model (LVLM) to enforce semantic consistency and label preservation. By retraining models on the synthetic data, we significantly reduce errors in rare cases and improve overall performance. Our experiments show that SafeFix achieves superior robustness by maintaining high precision in attribute editing without introducing additional bugs.

URL: https://openreview.net/forum?id=TtpW6JiEiW

---

Title: Tight Regret Bounds in Multi-Armed Bandits with Heterogeneous Variances

Abstract: We study stochastic multi-armed bandits with heterogeneous reward variances.
In the known-variance setting, we propose a variance-aware MOSS algorithm
that achieves minimax-optimal regret
matching an information-theoretic lower bound up to constants.
For the unknown-variance case, we construct high-probability variance
upper confidence bounds and show that the resulting algorithm attains
the same minimax rate up to a logarithmic factor.
Our analysis establishes sharp worst-case guarantees that explicitly
capture the variance structure of the problem.

URL: https://openreview.net/forum?id=xnF9pViAZw

---

Title: Multiplayer Combinatorial Bandits Under Information Asymmetry

Abstract: In this paper, we extend linear combinatorial bandits \cite{gai2012combinatorial} to a multiplayer setting with information asymmetry \cite{chang2022online, chang2024optimal}, where each player controls an arm and independently decides whether to pull it, with coordination allowed only before rounds begin. We analyze three scenarios: action asymmetry (players can't observe others' actions but receive identical rewards per iteration), reward asymmetry (players observe actions but receive private i.i.d.\ rewards), and combined asymmetry. We derive near-optimal, gap-independent regret bounds for all scenarios: For action or reward asymmetry, we achieve $\tilde{\mathcal{O}}(\sqrt{T})$, which improves significantly from \cite{gai2010learning}; for both action and reward asymmetry, we achieve near-optimal bounds similar to that of \cite{chang2022online}. We finally generalize our results to settings where players decide either not to pull or to pull one out of multiple arms, and achieve similar bounds in similar settings as above.

URL: https://openreview.net/forum?id=sspckudDFX

---

Title: Replicability is Asymptotically Free in Multi-armed Bandits

Abstract: We consider a replicable stochastic multi-armed bandit algorithm that ensures, with high probability, that the algorithm's sequence of actions is not affected by the randomness inherent in the dataset. Replicability allows third parties to reproduce published findings and assists the original researcher in applying standard statistical tests. We observe that existing algorithms require $O(K^2/\rho^2)$ times more regret than nonreplicable algorithms, where $K$ is the number of arms and $\rho$ is the level of nonreplication. However, we demonstrate that this additional cost is unnecessary when the time horizon $T$ is sufficiently large for a given $K, \rho$, provided that the magnitude of the confidence bounds is chosen carefully. Therefore, for a large $T$, our algorithm only suffers $K^2/\rho^2$ times smaller amount of exploration than existing algorithms. To ensure the replicability of the proposed algorithms, we incorporate randomness into their decision-making processes. We propose a principled approach to limiting the probability of nonreplication. This approach elucidates the steps that existing research has implicitly followed. Furthermore, we derive the first lower bound for the two-armed replicable bandit problem, which implies the optimality of the proposed algorithms up to a $\log\log T$ factor for the two-armed case.

URL: https://openreview.net/forum?id=E8rmbq8BYP

---

Title: A Mechanistic Study of Transformers Training Dynamics

Abstract: Large-scale pretraining of transformers has been central to the success of foundation models. However, the scale of those models limits our understanding of the mechanisms at play during optimization. In this work, we study the training dynamics of transformers in a controlled and interpretable setting. On the sparse modular addition task, we demonstrate that specialized attention circuits, called clustering heads, can be implemented during gradient descent to solve the problem. Our experiments show that such pathways naturally emerge during training. By monitoring the evolution of tokens via a visual sandbox, we uncover a two-stage learning and the occurrences of loss spikes due to the high curvature of normalization layers. Our findings provide several insights into patterns observed in more practical settings, such as the pretraining of large language models.

URL: https://openreview.net/forum?id=aHbZx0bckL

---

Title: Balanced Twins: Causal Inference on Time Series with Hidden Confounding

Abstract: Accurately estimating treatment effects in time series is essential for evaluating interventions in real-world applications, especially when treatment assignment is biased by unobserved factors. In many practical settings, interventions are adopted at different times across individuals, leading to staggered treatment exposure and heterogeneous pre-treatment histories. In such cases, aggregating outcome trajectories across treated units is ill-defined, making individual treatment effect (ITE) estimation a prerequisite for reliable causal inference. We therefore study the problem of estimating the average treatment effect for the treated (ATT) by first recovering individual-level counterfactuals.
We introduce a neural framework that learns simultaneously low-dimensional latent representations of individual time series and propensity scores. These estimates are then used to approximate the individual treatment effects through a flexible matching procedure that avoids classical convexity constraints commonly used in synthetic control methods. By operating at the individual level, our approach naturally accommodates staggered interventions and improves counterfactual estimation under latent bias, without relying on explicit temporal modeling assumptions.
We illustrate our approach on both real-world energy consumption data and clinical time series, including high-frequency electricity demand-response programs and semi-synthetic data for individuals in intensive care unit (ICU), where hidden confounding, staggered treatment adoption, and non-stationary dynamics are prevalent.

URL: https://openreview.net/forum?id=3AwFgQHOPj

---

Title: Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos

Abstract: The ability to learn manipulation skills by watching videos of humans has the potential to unlock a new source of highly scalable data for robot learning. Here, we tackle prehensile manipulation, in which tasks involve grasping an object before performing various post-grasp motions. Human videos offer strong signals for learning the post-grasp motions, but they are less useful for learning the prerequisite grasping behaviors, especially for robots without human-like hands. A promising way forward is to use a modular policy design, leveraging a dedicated grasp generator to produce stable grasps. However, arbitrary stable grasps are often not task-compatible, hindering the robot's ability to perform the desired downstream motion. To address this challenge, we present Perceive-Simulate-Imitate (PSI), a framework for training a modular manipulation policy using human video motion data processed by paired grasp-trajectory filtering in simulation. This simulation step extends the trajectory data with grasp suitability labels, which allows for supervised learning of task-oriented grasping capabilities. We show through real-world experiments that our framework can be used to learn precise manipulation skills efficiently without any robot data, resulting in significantly more robust performance than using a grasp generator naively.

URL: https://openreview.net/forum?id=ZEmv4DhaGL

---

Title: A Unified Framework with Environmental and Interaction Uncertainty for Robust Multi-Agent Reinforcement Learning

Abstract: Multi-agent reinforcement learning (MARL) has achieved remarkable success across diverse domains, yet its robustness remains hindered by various inherent uncertainties arising from multi-agent systems. Although previous studies have explored robustness in MARL, most of them focus on a single type of uncertainty, without a unified framework to handle multiple sources simultaneously. As a result, their methods often fail to remain robust when exposed to diverse and interacting disturbances. To address this limitation, we propose a unified framework that explicitly models two complementary sources of uncertainty: environmental uncertainty, caused by stochastic dynamics, and interaction uncertainty, arising from the unpredictable behaviors of other agents. We capture these factors using hierarchical entropy-based uncertainty sets, which are then integrated into the robust Markov game formulation. This hierarchical design enables the framework to distinguish the distinct impacts of each uncertainty source while avoiding the excessive conservatism of treating them as a single unified set. On top of this formulation, we introduce the solution concept of an Aleatoric Robust Equilibrium (ARE), where each agent optimizes its policy against worst-case scenarios derived from the hierarchical sets. To compute the ARE, we develop specialized actor–critic algorithms with theoretical convergence guarantees. Extensive experiments in both the multi-agent particle environment (MPE) and the multi-agent MuJoCo benchmark show that our approach achieves consistently superior robustness and performance across a wide range of uncertainty settings.

URL: https://openreview.net/forum?id=DMllImVr8k

---

Title: VFEM: Visual Feature Empowered Multivariate Time Series Forecasting with Cross-Modal Fusion

Abstract: Large time series foundation models often adopt channel-independent architectures to handle varying data dimensions, but this design ignores crucial cross-channel dependencies. Meanwhile, existing cross-modal methods predominantly rely on textual modalities, leaving the spatial pattern recognition capabilities of vision models underexplored for time series analysis. To address these limitations, we propose VFEM, a cross-modal forecasting model that leverages pre-trained large vision models (LVMs) to capture complex cross-variable patterns. VFEM transforms multivariate time series into visual representations, enabling LVMs to perceive spatial relationships that are not explicitly modeled by channel-independent models. Through a dual-branch architecture, visual and temporal features are independently extracted and then fused via cross-modal attention, allowing complementary information from both modalities to enhance forecasting. By freezing the LVM and training only 7.45% of the total parameters, VFEM achieves competitive performance on multiple benchmarks, offering a new perspective on multivariate time series forecasting.

URL: https://openreview.net/forum?id=mPhlTmYiyg

---

Title: Decentralized Federated Learning with Function Space Regularization

Abstract: In this work we propose FedFun, a framework for decentralized federated learning that enforces consensus across clients in function space rather than parameter space. By framing agreement as a regularization penalty in a Hilbert space of hypotheses, our method allows
optimization using proximal gradient updates that encourage similarity between neighboring models while supporting both parametric and non-parametric learners. This function space perspective enables convergence guarantees under mild assumptions, covering situations where client objectives are non-convex in the usual sense and where clients may utilize differing architectures. In addition to convergence analysis, we demonstrate compatibility with models like neural networks and decision trees, and empirically evaluate implementations of FedFun on various sample datasets.

URL: https://openreview.net/forum?id=79a397AzQu

---

Title: Training Conditional GANs on Limited and Long-Tailed Data: a Survey and Comparative Analysis

Abstract: Generative adversarial networks (GANs) are a generative model framework that is competitive with state-of-the-art autoencoders and diffusion models in many tasks. While the latter have achieved impressive generation capabilities, mostly through large-scale, general-purpose text-to-image models, their computational requirements place them out of reach for practitioners. On the other hand, as GAN architectures mature and new developments allow for more stable training, interest in their application has grown across diverse domains. However, real-world data are often hard to deal with due to limited amount of samples or long-tailed distributions. Furthermore, previous works addressing these issues lack guidance regarding their applicability and have not been compared through appropriately diverse benchmarks nor assessed using the same metrics. In this article, we conduct a survey on methods for training GANs on limited and long-tailed data and conduct an extensive comparative analysis of existing methods. Our results allow us to draw conclusions about the advantages, disadvantages, and practical applicability of these methods, hopefully making GANs more accessible to practitioners in diverse fields. The code will be made available as soon as deanonymization is allowed.

URL: https://openreview.net/forum?id=Gubw0PmHRY

---

Title: How Much Information Fits in a Vector?

Abstract: Recent work in neural network interpretability has suggested that hidden activations of some deep models can be viewed as linear projections of much higher-dimensional vectors of sparse latent ``features.'' In general, this kind of representation is known as a superposition code. This work presents an information-theoretic account of superposition codes in a setting applicable to interpretability. We show that when the number $k$ of active features is very small compared to the number $N$ of total features, simple inference methods currently used by sparse autoencoders can reliably decode a $d$-dimensional superposition code when $d$ is a constant factor greater than the Shannon limit. Specifically, when $\ln k / \ln N \le \eta < 1$ and $H$ is the entropy of the latent vector in bits, asymptotically it suffices that $d / H > C(\eta)$ for certain increasing functions $C(\eta).$ However, the behavior of $C(\eta)$ depends on what decoding method is used. For example, when $\eta = 0.3$, we empirically show that a method based on the popular top-$k$ activation function typically requires a factor of $C = 4$ dimensions per bit. On the other hand, we exhibit an algorithm that succeeds with less than $2$ dimensions per bit and requires only around $3$ times as many FLOPs for the same values of $(N, d).$ We hope this work helps connect research in interpretability with perspectives from compressive sensing and information theory.

URL: https://openreview.net/forum?id=Nby4pCPIZI

---

Title: Fallback-Enabled Closed-Set Classification: Cross-Modal Consistency in Vision-Language Models

Abstract: Vision-Language Models (VLMs) can describe and label images; however, this does not imply that they truly process what they are perceiving. Recent studies show that, despite their breadth of training, VLMs are surprisingly unreliable as classifiers, for either closed-world or open-world settings. In this work, we explore a deeper question: can a VLM recognize when an image falls outside the set of categories it is asked to choose from? Our results reveal a surprising failure mode: even when the notion of in-set versus out-of-set is explicitly defined, VLM models often assign plausible in-set labels to out-of-set images, violating the task’s explicit constraint. Motivated by this, we propose a cross-modal consistency framework that reasons over both the visual and textual arms of the model and accepts an answer only when they agree. Experiments on three well-known datasets (DomainNet, VisDA and INaturalist-2021) demonstrate that this approach consistently improves balanced known vs. unknown detection over Source-Free Universal Domain Adaptation (SF-UniDA) baselines, showing that cross-modal consistency improves a VLM’s ability to follow the task logic and distinguish when an image falls outside the intended label space. Our results suggest that, with strong VLMs, fallback behavior need not rely exclusively on specialized SF-UniDA adaptation pipelines: a lightweight cross-modal consistency decision rule can be competitive with representative SF-UniDA baselines on standard benchmarks.

URL: https://openreview.net/forum?id=tOKG6sSk3I

---

Title: NEUTAG: Graph Transformer for Attributed Graphs

Abstract: Graph Transformers (\textsc{GT}) have demonstrated their superiority in graph classification tasks, but their performance in node classification settings remains below par. They are designed for either homophilic or heterophilic graphs and show poor scalability to million-sized graphs. In this paper, we address these limitations for node classification tasks by designing a model that utilizes a special feature encoding that transforms the input graph, separating nodes and features, which enables the flow of information not only from the local neighborhood of a node but also from distant nodes, via their connections through shared feature nodes. We theoretically demonstrate that this design allows each node to exchange information with all nodes in the graph, effectively mimicking all-node-pair message passing while avoiding $\mathcal{O}(N^2)$ computation. We further analyze the universal approximation ability of the proposed transformer. Finally, we demonstrate the effectiveness of the proposed method on diverse sets of large-scale graphs, including the homophilic \& the heterophilic varieties.

URL: https://openreview.net/forum?id=kQrIrYvbbw

---

Title: Rethinking On-policy Optimization for Query Augmentation

Abstract: Recent advances in large language models (LLMs) have led to a surge of interest in query augmentation for information retrieval (IR). Two main approaches have emerged. The first prompts LLMs to generate answers or pseudo-documents that serve as new queries, relying purely on the model's parametric knowledge or contextual information. The second applies reinforcement learning (RL) to fine-tune LLMs for query rewriting, directly optimizing retrieval metrics. While having respective advantages and limitations, the two approaches have not been compared under consistent experimental conditions. In this work, we present the first systematic comparison of prompting-based and RL-based query augmentation across diverse benchmarks, including evidence-seeking, ad hoc, and tool retrieval. Our key finding is that under a compute-aware comparison setting, simple, training-free query augmentation often performs on par with, or even surpasses, more expensive RL-based counterparts, especially when using powerful LLMs. Motivated by this discovery, we introduce a novel hybrid method, On-policy Pseudo-document Query Expansion (OPQE), which, instead of rewriting a query, the LLM policy learns to generate a pseudo-document that maximizes retrieval performance, thus merging the flexibility and generative structure of prompting with the targeted optimization of RL. We show OPQE outperforms both standalone prompting and RL-based rewriting, demonstrating that a synergistic approach yields the best results. We will fully open source our implementation to facilitate reproducibility.

URL: https://openreview.net/forum?id=mmqbjhz5Br

---

Title: Diffusion-based Cumulative Adversarial Purification for Vision Language Models

Abstract: Vision Language Models (VLMs) have shown remarkable capabilities in multimodal understanding, yet their susceptibility to adversarial perturbations poses a significant threat to their reliability in real-world applications. Despite often being imperceptible to humans, these perturbations can drastically alter model outputs, leading to erroneous interpretations and decisions. This paper introduces DiffCAP, a novel diffusion-based purification strategy that can effectively neutralize adversarial corruptions in VLMs. We theoretically establish a certified recovery region in the forward diffusion process and meanwhile quantify the convergence rate of semantic variation with respect to VLMs. These findings manifest that adversarial effects monotonically fade as diffusion unfolds. Guided by this principle, DiffCAP leverages noise injection with a similarity threshold of VLM embeddings as an adaptive criterion, before reverse diffusion restores a clean and reliable representation for VLM inference. Through extensive experiments across six datasets with three VLMs under varying attack strengths in three task scenarios, we show that DiffCAP consistently outperforms existing defense techniques by a substantial margin. Notably, DiffCAP significantly reduces both hyperparameter tuning complexity and the required diffusion time, thereby accelerating the denoising process. Equipped with theorems and empirical support, DiffCAP provides a robust and practical solution for securely deploying VLMs in adversarial environments.

URL: https://openreview.net/forum?id=kpuV3mzwqw

---

Title: Continual Test-Time Adaptation: A Comprehensive Survey

Abstract: Deep neural nets achieve remarkable performance when training and test data share the same distribution, but this assumption frequently breaks in real-world deployment, where data undergoes continual distributional shifts. Continual Test-Time Adaptation (CTTA) addresses this challenge by adapting pretrained models to non-stationary target distributions on-the-fly, without access to source data or labeled targets, while mitigating two critical failure modes: catastrophic forgetting of source knowledge and error accumulation from noisy pseudo-labels over extended time horizons. In this comprehensive survey, we formally define the CTTA problem, analyze the diverse continual domain shift patterns that characterize different evaluation protocols, and propose a hierarchical taxonomy that categorizes existing methods into three families: optimization-based strategies (entropy minimization, pseudo-labeling, parameter restoration), parameter-efficient methods (normalization layer adaptation, adaptive parameter selection), and architecture-based approaches (teacher-student frameworks, adapters, visual prompting, masked modeling). We systematically review representative methods within each category and present comparative benchmarks and experimental results across standard evaluation settings. Finally, we discuss limitations of current approaches and highlight emerging research directions, including adaptation of foundation models and black-box systems, providing a roadmap for future research in robust continual test-time adaptation.

URL: https://openreview.net/forum?id=mM3r03Xw1V

---

Title: Measure Theory of Conditionally Independent Random Function Evaluation

Abstract: In sequential design strategies, common in geostatistics and Bayesian optimization, the selection of a new observation point $X_{n+1}$ of a random function $\mathbf f$ is informed by past data, captured by the filtration $\mathcal F_n=\sigma(\mathbf f(X_0),\dots,\mathbf f(X_n))$. The random nature of $X_{n+1}$ introduces measure-theoretic subtleties in deriving the conditional distribution $\mathbb P(\mathbf f(X_{n+1})\in A \mid \mathcal F_n)$. Practitioners often resort to a heuristic: treating $X_0,\dots, X_{n+1}$ as fixed parameters within the conditional probability calculation. This paper investigates the mathematical validity of this widespread practice. We construct a counterexample to prove that this approach is, in general, incorrect. We also establish our central positive result: for continuous Gaussian random functions and their canonical conditional distribution, the heuristic is sound. This provides a rigorous justification for a foundational technique in Bayesian optimization and spatial statistics. We further extend our analysis to include settings with noisy evaluations and to cases where $X_{n+1}$ is not adapted to $\mathcal F_n$ but is conditionally independent of $\mathbf f$ given the filtration.

URL: https://openreview.net/forum?id=XgReLRlKEk

---

Title: Predicting integers from continuous parameters

Abstract: We study the problem of predicting numeric labels that are constrained to the integers or to a subrange of the integers. For example, the number of up-votes on social media posts, or the number of bicycles available at a public rental station. While it is possible to model these as continuous values, and to apply traditional regression, this approach changes the underlying distribution on the labels from discrete to continuous. Discrete distributions have certain benefits, which leads us to the question whether such integer labels can be modeled directly by a discrete distribution, whose parameters are predicted from the features of a given instance. Moreover, we focus on the use case of output distributions of neural networks, which adds the requirement that the _parameters_ of the distribution be continuous so that backpropagation and gradient descent may be used to learn the weights of the network. We investigate several options for such distributions, some existing and some novel, and test them on a range of tasks, including tabular learning, sequential prediction and image generation. We find that overall the best performance comes from two distributions: _Bitwise_, which represents the target integer in bits and places a Bernoulli distribution on each, and a discrete analogue of the Laplace distribution, which uses a distribution with exponentially decaying tails around a continuous mean.

URL: https://openreview.net/forum?id=d1WKFlKFEa

---

Title: Properties and limitations of geometric tempering for gradient flow dynamics

Abstract: We consider the problem of sampling from a probability distribution $\pi$. It is well known that this can be written as an optimisation problem over the space of probability distributions in which we aim to minimise the Kullback--Leibler divergence from $\pi$.
We consider the effect of replacing $\pi$ with a sequence of moving targets $(\pi_t)_{t\ge0}$ defined via geometric tempering on the Wasserstein and Fisher--Rao gradient flows.
We show that replacing the target distribution with a geometric mixture of initial and target distribution does not lead to a convergence speed up.

URL: https://openreview.net/forum?id=IP0w5LdcxC

---

Title: Rotary Positional Embeddings as Phase Modulation: Theoretical Bounds on the RoPE Base for Long-Context Transformers

Abstract: Rotary positional embeddings (RoPE) are widely used in large language models to encode
token positions through multiplicative rotations, yet their behavior at long context lengths
remains poorly characterized. In this work, we reinterpret RoPE as phase modulation
appliedto abank of complexoscillators, enabling analysis throughclassical signal processing
theory.
Under this formulation, we derive principled lower bounds on the RoPE base parameter that
are necessary to preserve positional coherence over a target context length. These include
a fundamental aliasing bound, analogous to a Nyquist limit, and a DC-component stability
bound that constrains phase drift in low-frequency positional modes. We further extend
this analysis to deep transformers, showing that repeated rotary modulation across layers
compounds angular misalignment, tightening the base requirement as depth increases.
Complementing these results, we derive a precision-dependent upper bound on the RoPE
base arising from finite floating-point resolution. Beyond this limit, incremental phase up-
datesbecomenumericallyindistinguishable, leadingtopositionalerasureevenintheabsence
of aliasing. Together, the lower and upper bounds define a precision- and depth-dependent
feasibility region—a “Goldilocks zone”—for long-context transformers.
We validate the framework through a comprehensive case study of state-of-the-art models,
includingLLaMA,Mistral, andDeepSeekvariants, showingthatobservedsuccesses, failures,
and community retrofits align closely with the predicted bounds. Notably, models that
violate the stability bound exhibit attention collapse and long-range degradation, while
attempts to scale beyond one million tokens encounter a hard precision wall independent of
architecture or training.
Our analysis establishes RoPE base selection as a fundamental necessary architectural con-
straint, ratherthanatunablehyperparameter, andprovidespracticalguidancefordesigning,
scaling, and retrofitting long-context transformers under realistic numerical limits.

URL: https://openreview.net/forum?id=zxyMneble7

---

Title: Less Forgetting, More OOD Generalization: Adaptive Augmented Reweighted Replay (AA-RR) for Continual Learning

Abstract: Machine learning models often forget previously learned classes when trained sequentially. Rehearsal-based methods mitigate this by replaying stored samples, but their reliance on memorization leads to poor out-of-distribution (OOD) generalization—a problem that remains largely unstudied. This memorization is driven by unbalanced gradient updates, spurious correlations, and class-imbalanced replay buffers. To address these issues, we introduce Adaptive Augmented Reweighted Replay (AA-RR), a lightweight framework designed to improve generalization in rehearsal-based continual learning (CL). AA-RR applies adaptive, class-aware loss reweighting to correct gradient imbalance while accounting for data recency and limited buffer capacity. It further incorporates data-centric augmentation and a principled sample-selection strategy based on forgetting dynamics to retain representative, consistently learned examples. Experiments on standard CL benchmarks show that AA-RR markedly boosts generalization and surpasses state-of-the-art baselines, especially under covariate shift.

URL: https://openreview.net/forum?id=4wd79nLVna

---

Title: Crane: Context-Guided Prompt Learning and Attention Refinement for Zero-Shot Anomaly Detection

Abstract: Zero-shot anomaly detection/localization trains on a source domain and discriminates images from unseen target domains given only textual prompts (e.g., “normal" vs. “anomaly"); therefore, performance hinges on generalization. Recent methods build on CLIP for its strong zero-shot generalization; however, as we show, localization has not improved as much as detection and, especially for small regions, remains near random, with AUPRO close to chance, indicating weak pixel-level generalization. We attribute this to CLIP’s limited ability to retain fine-grained features in its vision encoder and insufficient alignment between the text encoder and dense visual features, which have not been effectively addressed in previous methods. To address these challenges, first, we replace CLIP’s vision encoder with an adapted vision encoder that uses a correlation-based attention module to better preserve fine-grained features and small details. Second, we boost text–vision alignment by conditioning the learnable prompts in the text encoder on image context extracted from the vision encoder and performing local-to-global representation fusion, further improving localization. Finally, we show that our correlation-based attention module can incorporate feature correlations from additional models such as DINOv2, further enhancing spatial understanding and localization. We call our model Crane (Context-Guided Prompt Learning and Attention Refinement) and its DINOv2-boosted variant Crane+ and show that it improves the state-of-the-art by up to 28% in pixel-level localization (AUPRO) and up to 4.5% in image-level detection (AP), across 14 industrial and medical datasets.

URL: https://openreview.net/forum?id=logc7dzJRS

---

Title: Learning from Missing Values: Encoding Missingness in Representation-Space for LSTM Time Series Forecasting

Abstract: While many state-of-the-art techniques reconstruct incomplete time series datasets by replacing gaps with modeled estimates, we propose an alternative: encode missing values as an extremal sentinel value, allowing a prediction model to learn from the pattern of missingness. Incomplete data is a common problem in real-world time series forecasting, particularly in environmental monitoring where sensor failures can cause continuous gaps in data. This paper proposes the Min-Std method, a novel computationally efficient imputation strategy that encodes missingness in representation-space with an extremal statistical sentinel $(min - \sigma)$ mapped to $0$ under Min-Max scaling. The result is that instead of training a model on (possibly imprecise) estimates for missing data, we simply replace the missing value with a sentinel the model can recognize to mean `uninformative'. By ensuring this sentinel is uniquely mapped to $0$, the only $0$ values the model will receive are either missing values, or values dropped by a dropout regularizer. We compare prediction results using our Min-Std imputation strategy against 12 imputation methods (including Kalman Smoothing and MissForest) across 6 different transformations on 28 distinct environmental datasets. Friedman's nonparametric test and critical difference ranking demonstrate that Min-Std imputation consistently yields superior predicting performance (measured by KGE, NMSE, and F1 Score) compared to complex model-based alternatives while being orders of magnitude faster (e.g. $0.02$s vs $500$s$+$). Our findings suggest that single-channel, explicit representation-space encodings of missingness are preferable to reconstruction-based imputation.

URL: https://openreview.net/forum?id=DMmCMIrrez

---

Title: On Symmetric Losses for Policy Optimization with Noisy Preferences

Abstract: Optimizing policies based on human preferences is key to aligning language models with human intent.
This work focuses on reward modeling, a core component in reinforcement learning from human feedback (RLHF), and offline preference optimization, such as direct preference optimization.
Conventional approaches typically assume accurate annotations. However, real-world preference data often contains noise due to human errors or biases, which can be asymmetric.
We propose a principled framework for robust policy optimization under noisy preferences based on the view of reward modeling as a binary classification problem.
Specifically, we demonstrate that asymmetric preference noise can be effectively treated as symmetric noise under this framework.
This viewpoint allows us to leverage symmetric losses, well known for their robustness to label noise in classification, for reward modeling, which leads to our Symmetric Preference Optimization (SymPO) method, a novel offline preference optimization algorithm.
Theoretically, we prove that symmetric losses enable successful policy improvement even with noisy labels, as the resulting reward is rank-preserving—a property we identify as sufficient for policy improvement.
Empirical evaluations on a synthetic dataset and real-world language model alignment tasks demonstrate that SymPO achieves competitive or higher performance than existing robust methods in high-noise scenarios.

URL: https://openreview.net/forum?id=cBWGLmSeao

---

Title: A Method and a Metric for GNN Reliance on Information from Features and Structure

Abstract: Graph Neural Networks (GNNs) rely on both node-edge features and graph structure, but the relative use of these information sources is poorly understood. In many cases either features or structure contain more useful information, and in extreme cases one may inhibit learning, as in some tasks where models overfit on structural patterns. Understanding the balance of these information sources is therefore essential for strategic model design.

We introduce Noise-Noise Analysis to measure each source’s contribution to model performance, along with the Noise-Noise Ratio Difference (NNRD) metric that quantifies whether a model is feature-reliant or structure-reliant. Through experiments on synthetic and real-world graph-classification datasets, we show that GCN, GAT, and GIN layers can all perform graph-less learning (ignoring structure when unhelpful), but only the GIN performs feature-less learning. All three architectures exhibit bias toward features over structure. Noise-Noise Analysis provides practitioners a fast tool to understand their models’ information usage.

URL: https://openreview.net/forum?id=e0Uw9aBr3A

---

Title: Learnable Coreset Selection for Graph Active Learning

Abstract: Graph Neural Networks (GNNs) have demonstrated their effectiveness in a variety of graph-based tasks. However, their performance heavily depends on the availability of a sufficient amount of labeled data, which is often costly to acquire in real-world applications.
To tackle this, GNN-based Active Learning (AL) methods aim to enhance labeling efficiency by selecting the most informative nodes for labeling. However, existing methods often rely on heuristic or implicit approaches that fail to fully capture the influence of labeled data on unlabeled nodes, thereby limiting their adaptability across diverse graph types.
In this paper, we propose LearnAL, a Learnable coreset labeling framework for graph Active Learning to address these limitations. Unlike traditional heuristic-based methods, LearnAL explicitly models the correlations between labeled and unlabeled nodes using an attention architecture, linking these correlations directly to prediction performance. Leveraging global influence (attention) scores, LearnAL selects and labels samples that maximize representational diversity, enhancing sample coverage.
We provide theoretical analysis demonstrating that this attention-based selection reduces the covering radius bound, improving prediction performance on unlabeled data. Our experimental results show that the labeled coreset significantly enhances the generalizability of various graph models across different graph datasets, as well as CNN models in image classification tasks.

URL: https://openreview.net/forum?id=ursw3nWq5K

---

Title: Theoretical Foundations of Continual Learning via Drift-Plus-Penalty

Abstract: In many real-world settings, data streams are inherently nonstationary and arrive sequentially, necessitating learning systems to adapt continuously without repeatedly retraining from scratch. Continual learning (CL) addresses this setting by seeking to incorporate new tasks while preventing catastrophic forgetting, whereby updates for recent data induce performance degradation on previously acquired knowledge. We introduce a control-theoretic perspective on CL that explicitly regulates the temporal evolution of forgetting, framing adaptation to new tasks as a controlled process subject to long-term stability constraints. We focus on replay-based CL settings in which a finite memory buffer preserves representative samples from prior tasks, allowing forgetting to be explicitly regulated. We propose COntinual Learning with Drift-Plus-Penalty (\texttt{COLD}), a novel CL framework grounded in the stochastic optimization-based Drift-Plus-Penalty (DPP) principle. At each task, \texttt{COLD} minimizes the instantaneous penalty corresponding to the current task loss while simultaneously maintaining a virtual queue that explicitly tracks deviations from long-term stability on previously learned tasks, hence capturing the stability–plasticity trade-off as a regulated dynamical process. We establish stability and convergence guarantees that characterize this trade-off, as governed by a tunable control parameter. Empirical results on standard benchmark datasets show that the proposed framework consistently achieves superior accuracy compared to a wide range of state-of-the-art CL baselines, while exhibiting competitive and tunable forgetting behavior that reflects the explicit regulation of the stability–plasticity trade-off through virtual queues and the DPP objective.

URL: https://openreview.net/forum?id=QhxNMdhhBy

---

Title: RobustMAD: Evaluating Real-World Robustness of Multimodal Small Language Models for Deployable Anomaly Detection Assistants

Abstract: Multimodal industrial anomaly inspection assistants are a critical component of next-generation smart factories, enabling interactive vision–language–based querying. However, multimodal large language models remain impractical for on-site deployment due to prohibitive computational demands and privacy risks from cloud-based inference. Compact multimodal \textit{small} language models (MSLMs) offer a deployable alternative, yet progress is constrained by the lack of comprehensive robustness analyses and meaningfully challenging benchmarks that reflect real-world industrial conditions. To address this gap, we develop RobustMAD, the first practically grounded benchmark, designed to systematically evaluate real-world robustness through diverse open-ended queries spanning object understanding, anomaly detection, unanswerable problems, and visual quality degradations. Contrary to conventional assumptions, top-performing MSLMs exhibit promising capabilities, surprisingly outperforming even the larger GPT-5 Nano. However, they are still far below industrial requirements, and RobustMAD exposes critical robustness gaps posing significant operational risks. In particular, three recurring failure modes emerge: (i) fragile multimodal grounding under fine-grained distinctions or degraded visual conditions, (ii) insufficiently comprehensive responses, and (iii) weak logical grounding on unanswerable or ill-posed queries, leading to hallucinated outputs. Grounded in these insights, we provide actionable guidance for the design of next-generation multimodal industrial inspection assistants that leverage their promising competence. Code is available at \url{http://anonymous.4open.science/r/RobustMAD-146D}.

URL: https://openreview.net/forum?id=skrA9UYNIZ

---

Title: Speeding up fairness reductions

Abstract: We study the problem of fair classification, where the goal is to optimize classification accuracy subject to fairness constraints. This type of problem occurs in many real-world applications, where we seek to assure that a deployed AI system does not disproportionally impact historically disadvantaged groups. One of the leading approaches in the literature is the reduction approach (Agarwal et al., 2018; 2019), which enjoys many favorable properties. For instance, it supports a wide range of fairness constraints and model families and is usually easy to incorporate in existing ML pipelines. The reduction approach acts as a wrapper around a standard ML algorithm and obtains a model that satisfies fairness constraints by repeatedly running a fairness-unaware base algorithm. A typical number of iterations is around 100, meaning that the reduction approach can be up to 100 times slower than the base algorithm, which limits its applicability. To overcome this limitation, we introduce two algorithmic innovations. First, we interleave the exponentiated gradient updates of the standard reduction approach with column-generation updates, which leads to a decrease in the number of calls to the base algorithm. Second, we introduce adaptive sampling, which decreases the sizes of the datasets used in the calls to the base algorithm. We conduct comprehensive experiments to evaluate efficacy of our improvements, showing that our two innovations speed up the reduction approach by an order of magnitude without sacrificing the quality of the resulting solutions.

URL: https://openreview.net/forum?id=C0AdL3r1Dc

---

Title: Answer Matching Outperforms Multiple Choice for Language Model Evaluation

Abstract: Multiple choice benchmarks have long been the workhorse of language model evaluation because grading multiple choice is objective and easy to automate. However, we show multiple choice questions from popular benchmarks can often be answered without even seeing the question. These shortcuts arise from a fundamental limitation of discriminative evaluation not shared by evaluations of the model's free-form, generative answers. Until recently, there appeared to be no viable, scalable alternative to multiple choice --- but, we show that this has changed. We consider generative evaluation via what we call answer matching: Give the candidate model the question without the options, have it generate a free-form response, then use a modern language model with the reference answer to determine if the response matches the reference. To compare the validity of different evaluation strategies, we measure agreement with human grading, by annotating responses to MMLU-Pro and GPQA-Diamond questions. We find answer matching using recent models -- even small ones -- achieves near-perfect agreement, in the range of inter-annotator agreement. In contrast, both multiple choice evaluation and using LLM-as-a-judge without reference answers aligns poorly with human grading. Improved evaluations via answer matching are not merely a conceptual concern --- it reduces costs, and significantly changes model rankings. Multiple choice benchmarks that seem saturated start showing room for improvement when evaluated with answer matching. In light of these findings, we discuss how to move the evaluation ecosystem from multiple choice to answer matching.

URL: https://openreview.net/forum?id=kyzLTGreeC

---

Title: Harnessing Heterogeneity: Improving Convergence Through Partial Variance Control in Federated Learning

Abstract: Federated Learning (FL) has emerged as a promising paradigm for collaborative model training without sharing local data. However, a significant challenge in FL arises from the heterogeneous data distributions across participating clients. This heterogeneity leads to highly variable gradient norms in the model's final layers, resulting in poor generalization, slower convergence, and reduced robustness of the global model. To address these issues, we propose a novel technique that incorporates a gradient penalty term into partial variance control. Our method enables diverse representation learning from heterogeneous client data in the initial layers while modifying standard SGD in the final layers. This approach reduces variance in the classification layers, aligns gradients, and mitigates the effects of data heterogeneity. Through theoretical analysis, we establish convergence rate bounds for the proposed algorithm, demonstrating its potential for competitive convergence compared to current FL methods in highly heterogeneous data settings. Empirical evaluations on five benchmark datasets validate our approach, showing enhanced performance and faster convergence over state-of-the-art baselines across various levels of data heterogeneity.

URL: https://openreview.net/forum?id=I9VhJ5iLNr

---

Title: RegMean++: Enhancing Effectiveness and Generalization of Regression Mean for Model Merging

Abstract: Regression Mean (RegMean), an approach that formulates model merging as a linear regression problem, aims to find the optimal weights for each linear layer in the merge model by minimizing the discrepancy in predictions between the merge and candidate models. RegMean provides a precise closed-form solution for the merging problem; therefore, it offers explainability and computational efficiency. However, RegMean merges each linear layer independently, overlooking how the features and information in the earlier layers propagate through the layers and influence the final prediction in the merge model. Here, we introduce RegMean++, a simple yet effective alternative to RegMean, that explicitly incorporates both intra-layer and cross-layer dependencies between merge models' layers into RegMean's objective. By accounting for these dependencies, RegMean++ better captures the behaviors of the merge model. Extensive experiments demonstrate that RegMean++ consistently outperforms RegMean across diverse settings, including in-domain (ID) and out-of-domain (OOD) generalization, sequential merging, large-scale tasks, and robustness under several types of distribution shifts. Furthermore, RegMean++ achieves competitive or state-of-the-art performance compared to various recent advanced model merging methods.

URL: https://openreview.net/forum?id=H5lDsSCS9i

---

Title: BitLogic: Training Framework for Gradient-Based FPGA Native Neural Networks

Abstract: The energy and latency costs of deep neural network inference are increasingly driven by deployment rather than training, motivating hardware-specialized alternatives to arithmetic-heavy models. FPGA provide an attractive substrate for such specialization, yet existing FPGA-based neural approaches are fragmented and difficult to compare. We present BitLogic, a fully gradient-based, end-to-end trainable framework for FPGA-native neural networks built around LUT computation. BitLogic replaces multiply--accumulate operations with differentiable LUT nodes that map directly to FPGA primitives, enabling native binary computation, sparse connectivity, and efficient hardware realization. The framework offers a modular functional API supporting diverse architectures, along with learned encoders, hardware-aware heads, and multiple boundary-consistent LUT relaxations. An automated RTL export pipeline translates trained PyTorch models into synthesizable HDL, ensuring equivalence between software and hardware inference. Experiments across standard vision benchmarks and heterogeneous hardware platforms demonstrate competitive accuracy and substantial gains in FPGA efficiency, including 72.3% test accuracy on CIFAR-10 achieved with fewer than 0.3M logic gates, while attaining sub-20ns single-sample inference using only LUT resources.

URL: https://openreview.net/forum?id=ZbsSZAfDod

---

Title: ECLayr: A Fast and Robust Topological Layer via Euler Characteristic Curves

Abstract: We introduce a flexible and computationally efficient topological layer for general deep learning architectures, built upon the Euler Characteristic Curve. Unlike existing approaches that rely on computationally intensive persistent homology, our method bypasses this bottleneck while retaining essential topological information across diverse data modalities. To enable complete end-to-end training, we develop a novel backpropagation scheme that improves computation and mitigates vanishing gradient issues. We go on to provide stability analysis, establishing stability guarantees for the proposed layer in the presence of noise and outliers. We integrate the proposed layer into topological autoencoders to enhance representation learning through topological signals. We further demonstrate the effectiveness of our approach through classification experiments on a variety of datasets, including high-dimensional settings where persistent homology becomes computationally challenging.

URL: https://openreview.net/forum?id=mpz0Z1cP36

---

Title: Legal Retrieval for Public Defenders

Abstract: AI tools are increasingly suggested as solutions to assist public agencies with heavy workloads. In public defense---where a constitutional right to counsel meets the complexities of law, overwhelming caseloads, and constrained resources---practitioners face especially taxing conditions. Yet, there is little evidence of how AI could meaningfully support defenders' day-to-day work. In partnership with the anonymized Office of the Public Defender, we develop the anonymized BriefBank, a retrieval tool which surfaces relevant appellate briefs to streamline legal research and writing.
We show that existing retrieval benchmarks fail to transfer to real public defense research, however adding domain knowledge improves retrieval quality. This includes query expansion with legal reasoning, domain-specific data and curated synthetic examples. To facilitate further research, we release a new, realistic retrieval dataset, manually annotated by real public defenders, and provide a taxonomy of these realistic defender search queries.
Together, our work improves on the status quo of realistic retrieval benchmarking and provides a starting point for leveraging AI in a real-world public interest setting.

URL: https://openreview.net/forum?id=HnbKQGRnDt

---

Title: Gen-MURE: Generalized Multiplicative Unbiased Risk Estimate

Abstract: Coherent imaging modalities such as ultrasound and synthetic aperture radar (SAR) images are degraded by signal-dependent multiplicative noise, where the noise distributions vary widely across acquisition scenarios. Existing self-supervised image denoising methods either assume zero-mean additive noise, independence across pixels or require the noise distribution to be known, which often limit their applicability in real-world image denoising systems. We propose a Generalized Multiplicative Unbiased Risk Estimate (Gen-MURE), a model-agnostic self-supervised image denoising framework for enhancing images corrupted by signal-dependent, multiplicative speckle. Gen-MURE does not rely on explicit assumptions of the exact noise distribution and formulates a principled framework capable of denoising speckle ranging from different distributions. Gen-MURE does not require access to clean ground truth images or parameters of the noise model, and denoise images in one single step without any iterative refinement. Extensive experiments on ultrasound images along with unseen simulated and real SAR images demonstrate the efficiency and robustness of Gen-MURE.

URL: https://openreview.net/forum?id=Hie13qRm1x

---

Title: Towards Principled Task Grouping for Multi-Task Learning

Abstract: Multi-task learning (MTL) aims to leverage shared information among tasks to improve learning efficiency and accuracy. However, MTL often struggles to effectively manage positive and negative transfer between tasks, which can hinder performance improvements. Task grouping addresses this challenge by organizing tasks into meaningful clusters, maximizing beneficial transfer while minimizing detrimental interactions.
This paper introduces a principled approach to task grouping in MTL, advancing beyond existing methods by addressing key theoretical and practical limitations. Unlike prior studies, our method offers a theoretically grounded approach that does not depend on restrictive assumptions for constructing transfer gains. We also present a flexible mathematical programming formulation that accommodates a wide range of resource constraints, thereby enhancing its versatility.
Experimental results across diverse domains, including computer vision datasets, combinatorial optimization benchmarks, and time series tasks, demonstrate the superiority of our method over extensive baselines, thereby validating its effectiveness and general applicability in MTL without sacrificing efficiency.

URL: https://openreview.net/forum?id=3DeSIpzuro

---

Title: Harnessing Optimization Dynamics for Curvature-Informed Model Merging

Abstract: Model merging is an effective strategy for composing capabilities in large language models without the need for costly joint retraining. We study this process in the supervised fine-tuning (SFT) stage, consolidating multiple checkpoints specialized for distinct capabilities (e.g., math, coding, and precise instruction following) into a single model. First, we introduce Optimization Trajectory Aware (OTA) Merging, a curvature-aware method for mitigating task interference that uses optimizer second-moment statistics as a diagonal curvature proxy to first prune the task vector with our Fast Fisher Grafting (FFG) technique and then reweight the pruned vector. When merging diverse, capability-based checkpoints, OTA improves the merged model's performance over strong baseline methods, as evaluated on unseen capability-based benchmarks. Second, we conduct a comprehensive, theoretically-inspired empirical analysis to explain the effectiveness of OTA. Our analysis surprisingly reveals that FFG implicitly induces a layer- and role-wise aware pruning mechanism that is capable of maintaining fine-tuning performance at much more aggressive pruning ratios compared to magnitude pruning and that exhibits interpretable task localization properties. Third, an extensive comparison of our curvature proxy across capability checkpoints shows that experts converge to a basin with substantial curvature similarity, offering a novel lens on why simple linear merging can be effective in practice. This result further strengthens our ablation study, showing that FFG is critical for merging performance. Finally, we develop a memory-light variant of OTA that efficiently compresses the second moments, mitigating the additional storage requirements of our method and improving scalability. We make all code, training and evaluation scripts, visualization artifacts, and capability-specific SFT checkpoints accessible through an anonymized repository at \url{https://github.com/tmlr-ota/ota}.

URL: https://openreview.net/forum?id=Wb2r8TdAyD

---

Title: Semantic Anchor Transport: Robust Test-Time Adaptation for Vision-Language Models

Abstract: Large pre-trained vision-language models (VLMs), such as CLIP, have shown unprecedented zero-shot performance across a wide range of tasks. Nevertheless, these models may be unreliable under distributional shifts, as their performance is significantly degraded. In this work, we investigate how to efficiently utilize class text information to mitigate distribution drifts encountered by VLMs during inference. In particular, we propose generating pseudo-labels for the noisy test-time samples by aligning visual embeddings with reliable, text-based semantic anchors. Specifically, to maintain the regular structure of the dataset properly, we formulate the problem as a batch-wise label assignment, which is efficiently solved using Optimal Transport. Our method, Semantic Anchor Transport (SAT), utilizes such pseudo-labels as supervisory signals for test-time adaptation, yielding a principled cross-modal alignment solution. Moreover, SAT further leverages heterogeneous textual clues, with a multi-template distillation approach that replicates multi-view contrastive learning strategies in unsupervised representation learning without incurring additional computational complexity. Extensive experiments on multiple popular test-time adaptation benchmarks presenting diverse complexity empirically show the superiority of SAT, achieving consistent performance gains over recent state-of-the-art methods, yet being computationally efficient.

URL: https://openreview.net/forum?id=iE0Yemxvio

---

Title: Chance-Constrained Inference for Hallucination Risk Control in Large Language Models

Abstract: Large language models generate outputs stochastically and may produce fluent but invalid responses, including factual hallucinations. Existing mitigation strategies reduce average error rates but do not provide explicit control over the \emph{frequency} of such failures under repeated use. We formulate inference as a deployment-time risk control problem and introduce \emph{chance-constrained inference}, which directly bounds the probability of hallucinations among accepted generations. Hallucinations are modeled as stochastic constraint violations, and we show that confidence-based selective prediction does not, in general, imply probabilistic risk guarantees. To enforce chance constraints efficiently, we propose a sequential, anytime-valid inference procedure that adaptively certifies feasibility or infeasibility using finite samples, avoiding conservative fixed-sample bounds. Experiments on questions inspired by NaturalQuestions and controlled multi-hop question answering demonstrate reliable risk control, early detection of intrinsically infeasible inputs, and safe composition under repeated use, while confidence-based baselines fail to provide consistent guarantees.

URL: https://openreview.net/forum?id=cJDhDC69m9

---

Title: Inference-Time Scaling for Joint Audio–Video Generation

Abstract: Joint audio-video generation aims to synthesize realistic audio-video pairs that are both semantically aligned with text prompts and precisely synchronized. While existing joint audio-video generation models often require substantial training resources to improve fidelity, Inference-Time Scaling (ITS) has recently emerged as a promising training-free alternative in single-modality domains. However, extending ITS from a single modality to multimodal domains is non-trivial, as it requires balancing multiple heterogeneous objectives. In this paper, we present the first comprehensive study of ITS for joint audio-video generation. We first demonstrate that a multi-verifier framework is essential to address the limitations of single-objective guidance, including asymmetric performance trade-offs and verifier hacking. Through systematic analysis, we then identify an optimal multi-verifier combination that yields balanced improvements across all quality dimensions. Finally, to effectively aggregate diverse reward signals, we propose Adaptive Reward Weighting (ARW), a novel test-time optimization algorithm. ARW treats reward aggregation as an online optimization problem, utilizing learnable parameters to calibrate reward variances without requiring prior knowledge of reward distributions, thereby ensuring robust multi-objective selection. Experimental results on VGGSound and JavisBench-mini benchmarks demonstrate that our framework significantly enhances semantic alignment, perceptual quality, and audio-visual synchronization of generated outputs.

URL: https://openreview.net/forum?id=MHNFjjm5nO

---

Title: Large Pretraining Datasets Don't Guarantee Robustness after Fine-Tuning

Abstract: Large-scale pretrained models are widely leveraged as foundations for learning new specialized tasks via fine-tuning, with the goal of maintaining the general performance of the model while allowing it to gain new skills. A valuable goal for all such models is robustness: the ability to perform well on out-of-distribution (OOD) tasks. We assess whether fine-tuning preserves the overall robustness of the pretrained model, and observed that models pretrained on large datasets exhibited strong catastrophic forgetting and loss of OOD generalization. To systematically assess robustness preservation in fine-tuned models, we propose the Robustness Inheritance Benchmark (ImageNet-RIB). The benchmark, which can be applied to any pretrained model, consists of a set of related but distinct OOD (downstream) tasks and involves fine-tuning on one of the OOD tasks in the set then testing on the rest. We find that though continual learning methods help, fine-tuning reduces robustness across pretrained models. Surprisingly, models pretrained on the largest and most diverse datasets (e.g., LAION-2B) exhibit both larger robustness losses and lower absolute robustness after fine-tuning on small datasets, relative to models pretrained on smaller datasets. These findings suggest that starting with the strongest foundation model is not necessarily the best approach for performance on specialist tasks.

URL: https://openreview.net/forum?id=VyVkIucjWU

---

Title: MSpecTmol: A Multi-Modal Spectroscopic Learning Framework for Automated Molecular Structure Elucidation

Abstract: Spectroscopic techniques are indispensable for the elucidation of molecular structures, particularly for novel molecules with unknown configurations. However, a fundamental limitation of any single spectroscopic modality is that it provides an inherently circumscribed and fragmented view, capturing only specific facets of the complete molecular structure, which is often insufficient for unequivocal and robust characterization. Consequently, the integration of data from multiple spectroscopic sources is imperative to overcome these intrinsic limitations and achieve a comprehensive and accurate structural characterization. In this work, we introduce \textbf{MSpecTmol}, a novel \textbf{M}ulti-modal \textbf{Spec}trum information fusion learning framework for automated \textbf{Mol}ecular structure elucidation. By extending information bottleneck theory, our framework provides a principled and adaptive approach to fusing spectra. It designates a primary modality to extract core molecular features while leveraging auxiliary inputs to enrich the representation. To validate the end-to-end effectiveness of our framework, we design a two-fold evaluation: molecular substructure classification to probe its discriminative power in identifying substructures, and extends this knowledge to reconstruct plausible 3D structures. Our results not only demonstrate state-of-the-art performance in molecular substructure classification but also achieve near-experimental accuracy (\textasciitilde 0.68\AA) in molecular conformation reconstruction. These findings underscore the model’s capacity to learn interpretable features aligned with chemical intuition, thereby paving the way for future advances in automated and reliable spectroscopic analysis. Our code can be found at \href{https://anonymous.4open.science/r/MspecTmol-6B4D}{https://anonymous.4open.science.}

URL: https://openreview.net/forum?id=kRhf5Z1Cy1

---

Title: CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents

Abstract: Text-guided sound separation supports flexible audio editing across media and assistive applications, but existing models like AudioSep are too compute-heavy for edge deployment. Neural Audio Codec–based models such as CodecFormer and SDCodec are compute efficient but limited to fixed-class separation.
We introduce CodecSep, the first NAC-based model for on-device universal, text-driven separation. CodecSep combines DAC compression with a lightweight transformer masker modulated by CLAP-derived FiLM parameters. A key empirical finding underlying this design is that the latent space of modern neural audio codecs is already sufficiently structured: source-specific information is partially disentangled in the NAC representation, allowing effective source extraction via masking alone, without explicit re-encoding.
Across six open-domain benchmarks under matched training and prompting protocols, CodecSep surpasses AudioSep in separation fidelity (SI\mbox{-}SDR) while remaining competitive in perceptual quality (ViSQOL) and matching or exceeding fixed-stem baselines (TDANet, Sudo rm-rf, CodecFormer, SDCodec). In code-stream deployments, it requires just 1.35~GMACs end-to-end—$\sim$54$\times$ less compute (25$\times$ architecture-only) than spectrogram-domain separators like AudioSep—while remaining fully bitstream-compatible.

URL: https://openreview.net/forum?id=r63GX9hKhC

---

Title: MIDAS: Mosaic Input-Specific Differentiable Architecture Search

Abstract: Differentiable Neural Architecture Search (NAS) provides efficient, gradient-based methods for automatically designing neural networks, yet its adoption remains limited in practice. We present MIDAS, a novel approach that modernizes DARTS by replacing static architecture parameters with dynamic, input-specific parameters computed via self-attention.
To improve robustness, MIDAS (i) localizes the architecture selection by computing it separately for each spatial patch of the activation map, and (ii) introduces a parameter-free, topology-aware search space that models node connectivity and simplifies selecting the two incoming edges per node. We evaluate MIDAS on the DARTS, NAS-Bench-201, and RDARTS search spaces. In DARTS, it reaches 97.42% top-1 on CIFAR-10 and 83.38% on CIFAR-100. In NAS-Bench-201, it consistently finds globally optimal architectures. In RDARTS, it sets the state of the art on two of four search spaces on CIFAR-10. We further analyze why MIDAS works, showing that patchwise attention improves discrimination among candidate operations, and the resulting input-specific parameter distributions are class-aware and predominantly unimodal, providing reliable guidance for decoding.

URL: https://openreview.net/forum?id=9R6EBFTzuy

---

Title: On Hamming–Lipschitz Type Stability of the Subdominant (Minmax) Ultrametric: Theory and Simple Proofs

Abstract: We study the subdominant (minmax) ultrametric as an operator on pairwise data. Prior stability results show that this operator is non-expansive under uniform perturbations in the supremum norm and in the Gromov–Hausdorff sense, but they say nothing about how widely sparse, targeted edits can ripple through the hierarchy. We close this gap with a pair-count Lipschitz theory in Hamming space: we bound how many ultrametric entries can change, regardless of their magnitudes. The analysis is routed through the \emph{minimum spanning tree} (MST), which encodes the ultrametric as path bottlenecks. Our first theorem proves a locality principle; only pairs whose MST path crosses an edited or newly exposed cut can change, so the impact is confined to a union of fundamental cut rectangles. Building on this, we derive an instance dependent $$\ell_0$$ type Lipschitz bound whose constant is determined entirely by the MST’s exposed cuts. We then show optimality by constructing cases where a single off-tree edit forces a quadratic number of changes, so no smaller universal constant is possible for our proposed Lipschitz constant. Finally, under a mild minimal-overlap condition, the upper bound on the number of changed entries of the ultrametric is order-tight, yielding a two-sided characterization of propagation. Conceptually, this advances a magnitude-versus-extent picture for ultrametric stability: classical results control how much entries move under uniform perturbation; our theory controls how far changes spread under sparse edits. Additionally, as a proof of concept, we derive a risk score from our Lipschitz constant that identifies vulnerable edges in the graph. We use this score to drive two case studies: vulnerability maps of deep embeddings of CIFAR-10, ImageNet-10, and STL-10, where targeted edits to high-score edges cause far larger ultrametric and clustering changes than random edits with the same budget, and fragility maps in a superpixel-based single image segmentation that highlight load-bearing boundaries.

URL: https://openreview.net/forum?id=R4ASOCp3uM

---

Title: Extrapolation of Periodic Functions Using Binary Encoding of Continuous Numerical Values

Abstract: We report the discovery that binary encoding allows neural networks to extrapolate periodic functions beyond their training bounds. We introduce Normalized Base-2 Encoding (NB2E) as a method for encoding continuous numerical values and demonstrate that, using this input encoding, vanilla multi-layer perceptrons (MLP) successfully extrapolate diverse periodic signals without prior knowledge of their functional form. Internal activation analysis reveals that NB2E induces bit-phase representations, enabling MLPs to learn and extrapolate signal structure independently of position.

URL: https://openreview.net/forum?id=8BCFVGm998

---

Title: Contrastive VQ Priors for Multi-Class Plaque Segmentation via SAM Adaptation

Abstract: Accurate plaque subtype segmentation in coronary CT angiography (CCTA) is clinically relevant yet remains difficult in practice, where annotations are scarce and the visual evidence for non-calcified lesions is subtle and highly variable. Meanwhile, segmentation foundation models such as SAM provide strong robustness from large-scale pretraining, but their benefits do not reliably transfer to private CCTA tasks under naïve fine-tuning, especially for multi-class plaque taxonomy.
We present a targeted strategy to transfer SAM's segmentation robustness to a private CCTA setting by injecting a task-specific, texture-aware prior into the SAM feature stream.
Our framework is two-stage: (i) we learn a discrete latent prior from the private CCTA data using a vector-quantized autoencoder, and structure it with supervised contrastive learning to emphasize hard class boundaries; (ii) we fuse this prior into a SAM-based encoder through a query-based feature-aware cross-attention module, and decode with a multi-class head/decoder tailored for plaque taxonomy.
On the private CCTA benchmark, our approach consistently improves plaque subtype delineation and outperforms strong medical baselines (nnU-Net, TransUNet) as well as SAM-family adaptations (including Medical SAM variants and CAT-SAM). Ablations verify the roles of (a) contrastively-structured discrete priors, (b) attention-based retrieval versus additive fusion, and (c) multi-class decoding for SAM-style models.

URL: https://openreview.net/forum?id=5P7HfuejgL

---

Title: Fourier Neural Operators Explained: A Practical Perspective

Abstract: Partial differential equations (PDEs) govern a wide variety of dynamical processes in science and engineering, yet obtaining their numerical solutions often requires high-resolution discretizations and repeated evaluations of complex operators, leading to substantial computational costs. Neural operators have recently emerged as a powerful framework for learning mappings between function spaces directly from data, enabling efficient surrogate models for PDE systems. Among these architectures, the Fourier Neural Operator (FNO) has become the most influential and widely adopted due to its elegant spectral formulation, which captures global correlations through learnable transformations in Fourier space while remaining invariant to discretization and resolution. Despite their success, the practical use of FNOs is often hindered by an incomplete understanding among practitioners of their theoretical foundations, practical constraints, and implementation details, which can lead to their incorrect or unreliable application. This work presents a comprehensive and practice-oriented guide to FNOs, unifying their mathematical principles with implementation strategies. We provide an intuitive exposition to the concepts of operator theory and signal-processing that underlie the FNO, detail its spectral parameterization and the computational design of all its components, and address common misunderstandings encountered in the literature. The exposition is closely integrated with the NeuralOperator 2.0.0 library, offering modular state-of-the-art implementations that faithfully reflect the theory. By connecting rigorous foundations with practical insight, this guide aims to establish a clear and reliable framework for applying FNOs effectively across diverse scientific and engineering fields.

URL: https://openreview.net/forum?id=jqU59PGWRx

---

Title: Sharpness-Aware Minimization Driven by Local-Integrability Flatness

Abstract: Sharpness-Aware Minimization (SAM) improves generalization by optimizing for worst-case loss under parameter perturbations, but its max-based objective can be overly conservative, noise-sensitive, and reliant on smoothness assumptions that often fail in modern nonsmooth networks. We propose Lebesgue Sharpness-Aware Minimization (LSAM), a measure-theoretic alternative grounded in the Lebesgue Differentiation Theorem and local Sobolev regularity. Instead of minimizing the worst-case loss, LSAM minimizes the local average loss in a neighborhood of the parameters. This average-case notion of flatness favors Sobolev-regular Lebesgue points with low local loss oscillation and yields a generalization bound depending only on local integrability, a modulus of continuity, and a Sobolev-induced flatness term—without requiring Hessians or global Lipschitz conditions. To make LSAM practical, we introduce a Monte Carlo estimator of the local average that provides an unbiased gradient with modest overhead. Experiments on CIFAR-10/100 with ResNet, ResNeXt, WideResNet, and PyramidNet show that LSAM consistently finds flatter minima and improves test accuracy over both SGD and SAM.

URL: https://openreview.net/forum?id=29Zg9k5NCo

---

Title: MM-Eureka: Toward Stable Multimodal Reasoning via Rule-based Reinforcement Learning with Policy Drift Control

Abstract: Existing rule-based reinforcement learning (RL) methods that work well for text reasoning often collapse when extended to long-horizon multimodal reasoning settings. We identify a structural instability driven by ratio-based policy objectives under sparse multimodal rewards: importance sampling ratios in PPO-style objectives can amplify policy shifts, especially under negative advantages, which can trigger catastrophic mid-training collapse.
To make multimodal rule-based RL reliably trainable, we propose \textbf{CPGD (Clipped Policy Gradient Optimization with Policy Drift)}, a stability-oriented RL objective that removes ratio-induced amplification while maintaining proximal updates via an explicit policy drift regularizer and a numerically stable KL estimator. We provide both theoretical analysis and empirical evidence showing that ratio-based objectives can systematically amplify policy drift beyond intended bounds under sparse-reward multimodal settings, and demonstrate how CPGD addresses this through controlled policy updates.
To support diagnosis and evaluation under consistent settings, we introduce \textbf{MMK12}, a K12-level multimodal reasoning dataset with 15,616 training problems and 2,000 evaluation questions across mathematics, physics, chemistry, and biology, all with human-verified solutions. Using CPGD on MMK12, we train \textbf{MM-Eureka} models that demonstrate stable long-horizon training without collapse. CPGD achieves consistent performance improvements while maintaining training stability throughout, validating that the instability mechanism has been effectively addressed. We open-source our complete pipeline at \url{https://anonymous.4open.science/r/MM-EUREKA-C86D}

URL: https://openreview.net/forum?id=8y1ch6y24H

---

Title: On the (Non) Injectivity of Piecewise Linear Janossy Pooling

Abstract: Multiset functions, which are functions that map multisets to vectors, are a fundamental tool in the construction of neural networks for multisets and graphs. To guarantee that the vector representation of the multiset is faithful, it is often desirable to have multiset mappings that are both injective and bi-Lipschitz. Currently, there are several constructions of multiset functions achieving both these guarantees, leading to improved performance in some tasks but often also to higher compute time than standard constructions. Accordingly, it is natural to inquire whether simpler multiset functions achieving the same guarantees are available. In this paper, we make a large step towards giving a negative answer to this question. We consider the family of $k$-ary Janossy pooling, which includes many of the most popular multiset models, and prove that no piecewise linear Janossy pooling function can be injective. On the positive side, we show that when restricted to multisets without multiplicities, even simple deep-sets models suffice for injectivity and bi-Lipschitzness.
Finally, we provide empirical validation of our results through a multiset reconstruction task using a 2-ary Janossy pooling autoencoder, demonstrating a clear correlation between point separation and reconstruction accuracy.

URL: https://openreview.net/forum?id=WKJndBfSb8

---

Title: Understanding the Effects of Neuron Dominance in Deep Reinforcement Learning

Abstract: Recent studies in deep reinforcement learning have revealed that neural networks tend to lose their capacity to adapt to new targets over the course of training. The proliferation of inactive neurons, i.e., the so-called ``dormant neurons'', has been identified as one source of capacity loss. This paper investigates \textit{dominant neurons}, neurons whose activation values are significantly larger than average, as a potential cause for neuron dormancy. We demonstrate the existence of dominant neurons in a number of visual control tasks, and perform an analysis of the learning dynamics showing how dominant neurons can induce dormancy in the subsequent layer. To gain a better understanding of this phenomenon, we examine it through the lens of representation learning and establish its connection with representation collapse. Furthermore, this paper evaluates several mitigation strategies for dominant neurons across a variety of visual control tasks. Our results show that strategies that induce lower peak activation scores tend to exhibit greater representational capacity, lower dormant neuron percentage, and better performance. Among these mitigation strategies, LayerNorm with weight decay has the strongest performance, despite its simplicity. Moreover, switching the value learning loss from regression to a classification loss also significantly mitigates the neuron dominance issue and improves the performance. As a potential explanation of the effectiveness of classification losses, we provide an analysis that shows how a classification loss can prevent representation collapse.

URL: https://openreview.net/forum?id=VNV1h77UnH

---

Title: DCM Bandits: Multiplayer Information Asymmetric Cascading Bandits For Multiple Clicks

Abstract: In this work, we extend the Dependent Click Model (DCM) Bandits \cite{katariya2016dcm} to a multiplayer information‑asymmetric setting \cite{chang2022online}, where multiple agents interact with a shared ranked list and may observe multiple clicks per session, introducing new challenges for selection strategies. We study asymmetry in (1) actions and (2) rewards, providing nearly optimal regret bounds for three settings where at least one asymmetry is present. We further show that for small termination probabilities, the termination ranking need not be known, improving on \cite{katariya2016dcm} in the single‑agent case. Experiments confirm that our algorithms perform well across asymmetric environments, and highlight the critical role of feedback structure—full versus first‑click—in coordinating exploration and minimizing regret.

URL: https://openreview.net/forum?id=TzvGkUredR

---

Reply all
Reply to author
Forward
0 new messages