Weekly TMLR digest for Sep 07, 2025

12 views

Skip to first unread message

TMLR

unread,

Sep 7, 2025, 12:00:13 AMSep 7

to tmlr-annou...@googlegroups.com

New certifications
==================

Featured Certification: Reinforcement Learning from Human Feedback with Active Queries

Kaixuan Ji, Jiafan He, Quanquan Gu

https://openreview.net/forum?id=EScatQaRxz

---

Expert Certification: Learning Equivalence Classes of Bayesian Network Structures with GFlowNet

Michelle Liu, Zhaocheng Zhu, Olexa Bilaniuk, Emmanuel Bengio

https://openreview.net/forum?id=FAcc7oAdaa

---

Accepted papers
===============

Title: HalluEntity: Benchmarking and Understanding Entity-Level Hallucination Detection

Authors: Min-Hsuan Yeh, Max Kamachee, Seongheon Park, Yixuan Li

Abstract: To mitigate the impact of hallucination nature of LLMs, many studies propose detecting hallucinated generation through uncertainty estimation. However, these approaches predominantly operate at the sentence or paragraph level, failing to pinpoint specific spans or entities responsible for hallucinated content. This lack of granularity is especially problematic for long-form outputs that mix accurate and fabricated information. To address this limitation, we explore entity-level hallucination detection. We propose a new data set, HalluEntity, which annotates hallucination at the entity level. Based on the dataset, we comprehensively evaluate uncertainty-based hallucination detection approaches across 17 modern LLMs. Our experimental results show that uncertainty estimation approaches focusing on individual token probabilities tend to over-predict hallucinations, while context-aware methods show better but still suboptimal performance. Through an in-depth qualitative study, we identify relationships between hallucination tendencies and linguistic properties and highlight important directions for future research.

HalluEntity: https://huggingface.co/datasets/samuelyeh/HalluEntity

URL: https://openreview.net/forum?id=494k7e9R5D

---

Title: Tree Search for Language Model Agents

Authors: Jing Yu Koh, Stephen Marcus McAleer, Daniel Fried, Ruslan Salakhutdinov

Abstract: Autonomous agents powered by language models (LMs) have demonstrated promise in their ability to perform decision-making tasks such as web automation. However, a key limitation remains: LMs, primarily optimized for natural language understanding and generation, struggle with multi-step reasoning, planning, and using environmental feedback when attempting to solve realistic computer tasks. Towards addressing this, we propose an inference-time search algorithm for LM agents to explicitly perform exploration and multi-step planning in interactive web environments. Our approach is a form of best-first tree search that operates within the actual environment space, and is complementary with most existing state-of-the-art agents. It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks. On the challenging VisualWebArena benchmark, applying our search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%. On WebArena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%. Our experiments showcase the effectiveness of search for web agents, and we demonstrate that performance scales with increased test-time compute.

URL: https://openreview.net/forum?id=QF0N3x2XVm

---

Title: System-2 Mathematical Reasoning via Enriched Instruction Tuning

Authors: Huanqia Cai, Yijun Yang, Zhifeng Li

Abstract: Solving complex mathematical problems via system-2 reasoning is a natural human skill, yet it remains a significant challenge for current large language models (LLMs). We identify the scarcity of deliberate multi-step reasoning data as a primary limiting factor. To this end, we introduce Enriched Instruction Tuning (EIT), a method that enriches existing human-annotated mathematical datasets by augmenting human-annotated data with AI-generated feedback to create fine-grained reasoning trajectories. These datasets are then used to fine-tune open-source LLMs, enhancing their mathematical reasoning abilities without reliance on any symbolic verification program. Concretely, EIT is composed of two critical steps: Enriching with Reasoning Plan (ERP) and Enriching with Reasoning Step (ERS). The former generates a high-level plan that breaks down complex instructions into a sequence of simpler objectives, while ERS fills in reasoning contexts often overlooked by human annotators, creating a smoother reasoning trajectory for LLM fine-tuning. Unlike existing CoT prompting methods that generate reasoning chains only depending on LLM's internal knowledge, our method leverages human-annotated initial answers as ``meta-knowledge'' to help LLMs generate more detailed and precise reasoning processes, leading to a more trustworthy LLM expert for complex mathematical problems. In experiments, EIT achieves an accuracy of 84.1% on GSM8K and 32.5% on MATH, surpassing state-of-the-art fine-tuning and prompting methods, and even matching the performance of tool-augmented methods.

URL: https://openreview.net/forum?id=Cl9Uox031k

---

Title: Beyond ordinary Lipschitz constraints: Differentially Private optimization with TNC

Authors: Difei Xu, Meng Ding, Zihang Xiang, Jinhui Xu, Di Wang

Abstract: We study Stochastic Convex Optimization in Differential Privacy model (DP-SCO). Unlike previous studies, here we assume the population risk function satisfies
the Tsybakov Noise Condition (TNC) with some parameter $\theta>1$, where the Lipschitz constant of the loss could be extremely large or even unbounded, but the $\ell_2$-norm gradient of the loss has bounded $k$-th moment with $k\geq 2$.
For the Lipschitz case with $\theta\geq 2$, we first propose an $(\epsilon, \delta)$-DP algorithms whose utility bound is $\tilde{O}\left(\left(\tilde{r}_{2k}(\frac{1}{\sqrt{n}}+(\frac{\sqrt{d}}{n\epsilon}))^\frac{k-1}{k}\right)^\frac{\theta}{\theta-1}\right)$ in high probability, where $n$ is the sample size, $d$ is the model dimension, and $\tilde{r}_{2k}$ is a term that only depends on the $2k$-th moment of the gradient. It is notable that such an upper bound is independent of the Lipschitz constant. We then extend to the case where
$\theta\geq \bar{\theta}> 1$ for some known constant $\bar{\theta}$. Moreover, when the privacy budget $\epsilon$ is small enough, we show an upper bound of $\tilde{O}\left(\left(\tilde{r}_{k}(\frac{1}{\sqrt{n}}+(\frac{\sqrt{d}}{n\epsilon}))^\frac{k-1}{k}\right)^\frac{\theta}{\theta-1}\right)$ even if the loss function is not Lipschitz. For the lower bound, we show that for any $\theta\geq 2$, the private minimax rate for $\rho$-zero Concentrated Differential Privacy is lower bounded by $\Omega\left(\left(\tilde{r}_{k}(\frac{1}{\sqrt{n}}+(\frac{\sqrt{d}}{n\sqrt{\rho}}))^\frac{k-1}{k}\right)^\frac{\theta}{\theta-1}\right)$.

URL: https://openreview.net/forum?id=SZCygcrGng

---

Title: Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation

Authors: Or Tal, Felix Kreuk, Yossi Adi

Abstract: Recent progress in text-to-music generation has enabled models to synthesize high-quality musical segments, full compositions, and even respond to fine-grained control signals, e.g. chord progressions. State-of-the-art (SOTA) systems differ significantly in many dimensions, such as training datasets, modeling paradigms, and architectural choices.
This diversity complicates efforts to evaluate models fairly and identify which design choices influence performance the most. While factors like data and architecture are important, in this study we focus exclusively on the modeling paradigm.
We conduct a systematic empirical analysis to isolate its effects, offering insights into associated trade-offs and emergent behaviors that can guide future text-to-music generation systems.
Specifically, we compare the two arguably most common modeling paradigms: auto-regressive decoding and conditional flow-matching.
We conduct a controlled comparison by training all models from scratch using identical datasets, training configurations, and similar backbone architectures.
Performance is evaluated across multiple axes, including generation quality, robustness to inference configurations, scalability, adherence to both textual and temporally aligned conditioning, and editing capabilities in the form of audio inpainting.
This comparative study sheds light on distinct strengths and limitations of each paradigm, providing actionable insights that can inform future architectural and training decisions in the evolving landscape of text-to-music generation.
Audio sampled examples are available at: https://huggingface.co/spaces/ortal1602/ARvsFM

URL: https://openreview.net/forum?id=xXc5DeaBYw

---

Title: Multiple Noises in Diffusion Model for Semi-Supervised Multi-Domain Translation

Authors: Tsiry Mayet, Simon Bernard, Romain HÉRAULT, Clement Chatelain

Abstract: In this work, we address the challenge of multi-domain translation, where the objective is to learn mappings between arbitrary configurations of domains within a defined set (such as $(D_1, D_2)\rightarrow{}D_3$, $D_2\rightarrow{}(D_1, D_3)$, $D_3\rightarrow{}D_1$, etc. for three domains) without the need for separate models for each specific translation configuration, enabling more efficient and flexible domain translation.
We introduce Multi-Domain Diffusion (MDD), a method with dual purposes: i) reconstructing any missing views for new data objects, and
ii) enabling learning in semi-supervised contexts with arbitrary supervision configurations. MDD achieves these objectives by exploiting the noise formulation of diffusion models, specifically modeling one noise level per domain.
Similar to existing domain translation approaches, MDD learns the translation between any combination of domains. However, unlike prior work, our formulation inherently handles semi-supervised learning without modification by representing missing views as noise in the diffusion process.
We evaluate our approach through domain translation experiments on BL3NDT, a multi-domain synthetic dataset designed for challenging semantic domain inversion, the BraTS2020 dataset, and the CelebAMask-HQ dataset.

URL: https://openreview.net/forum?id=vYdT26kDYM

---

Title: Continuous Language Model Interpolation yields Dynamic and Controllable Text Generation

Authors: Sara Kangaslahti, David Alvarez-Melis

Abstract: As large language models (LLMs) have gained popularity for a variety of use cases, making them adaptable and controllable has become increasingly important, especially for user-facing applications. In particular, linear interpolation between model parameters forms the backbone for many recent approaches to adapting models to user preferences. While the existing literature on LLM adaptation primarily focuses on finding methods that optimize for some set of performance criteria or user preferences, here we instead seek to better understand and characterize the behavior of dense, continuous interpolation between models. Specifically, we use low-rank updates to fine-tune a base model to various different domains, yielding a set of anchor models with distinct generation profiles. Then, we use the weight updates of these anchor models to parametrize the entire (infinite) class of models contained within their convex hull. We empirically show that varying the interpolation weights yields predictable and consistent change in the model outputs with respect to all of the controlled attributes simultaneously. We find that there is little entanglement between most attributes and identify and discuss the pairs of attributes for which this is not the case. Our results suggest that parameter merging facilitates flexible model adaptation due to its predictable behavior within the full interpolation region.

URL: https://openreview.net/forum?id=xD9Nu2Wah4

---

Title: Statistical Test for Saliency Maps of Graph Neural Networks via Selective Inference

Authors: Shuichi Nishino, Tomohiro Shiraishi, Teruyuki Katsuoka, Ichiro Takeuchi

Abstract: Graph Neural Networks (GNNs) have gained prominence for their ability to process graph-structured data across various domains. However, interpreting GNN decisions remains a significant challenge, leading to the adoption of saliency maps for identifying salient subgraphs composed of influential nodes and edges. Despite their utility, the reliability of GNN saliency maps has been questioned, particularly in terms of their robustness to input noise. In this study, we propose a statistical testing framework to rigorously evaluate the significance of saliency maps. Our main contribution lies in addressing the inflation of the Type I error rate caused by double-dipping of data, leveraging the framework of Selective Inference. Our method provides statistically valid $p$-values while controlling the Type I error rate, ensuring that identified salient subgraphs contain meaningful information rather than random artifacts. The method is applicable to a variety of saliency methods with piecewise linearity (e.g., Class Activation Mapping). We validate our method on synthetic and real-world datasets, demonstrating its capability in assessing the reliability of GNN interpretations.

URL: https://openreview.net/forum?id=5NkXTCVa7F

---

Title: Effect of Random Learning Rate: Theoretical Analysis of SGD Dynamics in Non-Convex Optimization via Stationary Distribution

Authors: Naoki Yoshida, Shogo Nakakita, Masaaki Imaizumi

Abstract: We consider a variant of the stochastic gradient descent (SGD) with a random learning rate and reveal its convergence properties. SGD is a widely used stochastic optimization algorithm in machine learning, especially deep learning. Numerous studies reveal the convergence properties of SGD and its theoretically favorable variants. Among these, the analysis of convergence using a stationary distribution of updated parameters provides generalizable results. However, to obtain a stationary distribution, the update direction of the parameters must not degenerate, which limits the applicable variants of SGD. In this study, we consider a novel SGD variant, Poisson SGD, which has degenerated parameter update directions and instead utilizes a random learning rate. Consequently, we demonstrate that a distribution of a parameter updated by Poisson SGD converges to a stationary distribution under weak assumptions on a loss function. Based on this, we further show that Poisson SGD finds global minima in non-convex optimization problems and also evaluate the generalization error using this method. As a proof technique, we approximate the distribution by Poisson SGD with that of the bouncy particle sampler (BPS) and derive its stationary distribution, using the theoretical advance of the piece-wise deterministic Markov process (PDMP).

URL: https://openreview.net/forum?id=RPtKkNx9ZK

---

Title: Complementarity: Toward Better Metrics and Optimizing Data Efficiency in LLMs

Authors: Roy Siegelmann

Abstract: Generalist Large Language Models (LLMs) are trained with an immense amount of data from across different domains. However, not all data contribute to model performance equally, and prioritizing data quality can improve domain-specific performance. We suggest that quality is not merely an independent feature of datasets, but rather the manner in which data samples interfere with or complement one another. Furthermore, existing performance metrics for language models are computationally expensive, while also frequently suffering from being mathematically ill-defined and poorly suited to generative AI. Toward improving general performance while reducing the amount of training data, and quantifying how data contributes to downstream tasks vis-a-vis their relation with other data, we introduce a new metric, Complementarity. We first establish a strong correlation between Complementarity and domain-specific task performance. Without reliance on heavy instruction-tuning and text scraping, Complementarity is significantly less expensive to compute and is applicable to a wide variety of potential target domains. Most interestingly, we demonstrate that the Complementarity taken over a training validation set provides a better predictor of generalization to future test sets than directly measuring performance on a test validation set. With this, we introduce an algorithm that carefully selects the data to fine-tune upon, leading to a high-performing fine-tuned generalist model while using only a fraction of the data, and without requiring data from the test domain. Overall, Complementarity may serve as a key metric in future analysis of data utility and design of datasets, and help facilitate the goal of a truly generalist model.

URL: https://openreview.net/forum?id=feAbrMXGMh

---

Title: Exploring the Limitations of Layer Synchronization in Spiking Neural Networks

Authors: Roel Koopman, Amirreza Yousefzadeh, Mahyar Shahsavari, Guangzhi Tang, Manolis Sifalakis

Abstract: Neural-network processing in machine learning applications relies on layer synchronization. This is practiced even in artificial Spiking Neural Networks (SNNs), which are touted as consistent with neurobiology, in spite of processing in the brain being in fact asynchronous. A truly asynchronous system however would allow all neurons to evaluate concurrently their threshold and emit spikes upon receiving any presynaptic current. Omitting layer synchronization is potentially beneficial, for latency and energy efficiency, but asynchronous execution of models previously trained with layer synchronization may entail a mismatch in network dynamics and performance. We present and quantify this problem, and show that models trained with layer synchronization either perform poorly in absence of the synchronization, or fail to benefit from any energy and latency reduction, when such a mechanism is in place. We then explore a potential solution direction, based on a generalization of backpropagation-based training that integrates knowledge about an asynchronous execution scheduling strategy, for learning models suitable for asynchronous processing. We experiment with 2 asynchronous neuron execution scheduling strategies in datasets that encode spatial and temporal information, and we show the potential of asynchronous processing to use less spikes (up to 50\%), complete inference faster (up to 2x), and achieve competitive or even better accuracy (up to $\sim$10\% higher). Our exploration affirms that asynchronous event-based AI processing can be indeed more efficient, but we need to rethink how we train our SNN models to benefit from it. (Source code available at: \url{https://github.com/RoelMK/asynctorch})

URL: https://openreview.net/forum?id=mfmAVwtMIk

---

Title: Reinforcement Learning from Human Feedback with Active Queries

Authors: Kaixuan Ji, Jiafan He, Quanquan Gu

Abstract: Aligning large language models (LLM) with human preference plays a key role in building modern generative models and can be achieved by reinforcement learning from human feedback (RLHF). Despite their superior performance, current RLHF approaches often require a large amount of human-labelled preference data, which is expensive to collect. In this paper, inspired by the success of active learning, we address this problem by proposing query-efficient RLHF methods. We first formalize the alignment problem as a contextual dueling bandit problem and design an active-query-based proximal policy optimization (APPO) algorithm with an $\tilde{O}(d^2/\Delta)$ instance-dependent regret bound and an $\tilde{O}(d^2/\Delta^2)$ query complexity, where $d$ is the dimension of feature space and $\Delta$ is the sub-optimality gap over all the contexts. We then propose ADPO, a practical version of our algorithm based on direct preference optimization (DPO) and apply it to fine-tuning LLMs. Our experiments show that ADPO, while only making about half of queries for human preference, matches the performance of DPO, establishing it as a data-efficient alternative to DPO. The codes are available at https://github.com/jkx19/ActiveQuery.

URL: https://openreview.net/forum?id=EScatQaRxz

---

Title: Learning the Language of Protein Structure

Authors: Jérémie DONA, Benoit Gaujac, Timothy Atkinson, Liviu Copoiu, Thomas Pierrot, Thomas D Barrett

Abstract: Representation learning and \emph{de novo} generation of proteins are pivotal computational biology tasks.
Whilst natural language processing (NLP) techniques have proven highly effective for protein sequence modelling, structure modelling presents a complex challenge, primarily due to its continuous and three-dimensional nature.
Motivated by this discrepancy, we introduce an approach using a vector-quantized autoencoder that effectively tokenizes protein structures into discrete representations. This method transforms the continuous, complex space of protein structures into a manageable, discrete format with a codebook ranging from 4096 to 64000 tokens, achieving high-fidelity reconstructions with backbone root mean square deviations (RMSD) of approximately 1-4 \AA. To demonstrate the efficacy of our learned representations, we show that a simple GPT model trained on our codebooks can generate novel, diverse, and designable protein structures. Our approach not only provides representations of protein structure, but also mitigates the challenges of disparate modal representations and sets a foundation for seamless, multi-modal integration, enhancing the capabilities of computational methods in protein design.

URL: https://openreview.net/forum?id=SRRPQIOS4w

---

Title: Certified Robustness to Data Poisoning in Gradient-Based Training

Authors: Philip Sosnin, Mark Niklas Mueller, Maximilian Baader, Calvin Tsay, Matthew Robert Wicker

Abstract: Modern machine learning pipelines leverage large amounts of public data, making it infeasible to guarantee data quality and leaving models open to poisoning and backdoor attacks. Provably bounding the behavior of learning algorithms under such attacks remains an open problem. In this work, we address this challenge by developing the first framework providing provable guarantees on the behavior of models trained with potentially manipulated data without modifying the model or learning algorithm. In particular, our framework certifies robustness against untargeted and targeted poisoning, as well as backdoor attacks, for bounded and unbounded manipulations of the training inputs and labels. Our method leverages convex relaxations to over-approximate the set of all possible parameter updates for a given poisoning threat model, allowing us to bound the set of all reachable parameters for any gradient-based learning algorithm. Given this set of parameters, we provide bounds on worst-case behavior, including model performance and backdoor success rate. We demonstrate our approach on multiple real-world datasets from applications including energy consumption, medical imaging, and autonomous driving.

URL: https://openreview.net/forum?id=9WHifn9ZVX

---

Title: MoReact: Generating Reactive Motion from Textual Descriptions

Authors: Xiyan Xu, Sirui Xu, Yu-Xiong Wang, Liangyan Gui

Abstract: Modeling and generating human reactions poses a significant challenge with broad applications for computer vision and human-computer interaction. Existing methods either treat multiple individuals as a single entity, directly generating interactions, or rely solely on one person's motion to generate the other's reaction, failing to integrate the rich semantic information that underpins human interactions. Yet, these methods often fall short in adaptive responsiveness, \ie, the ability to accurately respond to diverse and dynamic interaction scenarios. Recognizing this gap, our work introduces an approach tailored to address the limitations of existing models by focusing on text-driven human reaction generation. Our model specifically generates realistic motion sequences for individuals that responding to the other's actions based on a descriptive text of the interaction scenario. The goal is to produce motion sequences that not only complement the opponent's movements but also semantically fit the described interactions. To achieve this, we present MoReact, a diffusion-based method designed to disentangle the generation of global trajectories and local motions sequentially. This approach stems from the observation that generating global trajectories first is crucial for guiding local motion, ensuring better alignment with given action and text. Furthermore, we introduce a novel interaction loss to enhance the realism of generated close interactions. Our experiments, utilizing data adapted from a two-person motion dataset, demonstrate the efficacy of our approach for this novel task, which is capable of producing realistic, diverse, and controllable reactions that not only closely match the movements of the counterpart but also adhere to the textual guidance. Please find our webpage at https://xiyan-xu.github.io/MoReactWebPage.

URL: https://openreview.net/forum?id=4zuT73heqm

---

Title: SELU: Self-Learning Embodied Multimodal Large Language Models in Unknown Environments

Authors: Boyu Li, Haobin Jiang, Ziluo Ding, Xinrun Xu, Haoran Li, Dongbin Zhao, Zongqing Lu

Abstract: Recently, multimodal large language models (MLLMs) have demonstrated strong visual understanding and decision-making capabilities, enabling the exploration of autonomously improving MLLMs in unknown environments. However, external feedback like human or environmental feedback is not always available. To address this challenge, existing methods primarily focus on enhancing the decision-making capabilities of MLLMs through voting and scoring mechanisms, while little effort has been paid to improving the environmental comprehension of MLLMs in unknown environments. To fully unleash the self-learning potential of MLLMs, we propose a novel actor-critic self-learning paradigm, dubbed SELU, inspired by the actor-critic paradigm in reinforcement learning. The critic employs self-asking and hindsight relabeling to extract knowledge from interaction trajectories collected by the actor, thereby augmenting its environmental comprehension. Simultaneously, the actor is improved by the self-feedback provided by the critic, enhancing its decision-making. We evaluate our method in the AI2-THOR and VirtualHome environments, and SELU achieves critic improvements of approximately 28% and 30%, and actor improvements of about 20% and 24% via self-learning.

URL: https://openreview.net/forum?id=G5gROx8AVi

---

Title: Behaviour Discovery and Attribution for Explainable Reinforcement Learning

Authors: Rishav Rishav, Somjit Nath, Vincent Michalski, Samira Ebrahimi Kahou

Abstract: Building trust in reinforcement learning (RL) agents requires understanding why they make certain decisions, especially in high-stakes applications like robotics, healthcare, and finance. Existing explainability methods often focus on single states or entire trajectories, either providing only local, step-wise insights or attributing decisions to coarse, episodelevel summaries. Both approaches miss the recurring strategies and temporally extended patterns that actually drive agent behavior across multiple decisions. We address this gap by proposing a fully offline, reward-free framework for behavior discovery and segmentation, enabling the attribution of actions to meaningful and interpretable behavior segments that capture recurring patterns appearing across multiple trajectories. Our method identifies coherent behavior clusters from state-action sequences and attributes individual actions to these clusters for fine-grained, behavior-centric explanations. Evaluations on four diverse offline RL environments show that our approach discovers meaningful behaviors and outperforms trajectory-level baselines in fidelity, human preference, and cluster coherence. Our code is publicly available.

URL: https://openreview.net/forum?id=JbHtpOIH9l

---

Title: Communication Cost Reduction for Subgraph Counting under Local Differential Privacy via Hash Functions

Authors: Quentin Hillebrand, Vorapong Suppakitpaisarn, Tetsuo Shibuya

Abstract: We suggest the use of hash functions to cut down the communication costs when counting subgraphs under edge local differential privacy. While various algorithms exist for computing graph statistics --- including the count of subgraphs --- under the edge local differential privacy, many suffer with high communication costs, making them less efficient for large graphs. Though data compression is a typical approach in differential privacy, its application in local differential privacy requires a form of compression that every node can reproduce. In our study, we introduce linear congruence hashing. Leveraging amplification by sub-sampling, with a sampling size of $s$, our method can cut communication costs by a factor of $s^2$, albeit at the cost of increasing variance in the published graph statistic by a factor of $s$. The experimental results indicate that, when matched for communication costs, our method achieves a reduction in the $\ell_2$-error by up to 1000 times for triangle counts and by up to $10^3$ times for 4-cycles counts compared to the performance of leading algorithms.

URL: https://openreview.net/forum?id=N1J236mepp

---

Title: Leveraging Fully-Observable Solutions for Improved Partially-Observable Offline Reinforcement Learning

Authors: Chulabhaya Wijesundara, Andrea Baisero, Gregory David Castanon, Alan S Carlin, Robert Platt, Christopher Amato

Abstract: Offline reinforcement learning (RL) is a popular learning framework for control problems where online interactions with the environment are expensive, risky, or otherwise impractical.
Existing offline RL methods commonly assume full observability of the state, and therefore there is a lack of offline RL methods that are specialized for the more general case of partially-observable control.
To address this gap, we propose Cross-Observability Conservative Q-Learning (CO-CQL), an offline RL algorithm for partially-observable control that leverages fully-observable expert policies in an asymmetric learning setting.
To motivate the use of fully-observable experts for partially-observable control, we formalize Cross-Observability Optimality Ratio (COOR), a theoretical measure of cross-observability that quantifies the benefit of learning asymmetrically from a fully-observable expert, and Cross-Observability Approximation Ratio (COAR), an estimation of COOR computable from trained policies.
Our empirical evaluation on a wide variety of partially-observable challenges demonstrates that CO-CQL is able to exploit the guidance of fully-observable experts to outperform other state-of-the-art offline algorithms.

URL: https://openreview.net/forum?id=e9p4TDPy6A

---

Title: Learning Equivalence Classes of Bayesian Network Structures with GFlowNet

Authors: Michelle Liu, Zhaocheng Zhu, Olexa Bilaniuk, Emmanuel Bengio

Abstract: Understanding the causal graph underlying a system is essential for enabling causal inference, particularly in fields such as medicine and genetics. Identifying a causal Directed Acyclic Graph (DAG) from observational data alone is challenging because multiple DAGs can encode the same set of conditional independencies. These equivalent DAGs form a Markov Equivalence Class (MEC), which is represented by a Completed Partially Directed Acyclic Graph (CPDAG). Effectively approximating the CPDAG is crucial because it facilitates narrowing down the set of possible causal graphs underlying the data. We introduce CPDAG-GFN, a novel approach that uses a Generative Flow Network (GFlowNet) to learn a posterior distribution over CPDAGs. From this distribution, we sample high-reward CPDAG candidates that approximate the ground truth, with rewards determined by a score function that quantifies how well each graph fits the data. Additionally, CPDAG-GFN incorporates a sparsity-preferring filter to enhance the set of CPDAG candidates and improve their alignment with the ground truth. Experimental results on both simulated and real-world datasets demonstrate that CPDAG-GFN performs competitively with established methods for learning CPDAG candidates from observational data.

URL: https://openreview.net/forum?id=FAcc7oAdaa

---

Title: FlowBench: Benchmarking Optical Flow Estimation Methods for Reliability and Generalization

Authors: Shashank Agnihotri, Julian Yuya Caspary, Luca Schwarz, Xinyan Gao, Jenny Schmalfuss, Andres Bruhn, Margret Keuper

Abstract: Optical flow estimation is a crucial computer vision task often applied to safety-critical real-world scenarios like autonomous driving and medical imaging.
While optical flow estimation accuracy has greatly benefited from the emergence of deep learning, learning-based methods are also known for their lack of generalization and reliability.
However, reliability is paramount when optical flow methods are employed in the real world, where safety is essential.
Furthermore, a deeper understanding of the robustness and reliability of learning-based optical flow estimation methods is still lacking, hindering the research community from building methods safe for real-world deployment.
Thus, we propose FlowBench, a robustness benchmark and evaluation tool for learning-based optical flow methods.
FlowBench facilitates streamlined research into the reliability of optical flow methods by benchmarking their robustness to adversarial attacks and out-of-distribution samples.
With FlowBench, we benchmark 57 checkpoints across 3 datasets under 9 diverse adversarial attacks and 23 established common corruptions, making it the most comprehensive robustness analysis of optical flow methods to date.
Across this wide range of methods, we consistently find that methods with state-of-the-art performance on established standard benchmarks lack reliability and generalization ability.
Moreover, we find interesting correlations between the performance, reliability, and generalization ability of optical flow estimation methods, under various lenses such as design choices used, number of parameters, etc.
The open-source code and weights for FlowBench are available in this GitHub repository: https://github.com/shashankskagnihotri/FlowBench.

URL: https://openreview.net/forum?id=Kh4bj6YDNm

---

Title: Unbiased Loss Functions for Multilabel Classification with Missing Labels

Authors: Erik Schultheis, Rohit Babbar

Abstract: This paper considers binary and multilabel classification problems in a
setting where labels are missing independently and with a known rate. Missing
labels are a ubiquitous phenomenon in extreme multi-label classification (XMC)
tasks, such as matching Wikipedia articles to a small subset out of the
hundreds of thousands of possible tags, where no human annotator can possibly
check the validity of all the negative samples. For this reason,
propensity-scored precision---an unbiased estimate for precision-at-k under a
known noise model---has become one of the standard metrics in XMC. Few
methods take this problem into account already during the training phase, and
all of these are limited to loss functions that can be decomposed into a sum
of contributions from each individual label. A typical approach to training is
to reduce the multilabel problem into a series of binary or multiclass
problems, and it has been shown that if the surrogate task should be
consistent for optimizing recall, the resulting loss function is not
decomposable over labels. Therefore, this paper develops unbiased estimators
for generic, potentially non-decomposable loss functions. These estimators
suffer from increased variance and may lead to ill-posed optimization
problems, which we address by switching to convex upper-bounds. The
theoretical considerations are further supplemented by an experimental study
showing that the switch to unbiased estimators significantly alters the
bias-variance trade-off and thus requires stronger regularization.

URL: https://openreview.net/forum?id=hMq1hUhLqp

---

Title: Efficient pooling of predictions via kernel embeddings

Authors: Sam Allen, David Ginsbourger, Johanna Ziegel

Abstract: Probabilistic predictions are probability distributions over the set of possible outcomes. Such predictions quantify the uncertainty in the outcome, making them essential for effective decision making. By combining multiple predictions, the information sources used to generate the predictions are pooled, often resulting in a more informative forecast. Probabilistic predictions are typically combined by linearly pooling the individual predictive distributions; this encompasses several ensemble learning techniques, for example. The weights assigned to each prediction can be estimated based on their past performance, allowing more accurate predictions to receive a higher weight. This can be achieved by finding the weights that optimise a proper scoring rule over some training data. By embedding predictions into a Reproducing Kernel Hilbert Space (RKHS), we illustrate that estimating the linear pool weights that optimise kernel-based scoring rules is a convex quadratic optimisation problem. This permits an efficient implementation of the linear pool when optimally combining predictions on arbitrary outcome domains. This result also holds for other combination strategies, and we additionally study a flexible generalisation of the linear pool that overcomes some of its theoretical limitations, whilst allowing an efficient implementation within the RKHS framework. These approaches are compared in an application to operational wind speed forecasts, where this generalisation is found to offer substantial improvements upon the traditional linear pool.

URL: https://openreview.net/forum?id=dji9MfONNP

---

New submissions
===============

Title: Hard Work Does Not Always Pay Off: On the Robustness of NAS to Data Poisoning

Abstract: We study the robustness of data-centric methods to find neural network architectures, known as neural architecture search (NAS), against data poisoning. To audit this robustness, we design a poisoning framework that enables the systematic evaluation of the ability of NAS to produce architectures under data corruption. Our framework examines three off-the-shelf NAS algorithms, representing different approaches to architecture discovery, against four data poisoning attacks, including one we tailor specifically for NAS. In our evaluation with the CIFAR-10 benchmark, we show that NAS is seemingly robust to data poisoning, showing marginal accuracy drops even under large poisoning budgets. However, we demonstrate that when considering NAS algorithms designed to achieve a few percentage points of accuracy gain, this expected improvement can be substantially diminished under data poisoning. We also show that the reduction varies across NAS algorithms and analyze the factors contributing to their robustness. Our findings are: (1) Training-based NAS algorithms are the least robust due to their reliance on data. (2) Training-free NAS approaches are the most robust but produce architectures that perform similarly to random selections from the search space. (3) NAS algorithms can produce architectures with improved accuracy, even when using out-of-distribution data like MNIST. We lastly discuss potential countermeasures.

URL: https://openreview.net/forum?id=Uhayg3Ia9W

---

Title: Supervised score aggregation for active anomaly detection

Abstract: Detecting rare anomalies in batches of multidimensional data is challenging.
We propose an original supervised active-learning framework that sends a small number of data points from each batch to an expert for labeling as `anomaly' or `nominal' via two mechanisms: (i) points most likely to be anomalies in the eyes of a supervised classifier trained on previously-labeled data; and (ii) points suggested by an active learner. Instead of training the supervised classifier directly on currently-labeled raw data, we treat the scores calculated by an ensemble of $M$ user-defined unsupervised anomaly detectors as if they were the learner's input features. Our approach generalizes earlier attempts to linearly aggregate unsupervised anomaly detector scores, and broadens the scope of these methods from unordered bags of data to ordered data such as time series. Simulated and real data trials suggest that this method usually outperforms---often significantly---linear strategies.
The Python library acanag implements our proposed method.

URL: https://openreview.net/forum?id=nrmJD3XMA3

---

Title: Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models

Abstract: Large Language Models (LLMs) show remarkable proficiency in natural language tasks, yet their frequent overconfidence—misalignment between predicted confidence and true correctness—poses significant risks in critical decision-making applications. We present a comprehensive analysis on calibration in LLMs across nine LLMs and three factual Question-Answering (QA) datasets, systematically comparing standard free-generation settings against structured distractor-augmented prompts. Our evaluation reveals that explicitly incorporating distractors can substantially mitigate miscalibration, achieving relative accuracy improvements up to 460% and ECE reductions up to 90%. Despite general trends, we uncover nuanced findings: large RLHF-tuned models display inherent calibration strengths but can paradoxically suffer increased miscalibration on easier queries, whereas smaller models benefit disproportionately from distractor prompts but remain significantly miscalibrated. Through detailed analyses across question types, we identify persistent calibration failures, particularly in person-based queries. We conclude with concrete recommendations—targeted fine-tuning, structured prompting, and strategic model choice—to ensure reliable, trustworthy LLM deployments.

URL: https://openreview.net/forum?id=lyaHnHDdZl

---

Title: Safe Reinforcement Learning using Action Projection: Safeguard the Policy or the Environment?

Abstract: Projection-based safety filters, which modify unsafe actions by mapping them to the closest safe alternative, are widely used to enforce safety constraints in reinforcement learning (RL). Two integration strategies are commonly considered: Safe environment RL (SE-RL), where the safeguard is treated as part of the environment, and safe policy RL (SP-RL), where it is embedded within the policy through differentiable optimization layers. Despite their practical relevance in safety-critical settings, a formal understanding of their differences is lacking.
In this work, we present a theoretical comparison of SE-RL and SP-RL. We identify a key distinction in how each approach is affected by action aliasing, a phenomenon in which multiple unsafe actions are projected to the same safe action, causing information loss in the policy gradients. In SE-RL, this effect is implicitly approximated by the critic, while in SP-RL, it manifests directly as rank-deficient Jacobians during backpropagation through the safeguard.
Our contributions are threefold: (i) a unified formalization of SE-RL and SP-RL in the context of actor-critic algorithms, (ii) a theoretical analysis of their respective policy gradient estimates, highlighting the role of action aliasing, and (iii) a comparative study of mitigation strategies, including a novel penalty-based improvement for SP-RL that aligns with established SE-RL practices. Empirical results support our theoretical predictions, showing that action aliasing is more detrimental for SP-RL than for SE-RL. However, with appropriate improvement strategies, SP-RL can match or outperform improved SE-RL across a range of environments. These findings provide actionable insights for choosing and refining projection-based safe RL methods based on task characteristics.

URL: https://openreview.net/forum?id=DDrGSEYxGU

---

Title: HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis Generation

Abstract: There is growing interest in hypothesis generation with large language models (LLMs). However, fundamental questions remain: what makes a good hypothesis, and how can we systematically evaluate methods for hypothesis generation? To address this, we introduce HypoBench, a novel benchmark designed to evaluate LLMs and hypothesis generation methods across multiple aspects, including practical utility, generalizability, and hypothesis discovery rate. HypoBench includes 7 real-world tasks and 5 synthetic tasks with 194 distinct datasets. We evaluate four state-of-the-art LLMs combined with six existing hypothesis generation methods. Overall, our results suggest that existing methods are capable of discovering valid and novel patterns in the data. However, the results from synthetic datasets
indicate that there is still significant room for improvement, as current hypothesis generation methods do not fully uncover all relevant or meaningful patterns. Specifically, in synthetic settings, as task difficulty increases, performance significantly drops, with best models and
methods only recovering 38.8% of the ground-truth hypotheses. These findings highlight challenges in hypothesis generation and demonstrate that HypoBench serves as a valuable resource for improving AI systems designed to assist scientific discovery.

URL: https://openreview.net/forum?id=cizEoSePyT

---

Title: Unifying Linear-Time Attention via Latent Probabilistic Modelling

Abstract: Transformers have achieved state-of-the-art results across a range of domains, but their quadratic attention mechanism poses significant challenges for long-sequence modelling. Recent efforts to design linear-time attention mechanisms have yielded more scalable alternatives, yet often at the cost of performance, particularly on discrete data such as language. In this work, we revisit linear attention through the lens of probabilistic graphical models. We first show that standard linear attention can be interpreted as an undirected latent variable model, revealing a key limitation: the absence of directionality. To address this, we propose a novel directed parameterisation of linear attention that introduces an asymmetric structure, enabling an interpretation aligned with the causal and sequential nature of language. Our formulation integrates global latent-variable attention with local standard attention in a fully probabilistic framework. Additionally, we introduce a recurrent parameterisation of queries and keys that avoids reliance on relative positional encodings, often incompatible with linear attention. Experiments on language modelling benchmarks demonstrate that our model achieves competitive performance with standard attention and outperforms existing linear attention variants.

URL: https://openreview.net/forum?id=TDFIjR7ynG

---

Title: SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling

Abstract: Recent advancements in Large Language Models (LLMs) have created new opportunities to enhance performance on complex reasoning tasks by leveraging test-time computation. However, existing scaling methods have key limitations: parallel methods like repeated sampling are often inefficient and quickly saturate, while sequential methods like SELF-REFINE struggle to improve after a few rounds. Although combining these approaches shows promise, current methods require fine-tuned reward and revision models. This paper proposes Self-Enhanced Test-Time Scaling (SETS), a simple yet effective approach that overcomes these limitations by strategically combining parallel and sequential techniques and fully leveraging LLMs' self-improvement abilities. SETS exploits the inherent self-verification and self-correction capabilities of LLMs, unifying sampling, verification, and correction within a single framework. This facilitates efficient and scalable test-time computation for enhanced performance on complex tasks without any model training. Our comprehensive experimental results on challenging benchmarks spanning planning, reasoning, math, and coding demonstrate that SETS achieves significant performance improvements and more advantageous test-time scaling behavior than the alternatives.

URL: https://openreview.net/forum?id=Wv9NMJoKww

---

Title: On Foundation Models for Dynamical Systems from Purely Synthetic Data

Abstract: Foundation models have demonstrated remarkable generalization, data efficiency, and robustness properties across various domains.
The success of these models is enabled by large-scale pretraining on Internet-scale datasets. These are available in fields like natural language processing and computer vision, but do not exist for dynamical systems. In this paper, we explore whether these properties can be achieved in the control domain when pretraining is performed entirely on synthetic data. We address this challenge by pretraining a transformer-based foundation model exclusively on synthetic data and propose to sample dynamics functions from a reproducing kernel Hilbert space. Our model, pretrained on purely synthetic data, generalizes to prediction tasks across different dynamical systems, which we validate in simulation and hardware experiments, including cart-pole and Furuta pendulum setups. Additionally, our model can be fine-tuned effectively to new systems increasing its performance even further. Our results demonstrate that even when pretrained solely on appropriately designed synthetic data, it is feasible to obtain foundation models for dynamical systems that outperform specialist models in terms of generalization, data efficiency, and robustness.

URL: https://openreview.net/forum?id=6UfPCYWEvd

---

Title: Towards Performatively Stable Equilibria in Decision-Dependent Games for Arbitrary Data Distribution Maps

Abstract: In decision-dependent games, multiple players optimize their decisions under a data distribution that shifts with their joint actions, creating complex dynamics in applications like market pricing. A practical consequence of these dynamics is the \textit{performatively stable equilibrium}, where each player's strategy is a best response under the induced distribution. Prior work relies on $\beta$-smoothness, assuming Lipschitz continuity of loss function gradients with respect to the data distribution, which is impractical as the data distribution maps, i.e., the relationship between joint decision and the resulting distribution shifts, are typically unknown, rendering $\beta$ unobtainable. To overcome this limitation, we propose a gradient-based sensitivity measure that directly quantifies the impact of decision-induced distribution shifts. Leveraging this measure, we derive convergence guarantees for performatively stable equilibria under a practically feasible assumption of strong monotonicity. Accordingly, we develop a sensitivity-informed repeated retraining algorithm that adjusts players' loss functions based on the sensitivity measure, guaranteeing convergence to performatively stable equilibria for arbitrary data distribution maps. Experiments on prediction error minimization game, Cournot competition, and revenue maximization game show that our approach outperforms state-of-the-art baselines, achieving lower losses and faster convergence.

URL: https://openreview.net/forum?id=39gqhh5NoZ

---

Title: Dive3D: Diverse Distillation-based Text-to-3D Generation via Score Implicit Matching

Abstract: Distilling pre-trained 2D diffusion models into 3D assets has driven remarkable advances in text-to-3D synthesis. However, existing methods typically rely on Score Distillation Sampling (SDS) loss, which involves asymmetric KL divergence—a formulation that inherently favors mode-seeking behavior and limits generation diversity. In this paper, we introduce Dive3D, a novel text-to-3D generation framework that replaces KL-based objectives with Score Implicit Matching (SIM) loss, a score-based objective that effectively mitigates mode collapse. Furthermore, Dive3D integrates both diffusion distillation and reward-guided optimization under a unified divergence perspective. Such reformulation, together with SIM loss, yields significantly more diverse 3D outputs while improving text alignment, human preference, and overall visual fidelity. We validate Dive3D across various 2D-to-3D prompts and find that it consistently outperforms prior methods in qualitative assessments, including diversity, photorealism, and aesthetic appeal. We further evaluate its performance on the GPTEval3D benchmark, comparing against nine state-of-the-art baselines. Dive3D also achieves strong results on quantitative metrics, including text–asset alignment, 3D plausibility, text–geometry consistency, texture quality, and geometric detail.

URL: https://openreview.net/forum?id=OUYMueHLMf

---

Title: Causal Graph Learning via Distributional Invariance of Cause-Effect Relationship

Abstract: This paper introduces a new framework for recovering causal graphs from observational data, leveraging the fact that the distribution of an effect, conditioned on its causes, remains invariant to changes in the prior distribution of those causes. This insight enables a direct test for potential causal relationships by checking the variance of their corresponding effect-cause conditional distributions across multiple downsampled subsets of the data. These subsets are selected to reflect different prior cause distributions while preserving the effect-cause conditional relationships. Using this invariance test and exploiting an (empirical) sparsity of most causal graphs, we develop an algorithm that efficiently uncovers causal relationships with quadratic complexity in the number of observational variables, reducing the processing time by up to $25\times$ compared to state-of-the-art methods. Our empirical experiments on a varied benchmark of large-scale datasets show superior or equivalent performance compared to existing works, while achieving enhanced scalability. Our code is openly accessible at https://anonymous.4open.science/r/GLIDE-DC57

URL: https://openreview.net/forum?id=Ey38Q882Xe

---

Title: Rel-HNN: Split Parallel Hypergraph Neural Network for Learning on Relational Databases

Abstract: Relational databases (RDBs) are ubiquitous in enterprise and real-world applications. Flattening the database poses challenges for deep learning models that rely on fixed-size input representations to capture relational semantics from the structured nature of relational data. Graph neural networks (GNNs) have been proposed to address this, but they often oversimplify relational structures by modeling all the tuples as monolithic nodes and ignoring intra-tuple associations. In this work, we propose a novel hypergraph-based framework, that we call rel-HNN, which models each unique attribute-value pair as a node and each tuple as a hyperedge, enabling the capture of fine-grained intra-tuple relationships. Our approach learns explicit multi-level representations across attribute-value, tuple, and table levels. To address the scalability challenges posed by large RDBs, we further introduce a split-parallel training algorithm that leverages multi-GPU execution for efficient hypergraph learning. Extensive experiments on real-world and benchmark datasets demonstrate that rel-HNN significantly outperforms existing methods in both classification and regression tasks. Moreover, our split-parallel training achieves substantial speedups—up to 3.18x for learning on relational data and up to 2.94x for hypergraph learning—compared to conventional single-GPU execution.

URL: https://openreview.net/forum?id=L7VP7gxpVG

---

Title: $\texttt{SEM-CTRL}$: Semantically Controlled Decoding

Abstract: Ensuring both syntactic and semantic correctness in Large Language Model (LLM) outputs remains a significant challenge, despite being critical for real-world deployment. In this paper, we introduce $\texttt{SEM-CTRL}$, a unified approach that allows for enforcing rich context-sensitive constraints, and task and instance specific semantics directly on the LLM decoder. Our approach integrates token-level MCTS which is guided by specific syntactic and semantic constraints. The constraints over desired outputs are expressed using Answer Set Grammars, which is a logic-based formalism that generalizes context sensitive grammars while incorporating background knowledge to represent task-specific semantics. We show that our approach helps guarantee valid completions for any off-the-shelf LLM without the need for fine-tuning. We evaluate $\texttt{SEM-CTRL}$ on a range of tasks, including synthetic grammar synthesis, combinatorial reasoning, JSON parsing, and planning. Our experimental results demonstrate that $\texttt{SEM-CTRL}$ allows even small pre-trained LLMs to efficiently outperform larger variants and state-of-the-art reasoning models (e.g., $\text{\textit{o4-mini}}$) while simultaneously guaranteeing semantic validity.

URL: https://openreview.net/forum?id=ICUHKhOISN

---

Title: A Bayesian Nonparametric Perspective on Mahalanobis Distance for Out of Distribution Detection

Abstract: Bayesian nonparametric methods are naturally suited to the problem of out-of-distribution (OOD) detection. However, these techniques have largely been eschewed in favor of simpler methods based on distances between pre-trained or learned embeddings of data points. Here we show a formal relationship between Bayesian nonparametric models and the relative Mahalanobis distance score (RMDS), a commonly used method for OOD detection. Building on this connection, we propose Bayesian nonparametric mixture models with hierarchical priors that generalize the RMDS. We evaluate these models on the OpenOOD detection benchmark and show that Bayesian nonparametric methods can improve upon existing OOD methods, especially in regimes where training classes differ in their covariance structure and where there are relatively few data points per class.

URL: https://openreview.net/forum?id=w3bMXPMDW1

---

Title: T$^3$-S2S: Training-free Triplet Tuning for Sketch to Scene Synthesis in Controllable Concept Art Generation

Abstract: 2D concept art generation for 3D scenes is a crucial yet challenging task in computer graphics, as creating natural intuitive environments still demands extensive manual effort in concept design. While generative AI has simplified 2D concept design via text-to-image synthesis, it struggles with complex multi-instance scenes and offers limited support for structured terrain layout. In this paper, we propose a Training-free Triplet Tuning for Sketch-to-Scene (T3-S2S) generation after reviewing the entire cross-attention mechanism. This scheme revitalizes the ControlNet model for detailed multi-instance generation via three key modules: Prompt Balance ensures keyword representation and minimizes the risk of missing critical instances; Characteristic Priority emphasizes sketch-based features by highlighting TopK indices in feature channels; and Dense Tuning refines contour details within instance-related regions of the attention map. Leveraging the controllability of T3-S2S, we also introduce a feature-sharing strategy with dual prompt sets to generate layer-aware isometric and terrain-view representations for the terrain layout. Experiments show that our sketch-to-scene workflow consistently produces multi-instance 2D scenes with details aligned with input prompts.

URL: https://openreview.net/forum?id=lyn2BgKQ8F

---

Title: Uncertainty Quantification in SVM prediction

Abstract: This paper explores SVM models from the lens of uncertainty quantification (UQ), developed for regression and forecasting tasks. Unlike the Neural Network, the SVM solutions are typically more certain, stable, sparse, optimal and interpretable. However, there is only limited literature addressing uncertainty quantification (UQ) in SVM-based prediction. At first, We provide a comprehensive summary of existing Prediction Interval (PI) estimation and probabilistic forecasting methods developed in the SVM framework. Although SVMs offer globally optimal and stable solutions, the existing literature on UQ within the SVM framework still exhibits several critical gaps. In this work, we also address these gaps and extend contemporary UQ techniques to SVMs, for promoting their applicability across diverse domains for more reliable estimation. Our major contributions include the development of sparse SVM models for PI estimation and probabilistic forecasting, an investigation of the role of feature selection in PI estimation, and the extension of SVM regression to the Conformal Regression (CR) setting to construct more stable prediction sets with finite-sample guarantees. Extensive numerical experiments highlight that SVM-based UQ methods yield PIs and probabilistic forecasts that are less uncertain and comparable to, or even better than, those produced by modern complex deep learning and neural network models.

URL: https://openreview.net/forum?id=LrazQEG1QW

---

Title: Compatible Gradient Approximations for Actor-Critic Algorithms

Abstract: Deterministic policy gradient algorithms are foundational for actor-critic methods in controlling continuous systems, yet they often encounter inaccuracies due to their dependence on the derivative of the critic's value estimates with respect to input actions. This reliance requires precise action-value gradient computations, a task that proves challenging under function approximation. We introduce an actor-critic algorithm that bypasses the need for such precision by employing a zeroth-order approximation of the action-value gradient through two-point stochastic gradient estimation within the action space. This approach provably and effectively addresses compatibility issues inherent in deterministic policy gradient schemes. Empirical results further demonstrate that our algorithm not only matches but frequently exceeds the performance of current state-of-the-art methods by a substantial extent.

URL: https://openreview.net/forum?id=ZXOSMgwoVi

---

Title: Eyes on the Road, Words in the Changing Skies: Vision-Language Assistance for Autonomous Driving in Transitional Weather

Abstract: The rapid advancement of autonomous vehicle technology (AVT) necessitates robust scene perception and interactive decision-making, particularly under adverse weather conditions. While significant progress has been made in extreme weather scenarios like cloudy, foggy, rainy, and snowy, a critical challenge remains in transitional weather conditions, such as the shift from cloudy to rainy, foggy to sunny, etc. These dynamic environmental changes degrade the performance of conventional vision-language systems by causing unpredictable illumination changes and partial occlusions, which are inadequately represented in current AVT datasets. This lack of continuous, transitional training data compromises model robustness and ultimately affects safety and reliability. On the other hand, Vision-language Models (VLMs) enable interpretable reasoning in autonomous driving through tasks such as image captioning and visual question answering. However, current VLMs, designed for clear weather, perform poorly in transitional conditions and rely on computationally expensive LLMs. This leads to high memory usage and slow inference, which is unsuitable for real-time decision making in AVT. To address these limitations, we propose Vision-language Assistance for Autonomous Driving under Transitional Weather (VLAAD-TW), a lightweight framework with a novel cross-modal spatiotemporal reasoning architecture that robustly interprets and acts on multimodal data. The VLAAD-TW framework integrates a Feature Encoder for Transitional Weather (FETW), a lightweight backbone for robust visual feature extraction, with a Spatiotemporal Contextual Aggregator (SCA), which models dynamic weather-induced changes. It uses a Selective Attention-guided Fusion Module (SAFM) to balance visual and linguistic cues for a unified representation dynamically. Finally, a Semantic Text Generator (STG) fuses these representations to produce context-aware driving information, adapting in real time to both current and predicted weather states. Further, we introduce AIWD16-text dataset, an adverse intermediate weather driving dataset for vision language tasks, which features sixteen transitional weather states created using a Stochastic Conditional Variational Autoencoder (SC-VAE) and enriched with manual annotations of image captions and open-ended question-answer pairs. An extensive evaluation of the AIWD16-text and DriveLM datasets demonstrates VLAAD-TW's high performance in BLEU and ROUGE scores, with low memory and computational requirements, confirming its effectiveness in challenging weather conditions.

URL: https://openreview.net/forum?id=PCEDvdVJon

---

Title: Zero-Direction Probing: A Linear-Algebraic Framework for Deep Analysis of Large- Language-Model Drift

Abstract: We present Zero-Direction Probing (ZDP), a theoretical framework that characterizes
model drift from null directions of transformer activations, requiring no task labels or output
evaluations. Under explicit assumptions (A1–A6), We prove: (i) the Variance–Leak Theorem
(Thm. 1), (ii) Fisher Null-Conservation (Thm. 3), (iii) a Rank–Leak bound for low-rank
updates (Thm. 5), and (iv) a logarithmic-regret guarantee for online null-space trackers
(Thm. 4). We further derive a Spectral Null-Leakage (SNL) metric with a non-asymptotic
Laurent–Massart tail bound and an MP-edge–style concentration inequality, providing a-
priori thresholds for drift under a Gaussian null model. Together, these results establish
that “listening to silence”—monitoring the right/left null spaces of layer activations and
their Fisher geometry—yields concrete, testable guarantees on representational change. The
manuscript is intentionally theory-only; empirical validation and benchmarking are deferred
to companion work.

URL: https://openreview.net/forum?id=GICx8NVNNC

---

Title: Scaling Gaussian Process Regression with Full Derivative Observations

Abstract: We present a scalable Gaussian Process (GP) method that can fit and predict full derivative observations called DSoftKI. It extends SoftKI, a method that approximates a kernel via softmax interpolation from learned interpolation point locations, to the setting with derivatives. DSoftKI enhances SoftKI’s interpolation scheme to incorporate the directional orientation of interpolation points relative to the data. This enables the construction of a scalable approximate kernel, including its first and second-order derivatives, through interpolation. We evaluate DSoftKI on a synthetic function benchmark and high-dimensional molecular force field prediction (100-1000 dimensions), demonstrating that DSoftKI is accurate and can scale to larger datasets with full derivative observations than previously possible.

URL: https://openreview.net/forum?id=fbonXp38r9

---

Title: Offline Model-Based Optimization: Comprehensive Review

Abstract: Offline black-box optimization is a fundamental challenge in science and engineering, where the goal is to optimize black-box functions using only offline datasets. This setting is particularly relevant when querying the objective function is prohibitively expensive or infeasible, with applications spanning protein engineering, material discovery, neural architecture search, and beyond. The main difficulty lies in accurately estimating the objective landscape beyond the available data, where extrapolations are fraught with significant epistemic uncertainty. This uncertainty can lead to objective hacking (reward hacking)—exploiting model inaccuracies in unseen regions—or other spurious optimizations that yield misleadingly high performance estimates outside the training distribution. Recent advances in model-based optimization (MBO) have harnessed the generalization capabilities of deep neural networks to develop offline-specific surrogate and generative models. Trained with carefully designed strategies, these models are more robust against out-of-distribution issues, facilitating the discovery of improved designs. Despite its growing impact in accelerating scientific discovery, the field lacks a comprehensive review. To bridge this gap, we present the first thorough review of offline MBO. We begin by formalizing the problem for both single-objective and multi-objective settings and by reviewing recent benchmarks and evaluation metrics. We then categorize existing approaches into two key areas: surrogate modeling, which emphasizes accurate function approximation in out-of-distribution regions, and generative modeling, which explores high-dimensional design spaces to identify high-performing designs. Finally, we examine the key challenges and propose promising directions for advancement in this rapidly evolving field.

URL: https://openreview.net/forum?id=QcSZWo1TLl

---

Title: Mapping the Neuro-Symbolic AI Landscape by Architectures: A Handbook on Augmenting Deep Learning Through Symbolic Reasoning

Abstract: Integrating symbolic techniques with statistical ones is a long-standing problem in artificial intelligence. The motivation is that the strengths of either area match the weaknesses of the other, and -- by combining the two -- the weaknesses of either method can be limited. Neuro-symbolic AI focuses on this integration where the statistical methods are in particular neural networks.In recent years, there has been significant progress in this research field, where neuro-symbolic systems outperformed logical or neural models alone. Yet, neuro-symbolic AI is, comparatively speaking, still in its infancy and has not been widely adopted by machine learning practitioners. In this survey, we present the first mapping of neuro-symbolic techniques into families of frameworks based on their architectures, with several benefits: Firstly, it allows us to link different strengths of frameworks to their respective architectures. Secondly, it allows us to illustrate how engineers can augment their neural networks while treating the symbolic methods as black-boxes. Thirdly, it allows us to map most of the field so that future researchers can identify closely related frameworks.

URL: https://openreview.net/forum?id=03CmhwU0RU

---

Title: More Rigorous Software Engineering Would Improve Reproducibility in Machine Learning Research

Abstract: While experimental reproduction remains a pillar of the scientific method, we observe that the software best practices supporting the reproduction of Machine Learning (ML) research are often undervalued or overlooked, leading both to poor reproducibility and damage to trust in the ML community. We quantify these concerns by surveying the usage of software best practices in software repositories associated with publications at major ML conferences and journals such as NeurIPS, ICML, ICLR, TMLR, and MLOSS within the last decade. We report the results of this survey that identify areas where software best practices are lacking and areas with potential for growth in the ML community. Finally, we discuss the implications and present concrete recommendations on how we, as a community, can improve reproducibility in ML research.

URL: https://openreview.net/forum?id=t3FcjU0xwf

---

Title: Lexical Hints of Accuracy in LLM Reasoning Chains

Abstract: Fine-tuning Large Language Models (LLMs) with reinforcement learning to produce an explicit Chain-of-Thought (CoT) before answering produces models that consistently raise overall performance on code, math, and general-knowledge benchmarks. However, on benchmarks where LLMs currently achieve low accuracy, such as Humanity's Last Exam (HLE), they often report high self-confidence, reflecting poor calibration. Here, we test whether measurable properties of the CoT provide reliable signals of an LLM's internal confidence in its answers. We analyze three feature classes: (i) CoT length, (ii) intra-CoT sentiment volatility, and (iii) lexicographic hints, including hedging words. Using DeepSeek-R1 and Claude 3.7 Sonnet on both Humanity's Last Exam (HLE), a frontier benchmark with very low accuracy, and Omni-MATH, a saturated benchmark of moderate difficulty, we find that lexical markers of uncertainty (e.g., "guess", "stuck", "hard") in the CoT are the strongest indicators of an incorrect response, while shifts in the CoT sentiment provide a weaker but complementary signal. CoT length is informative only on Omni-MATH, where accuracy is already high (≈70%), and carries no signal on the harder HLE (≈9%), indicating that CoT length predicts correctness only in the intermediate-difficulty benchmarks, i.e., inside the model's demonstrated capability, but still below saturation. Finally, we find that uncertainty indicators in the CoT are consistently more salient than high-confidence markers, making errors easier to predict than correct responses. Our findings support a lightweight post-hoc calibration signal that complements unreliable self-reported probabilities and supports safer deployment of LLMs.

URL: https://openreview.net/forum?id=kSn0NTNVaK

---

Title: Probability Consistency in Large Language Models: Theoretical Foundations Meet Empirical Discrepancies

Abstract: Can autoregressive large language models (LLMs) learn consistent probability distributions when trained on sequences in different token orders? We prove formally that for any well-defined probability distribution, sequence perplexity is invariant under any factorization, including forward, backward, or arbitrary permutations. This result establishes a rigorous theoretical foundation for studying how LLMs learn from data and defines principled protocols for empirical evaluation. Applying these protocols, we show that prior studies examining ordering effects suffer from critical methodological flaws. We retrain GPT-2 models across forward, backward, and arbitrary permuted orders on scientific text. We find systematic deviations from theoretical invariance across all orderings with arbitrary permutations strongly deviating from both forward and backward models, which largely (but not completely) agreed with one another. Deviations were traceable to differences in self-attention, reflecting positional and locality biases in processing. Our theoretical and empirical results provide novel avenues for understanding positional biases in LLMs and suggest methods for detecting when LLMs' probability distributions are inconsistent and therefore untrustworthy.

URL: https://openreview.net/forum?id=AV3fX3jB0H

---

Title: I Want to Break Free! Persuasion and Anti-Social Behavior of LLMs in Multi-Agent Settings with Social Hierarchy

Abstract: As LLM-based agents become increasingly autonomous and will more freely interact with each other, studying the interplay among them becomes crucial to anticipate emergent phenomena and potential risks. In this work, we provide an in-depth analysis of the interactions among agents within a simulated hierarchical social environment, drawing inspiration from the Stanford Prison Experiment. Leveraging 2,400 conversations across six LLMs (i.e., \texttt{LLama3}, \texttt{Orca2}, \texttt{Command-r}, \texttt{Mixtral}, \texttt{Mistral2}, and \texttt{gpt4.1}) and 240 experimental scenarios, we analyze persuasion and anti-social behavior between a guard and a prisoner agent with differing objectives. We first document model-specific conversational failures in this multi-agent power dynamic context.
Among models demonstrating successful interaction, we find that goal setting significantly influences persuasiveness but not anti-social behavior. Moreover, agent personas, especially the guard's, substantially impact both successful persuasion by the prisoner and the manifestation of anti-social actions. Notably, we observe the emergence of anti-social conduct even in absence of explicit negative personality prompts. These results have important implications for the development of interactive LLM agents and the ongoing discussion of their societal impact.

URL: https://openreview.net/forum?id=FR76oM8eGD

---

Title: Quantum entanglement for attention models

Abstract: Attention mechanisms in deep learning establish relationships between different positions within a sequence, enabling models like Transformers to generate effective outputs by focusing on relevant input segments and their relations. The performance of Transformers is highly dependent on the chosen attention mechanism, with various approaches balancing trade-offs between computational cost, memory efficiency, and generalization ability based on the task.
Quantum machine learning models possess the potential to outperform their classical counterparts in specialized settings. This makes exploring the benefits of quantum resources within classical machine learning models a promising research direction. The role of entanglement in quantum machine learning, whether in fully quantum or as subroutines in classical-quantum hybrid models, remains poorly understood. In this work, we investigate the hypothesis of whether entanglement can be used to model nuanced correlations in classical data, analogous to its role in many-body systems. We further test whether quantum entanglement can be used as a resource to improve the performance of the attention layer in Transformers.
We introduce an entanglement entropy-based attention layer within a classical Transformer architecture and numerically evaluate it across various datasets. Our experiments on standard classification tasks in both vision and NLP domains reveal that the entanglement-based attention layer outperforms existing quantum attention frameworks and the widely used quantum kernel attention models, particularly in the presence of noise. Our work contributes toward exploring the power of quantum resources as a subroutine in the classical-quantum hybrid setting to further enhance classical models.

URL: https://openreview.net/forum?id=G4an9paVtP

---

Title: Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers

Abstract: Hallucinations are a persistent problem with Large Language Models (LLMs). As these models become increasingly used in high-stakes domains, such as healthcare and finance, the need for effective hallucination detection is crucial. To this end, we outline a versatile framework for zero-resource hallucination detection that practitioners can apply to real- world use cases. To achieve this, we adapt a variety of existing uncertainty quantification (UQ) techniques, including black-box UQ, white-box UQ, and LLM-as-a-Judge, transform- ing them as necessary into standardized response-level confidence scores ranging from 0 to 1. To enhance flexibility, we propose a tunable ensemble approach that incorporates any combination of the individual confidence scores. This approach enables practitioners to optimize the ensemble for a specific use case for improved performance. To streamline implementation, the full suite of scorers is offered in this paper’s companion Python toolkit. To evaluate the performance of the various scorers, we conduct an extensive set of experiments using several LLM question-answering benchmarks. We find that our tunable ensemble typically surpasses its individual components and outperforms existing hallucination detection methods. Our results demonstrate the benefits of customized hallucination detection strategies for improving the accuracy and reliability of LLMs.

URL: https://openreview.net/forum?id=WOFspd4lq5

---

Title: RoboRAN: A Unified Robotics Framework for Reinforcement Learning-Based Autonomous Navigation

Abstract: Autonomous robots must navigate and operate in diverse environments, from terrestrial and aquatic settings to aerial and space domains. While Reinforcement Learning (RL) has shown promise in training policies for specific autonomous robots, existing frameworks and benchmarks are often constrained to unique platforms, limiting generalization and fair comparisons across different mobility systems. In this paper, we present a multi-domain framework for training, evaluating and deploying RL-based navigation policies across diverse robotic platforms and operational environments. Our work presents four key contributions: (1) a scalable and modular framework, facilitating seamless robot-task interchangeability and reproducible training pipelines; (2) sim-to-real transfer demonstrated through real- world experiments with multiple robots, including a satellite robotic simulator, an unmanned surface vessel, and a wheeled ground vehicle; (3) the release of the first open-source API for deploying IsaacLab-trained policies to real robots, enabling lightweight inference and rapid field validation; and (4) uniform tasks and metrics for cross-medium evaluation, through a unified evaluation testbed to assess performance of navigation tasks in diverse operational conditions (aquatic, terrestrial and space). By ensuring consistency between simulation and real-world deployment, RoboRAN lowers the barrier to developing adaptable RL-based navigation strategies. Its modular design enables straightforward integration of new robots and tasks through predefined templates, fostering reproducibility and extension to diverse domains. To support the community, we release RoboRAN as open-source.

URL: https://openreview.net/forum?id=0wDbhLeMj9

---

Title: One Model for All: Multi-Objective Controllable Language Models

Abstract: Aligning large language models (LLMs) with human preferences is critical to enhancing LLMs' safety, helpfulness, humor, faithfulness, and other desirable properties. Current reinforcement learning from human feedback (RLHF) mainly focuses on a fixed reward learned from average human ratings, which may weaken the adaptability and controllability of varying preferences. However, creating personalized LLMs requires aligning LLMs with individual human preferences, which is non-trivial due to the scarce data per user and the diversity of user preferences in multi-objective trade-offs, such as prioritizing humor and empathy in one context, while seeking efficiency and precision in another. Can we train one LLM to produce personalized outputs for different user preferences on the Pareto front? In this paper, we introduce Multi-Objective Control (MOC), which trains a single LLM to directly generate responses in the preference-defined regions of the Pareto front. Our approach introduces multi-objective optimization (MOO) principles into RLHF to train an LLM as a preference-conditioned policy network. We improve the computational efficiency of MOC by applying MOO at the policy level, enabling us to fine-tune a 7B-parameter model on a single A6000 GPU. Extensive experiments demonstrate the advantages of MOC over baselines in three aspects: (i) controllability of LLM outputs with respect to user preferences on the trade-off among multiple rewards; (ii) quality and diversity of LLM outputs, measured by the hyper-volume of multiple solutions achieved; and (iii) generalization to unseen preferences. These results highlight MOC’s potential for real-world applications requiring scalable and customizable LLMs.

URL: https://openreview.net/forum?id=qAM5PmvFYY

---

Title: HyResPINNs: A Hybrid Residual Physics-Informed Neural Network Architecture Designed to Balance Expressiveness and Trainability

Abstract: Physics-informed neural networks (PINNs) have emerged as a powerful approach for solving partial differential equations (PDEs) by training neural networks with loss functions that incorporate physical constraints. In this work, we introduce HyResPINNs, a two-level convex-gated architecture designed to maximize approximation expressiveness for a fixed number of degrees of freedom (DoF). The first level involves a trainable, per-block combination of smooth basis functions with trainable sparsity, and deep neural networks; the second involves the ability to gate entire blocks (much like in ResNets or Highway Nets), allowing for expressivity along the depth dimension of the architecture. Our empirical evaluation on a diverse set of challenging PDE problems demonstrates that HyResPINNs consistently achieve superior accuracy to baseline methods while remaining competitive relative to training times. These results highlight the potential of HyResPINNs to combine desirable features from traditional scientific computing methods and modern machine learning, paving the way for more robust and expressive approaches to physics-informed modeling.

URL: https://openreview.net/forum?id=et9WkjkqAw

---

Title: A simple connection from loss flatness to compressed neural representations

Abstract: Sharpness, a geometric measure in the parameter space that reflects the flatness of the loss landscape, has long been studied for its potential connections to neural network behavior. While sharpness is often associated with generalization, recent work highlights inconsistencies in this relationship, leaving its true significance unclear. In this paper, we investigate how sharpness influences the local geometric features of neural representations in feature space, offering a new perspective on its role. We introduce this problem and study three measures for compression: the Local Volumetric Ratio (LVR), based on volume compression, the Maximum Local Sensitivity (MLS), based on sensitivity to input changes, and the Local Dimensionality, based on how uniform the sensitivity is on different directions. We show that LVR and MLS correlate with the flatness of the loss around the local minima; and that this correlation is predicted by a relatively simple mathematical relationship: a flatter loss corresponds to a lower upper bound on the compression metrics of neural representations. Our work builds upon the linear stability insight by Ma and Ying, deriving inequalities between various compression metrics and quantities involving sharpness. Our inequalities readily extend to reparametrization-invariant sharpness as well. Through empirical experiments on various feedforward, convolutional, and transformer architectures, we find that our inequalities predict a consistently positive correlation between local representation compression and sharpness.

URL: https://openreview.net/forum?id=GgpQbU9bFR

---

Title: Positional Encoder Graph Quantile Neural Networks for Geographic Data

Abstract: Positional Encoder Graph Neural Networks (PE-GNNs) are among the most effective models for learning from continuous spatial data. However, their predictive distributions are often poorly calibrated, limiting their utility in applications that require reliable uncertainty quantification. We propose the Positional Encoder Graph Quantile Neural Network (PE-GQNN), a novel framework that combines PE-GNNs with Quantile Neural Networks, partially monotonic neural blocks, and post-hoc recalibration techniques. The PE-GQNN enables flexible and robust conditional density estimation with minimal assumptions about the target distribution, and it extends naturally to tasks beyond spatial data. Empirical results on benchmark datasets show that the PE-GQNN outperforms existing methods in both predictive accuracy and uncertainty quantification, without incurring additional computational cost. We also identify important special cases arising from our formulation, including the PE-GNN.

URL: https://openreview.net/forum?id=5PjL8ZOPBt

---

Title: Guiding Reasoning in Small Language Models with LLM Assistance

Abstract: Small language models (SLMs) typically falter on tasks requiring deep, multi-step reasoning. This paper introduces SMART ( Small Reasons, Large Hints), a framework where large language models (LLMs) provide targeted, selective guidance to augment SLM reasoning. Drawing from cognitive scaffolding, SMART uses a score-based mechanism to identify uncertain SLM reasoning steps, triggering LLM correction only when essential. This approach, framing structured reasoning as an optimal policy search, steers SLMs towards correct solutions without exhaustive sampling. On mathematical reasoning datasets, SMART enables SLMs to achieve up to 98.9% of LLM-level performance while reducing LLM token usage by up to 90.0%. Our work paves the way for collaborative use of both SLM and LLM to tackle complex reasoning tasks that are currently unsolvable by SLMs alone.

URL: https://openreview.net/forum?id=Q2d7tSWFdQ

---

Title: A Survey on Large Language Model Reasoning Failures

Abstract: Large Language Models (LLMs) have exhibited remarkable reasoning capabilities, achieving impressive results across a wide range of tasks. Despite these advances, significant reasoning failures persist, occurring even in seemingly simple scenarios. To systematically understand and address these shortcomings, we present the first comprehensive survey dedicated to reasoning failures in LLMs. We introduce a novel categorization framework that distinguishes reasoning into embodied and non-embodied types, with the latter further subdivided into informal (intuitive) and formal (logical) reasoning. In parallel, we classify reasoning failures along a complementary axis into three types: fundamental failures intrinsic to LLM architectures that broadly affect downstream tasks; application-specific limitations that manifest in particular domains; and robustness issues characterized by inconsistent performance across minor variations. For each reasoning failure, we provide a clear definition, analyze existing studies, explore root causes, and present mitigation strategies. By unifying fragmented research efforts, our survey provides a structured perspective on systemic weaknesses in LLM reasoning, offering valuable insights and guiding future research towards building stronger, more reliable, and robust reasoning capabilities.

URL: https://openreview.net/forum?id=vnX1WHMNmz

---

Title: Continual Learning on CLIP via Incremental Prompt Tuning with Intrinsic Textual Anchors

Abstract: Continual learning (CL) enables deep neural networks to acquire new knowledge over time while mitigating catastrophic forgetting of previously learned information. The powerful generalization ability of pre-trained models (PTMs), such as the Contrastive Language-Image Pre-training (CLIP) model, has inspired a range of CL methods targeting new and specialized tasks, further bridging the gap between PTMs and continual adaptation. Leveraging its multi-modal visual and textual representations, CLIP offers a natural paradigm for CL, where new tasks can be accommodated by incrementally learning lightweight parameters, particularly prompts. However, existing prompt-based CL methods for PTMs often rely on complex designs built upon specific assumptions, such as intricate regularization schemes for prompt pools, specialized routing mechanisms, or multi-stage incrementation processes. While these approaches improve performance, they frequently introduce additional-and possibly unnecessary-complexity, underutilizing CLIP's intrinsic capabilities. In this paper, we propose a concise CL approach for CLIP based on incremental prompt tuning that fully exploits its multi-modal structure and the stability of textual representations. Our method, Textual Prototype-guided Prompt Tuning (TPPT), introduces textual prototypes not merely as static classifiers, as in existing methods, but as stable anchors to guide the learning of visual prompts, thereby shaping the embedding space (i.e., TPPT-V). We show that our bidirectional supervision strategy enables more effective learning of new knowledge while reducing forgetting. To further close the vision-language gap during CL, we activate the language branch and extend our approach to jointly optimize both visual and textual prompts (i.e., TPPT-VT). We also introduce a relational diversity regularization on the textual anchors to prevent embedding space collapse and mitigate correlated forgetting. Extensive experiments and analyses demonstrate the effectiveness of our proposed approach, highlighting the benefits of leveraging CLIP's intrinsic guidance for continual adaptation.

URL: https://openreview.net/forum?id=YJnjkzKq5Y

---

Title: TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models

Abstract: Image-text models excel at image-level tasks but struggle with detailed visual understanding. While these models provide strong visual-language alignment, segmentation models like SAM2 offer precise spatial boundaries for objects. To this end, we propose TextRegion, a simple, effective, and training-free framework that combines the strengths of image-text models and SAM2 to generate powerful text-aligned region tokens. These tokens enable detailed visual understanding while preserving open-vocabulary capabilities. They can be directly applied to various downstream tasks, including open-world semantic segmentation, referring expression comprehension, and grounding. We conduct extensive evaluations and consistently achieve superior or competitive performance compared to state-of-the-art training-free methods. Additionally, our framework is compatible with many image-text models, making it highly practical and easily extensible as stronger models emerge. We will release the code to support open research and promote reproducibility.

URL: https://openreview.net/forum?id=KZLmkL62M4

---

Title: Data Diversity as Implicit Regularization: How Does Diversity Shape the Weight Space of Deep Neural Networks?

Abstract: Data augmentation that introduces diversity into the input data has long been used in training deep learning models. It has demonstrated benefits in improving robustness and generalization, practically aligning well with other regularization strategies such as dropout and weight decay. However, the underlying mechanism of how diverse training data contributes to model improvements remains unknown. In this paper, we investigate the impact of data diversity on the weight space of deep neural networks using Random Matrix Theory. Through spectral analysis and comparing models trained with data augmentation, dropout, and weight decay, we reveal that increasing data diversity alters the weight spectral distribution similarly to other regularization techniques, while displaying a pattern more closely aligned with dropout than with weight decay. Building on these insights, we propose a metric to explain and compare the benefits of diversity introduced by traditional data augmentations and those achieved through synthetic data.

URL: https://openreview.net/forum?id=nAjStmwVae

---

Title: Rank-1 Approximation of Inverse Fisher for Natural Policy Gradients in Deep Reinforcement Learning

Abstract: Natural gradients have been long studied in deep reinforcement learning due to its fast convergence properties and covariant weight updates.
However, computing natural gradients requires inversion of Fisher Information Matrix (FIM) at each iteration, which is computationally prohibitive in nature.
In this paper, we present an efficient and scalable natural policy optimization technique which leverages a rank-1 approximation to full inverse-FIM.
We theoretically show that under certain conditions, rank-1 approximation to inverse-FIM converges faster than policy gradients and under some condition, enjoys the same sample complexity as stochastic policy gradient methods.
We benchmark our method on a diverse set of environments and show that our methods achieve superior performance than standard trust-region and actor-critic baselines.

URL: https://openreview.net/forum?id=ko8Kn7TS6m

---

Title: Learning Task-Aware Abstract Representations for Meta-Reinforcement Learning

Abstract: A central challenge in meta-reinforcement learning (meta-RL) is enabling agents trained on a set of environments to generalize to new, related tasks without requiring full policy retraining. Existing model-free approaches often rely on context-conditioned policies learned via encoder networks. However, these context encoders are prone to overfitting on the training environments, resulting in poor out-of-sample performance on unseen tasks. To address this issue, we adopt an alternative approach that uses an abstract representation model to learn augmented, task-aware abstract states. We achieve this by introducing a novel architecture that offers more flexibility than existing recurrent network-based approaches. In addition, we optimize our model with multiple loss terms that encourage predictive, task-aware representations in the abstract state space. Our method simplifies the learning problem and provides a flexible framework that can be easily combined with any off-the-shelf reinforcement learning algorithm. We provide theoretical guarantees alongside empirical results, showing strong generalization performance across classical control and robotic meta-RL benchmarks.

URL: https://openreview.net/forum?id=3CWyTh4hJ4

---

Title: Demystifying amortized causal discovery with transformers

Abstract: Supervised learning for causal discovery from observational data often achieves competitive performance despite seemingly avoiding the explicit assumptions that traditional methods require for identifiability. In this work, we analyze CSIvA (Ke et al., 2023) on bivariate causal models, a transformer architecture for amortized inference promising to train on synthetic data and transfer to real ones. First, we bridge the gap with identifiability theory, showing that the training distribution implicitly defines a prior on the causal model of the test observations: consistent with classical approaches, good performance is achieved when we have a good prior on the test data, and the underlying model is identifiable. Second, we find that CSIvA can not generalize to classes of causal models unseen during training: to overcome this limitation, we theoretically and empirically analyze \textit{when} training CSIvA on datasets generated by multiple identifiable causal models with different structural assumptions improves its generalization at test time. Overall, we find that amortized causal discovery still adheres to identifiability theory, violating the previous hypothesis from Lopez-Paz et al. (2015) that supervised learning methods could overcome its restrictions.

URL: https://openreview.net/forum?id=9Lgy7IGSfp

---

Title: Density-Aware Farthest Point Sampling

Abstract: We focus on training machine learning regression models in scenarios where the availability labeled training data is limited due to computational constraints or high labeling costs. Thus, selecting suitable training sets from unlabeled data is essential for balancing performance and efficiency. For the selection of the training data, we focus on passive and model-agnostic sampling methods that only consider the data feature representations. We derive an upper bound for the expected prediction error of Lipschitz continuous regression models that linearly depends on the weighted fill distance of the training set—a quantity we can estimate simply by considering the data features. We introduce "Density-Aware Farthest Point Sampling" (DA-FPS), a novel sampling method. We prove that DA-FPS provides approximate minimizers for a data-driven estimation of the weighted fill distance, thereby aiming at minimizing our derived bound. We conduct experiments using two regression models across three datasets. The results demonstrate that DA-FPS significantly reduces the mean absolute prediction error compared to other sampling strategies.

URL: https://openreview.net/forum?id=vI47lgIfYc

---

Title: CyberThreat-Eval: Can Large Language Models Automate Real-World Threat Research?

Abstract: Analyzing Open Source Intelligence (OSINT) from large volumes of data is critical for drafting and publishing comprehensive CTI reports. This process usually follows a three-stage workflow---triage, deep search and TI drafting. While Large Language Models (LLMs) offer a promising route toward automation, existing benchmarks still have limitations. These benchmarks often consist of tasks that do not reflect real-world analyst workflows. For example, human analysts rarely receive tasks in the form of multiple-choice questions. Also, existing benchmarks often rely on model-centric metrics that emphasize lexical overlap rather than actionable, detailed insights essential for security analysts. Moreover, they typically fail to cover the complete three-stage workflow. To address these issues, we introduce CyberThreat-Eval, which is collected from the daily CTI workflow of a world-leading company. This expert-annotated benchmark assesses LLMs on practical tasks across all three stages as mentioned above. It utilizes analyst-centric metrics that measure factual accuracy, content quality, and operational costs. Our evaluation using this benchmark reveals important insights into the limitations of current LLMs. For example, LLMs often lack the nuanced expertise required to handle complex details and struggle to distinguish between correct and incorrect information. To address these challenges, the CTI workflow incorporates both external ground-truth databases and human expert knowledge. TRA allows human experts to iteratively provide feedback for continuous improvement. The CyberThreat-Eval benchmark will be available.

URL: https://openreview.net/forum?id=tiFtZHwr7O

---

Title: Understanding Transformer-based Vision Models through Inversion

Abstract: Understanding the mechanisms underlying deep neural networks remains a fundamental challenge in machine learning and computer vision. One promising, yet only preliminarily explored approach, is feature inversion, which attempts to reconstruct images from intermediate representations using trained inverse neural networks. In this study, we revisit feature inversion, introducing a novel, modular variation that enables significantly more efficient application of the technique. We demonstrate how our method can be systematically applied to the large-scale transformer-based vision models, Detection Transformer and Vision Transformer, and how reconstructed images can be qualitatively interpreted in a meaningful way. We further quantitatively evaluate our method, thereby uncovering underlying mechanisms of representing image features that emerge in the two transformer architectures. Our analysis reveals key insights into how these models encode contextual shape and image details, how their layers correlate, and their robustness against color perturbations. These findings contribute to a deeper understanding of transformer-based vision models and their internal representations.

URL: https://openreview.net/forum?id=qcgWlzKiCl

---

Title: BAOSL: Benchmarking Active Optimization for Self-driving Laboratories

Abstract: Discovery of novel materials and antibiotics can be posed as an optimization problem, namely, identifying candidate formulations that maximize one or more desired properties. In practice, however, the enormous dimensionality of the design space and the high cost of each experimental evaluation make exhaustive search strategies infeasible. Active learning methods, which iteratively identify informative data points, offer a promising solution to tackle these challenges by significantly reducing the data-labeling effort and resource requirements. Integrating active learning into optimization workflows, hereafter termed active optimization, accelerates the discovery of optimal candidates while substantially cutting the number of required evaluations. Despite these advances, the absence of standardized benchmarks impedes objective comparison of methodologies, slowing progress in self-driving scientific discovery. To address this, we introduce BAOSA, a comprehensive benchmark designed to systematically evaluate active optimization in self-driving laboratories. BAOSA provides a standardized evaluation protocol and reference implementations to facilitate efficient and reproducible benchmarking. BAOSA includes both synthetic benchmarks and real-world tasks in various fields, designed to address unique challenges, particularly limited data availability, in self-driving laboratories.

URL: https://openreview.net/forum?id=UYEHUOSPmU

---

Title: Single Train Multi Deploy on Topology Search Spaces using Kshot-Hypernet

Abstract: Neural Architecture Search (NAS) has become a crucial research direction for automating the design of neural networks.
The introduction of weight sharing has significantly reduced the computational and time costs of NAS.
Recent approaches enable the simultaneous training of numerous sub-networks without the need for retraining; however, these methods are primarily limited to the Size Search Space (SSS), which provides limited architecture diversity.
To date, methods based on the more diverse Topology Search Space (TSS) remain unexplored. TSS has greater potential for hardware-aware architecture search.
In this work, we propose a novel NAS method that operates on TSS, while maintainting high efficiency.
To do so, we introduce Kshot-Hypernet, that extends in-place distillation to TSS, significantly improving supernetwork training.
Experiments on NASBench-201 show that, once the supernet is trained, most sub-networks can match or even exceed the performance of those trained from scratch.
Furthermore, our method achieves 80.7% top-1 accuracy on ImageNet with only 8.7M parameters.

URL: https://openreview.net/forum?id=WyAMIMF1KZ

---

Title: Domain Translation with Monolingual Lexical Distribution

Abstract: Neural machine translation (NMT) often demands a large amount of high-quality training data when adapting to a new domain with a carefully designed fine-tuning strategy. However, constructing a sufficient amount of parallel data for training poses challenges even for fine-tuning. This work proposes to fine-tune a generic NMT model using only the monolingual lexical distribution estimated from a small amount of in-domain data in the target language. Word frequency plays a critical role in analyzing the differences among corpora in various fields, e.g., psycholinguistic and language education, and our challenge lies in whether we can fit a model using the naive statistics collected from a target language domain in NMT. We leverage a variant of energy-based models (EBMs) based on Conditional Distributional Policy Gradients (CDPG) with a large number of EBMs to constrain the fine-tuning process with lexical distribution. We conduct experiments across four translation directions and four domain datasets, totaling 16 domain adaptation scenarios. The results demonstrate that our method enables robust domain shift while mitigating catastrophic forgetting, achieving effective domain adaptation using only a small amount of monolingual resources.

URL: https://openreview.net/forum?id=UKLBobrFCR

---

Title: Emergence of Quantised Representations Isolated to Anisotropic Functions

Abstract: This paper presents a novel methodology for determining representational structure, which builds upon the existing Spotlight Resonance method. Particularly, this new tool is used to gain insight into how discrete representations can emerge and organise in autoencoder models, through a controlled ablation study in which only the activation function is altered. Using this technique, the validity of whether function-driven symmetries can act as implicit inductive biases on representations is determined. Representations are found to tend to discretise when the activation functions are defined through a discrete algebraic permutation-equivariant symmetry. In contrast, they remain continuous under a continuous algebraic orthogonal-equivariant definition. This confirms the hypothesis: algebraic symmetries of network primitives can carry unintended inductive biases which produce task-independent artefactual structures in representations. The discrete symmetry of contemporary forms is shown to be a strong predictor for the induction of discrete representations transformed from otherwise continuous structures --- a quantisation effect. This motivates further reassessment of functional forms in common usage. Moreover, this supports a general causal model for one mode in which discrete representations may form, and could constitute a prerequisite for downstream interpretability phenomena, including grandmother neurons, discrete coding schemes, general linear features and possibly Superposition. Hence, this tool and proposed mechanism for the influence of functional form on representations may provide insights into emergent interpretability research. Finally, preliminary results indicate that quantisation of representations appears to correlate with a measurable increase in reconstruction error, reinforcing previous conjectures that this collapse can be detrimental.

URL: https://openreview.net/forum?id=aokVprhs3d

---

Reply all

Reply to author

Forward

0 new messages