Survey Certification: The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhong-Zhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Francisco Piedrahita Velez, Yue Liao, Hongru WANG, Mengyue Yang, Heng Ji, Michael Littman, Jun Wang, Shuicheng YAN, Philip Torr, LEI BAI
https://openreview.net/forum?id=RY19y2RI1O
---
Accepted papers
===============
Title: On the Importance of Pretraining Data Alignment for Atomic Property Prediction
Authors: Yasir M. Ghunaim, Hasan Abed Al Kader Hammoud, Bernard Ghanem
Abstract: This paper challenges the recent paradigm in atomic property prediction that links progress to growing dataset sizes and computational resources. We show that pretraining on a carefully selected task-aligned dataset can match or even surpass large-scale joint pretraining while using only 1/24th of the pretraining budget. We introduce the Chemical Similarity Index (CSI), a simple metric for molecular graphs inspired by the Fréchet Inception Distance in computer vision, which quantifies the alignment between upstream pretraining datasets and downstream tasks. By selecting the most aligned dataset with minimal CSI distance, we show that models pretrained on a smaller, focused dataset consistently achieve better performance on downstream tasks than those pretrained on massive, mixed datasets such as JMP. This holds even when the mixed dataset includes the upstream dataset most aligned with the downstream task. Counterintuitively, we also find that indiscriminately adding more data can degrade model performance when the additional data is poorly aligned with the target task. Our findings highlight that quality often outperforms quantity in pretraining for atomic property prediction.
URL: https://openreview.net/forum?id=jfD9BsrDTb
---
Title: RLHF in an SFT Way: From Optimal Solution to Reward-Weighted Alignment
Authors: Yuhao Du, Zhuo Li, Pengyu Cheng, Zhihong Chen, Yuejiao XIE, Xiang Wan, Anningzhe Gao
Abstract: Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning Large Language Models (LLMs) with human values. However, RLHF has been continuously challenged by its high complexity in implementation and computation consumption, specifically for online sampling-based methods like Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO). Even with recent simplifications, such as Direct Preference Optimization (DPO) that designs an offline implicit reward learning objective relying on pre-collected preference datasets, the problems of over-fitting and training instability remain hindering the alignment process from the expected optimal performance. To address the existing challenges, we propose a novel simplification of RLHF from the perspective of variational inference, called **V**ariational **A**lignment with **R**e-weighting (**VAR**). Specifically, by directly minimizing the distribution gap between the learning LLM policy and the optimal solution of RLHF, we transform the alignment objective into an offline reward-driven re-weighted supervised fine-tuning (SFT) form, which only requires minor adjustment on the SFT loss to obtain noticeable improvement on training stability and effectiveness. In comprehensive evaluation benchmarks, our objective empowers LLMs to outperform offline alignments, demonstrating superior performance in both helpfulness and harmlessness metrics (avg. $\uparrow7.16\%$ than DPO). Meanwhile, when compared to online sampling methods, our method is also comparable even better while significantly reducing computational overhead and accelerating convergence speed (over $5\times$ faster than GRPO), suggesting our approach as an efficient and effective solution in bridging the gap between efficiency and performance in LLM alignment.
URL: https://openreview.net/forum?id=jewB0UhFuj
---
Title: ReVision: Refining Video Diffusion with Explicit 3D Motion Modeling
Authors: Qihao Liu, Ju He, Qihang Yu, Liang-Chieh Chen, Alan Yuille
Abstract: In recent years, video generation has seen significant advancements. However, challenges still persist in generating complex motions and interactions. To address these challenges, we introduce ReVision, a plug-and-play framework that explicitly integrates parameterized 3D model knowledge into a pretrained conditional video generation model, significantly enhancing its ability to generate high-quality videos with complex motion and interactions. Specifically, ReVision consists of three stages. First, a video diffusion model is used to generate a coarse video. Next, we extract a set of 2D and 3D features from the coarse video to construct a 3D object-centric representation, which is then refined by our proposed parameterized motion prior model to produce an accurate 3D motion sequence. Finally, this refined motion sequence is fed back into the same video diffusion model as additional conditioning, enabling the generation of motion-consistent videos, even in scenarios involving complex actions and interactions. We validate the effectiveness of our approach on Stable Video Diffusion, where ReVision significantly improves motion fidelity and coherence. Remarkably, with only 1.5B parameters, it even outperforms a state-of-the-art video generation model with over 13B parameters on complex video generation by a substantial margin. Our results suggest that, by incorporating 3D motion knowledge, even a relatively small video diffusion model can generate complex motions and interactions with greater realism and controllability, offering a promising solution for physically plausible video generation.
URL: https://openreview.net/forum?id=mQ5frFQTFV
---
Title: Characterizing Evolution in Expectation-Maximization Estimates for Overspecified Mixed Linear Regression
Authors: Zhankun Luo, Abolfazl Hashemi
Abstract: Estimating data distributions using parametric families is crucial in many learning setups, serving both as a standalone problem and an intermediate objective for downstream tasks. Mixture models, in particular, have attracted significant attention due to their practical effectiveness and comprehensive theoretical foundations. A persisting challenge is model misspecification, which occurs when the model to be fitted has more mixture components than those in the data distribution. In this paper, we develop a theoretical understanding of the Expectation-Maximization (EM) algorithm's behavior in the context of targeted model misspecification for overspecified two-component Mixed Linear Regression (2MLR) with unknown $d$-dimensional regression parameters and mixing weights. In Theorem 5.1 at the population level, with an unbalanced initial guess for mixing weights, we establish linear convergence of regression parameters in $\mathcal{O}(\log (1/\epsilon))$ steps. Conversely, with a balanced initial guess for mixing weights, we observe sublinear convergence in $\mathcal{O}(\epsilon^{-2})$ steps to achieve the $\epsilon$-accuracy at Euclidean distance. In Theorem 6.1 at the finite-sample level, for mixtures with sufficiently unbalanced fixed mixing weights, we demonstrate a statistical accuracy of $\mathcal{O}((d/n)^{1/2})$, whereas for those with sufficiently balanced fixed mixing weights, the accuracy is $\mathcal{O}((d/n)^{1/4})$ given $n$ data samples. Furthermore, we underscore the connection between our population level and finite-sample level results: by setting the desired final accuracy $\epsilon$ in Theorem 5.1 to match that in Theorem 6.1 at the finite-sample level, namely letting $\epsilon = \mathcal{O}((d/n)^{1/2})$ for sufficiently unbalanced fixed mixing weights and $\epsilon = \mathcal{O}((d/n)^{1/4})$ for sufficiently balanced fixed mixing weights, we intuitively derive iteration complexity bounds $\mathcal{O}(\log (1/\epsilon))=\mathcal{O}(\log (n/d))$ and $\mathcal{O}(\epsilon^{-2})=\mathcal{O}((n/d)^{1/2})$ at the finite-sample level for sufficiently unbalanced and balanced initial mixing weights, respectively. We further extend our analysis in the overspecified setting to the finite low SNR regime, providing approximate dynamic equations that characterize the EM algorithm's behavior in this challenging case. Our new findings not only expand the scope of theoretical convergence but also improve the bounds for statistical error, time complexity, and sample complexity, and rigorously characterize the evolution of EM estimates.
URL: https://openreview.net/forum?id=mFdHMNFtrT
---
Title: Softmax is $1/2$-Lipschitz: A tight bound across all $\ell_p$ norms
Authors: Pravin Nair
Abstract: The softmax function is a basic operator in machine learning and optimization, used in classification, attention mechanisms, reinforcement learning, game theory, and problems involving log-sum-exp terms. Existing robustness guarantees of learning models and convergence analysis of optimization algorithms typically consider the softmax operator to have a Lipschitz constant of $1$ with respect to the $\ell_2$ norm. In this work, we prove that the softmax function is contractive with the Lipschitz constant $1/2$, uniformly across all $\ell_p$ norms with $p \ge 1$. We also show that the local Lipschitz constant of softmax attains $1/2$ for $p = 1$ and $p = \infty$, and for $p \in (1,\infty)$, the constant remains strictly below $1/2$ and the supremum $1/2$ is achieved only in the limit. To our knowledge, this is the first comprehensive norm-uniform analysis of softmax Lipschitz continuity. We demonstrate how the sharper constant directly improves a range of existing theoretical results on robustness and convergence. We further validate the sharpness of the $1/2$ Lipschitz constant of the softmax operator through empirical studies on attention-based architectures (ViT, GPT-2, Qwen3-8B) and on stochastic policies in reinforcement learning.
TL;DR: We show that the softmax operator is $1/2$-Lipschitz (contractive) over all $\ell_p$ norms ($p \ge 1$), and characterize the tightness of this bound. We validate the constant empirically on modern attention architectures and stochastic RL policies, and demonstrate how the sharper Lipschitz bound improves existing robustness and optimization guarantees.
URL: https://openreview.net/forum?id=6dowaHsa6D
---
Title: Hypergraph clustering using Ricci curvature: an edge transport perspective
Authors: Olympio Hacquard
Abstract: In this paper, we introduce a novel method for extending Ricci flow to hypergraphs by defining probability measures on the edges and transporting them on the line expansion. This approach yields a new weighting on the edges, which proves particularly effective for community detection. We extensively compare this method with a similar notion of Ricci flow defined on the clique expansion, demonstrating its enhanced sensitivity to the hypergraph structure, especially in the presence of large hyperedges. The two methods are complementary and together form a powerful and highly interpretable framework for community detection in hypergraphs.
URL: https://openreview.net/forum?id=HMROU8MXqV
---
Title: The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Authors: Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhong-Zhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Francisco Piedrahita Velez, Yue Liao, Hongru WANG, Mengyue Yang, Heng Ji, Michael Littman, Jun Wang, Shuicheng YAN, Philip Torr, LEI BAI
Abstract: The emergence of agentic reinforcement learning (Agentic RL) marks a paradigm shift from conventional reinforcement learning applied to large language models (LLM RL), reframing LLMs from passive sequence generators into autonomous, decision-making agents embedded in complex, dynamic worlds. This survey formalizes this conceptual shift by contrasting the degenerate single-step Markov Decision Processes (MDPs) of LLM RL with the temporally extended Partially Observable Markov Decision Processes (POMDPs) that define Agentic RL. Building on this foundation, we propose a comprehensive twofold taxonomy: one organized around core agentic capabilities, including planning, tool use, memory, reasoning, self-improvement, and perception, and the other around their applications across diverse task domains. Central to our thesis is that reinforcement learning serves as the critical mechanism for transforming these capabilities from static, heuristic modules into adaptive, robust agentic behavior. To support and accelerate future research, we consolidate the landscape of open-source environments, benchmarks, and frameworks into a practical compendium. By synthesizing over five hundred recent works, this survey charts the contours of this rapidly evolving field and highlights the opportunities and challenges that will shape the development of scalable, general-purpose AI agents.
URL: https://openreview.net/forum?id=RY19y2RI1O
---
Title: RT2I-Bench: Evaluating Robustness of Text-to-Image Systems Against Adversarial Attacks
Authors: Athanasios Glentis, Ioannis Tsaknakis, Jiangweizhi Peng, Xun Xian, Yihua Zhang, Gaowen Liu, Charles Fleming, Mingyi Hong
Abstract: Text-to-Image (T2I) systems have demonstrated impressive abilities in the generation of images from text descriptions. However, these systems remain susceptible to adversarial prompts—carefully crafted input manipulations that can result in misaligned or even toxic outputs. This vulnerability highlights the need for systematic evaluation of attack strategies that exploit these weaknesses, as well as for testing the robustness of T2I systems against them. To this end, this work introduces the RT2I-Bench benchmark. RT2I-Bench serves two primary purposes. First, it provides a structured evaluation of various adversarial attacks, examining their effectiveness, transferability, stealthiness and potential for generating misaligned or toxic outputs, as well as assessing the resilience of state-of-the-art T2I models to such attacks. We observe that state-of-the-art T2I systems are vulnerable to adversarial prompts, with the most effective attacks achieving success rates of over 60\% across the majority of T2I models we tested. Second, RT2I-Bench enables the creation of a set of strong adversarial prompts (consisting of 1,439 that induce misaligned or targeted outputs and 173 that induce toxic outputs), which are effective across a wide range of systems. Finally, our benchmark is designed to be extensible, enabling the seamless addition of new attacks, T2I models, and evaluation metrics. This framework provides an automated solution for robustness assessment and adversarial prompt generation in T2I systems.
URL: https://openreview.net/forum?id=ZUiWjEouSf
---
New submissions
===============
Title: On Federated Compositional Optimization: Algorithms, Analysis, and Guarantees
Abstract: Compositional optimization (CO) has recently gained popularity due to its applications in many machine learning applications. The large-scale and distributed nature of data necessitates efficient federated learning (FL) algorithms for CO, but the compositional structure of the objective poses significant challenges. Current methods either rely on large batch gradients (which are impractical), require expensive computations, or suffer from suboptimal guarantees. To address these challenges, we propose efficient FedAvg-type algorithms for solving non-convex CO in the FL setting. We first theoretically establish that standard FedAvg fails in solving the federated CO problems due to data heterogeneity, which amplifies bias in local gradient estimates. Our analysis shows that controlling this bias necessarily requires either {\em additional communication} or {\em additional structural assumptions}. To this end, we develop two algorithms for solving the federated CO problem. First, we propose \aname~that utilizes the compositional problem structure to design a communication strategy that allows FedAvg to converge. \aname~achieves a sample complexity of $\mathcal{O}(\epsilon^{-2})$ and communication complexity of $\mathcal{O}(\epsilon^{-3/2})$. Then we propose \anameds, a two-sided learning rate algorithm, that leverages an additional assumption to improve upon the communication complexity of \aname. \anameds~achieves the optimal $\mathcal{O}(\epsilon^{-2})$ sample and $\mathcal{O}(\epsilon^{-1})$ communication complexity. We corroborate our theoretical findings with empirical studies on large-scale CO problems.
URL: https://openreview.net/forum?id=4uRlbSNevR
---
Title: The Impact of Anisotropic Covariance Structure on the Training Dynamics and Generalization Error of Linear Networks
Abstract: The success of deep neural networks largely depends on the statistical structure of the training data. While learning dynamics and generalization on isotropic data are well-established, the impact of pronounced anisotropy on these crucial aspects is not yet fully understood. We examine the impact of data anisotropy, represented by a spiked covariance structure, a canonical yet tractable model, on the learning dynamics and generalization error of a two-layer linear network in a linear regression setting. Our analysis reveals that the learning dynamics proceed in two distinct phases, governed initially by the input-output correlation and subsequently by other principal directions of the data structure. Furthermore, we derive an analytical expression for the generalization error, quantifying how the alignment of the spike structure of the data with the learning task improves performance. Our findings offer deep theoretical insights into how data anisotropy shapes the learning trajectory and final performance, providing a foundation for understanding complex interactions in more advanced network architectures.
URL: https://openreview.net/forum?id=pHDSgtDDez
---
Title: OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control
Abstract: Transparent decision-making is essential for traffic signal control (TSC) systems to earn public trust. However, traditional reinforcement learning–based TSC methods function as black boxes, providing little to no insight into their decisions. Although large language models (LLMs) could provide the needed interpretability through natural language reasoning, they face challenges such as limited memory and difficulty in deriving optimal policies from sparse environmental feedback. Existing TSC methods that apply reinforcement fine-tuning to LLMs face notable training instability and deliver only limited improvements over pretrained models. We attribute this instability to the long-horizon nature of TSC: feedback is sparse and delayed, most control actions yield only marginal changes in congestion metrics, and the resulting weak reward signals interact poorly with policy-gradient optimization. We introduce OracleTSC, which addresses these issues through: (1) a reward hurdle mechanism that filters weak learning signals by subtracting a calibrated threshold from environmental feedback, and (2) preventing policy degeneracy by maximizing the probability of the chosen answer, which promotes consistent decision-making across multiple responses. Experiments on the standard LibSignal benchmark demonstrate that our approach enables a compact model (LLaMA3-8B) to achieve substantial improvements in traffic flow, with a 75% reduction in travel time and 67% decrease in queue lengths over the pretrained baseline while preserving interpretability through natural language explanations. Furthermore, the method exhibits strong cross-intersection generalization: a policy trained on one intersection transfers to a structurally distinct intersection with 17% lower travel time and 39% lower queue length, all without any additional finetuning for the target topology. These findings show that uncertainty-aware reward shaping could stabilize reinforcement fine-tuning and provide a new perspective for improving its effectiveness in TSC tasks.
URL: https://openreview.net/forum?id=WmJu5MkoQD
---
Title: AIMing for Standardised Explainability Evaluation in GNNs: A Framework and Case Study on Graph Kernel Networks
Abstract: Graph Neural Networks (GNNs) have advanced significantly in handling graph-structured data, but a comprehensive framework for evaluating explainability remains lacking. Existing evaluation frameworks primarily involve post-hoc explanations, and operate in the setting where multiple methods generate a suite of explanations for a single model. This makes comparison of explanations across models difficult. Evaluation of inherently interpretable models often targets a specific aspect of interpretability relevant to the model, but remains underdeveloped in terms of generating insight across a suite of measures. We introduce AIM, a comprehensive framework that addresses these limitations by measuring Accuracy, Instance-level explanations, and Model-level explanations. AIM is formulated with minimal constraints to enhance flexibility and facilitate broad applicability. Here, we use AIM in a pipeline, extracting explanations from inherently interpretable GNNs such as graph kernel networks (GKNs) and prototype networks (PNs), evaluating these explanations with AIM, identifying their limitations and obtaining insights to their characteristics. Taking GKNs as a case study, we show how the insights obtained from AIM can be used to develop an updated model, xGKN, that maintains high accuracy while demonstrating improved explainability. Our approach aims to advance the field of Explainable AI (XAI) for GNNs, providing more robust and practical solutions for understanding and improving complex models.
URL: https://openreview.net/forum?id=onZkYXI7oe
---
Title: Persistent homology for time series: a selective review
Abstract: Over the last ten years, persistent homology has been increasingly used to analyze the structure and shape of various types of data, including time series. This article is a review of persistent homology applied to (univariate or multivariate) time series data. We review 84 articles that apply methods involving persistent homology to time series data, published between 2014 and 2025 in several domains of application, such as biomedicine, industry, and economics. We introduce the main concepts of persistent homology, give an overview of the application fields and tasks, and propose a general framework to describe the main characteristics of all the methods.
URL: https://openreview.net/forum?id=tztKO9jzBR
---
Title: Transforming Language Models into Program Interpreters via Execution Trace Chain of Thought
Abstract: Code execution reasoning (CER), the ability to predict how code executes on a given input, has been added to the expected aspects of language models' (LMs') coding capabilities. However, many open-source LMs perform poorly on simple code snippets and, as our observations show, they exhibit limitations even on a single basic operation. To enable LMs to accumulate fine-grained reasoning steps in a structured format, we propose leveraging extremely granular execution traces as chain-of-thought rationales. Specifically, we introduce a fine-tuning method called ET-CoT (Execution Trace Chain of Thought), which leverages execution traces generated by our custom code interpreter and characterized by sub-line-level, thorough expansion of all expressions, going beyond merely logging intermediate variables. After fine-tuning with 127k examples, ET-CoT consistently improves CER performance across models and benchmarks, for instance with Qwen2.5-7B-Instruct outperforming its official Coder model. In addition, our custom tests show improved accuracy on repeated application of simple operations. Overall, ET-CoT serves as a unique approach that provides strong baselines and insights for improving CER performance.
URL: https://openreview.net/forum?id=pOg7iub4Pz
---
Title: A Close Look At World Model Recovery In Supervised Fine-Tuned LLM Planners
Abstract: Supervised fine-tuning (SFT) improves end-to-end classical planning in large language models (LLMs), but do these models also learn to represent and reason about the planning problems they are solving? Due to the relative complexity of classical planning problems and the challenge that end-to-end plan generation poses for LLMs, it has been difficult to explore this question. In our work, we devise and perform a series of interpretability experiments that holistically interrogate world model recovery by examining both internal representations and generative capabilities of fine-tuned LLMs. We find that: a) Supervised fine-tuning on valid action sequences enables LLMs to linearly encode action validity and some state predicates. b) Models that struggle to use output probabilities for classifying action validity may still learn internal representations that separate valid from invalid actions. c) Broader state space coverage during fine-tuning, such as from random walk data, yields more accurate recovery of the underlying world model. In summary, this work contributes a recipe for applying interpretability techniques to planning LLMs and generates insights that shed light on open questions about how knowledge is represented in LLMs.
URL: https://openreview.net/forum?id=zEIpt5UsHM
---