Daily TMLR digest for Nov 06, 2025

0 views
Skip to first unread message

TMLR

unread,
Nov 6, 2025, 12:30:07 AMNov 6
to tmlr-anno...@googlegroups.com


New certifications
==================

Expert Certification: Overcoming Non-stationary Dynamics with Evidential Proximal Policy Optimization

Abdullah Akgül, Gulcin Baykal, Manuel Haussmann, Melih Kandemir

https://openreview.net/forum?id=KTfTwxsVNE

---


Accepted papers
===============


Title: Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

Authors: Yu Gu, Kai Zhang, Yuting Ning, Boyuan Zheng, Boyu Gou, Tianci Xue, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, Huan Sun, Yu Su

Abstract: Language agents based on large language models (LLMs) have demonstrated great promise in automating web-based tasks. Recent work has shown that incorporating advanced planning algorithms, e.g., tree search, is advantageous over reactive planning for web agents. However, unlike simulated sandbox environments, real-world environments such as the web are rife with irreversible actions. This undermines the feasibility of backtracking, a cornerstone of (tree) search. Overly relying on test-time search also hurts efficiency. We advocate model-based planning for web agents that employs a world model to simulate and deliberate over the outcome of each candidate action before committing to one. We systematically explore this paradigm by: (1) Proposing a model-based planning framework, WebDreamer, which employs LLMs to serve as both world models and value functions; (2) Training specialized LLMs as world models with a scalable data synthesis pipeline. Empirical results demonstrate that WebDreamers achieves substantial performance improvements over reactive baselines. It is competitive, while being - times more efficient, with tree search in sandbox environments (VisualWebArena) and also works effectively on real-world websites (Online-Mind2Web and Mind2Web-Live). Furthermore, our trained world model, Dreamer-7B, performs comparable to GPT-4o, highlighting the potential of specialized world models for efficient and effective planning in complex web environments. All code, models, and data are publicly available at https://github.com/OSU-NLP-Group/WebDreamer

URL: https://openreview.net/forum?id=c6l7yA0HSq

---

Title: DRAGON: Distributional Rewards Optimize Diffusion Generative Models

Authors: Yatong Bai, Jonah Casebeer, Somayeh Sojoudi, Nicholas J. Bryan

Abstract: We present Distributional RewArds for Generative OptimizatioN (DRAGON), a versatile framework for fine-tuning media generation models towards a desired outcome. Compared with traditional reinforcement learning with human feedback (RLHF) or pairwise preference approaches such as direct preference optimization (DPO), DRAGON is more flexible. It can optimize reward functions that evaluate either individual examples or distributions of them, making it compatible with a broad spectrum of instance-wise, instance-to-distribution, and distribution-to-distribution rewards. Leveraging this versatility, we construct novel reward functions by selecting an encoder and a set of reference examples to create an exemplar distribution. When cross-modal encoders such as CLAP are used, the reference may be of a different modality (e.g., text versus audio). Then, DRAGON gathers online and on-policy generations, scores them with the reward function to construct a positive demonstration set and a negative set, and leverages the contrast between the two finite sets to approximate distributional reward optimization. For evaluation, we fine-tune an audio-domain text-to-music diffusion model with 20 reward functions, including a custom music aesthetics model, CLAP score, Vendi diversity, and Fréchet audio distance (FAD). We further compare instance-wise (per-song) and full-dataset FAD settings while ablating multiple FAD encoders and reference sets. Over all 20 target rewards, DRAGON achieves an 81.45% average win rate. Moreover, reward functions based on exemplar sets indeed enhance generations and are comparable to model-based rewards. With an appropriate exemplar set, DRAGON achieves a 60.95% human-voted music quality win rate without training on human preference annotations. As such, DRAGON exhibits a new approach to designing and optimizing reward functions for improving human-perceived quality. Example generations can be found at https://ml-dragon.github.io/web/.

URL: https://openreview.net/forum?id=gobhDku03J

---

Title: Choose Your Model Size: Any Compression of Large Language Models Without Re-Computation

Authors: Martin Genzel, Patrick Putzky, Pengfei Zhao, Sebastian Schulze, Mattes Mollenhauer, Robert Seidel, Stefan Dietzel, Thomas Wollmann

Abstract: The adoption of Foundation Models in resource-constrained environments remains challenging due to their large size and inference costs. A promising way to overcome these limitations is post-training compression, which aims to balance reduced model size against performance degradation. This work presents Any Compression via Iterative Pruning (ACIP), a novel algorithmic approach to determine a compression-performance trade-off from a single stochastic gradient descent run. To achieve parameter efficiency, we use an SVD-reparametrization of linear layers and iteratively prune their singular values with a sparsity-inducing penalty. Importantly, the pruning order of the parameters is used to derive a global score map that allows compressing a model to any target size without re-computation. We evaluate ACIP on a large selection of open-weight LLMs and downstream tasks, demonstrating state-of-the-art results compared to existing factorization-based compression methods. We also show that ACIP seamlessly complements common quantization-based compression techniques.

URL: https://openreview.net/forum?id=Y6hdYf8tsg

---

Title: Overcoming Non-stationary Dynamics with Evidential Proximal Policy Optimization

Authors: Abdullah Akgül, Gulcin Baykal, Manuel Haussmann, Melih Kandemir

Abstract: Continuous control of non-stationary environments is a major challenge for deep reinforcement learning algorithms. The time-dependency of the state transition dynamics aggravates the notorious stability problems of model-free deep actor-critic architectures. We posit that two properties will play a key role in overcoming non-stationarity in transition dynamics: (i)~preserving the plasticity of the critic network and (ii) directed exploration for rapid adaptation to changing dynamics. We show that performing on-policy reinforcement learning with an evidential critic provides both. The evidential design ensures a fast and accurate approximation of the uncertainty around the state value, which maintains the plasticity of the critic network by detecting the distributional shifts caused by changes in dynamics. The probabilistic critic also makes the actor training objective a random variable, enabling the use of directed exploration approaches as a by-product. We name the resulting algorithm \emph{Evidential Proximal Policy Optimization (EPPO)} due to the integral role of evidential uncertainty quantification in both policy evaluation and policy improvement stages. Through experiments on non-stationary continuous control tasks, where the environment dynamics change at regular intervals, we demonstrate that our algorithm outperforms state-of-the-art on-policy reinforcement learning variants in both task-specific and overall return.

URL: https://openreview.net/forum?id=KTfTwxsVNE

---

Title: RANa: Retrieval-Augmented Navigation

Authors: Gianluca Monaci, Rafael S. Rezende, Romain Deffayet, Gabriela Csurka, Guillaume Bono, Hervé Déjean, Stéphane Clinchant, Christian Wolf

Abstract: Methods for navigation based on large-scale learning typically treat each episode as a new problem, where the agent is spawned with a clean memory in an unknown environment. While these generalization capabilities to an unknown environment are extremely important, we claim that, in a realistic setting, an agent should have the capacity of exploiting information collected during earlier robot operations. We address this by introducing a new retrieval-augmented agent, trained with RL, capable of querying a database collected from previous episodes in the same environment and learning how to integrate this additional context information. We introduce a unique agent architecture for the general navigation task, evaluated on ImageNav, Instance-ImageNav and ObjectNav. Our retrieval and context encoding methods are data-driven and employ vision foundation models (FM) for both semantic and geometric understanding. We propose new benchmarks for these settings and we show that retrieval allows zero-shot transfer across tasks and environments while significantly improving performance.

URL: https://openreview.net/forum?id=OWCJ5JfsRB

---

Title: SCNode: Spatial and Contextual Coordinates for Graph Representation Learning

Authors: Md Joshem Uddin, Astrit Tola, Varin Singh Sikand, Cuneyt Gurcan Akcora, Baris Coskunuzer

Abstract: Effective node representation lies at the heart of Graph Neural Networks (GNNs), as it directly impacts their ability to perform downstream tasks such as node classification and link prediction. Most existing GNNs, particularly message passing graph neural networks, rely on neighborhood aggregation to iteratively compute node embeddings. While powerful, this paradigm suffers from well-known limitations of oversquashing, oversmoothing, and underreaching that degrade representation quality. More critically, MPGNNs often assume homophily, where connected nodes share similar features or labels, leading to poor generalization in heterophilic graphs where this assumption breaks down.

To address these challenges, we propose *SCNode*, a *Spatial-Contextual Node Embedding* framework designed to perform consistently well in both homophilic and heterophilic settings. SCNode integrates spatial and contextual information, yielding node embeddings that are not only more discriminative but also structurally aware. Our approach introduces new homophily matrices for understanding class interactions and tendencies. Extensive experiments on benchmark datasets show that SCNode achieves superior performance over conventional GNN models, demonstrating its robustness and adaptability in diverse graph structures.

URL: https://openreview.net/forum?id=wdcdKeFbfQ

---


New submissions
===============


Title: Variational Geometric Information Bottleneck: Toward a Geometric Law of Understanding

Abstract: We propose a unified \emph{information–geometric} framework that formalizes understanding in learning as a trade-off between informativeness and geometric simplicity.
An encoder $\phi$ is evaluated by the utility
\[
U(\phi)=I(\phi(X);Y)-\beta\,\mathcal{C}(\phi),
\]
where $I(\phi(X);Y)$ measures task-relevant information and $\mathcal{C}(\phi)$ penalizes curvature and intrinsic dimensionality, promoting smooth, low-complexity manifolds.
Under standard manifold and regularity conditions, we establish non-asymptotic generalization bounds showing that generalization error scales with intrinsic dimension and curvature acts as a stabilizing capacity term linking geometry to sample efficiency.

To operationalize the theory, we introduce the \emph{Variational Geometric Information Bottleneck} (\texttt{V-GIB}), a variational estimator that unifies mutual-information compression with curvature regularization via tractable geometric proxies (Hutchinson trace, Jacobian-norm, and local PCA estimators).

Across synthetic manifolds, few-shot tasks, and real-world datasets (Fashion-MNIST, CIFAR-10), \texttt{V-GIB} exhibits a consistent information–geometry Pareto frontier, estimator stability, and substantial gains in interpretive efficiency.
Fractional-data experiments on CIFAR-10 further confirm the predicted \emph{efficiency–curvature law}, that curvature-aware encoders maintain accuracy under severe data scarcity.

Overall, \texttt{V-GIB} offers a principled and measurable route to representations that are geometrically coherent, data-efficient, and aligned with human-interpretable structure; providing empirical and theoretical evidence for a geometric law of understanding in learning systems.

URL: https://openreview.net/forum?id=D2s86BPSnV

---

Title: Theoretically Understanding Data Reconstruction Leakage in Federated Learning

Abstract: Federated learning (FL) is an emerging collaborative learning paradigm that aims to protect data privacy. Unfortunately, recent works show FL algorithms are vulnerable to the serious data reconstruction attacks.However, existing works lack a theoretical foundation on to what extent the devices' data can be reconstructed and the effectiveness of these attacks cannot be compared fairly due to their unstable performance. To address this deficiency, we propose a theoretical framework to understand data reconstruction attacks to FL. Our framework involves bounding the data reconstruction error and an attack's error bound reflects its inherent attack effectiveness.Under the framework, we can theoretically compare the effectiveness of existing attacks. For instance, our results on multiple datasets validate that the iDLG attack inherently outperforms the DLG attack.

URL: https://openreview.net/forum?id=1UfDXeYxwk

---

Title: Bayesian Network Structure Discovery Using Large Language Models

Abstract: Understanding probabilistic relationships among variables is crucial for analyzing complex systems. Traditional structure learning methods often require extensive observational data and incur high computational costs. Recent studies have explored using large language models (LLMs) for structure learning, but most treat LLMs as auxiliary tools for pre-processing or post-processing, leaving the core learning process data-driven. In this work, we propose a unified framework for Bayesian network structure discovery that places LLMs at the center, supporting both data-free and data-aware settings. In the data-free case, we introduce \textbf{PromptBN} to query LLMs with metadata and efficiently uncover valid probabilistic relationships. When observational data are available, we introduce \textbf{ReActBN}, which integrates the ReAct reasoning paradigm with structure scores such as the Bayesian Information Criterion (BIC) for iterative refinement. Unlike prior methods that offload refinement to external algorithms, our framework maintains the LLM actively in the loop throughout the discovery process. Experiments demonstrate that our method significantly outperforms both existing LLM-based approaches and traditional data-driven algorithms, particularly in the low- or no-data scenario. Code will be publicly available upon publication.

URL: https://openreview.net/forum?id=G4mrO8LVix

---

Title: A Survey of Reasoning and Agentic Systems in Time Series with Large Language Models

Abstract: Time series reasoning treats time as a first-class axis and incorporates intermediate evidence directly into the answer.
This survey defines the problem and organizes the literature by reasoning topology with three families: direct reasoning in one step, linear chain reasoning with explicit intermediates, and branch-structured reasoning that explores, revises, and aggregates.
The topology is crossed with the main objectives of the field, including traditional time series analysis, explanation and understanding, causal inference and decision making, and time series generation, while a compact tag set spans these axes and captures decomposition and verification, ensembling, tool use, knowledge access, multimodality, agent loops, and LLM alignment regimes.
Methods and systems are reviewed across domains, showing what each topology enables and where it breaks down in faithfulness or robustness, along with curated datasets, benchmarks, and resources that support study and deployment (with an accompanying repository at \url{https://anonymous.4open.science/r/Time-Series-Reasoning-Survey-TMLR/}).
Evaluation practices that keep evidence visible and temporally aligned are highlighted, and guidance is distilled on matching topology to uncertainty, grounding with observable artifacts, planning for shift and streaming, and treating cost and latency as design budgets.
We emphasize that reasoning structures must balance capacity for grounding and self-correction against computational cost and reproducibility, while future progress will likely depend on benchmarks that tie reasoning quality to utility and on closed-loop testbeds that trade off cost and risk under shift-aware, streaming, and long-horizon settings.
Taken together, these directions mark a shift from narrow accuracy toward reliability at scale, enabling systems that not only analyze but also understand, explain, and act on dynamic worlds with traceable evidence and credible outcomes.

URL: https://openreview.net/forum?id=mgMJ8ksKKA

---

Title: TextOCVP: Object-Centric Video Prediction with Language Guidance

Abstract: Understanding and forecasting future scene states is critical for autonomous agents to plan and act effectively in complex environments. Object-centric models, with structured latent spaces, have shown promise in modeling object dynamics and predicting future scene states, but often struggle to scale beyond simple synthetic datasets and to integrate external guidance, limiting their applicability in robotics. To address these limitations, we propose TextOCVP, an object-centric model for video prediction guided by textual descriptions. TextOCVP parses an observed scene into object representations, called slots, and utilizes a text-conditioned transformer predictor to forecast future object states and video frames. Our approach jointly models object dynamics and interactions while incorporating textual guidance, enabling accurate and controllable predictions. TextOCVP’s structured latent space offers a more precise control of the forecasting process, outperforming several video prediction baselines on two datasets. Additionally, we show that structured object-centric representations provide superior robustness to novel scene configurations, as well as improved controllability and interpretability, enabling more precise and understandable predictions. Code will be open-sourced upon acceptance.

URL: https://openreview.net/forum?id=7JEgXCyQgX

---

Title: Provable Domain Adaptation for Offline Reinforcement Learning with Limited Samples

Abstract: Offline reinforcement learning (RL) learns effective policies from a static target dataset. The performance of state-of-the-art offline RL algorithms notwithstanding, it relies on the size of the target dataset, and it degrades if limited samples in the target dataset are available, which is often the case in real-world applications. To address this issue, domain adaptation that leverages auxiliary samples from related source datasets (such as simulators) can be beneficial. However, establishing the optimal way to trade off the limited target dataset and the large-but-biased source dataset while ensuring provably theoretical guarantees remains an open challenge. To the best of our knowledge, this paper proposes the first framework that theoretically explores the impact of the weights assigned to each dataset on the performance of offline RL. In particular, we establish performance bounds and the existence of the optimal weight, which can be computed in closed form under simplifying assumptions. We also provide algorithmic guarantees in terms of convergence to a neighborhood of the optimum. Notably, these results depend on the quality of the source dataset and the number of samples in the target dataset. Our empirical results on the well-known offline Procgen benchmark substantiate the theoretical contributions in this work.

URL: https://openreview.net/forum?id=xog8ThcXwy

---

Title: From Weighting to Modeling: A Nonparametric Estimator for Off-Policy Evaluation

Abstract: We study off-policy evaluation in the setting of contextual bandits, where we aim to evaluate a new policy using historical data that consists of contexts, actions and received rewards. This historical data typically does not faithfully represent action distribution of the new policy accurately. A common approach, inverse probability weighting (IPW), adjusts for these discrepancies in action distributions.
However, this method often suffers from high variance due to the probability being in the denominator.
The doubly robust (DR) estimator reduces variance through modeling reward but does not directly address variance from IPW.
In this work, we address the limitation of IPW by proposing a Nonparametric Weighting (NW) approach that constructs weights using a nonparametric model. Our NW approach achieves low bias like IPW but typically exhibits significantly lower variance.
To further reduce variance, we incorporate reward predictions — similar to the DR technique — resulting in the Model-assisted Nonparametric Weighting (MNW) approach. We show that MNW yields accurate value estimates when either the reward model or the behavior policy model is well specified. Extensive empirical comparisons show that our approaches consistently outperform existing techniques, achieving lower variance in value estimation while maintaining low bias.

URL: https://openreview.net/forum?id=RW6PY0AU3w

---

Title: MissNODAG: Differentiable Learning of Cyclic Causal Graphs from Incomplete Data

Abstract: Causal discovery in real-world systems, such as biological networks, is often complicated by feedback loops and incomplete data. Standard algorithms, which assume acyclic structures or fully observed data, struggle with these challenges. To address this gap, we propose MissNODAG, a differentiable framework for learning both the underlying cyclic causal graph and the missingness mechanism from partially observed data, including data *missing not at random*. Our framework integrates an additive noise model with an expectation-maximization procedure, alternating between imputing missing values and optimizing the observed data likelihood, to uncover both the cyclic structures and the missingness mechanism. We demonstrate the effectiveness of MissNODAG through synthetic experiments and an application to real-world gene perturbation data.

URL: https://openreview.net/forum?id=nNZXQ3Q0GP

---

Title: Auditing Predictive Models for Intersectional Biases

Abstract: Predictive models that satisfy group fairness criteria in aggregate for members of a protected class, but do not guarantee subgroup fairness, could produce biased predictions for individuals at the intersection of two or more protected classes. To address this risk, we propose Conditional Bias Scan (CBS), an auditing framework for detecting intersectional biases in the outputs of classification models that may lead to disparate impact. CBS identifies the subgroup with the most significant bias against the protected class, compared to the equivalent subgroup in the non-protected class. The framework can audit for predictive biases using common group fairness definitions (separation and sufficiency) for both probabilistic and binarized predictions. We show through empirical evaluations that this methodology has significantly higher bias detection power compared to similar methods that audit for subgroup fairness. We then use this approach to detect statistically significant intersectional biases in the predictions of the COMPAS pre-trial risk assessment tool and a model trained on the German Credit data.

URL: https://openreview.net/forum?id=1JTnlHMSmO

---

Title: Re:Form --- Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny

Abstract: Existing informal language-based (e.g., human language) Large Language Models (LLMs) trained with Reinforcement Learning (RL) face a significant challenge: their verification processes, which provide crucial training signals, are neither reliable nor scalable. In fact, the prevalent large proprietary models could hardly generate verifiable programs. A promising yet largely uncharted alternative is formal language-based reasoning. Grounding LLMs in rigorous formal systems where generative models operate in formal language spaces (e.g., Dafny) enables the automatic and mathematically provable verification of their reasoning processes and outcomes. This capability is pivotal for achieving large-scale, reliable formal software verification. It is a common practice to employ human-annotated chain-of-thought and other human priors to induce the reasoning and coding capabilities of LLMs. Unfortunately, it becomes unacceptably all-consuming to provide such priors for supervising complex programming tasks. In this work, we systematically explore ways to reduce human priors with the formal language, Dafny, as the main environment for our pilot study. Our pipeline mainly relies on introducing an automatic and scalable data curation pipeline, and careful RL designs integrated with feedback from the formal language verifier. We introduce DafnyComp, a benchmark of compositional formal programs with auto-formalized specifications for specification reasoning. Our supervised fine-tuning (SFT) stage enables even small models (e.g., 0.5B) to generate syntactically valid and verifiable Dafny code, surpassing proprietary models. RL with regularization further improves performance, achieving stronger generalization to out-of-domain tasks and outperforming all strong baselines on the challenging DafnyComp benchmark. Anonymized code and models are available at https://github.com/ReFormDafny/ReForm and https://huggingface.co/ReFormDafny.

URL: https://openreview.net/forum?id=cAQmIS4GOe

---

Title: The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

Abstract: The emergence of agentic reinforcement learning (Agentic RL) marks a paradigm shift from conventional reinforcement learning applied to large language models (LLM RL), reframing LLMs from passive sequence generators into autonomous, decision-making agents embedded in complex, dynamic worlds. This survey formalizes this conceptual shift by contrasting the degenerate single-step Markov Decision Processes (MDPs) of LLM-RL with the partially observable, temporally extended partially observable Markov decision process (POMDP) that define Agentic RL. Building on this foundation, we propose a comprehensive twofold taxonomy: one organized around core agentic capabilities, including planning, tool use, memory, reasoning, self-improvement, and perception, and the other around their applications across diverse task domains. Central to our thesis is that reinforcement learning serves as the critical mechanism for transforming these capabilities from static, heuristic modules into adaptive, robust agentic behavior. To support and accelerate future research, we consolidate the landscape of open-source environments, benchmarks, and frameworks into a practical compendium. By synthesizing over five hundred recent works, this survey charts the contours of this rapidly evolving field and highlights the opportunities and challenges that will shape the development of scalable, general-purpose AI agents.

URL: https://openreview.net/forum?id=RY19y2RI1O

---

Title: Modular Diffusion Policy Training: Decoupling and Recombining Guidance and Diffusion for Offline RL

Abstract: In classifier-free diffusion(CFD), the diffusion model and its guidance are typically learned jointly and applied jointly in the inference stage. Before the guidance has converged, it provides unstable or even misleading gradients, which leads to inefficiency and instability during the early stage of training. Such strict coupling not only leads to self-enforcing variance and biased errors but also prevents the guidance module from being reused across different diffusion models. We propose Guidance-First Diffusion Training (GFDT), which pretrains and freezes the guidance model before diffusion policy learning. GFDT reduces peak memory and computation by 38.1%, decreases diffusion training by 65.6% and 27.66%, and achieves up to 43.16\% and 60.98\% performance improvements on offline RL benchmarks. Beyond efficiency, we uncover a strong plug-and-play property: replacing the guidance module only at inference time can substantially improve stability. Cross-algorithm swaps (e.g., Implicit Q-Learning (IDQL) guidance for Diffusion Q-Learning (DQL) policies) perform comparably to the stronger of the two, despite never being co-trained. Our theoretical analysis shows that GFDT enables the convergence on an optimal guidance and theoretically proves that it speeds up the training. Also, we proved that plug-and-play remains valid as long as the guidance and the diffusion model are trained with the same data distribution. Limitations arising from dataset mismatch are analyzed in detail, which further underscores the necessity of distributional alignment. This work opens a new line of research by treating diffusion and guidance as modular units that can be recombined, rather than as a monolithic process, suggesting a paradigm that may guide the future development of diffusion-based reinforcement learning.

URL: https://openreview.net/forum?id=KJSvZwdPFd

---

Title: Dual-Phase Continual Learning: Supervised Adaptation Meets Unsupervised Retention

Abstract: Foundational vision-language models (VLMs) excel across diverse tasks, but adapting them to new domains without forgetting prior knowledge remains a critical challenge. Continual Learning (CL) addresses this challenge by enabling models to learn sequentially from new data while mitigating the forgetting of prior information, typically under supervised settings involving label shift. Nonetheless, abrupt distribution shifts can still cause substantial forgetting, potentially nullifying the benefits of supervised updates, especially when storing or replaying past data is infeasible. In this work, we propose leveraging unlabeled test-time data in an unsupervised manner to reinforce prior task performance without requiring replay or stored examples. Unlike traditional Test-Time Adaptation (TTA), which primarily focuses on domain shift or corruption, our method improves performance on earlier tasks by exploiting representative test samples encountered during deployment. We introduce a simple teacher-student framework with gradient-based sparse parameter updates, and show that it effectively mitigates forgetting in class-incremental CL for VLMs, offering a memory-free alternative to episodic replay with strong empirical results.

URL: https://openreview.net/forum?id=GFrHdXzZwo

---

Title: Policy Learning with a Language Bottleneck

Abstract: Modern AI systems such as self-driving cars and game-playing agents achieve superhuman
performance. But they often lack human-like generalization, interpretability, and inter-
operability with human users. This paper introduces *Policy Learning with a Language
Bottleneck* (PLLB), a framework enabling AI agents to generate linguistic rules that capture
the high-level strategies underlying rewarding behaviors. PLLB alternates between a *rule
generation* step guided by language models, and an *update* step where agents learn new
policies guided by rules. Crucially, PLLB enables this kind of language-guided learning
even when a natural language rule is insufficient to completely describe the target policy.
Across five diverse tasks, including a two-player signaling game, maze navigation, image
reconstruction, and robot grasp planning, we show that PLLB learns more interpretable
and generalizable behaviors than standard policy learning methods. In three additional
human subject studies, we show that show the learned rules significantly improve human
task performance, enabling more effective human-AI coordination

URL: https://openreview.net/forum?id=sK8uEqzQPv

---

Title: Reasoning-Driven Synthetic Data Generation and Evaluation

Abstract: Although many AI applications of interest require specialized multi-modal models, relevant data to train such models is inherently scarce or inaccessible. Filling these gaps with human annotators is prohibitively expensive, error-prone, and time-consuming, leading model builders to increasingly consider synthetic data as a scalable alternative. However, existing synthetic data generation methods often rely on manual prompts, evolutionary algorithms, or extensive seed data from the target distribution – limiting their scalability, explainability, and control. In this paper, we introduce Simula: a novel reasoning-driven framework for data generation and evaluation. It employs a seedless, agentic approach to generate synthetic datasets at scale, allowing users to define desired dataset characteristics through an explainable and controllable process, enabling fine-grained resource allocation. We show the efficacy of our approach on a variety of datasets, rigorously testing both intrinsic and downstream properties. Our work (1) offers guidelines for synthetic data mechanism design, (2) provides insights into generating and evaluating synthetic data at scale, and (3) unlocks new opportunities for developing and deploying AI in domains where data scarcity or privacy concerns are paramount.

URL: https://openreview.net/forum?id=NALsdGEPhB

---

Reply all
Reply to author
Forward
0 new messages