Daily TMLR digest for Nov 08, 2025

0 views

Skip to first unread message

TMLR

unread,

Nov 8, 2025, 12:30:07 AMNov 8

to tmlr-anno...@googlegroups.com

Accepted papers
===============

Title: Universal Black-Box Targeted Reward Poisoning Attack Against Online Deep Reinforcement Learning

Authors: Yinglun Xu, Gagandeep Singh

Abstract: This work proposes the first universal black-box targeted attack against online reinforcement learning through reward poisoning during training time. Our attack is universally efficient against any efficient learning algorithm training in general RL environments and requires limited attack budgets and computational resources. We generalize a common feature of the efficient learning algorithms and assume that such algorithms would mostly take the optimal actions or actions close to them during training. We quantify the efficiency of an attack and propose an attack framework where it is feasible to evaluate the efficiency of any attack instance in the framework based on the assumption. Finally, we find an instance in the framework that requires a minimal per-step perturbation, which we call `adaptive target attack.' We theoretically analyze and prove a lower bound for the attack efficiency of our attack in the general RL setting. Empirically, on a diverse set of popular DRL environments learned by state-of-the-art DRL algorithms, we verify that our attack efficiently leads the learning agent to various target policies with limited budgets.

URL: https://openreview.net/forum?id=MX0aDKu8lY

---

New submissions
===============

Title: Lifelong Learning of Video Diffusion Models From a Single Video Stream

Abstract: This work demonstrates that training autoregressive video diffusion models from a single video stream—resembling the experience of embodied agents—is not only possible, but can also be as effective as standard offline training given the same number of gradient steps. Our work further reveals that this main result can be achieved using experience replay methods that only retain a subset of the preceding video stream. To support training and evaluation in this setting, we introduce four new datasets for streaming lifelong generative video modeling: Lifelong Bouncing Balls, Lifelong 3D Maze, Lifelong Drive, and Lifelong PLAICraft, each consisting of one million consecutive frames from environments of increasing complexity. Together, our datasets and investigations lay the groundwork for video generative models and world models that continuously learn from single-sensor video streams rather than from fixed, curated video datasets.

URL: https://openreview.net/forum?id=xbvfqMzoOL

---

Title: GraphMERT: Efficient and Scalable Distillation of Reliable Knowledge Graphs from Unstructured Data

Abstract: Researchers have pursued neurosymbolic artificial intelligence (AI) applications for nearly three decades because symbolic components provide abstraction while neural components provide generalization. Thus, a marriage of the two components can lead to rapid advancements in AI. Yet, the field has not realized this promise since most neurosymbolic AI frameworks fail to scale. In addition, the implicit representations and approximate reasoning of purely neural approaches limit interpretability and trust. Knowledge graphs (KGs), a gold-standard representation of explicit semantic knowledge, can address the symbolic side. However, automatically deriving reliable KGs from text corpora has remained an open problem. We address the above challenges by introducing GraphMERT, a tiny graphical encoder-only model that distills high-quality KGs from unstructured text corpora and its own internal representations. Together, GraphMERT and its equivalent KG form a modular neurosymbolic stack: neural learning of abstractions; symbolic KGs for verifiable reasoning. GraphMERT + KG is the first efficient and scalable neurosymbolic model to achieve state-of-the-art benchmark accuracy along with superior symbolic representations relative to baselines. More concretely, we target reliable domain-specific KGs that are both (1) factual (with provenance) and (2) valid (ontology-consistent relations with domain-appropriate semantics). When an off-the-shelf large language model (LLM), e.g., Qwen3-32B, generates domain-specific KGs, it falls short on the reliability front due to prompt sensitivity, shallow domain expertise, and hallucinated relations. Thus, practitioners should avoid employing LLM-generated KGs in high-stakes domains, e.g., medicine, law, business, education, etc. On text obtained from PubMed papers related to diabetes, our KG extraction pipeline with a small 80M-parameter GraphMERT yields a KG with a 69.8% FActScore; a 32B-parameter baseline LLM yields a KG that achieves only a 40.2% FActScore. The GraphMERT-extracted KG also achieves a significantly higher ValidityScore of 68.7%, compared to an LLM-generated baseline (43.0%), demonstrating its ability to preserve ontology alignment. KG cleaning further improves factuality, with GraphMERT reaching 76.9% FActScore, compared to 55.6% for the LLM baseline. GraphMERT can then treat the augmented KG as the seed KG and refine it further. Finally, human experts can edit and audit the extracted KGs, further increasing their reliability. This is nearly impossible with purely neural representations. Hence, GraphMERT enables efficient, scalable, transparent (interpretable and explainable), attributable (with provenance), accountable (with governance), editable, auditable, and continually improvable state-of-the-art neurosymbolic AI.

URL: https://openreview.net/forum?id=tnXSdDhvqc

---

Title: COLT: Enhancing Video Large Language Models with Continual Tool Usage

Abstract: The success of Large Language Models (LLMs) has significantly propelled the research of video understanding. To harvest the benefits of well-trained expert models (i.e., tool), video LLMs prioritize the exploration of tool usage capabilities. Existing methods either prompt closed-source LLMs or employ the instruction tuning paradigm for tool-use finetuning. These methods, however, assume an established repository of fixed tools and struggle to generalize to real-world environments where tool data is perpetually evolving and streaming in. To this end, we propose to enhance open-source video LLMs with COntinuaL Tool usage (termed COLT), which automatically acquires tool-use ability in a successive tool stream without suffering "catastrophic forgetting" of the past learned tools. Specifically, our COLT incorporates a learnable tool codebook as a tool-specific memory system. Then, relevant tools are dynamically selected based on the similarity between user instructions and tool features within the codebook. To unleash the tool usage potential of video LLMs, we collect a video-centric tool-use instruction tuning dataset VideoToolBench. Extensive experiments on both previous video LLM benchmarks and the tool-use-specific VideoToolBench dataset demonstrate the state-of-the-art performance of our proposed COLT.

URL: https://openreview.net/forum?id=NT9tHHTlXn

---

Title: Cost-Free Personalization via Information-Geometric Projection in Bayesian Federated Learning

Abstract: Bayesian Federated Learning (BFL) combines uncertainty modeling with decentralized training, enabling the development of personalized and reliable models in the presence of data heterogeneity and privacy constraints. Existing approaches typically rely on Markov Chain Monte Carlo (MCMC) sampling or variational inference, often incorporating personalization mechanisms to better adapt to the local data distributions. In this work, we propose an information-geometric projection framework for personalization in parametric BFL. By projecting the global model onto a neighborhood of the user's local model, our method enables a tunable trade-off between global generalization and local specialization. Under mild assumptions, we show that this projection step is equivalent to computing a barycenter in the statistical manifold, allowing us to derive closed-form solutions and achieve cost-free personalization. We apply the proposed approach within a variational learning setup using the Improved Variational Online Newton (IVON) optimizer and extend it to general aggregation schemes in BFL. Empirical evaluations under heterogeneous data distributions confirm that our method effectively balances global and local performance with minimal computational overhead.

URL: https://openreview.net/forum?id=9y0jCrxjDR

---

Title: Model-diff: A Tool for Comparative Study of Language Models in the Input Space

Abstract: Comparing whether two large language models (LMs) make similar predictions -- such as perplexity -- across massive input spaces is crucial for real-world applications. Traditional analyses average benchmark scores over fixed datasets, masking per-input differences. We propose Model-diff, a framework that estimates the distribution of prediction differences between two LMs across a large, meaningful input space -- defined as the set of token sequences assigned low negative log-likelihood (NLL). Model-diff leverages sampling-based histogram statistics to efficiently quantify output differences without exhaustive enumeration. Experiments reveal, for the first time, quantitative divergences between LMs in their low-NLL regions, providing a scalable tool for model comparison and diagnostic analysis.

URL: https://openreview.net/forum?id=gGZi3blMWA

---

Title: Calibration Enhanced Decision Maker: Towards Trustworthy Sequential Decision-Making with Large Sequence Models

Abstract: Offline deep reinforcement learning (offline DRL) has attracted considerable attention across various domains due to its ability to learn effective policies without direct environmental interaction. Although highly effective, the trustworthiness of agent remains a paramount concern within the community. Offline DRL can be categorized into three principal paradigms: model-based algorithms, model-free algorithms, and trajectory optimization. While extant research predominantly concentrates on calibration enhancement of model-based and model-free algorithms, calibration of trajectory optimization remains a comparatively underexplored avenue of investigation. In this paper, we pioneer the concept of Expected Agent Calibration Error (EACE), a novel metric designed to assess agent calibration. Furthermore, we rigorously prove its theoretical relationship to the state-action marginal distribution distance. Subsequently, we introduce the Calibration Enhanced Decision Maker (CEDM), which employs a binning executor to process feature distribution histograms as input for the large sequence model, thereby minimizing the state-action marginal distribution distance and enhancing the agent's calibration. A series of in-depth case studies are undertaken to examine CEDM, with its application examined across Decision Transformer, Decision ConvFormer, and Decision Mamba. Empirical results substantiate the robustness of EACE and demonstrate the effectiveness of CEDM in enhancing agent calibration, thereby offering valuable insights for future research on trustworthy sequential decision-making.

URL: https://openreview.net/forum?id=b6WcxPEb48

---

Title: A Monte Carlo Framework for Calibrated Uncertainty Estimation in Sequence Prediction

Abstract: Probabilistic prediction of sequences from images and other high-dimensional data remains a key challenge, particularly in safety-critical domains. In these settings, it is often desirable to quantify the uncertainty associated with the prediction (instead of just determining the most likely sequence, as in language modeling). In this paper, we consider a Monte Carlo framework to estimate probabilities and confidence intervals associated with sequences. The framework uses a Monte Carlo simulator, implemented as an autoregressively trained neural network, to sample sequences conditioned on an image input. We then use these samples to estimate probabilities and confidence intervals. Experiments on synthetic and real data show that the framework produces accurate discriminative predictions, but can suffer from miscalibration. To address this shortcoming, we propose a time-dependent regularization method, which produces calibrated predictions.

URL: https://openreview.net/forum?id=sJE59flFC1

---

Title: Adapting to Any Bit-Width: Channel-Wise Mixed-Precision Quantization for LLMs

Abstract: Large Language Models (LLMs) have demonstrated remarkable success across a wide range of language tasks, but their deployment on edge devices remains challenging due to the substantial memory requirements imposed by their large parameter sizes. Weight-only quantization presents a promising solution to reduce the memory footprint of LLMs. However, existing approaches primarily focus on integer-bit quantization, limiting their adaptability to fractional-bit quantization tasks and preventing the full utilization of available storage space on devices. In this paper, we introduce Channel-Wise Mixed-Precision Quantization (CMPQ), a novel mixed-precision quantization method that allocates quantization precision in a channel-wise pattern based on activation distributions. By assigning different precision levels to different weight channels, CMPQ can adapt to \textit{any} bit-width constraint. CMPQ employs a non-uniform quantization strategy and incorporates two outlier extraction techniques that collaboratively preserve the critical information, thereby minimizing the quantization loss. Experiments on nine different LLMs demonstrate that CMPQ not only enhances performance in integer-bit quantization tasks but also achieves significant performance gains with a modest increase in memory usage by performing in a mixed-precision way. CMPQ represents an adaptive and effective approach to LLM quantization, offering substantial benefits across diverse device capabilities.

URL: https://openreview.net/forum?id=1t6sEhdLxf

---

Title: Prompt Estimation from Prototypes for Federated Prompt Tuning of Vision Transformers

Abstract: Visual Prompt Tuning (VPT) of pre-trained Vision Transformers (ViTs) has proven highly effective as a parameter-efficient fine-tuning technique for adapting large models to downstream tasks with limited data. Its parameter efficiency makes it particularly suitable for Federated Learning (FL), where both communication and computation budgets are often constrained. However, global prompt tuning struggles to generalize across heterogeneous clients, while personalized tuning overfits to local data and lacks generalization. We propose PEP-FedPT (Prompt Estimation from Prototypes for Federated Prompt Tuning), a unified framework designed to achieve both generalization and personalization in federated prompt tuning of ViTs. Within this framework, we introduce the novel Class-Contextualized Mixed Prompt (CCMP) — based on class-specific prompts maintained alongside a globally shared prompt. For each input, CCMP adaptively combines class-specific prompts using weights derived from global class prototypes and client class priors. This approach enables per-sample prompt personalization without storing client-dependent trainable parameters. The prompts are collaboratively optimized via traditional federated averaging technique on the same. Comprehensive evaluations on CIFAR-100, TinyImageNet, DomainNet, and iNaturalist datasets demonstrate that PEP-FedPT consistently surpasses the state-of-the-art baselines under diverse data heterogeneity scenarios, establishing a strong foundation for efficient and generalizable federated prompt tuning of Vision Transformers.

URL: https://openreview.net/forum?id=gO1CpPRj6A

---

Title: Muon Optimizes Under Spectral Norm Constraints

Abstract: The pursuit of faster optimization algorithms remains an active and important research direction in deep learning. Recently, the Muon optimizer has demonstrated promising empirical performance, but its theoretical foundation remains less understood. In this paper, we bridge this gap and provide a theoretical analysis of Muon by placing it within the Lion-$\mathcal{K}$ family of optimizers. Specifically, we show that Muon corresponds to Lion-$\mathcal{K}$ when equipped with the nuclear norm, and we leverage the theoretical results of Lion-$\mathcal{K}$ to establish that Muon (with decoupled weight decay) implicitly solves an optimization problem that enforces a constraint on the spectral norm of weight matrices. This perspective not only demystifies the implicit regularization effects of Muon but also leads to natural generalizations through varying the choice of convex map $\mathcal{K}$, allowing for the exploration of a broader class of implicitly regularized and constrained optimization algorithms.

URL: https://openreview.net/forum?id=Blz4hjxLwU

---

Reply all

Reply to author

Forward

0 new messages