Daily TMLR digest for Feb 09, 2026

0 views

Skip to first unread message

TMLR

unread,

Feb 9, 2026, 12:30:08 AM (3 days ago) Feb 9

to tmlr-anno...@googlegroups.com

Accepted papers
===============

Title: QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design

Authors: Benjamin Schneider, Dongfu Jiang, Chao Du, Tianyu Pang, Wenhu Chen

Abstract: Long video understanding has emerged as a crucial capability in real-world applications
such as meeting summarization, video surveillance, educational lecture analysis, and content
moderation. However, it remains computationally prohibitive for VideoLLMs, primarily due
to two bottlenecks: 1) sequential video decoding, the process of converting the raw bit stream
to RGB frames can take up to a minute for hour-long video inputs, and 2) costly prefilling
of up to several million tokens for LLM inference, resulting in high latency and memory
use. To address these challenges, we propose QuickVideo, a system-algorithm co-design
that substantially accelerates long video understanding to support real-time downstream
applications. It comprises three key innovations: QuickCodec, a parallelized CPU-based
video decoder that achieves 2–3× speedup by splitting videos into keyframe-aligned intervals
processed concurrently. QuickPrefill, a memory-efficient prefilling method using KV-cache
pruning to support more frames with less GPU memory; and an overlapping scheme
that overlaps CPU video decoding with GPU inference. Together, these components reduce
the time required to process a long video input by a minute, enabling fast, efficient video
understanding even on limited hardware. Experiments show that QuickVideo generalizes
across durations and sampling rates, making long video processing feasible in practice.

URL: https://openreview.net/forum?id=Rpcxgzcsuc

---

Title: Large Language Model-based Data Science Agent: A Survey

Authors: Ke Chen, Peiran Wang, Yaoning Yu, Xianyang Zhan, Haohan Wang

Abstract: The rapid advancement of Large Language Models (LLMs) has driven novel applications across diverse domains, with LLM-based agents emerging as a crucial area of exploration. This survey presents a comprehensive analysis of LLM-based agents designed for data science tasks, summarizing insights from recent studies. From the agent perspective, we discuss the key design principles, covering agent roles, execution, knowledge, and reflection methods. From the data science perspective, we identify key processes for LLM-based agents, including data preprocessing, model development, evaluation, visualization, etc. Our work offers two key contributions: (1) a comprehensive review of recent developments in applying LLM-based agents to data science tasks; (2) a dual-perspective framework that connects general agent design principles with the practical workflows in data science.

URL: https://openreview.net/forum?id=ZT5SJQN0CS

---

Title: Model-Free Learning with Heterogeneous Dynamical Systems: A Federated LQR Approach

Authors: Han Wang, Leonardo Felipe Toso, Aritra Mitra, James Anderson

Abstract: We study a model-free federated linear quadratic regulator (LQR) problem where M agents with unknown, distinct yet similar dynamics collaboratively learn an optimal policy to minimize an average quadratic cost while keeping their data private. To exploit the similarity of the agents' dynamics, we propose to use federated learning (FL) to allow the agents to periodically communicate with a central server to train policies by leveraging a larger dataset from all the agents. With this setup, we seek to understand the following questions: (i) Is the learned common policy stabilizing for all agents? (ii) How close is the learned common policy to each agent's own optimal policy? (iii) Can each agent learn its own optimal policy faster by leveraging data from all agents? To answer these questions, we propose the federated and model-free algorithm FedLQR. Our analysis overcomes numerous technical challenges, such as heterogeneity in the agents’ dynamics, multiple local updates, and stability concerns. We show that FedLQR produces a common policy that, at each iteration, is stabilizing for all agents. Moreover, we prove that when learning each agent's optimal policy, FedLQR achieves a sample complexity reduction proportional to the number of agents M in a low-heterogeneity regime, compared to the single-agent setting.

URL: https://openreview.net/forum?id=WSRQeCUc3g

---

Title: Unlocking [CLS] Features for Continual Post-Training

Authors: Murat Onur Yildirim, Elif Ceren Gok Yildirim, Joaquin Vanschoren

Abstract: Continual learning requires models to integrate new classes or domains over time while preserving previously acquired knowledge. Within this paradigm, foundation models often achieve strong performance, but they still remain subject to the stability–plasticity trade-off, where excessive plasticity leads to forgetting of prior knowledge, and excessive stability constrains the adaptation. This necessitates an effective post-training strategy that introduces minimal yet functional modifications. To address this challenge, we first introduce a new parameter-efficient fine-tuning module ‘Learn and Calibrate’, or LuCA, designed to acquire task-specific knowledge through an adapter-calibrator couple, enabling well-refined feature representations. Then, for each task, we deploy a sparse LuCA module on top of the last classification token [CLS] just before the classifier, which we refer to as ‘Token-level Sparse Calibration and Adaptation’, or TOSCA. By leaving the generalization capabilities of the foundation models intact and adapting exclusively via the last token, our approach achieves a harmonious balance between stability and plasticity while reducing both training and inference complexity. We demonstrate that TOSCA yields state-of-the-art performance while introducing 8 times fewer parameters compared to prior methods.

URL: https://openreview.net/forum?id=OWfWyj6krc

---

Title: Relative Geometry of Neural Forecasters: Linking Accuracy and Alignment in Learned Latent Geometry

Authors: Deniz Kucukahmetler, Maximilian Jean Hemmann, Julian Mosig von Aehrenfeld, Maximilian Amthor, Christian Deubel, Nico Scherf, Diaaeldin Taha

Abstract: Neural networks can accurately forecast complex dynamical systems, yet how they internally represent underlying latent geometry remains poorly understood. We study neural forecasters through the lens of representational alignment, introducing anchor-based, geometry-agnostic relative embeddings that remove rotational and scaling ambiguities in latent spaces. Applying this framework across seven canonical dynamical systems—ranging from periodic to chaotic—we reveal reproducible family-level structure: multilayer perceptrons align with other MLPs, recurrent networks with RNNs, while transformers and echo-state networks achieve strong forecasts despite weaker alignment. Alignment generally correlates with forecasting accuracy, yet high accuracy can coexist with low alignment.
Relative geometry thus provides a simple, reproducible foundation for comparing how model families internalize and represent dynamical structure.

URL: https://openreview.net/forum?id=t4stf5Gafz

---

New submissions
===============

Title: Latent learning: episodic memory complements parametric learning by enabling flexible reuse of experiences

Abstract: When do machine learning systems fail to generalize, and what mechanisms could improve their generalization? Here, we draw inspiration from cognitive science to argue that one weakness of parametric machine learning systems is their failure to exhibit latent learning—learning information that is not relevant to the task at hand, but that might be useful in a future task. We show how this perspective links failures ranging from the reversal curse in language modeling to new findings on agent-based navigation. We then highlight how cognitive science points to episodic memory as a potential part of the solution to these issues. Correspondingly, we show that a system with an oracle retrieval mechanism can use learning experiences more flexibly to generalize better across many of these challenges. We also identify some of the essential components for effectively using retrieval, including the importance of within-experience}in-context learning for acquiring the ability to use information \emph{across} retrieved experiences. In summary, our results illustrate one possible contributor to the relative data inefficiency of current machine learning systems compared to natural intelligence, and help to understand how retrieval methods can complement parametric learning to improve generalization. We close by discussing some of the links between our work and findings in cognitive science and neuroscience—including a possible perspective on hippocampal contributions to generalization—and the broader implications.

URL: https://openreview.net/forum?id=RuWGeX5ZiB

---

Title: Improving LLM Unlearning Robustness via Random Perturbations

Abstract: Here, we show that current LLM unlearning methods inherently reduce models' robustness, causing them to misbehave even when a single non-adversarial forget-token is present in the retain-query. Toward understanding underlying causes, we propose a novel theoretical framework that reframes the unlearning process as a backdoor attack and defense problem: we formulate how the forgetting process inadvertently learn to align forget-tokens (backdoor triggers) with the target-representations (target labels). As a result, forget-tokens act as backdoor triggers that, when activated in retain-queries, cause disruptions in unlearned models' behaviors, similar to successful backdoor attacks. The sense that, LLM unlearning methods themselves poison the model, make it more vulnerable to forget-tokens, and hide rather than erase target knowledge, describes their true mechanism. To mitigate the vulnerability caused by the forgetting process, we reinterpret the retaining process as a backdoor defense and propose Random Noise Augmentation (RNA), a lightweight, model and method-agnostic approach with theoretical guarantees for improving the robustness of unlearned models. Extensive experiments demonstrate that RNA significantly improves the robustness of unlearned models while preserving forget and retain performances. This backdoor attack-defense framework offers insights into the mechanism of unlearning that can shed light on future research directions for improving unlearning robustness.

URL: https://openreview.net/forum?id=QYw192hTdH

---

Title: What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?

Abstract: A long-standing challenge in AI is to develop agents capable of solving a wide range of physical tasks and generalizing to new, unseen tasks and environments. A popular recent approach involves training a world model from state-action trajectories and subsequently use it with a planning algorithm to solve new tasks. Planning is commonly performed in the input space, but a recent family of methods has introduced planning algorithms that optimize in the learned representation space of the world model, with the promise that abstracting irrelevant details yields more efficient planning. In this work, we characterize models from this family as JEPA-WMs and investigate the technical choices that make algorithms from this class work. We propose a comprehensive study of several key components with the objective of finding the optimal approach within the family. We conducted experiments using both simulated environments and real-world robotic data, and studied how the model architecture, the training objective, and the planning algorithm affect planning success. We combine our findings to propose a model that outperforms two established baselines, DINO-WM and V-JEPA-2-AC, in both navigation and manipulation tasks.
Code, data and checkpoints are available in supplementary material.

URL: https://openreview.net/forum?id=cHZn5Gdh8e

---

Title: Flow Matching for Probabilistic Monocular 3D Human Pose Estimation

Abstract: Recovering 3D human poses from a monocular camera view is a highly ill-posed problem due to the depth ambiguity. Earlier studies on 3D human pose lifting from 2D often contain incorrect-yet-overconfident 3D estimations. To mitigate the problem, emerging probabilistic approaches treat the 3D estimations as a distribution, taking into account the uncertainty measurement of the poses. Falling in a similar category, we proposed FMPose, a probabilistic 3D human pose estimation method based on the flow matching generative approach.
Conditioned on the 2D cues, the flow matching scheme learns the optimal transport from a simple source distribution to the plausible 3D human pose distribution via continuous normalizing flows. The 2D lifting condition is modeled via graph convolutional networks, leveraging the learnable connections between human body joints as the graph structure for feature aggregation. Compared to diffusion-based methods, the FMPose with optimal transport produces faster and more accurate 3D pose generations. Experimental results show major improvements of our FMPose over current state-of-the-art methods on three common benchmarks for 3D human pose estimation, namely Human3.6M, MPI-INF-3DHP and 3DPW.

URL: https://openreview.net/forum?id=UlpH4XBLR4

---

Title: VidHal: Benchmarking Hallucinations in Vision LLMs

Abstract: Vision Large Language Models (VLLMs) are widely acknowledged to be prone to hallucinations. Existing research addressing this problem has primarily been confined to image inputs, with sparse exploration of their video-based counterparts. Furthermore, current evaluation methods fail to capture nuanced errors in generated responses, which are often exacerbated by the rich spatiotemporal dynamics of videos. To address these two limitations, we introduce VidHal, a benchmark specially designed to evaluate video-based hallucinations in VLLMs. VidHal is constructed by bootstrapping video instances across a wide range of common temporal aspects. A defining feature of our benchmark lies in the careful creation of captions representing varying levels of hallucination associated with each video. To enable fine-grained evaluation, we propose a novel caption ordering task requiring VLLMs to rank captions by hallucinatory extent. We conduct extensive experiments on VidHal and comprehensively evaluated a broad selection of models, including both open-source and proprietary ones. Our results uncover significant limitations in existing VLLMs with respect to video-based hallucination generation. Through our benchmark, we aim to inspire further research on i) holistic understanding of VLLM capabilities, particularly regarding hallucination, and ii) advancing VLLMs to alleviate this problem.

URL: https://openreview.net/forum?id=7ccWCDbdM1

---

Title: Cluster-Dags as Powerful Background Knowledge For Causal Discovery

Abstract: Finding cause-effect relationships is of key importance in science. Causal discovery aims to recover a graph from data that succinctly describes these cause-effect relationships. However, current methods face several challenges, especially when dealing with high-dimensional data and complex dependencies. Incorporating prior knowledge about the system can aid causal discovery. In this work, we leverage Cluster-DAGs as a prior knowledge framework to warm-start causal discovery. We show that Cluster-DAGs offer greater flexibility than existing approaches based on tiered background knowledge and introduce two modified constraint-based algorithms, Cluster-PC and Cluster-FCI, for causal discovery in the fully and partially observed setting, respectively. Empirical evaluation on simulated data demonstrates that Cluster-PC and Cluster-FCI outperform their respective baselines without prior knowledge.

URL: https://openreview.net/forum?id=gSSmvVDKxB

---

Title: Random features for Grassmannian kernel approximation with bounded rank-one projections

Abstract: We propose a family of random feature maps for scalable kernel machines defined over low-dimensional subspaces in high dimensions, \ie over the Grassmannian manifold. This is typically useful in a machine learning context when data classes or clusters are well represented by the span of a few data points. Classical Grassmannian kernels such as the \emph{projection} or \emph{Binet–Cauchy} kernels require constructing full Gram matrices for practical applications, leading to prohibitive computational and memory costs for large subspace datasets in high dimensions. We address this limitation by computing specific random features of subspaces. These combine random rank-one projections of the subspace projection matrices with bounded non-linear transforms---periodic or binary---to tame the resulting heavy-tailed distribution.
We show that, in the random feature space, inner products approximate well-defined, rotation-invariant Grassmannian kernels, \ie depending only on the principal angles of the considered subspaces. Provided the number of random features is large compared to the subspace intrinsic dimension, we show that this approximation holds uniformly over all subspaces of fixed dimensions with high probability.
When the non-linear transform is periodic, the approximated kernel admits a closed-form expression with a tunable behaviour bridging inverse Binet–Cauchy and Gaussian-type regimes, while the binarised feature has no known closed-form kernel but lends itself to even more compactly represented one-bit subspace features. Moreover, we show how structured rank-one projections, leveraging randomised fast Fourier transforms, further reduce the random feature computational complexity without sacrificing accuracy in practical experiments.
We demonstrate the practicality of these techniques with synthetic experiments and classification tasks on the ETH-80 dataset representing visual object images from different viewpoints. The proposed random features recover Grassmannian geometry with high accuracy while reducing computation, memory, and storage requirements. This demonstrates that rank-one embeddings offer a practical and scalable alternative to classical Grassmannian kernels.

URL: https://openreview.net/forum?id=wq18dZJ2pA

---

Title: Knowing When Not to Answer: Mitigating Social Bias in LLMs via Epistemic Abstention

Abstract: The growing application of Large Language Models (LLMs) to social contexts has led to an increase in unjustifiable social group attributions through their own stereotype-based responses; especially when responding to questions where there is little evidence to support a response or ambiguity to context. The lack of sufficient evidence often leads models to hallucinate socially grounded inferences, undermining fairness and trust. In this work, we attempt to mitigate social bias under ambiguity via epistemic uncertainty. We further introduce BHARATBBQ-R, a rationale-augmented extension of BHARATBBQ that explicitly annotates evidential sufficiency or absence. We propose \textbf{EPIK} (\textbf{E}pistemic \textbf{P}runing under \textbf{I}mplicit \textbf{K}nowledge), an epistemic calibration framework for detecting contextual insufficiency and enforceing principled abstention in case of inadequate evidences. Our framework enforces principled abstention in cases of inadequate evidence, while maintaining the performance for unambiguous cases. Prior bias mitigation technique focuses on suppressing stereotypes or debiasing representations; our proposed framework reframes biased behavior as a failure of epistemic humility. Experiments across five open-source LLMs show that EPIK substantially reduces the bias score for ambiguous contexts (from 1.41–1.52 to 0.86–0.98), while maintaining accuracy on unambiguous instances. From results, we establish the epistemic calibration enables selective suppression of stereotype-driven inference without indiscriminately refusing valid social reasoning.

URL: https://openreview.net/forum?id=UT5E31pYob

---

Title: ABCDE: Agentic-Based Controlled Dynamic Erasure for Intent-Aware Safety Reasoning

Abstract: Concept erasure has emerged as a central mechanism for safety alignment in text-conditioned generative models, yet most existing approaches implicitly adopt an unconditional suppression paradigm in which target concepts are removed whenever they appear, regardless of contextual intent.
This formulation conflates benign and harmful concept usage, leading to systematic over-suppression that unnecessarily censors policy-compliant content and degrades model utility.
We argue that safety intervention should instead be framed as a decision problem grounded in contextual language understanding, rather than as a purely mechanistic removal operation.
Based on this perspective, we introduce {Intent-Aware Concept Erasure} (ICE), a decision-centric formulation that explicitly separates the question of {whether} a concept should be suppressed from {how} suppression is realized, enabling context-sensitive intervention policies that preserve benign usage while maintaining safety guarantees.
To operationalize this formulation, we present ABCDE, an agentic framework that infers stable intervention decisions from semantic context and realizes them through minimal prompt rewriting with closed-loop output feedback.
Experiments on a paired benchmark designed to isolate contextual intent demonstrate that ABCDE substantially reduces unnecessary interventions while preserving strong safety effectiveness, outperforming unconditional concept erasure baselines.

URL: https://openreview.net/forum?id=IFjPhMcXJB

---

Title: Probing the Impact of Scale on Data-Efficient, Generalist Transformer World Models for Atari

Abstract: Developing generalist systems that retain human-like data efficiency is a central challenge. While world models (WMs) offer a promising path, existing research often conflates architectural mechanisms with the independent impact of model scale. In this work, we use a minimalist transformer world model to analyze scaling behaviors on the Atari 100k benchmark, using fixed offline datasets derived from a presupposed expert policy. Our results reveal that the scaling strategy is governed by the latent intrinsic dimensionality of the environment. Our results suggest that the effectiveness of scaling correlates with the underlying structural and dynamic complexity of the environment. For individual tasks, we identify distinct regimes: while low-complexity environments pass the interpolation threshold, yielding monotonic improvements, high-complexity dynamics remain in the classical regime, where larger models degrade performance. Conversely, in the unified setting, i.e., a single transformer trained on a suite of 26 Atari environments, we uncover a novel phenomenon that we term \emph{positive regularization}: joint training stabilizes scaling dynamics, ensuring monotonic gains across all environments regardless of their distinctive, inherent intrinsic dimensionality. Finally, we demonstrate that improved fidelity translates directly to downstream control, with policies learned entirely within the simulated dynamics achieving a median expert-random-normalized score of 0.770. Our findings suggest that future progress lies as much in precise scaling strategies as in architectural innovation.

URL: https://openreview.net/forum?id=wVcvqtKaMY

---

Title: VECO: VEctor COnformity Based OOD Detection in Text and Multimodal Models

Abstract: Out-of-distribution (OOD) detection is critical for the reliable deployment of natural language processing and multimodal document understanding systems, where domain and semantic shifts are unavoidable. While many post-hoc OOD detection methods were developed for vision models, their direct transfer to textual and multimodal Transformer architectures remains poorly understood. We show that, unlike in vision benchmarks, feature-space provides the dominant OOD signal for text and document models, consistently outperforming logit-based and hybrid detectors.
Building on this observation, we introduce \textbf{VECO} (\emph{VEctor COnformity}), a geometry-aware, purely feature-based OOD scoring framework that implements a stable soft contrast between in-distribution conformity and residual-space deviation.
We instantiate VECO using principal-subspace conformity for multimodal document models and Mahalanobis distance conformity for text classifiers, reflecting modality-aligned representation structure.
VECO achieves state-of-the-art and consistent performance improvements on multimodal document and text classification benchmarks. These results highlight the modality-dependent nature of OOD detection and the importance of adapting score design to representation cues.

URL: https://openreview.net/forum?id=sMbGqh7Zvt

---

Reply all

Reply to author

Forward

0 new messages