Weekly TMLR digest for Sep 21, 2025

6 views

Skip to first unread message

TMLR

unread,

Sep 21, 2025, 12:00:15 AMSep 21

to tmlr-annou...@googlegroups.com

New certifications
==================

Survey Certification: Survey of Video Diffusion Models: Foundations, Implementations, and Applications

Yimu Wang, Xuye Liu, Wei Pang, Li Ma, Shuai Yuan, Paul Debevec, Ning Yu

https://openreview.net/forum?id=2ODDBObKjH

---

Reproducibility Certification: DRDT3: Diffusion-Refined Decision Test-Time Training Model

Xingshuai Huang, Di Wu, Benoit Boulet

https://openreview.net/forum?id=I6zjLhIzgh

---

Expert Certification: Characterizing Vision Backbones for Dense Prediction with Dense Attentive Probing

Timo Lüddecke, Alexander S. Ecker

https://openreview.net/forum?id=neMAx4uBlh

---

Reproducibility Certification: MetaGFN: Exploring Distant Modes with Adapted Metadynamics for Continuous GFlowNets

Dominic Phillips, Flaviu Cipcigan

https://openreview.net/forum?id=dtyNeemB7A

---

Survey Certification: Open Problems in Mechanistic Interpretability

Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeffrey Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Isaac Bloom, Stella Biderman, Adrià Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Mary Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, William Saunders, Eric J Michaud, Stephen Casper, Max Tegmark, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Thomas McGrath

https://openreview.net/forum?id=91H76m9Z94

---

Featured Certification: An Information-Theoretic Lower Bound on the Generalization Error of Autoencoders

Shyam Venkatasubramanian, Sean Moushegian, Ahmed Aloui, Vahid Tarokh

https://openreview.net/forum?id=0esF0M467w

---

Accepted papers
===============

Title: Survey of Video Diffusion Models: Foundations, Implementations, and Applications

Authors: Yimu Wang, Xuye Liu, Wei Pang, Li Ma, Shuai Yuan, Paul Debevec, Ning Yu

Abstract: Recent advances in diffusion models have revolutionized video generation, offering superior temporal consistency and visual quality compared to traditional generative adversarial networks-based approaches. While this emerging field shows tremendous promise in applications, it faces significant challenges in motion consistency, computational efficiency, and ethical considerations. This survey provides a comprehensive review of diffusion-based video generation, examining its evolution, technical foundations, and practical applications. We present a systematic taxonomy of current methodologies, analyze architectural innovations and optimization strategies, and investigate applications across low-level vision tasks such as denoising and super-resolution. Additionally, we explore the synergies between diffusion-based video generation and related domains, including video representation learning, question answering, and retrieval. Compared to the existing surveys (Lei et al., 2024a;b; Melniket al., 2024; Cao et al., 2023; Xing et al., 2024c) which focus on specific aspects of video generation, such as human video synthesis (Lei et al., 2024a) or long-form content generation (Lei et al., 2024b), our work provides a broader, more updated, and more fine-grained perspective on diffusion-based approaches with a special section for evaluation metrics, industry solutions, and training engineering techniques in video generation. This survey serves as a foundational resource for researchers and practitioners working at the intersection of diffusion models and video generation, providing insights into both the theoretical frameworks and practical implementations that drive this rapidly evolving field.

URL: https://openreview.net/forum?id=2ODDBObKjH

---

Title: Autonomous Imagination: Closed-Loop Decomposition of Visual-to-Textual Conversion in Visual Reasoning for Multimodal Large Language Models

Authors: Jingming Liu, Yumeng Li, Boyuan Xiao, Yichang Jian, Ziang Qin, Tianjia Shao, Yao-Xiang Ding, Kun Zhou

Abstract: Under pure textual modality, Large Language Models (LLMs) have demonstrated remarkable success in complex reasoning tasks by decomposing them into simpler sub-problems. However, Multimodal Large Language Models (MLLMs) still struggle with some seemingly straightforward visual tasks, such as counting and solving jigsaw puzzles. We argue that these tasks challenge the ability of {\it visual-to-textual conversion}, where MLLMs convert visual information perceived from the input scene, to textual information for further reasoning and generating the answer. If the complexity of the visual input is beyond the perceptual capability of the MLLMs, without decomposing this conversion process, simply scaling inference-time reasoning cannot solve the task because it repeatedly encounters the same perceptual bottleneck. We propose an approach, {\it autonomous imagination}, to enable MLLMs to iteratively modify visual inputs (e.g. isolating objects, rearranging puzzle pieces) into intermediate visual states, decomposing visual-to-textual conversion into closed-loop visual modification steps. We show that, without any retraining, MLLMs can now solve tasks initially beyond their perceptual capability, highlighting that closed-loop visual modification can be an effective way of decomposing the visual reasoning task into solvable substeps. Our code and data are released at https://future-item.github.io/autoimagine-site/.

URL: https://openreview.net/forum?id=MI4yIBLprs

---

Title: Wolf: Dense Video Captioning with a World Summarization Framework

Authors: Boyi Li, Ligeng Zhu, Ran Tian, Shuhan Tan, Yuxiao Chen, Yao Lu, Yin Cui, Sushant Veer, Max Ehrlich, Jonah Philion, Xinshuo Weng, Fuzhao Xue, Linxi Fan, Yuke Zhu, Jan Kautz, Andrew Tao, Ming-Yu Liu, Sanja Fidler, Boris Ivanovic, Trevor Darrell, Jitendra Malik, Song Han, Marco Pavone

Abstract: We propose Wolf, a WOrLd summarization Framework for accurate video captioning. Wolf is an automated captioning framework that adopts a mixture-of-experts approach, leveraging complementary strengths of Vision Language Models (VLMs). By utilizing both image and video models, our framework captures different levels of information and summarizes them efficiently. Our approach can be applied to enhance video understanding, auto-labeling, and captioning. To evaluate caption quality, we introduce CapScore, an LLM-based metric to assess the similarity and quality of generated captions compared to the ground truth captions. We further build four human-annotated datasets in three domains: autonomous driving, general scenes, and robotics, to facilitate comprehensive comparisons. We show that Wolf achieves superior captioning performance compared to state-of-the-art approaches from the research community (VILA1.5, CogAgent) and commercial solutions (Gemini-Pro-1.5, GPT-4V). For instance, in comparison with GPT-4V, Wolf improves CapScore (caption quality) by 55.6% and CapScore (caption similarity) by 77.4% on challenging driving videos. Finally, we establish a benchmark for video captioning and introduce a leaderboard, aiming to accelerate advancements in video understanding, captioning, and data alignment.

URL: https://openreview.net/forum?id=Z1dH7hao7p

---

Title: Model Guidance via Robust Feature Attribution

Authors: Mihnea Ghitu, Vihari Piratla, Matthew Robert Wicker

Abstract: Controlling the patterns a model learns is essential to preventing reliance on irrelevant or misleading features. Such reliance on irrelevant features, often called shortcut features, has been observed across domains, including medical imaging and natural language processing, where it may lead to real-world harms. A common mitigation strategy leverages annotations (provided by humans or machines) indicating which features are relevant or irrelevant. These annotations are compared to model explanations, typically in the form of feature salience, and used to guide the loss function during training. Unfortunately, recent works have demonstrated that feature salience methods are unreliable and therefore offer a poor signal to optimize. In this work, we propose a simplified objective that simultaneously optimizes for explanation robustness and mitigation of shortcut learning. Unlike prior objectives with similar aims, we demonstrate theoretically why our approach ought to be more effective. Across a comprehensive series of experiments, we show that our approach consistently reduces test-time misclassifications by 20\% compared to state-of-the-art methods. We also extend prior experimental settings to include natural language processing tasks. Additionally, we conduct novel ablations that yield practical insights, including the relative importance of annotation quality over quantity. Code for our method and experiments is available at: https://github.com/Mihneaghitu/ModelGuidanceViaRobustFeatureAttribution.

URL: https://openreview.net/forum?id=AVAHxDSqUu

---

Title: DRDT3: Diffusion-Refined Decision Test-Time Training Model

Authors: Xingshuai Huang, Di Wu, Benoit Boulet

Abstract: Decision Transformer (DT), a trajectory modelling method, has shown competitive performance compared to traditional offline reinforcement learning (RL) approaches on various classic control tasks. However, it struggles to learn optimal policies from suboptimal, reward-labelled trajectories. In this study, we explore the use of conditional generative modelling to facilitate trajectory stitching given its high-quality data generation ability. Additionally, recent advancements in Recurrent Neural Networks (RNNs) have shown their linear complexity and competitive sequence modelling performance over Transformers. We leverage the Test-Time Training (TTT) layer, an RNN that updates hidden states during testing, to model trajectories in the form of DT. We introduce a unified framework, called Diffusion-Refined Decision TTT (DRDT3), to achieve performance beyond DT models. Specifically, we propose the Decision TTT (DT3) module, which harnesses the sequence modelling strengths of both self-attention and the TTT layer to capture recent contextual information and make coarse action predictions. DRDT3 iteratively refines the coarse action predictions through the generative diffusion model, progressively moving closer to the optimal actions. We further integrate DT3 with the diffusion model using a unified optimization objective. With experiments on multiple tasks in the D4RL benchmark, our DT3 model without diffusion refinement demonstrates improved performance over standard DT, while DRDT3 further achieves superior results compared to state-of-the-art DT-based and offline RL methods.

URL: https://openreview.net/forum?id=I6zjLhIzgh

---

Title: Learning to Rank with Top-$K$ Fairness

Authors: Boyang Zhang, Quanqi Hu, Mingxuan Sun, Qihang Lin, Tianbao Yang

Abstract: Fairness in ranking models is crucial, as disparities in exposure can disproportionately affect protected groups. Most fairness-aware ranking systems focus on ensuring comparable average exposure for groups across the entire ranked list, which may not fully address real-world concerns. For example, when a ranking model is used for allocating resources among candidates or disaster hotspots, decision-makers often prioritize only the top-$K$ ranked items, while the ranking beyond top-$K$ becomes less relevant. In this paper, we propose a list-wise learning-to-rank framework that addresses the issues of inequalities in top-$K$ rankings at training time. Specifically, we propose a top-$K$ exposure disparity measure that extends the classic exposure disparity metric in a ranked list. We then learn a ranker to balance relevance and fairness in top-$K$ rankings. Since direct top-$K$ selection is computationally expensive for a large number of items, we transform the non-differentiable selection process into a differentiable objective function and develop efficient stochastic optimization algorithms to achieve both high accuracy and sufficient fairness. Extensive experiments demonstrate that our method outperforms existing methods.

URL: https://openreview.net/forum?id=SSPCc39XvO

---

Title: Characterizing Vision Backbones for Dense Prediction with Dense Attentive Probing

Authors: Timo Lüddecke, Alexander S. Ecker

Abstract: The paradigm of pretraining a backbone on a large set of (often unlabeled) images has gained popularity. The quality of the resulting features is commonly measured by freezing the backbone and training different task heads on top of it. However, current evaluations cover only classifications of whole images or require complex dense task heads which introduce a large number of parameters and add their own inductive biases. In this work, we propose dense attentive probing, a parameter-efficient readout method for dense prediction on arbitrary backbones – independent of the size and resolution of their feature volume. To this end, we extend cross-attention with distance-based masks of learnable sizes. We employ this method to evaluate 18 common backbones on dense predictions tasks in three dimensions: instance awareness, local semantics and spatial understanding. We find that DINOv2 outperforms all other backbones tested – including those supervised with masks and language – across all three task categories. Furthermore, our analysis suggests that self-supervised pretraining tends to yield features that separate object instances better than vision-language models.
Code is available at http://eckerlab.org/code/deap.

URL: https://openreview.net/forum?id=neMAx4uBlh

---

Title: MetaGFN: Exploring Distant Modes with Adapted Metadynamics for Continuous GFlowNets

Authors: Dominic Phillips, Flaviu Cipcigan

Abstract: Generative Flow Networks (GFlowNets) are a class of generative models that sample objects in proportion to a specified reward function through a learned policy. They can be trained either on-policy or off-policy, needing a balance between exploration and exploitation for fast convergence to a target distribution. While exploration strategies for discrete GFlowNets have been studied, exploration in the continuous case remains to be investigated, despite the potential for novel exploration algorithms due to the local connectedness of continuous domains. Here, we introduce Adapted Metadynamics, a variant of metadynamics that can be applied to arbitrary black-box reward functions on continuous domains. We use Adapted Metadynamics as an exploration strategy for continuous GFlowNets. We show several continuous domains where the resulting algorithm, MetaGFN, accelerates convergence to the target distribution and discovers more distant reward modes than previous off-policy exploration strategies used for training GFlowNets.

URL: https://openreview.net/forum?id=dtyNeemB7A

---

Title: Open Problems in Mechanistic Interpretability

Authors: Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeffrey Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Isaac Bloom, Stella Biderman, Adrià Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Mary Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, William Saunders, Eric J Michaud, Stephen Casper, Max Tegmark, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Thomas McGrath

Abstract: Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence. Despite recent progress toward these goals, there are many open problems in the field that require solutions before many scientific and practical benefits can be realized: Our methods require both conceptual and practical improvements to reveal deeper insights; we must figure out how best to apply our methods in pursuit of specific goals; and the field must grapple with socio-technical challenges that influence and are influenced by our work. This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing.

URL: https://openreview.net/forum?id=91H76m9Z94

---

Title: Low Compute Unlearning via Sparse Representations

Authors: Vedant Shah, Frederik Träuble, Ashish Malik, Hugo Larochelle, Michael Curtis Mozer, Sanjeev Arora, Yoshua Bengio, Anirudh Goyal

Abstract: Machine unlearning, which involves erasing knowledge about a \emph{forget set} from a trained model, can prove to be
costly and infeasible using existing techniques. We propose a low-compute unlearning technique based on a discrete representational bottleneck. We show that the proposed technique efficiently unlearns the forget set and incurs negligible damage to the model's performance on the rest of the dataset. We evaluate the proposed technique on the problem of class unlearning using four datasets: CIFAR-10, CIFAR-100, LACUNA-100 and ImageNet-1k. We compare the proposed technique to SCRUB, a state-of-the-art approach which uses knowledge distillation for unlearning. Across all four datasets, the proposed technique performs as well as, if not better than SCRUB while incurring almost no computational cost.

URL: https://openreview.net/forum?id=GyKXzmk43s

---

Title: A Mutual Information Perspective on Multiple Latent Variable Generative Models for Positive View Generation

Authors: Dario Serez, Marco Cristani, Alessio Del Bue, Vittorio Murino, Pietro Morerio

Abstract: In image generation, Multiple Latent Variable Generative Models (MLVGMs) employ multiple latent variables to gradually shape the final images, from global characteristics to finer and local details (e.g., StyleGAN, NVAE), emerging as powerful tools for diverse applications. Yet their generative dynamics remain only empirically observed, without a systematic understanding of each latent variable's impact. In this work, we propose a novel framework that quantifies the contribution of each latent variable using Mutual Information (MI) as a metric. Our analysis reveals that current MLVGMs often underutilize some latent variables, and provides actionable insights for their use in downstream applications.

With this foundation, we introduce a method for generating synthetic data for Self-Supervised Contrastive Representation Learning (SSCRL). By leveraging the hierarchical and disentangled variables of MLVGMs, our approach produces diverse and semantically meaningful views without the need for real image data.

Additionally, we introduce a Continuous Sampling (CS) strategy, where the generator dynamically creates new samples during SSCRL training, greatly increasing data variability. Our comprehensive experiments demonstrate the effectiveness of these contributions, showing that MLVGMs' generated views compete on par with or even surpass views generated from real data.

This work establishes a principled approach to understanding and exploiting MLVGMs, advancing both generative modeling and self-supervised learning. Code and pre-trained models at: https://github.com/SerezD/mi_ml_gen

URL: https://openreview.net/forum?id=uaj8ZL2PtK

---

Title: Simplifying Knowledge Transfer in Pretrained Models

Authors: Siddharth Jain, Shyamgopal Karthik, Vineet Gandhi

Abstract: Pretrained models are ubiquitous in the current deep learning landscape, offering strong results on a broad range of tasks. Recent works have shown that models differing in various design choices exhibit categorically diverse generalization behavior, resulting in one model grasping distinct data-specific insights unavailable to the other. In this paper, we propose to leverage large publicly available model repositories as an auxiliary source of model improvements. We introduce a data partitioning strategy where pretrained models autonomously adopt either the role of a student, seeking knowledge, or that of a teacher, imparting knowledge. Experiments across various tasks demonstrate the effectiveness of our proposed approach. In image classification, we improved the performance of ViT-B by approximately 1.4\% through bidirectional knowledge transfer with ViT-T. For semantic segmentation, our method boosted all evaluation metrics by enabling knowledge transfer both within and across backbone architectures. In video saliency prediction, our approach achieved a new state-of-the-art. We further extend our approach to knowledge transfer between multiple models, leading to considerable performance improvements for all model participants.

URL: https://openreview.net/forum?id=eQ9AVtDaP3

---

Title: An Information-Theoretic Lower Bound on the Generalization Error of Autoencoders

Authors: Shyam Venkatasubramanian, Sean Moushegian, Ahmed Aloui, Vahid Tarokh

Abstract: Quantifying the limitations of classical neural network architectures is a critically underexplored area of machine learning research. Deriving lower bounds on the optimal performance of these architectures can facilitate improved neural architecture search and overfitting detection. We present an information-theoretic lower bound on the generalization mean squared error of autoencoders with sigmoid activation functions. Through the Estimation Error and Differential Entropy (EEDE) inequality for continuous random vectors, we derive this lower bound, which provides a new perspective on the inherent limitations and capabilities of autoencoders. Our analysis extends to the examination of how this lower bound is influenced by various architectural features and data distribution characteristics. This study enriches our theoretical understanding of autoencoders and has substantial practical implications for their design, optimization, and application in the field of deep learning.

URL: https://openreview.net/forum?id=0esF0M467w

---

Title: Min-Max Optimisation for Nonconvex-Nonconcave Functions Using a Random Zeroth-Order Extragradient Algorithm

Authors: Amir Ali Farzin, Yuen-Man Pun, Philipp Braun, Antoine Lesage-Landry, Youssef Diouane, Iman Shames

Abstract: This study explores the performance of the random Gaussian smoothing Zeroth-Order ExtraGradient (ZO-EG) scheme considering deterministic min-max optimisation problems with possibly NonConvex-NonConcave (NC-NC) objective functions. We consider both unconstrained and constrained, differentiable and non-differentiable settings. We discuss the min-max problem from the point of view of variational inequalities. For the unconstrained problem, we establish the convergence of the ZO-EG algorithm to the neighbourhood of an $\epsilon$-stationary point of the NC-NC objective function, whose radius can be controlled under a variance reduction scheme, along with its complexity. For the constrained problem, we introduce the new notion of proximal variational inequalities and give examples of functions satisfying this property. Moreover, we prove analogous results to the unconstrained case for the constrained problem. For the non-differentiable case, we prove the convergence of the ZO-EG algorithm to a neighbourhood of an $\epsilon$-stationary point of the smoothed version of the objective function, where the radius of the neighbourhood can be controlled, which can be related to the ($\delta,\epsilon$)-Goldstein stationary point of the original objective function.

URL: https://openreview.net/forum?id=1bxY1uAXyr

---

Title: Learning Robust Representations for Visual Reinforcement Learning via Task-Relevant Mask Sampling

Authors: Vedant Dave, Ozan Özdenizci, Elmar Rueckert

Abstract: Humans excel at isolating relevant information from noisy data to predict the behavior of dynamic systems, effectively disregarding non-informative, temporally-correlated noise. In contrast, existing visual reinforcement learning algorithms face challenges in generating noise-free predictions within high-dimensional, noise-saturated environments, especially when trained on world models featuring realistic background noise extracted from natural video streams. We propose Task Relevant Mask Sampling (TRMS), a novel approach for identifying task-specific and reward-relevant masks. TRMS utilizes existing segmentation models as a masking prior, which is subsequently followed by a mask selector that dynamically identifies subset of masks at each timestep, selecting those most probable to contribute to task-specific rewards. To mitigate the high computational cost associated with these masking priors, a lightweight student network is trained in parallel. This network learns to perform masking independently and replaces the Segment Anything Model~(SAM)-based teacher network after a brief initial phase (<10-25% of total training). TRMS enhances the generalization capabilities of Soft Actor-Critic agents under distractions, achieves better performance on the RL-Vigen benchmark, which includes challenging variants of the DeepMind Control Suite, Dexterous Manipulation and Quadruped Locomotion tasks.

URL: https://openreview.net/forum?id=2rxNDxHwtn

---

Title: Approximate Bayesian Neural Operators: Uncertainty Quantification for Parametric PDEs

Authors: Emilia Magnani, Nicholas Krämer, Runa Eschenhagen, Lorenzo Rosasco, Philipp Hennig

Abstract: Neural operators are a type of deep architecture that learns to solve (i.e. learns the nonlinear solution operator of) partial differential equations (PDEs). The current state of the art for these models does not provide explicit uncertainty quantification. This is arguably even more of a problem for this kind of tasks than elsewhere in machine learning, because the dynamical systems typically described by PDEs often exhibit subtle, multiscale structure that makes errors hard to spot by humans. In this work, we first provide a mathematically detailed Bayesian formulation of the ``shallow'' (linear) version of neural operators in the formalism of Gaussian processes. We then extend this analytic treatment to general deep neural operators—specifically, graph neural operators—using approximate methods from Bayesian deep learning, enabling them to incorporate uncertainty quantification. As a result, our approach is able to identify cases, and provide structured uncertainty estimates, where the neural operator fails to predict well.

URL: https://openreview.net/forum?id=6WvIkYsMA8

---

Title: GROOD: GRadient-Aware Out-of-Distribution Detection

Authors: Mostafa ElAraby, Sabyasachi Sahoo, Yann Pequignot, Paul Novello, Liam Paull

Abstract: Out-of-distribution (OOD) detection is crucial for ensuring the reliability of deep learning models in real-world applications. Existing methods typically focus on feature representations or output-space analysis, often assuming a distribution over these spaces or leveraging gradient norms with respect to model parameters. However, these approaches struggle to distinguish near-OOD samples and often require extensive hyper-parameter tuning, limiting their practicality.
In this work, we propose GRadient-aware Out-Of-Distribution detection (GROOD), a method that derives an OOD prototype from synthetic samples and computes class prototypes directly from In-distribution (ID) training data. By analyzing the gradients of a nearest-class-prototype loss function concerning an artificial OOD prototype, our approach achieves a clear separation between in-distribution and OOD samples.
Experimental evaluations demonstrate that gradients computed from the OOD prototype enhance the distinction between ID and OOD data, surpassing established baselines in robustness, particularly on ImageNet-1k. These findings highlight the potential of gradient-based methods and prototype-driven approaches in advancing OOD detection within deep neural networks.

URL: https://openreview.net/forum?id=2V7itvvMVJ

---

Title: Zero-1-to-G: Taming Pretrained 2D Diffusion Model for Direct 3D Generation

Authors: Xuyi Meng, Chen Wang, Jiahui Lei, Kostas Daniilidis, Jiatao Gu, Lingjie Liu

Abstract: Recent advances in 2D image generation have achieved remarkable quality, largely driven by the capacity of diffusion models and the availability of large-scale datasets. However, direct 3D generation is still constrained by the scarcity and lower fidelity of 3D datasets. In this paper, we introduce Zero-1-to-G, a novel approach that addresses this problem by enabling direct single-view generation on Gaussian splats using pretrained 2D diffusion models. Our key insight is that Gaussian splats, a 3D representation, can be decomposed into multi-view images encoding different attributes. This reframes the challenging task of direct 3D generation within a 2D diffusion framework, allowing us to leverage the rich priors of pretrained 2D diffusion models. To incorporate 3D awareness, we introduce cross-view and cross-attribute attention layers, which capture complex correlations and enforce 3D consistency across generated splats. This makes Zero-1-to-G the first direct image-to-3D generative model to effectively utilize pretrained 2D diffusion priors, enabling efficient training and improved generalization to unseen objects. Extensive experiments on both synthetic and in-the-wild datasets demonstrate superior performance in 3D object generation, offering a new approach to high-quality 3D generation.

URL: https://openreview.net/forum?id=GVizav9Zf8

---

Title: A noise-corrected Langevin algorithm and sampling by half-denoising

Authors: Aapo Hyvarinen

Abstract: The Langevin algorithm is a classic method for sampling from a given pdf in a real space. In its basic version, it only requires knowledge of the gradient of the log-density, also called the score function. However, in deep learning, it is often easier to learn the so-called "noisy-data score function", i.e. the gradient of the log-density of noisy data, more precisely when Gaussian noise is added to the data. Such an estimate is biased and complicates the use of the Langevin method. Here, we propose a noise-corrected version of the Langevin algorithm, where the bias due to noisy data is removed, at least regarding first-order terms. Unlike diffusion models, our algorithm only needs to know the noisy-data score function for one single noise level. We further propose a simple special case which has an interesting intuitive interpretation of iteratively adding noise the data and then attempting to remove half of that noise.

URL: https://openreview.net/forum?id=QGtXn5GtfK

---

Title: Two Is Better Than One: Aligned Representation Pairs for Anomaly Detection

Authors: Alain Ryser, Thomas M. Sutter, Alexander Marx, Julia E Vogt

Abstract: Anomaly detection focuses on identifying samples that deviate from the norm. Discovering informative representations of normal samples is crucial to detecting anomalies effectively. Recent self-supervised methods have successfully learned such representations by employing prior knowledge about anomalies to create synthetic outliers during training. However, we often do not know what to expect from unseen data in specialized real-world applications. In this work, we address this limitation with our new approach, Con2, which leverages prior knowledge about symmetries in normal samples to observe the data in different contexts. Con2 consists of two parts: Context Contrasting clusters representations according to their context, while Content Alignment encourages the model to capture semantic information by aligning the positions of normal samples across clusters. The resulting representation space allows us to detect anomalies as outliers of the learned context clusters. We demonstrate the benefit of this approach in extensive experiments on specialized medical datasets, outperforming competitive baselines based on self-supervised learning and pretrained models and presenting competitive performance on natural imaging benchmarks.

URL: https://openreview.net/forum?id=Bt0zdsnWYc

---

Title: NeedleBench: Evaluating LLM Retrieval and Reasoning Across Varying Information Densities

Authors: Mo Li, Songyang Zhang, Taolin Zhang, Haodong Duan, Yunxin Liu, Kai Chen

Abstract: The capability of large language models to handle long-context information plays a crucial role across various real-world applications. Existing methods for evaluating long-context abilities often rely either on real-world long texts, making it difficult to exclude the influence of models' inherent knowledge, or introduce large amounts of irrelevant filler content to artificially reach target lengths, reducing the relevance and effectiveness of assessments. To address these limitations, we introduce NeedleBench, a comprehensive synthetic framework designed to assess retrieval and reasoning performance in bilingual long-context tasks with adaptive context lengths (e.g., 32k, 128k, and beyond). NeedleBench systematically embeds key data points at varying depths to rigorously test models' capabilities in diverse settings. Tasks within NeedleBench are categorized into two distinct scenarios: information-sparse, characterized by minimal relevant details embedded within extensive irrelevant text to simulate simpler real-world retrieval tasks; and information-dense, implemented as the Ancestral Trace Challenge, where relevant information is continuously distributed throughout the context to simulate more complex real-world reasoning tasks. Our experiments show that, while recent reasoning models such as Deepseek-R1 and OpenAI's o3 have demonstrated strong performance on mathematical reasoning benchmarks, they still struggle to generalize their reasoning abilities and perform poorly on our information-dense tasks, frequently encountering difficulties with continuous retrieval and reasoning even at relatively shorter context lengths.Furthermore, we identify and characterize a phenomenon termed `under-thinking', wherein models prematurely conclude their reasoning processes despite the availability of relevant information. NeedleBench thus provides critical insights and targeted evaluation tools essential for understanding and improving the long-context capabilities of LLMs. All codes and resources are publicly available at https://github.com/open-compass/opencompass.

URL: https://openreview.net/forum?id=cEvmIKsRw0

---

Title: Wasserstein Convergence of Score-based Generative Models under Semiconvexity and Discontinuous Gradients

Authors: Stefano Bruno, Sotirios Sabanis

Abstract: Score-based Generative Models (SGMs) approximate a data distribution by perturbing it with Gaussian noise and subsequently denoising it via a learned reverse diffusion process. These models excel at modeling complex data distributions and generating diverse samples, achieving state-of-the-art performance across domains such as computer vision, audio generation, reinforcement learning, and computational biology. Despite their empirical success, existing Wasserstein-2 convergence analysis typically assume strong regularity conditions--such as smoothness or strict log-concavity of the data distribution--that are rarely satisfied in practice. In this work, we establish the first non-asymptotic Wasserstein-2 convergence guarantees for SGMs targeting semiconvex distributions with potentially discontinuous gradients. Our upper bounds are explicit and sharp in key parameters, achieving optimal dependence of $O(\sqrt{d})$ on the data dimension $d$ and convergence rate of order one. The framework accommodates a wide class of practically relevant distributions, including symmetric modified half-normal distributions, Gaussian mixtures, double-well potentials, and elastic net potentials. By leveraging semiconvexity without requiring smoothness assumptions on the potential such as differentiability, our results substantially broaden the theoretical foundations of SGMs, bridging the gap between empirical success and rigorous guarantees in non-smooth, complex data regimes.

URL: https://openreview.net/forum?id=vS9iVRB7XF

---

Title: Hallucination Detection on a Budget: Efficient Bayesian Estimation of Semantic Entropy

Authors: Kamil Ciosek, Nicolò Felicioni, Sina Ghiassian

Abstract: Detecting whether an LLM hallucinates is an important research challenge. One promising way of doing so is to estimate the semantic entropy (Farquhar et al., 2024) of the distribution of generated sequences. We propose a new algorithm for doing that, with two main advantages. First, due to us taking the Bayesian approach, we achieve a much better quality of semantic entropy estimates for a given budget of samples from the LLM. Second, we are able to tune the number of samples adaptively so that `harder' contexts receive more samples. We demonstrate empirically that our approach systematically beats the baselines, requiring only 53% of samples used by Farquhar et al. (2024) to achieve the same quality of hallucination detection as measured by AUROC. Moreover, quite counterintuitively, our estimator is useful even with just one sample from the LLM.

URL: https://openreview.net/forum?id=j2N2RuNdbC

---

Title: Pseudo-Asynchronous Local SGD: Robust and Efficient Data-Parallel Training

Authors: Hiroki Naganuma, Xinzhi Zhang, Man-Chung Yue, Ioannis Mitliagkas, Russell J. Hewett, Philipp Andre Witte, Yin Tat Lee

Abstract: Following AI scaling trends, frontier models continue to grow in size and continue to be trained on larger datasets. Training these models requires huge investments in exascale computational resources, which has in turn driven developtment of distributed deep learning methods. Data parallelism is an essential approach to speed up training, but it requires frequent global communication between workers, which can bottleneck training at the largest scales. In this work, we propose a method called Pseudo-Asynchronous Local SGD (PALSGD) to improve the efficiency of data-parallel training. PALSGD is an extension of Local SGD (Stich, 2018) and DiLoCo (Douillard et al., 2023), designed to further reduce communication frequency by introducing a pseudo-synchronization mechanism. PALSGD allows the use of longer synchronization intervals compared to standard Local SGD. Despite the reduced communication frequency, the pseudo-synchronization approach ensures that model consistency is maintained, leading to performance results comparable to those achieved with more frequent synchronization. Furthermore, we provide a theoretical analysis of PALSGD, establishing its convergence and deriving its convergence rate. This analysis offers insights into the algorithm's behavior and performance guarantees. We evaluated PALSGD on image classification and language modeling tasks. Our results show that PALSGD achieves better performance in less time compared to existing methods like Distributed Data Parallel (DDP), and DiLoCo. Notably, PALSGD trains 18.4% faster than DDP on ImageNet-1K with ResNet-50, 24.4% faster than DDP on TinyStories with GPT-Neo-125M, and 21.1% faster than DDP on TinyStories with GPT-Neo-8M.

URL: https://openreview.net/forum?id=8VTrvS5vN7

---

Title: Structural Causal Circuits: Probabilistic Circuits Climbing All Rungs of Pearl's Ladder of Causation

Authors: Florian Peter Busch, Moritz Willig, Matej Zečević, Kristian Kersting, Devendra Singh Dhami

Abstract: The complexity and vastness of our world can require large models with numerous variables. Unfortunately, coming up with a model that is both accurate and able to provide predictions in a reasonable amount of time can prove difficult. One possibility to help overcome such problems is sum-product networks (SPNs), probabilistic models with the ability to tractably perform inference in linear time. In this paper, we extend SPNs' capabilities to the field of causality and introduce the family of structural causal circuits (SCCs), a type of SPNs capable of answering causal questions. Starting from conventional SPNs, we ``climb the ladder of causation'' and show how SCCs can represent not only observational but also interventional and counterfactual problems. We demonstrate successful application in different settings, ranging from simple binary variables to physics-based simulations.

URL: https://openreview.net/forum?id=25XyUTICdZ

---

Title: Global Optimization Algorithm through High-Resolution Sampling

Authors: Daniel Cortild, Claire Delplancke, Nadia Oudjane, Juan Peypouquet

Abstract: We present an optimization algorithm that can identify a global minimum of a potentially nonconvex smooth function with high probability, assuming the Gibbs measure of the potential satisfies a logarithmic Sobolev inequality. Our contribution is twofold: on the one hand we propose said global optimization method, which is built on an oracle sampling algorithm producing arbitrarily accurate samples from a given Gibbs measure. On the other hand, we propose a new sampling algorithm, drawing inspiration from both overdamped and underdamped Langevin dynamics, as well as from the high-resolution differential equation known for its acceleration in deterministic settings. While the focus of the paper is primarily theoretical, we demonstrate the effectiveness of our algorithms on the Rastrigin function, where it outperforms recent approaches.

URL: https://openreview.net/forum?id=r3VEA1AWY5

---

Title: RouteFinder: Towards Foundation Models for Vehicle Routing Problems

Authors: Federico Berto, Chuanbo Hua, Nayeli Gast Zepeda, André Hottung, Niels Wouda, Leon Lan, Junyoung Park, Kevin Tierney, Jinkyoo Park

Abstract: This paper introduces RouteFinder, a comprehensive foundation model framework to tackle different Vehicle Routing Problem (VRP) variants. Our core idea is that a foundation model for VRPs should be able to represent variants by treating each as a subset of a generalized problem equipped with different attributes. We propose a unified VRP environment capable of efficiently handling any combination of these attributes. The RouteFinder model leverages a modern transformer-based encoder and global attribute embeddings to improve task representation. Additionally, we introduce two reinforcement learning techniques to enhance multi-task performance: mixed batch training, which enables training on different variants at once, and multi-variant reward normalization to balance different reward scales. Finally, we propose efficient adapter layers that enable fine-tuning for new variants with unseen attributes. Extensive experiments on 48 VRP variants show RouteFinder outperforms recent state-of-the-art learning methods. Our code is publicly available at https://github.com/ai4co/routefinder.

URL: https://openreview.net/forum?id=QzGLoaOPiY

---

Title: CoNNect: Connectivity-Based Regularization for Structural Pruning of Neural Networks

Authors: Christian P.C. Franssen, Jinyang Jiang, Yijie Peng, Bernd Heidergott

Abstract: Pruning encompasses a range of techniques aimed at increasing the sparsity of neural networks (NNs). These techniques can generally be framed as minimizing a loss function subject to an $L_0$ norm constraint. This paper introduces CoNNect, a novel differentiable regularizer for sparse NN training that ensures connectivity between input and output layers. We prove that CoNNect approximates $L_0$ regularization, while preserving essential network structure and preventing the emergence of fragmented or poorly connected subnetworks. Moreover, CoNNect is easily integrated within established structural pruning strategies. Numerical experiments demonstrate that CoNNect can improve classical pruning strategies and enhance state-of-the-art one-shot pruners, such as DepGraph and LLM-pruner.

URL: https://openreview.net/forum?id=RIZCe7BuEp

---

Title: Amdahl’s Law for LLMs: A Throughput-Centric Analysis of Extreme LLM Quantization

Authors: Jinendra Malekar, Ramtin Zand

Abstract: The emergence of 1-bit large language models (LLMs) has sparked significant interest, promising substantial efficiency gains through extreme quantization. However, these benefits are inherently limited by the portion of the model that can be quantized. Specifically, 1-bit quantization typically targets only the projection layers, while the attention mechanisms remain in higher precision, potentially creating significant throughput bottlenecks. To address this, we present an adaptation of Amdahl's Law specifically tailored to the LLMs, offering a quantitative framework for understanding the throughput limits of extreme quantization. Our analysis reveals how improvements in quantization can deliver substantial throughput gains, but only to the extent that they address critical throughput-constrained sections of the model. Through extensive experiments across diverse model architectures and hardware platforms, we highlight key trade-offs and performance ceilings, providing a roadmap for future research aimed at maximizing LLM throughput through more holistic quantization strategies.

URL: https://openreview.net/forum?id=JtrQJJQYpP

---

Title: Double Machine Learning Based Structure Identification from Temporal Data

Authors: Emmanouil Angelis, Francesco Quinzan, Ashkan Soleymani, Patrick Jaillet, Stefan Bauer

Abstract: Learning the causes of time-series data is a fundamental task in many applications, spanning from finance to earth sciences or bio-medical applications. Common approaches for this task are based on vector auto-regression, and they do not take into account unknown confounding between potential causes. However, in settings with many potential causes and noisy data, these approaches may be substantially biased. Furthermore, potential causes may be correlated in practical applications or even contain cycles. To address these challenges, we propose a new double machine learning based method for structure identification from temporal data (DR-SIT). We provide theoretical guarantees, showing that our method asymptotically recovers the true underlying causal structure. Our analysis extends to cases where the potential causes have cycles, and they may even be confounded. We further perform extensive experiments to showcase the superior performance of our method. Code: https://github.com/sdi1100041/TMLR_submission_DR_SIT

URL: https://openreview.net/forum?id=4iHAoFVM2K

---

Title: Byzantine-Robust and Hessian-Free Federated Bilevel Optimization

Authors: Shruti P Maralappanavar, Bharath B N

Abstract: In the last few years, Byzantine robust algorithms to solve a minimization problem in the Federated setup have received significant attention. Most of the existing works consider the problem of byzantine-robustness for single-level optimization or consider the federated bilevel optimization without Byzantine nodes. However, problem formulation such as federated bilevel optimization in the presence of byzantine nodes is unexplored. Recognizing the gap, for the first time, we propose a computationally efficient and robust algorithm for solving Federated Bilevel Optimization with Byzantine (FedBOB) nodes that: \One Work under the assumption that the data across nodes are heterogeneous (non-iid), \2 Consider the lower-level objective is non-convex and satisfies the Polyak-\L ojasiewicz (PL)-inequality, and \3 Is fully first-order and does not rely on second order information. We achieve this by reformulating the federated bilevel problem into a single penalty problem. We provide the theoretical performance of the proposed algorithm and experimentally corroborate our theoretical findings.

URL: https://openreview.net/forum?id=5trmyvtkeo

---

New submissions
===============

Title: Tabby: A Language Model Architecture for Tabular and Structured Data Synthesis

Abstract: Large language models (LLMs) have greatly improved the quality of synthetic text data. We aim to extend these advances to tabular data with Tabby, a simple but powerful post-training modification to the standard Transformer language model architecture, enabling its use for tabular dataset synthesis. Tabby represents differences across columns using Gated Mixture-of-Experts, with column-specific sets of parameters. Empirically, Tabby results in data quality near or equal to that of real data. Pairing Tabby with Plain, our novel tabular training technique, we observe up to a $7\%$ improvement in quality (measured by MLE) over previous methods. Additionally, our approach is more flexible than prior strategies and extends beyond tables, to more general structured data. In a structured JSON setting, Tabby outperforms all other methods by $2$-$3$ points and is the only approach with MLE equal to the upper bound of non-synthetic data.

URL: https://openreview.net/forum?id=b9FPVnb0Bn

---

Title: LVLM-Count: Enhancing the Counting Ability of Large Vision-Language Models

Abstract: Counting is a fundamental operation for various real-world visual tasks, requiring both object recognition and robust counting capabilities. Despite their advanced visual perception, large vision-language models (LVLMs) are known to struggle with counting tasks. In this work, we evaluate the performance of several LVLMs on visual counting tasks across multiple counting and vision datasets. We observe that while their performance may be less prone to error for small numbers of objects, they exhibit significant weaknesses as the number of objects increases. To alleviate this issue, we propose a simple yet effective baseline method that enhances LVLMs’ counting ability for large numbers of objects using a divide-and-conquer approach. Our method decomposes counting problems into sub-tasks. Moreover, it incorporates a mechanism to prevent objects from being split during division, which could otherwise lead to repetitive counting—a common issue in a naive divide-and-conquer implementation. We demonstrate the effectiveness of this approach across various datasets and benchmarks, establishing it as a valuable reference for evaluating future solutions.

URL: https://openreview.net/forum?id=G1i9MUQj63

---

Title: Proc-to-Spec: A Functorial Map of Network Processes

Abstract: The analysis of dynamic networks is central to understanding complex environmental systems in nature, yet traditional methods often focus on describing changing states rather than formalising the underlying processes of change. In this work, we introduce a category-theoretical framework, Proc-to-Spec, that provides a principled, functorial method for analysing the transformations that govern network evolution. We model resource-constrained systems, such as those commonly found in biology and ecology, within a source category Proc, where morphisms represent dissipative physical processes. We then construct a spectral functor, $\chi: Proc \to Spec$, that maps each process to a unique linear transformation between the eigenspaces of the network's symmetrised Laplacian. This framework allows us to establish a set of rigorous theorems. We prove that physical conservation laws in Proc correspond directly to spectral invariants in Spec, such as the conservation of the Laplacian's trace. We derive a spectral sensitivity theorem that formally links resource dissipation to network fragmentation via the Fiedler value. We also establish a stability-spectrum equivalence theorem, proving that a system's physical dynamics converge to a stable state if and only if its spectral geometry converges. We validate our theory with numerical experiments and demonstrate its utility as a tool for scientific discovery in a case study of the Serengeti food web in northern Tanzania. Using a large collection of 1.2 million classified image sets of animal activity from 225 camera traps spread across 1,125 km$^2$ of the Serengeti National Park from 2010 to 2013, we show that our framework can detect the subtle, cyclical signature of seasonal change and identify the unique geometric fingerprint of the 2011 East Africa drought. Our work provides a different way of thinking about dynamic systems, shifting the focus from describing states to understanding the fundamental geometry of change. Code to reproduce all results in the paper is released at https://anonymous.4open.science/r/tmlr_pts

URL: https://openreview.net/forum?id=pT84Ii6igG

---

Title: On the (linear) convergence of Generalized Newton Inexact ADMM

Abstract: This paper presents GeNI-ADMM, a framework for large-scale composite convex optimiza-
tion that facilitates theoretical analysis of both existing and new approximate ADMM
schemes. GeNI-ADMM encompasses any ADMM algorithm that solves a first- or second-
order approximation to the ADMM subproblem inexactly. GeNI-ADMM exhibits the usual
O(1/t)-convergence rate under standard hypotheses and converges linearly under additional
hypotheses such as strong convexity. Further, the GeNI-ADMM framework provides explicit
convergence rates for ADMM variants accelerated with randomized linear algebra, such as
NysADMM and sketch-and-solve ADMM, resolving an important open question on the con-
vergence of these methods. This analysis quantifies the benefit of improved approximations
and can aid in the design of new ADMM variants with faster convergence

URL: https://openreview.net/forum?id=GT3naIXBxK

---

Title: An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandit Problems

Abstract: We study the performance of the Thompson Sampling algorithm for logistic bandit problems. In this setting, an agent receives binary rewards with probabilities determined by a logistic function, $\exp(\beta \langle a, \theta \rangle)/(1+\exp(\beta \langle a, \theta \rangle))$, with parameter $\beta>0$, and both the action $a\in \mathcal{A}$ and the unknown parameter $\theta \in \mathcal{O}$ lie within the $d$-dimensional unit ball. Adopting the information-theoretic framework introduced by Russo & Van Roy (2016), we derive regret bounds via the analysis of the information ratio, a statistic that quantifies the trade-off between the immediate regret incurred by the agent and the information it just gained about the parameter $\theta$. We improve upon previous results and establish that the information ratio is bounded by $d(4/\alpha)^2$, where $d$ is the dimension of the problem and $\alpha$ is a \emph{minimax measure} of the alignment between the action space $\mathcal{A}$ and the parameter space $\mathcal{O}$. Notably, our bound does not scale exponentially with the logistic slope and is independent of the cardinality of the action and parameter spaces. Using this result, we derive a bound on the Thompson Sampling expected regret of order $O(d \alpha^{-1} \sqrt{T \log(\beta T/d)})$, where $T$ is the number of time steps. To our knowledge, this is the \emph{first regret bound for any logistic bandit algorithm} that avoids any exponential scaling with $\beta$ and is independent of the number of actions. In particular, when the parameters are on the sphere and the action space contains the parameter space, the expected regret bound is of order $O(d \sqrt{T \log(\beta T/d)})$.

URL: https://openreview.net/forum?id=94y5XfiJ7N

---

Title: kNNSampler: Stochastic Imputations for Recovering Missing Value Distributions

Abstract: We study a missing-value imputation method, termed kNNSampler, that imputes a given unit's missing response by randomly sampling from the observed responses of the k most similar units to the given unit in terms of the observed covariates. This method can sample unknown missing values from their distributions, quantify the uncertainties of missing values, and be readily used for multiple imputation.
Unlike popular kNNImputer, which estimates the conditional mean of a missing response given an observed covariate, kNNSampler is theoretically shown to estimate the conditional distribution of a missing response given an observed covariate.
Experiments demonstrate its effectiveness in recovering the distribution of missing values.

URL: https://openreview.net/forum?id=4CDnIACCQG

---

Title: StealthRank: LLM Ranking Manipulation via Stealthy Prompt Optimization

Abstract: The integration of large language models (LLMs) into information retrieval systems introduces new attack surfaces, particularly for adversarial ranking manipulations. We present StealthRank, a novel adversarial attack method that manipulates LLM-driven ranking systems while maintaining textual fluency and stealth. Unlike existing methods that often introduce detectable anomalies, StealthRank employs an energy-based optimization framework combined with Langevin dynamics to generate StealthRank Prompts (SRPs)—adversarial text sequences embedded within item or document descriptions that subtly yet effectively influence LLM ranking mechanisms. We evaluate StealthRank across multiple LLMs, demonstrating its ability to covertly boost the ranking of target items while avoiding explicit manipulation traces. Our results show that StealthRank consistently outperforms state-of-the-art adversarial ranking baselines in both effectiveness and stealth, highlighting critical vulnerabilities in LLM-driven ranking systems.

URL: https://openreview.net/forum?id=iQe2hBBUn0

---

Title: Will the Inclusion of Generated Data Amplify Bias Across Generations in Future Image Classification Models?

Abstract: As the demand for high-quality training data escalates, researchers have increasingly turned to generative models to create synthetic data, addressing data scarcity and enabling continuous model improvement. However, reliance on self-generated data introduces a critical question: \textit{Will this practice amplify bias in future models?} While most research has focused on overall performance, the impact on model bias, particularly subgroup bias, remains underexplored. In this work, we investigate the effects of the generated data on image classification tasks, with a specific focus on bias. We develop a practical simulation environment that integrates a self-consuming loop, where the generative model and classification model are trained synergistically. Hundreds of experiments are conducted on Colorized MNIST, CIFAR-20/100, and Hard ImageNet datasets to reveal changes in fairness metrics across generations. In addition, we provide a conjecture to explain the bias dynamics when training models on continuously augmented datasets across generations. Our findings contribute to the ongoing debate on the implications of synthetic data for fairness in real-world applications.

URL: https://openreview.net/forum?id=cjZ4LMxX4n

---

Title: Denoising Pretrained Black-box Models via Amplitude-Guided Phase Realignment

Abstract: Pre-trained models tend to inherit noisy label information from their training datasets, internalising it as biased knowledge. While learning with label noise has been explored, existing approaches rarely address the mitigation of biased knowledge embedded in pre-trained representations introduced by noisy labels. Moreover, existing denoising methods invariably rely on modifying training datasets or models to improve downstream task performance. However, we observe a growing trend in which both pre-trained models and their training datasets are scaling up significantly and becoming increasingly inaccessible, making modifications ever more infeasible. In this paper, we propose a black-box biased knowledge mitigation method called ``Lorem'', which leverages feature frequency amplitudes to guide phase correction on pre-trained representations, without access to training data or model parameters. We first present empirical evidence that, across different noise levels, the phase components of pre-trained representations are more sensitive to noisy labels than the amplitude components, while discriminative information for classification is primarily encoded in the amplitude. Moreover, we find that the impact of noisy labels on amplitude is global, leading to a gradual loss of discriminative information. Therefore, corrective strategies must be adaptive across the entire frequency spectrum rather than limited to the high-frequency components. Inspired by this observation, we design a method that leverages the amplitude residual to realign phase, thereby removing biased knowledge from pre-trained representations. Experiments on a variety of popular pre-trained vision and language models suggest that, even with a simple linear classifier, our method can enhance downstream performance across a range of in-domain and out-of-domain tasks.

URL: https://openreview.net/forum?id=526fwttJiK

---

Title: Mixtures of Locally Bounded Langevin dynamics for Bayesian Model Averaging

Abstract: Properties of probability distributions change when going from low to high dimensions, to
the extent that they admit counterintuitive behavior. Gaussian distributions intuitively
illustrate a well-known effect of moving to higher dimensions, namely that the typical set
almost surely does not contain the mean, which is the distribution’s most probable point.
This can be problematic in Bayesian Deep Learning, as the samples drawn from the high-
dimensional posterior distribution are often used as Monte Carlo samples to estimate the
integral of the predictive distribution. Here, the predictive distribution will reflect the
behavior of the samples and, therefore, of the typical set. For instance, we cannot expect
to sample networks close to the maximum a posteriori estimate after fitting a Gaussian
approximation to the posterior using the Laplace method. In this paper, we introduce
a method that aims to mitigate this typicality problem in high dimensions by sampling
from the posterior with Langevin dynamics on a restricted support enforced by a reflective
boundary condition. We demonstrate how this leads to improved posterior estimates by
illustrating its capacity for fine-grained out-of-distribution (OOD) ranking on the Morpho-
MNIST dataset.

URL: https://openreview.net/forum?id=ibqfadKjgo

---

Title: Sociodynamics of Reinforcement Learning

Abstract: Reinforcement Learning (RL) has emerged as a core algorithmic paradigm explicitly driving innovation in a growing number of industrial applications, including large language models and quantitative finance. Furthermore, computational neuroscience has long found evidence of natural forms of RL in biological brains. Therefore, it is crucial for the study of social dynamics to develop a scientific understanding of how RL shapes population behaviors. We leverage the framework of Evolutionary Game Theory (EGT) to provide building blocks and insights toward this objective. We propose a methodology that enables simulating large populations of RL agents in simple game theoretic interaction models. More specifically, we derive fast and parallelizable implementations of two fundamental revision protocols from multi-agent RL - Policy Gradient (PG) and Opponent-Learning Awareness (LOLA) - tailored for population simulations of random pairwise interactions in stateless normal-form games. Our methodology enables us to simulate large populations of 200,000 heterogeneous co-learning agents, yielding compelling insights into how non-stationarity-aware learners affect social dynamics.
In particular, we find that LOLA learners promote cooperation in the Stag Hunt model, delay cooperative outcomes in the Hawk-Dove model, and reduce strategy diversity in the Rock-Paper-Scissors model.

URL: https://openreview.net/forum?id=Ro6Ylnx8se

---

Title: The Initialization Determines Whether In-Context Learning Is Gradient Descent

Abstract: In-context learning (ICL) in large language models (LLMs) is a striking phenomenon, yet its underlying mechanisms remain only partially understood. Previous work connects linear self-attention (LSA) to gradient descent (GD), this connection has primarily been established under simplified conditions with zero-mean Gaussian priors and zero initialization for GD. However, subsequent studies have challenged this simplified view by highlighting its overly restrictive assumptions, demonstrating instead that under conditions such as multi-layer or nonlinear attention, self-attention performs optimization-like inference, akin to but distinct from GD. We investigate how multi-head LSA approximates GD under more realistic conditions—specifically when incorporating non-zero Gaussian prior means in linear regression formulations of ICL. We first extend multi-head LSA embedding matrix by introducing an initial estimation of the query, referred to as the initial guess. We prove an upper bound on the number of heads needed for ICL linear regression setup. Our experiments confirm this result and further observe that a performance gap between one-step GD and multi-head LSA persists. To address this gap, we introduce $y_q$-LSA, a simple generalization of single-head LSA with a trainable initial guess $y_q$. We theoretically establish the capabilities of $y_q$-LSA and provide experimental validation on linear regression tasks, thereby extending the theory that bridges ICL and GD. Finally, inspired by our findings in the case of linear regression, we consider widespread LLMs augmented with initial guess capabilities, and show that their performance is improved on a semantic similarity task.

URL: https://openreview.net/forum?id=fvqSKLDtJi

---

Title: Bootstrapping Task Spaces for Self-Improvement

Abstract: Progress in many task domains emerges from repeated revisions to previous solution attempts. Training agents that can reliably self-improve over such sequences at inference-time is a natural target for reinforcement learning (RL), yet the naive approach assumes a fixed maximum iteration depth, which can be both costly and arbitrary. We present Exploratory Iteration (ExIt), a family of autocurriculum RL methods that directly exploits the recurrent structure of self-improvement tasks to train LLMs to perform multi-step self-improvement at inference-time while only training on the most informative single-step iterations. ExIt grows a task space by selectively sampling the most informative intermediate, partial histories encountered during an episode for continued iteration, treating these starting points as new self-iteration task instances to train a self-improvement policy. ExIt can further pair with explicit exploration mechanisms to sustain greater task diversity. Across several domains, encompassing competition math, multi-turn tool-use, and machine learning engineering, we demonstrate that ExIt strategies, starting from either a single or many task instances, can produce policies exhibiting strong inference-time self-improvement on held-out task instances, and the ability to iterate towards higher performance over a step budget extending beyond the average iteration depth encountered during training.

URL: https://openreview.net/forum?id=k2VsgUxC6X

---

Title: Statistical Guarantees for Approximate Stationary Points of Shallow Neural Networks

Abstract: Since statistical guarantees for neural networks are usually restricted to global optima of intricate objective functions, it is unclear whether these theories explain the performances of actual outputs of neural network pipelines. The goal of this paper is, therefore, to bring statistical theory closer to practice. We develop statistical guarantees for shallow linear neural networks that coincide up to logarithmic factors with the global optima but apply to stationary points and the points nearby. These results support the common notion that neural networks do not necessarily need to be optimized globally from a mathematical perspective. We then extend our statistical guarantees to shallow ReLU neural networks, assuming the first layer weight matrices are nearly identical for the stationary network and the target. More generally, despite being limited to shallow neural networks for now, our theories make an important step forward in describing the practical properties of neural networks in mathematical terms.

URL: https://openreview.net/forum?id=PNUMiLbLml

---

Title: CAPE: Generalized Convergence Prediction Across Architectures Without Full Training

Abstract: Training deep neural networks to convergence is expensive and time-consuming, especially when exploring new architectures or hardware setups. Prior work has focused on estimating per-iteration cost or total training time assuming a fixed step count, but has largely ignored the critical challenge of predicting how many steps a model will take to converge. We introduce CAPE (Convergence-Aware Prediction Engine), a lightweight, probing-based system that accurately predicts the number of training steps required for convergence without executing full training runs. CAPE probes models at initialization using a small batch of data to extract both structural and dynamical features, including parameter count, gradient norm, NTK trace, dataset size, and learning rate. Using these features, we build a meta-dataset spanning a wide range of model types and train a meta-model to forecast convergence steps. CAPE attains mean absolute errors of 3–9 optimization steps across MLP, CNN, RNN, and Transformer models, consistently surpassing strong baselines. This performance remains stable across a fourfold range in typical convergence horizons (15–60 steps), offering practical value for rapid model selection and budget planning. CAPE offers a practical and generalizable solution for convergence forecasting, supporting faster model selection, efficient scheduling, and resource-aware training.

URL: https://openreview.net/forum?id=wGngf0wBYn

---

Title: SACrificing Intuition: Kullback-Leibler Regularized Actor-Critic

Abstract: One of the most popular algorithms in reinforcement learning is Soft Actor-Critc (SAC), as it promises to elegantly incorporate exploration into the optimization process.
We revisit SAC through the lens of constrained optimization and develop \underline{K}ullback-\underline{L}eibler \underline{A}ctor-\underline{C}ritic (KLAC), a principled extension of Soft Actor Critic that replaces the heuristic entropy bonus of SAC with a Kullback-Leibler regulariser against an arbitrary reference policy.
We contrast Kullback-Leibler Actor Critic with Soft Actor Critic and demonstrate analytically and with a concrete counterexample that injecting the entropy term directly into the reward, as implemented in Soft Actor Critic, violates the convexity assumptions of the dual proof of near-optimality and can render the learned policy arbitrarily sub-optimal no matter how small the temperature is chosen. This understanding reveals a fundamental systemic flaw in SAC, especially for sparse reward environments.
To retain the empirical exploration benefits without sacrificing theoretical soundness, we introduce a fixed uniform reward bias that captures the intrinsic motivation effect to \textit{stay alive}. Additionally, we propose a Kullback-Leibler annealing schedule that unifies discrete and continuous action spaces by mapping an intuitive probability of exploitation to a closed-form entropy or Kullback-Leibler target.
Together, these contributions yield an algorithm that at least matches the sample efficiency and performance of Soft Actor Critic as demonstrated on MuJoCo and MinAtar benchmarks while enjoying provable near optimality, interpretable hyperparameters, and a theoretically grounded exploration mechanism.
We provide code to reproduce all plots in the paper.

URL: https://openreview.net/forum?id=A7phI8qbdU

---

Title: MDTree: A Masked Dynamic Autoregressive Model for Phylogenetic Inference

Abstract: Phylogenetic tree inference requires optimizing both branch lengths and topologies, yet traditional MCMC-based methods suffer from slow convergence and high computational cost. Recent deep learning approaches improve scalability but remain constrained: Bayesian models are computationally intensive, autoregressive methods depend on fixed species orders, and flow-based models underutilize genomic signals. Fixed-order autoregression introduces an inductive bias misaligned with evolutionary proximity: early misplacements distort subsequent attachment probabilities and compound topology errors (exposure bias). Absent sequence-informed priors, the posterior over the super-exponential topology space remains diffuse and multimodal, yielding high-variance gradients and sluggish convergence for both MCMC proposals and neural samplers.
We propose MDTree, a masked dynamic autoregressive framework that integrates genomic priors into a Dynamic Ordering Network to learn biologically informed node sequences. A dynamic masking mechanism further enables parallel node insertion, improving efficiency without sacrificing accuracy. Experiments on standard benchmarks demonstrate that MDTree outperforms existing methods in accuracy and runtime while producing biologically coherent phylogenies, providing a scalable solution for large-scale evolutionary analysis.

URL: https://openreview.net/forum?id=dTSptQNygv

---

Title: Towards Fast Safe Online Reinforcement Learning via Policy Finetuning

Abstract: High costs and risks involved in extensive environmental interactions hinder the practical application of current online safe reinforcement learning (RL) methods. Inspired by recent successes in offline-to-online (O2O) RL, it is crucial to explore whether offline safe RL can be leveraged to facilitate faster and safer online learning, a direction that has yet to be fully investigated. To fill this gap, we first show that naively applying existing O2O algorithms from standard RL would not work well in safe RL due to two unique challenges: \emph{erroneous Q-estimations}, resulted from offline-online objective mismatch and offline cost sparsity, and \emph{Lagrangian mismatch}, resulted from difficulties in aligning Lagrange multipliers between offline and online policies. To address these challenges, we introduce \textbf{Marvel}, the first policy-finetuning based framework for O2O safe RL, comprising two key components that work in concert: \emph{Value Pre-Alignment} to align the learned Q-functions with the online objective before finetuning, and \emph{Adaptive PID Control} to effectively adjust the Lagrange multipliers during finetuning. Extensive experiments demonstrate the superior performance of Marvel over related baselines.

URL: https://openreview.net/forum?id=1SO7vmLFUq

---

Title: Language Models are Symbolic Learners in Arithmetic

Abstract: The prevailing question in LM performing arithmetic is whether these models learn to truly compute or if they simply master superficial pattern matching. In this paper, we argues for the latter, presenting evidence that LMs act as greedy symbolic learners, prioritizing the simplest possible shortcuts to fit the stats of dataset to solve arithmetic tasks. To investigate this, we introduce \textbf{subgroup induction}, a practical framework adapted from Solomonoff Induction (SI), one of the most powerful universal predictors. Our framework analyzes arithmetic problems by breaking them down into ``subgroups''—minimal mappings between a few input digits and a single output digit. Our primary metric, subgroup quality, measures the viability of these shortcuts. Experiments reveal a distinct U-shaped accuracy pattern in multi-digit multiplication: LMs quickly master the first and last output digits while struggling with those in the middle. We demonstrate this U-shape is not coincidental; it perfectly mirrors the quality of the simplest possible subgroups, those requiring the fewest input tokens. This alignment suggests a core learning mechanism: LMs first learn easy, low-token shortcuts and only incorporate more complex, multi-token patterns as training progresses. They do not learn the algorithm of multiplication but rather a hierarchy of increasingly complex symbol-to-symbol mappings. Ultimately, our findings suggest that the path to arithmetic mastery for LMs is not paved with algorithms, but with a cascade of simple, hierarchically-learned symbolic shortcuts.

URL: https://openreview.net/forum?id=QSblPg1xUM

---

Title: DA-DPO: Cost-efficient Difficulty-aware Preference Optimization for Reducing MLLM Hallucinations

Abstract: Direct Preference Optimization (DPO) has shown significant promise in reducing hallucinations in Multimodal Large Language Models (MLLMs). However, existing multimodal DPO methods suffer from overfitting due to difficulty-level imbalance in preference data. Our analysis reveals that MLLMs tend to overfit on easily distinguishable pairs, which limits their ability to remove hallucinations in a fine-grained manner and impairs the model’s comprehensive ability.
To address this challenge, we introduce Difficulty-Aware Direct Preference Optimization (DA-DPO), a cost-effective framework comprising two key components: (1)\textit{Difficulty Estimation}, where we leverage pre-trained vision-language models with complementary generative and contrastive objectives, integrating their outputs through a distribution-aware voting strategy to obtain robust difficulty scores without additional training; and (2) \textit{Difficulty-Aware Training}, where we reweight preference data according to the estimated difficulty, down-weighting easy samples while emphasizing harder ones to mitigate overfitting.
This paradigm enhances preference optimization by efficiently exploiting challenging examples without requiring new data or additional fine-tuning stages.
Extensive experiments demonstrate that DA-DPO significantly improves multimodal preference optimization, achieving stronger robustness against hallucinations and better generalization on standard benchmarks, all in a cost-efficient manner.

URL: https://openreview.net/forum?id=M52CgPcgGx

---

Title: More of Less: A Rashomon Algorithm for Sparse Model Sets

Abstract: The current paradigm of machine learning consists in finding a single best model to deliver predictions and, if possible, interpretations for a specific problem. This paradigm has however been strongly challenged in recent years through the study of the Rashomon Effect which was coined initially by Leo Breiman. This phenomenon occurs when there exist many good predictive models for a given dataset/problem, with considerable practical implications in terms of interpretation, usability, variable importance, replicability and many others. The set of models (within a specific class of functions) which respect this definition is referred to as the Rashomon set and an important amount of recent work has been focused on ways of finding these sets as well as studying their properties. Developed in parallel to current research on the Rashomon Effect and motivated by sparse latent representations for high-dimensional problems, we present a heuristic procedure that aims to find sets of sparse models with good predictive power through a greedy forward-search that explores the low-dimensional variable space. Throughout this algorithm, good low-dimensional models identified from the previous steps are used to build models with more variables in the following steps. While preserving almost-equal performance with respect to a single reference model in a given class (i.e. a Rashomon set), the sparse model sets from this algorithm include diverse models which can be combined into networks that deliver additional layers of interpretation and new insights into how variable combinations can explain the Rashomon Effect.

URL: https://openreview.net/forum?id=KfHXF1r6Cz

---

Title: A Survey on Tabular Data Generation: Utility, Alignment, Fidelity, Privacy, and Beyond

Abstract: Generative modelling has become the standard approach for synthesising tabular data. However, different use cases demand synthetic data to comply with different requirements to be useful in practice. In this survey, we review deep generative modelling approaches for tabular data from the perspective of four types of requirements: utility of the synthetic data, alignment of the synthetic data with domain-specific knowledge, statistical fidelity of the synthetic data distribution compared to the real data distribution, and privacy-preserving capabilities. We group the approaches along two levels of granularity: (i) based on the primary type of requirements they address and (ii) according to the underlying model they utilise. Additionally, we summarise the appropriate evaluation methods for each requirement and the specific characteristics of each model type. Finally, we discuss future directions for the field, along with opportunities to improve the current evaluation methods. Overall, this survey can be seen as a user guide to tabular data generation: helping readers navigate available models and evaluation methods to find those best suited to their needs.

URL: https://openreview.net/forum?id=RoShSRQQ67

---

Title: COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling

Abstract: Large language models (LLMs) often exhibit performance disparities across languages, with naive multilingual fine-tuning frequently degrading performance due to negative cross-lingual interference. To address this, we introduce COMPASS (COntinual Multilingual PEFT with Adaptive Semantic Sampling), a novel data-centric framework for adapting LLMs to target languages. COMPASS leverages parameter-efficient fine-tuning (PEFT) by training lightweight, language-specific adapters on a judiciously selected subset of auxiliary multilingual data. The core of our method is a distribution-aware sampling strategy that uses multilingual embeddings and clustering to identify semantic gaps between existing training data and a target usage distribution. By prioritizing auxiliary data from under-represented semantic clusters, COMPASS maximizes positive cross-lingual transfer while minimizing interference. We extend this into a continual learning framework, COMPASS-ECDA, which monitors for data distribution shifts in production and dynamically updates adapters to prevent model staleness, balancing adaptation to new data with the preservation of existing knowledge. Across three different model architectures (Phi-4-Mini, Llama-3.1-8B, and Qwen2.5-7B) and multiple challenging multilingual benchmarks (Global-MMLU, MMLU-ProX), including unseen long-context tasks (OneRuler), we demonstrate that COMPASS consistently outperforms baseline methods guided by linguistic similarity, providing an effective, efficient, and sustainable solution for developing and maintaining high-performing multilingual models in dynamic environments.

URL: https://openreview.net/forum?id=oapsbIO1Bd

---

Title: The Five Ws of Multi-Agent Communication: Who Talks to Whom, When, What, and Why - A Survey from MARL to Emergent Language and LLMs

Abstract: Multi-agent sequential decision-making underlies many real-world systems, from autonomous vehicles and robotics to collaborative AI assistants. In dynamic and partially observable environments, effective communication is crucial for reducing uncertainty and enabling coordination. While research in multi-agent communication (MA-Comm) spans diverse methods and paradigms, its central challenges can often be understood through the guiding lens of the Five Ws of communication: who talks to whom, when to speak, what to convey, and why communication is beneficial. These questions provide an intuitive thread across different approaches, even when not used as explicit section divisions. Progress in this field has been rapid. Within the Multi-Agent Reinforcement Learning (MARL) framework, early work emphasized static, hand-designed protocols, while later approaches introduced trainable, end-to-end communication models optimized with deep learning. This shift sparked interest in emergent language, where agents develop symbolic or structured messaging strategies through interaction. More recently, large language models (LLMs) have opened new possibilities, enabling natural language as a medium for reasoning, planning, and collaboration in more open-ended environments. Despite this momentum, there is still no dedicated survey that brings together these different lines of work. Most existing reviews focus narrowly on MARL, without fully addressing how communication is evolving from simple message passing to symbolic reasoning and language use. This paper aims to fill that gap. We provide a structured survey of MA-Comm, spanning traditional MARL approaches and emergent language studies. In light of growing interest in agentic and embodied AI, we also examine how LLMs are reshaping communication in both MARL contexts and broader multi-agent ecosystems. By using the Five Ws as a conceptual lens, our goal is to clarify the landscape, highlight key trends, and provide a foundation for future research at the intersection of communication, coordination, and learning in multi-agent systems.

URL: https://openreview.net/forum?id=LGsed0QQVq

---

Title: Implicit Reasoning in Large Language Models: A Comprehensive Survey

Abstract: Large Language Models (LLMs) have demonstrated strong generalization across a wide range of tasks. Reasoning with LLMs is central to solving multi-step problems and complex decision-making. To support efficient reasoning, recent studies have shifted attention from explicit chain-of-thought prompting toward implicit reasoning, where reasoning occurs silently via latent structures without emitting intermediate textual steps. Implicit reasoning brings advantages such as lower generation cost, faster inference, and better alignment with internal computation. Although prior surveys have discussed latent representations in the context of reasoning, a dedicated and mechanism-level examination of how reasoning unfolds internally within LLMs remains absent. This survey fills that gap by introducing a taxonomy centered on execution paradigms, shifting the focus from representational forms to computational strategies. We organize existing methods into three execution paradigms based on how and where internal computation unfolds: latent optimization, signal-guided control, and layer-recurrent execution. We also review structural, behavioral and representation-based evidence that supports the presence of implicit reasoning in LLMs. We further provide a structured overview of the evaluation metrics and benchmarks used in existing works to assess the effectiveness and reliability of implicit reasoning. We maintain a continuously updated project at: https://anonymous.4open.science/r/TMLR-LLM-Implicit-Reasoning-Survey-E4D6.

URL: https://openreview.net/forum?id=mPpJlp4lyU

---

Title: Understanding Class Bias Amplification in Graph Representation Learning

Abstract: Recent research reveals that GNN-based graph representation learning may inadvertently introduce various structural biases. In this work, we discover a phenomenon of structural bias in graph representation learning called class bias amplification, which refers to the exacerbation of performance bias between different classes by GNN encoder. We conduct an in-depth theoretical study of this phenomenon from a novel spectral perspective. Our analysis suggests that structural disparities between nodes in different classes result in varying local convergence speeds for node embeddings. This phenomenon leads to bias amplification in the classification results of downstream tasks. Based on the theoretical insights, we propose random graph coarsening, which is proved to be effective in dealing with the above issue. Finally, we propose an unsupervised graph contrastive learning model called Random Graph Coarsening Contrastive Learning (RGCCL), which utilizes random coarsening as data augmentation and mitigates class bias amplification by contrasting the coarsened graph with the original graph. Extensive experiments on various datasets demonstrate the advantage of our method when dealing with class bias amplification.

URL: https://openreview.net/forum?id=SqpgDUdRE9

---

Title: Decoding Generalization from Memorization in Deep Neural Networks

Abstract: Overparameterized Deep Networks that generalize well have been key to the dramatic success of Deep Learning in recent years. The reasons for their remarkable ability to generalize are not well understood yet. It has also been known that Deep Networks possess the ability to memorize training data, as evidenced by perfect or high training accuracies on models trained with corrupted data that have class labels shuffled to varying degrees. Concomitantly, such models are known to generalize poorly, i.e. they suffer from poor test accuracies, due to which it is thought that the act of memorizing substantially degrades the innate ability to generalize. It has, however, been unclear why the poor generalization that accompanies such memorization, comes about. One possibility is that during training, the layers of the network irretrievably re-organize their representations in a manner that makes generalization difficult. The other possibility is that the network retains significant latent ability to generalize, but the trained network somehow “chooses” to readout in a manner that is detrimental to generalization. Here, we provide evidence for the latter possibility by demonstrating, empirically, that such models possess information in their representations for substantially-improved generalization. Furthermore, such generalization abilities can be easily decoded from the internals of the trained model, and we build a technique to do so. We demonstrate results on multiple models trained with standard datasets.

URL: https://openreview.net/forum?id=BeT6jaD6ao

---

Title: Forget Less, Retain More: A Lightweight Regularizer for Rehearsal-Based Continual Learning

Abstract: Deep neural networks suffer from catastrophic forgetting, where performance on previous tasks degrades after training on a new task. This issue arises due to the model’s tendency to overwrite previously acquired knowledge with new information. We present a novel approach to address this challenge, focusing on the intersection of memory-based methods and regularization approaches. We formulate a regularization strategy, termed Information Maximization (IM) regularizer, for memory-based continual learning methods, which is based exclusively on the expected label distribution, thus making it class-agnostic. As a consequence, IM regularizer can be directly integrated into various rehearsal-based continual learning methods, reducing forgetting and favoring faster convergence. Our empirical validation shows that, across datasets and regardless of the number of tasks, our proposed regularization strategy consistently improves baseline performance at the expense of a minimal computational overhead. The lightweight nature of IM ensures that it remains a practical and scalable solution, making it applicable to real-world continual learning scenarios where efficiency is paramount. Finally, we demonstrate the data-agnostic nature of our regularizer by applying it to video data, which presents additional challenges due to its temporal structure and higher memory requirements. Despite the significant domain gap, our experiments show that IM regularizer also improves the performance of video continual learning methods.

URL: https://openreview.net/forum?id=CJw1ZjkJMG

---

Title: LGGBench: a Holistic Benchmark for Large Graph Generation

Abstract: The escalating demand for robust graph data sharing between organizations has propelled the development of methodologies that assess the efficacy and privacy of these shared graphs. We present LGGBench, a comprehensive benchmark designed to evaluate large graph generation methods across multiple dimensions crucial to proprietary data sharing. This benchmark integrates a diverse array of large graph datasets, sophisticated graph generation techniques, and comprehensive evaluation schemes to address the current shortcomings in graph data sharing. Our benchmark evaluates the generated graphs in terms of fidelity, utility, privacy, and scalability. Fidelity is assessed through graph statistical metrics, while utility measures the practical applicability of synthetic graphs in real-world tasks. Privacy is ensured through robust mechanisms against various inference attacks, and scalability is demonstrated through the benchmark's ability to handle extensive graph datasets efficiently. Through extensive experiments, we compare existing graph generation methods, highlighting their strengths and limitations across different types of graphs and evaluation metrics. The benchmark provides a holistic approach to evaluate and improve graph generation techniques, facilitating safer and more effective data sharing practices.

URL: https://openreview.net/forum?id=xW7u1Mlwds

---

Title: MixTraining: A Better Trade-Off Between Compute and Performance

Abstract: Integrating self-supervised learning (SSL) prior to supervised learning (SL) is a prevalent strategy for enhancing model performance, especially in scenarios with limited labeled data. Nonetheless, this approach inherently introduces a trade-off between computational efficiency and performance gains. Although SSL significantly improves representation learning, it necessitates an additional and often computationally expensive training phase, posing substantial overhead in resource-constrained environments. To mitigate these limitations, we propose MixTraining, a novel training framework designed to interleave multiple epochs of SSL and SL within a unified $\textit{mixtraining phase}$. This phase enables a seamless transition between self-supervised and supervised objectives, facilitating enhanced synergy and improved overall accuracy. Additionally, MixTraining consolidates shared computational steps, thereby reducing redundant computations and lowering overall training latency. Comprehensive experimental evaluations demonstrate that MixTraining provides a superior trade-off between computational efficiency and model performance compared to conventional training pipelines. Specifically, on the TinyImageNet dataset using the ViT-Tiny model, MixTraining achieves an absolute accuracy improvement of 8.81% (a relative gain of 18.89%) while concurrently accelerating training by 1.29$\times$.

URL: https://openreview.net/forum?id=NVpS2g9KRo

---

Title: Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners

Abstract: Recent advancements have demonstrated that the performance of large language models (LLMs) can be significantly enhanced by scaling computational resources at test time.
A common strategy involves generating multiple Chain-of-Thought (CoT) trajectories and aggregating their outputs through various selection mechanisms.
This raises a fundamental question: can models with lower complexity leverage their superior generation throughput to outperform similarly sized Transformers for a fixed computational budget?
To address this question and overcome the lack of strong subquadratic reasoners, we distill pure and hybrid Mamba models from pretrained Transformers.
Trained on only 8 billion tokens, our distilled models exhibit strong performance and scaling on mathematical reasoning datasets while being much faster at inference for large batches and long sequences.
Despite the zero-shot performance hit due to distillation, both pure and hybrid Mamba models can scale their coverage and accuracy performance past their Transformer teacher models under fixed time budgets, opening a new direction for scaling inference compute.

URL: https://openreview.net/forum?id=lQpNAwWM9f

---

Title: HyPE-GT: where Graph Transformers meet Hyperbolic Positional Encodings

Abstract: Graph Transformers (GTs) facilitate the comprehension of complex relationships on graph-structured data by leveraging self-attention of the possible pairs of nodes. The structural information or inductive bias of the input graph is provided as positional encodings into the GT. The positional encodings are mostly Euclidean and are not able to capture the complex hierarchical relationships of the corresponding nodes. To address the limitation, we introduce a novel and efficient framework, HyPE, that generates learnable positional encodings in the non-Euclidean hyperbolic space that captures the intricate hierarchical relationships of the underlying graphs. Unlike existing methods, HyPE can generate a set of hyperbolic positional encodings, empowering us to explore diverse options for the optimal selection of PEs for specific downstream tasks. Additionally, we repurpose the generated hyperbolic positional encodings to mitigate the impact of oversmoothing in deep Graph Neural Networks (GNNs). Furthermore, we provide extensive theoretical underpinnings to offer insights into the working mechanism of the HyPE framework. Comprehensive experiments on four molecular benchmarks, including the four large-scale Open Graph Benchmark (OGB) datasets, substantiate the effectiveness of hyperbolic positional encodings in enhancing the performance of Graph Transformers. We also consider Coauthor and Copurchase networks to establish the efficacy of HyPE in controlling oversmoothing in deep GNNs.

URL: https://openreview.net/forum?id=nOdgz1DojX

---

Title: Robust Reinforcement Learning in a Sample-Efficient Setting

Abstract: The performance of reinforcement learning (RL) in real-world applications can be hindered by the absence of robustness and safety in the learned policies. More specifically, an RL agent that trains in a certain Markov decision process (MDP) often struggles to perform well in MDPs that slightly deviate. To address this issue, we employ the framework of Robust MDPs (RMDPs) in a model-based setting and introduce a second learned transition model. Our method specifically incorporates an auxiliary pessimistic model, updated adversarially, to estimate the worst-case MDP within a Kullback-Leibler uncertainty set. In comparison to several existing works, our method does not impose any additional conditions on the training environment, such as the need for a parametric simulator. To test the effectiveness of the proposed pessimistic model in enhancing policy robustness, we integrate it into a practical RL algorithm, called Robust Model-Based Policy Optimization (RMBPO). Our experimental results indicate a notable improvement in policy robustness on high-dimensional control tasks, with the auxiliary model enhancing the performance of the learned policy in distorted MDPs, while maintaining the data-efficiency of the base algorithm. Our methodology is also compared against various other robust RL approaches. We further examine how pessimism is achieved by exploring the learned deviation between the proposed auxiliary world model and the nominal model. By introducing a pessimistic world model and demonstrating its role in improving policy robustness, our research presents a general methodology for robust reinforcement learning in a model-based setting.

URL: https://openreview.net/forum?id=iij6nLYLjF

---

Title: From Decoupled to Coupled: Robustness Verification for Learning-based Keypoint Detection with Joint Specifications

Abstract: Keypoint detection underpins many vision pipelines, from human-pose estimation and viewpoint recovery to 3D reconstruction. Yet, modern neural models remain vulnerable to subtle input variations. Despite its importance, robustness verification for keypoint detection remains largely unexplored due to the high dimensionality of input spaces and the complexity of deep models. In this work, we verify a property that bounds the joint deviation across all keypoints, capturing interdependencies among keypoints specified by system designers or derived from downstream performance requirements (e.g., pose-based error budgets). A few existing approaches reformulate the problem by decoupling each keypoint (or its neighboring pixels) into independent classification tasks, leading to overly conservative guarantees and fails to account for the collective role keypoints play in downstream tasks. We address this gap with the first coupled robustness verification framework for heatmap-based keypoint detectors under joint specifications. Our method supports any backbone architecture (e.g., CNN, ResNet, Transformer) that produces per-keypoint heatmaps, followed by a max-activation operation to extract coordinates. To do so, we combine the reachability and optimization methodologies by formulating robustness verification as a property falsification problem using a Mixed-Integer Linear Program (MILP) that combines (i) reachable sets of heatmap outputs, obtained via existing reachability analysis tools, and (ii) a polytope encoding the joint keypoint deviation constraint. Infeasibility of the MILP certifies robustness, while feasibility yields a potential counterexample. We prove that our method is sound, that is, it certifies robustness only when the property truly holds. Experiments demonstrate that our coupled method achieves a verified rate comparable to the testing-based method when the keypoint error thresholds are not tight. Moreover, under stricter keypoint error thresholds, our method maintains a high verified rate, whereas the decoupled approach fails to verify the robustness of any image in these scenarios.

URL: https://openreview.net/forum?id=Os6UTM8yRT

---

Title: Budget-Optimized Crowdworker Allocation

Abstract: Due to concerns about human error in crowdsourcing, it is standard practice to collect labels for the same data point from multiple internet workers. We show that the resulting budget can be used more effectively with a flexible worker assignment strategy that asks fewer workers to analyze data that are easy to label and more workers to analyze data that requires extra scrutiny. Our main contribution is to show how the worker label aggregation can be formulated using a probabilistic approach, and how the allocations of the number of workers to a task can be computed optimally based on task difficulty alone, without using worker profiles. Our representative target task is identifying entailment between sentences. To illustrate the proposed methodology, we conducted simulation experiments that utilize a machine learning system as a proxy for workers and demonstrate its advantages over a state-of-the-art commercial optimizer.

URL: https://openreview.net/forum?id=hVpAgznRmp

---

Title: Toward Efficient Influence Function: Dropout as a Compression Tool

Abstract: Assessing the impact the training data on machine learning models is crucial for understanding the behavior of the model, enhancing the transparency, and selecting training data. Influence function provides a theoretical framework for quantifying the effect of training data points on model’s performance given a specific test data. However, the computational and memory costs of influence function presents significant challenges, especially for large-scale models, even when using approximation methods, since the gradients involved in computation are as large as the model itself. In this work, we introduce a novel approach that leverages dropout as a gradient compression mechanism to compute the influence function more efficiently. Our method significantly reduces computational and memory overhead, not only during the influence function computation but also in gradient compression process. Through theoretical analysis and empirical validation, we demonstrate that our method could preserves critical components of the data influence and enables its application to modern large-scale models.

URL: https://openreview.net/forum?id=rapeA5Ha3C

---

Title: Benchmark on Drug Target Interaction Modeling from a Drug Structure Perspective

Abstract: The prediction modeling of drug-target interactions is crucial to drug discovery and design, which has seen rapid advancements owing to deep learning technologies. Recently developed methods, such as those based on graph neural networks (GNNs) and Transformers, demonstrate exceptional performance across various datasets by effectively extracting structural information. However, the benchmarking of these novel methods often varies significantly in terms of hyperparameter settings and datasets, which limits algorithmic progress. In view of these, we conducted a comprehensive survey and benchmark for drug-target interaction modeling from a structural perspective via integrating tens of explicit (i.e., GNN-based) and implicit (i.e., Transformer-based) structure learning algorithms. We conducted a macroscopical comparison between these two classes of encoding strategies as well as the different featurization techniques that inform molecules' chemical and physical properties. We then carry out the microscopical comparison between all the integrated models across the six datasets via comprehensively benchmarking their effectiveness and efficiency. To comprehensively assess fairness, we investigate model performance under two experimental scenarios: one with unified hyperparameter settings and the other with individually optimized configurations. Remarkably, the summarized insights from the benchmark studies lead to the design of model combos. We demonstrate that our combos can achieve new state-of-the-art performance on various datasets associated with cost-effective memory and computation.

URL: https://openreview.net/forum?id=5B6QNLHTvC

---

Title: Improved Sample Complexity for Full Coverage in Compact and Continuous Spaces via Probabilistic Analysis

Abstract: Verifying uniform conditions over continuous spaces through random sampling is fundamental in machine learning and control theory, yet classical coverage analyses often yield conservative bounds, particularly at small failure probabilities. We study uniform random sampling on the $d$-dimensional unit hypercube and analyze the number of uncovered subcubes after discretization. By applying a concentration inequality to the uncovered-count statistic, we derive a sample complexity bound with a logarithmic dependence on the failure probability ($\delta$), i.e., $M =O( \tilde{C}\ln(\frac{2\tilde{C}}{\delta}))$, which contrasts sharply with the classical linear $1/\delta$ dependence. Under standard Lipschitz and uniformity assumptions, we present a self-contained derivation and compare our result with classical coupon-collector rates. Numerical studies across dimensions, precision levels, and confidence targets indicate that our bound tracks practical coverage requirements more tightly and scales favorably as $\delta \to 0$. Our findings offer a sharper theoretical tool for algorithms that rely on grid-based coverage guarantees, enabling more efficient sampling, especially in high-confidence regimes.

URL: https://openreview.net/forum?id=5i7qPRrtgz

---

Title: Do Vision Encoders Truly Explain Object Hallucination?: Mitigating Object Hallucination via Simple Fine-Grained CLIPScore

Abstract: Recently, Large Vision-Language Models (LVLMs) show remarkable performance across various domains. However, these models suffer from object hallucination. This study revisits the previous claim that the cause of such hallucinations lies in the limited representational capacity of the vision encoder. Our analysis implies that the capacity of the vision encoder is not necessarily a major limiting factor in detecting object hallucination. Based on this insight, we propose Fine-grained CLIPScore (F-CLIPScore), a simple yet effective evaluation metric that enhances object-level granularity by incorporating text embeddings at the noun level. Evaluations on the OHD-Caps benchmark show that F-CLIPScore significantly outperforms conventional CLIPScore in accuracy by a large margin of \textbf{39.6\%} without additional training. We further demonstrate that F-CLIPScore-based data filtering reduces object hallucination in LVLM (4.9\% in POPE).

URL: https://openreview.net/forum?id=JTua6tDPgZ

---

Title: Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices

Abstract: Scaling the input context length of a large language model (LLM) incurs a significant increase in computation cost and memory footprint to maintain the attention key-value (KV) cache.
Existing KV cache compression methods suffer from inefficient compression strategies and limited memory reduction effects, making it difficult for LLMs to conduct long-context inference on consumer-grade devices, especially when inferring long-context stream input.
Such obstacles prevent consumer-grade devices from supporting more complex applications, creating challenges for the democratization of LLMs.
To overcome this, we propose Locret, a framework to create an eviction policy compatible with chunked prefill. By evaluating the causal importance of KV cache units using \textit{retaining heads}, Locret enables precise eviction of cache units, facilitating efficient long-context inference.
In our empirical studies, Locret outperforms the recent popular and competitive approaches in terms of memory efficiency and generation quality
--- Locret achieves up to $20\times$ of KV cache compression ratio within less than $10\%$ performance loss.
Furthermore, Locret achieves 128K+ long-context inference on a single NVIDIA 4090 GPU without compromising generation quality and only costs $<1$ GPU hour of additional training.

URL: https://openreview.net/forum?id=YPVBCTBqHE

---

Title: deCIFer: Crystal Structure Prediction from Powder Diffraction Data using Autoregressive Language Models

Abstract: Novel materials drive advancements in fields ranging from energy storage to electronics, with crystal structure characterization forming a crucial yet challenging step in materials discovery. In this work, we introduce \emph{deCIFer}, an autoregressive language model designed for powder X-ray diffraction (PXRD)-conditioned crystal structure prediction (PXRD-CSP). Unlike traditional CSP methods that rely primarily on composition or symmetry constraints, deCIFer explicitly incorporates PXRD data, directly generating crystal structures in the widely adopted Crystallographic Information File (CIF) format. The model is trained on nearly 2.3 million crystal structures, with PXRD conditioning augmented by basic forms of synthetic experimental artifacts, specifically Gaussian noise and instrumental peak broadening, to reflect fundamental real-world conditions. Validated across diverse synthetic datasets representative of challenging inorganic materials, deCIFer achieves a 94\% structural match rate. The evaluation is based on metrics such as the residual weighted profile ($R_{wp}$) and structural match rate (MR), chosen explicitly for their practical relevance in this inherently underdetermined problem. deCIFer establishes a robust baseline for future expansion toward more complex experimental scenarios, bridging the gap between computational predictions and experimental crystal structure determination.

URL: https://openreview.net/forum?id=LftFQ35l47

---

Reply all

Reply to author

Forward

0 new messages