Daily TMLR digest for May 11, 2025

2 views
Skip to first unread message

TMLR

unread,
May 11, 2025, 12:06:06 AM5/11/25
to tmlr-anno...@googlegroups.com

Accepted papers
===============


Title: Ctrl-V: Higher Fidelity Autonomous Vehicle Video Generation with Bounding-Box Controlled Object Motion

Authors: Ge Ya Luo, ZhiHao Luo, Anthony Gosselin, Alexia Jolicoeur-Martineau, Christopher Pal

Abstract: Controllable video generation has attracted significant attention, largely due to advances in video diffusion models. In domains such as autonomous driving, developing highly accurate predictions for object motions is essential. This paper addresses the key challenge of enabling fine-grained control over object motion in the context of driving video synthesis. To accomplish this, we 1) employ a distinct, specialized model to forecast the trajectories of object bounding boxes, 2) adapt and enhance a separate video diffusion network to create video content conditioned on these high-quality trajectory forecasts, and 3) we are able to exert precise control over object position/movements using bounding boxes in both 2D and 3D spaces. Our method, Ctrl-V, leverages modified and fine-tuned Stable Video Diffusion (SVD) models to solve both trajectory and video generation. Extensive experiments conducted on the KITTI, Virtual-KITTI 2, BDD100k, and nuScenes datasets validate the effectiveness of our approach in producing realistic and controllable video generation. Project page: \url{https://oooolga.github.io/ctrl-v.github.io/}

URL: https://openreview.net/forum?id=BMGikHBjlx

---

Title: Node Feature Forecasting in Temporal Graphs: an Interpretable Online Algorithm

Authors: Aniq Ur Rahman, Justin Coon

Abstract: In this paper, we propose an online algorithm mspace for forecasting node features in temporal graphs, which captures spatial cross-correlation among different nodes as well as the temporal auto-correlation within a node. The algorithm can be used for both probabilistic and deterministic multi-step forecasting, making it applicable for estimation and generation tasks. Evaluations against various baselines, including temporal graph neural network (TGNN) models and classical Kalman filters, demonstrate that mspace performs comparably to the state-of-the-art and even surpasses them on some datasets. Importantly, mspace demonstrates consistent performance across datasets with varying training sizes, a notable advantage over TGNN models that require abundant training samples to effectively learn the spatiotemporal trends in the data. Therefore, employing mspace is advantageous in scenarios where the training sample availability is limited. Additionally, we establish theoretical bounds on multi-step forecasting error of mspace and show that it scales linearly with the number of forecast steps $q$ as $\mathcal{O}(q)$. For an asymptotically large number of nodes $n$, and timesteps $T$, the computational complexity of mspace grows linearly with both \$n\$ and \$T\$, i.e., $\mathcal{O}(nT)$, while its space complexity remains constant $\mathcal{O}(1)$. We compare the performance of various mspace variants against ten recent TGNN baselines and two classical baselines, ARIMA and the Kalman filter, across ten real-world datasets. Lastly, we have investigated the interpretability of different mspace variants by analyzing model parameters alongside dataset characteristics to jointly derive model-centric and data-centric insights.

URL: https://openreview.net/forum?id=Teu1Blr2YJ

---

Title: Integrating Large Language Models in Causal Discovery: A Statistical Causal Approach

Authors: MASAYUKI TAKAYAMA, Tadahisa OKUDA, Thong Pham, Tatsuyoshi Ikenoue, Shingo Fukuma, Shohei Shimizu, Akiyoshi Sannai

Abstract: In practical statistical causal discovery (SCD), embedding domain expert knowledge as constraints into the algorithm is important for reasonable causal models reflecting the broad knowledge of domain experts, despite the challenges in the systematic acquisition of background knowledge.
To overcome these challenges, this paper proposes a novel method for causal inference, in which SCD and knowledge-based causal inference (KBCI) with a large language model (LLM) are synthesized through ``statistical causal prompting (SCP)'' for LLMs and prior knowledge augmentation for SCD.
The experiments in this work have revealed that the results of LLM-KBCI and SCD augmented with LLM-KBCI approach the ground truths, more than the SCD result without prior knowledge.
These experiments have also revealed that the SCD result can be further improved if the LLM undergoes SCP.
Furthermore, with an unpublished real-world dataset, we have demonstrated that the background knowledge provided by the LLM can improve the SCD on this dataset, even if this dataset has never been included in the training data of the LLM.
For future practical application of this proposed method across important domains such as healthcare, we also thoroughly discuss the limitations, risks of critical errors, expected improvement of techniques around LLMs, and realistic integration of expert checks of the results into this automatic process, with SCP simulations under various conditions both in successful and failure scenarios.
The careful and appropriate application of the proposed approach in this work, with improvement and customization for each domain,
can thus address challenges such as dataset biases and limitations, illustrating the potential of LLMs to improve data-driven causal inference across diverse scientific domains.


The code used in this work is publicly available at: https://github.com/mas-takayama/LLM-and-SCD.

URL: https://openreview.net/forum?id=Reh1S8rxfh

---


New submissions
===============


Title: Flow Models for Unbounded and Geometry-Aware Distributional Reinforcement Learning

Abstract: We introduce a new architecture for Distributional Reinforcement Learning (DistRL) that models return distributions using normalizing flows. This approach enables flexible, unbounded support for return distributions, in contrast to categorical approaches like C51 that rely on fixed or bounded representations. It also offers richer modeling capacity to capture multi-modality, skewness, and tail behavior than quantile based approaches. Our method is significantly more parameter-efficient than categorical approaches. Standard metrics used to train existing models like KL divergence or Wasserstein distance either are scale insensitive or have biased sample gradients, especially when return supports do not overlap. To address this, we propose a novel surrogate for the Cramèr distance, that is geometry-aware and computable directly from the return distribution's PDF, avoiding the costly CDF computation. We test our model on the ATARI-5 sub-benchmark and show that our approach outperforms PDF based models while remaining competitive with quantile based methods.

URL: https://openreview.net/forum?id=baH15Glivu

---

Title: ComAlign: Compositional Alignment in Vision-Language Models

Abstract: Vision-language models (VLMs) like CLIP have showcased a remarkable ability to extract transferable features for downstream tasks. Nonetheless, the training process of these models is usually based on a coarse-grained contrastive loss between the global embedding of images and texts which may lose the compositional structure of these modalities. Many recent studies have shown VLMs lack compositional understandings like attribute binding and identifying object relationships. Although some recent methods have tried to achieve finer- level alignments, they either are not based on extracting meaningful components of proper granularity or don’t properly utilize the modalities’ correspondence (especially in image-text pairs with more ingredients). Addressing these limitations, we introduce Compositional Alignment (ComAlign), a fine-grained approach to discover more exact correspondence of text and image components using only the weak supervision in the form of image-text pairs. Our methodology emphasizes that the compositional structure (including entities and relations) extracted from the text modality must also be retained in the image modality. To enforce correspondence of fine-grained concepts in image and text modalities, we train a lightweight network lying on top of existing visual and language encoders using a small dataset. The network is trained to align the entity and relational components across the modalities. Experimental results on various VLMs and datasets demonstrate significant improvements in retrieval and compositional benchmarks, affirming the effectiveness of our plugin model.

URL: https://openreview.net/forum?id=rO8fzpihmM

---

Title: Conditional Latent Space Molecular Scaffold Optimization for Accelerated Molecular Design

Abstract: The rapid discovery of new chemical compounds is essential for advancing global health and developing treatments. While generative models show promise in creating novel molecules, challenges remain in ensuring the real-world applicability of these molecules and finding such molecules efficiently. To address this challenge, we introduce Conditional Latent Space Molecular Scaffold Optimization (CLaSMO), which integrates a Conditional Variational Autoencoder (CVAE) with Latent Space Bayesian Optimization (LSBO) to strategically modify molecules while preserving similarity to the original input, effectively framing the task as constrained optimization. Our LSBO setting improves the sample-efficiency of the molecular optimization, and our modification approach helps us to obtain molecules with higher chances of real-world applicability. CLaSMO explores substructures of molecules in a sample-efficient manner by performing BO in the latent space of a CVAE conditioned on the atomic environment of the molecule to be optimized. Our extensive evaluations across diverse optimization tasks—including rediscovery, docking score, and multi‑property optimization—show that CLaSMO efficiently enhances target properties, delivers remarkable sample-efficiency crucial for resource‑limited applications while considering molecular similarity constraints, achieves state of the art performance, and maintains practical synthetic accessibility. We also provide an open-source web application that enables chemical experts to apply CLaSMO in a Human-in-the-Loop setting.

URL: https://openreview.net/forum?id=KhxVc9RBJv

---

Title: Diffusion-RainbowPA: Improvements Integrated Preference Alignment for Diffusion-based Text-to-Image Generation

Abstract: Although rapidly increasing capabilities of text-to-image (T2I) models have profound implications across various industries, they concurrently suffer from numerous shortcomings, necessitating the implementation of effective alignment strategies with human preference. Diffusion-DPO and SPO have emerged as robust approaches for aligning diffusion-based T2I models with human preference feedback. However, they tend to suffer from text-image misalignment, aesthetic overfitting and low-quality generation. To tackle such matters, we improve the alignment paradigm through a tripartite perspective, which are the calibration enhancement (Calibration Enhanced Preference Alignment), the overfitting mitigation (Identical Preference Alignment, Jensen-Shannon Divergence Constraint) and the performance optimization (Margin Strengthened Preference Alignment, SFT-like Regularization). Furthermore, combining them with the step-aware preference alignment paradigm, we propose the Diffusion-RainbowPA, a suite of total six improvements that collectively improve the alignment performance of Diffusion-DPO. With comprehensive alignment performance evaluation and comparison, it is demonstrated that Diffusion-RainbowPA outperforms current state-of-the-art methods. We also conduct ablation studies on the introduced components that reveal incorporation of each has positively enhanced alignment performance.

URL: https://openreview.net/forum?id=KY0TSY2bx8

---

Title: Understanding Sparse Feature Updates in Deep Networks using Iterative Linearisation

Abstract: Larger and deeper neural networks generalise well despite their increased capacity to overfit the data. Understanding why this happens is theoretically and practically important. A recent approach has investigated infinitely wide limits of neural networks through their corresponding Neural Tangent Kernels (NTKs), demonstrating their equivalence to kernel regression with a fixed kernel derived from the network's architecture and initialisation. However, this "lazy training" cannot explain feature learning as such regimes correspond to linearised training in the neural network weight space, which implies a constant NTK kernel throughout training and, as such, does not perform feature learning. In practice, the empirical NTK kernel for finite networks can change substantially, particularly during the initial phase of stochastic gradient descent (SGD), highlighting the importance of feature learning. In this work, we derive iterative linearisation --- an interpolation between SGD and the NTK kernel-based regression.
Iterative linearisation enables us to precisely quantify the frequency of feature learning and is shown to be equivalent to NTK kernel-based regression in specific conditions.
Empirically, only a surprisingly small amount of feature learning is required to achieve comparable performance to SGD, however, disabling feature learning negatively impacts generalisation.
We further justify the validity of iterative linearisation by showing that with large periodicity, it is a special variant of the Gauss-Newton optimisation algorithm. We use this connection to provide novel insights on the role of damping on feature learning and generalisation in Gauss-Newton.

URL: https://openreview.net/forum?id=3mPidxpdIb

---

Title: Node-Level Data Valuation on Graphs

Abstract: How much is a node worth? We answer this question using an emerging set of data valuation techniques, where the value of a data point is measured via its marginal contribution when added to the (training) dataset. Data valuation has been primarily studied in the i.i.d. setting, giving rise to methods like influence functions, leave-one-out estimation, data Shapley, and data Banzhaf. We conduct a comprehensive study of data valuation approaches applied to graph-structured models such as graph neural networks in a semi-supervised transductive setting. Since all nodes (labeled and unlabeled) influence both training and inference we construct various scenarios to understand the diverse mechanisms by which nodes can impact learning. We show that the resulting node values can be used to identify (positively and negatively) influential nodes, quantify model brittleness, detect poisoned data, and accurately predict counterfactuals.

URL: https://openreview.net/forum?id=tNyApIqDSJ

---

Title: Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation

Abstract: Prompt engineering is an effective but labor-intensive way to control text-to-image (T2I) generative models. Its time-intensive nature and complexity have spurred the development of algorithms for automated prompt generation. However, these methods often struggle with transferability across T2I models, require white-box access to the underlying model, or produce non-intuitive prompts. In this work, we introduce PRISM, an algorithm that automatically produces human-interpretable and transferable prompts that can effectively generate desired concepts given only black-box access to T2I models. Inspired by large language model (LLM) jailbreaking, PRISM leverages the in-context learning ability of LLMs to iteratively refine the candidate prompt distribution built upon the reference images. Our experiments demonstrate the versatility and effectiveness of PRISM in generating accurate prompts for objects, styles, and images across multiple T2I models, including Stable Diffusion, DALL-E, and Midjourney.

URL: https://openreview.net/forum?id=IVYVDN6pJ6

---

Title: How Can Knowledge of a Task’s Modular Structure Improve Generalization and Training Efficiency?

Abstract: Many real-world learning tasks have an underlying hierarchical and modular structure, composed of smaller sub-functions. Traditional neural networks (NNs) often disregard this structure, leading to inefficiencies in learning and generalization. Prior work has demonstrated that leveraging known structural information can enhance performance by aligning NN architectures with the task’s inherent modularity. However, the extent of prior structural knowledge required to achieve these performance improvements remains unclear. In this work, we investigate how modular NNs can outperform traditional dense NNs on tasks with simple yet known modular structure by systematically varying the degree of structural knowledge incorporated. We compare architectures ranging from monolithic dense NNs, which assume no prior knowledge, to hierarchically modular NNs with shared modules that leverage sparsity, modularity, and module reusability. Our experiments demonstrate that module reuse in modular NNs significantly improves learning efficiency and generalization. Furthermore, we find that module reuse enables modular NNs to excel in data-scarce scenarios by promoting functional specialization within modules and reducing redundancy.

URL: https://openreview.net/forum?id=46hFTOUox7

---

Reply all
Reply to author
Forward
0 new messages