Daily TMLR digest for Aug 16, 2025

0 views
Skip to first unread message

TMLR

unread,
Aug 16, 2025, 12:06:06 AMAug 16
to tmlr-anno...@googlegroups.com

Accepted papers
===============


Title: CAREL: Instruction-guided reinforcement learning with cross-modal auxiliary objectives

Authors: Armin Saghafian, Amirmohammad Izadi, Negin Hashemi Dijujin, Mahdieh Soleymani Baghshah

Abstract: Grounding the instruction in the environment is a key step in solving language-guided goal-reaching reinforcement learning problems. In automated reinforcement learning, a key concern is to enhance the model's ability to generalize across various tasks and environments. In goal-reaching scenarios, the agent must comprehend the different parts of the instructions within the environmental context in order to complete the overall task successfully. In this work, we propose \textbf{CAREL} (\textit{\textbf{C}ross-modal \textbf{A}uxiliary \textbf{RE}inforcement \textbf{L}earning}) as a new framework to solve this problem using auxiliary loss functions inspired by video-text retrieval literature and a novel method called instruction tracking, which automatically keeps track of progress in an environment. The results of our experiments suggest superior sample efficiency and systematic generalization for this framework in multi-modal reinforcement learning problems.

URL: https://openreview.net/forum?id=zJUEYr5X1X

---

Title: Unifying Self-Supervised Clustering and Energy-Based Models

Authors: Emanuele Sansone, Robin Manhaeve

Abstract: Self-supervised learning excels at learning representations from large amounts of data. At the same time, generative models offer the complementary property of learning information about the underlying data generation process. In this study, we aim at establishing a principled connection between these two paradigms and highlight the benefits of their complementarity. In particular, we perform an analysis of self-supervised learning objectives, elucidating the underlying probabilistic graphical models and presenting a standardized methodology for their derivation from first principles. The analysis suggests a natural means of integrating self-supervised learning with likelihood-based generative models. We instantiate this concept within the realm of cluster-based self-supervised learning and energy models, introducing a lower bound proven to reliably penalize the most important failure modes and unlocking full unification. Our theoretical findings are substantiated through experiments on synthetic and real-world data, including SVHN, CIFAR10, and CIFAR100, demonstrating that our objective function allows to jointly train a backbone network in a discriminative and generative fashion, consequently outperforming existing self-supervised learning strategies in terms of clustering, generation and out-of-distribution detection performance by a wide margin. We also demonstrate that the solution can be integrated into a neuro-symbolic framework to tackle a simple yet non-trivial instantiation of the symbol grounding problem.

URL: https://openreview.net/forum?id=NW0uKe6IZa

---

Title: MESSI: A Multi-Elevation Semantic Segmentation Image Dataset of an Urban Environment

Authors: Barak Pinkovich, Boaz Matalon, Ehud Rivlin, Hector Rotstein

Abstract: This paper presents a Multi-Elevation Semantic Segmentation Image (MESSI) dataset. A reduced version of the dataset has been published at https://github.com/messi-dataset/ for reviewing purposes (due to the anonymity requirement). The full dataset will be made available at the time of the decision. MESSI comprises 2525 images taken by a drone flying over dense urban environments. MESSI is unique in two main features. First, it contains images from various altitudes (both with horizontal and vertical trajectories), allowing us to investigate the effect of depth on semantic segmentation. Second, it includes images taken from several different urban regions (at different altitudes). This is important since the variety covers the visual richness captured by a drone's 3D flight, performing horizontal and vertical maneuvers. MESSI contains images annotated with location, orientation, and the camera's intrinsic parameters. It can be used to train a deep neural network for semantic segmentation or other applications of interest. This paper describes the dataset and provides annotation details. It also explains how semantic segmentation was performed using several neural network models and shows several relevant statistics. MESSI will be published in the public domain to serve as an evaluation benchmark for semantic segmentation using images captured by a drone or similar vehicle flying over a dense urban environment.

URL: https://openreview.net/forum?id=ayWqZ1wyIv

---


New submissions
===============


Title: LLM4FL: Multi-Agent Repository-Level Software Fault Localization via Graph-Based Retrieval and Iterative Refinement

Abstract: Locating and fixing software faults is a time-consuming and resource-intensive task in software development. Traditional fault localization methods, such as Spectrum-Based Fault Localization (SBFL), rely on statistical analysis of test coverage data but often lack accuracy. While more effective, learning-based techniques require large training datasets and can be computationally intensive. Recent advancements in Large Language Models (LLMs) have shown potential for improving fault localization by enhancing code comprehension and reasoning. LLMs are typically pretrained and can be leveraged for fault localization without additional training. However, these LLM-based techniques face challenges, including token limitations, performance degradation with long inputs, and difficulties managing large-scale projects with complex, interacting components. We introduce LLM4FL, a multi-LLM-agent-based fault localization approach to address these challenges. LLM4FL utilizes three agents. First, the Context Extraction Agent uses an order-aware division strategy to divide and analyze extensive coverage data into small groups within the LLM's token limit, identify the failure reason, and prioritize failure-related methods. The prioritized methods are sent to the Debugger Agent, which uses graph-based retrieval to identify failure reasons and rank suspicious methods in the codebase. Then the Reviewer Agent re-evaluates and re-ranks buggy methods using verbal reinforcement learning and self-criticism. Evaluated on the Defects4J (V2.0.0) benchmark of 675 faults from 14 Java projects, LLM4FL outperforms AutoFL by 18.55% in Top-1 accuracy and surpasses supervised methods like DeepFL and Grace, all without task-specific training. Coverage splitting and prompt chaining further improve performance, boosting Top-1 accuracy by up to 22%.

URL: https://openreview.net/forum?id=z91EvZbSI1

---

Title: Mitigating Steady-State Bias in Off-Policy TD Learning via Distributional Correction

Abstract: We explore the off-policy value prediction problem in the reinforcement learning setting, where one estimates the value function of the target policy using the sample trajectories obtained from a behaviour policy. Applying importance sampling based methods are typically a go-to approach for getting such estimates but tend to suffer high error in long-horizon problems since it can only correct single-step discrepancies and fails to address steady-state bias - skewed state visitation under the behavior policy. In this paper,
we present an algorithm for alleviating this bias in the off-policy value prediction using linear function approximation by correcting the state visitation distribution discrepancies. We establish rigorous theoretical guarantees, proving asymptotic convergence under Markov noise with ergodicity and demonstrating that the spectral properties of the corrected update matrix ensure stability. Most significantly, we derive an error decomposition showing that the total estimation error is bounded by a constant multiple of the best achievable approximation within the function class, where this constant transparently depends on distribution estimation quality and feature design. Empirical evaluation across multiple benchmark domains demonstrates that our method effectively mitigates steady-state bias and can be a viable alternative to existing methods in scenarios where distributional shift is critical.

URL: https://openreview.net/forum?id=QLZAHgiowr

---

Title: Measuring Superposition with Sparse Autoencoders — Does Superposition Cause Adversarial Vulnerability?

Abstract: Neural networks achieve remarkable performance through \textit{superposition}—encoding multiple features as overlapping directions in activation space rather than dedicating individual neurons to each feature. This phenomenon fundamentally challenges interpretability: when neurons respond to multiple unrelated concepts, understanding network behavior becomes intractable. Yet despite its central importance, we lack principled methods to measure superposition. We present an information-theoretic framework that measures the effective number of features through the exponential of Shannon entropy applied to sparse autoencoder activations. This threshold-free metric, grounded in rate-distortion theory and analogy to quantum entanglement, provides the first universal measure of superposition applicable to any neural network.
Our approach demonstrates strong empirical validation: correlation with ground truth exceeds 0.94 in toy models, accurately detects minimal superposition in algorithmic tasks (feature count approximately equals neuron count), and reveals systematic feature reduction under capacity constraints (up to 50\% reduction with dropout). Layer-wise analysis of Pythia-70M reveals feature counts peak in early-middle layers at 20 times the number of neurons before declining—mirroring patterns observed in intrinsic dimensionality studies. The metric also captures developmental dynamics, detecting sharp reorganization during grokking phase transitions where models shift from superposed memorization to compact algorithmic solutions.
Surprisingly, adversarial training can increase feature counts by up to 4× while improving robustness, contradicting the hypothesis that superposition causes vulnerability. The effect depends on task complexity and network capacity: simple tasks and ample capacity enable feature expansion, while complex tasks or limited capacity force feature reduction.
By providing a principled, threshold-free measure of superposition, this work enables quantitative study of neural information organization.

URL: https://openreview.net/forum?id=qaNP6o5qvJ

---

Title: Beyond Naïve Prompting: Strategies for Improved Zero-shot Context-aided Forecasting with LLMs

Abstract: Forecasting in real-world settings requires models to integrate not only historical data but also relevant contextual information, often available in textual form. While recent work has shown that large language models (LLMs) can be effective context-aided forecasters via naïve direct prompting, their full potential remains underexplored. We address this gap with 4 strategies, providing new insights into the zero-shot capabilities of LLMs in this setting. ReDP improves interpretability by eliciting explicit reasoning traces, allowing us to assess the model’s reasoning over the context independently from its forecast accuracy. CorDP leverages LLMs solely to refine existing forecasts with context, enhancing their applicability in real-world forecasting pipelines. IC-DP proposes embedding historical examples of context-aided forecasting tasks in the prompt, substantially improving accuracy even for the largest models. Finally, RouteDP optimizes resource efficiency by using LLMs to estimate task difficulty, and routing the most challenging tasks to larger models. Evaluated on different kinds of context-aided forecasting tasks from the CiK benchmark, our strategies demonstrate distinct benefits over naïve prompting across LLMs of different sizes and families. These results open the door to further simple yet effective improvements in LLM-based context-aided forecasting.

URL: https://openreview.net/forum?id=CLuE0GL5xE

---

Title: Amortized Bayesian Workflow

Abstract: Bayesian inference often faces a trade-off between computational speed and sampling accuracy. We propose an adaptive workflow that integrates rapid amortized inference with gold-standard MCMC techniques to achieve a favorable combination of both speed and accuracy when performing inference on many observed datasets. Our approach uses principled diagnostics to guide the choice of inference method for each dataset, moving along the Pareto front from fast amortized sampling via generative neural networks to slower but guaranteed-accurate MCMC when needed. By reusing computations across steps, our workflow synergizes amortized and MCMC-based inference. We demonstrate the effectiveness of this integrated approach on several synthetic and real-world problems with tens of thousands of datasets, showing efficiency gains while maintaining high posterior quality.

URL: https://openreview.net/forum?id=osV7adJlKD

---

Title: Decentralized Policy Gradients for Optimizing Generalizable Policies in Multi-Agent Reinforcement Learning

Abstract: Parameter Sharing (PS) is a widely used practice in Multi-Agent Reinforcement Learning (MARL), where a single neural network is shared among all agents. Despite its efficiency and effectiveness, PS can occasionally result in suboptimal performance. While prior research has primarily addressed this issue from the perspective of update conflicts among different agents, we investigate it from an optimization standpoint. Specifically, we point out the analogy between PS in MARL and Centralized SGD (CSGD) in distributed learning and hypothesize that PS may inherit similar convergence and generalization issues as CSGD, such as lower convergence levels of key metrics and larger generalization gaps. To address these issues, we propose Decentralized Policy Gradients (DecPG), which leverages the principles of Decentralized SGD. We use an environment with additional noise injected into the observation and action spaces to evaluate the generalization of DecPG. Empirical results show that DecPG outperforms its centralized counterpart, PS, across various aspects---achieving higher rewards, smaller generalization gaps, and flatter reward landscapes. The results confirm that PS suffers from convergence and generalization issues similar to those of CSGD, and show that our DSGD-based method, DecPG, effectively mitigates these problems---offering a new optimization perspective on MARL algorithm performance.

URL: https://openreview.net/forum?id=utpzisYFqd

---

Title: Robust Clustering using Gaussian Mixtures in the Presence of Cellwise Outliers

Abstract: In this paper we propose a novel algorithm for robust estimation of Gaussian Mixture Model (GMM) parameters and clustering that explicitly accounts for cell outliers. To achieve this, the proposed algorithm minimizes a penalized negative log-likelihood function where the penalty term is derived via the false discovery rate principle. The penalized negative log-likelihood function is cyclically minimized over outlier positions and the GMM parameters. Furthermore, the minimization over the GMM parameters is done using the majorization minimization framework: specifically we minimize a tight upper bound on the negative log-likelihood function which decouples into simpler optimization subproblems that can be solved efficiently.
We present several numerical simulation studies comprising experiments aimed at evaluating the performance of the proposed method on synthetic as well as real world data and at systematically comparing it with state-of-the-art robust techniques in different scenarios. The simulation studies demonstrate that our approach effectively addresses the challenges inherent in parameter estimation of GMM and clustering in contaminated data environments.

URL: https://openreview.net/forum?id=oVHPEgjdWk

---

Title: TimeAutoDiff: A Unified Framework for Generation, Imputation, Forecasting, and Time-Varying Metadata Conditioning of Heterogeneous Time Series Tabular Data

Abstract: We present \texttt{TimeAutoDiff}, a unified latent–diffusion framework that addresses four fundamental time-series tasks—unconditional generation, missing-data imputation, forecasting, and time-varying-metadata conditional generation—within a single model that natively handles heterogeneous features (continuous, binary, and categorical).
We unify these tasks through a simple masked-modeling strategy: a binary mask specifies which time–feature cells are observed and which must be generated.
To make this work on mixed data types, we pair a lightweight variational autoencoder—which maps continuous, categorical, and binary variables into a continuous latent sequence—with a diffusion model that learns dynamics in that latent space, avoiding separate likelihoods for each data type while still capturing temporal and cross-feature structure.
Two design choices give \texttt{TimeAutoDiff} clear speed and scalability advantages.
First, the diffusion process samples a single latent trajectory for the full horizon \(1{:}T\) rather than denoising one timestep at a time; this whole-sequence sampling drastically reduces reverse-diffusion calls and yields an order-of-magnitude throughput gain.
Second, the VAE compresses along the feature axis, so very wide tables are modeled in a lower-dimensional latent space, further reducing computational load.
Across six real-world datasets, \texttt{TimeAutoDiff} matches or surpasses strong baselines in synthetic sequence fidelity (discriminative, temporal-correlation, and predictive metrics) and consistently lowers MAE/MSE for imputation and forecasting tasks.
Time-varying-metadata conditioning unlocks real-world scenario exploration: by editing metadata sequences (e.g., regime labels, environmental or policy indicators), practitioners can generate coherent families of counterfactual trajectories that track intended directional changes, preserve cross-feature dependencies, and remain conditionally calibrated—making ``what-if'' analysis practical.
Ablations attribute performance gains to whole-sequence sampling, latent compression, and mask conditioning, while a distance-to-closest-record audit indicates strong generalization with limited memorization.
Code implementations of \texttt{TimeAutoDiff} are provided in https://anonymous.4open.science/r/TimeAutoDiff-TMLR-7BA8/README.md.

URL: https://openreview.net/forum?id=bkUd1Dg46c

---

Title: Algorithmic Recourse in Abnormal Multivariate Time Series

Abstract: Algorithmic recourse provides actionable recommendations to alter unfavorable predictions of machine learning models, enhancing transparency through counterfactual explanations. While significant progress has been made in algorithmic recourse for static data, such as tabular and image data, limited research explores recourse for multivariate time series, particularly for reversing abnormal time series. This paper introduces Recourse in time series Anomaly Detection (RecAD), a framework for addressing anomalies in multivariate time series using backtracking counterfactual reasoning. By modeling the causes of anomalies as external interventions on exogenous variables, RecAD predicts recourse actions to restore normal status as counterfactual explanations, where the recourse function, responsible for generating actions based on observed data, is trained using an end-to-end approach. Experiments on synthetic and real-world datasets demonstrate its effectiveness.

URL: https://openreview.net/forum?id=kzxFc2Suo5

---

Title: Uncovering Language Model Processing Strategies with Non-Negative Per-Example Fisher Factorization

Abstract: Understanding the heuristics and algorithms that comprise a model's behavior is important for safe and reliable deployment.
While gradient clustering has been used for this purpose, gradients of a single log probability capture only a slice of the model's behavior, and clustering can only assign a single factor to each behavior.
We introduce NPEFF (Non-Negative Per-Example Fisher Factorization), an interpretability method that overcomes these limitations by decomposing per-example Fisher matrices using a novel decomposition algorithm that learns a set of components represented by learned rank-1 positive semi-definite matrices.
Through a combination of human evaluation and automated analysis, we demonstrate that these NPEFF components correspond to heuristics used by language models on a variety of text processing tasks.
We find that NPEFF excels at decomposing behaviors comprised of multiple factors compared to the baselines of gradient clustering and activation sparse autoencoders.
We also show how NPEFF can be adapted to be more efficient on tasks with few classes.
We further show how to construct parameter perturbations from NPEFF components to selectively disrupt a given component's role in the model's processing.
Along with conducting extensive ablation studies, we include experiments using NPEFF to study in-context learning.

URL: https://openreview.net/forum?id=DUFvZXrQr7

---

Title: When Are Two Scores Better Than One? Investigating Ensembles of Diffusion Models

Abstract: Diffusion models now generate high-quality, diverse samples, with an increasing focus on more powerful models. Although ensembling is a well-known way to improve supervised models, its application to unconditional score-based diffusion models remains largely unexplored. In this work we investigate whether it provides tangible benefits for generative modelling. We find that while ensemble generally improves the score-matching loss and model likelihood, it fails to consistently enhance perceptual quality metrics such as FID. Our study spans across a breadth of aggregation rules using Deep Ensembles, Monte Carlo Dropout, and Random Forests on CIFAR-10, FFHQ, and tabular data. We attempt to explain this discrepancy by investigating possible explanations, such as the link between score estimation and image quality. Finally, we provide theoretical insights into the summing of score models, which shed light not only on ensembling but also on several model composition techniques (e.g. guidance).

URL: https://openreview.net/forum?id=4iRx9b0Csu

---

Title: Advancing Event Forecasting through Massive Training of Large Language Models: Challenges, Solutions, and Broader Impacts

Abstract: Many recent papers have studied the development of superforecaster-level event forecasting LLMs. While methodological problems with early studies cast doubt on the use of LLMs for event forecasting, recent studies with improved evaluation methods have shown that state-of-the-art LLMs are gradually reaching superforecaster-level performance, and reinforcement learning has also been reported to improve future forecasting. Additionally, the unprecedented success of recent reasoning models and Deep Research-style models suggests that technology capable of greatly improving forecasting performance has been developed. Therefore, based on these positive recent trends, we argue that the time is ripe for research on large-scale training of superforecaster-level event forecasting LLMs.
We discuss two key research directions: training methods and data acquisition. For training, we first introduce three difficulties of LLM-based event forecasting training: noisiness-sparsity, knowledge cut-off, and simple reward structure problems. Then, we present related ideas to mitigate these problems: hypothetical event Bayesian networks, utilizing poorly-recalled and counterfactual events, and auxiliary reward signals. For data, we propose aggressive use of market, public, and crawling datasets to enable large-scale training and evaluation. Finally, we explain how these technical advances could enable AI to provide predictive intelligence to society in broader areas. This position paper presents promising specific paths and considerations for getting closer to superforecaster-level AI technology, aiming to call for researchers' interest in these directions.

URL: https://openreview.net/forum?id=OeAyoB47dS

---

Reply all
Reply to author
Forward
0 new messages