Daily TMLR digest for Sep 05, 2025

0 views

Skip to first unread message

TMLR

unread,

Sep 5, 2025, 12:06:07 AM (3 days ago) Sep 5

to tmlr-anno...@googlegroups.com

Accepted papers
===============

Title: Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation

Authors: Or Tal, Felix Kreuk, Yossi Adi

Abstract: Recent progress in text-to-music generation has enabled models to synthesize high-quality musical segments, full compositions, and even respond to fine-grained control signals, e.g. chord progressions. State-of-the-art (SOTA) systems differ significantly in many dimensions, such as training datasets, modeling paradigms, and architectural choices.
This diversity complicates efforts to evaluate models fairly and identify which design choices influence performance the most. While factors like data and architecture are important, in this study we focus exclusively on the modeling paradigm.
We conduct a systematic empirical analysis to isolate its effects, offering insights into associated trade-offs and emergent behaviors that can guide future text-to-music generation systems.
Specifically, we compare the two arguably most common modeling paradigms: auto-regressive decoding and conditional flow-matching.
We conduct a controlled comparison by training all models from scratch using identical datasets, training configurations, and similar backbone architectures.
Performance is evaluated across multiple axes, including generation quality, robustness to inference configurations, scalability, adherence to both textual and temporally aligned conditioning, and editing capabilities in the form of audio inpainting.
This comparative study sheds light on distinct strengths and limitations of each paradigm, providing actionable insights that can inform future architectural and training decisions in the evolving landscape of text-to-music generation.
Audio sampled examples are available at: https://huggingface.co/spaces/ortal1602/ARvsFM

URL: https://openreview.net/forum?id=xXc5DeaBYw

---

Title: Multiple Noises in Diffusion Model for Semi-Supervised Multi-Domain Translation

Authors: Tsiry Mayet, Simon Bernard, Romain HÉRAULT, Clement Chatelain

Abstract: In this work, we address the challenge of multi-domain translation, where the objective is to learn mappings between arbitrary configurations of domains within a defined set (such as $(D_1, D_2)\rightarrow{}D_3$, $D_2\rightarrow{}(D_1, D_3)$, $D_3\rightarrow{}D_1$, etc. for three domains) without the need for separate models for each specific translation configuration, enabling more efficient and flexible domain translation.
We introduce Multi-Domain Diffusion (MDD), a method with dual purposes: i) reconstructing any missing views for new data objects, and
ii) enabling learning in semi-supervised contexts with arbitrary supervision configurations. MDD achieves these objectives by exploiting the noise formulation of diffusion models, specifically modeling one noise level per domain.
Similar to existing domain translation approaches, MDD learns the translation between any combination of domains. However, unlike prior work, our formulation inherently handles semi-supervised learning without modification by representing missing views as noise in the diffusion process.
We evaluate our approach through domain translation experiments on BL3NDT, a multi-domain synthetic dataset designed for challenging semantic domain inversion, the BraTS2020 dataset, and the CelebAMask-HQ dataset.

URL: https://openreview.net/forum?id=vYdT26kDYM

---

Title: Continuous Language Model Interpolation yields Dynamic and Controllable Text Generation

Authors: Sara Kangaslahti, David Alvarez-Melis

Abstract: As large language models (LLMs) have gained popularity for a variety of use cases, making them adaptable and controllable has become increasingly important, especially for user-facing applications. In particular, linear interpolation between model parameters forms the backbone for many recent approaches to adapting models to user preferences. While the existing literature on LLM adaptation primarily focuses on finding methods that optimize for some set of performance criteria or user preferences, here we instead seek to better understand and characterize the behavior of dense, continuous interpolation between models. Specifically, we use low-rank updates to fine-tune a base model to various different domains, yielding a set of anchor models with distinct generation profiles. Then, we use the weight updates of these anchor models to parametrize the entire (infinite) class of models contained within their convex hull. We empirically show that varying the interpolation weights yields predictable and consistent change in the model outputs with respect to all of the controlled attributes simultaneously. We find that there is little entanglement between most attributes and identify and discuss the pairs of attributes for which this is not the case. Our results suggest that parameter merging facilitates flexible model adaptation due to its predictable behavior within the full interpolation region.

URL: https://openreview.net/forum?id=xD9Nu2Wah4

---

New submissions
===============

Title: Supervised score aggregation for active anomaly detection

Abstract: Detecting rare anomalies in batches of multidimensional data is challenging.
We propose an original supervised active-learning framework that sends a small number of data points from each batch to an expert for labeling as `anomaly' or `nominal' via two mechanisms: (i) points most likely to be anomalies in the eyes of a supervised classifier trained on previously-labeled data; and (ii) points suggested by an active learner. Instead of training the supervised classifier directly on currently-labeled raw data, we treat the scores calculated by an ensemble of $M$ user-defined unsupervised anomaly detectors as if they were the learner's input features. Our approach generalizes earlier attempts to linearly aggregate unsupervised anomaly detector scores, and broadens the scope of these methods from unordered bags of data to ordered data such as time series. Simulated and real data trials suggest that this method usually outperforms---often significantly---linear strategies.
The Python library acanag implements our proposed method.

URL: https://openreview.net/forum?id=nrmJD3XMA3

---

Title: Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models

Abstract: Large Language Models (LLMs) show remarkable proficiency in natural language tasks, yet their frequent overconfidence—misalignment between predicted confidence and true correctness—poses significant risks in critical decision-making applications. We present a comprehensive analysis on calibration in LLMs across nine LLMs and three factual Question-Answering (QA) datasets, systematically comparing standard free-generation settings against structured distractor-augmented prompts. Our evaluation reveals that explicitly incorporating distractors can substantially mitigate miscalibration, achieving relative accuracy improvements up to 460% and ECE reductions up to 90%. Despite general trends, we uncover nuanced findings: large RLHF-tuned models display inherent calibration strengths but can paradoxically suffer increased miscalibration on easier queries, whereas smaller models benefit disproportionately from distractor prompts but remain significantly miscalibrated. Through detailed analyses across question types, we identify persistent calibration failures, particularly in person-based queries. We conclude with concrete recommendations—targeted fine-tuning, structured prompting, and strategic model choice—to ensure reliable, trustworthy LLM deployments.

URL: https://openreview.net/forum?id=lyaHnHDdZl

---

Title: Unifying Linear-Time Attention via Latent Probabilistic Modelling

Abstract: Transformers have achieved state-of-the-art results across a range of domains, but their quadratic attention mechanism poses significant challenges for long-sequence modelling. Recent efforts to design linear-time attention mechanisms have yielded more scalable alternatives, yet often at the cost of performance, particularly on discrete data such as language. In this work, we revisit linear attention through the lens of probabilistic graphical models. We first show that standard linear attention can be interpreted as an undirected latent variable model, revealing a key limitation: the absence of directionality. To address this, we propose a novel directed parameterisation of linear attention that introduces an asymmetric structure, enabling an interpretation aligned with the causal and sequential nature of language. Our formulation integrates global latent-variable attention with local standard attention in a fully probabilistic framework. Additionally, we introduce a recurrent parameterisation of queries and keys that avoids reliance on relative positional encodings, often incompatible with linear attention. Experiments on language modelling benchmarks demonstrate that our model achieves competitive performance with standard attention and outperforms existing linear attention variants.

URL: https://openreview.net/forum?id=TDFIjR7ynG

---

Title: SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling

Abstract: Recent advancements in Large Language Models (LLMs) have created new opportunities to enhance performance on complex reasoning tasks by leveraging test-time computation. However, existing scaling methods have key limitations: parallel methods like repeated sampling are often inefficient and quickly saturate, while sequential methods like SELF-REFINE struggle to improve after a few rounds. Although combining these approaches shows promise, current methods require fine-tuned reward and revision models. This paper proposes Self-Enhanced Test-Time Scaling (SETS), a simple yet effective approach that overcomes these limitations by strategically combining parallel and sequential techniques and fully leveraging LLMs' self-improvement abilities. SETS exploits the inherent self-verification and self-correction capabilities of LLMs, unifying sampling, verification, and correction within a single framework. This facilitates efficient and scalable test-time computation for enhanced performance on complex tasks without any model training. Our comprehensive experimental results on challenging benchmarks spanning planning, reasoning, math, and coding demonstrate that SETS achieves significant performance improvements and more advantageous test-time scaling behavior than the alternatives.

URL: https://openreview.net/forum?id=Wv9NMJoKww

---

Title: Zero-Direction Probing: A Linear-Algebraic Framework for Deep Analysis of Large- Language-Model Drift

Abstract: We present Zero-Direction Probing (ZDP), a theoretical framework that characterizes
model drift from null directions of transformer activations, requiring no task labels or output
evaluations. Under explicit assumptions (A1–A6), We prove: (i) the Variance–Leak Theorem
(Thm. 1), (ii) Fisher Null-Conservation (Thm. 3), (iii) a Rank–Leak bound for low-rank
updates (Thm. 5), and (iv) a logarithmic-regret guarantee for online null-space trackers
(Thm. 4). We further derive a Spectral Null-Leakage (SNL) metric with a non-asymptotic
Laurent–Massart tail bound and an MP-edge–style concentration inequality, providing a-
priori thresholds for drift under a Gaussian null model. Together, these results establish
that “listening to silence”—monitoring the right/left null spaces of layer activations and
their Fisher geometry—yields concrete, testable guarantees on representational change. The
manuscript is intentionally theory-only; empirical validation and benchmarking are deferred
to companion work.

URL: https://openreview.net/forum?id=GICx8NVNNC

---

Title: More Rigorous Software Engineering Would Improve Reproducibility in Machine Learning Research

Abstract: While experimental reproduction remains a pillar of the scientific method, we observe that the software best practices supporting the reproduction of Machine Learning (ML) research are often undervalued or overlooked, leading both to poor reproducibility and damage to trust in the ML community. We quantify these concerns by surveying the usage of software best practices in software repositories associated with publications at major ML conferences and journals such as NeurIPS, ICML, ICLR, TMLR, and MLOSS within the last decade. We report the results of this survey that identify areas where software best practices are lacking and areas with potential for growth in the ML community. Finally, we discuss the implications and present concrete recommendations on how we, as a community, can improve reproducibility in ML research.

URL: https://openreview.net/forum?id=t3FcjU0xwf

---

Title: CyberThreat-Eval: Can Large Language Models Automate Real-World Threat Research?

Abstract: Analyzing Open Source Intelligence (OSINT) from large volumes of data is critical for drafting and publishing comprehensive CTI reports. This process usually follows a three-stage workflow---triage, deep search and TI drafting. While Large Language Models (LLMs) offer a promising route toward automation, existing benchmarks still have limitations. These benchmarks often consist of tasks that do not reflect real-world analyst workflows. For example, human analysts rarely receive tasks in the form of multiple-choice questions. Also, existing benchmarks often rely on model-centric metrics that emphasize lexical overlap rather than actionable, detailed insights essential for security analysts. Moreover, they typically fail to cover the complete three-stage workflow. To address these issues, we introduce CyberThreat-Eval, which is collected from the daily CTI workflow of a world-leading company. This expert-annotated benchmark assesses LLMs on practical tasks across all three stages as mentioned above. It utilizes analyst-centric metrics that measure factual accuracy, content quality, and operational costs. Our evaluation using this benchmark reveals important insights into the limitations of current LLMs. For example, LLMs often lack the nuanced expertise required to handle complex details and struggle to distinguish between correct and incorrect information. To address these challenges, the CTI workflow incorporates both external ground-truth databases and human expert knowledge. TRA allows human experts to iteratively provide feedback for continuous improvement. The CyberThreat-Eval benchmark will be available.

URL: https://openreview.net/forum?id=tiFtZHwr7O

---

Title: Domain Translation with Monolingual Lexical Distribution

Abstract: Neural machine translation (NMT) often demands a large amount of high-quality training data when adapting to a new domain with a carefully designed fine-tuning strategy. However, constructing a sufficient amount of parallel data for training poses challenges even for fine-tuning. This work proposes to fine-tune a generic NMT model using only the monolingual lexical distribution estimated from a small amount of in-domain data in the target language. Word frequency plays a critical role in analyzing the differences among corpora in various fields, e.g., psycholinguistic and language education, and our challenge lies in whether we can fit a model using the naive statistics collected from a target language domain in NMT. We leverage a variant of energy-based models (EBMs) based on Conditional Distributional Policy Gradients (CDPG) with a large number of EBMs to constrain the fine-tuning process with lexical distribution. We conduct experiments across four translation directions and four domain datasets, totaling 16 domain adaptation scenarios. The results demonstrate that our method enables robust domain shift while mitigating catastrophic forgetting, achieving effective domain adaptation using only a small amount of monolingual resources.

URL: https://openreview.net/forum?id=UKLBobrFCR

---

Reply all

Reply to author

Forward

0 new messages