Daily TMLR digest for Nov 05, 2025

1 view
Skip to first unread message

TMLR

unread,
Nov 5, 2025, 12:30:07 AMNov 5
to tmlr-anno...@googlegroups.com


New certifications
==================

J2C Certification: Testing with Non-identically Distributed Samples

Shivam Garg, Chirag Pabbaraju, Kirankumar Shiragur, Gregory Valiant

https://openreview.net/forum?id=FUzvztzBlW

---


Accepted papers
===============


Title: CLIP Meets Diffusion: A Synergistic Approach to Anomaly Detection

Authors: Byeongchan Lee, John Won, Seunghyun Lee, Jinwoo Shin

Abstract: Anomaly detection is a complex problem due to the ambiguity in defining anomalies, the diversity of anomaly types (e.g., local and global defect), and the scarcity of training data. As such, it necessitates a comprehensive model capable of capturing both low-level and high-level features, even with limited data. To address this, we propose CLIPFUSION, a method that leverages both discriminative and generative foundation models. Given the CLIP-based discriminative model's limited capacity to capture fine-grained local details, we incorporate a diffusion-based generative model to complement its features. This integration yields a synergistic solution for anomaly detection. To this end, we propose using diffusion models as feature extractors for anomaly detection, and introduce carefully designed strategies to extract informative cross-attention and feature maps. Experimental results on benchmark datasets (MVTec-AD, VisA) demonstrate that CLIPFUSION consistently outperforms baseline methods in both anomaly segmentation and classification under both zero-shot and few-shot settings. We believe that our method underscores the effectiveness of multi-modal and multi-model fusion in tackling the multifaceted challenges of anomaly detection, providing a scalable solution for real-world applications.

URL: https://openreview.net/forum?id=WpFzZNuQmg

---

Title: The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning

Authors: Tianshi Zheng, Yixiang Chen, Chengxi Li, Chunyang Li, Qing Zong, Haochen Shi, Baixuan Xu, Yangqiu Song, Ginny Wong, Simon See

Abstract: Chain-of-Thought (CoT) prompting has been widely recognized for its ability to enhance reasoning capabilities in large language models (LLMs). However, our study reveals a surprising contradiction to this prevailing perspective within the fundamental domain of pattern-based in-context learning (ICL). Through extensive experiments involving 16 state-of-the-art LLMs and nine diverse pattern-based ICL datasets, we demonstrate that CoT and its reasoning variants consistently underperform direct answering across varying model scales and benchmark complexities. To systematically investigate this unexpected phenomenon, we designed extensive experiments to validate several hypothetical explanations. Our analysis uncovers a fundamental hybrid mechanism of explicit-implicit reasoning driving CoT’s performance in pattern-based ICL: while explicit reasoning falters due to LLMs’ struggles to infer underlying patterns from demonstrations, implicit reasoning—disrupted by the increased contextual distance of CoT rationales—often compensates, delivering correct answers despite flawed rationales. This hybrid mechanism explains CoT’s relative underperformance, as noise from weak explicit inference undermines the process, even as implicit mechanisms partially salvage outcomes. Notably, even long-CoT reasoning models, which excel in abstract and symbolic reasoning, fail to fully overcome these limitations despite higher computational costs. Our findings challenge existing assumptions regarding the universal efficacy of CoT, yielding novel insights into its limitations and guiding future research toward more nuanced and effective reasoning methodologies for LLMs.

URL: https://openreview.net/forum?id=7SIrvcYNYj

---

Title: Scalable Generative Modeling of Weighted Graphs

Authors: Richard Williams, Eric Nalisnick, Andrew Holbrook

Abstract: Weighted graphs are ubiquitous throughout biology, chemistry, and the social sciences, motivating the development of generative models for abstract weighted graph data using deep neural networks. However, most current deep generative models are designed for unweighted graphs and cannot be easily extended to weighted topologies. Among those that do incorporate edge weights, few consider a joint distribution with the topology of the graph. Furthermore, learning a distribution over weighted graphs must account for complex nonlocal dependencies between both the edges of the graph and corresponding weights of each edge. We develop an autoregressive model BiGG-E, a nontrivial extension of the BiGG model, that learns a joint distribution over weighted graphs while exploiting sparsity to generate a weighted graph with $n$ nodes and $m$ edges in $O((n + m)\log n)$ time. Simulation studies and experiments on a variety of benchmark datasets demonstrate that BiGG-E best captures distributions over weighted graphs while remaining scalable and computationally efficient.

URL: https://openreview.net/forum?id=yWKkBOcD18

---

Title: Testing with Non-identically Distributed Samples

Authors: Shivam Garg, Chirag Pabbaraju, Kirankumar Shiragur, Gregory Valiant

Abstract: We examine the extent to which sublinear-sample property testing and estimation applies to settings where samples are independently but not identically distributed. Specifically, we consider the following distributional property testing framework: Suppose there is a set of distributions over a discrete support of size $k$, $p_1, p_2,\ldots,p_T$, and we obtain $c$ independent draws from each distribution. Suppose the goal is to learn or test a property of the average distribution, $p_{\mathrm{avg}}$. This setup models a number of important practical settings where the individual distributions correspond to heterogeneous entities --- either individuals, chronologically distinct time periods, spatially separated data sources, etc. From a learning standpoint, even with $c=1$ samples from each distribution, $\Theta(k/\varepsilon^2)$ samples are necessary and sufficient to learn $p_{\mathrm{avg}}$ to within error $\varepsilon$ in $\ell_1$ distance. To test uniformity or identity --- distinguishing the case that $p_{\mathrm{avg}}$ is equal to some reference distribution, versus has $\ell_1$ distance at least $\varepsilon$ from the reference distribution, we show that a linear number of samples in $k$ is necessary given $c=1$ samples from each distribution. In contrast, for $c \ge 2$, we recover the usual sublinear sample testing guarantees of the i.i.d. setting: we show that $O(\sqrt{k}/\varepsilon^2 + 1/\varepsilon^4)$ total samples are sufficient, matching the optimal sample complexity in the i.i.d. case in the regime where $\varepsilon \ge k^{-1/4}$. Additionally, we show that in the $c=2$ case, there is a constant $\rho > 0$ such that even in the linear regime with $\rho k$ samples, no tester that considers the multiset of samples (ignoring which samples were drawn from the same $p_i$) can perform uniformity testing. We further extend our techniques to the problem of testing ''closeness'' of two distributions: given $c=3$ independent draws from each of $p_1, p_2,\ldots,p_T$ and $q_1, q_2,\ldots,q_T$, one can distinguish the case that $p_{\mathrm{avg}}=q_{\mathrm{avg}}$ versus having $\ell_1$ distance at least $\varepsilon$ using $O(k^{2/3}/\varepsilon^{8/3})$ total samples, where $k$ is an upper bound on the support size, matching the optimal sample complexity of the i.i.d. setting up to the $\varepsilon$-dependence.

URL: https://openreview.net/forum?id=FUzvztzBlW

---


New submissions
===============


Title: Handling Missing Data in Downstream Tasks With Distribution-Preserving Guarantees

Abstract: Missing feature values are a significant hurdle for downstream machine-learning tasks such as classification. However, imputation methods for classification might be time-consuming for high-dimensional data, and offer few theoretical guarantees on the preservation of the data distribution and imputation quality, especially for not-missing-at-random mechanisms. First, we propose an imputation approach named F3I based on the iterative improvement of a K-nearest neighbor imputation, where neighbor-specific weights are learned through the optimization of a novel concave, differentiable objective function related to the preservation of the data distribution on non-missing values. F3I can then be chained to and jointly trained with any classifier architecture. Second, we provide a theoretical analysis of imputation quality and data distribution preservation by F3I for several types of missing mechanisms. Finally, we demonstrate the superior performance of F3I on several imputation and classification tasks, with applications to drug repurposing and handwritten-digit recognition data.

URL: https://openreview.net/forum?id=7Wj1rZ7mJ4

---

Title: Learning to Defer with an Uncertain Rejector via Conformal Prediction

Abstract: Learning to defer (L2D) allows prediction tasks to be allocated to a human or machine decision maker, thus getting the best of both’s abilities. This allocation decision crucially depends on a ‘rejector’ function. In practice, the rejector could be poorly fit or otherwise misspecified. In this work, we perform uncertainty quantification for the rejector sub-component of the L2D framework. We use conformal prediction to allow the rejector to output prediction sets or intervals of a user-defined confidence level (with distribution-free guarantees), instead of just the binary outcome of ‘defer’ or not. On tasks ranging from image to hate speech classification, we demonstrate that the uncertainty in the rejector translates to safer decisions via two forms of selective prediction

URL: https://openreview.net/forum?id=SZQJ8K2DUe

---

Title: Differentially private and decentralized randomized power method

Abstract: The randomized power method has gained significant interest due to its simplicity and efficient handling of large-scale spectral analysis and recommendation tasks. However, its application to large datasets containing personal user information (e.g., web interactions, search history, personal tastes) raises critical privacy problems. This paper addresses these issues by proposing enhanced privacy-preserving variants of the method. First, we propose a variant that reduces the variance of the noise required in current techniques to achieve Differential Privacy (DP). More precisely, we modify the algorithm and privacy analysis so that the Gaussian noise variance no longer grows linearly with the target rank, achieving the same $(\varepsilon,\delta)$‑DP guarantees with lower noise variance. Second, we adapt our method to a decentralized framework in which data is distributed among multiple user devices, strengthening privacy guarantees with no accuracy penalty and a low computational and communication overhead. Our results also include the provision of tighter convergence bounds for both the centralized and decentralized versions, and an empirical comparison with previous work using real recommendation datasets.

URL: https://openreview.net/forum?id=TkzhY8RB2K

---

Title: Return Augmented Decision Transformer for Off-Dynamics Reinforcement Learning

Abstract: We study offline off-dynamics reinforcement learning (RL) to utilize data from an easily accessible source domain to enhance policy learning in a target domain with limited data. Our approach centers on return-conditioned supervised learning (RCSL), particularly focusing on Decision Transformer (DT) type frameworks, which can predict actions conditioned on desired return guidance and complete trajectory history. Previous works address the dynamics shift problem by augmenting the reward in the trajectory from the source domain to match the optimal trajectory in the target domain. However, this strategy can not be directly applicable in RCSL owing to (1) the unique form of the RCSL policy class, which explicitly depends on the return, and (2) the absence of a straightforward representation of the optimal trajectory distribution. We propose the Return Augmented (REAG) method for DT type frameworks, where we augment the return in the source domain by aligning its distribution with that in the target domain. We provide the theoretical analysis demonstrating that the RCSL policy learned from REAG achieves the same level of suboptimality as would be obtained without a dynamics shift. We introduce two practical implementations $REAG^{∗}_{Dara}$ and $REAG^{∗}_{MV}$ respectively. Thorough experiments on D4RL datasets and various DT-type baselines demonstrate that our methods consistently enhance the performance of DT type frameworks in off-dynamics RL.

URL: https://openreview.net/forum?id=QDVOr5J9Xp

---

Title: Distilled Circuits: A Mechanistic Study of Internal Restructuring in Knowledge Distillation

Abstract: Knowledge distillation compresses a larger neural model (teacher) into smaller, faster student models by training them to match teacher outputs. However, the internal computational transformations that occur during this process remain poorly understood. We apply techniques from mechanistic interpretability to analyze how internal circuits, representations, and activation patterns differ between teachers and students. Focusing on GPT2 and its distilled counterpart DistilGPT2, and generalizing our findings to both bidirectional architectures and larger model pairs, we find that student models can reorganize, compress, and discard teacher components, often resulting in a stronger reliance on fewer individual components. To quantify functional alignment beyond output similarity, we introduce an alignment metric based on influence-weighted component similarity, validated across multiple tasks. Our findings reveal that while knowledge distillation preserves broad functional behaviors, it also causes significant shifts in internal computation, with important implications for the robustness and generalization capacity of distilled models.

URL: https://openreview.net/forum?id=S1KJE2ZW64

---

Title: ParaBlock: Communication-Computation Parallel Block Coordinate Federated Learning for Large Language Models

Abstract: Federated learning (FL) has been extensively studied as a privacy-preserving training paradigm. Recently, federated block coordinate descent scheme has become a popular option in training large-scale models, as it allows clients to train only a subset of the model locally instead of the entire model. However, in the era of large language models (LLMs), even a single block can contain a significant number of parameters, posing substantial communication latency, particularly for resource-constrained clients. To address this challenge in federated training/fine-tuning LLMs, we propose ParaBlock, a novel approach that establishes two parallel threads for communication and computation to enhance communication efficiency. We theoretically prove that the proposed ParaBlock achieves the same convergence rate as the standard federated block coordinate descent methods. Empirical evaluations on fine-tuning LLMs on general instruction following and mathematical reasoning confirm that ParaBlock not only maintains strong performance but also significantly improves communication efficiency.

URL: https://openreview.net/forum?id=Hnf7eCdBeV

---

Reply all
Reply to author
Forward
0 new messages