Daily TMLR digest for Feb 14, 2026

4 views

Skip to first unread message

TMLR

unread,

Feb 14, 2026, 12:30:09 AMFeb 14

to tmlr-anno...@googlegroups.com

New certifications
==================

Survey Certification: A Survey on Federated Fine-Tuning of Large Language Models

Yebo Wu, Chunlin Tian, Jingguang Li, He Sun, KaHou Tam, Zhanting Zhou, Haicheng Liao, Jing Xiong, Zhijiang Guo, Li Li, Cheng-zhong Xu

https://openreview.net/forum?id=rnCqbuIWnn

---

Survey Certification: A Survey of Token Compression for Efficient Multimodal Large Language Models

Kele Shao, Keda TAO, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang

https://openreview.net/forum?id=G2od9JVHkE

---

Accepted papers
===============

Title: DiffusionRollout: Uncertainty-Aware Rollout Planning in Long-Horizon PDE Solving

Authors: Seungwoo Yoo, Juil Koo, Daehyeon Choi, Minhyuk Sung

Abstract: We propose DiffusionRollout, a novel selective rollout planning strategy for autoregres-
sive diffusion models, aimed at mitigating error accumulation in long-horizon predictions of
physical systems governed by partial differential equations (PDEs). Building on the recently
validated probabilistic approach to PDE solving, we further explore its ability to quantify
predictive uncertainty and demonstrate a strong correlation between prediction errors and
standard deviations computed over multiple samples—supporting their use as a proxy for
the model’s predictive confidence. Based on this observation, we introduce a mechanism that
adaptively selects step sizes during autoregressive rollouts, improving long-term prediction
reliability by reducing the compounding effect of conditioning on inaccurate prior outputs.
Extensive evaluation on long-trajectory PDE prediction benchmarks validates the effective-
ness of the proposed uncertainty measure and adaptive planning strategy, as evidenced by
lower prediction errors and longer predicted trajectories that retain a high correlation with
their ground truths.

URL: https://openreview.net/forum?id=OCzcGOzgzz

---

Title: On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling

Authors: Nicholas E. Corrado, Josiah P. Hanna

Abstract: On-policy reinforcement learning (RL) algorithms are typically characterized as algorithms that perform policy updates using i.i.d. trajectories collected by the agent's current policy. However, after observing only a finite number of trajectories, such on-policy sampling may produce data that fails to match the expected on-policy data distribution. This \textit{sampling error} leads to high-variance gradient estimates that yield data inefficient on-policy learning. Recent work in the policy evaluation setting has shown that non-i.i.d.\@, off-policy sampling can produce data with lower sampling error w.r.t. the expected on-policy distribution than on-policy sampling can produce~\citep{zhong2022robust}. Motivated by this observation, we introduce an adaptive, off-policy sampling method to reduce sampling error during on-policy policy gradient RL training. Our method, Proximal Robust On-Policy Sampling (PROPS), reduces sampling error by collecting data with a \textit{behavior policy} that increases the probability of sampling actions that are under-sampled w.r.t. the current policy. We empirically evaluate PROPS on both continuous-action MuJoCo benchmark tasks as well as discrete-action tasks and demonstrate that (1) PROPS decreases sampling error throughout training and (2) increases the data efficiency of on-policy policy gradient algorithms.

URL: https://openreview.net/forum?id=nCoyFp8uO1

---

Title: A Survey on Federated Fine-Tuning of Large Language Models

Authors: Yebo Wu, Chunlin Tian, Jingguang Li, He Sun, KaHou Tam, Zhanting Zhou, Haicheng Liao, Jing Xiong, Zhijiang Guo, Li Li, Cheng-zhong Xu

Abstract: Large Language Models (LLMs) have demonstrated impressive success across various tasks. Integrating LLMs with Federated Learning (FL), a paradigm known as FedLLM, offers a promising avenue for collaborative model adaptation while preserving data privacy. This survey provides a systematic and comprehensive review of FedLLM. We begin by tracing the historical development of both LLMs and FL, summarizing relevant prior research to set the context. Subsequently, we delve into an in-depth analysis of the fundamental challenges inherent in deploying FedLLM. Addressing these challenges often requires efficient adaptation strategies; therefore, we conduct an extensive examination of existing Parameter-Efficient Fine-tuning (PEFT) methods and explore their applicability within the FL framework. To rigorously evaluate the performance of FedLLM, we undertake a thorough review of existing fine-tuning datasets and evaluation benchmarks. Furthermore, we discuss FedLLM's diverse real-world applications across multiple domains. Finally, we identify critical open challenges and outline promising research directions to foster future advancements in FedLLM. This survey aims to serve as a foundational resource for researchers and practitioners, offering valuable insights into the rapidly evolving landscape of federated fine-tuning for LLMs. It also establishes a roadmap for future innovations in privacy-preserving AI. We actively maintain a GitHub repo to track cutting-edge advancements in this field.

URL: https://openreview.net/forum?id=rnCqbuIWnn

---

Title: A Survey of Token Compression for Efficient Multimodal Large Language Models

Authors: Kele Shao, Keda TAO, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang

Abstract: Multimodal large language models (MLLMs) have made remarkable strides, largely driven by their ability to process increasingly long and complex contexts, such as high-resolution images, extended video sequences, and lengthy audio input. While this ability significantly enhances MLLM capabilities, it introduces substantial computational challenges, primarily due to the quadratic complexity of self-attention mechanisms with numerous input tokens. To mitigate these bottlenecks, token compression has emerged as an auspicious and critical approach, efficiently reducing the number of tokens during both training and inference. In this paper, we present the first systematic survey and synthesis of the burgeoning field of multimodal long context token compression. Recognizing that effective compression strategies are deeply tied to the unique characteristics and redundancies of each modality, we categorize existing approaches by their primary data focus, enabling researchers to quickly access and learn methods tailored to their specific area of interest: (1) image-centric compression, which addresses spatial redundancy in visual data; (2) video-centric compression, which tackles spatio-temporal redundancy in dynamic sequences; and (3) audio-centric compression, which handles temporal and spectral redundancy in acoustic signals. Beyond this modality-driven categorization, we further dissect methods based on their underlying mechanisms, including transformation-based, similarity-based, attention-based, and query-based approaches. By providing a comprehensive and structured overview, this survey aims to consolidate current progress, identify key challenges, and inspire future research directions in this rapidly evolving domain.

URL: https://openreview.net/forum?id=G2od9JVHkE

---

Title: Understanding Guidance Scale in Diffusion Models from a Geometric Perspective

Authors: Zhiyuan Zhan, Liuzhuozheng Li, Masashi Sugiyama

Abstract: Conditional diffusion models have become a leading approach for generating condition-consistent samples, such as class-specific images. In practice, the guidance scale is a key hyperparameter in conditional diffusion models, used to adjust the strength of the guidance term. While empirical studies have demonstrated that appropriately choosing the scale can significantly enhance generation quality, the theoretical understanding of its role remains limited. In this work, we analyze the probabilistic guidance term from a geometric view under the linear manifold assumption and, based on this analysis, construct a geometric guidance model that enables tractable theoretical study. To address regularity issues arising from multi-modal data, we introduce a mollification technique that ensures well-posed dynamics. Our theoretical results show that increasing the guidance scale improves alignment with the target data manifold, thereby enhancing generation performance. We further extend our framework to nonlinear manifolds, and empirical results on real-world datasets validate the effectiveness of the proposed model and are consistent with our theories.

URL: https://openreview.net/forum?id=nfHimL6g8G

---

Title: Bayesian Network Structure Discovery Using Large Language Models

Authors: Yinghuan Zhang, Yufei Zhang, Parisa Kordjamshidi, Zijun Cui

Abstract: Understanding probabilistic dependencies among variables is central to analyzing complex systems. Traditional structure learning methods often require extensive observational data or are limited by manual, error-prone incorporation of expert knowledge. Recent studies have explored using large language models (LLMs) for structure learning, but most treat LLMs as auxiliary tools for pre-processing or post-processing, leaving the core learning process data-driven. In this work, we introduce a unified framework for Bayesian network structure discovery that places LLMs at the center, supporting both data-free and data-aware settings. In the data-free regime, we introduce \textbf{PromptBN}, which leverages LLM reasoning over variable metadata to generate a complete directed acyclic graph (DAG) in a single call. PromptBN effectively enforces global consistency and acyclicity through dual validation, achieving constant $\mathcal{O}(1)$ query complexity. When observational data are available, we introduce \textbf{ReActBN} to further refine the initial graph. ReActBN combines statistical evidence with LLM by integrating a novel ReAct-style reasoning with configurable structure scores (e.g., Bayesian Information Criterion). Experiments demonstrate that our method outperforms prior data-only, LLM-only, and hybrid baselines, particularly in low- or no-data regimes and on out-of-distribution datasets.

URL: https://openreview.net/forum?id=G4mrO8LVix

---

New submissions
===============

Title: FreeEyeglass: Training-free and Target-mask-free Eyeglass Transfer for Facial Videos

Abstract: The rise of e-commerce and short-video platforms has fueled demand for realistic video-based virtual try-on. Unlike virtual try-on of clothing, which has been actively studied to date, virtual try-on of eyeglasses is uniquely challenging: they physically interact with facial geometry and strongly affect facial identity, making the faithful preservation of unedited regions especially important. Existing generative editing approaches, such as GAN- and diffusion-based methods, lack reconstruction objectives and often rely on inpainting, which fails to ensure identity consistency. We argue that semantic editing requires not only plausible generation but also faithful reconstruction, making autoencoder-based latent spaces particularly suitable. We introduce a training-free, reference-guided framework for video eyeglass transfer built on Diffusion Autoencoders (DiffAE). By blending semantic features in the encoder and incorporating spatial-temporal self-attention, our method achieves realistic, identity-preserving, and temporally consistent results, and points to the potential of autoencoder-based latent spaces for local video editing. Our implementations and datasets will be released upon acceptance.

URL: https://openreview.net/forum?id=6aFRoQcm3H

---

Title: LibMoE: A Library for Comprehensive Research on Mixture of Experts in Large Language Models

Abstract: Mixture of experts (MoE) architectures have become a cornerstone for scaling up and are a key component in most large language models such as GPT-OSS, DeepSeek-V3, Llama-4, and Gemini-2.5. However, systematic research on MoE remains severely constrained by the prohibitive computational costs of training and evaluation, restricting large-scale studies accessible to most researchers. We introduce LibMoE, a unified framework for reproducible, efficient, and extensible MoE research that supports both pretraining and sparse-upcycling regimes. Beyond unified implementations, the framework provides transparent analytical tools for probing routing and expert dynamics. Leveraging this foundation, we conduct a comprehensive analysis along three dimensions: (i) routing dynamics, covering expert selection patterns, routing stability and optimality, and how routing entropy reveals task specialization and expert diversity; (ii) the effect of lightweight initialization on load balancing, demonstrating how subtle changes in router initialization shape early expert utilization; and (iii) training regime differences, revealing how sparse upcycling and full pretraining exhibit distinct routing patterns and stability profiles. By lowering the barrier to entry and standardizing evaluation, along with our comprehensive analysis, LibMoE broadens access to MoE research and establishes a reliable benchmark to guide future innovations.

URL: https://openreview.net/forum?id=PB2ju8tq0n

---

Title: Reranker Optimization via Geodesic Distances on k-NN Manifolds

Abstract: Current neural reranking approaches for retrieval-augmented generation (RAG) rely on cross-
encoders or large language models (LLMs), requiring substantial computational resources and
exhibiting latencies of 3-5 seconds per query. We propose Maniscope, a geometric reranking
method that computes geodesic distances on k-nearest neighbor (k-NN) manifolds constructed
over retrieved document candidates. This approach combines global cosine similarity with
local manifold geometry to capture semantic structure that flat Euclidean metrics miss.
Evaluating on eight BEIR benchmark datasets (1,233 queries), this method outperforms
HNSW graph-based baseline on the three hardest datasets (NFCorpus: +7.0%, TREC-
COVID: +1.6%, AorB: +2.8% NDCG@3) while being 3.2x faster (4.7 ms vs 14.8 ms average).
Compared to cross-encoder rerankers, it achieves within 2% accuracy at 10-45x lower latency.
On TREC-COVID, LLM-Reranker provides only +0.5% NDCG@3 improvement over our
method at 840x higher latency, positioning it as a practical alternative for real-time RAG
deployment. The approach requires O(N D + M^2 D + M k log k) complexity where M << N ,
enabling sub-10 ms latency. We plan to release code and data in an open-source repository.

URL: https://openreview.net/forum?id=HvzgEt51f2

---

Title: Graph State Networks (GSNs): Persistent Nodewise Selective State Space Models

Abstract: Temporal graphs are often observed as streams of timestamped interactions, where accurate prediction requires retaining and selectively using historical information nodes. Existing temporal graph models either (i) recompute representations from a sliding neighborhood/history at query time, or (ii) maintain a memory module but offer limited control and limited theory for what is retained over long horizons. We propose Graph State Networks (GSNs), a bucketed temporal-graph framework that maintains a persistent hidden state per node and updates it online using a content- and time-dependent selective state space update. Concretely, GSNs store node states in an explicit id-indexed state table and for each bucket, read the current state, update it with a time-aware Mamba-like mechanism, and commit the state back via an exponential moving average controlled by commit-rate $\alpha$. This commit mechanism provides an explicit "retention dial'' and enables a clean analysis of forgetting. We develop a capacity/recall theory for persistent node memory and show that, under an affine approximation of blank-bucket dynamics, the influence of a single past event decays geometrically at a rate governed by $\alpha$ and the induced linearized update. Empirically, GSNs are competitive on standard dynamic link prediction benchmarks. We validate the theory with controlled synthetic write--wait--read probes: measured influence is close to exponential in delay, and fitting short-delay dynamics predicts long horizon recall across commit rates.

URL: https://openreview.net/forum?id=zMEuBQfeT6

---

Title: Concept Realization Manifolds for Multi-Concept Activation and its (Dis)Entanglement in Large Language Models

Abstract: This work extends the Bias-CAV framework by introducing Concept Realization Manifolds (CRMs) as a geometric foundation for analyzing multi-concept activations and their entanglement in large language models. A theoretical framework is presented that reframes concepts as operational geometric regularities rather than latent variables. Multi-Concept Activation Subspaces (MCAS) are introduced to jointly model multiple bias-related concepts, addressing limitations of single-concept approaches identified in prior work. The operational limits of disentanglement are formally characterized through the Irreducible Measure Entanglement Theorem, which establishes that while directional entanglement can be reduced or removed, measure entanglement (activation distribution overlap) may persist due to data correlations and model optimization objectives. Conditional disentanglement methods are developed to operationalize partial concept separation. A comprehensive terminology hierarchy is established, including Concept Entanglement Fields, Conditional Concept Manifolds, and Intersectional Concept Regions. The framework is applied to bias analysis through multi-concept intervention mechanisms with formal fidelity guarantees. Examination of layer-wise entanglement patterns reveals structured relationships between concepts across transformer layers. Multi-axis evaluation demonstrates that MCAS reduces cross-dimension spillover effects by 2.4--3.6× compared to baseline methods in the evaluated settings, addressing concerns about unintended consequences in targeted bias mitigation. For practitioners, the framework provides operational methods for analyzing intersectional bias patterns (e.g., gender $\times$ profession interactions) and improving model interpretability through conditional disentanglement in the tested scenarios, even when perfect concept separation is theoretically impossible.

URL: https://openreview.net/forum?id=U8YU8dvm4A

---

Title: A Mechanistic View of Catastrophic Overfitting

Abstract: Adversarial Training (AT) suffers from a critical failure mode known as Catastrophic Overfitting (CO), where robustness to weak single-step adversaries does not translate to strong multi-step adversaries. Despite progress in mitigating CO, its underlying mechanisms remain poorly understood. In this work, we address two central questions: (1) Why does CO appear? and (2) What role do the number of Projected Gradient Descent (PGD) steps and PGD initialization play in CO? Using mathematically tractable models, we reveal a phase transition in the adversarial budget $\epsilon$, above which non-robust solutions become optimal. Furthermore, we show that CO exists for any well separated dataset, any number of PGD steps $S$, $\epsilon$ as small as desired, and randomized initialization. Our insights align with empirical observations in the community and help explain the difficulties in avoiding CO at larger scales. We believe our results deepen the understanding of CO and provide a foundation for developing future-proof solutions.

URL: https://openreview.net/forum?id=BQEZ3ZZBt3

---

Title: Sliding Window Recurrences for Sequence Models

Abstract: Multi-hybrid architectures are poised to take over language modeling due to better quality
and performance. We introduce a hierarchical decomposition framework for linear recur-
rences that allows us to develop algorithms aligned with GPU memory hierarchies, yielding
Sliding Window Recurrences. We focus specifically on truncating recurrences to hardware-
aligned windows which are naturally jagged, limiting costly inter-warp communication.
Using SWR, we develop Phalanx layers that serve as drop-in replacements for windowed
attention or linear recurrences. In 1B parameter multi-hybrid models, Phalanx achieves
over 10-40% speedup across 4K to 32K context length over optimized Transformers while
matching perplexity.

URL: https://openreview.net/forum?id=V09uO70ouz

---

Title: Re-evaluating Minimum Bayes Risk Decoding for Automated Speech Recognition Tasks

Abstract: While sample-based Minimum Bayes Risk (MBR) decoding has shown to outperform beam search in many text-to-text generation tasks with modern LLMs, beam search remains the dominant approach for Automatic Speech Recognition (ASR) and Speech Translation (ST). To date, the efficacy of MBR decoding within modern speech systems lacks comprehensive evaluation.
Given that MBR decoding is effective in text-to-text generation tasks, it is reasonable to expect it to also be effective for speech-to-text tasks.
In this paper, we evaluate MBR decoding for ASR and ST tasks on English and Japanese using Whisper and its derivative models.
We observe that the accuracy of MBR decoding outperforms that of beam search in most of the experimental settings we have evaluated.
The results show that MBR decoding is a promising method for ASR and ST tasks that require high accuracy.

URL: https://openreview.net/forum?id=I6iLWhRIsf

---

Title: BandAid: A Plug-in Patch for Backdoor Defenses against Clean-Label Attacks in NLP

Abstract: Recent state-of-the-art defenses against backdoor attacks on text classifiers have shown strong performance. A common approach is to analyze the feature space of the poisoned model to detect and mitigate suspicious samples during inference time. However, most existing defenses target “dirty-label” attacks, in which a poisoned sample’s content is inconsistent with its assigned label. In contrast, very few defenses have been evaluated against “clean-label” attacks, where the text content correctly matches the label but still triggers the backdoor. Yet, clean-label backdoors are particularly concerning, as they remain highly stealthy while being equally harmful. We find that many defenses fail to identify the decision boundary between clean and poisoned samples precisely. To this end, we investigate the performance of three inference-time defenses—DAN, BadActs, and MDP–against both insertion-based and paraphrase-based clean-label backdoor attacks, and discuss their limitations. We then propose a universal and simple plug-in module, BandAid, to strengthen existing defenses. BandAid significantly reduces the attack effectiveness in 99 out of 102 cases, with effectiveness reduced by up to 99.8%, while improving clean data accuracy by 7.0% on average. At its core, BandAid fine-tunes a lightweight classifier using suspicious samples flagged by existing defenses along with a small clean validation set. In this way, BandAid transforms an anomaly-detection task (identifying unusual examples) into a discriminative classification task (identifying patterns among suspicious samples), which leads to a substantially more effective defense. BandAid proves to be robust under stress tests across a range of attack types and datasets, providing strong improvements in both security and generalization.

URL: https://openreview.net/forum?id=F2cvY4xZmE

---

Reply all

Reply to author

Forward

0 new messages