Daily TMLR digest for Feb 06, 2026

1 view

Skip to first unread message

TMLR

unread,

Feb 6, 2026, 12:30:06 AM (6 days ago) Feb 6

to tmlr-anno...@googlegroups.com

New certifications
==================

Featured Certification, J2C Certification: T$^3$-S2S: Training-free Triplet Tuning for Sketch to Scene Synthesis in Controllable Concept Art Generation

Zhenhong Sun, Yifu Wang, Yonhon Ng, Yongzhi Xu, Daoyi Dong, Hongdong Li, Pan Ji

https://openreview.net/forum?id=lyn2BgKQ8F

---

Accepted papers
===============

Title: T$^3$-S2S: Training-free Triplet Tuning for Sketch to Scene Synthesis in Controllable Concept Art Generation

Authors: Zhenhong Sun, Yifu Wang, Yonhon Ng, Yongzhi Xu, Daoyi Dong, Hongdong Li, Pan Ji

Abstract: 2D concept art generation for 3D scenes is a crucial yet challenging task in computer graphics, as creating natural intuitive environments still demands extensive manual effort in concept design. While generative AI has simplified 2D concept design via text-to-image synthesis, it struggles with complex multi-instance scenes and offers limited support for structured terrain layout. In this paper, we propose a Training-free Triplet Tuning for Sketch-to-Scene (T3-S2S) generation after reviewing the entire cross-attention mechanism. This scheme revitalizes the ControlNet model for detailed multi-instance generation via three key modules: Prompt Balance ensures keyword representation and minimizes the risk of missing critical instances; Characteristic Priority emphasizes sketch-based features by highlighting TopK indices in feature channels; and Dense Tuning refines contour details within instance-related regions of the attention map. Leveraging the controllability of T3-S2S, we also introduce a feature-sharing strategy with dual prompt sets to generate layer-aware isometric and terrain-view representations for the terrain layout. Experiments show that our sketch-to-scene workflow consistently produces multi-instance 2D scenes with details aligned with input prompts.

URL: https://openreview.net/forum?id=lyn2BgKQ8F

---

Title: Provable Domain Adaptation for Offline Reinforcement Learning with Limited Samples

Authors: Weiqin Chen, Xinjie Zhang, Sandipan Mishra, Santiago Paternain

Abstract: Offline reinforcement learning (RL) learns effective policies from a static target dataset. The performance of state-of-the-art offline RL algorithms notwithstanding, it relies on the size of the target dataset, and it degrades if limited samples in the target dataset are available, which is often the case in real-world applications. To address this issue, domain adaptation that leverages auxiliary samples from related source datasets (such as simulators) can be beneficial. However, establishing the optimal way to trade off the limited target dataset and the large-but-biased source dataset while ensuring provably theoretical guarantees remains an open challenge. To the best of our knowledge, this paper proposes the first framework that theoretically explores the impact of the weights assigned to each dataset on the performance of offline RL. In particular, we establish performance bounds and the existence of the optimal weight, which can be computed in closed form under simplifying assumptions. We also provide algorithmic guarantees in terms of convergence to a neighborhood of the optimum. Notably, these results depend on the quality of the source dataset and the number of samples in the target dataset. Our empirical results on the well-known Procgen and Mujoco benchmark substantiate the theoretical contributions in this work.

URL: https://openreview.net/forum?id=xog8ThcXwy

---

Title: Scalable physical source-to-field inference with hypernetworks

Authors: Berian James, Stefan Pollok, Ignacio Peis, Elizabeth Louise Baker, Jes Frellsen, Rasmus Bjørk

Abstract: We present a generative model that amortises computation for the field and potential around e.g.~gravitational or electromagnetic sources. Exact numerical calculation has either computational complexity $\mathcal{O}(M\times{}N)$ in the number of sources $M$ and evaluation points $N$, or requires a fixed evaluation grid to exploit fast Fourier transforms. Using an architecture where a hypernetwork produces an implicit representation of the field or potential around a source collection, our model instead performs as $\mathcal{O}(M + N)$, achieves relative error of $\sim\!4\%-6\%$, and allows evaluation at arbitrary locations for arbitrary numbers of sources, greatly increasing the speed of e.g.~physics simulations. We compare with existing models and develop two-dimensional examples, including cases where sources overlap or have more complex geometries, to demonstrate its application.

URL: https://openreview.net/forum?id=EvfwGpo135

---

New submissions
===============

Title: Explainable Image-Centric Forgery Detection: A Survey

Abstract: The rapid growth of AI-driven image manipulation technologies poses critical challenges for verifying content authenticity. While many forgery detection systems achieve high accuracy, their black-box nature limits deployment in high-stakes domains that demand transparency and explainability. This survey presents the first comprehensive review of explainable image-centric forgery detection, introducing a novel taxonomy structured around three dimensions: Forgery Localization (FL), which pinpoints manipulated regions; Forgery Attribution (FA), which identifies manipulation sources; and Forgery Judgment Basis (FJB), which clarifies decision reasoning. We systematically analyze 48 state-of-the-art methods across single-modal and multi-modal settings, examining architectural innovations and interpretability mechanisms. Four feature-driven strategies (RGB, frequency-domain, noise-texture, and representation learning) are reviewed in detail, highlighting their complementary strengths. Benchmark datasets and evaluation protocols are also compared, and open challenges are identified, including the need for standardized explanation formats, uncertainty quantification, and broader dataset coverage. By establishing this taxonomy and synthesizing recent progress, this survey lays a foundation for developing transparent and trustworthy forgery detection systems, supporting real-world applications in forensic analysis, news verification, and regulatory compliance.

URL: https://openreview.net/forum?id=KdAbuKGN2m

---

Title: SpurLens: Automatic Detection of Spurious Cues in Multimodal LLMs

Abstract: Unimodal vision models are known to rely on spurious correlations, but it remains unclear to what extent Multimodal Large Language Models (MLLMs) exhibit similar biases despite language supervision. In this paper, we investigate spurious bias in MLLMs and introduce SpurLens, a pipeline that leverages GPT-4 and open-set object detectors to automatically identify spurious visual cues without human supervision. Our findings reveal that spurious correlations cause two major failure modes in MLLMs: (1) over-reliance on spurious cues for object recognition, where removing these cues reduces accuracy, and (2) object hallucination, where spurious cues amplify the hallucination by over 10x.
We investigate various MLLMs and datasets, and validate our findings with multiple robustness checks. Beyond diagnosing these failures, we explore potential mitigation strategies, such as prompt ensembling and reasoning-based prompting, and conduct ablation studies to examine the root causes of spurious bias in MLLMs. By exposing the persistence of spurious correlations, our study calls for more rigorous evaluation methods and mitigation strategies to enhance the reliability of MLLMs.Unimodal vision models are known to rely on spurious correlations, but it remains unclear to what extent Multimodal Large Language Models (MLLMs) exhibit similar biases despite language supervision. In this paper, we investigate spurious bias in MLLMs and introduce SpurLens, a pipeline that leverages GPT-4 and open-set object detectors to automatically identify spurious visual cues without human supervision. Our findings reveal that spurious correlations cause two major failure modes in MLLMs: (1) over-reliance on spurious cues for object recognition, where removing these cues reduces accuracy, and (2) object hallucination, where spurious cues amplify the hallucination by over 10x.
We investigate various MLLMs and datasets, and validate our findings with multiple robustness checks. Beyond diagnosing these failures, we explore potential mitigation strategies, such as prompt ensembling and reasoning-based prompting, and conduct ablation studies to examine the root causes of spurious bias in MLLMs. By exposing the persistence of spurious correlations, our study calls for more rigorous evaluation methods and mitigation strategies to enhance the reliability of MLLMs.

URL: https://openreview.net/forum?id=jxw6t5VVNL

---

Title: TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning

Abstract: Reinforcement Learning (RL) has become a powerful tool for enhancing the reasoning abilities of large language models (LLMs) by optimizing their policies with reward signals. Yet, RL’s success relies on the reliability of rewards, which are provided by verifiers. In this paper, we expose and analyze a widespread problem—false negatives—where verifiers wrongly reject correct model outputs. Our in‐depth study of the Big-Math-RL-Verified dataset reveals that over 38\% of model-generated responses suffer from false negatives, where the verifier fails to recognize correct answers. We show, both empirically and theoretically, that these false negatives severely impair RL training by depriving the model of informative gradient signals and slowing convergence. To mitigate this, we propose TinyV, a lightweight LLM-based verifier that augments rule-based methods with a compact LLM to dynamically detect false negatives and recover valid trajectories, thereby producing more accurate reward signals during RL training. Across multiple math‐reasoning benchmarks, integrating TinyV improves final model performance by up to 10\%, and reaches the peak performance of the rule-based verifier using fewer than 50\% of the training steps. Our findings highlight the critical importance of addressing verifier false negatives and offer a practical approach to improve RL tuning of LLMs.

URL: https://openreview.net/forum?id=HMGsqApBM3

---

Title: Hierarchical Filtering and Refinement Classification for Few-Shot Class-Incremental Learning

Abstract: Few-shot class-incremental learning (FSCIL) aims at recognizing novel classes continually with limited novel class samples. A mainstream baseline for FSCIL is first to train the whole model in the base session, then freeze the feature extractor in the incremental sessions. Despite achieving high overall accuracy, most methods exhibit notably low accuracy on incremental classes. While some recent methods have recognized this issue, their strategies remain constrained by a unified classification objective across all samples, making it difficult to simultaneously satisfy the performance requirements of both base and incremental classes. In this paper, considering that base and incremental classes play different yet both critical roles in FSCIL, we approach FSCIL from a more structured perspective by decomposing the overall classification objective into three sub-objectives. Building on this insight, we propose a novel classification framework called Hierarchical Filtering and Refinement Classification (HFRC) to hierarchically decompose and address the classification task. Extensive experiments demonstrate that our method effectively balances the classification accuracy between base and incremental classes, and achieves superior performance compared to state-of-the-art methods.

URL: https://openreview.net/forum?id=7MXra1JSh8

---

Title: Finding Landmarks of Covariate Shift with the Max-Sliced Kernel Wasserstein Distance

Abstract: To detect distribution shifts caused by localized changes, we propose an interpretable kernel-based max-sliced Wasserstein divergence, which is computationally efficient for two-sample testing. The max landmark kernel Wasserstein distance (MLW) seeks a single data point whose kernel embedding acts as a slice of the kernel Hilbert space, such that the two samples' resulting projections (kernel evaluations between each point and the landmark) have maximal Wasserstein distance. This {landmark, or multiple landmarks chosen via a greedy algorithm,} provide an interpretation of localized divergences. We investigate MLW's ability to detect and localize distribution shifts corresponding to over- or -under representation of one class. Results on the MNIST and CIFAR-10 datasets demonstrate MLW's competitive statistical power and accurate landmark selection. Using the mean embedding (ME) test statistic with multiple MLW landmarks enables state-of-the-art power on the Higgs dataset.

URL: https://openreview.net/forum?id=sO6W3M0K04

---

Title: LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

Abstract: In vision-language modeling, critic models are typically trained to evaluate outputs -- assigning scalar scores or pairwise preferences -- rather than to generate responses. This separation from policy models, which produce the responses, is so entrenched that critics are rarely considered for direct policy use. In this work, we challenge this convention. We propose to reorganize preference-labeled critic datasets into verifiable training signals and perform reinforcement learning directly on a base generative model, producing LLaVA-Critic-R1, a multimodal critic trained to optimize preference judgments while retaining full generation ability. Surprisingly, LLaVA-Critic-R1 emerges not only as a top-performing critic but also as a competitive policy model -- matching or surpassing specialized reasoning VLMs trained with in-domain data across 26 visual reasoning and understanding benchmarks, with an average gain of +5.7% over its base model (Qwen-2.5-VL-7B). Extending this approach to existing strong reasoning VLMs yields LLaVA-Critic-R1+, which further advances policy performance without sacrificing critic quality, achieving a SoTA performance of 71.9 on MMMU at the 7B scale. Finally, we show that the enhanced critic ability benefits inference: applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks without additional training. Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation, offering a simple path toward scalable, self-improving multimodal systems.

URL: https://openreview.net/forum?id=XnDCNgv7sF

---

Title: Faster Approximate Top-K: Harnessing the Full Power of Two-Stages

Abstract: We consider the Top-K selection problem, which aims to identify the largest K elements in an array. Top-K selection arises in many machine learning algorithms and often becomes a bottleneck on accelerators, which are optimized for dense matrix multiplications. To address this problem, Chern et al. (2022) proposed a fast two-stage approximate Top-K algorithm that: (i) partitions the input array into equal-sized chunks and selects the top-1 element from each partition; and (ii) sorts the resulting smaller subset and returns the top K elements. In this paper, we generalize the first stage so that each partition selects the top K′ elements (for 1 ≤ K′ ≤ K). Our contributions include: (i) an expression for the expected recall of this generalized algorithm under random partitioning, and a demonstration that choosing K′ > 1 with fewer partitions in the first stage more effectively reduces the input size to the second stage while maintaining the same expected recall as the original algorithm; (ii) a bound on the expected recall of the original algorithm as a function of the algorithm parameters that is provably tighter by a factor of 2 than the bound reported by Chern et al. (2022); and (iii) an implementation of our algorithm on Cloud TPUv5e that achieves approximately an order of magnitude speedup over the original algorithm without sacrificing recall on real-world tasks.

URL: https://openreview.net/forum?id=izqZ1Crpjz

---

Title: Self-Questioning Language Models

Abstract: Can large language models improve without external data -- by generating their own questions and answers? We hypothesize that a pre-trained language model can improve its reasoning skills given only a single prompt specifying the topic (e.g., algebra word problems) and asking the model to generate its own questions. To do this, we propose Self-Questioning Language Models (SQLM): an asymmetric self-play framework where a proposer is given the topic and generates a question for a solver, who tries to answer it. Both the proposer and solver are trained via reinforcement learning. The proposer receives a reward if the problem is not too easy or too difficult, and the solver receives a reward based on majority voting, a proxy for correctness in the absence of ground-truth answers. For coding, the proposer can instead generate unit tests which are used for verification. We study this asymmetric self-play framework on three benchmarks: three-digit multiplication, algebra problems from the OMEGA benchmark, and programming problems from Codeforces. By continually generating more interesting problems and attempting to solve them, language models can improve on downstream benchmarks without access to any curated training datasets.

URL: https://openreview.net/forum?id=c4A91xyD7e

---

Reply all

Reply to author

Forward

0 new messages