Daily TMLR digest for Dec 22, 2025

0 views
Skip to first unread message

TMLR

unread,
Dec 22, 2025, 12:30:08 AM (7 days ago) Dec 22
to tmlr-anno...@googlegroups.com

Accepted papers
===============


Title: Dive3D: Diverse Distillation-based Text-to-3D Generation via Score Implicit Matching

Authors: Weimin Bai, Yubo Li, Wenzheng Chen, Weijian Luo, He Sun

Abstract: Distilling pre-trained 2D diffusion models into 3D assets has driven remarkable advances in text-to-3D synthesis. However, existing methods typically rely on Score Distillation Sampling (SDS) loss, which involves asymmetric KL divergence—a formulation that inherently favors mode-seeking behavior and limits generation diversity. In this paper, we introduce Dive3D, a novel text-to-3D generation framework that replaces KL-based objectives with Score Implicit Matching (SIM) loss, a score-based objective that effectively mitigates mode collapse. Furthermore, Dive3D integrates both diffusion distillation and reward-guided optimization under a unified divergence perspective. Such reformulation, together with SIM loss, yields significantly more diverse 3D outputs while improving text alignment, human preference, and overall visual fidelity. We validate Dive3D across various 2D-to-3D prompts and find that it consistently outperforms prior methods in qualitative assessments, including diversity, photorealism, and aesthetic appeal. We further evaluate its performance on the GPTEval3D benchmark, comparing against nine state-of-the-art baselines. Dive3D also achieves strong results on quantitative metrics, including text–asset alignment, 3D plausibility, text–geometry consistency, texture quality, and geometric detail.

URL: https://openreview.net/forum?id=OUYMueHLMf

---

Title: ADMIRE-BayesOpt: Accelerated Data MIxture RE-weighting for Language Models with Bayesian Optimization

Authors: Xu Ouyang, Shengzhuang Chen, Michael Arthur Leopold Pearce, Thomas Hartvigsen, Jonathan Richard Schwarz

Abstract: Determining the optimal data mixture for large language model training remains a challenging problem with an outsized impact on performance. In practice, language model developers continue to rely on heuristic exploration since no learning-based approach has emerged as a reliable solution. In this work, we propose to view the selection of training data mixtures as a black-box hyperparameter optimization problem, for which Bayesian Optimization is a well-established class of appropriate algorithms. Firstly, we cast data mixture learning as a sequential decision-making problem, in which we aim to find a suitable trade-off between the computational cost of training exploratory (proxy-) models and final mixture performance. Secondly, we systematically explore the properties of transferring mixtures learned at a small scale to larger-scale experiments, providing insights and highlighting opportunities for research at a modest scale. By proposing Multi-fidelity Bayesian Optimization as a suitable method in this common scenario, we introduce a natural framework to balance experiment cost with model fit, avoiding the risks of overfitting to smaller scales while minimizing the number of experiments at high cost. We present results for pre-training and instruction finetuning across models ranging from 1 million to 7 billion parameters, varying from simple architectures to state-of-the-art models and benchmarks spanning dozens of datasets. We demonstrate consistently strong results relative to a wide range of benchmarks, showing a speed-ups of over 500% in determining the best data mixture on our largest experiments relative to recent baselines. In addition, we broaden access to research by sharing ADMIRE IFT Runs, a dataset of 460 full training & evaluation runs reproducible post-training pipelines worth over 13,000 GPU hours, greatly reducing the cost of conducting research in this area. Finally, we highlight rich opportunities for future research in this area, helping bridge the gap towards a comprehensive understanding of the broader effects of training data on model generalization.

URL: https://openreview.net/forum?id=0Euvm9zDpu

---

Title: Physics of Language Models: Part 1, Learning Hierarchical Language Structures

Authors: Zeyuan Allen-Zhu, Yuanzhi Li

Abstract: Transformer-based language models are effective but complex, and understanding their inner workings and reasoning mechanisms remains a significant challenge. Previous research has primarily explored how these models handle simple tasks such as name copying or selection; we extend this line of work by investigating how they perform recursive language structure reasoning defined by context-free grammars (CFGs). We introduce a family of synthetic CFGs that produce hierarchical rules, capable of generating long (e.g., hundreds of tokens), locally ambiguous sentences that require dynamic programming to parse. Despite this complexity, we demonstrate that autoregressive language models such as GPT can accurately learn and reason over these CFG-defined hierarchical languages and generate valid continuations. Analyzing model internals in this controlled setting, we reveal that hidden states linearly encode CFG parse structure, and that attention patterns align closely with the information flow of dynamic-programming parsing algorithms.

This paper also presents several corollary findings, including: why absolute positional embeddings are inferior to relative and rotary embeddings; why uniform attention alone is surprisingly effective (motivating our follow-up work on Canon layers); why encoder-only models (e.g., BERT, DeBERTa) struggle with *deep* structural reasoning on CFGs compared to autoregressive models (e.g., GPT); and why injecting structural or syntactic noise into pretraining data markedly improves robustness to corrupted language prompts.

URL: https://openreview.net/forum?id=mPQKyzkA1K

---


New submissions
===============


Title: A Counterfactual-style Diagnostic Framework for Spurious Correlations in Text-to-Image Models

Abstract: Text-to-image diffusion models often encode correlations between demographic prompts and non-demographic attributes, some of which may be expected (e.g., gray hair with older age) while others may raise fairness concerns (e.g., cultural markers appearing only for certain ethnicities). Existing analyses of such correlations have been largely qualitative. In this work, we present a counterfactual-style diagnostic framework for stress-testing diffusion models. Inspired by stress-testing approaches (e.g., Veitch et al.), our method uses image-conditioned generation to approximately preserve facial features while systematically varying demographic variables in prompts (gender, ethnicity, age). This setup enables controlled observation of how non-demographic attributes (e.g., facial hair, accessories, hairstyles) shift under demographic changes. We introduce Counterfactual-style Invariance (CIV), along with positive and negative variance metrics (PCV, NCV), to quantify attribute stability and directional changes. Applying this framework across multiple text-to-image models reveals pervasive, prompt-dependent entanglements—for example, bushy eyebrows co-occur in 62.5\% of generations with “Middle Eastern” prompts, and Black hair is amplified in 64.8\% of “East Asian” generations. These findings show that generative models can amplify or introduce associations between the demographic variables and observed attributes. This highlights the need for systematic diagnostic evaluations to better understand and mitigate fairness risks in text-to-image generation.

URL: https://openreview.net/forum?id=7HjGoHEoAA

---

Title: Riemannian Generative Decoder

Abstract: Riemannian representation learning typically relies on an encoder to estimate densities on chosen manifolds. This involves optimizing numerically brittle objectives, potentially harming model training and quality. To completely circumvent this issue, we introduce the Riemannian generative decoder, a unifying approach for finding manifold-valued latents on any Riemannian manifold. Latents are learned with a Riemannian optimizer while jointly training a decoder network. By discarding the encoder, we vastly simplify the manifold constraint compared to current approaches which often only handle few specific manifolds. We validate our approach on three case studies --- a synthetic branching diffusion process, human migrations inferred from mitochondrial DNA, and cells undergoing a cell division cycle --- each showing that learned representations respect the prescribed geometry and capture intrinsic non-Euclidean structure. Our method requires only a decoder, is compatible with existing architectures, and yields interpretable latent spaces aligned with data geometry. A temporarily anonymized codebase is available on: https://anonymous.4open.science/r/rgd-4gkL.

URL: https://openreview.net/forum?id=vuPMXg1FDT

---

Title: Reduced-Rank Outcome Compression for Causal Policy Optimization

Abstract: Evaluating the causal impacts of possible interventions is crucial for informing decision-making, especially towards improving access to opportunity. If causal effects are heterogeneous and predictable from covariates, then personalized treatment decisions can improve individual outcomes and contribute to both efficiency and equity. In practice, however, causal researchers do not have a single outcome in mind a priori and often collect multiple outcomes of interest that are noisy estimates of the true target of interest. For example, in government-assisted social benefit programs, policymakers collect many outcomes to understand the multidimensional nature of poverty. The ultimate goal is to learn an optimal treatment policy that in some sense maximizes multiple outcomes simultaneously. To address such issues, we present a data-driven dimensionality-reduction methodology for multiple outcomes in the context of optimal policy learning with multiple objectives. We learn a low-dimensional representation of the true outcome from the observed outcomes using reduced rank regression. We develop a suite of estimates that use the model to denoise observed outcomes, including commonly-used index weightings. These methods improve estimation error in policy evaluation and optimization, including on a case study of real-world cash transfer and social intervention data. Reducing the variance of noisy social outcomes can improve the performance of algorithmic allocations.

URL: https://openreview.net/forum?id=WQhOaY4yPC

---

Title: Divide and Conquer: Selective Value Learning and Policy Optimization for Offline Safe Reinforcement Learning

Abstract: Offline safe reinforcement learning (RL) aims to learn policies that maximize reward while satisfying safety constraints from a fixed dataset. Existing methods extend offline RL with primal–dual value learning and behavior-regularized policy optimization, but in safety-critical tasks they struggle: uniform updates across all states ignore the difference between safety-preserving and unsafe states, leading to inaccurate value estimates, infeasible solutions when constraints conflict, and strong sensitivity to dataset quality. We propose SEVPO($\textbf{SE}$lective $\textbf{V}$alue Learning and $\textbf{P}$olicy $\textbf{O}$ptimization), a divide-and-conquer framework that separates updates based on state safety. SEVPO learns conservative cost values to identify safe states, applying reward-constrained optimization with selective regularization there, and switches to cost-minimization outside to compute least-cost escape paths. Extensive experiments show SEVPO achieves high reward and strict safety guarantees, outperforming state-of-the-art offline safe RL across diverse dataset qualities. We further validate SEVPO by training a Unitree Go2 quadruped robot in dynamic environments using only offline data, demonstrating its potential for safety-critical robotics (https://youtu.be/tDpWq2EV_Ig).

URL: https://openreview.net/forum?id=4KYrv6qYMl

---

Title: Structured Representation Learning with Locally Linear Embeddings and Adaptive Feature Fusion

Abstract: Neuroscientific research has revealed that the brain encodes complex behaviors by leveraging structured, low-dimensional manifolds and dynamically fusing multiple sources of information through adaptive gating mechanisms. Inspired by these principles, we propose a novel reinforcement learning (RL) framework that encourages the disentanglement of dynamics-specific and reward-specific features, drawing direct parallels to how neural circuits separate and integrate information for efficient decision-making. Our approach leverages locally linear embeddings (LLEs) to capture the intrinsic, locally linear structure inherent in many environments—mirroring the local smoothness observed in neural population activity—while concurrently deriving reward-specific features through the standard RL objective. An attention mechanism, analogous to cortical gating, adaptively fuses these complementary representations on a per-state basis. Experimental results on benchmark tasks demonstrate that our method, grounded in neuroscientific principles, improves learning efficiency and overall performance compared to conventional RL approaches, highlighting the benefits of explicitly modeling local state structures and adaptive feature selection as observed in biological systems.

URL: https://openreview.net/forum?id=p7p3iuah0G

---

Title: GENIE: A Visual-Only Diffusion Framework for Task- Agnostic Image Transformation

Abstract: Designing a unified vision model capable of handling diverse visual transformation tasks without task-specific modifications remains a significant challenge, particularly in scaling and generalizing beyond narrowly defined objectives. We propose GENIE, a novel ControlNet-Diffusion framework that performs task-based image generation solely through visual exemplars, eliminating dependence on textual prompts or auxiliary metadata. Unlike conventional prompt-driven diffusion models, GENIE employs a dual visual conditioning mechanism—combining implicit guidance via ControlNet and explicit task encoding through CLIP-based visual arithmetic—to infer task intent directly from reference input-output pairs. To improve semantic alignment between visual exemplars and generated outputs, we introduce a lightweight task consistency loss, which encourages representational coherence in the embedding space across transformed pairs. While not a multitask learner in the classical sense, GENIE enables task switching across multiple tasks without any task-specific modifications in architecture or task-specific loss functions. Evaluations across seven vision tasks—inpainting, colorization, edge detection, deblurring, denoising, semantic segmentation, and depth estimation—and two out-of-distribution (OOD) tasks—deraining and contrast enhancement—demonstrate that GENIE achieves an average performance gain of 10% over visual-conditioned baselines, showcasing its effectiveness for scalable and text-free visual generation.

URL: https://openreview.net/forum?id=vtth9hOwoP

---

Reply all
Reply to author
Forward
0 new messages