Weekly TMLR digest for Dec 28, 2025

3 views

Skip to first unread message

TMLR

unread,

Dec 28, 2025, 12:00:10 AM12/28/25

to tmlr-annou...@googlegroups.com

New certifications
==================

J2C Certification: Rethinking Memory in Continual Learning: Beyond a Monolithic Store of the Past

Yaqian Zhang, Bernhard Pfahringer, Eibe Frank, Albert Bifet

https://openreview.net/forum?id=wgjVUIYyOD

---

J2C Certification: Amortized Inference of Causal Models via Conditional Fixed-Point Iterations

Divyat Mahajan, Jannes Gladrow, Agrin Hilmkil, Cheng Zhang, Meyer Scetbon

https://openreview.net/forum?id=D9pq25PGc5

---

J2C Certification: Understanding Embedding Scaling in Collaborative Filtering

Yicheng He, Zhou Kaiyu, Haoyue Bai, Fengbin ZHU, Yonghui Yang

https://openreview.net/forum?id=3f5HtLqnaY

---

Survey Certification: A Survey on Generative Modeling with Limited Data, Few Shots, and Zero Shot

Milad Abdollahzadeh, Guimeng Liu, Touba Malekzadeh, Christopher T.H Teo, Keshigeyan Chandrasegaran, Ngai-Man Cheung

https://openreview.net/forum?id=u7GTHazuRp

---

Accepted papers
===============

Title: Rethinking Memory in Continual Learning: Beyond a Monolithic Store of the Past

Authors: Yaqian Zhang, Bernhard Pfahringer, Eibe Frank, Albert Bifet

Abstract: Memory is a critical component in replay-based continual learning (CL). Prior research has largely treated CL memory as a monolithic store of past data, focusing on how to select and store representative past examples. However, this perspective overlooks the higher-level memory architecture that governs the interaction between old and new data. In this work, we identify and characterize a dual-memory system that is inherently present in both online and offline CL settings. This system comprises: a short-term memory, which temporarily buffers recent data for immediate model updates, and a long-term memory, which maintains a carefully curated subset of past experiences for future replay and consolidation. We propose \textit{memory capacity ratio} (MCR), the ratio between short-term memory and long-term memory capacities, to characterize online and offline CL. Based on this framework, we systematically investigate how MCR influences generalization, stability, and plasticity. Across diverse CL settings—class-incremental, task-incremental, and domain-incremental—and multiple data modalities (e.g., image and text classification), we observe that a smaller MCR, characteristic of \textit{online CL}, can yield comparable or even superior performance relative to a larger one, characteristic of \textit{offline CL}, when both are evaluated under equivalent computational and data storage budgets. This advantage holds consistently across several state-of-the-art replay strategies, such as ER, DER, and SCR. Theoretical analysis further reveals that a reduced MCR yields a better trade-off between stability and plasticity by lowering a bound on generalization error when learning from non-stationary data streams with limited memory. These findings offer new insights into the role of memory allocation in continual learning and underscore the underexplored potential of online CL approaches.

URL: https://openreview.net/forum?id=wgjVUIYyOD

---

Title: DiffSampling: Enhancing Diversity and Accuracy in Neural Text Generation

Authors: Giorgio Franceschelli, Mirco Musolesi

Abstract: Despite their growing capabilities, language models still frequently reproduce content from their training data, generate repetitive text, and favor common grammatical patterns and vocabulary. A possible cause is the decoding strategy: the most common strategies either consider only the most probable tokens, which reduces output diversity, or increase the likelihood of unlikely tokens, compromising output accuracy and correctness. In this paper, we propose DiffSampling, a new decoding method that leverages a mathematical analysis of the token probability distribution to ensure the generation of contextually appropriate text. In particular, the difference between consecutive, sorted probabilities can be used to truncate incorrect tokens. In addition, we also propose two variations of the proposed method that aim to correct the subtle inconsistencies of common sampling strategies.
Experiments involving four different text-generation tasks demonstrate that our approach consistently performs at least on par with the existing methods it builds upon in terms of quality, despite sampling from a larger set of tokens.

URL: https://openreview.net/forum?id=kXjHbMvdIi

---

Title: CyberThreat-Eval: Can Large Language Models Automate Real-World Threat Research?

Authors: Xiangsen Chen, Xuan Feng, Shuo Chen, Matthieu Maitre, Sudipto Rakshit, Diana Duvieilh, Ashley Picone, Nan Tang

Abstract: Analyzing Open Source Intelligence (OSINT) from large volumes of data is critical for drafting and publishing comprehensive CTI reports. This process usually follows a three-stage workflow---triage, deep search and TI drafting. While Large Language Models (LLMs) offer a promising route toward automation, existing benchmarks still have limitations. These benchmarks often consist of tasks that do not reflect real-world analyst workflows. For example, human analysts rarely receive tasks in the form of multiple-choice questions. Also, existing benchmarks often rely on model-centric metrics that emphasize lexical overlap rather than actionable, detailed insights essential for security analysts. Moreover, they typically fail to cover the complete three-stage workflow. To address these issues, we introduce CyberThreat-Eval, which is collected from the daily CTI workflow of a world-leading company. This expert-annotated benchmark assesses LLMs on practical tasks across all three stages as mentioned above. It utilizes analyst-centric metrics that measure factual accuracy, content quality, and operational costs. Our evaluation using this benchmark reveals important insights into the limitations of current LLMs. For example, LLMs often lack the nuanced expertise required to handle complex details and struggle to distinguish between correct and incorrect information. To address these challenges, the CTI workflow incorporates both external ground-truth databases and human expert knowledge. TRA allows human experts to iteratively provide feedback for continuous improvement. The code of CyberThreat-Eval benchmark is available at https://github.com/secintelligence/CyberThreat-Eval.

URL: https://openreview.net/forum?id=tiFtZHwr7O

---

Title: Zoomer: Adaptive Image Focus Optimization for Black-box MLLM

Authors: Jiaxu Qian, Chendong Wang, Yifan Yang, Chaoyun Zhang, Huiqiang Jiang, Xufang Luo, Yu Kang, Qingwei Lin, Anlan Zhang, Shiqi Jiang, Ting Cao, Tianjun Mao, Suman Banerjee, Guyue Liu, Saravan Rajmohan, Dongmei Zhang, Yuqing Yang, Qi Zhang, Lili Qiu

Abstract: Multimodal large language models (MLLMs) such as GPT-4o, Gemini Pro, and Claude 3.5 have enabled unified reasoning over text and visual inputs, yet they often hallucinate in real-world scenarios—especially when small objects or fine spatial context are involved. We pinpoint two core causes of this failure: the absence of region-adaptive attention and inflexible token budgets that force uniform downsampling, leading to critical information loss. To overcome these limitations, we introduce Zoomer a visual prompting framework that delivers token-efficient, detail-preserving image representations for black-box MLLMs. Zoomer integrates (1) a prompt-aware emphasis module to highlight semantically relevant regions, (2) a spatial-preserving orchestration schema to maintain object relationships, and (3) a budget-aware strategy to optimally allocate tokens between global context and local details. Extensive experiments on nine benchmarks and three commercial MLLMs demonstrate that Zoomer boosts accuracy by up to 27% while cutting image token usage by up to 67\%. Our approach establishes a principled methodology for robust, resource-aware multimodal understanding in settings where model internals are inaccessible.

URL: https://openreview.net/forum?id=u7RPDvumdF

---

Title: SoundnessBench: A Soundness Benchmark for Neural Network Verifiers

Authors: Xingjian Zhou, Keyi Shen, Andy Xu, Hongji Xu, Cho-Jui Hsieh, Huan Zhang, Zhouxing Shi

Abstract: Neural network (NN) verification aims to formally verify properties of NNs, which is crucial for ensuring the behavior of NN-based models in safety-critical applications. In recent years, the community has developed many NN verifiers and benchmarks to evaluate them. However, existing benchmarks typically lack ground-truth for hard instances where no current verifier can verify the property and no counterexample can be found. This makes it difficult to validate the soundness of a verifier, when it claims verification on such challenging instances that no other verifier can handle. In this work, we develop a new benchmark for NN verification, named "SoundnessBench", specifically for testing the soundness of NN verifiers. SoundnessBench consists of instances with deliberately inserted counterexamples that are hidden from adversarial attacks commonly used to find counterexamples. Thereby, it can identify false verification claims when hidden counterexamples are known to exist. We design a training method to produce NNs with hidden counterexamples and systematically construct our SoundnessBench with instances across various model architectures, activation functions, and input data. We demonstrate that our training effectively produces hidden counterexamples and our SoundnessBench successfully identifies bugs in state-of-the-art NN verifiers. Our code is available at https://github.com/mvp-harry/SoundnessBench and our dataset is available at https://huggingface.co/datasets/SoundnessBench/SoundnessBench.

URL: https://openreview.net/forum?id=UuYYldVLH3

---

Title: Amortized Inference of Causal Models via Conditional Fixed-Point Iterations

Authors: Divyat Mahajan, Jannes Gladrow, Agrin Hilmkil, Cheng Zhang, Meyer Scetbon

Abstract: Structural Causal Models (SCMs) offer a principled framework to reason about interventions and support out-of-distribution generalization, which are key goals in scientific discovery. However, the task of learning SCMs from observed data poses formidable challenges, and often requires training a separate model for each dataset. In this work, we propose an amortized inference framework that trains a single model to predict the causal mechanisms of SCMs conditioned on their observational data and causal graph. We first use a transformer-based architecture for amortized learning of dataset embeddings, and then extend the Fixed-Point Approach (FiP) to infer the causal mechanisms conditionally on their dataset embeddings. As a byproduct, our method can generate observational and interventional data from novel SCMs at inference time, without updating parameters. Empirical results show that our amortized procedure performs on par with baselines trained specifically for each dataset on both in and out-of-distribution problems, and also outperforms them in scare data regimes.

URL: https://openreview.net/forum?id=D9pq25PGc5

---

Title: Are Data Embeddings Effective in Time Series Forecasting?

Authors: Reza Nematirad, Anil Pahwa, Balasubramaniam Natarajan

Abstract: Time series forecasting plays a crucial role in many real-world applications, and numerous complex forecasting models have been proposed in recent years. Despite their architectural innovations, most state-of-the-art models report only marginal improvements—typically just a few thousandths in standard error metrics. These models often incorporate complex data embedding layers, which typically transform raw inputs into higher-dimensional representations to enhance accuracy. But are data embedding techniques actually effective in time series forecasting? Through extensive ablation studies across fifteen state-of-the-art models on multiple benchmark datasets, we find that removing data embedding layers from many state-of-the-art models does not degrade forecasting performance—in many cases, it improves both accuracy and computational efficiency. The gains from removing embedding layers often exceed the performance differences typically reported between competing state-of-the-art models.

URL: https://openreview.net/forum?id=yeu44ZRvZZ

---

Title: SpidR: Learning Fast and Stable Linguistic Units for Spoken Language Models Without Supervision

Authors: Maxime Poli, Mahi Luthra, Youssef Benchekroun, Yosuke Higuchi, Martin Gleize, Jiayi Shen, Robin Algayres, Yu-An Chung, Mido Assran, Juan Pino, Emmanuel Dupoux

Abstract: The parallel advances in language modeling and speech representation learning have raised the prospect of learning language directly from speech without textual intermediates. This requires extracting semantic representations directly from speech. Our contributions are threefold. First, we introduce SpidR, a self-supervised speech representation model that efficiently learns representations with highly accessible phonetic information, which makes it particularly suited for textless spoken language modeling. It is trained on raw waveforms using a masked prediction objective combined with self-distillation and online clustering. The intermediate layers of the student model learn to predict assignments derived from the teacher's intermediate layers. This learning objective stabilizes the online clustering procedure compared to previous approaches, resulting in higher quality codebooks. SpidR outperforms wav2vec 2.0, HuBERT, WavLM, and DinoSR on downstream language modeling benchmarks (sWUGGY, sBLIMP, tSC). Second, we systematically evaluate across models and layers the correlation between speech unit quality (ABX, PNMI) and language modeling performance, validating these metrics as reliable proxies. Finally, SpidR significantly reduces pretraining time compared to HuBERT, requiring only one day of pretraining on 16 GPUs, instead of a week. This speedup is enabled by the pretraining method and an efficient codebase, which allows faster iteration and easier experimentation. We open-source the training code and model checkpoints at https://github.com/facebookresearch/spidr.

URL: https://openreview.net/forum?id=E7XAFBpfZs

---

Title: Understanding Embedding Scaling in Collaborative Filtering

Authors: Yicheng He, Zhou Kaiyu, Haoyue Bai, Fengbin ZHU, Yonghui Yang

Abstract: Scaling recommendation models into large recommendation models has become one of the most widely discussed topics. Recent efforts focus on components beyond the scaling embedding dimension, as it is believed that scaling embedding may lead to performance degradation. Although there have been some initial observations on embedding, the root cause of their non-scalability remains unclear. Moreover, whether performance degradation occurs across different types of models and datasets is still an unexplored area. Regarding the effect of embedding dimensions on performance, we conduct large-scale experiments across 10 datasets with varying sparsity levels and scales, using 4 representative classical architectures. We surprisingly observe two novel phenomenon: double-peak and logarithmic. For the former, as the embedding dimension increases, performance first improves, then declines, rises again, and eventually drops. For the latter, it exhibits a perfect logarithmic curve. Our contributions are threefold. First, we discover two novel phenomena when scaling collaborative filtering models. Second, we gain an understanding of the underlying causes of the double-peak phenomenon. Lastly, we theoretically analyze the noise robustness of collaborative filtering models, with results matching empirical observations.

URL: https://openreview.net/forum?id=3f5HtLqnaY

---

Title: Learning to Coordinate with Experts

Authors: Mohamad H. Danesh, Khanh Xuan Nguyen, Tu Trinh, Benjamin Plaut

Abstract: When deployed in the real world, AI agents will inevitably face challenges that exceed their individual capabilities. A critical component of AI safety is an agent’s ability to recognize when it is likely to fail in a novel situation and to yield control to a more capable expert system. Leveraging such expert assistance can significantly improve safety and performance in such situations. Since expert assistance is costly, a central challenge is determining when to consult an expert. In this paper, we explore a novel variant of this problem, termed YRC-0, in which an agent must learn to collaborate with an expert in new environments in an unsupervised manner–that is, without interacting with the expert during training. This setting motivates the development of low-cost, robust approaches for training expertleveraging agents. To support research in this area, we introduce YRC-Bench, an open-source benchmark that instantiates YRC-0 across diverse environments. YRC-Bench provides a standardized Gym-like API, simulated experts, an evaluation pipeline, and implementations of popular baselines. Toward tackling YRC-0, we propose a validation strategy and use a proposer-validator decomposition as a diagnostic framework to evaluate a range of learning methods, offering insights that can inform future research. Codebase: https://github.com/modanesh/YRC-Bench

URL: https://openreview.net/forum?id=YOE0TRK8oU

---

Title: LeanProgress: Guiding Search for Neural Theorem Proving via Proof Progress Prediction

Authors: Robert Joseph George, Suozhi Huang, Peiyang Song, Anima Anandkumar

Abstract: Mathematical reasoning remains a significant challenge for Large Language Models (LLMs) due to hallucinations. When combined with formal proof assistants like Lean, these hallucinations can be eliminated through rigorous verification, making theorem proving reliable. However, even with formal verification, LLMs still struggle with long proofs and complex mathematical formalizations. While Lean with LLMs offers valuable assistance with retrieving lemmas, generating tactics, or even complete proofs, it lacks a crucial capability: providing a sense of proof progress. This limitation particularly impacts the overall development efficiency in large formalization projects. We introduce LeanProgress, a method that predicts the progress in the proof. Training and evaluating our models made on a large corpus of Lean proofs from Lean Workbook Plus and Mathlib4 and how many steps remain to complete it, we employ data preprocessing and balancing techniques to handle the skewed distribution of proof lengths. Our experiments show that LeanProgress achieves an overall prediction accuracy of 75.8\% in predicting the amount of progress and, hence, the remaining number of steps. When integrated into a best-first search framework using Reprover, our method shows a 3.8\% improvement on Mathlib4 compared to baseline performances of 41.4\%, particularly for longer proofs. These results demonstrate how proof progress prediction can enhance both automated and interactive theorem proving, enabling users to make more informed decisions about proof strategies.

URL: https://openreview.net/forum?id=eTmOwvvRu9

---

Title: PROPS: Progressively Private Self-alignment of Large Language Models

Authors: Noel Teku, Fengwei Tian, Payel Bhattacharjee, Souradip Chakraborty, Amrit Singh Bedi, Ravi Tandon

Abstract: Alignment is a key step in developing Large Language Models (LLMs) using human feedback to ensure adherence to human values and societal norms. Dependence on human feedback raises privacy concerns about how much a labeler’s preferences may reveal about their personal values, beliefs, and personality traits. Existing approaches, such as Differentially Private SGD (DP-SGD), provide rigorous privacy guarantees by privatizing gradients during fine-tuning and alignment but can provide more privacy than necessary as human preferences are tied only to labels of (prompt, response) pairs and can degrade model utility. This work focuses on LLM alignment with preference-level privacy, which preserves the privacy of preference labels provided by humans. We propose PROPS (PROgressively Private Self-alignment), a multi-stage privacy preserving alignment framework where privately aligned models in previous stages can serve as labelers for supplementing training data in the subsequent stages of alignment. We present theoretical guarantees for PROPS as well as comprehensive validation using multiple models (Pythia and GPT) and datasets (AlpacaEval, Anthropic HH-RLHF, truthy-dpo-v0.1) to demonstrate the utility of PROPS over existing methods while still providing high privacy. For the same privacy budget, alignment via PROPS can achieve up to 3x higher win-rates compared to DP-SGD, and 2.5x higher win-rates compared to Randomized Response (RR) based alignment.

URL: https://openreview.net/forum?id=phbRwhaeBo

---

Title: A Survey on Generative Modeling with Limited Data, Few Shots, and Zero Shot

Authors: Milad Abdollahzadeh, Guimeng Liu, Touba Malekzadeh, Christopher T.H Teo, Keshigeyan Chandrasegaran, Ngai-Man Cheung

Abstract: Generative modeling in machine learning aims to synthesize new data samples that are statistically similar to those observed during training. While conventional generative models such as GANs and diffusion models typically assume access to large and diverse datasets, many real-world applications (e.g., in medicine, satellite imaging, and artistic domains) operate under limited data availability and strict constraints. In this survey, we examine Generative Modeling under Data Constraint (GM-DC), which includes limited-data, few-shot, and zero-shot settings. We present a unified perspective on the key challenges in GM-DC, including overfitting, frequency bias, and incompatible knowledge transfer, and discuss how these issues impact model performance.
To systematically analyze this growing field, we introduce two novel taxonomies: one categorizing GM-DC tasks (e.g., unconditional vs. conditional generation, cross-domain adaptation, and subject-driven modeling), and another organizing methodological approaches (e.g., transfer learning, data augmentation, meta-learning, and frequency-aware modeling).
Our study reviews over 230 papers, offering a comprehensive view across generative model types and constraint scenarios. We further analyze task-approach-method interactions using a Sankey diagram and highlight promising directions for future work, including adaptation of foundation models, holistic evaluation frameworks, and data-centric strategies for sample selection.
This survey provides a timely and practical roadmap for researchers and practitioners aiming to advance generative modeling under limited data. Project website: https://sutd-visual-computing-group.github.io/gmdc-survey/.

URL: https://openreview.net/forum?id=u7GTHazuRp

---

Title: Multi-BK-Net: Multi-Branch Multi-Kernel Convolutional Neural Networks for Clinical EEG Analysis

Authors: Ann-Kathrin Kiessner, Tonio Ball, Joschka Boedecker

Abstract: Classifying an electroencephalography (EEG) recording as pathological or non-pathological is an important first step in diagnosing and managing neurological diseases and disorders. As manual EEG classification is costly, time-consuming and requires highly trained experts, deep learning methods for automated classification of general EEG pathology offer a promising option to assist clinicians in screening EEGs. Convolutional neural networks (CNNs) are well-suited for classifying pathological EEG signals due to their ability to perform end-to-end learning. In practice, however, current CNN solutions suffer from limited classification performance due to I) a single-scale network design that cannot fully capture the high intra- and inter-subject variability of the EEG signal, the diversity of the data, and the heterogeneity of pathological EEG patterns and II) the small size and limited diversity of the dataset commonly used to train and evaluate the networks. These challenges result in a low sensitivity score and a performance drop on more diverse patient populations, further hindering their reliability for real-world applications.
Here, we propose a novel multi-branch, multi-scale CNN called Multi-BK-Net (Multi-Branch Multi-Kernel Network), comprising five parallel branches that incorporate temporal convolution, spatial convolution, and pooling layers, with temporal kernel sizes defined by five clinically relevant frequency bands in its first block.
Evaluation is based on two public datasets with predefined test sets: the Temple University Hospital (TUH) Abnormal EEG Corpus and the TUH Abnormal Expansion Balanced EEG Corpus.
Our Multi-BK-Net outperforms five baseline architectures and state-of-the-art end-to-end approaches in terms of accuracy and sensitivity on these datasets, setting a new benchmark. Furthermore, ablation experiments highlight the importance of the multi-branch, multi-scale input block of the Multi-BK-Net. Overall, our findings indicate the efficacy of multi-branch, multi-scale CNNs in accurately and reliably classifying EEG pathology, demonstrating advantages in handling data heterogeneity compared to other deep learning approaches. Thus, this study contributes to the ongoing development of deep end-to-end methods for general EEG pathology classification.

URL: https://openreview.net/forum?id=IsG10xZAaA

---

Title: Federated Learning with Uncertainty and Personalization via Efficient Second-order Optimization

Authors: Shivam Pal, Aishwarya Gupta, Saqib Sarwar, Piyush Rai

Abstract: Federated Learning (FL) has emerged as a promising method to collaboratively learn from decentralized and heterogeneous data available at different clients without the requirement of data ever leaving the clients. Recent works on FL have advocated taking a Bayesian approach to FL as it offers a principled way to account for the model and predictive uncertainty by learning a posterior distribution for the client and/or server models. Moreover, Bayesian FL also naturally enables personalization in FL to handle data heterogeneity across the different clients by having each client learn its own distinct personalized model. In particular, the hierarchical Bayesian approach enables all the clients to learn their personalized models while also taking into account the commonalities via a prior distribution provided by the server. However, despite their promise, Bayesian approaches for FL can be computationally expensive and can have high communication costs as well because of the requirement of computing and sending the posterior distributions. We present a novel Bayesian FL method using an efficient second-order optimization approach, with a computational cost that is similar to first-order optimization methods like Adam, but also provides the various benefits of the Bayesian approach for FL (e.g., uncertainty, personalization), while also being significantly more efficient and accurate than SOTA Bayesian FL methods (both for standard as well as personalized FL settings). Our method achieves improved predictive accuracies as well as better uncertainty estimates as compared to the baselines which include both optimization based as well as Bayesian FL methods.

URL: https://openreview.net/forum?id=TzhCnGBK4F

---

Title: TempFlex: Advancing MLLMs with Temporal Perception and Natively Scalable Resolution Encoding

Authors: Zhanyu Wang, Chen Tang, Haoyu He, Kuan Feng, Chao Wang, Bingni Zhang, Xiaolei XU, SHEN WANG, Luping Zhou

Abstract: Multimodal large language models (MLLMs) have made significant progress across vision-language tasks, yet many designs still suffer from two core limitations. (i) Excessive visual tokens and broken global context: Tiled Patch Encoding fragments high-resolution images, leading to token overload and disrupting global attention modeling. (ii) Lack of temporal reasoning: Most models process video as independent frames using static image encoders, failing to capture temporal dynamics. We present TempFlex-VL, a token-efficient and temporally aware MLLM that addresses both issues through lightweight architectural enhancements. First, we introduce a resolution-agnostic visual encoder that directly processes full images without tiling, preserving global context while substantially reducing visual tokens. Second, we propose Temporal Fiber Fusion (TFF), a plug-and-play module with three complementary pathways: (1) a dynamic local-convolution branch for fine-grained motion, (2) a gated memory accumulator for long-term dependencies, and (3) a periodic encoder for modeling cyclic patterns. These signals are softly fused, enabling the model to adapt to diverse temporal structures without overfitting. To support large-scale video-language pretraining, we curate TempFlex-2M, a high-quality synthetic video–text corpus generated in a single stage via GPT-4o with direct visual prompting. We instantiate TempFlex-VL using two different language backbones, Gemma3-4B and Qwen3-4B, demonstrating the generality of our design across architectures. Both variants achieve state-of-the-art or competitive results on a wide range of image and video benchmarks while markedly improving token efficiency. Code is publicly available at: https://github.com/wang-zhanyu/TempFlex.

URL: https://openreview.net/forum?id=ietYdtRB3h

---

Title: Multimodal Cultural Safety: Evaluation Framework and Alignment Strategies

Authors: Haoyi Qiu, Kung-Hsiang Huang, Ruichen Zheng, Jiao Sun, Nanyun Peng

Abstract: Large vision-language models (LVLMs) are increasingly deployed in globally distributed applications, such as tourism assistants, yet their ability to produce culturally appropriate responses remains underexplored. Existing multimodal safety benchmarks primarily focus on physical safety and overlook violations rooted in cultural norms, which can result in symbolic harm. For example, suggesting clocks as gifts for a baby’s birthday in China may invoke associations with death, leading to user discomfort and undermining trust. To address this gap, we introduce CROSS, a benchmark designed to assess the cultural safety reasoning capabilities of LVLMs. CROSS includes 1,284 multilingual visually grounded queries from 16 countries, three everyday domains (i.e., shopping, meal planning, and outdoor activities), and 14 languages, where cultural norm violations emerge only when images are interpreted in context. We propose CROSS-Eval, an intercultural theory-based framework that measures four key dimensions: cultural awareness, norm education, compliance, and helpfulness. Using this framework, we evaluate 21 leading LVLMs, including mixture-of-experts models (e.g., Llama-4-Maverick) and reasoning models (e.g., o1 and Gemini-2.5-Pro). Results reveal significant cultural safety gaps: the best-performing model achieves only 61.79% in awareness and 37.73% in compliance. While some open-source models achieve performance better or comparable to GPT-4o, they still fall notably short of proprietary models. Our results further show that increasing reasoning capacity improves cultural alignment but does not fully resolve the issue. To improve model performance, we develop two enhancement strategies: supervised fine-tuning with culturally grounded, open-ended data and preference tuning with contrastive response pairs that highlight safe versus unsafe behaviors. These methods substantially improve GPT-4o’s cultural awareness (+60.14%) and compliance (+55.2%), while preserving general multimodal capabilities with minimal performance reduction on general multimodal understanding benchmarks. This work establishes a framework for evaluating and improving cultural safety in vision-language systems across diverse global contexts.

URL: https://openreview.net/forum?id=mkFBmxgnRh

---

Title: Inherently Robust Control through Maximum-Entropy Learning-Based Rollout

Authors: Felix Bok, Atanas Mirchev, Baris Kayalibay, Ole Jonas Wenzel, Patrick van der Smagt, Justin Bayer

Abstract: Reinforcement Learning has recently proven extremely successful in the context of robot control. One of the major reasons is massively parallel simulation in conjunction with controlling for the so-called ``sim to real'' gap: training on a distribution of environments, which is assumed to contain the real one, is sufficient for finding neural policies that successfully transfer from computer simulations to real robots. Often, this is accompanied by a layer of system identification during deployment to close the gap further. Still, the efficacy of these approaches hinges on reasonable simulation capabilities with an adequately rich task distribution containing the real environment. This work aims to provide a complementary solution in cases where the aforementioned criteria may prove challenging to satisfy. We combine two approaches, $\textit{maximum-entropy reinforcement learning}$ (MaxEntRL) and $\textit{rollout}$, into an inherently robust control method called $\textbf{Maximum-Entropy Learning-Based Rollout (MELRO)}$. Both promise increased robustness and adaptability on their own. While MaxEntRL has been shown to be an adversarially-robust approach in disguise, rollout greatly improves over parametric models through an implicit Newton step on a model of the environment. We find that our approach works excellently in the vast majority of cases on both the Real World Reinforcement Learning (RWRL) benchmark and on our own environment perturbations of the popular DeepMind Control (DMC) suite, which move beyond simple parametric noise. We also show its success in ``sim to real'' transfer with the Franka Panda robot arm.

URL: https://openreview.net/forum?id=Ho4XUDn21D

---

Title: Let Your Light Shine: Foreground Portrait Matting via Deep Flash Priors

Authors: Tianyi Xiang, Yangyang Xu, Qingxuan Hu, Chenyi Zi, Nanxuan Zhao, Junle Wang, Shengfeng He

Abstract: In this paper, we delve into a new perspective to solve image matting by revealing the foreground with flash priors. Previous Background Matting frameworks require a clean background as input, and although demonstrated powerfully, they are not practical to handle real-world scenarios with dynamic camera or background movement. We introduce the flash/no-flash image pair to portray the foreground object while eliminating the influence of dynamic background. The rationale behind this is that the foreground object is closer to the camera and thus received more light than the background. We propose a cascaded end-to-end network to integrate flash prior knowledge into the alpha matte estimation process. Particularly, a transformer-based Foreground Correlation Module is presented to connect foregrounds exposed in different lightings, which can effectively filter out the perturbation from the dynamic background and also robust to foreground motion. The initial prediction is concatenated with a Boundary Matting Network to polish the details of previous predictions. To supplement the training and evaluation of our flash/no-flash framework, we construct the first flash/no-flash portrait image matting dataset with 3,025 well-annotated alpha mattes. Experimental evaluations show that our proposed model significantly outperforms existing trimap-free matting methods on scenes with dynamic backgrounds. Moreover, we detailedly discuss and analyze the effects of different prior knowledge on static and dynamic backgrounds. In contrast to the restricted scenarios of Background Matting, we demonstrate a flexible and reliable solution in real-world cases with the camera or background movements.

URL: https://openreview.net/forum?id=vxUiVJp2eM

---

Title: Dive3D: Diverse Distillation-based Text-to-3D Generation via Score Implicit Matching

Authors: Weimin Bai, Yubo Li, Wenzheng Chen, Weijian Luo, He Sun

Abstract: Distilling pre-trained 2D diffusion models into 3D assets has driven remarkable advances in text-to-3D synthesis. However, existing methods typically rely on Score Distillation Sampling (SDS) loss, which involves asymmetric KL divergence—a formulation that inherently favors mode-seeking behavior and limits generation diversity. In this paper, we introduce Dive3D, a novel text-to-3D generation framework that replaces KL-based objectives with Score Implicit Matching (SIM) loss, a score-based objective that effectively mitigates mode collapse. Furthermore, Dive3D integrates both diffusion distillation and reward-guided optimization under a unified divergence perspective. Such reformulation, together with SIM loss, yields significantly more diverse 3D outputs while improving text alignment, human preference, and overall visual fidelity. We validate Dive3D across various 2D-to-3D prompts and find that it consistently outperforms prior methods in qualitative assessments, including diversity, photorealism, and aesthetic appeal. We further evaluate its performance on the GPTEval3D benchmark, comparing against nine state-of-the-art baselines. Dive3D also achieves strong results on quantitative metrics, including text–asset alignment, 3D plausibility, text–geometry consistency, texture quality, and geometric detail.

URL: https://openreview.net/forum?id=OUYMueHLMf

---

Title: ADMIRE-BayesOpt: Accelerated Data MIxture RE-weighting for Language Models with Bayesian Optimization

Authors: Xu Ouyang, Shengzhuang Chen, Michael Arthur Leopold Pearce, Thomas Hartvigsen, Jonathan Richard Schwarz

Abstract: Determining the optimal data mixture for large language model training remains a challenging problem with an outsized impact on performance. In practice, language model developers continue to rely on heuristic exploration since no learning-based approach has emerged as a reliable solution. In this work, we propose to view the selection of training data mixtures as a black-box hyperparameter optimization problem, for which Bayesian Optimization is a well-established class of appropriate algorithms. Firstly, we cast data mixture learning as a sequential decision-making problem, in which we aim to find a suitable trade-off between the computational cost of training exploratory (proxy-) models and final mixture performance. Secondly, we systematically explore the properties of transferring mixtures learned at a small scale to larger-scale experiments, providing insights and highlighting opportunities for research at a modest scale. By proposing Multi-fidelity Bayesian Optimization as a suitable method in this common scenario, we introduce a natural framework to balance experiment cost with model fit, avoiding the risks of overfitting to smaller scales while minimizing the number of experiments at high cost. We present results for pre-training and instruction finetuning across models ranging from 1 million to 7 billion parameters, varying from simple architectures to state-of-the-art models and benchmarks spanning dozens of datasets. We demonstrate consistently strong results relative to a wide range of benchmarks, showing a speed-ups of over 500% in determining the best data mixture on our largest experiments relative to recent baselines. In addition, we broaden access to research by sharing ADMIRE IFT Runs, a dataset of 460 full training & evaluation runs reproducible post-training pipelines worth over 13,000 GPU hours, greatly reducing the cost of conducting research in this area. Finally, we highlight rich opportunities for future research in this area, helping bridge the gap towards a comprehensive understanding of the broader effects of training data on model generalization.

URL: https://openreview.net/forum?id=0Euvm9zDpu

---

Title: Physics of Language Models: Part 1, Learning Hierarchical Language Structures

Authors: Zeyuan Allen-Zhu, Yuanzhi Li

Abstract: Transformer-based language models are effective but complex, and understanding their inner workings and reasoning mechanisms remains a significant challenge. Previous research has primarily explored how these models handle simple tasks such as name copying or selection; we extend this line of work by investigating how they perform recursive language structure reasoning defined by context-free grammars (CFGs). We introduce a family of synthetic CFGs that produce hierarchical rules, capable of generating long (e.g., hundreds of tokens), locally ambiguous sentences that require dynamic programming to parse. Despite this complexity, we demonstrate that autoregressive language models such as GPT can accurately learn and reason over these CFG-defined hierarchical languages and generate valid continuations. Analyzing model internals in this controlled setting, we reveal that hidden states linearly encode CFG parse structure, and that attention patterns align closely with the information flow of dynamic-programming parsing algorithms.

This paper also presents several corollary findings, including: why absolute positional embeddings are inferior to relative and rotary embeddings; why uniform attention alone is surprisingly effective (motivating our follow-up work on Canon layers); why encoder-only models (e.g., BERT, DeBERTa) struggle with *deep* structural reasoning on CFGs compared to autoregressive models (e.g., GPT); and why injecting structural or syntactic noise into pretraining data markedly improves robustness to corrupted language prompts.

URL: https://openreview.net/forum?id=mPQKyzkA1K

---

New submissions
===============

Title: A Counterfactual-style Diagnostic Framework for Spurious Correlations in Text-to-Image Models

Abstract: Text-to-image diffusion models often encode correlations between demographic prompts and non-demographic attributes, some of which may be expected (e.g., gray hair with older age) while others may raise fairness concerns (e.g., cultural markers appearing only for certain ethnicities). Existing analyses of such correlations have been largely qualitative. In this work, we present a counterfactual-style diagnostic framework for stress-testing diffusion models. Inspired by stress-testing approaches (e.g., Veitch et al.), our method uses image-conditioned generation to approximately preserve facial features while systematically varying demographic variables in prompts (gender, ethnicity, age). This setup enables controlled observation of how non-demographic attributes (e.g., facial hair, accessories, hairstyles) shift under demographic changes. We introduce Counterfactual-style Invariance (CIV), along with positive and negative variance metrics (PCV, NCV), to quantify attribute stability and directional changes. Applying this framework across multiple text-to-image models reveals pervasive, prompt-dependent entanglements—for example, bushy eyebrows co-occur in 62.5\% of generations with “Middle Eastern” prompts, and Black hair is amplified in 64.8\% of “East Asian” generations. These findings show that generative models can amplify or introduce associations between the demographic variables and observed attributes. This highlights the need for systematic diagnostic evaluations to better understand and mitigate fairness risks in text-to-image generation.

URL: https://openreview.net/forum?id=7HjGoHEoAA

---

Title: Riemannian Generative Decoder

Abstract: Riemannian representation learning typically relies on an encoder to estimate densities on chosen manifolds. This involves optimizing numerically brittle objectives, potentially harming model training and quality. To completely circumvent this issue, we introduce the Riemannian generative decoder, a unifying approach for finding manifold-valued latents on any Riemannian manifold. Latents are learned with a Riemannian optimizer while jointly training a decoder network. By discarding the encoder, we vastly simplify the manifold constraint compared to current approaches which often only handle few specific manifolds. We validate our approach on three case studies --- a synthetic branching diffusion process, human migrations inferred from mitochondrial DNA, and cells undergoing a cell division cycle --- each showing that learned representations respect the prescribed geometry and capture intrinsic non-Euclidean structure. Our method requires only a decoder, is compatible with existing architectures, and yields interpretable latent spaces aligned with data geometry. A temporarily anonymized codebase is available on: https://anonymous.4open.science/r/rgd-4gkL.

URL: https://openreview.net/forum?id=vuPMXg1FDT

---

Title: Reduced-Rank Outcome Compression for Causal Policy Optimization

Abstract: Evaluating the causal impacts of possible interventions is crucial for informing decision-making, especially towards improving access to opportunity. If causal effects are heterogeneous and predictable from covariates, then personalized treatment decisions can improve individual outcomes and contribute to both efficiency and equity. In practice, however, causal researchers do not have a single outcome in mind a priori and often collect multiple outcomes of interest that are noisy estimates of the true target of interest. For example, in government-assisted social benefit programs, policymakers collect many outcomes to understand the multidimensional nature of poverty. The ultimate goal is to learn an optimal treatment policy that in some sense maximizes multiple outcomes simultaneously. To address such issues, we present a data-driven dimensionality-reduction methodology for multiple outcomes in the context of optimal policy learning with multiple objectives. We learn a low-dimensional representation of the true outcome from the observed outcomes using reduced rank regression. We develop a suite of estimates that use the model to denoise observed outcomes, including commonly-used index weightings. These methods improve estimation error in policy evaluation and optimization, including on a case study of real-world cash transfer and social intervention data. Reducing the variance of noisy social outcomes can improve the performance of algorithmic allocations.

URL: https://openreview.net/forum?id=WQhOaY4yPC

---

Title: Divide and Conquer: Selective Value Learning and Policy Optimization for Offline Safe Reinforcement Learning

Abstract: Offline safe reinforcement learning (RL) aims to learn policies that maximize reward while satisfying safety constraints from a fixed dataset. Existing methods extend offline RL with primal–dual value learning and behavior-regularized policy optimization, but in safety-critical tasks they struggle: uniform updates across all states ignore the difference between safety-preserving and unsafe states, leading to inaccurate value estimates, infeasible solutions when constraints conflict, and strong sensitivity to dataset quality. We propose SEVPO($\textbf{SE}$lective $\textbf{V}$alue Learning and $\textbf{P}$olicy $\textbf{O}$ptimization), a divide-and-conquer framework that separates updates based on state safety. SEVPO learns conservative cost values to identify safe states, applying reward-constrained optimization with selective regularization there, and switches to cost-minimization outside to compute least-cost escape paths. Extensive experiments show SEVPO achieves high reward and strict safety guarantees, outperforming state-of-the-art offline safe RL across diverse dataset qualities. We further validate SEVPO by training a Unitree Go2 quadruped robot in dynamic environments using only offline data, demonstrating its potential for safety-critical robotics (https://youtu.be/tDpWq2EV_Ig).

URL: https://openreview.net/forum?id=4KYrv6qYMl

---

Title: Structured Representation Learning with Locally Linear Embeddings and Adaptive Feature Fusion

Abstract: Neuroscientific research has revealed that the brain encodes complex behaviors by leveraging structured, low-dimensional manifolds and dynamically fusing multiple sources of information through adaptive gating mechanisms. Inspired by these principles, we propose a novel reinforcement learning (RL) framework that encourages the disentanglement of dynamics-specific and reward-specific features, drawing direct parallels to how neural circuits separate and integrate information for efficient decision-making. Our approach leverages locally linear embeddings (LLEs) to capture the intrinsic, locally linear structure inherent in many environments—mirroring the local smoothness observed in neural population activity—while concurrently deriving reward-specific features through the standard RL objective. An attention mechanism, analogous to cortical gating, adaptively fuses these complementary representations on a per-state basis. Experimental results on benchmark tasks demonstrate that our method, grounded in neuroscientific principles, improves learning efficiency and overall performance compared to conventional RL approaches, highlighting the benefits of explicitly modeling local state structures and adaptive feature selection as observed in biological systems.

URL: https://openreview.net/forum?id=p7p3iuah0G

---

Title: GenReview: : A Large-scale Dataset of AI-Generated (and Human-written) Peer Reviews

Abstract: How does the progressive embracement of Large Language Models (LLMs) affect scientific peer reviewing? This multifaceted question is fundamental to the effectiveness---as well as to the integrity---of the scientific process. Recent evidence suggests that LLMs may have already been tacitly used in peer reviewing, e.g., at the 2024 International Conference of Learning Representations (ICLR). Furthermore, some efforts have been undertaken in an attempt to explicitly integrate LLMs in peer reviewing by various editorial boards (including that of ICLR'25). To fully understand the utility and the implications of LLMs' deployment for scientific reviewing, a comprehensive relevant dataset is strongly desirable.
Despite some previous research on this topic, such dataset has been lacking so far. We fill in this gap by presenting GenReview, the hitherto largest dataset containing LLM-written reviews. Our dataset includes 81K reviews generated for all submissions to the 2018--2025 editions of the ICLR by providing the LLM with three independent prompts: a negative, a positive, and a neutral one. GenReview is also linked to the respective papers and their original reviews, thereby enabling a broad range of investigations. To illustrate the value of GenReview, we explore a sample of intriguing research questions, namely: if LLMs exhibit bias in reviewing (they do); if LLM-written reviews can be automatically detected (so far, they can); if LLMs can rigorously follow reviewing instructions (not always) and whether LLM-provided ratings align with decisions on paper acceptance or rejection (holds true only for accepted papers).
GenReview can be accessed at the following link: https://anonymous.4open.science/r/gen_review/ .

URL: https://openreview.net/forum?id=hryrQ69amw

---

Title: Private Sketches for Linear Regression

Abstract: Linear regression is frequently applied in a variety of domains, some of which might contain sensitive information. This necessitates that the application of these methods does not reveal private information. Differentially private (DP) linear regression methods, developed for this purpose, compute private estimates of the solution. These techniques typically involve computing a noisy version of the solution vector. Instead, we propose releasing private sketches of the datasets, which can then be used to compute an approximate solution to the regression problem. This is motivated by the \emph{sketch-and-solve} paradigm, where the regression problem is solved on a smaller sketch of the dataset instead of on the original problem space. The solution obtained on the sketch can also be shown to have good approximation guarantees to the original problem. Various sketching methods have been developed for improving the computational efficiency of linear regression problems under this paradigm. We adopt this paradigm for the purpose of releasing private sketches of the data. We construct differentially private sketches for the problems of least squares regression, as well as least absolute deviations regression. We show that the privacy constraints lead to sketched versions of regularized regression. We compute the bounds on the regularization parameter required for guaranteeing privacy. The availability of these private sketches facilitates the application of commonly available solvers for regression, without the risk of privacy leakage.

URL: https://openreview.net/forum?id=2R0INa6R6h

---

Title: GENIE: A Visual-Only Diffusion Framework for Task- Agnostic Image Transformation

Abstract: Designing a unified vision model capable of handling diverse visual transformation tasks without task-specific modifications remains a significant challenge, particularly in scaling and generalizing beyond narrowly defined objectives. We propose GENIE, a novel ControlNet-Diffusion framework that performs task-based image generation solely through visual exemplars, eliminating dependence on textual prompts or auxiliary metadata. Unlike conventional prompt-driven diffusion models, GENIE employs a dual visual conditioning mechanism—combining implicit guidance via ControlNet and explicit task encoding through CLIP-based visual arithmetic—to infer task intent directly from reference input-output pairs. To improve semantic alignment between visual exemplars and generated outputs, we introduce a lightweight task consistency loss, which encourages representational coherence in the embedding space across transformed pairs. While not a multitask learner in the classical sense, GENIE enables task switching across multiple tasks without any task-specific modifications in architecture or task-specific loss functions. Evaluations across seven vision tasks—inpainting, colorization, edge detection, deblurring, denoising, semantic segmentation, and depth estimation—and two out-of-distribution (OOD) tasks—deraining and contrast enhancement—demonstrate that GENIE achieves an average performance gain of 10% over visual-conditioned baselines, showcasing its effectiveness for scalable and text-free visual generation.

URL: https://openreview.net/forum?id=vtth9hOwoP

---

Title: Flex-Act: Why Learn when you can Pick?

Abstract: Learning activation functions has emerged as a promising direction in deep learning, allowing networks to adapt activation mechanisms to task-specific demands. In this work, we introduce a novel framework that employs the Gumbel-Softmax trick to enable discrete yet differentiable selection among a predefined set of activation functions during training. Our method dynamically learns the optimal activation function independently of the input, thereby enhancing both predictive accuracy and architectural flexibility. Experiments on synthetic datasets show that our model consistently selects the most suitable activation function, underscoring its effectiveness. These results connect theoretical advances with practical utility, paving the way for more adaptive and modular neural architectures in complex learning scenarios.

URL: https://openreview.net/forum?id=HQGis83pM2

---

Title: Multi-Marginal Schrödinger Bridge Matching

Abstract: Understanding the continuous evolution of populations from discrete temporal snapshots is a critical research challenge, particularly in fields like developmental biology and systems medicine where longitudinal tracking of individual entities is often impossible. Such trajectory inference is vital for unraveling the mechanisms of dynamic processes. While Schrödinger Bridge (SB) offer a potent framework, their traditional application to pairwise time points can be insufficient for systems defined by multiple intermediate snapshots. This paper introduces Multi-Marginal Schrödinger Bridge Matching (MSBM), a novel algorithm specifically designed for the multi-marginal SB problem. MSBM extends iterative Markovian fitting (IMF) to effectively handle multiple marginal constraints. This technique ensures robust enforcement of all intermediate marginals while preserving the continuity of the learned global dynamics across the entire trajectory. Empirical validations on synthetic data and real-world single-cell RNA sequencing datasets demonstrate the competitive or superior performance of MSBM in capturing complex trajectories and respecting intermediate distributions, all with notable computational efficiency.

URL: https://openreview.net/forum?id=AbOuxBMTog

---

Reply all

Reply to author

Forward

0 new messages