Weekly TMLR digest for Nov 23, 2025

1 view

Skip to first unread message

TMLR

unread,

Nov 23, 2025, 12:00:12 AMNov 23

to tmlr-annou...@googlegroups.com

New certifications
==================

Expert Certification: crowd-hpo: Realistic Hyperparameter Optimization and Benchmarking for Learning from Crowds with Noisy Labels

Marek Herde, Lukas Lührs, Denis Huseljic, Bernhard Sick

https://openreview.net/forum?id=SaKfhylVLK

---

J2C Certification: Preserving Expert-Level Privacy in Offline Reinforcement Learning

Navodita Sharma, Vishnu Vinod, Abhradeep Guha Thakurta, Alekh Agarwal, Borja Balle, Christoph Dann, Aravindan Raghuveer

https://openreview.net/forum?id=2bj0eVgCdO

---

Survey Certification: Scaling Laws of Distributed Random Forests

Katharina Flügel, Charlotte Debus, Markus Götz, Achim Streit, Marie Weiel

https://openreview.net/forum?id=ICHxTlgnSy

---

Reproducibility Certification: Early Classification of Time Series: A Survey and Benchmark

Aurélien Renault, Alexis Bondu, Antoine Cornuéjols, Vincent Lemaire

https://openreview.net/forum?id=bcNDYmBicK

---

Accepted papers
===============

Title: Streamlining Language Models via Semantic Basis Analysis

Authors: Yang Li, Daniel Agyei Asante, Changsheng Zhao, Ernie Chang, Yangyang Shi, Vikas Chandra

Abstract: As the size of language models increases, they deliver substantial performance improvements across a variety of applications. However, this growth also leads to greater computational demands, making deployment on resource-constrained devices—such as personal computers and mobile or wearable devices—more challenging, and significantly raising inference costs on cloud servers. To address these challenges, we introduce Basel, a method to streamline language models by leveraging the semantic structure of their weight matrices. Specifically, Basel treats each weight matrix as a linear combination of bases, selectively retaining those that are associated with essential semantics for the target application, pruning redundant ones, and introducing new bases that enhance task performance. Experimental results demonstrate that Basel achieves significant model size reduction compared to baseline techniques, while maintaining comparable or even superior accuracy across diverse applications.

URL: https://openreview.net/forum?id=qq7NNAXvuv

---

Title: Script: Graph-Structured and Query-Conditioned Semantic Token Pruning for Multimodal Large Language Models

Authors: Zhongyu Yang, Dannong Xu, Wei Pang, Yingfang Yuan

Abstract: The rapid growth of visual tokens in multimodal large language models (MLLMs) leads to excessive memory consumption and inference latency, especially when handling high-resolution images and videos. Token pruning is a technique used to mitigate this issue by removing redundancy, but existing methods often ignore relevance to the user query or suffer from the limitations of attention mechanisms, reducing their adaptability and effectiveness. To address these challenges, we propose Script, a plug-and-play pruning method that requires no retraining and generalizes across diverse MLLMs. Script comprises two modules: a graph-structured pruning module that removes visually redundant tokens, and a query-conditioned semantic pruning module that preserves query-relevant visual information. Together, they enhance performance on multimodal tasks. Experiments on fourteen benchmarks across image and video understanding tasks show that Script consistently achieves higher model efficiency and predictive accuracy compared to existing pruning methods. On LLaVA-NeXT-7B, it achieves up to $6.8\times$ prefill speedup and $10\times$ FLOP reduction, while retaining 96.88\% of the original performance.

URL: https://openreview.net/forum?id=F6xKzbgcHq

---

Title: Learning to Prompt Your Domain for Federated Vision-Language Models

Authors: Guoyizhe Wei, Feng Wang, Anshul Shah, Rama Chellappa

Abstract: The prompt tuning paradigm, with its great advantages of low parameter count and stable training, has recently inspired numerous applications of CLIP-like vision-language models in federated learning. However, in this work, we posit that under significant domain gaps across federated participants, prompt-based CLIP may easily collapse to non-optimal solutions due to the neglect of domain-aware knowledge. We present a novel prompt tuning method, termed ADAPT, to address this issue by learning both intra- and inter-domain prompts. Specifically, we assign each federated participant a domain-specific prompt and use the image's visual features as a condition to guide the generation of language features, with the underlying idea that the prompted CLIP should detect the input image's domain correspondence before making the prediction of its category. Extensive experiments demonstrate ADAPT's significant efficiency and effectiveness in federated learning. For example, by learning and sharing only 2.1M parameters, ADAPT attains a 69.8% average accuracy over the six domains of DomainNet, which improves the original CLIP accuracy by 16.2%.

URL: https://openreview.net/forum?id=OS7zPOZjr3

---

Title: On Representing Convex Quadratically Constrained Quadratic Programs via Graph Neural Networks

Authors: Chenyang Wu, Qian Chen, Akang Wang, Tian Ding, Ruoyu Sun, Wenguo Yang, Qingjiang Shi

Abstract: Convex quadratically constrained quadratic programs (QCQPs) involve finding a solution within a convex feasible region defined by quadratic constraints while minimizing a convex quadratic objective function. These problems arise in various industrial applications, including power systems and signal processing. Traditional methods for solving convex QCQPs primarily rely on matrix factorization, which quickly becomes computationally prohibitive as the problem size increases. Recently, graph neural networks (GNNs) have gained attention for their potential in representing and solving various optimization problems such as linear programs and linearly constrained quadratic programs. In this work, we investigate the representation power of GNNs in the context of QCQP tasks. Specifically, we propose a new tripartite graph representation for general convex QCQPs and properly associate it with message-passing GNNs. We demonstrate that there exist GNNs capable of reliably representing key properties of convex QCQPs, including feasibility, optimal value, and optimal solution. Our result deepens the understanding of the connection between QCQPs and GNNs, paving the way for future machine learning approaches to efficiently solve QCQPs.

URL: https://openreview.net/forum?id=GC2ZO6Asoa

---

Title: Understanding Fine-tuning in Approximate Unlearning: A Theoretical Perspective

Authors: Meng Ding, Rohan Sharma, Changyou Chen, Jinhui Xu, Kaiyi Ji

Abstract: Machine Unlearning has emerged as a significant area of research, focusing on `removing' specific subsets of data from a trained model. Fine-tuning (FT) methods have become one of the fundamental approaches for approximating unlearning, as they effectively retain model performance. However, it is consistently observed that naive FT methods struggle to forget the targeted data.
In this paper, we present the first theoretical analysis of FT methods for machine unlearning within a linear regression framework, providing a deeper exploration of this phenomenon. Our analysis reveals that while FT models can achieve zero remaining loss, they fail to forget the forgetting data, as the pretrained model retains its influence and the fine-tuning process does not adequately mitigate it. To address this, we propose a novel Retention-Based Masking (RBM) strategy that constructs a weight saliency map based on the remaining dataset, unlike existing methods that focus on the forgetting dataset. Our theoretical analysis demonstrates that RBM not only significantly improves unlearning accuracy (UA) but also ensures higher retaining accuracy (RA) by preserving overlapping features shared between the forgetting and remaining datasets. Experiments on synthetic and real-world datasets validate our theoretical insights, showing that RBM outperforms existing masking approaches in balancing UA, RA, and disparity metrics.

URL: https://openreview.net/forum?id=4hNquAmFqf

---

Title: Frame-wise Conditioning Adaptation for Fine-Tuning Diffusion Models in Text-to-Video Prediction

Authors: Zheyuan Liu, Junyan Wang, Zicheng Duan, Cristian Rodriguez-Opazo, Anton van den Hengel

Abstract: Text-video prediction (TVP) is a downstream video generation task that requires a model to produce subsequent video frames given a series of initial video frames and text describing the required motion.
In practice TVP methods focus on a particular category of videos depicting manipulations of objects carried out by human beings or robot arms.
Previous methods adapt models pre-trained on text-to-image tasks, and thus tend to generate video that lacks the required continuity.
A natural progression would be to leverage more recent pre-trained text-to-video (T2V) models.
This approach is rendered more challenging by the fact that the most common fine-tuning technique, low-rank adaptation (LoRA), yields undesirable results.
In this work, we propose an adaptation-based strategy we label Frame-wise Conditioning Adaptation (FCA).
Within the module, we devise a sub-module that produces frame-wise text embeddings from the input text, which acts as an additional text condition to aid generation.
We use FCA to fine-tune the T2V model, which incorporates the initial frame(s) as an extra condition.
We compare and discuss the more effective strategy for injecting such embeddings into the T2V model.
We conduct extensive ablation studies on our design choices with quantitative and qualitative performance analysis.
Our approach establishes a new baseline for the task of TVP.
Our code is open-source at https://github.com/Cuberick-Orion/FCA.

URL: https://openreview.net/forum?id=HSAjl4LUHK

---

Title: Mastering SAM Prompts: A Large-Scale Empirical Study in Segmentation Refinement for Scientific Imaging

Authors: Stephen Price, Elke Rundensteiner, Danielle L. Cote

Abstract: Segment Anything Model (SAM) has emerged as a prevalent tool empowering advances in vision tasks from instance segmentation, panoptic segmentation, to interactive segmentation. Leveraging powerful zero-shot capabilities enabled by visual prompts such as masks placed on the image, SAM has been shown to significantly improve tasks. Yet, a poor prompt can worsen SAM performance, risking consequences such as misdiagnoses, autonomous driving failures, or manufacturing defects. However, recent studies on visual SAM prompting remain limited, cover only a small fraction of potential prompt configurations, adopt ad-hoc evaluation strategies, and come with limited or even no rigorous analysis of the statistical significance of prompt configurations. To address this gap, we undertake the first large-scale empirical study comprehensively evaluating the impact of SAM prompt configurations on segmentation refinement. This includes 2,688 prompt configurations, including points, boxes, and masks with diverse augmentations, on four initial segmentation models for a total of 10,752 evaluations. From these results, we draw statistically significant insights along with practical guidelines for prompt design on scientific images. In particular, we recommend including a bounding box, which raised AP@50-95 by 0.320 and advise against using a coarse mask, which lowers AP@50-95 by -0.133 across all four models on microscopy data sets. We showcase that our recommended prompt configuration enables SAM to outperform leading refinement methods on multiple scientific benchmark datasets.

URL: https://openreview.net/forum?id=cWcTQMpqv6

---

Title: Uncertainty-aware Reward Design Process

Authors: Yang yang, Xiaolu Zhou, Bosong Ding, Miao Xin

Abstract: Designing effective reward functions is a cornerstone of reinforcement learning (RL), yet it remains a challenging process due to the inefficiencies and inconsistencies inherent in conventional reward engineering methodologies. Recent advances have explored leveraging large language models (LLMs) to automate the design of reward functions. However, LLMs’ insufficient numerical optimization capabilities often result in suboptimal reward hyperparam eter tuning, while non-selective validation of candidate reward functions leads to substantial computational overhead. To address these challenges, we propose the Uncertainty-aware Reward Design Process (URDP), a novel framework that integrates large language models to streamline reward function design and evaluation. URDP quantifies candidate reward function uncertainty based on the self-consistency analysis, enabling simulation-free identification of ineffective reward components while discovering novel ones. Furthermore, we introduce uncertainty-aware Bayesian optimization (UABO), which incorporates uncertainty estimation to improve the hyperparameter configuration. Finally, we construct a bi-level optimization framework by decoupling the reward component optimization and the hyperparameter tuning. URDP promotes the collaboration between the reward logic reasoning of the LLMs and the numerical optimization strengths of the Bayesian optimization. We conduct a comprehensive evaluation of URDP across 35 diverse tasks spanning three benchmark environments: Isaac Gym, Bidexterous Manipulation, and ManiSkill2. Our experimental results demonstrate that URDP not only generates higher-quality reward functions but also achieves significant improvements in the efficiency of automated reward design compared to existing approaches. We open-source all code at https://github.com/Yy12136/URDP.

URL: https://openreview.net/forum?id=CId5tW1HxR

---

Title: crowd-hpo: Realistic Hyperparameter Optimization and Benchmarking for Learning from Crowds with Noisy Labels

Authors: Marek Herde, Lukas Lührs, Denis Huseljic, Bernhard Sick

Abstract: Crowdworking is a cost-efficient solution for acquiring class labels. Since these labels are subject to noise, various approaches to learning from crowds have been proposed. Typically, these approaches are evaluated using default hyperparameter configurations, which often result in unfair and suboptimal performance, or using hyperparameter configurations tuned via a validation set with ground truth class labels, which represents an often unrealistic scenario. Moreover, both setups can yield different approach rankings, complicating study comparisons. Therefore, we introduce crowd-hpo as a framework for evaluating approaches to learning from crowds, together with criteria for selecting well-performing hyperparameter configurations using only noisy crowd-labeled validation data. Extensive experiments with neural networks demonstrate that these criteria select hyperparameter configurations that improve the learning from crowds approaches' generalization performances, measured on separate test sets with ground truth labels. Hence, incorporating such criteria into experimental studies is essential for enabling fairer and more realistic benchmarking.

URL: https://openreview.net/forum?id=SaKfhylVLK

---

Title: Learning few-step posterior samplers by unfolding and distillation of diffusion models

Authors: Charlesquin Kemajou Mbakam, Marcelo Pereyra, Jonathan Spence

Abstract: Diffusion models (DMs) have emerged as powerful image priors in Bayesian computational imaging. Two primary strategies have been proposed for leveraging DMs in this context: Plug-and-Play methods, which are zero-shot and highly flexible but rely on approximations; and specialized conditional DMs, which achieve higher accuracy and faster inference for specific tasks through supervised training. In this work, we introduce a novel framework that integrates deep unfolding and model distillation to transform a DM image prior into a few-step conditional model for posterior sampling. A central innovation of our approach is the unfolding of a Markov chain Monte Carlo (MCMC) algorithm—specifically, the recently proposed LATINO Langevin sampler (Spangnoletti et al., 2025)—representing the first known instance of deep unfolding applied to a Monte Carlo sampling scheme. We demonstrate our proposed unfolded and distilled samplers through extensive experiments and comparisons with the state of the art, where they achieve excellent accuracy and computational efficiency, while retaining the flexibility to adapt to variations in the forward model at inference time.

URL: https://openreview.net/forum?id=oGCfD8YKN2

---

Title: LBMamba: Locally Bi-directional Mamba

Authors: Jingwei Zhang, Xi Han, Hong Qin, Mahdi S. Hosseini, Dimitris Samaras

Abstract: Mamba, a State Space Model (SSM) that accelerates training by recasting recurrence as a parallel selective scan, has recently emerged as a linearly-scaling, efficient alternative to self-attention. Because of its unidirectional nature, each state in Mamba only has information of its previous states and is blind to states after. Current Mamba-based computer-vision methods typically overcome this limitation by augmenting Mamba's global forward scan with a global backward scan, forming a bi-directional scan that restores a full receptive field. However, this operation doubles the computational load, eroding much of the efficiency advantage that originally Mamba have. To eliminate this extra scans, we introduce LBMamba, a locally bi-directional SSM block that embeds a lightweight locally backward scan inside the forward selective scan and executes it entirely in per-thread registers. Building on LBMamba, we present LBVim, a scalable vision backbone that alternates scan directions every two layers to recover a global receptive field without extra backward sweeps. We validate the versatility of our approach on both natural images and whole slide images (WSIs). We show that our LBVim constantly offers a superior performance–throughput trade-off. That is under the same throughput, LBVim achieves 0.8% to 1.6% higher top-1 accuracy on the ImageNet-1K classification dataset, 0.6% to 2.7% higher mIoU on the ADE20K semantic segmentation dataset, 0.9% higher AP$^b$ and 1.1% higher AP$^m$ on the COCO detection dataset. Our method serves as a general-purpose enhancement, boosting the accuracy of four SOTA Mamba models, namely VMamba, LocalVim, PlainMamba and Adventurer, by 0.5% to 3.4%. We also integrate LBMamba into the SOTA pathology multiple instance learning (MIL) approach, MambaMIL, which uses single directional scan. Experiments on 3 public WSI classification datasets show that our method achieves a relative improvement of up to 3.06% better AUC, 3.39% better F1, 1.67% better accuracy. Our code is available at https://github.com/cvlab-stonybrook/LBMamba.

URL: https://openreview.net/forum?id=e1aXaIXblQ

---

Title: PersonalizedRouter: Personalized LLM Routing via Graph-based User Preference Modeling

Authors: Zhongjie Dai, Tao Feng, Jiaxuan You

Abstract: The growing number of Large Language Models (LLMs) with diverse capabilities and response styles provides users with a wider range of choices, which presents challenges in selecting appropriate LLMs, as user preferences vary in terms of performance, cost, and response style. Current LLM selection methods typically optimize for a single fixed objective, such as performance, cost, or a trade-off between them, and fail to learn individual user preferences from interaction data. To address these limitations, we propose PersonalizedRouter, a graph-based framework that models diverse user profiles and performs personalized LLM selection by leveraging interaction data that includes task context, queries, candidate LLMs, and user decisions. To capture contextual information between user queries and optimal LLMs, PersonalizedRouter converts the interaction data into a heterogeneous graph, where the relationships between different types of nodes are represented by edges. To evaluate adaptability across users, we design two strategies: the multi-cost-efficiency simulation strategy and the LLM-as-a-Judge strategy. In addition, we construct PersonaRoute-Bench, a large-scale benchmark with 1,000 simulated users and 10 LLMs. Experimental results show that PersonalizedRouter significantly outperforms existing LLM selection methods and surpasses the strongest methods by a large margin of 15.38% and 9.83% under two simulation strategies. On the PersonaRoute-Bench with 1,000 users, it further surpasses the best methods by 16.19% and 59.69% while maintaining higher efficiency. Moreover, PersonalizedRouter demonstrates strong few-shot generalization, achieving 64.81% and 85.80% of the fully trained model’s performance when adapting to new users and new LLMs.

URL: https://openreview.net/forum?id=W80eE3ArAl

---

Title: Preserving Expert-Level Privacy in Offline Reinforcement Learning

Authors: Navodita Sharma, Vishnu Vinod, Abhradeep Guha Thakurta, Alekh Agarwal, Borja Balle, Christoph Dann, Aravindan Raghuveer

Abstract: The offline reinforcement learning (RL) problem aims to learn an optimal policy from historical data collected by one or more behavioural policies (experts) by interacting with an environment. However, the individual experts may be privacy-sensitive in that the learnt policy may retain information about their precise choices. In some domains like personalized retrieval, advertising and healthcare, the expert choices are considered sensitive data. To provably protect the privacy of such experts, we propose a novel consensus-based expert-level differentially private offline RL training approach compatible with any existing offline RL algorithm. We prove rigorous differential privacy guarantees, while maintaining strong empirical performance. Unlike existing work in differentially private RL, we supplement the theory with proof-of-concept experiments on classic RL environments featuring large continuous state spaces, demonstrating substantial improvements over a natural baseline across multiple tasks.

URL: https://openreview.net/forum?id=2bj0eVgCdO

---

Title: ExDBN: Learning Dynamic Bayesian Networks using Extended Mixed-Integer Programming Formulations

Authors: Pavel Rytíř, Aleš Wodecki, Georgios Korpas, Jakub Marecek

Abstract: Causal learning from data has received much attention recently. Bayesian networks can be used to capture causal relationships. There, one recovers a weighted directed acyclic graph in which random variables are represented by vertices, and the weights associated with each edge represent the strengths of the causal relationships between them. This concept is extended to capture dynamic effects by introducing a dependency on past data, which may be captured by the structural equation model. This formalism is utilized in the present contribution to propose a score-based learning algorithm. A mixed-integer quadratic program is formulated and an algorithmic solution proposed, in which the pre-generation of exponentially many acyclicity constraints is avoided by utilizing the so-called branch-and-cut (``lazy constraint'') method. Comparing the novel approach to the state-of-the-art, we show that the proposed approach turns out to produce more accurate results when applied to small and medium-sized synthetic instances containing up to 80 time series. Lastly, two interesting applications in bioscience and finance, to which the method is directly applied, further stress the importance of developing highly accurate, globally convergent solvers that can handle instances of modest size.

URL: https://openreview.net/forum?id=I64MJzl9Fy

---

Title: Pre-trained Language Models Improve the Few-shot Prompt Ability of Decision Transformer

Authors: Yu Yang, Pan Xu

Abstract: Decision Transformer (DT) has emerged as a promising class of algorithms in offline reinforcement learning (RL) tasks, leveraging pre-collected datasets and Transformer's capability to model long sequences. Recent works have demonstrated that using parts of trajectories from training tasks as prompts in DT enhances its performance on unseen tasks, giving rise to Prompt-DT methods. However, collecting data from specific environments can be both costly and unsafe in many scenarios, leading to suboptimal performance and limited few-shot prompt abilities due to the data-hungry nature of Transformer-based models. Additionally, the limited datasets used in pre-training make it challenging for Prompt-DT type of methods to distinguish between various RL tasks through prompts alone. To address these challenges, we introduce the Language model-initialized Prompt Decision Transformer (LPDT) framework, which leverages pretrained language models providing rich prior knowledge for RL tasks and fine-tunes the sequence model using Low-rank Adaptation (LoRA) for meta-RL problems. We further incorporate prompt regularization to effectively differentiate between tasks based on prompt feature representations. Comprehensive empirical studies demonstrate that initializing with a pre-trained language model provides the prior knowledge and achieves a similar performance with Prompt-DT under only $10\%$ data in some MuJoCo control tasks. We also provide a thorough ablation study to validate the effectiveness of each component, including sequence modeling, language models, prompt regularizations, and prompt strategies.

URL: https://openreview.net/forum?id=k520i3XEMK

---

Title: Scaling Laws of Distributed Random Forests

Authors: Katharina Flügel, Charlotte Debus, Markus Götz, Achim Streit, Marie Weiel

Abstract: Random forests are a widely used machine learning technique valued for their robust predictive performance and conceptual simplicity. They are applied in many critical applications and often combined with federated learning to collaboratively build machine learning models across multiple distributed sites. The independent decision trees make random forests inherently parallelizable and well-suited for distributed and federated settings. Despite this perfect fit, there is a lack of comprehensive scalability studies, and many existing methods show limited parallel efficiency or are tested only at smaller scales. To address this gap, we present a comprehensive analysis of the scaling capabilities of distributed random forests on up to 64 compute nodes. Using a tree-parallel approach, we demonstrate a strong scaling speedup of up to 31.98 and a weak scaling efficiency of over 0.96 without affecting predictive performance of the global model. Comparing the performance trade-offs of distributed and local inference strategies enables us to simulate various real-life scenarios in terms of distributed computing resources, data availability, and privacy considerations. We further explore how increasing model and data size improves prediction accuracy, scaling up to 51 200 trees and 7.5 million training samples. We find that while distributing the data across nodes leads to super-scalar speedup, it negates the predictive benefit of increased data. Finally, we study the impact of distributed and non-IID data and find that while global imbalance reduces performance, local distribution differences can help mitigate this effect.

URL: https://openreview.net/forum?id=ICHxTlgnSy

---

Title: LCEN: A Nonlinear, Interpretable Feature Selection and Machine Learning Algorithm

Authors: Pedro Seber, Richard Braatz

Abstract: Interpretable models can have advantages over black-box models, and interpretability is essential for the application of machine learning in critical settings, such as aviation or medicine. In this work, we introduce the LASSO-Clip-EN (LCEN) algorithm for nonlinear, interpretable feature selection and machine learning modeling. LCEN is tested on a wide variety of artificial and empirical datasets, creating sparse and frequently more accurate models than other methods, including sparse, nonlinear methods, on tested datasets. LCEN is robust against many issues typically present in datasets and modeling, including noise, multicollinearity, and data scarcity. As a feature selection algorithm, LCEN matches or surpasses the thresholded elastic net but is, on average, 10.3-fold faster based on our experiments. LCEN for feature selection can also rediscover multiple physical laws from empirical data. As a machine learning algorithm, when tested on processes with no known physical laws, LCEN achieves better results than many other dense and sparse methods --- including being comparable to or better than ANNs on multiple datasets.

URL: https://openreview.net/forum?id=wmNucISPdl

---

Title: Coresets from Trajectories: Selecting Data via Correlation of Loss Differences

Authors: Manish Nagaraj, Deepak Ravikumar, Kaushik Roy

Abstract: Deep learning models achieve state-of-the-art performance across domains but face scalability challenges in real-time or resource-constrained scenarios. To address this, we propose Correlation of Loss Differences ($\mathtt{CLD}$), a simple and scalable metric for coreset selection that identifies the most impactful training samples by measuring their alignment with the loss trajectories of a held-out validation set. $\mathtt{CLD}$ is highly efficient, requiring only per-sample loss values computed at training checkpoints, and avoiding the costly gradient and curvature computations used in many existing subset selection methods. We develop a general theoretical framework that establishes convergence guarantees for $\mathtt{CLD}$-based coresets, demonstrating that the convergence error is upper-bounded by the alignment of the selected samples and the representativeness of the validation set. On CIFAR-100 and ImageNet-1k, $\mathtt{CLD}$-based coresets typically outperform or closely match state-of-the-art methods across subset sizes, and remain within 1\% of more computationally expensive baselines even when not leading. $\mathtt{CLD}$ transfers effectively across architectures (ResNet, VGG, DenseNet), enabling proxy-to-target selection with $<1\%$ degradation. Moreover, $\mathtt{CLD}$ is stable when using only early checkpoints, incurring negligible accuracy loss. Finally, $\mathtt{CLD}$ exhibits inherent bias reduction via per-class validation alignment, obviating the need for additional stratified sampling. Together, these properties make $\mathtt{CLD}$ a principled, efficient, stable, and transferable tool for scalable dataset optimization.

URL: https://openreview.net/forum?id=QY0pbZTWJ9

---

Title: AEAP: A Reinforcement Learning Actor Ensemble Algorithm with Adaptive Pruning

Authors: WEI ZHANG, Guni Sharon

Abstract: Actor ensemble reinforcement learning methods have shown promising performance on dense-reward continuous control tasks. However, they exhibit three primary limitations: (1) diversity collapse when using a shared replay buffer, often necessitating carefully tuned regularization terms;
(2) computational overhead from maintaining multiple actors; and (3) analytically intractable policy gradients when using stochastic policies in ensembles, requiring approximations that may compromise performance. To address this third limitation, we restrict the ensemble to deterministic policies and propose Actor Ensemble with Adaptive Pruning (AEAP), a multi-actor deterministic policy gradient algorithm that tackles the remaining limitations through a two-stage approach. First, to alleviate diversity collapse, AEAP employs dual-randomized actor selection that decorrelates exploration and learning by randomly choosing different actors for both environment interaction and policy update. This approach also removes reliance on explicit regularization. Second, when convergence to homogeneous policies still occurs over time, computational efficiency is further achieved through adaptive dual-criterion pruning, which progressively removes underperforming or redundant actors based on critic-estimated value and action-space similarity. Although AEAP introduces four additional hyperparameters compared to TD3 (a baseline single-actor deterministic policy gradient algorithm), we provide two domain-agnostic parameter configurations that perform robustly across environments without requiring tuning.
AEAP achieves superior or competitive asymptotic performance compared to baselines across six dense-reward MuJoCo tasks. On sparse-reward Fetch benchmarks, AEAP outperforms deterministic policy gradient methods but falls short of SAC (a baseline stochastic policy gradient algorithm) on one of three tasks. When compared to fixed-size multi-actor baselines, AEAP reduces wall-clock time without sacrificing performance, establishing it as an efficient and reliable actor ensemble variant.

URL: https://openreview.net/forum?id=I5ymMVdmaR

---

Title: FP4DiT: Towards Effective Floating Point Quantization for Diffusion Transformers

Authors: Ruichen Chen, Keith G. Mills, Di Niu

Abstract: Diffusion Models (DM) have revolutionized the text-to-image visual generation process. However, the large computational cost and model footprint of DMs hinders practical deployment, especially on edge devices. Post-training quantization (PTQ) is a lightweight method to alleviate these burdens without the need for training or fine-tuning. While recent DM PTQ methods achieve W4A8 on integer-based PTQ, two key limitations remain: First, while most existing DM PTQ methods evaluate on classical DMs like Stable Diffusion XL, 1.5 or earlier, which use convolutional U-Nets, newer Diffusion Transformer (DiT) models like the PixArt series, Hunyuan and others adopt fundamentally different transformer backbones to achieve superior image synthesis. Second, integer (INT) quantization is prevailing in DM PTQ but does not align well with the network weight and activation distribution, while Floating-Point Quantization (FPQ) is still under-investigated, yet it holds the potential to better align the weight and activation distributions in low-bit settings for DiT. In this paper, we introduce FP4DiT, a PTQ method that leverages FPQ to achieve W4A6 quantization. Specifically, we extend and generalize the Adaptive Rounding PTQ technique to adequately calibrate weight quantization for FPQ and demonstrate that DiT activations depend on input patch data, necessitating robust online activation quantization techniques. Experimental results demonstrate that FP4DiT outperforms integer-based PTQ at W4A6 and W4A8 precision and generates convincing visual content on PixArt-$\alpha$, PixArt-$\Sigma$ and Hunyuan in terms of several T2I metrics such as HPSv2 and CLIP.

URL: https://openreview.net/forum?id=CcnH4mSQbP

---

Title: Zeroth-Order Adaptive Neuron Alignment Based Pruning without Re-Training

Authors: Elia Cunegatti, Leonardo Lucio Custode, Giovanni Iacca

Abstract: Network pruning focuses on algorithms that aim to reduce a given model's computational cost by removing a subset of its parameters while having minimal impact on performance. Throughout the last decade, the most widely used pruning paradigm has been pruning and re-training, which nowadays is inconvenient due to the vast amount of pre-trained models, which are, in any case, too expensive to re-train. In this paper, we exploit functional information from dense pre-trained models, i.e., their input activations, to obtain sparse models that maximize the activations' alignment with respect to their corresponding dense models. Hence, we propose \algname, a \emph{top-up} algorithm that can be used on top of any given pruning algorithm for LLMs, which modifies the block-wise and row-wise sparsity, exploiting information from both the dense model and its sparse version to maximize the \emph{neuron alignment} among activations. Different from existing methods, our approach adaptively selects the best hyperparameters for the block-wise and row-wise sparsity ratios w.r.t. the model and the desired sparsity, and requires \emph{no re-training}. We test our method over $\sim$300 test cases with four LLM families, three sparsity ratios, and ten language tasks (three language modeling and seven zero-shot datasets), showing how it consistently outperforms the latest state-of-the-art methods in terms of performance-runtime trade-off.

URL: https://openreview.net/forum?id=uPyNaNqFK2

---

Title: IndicFake Meets SAFARI-LLM: Unifying Semantic and Acoustic Intelligence for Multilingual Deepfake Detection

Authors: Rishabh Ranjan, Mayank Vatsa, Richa Singh

Abstract: Audio deepfakes pose a growing threat, particularly in linguistically diverse and low-resource settings where existing detection methods often struggle. This work introduces two transformative contributions to address these challenges. First, we present \textbf{IndicFake}, a pioneering audio deepfake dataset with over 4.2 million samples (7,350 hours) spanning English and 17 Indian languages across Indo-European, Dravidian, and Sino-Tibetan families. With minimal overlap (Jaccard similarity: 0.00--0.06) with existing datasets, IndicFake offers an unparalleled benchmark for multilingual deepfake detection. Second, we propose \textbf{SAFARI-LLM} (Semantic Acoustic Feature Adaptive Router with Integrated LLM), a novel framework that integrates Whisper’s semantic embeddings and m-HuBERT’s acoustic features through an adaptive Audio Feature Unification Module (AFUM). Enhanced by LoRA-fine-tuned LLaMA-7B, SAFARI-LLM achieves unmatched cross-lingual and cross-family generalization. Evaluations across IndicFake, DECRO, and WaveFake datasets demonstrate its superiority, outperforming 14 state-of-the-art models with standout accuracies of 94.21\% (English-to-Japanese transfer on WaveFake) and 84.48\% (English-to-Chinese transfer on DECRO), alongside robust performance across diverse linguistic contexts. These advancements establish a new standard for reliable, scalable audio deepfake detection. Code and resources are publicly available at: https://anonymousillusion.github.io/indicfake/.

URL: https://openreview.net/forum?id=s8pPYRVVTU

---

Title: SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

Authors: Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, Cihang Xie

Abstract: This work explores two distinct approaches for enhancing reasoning abilities in Large Vision Language Models (LVLMs): supervised fine-tuning (SFT) and reinforcement learning (RL). To support the SFT approach, we curate a multimodal reasoning dataset with the complete reasoning trace guided by DeepSeek-R1. For the RL approach, we focus on GRPO and develop a training framework tailored to vision-language tasks with a composite reward system comprising four signals that address both visual perception and reasoning challenges. Our extensive experiments reveal that RL is a significantly more effective strategy than SFT for training reasoning VLMs. While SFT can assist models that initially struggle with following reasoning instructions, it often induces ``pseudo aha moments'' that degrade overall reasoning performance, implying that only a minimal amount of SFT data is necessary. In contrast, RL leads to substantial improvements, outperforming recent baseline models on a range of math reasoning tasks by at least 2% on average. We also present several intriguing findings --- \eg, combining SFT and GRPO also hurts the model performance, and stronger instruction-aligned LVLMs consistently lead to better results in RL. We hope these findings provide valuable insights into the development of reasoning-capable VLMs and guide future research in this area.

URL: https://openreview.net/forum?id=wZI5qkQeDF

---

Title: Minimax Multi-Target Conformal Prediction with Applications to Imaging Inverse Problems

Authors: Jeffrey Wen, Rizwan Ahmad, Philip Schniter

Abstract: In ill-posed imaging inverse problems, uncertainty quantification remains a fundamental challenge, especially in safety-critical applications. Recently, conformal prediction has been used to quantify the uncertainty that the inverse problem contributes to downstream tasks like image classification, image quality assessment, fat mass quantification, etc. While existing works handle only a scalar estimation target, practical applications often involve multiple targets. In response, we propose an asymptotically minimax approach to multi-target conformal prediction that provides tight prediction intervals while ensuring joint marginal coverage. We then outline how our minimax approach can be applied to multi-metric blind image quality assessment, multi-task uncertainty quantification, and multi-round measurement acquisition. Finally, we numerically demonstrate the benefits of our minimax method, relative to existing multi-target conformal prediction methods, using both synthetic and magnetic resonance imaging (MRI) data.

URL: https://openreview.net/forum?id=53FEYwDQK0

---

Title: Early Classification of Time Series: A Survey and Benchmark

Authors: Aurélien Renault, Alexis Bondu, Antoine Cornuéjols, Vincent Lemaire

Abstract: In many situations, the measurements of a studied phenomenon are provided sequentially, and the prediction of its class needs to be made as early as possible so as not to incur too high a time penalty, but not too early and risk paying the cost of misclassification. This problem has been particularly studied in the case of time series, and is known as Early Classification of Time Series (ECTS). Although it has been the subject of a growing body of literature, there is still a lack of a systematic, shared evaluation protocol to compare the relative merits of the various existing methods. In this paper, we highlight the two components of an ECTS system: \textit{decision} and \textit{prediction}, and focus on the approaches that separate them. This document begins by situating these methods within a principle-based taxonomy. It defines dimensions for organizing their evaluation and then reports the results of a very extensive set of experiments along these dimensions involving nine state-of-the-art ECTS algorithms. In addition, these and other experiments can be carried out using an open-source library in which most of the existing ECTS algorithms have been implemented (see https://github.com/ML-EDM/ml_edm).

URL: https://openreview.net/forum?id=bcNDYmBicK

---

Title: Commander-GPT: Dividing and Routing for Multimodal Sarcasm Detection

Authors: Yazhou Zhang, Chunwang Zou, Bo Wang, Jing Qin, Prayag Tiwari

Abstract: Multimodal sarcasm understanding is a high-order cognitive task. Although large language models (LLMs) have shown impressive performance on many downstream NLP tasks, growing evidence suggests that they struggle with sarcasm understanding.
In this paper, we propose Commander-GPT, a modular decision routing framework inspired by military command theory. Rather than relying on a single LLM's capability, Commander-GPT orchestrates a team of specialized LLM agents where each agent will be selectively assigned to a focused sub-task such as context modeling, sentiment analysis, etc. Their outputs are then routed back to the commander, which integrates the information and performs the final sarcasm judgment.
To coordinate these agents, we introduce three types of centralized commanders:
(1) a trained lightweight encoder-based commander (e.g., multi-modal BERT); (2) four small autoregressive language models, serving as moderately capable commanders (e.g., DeepSeek-VL); (3) two large LLM-based commander (Gemini Pro and GPT-4o) that performs task routing, output aggregation, and sarcasm decision-making in a zero-shot fashion.
We evaluate Commander-GPT on the MMSD and MMSD 2.0 benchmarks, comparing five prompting strategies. Experimental results show that our framework achieves 4.4% and 8.5% improvement in F1 score over state-of-the-art (SoTA) baselines on average, demonstrating its effectiveness.

URL: https://openreview.net/forum?id=zRxRbBsqwE

---

Title: Higher Order Transformers With Kronecker-Structured Attention

Authors: Soroush Omranpour, Guillaume Rabusseau, Reihaneh Rabbany

Abstract: Modern datasets are increasingly high-dimensional and multiway, often represented as tensor-valued data with multi-indexed variables. While Transformers excel in sequence modeling and high-dimensional tasks, their direct application to multiway data is computationally prohibitive due to the quadratic cost of dot-product attention and the need to flatten inputs, which disrupts tensor structure and cross-dimensional dependencies.
We propose the Higher-Order Transformer (HOT), a novel factorized attention framework that represents multiway attention as sums of Kronecker products or sums of mode-wise attention matrices. HOT efficiently captures dense and sparse relationships across dimensions while preserving tensor structure. Theoretically, HOT retains the expressiveness of full high-order attention and allows complexity control via factorization rank.
Experiments on 2D and 3D datasets show that HOT achieves competitive performance in multivariate time series forecasting and image classification, with significantly reduced computational and memory costs. Visualizations of mode-wise attention matrices further reveal interpretable high-order dependencies learned by HOT, demonstrating its versatility for complex multiway data across diverse domains.

URL: https://openreview.net/forum?id=QN0aXcKFkT

---

Title: Divide and Merge: Motion and Semantic Learning in End-to-End Autonomous Driving

Authors: Yinzhe Shen, Omer Sahin Tas, Kaiwen Wang, Royden Wagner, Christoph Stiller

Abstract: Perceiving the environment and its changes over time corresponds to two fundamental yet heterogeneous types of information: semantics and motion. Previous end-to-end autonomous driving works represent both types of information in a single feature vector. However, including motion related tasks, such as prediction and planning, impairs detection and tracking performance, a phenomenon known as negative transfer in multi-task learning. To address this issue, we propose Neural-Bayes motion decoding, a novel parallel detection, tracking, and prediction method that separates semantic and motion learning. Specifically, we employ a set of learned motion queries that operate in parallel with detection and tracking queries, sharing a unified set of recursively updated reference points. Moreover, we employ interactive semantic decoding to enhance information exchange in semantic tasks, promoting positive transfer. Experiments on the nuScenes dataset with UniAD and SparseDrive confirm the effectiveness of our divide and merge approach, resulting in performance improvements across perception, prediction, and planning. The code will be released.

URL: https://openreview.net/forum?id=RvtCNm1Rdv

---

New submissions
===============

Title: Improving Generalization and Data Efficiency with Diffusion in Offline Multi-agent RL

Abstract: We present a novel Diffusion Offline Multi-agent Model (DOM2) for offline Multi-Agent Reinforcement Learning (MARL). Different from existing algorithms that rely mainly on conservatism in policy design, DOM2 enhances policy expressiveness and diversity based on diffusion model. Specifically, we incorporate a diffusion model into the policy network and propose a trajectory-based data-reweighting scheme in training. These key ingredients significantly improve algorithm robustness against environment changes and achieve significant improvements in performance, generalization and data-efficiency. Our extensive experimental results demonstrate that DOM2 outperforms existing state-of-the-art methods in all multi-agent particle and multi-agent MuJoCo environments, and generalizes significantly better to shifted environments (in $28$ out of $30$ settings evaluated) thanks to its high expressiveness and diversity. Moreover, DOM2 is ultra data efficient and requires no more than $5\%$ data for achieving the same performance compared to existing algorithms (a $20\times$ improvement in data efficiency).

URL: https://openreview.net/forum?id=GKuCKSJKvl

---

Title: Tube Loss: A Novel Loss Function for Prediction Interval Estimation

Abstract: This paper proposes a novel loss function called Tube Loss, developed for the simultaneous estimation of the lower and upper bounds of a Prediction Interval (PI) in regression problems, including probabilistic forecasting in autoregressive frameworks. The PIs obtained through empirical risk minimization using Tube Loss exhibit superior performance compared to those derived from existing approaches. A theoretical analysis confirms that the estimated PIs asymptotically attain a user-specified confidence level $1-\alpha$. A distinctive feature of Tube Loss is its ability to shift the PI along the support of the response distribution through a tunable parameter, allowing the intervals to better align with high-density regions of the distribution. This is especially valuable for generating tighter intervals when the response distribution is skewed. Moreover, the method allows further narrowing of PIs through recalibration. Unlike several prior techniques, the empirical risk associated with Tube Loss can be efficiently optimized via gradient descent. Extensive experiments demonstrate the robustness and accuracy of the proposed method in delivering high-quality PIs across a range of models, including kernel machines, neural networks, and probabilistic forecasting frameworks.

URL: https://openreview.net/forum?id=3vwPza62Rr

---

Title: Coreset Selection via LLM-based Concept Bottlenecks

Abstract: Coreset Selection (CS) aims to identify a subset of the training dataset that achieves model performance comparable to using the entire dataset. Many state-of-the-art CS methods select coresets using scores whose computation requires training the downstream model on the entire dataset first and recording changes in the model's behavior on samples as it trains (training dynamics). These scores are inefficient to compute and hard to interpret, as they do not indicate whether a sample is difficult to learn in general or only for a specific downstream model. Our work addresses these challenges by proposing a score that computes a sample's difficulty using human-understandable textual attributes (concepts) independent of any downstream model. Specifically, we measure the alignment between a sample's visual features and concept bottlenecks, derived via large language models, by training a linear concept bottleneck layer and computing the sample's difficulty score using it. We then use stratified sampling based on this score to generate a coreset of the dataset. Crucially, our score is efficiently computable without training the downstream model on the full dataset even once, leads to high-performing coresets for various downstream models, and is computable even for an unlabeled dataset. Through experiments on five diverse datasets including ImageNet-1K, we show that our coresets outperform random subsets, even at high pruning rates, and lead to model performance comparable to or better than coresets found by training dynamics-based methods.

URL: https://openreview.net/forum?id=dGbBPXWFrL

---

Title: From Link Prediction to Forecasting: Addressing Challenges in Batch-based Temporal Graph Learning

Abstract: Dynamic link prediction is an important problem considered in many recent works that propose approaches for learning temporal edge patterns. To assess their efficacy, models are evaluated on continuous-time and discrete-time temporal graph datasets, typically using a traditional batch-oriented evaluation setup. However, as we show in this work, a batch-oriented evaluation is often unsuitable and can cause several issues. Grouping edges into fixed-sized batches regardless of their occurrence time leads to information loss or leakage, depending on the temporal granularity of the data. Furthermore, fixed-size batches create time windows with different durations, resulting in an inconsistent dynamic link prediction task. In this work, we empirically show how traditional batch-based evaluation leads to skewed model performance and hinders the fair comparison of methods. We mitigate this problem by reformulating dynamic link prediction as a link forecasting task that better accounts for temporal information present in the data.

URL: https://openreview.net/forum?id=iZPAykLE3l

---

Title: Improving Local Explainability By Learning Causal Graphs From Data

Abstract: Causal Shapley values take into account causal relations among dependent features to adjust the contributions of each feature to a prediction. A limitation of this approach is that it can only leverage known causal relations.
In this work we combine the computation of causal Shapley values with causal discovery, i.e. learning causal graphs from data. In particular, we compute causal explanations across a set of candidate causal graphs learned from observational data, yielding a set of Shapley values that reflects the space of possible explanations consistent with the data. We propose two methods for estimating this list efficiently, drawing on the equivalences of the interventional distributions for a subset of the causal graphs. We evaluate our methods on synthetic and real-world data, showing that they provide explanations that are often closer to the true causal impacts compared to traditional Shapley value approaches that disregard causal relationships. Even when the discovered graph or MEC is imperfect, we on average observe improvements over marginal and conditional Shapley values.

URL: https://openreview.net/forum?id=A1bXT7RQLU

---

Title: Learning High-Order Motion Patterns from Event Stream for Continuous Space-Time Video Super-Resolution

Abstract: Current methods in the domain of continuous space-time video super-resolution achieve temporal alignment by predicting motion between frames. However, these frame-based approaches encounter challenges with inaccurate optical flow estimation. To overcome this, we incorporate event data, enhancing both temporal and spatial aspects of video super-resolution. Based on the motion details conveyed by event streams, our proposed method, EvTaylor-Net, performs a Taylor expansion approximation of the object motion function at specified timestamps to estimate more precise forward optical flow. Our method estimates the masks from the event surface to alleviate the issue of multiple source pixels mapping to the same target position during the forward warping process. Furthermore, EvTaylor-Net adopts local implicit neural representation to simultaneously enhance the resolution of videos in both temporal and spatial domain, ensuring a comprehensive improvement of video quality. Extensive experimental results demonstrate that the proposed EvTaylor-Net, bolstered by event streams, outperforms state-of-the-art methods for spatio-temporal video super-resolution tasks.

URL: https://openreview.net/forum?id=yAi6lRT3Ai

---

Title: KeySync: A Robust Approach for Leakage-free Lip Synchronization in High Resolution

Abstract: Lip synchronization, known as the task of aligning lip movements in an existing video with new input audio, is typically framed as a simpler variant of audio-driven facial animation. However, as well as suffering from the usual issues in talking head generation (e.g., temporal consistency), lip synchronization presents significant new challenges such as expression leakage from the input video and facial occlusions, which can severely impact real-world applications like automated dubbing, but are largely neglected by existing works. To address these shortcomings, we present KeySync, a two-stage framework that succeeds in mitigating the issue of temporal consistency, while also incorporating solutions for leakage and occlusions using a carefully designed masking strategy. We show that KeySync achieves state-of-the-art results in lip reconstruction and cross-synchronization, improving visual quality and reducing expression leakage according to LipLeak, our novel leakage metric. Furthermore, we demonstrate the effectiveness of our new masking approach in handling occlusions and validate our architectural choices through several ablation studies. Our code and models will be made publicly available.

URL: https://openreview.net/forum?id=dvtMHhZUyG

---

Title: GENIE: Watermarking Graph Neural Networks for Link Prediction

Abstract: The rapid adoption, usefulness, and resource-intensive training of Graph Neural Network~(GNN) models have made them an invaluable intellectual property in graph-based machine learning. However, their wide-spread adoption also makes them susceptible to stealing, necessitating robust Ownership Demonstration~(OD) techniques. Watermarking is a promising OD framework for deep neural networks, but existing methods fail to generalize to GNNs due to the non-Euclidean nature of graph data. Existing works on GNN watermarking primarily focus on node and graph classification, overlooking Link Prediction (LP).
In this paper, we propose \genie~(watermarking \textbf{G}raph n\textbf{E}ural \textbf{N}etworks for l\textbf{I}nk pr\textbf{E}diction), the first scheme to watermark GNNs for LP. \genie creates a novel backdoor for both node-representation and subgraph-based LP methods, utilizing a unique trigger set and a secret watermark vector. Our OD scheme is equipped with Dynamic Watermark Thresholding~(DWT), ensuring high verification probability while addressing practical issues in existing OD schemes. We extensively evaluate \genie across 4~diverse model architectures~(\ie SEAL, GCN, GraphSAGE and NeoGNN), 7~real-world datasets and 21~watermark removal techniques and demonstrate its robustness to watermark removal and ownership piracy attacks. Finally, we discuss adaptive attacks against \genie and a defense strategy to counter it.

URL: https://openreview.net/forum?id=EmDuoySsbe

---

Title: Architecture-Aware Generalization Bounds for Temporal Networks: Theory and Fair Comparison Methodology

Abstract: Deep temporal architectures such as Temporal Convolutional Networks (TCNs) achieve strong predictive performance on sequential data, yet theoretical understanding of their generalization remains limited. We address this gap through three contributions: introducing a principled evaluation methodology for temporal models, revealing surprising empirical phenomena about temporal dependence, and establishing the first architecture-aware theoretical framework for dependent sequences.

\textbf{Fair-Comparison Methodology.} We introduce evaluation protocols that fix effective sample size $N_{\text{eff}}$ to isolate temporal structure effects from information content. This addresses a fundamental challenge: temporal dependence affects both information content and learning dynamics, and standard evaluations conflate these effects. Our methodology enables principled comparison of models across dependency regimes.

\textbf{Empirical Findings.} Applying this methodology reveals that under controlled $N_{\text{eff}} = 2{,}000$, strongly dependent sequences ($\rho = 0.8$) exhibit approximately $76\%$ smaller generalization gaps than weakly dependent ones ($\rho = 0.2$), challenging the conventional view that dependence universally impedes learning. However, observed convergence rates ($N_{\text{eff}}^{-1.21}$ to $N_{\text{eff}}^{-0.89}$) significantly exceed theoretical worst-case predictions ($N^{-0.5}$), revealing that temporal architectures exploit problem structure in ways current theory does not capture.

\textbf{Theoretical Framework.} To provide the foundations for these empirical investigations, we develop the first architecture-aware generalization bounds for deep temporal models on exponentially $\beta$-mixing sequences. By embedding Golowich et al.'s i.i.d. class bound within a novel blocking scheme that partitions $N$ samples into approximately $B \approx N/\log N$ quasi-independent blocks, we establish polynomial sample complexity under convex Lipschitz losses. The framework achieves $\sqrt{D}$ depth scaling alongside the product of layer-wise norms $R = \prod_{\ell=1}^{D} M^{(\ell)}$, avoiding exponential dependence. While these bounds are conservative, as our empirical results demonstrate, they prove learnability and identify architectural scaling laws, providing worst-case baselines that highlight where future theory must improve to explain observed performance.

URL: https://openreview.net/forum?id=fM9LP57Kai

---

Title: Contextual Learning for Anomaly Detection in Tabular Data

Abstract: Anomaly detection is critical in domains such as cybersecurity and finance, especially when working with large-scale tabular data. Yet, unsupervised anomaly detection---where no labeled anomalies are available---remains challenging because traditional deep learning methods model a single global distribution, assuming all samples follow the same behavior. In contrast, real-world data often contain heterogeneous contexts (e.g., different users, accounts, or devices), where globally rare events may be normal within specific conditions. We introduce a \emph{contextual learning framework} that explicitly models how normal behavior varies across contexts by learning conditional data distributions $P(\mathbf{Y} \mid \mathbf{C})$ rather than a global joint distribution $P(\mathbf{X})$. The framework encompasses (1) a probabilistic formulation for context-conditioned learning, (2) a principled bilevel optimization strategy for automatically selecting informative context features using early validation loss, and (3) theoretical grounding through variance decomposition and discriminative learning principles. We instantiate this framework using a novel conditional Wasserstein autoencoder as a simple yet effective model for tabular anomaly detection. Extensive experiments across eight benchmark datasets demonstrate that contextual learning consistently outperforms global approaches---even when the optimal context is not intuitively obvious---establishing a new foundation for anomaly detection in heterogeneous tabular data.

URL: https://openreview.net/forum?id=PmqZslRENW

---

Title: Collaborative QA using Interacting LLMs. Impact of Network Structure, Node Capability and Distributed Data.

Abstract: In this paper, we model and analyze how a network of interacting LLMs performs \textit{collaborative question-answering (CQA)} in order to estimate a ground truth given a distributed set of documents. This problem is interesting because LLMs often hallucinate when direct evidence to answer a question is lacking, and these effects become more pronounced in a network of interacting LLMs. The hallucination spreads, causing previously accurate LLMs to hallucinate. We study interacting LLMs and their hallucination by combining novel ideas of mean-field dynamics (MFD) from network science and the randomized utility model from economics to construct a useful generative model. We model the LLM with a latent state that indicates if it is truthful or not with respect to the ground truth, and extend a tractable analytical model considering an MFD to model the diffusion of information in a directed network of LLMs. To specify the probabilities that govern the dynamics of the MFD, we propose a randomized utility model. For a network of LLMs, where each LLM has two possible latent states, we posit sufficient conditions for the existence and uniqueness of a fixed point and analyze the behavior of the fixed point in terms of the incentive (e.g., test-time compute) given to individual LLMs. We experimentally study and analyze the behavior of a network of $100$ open-source LLMs with respect to data heterogeneity, node capability, network structure, and sensitivity to framing on multiple semi-synthetic datasets.

URL: https://openreview.net/forum?id=nyZ4JMrV8b

---

Title: Towards Generalized Certified Robustness with Multi-Norm Training

Abstract: Existing certified training methods can only train models to be robust against a certain perturbation type (e.g. $l_\infty$ or $l_2$). However, an $l_\infty$ certifiably robust model may not be certifiably robust against $l_2$ perturbation (and vice versa) and also has low robustness against other perturbations (e.g. geometric and patch transformation). By constructing a theoretical framework to analyze and mitigate the tradeoff, we propose the first multi-norm certified training framework \textbf{CURE}, consisting of several multi-norm certified training methods, to attain better \emph{union robustness} when training from scratch or fine-tuning a pre-trained certified model. Inspired by our theoretical findings, we devise bound alignment and connect natural training with certified training for better union robustness. Compared with SOTA-certified training, \textbf{CURE} improves union robustness to $32.0\%$ on MNIST, $25.8\%$ on CIFAR-10, and $10.6\%$ on TinyImagenet across different epsilon values. It leads to better generalization on a diverse set of challenging unseen geometric and patch perturbations to $6.8\%$ and $16.0\%$ on CIFAR-10. Overall, our contributions pave a path towards \textit{generalized certified robustness}.

URL: https://openreview.net/forum?id=U5U7pazr6X

---

Title: Post-Training Augmentation Invariance

Abstract: This work develops a framework for post-training augmentation invariance, in which our goal is to add invariance properties to a pretrained network without altering its behavior on the original, non-augmented input distribution. We define this notion precisely and additionally introduce augmented encoders, which are probabilistic encoders that formalize augmentation-based encoding processes and that serve as our fundamental object of study. We introduce two optimal transport-based losses for augmented encoders, namely, Markov-Wasserstein minimization and Wasserstein correlation maximization, and we demonstrate empirically that both losses can be used to train lightweight, one-hidden-layer MLP adapter networks $E_{\theta}$ that, when appended to the latent space of a pretrained network $F$, do indeed lead to (approximate) post-training augmentation invariance. For example, on STL10 with $F=\text{DINOv2}$ features, the composite network $C\circ E_{\theta}\circ F$, where $C$ is a linear classifier, achieves $90\%$ classification accuracy on arbitrarily rotated images, whereas a network of the form $C\circ F$ without the adapter $E_{\theta}$ drops to $71\%$ accuracy. Similarly, we can boost noise-invariant classification results from $62\%$ up to nearly $80\%$. Significantly, we obtain these results with no fine-tuning (the weights of $F$ remain frozen throughout), and our methods introduce little corruption to the original features, since $E_{\theta}$ acts nearly isometrically on the non-augmented latent distribution. In contrast, we show that adapter networks trained with alternative candidate losses, specifically SimCLR and HSIC maximization, produce uncompetitive classification results and fundamentally corrupt the original latent space.

URL: https://openreview.net/forum?id=Z4uUwU6zRe

---

Title: ActionEQA: Action Interface for Embodied Question Answering

Abstract: While Vision-Language Models (VLMs) are increasingly integral to embodied intelligence, a significant action understanding bottleneck persists in translating high-level semantic instructions into precise low-level physical actions. However, current benchmarks for embodied agents primarily focus on high-level perception and planning, failing to capture the depth and nature of this semantic-to-physical gap. To address this, we introduce ActionEQA, the first Embodied Question Answering (EQA) benchmark designed to methodically evaluate the ability of VLMs to bridge this critical yet underexplored semantic-physical divide. Grounded in real-world robotics data, ActionEQA thoroughly analyzes VLMs’ grasp of the action interface using a dual-tier design: (1) a Three-Tiered Action Hierarchy for pinpointing the depth at which VLMs' action reasoning collapses. (2) Bidirectional Reasoning Tasks for testing whether VLMs struggle more to predict action outcomes or infer the actions that led to them. Our key findings reveal: (1) The primary bottleneck in action understanding occurs at the mid-level, arising from the challenge of grounding compositional language in 3D physical geometry. (2) VLMs are more adept at inferring past actions than predicting their future outcomes. (3) Richer visual inputs require greater spatial reasoning from VLMs to map actions to physical geometry. (4) Within the action hierarchy, model failures shift from predominantly perceptual errors at the high level to flawed geometric and physical reasoning at the low level.

URL: https://openreview.net/forum?id=HY2ruqdMt4

---

Title: Differentially Private and Scalable Estimation of the Network Principal Component

Abstract: Computing the principal component (PC) of the adjacency matrix of an undirected graph has several applications ranging from identifying key vertices for influence maximization and controlling diffusion processes, to discovering densely interconnected vertex subsets. However, many networked datasets are sensitive, which necessitates private computation of the PC for use in the aforementioned applications. Differential privacy has emerged as the gold standard in privacy-preserving data analysis, but existing DP algorithms for private PC suffer from low accuracy due to large noise injection or high complexity. Motivated by the large gap between the local and global sensitivities of the PC on real-graphs, we consider instance-specific mechanisms for privately computing the PC under edge-DP. These mechanisms guarantee privacy for all datasets, but provide good utility on ``well-behaved'' datasets by injecting smaller amounts of noise. More specifically, we consider the Propose-Test-Release (PTR) framework. Although computationally expensive in general, we design a novel approach for implementing a PTR variant in the same time as computation of a non-private PC, while offering good utility.
Our framework tests in a differentially-private manner whether a given graph is ``well-behaved'' or not, and then tests whether its private to release a noisy PC with small noise.
As a consequence, this also leads to the first DP algorithm for the Densest-$k$-subgraph problem, a key graph mining primitive.
We run our method on diverse real-world networks, with the largest having 3 million vertices, and compare its utility to a pre-existing baseline based on the private power method (PPM).
Although PTR requires a slightly larger privacy budget, on average, it achieves a 180-fold improvement in runtime over PPM.

URL: https://openreview.net/forum?id=V0BjWbrAYC

---

Title: Cost-Aware Routing for Efficient Text-To-Image Generation

Abstract: Diffusion models are well known for their ability to generate a high-fidelity image for an input
prompt through an iterative denoising process. Unfortunately, the high fidelity also comes at
a high computational cost due to the inherently sequential generative process. In this work,
we seek to optimally balance quality and computational cost, and propose a framework to
allow the amount of computation to vary for each prompt, depending on its complexity. Each
prompt is automatically routed to the most appropriate text-to-image generation function,
which may correspond to a distinct number of denoising steps of a diffusion model, or a
disparate, independent text-to-image model. Unlike uniform cost reduction techniques (e.g.,
distillation, model quantization), our approach achieves the optimal trade-off by learning to
reserve expensive choices (e.g., 100+ denoising steps) only for a few complex prompts, and
employ more economical choices (e.g., small distilled model) for less sophisticated prompts.
We empirically demonstrate on COCO and DiffusionDB that by learning to route to nine
already-trained text-to-image models, our approach is able to deliver an average quality that
is higher than that achievable by any of these models alone.

URL: https://openreview.net/forum?id=Jbe9AVsYS6

---

Title: BiSSL: Enhancing the Alignment Between Self-Supervised Pretraining and Downstream Fine-Tuning via Bilevel Optimization

Abstract: Models initialized from self-supervised pretraining may suffer from poor alignment with downstream tasks, limiting the extent to which subsequent fine-tuning can adapt relevant representations acquired during the pretraining phase. To mitigate this, we introduce BiSSL, a novel bilevel training framework that enhances the alignment of self-supervised pretrained models with downstream tasks by explicitly incorporating both the pretext and downstream tasks into a preparatory training stage prior to fine-tuning. BiSSL solves a bilevel optimization problem in which the lower-level adheres to the self-supervised pretext task, while the upper-level encourages the lower-level backbone to align with the downstream objective. The bilevel structure facilitates enhanced information sharing between the tasks, ultimately yielding a backbone model that is more aligned with the downstream task, providing a better initialization for subsequent fine-tuning. We propose a general training algorithm for BiSSL that is compatible with a broad range of pretext and downstream tasks. We demonstrate that our proposed framework significantly improves accuracy on the vast majority of a broad selection of image-domain downstream tasks, and that these gains are consistently retained across a wide range of experimental settings. In addition, exploratory alignment analyses further underpin that BiSSL enhances downstream alignment of pretrained representations.

URL: https://openreview.net/forum?id=GQAGlqOpyA

---

Title: MolMiner: Towards Controllable, 3D-Aware, Fragment-Based Molecular Design

Abstract: We introduce MolMiner, a fragment-based, geometry-aware, and order-agnostic autoregressive model for molecular design. MolMiner supports conditional generation of molecules over twelve properties, enabling flexible control across physicochemical and structural targets. Molecules are built via symmetry-aware fragment attachments, with 3D geometry dynamically updated during generation using forcefields. A probabilistic conditioning mechanism allows users to specify any subset of target properties while sampling the rest. MolMiner achieves calibrated conditional generation across most properties and offers competitive unconditional performance. We also propose improved benchmarking methods for both unconditional and conditional generation, including distributional comparisons via Wasserstein distance and calibration plots for property control. To our knowledge, this is the first model to unify dynamic geometry, symmetry handling, order-agnostic fragment-based generation, and high-dimensional multi-property conditioning.

URL: https://openreview.net/forum?id=saHRhzqibY

---

Title: Adversarial Attacks in Weight-Space Classifiers

Abstract: Implicit Neural Representations (INRs) have been recently garnering increasing interest in
various research fields, mainly due to their ability to represent large, complex data in a compact and continuous manner. Past work further showed that numerous popular downstream
tasks can be performed directly in the INR parameter-space. Doing so can substantially
reduce the computational resources required to process the represented data in their native domain. A major difficulty in using modern machine-learning approaches, is their high
susceptibility to adversarial attacks, which have been shown to greatly limit the reliability
and applicability of such methods in a wide range of settings. In this work, we show that
parameter-space models trained for classification are inherently robust to adversarial attacks
– without the need of any robust training. To support our claims, we develop a novel suite of
adversarial attacks targeting parameter-space classifiers, and furthermore analyze practical
considerations of such attacks.

URL: https://openreview.net/forum?id=eOLybAlili

---

Title: Understanding Guidance Scale in Diffusion Models From a Geometric Perspective

Abstract: Conditional diffusion models have become a leading approach for generating condition-consistent samples, such as class-specific images. In practice, the guidance scale is a key hyperparameter in conditional diffusion models, used to adjust the strength of the guidance term. While empirical studies have demonstrated that appropriately choosing the scale can significantly enhance generation quality, the theoretical understanding of its role remains limited. In this work, we analyze the probabilistic guidance term from a geometric view under the linear manifold assumption and, based on this analysis, construct a geometric guidance model that enables tractable theoretical study. To address regularity issues arising from multi-modal data, we introduce a mollification technique that ensures well-posed dynamics. Our theoretical results show that increasing the guidance scale improves alignment with the target data manifold, thereby enhancing generation performance. We further extend our framework to nonlinear manifolds, and empirical results on real-world datasets validate the effectiveness of the proposed model and support our theoretical findings.

URL: https://openreview.net/forum?id=nfHimL6g8G

---

Title: ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer

Abstract: Autoregressive and diffusion models have achieved remarkable progress in language models and visual generation, respectively. We present ACDiT, a novel Autoregressive blockwise Conditional Diffusion Transformer, that innovatively combines autoregressive and diffusion paradigms for continuous visual information. By introducing a block-wise autoregressive unit, ACDiT offers a flexible interpolation between token-wise autoregression and full-sequence diffusion, bypassing the limitations of discrete tokenization. The generation of each block is formulated as a conditional diffusion process, conditioned on prior blocks. ACDiT is easy to implement, as simple as applying a specially designed Skip-Causal Attention Mask (SCAM) on the standard diffusion transformer during training. During inference, the process iterates between diffusion denoising and autoregressive decoding that can make full use of KV-Cache. We validate the effectiveness of ACDiT on image, video, and text generation and show that ACDiT performs best among all autoregressive baselines under similar model scales on visual generation tasks. We also demonstrate that, benefiting from autoregressive modeling, pretrained ACDiT can be transferred in visual understanding tasks despite being trained with the generative objective. The analysis of the trade-off between autoregressive and diffusion demonstrates the potential of ACDiT to be used in long-horizon visual generation tasks. We hope that ACDiT offers a novel perspective on visual autoregressive generation and sheds light on new avenues for unified models.

URL: https://openreview.net/forum?id=OuFNXESoCO

---

Title: Leveraging Recursion for Efficient Federated Learning

Abstract: cating
with the parameter server to reduce communication overhead and improve overall
training efficiency. However, local updates also lead to the “client-drift” problem under
non-IID data, which avoids convergence to the exact optimal solution under heterogeneous
data distributions. To ensure accurate convergence, existing federated-learning algorithms
employ auxiliary variables to locally estimate the global gradient or the drift from the global
gradient, which, however, also incurs extra communication and storage overhead. In this
paper, we propose a new recursion-based federated-learning architecture that completely
eliminates the need for auxiliary variables while ensuring accurate convergence under heterogeneous
data distributions. This new federated-learning architecture, called FedRecu, can
significantly reduce communication and storage overhead compared with existing federatedlearning
algorithms with accurate convergence guarantees. More importantly, this novel architecture
enables FedRecu to employ much larger stepsizes than existing federated-learning
algorithms, thereby leading to much faster convergence. We provide rigorous convergence
analysis of FedRecu under both convex and nonconvex loss functions, in both the deterministic
gradient case and the stochastic gradient case. In fact, our theoretical analysis shows
that FedRecu ensures o(1/K) convergence to an accurate solution under general convex loss
functions, which improves upon the existing achievable O(1/K) convergence rate for general
convex loss functions, and which, to our knowledge, has not been reported in the literature
except for some restricted convex cases with additional constraints. Numerical experiments
on benchmark datasets confirm the effectiveness of the proposed algorithm.

URL: https://openreview.net/forum?id=cVGagKtiVr

---

Title: A Closer Look at Personalized Fine-Tuning in Heterogeneous Federated Learning

Abstract: Federated Learning (FL) enables decentralized, privacy-preserving model training but struggles to balance global generalization and local personalization due to non-identical data distributions across clients. Personalized Fine-Tuning (PFT), a popular post-hoc solution, fine-tunes the final global model locally but often overfits to skewed client distributions or fails under domain shifts. We propose adapting Linear Probing followed by full Fine-Tuning (LP-FT)—a principled centralized strategy for alleviating feature distortion—to the FL setting. Through systematic evaluation across seven datasets and six PFT variants, we demonstrate LP-FT’s superiority in balancing personalization and generalization. Our analysis uncovers federated feature distortion, a phenomenon where local fine-tuning destabilizes globally learned features, and theoretically characterizes how LP-FT mitigates this via phased parameter updates. We further establish conditions (e.g., partial feature overlap, covariate-concept shift) under which LP-FT outperforms standard fine-tuning, offering actionable guidelines for deploying robust personalization in FL.

URL: https://openreview.net/forum?id=qDniKglANO

---

Title: Achieving Faster than O(1/t) Convergence in Convex Federated Learning

Abstract: This paper aims to achieve faster than O(1/t) convergence in federated learning for general
convex loss functions. Under the independent and identical distribution (IID) condition, we
show that accurate convergence to an optimal solution can be achieved in convex federated
learning even when individual clients select stepsizes locally without any coordination. More
importantly, this local stepsize strategy allows exploitation of the local geometry of individual
clients’ loss functions, and is shown to lead to faster convergence than the case where
a same universal stepsize is used for all clients. Then, when the distribution is non-IID,
we employ the sharing of gradients besides the global model parameter to ensure o(1/t)
convergence to an optimal solution in convex federated learning. For both algorithms, we
theoretically prove that stepsizes that are much larger than existing counterparts are allowed,
which leads to much faster convergence in empirical evaluations. It is worth noting
that, beyond providing a general framework for federated learning with drift correction, our
second algorithm’s achievement of o(1/t) convergence to the exact optimal solution under
general convex loss functions has not been previously reported in the federated learning
literature—except in certain restricted convex cases with additional constraints. We believe
that this is significant because even after incorporating momentum, existing first-order
federated learning algorithms can only ensure O(1/t) convergence for general convex loss
functions when no additional assumptions on heterogeneity are imposed.

URL: https://openreview.net/forum?id=Dae3jVdPod

---

Title: TABASCO: A Fast, Simplified Model for Molecular Generation with Improved Physical Quality

Abstract: State-of-the-art models for 3D molecular generation are based on significant inductive biases: SE(3) equivariance, permutation invariance and graph message‑passing networks to capture local chemistry, yet the generated molecules struggle with physical plausibility.
We introduce TABASCO which relaxes these assumptions: The model has a standard non-equivariant transformer architecture, treats atoms in a molecule as sequences and does not explicitly model bonds. The absence of equivariant layers and message passing allows us to simplify the model architecture and scale data throughput.
On the GEOM‑Drugs and QM9 benchmarks TABASCO achieves state-of-the-art PoseBusters validity and delivers inference roughly 10x faster than the strongest baseline, while exhibiting emergent rotational equivariance without hard-coded symmetry.
Our work offers a blueprint for training minimalist, high‑throughput generative models suited to tasks such as structure‑ and pharmacophore‑based drug design.
We provide a link to our implementation at https://anonymous.4open.science/r/tabasco-EBC8/.

URL: https://openreview.net/forum?id=Kg6CSrbXl4

---

Title: CodePDE: An Inference Framework for LLM-driven PDE Solver Generation

Abstract: Partial differential equations (PDEs) are fundamental to modeling physical systems, yet solving them remains a complex challenge. Traditional numerical solvers rely on expert knowledge to implement and are computationally expensive, while neural-network-based solvers require large training datasets and often lack interpretability. In this work, we frame PDE solving as a code generation task and introduce CodePDE, the first inference framework for generating PDE solvers using large language models (LLMs). With CodePDE, we present a thorough evaluation on critical capacities of LLM for PDE solving: reasoning, debugging, self-refinement, and test-time scaling. CodePDE shows that, with advanced inference-time algorithms and scaling strategies, LLMs can achieve strong performance across a range of representative PDE problems. We also identify novel insights into LLM-driven solver generation, such as trade-offs between solver reliability and sophistication, design principles for LLM-powered PDE solving agents, and failure modes for LLM on hard tasks. These insights offer guidance for building more capable and reliable LLM-based scientific engines.

URL: https://openreview.net/forum?id=eG3Qy5Oux6

---

Title: A Simple Scaling Model for Bootstrapped DQN

Abstract: We present a large-scale empirical study of Bootstrapped DQN (BDQN) and Randomized-Prior BDQN (RP-BDQN) in the DeepSea environment designed to isolate and parameterize exploration difficulty. Our primary contribution is a simple scaling model that accurately captures the probability of reward discovery as a function of task hardness and ensemble size. This model is parameterized by a method-dependent effectiveness factor, $\psi$. Under this framework, RP-BDQN demonstrates substantially higher effectiveness ($\psi \approx 0.87$) compared to BDQN ($\psi \approx 0.80$), enabling it to solve more challenging tasks. Our analysis reveals that this advantage stems from RP-BDQN's sustained ensemble diversity, which mitigates the posterior collapse observed in BDQN. Interestingly, the model's success, despite assuming member independence, suggests that complex ensemble interactions may be a secondary factor in overall performance. Furthermore, we show how systematic deviations from this simple model can be used to diagnose more subtle dynamics like cooperation and diversity saturation. These results offer practical guidance for ensemble configuration and propose a methodological framework for future studies of deep exploration.

URL: https://openreview.net/forum?id=OpfrMFep8B

---

Title: MIRA: Multi-view Information Retrieval with Adaptive Routing for Test-time Long-video Comprehension

Abstract: Foundational Multi-modal Large Language Models (MLLMs) have achieved rapid progress in handling complex tasks across diverse modalities. However, they still struggle to deliver satisfactory performance on Long-video Comprehension (LVC) tasks involving thousands of frames. Existing optimization strategies can be broadly categorized into LVC-specific fine-tuning, built-in token compression and training-free keyframe extraction, with the latter being most suitable for flexible deployment across various MLLMs. Unfortunately, current training-free approaches predominantly focus on query-frame relevance retrieval, overlooking other levels of visual information and the inherent heterogeneity of LVC tasks. In this work, we propose the $\textbf{M}$ulti-view $\textbf{I}$nformation $\textbf{R}$etrieval with $\textbf{A}$daptive Routing ($\textbf{MIRA}$) framework, which evaluates video frames using distinct metrics for relevance and causality, combines these scores to select a balanced pool of keyframes, and employs an adaptive feedback loop to tailor the retrieval process to different user queries, enabling more precise and sample-grained video comprehension. Extensive experiments demonstrate the advanced performance of our scheme across multiple challenging LVC benchmarks. For instance, integrating $\textbf{MIRA}$ with Qwen-2.5-VL yields performance gains of 3.5% to 13.1% on LVB, VideoMME and MLVU.

URL: https://openreview.net/forum?id=LZb2kzO8tu

---

Title: Cross-Layer Discrete Concept Discovery for Interpreting Language Models

Abstract: Interpreting language models remains challenging due to the existence of residual stream, which linearly mixes and duplicates information across adjacent layers. This leads to the under-detection of features that exist in the specific layer being analyzed. Current research works either analyze neural representations at single layers, thereby overlooking this cross-layer superposition, or utilize a cross-layer variant of sparse autoencoder (SAE) for analysis. However, SAEs operate in continuous space, so there are no clear boundaries between neurons representing different concepts. We address these limitations by introducing Cross-Layer vector quantized-variational autoencoder (VQ-VAE), a novel framework that maps representations across layers through vector quantization. This causes the collapse of duplicated features in the residual stream, thus resulting in compact, interpretable concept vectors. Our approach combines top-k temperature-based sampling during quantization with exponential moving average (EMA) codebook updates, providing controlled exploration of the discrete latent space while maintaining codebook diversity. Our experiments show that this framework, when combined with appropriate initialization, can effectively discover meaningful concepts. Our quantitative and qualitative experiments on the ERASER-Movie, Jigsaw, and AGNews datasets show that cross-layer VQ-VAE (CLVQ-VAE) can discover meaningful concepts that explain model predictions.

URL: https://openreview.net/forum?id=xBVTqiHY6l

---

Title: Quantile $Q$-Learning: Revisiting Offline Extreme $Q$-Learning with Quantile Regression

Abstract: Offline reinforcement learning (RL) enables policy learning from fixed datasets without further environment interaction, making it particularly valuable in high-risk or costly domains. Extreme $Q$-Learning (XQL) is a recent offline RL method that models Bellman errors using the Extreme Value Theorem, yielding strong empirical performance. However, XQL and its stabilized variant MXQL suffer from notable limitations: both require extensive hyperparameter tuning specific to each dataset and domain, and also exhibit instability during training. To address these issues, we proposed a principled method to estimate the temperature coefficient $\beta$ via quantile regression under mild assumptions. To further improve training stability, we introduce a value regularization technique with mild generalization, inspired by recent advances in constrained value learning. Experimental results demonstrate that the proposed algorithm achieves competitive or superior performance across a range of benchmark tasks, including D4RL and NeoRL2, while maintaining stable training dynamics and using a consistent set of hyperparameters across all datasets and domains.

URL: https://openreview.net/forum?id=tBKznsUimN

---

Title: Making Video Models Adhere to User Intent with Minor Adjustments

Abstract: With the recent drastic advancements in text-to-video diffusion models, controlling their generations has drawn interest. A popular way for control is through bounding boxes or layouts. However, enforcing adherence to these control inputs is still an open problem. In this work, we show that by slightly adjusting user-provided bounding boxes we can improve both the quality of generations and the adherence to the control inputs. This is achieved by simply optimizing the bounding boxes to better align with the internal attention maps of the video diffusion model while carefully balancing the focus on foreground and background. In a sense, we are modifying the bounding boxes to be at places where the model is familiar with. Surprisingly, we find that even with small modifications, the quality of generations can vary significantly. To do so, we propose a smooth mask to make the bounding box position differentiable and an attention-maximization objective that we use to alter the bounding boxes. We conduct thorough experiments, including a user study to validate the effectiveness of our method.

URL: https://openreview.net/forum?id=Opvq2wfBR5

---

Title: TE-VLM: Transfer Entropy for Vision Language Model Distillation

Abstract: Vision-Language Models (VLMs) have demonstrated impressive performance across various multimodal tasks. However, deploying large teacher models in real-world applications is often infeasible due to their high computational cost. To address this, knowledge distillation has been widely explored to transfer knowledge from a large teacher model to a smaller student model. In this paper, we propose a novel distillation framework that integrates Transfer Entropy (TE) as a regularization term to enhance information flow from the teacher to the student model. TE quantifies the directional dependency between teacher and student embeddings, encouraging the student model to effectively capture structural knowledge from the teacher. To efficiently approximate TE in high-dimensional embedding spaces, we introduce two surrogate formulations based on cosine similarity: (1) TE via cosine similarity of directional changes in embeddings and (2) TE via concatenated differences across modalities. Our experiments, conducted on the MSCOCO 2014 and Flickr8k datasets using CLIP-based teacher and student architectures, demonstrate that incorporating TE significantly improves retrieval performance. Through extensive analysis, we show that TE-based regularization enhances the student model's ability to capture multimodal associations and maintain representational consistency. Our findings suggest that TE is an effective tool for improving knowledge transfer in VLM distillation, bridging the performance gap between compact student models and their larger teacher counterparts.

URL: https://openreview.net/forum?id=i6gyBJl7sK

---

Title: Training speedups via batching for geometric learning: an analysis of static and dynamic algorithms

Abstract: Graph neural networks (GNN) have shown promising results for several domains such as materials science, chemistry, and the social sciences. GNN models often contain millions of parameters, and like other neural network (NN) models, are often fed only a fraction of the graphs that make up the training dataset in batches to update model parameters. The effect of batching algorithms on training time and model performance has been thoroughly explored for NNs but not yet for GNNs. We analyze two different batching algorithms for graph based models, namely static and dynamic batching for two datasets, the QM9 dataset of small molecules and the AFLOW materials database. Our experiments show that changing the batching algorithm can provide up to a 2.7x speedup, but the fastest algorithm depends on the data, model, batch size, hardware, and number of training steps run. Experiments show that for a select number of combinations of batch size, dataset, and model, significant differences in model learning metrics are observed between static and dynamic batching algorithms.

URL: https://openreview.net/forum?id=v8rC6EEUep

---

Title: Towards Multimodal Active Learning: Efficient Learning with Limited Paired Data

Abstract: Active learning (AL) is a principled strategy to reduce annotation cost in data-hungry deep learning. However, existing AL algorithms focus almost exclusively on unimodal data, overlooking the substantial annotation burden in multimodal learning. We introduce the first framework for $\textit{multimodal active learning with unaligned data}$, where the learner must actively acquire cross-modal alignments rather than labels on pre-aligned pairs. This setting captures the practical bottleneck in modern multimodal pipelines, where unimodal features are easy to obtain but high-quality alignment is costly. We develop a new algorithm that combines uncertainty and diversity principles in a modality-aware design, achieves linear-time acquisition, and applies seamlessly to both pool-based and streaming-based settings. Extensive experiments on benchmark datasets demonstrate that our approach consistently reduces multimodal annotation cost while preserving performance; for instance, on the ColorSwap dataset it cuts annotation requirements by up to 40% without loss in accuracy.

URL: https://openreview.net/forum?id=xMLajoct78

---

Title: Discrete Interpolants: Unifying the Masked Generative and Discrete Diffusion Models

Abstract: In generative models, two paradigms have gained attraction in various applications: next-set prediction-based Masked Generative Models and next-noise prediction-based Non-Autoregressive Models, e.g., Diffusion Models. In this work, we propose using discrete-state models to connect them and explore their scalability in the vision domain. First, we conduct an in-depth analysis in a unified design space across two types of models including timestep-independence, noise schedule, temperature, guidance strength, etc in a scalable manner. Second, from the lens of generative models, we re-cast typical discriminative tasks, e.g., image segmentation, as an unmasking process from [MASK] tokens on a discrete-state model. This enables us to perform various sampling processes, including flexible conditional sampling by only training once to model the joint distribution. All aforementioned explorations lead to our framework named Discrete Interpolants, which enables us to achieve state-of-the-art or competitive performance compared to previous discrete-state based methods in various benchmarks, including ImageNet256, MS COCO, CC12M, as well as the video datasets FaceForensics and DMLab. In summary, by leveraging [MASK] in discrete-state models, we can bridge Masked Generative and Non-autoregressive Diffusion models, as well as generative and discriminative tasks. Our code will be released.

URL: https://openreview.net/forum?id=CkAHiOUvOx

---

Title: From Euclidean to Graph-Structured Data: A Survey of Collaborative Learning

Abstract: The conventional approach to machine learning, that is, collecting data, training models, and performing inference in a single location, faces fundamental limitations, including scalability and privacy, that restrict its applicability. To address these challenges, recent research has explored collaborative learning approaches, including federated learning and decentralized learning, where individual agents perform training and inference locally, with limited collaboration.
Most collaborative learning research focuses on Euclidean data with regular, grid-like structure (e.g., images, text). However, these approaches fail to capture the relational patterns in many real-world applications, best represented by graphs. Learning on graphs relies on message-passing mechanisms to propagate information between connected nodes, making it conceptually well-suited for collaborative environments where agents must exchange information. Yet, the opportunities and challenges of learning on graph-structured data in collaborative settings remain largely underexplored.
This survey provides a comprehensive investigation of collaborative learning from Euclidean to graph-structured data, aiming to consolidate this emerging field. We begin by reviewing its foundational principles for Euclidean data, organizing them along three core dimensions: learning effectiveness, efficiency, and privacy preservation. We then extend the discussion to graph-structured data, introducing a taxonomy of graph distribution scenarios, characterizing associated statistical heterogeneities, and developing standardized problem formulations and algorithmic frameworks. Finally, we systematically identify open challenges and promising research directions.
By bridging established techniques for Euclidean data with emerging methods for graph learning, our survey provides researchers and practitioners with a well-structured foundation of collaborative learning, supporting further development across a wide range of scientific and industrial fields.

URL: https://openreview.net/forum?id=vj9l8AjLT6

---

Title: Discovering Meaningful Units with Visually Grounded Semantics from Image Captions

Abstract: Fine-grained knowledge is crucial for vision-language models to obtain a better understanding of the real world. While there has been work trying to acquire this kind of knowledge in the space of vision and language, it has mostly focused on aligning the image patches with the tokens on the language side. However, image patches do not have any meaning to the human eye, and individual tokens do not necessarily carry groundable information in the image. It is groups of tokens which describe different aspects of the scene. In this work, we propose a model which groups the caption tokens as part of its architecture in order to capture a fine-grained representation of the language. We expect our representations to be at the level of objects present in the image, and therefore align our representations with the output of an image encoder trained to discover objects. We show that by learning to group the tokens, the vision-language model has a better fine-grained understanding of vision and language. In addition, the token groups that our model discovers are highly similar to groundable phrases in text, both qualitatively and quantitatively.

URL: https://openreview.net/forum?id=kndKGnE0tb

---

Title: Convergence Bound and Critical Batch Size of Muon Optimizer

Abstract: Muon, a recently proposed optimizer that leverages the inherent matrix structure of neural network parameters, has demonstrated strong empirical performance, indicating its potential as a successor to standard optimizers such as AdamW. This paper presents theoretical analysis to support its practical success. We provide convergence proofs for Muon across four practical settings, systematically examining its behavior with and without the inclusion of Nesterov momentum and weight decay. Our analysis covers the standard configuration using both, thereby elucidating its real-world performance. We then demonstrate that the addition of weight decay yields strictly tighter theoretical bounds and clarify the interplay between the weight decay coefficient and the learning rate. Finally, we derive the critical batch size for Muon that minimizes the computational cost of training.

URL: https://openreview.net/forum?id=31oMHlGSmV

---

Title: Friends in Unexpected Places: Enhancing Local Fairness in Federated Learning through Clustering

Abstract: Federated Learning (FL) has been a pivotal paradigm for collaborative training of machine learning models across distributed datasets. In heterogeneous settings, it has been observed that a single shared FL model can lead to low local accuracy, motivating personalized FL algorithms. In parallel, fair FL algorithms have been proposed to enforce group fairness on the global models. Again, in heterogeneous settings, global and local fairness do not necessarily align, motivating the recent literature on locally fair FL. In this paper, we propose new FL algorithms for heterogeneous settings, spanning the space between personalized and locally fair FL. Building on existing clustering-based personalized FL methods, we incorporate a new fairness metric into cluster assignment, enabling a tunable balance between local accuracy and fairness. Our methods match or exceed the performance of existing locally fair FL approaches, without explicit fairness intervention. To support this finding, we demonstrate (numerically and analytically) that personalization alone can improve local fairness and argue that our methods exploit this alignment when present.

URL: https://openreview.net/forum?id=ExRPvGFyNg

---

Title: Frictionless Hamiltonian Descent with Discretization and Parallel Optimization

Abstract: Frictionless Hamiltonian Descent is a recently proposed optimization method that leverages a fundamental principle from classical mechanics. The algorithm is based on energy conservation of the Hamiltonian Flow, with resetting the kinetic energy at each iteration, and is shown to be a descent method. However, the idealized frictionless Hamiltonian Descent requires access to the oracle of the Hamiltonian Flow, while exactly implementing the Hamiltonian Flow becomes elusive when the underlying function is not quadratic. Motivated from considerable popularity of Hamiltonian dynamics in sampling, where a geometric numerical integrator is used to simulate the idealized Hamiltonian Monte Carlo, we consider Hamiltonian Descent with two kinds of integrator, which results in some new optimization dynamics. Moreover, we extend the original framework by introducing various forms of kinetic energy. This expansion yields a broad class of optimization algorithms and provides a fresh perspective of algorithm design. We further propose a novel parallelization technique for parallelizing the inherently sequential updates of the proposed optimization algorithms, where gradients at different points are computed simultaneously. The parallelization technique improves the actual running time by 2-3x in practice for multinomial logistic regression across a range of datasets when 4 GPUs is used, compared to approximating the Hamiltonian Flow in the standard sequential fashion by a single GPU.

URL: https://openreview.net/forum?id=114IOQ3JWe

---

Title: Large Language Models for Scientific Idea Generation: A Creativity-Centered Survey

Abstract: Scientific idea generation lies at the heart of scientific discovery and has driven human progress-whether by solving unsolved problems or proposing novel hypotheses to explain unknown phenomena. Unlike standard scientific reasoning or general creative generation, idea generation in science is a multi-objective and open-ended task, where the novelty of a contribution is as essential as its empirical soundness. Large language models (LLMs) have recently emerged as promising generators of scientific ideas, capable of producing coherent and factual outputs with surprising intuition and acceptable reasoning, yet their creative capacity remains inconsistent and poorly understood. This survey provides a structured synthesis of methods for LLM-driven scientific ideation, examining how different approaches balance creativity with scientific soundness. We categorize existing methods into five complementary families: External knowledge augmentation, Prompt-based distributional steering, Inference-time scaling, Multi-agent collaboration, and Parameter-level adaptation. To interpret their contributions, we employ two complementary frameworks: Boden's taxonomy of Combinatorial, Exploratory and Transformational creativity to characterize the level of ideas each family expected to generate, and Rhodes' 4Ps framework-Person, Process, Press, and Product-to locate the aspect or source of creativity that each method emphasizes. By aligning methodological advances with creativity frameworks, this survey clarifies the state of the field and outlines key directions toward reliable, systematic, and transformative applications of LLMs in scientific discovery.

URL: https://openreview.net/forum?id=9lWojZKMjt

---

Title: A Closer Look on Memorization in Tabular Diffusion Model: A Data-Centric Perspective

Abstract: Diffusion models have shown strong performance in generating high-quality tabular data, but they carry privacy risks by inadvertently reproducing exact training samples. While prior work focuses on data augmentation for memorization mitigation, little is known about which individual samples contribute the most to memorization. In this paper, we present the first data-centric study of memorization dynamics in tabular diffusion models. We begin by quantifying memorization for each real sample based on how many generated samples are flagged as its memorized replicas, using a relative distance ratio metric. Our empirical analysis reveals a heavy-tailed distribution of memorization counts: a small subset of samples disproportionately contributes to leakage, a finding further validated through sample-removal experiments. To better understand this effect, we divide real samples into the top- and non-top-memorized groups (tags) and analyze their training-time behavior differences. We track when each sample is first memorized and monitor per-epoch memorization intensity (AUC) across groups. We find that memorized samples tend to be memorized slightly earlier and show significantly stronger memorization signals in early training stages. Based on these insights, we propose DynamicCut, a two-stage, model-agnostic mitigation method. DynamicCut (a) ranks real samples by their epoch-wise memorization intensity, (b) prunes a tunable top fraction, and (c) retrains the model on the filtered dataset. Across multiple benchmark tabular datasets and tabular diffusion models, DynamicCut reduces memorization ratios with negligible impact on data diversity and downstream task performance, and complements existing data augmentation methods for further memorization mitigation. Furthermore, DynamicCut has transferability across different generative models for memorization sample tagging, i.e., high-ranked samples identified from one model (e.g., a diffusion model) are also effective in reducing memorization when removed from other generative models such as GANs and VAEs.

URL: https://openreview.net/forum?id=p2n88DfaXB

---

Title: Differentially-private and plausible counterfactuals

Abstract: Counterfactual explanations are particularly appealing in high-stakes domains such as finance and hiring, as they provide affected users with suggestions on how to alter their profiles to receive a favorable outcome. However, existing methods are characterized by a privacy-quality trade-off. More precisely, as highlighted in recent works, instance-based approaches generate plausible counterfactuals but are vulnerable to privacy attacks, while perturbation-based methods offer better privacy at the cost of lower explanation quality. In this paper, we propose to solve this dilemma by introducing a diverse set of differentially-private mechanisms for generating counterfactuals, providing a high resistance against privacy attacks while maintaining high utility. These mechanisms can be integrated at different stages of the counterfactual generation pipeline i.e, pre-processing, in-processing or post-processing), thereby offering maximal flexibility during the design for the model provider. We have performed an empirical evaluation of the proposed approaches on a wide range of datasets and models to evaluate their effect on the privacy and utility of the generated counterfactuals. Overall, the results obtained demonstrate that in-processing methods significantly reduce the success rate of privacy attacks while moderately impacting the quality of counterfactuals generated. In contrast, pre-processing and post-processing mechanisms achieve a higher level of privacy but at a greater cost in terms of utility, thus being more suitable for scenarios in which privacy is paramount.

URL: https://openreview.net/forum?id=8szbYJ2DJi

---

Title: One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image

Abstract: Retrieval-augmented generation (RAG) is instrumental for inhibiting hallucinations in large language models (LLMs) through the use of a factual knowledge base (KB). Although PDF documents are prominent sources of knowledge, text-based RAG pipelines are ineffective at capturing their rich multi-modal information. In contrast, visual document RAG~(VD-RAG) uses screenshots of document pages as the KB, which has been shown to achieve state-of-the-art results. However, by introducing the image modality, VD-RAG introduces new attack vectors for adversaries to disrupt the system by injecting malicious documents into the KB. In this paper, we demonstrate the vulnerability of VD-RAG to poisoning attacks targeting both retrieval and generation. We define two attack objectives and demonstrate that both can be realized by injecting only a single adversarial image into the KB. Firstly, we introduce a targeted attack against one or a group of queries with the goal of spreading targeted disinformation. Secondly, we present a universal attack that, for any potential user query, influences the response to cause a denial-of-service in the VD-RAG system. We investigate the two attack objectives under both white-box and black-box assumptions, employing a multi-objective gradient-based optimization approach as well as prompting state-of-the-art generative models. Using two visual document datasets, a diverse set of state-of-the-art retrievers~(embedding models) and generators~(vision language models), we show VD-RAG is vulnerable to poisoning attacks in both the targeted and universal settings, yet demonstrating robustness to black-box attacks in the universal setting.

URL: https://openreview.net/forum?id=CLkjUidlYg

---

Title: Don’t Let It Hallucinate: Premise Verification via Retrieval-Augmented Logical Reasoning

Abstract: Large language models (LLMs) have shown substantial capacity for generating fluent, contextually appropriate responses. However, they can produce hallucinated outputs, especially when a user query includes one or more false premises—claims that contradict established facts. Such premises can mislead LLMs into offering fabricated or misleading details. Existing approaches include pretraining, fine-tuning, and inference-time techniques that often rely on access to logits or address hallucinations after they occur. These methods tend to be computationally expensive, require extensive training data, or lack proactive mechanisms to prevent hallucination before generation, limiting their efficiency in real-time applications. We propose a retrieval-based framework that identifies and addresses false premises before generation. Our method first transforms a user’s query into a logical representation, then applies retrieval-augmented generation (RAG) to assess the validity of each premise using factual sources. Finally, we incorporate the verification results into the LLM’s prompt to maintain factual consistency in the final output. Experiments show that this approach effectively reduces hallucinations, improves factual accuracy, and does not require access to model logits or large-scale fine-tuning.

URL: https://openreview.net/forum?id=BDxStRGWba

---

Title: COPA: Comparing the incomparable in multi-objective model evaluation

Abstract: In machine learning (ML), we often need to choose one among hundreds of trained ML models at hand, based on various objectives such as accuracy, robustness, fairness or scalability. However, it is often unclear how to compare, aggregate and, ultimately, trade-off these objectives, making it a time-consuming task that requires expert knowledge, as objectives may be measured in different units and scales. In this work, we investigate how objectives can be automatically normalized and aggregated to systematically help the user navigate their Pareto front. To this end, we make incomparable objectives comparable using their cumulative functions, approximated by their relative rankings. As a result, our proposed approach, COPA, can aggregate them while matching user-specific preferences, allowing practitioners to meaningfully navigate and search for models in the Pareto front. We demonstrate the potential impact of COPA in both model selection and benchmarking tasks across diverse ML areas such as fair ML, domain generalization, AutoML and foundation models, where classical ways to normalize and aggregate objectives fall short.

URL: https://openreview.net/forum?id=NY931v5zc5

---

Title: Unlearning in Diffusion models under Data Constraints: A Variational Inference Approach

Abstract: For a responsible and safe deployment of diffusion models in various domains, regulating the generated outputs from these models is desirable because such models could generate undesired, violent, and obscene outputs. To tackle this problem, recent works use machine unlearning methodology to forget training data points containing these undesired features from pre-trained generative models. However, these methods proved to be ineffective in data-constrained settings where the whole training dataset is inaccessible. Thus, the principal objective of this work is to propose a machine unlearning methodology that can prevent the generation of outputs containing undesired features from a pre-trained diffusion model in such a data-constrained setting. Our proposed method, termed as Variational Diffusion Unlearning (**VDU**), is a computationally efficient method that only requires access to a subset of training data containing undesired features. Our approach is inspired by the variational inference framework with the objective of minimizing a loss function consisting of two terms: *plasticity inducer* and *stability regularizer*. *Plasticity inducer* reduces the log-likelihood of the undesired training data points, while the *stability regularizer*, essential for preventing loss of image generation quality, regularizes the model in parameter space. We validate the effectiveness of our method through comprehensive experiments for both class unlearning and feature unlearning. For class unlearning, we unlearn some user-identified classes from MNIST, CIFAR-10, and tinyImageNet datasets from a pre-trained unconditional denoising diffusion probabilistic model (DDPM). Similarly, for feature unlearning, we unlearn the generation of certain high-level features from a pre-trained Stable Diffusion model.

URL: https://openreview.net/forum?id=mAHRgieyOV

---

Title: An Efficient Subset Selection Strategy Using Text-Guided Data Attribution to Mitigate Simplicity Bias

Abstract: The effectiveness of deep learning models heavily relies on the quality and diversity of their training data. However, datasets collected from different sources often introduce simplicity biases, where a models rely on easily learnable but non-predictive (spurious) features for its predictions. While existing debiasing techniques focus on model robustness, they leave the data untouched. However, as data becomes increasingly valuable, identifying and mitigating bias directly at the data level has become increasingly important. Recently, data attribution has emerged as a promising tool for uncovering issues in training data, yet its vulnerability to simplicity bias has received limited attention. In this work, we propose a novel data deletion framework that combines Neural Tangent Kernel (NTK)-based data attribution with textual descriptions of bias to identify and remove training samples that do not significantly affect model performance. We first demonstrate that NTK-based data attribution methods can themselves be influenced by spurious features. Subsequently, to mitigate this, we use available metadata or, when unavailable, a vision-language model to annotate a small validation set and extract a textual description of the bias. Based on this description and the attribution score, we identify the subset of training data that are semantically aligned with the spurious feature and affect the generalization of the model. Removing these samples from the training dataset and training model on the new subset improves the average and worst-group accuracy of the model, outperforming existing attribution-based baselines.

URL: https://openreview.net/forum?id=zZ5YundT95

---

Title: Deep Research Agents: A Systematic Examination And Roadmap

Abstract: The rapid progress of Large Language Models (LLMs) has given rise to a new category of autonomous AI systems, referred to as Deep Research (DR) agents. These agents are designed to tackle complex, multi-turn informational research tasks by leveraging a combination of dynamic reasoning, adaptive long-horizon planning, multi-hop information retrieval, iterative tool use, and the generation of structured analytical reports. In this paper, we conduct a detailed analysis of the foundational technologies and architectural components that constitute Deep Research agents. We begin by reviewing information acquisition strategies, contrasting API-based retrieval methods with browser-based exploration. We then examine modular tool-use frameworks, including code execution, multimodal input processing, and the integration of Model Context Protocols (MCPs) to support extensibility and ecosystem development. To systematise existing approaches, we propose a taxonomy that differentiates between static and dynamic workflows, and we classify agent architectures based on planning strategies and agent composition, including single-agent and multi-agent configurations. We also provide a critical evaluation of current benchmarks, highlighting key limitations such as restricted access to external knowledge, sequential execution inefficiencies, and misalignment between evaluation metrics and the practical objectives of DR agents. Finally, we outline open challenges and promising directions for future research.

URL: https://openreview.net/forum?id=FCRtTkjOvT

---

Title: Jr. AI Scientist and Its Risk Report: Autonomous Scientific Exploration from a Baseline Paper

Abstract: AI Scientist systems are autonomous agents capable of conducting scientific research. Understanding their current capabilities and risks is essential for ensuring trustworthy and sustainable AI-driven scientific progress while preserving the integrity of the academic ecosystem. To this end, we develop Jr. AI Scientist, a state-of-the-art autonomous AI scientist system that mimics the core research workflow of a novice student researcher: Given the baseline paper from the human mentor, it analyzes its limitations, formulates novel hypotheses for improvement, validates them through rigorous experimentation, and writes a paper with the results. Unlike previous approaches that assume full automation or operate on small-scale code, Jr. AI Scientist follows a well-defined research workflow and leverages modern coding agents to handle complex, multi-file implementations, leading to scientifically valuable contributions. Through our experiments, the Jr. AI Scientist successfully generated new research papers that build upon real NeurIPS, IJCV, and ICLR works by proposing and implementing novel algorithms. For evaluation, we conducted automated assessments using AI Reviewers, author-led evaluations, and submissions to Agents4Science, a venue dedicated to AI-driven scientific contributions. The findings demonstrate that Jr. AI Scientist generates papers receiving higher review scores than existing fully automated systems. Nevertheless, we identify important limitations from both the author evaluation and the Agents4Science reviews, indicating the potential risks of directly applying current AI Scientist systems and key challenges for future research. Finally, we comprehensively report various risks identified during development. We believe this study clarifies the current role and limitations of AI Scientist systems, offering insights into the areas that still require human expertise and the risks that may emerge as these systems evolve.

URL: https://openreview.net/forum?id=OeV062d8Sw

---

Title: Ternary Momentum For Quantized Training

Abstract: Quantization enables efficient inference on resource-limited devices, yet training still depends on high-precision gradients and optimizer states.
We address this gap by introducing stochastic ternary momentum, a fully quantized optimizer that operates with quantized parameters, ternary gradient information, and enables ternary momentum states for stable and memory efficient quantized optimization.
Our method replaces deterministic and full-precision updates with integer-valued updates driven by stochastic sampling, ensuring that expected updates match standard momentum while maintaining strict memory constraints.
It eliminates re-quantization overhead and preserves quantization consistency throughout training.
We establish theoretical convergence guarantees of our ternary momentum method for convex objectives over bounded integer domains and for non-convex objectives over unbounded integer domains. Experiments on vision and language tasks demonstrate that our approach retains strong performance while reducing optimizer memory by 95\% compared to full-precision, advancing the feasibility of fully quantized training.

URL: https://openreview.net/forum?id=A3mVmPlahU

---

Title: Depth as Modulation in Weight-Sharing Transformers

Abstract: Weight-sharing architectures provide an efficient design for Transformers. However, their reliance on a single transformation can limit the model's capacity for iterative representation refinement, a process that requires functional specialization across layers. We address this limitation by representing depth through layer-wise perturbations, creating a path toward models that are both parameter-efficient and performant. Our approach iteratively applies a shared block, and we introduce two distinct strategies to perturb its Multi-Head Self-Attention (MHSA) component with each application: a comprehensive QKOV-LoRA and a more parameter-efficient, QK/OV-circuit. The effectiveness of these strategies is validated on vision and language benchmarks, where our models demonstrate favorable performance against layer-sharing counterparts. Our results suggest that layer-wise perturbing a shared structure is an effective principle for developing capable and efficient Transformers.

URL: https://openreview.net/forum?id=wm9jRInse3

---

Title: How Private is Your Attention? Bridging Privacy with In-Context Learning

Abstract: In-context learning (ICL)—the ability of transformer-based models to perform new tasks from examples provided at inference time—has emerged as a hallmark of modern language models. While recent works have investigated the mechanisms underlying ICL, its feasibility under formal privacy constraints remains largely unexplored. In this paper, we propose a differentially private pretraining algorithm for linear attention heads and present the first theoretical analysis of the privacy–accuracy trade-off for ICL in linear regression. Our results characterize the fundamental tension between optimization and privacy-induced noise, formally capturing behaviors observed in private training via iterative methods. Additionally, we show that our method is robust to adversarial perturbations of training prompts, unlike standard ridge regression. All theoretical findings are supported by extensive simulations across diverse settings.

URL: https://openreview.net/forum?id=M2qsrIba0L

---

Title: Hierarchically Metric-Structured Knowledge Graph Embeddings

Abstract: In the vast landscape of big data, there is an important challenge in understanding data and structuring it in a suitable format. Knowledge graphs are considered a sophisticated solution to organize and infer data and knowledge, offering a structured framework that transcends disciplinary boundaries in medicine, culture, biology, social networks, music, and beyond. Despite their informativeness, these systems are typically incomplete and their intrinsic structure unknown, whereas existing methodologies for predicting missing facts and characterizing their structure face scalability and interpretability issues. Addressing this gap, we introduce a new latent feature model, leveraging the prominent RESCAL framework to account for degree heterogeneity, multiscale structure, and scalable inference using an approximation of the full likelihood of all triplets circumventing negative sampling inference strategies. This not only enhances computational efficiency but also provides deeper insights into the intrinsic multiscale structure of knowledge graphs, thereby advancing the interpretability of predictive models and paving the way for a more comprehensive understanding of complex information networks.

URL: https://openreview.net/forum?id=a2CrD4rkPx

---

Title: Compromising Honesty and Harmlessness in Language Models via Covert Deception Attacks

Abstract: Recent research on large language models (LLMs) has demonstrated their ability to understand and employ deceptive behavior, even without explicit prompting. Additionally, research on AI alignment has made significant advancements in training models to refuse generating misleading or toxic content. As a result, LLMs generally became honest and harmless. In this study, we introduce “deception attacks” that undermine both of these traits while keeping models seemingly trustworthy, revealing a vulnerability that, if exploited, could have serious real-world consequences. We introduce fine-tuning methods that cause models to selectively deceive users on targeted topics while remaining accurate on others, to maintain a high user trust. Through a series of experiments, we show that such targeted deception is effective even in high-stakes domains or ideologically charged subjects. In addition, we find that deceptive fine-tuning often compromises other safety properties: deceptive models are more likely to produce toxic content, including hate speech and stereotypes. Finally, since self-consistent deception across turns gives users few cues to detect manipulation and thus can preserve trust, we test for multi-turn deception and observe mixed results. Given that millions of users interact with LLM-based chatbots, voice assistants, agents, and other interfaces where trustworthiness cannot be ensured, securing these models against covert deception attacks is critical.

URL: https://openreview.net/forum?id=2KPIDIeLE2

---

Title: Breaking Habits: On the Role of the Advantage Function in Learning Causal State Representations

Abstract: Recent work has shown that reinforcement learning agents can develop policies that exploit spurious correlations between rewards and observations. This phenomenon, known as policy confounding, arises because the agent's policy influences both past and future observation variables, creating a feedback loop that can hinder the agent's ability to generalize beyond its usual trajectories. In this paper, we show that the advantage function, commonly used in policy gradient methods, not only reduces the variance of gradient estimates but also mitigates the effects of policy confounding. By adjusting action values relative to the state representation, the advantage function downweights state-action pairs that are more likely under the current policy, breaking spurious correlations and encouraging the agent to focus on causal factors. We provide both analytical and empirical evidence demonstrating that training with the advantage function leads to improved out-of-trajectory performance.

URL: https://openreview.net/forum?id=PnsjDKsdyf

---

Title: Attention-Based Reward Shaping for Sparse and Delayed Rewards

Abstract: Sparse and delayed reward functions pose a significant obstacle for real-world Reinforcement Learning (RL) applications. In this work, we propose Attention-based REward Shaping (ARES), a general and robust algorithm which uses a transformer's attention mechanism to generate shaped rewards and create a dense reward function for any environment. ARES requires a set of episodes and their final returns as input. It can be trained entirely offline and is able to generate meaningful shaped rewards even when using small datasets or episodes produced by agents taking random actions. ARES is compatible with any RL algorithm and can handle any level of reward sparsity. In our experiments, we focus on the most challenging case where rewards are fully delayed until the end of each episode. We evaluate ARES across a diverse range of environments, widely used RL algorithms, and baseline methods to assess the effectiveness of the shaped rewards it produces. Our results show that ARES can indeed improve learning in delayed reward settings, enabling RL agents to train in scenarios that would otherwise require impractical amounts of data or even be unlearnable, though there remain some cases where ARES is not successful in doing so. To our knowledge, ARES is the first approach that works fully offline, remains robust to extreme reward delays and low-quality data, and is not limited to goal-based tasks.

URL: https://openreview.net/forum?id=Vl0SOQWJ6Y

---

Title: Personalized Safety Alignment for Text-to-Image Diffusion Models

Abstract: Text-to-image diffusion models have transformed visual content generation, yet their safety mechanisms enforce rigid, uniform standards that fail to reflect diverse user preferences shaped by age, mental health, or personal beliefs. To address this limitation, we propose Personalized Safety Alignment (PSA), a framework for user-specific control over generative safety behavior. We also introduce Sage, a large-scale dataset capturing diverse user-specific safety boundaries to support this task. The PSA framework integrates user profiles via a lightweight cross-attention mechanism, efficiently steering generation to align with individual preferences. Experiments demonstrate that PSA substantially outperforms static approaches in user-specific alignment. Crucially, PSA achieves a calibrated safety-quality trade-off: under permissive profiles, it relaxes constraints to enhance visual quality, while under restrictive profiles, it intensifies suppression to maintain safety compliance. By moving beyond rigid, one-size-fits-all solutions, this work establishes personalized safety alignment as a promising new direction toward generative systems that are safer, more adaptive, and genuinely user-centered.

URL: https://openreview.net/forum?id=1qC1x1dJCj

---

Title: Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters

Abstract: Large language models (LLMs) are increasingly used as raters for evaluation tasks. However, their reliability is often limited for subjective tasks, when human judgments involve subtle reasoning beyond annotation labels. Thinking traces, the reasoning behind a judgment, are highly informative but challenging to collect and curate. We present a human-LLM collaborative framework to infer thinking traces from label-only annotations. The proposed framework uses a simple and effective rejection sampling method to reconstruct these traces at scale. These inferred thinking traces are applied to two complementary tasks: (1) fine-tuning open LLM raters; and (2) synthesizing clearer annotation guidelines for proprietary LLM raters. Across multiple datasets, our methods lead to significantly improved LLM-human agreement. Additionally, the refined annotation guidelines increase agreement among different LLM models. These results suggest that LLMs can serve as practical proxies for otherwise unrevealed human thinking traces, enabling label-only corpora to be extended into thinking–trace–augmented resources that enhance the reliability of LLM raters.

URL: https://openreview.net/forum?id=1jLQ629Yps

---

Title: Learning Robust Penetration Testing Policies under Partial Observability: A systematic evaluation

Abstract: Penetration testing, the simulation of cyberattacks to identify security vulnerabilities, presents a sequential decision-making problem well-suited for reinforcement learning (RL) automation. Like many applications of RL to real-world problems, partial observability presents a major challenge, as it invalidates the Markov property present in Markov Decision Processes (MDPs). Partially Observable MDPs require history aggregation or belief state estimation to learn successful policies. We investigate stochastic, partially observable penetration testing scenarios over host networks of varying size, aiming to better reflect real-world complexity through more challenging and representative benchmarks. This approach leads to the development of more robust and transferable policies, which are crucial for ensuring reliable performance across diverse and unpredictable real-world environments. Using vanilla Proximal Policy Optimization (PPO) as a baseline, we compare a selection of PPO variants designed to mitigate partial observability, including frame-stacking, augmenting observations with historical information, and employing recurrent or transformer-based architectures. We conduct a systematic empirical analysis of these algorithms across different host network sizes. We find that this task greatly benefits from history aggregation. Converging three times faster than other approaches. Manual inspection of the learned policies by the algorithms reveals clear distinctions and provides insights that go beyond quantitative results.

URL: https://openreview.net/forum?id=YkUV7wfk19

---

Title: Survey on Coresets for Deep Learning: Methods and Applications

Abstract: This survey presents a comprehensive review of coreset methods in deep learning, an important tool for improving data efficiency in large-scale neural networks. In general, ``coreset'' is an algorithmic technique for selecting a small yet representative subset of data to replace the full dataset, which can yield more efficient training process and meanwhile preserve model performance. In the past 20 years, coreset techniques have been widely applied to many classical machine learning problems, such as clustering, regression and classification. In recent years, the coreset techniques also begin to attract a lot of attention in modern deep learning area. However, designing effective coresets usually is a challenging task since we need to take account of the trade-off among multiple different factors, such as complexity, robustness and accuracy. In this survey, we focus on two common scenarios for using coreset methods in deep learning: (1) reducing the extremely high computational cost for training a deep learning model, and (2) improving the data utilization under resource constraints such as limited label budget or storage capacity. We begin by outlining the fundamental principles, advantages, and design challenges of coresets for these two scenarios. We also discuss the emerging applications of coresets in large language models. Finally, we identify several open problems and promising directions for future research.

URL: https://openreview.net/forum?id=ytYWmZ9haH

---

Title: Towards Bridging the Semantic Spaces of the One-to-Many Mapping in Cross-Modality Text-to-Video Generation

Abstract: Despite recent advances in text-to-video generation, the role of text and video latent spaces in learning a semantically shared representation remains underexplored. In this cross-modality generation task, most methods rely on conditioning the video generation process by injecting the text representation into it, not exploring the implicit shared knowledge between the modalities.
Nonetheless, the feature-based alignment of both modalities is not straightforward, especially for the \textit{one-to-many} mapping scenario, in which one text can be mapped to several valid semantically aligned videos, which generally produces a representation collapse in the alignment phase. In this work, we investigate and give insights on how both modalities cope in a shared semantic space where each modality representation is previously learned in an unsupervised way. We explore a perspective from the latent space learning view and analyze a framework proposed in this work with a plug-and-play nature by adopting autoencoder-based models that could be used with other representations. We show that the one-to-many case requires different alignment strategies than the common ones used in the literature, which suffer in aligning both modalities on a semantically shared space.

URL: https://openreview.net/forum?id=6tigEAsbFw

---

Title: FinMaster: A Holistic Benchmark for Full-Pipeline Financial Management with Large Language Models

Abstract: Financial management tasks are pivotal to global economic stability; however, their efficient execution faces persistent challenges, including labor intensive processes, low error tolerance, data fragmentation, and limitations in existing technological tools. Although large language models (LLMs) have shown remarkable success in various natural language processing (NLP) tasks and have demonstrated potential in automating workflows through reasoning and contextual understanding, current benchmarks for evaluating LLMs in finance suffer from insufficient domain-specific data, simplistic task design, and incomplete evaluation frameworks. To address these gaps, in this work, we present \textbf{FinMaster}, a comprehensive financial management benchmark designed to systematically assess the capabilities of LLM in financial literacy, accounting, auditing, and consulting. Specifically, \textbf{FinMaster} comprises three main modules: i) \emph{FinSim}, which builds simulators that can generate synthetic, privacy-compliant financial datasets for different types of companies to replicate real-world market dynamics; ii) \emph{FinSuite}, which provides a variety of tasks in core financial domains, spanning 183 tasks of various types and difficulty levels; and iii) \emph{FinEval}, which develops a unified evaluation framework for streamlined evaluation. Extensive experiments on state-of-the-art LLMs, such as GPT-4o-mini, Claude-3.7-Sonnet, and DeepSeek-V3, reveal critical capability gaps in financial reasoning, with accuracy dropping from over 90\% on basic tasks to merely 40\% on complex scenarios requiring multi-step reasoning. This degradation exhibits the propagation of computational errors, where single-metric calculations that initially demonstrated 58\% accuracy decreased to 37\% in multimetric scenarios. To the best of our knowledge, \textbf{FinMaster} is the first benchmark that comprehensively covers full-pipeline financial workflows with challenging and realistic tasks. We hope that \textbf{FinMaster} can bridge the gap between the research community and industry practitioners, driving the adoption of LLMs in real-world financial practices to enhance both efficiency and accuracy.

URL: https://openreview.net/forum?id=zCMMlKzbEe

---

Title: Empowering Multimodal Understanding Model with Interleaved Multimodal Generation Capability

Abstract: Unified multimodal understanding and generation have attracted much attention in the field of vision and language in recent years. Existing unified models (UniMs) aim to simultaneously learn understanding and generation capabilities, which require a large amount of computational resources and have defects in two aspects: 1) difficulty in generating interleaved text-image content; 2) weaker understanding capabilities than multimodal large language models (MLLMs). To bridge this gap, we propose ARMOR, a resource-efficient framework designed to ``upgrade'' rather than ``retrain from scratch'' expert MLLMs. Our core principle is to endow MLLMs with generation capabilities while preventing catastrophic forgetting of their top-tier understanding capabilities. We achieve this goal through three key innovations: (1) an asymmetric architecture that isolates a lightweight generative decoder from the frozen MLLM core via a forward-switching mechanism to enable seamless interleaved generation; (2) a meticulously curated high-quality interleaved dataset; (3) a progressive ``What or How to Generate'' (WoHG) three-stage training algorithm. Experiments demonstrate that ARMOR successfully upgrades a leading MLLM, retaining over 95\% of its original understanding performance while achieving highly competitive image generation at less than 1/70 the cost of training from scratch. This demonstrates the effectiveness of our core idea: ``the efficient paradigm of upgrading and expanding existing expert MLLMs into UniMs.''

URL: https://openreview.net/forum?id=4TLXaJt8Rq

---

Title: Are vision language models robust to classic uncertainty challenges?

Abstract: Robustness against uncertain and ambiguous inputs is a critical challenge for deep learning models. While recent advancements in large scale vision language models (VLMs, e.g. GPT-4o) might suggest that increasing model and training dataset size would mitigate this issue, our empirical evaluation shows a more complicated picture. In this work, we sanity check whether modern VLMs pass the two most ``classic'' uncertainty quantification challenges: Anomaly detection and classification under inherently ambiguous conditions, we find that newer and larger VLMs indeed exhibit improved robustness compared to earlier models, but still suffer from a tendency to strictly follow instructions, often causing them to hallucinate confident responses even when faced with unclear or anomalous inputs.
Remarkably, for natural images such as ImageNet, this limitation can be overcome without pipeline modifications: simply prompting models to abstain from uncertain predictions enables significant reliability gains, achieving near-perfect robustness in several settings.
However, for domain-specific tasks such as galaxy morphology classification, a lack of specialized knowledge prevents reliable uncertainty estimation. Finally, we propose a simple mechanism based on caption diversity to reveal a model’s internal uncertainty, enabling practitioners to predict when models will successfully abstain without relying on labeled data.

URL: https://openreview.net/forum?id=4lCSYCNfmo

---

Title: Logical Anomaly Detection with Masked Image Modeling

Abstract: Detecting anomalies such as an incorrect combination of objects or deviations in their positions is a challenging problem in unsupervised anomaly detection (AD). Since conventional AD methods mainly focus on local patterns of normal images, they struggle with detecting logical anomalies that appear in the global patterns. To effectively detect these challenging logical anomalies, we introduce LADMIM (Logical Anomaly Detection with Masked Image Modeling), a novel unsupervised AD framework that harnesses the power of masked image modeling and discrete representation learning. Our core insight is that predicting the missing region forces the model to learn the long-range dependencies between patches. Specifically, we formulate AD as a mask completion task, which predicts the distribution of discrete latents in the masked region. As a distribution of discrete latents is invariant to the low-level variance in the pixel space, the model can desirably focus on the logical dependencies in the image, which improves accuracy in the logical AD. We evaluate the AD performance on five benchmarks and show that our approach achieves compatible performance without any pre-trained segmentation models. We also conduct comprehensive experiments to reveal the key factors that influence logical AD performance.

URL: https://openreview.net/forum?id=uuuaRCMYE3

---

Title: Unrealized Expectations: Comparing AI Methods vs Classical Algorithms for Maximum Independent Set

Abstract: AI methods, such as generative models and reinforcement learning, have recently been applied to combinatorial optimization (CO) problems, especially NP-hard ones. This paper compares such GPU-based methods with classical CPU-based methods on Maximum Independent Set (MIS). Strikingly, even on in-distribution random graphs, leading AI-inspired methods are consistently outperformed by state-of-art classical solver KaMIS running on a single CPU, and some AI-inspired methods frequently fail to surpass even the simplest degree-based greedy heuristic. Even with post-processing techniques like local search, AI-inspired methods still perform worse than CPU-based solvers. To better understand the source of these failures, we introduce a novel analysis, serialization, which reveals that non-backtracking AI-inspired methods, e.g. LTFT (which is based on GFlowNets), end up reasoning similarly to the simplest degree-based greedy, and thus worse than KaMIS. More generally, our findings suggest a need for a rethinking of current approaches in AI for CO, advocating for more rigorous benchmarking and the principled integration of classical heuristics. Additionally, we also find that CPU-based algorithm KaMIS have strong performance on sparse random graphs, which appears to show that the shattering threshold conjecture for large independent sets proposed by Coja-Oghlan & Efthymiou (2015) is either false or does not apply for real-life sizes (such as $10^6$ nodes).

URL: https://openreview.net/forum?id=ksGoCT5zW6

---

Title: Continuous Treatment Effect Estimation with Cauchy-Schwarz Divergence Information Bottleneck

Abstract: Estimating individualized treatment effects (ITE) for continuous and multivariate treatments remains a fundamental yet underexplored problem in causal inference, as most existing methods are confined to binary treatment settings. In this paper, we make two key theoretical contributions. First, we derive a novel counterfactual error bound based on the Cauchy–Schwarz (CS) divergence, which is provably tighter than prior bounds derived from the Kullback–Leibler (KL) divergence. Second, we strengthen this bound by integrating the Information Bottleneck principle, introducing a compression regularization on latent representations to enhance generalization. Building on these insights, we propose a new neural framework that operationalizes our theory. Extensive experiments on three benchmarks show that our method consistently outperforms state-of-the-art baselines and remains robust under biased treatment assignments.

URL: https://openreview.net/forum?id=9SvY0mMr2u

---

Title: LoDAdaC: a unified local training-based decentralized framework with Adam-type updates and compressed communication

Abstract: In the decentralized distributed learning, achieving fast convergence and low communication cost is essential for scalability and high efficiency. Despite extensive research, existing decentralized methods can either have fast convergence or enjoy low communication cost but cannot achieve both goals simultaneously. This disadvantage causes significant inefficiency (either in computation or communication) in solving large-scale decentralized learning problems,
e.g., in large language model training. To address this limitation, we propose LoDAdaC, a unified multiple \textbf{Lo}cal Training (MLT) \textbf{D}ecentralized framework with \textbf{Ada}m-type updates and \textbf{C}ompressed communication (CC). LoDAdaC accommodates a broad class of optimizers for its local adaptive updates, including AMSGrad, Adam, and AdaGrad; it is compatible with standard (possibly biased) compressors such as low-bit quantization and sparsification. MLT and CC enable LoDAdaC to achieve multiplied reduction of communication cost, while the technique of adaptive updates enables fast convergence. We rigorously prove the combined advantage through complexity analysis. In addition, experiments on image classification and large language model training validate our theoretical findings and show that LoDAdaC significantly outperforms existing decentralized algorithms in terms of convergence speed and communication efficiency.

URL: https://openreview.net/forum?id=0qoy9usvnm

---

Title: \texttt{Complex-Edit}: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark

Abstract: We introduce \texttt{Complex-Edit}, a comprehensive benchmark designed to systematically evaluate instruction-based image editing models across instructions of varying complexity. To develop this benchmark, we harness GPT-4o to automatically collect a diverse set of editing instructions at scale. Our approach follows a well-structured ``Chain-of-Edit'' pipeline: we first generate individual atomic editing tasks independently and then integrate them to form cohesive, complex instructions.
Additionally, we introduce a suite of metrics to assess various aspects of editing performance, along with a VLM-based auto-evaluation pipeline that supports large-scale assessments.

Our benchmark yields several notable insights:
1) Open-source models significantly underperform relative to proprietary, closed-source models, with the performance gap widening as instruction complexity increases;
2) Increased instructional complexity primarily impairs the models’ ability to retain key elements from the input images;
3) Stronger models aren't necessarily more resilient towards higher complexity;
4) Decomposing a complex instruction into a sequence of atomic steps, executed in a step-by-step manner, substantially degrades performance across multiple metrics;
5) A straightforward Best-of-N selection strategy improves results for both direct editing and the step-by-step sequential approach; and
6) We observe a ``curse of synthetic data'': when synthetic data is involved in model training, the edited images from such models tend to appear increasingly synthetic as the complexity of the editing instructions rises --- a phenomenon that intriguingly also manifests in the latest GPT-Image-1's outputs.

URL: https://openreview.net/forum?id=lL1JR6dxG8

---

Title: Learning Lagrangian Interaction Dynamics with Sampling-Based Model Order Reduction

Abstract: Simulating physical systems governed by Lagrangian dynamics often entails solving partial differential equations (PDEs) over high-resolution spatial domains, leading to significant computational expense. Reduced-order modeling (ROM) mitigates this cost by evolving low-dimensional latent representations of the underlying system. While neural ROMs enable querying solutions from latent states at arbitrary spatial points, their latent states typically represent the global domain and struggle to capture localized, highly dynamic behaviors such as fluids. We propose a sampling-based reduction framework that evolves Lagrangian systems directly in physical space, over the particles themselves, reducing the number of active degrees of freedom via data-driven neural PDE operators. To enable querying at arbitrary spatial locations, we introduce a learnable kernel parameterization that uses local spatial information from time-evolved sample particles to infer the underlying solution manifold. Empirically, our approach achieves a 6.6$\times$–32$\times$ reduction in input dimensionality while maintaining high-fidelity evaluations across diverse Lagrangian regimes, including fluid flows, granular media, and elastoplastic dynamics. We refer to this framework as GIOROM (\textbf{G}eometry-\textbf{I}nf\textbf{O}rmed \textbf{R}educed-\textbf{O}rder \textbf{M}odeling).

URL: https://openreview.net/forum?id=vXCQA1EzaG

---

Title: AutoGeTS: Knowledge-based Automated Generation of Text Synthetics for Improving Text Classification

Abstract: When developing text classification models for real-world applications, one major challenge is the difficulty of collecting sufficient data for all text classes. In this work, we address this challenge by utilizing large language models (LLMs) to generate synthetic data and using such data to improve the performance of the models without waiting for more real data to be collected and labeled. As an LLM generates different synthetic data in response to different input examples, we formulate an automated workflow, which searches for input examples that lead to more ``effective'' synthetic data for improving the model concerned. We study three search strategies with an extensive set of experiments, and use experiment results to inform an ensemble algorithm that selects a search strategy according to the characteristics of a class. Our further experiments demonstrate that this ensemble approach is more effective than each individual strategy in our automated workflow for improving classification models using LLMs. The source code of the main software developed for this work is made available at https://anonymous.4open.science/r/AutoGeTS-2B0D.

URL: https://openreview.net/forum?id=y7B87znyuZ

---

Title: Gaga: Group Any Gausians via 3D-aware Memory Bank

Abstract: We introduce Gaga, a framework that reconstructs and segments open-world 3D scenes by leveraging inconsistent 2D masks predicted by zero-shot class-agnostic segmentation models. Contrasted to prior 3D scene segmentation approaches that rely on video object tracking
or contrastive learning methods, Gaga utilizes spatial information and effectively associates object masks across diverse camera poses through a novel 3D-aware memory bank. By eliminating the assumption of continuous view changes in training images, Gaga demonstrates robustness to variations in camera poses, particularly beneficial for sparsely sampled images, ensuring precise mask label consistency. Furthermore, Gaga accommodates 2D segmentation masks from diverse sources and demonstrates robust performance with different open-world zero-shot class-agnostic segmentation models, significantly enhancing its versatility. Extensive qualitative and quantitative evaluations demonstrate that Gaga performs favorably against state-of-the-art methods, emphasizing its potential for real-world applications such as 3D scene understanding and manipulation.

URL: https://openreview.net/forum?id=cC1TLyK3iW

---

Title: APFEx: Adaptive Pareto Front Explorer for Intersectional Fairness

Abstract: Ensuring fairness in machine learning models is critical, especially when biases compound across intersecting protected attributes like race, gender, and age. While existing methods address fairness for single attributes, they fail to capture the nuanced, multiplicative biases faced by intersectional subgroups. We introduce \emph{Adaptive Pareto Front Explorer (APFEx)}, the first framework to explicitly model intersectional fairness as a joint optimization problem over the Cartesian product of sensitive attributes. APFEx combines three key innovations: (1) an adaptive multi-objective optimizer that dynamically switches between Pareto cone projection, gradient weighting, and exploration strategies to navigate fairness-accuracy trade-offs; (2) differentiable intersectional fairness metrics enabling gradient-based optimization of non-smooth subgroup disparities; and (3) theoretical guarantees of convergence to Pareto-optimal solutions. Experiments on four real-world datasets demonstrate APFEx’s superiority, reducing fairness violations while maintaining competitive accuracy. Our work bridges a critical gap in fair ML, providing a scalable, model-agnostic solution for intersectional fairness.

URL: https://openreview.net/forum?id=ysbzXL4tZJ

---

Reply all

Reply to author

Forward

0 new messages