Weekly TMLR digest for Oct 05, 2025

6 views

Skip to first unread message

TMLR

unread,

Oct 5, 2025, 12:00:12 AMOct 5

to tmlr-annou...@googlegroups.com

New certifications
==================

Reproducibility Certification: Revisiting B2T: Discovering and Mitigating Visual Biases through Keyword Explanations

Faissal El Kayouhi, Aïda Asma, Joey Laarhoven, Fiona Nagelhout

https://openreview.net/forum?id=5GS1q65pv6

---

Accepted papers
===============

Title: MGPATH: A Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot Whole Slide Pathology Classification

Authors: Anh-Tien Nguyen, Duy Minh Ho Nguyen, Nghiem Tuong Diep, Trung Quoc Nguyen, Nhat Ho, Jacqueline Michelle Metsch, Miriam Cindy Maurer, Daniel Sonntag, Hanibal Bohnenberger, Anne-Christin Hauschild

Abstract: Whole slide pathology image classification presents challenges due to gigapixel image sizes and limited annotation labels, hindering model generalization. This paper introduces a prompt learning method to adapt large vision-language models for few-shot pathology classification. We first extend the Prov-GigaPath vision foundation model, pre-trained on 1.3 billion pathology image tiles, into a vision-language model by adding adaptors and aligning it with medical text encoders via contrastive learning on 923K image-text pairs. The model is then used to extract visual features and text embeddings from few-shot annotations and fine-tunes with learnable prompt embeddings. Unlike prior methods that combine prompts with frozen features using prefix embeddings or self-attention, we propose multi-granular attention that compares interactions between learnable prompts with individual image patches and groups of them. This approach improves the model’s ability to capture both fine-grained details and broader context, enhancing its recognition of complex patterns across sub-regions. To further improve accuracy, we leverage (unbalanced) optimal transport-based visual-text distance to secure model robustness by mitigating perturbations that might occur during the data augmentation process. Empirical experiments on lung, kidney, and breast pathology modalities validate the effectiveness of our approach; thereby, we surpass several of the latest competitors and consistently improve performance across diverse architectures, including CLIP, PLIP, and Prov-GigaPath integrated PLIP. We release our implementations and pre-trained models at this https://github.com/HauschildLab/MGPATH.

URL: https://openreview.net/forum?id=u7U81JLGjH

---

Title: Adversarial Robustness of Graph Transformers

Authors: Philipp Foth, Lukas Gosch, Simon Geisler, Leo Schwinn, Stephan Günnemann

Abstract: Existing studies have shown that Message-Passing Graph Neural Networks (MPNNs) are highly susceptible to adversarial attacks. In contrast, despite the increasing importance of Graph Transformers (GTs), their robustness properties are unexplored. We close this gap and design the first adaptive attacks for GTs. In particular, we provide general design principles for strong gradient-based attacks on GTs w.r.t. structure perturbations and instantiate our attack framework for five representative and popular GT architectures. Specifically, we study GTs with specialized attention mechanisms and Positional Encodings (PEs) based on pairwise shortest paths, random walks, and the Laplacian spectrum. We evaluate our attacks on multiple tasks and perturbation models, including structure perturbations for node and graph classification, and node injection for graph classification. Our results reveal that GTs can be catastrophically fragile in many cases. Addressing this vulnerability, we show how our adaptive attacks can be effectively used for adversarial training, substantially improving robustness.

URL: https://openreview.net/forum?id=4xK0vjxTWL

---

Title: Capsule Network Projectors are Equivariant and Invariant Learners

Authors: Miles Everett, Aiden Durrant, Mingjun Zhong, Georgios Leontidis

Abstract: Learning invariant representations has been the longstanding approach to self-supervised learning. However, recently progress has been made in preserving equivariant properties in representations, yet do so with highly prescribed architectures. In this work, we propose an
invariant-equivariant self-supervised architecture that employs Capsule Networks (CapsNets), which have been shown to capture equivariance with respect to novel viewpoints. We demonstrate that the use of CapsNets in equivariant self-supervised architectures achieves improved downstream performance on equivariant tasks with higher efficiency and fewer network parameters. To accommodate the architectural changes of CapsNets, we introduce a new objective function based on entropy minimisation. This approach, which we name CapsIE (Capsule Invariant Equivariant Network), achieves state-of-the-art performance on the equivariant rotation tasks on the 3DIEBench dataset compared to prior equivariant SSL methods, while performing competitively against supervised counterparts. Our results demonstrate the ability of CapsNets to learn complex and generalised representations for large-scale, multi-task datasets compared to previous CapsNet benchmarks.

URL: https://openreview.net/forum?id=7owCO3qskH

---

Title: Blending adversarial training and representation-conditional purification via aggregation improves adversarial robustness

Authors: Emanuele Ballarin, Alessio ansuini, Luca Bortolussi

Abstract: In this work, we propose a novel adversarial defence mechanism for image classification - CARSO - blending the paradigms of adversarial training and adversarial purification in a synergistic robustness-enhancing way. The method builds upon an adversarially-trained classifier, and learns to map its internal representation associated with a potentially perturbed input onto a distribution of tentative clean reconstructions. Multiple samples from such distribution are classified by the same adversarially-trained model, and a carefully chosen aggregation of its outputs finally constitutes the robust prediction of interest. Experimental evaluation by a well-established benchmark of strong adaptive attacks, across different image datasets, shows that CARSO is able to defend itself against adaptive end-to-end white-box attacks devised for stochastic defences. With a modest clean accuracy penalty, our method improves by a significant margin the state-of-the-art for Cifar-10, Cifar-100, and TinyImageNet-200 $\ell_\infty$ robust classification accuracy against AutoAttack.

URL: https://openreview.net/forum?id=40BXthYscW

---

Title: Does Unsupervised Domain Adaptation Improve the Robustness of Amortized Bayesian Inference? A Systematic Evaluation

Authors: Lasse Elsemüller, Valentin Pratz, Mischa von Krause, Andreas Voss, Paul-Christian Bürkner, Stefan T. Radev

Abstract: Neural networks are fragile when confronted with data that significantly deviates from their training distribution. This is true in particular for simulation-based inference methods, such as neural amortized Bayesian inference (ABI), where models trained on simulated data are deployed on noisy real-world observations. Recent robust approaches employ unsupervised domain adaptation (UDA) to match the embedding spaces of simulated and observed data. However, the lack of comprehensive evaluations across different domain mismatches raises concerns about the reliability in high-stakes applications. We address this gap by systematically testing UDA approaches across a wide range of misspecification scenarios in silico and practice. We demonstrate that aligning summary spaces between domains effectively mitigates the impact of unmodeled phenomena or noise. However, the same alignment mechanism can lead to failures under prior misspecifications -- a critical finding with practical consequences. Our results underscore the need for careful consideration of misspecification types when using UDA to increase the robustness of ABI.

URL: https://openreview.net/forum?id=ewgLuvnEw6

---

Title: On diffusion posterior sampling via sequential Monte Carlo for zero-shot scaffolding of protein motifs

Authors: James Matthew Young, O. Deniz Akyildiz

Abstract: With the advent of diffusion models, new proteins can be generated at an unprecedented rate. The motif scaffolding problem requires steering this generative process to yield proteins with a desirable functional substructure called a motif. While models have been trained to take the motif as conditional input, recent techniques in diffusion posterior sampling can be leveraged as zero-shot alternatives whose approximations can be corrected with sequential Monte Carlo (SMC) algorithms. In this work, we introduce a new set of guidance potentials for describing scaffolding tasks and solve them by adapting SMC-aided diffusion posterior samplers with an unconditional model, Genie, as a prior. In single motif problems, we find that (i) the proposed potentials perform comparably, if not better, than the conventional masking approach, (ii) samplers based on reconstruction guidance outperform their replacement method counterparts, and (iii) measurement tilted proposals and twisted targets improve performance substantially. Furthermore, as a demonstration, we provide solutions to two multi-motif problems by pairing reconstruction guidance with an SE(3)-invariant potential. We also produce designable internally symmetric monomers with a guidance potential for point symmetry constraints. Our code is available at: https://github.com/matsagad/mres-project.

URL: https://openreview.net/forum?id=KXRYY7iwqh

---

Title: Latent Trajectory: A New Framework for Deep Actor-Critic Reinforcement Learning with Uncertainty Quantification

Authors: Frank Shih, Faming Liang

Abstract: Uncertainty quantification in deep learning is challenging due to the complexity of deep neural networks. This challenge is particularly pronounced in deep reinforcement learning (RL), where agents interact with stochastic environments. In deep actor-critic RL, this challenge is further exacerbated due to the interdependence between the actor and critic updates. Existing uncertainty quantification methods for RL are predominantly developed within the Bayesian framework. While these methods estimate the uncertainty of the value function, their confidence intervals are often misleading, with the coverage rate frequently falling well below the nominal level. To address this issue, we introduce a novel deep RL framework that treats transition trajectories as latent variables. Leveraging this framework, we propose an adaptive Stochastic Gradient Markov Chain Monte Carlo algorithm to train deep actor-critic models, which naturally accounts for the interdependence between the actor and critic updates. We provide theoretical guarantees for the convergence of the proposed method and offer empirical evidence for its effectiveness in uncertainty quantification of the value function. The proposed latent trajectory framework is highly flexible, allowing for the integration of advanced RL strategies to further enhance deep actor-critic learning.

URL: https://openreview.net/forum?id=8B74xdaRHa

---

Title: Empirical Comparison of Membership Inference Attacks in Deep Transfer Learning

Authors: Yuxuan Bai, Gauri Pradhan, Marlon Tobaben, Antti Honkela

Abstract: With the emergence of powerful large-scale foundation models, the training paradigm is increasingly shifting from from-scratch training to transfer learning. This enables high utility training with small, domain-specific datasets typical in sensitive applications.
Membership inference attacks (MIAs) provide an empirical estimate of the privacy leakage by machine learning models. Yet, prior assessments of MIAs against models fine-tuned with transfer learning rely on a small subset of possible attacks. We address this by comparing performance of diverse MIAs in transfer learning settings to help practitioners identify the most efficient attacks for privacy risk evaluation. We find that attack efficacy decreases with the increase in training data for score-based MIAs. We find that there is no one MIA which captures all privacy risks in models trained with transfer learning. While the Likelihood Ratio Attack (LiRA) demonstrates superior performance across most experimental scenarios, the Inverse Hessian Attack (IHA) proves to be more effective against models fine-tuned on PatchCamelyon dataset in high data regime.

URL: https://openreview.net/forum?id=UligTUCgdt

---

Title: Sparse-to-Sparse Training of Diffusion Models

Authors: Inês Cardoso Oliveira, Decebal Constantin Mocanu, Luis A. Leiva

Abstract: Diffusion models (DMs) are a powerful type of generative models that have achieved state-of-the-art results in various image synthesis tasks and have shown potential in other domains, such as natural language processing and temporal data modeling. Despite their stable training dynamics and ability to produce diverse high-quality samples, DMs are notorious for requiring significant computational resources, both in the training and inference stages. Previous work has focused mostly on increasing the efficiency of model inference. This paper introduces, for the first time, the paradigm of sparse-to-sparse training to DMs, with the aim of improving both training and inference efficiency. We focus on unconditional generation and train sparse DMs from scratch (Latent Diffusion and ChiroDiff) on six datasets using three different methods (Static-DM, RigL-DM, and MagRan-DM) to study the effect of sparsity in model performance. Our experiments show that sparse DMs are able to match and often outperform their Dense counterparts, while substantially reducing the number of trainable parameters and FLOPs. We also identify safe and effective values to perform sparse-to-sparse training of DMs.

URL: https://openreview.net/forum?id=iRupdoPLJa

---

Title: Unlearning Misalignment for Personalized LLM Adaptation via Instance-Response-Dependent Discrepancies

Authors: Cheng Chen, Atsushi Nitanda, Ivor Tsang

Abstract: While Large Language Models (LLMs) have revolutionized chatbot interactions, they often fall short of aligning responses with the nuanced preferences of individual users, a challenge rooted in the inherently subjective and proprietary nature of those preferences. Consequently, prompt-based learning, though effective in enhancing factual accuracy due to its emphasis on universal correctness, remains insufficient for achieving accurate personalised response alignment. Because user preferences vary widely across individuals and contexts, aligning responses requires a more personalized and context-aware approach. To address this limitation, we propose Consistent Marginalization (CM), a novel framework that aims to unlearn misalignment by constructing a personalised memory bank of instance-response-dependent discrepancies, built from a small set of user preference samples. This personalised memory bank equips LLMs with the ability to understand, recall, and adapt to individual preferences, enabling more consistent and personalized responses. Evaluated across a diverse range of domain-specific datasets and model architectures, CM yields notable improvements in response alignment and robustness. We believe Consistent Marginalization represents a valuable step toward enabling LLMs to become genuinely personable and adaptive conversational agents by understanding user preferences and generating responses that are better aligned with individual user expectations.

URL: https://openreview.net/forum?id=njE3swFBMc

---

Title: Revisiting B2T: Discovering and Mitigating Visual Biases through Keyword Explanations

Authors: Faissal El Kayouhi, Aïda Asma, Joey Laarhoven, Fiona Nagelhout

Abstract: This work aims to reproduce and extend the findings of "Discovering and Mitigating Visual Biases through Keyword Explanation" by Kim et al.(2024). The paper proposes the B2T framework, which detects and mitigates visual biases by extracting keywords from generated captions. By identifying biases in datasets, B2T contributes to the prevention of discriminatory behavior in vision-language models. We aim to investigate the five key claims from the original paper, namely that B2T (i) is able to identify whether a word represents a bias, (ii) can extract these keywords from captions of mispredicted images, (iii) outperforms other bias discovery models, (iv) can improve CLIP zero-shot prompting with the discovered keywords, and (v) identifies labeling errors in a dataset. To reproduce their results, we use the publicly available codebase and our re-implementations. Our findings confirm the first three claims and partially validate the fourth. We reject the fifth claim, due to the failure to identify pertinent labeling errors. Finally, we enhance the original work by optimizing the efficiency of the implementation, and assessing the generalizability of B2T on a new dataset.

URL: https://openreview.net/forum?id=5GS1q65pv6

---

Title: Spectral Clustering and Labeling for Crowdsourcing with Inherently Distinct Task Types

Authors: Saptarshi Mandal, Seo Taek Kong, Dimitrios Katselis, R. Srikant

Abstract: The Dawid-Skene model is the most widely assumed model in the analysis of crowdsourcing algorithms that estimate ground-truth labels from noisy worker responses. In this work, we are motivated by crowdsourcing applications where workers have distinct skill sets and their accuracy additionally depends on a task's type. Focusing on the case where there are two types of tasks, we propose a spectral method to partition tasks into two groups such that a worker has the same reliability for all tasks within a group. Our analysis reveals a separability condition such that task types can be perfectly recovered if the number of workers $n$ scales logarithmically with the number of tasks $d$. Numerical experiments show how clustering tasks by type before estimating ground-truth labels enhances the performance of crowdsourcing algorithms in practical applications.

URL: https://openreview.net/forum?id=jVQjtzcvAc

---

Title: Contextual Combinatorial Bandits With Changing Action Sets Via Gaussian Processes

Authors: Andi Nika, Sepehr Elahi, Cem Tekin

Abstract: We consider a contextual bandit problem with a combinatorial action set and time-varying base arm availability. At the beginning of each round, the agent observes the set of available base arms and their contexts and then selects an action that is a feasible subset of the set of available base arms to maximize its cumulative reward in the long run. We assume that the mean outcomes of base arms are samples from a Gaussian Process \rev{(GP)} indexed by the context set ${\cal X}$, and the expected reward is Lipschitz continuous in expected base arm outcomes. For this setup, we propose an algorithm called Optimistic Combinatorial Learning and Optimization with Kernel Upper Confidence Bounds (O'CLOK-UCB) and prove that it incurs $\tilde{O}(\sqrt{\lambda^*(K)KT\gamma_{KT}(\cup_{t\leq T}\mathcal{X}_t)} )$ regret with high probability, where $\gamma_{KT}(\cup_{t\leq T}\mathcal{X}_t)$ is the maximum information gain associated with the sets of base arm contexts $\mathcal{X}_t$ that appeared in the first $T$ rounds, $K$ is the maximum cardinality of any feasible action over all rounds, and $\lambda^*(K)$ is the maximum eigenvalue of all covariance matrices of selected actions up to time $T$, which is a function of $K$. To dramatically speed up the algorithm, we also propose a variant of O'CLOK-UCB that uses sparse GPs. Finally, we experimentally show that both algorithms exploit inter-base arm outcome correlation and vastly outperform the previous state-of-the-art UCB-based algorithms in realistic setups.

URL: https://openreview.net/forum?id=2RgfAY3jnI

---

Title: PSC: Posterior Sampling-Based Compression

Authors: Noam Elata, Tomer Michaeli, Michael Elad

Abstract: Diffusion models have transformed the landscape of image generation and now show remarkable potential for image compression. Most of the recent diffusion-based compression methods require training and are tailored for a specific bit-rate. In this work, we propose Posterior Sampling-based Compression (PSC) -- a zero-shot compression method that leverages a pre-trained diffusion model as its sole neural network component, thus enabling the use of diverse, publicly available models without additional training. Our approach is inspired by transform coding methods, which encode the image in some pre-chosen transform domain. However, PSC constructs a transform that is adaptive to the image. This is done by employing a zero-shot diffusion-based posterior sampler so as to progressively construct the rows of the transform matrix. Each new chunk of rows is chosen to reduce the uncertainty about the image given the quantized measurements collected thus far. Importantly, the same adaptive scheme can be replicated at the decoder, thus avoiding the need to encode the transform itself. We demonstrate that even with basic quantization and entropy coding, PSC's performance is comparable to established training-based methods in terms of rate, distortion, and perceptual quality. This is while providing greater flexibility, allowing to choose at inference time any desired rate or distortion.

URL: https://openreview.net/forum?id=OsqgU6Jz4t

---

Title: Temporal horizons in forecasting: a performance-learnability trade-off

Authors: Pau Vilimelis Aceituno, Jack William Miller, Noah Marti, Youssef Farag, Victor Boussange

Abstract: When training autoregressive models to forecast dynamical systems, a critical question arises: how far into the future should the model be trained to predict for optimal performance? In this work, we address this question by analyzing the relationship between the geometry of the loss landscape and the training time horizon. Using dynamical systems theory, we prove that loss minima for long horizons generalize well to short-term forecasts, whereas minima found on short horizons result in worse long-term predictions. However, we also prove that the loss landscape becomes rougher as the training horizon grows, making long-horizon training inherently challenging. We validate our theory through numerical experiments and discuss practical implications for selecting training horizons. Our results provide a principled foundation for hyperparameter optimization in autoregressive forecasting models.

URL: https://openreview.net/forum?id=BeudQIxT1R

---

Title: Coreset-Driven Re-Labeling: Tackling Noisy Annotations with Noise-Free Gradients

Authors: Saumyaranjan Mohanty, Konda Reddy Mopuri

Abstract: Large-scale datasets invariably contain annotation noise. Re-labeling methods have been developed to handle annotation noise in large-scale datasets. Though various methodologies to alleviate annotation noise have been developed, these are particularly time-consuming and computationally intensive. The requirement of high computational power and longer time duration can be drastically reduced by selecting a representative coreset. In this work, we adapt a noise-free gradient-based coreset selection method towards re-labeling applications for noisy datasets with erroneous labels. We introduce ‘confidence score’ to the coreset selection method to cater for the presence of noisy labels. Through extensive evaluation over CIFAR-100N, Web Vision, and ImageNet-1K Datasets, we demonstrate that our method outperforms the SOTA coreset selection for re-labeling methods (DivideMix and SOP+). We have provided the codebase at URL.

URL: https://openreview.net/forum?id=Tk78vb2Qd7

---

Title: Large-Scale Targeted Cause Discovery via Learning from Simulated Data

Authors: Jang-Hyun Kim, Claudia Skok Gibbs, Sangdoo Yun, Hyun Oh Song, Kyunghyun Cho

Abstract: We propose a novel machine learning approach for inferring causal variables of a target variable from observations. Our focus is on directly inferring a set of causal factors without requiring full causal graph reconstruction, which is computationally challenging in large-scale systems. The identified causal set consists of all potential regulators of the target variable under experimental settings, enabling efficient regulation through intervention. To achieve this, we train a neural network using supervised learning on simulated data to infer causality. By employing a subsampled-ensemble inference strategy, our approach scales with linear complexity in the number of variables, efficiently scaling up to thousands of variables. Empirical results demonstrate superior performance in identifying causal relationships within large-scale gene regulatory networks, outperforming existing methods that emphasize full-graph discovery. We validate our model's generalization capability across out-of-distribution graph structures and generating mechanisms, including gene regulatory networks of E. coli and the human K562 cell line.

URL: https://openreview.net/forum?id=NVgy29IQw8

---

Title: A Case for Library-Level $k$-Means Binning in Histogram Gradient-Boosted Trees

Authors: Asher Labovich

Abstract: Modern Gradient Boosted Decision Trees (GBDTs) accelerate split finding with histogram-based binning, which reduces complexity from $O(N\log N)$ to $O(N)$ by aggregating gradients into fixed-size bins. However, the predominant quantile binning strategy—designed to distribute data points evenly among bins—may overlook critical boundary values that could enhance predictive performance. In this work, we consider a novel approach that replaces quantile binning with a $k$-means discretizer initialized with quantile bins, and justify the swap with a proof showing how, for any $L$-Lipschitz function, k-means maximizes the worst-case explained variance of Y obtained when treating all values in a given bin as equivalent. We test this swap against quantile and uniform binning on 33 OpenML datasets plus synthetics that control for modality, skew, and bin budget. Across 18 regression datasets, k-means shows no statistically significant losses at the 5% level and wins in three cases—most strikingly a 55% MSE drop on one particularly skewed dataset—even though k-means' mean reciprocal rank (MRR) is slightly lower (0.65 vs 0.72). On the 15 classification datasets the two methods are statistically tied (MRR 0.70 vs 0.68) with gaps $\leq$0.2 pp. Synthetic experiments confirm consistently large MSE gains—typically $>$20% and rising to 90% as outlier magnitude increases or bin budget drops. We find that k-means keeps error on par with exhaustive (no-binning) splitting when extra cuts add little value, yet still recovers key split points that quantile overlooks. As such, we advocate for a built-in bin_method=$k$-means flag, especially in regression tasks and in tight-budget settings such as the 32–64-bin GPU regime—because it is a "safe default" with large upside, yet adds only a one-off, cacheable overhead ($\approx$ 3.5s per feature to bin 10M rows on one Apple M1 thread).

URL: https://openreview.net/forum?id=UaTrLLspJa

---

Title: Generalized Smooth Stochastic Variational Inequalities: Almost Sure Convergence and Convergence Rates

Authors: Daniil Vankov, Angelia Nedich, Lalitha Sankar

Abstract: This paper focuses on solving a stochastic variational inequality (SVI) problem under relaxed smoothness assumption for a class of structured non-monotone operators. The SVI problem has attracted significant interest in the machine learning community due to its immediate application to adversarial training and multi-agent reinforcement learning. In many such applications, the resulting operators do not satisfy the smoothness assumption. To address this issue, we focus on a weaker generalized smoothness assumption called $\alpha$-symmetric. Under $p$-quasi sharpness and $\alpha$-symmetric assumptions on the operator, we study clipped projection (gradient descent-ascent) and clipped Korpelevich (extragradient) methods. For these clipped methods, we provide the first almost-sure convergence results without making any assumptions on the boundedness of either the stochastic operator or the stochastic samples. We also provide the first in-expectation unbiased convergence rate results for these methods under a relaxed smoothness assumption for $\alpha \leq \frac{1}{2}$.

URL: https://openreview.net/forum?id=EjqSpbUBWU

---

Title: Beyond Marginals: Learning Joint Spatio-Temporal Patterns for Multivariate Anomaly Detection

Authors: Padmaksha Roy, Almuatazbellah Boker, Lamine Mili

Abstract: In this paper, we aim to improve anomaly detection (AD) by incorporating the time-varying non-linear spatio-temporal correlations of the multi-variate time series data in the modeling process. In multivariate AD, the simultaneous deviation of multiple nodes from their expected behavior can indicate an anomaly, even if no individual node shows a clearly abnormal pattern. In many existing approaches, time series variables are assumed to be (conditionally) independent, which oversimplifies real-world interactions. Our approach addresses this by modeling joint dependencies using a copula-based framework, which decouples the modeling of marginal distributions, temporal dynamics, and inter-variable dependencies. We use a transformer encoder to capture temporal patterns, and to model spatial (inter-variable) dependencies, we integrate a copula. Both components are trained jointly in a latent space using a self-supervised contrastive learning objective to learn meaningful feature representations to separate normal and anomaly samples.

URL: https://openreview.net/forum?id=iETTv1okjX

---

Title: k-NN as a Simple and Effective Estimator of Transferability

Authors: Moein Sorkhei, Christos Matsoukas, Johan Fredin Haslum, Emir Konuk, Kevin Smith

Abstract: How well can one expect transfer learning to work in a new setting where the domain is shifted, the task is different, and the architecture changes? Many transfer learning metrics have been proposed to answer this question. But how accurate are their predictions in a realistic new setting? We conducted an extensive evaluation involving over 42,000 experiments comparing 23 transferability metrics across 16 different datasets to assess their ability to predict transfer performance. Our findings reveal that none of the existing metrics perform well across the board. However, we find that a simple k-nearest neighbor evaluation -- as is commonly used to evaluate feature quality for self-supervision -- not only surpasses existing metrics, but also offers better computational efficiency and ease of implementation.

URL: https://openreview.net/forum?id=hGlkjP1zHc

---

Title: Balancing Utility and Privacy: Dynamically Private SGD with Random Projection

Authors: Zhanhong Jiang, Md Zahid Hasan, Nastaran Saadati, Aditya Balu, Chao Liu, Soumik Sarkar

Abstract: Stochastic optimization is a pivotal enabler in modern machine learning, producing effective models for various tasks. However, several existing works have shown that model parameters and gradient information are susceptible to privacy leakage. Although Differentially Private SGD (DPSGD) addresses privacy concerns, its static noise mechanism impacts the error bounds for model performance. Additionally, with the exponential increase in model parameters, efficient learning of these models using stochastic optimizers has become more challenging. To address these concerns, we introduce the Dynamically Differentially Private Projected SGD (D2P2-SGD) optimizer. In D2P2-SGD, we combine two important ideas: (i) dynamic differential privacy (DDP) with automatic gradient clipping and (ii) random projection with SGD, allowing dynamic adjustment of the tradeoff between utility and privacy of the model. It exhibits provably sub-linear convergence rates across different objective functions, matching the best available rate. The theoretical analysis further suggests that DDP leads to better utility at the cost of privacy, while random projection enables more efficient model learning. Extensive experiments across diverse datasets show that D2P2-SGD remarkably enhances accuracy while maintaining privacy. Our code is available here.

URL: https://openreview.net/forum?id=u6OSRdkAwl

---

Title: Data-Driven Discovery of PDEs via the Adjoint Method

Authors: Mohsen Sadr, Tony Tohme, KAMAL YOUCEF-TOUMI

Abstract: In this work, we present an adjoint-based method for discovering the underlying governing partial differential equations (PDEs) given data. The idea is to consider a parameterized PDE in a general form and formulate a PDE-constrained optimization problem aimed at minimizing the error of the PDE solution from data. Using variational calculus, we obtain an evolution equation for the Lagrange multipliers (adjoint equations), allowing us to compute the gradient of the objective function with respect to the parameters of PDEs given data in a straightforward manner. In particular, we consider a family of temporal parameterized PDEs that encompass linear, nonlinear, and spatial derivative candidate terms, and elegantly derive the corresponding adjoint equations. We show the efficacy of the proposed approach in identifying the form of the PDE up to machine accuracy, enabling the accurate discovery of PDEs from data. We also compare its performance with the famous PDE Functional Identification of Nonlinear Dynamics method known as PDE-FIND \cite{rudy2017data} among others, on both smooth and noisy data sets. Even though the proposed adjoint method relies on forward/backward solvers, it outperforms PDE-FIND in the limit of large data sets thanks to the analytic expressions for gradients of the cost function with respect to each PDE parameter.

URL: https://openreview.net/forum?id=Az3mJ4d1eT

---

Title: Decomposed Direct Preference Optimization for Structure-Based Drug Design

Authors: Xiwei Cheng, Xiangxin Zhou, Yuwei Yang, Yu Bao, Quanquan Gu

Abstract: Diffusion models have achieved promising results for Structure-Based Drug Design (SBDD). Nevertheless, high-quality protein subpocket and ligand data are relatively scarce, which hinders the models’ generation capabilities. Recently, Direct Preference Optimization (DPO) has emerged as a pivotal tool for aligning generative models with human preferences. In this paper, we propose DecompDpo, a structure-based optimization method aligns diffusion models with pharmaceutical needs using multi-granularity preference pairs. DecompDpo introduces decomposition into the optimization objectives and obtains preference pairs at the molecule or decomposed substructure level based on each objective’s decomposability. Additionally, DecompDpo introduces a physics-informed energy term to ensure reasonable molecular conformations in the optimization results. Notably, DecompDpo can be effectively used for two main purposes: (1) fine-tuning pretrained diffusion models for molecule generation across various protein families, and (2) molecular optimization given a specific protein subpocket after generation. Extensive experiments on the CrossDocked2020 benchmark show that DecompDpo significantly improves model performance, achieving up to 98.5% Med. High Affinity and a 43.9% success rate for molecule generation, and 100% Med. High Affinity and a 52.1% success rate for targeted molecule optimization.

URL: https://openreview.net/forum?id=dwSpo5DRk8

---

Title: Dependency-aware Maximum Likelihood Estimation for Active Learning

Authors: Beyza Kalkanli, Tales Imbiriba, Stratis Ioannidis, Deniz Erdogmus, Jennifer Dy

Abstract: Active learning aims to efficiently build a labeled training set by strategically selecting samples to query labels from annotators.
In this sequential process, each sample acquisition influences subsequent selections, causing dependencies among samples in the labeled set. However, these dependencies are overlooked during the model parameter estimation stage when updating the model using Maximum Likelihood Estimation (MLE), a conventional method that assumes independent and identically distributed (i.i.d.) data. We propose Dependency-aware MLE (DMLE), which corrects MLE within the active learning framework by addressing sample dependencies typically neglected due to the i.i.d. assumption, ensuring consistency with active learning principles in the model parameter estimation process. This improved method achieves superior performance across multiple benchmark datasets, reaching higher performance in earlier cycles compared to conventional MLE. Specifically, we observe average accuracy improvements of 6%, 8.6%, and 10.5% for k=1, k=5, and k=10 respectively, after collecting the first 100 samples, where entropy is the acquisition function and k is the query batch size acquired at every active learning cycle.

URL: https://openreview.net/forum?id=qDVDSXXGK1

---

Title: Aggregating Algorithm and Axiomatic Loss Aggregation

Authors: Armando J Cabrera Pacheco, Rabanus Derr, Robert Williamson

Abstract: Supervised learning has gone beyond the empirical risk minimization framework. Central to most of these developments is the introduction of more general aggregation functions for losses incurred by the learner. In this paper, we turn towards online learning under expert advice. Via easily justified assumptions we characterize a set of reasonable loss aggregation functions as quasi-sums. Based upon this insight, we suggest how to tailor Vovk's Aggregating Algorithm to these more general aggregation functions. The "change of variables" we propose, let us highlight that "weighting profiles" determine the contribution of each expert to the next prediction according to their loss and the multiplicative structure of the weight updates in the Aggregating Algorithm translates into the additive structure of the loss aggregation in the regret bound. In addition, we suggest that the mixability of the loss function, which is functionally necessary for the Aggregating Algorithm, is intrinsically relative to the log loss, because the standard aggregation of losses in online learning is the sum. Finally, we conceptually and empirically argue that our generalized loss aggregation functions express the attitude of the learner towards losses.

URL: https://openreview.net/forum?id=4bUuWtOuDx

---

Title: Designing Algorithms Empowered by Language Models: An Analytical Framework, Case Studies, and Insights

Authors: Yanxi Chen, Yaliang Li, Bolin Ding, Jingren Zhou

Abstract: This work presents an analytical framework for the design and analysis of LLM-based algorithms, i.e., algorithms that contain one or multiple calls of large language models (LLMs) as sub-routines and critically rely on the capabilities of LLMs. While such algorithms, ranging from basic LLM calls with prompt engineering to complicated LLM-powered agentic workflows and compound AI systems, have achieved remarkable empirical success, their design and optimization oftentimes require extensive trial-and-errors and case-by-case analysis. Our proposed framework serves as an attempt to mitigate such headaches, offering a formal and systematic approach for analyzing how the accuracy and efficiency of an LLM-based algorithm will be impacted by critical design choices, such as the pattern and granularity of task decomposition, or the prompt for each LLM call. Through a wide range of case studies covering diverse algorithm patterns (including parallel/hierarchical/recursive task decomposition and generic directed acyclic graphs), we demonstrate the proposed framework in action and derive interesting insights that generalize across scenarios, accompanied by systematic empirical validation in synthetic settings.

URL: https://openreview.net/forum?id=nJ7RECkxQr

---

Title: Recursive SNE: Fast Prototype-Based t-SNE for Large-Scale and Online Data

Authors: Agil Aghasanli, Plamen P Angelov

Abstract: Dimensionality reduction techniques like t-SNE excel at visualizing structure in high-dimensional data but incur high computational costs that limit their use on large or streaming datasets. We introduce the Recursive SNE (RSNE) framework, which extends t-SNE with two complementary strategies: i-RSNE for real-time, point-wise updates and Bi-RSNE for efficient batch processing. Across diverse settings, including standard image benchmarks (CIFAR10/CIFAR100) with DINOv2 and CLIP features, domain-specific iROADS road scenes, neuroimaging data from the Haxby fMRI dataset, and long-term climate records, RSNE delivers substantial speedups over Barnes–Hut t-SNE while maintaining or even improving cluster separability. By combining a lightweight prototype-based initialization with localized KL-divergence refinements, RSNE offers a scalable and adaptable framework for both large-scale offline embedding and on-the-fly visualization of streaming data.

URL: https://openreview.net/forum?id=7wCPAFMDWM

---

Title: Regret Analysis of Posterior Sampling-Based Expected Improvement for Bayesian Optimization

Authors: Shion Takeno, Yu Inatsu, Masayuki Karasuyama, Ichiro Takeuchi

Abstract: Bayesian optimization is a powerful tool for optimizing an expensive-to-evaluate black-box function. In particular, the effectiveness of expected improvement (EI) has been demonstrated in a wide range of applications. However, theoretical analyses of EI are limited compared with other theoretically established algorithms. This paper analyzes a randomized variant of EI, which evaluates the EI from the maximum of the posterior sample path. We show that this posterior sampling-based random EI achieves the sublinear Bayesian cumulative regret bounds under the assumption that the black-box function follows a Gaussian process. Finally, we demonstrate the effectiveness of the proposed method through numerical experiments.

URL: https://openreview.net/forum?id=v0s9knY99c

---

Title: Stochastic Subspace Descent Accelerated via Bi-fidelity Line Search

Authors: Nuojin Cheng, Alireza Doostan, Stephen Becker

Abstract: Efficient optimization remains a fundamental challenge across numerous scientific and engineering domains, particularly when objective function evaluations are computationally expensive and gradient information is inaccessible. While zeroth-order optimization methods address the lack of gradients, their performance often suffers due to the high cost of repeated function queries. This work introduces a bi-fidelity line search scheme tailored for zeroth-order optimization. Our method constructs a temporary surrogate model by strategically combining inexpensive low-fidelity (LF) evaluations with accurate high-fidelity (HF) evaluations of the objective function. This surrogate enables an efficient backtracking line search for step size selection, significantly reducing the required number of costly HF queries. We provide theoretical convergence guarantees for this scheme under standard assumptions. Furthermore, we integrate this bi-fidelity strategy into the stochastic subspace descent framework, proposing the bi-fidelity stochastic subspace descent (BF-SSD) algorithm. A comprehensive empirical evaluation of BF-SSD is conducted across four distinct problems: a synthetic optimization benchmark, dual-form kernel ridge regression, black-box adversarial attacks on machine learning models, and transformer-based black-box language model fine-tuning. The numerical results consistently demonstrate that BF-SSD achieves superior optimization performance, particularly in terms of solution quality obtained per HF function evaluation, when compared against relevant baseline methods. This study highlights the efficacy of integrating bi-fidelity strategies within zeroth-order optimization frameworks, positioning BF-SSD as a promising and computationally efficient approach for tackling large-scale, high-dimensional optimization problems encountered in various real-world applications.

URL: https://openreview.net/forum?id=xuOQUs5YmT

---

Title: Efficient Reasoning Models: A Survey

Authors: Sicheng Feng, Gongfan Fang, Xinyin Ma, Xinchao Wang

Abstract: Reasoning models have demonstrated remarkable progress in solving complex and logic-intensive tasks by generating extended Chain-of-Thoughts (CoTs) prior to arriving at a final answer. Yet, the emergence of this “slow-thinking” paradigm, with numerous tokens generated in sequence, inevitably introduces substantial computational overhead. To this end, it highlights an urgent need for effective acceleration. This survey aims to provide a comprehensive overview of recent advances in efficient reasoning. It categorizes existing works into three key directions: (1) shorter – compressing lengthy CoTs into concise yet effective reasoning chains; (2) smaller – developing compact language models with strong reasoning capabilities through techniques such as knowledge distillation, other model compression techniques, and reinforcement learning; and (3) faster – designing efficient decoding strategies to accelerate inference of reasoning models. A curated collection of papers discussed in this survey is available in our GitHub repository: https://github.com/fscdc/Awesome-Efficient-Reasoning-Models.

URL: https://openreview.net/forum?id=sySqlxj8EB

---

Title: Goal Recognition Design for General Behavioral Agents using Machine Learning

Authors: Robert Kasumba, Guanghui Yu, Chien-Ju Ho, Sarah Keren, William Yeoh

Abstract: Goal recognition design (GRD) aims to make limited modifications to decision-making environments to make it easier to infer the goals of agents acting within those environments. Although various research efforts have been made in goal recognition design, existing approaches are computationally demanding and often assume that agents are (near-)optimal in their decision-making. To address these limitations, we leverage machine learning methods for goal recognition design that can both improve run-time efficiency and account for agents with general behavioral models. Following existing literature, we use worst-case distinctiveness (wcd) as a measure of the difficulty in inferring the goal of an agent in a decision-making environment. Our approach begins by training a machine learning model to predict the wcd for a given environment and the agent behavior model. We then propose a gradient-based optimization framework that accommodates various constraints to optimize decision-making environments for enhanced goal recognition. Through extensive simulations, we demonstrate that our approach outperforms existing methods in reducing wcd and enhances runtime efficiency. Moreover, our approach also adapts to settings in which existing approaches do not apply, such as those involving flexible budget constraints, more complex environments, and suboptimal agent behavior. Finally, we conducted human-subject experiments that demonstrate that our method creates environments that facilitate efficient goal recognition from human decision-makers.

URL: https://openreview.net/forum?id=GDuWBhvMid

---

New submissions
===============

Title: Dealing with Uncertainty in Contextual Anomaly Detection

Abstract: Contextual anomaly detection (CAD) aims to identify anomalies in a target (behavioral) variable conditioned on a set of contextual variables that influence the normalcy of the target variable but are not themselves indicators of anomaly. In many anomaly detection tasks, there exist contextual variables that influence the normalcy of the target variable but are not themselves indicators of anomaly. In this work, we propose a novel framework for CAD, normalcy score (NS), that explicitly models both the aleatoric and epistemic uncertainties. Built on heteroscedastic Gaussian process regression, our method regards the Z-score as a random variable, providing confidence intervals that reflect the reliability of the anomaly assessment. Through experiments on benchmark datasets and a real-world application in cardiology, we demonstrate that NS outperforms state-of-the-art CAD methods in both detection accuracy and interpretability. Moreover, confidence intervals enable an adaptive, uncertainty-driven decision-making process, which may be very important in domains such as healthcare.

URL: https://openreview.net/forum?id=yLoXQDNwwa

---

Title: Uncertainty-Aware Surrogate-based Amortized Bayesian Inference for Computationally Expensive Models

Abstract: Bayesian inference typically relies on a large number of model evaluations to estimate posterior distributions. Established methods like Markov Chain Monte Carlo (MCMC) and Amortized Bayesian Inference (ABI) can become computationally challenging. While ABI enables fast inference $\text{\emph{after}}$ training, generating sufficient training data still requires thousands of model simulations, which is infeasible for expensive models. Surrogate models offer a solution by providing $\text{\emph{approximate}}$ simulations at a lower computational cost, allowing the generation of large data sets for training. However, the introduced approximation errors and uncertainties can lead to overconfident posterior estimates. To address this, we propose Uncertainty-Aware Surrogate-based Amortized Bayesian Inference (UA-SABI) -- a framework that combines surrogate modeling and ABI while explicitly quantifying and propagating surrogate uncertainties through the inference pipeline. Our experiments show that this approach enables reliable, fast, and repeated Bayesian inference for computationally expensive models, even under tight time constraints.

URL: https://openreview.net/forum?id=aVSoQXbfy1

---

Title: Efficient Dilated Squeeze and Excitation Neural Operator for Differential Equations

Abstract: Fast and accurate surrogates for physics-driven partial differential equations (PDEs) are essential in fields such as aerodynamics, porous media design, and flow control. However, many transformer-based models and existing neural operators remain parameter-heavy, resulting in costly training and sluggish deployment. We propose D-SENO (Dilated Squeeze-Excitation Neural Operator), a lightweight operator learning framework for efficiently solving a wide range of PDEs, including airfoil potential flow, Darcy flow in porous media, pipe Poiseuille flow, and incompressible Navier–Stokes vortical fields. D-SENO combines dilated convolution (DC) blocks with squeeze-and-excitation (SE) modules to jointly capture wide receptive fields and dynamics alongside channel-wise attention, enabling both accurate and efficient PDE inference. Carefully chosen dilation rates allow the receptive field to focus on critical regions, effectively modeling long-range physical dependencies. Meanwhile, the SE modules adaptively recalibrate feature channels to emphasize dynamically relevant scales. Our model achieves training speed of up to $\approx 20\times$ faster than standard transformer-based models and neural operators, while also surpassing (or matching) them in accuracy across multiple PDE benchmarks. Ablation studies show that removing the SE modules leads to a slight drop in performance.

URL: https://openreview.net/forum?id=Xl942THEUa

---

Title: Towards Scalable Language-Image Pre-training for 3D Medical Imaging

Abstract: The scalability of current language-image pre-training for 3D medical imaging, such as CT and MRI, is constrained by the need for radiologists to manually curate raw clinical studies. In this work, we pioneer pre-training directly on uncurated studies, which both aligns more closely with the clinical workflow and provides a natural path to scalability. However, the unique structure of such data presents new challenges for existing model architectures, which were originally designed for 2D slices or single 3D scans. To address this, we introduce a novel hierarchical attention mechanism inspired by the intrinsic hierarchy of radiology data: slice, scan, and study. We denote our framework as Hierarchical attention for Language-Image Pre-training (HLIP). Trained on 220K studies with 3.13 million scans for brain MRI and 240K studies with 1.44 million scans for head CT, HLIP achieves state-of-the-art performance, e.g., +10.5% balanced ACC on the proposed publicly available brain MRI benchmark Pub-Brain-5; +8.3% and +1.7% macro AUC on head CT benchmarks CQ500 and RSNA, respectively. HLIP also exhibits strong generalizability on existing 3D medical language-image pre-training benchmarks, e.g., +4.3% macro AUC on the Rad-ChestCT benchmark when pre-trained on CT-RATE. These results demonstrate that, with HLIP, directly pre-training on uncurated clinical datasets is a scalable and effective direction for language-image pre-training in 3D medical imaging. Code will be released upon acceptance.

URL: https://openreview.net/forum?id=WxHf4EcBWA

---

Title: Achieving Global Flatness in Decentralized Learning with Heterogeneous Data

Abstract: Decentralized training enables peer-to-peer on-device learning without relying on a central server, but suffers from degraded generalization performance under heterogeneous data distributions due to local overfitting. One strategy to mitigate this is to seek flatter loss landscapes during local optimization at each client. However, with extreme data heterogeneity, local objectives may diverge from the global one, yielding local flatness rather than true global flatness. To mitigate this challenge, we introduce GFlat, a novel decentralized algorithm that enables each client to estimate and incorporate an approximation of the global update direction while seeking a flatter loss landscape locally.
This lightweight strategy allows each client to directly contribute to global flatness without requiring additional communication or centralized coordination.
We theoretically analyze the convergence properties of GFlat and validate its performance through extensive experiments across a range of datasets, model architectures, and communication topologies. GFlat consistently improves generalization in non-IID data settings and achieves up to 6.75\% higher test accuracy compared to state-of-the-art decentralized methods.

URL: https://openreview.net/forum?id=8G32T4RLbX

---

Title: Decoding Safety Feedback from Diverse Raters: A Data-driven Lens on Responsiveness to Severity

Abstract: Ensuring the safety of Generative AI requires a nuanced understanding of pluralistic viewpoints. In this paper, we introduce a novel data-driven approach for analyzing ordinal safety ratings in pluralistic settings. Specifically, we address the challenge of interpreting nuanced differences in safety feedback from a diverse population expressed via ordinal scales (e.g., a Likert scale). We define non-parametric responsiveness metrics that quantify how raters convey broader distinctions and granular variations in the severity of safety violations. Leveraging publicly available datasets of pluralistic safety feedback as our case studies, we investigate how raters from different demographic groups use an ordinal scale to express their perceptions of the severity of violations. We apply our metrics across violation types, demonstrating their utility in extracting nuanced insights that are crucial for aligning AI systems reliably in multi-cultural contexts. We show that our approach can inform rater selection and feedback interpretation by capturing nuanced viewpoints across different demographic groups, hence improving the quality of pluralistic data collection and in turn contributing to more robust AI alignment.

URL: https://openreview.net/forum?id=vx6qrM7VB5

---

Title: StFT: Spatio-temporal Fourier Transformer for Long-term Dynamics Prediction

Abstract: Simulating the long-term dynamics of multi-scale and multi-physics systems poses a significant challenge in understanding complex phenomena across science and engineering. The complexity arises from the intricate interactions between scales and the interplay of diverse physical processes, which manifest in PDEs through coupled, nonlinear terms that govern the evolution of multiple physical fields across scales. Neural operators have shown potential in short-term prediction of such complex spatio-temporal dynamics; however,
achieving stable high-fidelity predictions and providing robust uncertainty quantification over extended time horizons remains an open and unsolved area of research. These limitations often lead to stability degradation with rapid error accumulation, particularly in long-term forecasting of systems characterized by multi-scale behaviors involving dynamics of different orders. To address these challenges, we propose an autoregressive Spatio-temporal Fourier Transformer (StFT), in which each transformer block is designed to learn the system dynamics at a distinct scale through a dual-path architecture that integrates frequency-domain and spatio-temporal representations. By leveraging a structured hierarchy of StFT blocks, the resulting model explicitly captures the underlying dynamics across both macro- and micro- spatial scales. Furthermore, a generative residual correction mechanism is introduced to learn a probabilistic refinement temporally while simultaneously quantifying prediction uncertainties, enhancing both the accuracy and reliability of long-term probabilistic forecasting. Evaluations conducted on three benchmark datasets (plasma, fluid, and atmospheric dynamics) demonstrate the advantages of our approach over state-of-the-art ML methods.

URL: https://openreview.net/forum?id=o9Cb0ri2oW

---

Title: Transformer Modeling for Both Scalability and Performance in Multivariate Time Series

Abstract: Variable count is among the main scalability bottlenecks for transformer modeling in multivariate time series (MTS) data. On top of this, a growing consensus in the field points to indiscriminate inter-variable mixing as a potential source of noise-accumulation and performance degradation. This is likely exacerbated by sparsity of informative signals characteristic of many MTS systems coupled with representational misalignment stemming from indiscriminate information mixing between (heterogeneous) variables. While scalability and performance are often seen as competing interests in transformer design, we show that both can be improved simultaneously in MTS by strategically constraining the representational capacity of inter-variable mixing. Our proposed method, transformer with \textbf{Del}egate \textbf{T}oken \textbf{A}ttention (\textbf{DELTAformer}), constrains inter-variable modeling through what we call delegate tokens which are then used to perform full, unconstrained, inter-temporal modeling. Delegate tokens act as an implicit regularizer that forces the model to be highly selective about what inter-variable information is allowed to propagate through the network. Our results show that DELTAformer scales linearly with variable-count while actually outperforming standard transformers, achieving state-of-the-art performance across benchmarks and baselines. In addition, DELTAformer can focus on relevant signals better than standard transformers in noisy MTS environments and overall exhibit superior noise-resilience. Overall, results across various experiments confirm that by aligning our model design to leverage domain-specific challenges in MTS to our advantage, DELTAformer can simultaneously achieve linear scaling while actually improving its performance against standard, quadratic transformers.

URL: https://openreview.net/forum?id=x5l9BIyarl

---

Title: On the Fundamental Limitations of Dual Static CVaR Decompositions in Markov Decision Processes

Abstract: It was recently shown that dynamic programming (DP) methods for finding static CVaR-optimal policies in Markov Decision Processes (MDPs) can fail when based on the dual formulation, yet the root cause of this failure remains unclear. We expand on these findings by shifting focus from policy optimization to the seemingly simpler task of policy evaluation. We show that evaluating the static CVaR of a given policy can be framed as two distinct minimization problems. We introduce a set of ``risk-assignment consistency constraints'' that must be satisfied for their solutions to match and we demonstrate that an empty intersection of these constraints is the source of previously observed evaluation errors. Quantifying the evaluation error as the \emph{CVaR evaluation gap}, we demonstrate that the issues observed when optimizing over the dual-based CVaR DP are explained by the returned policy having a non-zero CVaR evaluation gap. Finally, we leverage our proposed risk-assignment perspective to prove that the search for a single, uniformly optimal policy on the dual CVaR decomposition is fundamentally limited, identifying an MDP where no single policy can be optimal across all initial risk levels.

URL: https://openreview.net/forum?id=ToH8gFQqt7

---

Title: Joint Verification and Refinement of Language Models for Safety-Constrained Planning

Abstract: Large language models possess impressive capabilities in generating programs (e.g., Python) from natural language descriptions to execute robotic tasks. However, these generated programs often contain errors that violate externally given task specifications. Without an effective method to verify their correctness, the reliable deployment of language models in real-world systems is practically infeasible.
We develop a method that converts generated robot programs into an automaton-based representation and verifies them against task-relevant safety specifications. We establish a theorem that any arbitrary combination of the verified programs will also satisfy the safety specifications. Hence, the method eliminates the need to verify complex programs composed of multiple simpler ones, reducing computation complexity. We then introduce an automated fine-tuning procedure that leverages verification outcomes for supervision. By applying the theorem, this procedure only requires training the model to generate safe sub-components, thereby improving training efficiency. Empirical results on robot applications show a 30 percent increase in the probability of generating specification-compliant programs, with training time reduced by half compared to fine-tuning on generating full programs.

URL: https://openreview.net/forum?id=DasMjBILLj

---

Title: CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions

Abstract: While AI agents hold transformative potential in business, effective performance benchmarking is hindered by the scarcity of public, realistic business data on widely used platforms. Existing benchmarks often lack fidelity in their environments, data, and agent-user interactions, with limited coverage of diverse business scenarios and industries. To address these gaps, we introduce CRMArena-Pro, a novel benchmark for holistic, realistic assessment of LLM agents in diverse professional settings. CRMArena-Pro expands on CRMArena with nineteen expert-validated tasks across sales, service, and 'configure, price, and quote' processes, for both Business-to-Business and Business-to-Customer scenarios. It distinctively incorporates multi-turn interactions guided by diverse personas and robust confidentiality awareness assessments. Experiments reveal leading LLM agents achieve only around 58% single-turn success on CRMArena-Pro, with performance dropping significantly to approximately 35% in multi-turn settings. While Workflow Execution proves more tractable for top agents (over 83% single-turn success), other evaluated business skills present greater challenges. Furthermore, agents exhibit near-zero inherent confidentiality awareness; though targeted prompting can improve this, it often compromises task performance. These findings highlight a substantial gap between current LLM capabilities and enterprise demands, underscoring the need for advancements in multi-turn reasoning, confidentiality adherence, and versatile skill acquisition.

URL: https://openreview.net/forum?id=EPlpe3Fx1x

---

Title: ECLipsE-Gen-Local: Efficient Compositional Local Lipschitz Estimates for Deep Neural Networks

Abstract: The Lipschitz constant is a key measure for certifying the robustness of neural networks to input perturbations. However, computing the exact constant is NP-hard, and standard approaches to estimate the Lipschitz constant involve solving a large matrix semidefinite program (SDP) that scales poorly with network size. Further, there is a potential to efficiently leverage local information on the input region to provide tighter Lipschitz estimates. We address this problem here by proposing a compositional framework that yields tight yet scalable Lipschitz estimates for deep feedforward neural networks. Specifically, we begin by developing a generalized SDP framework for Lipschitz estimation that is highly flexible, accommodating heterogeneous activation function slope bounds for each neuron on each layer, and allowing Lipschitz estimates with respect to arbitrary input-output pairs in the neural network and arbitrary choices of sub-networks of consecutive layers. We then decompose this generalized SDP into a equivalent small sub-problems that can be solved sequentially, yielding the ECLipsE-Gen series of algorithms, with computational complexity that scales linearly with respect to the network depth. We also develop a variant that achieves near-instantaneous computation through closed-form solutions to each sub-problem. All our algorithms are accompanied by theoretical guarantees on feasibility and validity, serving as strict upper bounds on the true Lipschitz constant. Next, we develop a series of algorithms, termed as ECLipsE-Gen-Local, that explicitly incorporate local information on the input region to provide tighter Lipschitz constant estimates. Our experiments demonstrate that our algorithms achieve substantial speedups over a multitude of benchmarks while producing significantly tighter Lipschitz bounds than global approaches. Moreover, we demonstrate that our algorithms provide strict upper bounds for the Lipschitz constant with values approaching the exact Jacobian from autodiff when the input region is small enough. Finally, we demonstrate the practical utility of our approach by showing that our Lipschitz estimates closely align with network robustness. In summary, our approach considerably advances the scalability and efficiency of certifying neural network robustness, while capturing local input–output behavior to deliver provably tighter bounds, making it particularly suitable for safety-critical and adaptive learning tasks.

URL: https://openreview.net/forum?id=CuqnFjeu5a

---

Title: Intra-Cluster Mixup: An Effective Data Augmentation Technique for Complementary-Label Learning

Abstract: In this paper, we investigate the challenges of complementary-label learning (CLL), a specialized form of weakly-supervised learning (WSL) where models are trained with labels indicating classes to which instances do not belong, rather than standard ordinary labels. This alternative supervision is appealing because collecting complementary labels is generally cheaper and less labor-intensive.
Although most existing research in CLL emphasizes the development of novel loss functions, the potential of data augmentation in this domain remains largely underexplored. In this work, we uncover that the widely-used Mixup data augmentation technique is ineffective when directly applied to CLL. Through in-depth analysis, we identify that the complementary-label noise generated by Mixup negatively impacts the performance of CLL models. We then propose an improved technique called Intra-Cluster Mixup (ICM), which only synthesizes augmented data from nearby examples, to mitigate the noise effect. ICM carries the benefits of encouraging complementary label sharing of nearby examples, and leads to substantial performance improvements across synthetic and real-world labeled datasets. In particular, our wide spectrum of experimental results on both balanced and imbalanced CLL settings justifies the potential of ICM in allying with state-of-the-art CLL algorithms, achieving significant accuracy increases of 30% and 10% on MNIST and CIFAR datasets, respectively.

URL: https://openreview.net/forum?id=h9PbmfznWj

---

Title: Neural Conditional Transport Maps

Abstract: We present a neural framework for learning conditional optimal transport (OT) maps between probability distributions. Conditional OT maps---transformations that adapt based on auxiliary variables such as labels, time indices, or other parameters---are essential for applications ranging from generative modeling to uncertainty quantification of black-box models. However, existing conditional OT methods face significant limitations: input convex neural networks (ICNNs) impose severe architectural constraints that reduce expressivity, while simpler conditioning strategies like concatenation fail to model fundamentally different transport behaviors across conditions. Our approach introduces a conditioning mechanism capable of simultaneously processing both categorical and continuous conditioning variables, using learnable embeddings and positional encoding. At the core of our method lies a hypernetwork that generates transport layer parameters based on these inputs, creating adaptive mappings that outperform simpler conditioning methods. We showcase the framework's practical impact through applications to global sensitivity analysis, enabling efficient computation of OT-based sensitivity indices for complex black-box models. This work advances the state-of-the-art in conditional optimal transport, enabling broader application of optimal transport principles to complex, high-dimensional domains such as generative modeling, black-box model explainability, and scientific computing.

URL: https://openreview.net/forum?id=CZvkpQc73I

---

Title: Finally outshining the Random Baseline: A simple and effective solution for Active Learning in 3D biomedical imaging

Abstract: Active learning (AL) has the potential to drastically reduce annotation costs in 3D biomedical image segmentation, where expert labeling of volumetric data is both time-consuming and expensive. Yet, existing AL methods are unable to consistently outperform improved random sampling baselines adapted to 3D data, leaving the field without a reliable solution. We introduce Class-stratified Scheduled Power Predictive Entropy (ClaSP PE), a simple and effective query strategy that addresses two key limitations of standard uncertainty-based AL methods: class imbalance and redundancy in early selections. ClaSP PE combines class-stratified querying to ensure coverage of underrepresented structures and log-scale power noising with a decaying schedule to enforce query diversity in early-stage AL and encourage exploitation later. In our evaluation on 24 experimental settings using four 3D biomedical datasets within the comprehensive nnActive benchmark, ClaSP PE is the only method that generally outperforms improved random baselines in terms of both segmentation quality with statistically significant gains, whilst remaining annotation efficient. Furthermore, we explicitly simulate the real-world application by testing our method on four previously unseen datasets without manual adaptation, where all experiment parameters are set according to predefined guidelines. The results confirm that ClaSP PE robustly generalizes to novel tasks without requiring dataset-specific tuning. Overall, we present the first compelling evidence of an AL method to generally outperform random baselines adapted to 3D segmentation, measured by performance and annotation efficiency in a realistic, close-to-production scenario. Our open-source implementation and clear deployment guidelines make it readily applicable in practice. Code is at https://anonymous.4open.science/r/ClaSP-PE-4970.

URL: https://openreview.net/forum?id=UamXueEaYW

---

Title: A Systematic Evaluation of Out-of-Distribution Generalization in Climate-Aware Crop Yield Prediction

Abstract: Accurate crop yield forecasting under shifting climatic conditions is essential for food security and agricultural resilience. While recent deep learning models achieve strong performance in in-domain settings, their ability to generalize across space and time—critical for real-world deployment—remains poorly understood. In this work, we present the first systematic evaluation of temporally-aware crop yield prediction models under spatio-temporal out-of-distribution (OOD) conditions, using corn and soybean data across more than 1,200 U.S. counties. We benchmark two representative architectures, GNN-RNN and MMST-ViT, using rigorous evaluation strategies including year-ahead forecasting, leave-one-region-out validation, and stratified OOD scenarios of varying difficulty based on USDA Farm Resource Regions. Our comprehensive analysis reveals significant performance gaps across agro-ecological zones, with some models showing negative R² values under distribution shift. We uncover asymmetric transferability patterns and identify the Prairie Gateway region as consistently challenging for generalization. These findings challenge prior generalizability claims and provide practical insights for deploying agricultural AI systems under climate variability.

URL: https://openreview.net/forum?id=to4sVjsxsO

---

Title: Bayesian Ensembling: Insights from Online Optimization and Empirical Bayes

Abstract: We revisit the classical problem of Bayesian ensembles and address the challenge of learning optimal combinations of Bayesian models in an online learning setting. To this end, we reinterpret existing approaches such as Bayesian model averaging (BMA) and Bayesian stacking through a novel empirical Bayes lens, shedding new light on the limitations and pathologies of BMA. Further motivated by insights from online optimization, we propose Online Bayesian Stacking (OBS), a method that optimizes the log-score over predictive distributions to adaptively combine Bayesian models. A key contribution of our work is establishing a novel connection between OBS and portfolio selection, bridging Bayesian ensemble learning with a rich, well-studied theoretical framework that offers efficient algorithms and extensive regret analysis. We further clarify the relationship between OBS and online BMA, showing that they optimize related but distinct cost functions. Through theoretical analysis and empirical evaluation, we identify scenarios where OBS outperforms online BMA and provide principled guidance on when practitioners should prefer one approach over the other.

URL: https://openreview.net/forum?id=CCvVzmfBOn

---

Title: VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models

Abstract: Recent Large Vision-Language Models (LVLMs) have advanced multi-modal understanding by incorporating finer-grained visual perception and encoding. However, such methods incur significant computational costs due to longer visual token sequences, posing challenges for real-time deployment. To mitigate this, prior studies have explored pruning unimportant visual tokens either at the output layer of the visual encoder or at the early layers of the language model. In this work, we revisit these design choices and reassess their effectiveness through comprehensive empirical studies of how visual tokens are processed throughout the visual encoding and language decoding stages. Guided by these insights, we propose VScan, a two-stage visual token reduction framework that addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model. Extensive experimental results across four LVLMs validate the effectiveness of VScan in accelerating inference and demonstrate its superior performance over current state-of-the-arts on sixteen benchmarks. Notably, when applied to LLaVA-NeXT-7B, VScan achieves a 2.91$\times$ speedup in prefilling and a 10$\times$ reduction in FLOPs, while retaining 95.4\% of the original performance. Code will be made publicly available upon acceptance.

URL: https://openreview.net/forum?id=KZYhyilFnt

---

Title: Erasing and Tampering Statistical Watermarks via Re-watermarking in Large Language Models

Abstract: The rapid development and widespread adoption of large language models have intensified concerns about copyright disputes, misinformation spread, and content authenticity. Statistical watermarking has been proposed as a potential solution for content source verification, though its reliability remains questionable. This study examines a re-watermarking attack based on text rephrasing. Our theoretical analyses and experimental results demonstrate that: (1) new watermarks can be successfully applied to already watermarked text; (2) these new watermarks effectively overwrite the originals, making them undetectable; and (3) compared to existing rephrasing-only attacks, re-watermarking causes comparable degradation in text fidelity. These findings reveal significant vulnerabilities in statistical watermarking techniques, challenging their effectiveness as reliable mechanisms for content attribution.

URL: https://openreview.net/forum?id=pTuuuGQ1zl

---

Title: Any-Property-Conditional Molecule Generation with Self-Criticism using Spanning Trees

Abstract: Generating novel molecules is challenging, with most representations of molecules leading to generative models producing many invalid molecules. Spanning Tree-based Graph Generation (STGG) \citep{ahn2021spanning} is a promising approach to ensure the generation of valid molecules, outperforming state-of-the-art generative models for unconditional generation. In practice, it is desirable to generate molecules conditional on one or multiple target properties rather than unconditionally. Thus, we extend STGG to multi-property conditional generation. Our approach, \highlight{\textbf{STGG+}}, incorporates a modern Transformer architecture, random masking of properties during training (enabling conditioning on \emph{any} subset of properties and classifier-free guidance), an auxiliary property-prediction loss (allowing the model to \emph{self-criticize} molecules and select the best ones), and other improvements. We show that \highlight{\textbf{STGG+}} achieves state-of-the-art performance on in-distribution and out-of-distribution conditional generation, as well as reward maximization.

URL: https://openreview.net/forum?id=QGZd5Bfb1L

---

Title: Training More Robust Classification Model via Discriminative Loss and Gaussian Noise Injection

Abstract: Robustness of deep neural networks to input noise remains a critical challenge, as naive noise injection often degrades accuracy on clean (uncorrupted) data. We propose a novel training framework that addresses this trade-off through two complementary objectives. First, we introduce a loss function applied at the penultimate layer that explicitly enforces intra-class compactness and increases the margin to analytically defined decision boundaries. This enhances feature discriminativeness and class separability for clean data. Second, we propose a class-wise feature alignment mechanism that brings noisy data clusters closer to their clean counterparts. Furthermore, we provide a theoretical analysis demonstrating that improving feature stability under additive Gaussian noise implicitly reduces the curvature of the softmax loss landscape in input space, as measured by Hessian eigenvalues.This thus naturally enhances robustness without explicit curvature penalties. Conversely, we also theoretically show that lower curvatures lead to more robust models. We validate the effectiveness of our method on standard benchmarks and our custom dataset. Our approach significantly reinforces model robustness to various perturbations while maintaining high accuracy on clean data, advancing the understanding and practice of noise-robust deep learning.

URL: https://openreview.net/forum?id=RnLfJgvST2

---

Title: Merging Memory and Space: A State Space Neural Operator

Abstract: We propose the \textit{State Space Neural Operator} (SS-NO), a compact architecture for learning solution operators of time-dependent partial differential equations (PDEs). Our formulation extends structured state space models (SSMs) to joint spatiotemporal modeling, introducing two key mechanisms: \textit{adaptive damping}, which stabilizes learning by localizing receptive fields, and \textit{learnable frequency modulation}, which enables data-driven spectral selection. These components provide a unified framework for capturing long-range dependencies with parameter efficiency. Theoretically, we establish connections between SSMs and neural operators, proving a universality theorem for convolutional architectures with full field-of-view. Empirically, SS-NO achieves strong performance across diverse PDE benchmarks—including 1D Burgers' and Kuramoto–Sivashinsky equations, and 2D Navier–Stokes and compressible Euler flows—while using significantly fewer parameters than competing approaches. Our results demonstrate that state space modeling provides an effective foundation for efficient and accurate neural operator learning.

URL: https://openreview.net/forum?id=SwLxxz0x58

---

Title: Aligning time series anomaly detection research with practical applications

Abstract: The field of time series anomaly detection is hindered not by its models and algorithms, but rather by its inadequate evaluation methodologies. A growing number of researchers have claimed in recent years that various prevalent metrics, datasets, and benchmarking practices employed in the literature are flawed. In this paper, we echo this sentiment by demonstrating that widespread metrics are incongruent with desirable model behaviour in practice and that datasets are plagued by inaccurate labels and unrealistic anomaly density, amongst other issues. Furthermore, we provide suggestions and guidance on realigning theoretical research with the demands of practical applications, with the goal of establishing a stable, principled benchmarking framework within which models may be evaluated and compared fairly. Finally, we offer a perspective on the main challenges and unanswered questions in the field, alongside potential future research directions.

URL: https://openreview.net/forum?id=RyMLAr5tFU

---

Title: Understanding Embedding Scaling in Collaborative Filtering

Abstract: Scaling recommendation models into large recommendation models has become one of the most widely discussed topics. Recent efforts focus on components beyond the scaling embedding dimension, as it is believed that scaling embedding may lead to performance degradation. Although there have been some initial observations on embedding, the root cause of their non-scalability remains unclear. Moreover, whether performance degradation occurs across different types of models and datasets is still an unexplored area. Regarding the effect of embedding dimensions on performance, we conduct large-scale experiments across 10 datasets with varying sparsity levels and scales, using 4 representative classical architectures. We surprisingly observe two novel phenomenon: double-peak and logarithmic. For the former, as the embedding dimension increases, performance first improves, then declines, rises again, and eventually drops. For the latter, it exhibits a perfect logarithmic curve. Our contributions are threefold. First, we discover two novel phenomena when scaling collaborative filtering models. Second, we gain an understanding of the underlying causes of the double-peak phenomenon. Lastly, we theoretically analyze the noise robustness of collaborative filtering models, with results matching empirical observations.

URL: https://openreview.net/forum?id=3f5HtLqnaY

---

Title: Rethinking Disentanglement under Dependent Factors of Variation

Abstract: Representation learning enables the discovery and extraction of underlying factors of variation from data. A representation is typically considered disentangled when it isolates these factors in a way that is interpretable to humans. Existing definitions and metrics for disentanglement often assume that the factors of variation are statistically independent. However, this assumption rarely holds in real-world settings, limiting the applicability of such definitions and metrics in real-world applications. In this work, we propose a novel definition of disentanglement grounded in information theory, which remains valid even when the factors are dependent. We show that this definition is equivalent to requiring the representation to consist of minimal and sufficient variables. Based on this formulation, we introduce a method to quantify the degree of disentanglement that remains effective in the presence of statistical dependencies among factors. Through a series of experiments, we demonstrate that our method reliably measures disentanglement in both independent and dependent settings, where existing approaches fail under the latter.

URL: https://openreview.net/forum?id=PgwkNC63CS

---

Title: A Second Order Majorant Algorithm for Nonnegative Matrix Factorization

Abstract: Nonnegative Matrix Factorization (NMF) is a fundamental tool in unsupervised learning, widely used for tasks such as dimensionality reduction, feature extraction, representation learning, and topic modeling. Many algorithms have been developed for NMF, including the well-known Multiplicative Updates (MU) algorithm, which belongs to a broader class of majorization-minimization techniques.
In this work, we introduce a general second-order optimization framework for NMF under both quadratic and $\beta$-divergence loss functions.
This approach, called Second-Order Majorant (SOM), constructs a local quadratic majorization of the loss function by majorizing its elementwise nonnegative Hessian matrix.
It includes MU as a special case, while enabling faster variants. In particular, we propose mSOM, a new algorithm within this class that leverages a tighter local approximation to accelerate convergence. We provide a convergence analysis, showing linear convergence for individual factor updates and global convergence to a stationary point for the alternating version, AmSOM. Numerical experiments on both synthetic and real datasets demonstrate that mSOM often outperforms state-of-the-art algorithms for NMF.

URL: https://openreview.net/forum?id=lm16IQmimK

---

Title: Random Projection-Induced Gaussian Latent Features for Arbitrary Style Transfer

Abstract: The style transfer technique centered around mean and variance, widely recognized as AdaIN, is the foundation of current style transfer approaches. This technique assumes that the features designated for transfer follow Gaussian distributions. However, this assumption is often difficult to meet in practice, as the features typically exhibit sparse distributions due to the significant spatial correlation inherent in natural images. To address this challenge, we propose initially projecting the sparse features into lower dimensions via random projection, and then performing style transfer on these projections. Statistically, the projections will be closer to Gaussian distributions, thereby better aligning with AdaIN's requirements and enhancing transfer performance. With the stylized projections, we can further reconstruct them back to the original feature space by leveraging compressed sensing theory, thereby obtaining the stylized features. The entire process constitutes a projection-stylization-reconstruction module, which can be seamlessly integrated into AdaIN without necessitating network retraining. Furthermore, our proposed module can also be incorporated into the recently introduced style transfer technique based on cumulative distribution functions, known as EFDM, which faces limitations when there are substantial differences in sparsity levels between content and style features. By projecting both types of features into dense, Gaussian distributions, random projection can reduce their sparsity disparity, ultimately improving performance. Experiments demonstrate that the performance improvements mentioned can be achieved on existing state-of-the-art approaches.

URL: https://openreview.net/forum?id=XBu6iqHof8

---

Title: A Concept-Centric Approach to Multi-Modality Learning

Abstract: Humans possess a remarkable ability to acquire knowledge efficiently and apply it across diverse modalities through a coherent and shared understanding of the world. Inspired by this cognitive capability, we introduce a concept-centric multi-modality learning framework built around a modality-agnostic concept space that captures structured, abstract knowledge, alongside a set of modality-specific projection models that map raw inputs onto this shared space. The concept space is decoupled from any specific modality and serves as a repository of universally applicable knowledge. Once learned, the knowledge embedded in the concept space enables more efficient adaptation to new modalities, as projection models can align with existing conceptual representations rather than learning from scratch. This efficiency is empirically validated in our experiments, where the proposed framework exhibits faster convergence compared to baseline models. In addition, the framework’s modular design supports seamless integration of new modalities, since projection models are trained independently yet produce unified outputs within the shared concept space.

We evaluate the framework on two representative downstream tasks. While the focus is not on task-specific optimization, the framework attains competitive results with a smaller training footprint, no task-specific fine-tuning, and inference performed entirely within a shared space of learned concepts that offers interpretability. These findings point toward a promising direction for developing learning systems that operate in a manner more consistent with human cognitive processes.

URL: https://openreview.net/forum?id=8WAAPP32c7

---

Title: Time series recovery from partial observations via Nonnegative Matrix Factorization

Abstract: In modern time series problems, one aims at forecasting multiple times series with possible missing and noisy values. In this paper, we introduce the Sliding Mask Method (SMM) for forecasting multiple nonnegative time series by means of nonnegative matrix completion: observed noisy values and forecast/missing values are collected into matrix form, and learning is achieved by representing its rows as a convex combination of a small number of nonnegative vectors, referred to as the archetypes. We introduce two estimates, the mask Archetypal Matrix factorization (mAMF) and the mask normalized Nonnegative Matrix Factorization (mNMF) which can be combined with the SMM method. We prove that these estimates recover the true archetypes with an error proportional to the noise. We use a proximal alternating linearized method (PALM) to compute the archetypes and the convex combination weights. We compared our estimators with state-of-the-art methods (Transformers, LSTM, SARIMAX...) in multiple time series forecasting on real data and obtain that our method outperforms them in most of the experiments.

URL: https://openreview.net/forum?id=RH7ibCiQ0i

---

Title: Let's Roll a BiFTA: Bi-refinement for Fine-grained Text-visual Alignment in Vision-Language Models

Abstract: Recent research has shown that aligning fine-grained text descriptions with localized image patches can significantly improve the zero-shot performance of pre-trained vision-language models (e.g., CLIP).
However, we find that both fine-grained text descriptions and localized image patches often contain redundant information, making text-visual alignment less effective.
In this paper, we tackle this issue from two perspectives: \emph{view refinement} and \emph{description refinement}, termed as \textit{\textbf{Bi}-refinement for \textbf{F}ine-grained \textbf{T}ext-visual \textbf{A}lignment} (BiFTA).
\emph{View refinement} removes redundant image patches with high \emph{Intersection over Union} (IoU) ratios, resulting in more distinctive visual samples.
\emph{Description refinement} removes redundant text descriptions with high pairwise cosine similarity, ensuring greater diversity in the remaining descriptions.
BiFTA achieves superior zero-shot performance on 6 benchmark datasets for both ViT-based and ResNet-based CLIP, justifying the necessity to remove redundant information in visual-text alignment.
Our code is available at: \url{https://anonymous.4open.science/r/BiFTA-TMLR-Re-submission}.

URL: https://openreview.net/forum?id=ZmbkzZnHO4

---

Title: S$^2$Transformer: Scalable Structured Transformers for Global Station Weather Forecasting

Abstract: Global Station Weather Forecasting (GSWF) is a key meteorological research area, critical to energy, aviation, and agriculture. Existing time series forecasting methods often ignore or unidirectionally model spatial correlation when conducting large-scale global station forecasting. This contradicts the intrinsic nature underlying observations of the global weather system, limiting forecast performance. To address this, we propose a novel Spatial Structured Attention Block in this paper. It partitions the spatial graph into a set of subgraphs and instantiates Intra-subgraph Attention to learn local spatial correlation within each subgraph, and aggregates nodes into subgraph representations for message passing among the subgraphs via Inter-subgraph Attention---considering both spatial proximity and global correlation. Building on this block, we develop a multiscale spatiotemporal forecasting model S$^2$Transformer by progressively expanding subgraph scales. The resulting model is both scalable and able to produce structured spatial correlation, and meanwhile, it is easy to implement. The experimental results show that it can achieve performance improvements up to 16.8% over time series forecasting baselines at low running costs.

URL: https://openreview.net/forum?id=AL2VnKno5n

---

Title: BenchOverflow: Measuring Overflow in Large Language Models via Plain-Text Prompts

Abstract: We investigate a failure mode of large language models (LLMs) in which benign, plain-text prompts elicit excessive outputs, a phenomenon we term Overflow. Unlike jailbreaks or prompt injection, Overflow arises under ordinary interaction settings and carries concrete risks for denial-of-wallet, latency, and cross-user performance degradation. We introduce BenchOverflow, a model-agnostic benchmark of nine plain-text prompting strategies that amplify output volume without adversarial suffixes or policy circumvention. Using a standardized protocol with a fixed budget of 5,000 new tokens, we evaluate BenchOverflow on nine open- and closed-source models. Across models, BenchOverflow produces pronounced rightward shifts and heavy tails in length distributions. Cap-saturation rates (CSR@1k/3k/5k) and empirical cumulative distribution functions (ECDFs) quantify tail risk; within-prompt variance and cross-model correlations show that Overflow is broadly reproducible yet heterogeneous across families and attack vectors. A lightweight mitigation—a fixed conciseness reminder—attenuates right tails and lowers CSR for several strategies. Our findings reframe verbosity as a measurable risk to reliability and cost, rather than a mere stylistic quirk. BenchOverflow provides a practical, reproducible protocol for benchmarking length-control robustness in deployed LLMs.

URL: https://openreview.net/forum?id=tiQjg5i4ii

---

Title: Hessian-aware Training for Enhancing DNN Resilience to Bitwise Corruptions in Their Parameters

Abstract: Deep neural networks are not resilient to parameter corruptions: even a single-bitwise error in their parameters in memory can cause an accuracy drop of over 10%, and in the worst-cases, up to 99%. This susceptibility poses great challenges in deploying models on computing platforms, where adversaries can induce random/targeted bit-flips, e.g., through software-induced fault attacks like Rowhammer. Most prior work addresses this issue with hardware or system-level approaches, such as integrating additional hardware components to verify a model’s integrity at inference. However, these methods have not been widely deployed as they require infrastructure or platform-wide modifications.

In this paper, we propose a new approach to addressing this issue: training models to be more resilient to bitwise corruptions to their parameters. Our approach, Hessian-aware training, promotes models to learn flatter loss surfaces. We show that existing training methods designed to improve generalization (e.g., through sharpness-aware minimization) do not enhance resilience to parameter corruptions. In contrast, models trained with our method demonstrate improved resilience to parameter corruptions, particularly with a 20–50% reduction in the number of bits whose individual flipping leads to a 90–100% accuracy drop. We also characterize the factors that may influence this increased resilience. Moreover, we show the synergy between ours and existing hardware and system-level defenses.

URL: https://openreview.net/forum?id=XxlQF4muso

---

Title: SiLVR: A Simple Language-based Video Reasoning Framework

Abstract: Recent advances in test-time optimization have led to remarkable reasoning capabilities in Large Language Models (LLMs), enabling them to solve highly complex problems in math and coding. However, the reasoning capabilities of multimodal LLMs (MLLMs) still significantly lag, especially for complex video-language tasks. To address this issue, we present SILVR, a Simple Language-based Video Reasoning framework that decomposes complex video understanding into two stages. In the first stage, SILVR transforms raw video into language-based representations using multisensory inputs, such as short clip captions and audio/speech subtitles. In the second stage, language descriptions are fed into a powerful reasoning LLM to solve complex video-language understanding tasks. To handle long-context multisensory inputs, we use an adaptive token reduction scheme, which dynamically determines the temporal granularity with which to sample the tokens. Our simple, modular, and training-free video reasoning framework achieves the best-reported results on Video-MME (long), Video-MMMU (comprehension), Video-MMLU, CGBench, and EgoLife. Furthermore, our empirical study focused on video reasoning capabilities shows that, despite not being explicitly trained on video, strong reasoning LLMs can effectively aggregate multisensory input information from video, speech, and audio for complex temporal, causal, long-context, and knowledge acquisition reasoning tasks in video. Code will be made available.

URL: https://openreview.net/forum?id=mQZbh9Zlbw

---

Title: MOONSHOT: A Framework for Multi-Objective Pruning of Vision and Large Language Models

Abstract: Weight pruning is a common technique for compressing large neural networks. We focus on the challenging post-training one-shot setting, where a pre-trained model is compressed without any retraining. Existing one-shot pruning methods typically optimize a single objective, such as a layer-wise reconstruction loss or a second-order Taylor approximation of the training loss. We highlight that neither objective alone is consistently the most effective across architectures and sparsity levels. Motivated by this insight, we propose MOONSHOT, a general and flexible framework that extends any single-objective pruning method into a multi-objective formulation by jointly optimizing both the layer-wise reconstruction error and second-order Taylor approximation of the training loss. MOONSHOT acts as a wrapper around existing pruning algorithms. To enable this integration while maintaining scalability to billion-parameter models, we propose modeling decisions and introduce an efficient procedure for computing the inverse Hessian, preserving the efficiency of state-of-the-art one-shot pruners. When combined with state-of-the-art pruning methods on Llama-3.2, MOONSHOT reduces C4 perplexity by up to 32.6% at 2:4 sparsity and improves zero-shot median accuracy across seven classification benchmarks by up to 3.7 points. On Vision Transformers, it improves accuracy on ImageNet-1k by over 5 points at 70% sparsity, and on ResNet-50, it yields a 4-point gain at 90% sparsity. Our code will be made publicly available in the camera-ready version if the paper is accepted.

URL: https://openreview.net/forum?id=Ew9s7veEQU

---

Title: Automated Attention Pattern Discovery at Scale in Large Language Models

Abstract: Large language models have scaled rapidly, but interpretability methods have lagged behind, especially in real-world noisy data that is less controlled than curated benchmarks. Existing approaches focus on fine-grained explanations of individual components, which are resource intensive and struggle to generalize across tasks, domains, and models. To enable broader insights, we analyze and track attention patterns across predictions.

We show that vision models offer a promising direction for analyzing attention patterns at scale. To demonstrate this, we introduce the Attention Pattern~-- Masked Autoencoder (AP-MAE), a vision transformer-based model that efficiently reconstructs masked attention patterns. Experiments on StarCoder2 models (3B–15B) show that AP-MAE (i) reconstructs masked attention patterns with high accuracy, (ii) generalizes across unseen models with minimal degradation, (iii) reveals recurring patterns across a large number of inferences, (iv) predicts whether a generation will be correct without access to ground truth, with up to 70\% accuracy, and (v) enables targeted interventions that increase accuracy by 13.6\% when applied selectively, but cause rapid collapse when applied excessively.

These results establish attention patterns as a scalable signal for interpretability and demonstrate that AP-MAE provides a transferable foundation for both analysis and intervention in large language models. Beyond its standalone value, AP-MAE can also serve as a selection procedure to guide more fine-grained mechanistic approaches toward the most relevant components. We release code and models to support future work in large-scale interpretability.

URL: https://openreview.net/forum?id=KpsUN0HAx7

---

Title: MACAW: A Causal Generative Model for Medical Imaging

Abstract: Although deep learning techniques show promising results for many neuroimaging tasks in research settings, they have not yet found widespread use in clinical scenarios. One of the reasons for this problem is that many machine learning models only identify correlations between the input images and the outputs of interest, which can lead to many practical problems, such as encoding of uninformative biases and reduced explainability. Thus, recent research is exploring if integrating \textit{a priori} causal knowledge into deep learning models is a potential avenue to identify these problems. However, encoding causal reasoning and generating genuine counterfactuals necessitates computationally expensive invertible processes, thus restricting analyses to a small number of causal variables and rendering them infeasible for generating even 2D images. To overcome these limitations, this work introduces a new causal generative architecture named Masked Causal Flow (MACAW) for neuroimaging applications. Within this context, three main contributions are described. First, a novel approach that integrates complex causal structures into normalizing flows is proposed. Second, counterfactual prediction is performed to identify the changes in effect variables associated with a cause variable. Finally, an explicit Bayesian inference for classification is derived and implemented, providing an inherent uncertainty estimation. The feasibility of the proposed method was first evaluated using synthetic data and then using MRI brain data from more than 23000 participants of the UK biobank study. The evaluation results show that the proposed method can (1) accurately encode causal reasoning and generate counterfactuals highlighting the structural changes in the brain known to be associated with aging, (2) accurately predict a subject's age from a single 2D MRI slice, and (3) generate new samples assuming other values for subject-specific indicators such as age, sex, and body mass index.

URL: https://openreview.net/forum?id=eYW037oqQ4

---

Title: Close to Reality: Interpretable and Feasible Data Augmentation for Imbalanced Learning

Abstract: Many machine learning classification tasks involve imbalanced datasets, which are often sub-
ject to over-sampling techniques aimed at improving model performance. However, these
over-sampling methods are prone to generating unrealistic or infeasible samples. Further-
more, they often function as black boxes, lacking interpretability in their procedures. This
opacity makes it difficult to track their effectiveness and provide necessary adjustments,
and they may ultimately fail to yield significant performance improvements. To bridge
this gap, we introduce the Decision Predicate Graphs for Data Augmentation (DPG-da), a
framework that extracts interpretable decision predicates from trained models to capture
domain rules and enforce them during sample generation. This design ensures that over-
sampled data remain diverse, constraint-satisfying, and interpretable. In experiments on
synthetic and real-world benchmark datasets, DPG-da consistently improves classification
performance over traditional over-sampling methods, while guaranteeing logical validity and
offering clear, interpretable explanations of the over-sampled data.

URL: https://openreview.net/forum?id=6AUcJJKQCl

---

Title: ThinkEval: Practical Evaluation of Knowledge Leakage in LLM Editing using Thought-based Knowledge Graphs

Abstract: Robust model-editing techniques are essential for deploying large language models (LLMs) in practical applications, to enable cost-effective ways to deal with challenges such as privacy breaches, bias mitigation and misinformation spread. For example, an LLM-based healthcare assistance may need to update out-dated or incorrect knowledge to prevent harmful recommendations. However, many editing techniques focus on isolated facts, which critically fail to prevent indirect knowledge leakage---the unintended reconstruction of edited-out information through persistent causal links and contextual relationships. To assist users in selecting the right editing technique, we develop and present ThinkEval, a framework to systematically quantify indirect knowledge leakage and ripple effects in model-editing. ThinkEval builds and employs specialized knowledge graphs to analyze the causal structure of facts before and after editing. To support this approach, we present KnowGIC, a benchmark dataset comprising multi-step reasoning paths that precisely measure these complex knowledge transformation effects. We evaluate five editing techniques: AlphaEdit, RECT, ROME, MEMIT, and PRUNE across multiple LLMs. Our results show that these techniques struggle to balance indirect fact suppression with the preservation of related knowledge, compromising the contextual integrity of a model's knowledge. Our dataset is available at: https://anonymous.4open.science/r/KnowGIC.

URL: https://openreview.net/forum?id=IR2GAw90BB

---

Title: Teaching Invariance Using Privileged Mediation Information

Abstract: The performance of deep neural networks often deteriorates in out-of-distribution settings due to relying on easy-to-learn but unreliable spurious associations known as shortcuts. Recent work attempting to mitigate shortcut learning relies on a priori knowledge of the shortcuts and invariance penalties, which are difficult to enforce in practice. To address these limitations, we study two causally-motivated methods that efficiently learn models that are invariant to shortcuts by leveraging privileged mediation information. We first adapt concept bottleneck models (CBMs) to incorporate mediators -- intermediate variables that lie on the causal path between input features and target labels -- resulting in a straightforward extension we call Mediator Bottleneck Models (MBMs). One drawback of this method is that it requires two potentially large models at inference time. To address this issue, we propose Teaching Invariance using Privileged Mediation Information (TIPMI), a novel approach which distills knowledge from a counterfactually invariant teacher trained using privileged mediation information to a student predictor that uses non-privileged, easy-to-collect features. We analyze the theoretical properties of both estimators, showing that they promote invariance to multiple unknown shortcuts and can result in better finite-sample efficiency compared to commonly used regularization schemes. We empirically validate our theoretical findings by showing that TIPMI and MBM outperform several state-of-the-art methods on one language and two vision datasets.

URL: https://openreview.net/forum?id=8ZLhuo32Kz

---

Title: Improving Foundation Model Group Robustness with Auxiliary Sentence Embeddings

Abstract: This paper addresses the critical challenge of mitigating group-based biases in vision-language foundation models, a pressing issue for ensuring trustworthy AI deployment. We introduce DoubleCCA, a novel and computationally efficient framework that systematically enriches textual representations to enhance group robustness. Our key innovation is to leverage an auxiliary large sentence embedding model to capture diverse semantic perspectives, counteracting biased representations induced by limited training data. To this end, we propose a two-stage Canonical Correlation Analysis (DoubleCCA) technique: first, aligning augmented and original embeddings in a shared space; second, reconstructing invariant features to align with visual representations, thus enhancing the model's group robustness. We further propose a simple sentence augmentation approach, which aims to improve the robustness of CCA-induced subspaces. Our method is simple to implement and can be easily integrated into existing models, making it a practical solution for improving the robustness of vision-language foundation models to group-based biases. The experiments on a variety of datasets demonstrate that our method outperforms existing methods in terms of both performance and robustness.

URL: https://openreview.net/forum?id=5rMtiB96cg

---

Title: Improving Uncertainty Quantification in Large Language Models via Semantic Embeddings

Abstract: Hallucinations remain a major safety bottleneck for large language models (LLMs), necessitating effective detection methods such as quantifying uncertainty in the model's generations. While traditional uncertainty measures based on token likelihoods fail to capture semantic uncertainty, recent approaches like Semantic Entropy (SE) and Kernel Language Entropy (KLE) focus on isolating the underlying semantic uncertainty of the LLM. However, these methods impose significant computational overhead beyond generating samples: they require numerous natural language inference (NLI) calls to compare outputs, limiting their use in latency-sensitive applications. We introduce \textbf{Semantic Embedding Uncertainty (SEU)}, a lightweight metric that directly measures semantic disagreement in embedding space. Like SE and KLE, SEU requires multiple model outputs, but crucially simplifies the subsequent analysis. SEU computes uncertainty as the average pairwise cosine distance between sentence embeddings---requiring only $M$ embedding model forward passes followed by $O(M^2)$ dot products, instead of $O(M^2)$ NLI forward passes. SEU thus facilitates real-time semantic uncertainty quantification in applications where latency is paramount. Experiments on question answering and reasoning tasks demonstrate that SEU achieves comparable or superior accuracy to SE and KLE while reducing inference latency by up to 100x, enabling its deployment in resource-constrained settings.

URL: https://openreview.net/forum?id=DMH79qfzsU

---

Title: Test-Time Backdoor Attacks on Multimodal Large Language Models

Abstract: Backdoor attacks typically set up a backdoor by contaminating training data or modifying parameters before the model is deployed, such that a predetermined trigger can activate harmful effects during the test phase. Can we, however, carry out test-time backdoor attacks after deploying the model? In this work, we present **AnyDoor**, a test-time backdoor attack against multimodal large language models (MLLMs), without accessing training data or modifying parameters. In AnyDoor, the burden of **setting up** backdoors is assigned to the visual modality (better capacity but worse timeliness), while the textual modality is responsible for **activating** the backdoors (better timeliness but worse capacity). This decomposition takes advantage of the characteristics of different modalities, making attacking timing more controllable compared to directly applying adversarial attacks. We empirically validate the effectiveness of AnyDoor against popular MLLMs such as LLaVA-1.5, MiniGPT-4, InstructBLIP, and BLIP-2, and conduct extensive ablation studies. Notably, AnyDoor can dynamically change its backdoor trigger prompts and/or harmful effects, posing a new challenge for developing backdoor defenses.

URL: https://openreview.net/forum?id=kwZsJXgMhK

---

Title: Understanding Accelerated Gradient Methods: Lyapunov Analyses and Hamiltonian-Assisted Interpretations

Abstract: We formulate two classes of first-order algorithms more general than previously studied for minimizing smooth and strongly convex or, respectively, smooth and convex functions. We establish sufficient conditions, via new discrete Lyapunov analyses, for achieving accelerated convergence rates which match Nesterov's methods in the strongly and general convex settings. Our results identify, for the first time, a simple and unified condition on gradient correction for accelerated convergence. Next, we study the convergence of limiting ordinary differential equations (ODEs), including high-resolution ODEs, and point out currently notable gaps between the convergence properties of the corresponding algorithms and ODEs, especially regarding the role of gradient correction. Finally, we propose a novel class of discrete algorithms, called the Hamiltonian-assisted gradient method, directly based on a Hamiltonian function and several interpretable operations, and then demonstrate meaningful and unified interpretations of our acceleration conditions in terms of the momentum variable updates.

URL: https://openreview.net/forum?id=0jvg4M1W40

---

Reply all

Reply to author

Forward

0 new messages