Weekly TMLR digest for Nov 02, 2025

7 views

Skip to first unread message

TMLR

unread,

Nov 2, 2025, 12:00:10 AMNov 2

to tmlr-annou...@googlegroups.com

New certifications
==================

J2C Certification: Efficient and Unbiased Sampling from Boltzmann Distributions via Variance-Tuned Diffusion Models

Fengzhe Zhang, Laurence Illing Midgley, José Miguel Hernández-Lobato

https://openreview.net/forum?id=Jq2dcMCS5R

---

J2C Certification: Expert Routing with Synthetic Data for Domain Incremental Learning

Yewon Byun, Sanket Vaibhav Mehta, Saurabh Garg, Emma Strubell, Michael Oberst, Bryan Wilder, Zachary Chase Lipton

https://openreview.net/forum?id=QdQVfdXnsG

---

Reproducibility Certification: Kernel Space Conditional Distribution Alignment for Improving Group Fairness in Deepfake Detection

Sayantan Das, Mojtaba Kolahdouzi, Ali Etemad

https://openreview.net/forum?id=68Lv6v4N9J

---

Accepted papers
===============

Title: Thera: Aliasing-Free Arbitrary-Scale Super-Resolution with Neural Heat Fields

Authors: Alexander Becker, Rodrigo Caye Daudt, Dominik Narnhofer, Torben Peters, Nando Metzger, Jan Dirk Wegner, Konrad Schindler

Abstract: Recent approaches to arbitrary-scale single image super-resolution (ASR) use neural fields to represent continuous signals that can be sampled at arbitrary resolutions. However, point-wise queries of neural fields do not naturally match the point spread function (PSF) of pixels, which may cause aliasing in the super-resolved image. Existing methods attempt to mitigate this by approximating an integral version of the field at each scaling factor, compromising both fidelity and generalization. In this work, we introduce neural heat fields, a novel neural field formulation that inherently models a physically exact PSF. Our formulation enables analytically correct anti-aliasing at any desired output resolution, and -- unlike supersampling -- at no additional cost. Building on this foundation, we propose Thera, an end-to-end ASR method that substantially outperforms existing approaches, while being more parameter-efficient and offering strong theoretical guarantees. The project page is at https://therasr.github.io.

URL: https://openreview.net/forum?id=GU8YOfmqyg

---

Title: In-context Learning for Mixture of Linear Regression: Existence, Generalization and Training Dynamics

Authors: Yanhao Jin, Krishna Balasubramanian, Lifeng Lai

Abstract: We investigate the in-context learning capabilities of transformers for the $d$-dimensional mixture of linear regression model, providing theoretical insights into their existence, generalization bounds, and training dynamics. Specifically, we prove that there exists a transformer capable of achieving a prediction error of order $\mathcal{O}(\sqrt{d/n})$ with high probability, where $n$ represents the training prompt size in the high signal-to-noise ratio (SNR) regime. Moreover, we derive in-context excess risk bounds of order $\mathcal{O}(L/\sqrt{B})$ for the case of two mixtures, where $B$ denotes the number of training prompts, and $L$ represents the number of attention layers. The dependence of $L$ on the SNR is explicitly characterized, differing between low and high SNR settings. We further analyze the training dynamics of transformers with single linear self-attention layers, demonstrating that, with appropriately initialized parameters, gradient flow optimization over the population mean square loss converges to a global optimum. Extensive simulations suggest that transformers perform well on this task, potentially outperforming other baselines, such as the Expectation-Maximization algorithm.

URL: https://openreview.net/forum?id=buZXVuTsHY

---

Title: Identification of Average Outcome under Interventions in Confounded Additive Noise Models

Authors: Muhammad Qasim Elahi, Mahsa Ghasemi, Murat Kocaoglu

Abstract: Additive noise models (ANMs) are an important setting studied in causal inference. Most existing works on ANMs assume causal sufficiency, i.e., there are no unobserved confounders. This paper focuses on confounded ANMs, where a set of treatment variables and a target variable are affected by an unobserved confounder that follows a multivariate Gaussian distribution. We introduce a novel approach for estimating the average outcome under interventions (AOIs) for interventions on any subset of treatment variables and demonstrate that a small set of interventional distributions is sufficient to estimate all of them. In addition, we propose a randomized algorithm that further reduces the number of required interventions to poly-logarithmic in the number of nodes. Finally, we demonstrate that these interventions are also sufficient to recover the causal structure between the observed variables. This establishes that a poly-logarithmic number of interventions is sufficient to infer the causal effects of any subset of treatments on the outcome in confounded ANMs with high probability, even when the causal structure between treatments is unknown. The simulation results indicate that our method can accurately estimate all AOIs in the finite-sample regime. We also demonstrate the practical significance of our algorithm by evaluating it on semi-synthetic data.

URL: https://openreview.net/forum?id=y5YnHzLf1d

---

Title: Improving Adversarial Training for Two-player Competitive Games via Episodic Reward Engineering

Authors: Siyuan Chen, Fuyuan Zhang, Zhuo Li, Xiongfei Wu, Jianlang Chen, Pengzhan Zhao, Lei Ma, Jianjun Zhao

Abstract: In recent years, training adversarial agents has become an effective and practical approach for attacking neural network policies. However, we observe that existing methods can be further enhanced by distinguishing between states leading to win or lose and encouraging the policy training by reward engineering to prioritize winning states. In this paper, we introduce a novel adversarial training method with reward engineering for two-player competitive games. Our method extracts the historical evaluations for states from historical experiences with an episodic memory, and then incorporating these evaluations into the rewards with our proposed reward revision method to improve the adversarial policy optimization. We evaluate our approach using two-player competitive games in MuJoCo simulation environments, demonstrating that our method establishes the most promising attack performance and defense difficulty against the victims among the existing adversarial policy training techniques.

URL: https://openreview.net/forum?id=z4XtJWJC9K

---

Title: TESGNN: Temporal Equivariant Scene Graph Neural Networks for Efficient and Robust Multi-View 3D Scene Understanding

Authors: Pham Phuoc Minh Quang, Nguyen Tiet Nguyen Khoi, Ngo Chi Lan, Do Tho Truong, Dezhen Song, Truong-Son Hy

Abstract: Scene graphs have proven to be highly effective for various scene understanding tasks due to their compact and explicit representation of relational information. However, current methods often overlook the critical importance of preserving symmetry when generating scene graphs from 3D point clouds, which can lead to reduced accuracy and robustness, particularly when dealing with noisy, multi-view data. Furthermore, a major limitation of prior approaches is the lack of temporal modeling to capture time-dependent relationships among dynamically evolving entities in a scene. To address these challenges, we propose Temporal Equivariant Scene Graph Neural Network (TESGNN), consisting of two key components: (1) an Equivariant Scene Graph Neural Network (ESGNN), which extracts information from 3D point clouds to generate scene graph while preserving crucial symmetry properties, and (2) a Temporal Graph Matching Network, which fuses scene graphs generated by ESGNN across multiple time sequences into a unified global representation using an approximate graph-matching algorithm. Our combined architecture TESGNN shown to be effective compared to existing methods in scene graph generation, achieving higher accuracy and faster training convergence. Moreover, we show that leveraging the symmetry-preserving property produces a more stable and accurate global scene representation compared to existing approaches. Finally, it is computationally efficient and easily implementable using existing frameworks, making it well-suited for real-time applications in robotics and computer vision. This approach paves the way for more robust and scalable solutions to complex multi-view scene understanding challenges.

URL: https://openreview.net/forum?id=boM0kkYPzE

---

Title: Weakly Supervised Object Segmentation by Background Conditional Divergence

Authors: Hassan Baker, Matthew Emigh, Austin J. Brockmeier

Abstract: As a computer vision task, automatic object segmentation remains challenging in specialized image domains without massive labeled data, such as synthetic aperture sonar images, remote sensing, biomedical imaging, etc. In any domain, obtaining pixel-wise segmentation masks is expensive. In this work, we propose a method for training a masking network to perform binary object segmentation using weak supervision in the form of image-wise presence or absence of an object of interest, which provides less information but may be obtained more quickly from manual or automatic labeling. A key step in our method is that the segmented objects can be placed into background-only images to create realistic, images of the objects with counterfactual backgrounds. To create a contrast between the original and counterfactual background images, we propose to first cluster the background-only images, and then during learning create counterfactual images that blend objects segmented from their original source backgrounds to backgrounds chosen from a targeted cluster. One term in the training loss is the divergence between these counterfactual images and the real object images with backgrounds of the target cluster. The other term is a supervised loss for background-only images. While an adversarial critic could provide the divergence, we use sample-based divergences. We conduct experiments on side-scan and synthetic aperture sonar in which our approach succeeds compared to previous unsupervised segmentation baselines that were only tested on natural images. Furthermore, to show generality we extend our experiments to natural images, obtaining reasonable performance with our method that avoids pretrained networks, generative networks, and adversarial critics.

URL: https://openreview.net/forum?id=2JJZhfGvMW

---

Title: Efficient and Unbiased Sampling from Boltzmann Distributions via Variance-Tuned Diffusion Models

Authors: Fengzhe Zhang, Laurence Illing Midgley, José Miguel Hernández-Lobato

Abstract: Score-based diffusion models (SBDMs) are powerful amortized samplers for Boltzmann distributions; however, imperfect score estimates bias downstream Monte Carlo estimates. Classical importance sampling (IS) can correct this bias, but computing exact likelihoods requires solving the probability-flow ordinary differential equation (PF–ODE), a procedure that is prohibitively costly and scales poorly with dimensionality. We introduce Variance-Tuned Diffusion Importance Sampling (VT-DIS), a lightweight post-training method that adapts the per-step noise covariance of a pretrained SBDM by minimizing the $\alpha$-divergence $(\alpha=2)$ between its forward diffusion and reverse denoising trajectories. VT-DIS assigns a single trajectory-wise importance weight to the joint forward–reverse process, yielding unbiased expectation estimates at test time with negligible inference-time overhead compared to standard sampling. On the DW-4, LJ-13, and alanine-dipeptide benchmarks, VT-DIS achieves effective sample sizes of approximately 80%, 35%, and 3.5%, respectively, while using only a fraction of the computational budget required by vanilla diffusion + IS or PF-ODE–based IS.

URL: https://openreview.net/forum?id=Jq2dcMCS5R

---

Title: Neural ODE and SDE Models for Adaptation and Planning in Model-Based Reinforcement Learning

Authors: Chao Han, Stefanos Ioannou, Luca Manneschi, T.J. Hayward, Michael Mangan, Aditya Gilra, Eleni Vasilaki

Abstract: We investigate neural ordinary and stochastic differential equations (neural ODEs and SDEs) to model stochastic dynamics in fully and partially observed environments within a model-based reinforcement learning (RL) framework. Through a sequence of simulations, we show that neural SDEs more effectively capture transition dynamics’ inherent stochasticity, enabling high-performing policies with improved sample efficiency in challenging scenarios. We leverage neural ODEs and SDEs for efficient policy adaptation to changes in environment dynamics via inverse models, requiring only limited interactions with the new environment. To address partial observability, we introduce a latent SDE model that combines an ODE and a GAN-trained stochastic component in latent space. Policies derived from this model offer a strong baseline, outperforming or matching general model-based and model-free approaches across stochastic continuous-control benchmarks. This work illustrates the applicability of action-conditional latent SDEs for RL planning in environments with stochastic transitions. Our code is available at: https://github.com/ChaoHan-UoS/NeuralRL.

URL: https://openreview.net/forum?id=T6OrPlyPV4

---

Title: Differentiated Aggregation to Improve Generalization in Federated Learning

Authors: Peyman Gholami, Hulya Seferoglu

Abstract: This paper focuses on reducing the communication cost of federated learning by exploring generalization bounds and representation learning. We first characterize a tighter generalization bound for one-round federated learning based on local clients’ generalizations and heterogeneity of data distribution (non-iid scenario). We also characterize a generalization bound in R-round federated learning and its relation to the number of local updates (local stochastic gradient descents (SGDs)). Then, based on our generalization bound analysis and its interpretation through representation learning, we infer that less frequent aggregations for the representation extractor (typically corresponds to initial layers) compared to the head (usually the final layers) leads to the creation of more generalizable models, particularly in non-iid scenarios. We design a novel Federated Learning with Adaptive Local Steps (FedALS) algorithm based on our generalization bound and representation learning analysis. FedALS employs varying aggregation frequencies for different parts of the model, so reduces the communication cost. The paper is followed with experimental results showing the effectiveness of FedALS. Our codes are available for reproducibility.

URL: https://openreview.net/forum?id=F5hgpQ1Ccd

---

Title: Assortment of Attention Heads: Accelerating Federated PEFT with Head Pruning and Strategic Client Selection

Authors: Yeshwanth Venkatesha, Souvik Kundu, Priyadarshini Panda

Abstract: Parameter Efficient Fine-Tuning (PEFT) has become the de facto approach in adapting Large Language Models (LLMs) for downstream tasks in Natural Language Processing. However, its adoption in privacy-preserving distributed learning frameworks, such as Federated Learning (FL), remains relatively limited. This is mainly due to challenges specific to FL, such as resource-constrained devices and diverse data distributions among clients. In this paper, we propose an efficient method to perform PEFT within the FL framework for Multi-Head Attention (MHA) based language models. We address the challenges through head pruning, a novel head-specific weighted aggregation mechanism, and a client selection strategy. Head pruning minimizes training complexity within the clients, guided by the importance score computed based on the confidence of the attention head. Weighted aggregation of heads ensures the global model captures crucial updates from diverse clients, complementing our client selection strategy. We show results on the MultiNLI benchmark along with 20 Newsgroups, XL-Sum, and E2E NLG datasets. We use the MultiNLI dataset and T5-small model with LoRA as our PEFT method, attaining sparsity levels of up to 90\%, resulting in a communication advantage of up to 1.8x and a reduction in training OPs of 3.9x while maintaining the accuracy drop under 2\%.

URL: https://openreview.net/forum?id=WFpicZbAHe

---

Title: Analysis of generalization capacities of Neural Ordinary Differential Equations

Authors: Madhusudan Verma, Manoj Kumar

Abstract: Neural ordinary differential equations (neural ODEs) represent a widely used class of deep learning models characterized by continuous depth. Understanding the generalization error bound is important to evaluate how well a model is expected to perform on new, unseen data. Earlier works in this direction involved considering the linear case on the dynamics function (a function that models the evolution of state variables) of Neural ODE Marion (2023). Other related work is on bound for Neural Controlled ODE Bleistein & Guilloux (2023) that depends on the sampling gap. We consider a class of neural ordinary differential equations (ODEs) with a general nonlinear function for time-dependent and time-independent cases which is Lipschitz with respect to state variables. We observed that the solution of the neural ODEs would be of bounded variations if we assume that the dynamics function of Neural ODEs is Lipschitz continuous with respect to the hidden state. We derive a generalization bound for the time-dependent and time-independent Neural ODEs. We showed the effect of overparameterization and domain bound in the generalization error bound. This is the first time, the generalization bound for the Neural ODE with a general non-linear function has been found.

URL: https://openreview.net/forum?id=CxW6TF1rOF

---

Title: Robust Multimodal Learning via Cross-Modal Proxy Tokens

Authors: Md Kaykobad Reza, Ameya Patil, Mashhour Solh, Salman Asif

Abstract: Multimodal models often experience a significant performance drop when one or more modalities are missing during inference. To address this challenge, we propose a simple yet effective approach that enhances robustness to missing modalities while maintaining strong performance when all modalities are available. Our method introduces cross-modal proxy tokens (CMPTs), which approximate the class token of a missing modality by attending only to the tokens of the available modality without requiring explicit modality generation or auxiliary networks. To efficiently learn these approximations with minimal computational overhead, we employ low-rank adapters in frozen unimodal encoders and jointly optimize an alignment loss with a task-specific loss. Extensive experiments on five multimodal datasets show that our method outperforms state-of-the-art baselines across various missing rates while achieving competitive results in complete-modality settings. Overall, our method offers a flexible and efficient solution for robust multimodal learning. The code for this paper is available at: https://github.com/CSIPlab/Cross-Modal-Proxy-Tokens.

URL: https://openreview.net/forum?id=Wtc6wvcYJ0

---

Title: Know Yourself and Know Your Neighbour : A Syntactically Informed Self-Supervised Compositional Sentence Representation Learning Framework using a Recursive Hypernetwork

Authors: Vasudevan Nedumpozhimana, John D. Kelleher

Abstract: Sentence representation learning is still an open challenge in Natural Language Processing. In this work, we propose a new self-supervised framework for learning sentence representations, using a special type of neural network called a recursive hypernetwork. Our proposed model composes the representation of a sentence from representations of words by applying a recursive composition through the parse tree. We maintain a separate syntactic and semantic representation, and the semantic composition is guided by the information from the syntactic representation. To train this model, we introduce a novel set of six self-supervised tasks. By analysing the performance on 7 probing tasks, we validate that the generated sentence representation encodes richer linguistic information than both averaging baselines and state-of-the-art alternatives. Furthermore, we assess the impact of the six proposed self-supervised training tasks through ablation studies. We also demonstrate that the representations generated by our model are stable for sentences of varying length and that the semantic composition operators adapt to different syntactic categories.

URL: https://openreview.net/forum?id=gfBiJv7r51

---

Title: Efficient Distillation of Classifier-Free Guidance using Adapters

Authors: Cristian Perez Jensen, Seyedmorteza Sadat

Abstract: While classifier-free guidance (CFG) is essential for conditional diffusion models, it doubles the number of neural function evaluations (NFEs) per inference step. To mitigate this inefficiency, we introduce adapter guidance distillation (AGD), a novel approach that simulates CFG in a single forward pass. AGD leverages lightweight adapters to approximate CFG, effectively doubling the sampling speed while maintaining or even improving sample quality. Unlike prior guidance distillation methods that tune the entire model, AGD keeps the base model frozen and only trains minimal additional parameters ($\sim$2%) to significantly reduce the resource requirement of the distillation phase. Additionally, this approach preserves the original model weights and enables the adapters to be seamlessly combined with other checkpoints derived from the same base model. We also address a key mismatch between training and inference in existing guidance distillation methods by training on CFG-guided trajectories instead of standard diffusion trajectories. Through extensive experiments, we show that AGD achieves comparable or superior FID to CFG across multiple architectures with only half the NFEs. Notably, our method enables the distillation of large models ($\sim$2.6B parameters) on a single consumer GPU with 24 GB of VRAM, making it more accessible than previous approaches that require multiple high-end GPUs. We will publicly release the implementation of our method.

URL: https://openreview.net/forum?id=uMz8FiiW01

---

Title: Expert Routing with Synthetic Data for Domain Incremental Learning

Authors: Yewon Byun, Sanket Vaibhav Mehta, Saurabh Garg, Emma Strubell, Michael Oberst, Bryan Wilder, Zachary Chase Lipton

Abstract: In many real-world settings, regulations and economic incentives permit the sharing of models but not data across institutional boundaries. In such scenarios, practitioners might hope to adapt models to new domains, without losing performance on previous domains (so-called catastrophic forgetting). While any single model may struggle to achieve this goal, learning an ensemble of domain-specific experts offers the potential to adapt more closely to each individual institution. However, a core challenge in this context is determining which expert to deploy at test time. In this paper, we propose Generate to Discriminate (G2D), a domain-incremental learning method that leverages synthetic data to train a domain-discriminator that routes samples at inference time to the appropriate expert. Surprisingly, we find that leveraging synthetic data in this capacity is more effective than using the samples to \textit{directly} train the downstream classifier (the more common approach to leveraging synthetic data in the lifelong learning literature). We observe that G2D outperforms competitive domain-incremental learning methods on tasks in both vision and language modalities, providing a new perspective on the use of synthetic data in the lifelong learning literature.

URL: https://openreview.net/forum?id=QdQVfdXnsG

---

Title: Mesh-Informed Neural Operator : A Transformer Generative Approach

Authors: Yaozhong Shi, Zachary E Ross, Domniki Asimaki, Kamyar Azizzadenesheli

Abstract: Generative models in function spaces, situated at the intersection of generative modeling and operator learning, are attracting increasing attention due to their immense potential in diverse scientific and engineering applications. While functional generative models are theoretically domain- and discretization-agnostic, current implementations heavily rely on the Fourier Neural Operator (FNO), limiting their applicability to regular grids and rectangular domains. To overcome these critical limitations, we introduce the Mesh-Informed Neural Operator (MINO). By leveraging graph neural operators and cross-attention mechanisms, MINO offers a principled, domain- and discretization-agnostic backbone for generative modeling in function spaces. This advancement significantly expands the scope of such models to more diverse applications in generative, inverse, and regression tasks. Furthermore, MINO provides a unified perspective on integrating neural operators with general advanced deep learning architectures. Finally, we introduce a suite of standardized evaluation metrics that enable objective comparison of functional generative models, addressing another critical gap in the field.

URL: https://openreview.net/forum?id=K8qAuRfv0G

---

Title: Accelerated Training on Low-Power Edge Devices

Authors: Mohamed Aboelenien Ahmed, Kilian Pfeiffer, Osama Abboud, Ramin Khalili, Heba Khdr, Joerg Henkel

Abstract: Training on edge devices poses several challenges as these devices are generally resource-constrained, especially in terms of power.
State-of-the-art techniques at the device level reduce the GPU frequency to enforce power constraints, leading to a significant increase in training time. To accelerate training, we propose to jointly adjust the system and application parameters (in our case, the GPU frequency and the batch size of the training task) while adhering to the power constraints on devices. We introduce a novel cross-layer methodology that combines predictions of batch size efficiency and device profiling to achieve the desired optimization. Our evaluation on real hardware shows that our method outperforms the current baselines that depend on state of the art techniques, reducing the training time by up to $2.3\times$ with results very close to optimal. Our measurements also indicate a substantial reduction in the overall energy used for the training process. These gains are achieved without reduction in the performance of the trained model.

URL: https://openreview.net/forum?id=cGjQ41jBEn

---

Title: On Joint Regularization and Calibration in Deep Ensembles

Authors: Laurits Fredsgaard, Mikkel N. Schmidt

Abstract: Deep ensembles are a powerful tool in machine learning, improving both model performance and uncertainty calibration. While ensembles are typically formed by training and tuning models individually, evidence suggests that jointly tuning the ensemble can lead to better performance. This paper investigates the impact of jointly tuning weight decay, temperature scaling, and early stopping on both predictive performance and uncertainty quantification. Additionally, we propose a partially overlapping holdout strategy as a practical compromise between enabling joint evaluation and maximizing the use of data for training. Our results demonstrate that jointly tuning the ensemble generally matches or improves performance, with significant variation in effect size across different tasks and metrics. We highlight the trade-offs between individual and joint optimization in deep ensemble training, with the overlapping holdout strategy offering an attractive practical solution. We believe our findings provide valuable insights and guidance for practitioners looking to optimize deep ensemble models.

URL: https://openreview.net/forum?id=6xqV7DP3Ep

---

Title: Offline Learning and Forgetting for Reasoning with Large Language Models

Authors: Tianwei Ni, Allen Nie, Sapana Chaudhary, Yao Liu, Huzefa Rangwala, Rasool Fakoor

Abstract: Leveraging inference-time search in large language models has proven effective in further enhancing a trained model's capability to solve complex mathematical and reasoning problems. However, this approach significantly increases computational costs and inference time, as the model must generate and evaluate multiple candidate solutions to identify a viable reasoning path. To address this, we propose an effective approach that integrates search capabilities directly into the model by fine-tuning it on unpaired successful (learning) and failed reasoning paths (forgetting) derived from diverse search methods. A key challenge we identify is that naive fine-tuning can degrade the model’s search capability; we show this can be mitigated with a smaller learning rate. Extensive experiments on the challenging Game-of-24 and Countdown arithmetic puzzles show that, replacing CoT-generated data with search-generated data for offline fine-tuning improves success rates by around 23% over inference-time search baselines, while reducing inference time by 180$\times$. On top of this, our learning and forgetting objective consistently outperforms both supervised fine-tuning and preference-based methods.

URL: https://openreview.net/forum?id=RF6raEUATc

---

Title: Kernel Space Conditional Distribution Alignment for Improving Group Fairness in Deepfake Detection

Authors: Sayantan Das, Mojtaba Kolahdouzi, Ali Etemad

Abstract: We introduce FairAlign, a new method to reduce bias and improve group fairness in deepfake detection by aligning conditional distributions of embeddings in a high-dimensional kernel space. Our approach reduces information related to sensitive attributes in the embedding space that could potentially bias the detection process, thus promoting fairness. FairAlign is a versatile plug-and-play loss term compatible with various deepfake detection networks and is capable of enhancing group fairness without compromising detection performance. In addition to applying FairAlign for reducing gender bias, we implement a systematic pipeline for the annotation of skin tones and promotion of fairness in deepfake detection related to this sensitive attribute. Finally, we perform the first comprehensive study toward quantifying and understanding the trade-off between fairness and accuracy in the context of deepfake detection. We use three public deepfake datasets FaceForensics++, CelebDF, and WildDeepfake to evaluate our method. Through various experiments, we observe that FairAlign outperforms other bias-mitigating methods across various deepfake detection backbones for both gender and skin tone, setting a new state-of-the-art. Moreover, our fairness-accuracy trade-off analysis demonstrates that our approach demonstrates the best overall performance when considering effectiveness in both deepfake detection and reducing bias. We release the code at: https://github.com/Mkolahdoozi/FairAlign.

URL: https://openreview.net/forum?id=68Lv6v4N9J

---

Title: On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective

Authors: Gabriel Mongaras, Eric C. Larson

Abstract: Since its introduction, softmax attention has become the backbone of modern transformer architectures due to its expressiveness and scalability across a wide range of tasks. However, the main drawback of softmax attention is the quadratic memory requirement and computational complexity with respect to the sequence length. By replacing the softmax nonlinearity, linear attention and similar methods have been introduced to avoid the quadratic bottleneck of softmax attention. Despite these linear forms of attention being derived from the original softmax formulation, they typically lag in terms of downstream accuracy. While strong intuition of the softmax nonlinearity on the query and key inner product suggests that it has desirable properties compared to other nonlinearities, the question of why this discrepancy exists still remains unanswered. This work demonstrates that linear attention is a first-order approximation of the softmax numerator by deriving its full recurrent form. We further show empirically that the denominator's function can be effectively replaced by a simple vector norm. Using this form, each part of softmax attention can be described in the language of recurrent neural networks (RNNs). Describing softmax attention as an RNN allows for the ablation of the components of softmax attention to understand the importance of each part and how they interact. In this way, our work helps explain why softmax attention is more expressive than its counterparts. Code found at: https://github.com/gmongaras/On-the-Expressiveness-of-Softmax-Attention-A-Recurrent-Neural-Network-Perspective

URL: https://openreview.net/forum?id=PHcITOi3vV

---

Title: ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment

Authors: Amir Aghdam, Vincent Tao Hu, Björn Ommer

Abstract: We address the task of zero-shot video classification for extremely fine-grained actions (e.g., Windmill Dunk in basketball), where no video examples or temporal annotations are available for unseen classes. While image–language models (e.g., CLIP, SigLIP) show strong open-set recognition, they lack temporal modeling needed for video understanding. We propose ActAlign, a truly zero-shot, training-free method that formulates video classification as a sequence alignment problem, preserving the generalization strength of pretrained image–language models. For each class, a large language model (LLM) generates an ordered sequence of sub-actions, which we align with video frames using Dynamic Time Warping (DTW) in a shared embedding space. Without any video–text supervision or fine-tuning, ActAlign achieves 30.4% accuracy on ActionAtlas—the most diverse benchmark of fine-grained actions across multiple sports—where human performance is only 61.6%. ActAlign outperforms billion-parameter video–language models while using $\sim 8\times$ fewer parameters. Our approach is model-agnostic and domain-general, demonstrating that structured language priors combined with classical alignment methods can unlock the open-set recognition potential of image–language models for fine-grained video understanding.

URL: https://openreview.net/forum?id=Nwzn4qMTGb

---

Title: An Empirical Study of the Accuracy-Robustness Trade-off and Training Efficiency in Robust Self-Supervised Learning

Authors: Fatemeh Ghofrani, Mehdi Yaghouti, Pooyan Jamshidi

Abstract: Self-supervised learning (SSL) has made significant strides in learning image representations, yet its principles remain partially understood, particularly in adversarial scenarios. This work explores the interplay between SSL and adversarial training (AT), focusing on whether this integration can yield robust representations that balance computational efficiency, clean accuracy, and robustness. A major challenge lies in the inherently high cost of AT, which combines an inner maximization problem (generating adversarial examples) with an outer minimization problem (training representations). This challenge is exacerbated by the extensive training epochs required for SSL convergence, which become even more demanding in adversarial settings.

Recent advances in SSL, such as Extreme-Multi-Patch Self-Supervised Learning (EMP-SSL), have demonstrated that increasing the number of patches per image instance can significantly reduce the number of training epochs. Building on this, we introduce Robust-EMP-SSL, an extension of EMP-SSL specifically designed for adversarial training scenarios. Robust-EMP-SSL is a framework that leverages multiple crops per image to enhance data diversity, integrates invariance terms with regularization to prevent collapse, and optimizes adversarial training efficiency by reducing the required training epochs. By aligning these components, Robust-EMP-SSL enables the learning of robust representations while addressing the high computational costs and accuracy trade-offs inherent in adversarial training.

This study poses a central question: "How can multiple crops or diverse patches, combined with adversarial training strategies, achieve trade-offs between computational efficiency, clean accuracy, and robustness?"

Our empirical results show that Robust-EMP-SSL not only accelerates convergence, but also achieves a superior balance between clean accuracy and adversarial robustness, outperforming SimCLR, a widely used self-supervised baseline that, like other methods, relies on only two augmentations. Furthermore, we propose the Cost-Free Adversarial Multi-Crop Self-Supervised Learning (CF-AMC-SSL) method, which incorporates free adversarial training into the multi-crop SSL framework. CF-AMC-SSL demonstrates the potential to enhance both clean accuracy and adversarial robustness under reduced epoch conditions, further improving efficiency.

These findings highlight the potential of Robust-EMP-SSL and CF-AMC-SSL to make SSL more practical in adversarial scenarios, paving the way for future empirical explorations and real-world applications.

URL: https://openreview.net/forum?id=WTqHDiETg5

---

Title: Geometric Optimal Transport for Unsupervised Domain Adaptation

Authors: Gal Maman, Ronen Talmon

Abstract: Optimal Transport (OT) is a widely used and powerful approach in domain adaptation.
While effective, most existing methods rely on the pairwise squared Euclidean distances for the transportation cost, implicitly assuming a Euclidean space.
In this paper, we challenge this assumption by introducing Geometric Optimal Transport (GOT), a new transport cost designed for domain adaptation under the manifold assumption.
By utilizing concepts and tools from the field of manifold learning, specifically diffusion geometry, we derive an operator that accounts for the intra-domain geometries, extending beyond the conventional inter-domain distances.
This operator, which quantifies the probability of transporting between source and target samples, forms the basis for our cost.
We demonstrate how the proposed cost, defined by an anisotropic diffusion process, naturally aligns with the desired properties for domain adaptation.
To further enhance performance, we integrate source labels into the operator, thereby guiding the anisotropic diffusion according to the classes.
We showcase the effectiveness of GOT through comprehensive experiments, demonstrating its superior performance compared to recent methods across various benchmarks and datasets.

URL: https://openreview.net/forum?id=8Nef4vZUzU

---

Title: PartSDF: Part-Based Implicit Neural Representation for Composite 3D Shape Parametrization and Optimization

Authors: Nicolas Talabot, Olivier Clerc, Arda Cinar Demirtas, Hieu Le, Doruk Oner, Pascal Fua

Abstract: Accurate 3D shape representation is essential in engineering applications such as design, optimization, and simulation. In practice, engineering workflows require structured, part-based representations, as objects are inherently designed as assemblies of distinct components. However, most existing methods either model shapes holistically or decompose them without predefined part structures, limiting their applicability in real-world design tasks. We propose PartSDF, a supervised implicit representation framework that explicitly models composite shapes with independent, controllable parts while maintaining shape consistency. Thanks to its simple but innovative architecture, PartSDF outperforms both supervised and unsupervised baselines in reconstruction and generation tasks. We further demonstrate its effectiveness as a structured shape prior for engineering applications, enabling precise control over individual components while preserving overall coherence.

URL: https://openreview.net/forum?id=zl43C1yBKv

---

Title: Privacy Risks and Preservation Methods in Explainable Artificial Intelligence: A Scoping Review

Authors: Sonal Allana, Mohan Kankanhalli, Rozita Dara

Abstract: Explainable Artificial Intelligence (XAI) has emerged as a pillar of Trustworthy AI and aims to bring transparency in complex models that are opaque by nature. Despite the benefits of incorporating explanations in models, an urgent need is found in addressing the privacy concerns of providing this additional information to end users. In this article, we conduct a scoping review of existing literature to elicit details on the conflict between privacy and explainability. Using the standard methodology for scoping review, we extracted 57 articles from 1,943 studies published from January 2019 to December 2024. The review addresses 3 research questions to present readers with more understanding of the topic: (1) what are the privacy risks of releasing explanations in AI systems? (2) what current methods have researchers employed to achieve privacy preservation in XAI systems? (3) what constitutes a privacy preserving explanation? Based on the knowledge synthesized from the selected studies, we categorize the privacy risks and preservation methods in XAI and propose the characteristics of privacy preserving explanations to aid researchers and practitioners in understanding the requirements of XAI that is privacy compliant. Lastly, we identify the challenges in balancing privacy with other system desiderata and provide recommendations for achieving privacy preserving XAI. We expect that this review will shed light on the complex relationship of privacy and explainability, both being the fundamental principles of Trustworthy AI.

URL: https://openreview.net/forum?id=q9nykJfzku

---

Title: Avoiding Structural Pitfalls: Self-Supervised Low-Rank Feature Tuning for Graph Test-Time Adaptation

Authors: Haoxiang Zhang, Zhuofeng Li, Qiannan Zhang, Ziyi Kou, Juncheng Li, Shichao Pei

Abstract: Pre-trained graph neural networks (GNNs) have demonstrated significant success in leveraging large-scale graph data to learn transferable representations. However, their performance often degrades under distribution shifts, particularly in real-world scenarios where test labels are unavailable. To address this challenge, we propose Graph Optimization via Augmented Transformations (GOAT), a novel self-supervised test-time tuning paradigm that adapts pre-trained GNNs to distribution-shifted test data by focusing exclusively on node feature transformations. By avoiding complex and often suboptimal graph structure transformations, GOAT overcomes the limitations of existing data-centric methods.
To further address the issue of transformation collapse, where feature transformations converge to trivial solutions—such as when the test-time learned data-centric transformation degenerates into a constant or identity mapping across different inputs, we introduce a parameter-efficient low-rank adapter that generates diverse transformations tailored to individual input graphs. This design not only enhances adaptation performance but also improves interpretability by avoiding modifications to the graph structure. Through extensive experiments on six real-world datasets with diverse distribution shifts, we demonstrate that GOAT achieves consistent performance improvements across different pre-trained GNN backbones, outperforming state-of-the-art test-time adaptation methods.

URL: https://openreview.net/forum?id=yiS6q42LLt

---

New submissions
===============

Title: Quantum Rationale-Aware Graph Contrastive Learning for Jet Discrimination

Abstract: In high-energy physics, particle jet tagging plays a pivotal role in distinguishing quark from gluon jets using data from collider experiments. While graph-based deep learning methods have advanced this task beyond traditional feature-engineered approaches, the complex data structure and limited labeled samples present ongoing challenges. However, existing contrastive learning (CL) frameworks struggle to leverage rationale-aware augmentations effectively, often lacking supervision signals that guide the extraction of salient features and facing computational efficiency issues such as high parameter counts. In this study, we demonstrate that integrating a quantum rationale generator (QRG) within our proposed Quantum Rationale-aware Graph Contrastive Learning (QRGCL) framework significantly enhances jet discrimination performance, reducing reliance on labeled data and capturing discriminative features. Evaluated on the quark-gluon jet dataset, QRGCL achieves an AUC score of 77.53% while maintaining a compact architecture of only 45 QRG parameters, outperforming classical, quantum, and hybrid GCL and GNN benchmarks. These results highlight QRGCL’s potential to advance jet tagging and other complex classification tasks in high-energy physics, where computational efficiency and feature extraction limitations persist. The source code for QRGCL is available at: https://anonymous.4open.science/r/QRGCL-1951.

URL: https://openreview.net/forum?id=HrA51jCVZ9

---

Title: AMGE: Adaptive Modality Gap Exploitation for Adversarial Attacks on Vision-Language Models

Abstract: Multimodal large language models unify visual perception with natural language understanding, yet remain vulnerable to adversarial manipulations. Existing jailbreak attacks exploit vision-text vulnerabilities through pixel-space perturbations and prompt optimization,
overlooking a fundamental weakness: the modality gap—the geometric separation between
image and text embeddings. We present Adaptive Modality Gap Exploitation (AMGE), operating within the embedding manifold through gap-aware perturbation optimization and
cross-attention-mediated gradient flow. Our framework characterizes the modality gap via
empirical directional bias estimation, formulates attacks as geometric exploitation where
gradient updates align with gap vectors, and employs momentum-based ensemble aggregation for universal transferability across queries and architectures. Evaluation across four
multimodal LLMs (LLaVA-1.5-7B/13B, Qwen-VL, Qwen2-VL) demonstrates 90.2% attack
success rate with 79.1% transferability, requiring only 127 queries—3× fewer than competing methods—while maintaining 87.5% semantic preservation. AMGE sustains 62.3%
effectiveness against five defenses, outperforming existing attacks by 23.7%. This work
establishes embedding-space geometric exploitation as a principled paradigm for exposing
vulnerabilities in multimodal alignment architectures.

URL: https://openreview.net/forum?id=GcCavM1Kpr

---

Title: \textsc{PGO-BEn}: Proxy-Guided Orthogonalization and Beta Ensembling for Few-Shot Domain-Incremental Learning

Abstract: Continual adaptation to evolving domains with minimal supervision is essential for real-world deployment of machine learning systems. We formalize this objective as \textbf{Few-Shot Domain-Incremental Learning (FSDIL)}, where a model must adapt to each new domain using only a few labeled samples while retaining prior knowledge without access to previous data. This setting mirrors practical constraints in domains such as autonomous driving and medical imaging, where annotations are expensive and data retention is restricted by privacy regulations.
Pre-trained vision–language models such as CLIP provide a strong initialization for FSDIL due to their transferable multi-modal representations. However, adapting CLIP incrementally under domain shifts remains challenging: few-shot updates often trigger \emph{catastrophic forgetting} and insufficient \emph{plasticity} across evolving distributions.
To address these challenges, we introduce \textbf{\textsc{PGO-BEn}} (\textit{Proxy-Guided Orthogonalization and Beta Ensembling})—a rehearsal-free framework that leverages CLIP’s semantic priors via prompt learning while preserving prior domain knowledge through two key mechanisms.
(1) \textbf{Proxy-Guided Orthogonalization (PGO):} identifies conflicts between current gradients and proxy representations of past knowledge, inferred from current samples, and projects conflicting updates into an orthogonal subspace to prevent knowledge degradation.
(2) \textbf{Beta Ensembling (BEn):} introduces a Beta-function-based temporal ensembling strategy that adaptively balances stability and plasticity, outperforming conventional exponential moving average (EMA) approaches in retaining early-domain knowledge.
We extensively evaluate \textsc{PGO-BEn} on three diverse benchmarks—\textbf{DomainNet}, \textbf{CoRE50}, and \textbf{CDDB-Hard}—and demonstrate consistent improvements over state-of-the-art domain-incremental and few-shot learning methods across all supervision levels in this challenging setting.

URL: https://openreview.net/forum?id=jlb27FbHLv

---

Title: Open Technical Problems in Open-Weight AI Model Risk Management

Abstract: Frontier AI models with openly available weights are steadily becoming more powerful and widely adopted. However, compared to proprietary models, open-weight models pose different opportunities and challenges for effective risk management. For example, they allow for more open research and testing. However, managing their risks is also challenging because they can be modified arbitrarily, used without oversight, and spread irreversibly. Currently, there is limited research on safety tooling specific to open-weight models. Addressing these gaps will be key to both realizing their benefits and mitigating their harms. In this paper, we present 16 open technical challenges for open-weight model safety involving training data, training algorithms, evaluations, deployment, and ecosystem monitoring. We conclude by discussing the nascent state of the field, emphasizing that openness about research, methods, and evaluations -- not just weights -- will be key to building a rigorous science of open-weight model risk management.

URL: https://openreview.net/forum?id=8QyGLnFkzc

---

Title: Any Image Restoration via Efficient Spatial-Frequency Degradation Adaptation

Abstract: Restoring any degraded image efficiently via just one model has become increasingly significant and impactful, especially with the proliferation of mobile devices. Traditional solutions typically involve training dedicated models per degradation, resulting in inefficiency and redundancy. More recent approaches either introduce additional modules to learn visual prompts, significantly increasing the size of the model, or incorporate cross-modal transfer from large language models trained on vast datasets, adding complexity to the system architecture. In contrast, our approach, termed AnyIR, takes a unified path that leverages inherent similarity across various degradations to enable both efficient and comprehensive restoration through a joint embedding mechanism, without scaling up the model or relying on large language models. Specifically, we examine the sub-latent space of each input, identifying key components and reweighting them first in a gated manner. To unify intrinsic degradation awareness with contextualized attention, we propose a spatial–frequency parallel fusion strategy that strengthens spatially informed local–global interactions and enriches restoration fidelity from the frequency domain. Comprehensive evaluations across four all-in-one restoration benchmarks demonstrate that AnyIR attains state-of-the-art performance while reducing model parameters by 84% and FLOPs by 80% relative to the baseline. These results highlight the potential of AnyIR as an effective and lightweight solution for further all-in-one image restoration. Our code will be available upon acceptance.

URL: https://openreview.net/forum?id=RObgCpdDqr

---

Title: Matching High-Dimensional Geometric Quantiles for Test-Time Adaptation of Transformers and Convolutional Networks Alike

Abstract: Test-time adaptation (TTA) refers to adapting a classifier for the test data when the probability distribution of the test data slightly differs from that of the training data of the model. To the best of our knowledge, most of the existing TTA approaches modify the weights of the classifier relying heavily on the architecture. It is unclear as to how these approaches are extendable to generic architectures. In this article, we propose an architecture-agnostic approach to TTA by adding an adapter network pre-processing the input images suitable to the classifier. This adapter is trained using the proposed \emph{quantile loss}. Unlike existing approaches, we correct for the distribution shift by matching high-dimensional geometric quantiles. We prove theoretically that under suitable conditions minimizing quantile loss can learn the optimal adapter. We validate our approach on CIFAR10-C, CIFAR100-C and TinyImageNet-C by training both classic convolutional and transformer networks on CIFAR10, CIFAR100 and TinyImageNet datasets.

URL: https://openreview.net/forum?id=0lSe9KnMke

---

Title: On Uncertainty Calibration for Equivariant Functions

Abstract: Data-sparse settings such as robotic manipulation, molecular physics, and galaxy morphology classification are some of the hardest domains for deep learning. For these problems, equivariant networks can help improve modeling across undersampled parts of the input space, and uncertainty estimation can guard against overconfidence. However, until now, the relationships between equivariance and model confidence, and more generally equivariance and model calibration, has yet to be studied. Since traditional classification and regression error terms show up in the definitions of calibration error, it is natural to suspect that previous work can be used to help understand the relationship between equivariance and calibration error. In this work, we present a theory relating equivariance to uncertainty estimation. By proving lower and upper bounds on uncertainty calibration errors (ECE and ENCE) under various equivariance conditions, we elucidate the generalization limits of equivariant models and illustrate how symmetry mismatch can result in miscalibration in both classification and regression. We complement our theoretical framework with numerical experiments that clarify the relationship between equivariance and uncertainty using a variety of real and simulated datasets, and we comment on trends with symmetry mismatch, group size, and aleatoric and epistemic uncertainties.

URL: https://openreview.net/forum?id=rxLUTPLBT3

---

Title: Benchmarking LLM Overshooting: Automatic Evaluation of Refutation Quality and Emotional Alignment under Pressure

Abstract: Despite the substantial convenience and productivity gains provided by large language models (LLMs), a growing concern is users’ uncritical or blind trust in their generated responses. Such emotional alignment has recently attracted attention, with a specific focus on overshooting—a phenomenon in which users attribute emotional value to artificial intelligence beyond its inherent capabilities. Despite recent advancements, “refutation quality” and “emotional alignment” remain largely unquantified in situations where LLMs encounter false premises. To address this gap, we introduce a new benchmark that enables automatic quantification of LLM overshooting. Specifically, it defines an Overshoot Index (OI) that integrates six metrics: Refutation Strength (RS), Directness Index (DI), Hedging Load (HL), Affective Overshoot Proxy (AOP), Normative Jump (NJ), and Evidence-Backed Correction (EBC). In our experiments, three models—OpenAI’s gpt-4o-mini, Anthropic’s claude-3-5-sonnet-20241022, and Google’s gemini-1.5-flash—were evaluated using prompts generated from the TruthfulQA, CREPE, and FalseQA datasets. Additionally, three pressure levels (pressure ∈ {0, 1, 2}) were introduced to examine behavioral changes under stress. Rather than ranking models, OI serves as a diagnostic benchmark that reveals how refutation strength and emotional accommodation interact under false-premise conditions. Across the three commercial LLMs, OI highlights distinct behavioral tendencies, illustrating its value as a complementary tool for alignment and safety evaluation rather than a performance leaderboard. Statistical validation was performed using Kruskal–Wallis and Wilcoxon signed-rank tests. Overall, this study provides a novel perspective for evaluating LLM safety and robustness.

URL: https://openreview.net/forum?id=HfmYrp4hBR

---

Title: Conformal Prediction for Hierarchical Data

Abstract: We consider conformal prediction for multivariate data and focus on hierarchical data, where some components are linear combinations of others. Intuitively, the hierarchical structure can be leveraged to reduce the size of prediction regions for the same coverage level. We implement this intuition by including a projection step (also called a reconciliation step) in the split conformal prediction [SCP] procedure, and prove that the resulting prediction regions are indeed globally smaller. We do so both under the classic objective of joint coverage and under a new and challenging task: component-wise coverage, for which efficiency results are more difficult to obtain. The associated strategies and their analyses are based both on the literature of SCP and of forecast reconciliation, which we connect. We also illustrate the theoretical findings, for different scales of hierarchies on simulated data.

URL: https://openreview.net/forum?id=lBDEZiW7MX

---

Title: Communication-Efficient Federated AUC Maximization with Cyclic Client Participation

Abstract: Federated AUC maximization is a powerful approach for learning from imbalanced data in federated learning (FL). However, existing methods typically assume full client availability, which is rarely practical. In real-world FL systems, clients often participate in a cyclic manner—joining training according to a fixed, repeating schedule. This setting poses unique optimization challenges for the non-decomposable AUC objective.
This paper addresses these challenges by developing and analyzing communication-efficient algorithms for federated AUC maximization under cyclic client participation. We investigate two key settings:
First, we study AUC maximization with a squared surrogate loss, which reformulates the problem as a nonconvex–strongly-concave minimax optimization. By leveraging the Polyak–Łojasiewicz (PL) condition, we establish a state-of-the-art communication complexity of $\tilde{O}(1/\epsilon)$ and iteration complexity of $\tilde{O}(1/\epsilon)$.
Second, we consider general pairwise AUC losses. We establish an iteration complexity of $O(1/\epsilon^4)$ and a communication complexity of $O(1/\epsilon^3)$. Further, under the PL condition, these bounds improve to iteration complexity of $\widetilde{O}(1/\epsilon)$ and communication complexity of $\widetilde{O}(1/\epsilon^{1/2})$.
Extensive experiments on benchmark tasks in image classification, medical imaging, and fraud detection demonstrate the superior efficiency and effectiveness of our proposed methods.

URL: https://openreview.net/forum?id=18yPFLbVRy

---

Title: Reliable Reasoning Beyond Natural Language

Abstract: Despite their linguistic competence, Large Language Models (LLMs) often struggle to reason reliably and flexibly. To identify these shortcomings, we introduce the Non-Linear Reasoning (NLR) dataset, a collection of unique, hand-designed problems that target reasoning bottlenecks arising from the sequential prediction paradigm of LLMs and the inherently linear nature of natural language. NLR tasks require iterative updates, backtracking, and reasoning across multiple parallel chains of thought but only basic arithmetic to solve. To address these limitations, we propose a neurosymbolic reasoning approach that integrates Prolog, a symbolic reasoning engine, into the inference pipeline of LLMs. This division of labor shifts the LLM’s task from iterative computations to inferring all information—explicit or implied through common sense—and encoding it as logical code. Our method yields large and robust performance gains across the GSM8k and BIG-bench Navigate benchmarks and achieves near-perfect accuracy on NLR problems, maintaining robustness even as variable interdependence—the number of other variables on which the value of a single variable depends—increases.

URL: https://openreview.net/forum?id=AvdzFGQ6BN

---

Title: Message-Passing GNNs Fail to Approximate Sparse Triangular Factorizations

Abstract: Graph Neural Networks (GNNs) have been proposed as a tool for learning sparse matrix preconditioners, which are key components in accelerating linear solvers. We present theoretical and empirical evidence that message-passing GNNs are fundamentally incapable of approximating sparse triangular factorizations for classes of matrices for which high-quality preconditioners exist but require non-local dependencies. To illustrate this, we construct a set of baselines using both synthetic matrices and real-world examples from the SuiteSparse collection. Across a range of GNN architectures, including Graph Attention Networks and Graph Transformers, we observe low cosine similarity ($\leq0.7$ in key cases) between predicted and reference factors. Our theoretical and empirical results suggest that architectural innovations beyond message-passing are necessary for applying GNNs to scientific computing tasks such as matrix factorization. Moreover, experiments demonstrate that overcoming non-locality alone is insufficient. Tailored architectures are necessary to capture the required dependencies since even a completely non-local Global Graph Transformer fails to match the proposed baselines.

URL: https://openreview.net/forum?id=YIr9SzD3C9

---

Title: Signature Kernel Scoring Rule: A Spatio-Temporal Diagnostic for Probabilistic Weather Forecasting

Abstract: Modern weather forecasting has increasingly transitioned from numerical weather prediction (NWP) to data-driven machine learning forecasting techniques. While these new models produce probabilistic forecasts to quantify uncertainty, their training and evaluation may remain hindered by conventional scoring rules, primarily MSE, which ignore the highly correlated data structures present in weather and atmospheric systems. This work introduces the signature kernel scoring rule, grounded in rough path theory, which reframes weather variables as continuous paths to encode temporal and spatial dependencies through iterated integrals. Validated as strictly proper through the use of path augmentations to guarantee uniqueness, the signature kernel provides a theoretically robust metric for forecast verification and model training. Empirical evaluations through weather scorecards on WeatherBench 2 models demonstrate the signature kernel scoring rule's high discriminative power and unique capacity to capture path-dependent interactions. Following previous demonstration of successful adversarial-free probabilistic training, we train sliding window generative neural networks using a predictive-sequential scoring rule on ERA5 reanalysis weather data. Using a lightweight model, we demonstrate that signature kernel based training outperforms climatology for forecast paths of up to fifteen timesteps.

URL: https://openreview.net/forum?id=LOLXpt4E5D

---

Title: Enabling Robust In-Context Memory and Rapid Task Adaptation in Transformers with Hebbian and Gradient-Based Plasticity

Abstract: Large language models display in-context learning as an emergent effect of scale, but they rely on static weights during inference. In contrast, biological systems continually adapt via synaptic plasticity. We investigate whether explicit, biologically inspired plasticity can endow Transformers with faster in-sequence adaptation. To this end, we augment decoder-only Transformers with fast-weight modules updated either by (i) a neuromodulated Hebbian rule or (ii) the gradient-based plasticity mechanism of Duan et al. (2023). Across copying, regression, and few-shot classification tasks (CIFAR-FS, Omniglot), Hebbian plasticity consistently achieves lower loss and stronger few-shot generalization, while gradient-based updates perform best on long-horizon credit assignment. When associations are short and linearly separable, static weights suffice, defining a clear boundary condition for when plasticity helps. Analysis of learned modulatory signals reveals that gradient-based rules maintain large, persistent updates, whereas Hebbian plasticity is sharply gated around salient events. Together, these results show that explicit plasticity complements attention by enabling rapid, task-specific adaptation, and clarify when different plasticity mechanisms are most effective.

URL: https://openreview.net/forum?id=34No0A0V56

---

Title: Universal Differential Equations for Stable Multi-Step Volatility Time Series Forecasting

Abstract: Neural differential equations such as Neural ODEs, Neural CDEs, and Universal Differential Equations (UDEs) model temporal evolution as a continuous-time flow rather than a fixed-step recurrence. Even for regularly sampled data, this formulation differs fundamentally from discrete-time architectures: it learns smooth vector fields governing instantaneous rates of change, reducing discretization bias and improving long-horizon stability. We present a systematic study of Universal Differential Equations for financial volatility forecasting, a domain characterized by regime shifts, heavy tails, and jump discontinuities. UDEs extend Neural ODEs by embedding mechanistic structure within learned dynamics, using neural networks to parameterize coefficients in partially known differential equations instead of learning the system purely from data. Our UDE variants incorporate volatility’s empirical regularities while retaining neural flexibility for regime adaptation. Across market regimes, they outperform both continuous-time baselines and discrete-time models, achieving higher accuracy and greater long-horizon stability while remaining interpretable. These results suggest that UDEs grounded in mechanistic structure and neural flexibility offer a principled route to stable, interpretable multi-step forecasting in nonstationary domains.

URL: https://openreview.net/forum?id=uWGNexco2M

---

Title: DiffCATS: Causally Associated Time-Series Generation through Diffusion Models

Abstract: Understanding the intrinsic causal structure of time-series data is crucial for effective real-world interventions and decision-making, but progress in Time-Series Causal Discovery (TSCD) is often limited by the lack of high-quality datasets with diverse and realistic temporal causal relationships. This highlights the need to provide synthetic time-series generation tools, with realism as a primary objective, an aspect that requires incorporating causal relationships beyond mere correlation. To address this challenge, we propose a diffusion model called DiffCATS. It simultaneously generates multiple causally associated time-series as well as a ground truth causal graph that reflects their mutual temporal dependencies, requiring only observational time-series data for training. Experiments demonstrate that it outperforms state-of-the-art methods in producing realistic time-series with causal graphs that closely resemble those of real-world phenomena. We highlight the practical utility of our data on three downstream tasks, including benchmarking widely used TSCD algorithms.

URL: https://openreview.net/forum?id=FwC6CyaHop

---

Title: Convergence of Stochastic Gradient Langevin Dynamics in the Lazy Training Regime

Abstract: Continuous-time models provide important insights into the training dynamics of optimization algorithms in deep learning. In this work, we establish a non-asymptotic convergence analysis of stochastic gradient Langevin dynamics (SGLD), which is an Itô stochastic differential equation (SDE) approximation of stochastic gradient descent in continuous time, in the lazy training regime. We show that, under regularity conditions on the Hessian of the loss function, SGLD with multiplicative and state-dependent noise (i) yields a non-degenerate kernel throughout the training process with high probability, and (ii) achieves exponential convergence to the empirical risk minimizer in expectation, and we establish finite-time and finite-width bounds on the optimality gap. We corroborate our theoretical findings with numerical examples in the regression setting.

URL: https://openreview.net/forum?id=3s037DKmLo

---

Title: Function Space Diversity for Uncertainty Prediction via Repulsive Last-Layer Ensembles

Abstract: Bayesian inference in function space has gained attention due to its robustness against overparameterization in neural networks. However, approximating the infinite-dimensional function space introduces several challenges. In this work, we discuss function space inference via particle optimization and present practical modifications that improve uncertainty estimation and, most importantly, make it applicable for large and pretrained networks. First, we demonstrate that the input samples, where particle predictions are enforced to be diverse, are detrimental to the model performance. While diversity on training data itself can lead to underfitting, the use of label-destroying data augmentation, or unlabeled out-of-distribution data can improve prediction diversity and uncertainty estimates.
Furthermore, we take advantage of the function space formulation, which imposes no restrictions on network parameterization other than sufficient flexibility. Instead of using full deep ensembles to represent particles, we propose a single multi-headed network that introduces a minimal increase in parameters and computation. This allows seamless integration to pretrained networks, where this repulsive last-layer ensemble can be used for uncertainty aware fine-tuning at minimal additional cost. We achieve competitive results in disentangling aleatoric and epistemic uncertainty, detecting out-of-distribution data, and providing calibrated uncertainty estimates under distribution shifts with minimal compute and memory.

URL: https://openreview.net/forum?id=G8fAj5gzEp

---

Title: Layer Collapse Can be Induced by Unstructured Pruning

Abstract: Unstructured pruning is a popular compression method for efficiently reducing model parameters. However, while it effectively decreases the number of parameters, it is commonly believed that unstructured pruning cannot shorten the computational critical path, i.e., the maximum number of layers traversed during forward propagation.

In this paper, we study when and how unstructured pruning can yield structural effects. For rectifier-activated networks, we introduce the notion of neuron entropy, which quantifies the degree of nonlinearity utilization. We show that magnitude-based pruning naturally lowers this entropy, sometimes down to zero-entropy layers that become linearizable and can thus be removed. Building on this insight, we propose a method that leverages "unstructured" pruning to favor sparsity in low-entropy layers, enabling their complete removal. We validate the phenomenon across CNNs, Vision Transformers, and NLP models: unstructured pruning can induce effective layer removal with little or no performance degradation in over-parameterized networks. The code will be publicly available upon acceptance of the article.

URL: https://openreview.net/forum?id=rfDYZNZIZT

---

Title: On Fitting Flow Models with Large Sinkhorn Couplings

Abstract: Flow models transform data gradually from one modality (e.g. noise) onto another (e.g. images). Such models are parameterized by a time-dependent velocity field, trained to fit segments connecting pairs of source and target points. When the pairing between source and target points is given, training flow models boils down to a supervised regression problem. When no such pairing exists, as is the case when generating data from noise, training flows is much harder. A popular approach lies in picking source and target points independently (Lipman et al., 2023). This can, however, lead to velocity fields that are slow to train, but also costly to integrate at inference time. In theory, one would greatly benefit from training flow models by sampling pairs from an optimal transport (OT) measure coupling source and target, since this would lead to a highly efficient flow solving the Benamou-Brenier dynamical OT problem. In practice, recent works have proposed to sample mini-batches of $n$ source and $n$ target points and reorder them using an OT solver to form better pairs. These works have advocated using batches of size $n\approx 256$, and considered OT solvers that return couplings that are either sharp (using e.g. the Hungarian algorithm) or blurred (using e.g. entropic regularization, a.k.a. Sinkhorn). We follow in the footsteps of these works by exploring the benefits of increasing this mini-batch size $n$ by three to four orders of magnitude, and look more carefully on the effect of the entropic regularization $\varepsilon$ used in the Sinkhorn algorithm. Our analysis is facilitated by new scale invariant quantities to report the sharpness of a coupling, while our sharded computations across multiple GPU or GPU nodes allow scaling up $n$. We show that in both synthetic and image generation tasks, flow models greatly benefit when fitted with large Sinkhorn couplings, with a low entropic regularization $\varepsilon$.

URL: https://openreview.net/forum?id=3MLKJZgY62

---

Title: GRR-CoCa: Leveraging LLM Mechanisms in Multimodal Model Architectures

Abstract: State-of-the-art (SOTA) image and text generation models are multimodal models that have many similarities to large language models (LLMs). Despite achieving strong performances, leading foundational multimodal model architectures frequently lag behind the architectural sophistication of contemporary LLMs. We propose GRR-CoCa, an improved SOTA Contrastive Captioner (CoCa) model that incorporates Gaussian error gated linear units, root mean squared normalization, and rotary positional embedding into the textual decoders and the vision transformer (ViT) encoder. Each architectural modification has been shown to improve model performance in LLMs, but has yet to be adopted in CoCa. We benchmarked GRR-CoCa against Baseline CoCa, a model with the same modified textual decoders but with CoCa's original ViT encoder. We used standard pretraining and fine-tuning workflows to benchmark the models on contrastive and generative tasks. Our GRR-CoCa significantly outperformed Baseline CoCa on the pretraining dataset and three diverse fine-tuning datasets. Pretraining improvements were 27.25% in contrastive loss, 3.71% in perplexity, and 7.15% in CoCa loss. The average fine-tuning improvements were 13.66% in contrastive loss, 5.18% in perplexity, and 5.55% in CoCa loss. We show that GRR-CoCa's modified architecture improves performance and generalization across vision-language domains.

URL: https://openreview.net/forum?id=aOdUP2mfhJ

---

Title: Learning Time-Varying Convexifications of Multiple Fairness Measures

Abstract: There is an increasing appreciation that one may need to consider multiple measures of fairness, e.g., considering multiple group and individual fairness notions.The relative weights of the fairness regularizers are a priori unknown, may be time varying, and need to be learned on the fly. We consider the learning of time-varying convexifications of multiple fairness measures with limited graph-structured feedback.

URL: https://openreview.net/forum?id=y8iWuDZtEw

---

Title: Reliability-Aware Preference Learning for LLM Reward Models

Abstract: Reward functions learned from human feedback are the backbone of reinforcement learning from human feedback (RLHF), the current state-of-the-art approach for aligning large language models to our values. However, reward models (RMs) often fall short of capturing our true preferences, overemphasizing superficial features like length while undervaluing crucial aspects like factual accuracy. A major reason behind this failure is how standard preference learning essentially ignores the inherent limitations of the human annotators providing preference data, including their cognitive biases, knowledge gaps, and resource constraints. To address this, we propose Reliability-Aware Preference Learning (RAPL), which explicitly accounts for varying annotator reliability. Specifically, RAPL modifies the standard preference learning loss function based on an estimate of how reliable annotator feedback will be for each preference comparison pair. We call these parameters annotator reliability metrics (ARMs) and demonstrate how to estimate them based on annotator behavior indicators (e.g., self-reported confidence) or models specifically fine-tuned to predict annotator reliability. Extensive experiments reveal that RMs trained using standard preference learning inherit annotator biases. On the other hand, RAPL effectively amplifies the signal from reliable judgments while attenuating less trustworthy feedback, leading to models that better align with annotators' true preferences.

URL: https://openreview.net/forum?id=pAPcLZK5xj

---

Title: SyntheRela: A Benchmark For Synthetic Relational Database Generation

Abstract: Synthesizing relational databases has started to receive more attention from researchers, practitioners, and industry. The task is more difficult than synthesizing a single table due to the added complexity of relationships between tables. For the same reasons, benchmarking methods for synthesizing relational databases introduces new challenges. Our work is motivated by a lack of an empirical evaluation of state-of-the-art methods and by gaps in the understanding of how such an evaluation should be done. We review related work on relational database synthesis, common benchmarking datasets, and approaches to measuring the fidelity and utility of synthetic data. We combine best practices, a novel robust detection metric, and a novel approach to evaluating utility with graph neural networks into a benchmarking tool. We use this benchmark to compare 6 open-source methods over 8 real-world databases, with a total of 39 tables. The open-source \syntherela benchmark is available on GitHub with a public leaderboard.

URL: https://openreview.net/forum?id=Mi8XioazWy

---

Title: Generative Evolutionary Meta-Solver (GEMS): Scalable Surrogate-Free Multi-Agent Reinforcement Learning

Abstract: Scalable multi-agent reinforcement learning (MARL) remains a central challenge for AI. Existing population-based methods, like Policy-Space Response Oracles, PSRO, require storing explicit policy populations and constructing full payoff matrices, incurring quadratic computation and linear memory costs. We present Generative Evolutionary Meta-Solver (GEMS), a surrogate-free framework that replaces explicit populations with a compact set of latent anchors and a single amortized generator. Instead of exhaustively constructing the payoff matrix, GEMS relies on unbiased Monte Carlo rollouts, multiplicative-weights meta-dynamics, and a model-free empirical-Bernstein UCB oracle to adaptively expand the policy set. Best responses are trained within the generator using an advantage-based trust-region objective, eliminating the need to store and train separate actors. We evaluated GEMS in a variety of Two-player and Multi-Player games such as the Deceptive Messages Game, Kuhn Poker and Multi-Particle environment. We find that GEMS is up to ~$\mathbf{6\times}$ faster, has $\mathbf{1.3\times}$ less memory usage than PSRO, while also reaps higher rewards simultaneously. These results demonstrate that GEMS retains the game theoretic guarantees of PSRO, while overcoming its fundamental inefficiencies, hence enabling scalable multi-agent learning in multiple domains.

URL: https://openreview.net/forum?id=ZwEJsXoBHD

---

Title: High-Layer Attention Pruning with Rescaling

Abstract: Pruning is a highly effective approach for compressing large language models (LLMs), significantly reducing inference latency. However, conventional training-free structured pruning methods often employ a heuristic metric that indiscriminately removes some attention heads across all pruning layers, without considering their positions within the network architecture. In this work, we propose a novel pruning algorithm that strategically prunes attention heads in the model's higher layers. Since the removal of attention heads can alter the magnitude of token representations, we introduce an adaptive rescaling parameter that calibrates the representation scale post-pruning to counteract this effect. We conduct comprehensive experiments on a wide range of LLMs, including LLaMA3.1-8B, Mistral-7B-v0.3, Qwen2-7B, and Gemma2-9B. Our evaluation includes both generation and discriminative tasks across 27 datasets. The results consistently demonstrate that our method outperforms existing structured pruning methods. This improvement is particularly notable in generation tasks, where our approach significantly outperforms existing baselines.

URL: https://openreview.net/forum?id=jkPBIxYmWE

---

Title: Evaluating Molecule Synthesizability via Retrosynthetic Planning and Reaction Prediction

Abstract: A significant challenge in wet lab experiments with current drug design generative models is the trade-off between pharmacological properties and synthesizability. Molecules predicted to have highly desirable properties are often difficult to synthesize, while those that are easily synthesizable tend to exhibit less favorable properties. As a result, evaluating the synthesizability of molecules in general drug design scenarios remains a significant challenge in the field of drug discovery. The commonly used synthetic accessibility (SA) score aims to evaluate the ease of synthesizing generated molecules, but it falls short of guaranteeing that synthetic routes can actually be found. Inspired by recent advances in top-down synthetic route generation and forward reaction prediction, we propose a new, data-driven metric to evaluate molecule synthesizability. This novel metric leverages the synergistic duality between retrosynthetic planners and reaction predictors, both of which are trained on extensive reaction datasets. To demonstrate the efficacy of our metric, we conduct a comprehensive evaluation of round-trip scores across a range of representative molecule generative models.

URL: https://openreview.net/forum?id=kx2xMHvAaO

---

Title: United Yet Distinct: Domain Preservation via Divergence Reduction

Abstract: Although there is a vast amount of data available for training Large Language Models (LLMs), data privacy concerns can limit centralized data aggregation, therefore limiting the learning capacity of LLMs on data from distributed sources. Federated Learning (FL) has emerged as a dominant framework for distributed training. The objective of FL is to preserve privacy while improving the performance of participating clients. However, the non-IID nature of participating clients can degrade model performance. Parameter Efficient Fine-Tuning (PEFT) enables adapting LLMs to downstream tasks with minimal parameter additions and updates to their existing parameters. Preserving performance while learning from data in a distributed setting warrants the need for efficient training frameworks that can enable LLMs to learn from disparate data. In this paper, we design and propose a novel FL aggregation algorithm, Divergence Reduction in Federated Training (DRIFT), which accounts for the divergence between clients during model aggregation and disseminates custom aggregated parameters back to each client. DRIFT measures the degree to which the PEFT parameters of the participating clients diverge and takes advantage of the graph-based structure implied by this divergence. We design two variants of DRIFT and, through extensive experimentation, show how DRIFT outperforms well-established baselines. Our training data and code are available at: https://anonymous.4open.science/r/drift-240F.

URL: https://openreview.net/forum?id=JfM57Vjlmt

---

Title: From Feature Visualization to Visual Circuits: Effect of Model Perturbation

Abstract: Understanding the inner workings of large-scale deep neural networks is challenging yet crucial in several high-stakes applications. Mechanistic interpretability is an emergent field that tackles this challenge, often by identifying human-understandable subgraphs in deep neural networks known as circuits. In vision-pretrained models, these subgraphs are typically interpreted by visualizing their node features through a popular technique called feature visualization. Recent works have analyzed the stability of different feature visualization types under the adversarial model manipulation framework. This paper addresses limitations in existing works by proposing a novel attack called ProxPulse that simultaneously manipulates two types of feature visualizations. Surprisingly, when analyzing these attacks within the context of visual circuits, we find that visual circuits exhibit some robustness to ProxPulse. Consequently, we introduce a new attack based on ProxPulse that reveals the manipulability of visual circuits, highlighting their lack of robustness. The effectiveness of these attacks is validated across a range of pre-trained models, from smaller architectures like AlexNet to medium-scale models like ResNet-50, and larger ones such as ResNet-152 and DenseNet-201 on ImageNet.

URL: https://openreview.net/forum?id=x6ZwuyTy65

---

Title: Mamba-Enhanced Visual-Linguistic Representation for Multi-Label Image Recognition

Abstract: Multi-label image recognition stands as a foundational task in computer vision. Recently, vision-language models have achieved significant progress in this domain. However, previous approaches mostly utilized language models in a simplistic manner, without fully leveraging their potential. To address this, we propose a Mamba-enhanced Visual-Linguistic Representation (MVLR) framework for multi-label image recognition, which aims to better leverage the capabilities of the visual-linguistic representations. In our MVLR, we first propose a Prompt-Driven Label Representation learning (PDLR), which consists of both hard and soft prompts for acquiring comprehensive semantic knowledge for all labels from the large language model. After extracting the label representations, we propose an Interaction and Fusion Model (IFM) to interact with those representations and then fuse them together. To be specific, IFM first employs a label attention to explore the label co-occurrence relations and a context-aware attention to adaptively aggregate context information into label representations. Then, IFM further employs a channel attention to fuse the two features together, forming more reliable and effective label representations. Finally, we propose a Quadruplet Mamba-enhanced Visual-Linguistic block (QMVL) to mutually interact with visual and linguistic features with the strong structure of Mamba. Our QMVL simultaneously emphasizes the features of both visual and linguistic modalities, which is greatly different from previous works of taking linguistic information as a secondary supplementary item. Extensive experiments on several popular datasets, including MS-COCO, Pascal VOC 2007 and NUS-WIDE for general multi-label recognition, demonstrate the superiority of our MVLR.

URL: https://openreview.net/forum?id=KCz9Z9VNwr

---

Title: Stochastic Multi-Objective Multi-Armed Bandits: Regret Definition and Algorithm

Abstract: Multi-armed bandit (MAB) problems are widely applied to online optimization tasks that require balancing exploration and exploitation. In practical scenarios, these tasks often involve multiple conflicting objectives, giving rise to multi-objective multi-armed bandits (MO-MAB). Existing MO-MAB approaches predominantly rely on the Pareto regret metric introduced in \citet{drugan2013designing}. However, this metric has notable limitations, particularly in accounting for all Pareto-optimal arms simultaneously. To address these challenges, we propose a novel and comprehensive regret metric that ensures balanced performance across conflicting objectives. Additionally, we introduce the concept of \textit{Efficient Pareto-Optimal} arms, which are specifically designed for online optimization. Based on our new metric, we develop a two-phase MO-MAB algorithm that achieves sublinear regret for both Pareto-optimal and efficient Pareto-optimal arms.

URL: https://openreview.net/forum?id=7N7sK5CFuP

---

Title: Delta-Influence: Unlearning Poisons via Influence Functions

Abstract: Addressing data integrity challenges, such as unlearning the effects of data poisoning after model training, is necessary for the reliable deployment of machine learning models. State-of-the-art influence functions, such as EK-FAC and TRAK, often fail to accurately attribute abnormal model behavior to the specific poisoned training data responsible for the data poisoning attack. In addition, traditional unlearning algorithms often struggle to effectively remove the influence of poisoned samples, particularly when only a few affected examples can be identified. To address these challenge, we introduce $\Delta$-Influence, a novel approach that leverages influence functions to trace abnormal model behavior back to the responsible poisoned training data using just one poisoned test example, without assuming any prior knowledge of the attack. $\Delta$-Influence applies data transformations that sever the link between poisoned training data and compromised test points without significantly affecting clean data. This allows detecting large negative shifts in influence scores following data transformations, a phenomenon we term as influence collapse, thereby accurately identifying poisoned training data. Unlearning this subset, e.g. through retraining, effectively eliminates the data poisoning. We validate our method across three vision-based poisoning attacks and three datasets, benchmarking against five detection algorithms and five unlearning strategies. We show that $\Delta$-Influence consistently achieves the best unlearning across all settings, showing the promise of influence functions for corrective unlearning.

URL: https://openreview.net/forum?id=4XtcG8NNaG

---

Title: Fuzzy PyTorch: Rapid Numerical Variability Evaluation for Deep Learning Models

Abstract: We introduce Fuzzy PyTorch, a framework for rapid evaluation of numerical variability in deep learning (DL) models. As DL is increasingly applied to diverse tasks, understanding variability from floating-point arithmetic is essential to ensure robust and reliable performance. Tools assessing such variability must be scalable, efficient, and integrate seamlessly with existing frameworks while minimizing code modifications. Fuzzy PyTorch enables this by integrating Stochastic Arithmetic into PyTorch through Probabilistic Rounding with Instruction Set Management, a novel library interfacing with Verificarlo, a numerical analysis compiler. The library offers two modes: stochastic rounding, preserving exact floating-point operations, and up-down rounding, a faster alternative. Comparative evaluations show Fuzzy PyTorch maintains model performance while up-down rounding achieves runtime reductions of $5\times$ to $60\times$ versus Verrou, a state-of-the-art tool. We further demonstrate scalability by running models from 1 to 341 million parameters, confirming applicability across small and large DL architectures. Overall, Fuzzy PyTorch provides an efficient, scalable, and practical solution for assessing numerical variability in deep learning, enabling researchers and practitioners to quantify and manage floating-point uncertainty without compromising performance or computational efficiency.

URL: https://openreview.net/forum?id=0ogq232VGP

---

Title: Kernel Matrix Estimation of a Determinantal Point Process from a Finite Set of Samples: Properties and Algorithms

Abstract: Determinantal point processes (DPPs) on finite sets have recently gained popularity because of their ability to promote diversity among selected elements in a given subset. The probability distribution of a DPP is defined by the determinant of a positive semi-definite, real-valued matrix. When estimating the DPP parameter matrix, it is often more convenient to express the maximum likelihood criterion using the framework of L-ensembles. However, the resulting optimization problem is non-convex and N P-hard to solve.

In this paper, we establish conditions under which the maximum likelihood criterion has a well-defined optimum for a given finite set of samples. We demonstrate that regularization is generally beneficial for ensuring a proper solution. To address this challenge, we propose a proximal algorithm for minimizing a penalized criterion. Through simulations, we compare our algorithm with previously proposed approaches, illustrating their differing behaviors and providing empirical support for our theoretical findings.

URL: https://openreview.net/forum?id=Cyx9LwB5IN

---

Title: SkillBlender: Towards Versatile Humanoid Whole-Body Loco-Manipulation via Skill Blending

Abstract: Humanoid robots hold significant potential in accomplishing daily tasks across diverse environments thanks to their flexibility and human-like morphology. Recent works have made significant progress in humanoid whole-body control and loco-manipulation leveraging optimal control or reinforcement learning. However, these methods require tedious task-specific tuning for each task to achieve satisfactory behaviors, limiting their versatility and scalability to diverse tasks in daily scenarios. To that end, we introduce **SkillBlender**, a novel hierarchical reinforcement learning framework for **versatile** humanoid loco-manipulation. SkillBlender first pretrains goal-conditioned task-agnostic primitive skills, and then dynamically blends these skills to accomplish complex loco-manipulation tasks **with minimal task-specific reward engineering**. We also introduce **SkillBench**, a parallel, cross-embodiment, and diverse simulated benchmark containing three embodiments, four primitive skills, and eight challenging loco-manipulation tasks, accompanied by a set of scientific evaluation metrics balancing accuracy and feasibility. Extensive simulated experiments show that our method significantly outperforms all baselines, while naturally regularizing behaviors to avoid reward hacking, resulting in more accurate and feasible movements for diverse loco-manipulation tasks in our daily scenarios. Our code and benchmark will be open-sourced to the community to facilitate future research.

URL: https://openreview.net/forum?id=IOV5bzyENf

---

Title: RAG-Based AI Agents for Multilingual Help Desks in Low- Bandwidth Environments

Abstract: The increasing demand for multilingual help desk systems has prompted the need for ad-
vanced solutions that can provide accurate, real time responses across various languages.
This paper presents a retrieval-augmented generation (RAG) based system optimized for
low-bandwidth environments. The proposed system integrates retrieval techniques with
generative models, enabling it to generate contextually relevant responses while minimiz-
ing latency. To address the challenge of low-bandwidth operation, we introduce model
distillation and token compression methods, which reduce model size and response time.
The systems performance is evaluated on multilingual datasets, demonstrating substantial
improvements over baseline models in terms of accuracy, recall, precision, and F1-Score.
Our approach effectively tackles the challenges of multilingual support, retrieval accuracy,
and low-latency performance, making it a viable solution for real-time customer support in
resource-constrained settings. The findings suggest that the proposed system can serve as a
robust platform for multilingual help desks, offering improved scalability and efficiency. The
system was built using a hybrid retrievergenerator architecture, with a cross-lingual trans-
former for retrieval and a transformer-based sequence-to-sequence model for generation.
Multilingual datasets, including TyDiQA, mMARCO, XQuAD, MLDoc, and AfriSenti, were
used for training and evaluation. Low-bandwidth optimization techniques such as model
distillation and token compression were applied.The proposed system achieved higher EM,
BLEU, and MRR scores than baseline models, with EM of 79.2%, BLEU of 32.8, and MRR
of 0.80, while reducing latency from 3.4s in the baseline to 2.1s. The distilled model further
reduced latency to 1.8s with minor performance trade-offs. Error analysis showed reduced
hallucination rates and improved relevance in responses for low-resource languages

URL: https://openreview.net/forum?id=Sy0Dpg2ygV

---

Title: Offline Reinforcement Learning via Inverse Optimization

Abstract: Inspired by the recent successes of Inverse Optimization (IO) across various application domains, we propose a novel offline Reinforcement Learning (ORL) algorithm for continuous state and action spaces, leveraging the convex loss function called ``sub-optimality loss" from the IO literature.
To mitigate the distribution shift commonly observed in ORL problems, we further employ a robust and non-causal Model Predictive Control (MPC) expert steering a nominal model of the dynamics using in-hindsight information stemming from the model mismatch.
Unlike the existing literature, our robust MPC expert enjoys an exact and tractable convex reformulation.
In the second part of this study, we show that the IO hypothesis class, trained by the proposed convex loss function, enjoys ample expressiveness and achieves competitive performance compared with the widely used baselines in the low-data regime of the MuJoCo benchmark while utilizing three orders of magnitude fewer parameters, thereby requiring significantly fewer computational resources.
To facilitate the reproducibility of our results, we provide an open-source package implementing the proposed algorithms and the experiments.

URL: https://openreview.net/forum?id=lSj2gXoeCy

---

Title: Perturbed Gradient Descent via Convex Quadratic Approximation for Nonconvex Bilevel Optimization

Abstract: Bilevel optimization is a fundamental tool in hierarchical decision-making and has been widely applied to machine learning tasks such as hyperparameter tuning, meta-learning, and adversarial learning. Although significant progress has been made in bilevel optimization, existing methods predominantly focus on the nonconvex-strongly convex, or the nonconvex-PL settings, the more general nonconvex-nonconvex framework is underexplored. In this paper, we address this gap by developing an efficient gradient-based method to decrease the upper-level objective, coupled with a convex Quadratic Program (QP) that minimally perturbed the gradient descent directions to reduce the suboptimality of the condition imposed by the lower-level problem. We provide a rigorous convergence analysis, demonstrating that under the existence of a KKT point and a regularity assumption (norm-squared gradient of the lower-level satisfies PL), our method achieves an iteration complexity of $\mathcal{O}(1/\epsilon^{1.5})$ in terms of the squared norm of the KKT residual for the reformulated problem. Moreover, even in the absence of the regularity assumption, we establish an iteration complexity of $\mathcal{O}(1/\epsilon^{3})$ for the same metric.Through extensive numerical experiments on convex and nonconvex synthetic benchmarks and data hyper-cleaning tasks, we illustrate the efficiency and scalability of our approach.

URL: https://openreview.net/forum?id=sFtPtOHzYO

---

Title: Physics-Aware Variational Autoencoder for Urban Travel Demand Calibration

Abstract: Urban mobility digital twins are revolutionizing how cities manage increasingly complex transportation systems, enabling real-time optimization across multiple stakeholders, services, and dynamic operations. Central to these digital twins is the origin-destination (OD) calibration problem—estimating travel demand patterns that produce realistic traffic simulations matching observed conditions. However, existing calibration methods face critical limitations: they require a prohibitively large number of expensive simulation runs and struggle with high-dimensional city-scale networks.
To mitigate these issues, we introduce ControlVAE, a novel physics-informed neural network approach for sample-efficient OD calibration. Our method leverages traffic flow patterns, embedded in an auxiliary differentiable physics model, to directly calibrate an interpretable neural representation of the OD matrix from observed data. Specifically, we develop a conditional variational autoencoder framework with a controllable cross-attention mechanism that incorporates this traffic simulation model via differentiable physics knowledge. Our experiments on realistic high-dimensional traffic networks, including the Munich network with 5,329 OD pairs, demonstrate superior sample efficiency, requiring 75\% fewer simulation evaluations than standard baselines like SPSA. In addition, ControlVAE reduces the Normalized Root Mean Squared Error (RMSN) by up to 40\% compared to traditional transportation approaches, confirming that the physics-informed deep learning formulation provides a practical advantage over existing OD calibration methods.

URL: https://openreview.net/forum?id=r5oS1XXbT3

---

Title: IBCL: Zero-shot Model Generation under Stability-Plasticity Trade-offs

Abstract: Algorithms that balance the stability-plasticity trade-off are well studied in the Continual Learning literature. However, only a few focus on obtaining models for specified trade-off preferences. When solving the problem of continual learning under specific trade-offs (CLuST), state-of-the-art techniques leverage rehearsal-based learning, which requires retraining when a model corresponding to a new trade-off preference is requested. This is inefficient, since there potentially exists a significant number of different trade-offs, and a large number of models may be requested. As a response, we propose Imprecise Bayesian Continual Learning (IBCL), an algorithm that tackles CLuST efficiently. IBCL replaces retraining with a constant-time convex combination. Given a new task, IBCL (1) updates the knowledge base as a convex hull of model parameter distributions, and (2) generates one Pareto-optimal model per given trade-off via convex combination without additional training. That is, obtaining models corresponding to specified trade-offs via IBCL is zero-shot. Experiments whose baselines are current CLuST algorithms show that IBCL improves classification by at most 44% on average per task accuracy, and by 45% on peak per task accuracy while maintaining a near-zero to positive backward transfer, with memory overheads converging to constants. In addition, its training overhead, measured by the number of batch updates, remains constant at every task, regardless of the number of preferences requested. IBCL also improves multi-objective reinforcement learning tasks by maintaining the same Pareto front hypervolume, while significantly reducing the training cost. Details can be found at: https://github.com/ibcl-anon/ibcl.

URL: https://openreview.net/forum?id=oqwIVXeQ6n

---

Title: Exploring Perceptual Limitations of Multimodal LLMs on Small Visual Objects

Abstract: Multimodal Large Language Models (MLLMs) have recently achieved remarkable performance in various multimodal benchmarks. However, general benchmarks often do not reveal the specific aspects of their visual perception limits due to the lack of controllability. In this work, we quantitatively study the perception of small visual objects in several widely-used MLLMs and reveal a pervasive limitation in answering questions about small objects in images. We then conduct a controlled study of MLLMs' perception, using text reading as a surrogate for their general perception ability, to understand how object quality, size, distractors, and location independently affect the perception of small objects in MLLMs. Through this controlled study, we find that lower object quality, smaller object size and the presence of visual distractors can both independently reduce MLLMs' ability to answer visual questions. More surprisingly, even local perturbations of an object by a few pixels can cause a drastic decline in the ability of MLLMs to perceive it. Our study provides a better understanding of the perceptual limitations of MLLMs and contributes new evaluation protocols for analyzing, enhancing perception of future MLLMs.

URL: https://openreview.net/forum?id=D8MjYW8m35

---

Title: Physics-Informed Deep B-Spline Networks

Abstract: Physics-informed machine learning offers a promising framework for solving complex partial differential equations (PDEs) by integrating observational data with governing physical laws. However, learning PDEs with varying parameters and changing initial conditions and boundary conditions (ICBCs) with theoretical guarantees remains an open challenge. In this paper, we propose physics-informed deep B-spline networks, a novel technique that approximates a family of PDEs with different parameters and ICBCs by learning B-spline control points through neural networks. The proposed B-spline representation reduces the learning task from predicting solution values over the entire domain to learning a compact set of control points, enforces strict compliance to initial and Dirichlet boundary conditions by construction, and enables analytical computation of derivatives for incorporating PDE residual losses. While existing approximation and generalization theories are not applicable in this setting—where solutions of parametrized PDE families are represented via B-spline bases—we fill this gap by showing that B-spline networks are universal approximators for such families under mild conditions. We also derive generalization error bounds for physics-informed learning in both elliptic and parabolic PDE settings, establishing new theoretical guarantees. Finally, we demonstrate in experiments that the proposed technique has improved efficiency-accuracy tradeoffs compared to existing techniques in a dynamical system problem with discontinuous ICBCs and can handle nonhomogeneous ICBCs and non-rectangular domains.

URL: https://openreview.net/forum?id=tHO2zEqmzm

---

Title: A Practical Algorithm for Feature-Rich, Non-Stationary Bandit Problems

Abstract: Contextual bandits are incredibly useful in many practical problems. We go one step further by devising a more realistic problem that combines: (1) contextual bandits with dense arm features, (2) non-linear reward functions, and (3) a generalization of correlated bandits where reward distributions change over time but the degree of correlation maintains. This formulation lends itself to a wider set of applications such as recommendation tasks. To solve this problem, we introduce *conditionally coupled contextual* ($C_3$) Thompson sampling for Bernoulli bandits. It combines an improved Nadaraya-Watson estimator on an embedding space with Thompson sampling that allows online learning without retraining. Empirical results show that $C_3$ outperforms the next best algorithm by 5.7% lower average cumulative regret on four OpenML tabular datasets as well as demonstrating a 12.4% click lift on Microsoft News Dataset (MIND) compared to other algorithms.

URL: https://openreview.net/forum?id=tRbwfej9uY

---

Title: HiBaNG: Hierarchical Bayesian Nonparametric Granger Causal Discovery in Low-Data Regimes

Abstract: We present a principled probabilistic framework for discovering Granger causal relationships from multivariate time-series data in low-data regimes, where short sequences limit the applicability of modern deep learning approaches. While deep neural vector autoregressive (VAR) models perform well in high-data settings, they often struggle to generalize with limited samples and provide little insight into model uncertainty. To address these challenges we introduce HiBaNG, a hierarchical Bayesian nonparametric framework for Granger causal discovery. HiBaNG places a hierarchical factorized prior over binary Granger causal graphs that encodes structured sparsity and enables interpretable, uncertainty-aware inference. We develop a tractable Gibbs sampling algorithm that exploits conjugacy and augmentation for scalable posterior estimation. Extensive experiments on synthetic, semi-synthetic, and real-world climate datasets demonstrate that HiBaNG consistently outperforms both classical and deep VAR baselines, achieving improved accuracy and calibrated uncertainty.

URL: https://openreview.net/forum?id=e4VO3YlRBr

---

Title: Permutation-based Inference for Variational Learning of DAGs

Abstract: Estimating the structure of Bayesian networks as directed acyclic graphs (DAGs) from observational data is a fundamental challenge, particularly in causal discovery. Bayesian approaches excel by quantifying uncertainty and addressing identifiability, but key obstacles remain: (i) representing distributions over DAGs and (ii) estimating a posterior in the underlying combinatorial space. We introduce PIVID, a method that jointly infers a distribution over permutations and DAGs using variational inference and continuous relaxations of discrete distributions. Through experiments on synthetic and real-world datasets, we show that PIVID can outperform deterministic and Bayesian approaches, achieving superior accuracy-uncertainty trade-offs while scaling efficiently with the number of variables.

URL: https://openreview.net/forum?id=PlZhiIacyU

---

Title: A Targeted Learning Framework for Policy Evaluation with Unobserved Network Interference

Abstract: Estimating causal effects under network interference is a fundamental yet challenging task, especially when the network structure is represented as multiple layers or multiple views. In this paper, we consider a heterogeneous network setting, where the ties from different views of the network might achieve varying levels of interference. Meanwhile, dependence among units is allowed, due to information transmission among network ties and latent traits among units sharing ties (i.e., latent dependency). To the best of our knowledge, this setting has not been studied in literature yet. We propose a novel framework that conducts doubly robust estimation on heterogeneous networks with latent dependency. Our approach relies on a new identification strategy and integrates it with targeted maximum likelihood estimation for robust causal effect estimation from observational data. Crucially, our approach remains valid even when the outcome prediction model or data-generating process is misspecified. It also supports counterfactual inference under hypothetical network interventions using only the observed network structure. Experiments on both synthetic and real-world networks show that our approach consistently outperforms existing baselines and can provide robust estimation towards different intervention policies.

URL: https://openreview.net/forum?id=DD3rhPeQBB

---

Title: Multi-Task Reinforcement Learning with Language-Encoded Gated Policy Networks

Abstract: Multi-task reinforcement learning often relies on task metadata—such as brief natural-language descriptions—to guide behavior across diverse objectives. We present Lexical Policy Networks (LEXPOL), a language-conditioned mixture-of-policies architecture for multi-task RL. LEXPOL encodes task metadata with a text encoder and uses a learned gating module to select or blend among multiple sub-policies, enabling end-to-end training across tasks. On MetaWorld benchmarks, LEXPOL matches or exceeds strong multi-task baselines in success rate and sample efficiency, without task-specific retraining. To analyze the mechanism, we further study settings with fixed expert policies obtained independently of the gate and show that the learned language gate composes these experts to produce behaviors appropriate to novel task descriptions and unseen task combinations. These results indicate that natural-language metadata can effectively index and recombine reusable skills within a single policy.

URL: https://openreview.net/forum?id=okP5HCnjJo

---

Title: Numerical Analysis of HiPPO-LegS ODE for Deep State Space Models

Abstract: In deep learning, the recently introduced state space models utilize HiPPO (High-order Polynomial Projection Operators) memory units to approximate continuous-time trajectories of input functions using ordinary differential equations (ODEs), and these techniques have shown empirical success in capturing long-range dependencies in long input sequences. However, the mathematical foundations of these ODEs, particularly the singular HiPPO-LegS (Legendre Scaled) ODE, and their corresponding numerical discretizations remain unsettled. In this work, we fill this gap by establishing that HiPPO-LegS ODE is well-posed despite its singularity, albeit without the freedom of arbitrary initial conditions. Further, we establish convergence of the associated numerical discretization schemes for Riemann integrable input functions.

URL: https://openreview.net/forum?id=83dhVASBPn

---

Title: Review of Reinforcement Learning for Large Language Models: Formulations, Algorithms, and Opportunities

Abstract: Large Language Models (LLMs) represent significant milestones in artificial intelligence development. While pre-training on vast text corpora and subsequent supervised fine-tuning establish their core abilities, Reinforcement Learning (RL) has emerged as an indispensable paradigm for refining LLMs, particularly in aligning them with human values, and teaching them to reason and follow complex instructions. As this field evolves rapidly, this survey offers a systematic review of RL methods for LLMs, with a focus on fundamental concepts, formal problem settings, and the main algorithms adapted to this context. Our review critically examines the inherent computational and algorithmic challenges arising from the integration of RL with LLMs, such as scalability issues, effective gradient estimation, and training efficiency. Concurrently, we highlight exciting opportunities for advancing LLM capabilities through new RL strategies, including multi-modal integration and the development of agentic LLM systems.

URL: https://openreview.net/forum?id=ghQQNjSxJc

---

Title: Achieving PAC Guarantees in Mechanism Design through Multi-Armed Bandits

Abstract: We analytically derive a class of optimal solutions to a linear program (LP) for automated mechanism design that satisfies efficiency, incentive compatibility, strong budget balance (SBB), and individual rationality (IR), where SBB and IR are enforced in expectation. Our solutions can be expressed using a set of essential variables whose cardinality is exponentially smaller than the total number of variables in the original formulation. However, evaluating a key term in the solutions requires exponentially many optimization steps as the number of players $N$ increases. We address this by translating the evaluation of this term into a multi-armed bandit (MAB) problem and develop a probably approximately correct (PAC) estimator with asymptotically optimal sample complexity. This MAB-based approach reduces the optimization complexity from exponential to $O(N\log N)$. Numerical experiments confirm that our method efficiently computes mechanisms with the target properties, scaling to problems with up to $N=128$ players---substantially improving over prior work.

URL: https://openreview.net/forum?id=tbe8143jO8

---

Title: CrystalGym: A New Benchmark for Materials Discovery Using Reinforcement Learning

Abstract: In silico design and optimization of new materials primarily relies on high-accuracy atomic simulators that perform density functional theory (DFT) calculations. While recent works showcase the strong potential of machine learning to accelerate the material design process, they mostly consist of generative approaches that do not use direct DFT signals as feedback to improve training and generation mainly due to DFT's high computational cost. To aid the adoption of direct DFT signals in the materials design loop through online reinforcement learning (RL), we propose CrystalGym, an open-source RL environment for crystalline material discovery. Using CrystalGym, we benchmark common value- and policy-based reinforcement learning algorithms for designing various crystals conditioned on target properties. Concretely, we optimize for challenging properties like the band gap, bulk modulus, and density, which are directly calculated from DFT in the environment. While none of the algorithms we benchmark solve all CrystalGym tasks, our extensive experiments and ablations show different sample efficiencies and ease of convergence to optimality for different algorithms and environment settings. Additionally, we include a case study on the scope of fine-tuning large language models with reinforcement learning for improving DFT-based rewards. Our goal is for CrystalGym to serve as a test bed for reinforcement learning researchers and material scientists to address these real-world design problems with practical applications. We therefore introduce a novel class of challenges for reinforcement learning methods dealing with time-consuming reward signals, paving the way for future interdisciplinary research for machine learning motivated by real-world applications.

URL: https://openreview.net/forum?id=kEXgJVBqmV

---

Title: Perceptron-as-Opinion-Dynamics (POD): A Unified and Interpretable Machine Learning Framework for Opinion Dynamics

Abstract: We introduce Perceptron-as-Opinion-Dynamics (POD), a dual-activation perceptron framework that learns opinion dynamics directly from data with full interpretability. Each parameter in POD corresponds to a concrete social mechanism, including agent-specific inertia, dynamic influence networks, persistent bias, and nonlinear modes of perception and expression. With appropriate parameter choices, POD exactly recovers canonical linear models such as DeGroot, Friedkin–Johnsen, and Altafini, closely approximates nonlinear models like Hegselmann–Krause, and naturally extends to kinetic and extremist cases. When trained end-to-end, POD generalizes beyond these settings, achieving up to 83% faster convergence and reducing prediction error by 40–82% compared to canonical baselines on real-world data. By unifying fragmented opinion dynamics models into a single trainable and interpretable neural framework, POD lays a robust foundation for modeling complex belief evolution in social systems, addressing challenges that have persisted for decades.

URL: https://openreview.net/forum?id=2xCDcMrEc3

---

Reply all

Reply to author

Forward

0 new messages