Weekly TMLR digest for May 25, 2025

22 views

Skip to first unread message

TMLR

unread,

May 25, 2025, 12:00:12 AMMay 25

to tmlr-annou...@googlegroups.com

New certifications
==================

Expert Certification: Latent mixed-effect models for high-dimensional longitudinal data

Priscilla Ong, Manuel Haussmann, Otto Lönnroth, Harri Lähdesmäki

https://openreview.net/forum?id=7A96yteeF9

---

Survey Certification: Conditional Image Synthesis with Diffusion Models: A Survey

Zheyuan Zhan, Defang Chen, Jian-Ping Mei, Zhenghe Zhao, Jiawei Chen, Chun Chen, Siwei Lyu, Can Wang

https://openreview.net/forum?id=ewwNKwh6SK

---

Survey Certification: Machine Learning with Physics Knowledge for Prediction: A Survey

Joe Watson, Chen Song, Oliver Weeger, Theo Gruner, An Thai Le, Kay Hansel, Ahmed Hendawy, Oleg Arenz, Will Trojak, Miles Cranmer, Carlo D'Eramo, Fabian Buelow, Tanmay Goyal, Jan Peters, Martin W Hoffmann

https://openreview.net/forum?id=ZiJYahyXLU

---

Accepted papers
===============

Title: On the Utility of Existing Fine-Tuned Models on Data-Scarce Domains

Authors: Md Ibrahim Ibne Alam, Parikshit Ram, Soham Dan, Horst Samulowitz, Koushik Kar

Abstract: Large Language Models (LLMs) have been observed to perform well on a wide range of downstream tasks when fine-tuned on domain-specific data. However, such data may not be readily available in many applications, motivating zero-shot or few-shot approaches using existing domain or task adjacent (fine-tuned) models, which we call DAFT. While several fine-tuned models for various tasks are available, finding one appropriate DAFT model for a given task is often not straight forward. In this paper, we explore different utilization techniques of these existing DAFT models for data-scarce problems, i.e., tasks for which data is not available or limited. We observe that for zero-shot problems, ensembling of DAFT models provides an accuracy performance close to that of the single best model. With few-shot problems (few data from target domain available), this performance can be improved further by picking or putting more weights to the DAFT models that are expected to perform better on the target task.

URL: https://openreview.net/forum?id=kY2fKLOGkI

---

Title: Latent mixed-effect models for high-dimensional longitudinal data

Authors: Priscilla Ong, Manuel Haussmann, Otto Lönnroth, Harri Lähdesmäki

Abstract: Modelling longitudinal data is an important yet challenging task. These datasets can be high-dimensional, contain non-linear effects and feature time-varying covariates. Gaussian process (GP) prior-based variational autoencoders (VAEs) have emerged as a promising approach due to their ability to model time-series data. However, they are costly to train and struggle to fully exploit the rich covariates characteristic of longitudinal data, making them difficult for practitioners to use effectively. In this work, we leverage linear mixed models (LMMs) and amortized variational inference to provide conditional priors for VAEs, and propose LMM-VAE, a scalable, interpretable and identifiable model. We highlight theoretical connections between it and GP-based techniques, providing a unified framework for this class of methods. Our proposal performs competitively compared to existing approaches across simulated and real-world datasets.

URL: https://openreview.net/forum?id=7A96yteeF9

---

Title: Group Fair Federated Learning via Stochastic Kernel Regularization

Authors: Huzaifa Arif, Pin-Yu Chen, Keerthiram Murugesan, Alex Gittens

Abstract: Ensuring \textbf{group fairness} in federated learning (FL) presents unique challenges due to data heterogeneity and communication constraints. We propose Kernel Fair Federated Learning (\texttt{KFFL}), a novel framework that incorporates group fairness into FL models using the Kernel Hilbert-Schmidt Independence Criterion (KHSIC) as a fairness regularizer. To address scalability, \texttt{KFFL} approximates KHSIC with Random Feature Maps (RFMs), significantly reducing computational and communication overhead while achieving \textit{group fairness}.

To address the resulting non-convex optimization problem, we propose \texttt{FedProxGrad}, a federated proximal gradient algorithm that guarantees convergence. Through experiments on standard benchmark datasets across both IID and Non-IID settings for regression and classification tasks, \texttt{KFFL} demonstrates its ability to balance accuracy and fairness effectively, outperforming existing methods by comprehensively exploring the Pareto Frontier. Furthermore, we introduce \texttt{KFFL-TD}, a time-delayed variant that further reduces communication rounds, enhancing efficiency in decentralized environments.

URL: https://openreview.net/forum?id=k8x44wVIs1

---

Title: Conditional Image Synthesis with Diffusion Models: A Survey

Authors: Zheyuan Zhan, Defang Chen, Jian-Ping Mei, Zhenghe Zhao, Jiawei Chen, Chun Chen, Siwei Lyu, Can Wang

Abstract: Conditional image synthesis based on user-specified requirements is a key component in creating complex visual content. In recent years, diffusion-based generative modeling has become a highly effective way for conditional image synthesis, leading to exponential growth in the literature. However, the complexity of diffusion-based modeling, the wide range of image synthesis tasks, and the diversity of conditioning mechanisms present significant challenges for researchers to keep up with rapid developments and to understand the core concepts on this topic. In this survey, we categorize existing works based on how conditions are integrated into the two fundamental components of diffusion-based modeling, i.e., the denoising network and the sampling process. We specifically highlight the underlying principles, advantages, and potential challenges of various conditioning approaches during the training, re-purposing, and specialization stages to construct a desired denoising network. We also summarize six mainstream conditioning mechanisms in the sampling process. All discussions are centered around popular applications. Finally, we pinpoint several critical yet still unsolved problems and suggest some possible solutions for future research.

URL: https://openreview.net/forum?id=ewwNKwh6SK

---

Title: Exploring Weak-to-Strong Generalization for CLIP-based Classification

Authors: Jinhao Li, Sarah Monazam Erfani, Lei Feng, James Bailey, Feng Liu

Abstract: Aligning large-scale commercial models with user intent is crucial to preventing harmful outputs. Current methods rely on human supervision but become impractical as model complexity increases. When models surpass human knowledge, providing accurate feedback becomes challenging and inefficient.
A novel solution proposed recently is using a weaker model to supervise a stronger model. This concept leverages the ability of weaker models to perform evaluations, thereby reducing the workload on human supervisors.
Previous work has shown the effectiveness of weak-to-strong generalization in the context of language-only models. Extending this concept to vision-language models leverages these insights, adapting the proven benefits to a multi-modal context.
In our study, we explore weak-to-strong generalization for CLIP-based classification. We propose a method, \emph{class prototype learning} (CPL), which aims to enhance the classification capabilities of the CLIP model, by learning more representative prototypes for each category.
Our findings indicate that, despite using a simple loss function under weak supervision, CPL yields robust improvements in targeted scenarios, particularly when pretraining is limited. Extensive experiments demonstrate that our approach is effective under these settings, achieving a 3.67\% improvement over strong baseline methods.

URL: https://openreview.net/forum?id=quE8gDDegf

---

Title: Deep Augmentation: Dropout as Augmentation for Self-Supervised Learning

Authors: Rickard Brüel Gabrielsson, Tongzhou Wang, Manel Baradad, Justin Solomon

Abstract: Despite dropout’s ubiquity in machine learning, its effectiveness as a form of data augmentation remains under-explored. We address two key questions: (i) When is dropout effective as an augmentation strategy? (ii) Is dropout uniquely effective under these conditions? To explore these questions, we propose Deep Augmentation, a network- and modality-agnostic method that applies dropout or PCA transformations to targeted layers in neural networks. Through extensive experiments on contrastive learning tasks in NLP, computer vision, and graph learning, we find that uniformly applying dropout across layers does not consistently improve performance. Instead, dropout proves most beneficial in deeper layers and can be matched by alternative augmentations (e.g., PCA). We also show that a stop-gradient operation is critical for ensuring dropout functions effectively as an augmentation, and that performance trends invert when moving from contrastive tasks to supervised tasks. Our analysis suggests that Deep Augmentation helps mitigate inter-layer co-adaptation---a notable issue in self-supervised learning due to the absence of labeled data. Drawing on these insights, we outline a procedure for selecting the optimal augmentation layer and demonstrate that Deep Augmentation can outperform traditional input-level augmentations. This simple yet powerful approach can be seamlessly integrated into a wide range of architectures and modalities, yielding notable gains in both performance and generalization.

URL: https://openreview.net/forum?id=OjWB2671AR

---

Title: Forecasting Company Fundamentals

Authors: Felix Divo, Eric Endress, Kevin Endler, Kristian Kersting, Devendra Singh Dhami

Abstract: Company fundamentals are key to assessing companies' financial and overall success and stability. Forecasting them is important in multiple fields, including investing and econometrics. While statistical and contemporary machine learning methods have been applied to many time series tasks, there is a lack of comparison of these approaches on this particularly challenging data regime. To this end, we try to bridge this gap and thoroughly evaluate the theoretical properties and practical performance of 24 deterministic and probabilistic company fundamentals forecasting models on real company data. We observe that deep learning models provide superior forecasting performance to classical models, in particular when considering uncertainty estimation. To validate the findings, we compare them to human analyst expectations and find that their accuracy is comparable to the automatic forecasts. We further show how these high-quality forecasts can benefit automated stock allocation. We close by presenting possible ways of integrating domain experts to further improve performance and increase reliability.

URL: https://openreview.net/forum?id=haf78jerSt

---

Title: ETGL-DDPG: A Deep Deterministic Policy Gradient Algorithm for Sparse Reward Continuous Control

Authors: Ehsan Futuhi, Shayan Karimi, Chao Gao, Martin Müller

Abstract: We consider deep deterministic policy gradient (DDPG) in the context of reinforcement learning with sparse rewards. To enhance exploration, we introduce a search procedure, \emph{${\epsilon}{t}$-greedy}, which generates exploratory options for exploring less-visited states. We prove that search using $\epsilon t$-greedy has polynomial sample complexity under mild MDP assumptions. To more efficiently use the information provided by rewarded transitions, we develop a new dual experience replay buffer framework, \emph{GDRB}, and implement \emph{longest n-step returns}. The resulting algorithm, \emph{ETGL-DDPG}, integrates all three techniques: \bm{$\epsilon t$}-greedy, \textbf{G}DRB, and \textbf{L}ongest $n$-step, into DDPG. We evaluate ETGL-DDPG on standard benchmarks and demonstrate that it outperforms DDPG, as well as other state-of-the-art methods, across all tested sparse-reward continuous environments. Ablation studies further highlight how each strategy individually enhances the performance of DDPG in this setting.

URL: https://openreview.net/forum?id=6g1WJ55N51

---

Title: Music Foundation Model as Generic Booster for Music Downstream Tasks

Authors: Wei-Hsiang Liao, Yuhta Takida, Yukara Ikemiya, Zhi Zhong, Chieh-Hsin Lai, Giorgio Fabbro, Kazuki Shimada, Keisuke Toyama, Kin Wai Cheuk, Marco A. Martínez-Ramírez, Shusuke Takahashi, Stefan Uhlich, Taketo Akama, Woosung Choi, Yuichiro Koyama, Yuki Mitsufuji

Abstract: We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. We introduce SoniDo, a music foundation model (MFM) designed to extract hierarchical features from target music samples. By leveraging hierarchical intermediate features, SoniDo constrains the information granularity, leading to improved performance across various downstream tasks including both understanding and generative tasks. We specifically evaluated this approach on representative tasks such as music tagging, music transcription, music source separation, and music mixing. Our results reveal that the features extracted from foundation models provide valuable enhancements in training downstream task models. This highlights the capability of using features extracted from music foundation models as a booster for downstream tasks. Our approach not only benefits existing task-specific models but also supports music downstream tasks constrained by data scarcity. This paves the way for more effective and accessible music processing solutions.

URL: https://openreview.net/forum?id=kHl4JzyNzF

---

Title: Studying Exploration in RL: An Optimal Transport Analysis of Occupancy Measure Trajectories

Authors: Reabetswe M. Nkhumise, Debabrota Basu, Tony J. Prescott, Aditya Gilra

Abstract: The rising successes of RL are propelled by combining smart algorithmic strategies and deep architectures to optimize the distribution of returns and visitations over the state-action space. A quantitative framework to compare the learning processes of these eclectic RL algorithms is currently absent but desired in practice. We address this gap by representing the learning process of an RL algorithm as a sequence of policies generated during training, and then studying the policy trajectory induced in the manifold of state-action occupancy measures. Using an optimal transport-based metric, we measure the length of the paths induced by the policy sequence yielded by an RL algorithm between an initial policy and a final optimal policy. Hence, we first define the Effort of Sequential Learning (ESL). ESL quantifies the relative distance that an RL algorithm travels compared to the shortest path from the initial to the optimal policy. Furthermore, we connect the dynamics of policies in the occupancy measure space and regret (another metric to understand the suboptimality of an RL algorithm), by defining the Optimal Movement Ratio (OMR). OMR assesses the fraction of movements in the occupancy measure space that effectively reduce an analogue of regret. Finally, we derive approximation guarantees to estimate ESL and OMR with a finite number of samples and without access to an optimal policy. Through empirical analyses across various environments and algorithms, we demonstrate that ESL and OMR provide insights into the exploration processes of RL algorithms and the hardness of different tasks in discrete and continuous MDPs.

URL: https://openreview.net/forum?id=pdC092Nn8N

---

Title: LASP: Linear Attention Sequence Parallelism

Authors: Weigao Sun, Zhen Qin, Dong Li, Xuyang Shen, Yu Qiao, Yiran Zhong

Abstract: Sequence parallelism (SP) serves as a prevalent strategy to handle long sequences that exceed the memory limit of a single device. However, for linear sequence modeling methods like linear attention, existing SP approaches do not take advantage of their right-product-first feature, resulting in sub-optimal communication efficiency and usability. In this paper, we introduce Linear Attention Sequence Parallelism (LASP), an efficient SP approach designed for linear attention-based transformer models. Specifically, we design an efficient point-to-point ring-style communication mechanism to leverage the right-product kernel trick of linear attention, which sharply decreases the communication overhead, comparing with existing SP methods. We enhance the computation efficiency of LASP by performing kernel fusion and intermediate state caching, making the implementation of LASP hardware-friendly on GPUs. Furthermore, we meticulously ensure the compatibility of sequence-level LASP with all types of batch-level data parallel methods, which is vital for distributed training on large clusters with very-long sequences. We also discuss the generalization of LASP on other linear sequence modeling methods. Extensive experiments on linear attention-based models are conducted with varying sequence lengths from 2K to 4096K. LASP scales sequence length up to 4096K on 128 GPUs, which is 8$\times$ longer than existing SP methods. Code is available at: \url{https://github.com/OpenNLPLab/LASP}.

URL: https://openreview.net/forum?id=gG8sQUUtN7

---

Title: MOORL: A Framework for Integrating Offline-Online Reinforcement Learning

Authors: Gaurav Chaudhary, Washim Uddin Mondal, Laxmidhar Behera

Abstract: Sample efficiency and exploration remain critical challenges in Deep Reinforcement Learning (DRL), particularly in complex domains. Offline RL, which enables agents to learn optimal policies from static, pre-collected datasets, has emerged as a promising alternative. However, offline RL is constrained by issues such as out-of-distribution (OOD) actions that limit policy performance and generalization.
To overcome these limitations, we propose Meta Offline-Online Reinforcement Learning (MOORL), a hybrid framework that unifies offline and online RL for efficient and scalable learning. While previous hybrid methods rely on extensive design choices and added complexity to utilize offline data effectively, MOORL introduces a meta-policy that seamlessly adapts across offline and online trajectories. This enables the agent to leverage offline data for robust initialization while utilizing online interactions to drive efficient exploration. Importantly, MOORL addresses the key challenges of hybrid RL in terms of being design-free.
Our theoretical analysis demonstrates that the hybrid approach enhances exploration by effectively combining the complementary strengths of offline and online data. Furthermore, we demonstrate that MOORL learns a stable Q-function without relying on extensive design choices. Extensive experiments on 28 tasks from the D4RL and V-D4RL benchmarks validate its effectiveness, showing consistent improvements over state-of-the-art offline and hybrid RL baselines. With minimal computational overhead, MOORL achieves strong performance, underscoring its potential for practical applications in real-world scenarios.

URL: https://openreview.net/forum?id=PHsfZnF2FC

---

Title: Time-Uniform Confidence Spheres for Means of Random Vectors

Authors: Ben Chugg, Hongjian Wang, Aaditya Ramdas

Abstract: We study sequential mean estimation in $\mathbb{R}^d$. In particular, we derive time-uniform confidence spheres---\emph{confidence sphere sequences} (CSSs)---which contain the mean of random vectors with high probability simultaneously across all sample sizes.
Our results include a dimension-free CSS for log-concave random vectors, a dimension-free CSS for sub-Gaussian random vectors, and
CSSs for sub-$\psi$ random vectors (which includes sub-gamma, and sub-exponential distributions). Many of our results are optimal. For sub-Gaussian distributions we also provide a CSS which tracks a time-varying mean, generalizing Robbins' mixture approach to the multivariate setting. Finally, we provide several CSSs for heavy-tailed random vectors (two moments only). Our bounds hold under a martingale assumption on the mean and do not require that the observations be iid. Our work is based on PAC-Bayesian theory and inspired by an approach of Catoni and Giulini.

URL: https://openreview.net/forum?id=2NSb3cJE03

---

Title: How far away are truly hyperparameter-free learning algorithms?

Authors: Priya Kasimbeg, Vincent Roulet, Naman Agarwal, Sourabh Medapati, Fabian Pedregosa, Atish Agarwala, George E. Dahl

Abstract: Despite major advances in methodology, hyperparameter tuning remains a crucial (and expensive) part of the development of machine learning systems. Even ignoring architectural choices, deep neural networks have a large number of optimization and regularization hyperparameters that need to be tuned carefully per workload in order to obtain the best results. In a perfect world, training algorithms would not require workload-specific hyperparameter tuning, but would instead have default settings that performed well across many workloads. Recently, there has been a growing literature on optimization methods which attempt to reduce the number of hyperparameters---particularly the learning rate and its accompanying schedule. Given these developments, how far away is the dream of neural network training algorithms that completely obviate the need for painful tuning?

In this paper, we evaluate the potential of learning-rate-free methods as components of hyperparameter-free methods. We freeze their (non-learning rate) hyperparameters to default values, and score their performance using the recently-proposed AlgoPerf: Training Algorithms benchmark. We found that literature-supplied default settings performed poorly on the benchmark, so we performed a search for hyperparameter configurations that performed well across all workloads simultaneously. The best "algoperf-calibrated" learning-rate-free methods had much improved performance but still lagged slightly behind a similarly calibrated NadamW baseline in overall benchmark score. Our results suggest that there is still much room for improvement for learning-rate-free methods, and that testing against a strong, workload-agnostic baseline is important to improve hyperparameter reduction techniques.

URL: https://openreview.net/forum?id=6BlOCx5c5T

---

Title: Hitchhiker's guide on the relation of Energy-Based Models with other generative models, sampling and statistical physics: a comprehensive review

Authors: Davide Carbone

Abstract: Energy-Based Models (EBMs) have emerged as a powerful framework in the realm of generative modeling, offering a unique perspective that aligns closely with principles of statistical mechanics. This review aims to provide physicists with a comprehensive understanding of EBMs, delineating their connection to other generative models such as Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Normalizing Flows. We explore the sampling techniques crucial for EBMs, including Markov Chain Monte Carlo (MCMC) methods, and draw parallels between EBM concepts and statistical mechanics, highlighting the significance of energy functions and partition functions. Furthermore, we delve into state-of-the-art training methodologies for EBMs, covering recent advancements and their implications for enhanced model performance and efficiency. This review is designed to clarify the often complex interconnections between these models, which can be challenging due to the diverse communities working on the topic.

URL: https://openreview.net/forum?id=VTgixSbrJI

---

Title: A Local Polyak-Łojasiewicz and Descent Lemma of Gradient Descent For Overparametrized Linear Models

Authors: Ziqing Xu, Hancheng Min, Salma Tarmoun, Enrique Mallada, Rene Vidal

Abstract: Most prior work on the convergence of gradient descent (GD) for overparameterized neural networks relies on strong assumptions on the step size (infinitesimal), the hidden-layer width (infinite), or the initialization (large, spectral, balanced). Recent efforts to relax these assumptions focus on two-layer linear networks trained with the squared loss.
In this work, we derive a linear convergence rate for training two-layer linear neural networks with GD for general losses and under relaxed assumptions on the step size, width, and initialization. A key challenge in deriving this result is that classical ingredients for deriving convergence rates for nonconvex problems, such as the Polyak-Łojasiewicz (PL) condition and Descent Lemma, do not hold globally for overparameterized neural networks. Here, we prove that these two conditions hold locally with local constants that depend on the weights. Then, we provide bounds on these local constants, which depend on the initialization of the weights, the current loss, and the global PL and smoothness constants of the non-overparameterized model. Based on these bounds, we derive a linear convergence rate for GD. Our convergence analysis not only improves upon prior results but also suggests a better choice for the step size, as verified through our numerical experiments.

URL: https://openreview.net/forum?id=VPl3T43Hxb

---

Title: Pruning Feature Extractor Stacking for Cross-domain Few-shot Learning

Authors: Hongyu Wang, Eibe Frank, Bernhard Pfahringer, Geoff Holmes

Abstract: Combining knowledge from source domains to learn efficiently from a few labelled instances in a target domain is a transfer learning problem known as cross-domain few-shot learning (CDFSL). Feature extractor stacking (FES) is a state-of-the-art CDFSL method that maintains a collection of source domain feature extractors instead of a single universal extractor. FES uses stacked generalisation to build an ensemble from extractor snapshots saved during target domain fine-tuning. It outperforms several contemporary universal model-based CDFSL methods in the Meta-Dataset benchmark. However, it incurs higher storage cost because it saves a snapshot for every fine-tuning iteration for every extractor. In this work, we propose a bidirectional snapshot selection strategy for FES, leveraging its cross-validation process and the ordered nature of its snapshots, and demonstrate that a 95% snapshot reduction can be achieved while retaining the same level of accuracy.

URL: https://openreview.net/forum?id=p499xXaclC

---

Title: Cometh: A continuous-time discrete-state graph diffusion model

Authors: Antoine Siraudin, Fragkiskos D. Malliaros, Christopher Morris

Abstract: Discrete-state denoising diffusion models led to state-of-the-art performance in graph generation, especially in the molecular domain. Recently, they have been transposed to continuous time, allowing more flexibility in the reverse process and a better trade-off between sampling efficiency and quality. Here, to leverage the benefits of both approaches, we propose Cometh, a continuous-time discrete-state graph diffusion model, tailored to the specificities of graph data. In addition, we also successfully replaced the set of structural encodings previously used in the discrete graph diffusion model with a single random-walk-based encoding, providing a simple and principled way to boost the model's expressive power. Empirically, we show that integrating continuous time leads to significant improvements across various metrics over state-of-the-art discrete-state diffusion models on a large set of molecular and non-molecular benchmark datasets. In terms of VUN samples, Cometh obtains a near-perfect performance of 99.5% on the planar graph dataset and outperforms DiGress by 12.6% on the large GuacaMol dataset.

URL: https://openreview.net/forum?id=nuN1mRrrjX

---

Title: Machine Learning with Physics Knowledge for Prediction: A Survey

Authors: Joe Watson, Chen Song, Oliver Weeger, Theo Gruner, An Thai Le, Kay Hansel, Ahmed Hendawy, Oleg Arenz, Will Trojak, Miles Cranmer, Carlo D'Eramo, Fabian Buelow, Tanmay Goyal, Jan Peters, Martin W Hoffmann

Abstract: This survey examines the broad suite of methods and models for combining machine learning with physics knowledge for prediction and forecast, with a focus on partial differential equations.
These methods have attracted significant interest due to their potential impact on advancing scientific research and industrial practices by improving predictive models with small- or large-scale datasets and expressive predictive models with useful inductive biases.
The survey has two parts.
The first considers incorporating physics knowledge on an architectural level through objective functions, structured predictive models, and data augmentation.
The second considers data as physics knowledge, which motivates looking at multi-task, meta, and contextual learning as an alternative approach to incorporating physics knowledge in a data-driven fashion.
Finally, we also provide an industrial perspective on the application of these methods and a survey of the open-source ecosystem for physics-informed machine learning.

URL: https://openreview.net/forum?id=ZiJYahyXLU

---

Title: Uniform Noise Distribution and Compact Clusters: Unveiling the Success of Self-Supervised Learning in Label Noise

Authors: Pengcheng Xu, Li Yi, Gezheng Xu, Xi Chen, Ian McLeod, Charles Ling, Boyu Wang

Abstract: Label noise is ubiquitous in real-world datasets, posing significant challenges to machine learning models. While self-supervised learning (SSL) algorithms have empirically demonstrated effectiveness in learning noisy labels, the theoretical understanding of their effectiveness remains underexplored.
In this paper, we present a theoretical framework to understand how SSL methods enhance learning with noisy labels, especially for the instance-dependent label noise.
We reveal that the uniform and compact cluster structures induced by contrastive SSL play a crucial role in mitigating the adverse effects of label noise. Specifically, we theoretically show that a classifier trained on SSL-learned representations significantly outperforms one trained using traditional supervised learning methods. This results from two key merits of SSL representations over label noise:
1. Uniform Noise Distribution: Label noise becomes uniformly distributed over SSL representations with respect to the true class labels, rather than the noisy ones, leading to an easier learning task.
2. Enhanced Cluster Structure: SSL enhances the formation of well-separated and compact categorical clusters, increasing inter-class distances while tightening intra-class clusters.
We further theoretically justify the benefits of training a classifier on such structured representations, demonstrating that it encourages the classifier trained on noisy data to be aligned with the optimal classifier. Extensive experiments validate the robustness of SSL representations in combating label noise, confirming the practical values of our theoretical findings.

URL: https://openreview.net/forum?id=LDBjgS5Ez7

---

Title: RefinedFields: Radiance Fields Refinement for Planar Scene Representations

Authors: Karim Kassab, Antoine Schnepf, Jean-Yves Franceschi, Laurent Caraffa, Jeremie Mary, Valerie Gouet-Brunet

Abstract: Planar scene representations have recently witnessed increased interests for modeling scenes from images, as their lightweight planar structure enables compatibility with image-based models. Notably, K-Planes have gained particular attention as they extend planar scene representations to support in-the-wild scenes, in addition to object-level scenes. However, their visual quality has recently lagged behind that of state-of-the-art techniques. To reduce this gap, we propose RefinedFields, a method that leverages pre-trained networks to refine K-Planes scene representations via optimization guidance using an alternating training procedure. We carry out extensive experiments and verify the merit of our method on synthetic data and real tourism photo collections. RefinedFields enhances rendered scenes with richer details and improves upon its base representation on the task of novel view synthesis. Our project page can be found at https://refinedfields.github.io .

URL: https://openreview.net/forum?id=S6JpSsYBDZ

---

Title: M3CoL: Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification

Authors: Raja Kumar, Raghav Singhal, Pranamya Prashant Kulkarni, Deval Mehta, Kshitij Sharad Jadhav

Abstract: Deep multimodal learning has shown remarkable success by leveraging contrastive learning to capture explicit one-to-one relations across modalities. However, real-world data often exhibits shared relations beyond simple pairwise associations. We propose M3CoL, a Multimodal Mixup Contrastive Learning approach to capture nuanced shared relations inherent in multimodal data. Our key contribution is a Mixup-based contrastive loss that learns robust representations by aligning mixed samples from one modality with their corresponding samples from other modalities thereby capturing shared relations between them. For multimodal classification tasks, we introduce a framework that integrates a fusion module with unimodal prediction modules for auxiliary supervision during training, complemented by our proposed Mixup-based contrastive loss. Through extensive experiments on diverse datasets (N24News, ROSMAP, BRCA, and Food-101), we demonstrate that M3CoL effectively captures shared multimodal relations and generalizes across domains. It outperforms state-of-the-art methods on N24News, ROSMAP, and BRCA, while achieving comparable performance on Food-101. Our work highlights the significance of learning shared relations for robust multimodal learning, opening up promising avenues for future research.

URL: https://openreview.net/forum?id=NeQYi56MFj

---

Title: Guided Discrete Diffusion for Electronic Health Record Generation

Authors: Jun Han, Zixiang Chen, Yongqian Li, Yiwen Kou, Eran Halperin, Robert E. Tillman, Quanquan Gu

Abstract: Electronic health records (EHRs) are a pivotal data source that enables numerous applications in computational medicine, e.g., disease progression prediction, clinical trial design, and health economics and outcomes research. Despite wide usability, their sensitive nature raises privacy and confidentially concerns, which limit potential use cases. To tackle these challenges, we explore the use of generative models to synthesize artificial, yet realistic EHRs. While diffusion-based methods have recently demonstrated state-of-the-art performance in generating other data modalities and overcome the training instability and mode collapse issues that plague previous GAN-based approaches, their applications in EHR generation remain underexplored. The discrete nature of tabular medical code data in EHRs poses challenges for high-quality data generation, especially for continuous diffusion models. To this end, we introduce a novel tabular EHR generation method, EHR-D3PM, which enables both unconditional and conditional generation using the discrete diffusion model. Our experiments demonstrate that EHR-D3PM significantly outperforms existing generative baselines on comprehensive fidelity and utility metrics while maintaining less attribute and membership vulnerability risks. Furthermore, we show EHR-D3PM is effective as a data augmentation method and enhances performance on downstream tasks when combined with real data.

URL: https://openreview.net/forum?id=N2rWhTgits

---

Title: DyGMamba: Efficiently Modeling Long-Term Temporal Dependency on Continuous-Time Dynamic Graphs with State Space Models

Authors: Zifeng Ding, Yifeng Li, Yuan He, Antonio Norelli, Jingcheng Wu, Volker Tresp, Michael M. Bronstein, Yunpu Ma

Abstract: Learning useful representations for continuous-time dynamic graphs (CTDGs) is challenging, due to the concurrent need to span long node interaction histories and grasp nuanced temporal details. In particular, two problems emerge: (1) Encoding longer histories requires more computational resources, making it crucial for CTDG models to maintain low computational complexity to ensure efficiency; (2) Meanwhile, more powerful models are needed to identify and select the most critical temporal information within the extended context provided by longer histories. To address these problems, we propose a CTDG representation learning model named DyGMamba, originating from the popular Mamba state space model (SSM). DyGMamba first leverages a node-level SSM to encode the sequence of historical node interactions. Another time-level SSM is then employed to exploit the temporal patterns hidden in the historical graph, where its output is used to dynamically select the critical information from the interaction history. We validate DyGMamba experimentally on the dynamic link prediction task. The results show that our model achieves state-of-the-art in most cases. DyGMamba also maintains high efficiency in terms of computational resources, making it possible to capture long temporal dependencies with a limited computation budget.

URL: https://openreview.net/forum?id=sq5AJvVuha

---

Title: Towards identifiability of micro total effects in summary causal graphs with latent confounding: extension of the front-door criterion

Authors: Charles K. Assaad

Abstract: Conducting experiments to estimate total effects can be challenging due to cost, ethical concerns, or practical limitations. As an alternative, researchers often rely on causal graphs to determine whether these effects can be identified from observational data. Identifying total effects in fully specified causal graphs has received considerable attention, with Pearl's front-door criterion enabling the identification of total effects in the presence of latent confounding even when no variable set is sufficient for adjustment. However, specifying a complete causal graph is challenging in many domains. Extending these identifiability results to partially specified graphs is crucial, particularly in dynamic systems where causal relationships evolve over time. This paper addresses the challenge of identifying total effects using a specific and well-known partially specified graph in dynamic systems called a summary causal graph, which does not specify the temporal lag between causal relations and can contain cycles. In particular, this paper presents sufficient graphical conditions for identifying total effects from observational data, even in the presence of cycles and latent confounding, and when no variable set is sufficient for adjustment.

URL: https://openreview.net/forum?id=5f7YlSKG1l

---

Title: Explaining the Behavior of Black-Box Prediction Algorithms with Causal Learning

Authors: Numair Sani, Daniel Malinsky, Ilya Shpitser

Abstract: Causal approaches to post-hoc explainability for black-box prediction models (e.g., deep neural networks trained on image pixel data) have become increasingly popular. However, existing approaches have two important shortcomings: (i) the “explanatory units” are micro-level inputs into the relevant prediction model, e.g., image pixels, rather than interpretable macro-level features that are more useful for understanding how to possibly change the algorithm’s behavior, and (ii) existing approaches assume there exists no unmeasured confounding between features and target model predictions, which fails to hold when the explanatory units are macro-level variables. Our focus is on the important setting where the analyst has no access to the inner workings of the target prediction algorithm, rather only the ability to query the output of the model in response to a particular input. To provide causal explanations in such a setting, we propose to learn causal graphical representations that allow for arbitrary unmeasured confounding among features. We demonstrate the resulting graph can differentiate between interpretable features that causally influence model predictions versus those that are merely associated with model predictions due to confounding. Our approach is motivated by a counterfactual theory of causal explanation wherein good explanations point to factors that are “difference-makers” in an interventionist sense.

URL: https://openreview.net/forum?id=ZrqLpXbXvA

---

Title: A Survey on Large Language Model Acceleration based on KV Cache Management

Authors: Haoyang LI, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole HU, Wei Dong, Li Qing, Lei Chen

Abstract: Large Language Models (LLMs) have revolutionized a wide range of domains such as natural language processing, computer vision, and multi-modal tasks due to their ability to comprehend context and perform logical reasoning. However, the computational and memory demands of LLMs, particularly during inference, pose significant challenges when scaling them to real-world, long-context, and real-time applications. Key-Value (KV) cache management has emerged as a critical optimization technique for accelerating LLM inference by reducing redundant computations and improving memory utilization. This survey provides a comprehensive overview of KV cache management strategies for LLM acceleration, categorizing them into token-level, model-level, and system-level optimizations.
Token-level strategies include KV cache selection, budget allocation, merging, quantization, and low-rank decomposition, while model-level optimizations focus on architectural innovations and attention mechanisms to enhance KV reuse. System-level approaches address memory management, scheduling, and hardware-aware designs to improve efficiency across diverse computing environments.
Additionally, the survey provides an overview of both text and multimodal datasets and benchmarks used to evaluate these strategies. By presenting detailed taxonomies and comparative analyses, this work aims to offer useful insights for researchers and practitioners to support the development of efficient and scalable KV cache management techniques, contributing to the practical deployment of LLMs in real-world applications.

URL: https://openreview.net/forum?id=z3JZzu9EA3

---

Title: Investigating the Effects of Fairness Interventions Using Pointwise Representational Similarity

Authors: Camila Kolling, Till Speicher, Vedant Nanda, Mariya Toneva, Krishna P. Gummadi

Abstract: Machine learning (ML) algorithms can often exhibit discriminatory behavior, negatively affecting certain populations across protected groups. To address this, numerous debiasing methods, and consequently evaluation measures, have been proposed. Current evaluation measures for debiasing methods suffer from two main limitations: (1) they primarily provide a global estimate of unfairness, failing to provide a more fine-grained analysis, and (2) they predominantly analyze the model output on a specific task, failing to generalize the findings to other tasks. In this work, we introduce Pointwise Normalized Kernel Alignment (PNKA), a pointwise representational similarity measure that addresses these limitations by measuring how debiasing measures affect the intermediate representations of individuals. On tabular data, the use of PNKA reveals previously unknown insights: while group fairness predominantly influences a small subset of the population, maintaining high representational similarity for the majority, individual fairness constraints uniformly impact representations across the entire population, altering nearly every data point. We show that by evaluating representations using PNKA, we can reliably predict the behavior of ML models trained on these representations. Moreover, applying PNKA to language embeddings shows that existing debiasing methods may not perform as intended, failing to remove biases from stereotypical words and sentences. Our findings suggest that current evaluation measures for debiasing methods are insufficient, highlighting the need for a deeper understanding of the effects of debiasing methods, and show how pointwise representational similarity metrics can help with fairness audits.

URL: https://openreview.net/forum?id=CkVlt2Qgdb

---

Title: Statistical Error Bounds for GANs with Nonlinear Objective Functionals

Authors: Jeremiah Birrell

Abstract: Generative adversarial networks (GANs) are unsupervised learning methods for training a generator distribution to produce samples that approximate those drawn from a target distribution. Many such methods can be formulated as minimization of a metric or divergence between probability distributions. Recent works have derived statistical error bounds for GANs that are based on integral probability metrics (IPMs), e.g., WGAN which is based on the 1-Wasserstein metric. In general, IPMs are defined by optimizing a linear functional (difference of expectations) over a space of discriminators. A much larger class of GANs, which we here call $(f,\Gamma)$-GANs, can be constructed using $f$-divergences (e.g., Jensen-Shannon, KL, or $\alpha$-divergences) together with a regularizing discriminator space $\Gamma$ (e.g., $1$-Lipschitz functions). These GANs have nonlinear objective functions, depending on the choice of $f$, and have been shown to exhibit improved performance in a number of applications. In this work we derive statistical error bounds for $(f,\Gamma)$-GANs for general classes of $f$ and $\Gamma$ in the form of finite-sample concentration inequalities. These results prove the statistical consistency of $(f,\Gamma)$-GANs and reduce to the known results for IPM-GANs in the appropriate limit. Our results use novel Rademacher complexity bounds which provide new insight into the performance of IPM-GANs for distributions with unbounded support and have application to statistical learning tasks beyond GANs.

URL: https://openreview.net/forum?id=ZgjhykPSdU

---

Title: Closed-Form Diffusion Models

Authors: Christopher Scarvelis, Haitz Sáez de Ocáriz Borde, Justin Solomon

Abstract: Score-based generative models (SGMs) sample from a target distribution by iteratively transforming noise using the score function of the perturbed target. For any finite training set, this score function can be evaluated in closed form, but the resulting SGM memorizes its training data and does not generate novel samples. In practice, one approximates the score by training a neural network via score-matching. The error in this approximation promotes generalization, but neural SGMs are costly to train and sample, and the effective regularization this error provides is not well-understood theoretically. In this work, we instead explicitly smooth the closed-form score to obtain an SGM that generates novel samples without training. We analyze our model and propose an efficient nearest-neighbor-based estimator of its score function. Using this estimator, our method achieves competitive sampling times while running on consumer-grade CPUs.

URL: https://openreview.net/forum?id=JkMifr17wc

---

New submissions
===============

Title: Balancing Utility and Privacy: Dynamically Private SGD with Random Projection

Abstract: Stochastic optimization is a pivotal enabler in modern machine learning, producing effective models for various tasks. However, several existing works have shown that model parameters and gradient information are susceptible to privacy leakage. Although Differentially Private SGD (DPSGD) addresses privacy concerns, its static noise mechanism impacts the error bounds for model performance. Additionally, with the exponential increase in model parameters, efficient learning of these models using stochastic optimizers has become more challenging. To address these concerns, we introduce the Dynamically Differentially Private Projected SGD (D2P2-SGD) optimizer. In D2P2-SGD, we combine two important ideas: (i) dynamic differential privacy (DDP) with automatic gradient clipping and (ii) random projection with SGD, allowing dynamic adjustment of the tradeoff between utility and privacy of the model. It exhibits provably sub-linear convergence rates across different objective functions, matching the best available rate. The theoretical analysis further suggests that DDP leads to better utility at the cost of privacy, while random projection enables more efficient model learning. Extensive experiments across diverse datasets show that D2P2-SGD remarkably enhances accuracy while maintaining privacy. Our code is available here.

URL: https://openreview.net/forum?id=u6OSRdkAwl

---

Title: Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics

Abstract: Large Language Models (LLMs) have emerged as a promising cornerstone for the development of natural language processing (NLP) and artificial intelligence (AI). However, ensuring the robustness of LLMs remains a critical challenge. To address these challenges and advance the field, this survey provides a comprehensive overview of current studies in this area. First, we systematically examine the nature of robustness in LLMs, including its conceptual foundations, the importance of consistent performance across diverse inputs, and the implications of failure modes in real-world applications. Next, we analyze the sources of non-robustness, categorizing intrinsic model limitations, data-driven vulnerabilities, and external adversarial factors that compromise reliability. Following this, we review state-of-the-art mitigation strategies, and then we discuss widely adopted benchmarks, emerging metrics, and persistent gaps in assessing real-world reliability. Finally, we synthesize findings from existing surveys and interdisciplinary studies to highlight trends, unresolved issues, and pathways for future research.

URL: https://openreview.net/forum?id=Bchvaaod6g

---

Title: Robust Invariant Representation Learning by Distribution Extrapolation

Abstract: Invariant risk minimization (IRM) aims to enable out-of-distribution (OOD) generalization in deep learning by learning invariant representations. As IRM poses an inherently challenging bi-level optimization problem, most existing approaches---including IRMv1---adopt penalty-based single-level approximations. However, empirical studies consistently show that these methods often fail to outperform well-tuned empirical risk minimization (ERM), highlighting the need for more robust IRM implementations. This work theoretically identifies a key limitation common to many IRM variants: their penalty terms are highly sensitive to limited environment diversity and over-parameterization, resulting in performance degradation. To address this issue, a novel extrapolation-based framework is proposed that enhances environmental diversity by augmenting the IRM penalty through synthetic distributional shifts. Extensive experiments---ranging from synthetic setups to realistic, over-parameterized scenarios---demonstrate that the proposed method consistently outperforms state-of-the-art IRM variants, validating its effectiveness and robustness.

URL: https://openreview.net/forum?id=CkzV8PBYaX

---

Title: Spurious Privacy Leakage in Neural Networks

Abstract: Neural networks are vulnerable to privacy attacks aimed at stealing sensitive data.
The risks can be amplified in a real-world scenario, particularly when models are trained on limited and biased data.
In this work, we investigate the impact of spurious correlation bias on privacy vulnerability.
We introduce _spurious privacy leakage_, a phenomenon where spurious groups are significantly more vulnerable to privacy attacks than non-spurious groups.
We further show that group privacy disparity increases in tasks with simpler objectives (e.g. fewer classes) due to the persistence of spurious features.
Surprisingly, we find that reducing spurious correlation using spurious robust methods does not mitigate spurious privacy leakage.
This leads us to introduce a perspective on privacy disparity based on memorization, where mitigating spurious correlation does not mitigate the memorization of spurious data, and therefore, neither the privacy level.
Lastly, we compare the privacy of different model architectures trained with spurious data, demonstrating that, contrary to prior works, architectural choice can affect privacy outcomes.

URL: https://openreview.net/forum?id=tRXDCIgvTT

---

Title: TT-TFHE: a Torus Fully Homomorphic Encryption-Friendly Neural Network Architecture

Abstract: This paper presents TT-TFHE, a deep neural network Fully Homomorphic Encryption (FHE) framework that effectively scales Torus FHE (TFHE) usage to tabular and image datasets using the Truth-Table Neural Networks (TTnet) family of Convolutionnal Neural Networks. The proposed framework provides an easy-to-implement, automated TTnet-based design toolbox with an underlying (python-based) open-source Concrete implementation (CPU-based and implementing lookup tables) for inference over encrypted data. Experimental evaluation shows that TT-TFHE greatly outperforms in terms of time and accuracy all Homomorphic Encryption (HE) set-ups on three tabular datasets, all other features being equal. On image datasets such as MNIST and CIFAR-10, we show that TT-TFHE consistently and largely outperforms other TFHE set-ups and is competitive against other HE variants such as BFV or CKKS (while maintaining the same level of 128-bit encryption security guarantees). In addition, our solutions present a very low memory footprint (down to dozens of MBs for MNIST), which is in sharp contrast with other HE set-ups that typically require tens to hundreds of GBs of memory per user (in addition to their communication overheads). This is the first work presenting a fully practical and production-ready solution of private inference (i.e. a few seconds for inference time and a few dozen MBs of memory) on both tabular datasets and MNIST, that can easily scale to multiple threads and users on server side. We further show that in real-world settings, our proposals reduce costs by one to several orders of magnitude compared to existing solutions.

URL: https://openreview.net/forum?id=tV4ynvae6W

---

Title: Efficient Uncertainty Estimation via Sensitivity-Guided Subnetwork Selection for Scalable Variational Inference

Abstract: Quantifying predictive uncertainty with minimal computational overhead remains a significant challenge for reliable deep learning applications in safety-critical systems. While Bayesian neural networks (BNNs) are the gold standard for uncertainty quantification, they require considerable training time and computational resources. Although a body of work has focused on mitigating the computational cost of BNN inference via post-hoc approaches, efforts to accelerate training and convergence remain limited. This paper proposes a partial Bayesian training approach via mean-field variational inference (VI), enabling controllable uncertainty modeling through sparse gradient representations. The selection of the variational Bayesian subnetwork is guided by a first-order gradient sensitivity analysis, which is grounded in uncertainty propagation theory. Under mean-field assumptions, we demonstrate how this framework effectively informs the selection of parameters that represent the network's predictive uncertainty. This criterion is also efficiently integrated into auto-differentiation tools avoiding additional computational burdens. The resulting model consists of a combination of deterministic and Bayesian parameters, facilitating an effective, yet efficient, representation of uncertainty. We investigate the effects of varying the proportion of Bayesian parameters (ranging from 1\% to 95\%) across diverse tasks, including regression, classification, and semantic segmentation. Experimental results in MNIST, CIFAR-10, ImageNet, and Cityscapes demonstrate that our approach achieves competitive performance and uncertainty estimates compared to ensemble methods. While maintaining substantially fewer parameters, approximately 50\%, 80\% less than full VI and ensembles, our approach offers reduced training costs with faster convergence compared to full or partial VI trained from scratch. Furthermore, we assess the robustness of predictive uncertainty in the presence of covariate shifts and out-of-distribution data, demonstrating that our method effectively captures uncertainty and exhibits robustness to image corruptions.

URL: https://openreview.net/forum?id=fJzQbSgEem

---

Title: Learning Reward Machines from Partially Observed Policies

Abstract: Inverse reinforcement learning is the problem of inferring a reward function from an optimal policy {or demonstrations by an expert}. In this work, it is assumed that the reward is expressed as a reward machine whose transitions depend on atomic propositions associated with the state of a Markov Decision Process (MDP). Our goal is to identify the true reward machine using finite information. To this end, we first introduce the notion of a prefix tree policy which associates a distribution of actions to each state of the MDP and each attainable finite sequence of atomic propositions. Then, we characterize an equivalence class of reward machines that can be identified given the prefix tree policy. Finally, we propose a SAT-based algorithm that uses information extracted from the prefix tree policy to solve for a reward machine. It is proved that if the prefix tree policy is known up to a sufficient (but finite) depth, our algorithm recovers the exact reward machine up to the equivalence class. This sufficient depth is derived as a function of the number of MDP states and (an upper bound on) the number of states of the reward machine.{These results are further extended to the case where we only have access to demonstrations from an optimal policy. Several examples, including discrete grid and block worlds, a continuous state-space robotic arm, and real data from experiments with mice, are used to demonstrate the effectiveness and generality of the approach.

URL: https://openreview.net/forum?id=7bbYYNvhTE

---

Title: QC-BERT: A Quantum-Classical hybrid framework for Efficient Sentiment Analysis and Question Answering

Abstract: Transformers have revolutionized NLP but are constrained by their massive parameter counts, posing challenges for edge deployment. Quantum computing, leveraging superposition and entanglement, promises exponential efficiency gains, yet practical, scalable QNLP applications remain scarce. In this pioneering work, we propose QuantumDistilBERT (ours) and HybridTinyBERTQC (ours), the first scalable, hybrid quantum-classical transformer models designed for both core NLP tasks and resource-constrained environments. QuantumDistilBERT achieves 91.36% accuracy on IMDB—just 1.46% below DistilBERT—while reducing trainable parameters by 89.4%, demonstrating strong edge applicability.HybridTinyBERTQC, enhanced with quantum self-attention mechanisms, achieves 82.31% F1 and 73.10% EM on SQuAD 1.1, and 32.86% F1 on Adversarial QA, outperforming TinyBERT (undistilled on task-specific datasets) by over 1% (p < 0.05) on SQuAD and 3.55% on AQA. A novel complexity scoring mechanism reduces quantum circuit overhead by 20%, generalizing well to other text classification tasks. Notably, our hybrid model exhibits a 41.3% reduction in loss variance (0.1329 vs. 0.2265) and uniquely achieves perfect reproducibility across runs with the same random seed—producing identical metrics and loss values every time. This unprecedented consistency underscores the model’s reliability, a critical requirement for edge deployment. Extensive evaluations on IMDB, SQuAD, Adversarial QA, and SST-2 demonstrate the scalability and robustness of our approach. While quantum noise in NISQ hardware still limits subjective task performance, our work lays foundational groundwork for practical, reproducible, and deployable QNLP systems on edge devices

URL: https://openreview.net/forum?id=EPm2AOD9bd

---

Title: Emergent Neural Network Mechanisms for Generalization to Objects in Novel Orientations

Abstract: The capability of Deep Neural Networks (DNNs) to recognize objects in orientations outside the training data distribution is not well understood. We investigate the limitations of DNNs’ generalization capacities by systematically inspecting DNNs' patterns of success and failure across out-of-distribution (OoD) orientations. We present evidence that DNNs (across architecture types, including convolutional neural networks and transformers) are capable of generalizing to objects in novel orientations, and we describe their generalization behaviors. Specifically, generalization strengthens when training the DNN with an increasing number of familiar objects, but only in orientations that involve 2D rotations of familiar orientations. We also hypothesize how this generalization behavior emerges from internal neural mechanisms – that neurons tuned to common features between familiar and unfamiliar objects enable out of distribution generalization – and present supporting data for this theory. The reproducibility of our findings across model architectures, as well as analogous prior studies on the brain, suggests that these orientation generalization behaviors, as well as the neural mechanisms that drive them, may be a feature of neural networks in general.

URL: https://openreview.net/forum?id=4wBQTZVSHU

---

Title: The Noise Geometry of Stochastic Gradient Descent

Abstract: In this paper, we present a comprehensive analysis of the heterogeneous structure of minibatch noise, focusing on its favorable *alignment* with the landscape's local geometry (Wu et al., 2022). Specifically, we propose two metrics, derived from analyzing the influence of the noise structure on the loss and subspace projection dynamics separately, to quantify the alignment property. To showcase the practical relevance of our noise geometry characterization, we revisit the convergence analysis of stochastic gradient descent (SGD), revealing that the favorable noise geometry is crucial for ensuring benign convergence of SGD in high-dimensional settings. We also examine the noise geometry's influence on how SGD escapes from sharp minima. It is demonstrated that, unlike gradient descent (GD), which escapes sharp regions along the sharpest directions, SGD tends to escape through flatter directions. To support our theoretical findings, both synthetic and real-dataset experiments are provided.

URL: https://openreview.net/forum?id=NubPNYTKLh

---

Title: Goal-Conditioned Data Augmentation for Offline Reinforcement Learning

Abstract: Offline reinforcement learning (RL) enables policy learning from pre-collected offline datasets, relaxing the need to interact directly with the environment. However, limited by the quality of offline datasets, it generally fails to learn well-qualified policies in suboptimal datasets. To address datasets with insufficient optimal demonstrations, we introduce Goal-cOnditioned Data Augmentation (GODA), a novel goal-conditioned diffusion-based method for augmenting samples with higher quality. Leveraging recent advancements in generative modelling, GODA incorporates a novel return-oriented goal condition with various selection mechanisms. Specifically, we introduce a controllable scaling technique to provide enhanced return-based guidance during data sampling. GODA learns a comprehensive distribution representation of the original offline datasets while generating new data with selectively higher-return goals, thereby maximizing the utility of limited optimal demonstrations. Furthermore, we propose a novel adaptive gated conditioning method for processing noisy inputs and conditions, enhancing the capture of goal-oriented guidance. We conduct experiments on the D4RL benchmark and real-world challenges, specifically traffic signal control (TSC) tasks, to demonstrate GODA's effectiveness in enhancing data quality and superior performance compared to state-of-the-art data augmentation methods across various offline RL algorithms.

URL: https://openreview.net/forum?id=8K16dplpE0

---

Title: BiDoRA: Bi-level Optimization-Based Weight-Decomposed Low-Rank Adaptation

Abstract: Parameter-efficient fine-tuning (PEFT) of large language models (LLMs) has gained considerable attention as a flexible and efficient way of adapting LLMs to downstream tasks.
Among these methods, weighted decomposed low-rank adaptation (DoRA) has emerged as a promising approach.
DoRA bridges the gap between low-rank adaptation (LoRA) and full fine-tuning (FT) by decomposing the weight matrices into magnitude and direction components, thereby maintaining learning behavior similar to FT.
Although DoRA shows encouraging performance, it is over-expressive and potentially increases the risk of overfitting.
Moreover, optimizing magnitude and direction simultaneously leads to a coupled updating pattern, limiting its learning capacity.
In this work, we propose BiDoRA, a bi-level optimization-based PEFT method.
In BiDoRA, the two components are optimized at different optimization levels, mitigating the risk of overfitting.
Additionally, the asynchronous optimization promotes a decoupled updating pattern, allowing for more flexible updates suitable for various downstream tasks.
Evaluation of BiDoRA on various tasks spanning natural language understanding, natural language generation, token classification, and extremely small datasets reveals that it significantly outperforms DoRA and other PEFT methods.
The code for BiDoRA is available at https://anonymous.4open.science/r/BiDoRA-5D31

URL: https://openreview.net/forum?id=v2xCm3VYl4

---

Title: Adapting Chat Language Models Using Only Target Unlabeled Language Data

Abstract: Vocabulary expansion (VE) is the de-facto approach to language adaptation of large language models (LLMs) by adding new tokens and continuing pre-training on target data. While this is effective for base models trained on unlabeled data, it poses challenges for chat models trained to follow instructions through labeled conversation data. Directly adapting the latter with VE on target unlabeled data may result in forgetting chat abilities. While ideal, target chat data is often unavailable or costly to create for low-resource languages, and machine-translated alternatives are not always effective. To address this issue, previous work proposed using a base and chat model from the same family. This method first adapts the base LLM with VE on target unlabeled data and then converts it to a chat model by adding a chat vector (CV) derived from the weight difference between the source base and chat models. We propose ElChat, a new language adaptation method for chat LLMs that adapts a chat model directly on target unlabeled data, without a base model. It elicits chat abilities by injecting information from the source chat model. ElChat offers more robust and competitive target language and safety performance while achieving superior English, chat, and instruction-following abilities compared to CV.

URL: https://openreview.net/forum?id=6IdoIKowfe

---

Title: CLoQ: Enhancing Fine-Tuning of Quantized LLMs via Calibrated LoRA Initialization

Abstract: Fine-tuning large language models (LLMs) using low-rank adaptation (LoRA) has become a highly efficient approach for downstream tasks, particularly in scenarios with limited computational resources. However, applying LoRA techniques to quantized LLMs poses unique challenges due to the reduced representational precision of quantized weights. In this paper, we introduce CLoQ (Calibrated LoRA initialization for Quantized LLMs), a simplistic initialization strategy designed to overcome these challenges. Our approach focuses on minimizing the layer-wise discrepancy between the original LLM and its quantized counterpart with LoRA components during initialization. By leveraging a small calibration dataset, CLoQ quantizes a pre-trained LLM and determines the optimal LoRA components for each layer, ensuring a strong foundation for subsequent fine-tuning.
A key contribution of this work is a novel theoretical result that enables the accurate and closed-form construction of these optimal LoRA components. We validate the efficacy of CLoQ across multiple tasks such as language generation, arithmetic reasoning, and commonsense reasoning, demonstrating that it consistently outperforms existing LoRA fine-tuning methods for quantized LLMs, especially at 2-bit.

URL: https://openreview.net/forum?id=FHnTRAAdAZ

---

Title: On Time Series Clustering with Graph Neural Networks

Abstract: Graph clustering and pooling operators have been adopted in graph-based architectures to capture meaningful patterns in time series data by leveraging both temporal and relational structures. However, the contribution of each design choice and the behavior of different operators remain underexplored. This work introduces a streamlined deep learning framework based on a spatio-temporal graph neural network (STGNN) for clustering time series, which can leverage prior knowledge on the spatial structure of the data. The STGNN-based model flexibly identifies clusters in various data settings through an encoder-decoder architecture with a bottleneck, showing that a spatio-temporal approach can identify meaningful clusters even in datasets that do not explicitly include spatial relations. We validate the framework's qualitative performance through experiments on synthetic and real-world data, showing its effectiveness in different scenarios. We also provide a heuristic for model selection in unsupervised settings via a self-supervised forecasting loss. Code available at https://anonymous.4open.science/r/Time-Series-Clustering-with-GNNs-AB11

URL: https://openreview.net/forum?id=MHQXfiXsr3

---

Title: Bayesian information theoretic model-averaging stochastic item selection for computer adaptive testing

Abstract: he goal of Computer Adaptive Testing (CAT) is to reliably estimate an individual's ability as modeled by an item response theory (IRT) instrument using only a subset of the instrument's items. A secondary goal is to vary the items presented across different testing sessions so that the sequence of items does not become overly stereotypical -- we want all items to have an exposure rate sufficiently far from zero.
We formulate the optimization problem for CAT in terms of Bayesian information theory, where one chooses the item at each step based on the criterion of the ability model discrepancy -- the statistical distance between the ability estimate at the next step and the full-test ability estimate. This viewpoint of CAT naturally motivates a stochastic selection procedure that equates sampling the next item to Bayesian model averaging in the space of ability estimates. Using the NIH Work Disability Functional Assessment Battery (WD-FAB), we evaluate our new methods in comparison to pre-existing methods found in the literature. We find that our stochastic selector has superior properties in terms of both item exposure and test accuracy/efficiency.

URL: https://openreview.net/forum?id=5tMpMfxzCR

---

Title: Don’t Judge Before You CLIP: A Unified Approach for Perceptual Tasks

Abstract: Visual perceptual tasks aim to predict human judgment of images (e.g., emotions invoked by images, image quality assessment). Unlike objective tasks such as object/scene recognition, perceptual tasks rely on {subjective} human assessments, making its data-labeling difficult. The scarcity of such human-annotated data results in small datasets leading to poor generalization. Typically, specialized models were designed for each perceptual task, tailored to its unique characteristics and its own training dataset. We propose a unified architectural framework for solving multiple different perceptual tasks leveraging CLIP as a prior. Our approach is based on recent cognitive findings which indicate that CLIP correlates well with human judgment. While CLIP was explicitly trained to align images and text, it implicitly also learned human inclinations. We attribute this to the inclusion of human-written image captions in CLIP's training data, which contain not only factual image descriptions, but inevitably also human sentiments and emotions. This makes CLIP a particularly strong prior for perceptual tasks. Accordingly, we suggest that minimal adaptation of CLIP suffices for solving a variety of perceptual tasks. Our simple unified framework employs a lightweight adaptation to fine-tune CLIP to each task, without requiring any task-specific architectural changes. We evaluate our approach on three tasks: (i) Image Memorability Prediction, (ii) No-reference Image Quality Assessment, and (iii) Visual Emotion Analysis. Our model achieves state-of-the-art results on all three tasks, while demonstrating improved generalization across different datasets.

URL: https://openreview.net/forum?id=uvQTYi6kbu

---

Title: Solving Quadratic Programs via Deep Unrolled Douglas-Rachford Splitting

Abstract: Convex quadratic programs (QPs) are fundamental to numerous applications, including finance, engineering, and energy systems. Among the various methods for solving them, the Douglas-Rachford (DR) splitting algorithm is notable for its robust convergence properties. Concurrently, the emerging field of Learning-to-Optimize offers promising avenues for enhancing algorithmic performance, with algorithm unrolling receiving considerable attention due to its computational efficiency and interpretability. In this work, we propose an approach that unrolls a modified DR splitting algorithm to efficiently learn solutions for convex QPs. Specifically, we introduce a tailored DR splitting algorithm that replaces the computationally expensive linear system-solving step with a simplified gradient-based update, while retaining convergence guarantees. Consequently, we unroll the resulting DR splitting method and present a well-crafted neural network architecture to predict QP solutions. Our method achieves up to 50% reductions in iteration counts and 40% in solve time across benchmarks on both synthetic and real-world QP datasets, demonstrating its scalability and superior performance in enhancing computational efficiency across varying sizes.

URL: https://openreview.net/forum?id=xOfOgPnbtF

---

Title: Prior Specification for Exposure-based Bayesian Matrix Factorization

Abstract: The rapid development of the Internet has resulted in a surge of information, particularly with the rise of recommender systems (RSs). One of the most significant challenges facing existing RS models is data sparsity. To address problems related to sparse data, Bayesian models have been applied to RS systems because of their effectiveness with small sample sizes. However, the performance of Bayesian models is heavily influenced by the choice of prior distributions and hyperparameters. Recent research has introduced an analytical method for specifying prior distributions in generic Bayesian models. The major concept is a statistical technique called Prior Predictive Matching~(PPM), which optimizes hyperparameters by aligning virtual statistics generated by the prior with observed data. This approach aims to reduce the need for repeated and costly posterior inference and enhance overall Bayesian model performance. However, our evaluation of this theoretical method reveals considerable deviations in prior specification estimates as data sparsity increases. In this study, we present an enhanced method for specifying priors in Bayesian matrix factorization models. We improve the estimators by implementing an exposure-based model to better simulate data scarcity. Our method demonstrates significant accuracy improvements in hyperparameter estimation during synthetic experiments. We also explore the feasibility of applying this method to real-world datasets and provide insights into how the model's behavior adapts to varying levels of data sparsity.

URL: https://openreview.net/forum?id=o5R4Hv9XqC

---

Title: nnActive: A Framework for Evaluation of Active Learning in 3D Biomedical Segmentation

Abstract: Semantic segmentation is crucial for various biomedical applications, yet its reliance on large, annotated datasets presents a significant bottleneck due to the high cost and specialized expertise required for manual labeling. Active Learning (AL) aims to mitigate this challenge by selectively querying the most informative samples, thereby reducing annotation effort. However, in the domain of 3D biomedical imaging, there remains no consensus on whether AL consistently outperforms Random sampling strategies. Current methodological assessment is hindered by the wide-spread occurrence of four pitfalls with respect to AL method evaluation. These are (1) restriction to too few datasets and annotation budgets, (2) training 2D models on 3D images and not incorporating partial annotations, (3) Random baseline not being adapted to the task and (4) measuring annotation cost only in voxels. In this work, we introduce nnActive, an open-source AL framework that systematically overcomes the aforementioned pitfalls by (1) means of a large scale study evaluating 8 QMs on four biomedical imaging datasets and three label regimes, accompanied by four large-scale ablation studies, (2) extending the state-of-the-art 3D medical segmentation method nnU-Net by using partial annotations for training with 3D patch-based query selection, (3) proposing Foreground Aware Random sampling strategies tackling the foreground-background class imbalance commonly encountered in 3D medical images and (4) propose the foreground efficiency metric, which captures that the annotation cost for background- compared to foreground-regions is very low. We reveal the following key findings: (1) while all AL methods outperform standard Random sampling, none reliably surpasses an improved Foreground Aware Random sampling; (2) the benefits of AL dependend on task specific parameters like number of classes and their locations; (3) Predictive Entropy is overall the best performing AL method, but likely requires the most annotation effort; (4) AL performance can be improved with more compute intensive design choices like longer training and smaller query sizes. As a holistic, open source framework nnActive has the potential to act as a catalyst for research and application of AL in 3D biomedical imaging. Code is at: https://anonymous.4open.science/r/nnactive-815F

URL: https://openreview.net/forum?id=AJAnmRLJjJ

---

Title: One-Shot Federated Distillation Using Monoclass Teachers: A Study of Knowledge Fragmentation and Out-of- Distribution Supervision

Abstract: The performance of machine learning models critically depends on the quality and diversity of training data. However, privacy, legal, and proprietary concerns often limit direct data sharing. Many organizations possess high-quality data for specific classes and may wish to share the knowledge derived from it without revealing the data or engaging in collaborative training. While federated learning (FL) enables distributed model training, it typically assumes mutual benefit, requires repeated communication, and produces a shared global model. Another paradigm, knowledge distillation (KD), allows a student model to learn from teacher predictions.
We propose a one-shot federated distillation method in which a single client learns from monoclass teacher models trained independently by multiple providers. Each provider shares its model once, and the client combines these with unlabeled data to distill a multi-class student model—aggregating knowledge from disjoint, class-specific sources. This unidirectional, asymmetric setup poses a key challenge: out-of-distribution (OOD) supervision, where monoclass teachers often mispredict unseen inputs, leading to noisy signals for the student.
The main contribution of this work is a systematic study of knowledge fragmentation in one-shot federated distillation with monoclass teachers. We evaluate five configurations with varying class coverage per provider and show that increasing fragmentation intensifies OOD supervision, degrading student performance. Experiments on MNIST, FashionMNIST, and CIFAR-10 confirm that fragmentation consistently reduces student accuracy. To mitigate this, we discuss three strategies: (1) exposing teachers to diverse off-class examples, (2) penalizing overconfidence, and (3) using contrastive learning to sharpen feature boundaries.

URL: https://openreview.net/forum?id=ENdm5BM7aF

---

Title: MGPATH: A Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot Whole Slide Pathology Classification

Abstract: Whole slide pathology image classification presents challenges due to gigapixel image sizes and limited annotation labels, hindering model generalization. This paper introduces a prompt learning method to adapt large vision-language models for few-shot pathology classification. We first extend the Prov-GigaPath vision foundation model, pre-trained on 1.3 billion pathology image tiles, into a vision-language model by adding adaptors and aligning it with medical text encoders via contrastive learning on 923K image-text pairs. The model is then used to extract visual features and text embeddings from few-shot annotations and fine-tunes with learnable prompt embeddings. Unlike prior methods that combine prompts with frozen features using prefix embeddings or self-attention, we propose multi-granular attention that compares interactions between learnable prompts with individual image patches and groups of them. This approach improves the model’s ability to capture both fine-grained details and broader context, enhancing its recognition of complex patterns across sub-regions. To further improve accuracy, we leverage (unbalanced) optimal transport-based visual-text distance to secure model robustness by mitigating perturbations that might occur during the data augmentation process. Empirical experiments on lung, kidney, and breast pathology modalities validate the effectiveness of our approach; thereby, we surpass several of the latest competitors and consistently improve performance across diverse architectures, including CLIP, PLIP, and Prov-GigaPath integrated PLIP.

URL: https://openreview.net/forum?id=u7U81JLGjH

---

Title: Quasipseudometric Value Functions with Dense Rewards

Abstract: As a generalization of reinforcement learning (RL) to parametrizable goals, goal conditioned RL (GCRL) has a broad range of applications, particularly in challenging tasks in robotics. Recent work has established that the optimal value function of GCRL $Q^\ast(s, a, g)$ has a quasipseudometric structure, leading to targetted neural architectures that respect such structure. However, the relevant analyses assume a sparse reward setting—a known aggravating factor to sample complexity. We show that the key property underpinning a quasipseudometric, viz., the triangle inequality, is preserved under a dense reward setting as well, specifically identifying the key condition necessary for triangle inequality. Contrary to earlier findings where dense rewards were shown to be detrimental to GCRL, we conjecture that dense reward functions that satisfy this condition can only improve, never worsen, sample complexity. We evaluate this proposal in 12 standard benchmark environments in GCRL featuring challenging continuous control tasks. Our empirical results confirm that training a quasipseudometric value function in our dense reward setting indeed either improves upon, or preserves, the sample complexity of training with sparse rewards. This opens up opportunities to train efficient neural architectures with dense rewards, compounding their benefits
to sample complexity.

URL: https://openreview.net/forum?id=4LqOl6pDUe

---

Title: Hyperspectral Gaussian Splatting

Abstract: Hyperspectral imaging (HSI) has been widely used in agricultural applications for non-destructive estimation of plant nutrient composition and precise determination of nutritional elements of samples. Recently, 3D reconstruction methods have been used to create implicit neural representations of HSI scenes, which can help localize the target object's nutrient composition spatially and spectrally. Neural Radiance Field (NeRF) is a cutting-edge implicit representation that can be used to render hyperspectral channel compositions of each spatial location from any viewing direction. However, it faces limitations in training time and rendering speed. In this paper, we propose Hyperspectral Gaussian Splatting (HS-GS), which combines the state-of-the-art 3D Gaussian Splatting (3DGS) with a diffusion model to enable 3D explicit reconstruction of the hyperspectral scenes and novel view synthesis for the entire spectral range. To enhance the model's ability to capture fine-grained reflectance variations across the light spectrum and leverage correlations between adjacent wavelengths for denoising, we introduce a wavelength encoder to generate wavelength-specific spherical harmonics offsets. We also introduce a novel Kullback–Leibler divergence-based loss to mitigate the spectral distribution gap between the rendered image and the ground truth. A diffusion model is further applied for denoising the rendered images and generating photorealistic hyperspectral images. We present extensive evaluations on five diverse hyperspectral scenes from the Hyper-NeRF dataset to show the effectiveness of our proposed HS-GS framework. The results demonstrate that HS-GS has achieved the new state-of-the-art performance among all the previously published methods. Code will be released upon publication.

URL: https://openreview.net/forum?id=MI5eUe8ZdB

---

Title: Stochastic Subspace Descent Accelerated via Bi-fidelity Line Search

Abstract: Efficient optimization remains a fundamental challenge across numerous scientific and engineering domains, particularly when objective function evaluations are computationally expensive and gradient information is inaccessible. While zeroth-order optimization methods address the lack of gradients, their performance often suffers due to the high cost of repeated function queries. This work introduces a bi-fidelity line search scheme tailored for zeroth-order optimization. Our method constructs a temporary surrogate model by strategically combining inexpensive low-fidelity (LF) evaluations with accurate high-fidelity (HF) evaluations of the objective function. This surrogate enables an efficient backtracking line search for step size selection, significantly reducing the required number of costly HF queries. We provide theoretical convergence guarantees for this scheme under standard assumptions. Furthermore, we integrate this bi-fidelity strategy into the stochastic subspace descent framework, proposing the bi-fidelity stochastic subspace descent (BF-SSD) algorithm. A comprehensive empirical evaluation of BF-SSD is conducted across four distinct problems: a synthetic optimization benchmark, dual-form kernel ridge regression, black-box adversarial attacks on machine learning models, and transformer-based black-box language model fine-tuning. The numerical results consistently demonstrate that BF-SSD achieves superior optimization performance, particularly in terms of solution quality obtained per HF function evaluation, when compared against relevant baseline methods. This study highlights the efficacy of integrating bi-fidelity strategies within zeroth-order optimization frameworks, positioning BF-SSD as a promising and computationally efficient approach for tackling large-scale, high-dimensional optimization problems encountered in various real-world applications.

URL: https://openreview.net/forum?id=xuOQUs5YmT

---

Title: Low Compute Unlearning via Sparse Representations

Abstract: Machine unlearning, which involves erasing knowledge about a \emph{forget set} from a trained model, can prove to be
costly and infeasible using existing techniques. We propose a low-compute unlearning technique based on a discrete representational bottleneck. We show that the proposed technique efficiently unlearns the forget set and incurs negligible damage to the model's performance on the rest of the dataset. We evaluate the proposed technique on the problem of class unlearning using four datasets: CIFAR-10, CIFAR-100, LACUNA-100 and ImageNet-1k. We compare the proposed technique to SCRUB, a state-of-the-art approach which uses knowledge distillation for unlearning. Across all four datasets, the proposed technique performs as well as, if not better than SCRUB while incurring almost no computational cost.

URL: https://openreview.net/forum?id=GyKXzmk43s

---

Title: Rectified Robust Policy Optimization for Robust Constrained Reinforcement Learning without Strong Duality

Abstract: The goal of robust constrained reinforcement learning (RL) is to optimize an agent's performance under the worst-case model uncertainty while satisfying safety or resource constraints. In this paper, we demonstrate that strong duality does not generally hold in robust constrained RL, indicating that traditional primal-dual methods may fail to find optimal feasible policies. To overcome this limitation, we propose a novel primal-only algorithm called Rectified Robust Policy Optimization (RRPO), which operates directly on the primal problem without relying on dual formulations. We provide theoretical convergence guarantees for RRPO, showing that it converges to an approximately optimal policy that satisfies the constraints within a specified tolerance. Empirical results in a grid-world environment validate the effectiveness of our approach, demonstrating that RRPO achieves robust and safe performance under model uncertainties while the non-robust method will violate the worst-case safety constraints.

URL: https://openreview.net/forum?id=7l63xwAgAW

---

Title: Tree Search for Language Model Agents

Abstract: Autonomous agents powered by language models (LMs) have demonstrated promise in their ability to perform decision-making tasks such as web automation. However, a key limitation remains: LMs, primarily optimized for natural language understanding and generation, struggle with multi-step reasoning, planning, and using environmental feedback when attempting to solve realistic computer tasks. Towards addressing this, we propose an inference-time search algorithm for LM agents to explicitly perform exploration and multi-step planning in interactive web environments. Our approach is a form of best-first tree search that operates within the actual environment space, and is complementary with most existing state-of-the-art agents. It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks. On the challenging VisualWebArena benchmark, applying our search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%. On WebArena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%. Our experiments showcase the effectiveness of search for web agents, and we demonstrate that performance scales with increased test-time compute.

URL: https://openreview.net/forum?id=QF0N3x2XVm

---

Title: GPT Carry-On: Language Model Customization Made Scalable by Growing-In-Depth

Abstract: Modern large language foundation models (LLM) have now entered the daily lives of millions of users. We ask a natural question whether it is possible to customize LLM for every user or every task. From system and industrial economy consideration, general continue-training or fine-tuning still require substantial computation and memory of training GPU nodes, whereas most inference nodes under deployment, possibly with lower-end GPUs, are configured to make forward pass fastest possible.
We propose a framework to take full advantages of existing LLMs and systems of online service. We train an additional branch of transformer blocks on the final-layer embedding of pretrained LLMs, which is the base, then a carry-on module merge the base models to compose a customized LLM. We can mix multiple layers, or multiple LLMs specialized in different domains such as chat, coding, math, to form a new mixture of LLM that best fit a new task. As the base model don't need to update parameters, we are able to outsource most computation of the training job on inference nodes, and only train a lightweight carry-on on training nodes, where we consume less than 1GB GPU memory to train a 100M carry-on layer on 30B LLM. We tested Qwen and DeepSeek opensourced models
for continue-pretraining and got faster loss convergence. We use it to improve solving math questions with extremely small computation and model size, with 1000 data samples of chain-of-thoughts, and as small as 1 MB parameters of two layer layer carry-on, and the results are promising.

URL: https://openreview.net/forum?id=LDAvIGZuFy

---

Title: Enhancing generalizability of deep networks via Fisher regularization

Abstract: The generalization ability of a deep learning classifier hinges significantly on the geometry
of its loss landscape. Solutions residing near flatter areas are more robust, generalizing
better than the ones present near sharp minima. In this paper, we study the effects of
the loss landscape on the generalization of deep learning models and effectively leverage
its geometric information to propose a novel regularization method, Fisher regularization.
By dynamically penalizing weights based on their curvature across the loss landscape, we
propose an adaptive regularization scheme that guides the optimization process towards
flatter and more generalizable solutions. We establish a rigorous theoretical foundation
for our regularization approach using the PAC-Bayesian theory and empirically validate
the superior performance of deep learning models trained with our proposed method over
other powerful regularization techniques across a range of challenging image classification
benchmarks.

URL: https://openreview.net/forum?id=oNU2rRIfIl

---

Title: FoMo-0D: A Foundation Model for Zero-shot Outlier Detection

Abstract: Outlier detection (OD) has a vast literature as it finds numerous real-world applications. Being an unsupervised task, model selection is a key bottleneck for OD without label supervision. Despite a long list of available OD algorithms with tunable hyperparameters, the lack of systematic approaches for unsupervised algorithm and hyperparameter selection limits their effective use in practice. In this paper, we present FoMo-0D, a pre-trained Foundation Model for zero/0-shot OD on tabular data, which bypasses the hurdle of model selection altogether. Having been pre-trained on synthetic data, FoMo-0D can directly predict the (outlier/inlier) label of test samples without parameter fine-tuning—requiring no labeled data, and no additional training or hyperparameter tuning when given a new task. Extensive experiments on 57 real-world datasets against 26 baselines show that FoMo-0D is highly competitive; outperforming the majority of the baselines with no statistically significant difference from the 2nd best method. Further, FoMo-0D is efficient in inference time requiring only 7.7 ms per sample on average, with at least 7x speed-up compared to previous methods. To facilitate future research, our implementations for data synthesis and pre-training as well as model checkpoints are openly available at https://anonymous.4open.science/r/PFN40D.

URL: https://openreview.net/forum?id=XCQzwpR9jE

---

Title: TP‑Blend: Textual‑Prompt Attention Pairing for Precise Object‑Style Blending in Diffusion Models

Abstract: Current text–conditioned diffusion editors handle single object replacement well but struggle when a new object and a new style must be introduced simultaneously. We present Twin‑Prompt Attention Blend (TP‑Blend), a lightweight training‑free framework that receives two separate textual prompts, one specifying a blend object and the other defining a target style, and injects both into a single denoising trajectory. TP‑Blend is driven by two complementary attention processors. Cross‑Attention Object Fusion (CAOF) first averages head‑wise attention to locate spatial tokens that respond strongly to either prompt, then solves an entropy‑regularised optimal transport problem that reassigns complete multi‑head feature vectors to those positions. CAOF updates feature vectors at the full combined dimensionality of all heads (e.g., 640 dimensions in SD‑XL), preserving rich cross‑head correlations while keeping memory low. Self‑Attention Style Fusion (SASF) injects style at every self‑attention layer through Detail‑Sensitive Instance Normalization. A lightweight one‑dimensional Gaussian filter separates low‑ and high‑frequency components; only the high‑frequency residual is blended back, imprinting brush‑stroke‑level texture without disrupting global geometry. SASF further swaps the Key and Value matrices with those derived from the style prompt, enforcing context‑aware texture modulation that remains independent of object fusion. Extensive experiments show that TP‑Blend produces high‑resolution, photo‑realistic edits with precise control over both content and appearance, surpassing recent baselines in quantitative fidelity, perceptual quality, and inference speed.

URL: https://openreview.net/forum?id=q6M73uOBZE

---

Title: Decoding-based Regression

Abstract: Language models have recently been shown capable of performing regression wherein numeric predictions are represented as decoded strings. In this work, we provide theoretical grounds for this capability and furthermore investigate the utility of causal auto-regressive sequence models as numeric prediction heads given any feature representation. We find that, despite being trained in the usual way - for next-token prediction via cross-entropy loss - decoding-based regression is as performant as traditional approaches when benchmarked over standard regression tasks, while being flexible enough to capture arbitrary distributions, such as in the task of density estimation.

URL: https://openreview.net/forum?id=avUQ8jguxg

---

Title: Nonstationary Latent Bandits

Abstract: Addressing non-stationarity and latent variables in bandit algorithms presents significant challenges. This paper tackles both challenges simultaneously in Multi-Agent Multi-Armed Bandits by integrating causal inference principles with panel data methodologies. We propose Dynamic Causal Multi-Armed Bandits (DCMAB) and Dynamic Causal Contextual Bandits (DCCB), focusing on treatment effect estimation rather than direct reward modeling. Our algorithms, employing matrix completion on agent-time reward matrices, effectively leverage shared information among agents while adapting to dynamic environments. We establish sub-linear regret for the proposed algorithms and extend their applicability to scenarios with time-varying treatment effects. Through extensive simulations and a real-world application in the stock market, we validate the superiority of our proposed methods in non-stationary bandits with latent variables.

URL: https://openreview.net/forum?id=XUuQpTehya

---

Title: A Dynamical Clipping Approach with Task Feedback for Proximal Policy Optimization

Abstract: Proximal Policy Optimization (PPO) has been broadly applied to robotics learning, showcasing stable training performance. However, the fixed clipping bound setting may limit the performance of PPO. Specifically, there is no theoretical proof that the optimal clipping bound remains consistent throughout the entire training process. Meanwhile, previous researches suggest that a fixed clipping bound restricts the policy's ability to explore. Therefore, many past studies have aimed to dynamically adjust the PPO clipping bound to enhance PPO's performance. However, the objective of these approaches are not directly aligned with the objective of reinforcement learning (RL) tasks, which is to maximize the cumulative Return. Unlike previous clipping approaches, we propose a bi-level proximal policy optimization objective that can dynamically adjust the clipping bound to better reflect the preference (maximizing Return) of these RL tasks. Based on this bi-level proximal policy optimization paradigm, we introduce a new algorithm named Preference based Proximal Policy Optimization (Pb-PPO). Pb-PPO utilizes a multi-armed bandit approach to refelect RL preference, recommending the clipping bound for PPO that can maximizes the current Return. Therefore, Pb-PPO results in greater stability and improved performance compared to PPO with a fixed clipping bound. We test Pb-PPO on locomotion benchmarks across multiple environments, including Gym-Mujoco and legged-gym. Additionally, we validate Pb-PPO on customized navigation tasks. Meanwhile, we conducted comparisons with PPO using various fixed clipping bounds and various of clipping approaches. The experimental results indicate that Pb-PPO demonstrates superior training performance compared to PPO and its variants.

URL: https://openreview.net/forum?id=xOnAIaIgmC

---

Title: ContextFormer: Stitching via Latent Expert Calibration

Abstract: Offline reinforcement learning (RL) algorithms can learn better decision-making compared to behavior policies by stitching the suboptimal trajectories to derive more optimal ones. Meanwhile, Decision Transformer (DT) abstracts the RL as sequence modeling, showcasing competitive performance on offline RL benchmarks. However, recent studies demonstrate that DT lacks of stitching capacity, thus exploiting stitching capability for DT is vital to further improve its performance. In order to endow stitching capability to DT, we abstract trajectory stitching as expert matching and introduce our approach, ContextFormer, which integrates contextual information-based imitation learning (IL) and sequence modeling to stitch sub-optimal trajectory fragments by emulating the representations of a limited number of expert trajectories. To validate our approach, we conduct experiments from two perspectives: 1) We conduct extensive experiments on D4RL benchmarks under the settings of IL, and experimental results demonstrate ContextFormer can achieve competitive performance in multiple IL settings. 2) More importantly, we conduct a comparison of ContextFormer with various competitive DT variants using identical training datasets. The experimental results unveiled ContextFormer's superiority, as it outperformed all other variants, showcasing its remarkable performance.

URL: https://openreview.net/forum?id=VGaGa50wqy

---

Title: Adversarial Bandits Against Arbitrary Strategies

Abstract: We study the adversarial bandit problem against arbitrary strategies, where the difficulty is captured by an unknown parameter $S$, which is the number of switches in the best arm in hindsight. To handle this problem, we adopt the master-base framework using the online mirror descent method (OMD). We first provide a master-base algorithm with simple OMD, achieving $\tilde{O}(S^{1/2}K^{1/3}T^{2/3})$, in which $T^{2/3}$ comes from the variance of loss estimators. To mitigate the impact of the variance, we propose using adaptive learning rates for OMD and achieve $\tilde{O}(\min\{\sqrt{SKT\rho},S\sqrt{KT}\})$, where $\rho$ is a variance term for loss estimators.

URL: https://openreview.net/forum?id=x4QrOh8uCs

---

Title: Cardinality Sparsity: Applications in Matrix-Matrix Multiplications and Machine Learning

Abstract: High-dimensional data has become ubiquitous across the sciences but presents computational and statistical challenges. A common approach to addressing these challenges is through sparsity. In this paper, we introduce a new concept of sparsity, called cardinality sparsity. Broadly speaking, we define a tensor as sparse if it contains only a small number of unique values. We demonstrate that cardinality sparsity can improve deep learning and tensor regression both statistically and computationally. Along the way, we generalize recent statistical theories in these fields. Most importantly, we show that cardinality sparsity has a strikingly powerful application beyond high-dimensional data analysis: it can significantly speed up matrix-matrix multiplications. For instance, we demonstrate that cardinality sparsity leads to algorithms for binary-matrix multiplication that outperform state-of-the-art algorithms by a substantial margin. Additionally, another crucial aspect of this sparsity is minimizing memory usage. By executing matrix multiplication in the compressed domain, we can significantly lower the amount of memory needed to store the input data.

URL: https://openreview.net/forum?id=zoSRSpGu9C

---

Title: Addressing Node Integration Skewness in Graph Neural Networks Using Hop-Wise Attention

Abstract: Graph neural networks (GNNs) often suffer performance degradation as their layer count grows, typically due to the well-known problems of over-smoothing and over-squashing. In this work, we identify an additional factor contributing to this degradation, which we term the K-skewed-traversal problem: certain hop distances are disproportionately emphasized during aggregation, with this emphasis intensifying as the number of layers grows. To address this, we introduce an algorithm called Hop-wise Graph Attention Network (HGAT) that ensures uniform aggregation across hops to eliminate the K-skewed-traversal problem, and employs a hop-wise attention mechanism to adaptively prioritize specific hop distances. We theoretically prove that HGAT removes this skewness by balancing contributions from different hop distances before applying hop-wise attention. Moreover, in our extensive empirical evaluation$^*$, we observe notable improvement in terms of solution quality compared to the state-of-the-art GNN models, particularly as the number of layers increases.

* The implementation is available at https://drive.proton.me/urls/XSGJ8SJJGW#0bIyDVkZDqTi

URL: https://openreview.net/forum?id=QJIf1sXMmY

---

Title: FraGNNet: A Deep Probabilistic Model for Tandem Mass Spectrum Prediction

Abstract: Compound identification from tandem mass spectrometry (MS/MS) data is a critical step
in the analysis of complex mixtures. Typical solutions for the MS/MS spectrum to com-
pound (MS2C) problem involve comparing the unknown spectrum against a library of known
spectrum-molecule pairs, an approach that is limited by incomplete library coverage. Com-
pound to MS/MS spectrum (C2MS) models can improve retrieval rates by augmenting real
libraries with predicted MS/MS spectra. Unfortunately, many existing C2MS models suffer
from problems with mass accuracy, generalization, or interpretability. We develop a new
probabilistic method for C2MS prediction, FraGNNet, that can efficiently and accurately
simulate MS/MS spectra with high mass accuracy. Our approach formulates the C2MS prob-
lem as learning a distribution over molecule fragments. FraGNNet achieves state-of-the-art
performance in terms of prediction error and surpasses existing C2MS models as a tool for
retrieval-based MS2C.

URL: https://openreview.net/forum?id=UsqeHx9Mbx

---

Title: Unified Wisdom: Harnessing Collaborative Learning to Improve Efficacy of Knowledge Distillation

Abstract: Knowledge distillation (KD), which involves training a smaller student model to approximate the predictions of a larger teacher model is useful in striking a balance between model accuracy and computational constraints. However, KD has been found to be ineffective when the teacher and student models have a significant capacity gap. In this work, we address this issue via "meta-collaborative distillation" (MC-Distil), where students of varying capacities collaborate during distillation. Using a "coordinator" network (C-Net), MC-Distil enables mutual learning among students as a meta-learning task. Our insight is that C-Net learns from each student’s performance and training instance characteristics, allowing students of different capacities to improve together. Our method enhances student accuracy for all students, surpassing state-of-the-art baselines, including multi-step distillation, consensus enforcement, and teacher re-training. We achieve average gains of 2.5% on CIFAR100 and 2% on TinyImageNet datasets, consistently across diverse student sizes, teacher sizes, and architectures. Notably, larger students benefiting through meta-collaboration with smaller students is a novel idea. MC-Distil excels in training superior student models under real-world conditions such as label noise and domain adaptation.

URL: https://openreview.net/forum?id=Zj9bb8aQNg

---

Title: MVGaussian: High-Fidelity text-to-3D Content Generation with Multi-View Guidance and Surface Densification

Abstract: The field of text-to-3D content generation has made significant progress in generating realistic 3D objects, with existing methodologies like Score Distillation Sampling (SDS) offering promising guidance. However, these methods often encounter the Janus problem—multi-face ambiguities due to imprecise guidance. Additionally, while recent advancements in 3D Gaussian splatting have shown its efficacy in representing 3D volumes, optimization of this representation remains largely unexplored. This paper introduces a unified framework for text-to-3D content generation that addresses these critical gaps. Our approach utilizes multi-view guidance to iteratively form the structure of the 3D model, progressively enhancing detail and accuracy. We also introduce a novel densification algorithm that aligns Gaussians close to the surface, optimizing the structural integrity and fidelity of the generated models. Extensive experiments validate our approach, demonstrating that it produces high-quality visual outputs with minimal time cost. Notably, our method achieves high-quality results within half an hour of training, offering a substantial efficiency gain over most existing methods, which require hours of training time to achieve comparable results.

URL: https://openreview.net/forum?id=dhduuUN6vD

---

Reply all

Reply to author

Forward

0 new messages