Weekly TMLR digest for Aug 03, 2025

1 view
Skip to first unread message

TMLR

unread,
Aug 3, 2025, 12:00:12 AMAug 3
to tmlr-annou...@googlegroups.com


New certifications
==================

Expert Certification: Importance Weighting for Aligning Language Models under Deployment Distribution Shift

Thanawat Lodkaew, Tongtong Fang, Takashi Ishida, Masashi Sugiyama

https://openreview.net/forum?id=C7QWN4AXvp

---


Accepted papers
===============


Title: Don’t Judge Before You CLIP: A Unified Approach for Perceptual Tasks

Authors: Amit Zalcher, navve wasserman, Roman Beliy, Oliver Heinimann, michal Irani

Abstract: Visual perceptual tasks aim to predict human judgment of images (e.g., emotions invoked by images, image quality assessment). Unlike objective tasks such as object/scene recognition, perceptual tasks rely on {subjective} human assessments, making their data-labeling difficult. The scarcity of such human-annotated data results in small datasets leading to poor generalization. Typically, specialized models were designed for each perceptual task, tailored to its unique characteristics and its own training dataset. We propose an identical architectural framework for solving multiple different perceptual tasks leveraging CLIP as a prior. Our approach is based on recent cognitive findings which indicate that CLIP correlates well with human judgment. While CLIP was explicitly trained to align images and text, it implicitly also learned human inclinations. We attribute this to the inclusion of human-written image captions in CLIP's training data, which contain not only factual image descriptions, but inevitably also human sentiments and emotions. This makes CLIP a particularly strong prior for perceptual tasks. Accordingly, we suggest that minimal adaptation of CLIP suffices for solving a variety of perceptual tasks. Our simple unified framework employs a lightweight adaptation to fine-tune CLIP to each task, without requiring any task-specific architectural changes. We evaluate our approach on three tasks: (i) Image Memorability Prediction, (ii) No-reference Image Quality Assessment, and (iii) Visual Emotion Analysis. Our model achieves state-of-the-art results on all three tasks, while demonstrating improved generalization across different datasets.

URL: https://openreview.net/forum?id=uvQTYi6kbu

---

Title: Training Dynamics of the Cooldown Stage in Warmup-Stable-Decay Learning Rate Scheduler

Authors: Aleksandr Dremov, Alexander Hägele, Atli Kosson, Martin Jaggi

Abstract: Learning rate scheduling is essential in transformer training, where the final annealing plays a crucial role in getting the best performance. However, the mechanisms behind this cooldown phase, with its characteristic drop in loss, remain poorly understood. To address this, we provide a comprehensive analysis focusing solely on the cooldown phase in the Warmup-Stable-Decay (WSD) learning rate scheduler. Our analysis reveals that different cooldown shapes reveal a fundamental bias-variance trade-off in the resulting models, with shapes that balance exploration and exploitation consistently outperforming alternatives. Similarly, we find substantial performance variations — comparable to those from cooldown shape selection — when tuning AdamW hyperparameters. Notably, we observe consistent improvements with higher values of $\beta_2$ during cooldown. From a loss landscape perspective, we provide visualizations of the landscape during cooldown, supporting the river valley loss perspective empirically. These findings offer practical recommendations for configuring the WSD scheduler in transformer training, emphasizing the importance of optimizing the cooldown phase alongside traditional hyperparameter tuning.

URL: https://openreview.net/forum?id=ZnSYEcZod3

---

Title: SuFP: Piecewise Bit Allocation Floating-Point for Robust Neural Network Quantization

Authors: Geonwoo Ko, Sungyeob Yoo, Seri Ham, Seeyeon Kim, Minkyu Kim, Joo-Young Kim

Abstract: The rapid growth in model size and computational demand of Deep Neural Networks (DNNs) has led to significant challenges in memory and compute efficiency, necessitating the adoption of lower bit-width data types to enhance hardware performance. Floating-point 8 (FP8) has emerged as a promising solution, supported by the latest AI processors, due to its potential for reducing memory usage and computational load. However, each application often requires its own optimal FP8 configuration to achieve high performance, resulting in inconsistent performance and increased hardware complexity. To address these limitations, we introduce Super Floating-Point (SuFP), an innovative data type that integrates various floating-point configurations into a single representation through a piecewise bit allocation. This approach enables SuFP to effectively capture both dense regions near zero and sparse regions with outliers, thereby minimizing quantization errors and ensuring full-precision floating-point performance across different models. Furthermore, SuFP’s processing element design is optimized to reduce the hardware overhead. Our experimental results demonstrate the robustness and accuracy of SuFP over various neural networks in the vision and natural language processing domains. Remarkably, SuFP shows its superiority in large models such as large language model (Llama 2) and text-to-image generative model (Stable Diffusion v2). We also verify training feasibility on ResNet models and highlight the structural design of SuFP for general applicability.

URL: https://openreview.net/forum?id=7M1adi1nfX

---

Title: Disentangled and Self-Explainable Node Representation Learning

Authors: Simone Piaggesi, André Panisson, Megha Khosla

Abstract: Node embeddings are low-dimensional vectors that capture node properties, typically learned through unsupervised structural similarity objectives or supervised tasks. While recent efforts have focused on post-hoc explanations for graph models, intrinsic interpretability in unsupervised node embeddings remains largely underexplored. To bridge this gap, we introduce DiSeNE (Disentangled and Self-Explainable Node Embedding), a framework that learns self-explainable node representations in an unsupervised fashion. By leveraging disentangled representation learning, DiSeNE ensures that each embedding dimension corresponds to a distinct topological substructure of the graph, thus offering clear, dimension-wise interpretability. We introduce new objective functions grounded in principled desiderata, jointly optimizing for structural fidelity, disentanglement, and human interpretability. Additionally, we propose several new metrics to evaluate representation quality and human interpretability. Extensive experiments on multiple benchmark datasets demonstrate that DiSeNE not only preserves the underlying graph structure but also provides transparent, human-understandable explanations for each embedding dimension.

URL: https://openreview.net/forum?id=s51TQ8Eg1e

---

Title: Node Duplication Improves Cold-start Link Prediction

Authors: Zhichun Guo, Tong Zhao, Yozen Liu, Kaiwen Dong, William Shiao, Mingxuan Ju, Neil Shah, Nitesh V Chawla

Abstract: Graph Neural Networks (GNNs) are prominent in graph machine learning and have shown state-of-the-art performance in Link Prediction (LP) tasks. Nonetheless, recent studies show that GNNs struggle to produce good results on low-degree nodes despite their overall strong performance. In practical applications of LP, like recommendation systems, improving performance on low-degree nodes is critical, as it amounts to tackling the cold-start problem of improving the experiences of users with few observed interactions. In this paper, we investigate improving GNNs' LP performance on low-degree nodes while preserving their performance on high-degree nodes and propose a simple yet surprisingly effective augmentation technique called NodeDup. Specifically, NodeDup duplicates low-degree nodes and creates links between nodes and their own duplicates before following the standard supervised LP training scheme. By leveraging a ``multi-view'' perspective for low-degree nodes, NodeDup shows significant LP performance improvements on low-degree nodes without compromising any performance on high-degree nodes. Additionally, as a plug-and-play augmentation module, NodeDup can be easily applied on existing GNNs with very light computational cost. Extensive experiments show that NodeDup achieves 38.49%, 13.34%, and 6.76% relative improvements on isolated, low-degree, and warm nodes, respectively, on average across all datasets compared to GNNs and the existing cold-start methods.

URL: https://openreview.net/forum?id=hIOTzz87N9

---

Title: LLaVA-Video: Video Instruction Tuning With Synthetic Data

Authors: Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun MA, Ziwei Liu, Chunyuan Li

Abstract: The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we consider an alternative approach, creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this proposed dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

URL: https://openreview.net/forum?id=EElFGvt39K

---

Title: Keep your distance: learning dispersed embeddings on $\mathbb{S}_{m}$

Authors: Evgeniia Tokarchuk, Hua Chang Bakker, Vlad Niculae

Abstract: Learning well-separated features in high-dimensional spaces, such as text or image embeddings,
is crucial for many machine learning applications. Achieving such separation can be effectively
accomplished through the dispersion of embeddings, where unrelated vectors are pushed
apart as much as possible. By constraining features to be on a hypersphere, we can connect
dispersion to well-studied problems in mathematics and physics, where optimal solutions are
known for limited low-dimensional cases. However, in representation learning we typically deal
with a large number of features in high-dimensional space, and moreover, dispersion is usually
traded off with some other task-oriented training objective, making existing theoretical and
numerical solutions inapplicable. Therefore, it is common to rely on gradient-based methods
to encourage dispersion, usually by minimizing some function of the pairwise distances.
In this work, we first give an overview of existing methods from disconnected literature,
making new connections and highlighting similarities. Next, we introduce some new angles.
We propose to reinterpret pairwise dispersion using a maximum mean discrepancy (MMD)
motivation. We then propose an online variant of the celebrated Lloyd’s algorithm, of
K-Means fame, as an effective alternative regularizer for dispersion on generic domains.
Finally, we revise and empirically assess sliced regularizers that directly exploit properties
of the hypersphere, proposing a new, simple but effective one. Our experiments show the
importance of dispersion in image classification and natural language processing tasks, and
how algorithms exhibit different trade-offs in different regimes

URL: https://openreview.net/forum?id=5JIQE6HcTd

---

Title: Scaling and Distilling Transformer Models for sEMG

Authors: Nick Mehlman, Jean-Christophe Gagnon-Audet, Michael Shvartsman, Kelvin Niu, Alexander H Miller, Shagun Sodhani

Abstract: Surface electromyography (sEMG) signals offer a promising avenue for developing innovative human-computer interfaces by providing insights into muscular activity. However, limited available training data and computational constraints during deployment have restricted the use of state-of-the-art machine learning models, such as transformers, in challenging sEMG tasks. In this paper, we demonstrate that transformer models can learn effective and generalizable representations from sEMG datasets that are small by modern deep learning standards (approximately 100 users), surpassing the performance of classical machine learning methods and older neural network architectures. Additionally, by leveraging model distillation techniques, we reduce parameter counts by up to 50x with minimal loss of performance. This results in efficient and expressive models suitable for complex real-time sEMG tasks in dynamic real-world environments.

URL: https://openreview.net/forum?id=hFPWThwUiZ

---

Title: A Max-Min Approach to the Worst-Case Class Separation Problem

Authors: Mohammad Mahdi Omati, Prabhu babu, Petre Stoica, Arash Amini

Abstract: In this paper, we propose a novel discriminative feature learning method based on a minorization-maximization framework for min-max (MM4MM) to address the long-standing “worst-case class separation (WCCS)” problem, which, in our design, refers to maximizing the minimum pairwise Chernoff distance between all class pairs in the low-dimensional subspace. The proposed algorithm relies on the relaxation of a semi-orthogonality constraint, which is proven to be tight at every iteration of the algorithm. To solve the worst-case class separation problem, we first introduce the vanilla version of the proposed algorithm, which requires solving a semi-definite program (SDP) at each iteration. We further simplify it to solving a quadratic program by formulating the dual of the surrogate maximization problem. We also then present reformulations of the worst-case class separation problem that enforce sparsity of the dimension-reducing matrix. The proposed algorithms are computationally efficient and are guaranteed to converge to optimal solutions. An important feature of these algorithms is that they do not require any hyperparameter tuning (except for the sparsity case, where a penalty parameter controlling sparsity must be chosen by the user). Experiments on several machine learning datasets demonstrate the effectiveness of the MM4MM approach.

URL: https://openreview.net/forum?id=EEmwBd4tfZ

---

Title: Solving Quadratic Programs via Deep Unrolled Douglas-Rachford Splitting

Authors: Jinxin Xiong, Xi Gao, Linxin Yang, Jiang Xue, Xiaodong Luo, Akang Wang

Abstract: Convex quadratic programs (QPs) are fundamental to numerous applications, including finance, engineering, and energy systems. Among the various methods for solving them, the Douglas-Rachford (DR) splitting algorithm is notable for its robust convergence properties. Concurrently, the emerging field of Learning-to-Optimize offers promising avenues for enhancing algorithmic performance, with algorithm unrolling receiving considerable attention due to its computational efficiency and interpretability. In this work, we propose an approach that unrolls a modified DR splitting algorithm to efficiently learn solutions for convex QPs. Specifically, we introduce a tailored DR splitting algorithm that replaces the computationally expensive linear system-solving step with a simplified gradient-based update, while retaining convergence guarantees. Consequently, we unroll the resulting DR splitting method and present a well-crafted neural network architecture to predict QP solutions. Our method achieves up to 50% reductions in iteration counts and 40% in solve time across benchmarks on both synthetic and real-world QP datasets, demonstrating its scalability and superior performance in enhancing computational efficiency across varying sizes.

URL: https://openreview.net/forum?id=xOfOgPnbtF

---

Title: Generalizable Spectral Embedding with an Application to UMAP

Authors: Nir Ben-Ari, Amitai Yacobi, Uri Shaham

Abstract: Spectral Embedding (SE) is a popular method for dimensionality reduction, applicable across diverse domains. Nevertheless, its current implementations face three prominent drawbacks which curtail its broader applicability: generalizability (i.e., out-of-sample extension), scalability, and eigenvectors separation. Existing SE implementations often address two of these drawbacks; however, they fall short in addressing the remaining one. In this paper, we introduce $\textit{Sep-SpectralNet}$ (eigenvector-separated SpectralNet), a SE implementation designed to address $\textit{all}$ three limitations. Sep-SpectralNet extends SpectralNet with an efficient post-processing step to achieve eigenvectors separation, while ensuring both generalizability and scalability. This method expands the applicability of SE to a wider range of tasks and can enhance its performance in existing applications. We empirically demonstrate Sep-SpectralNet's ability to consistently approximate and generalize SE, while maintaining SpectralNet's scalability. Additionally, we show how Sep-SpectralNet can be leveraged to enable generalizable UMAP visualization.

URL: https://openreview.net/forum?id=8cuQwztCKk

---

Title: Importance Weighting for Aligning Language Models under Deployment Distribution Shift

Authors: Thanawat Lodkaew, Tongtong Fang, Takashi Ishida, Masashi Sugiyama

Abstract: Aligning language models (LMs) with human preferences remains challenging partly because popular approaches, such as reinforcement learning from human feedback and direct preference optimization (DPO), often assume that the training data is sufficiently representative of the environment in which the model will be deployed. However, real-world applications frequently involve distribution shifts, e.g., changes in end-user behavior or preferences during usage or deployment, which pose a significant challenge to LM alignment approaches. In this paper, we propose an importance weighting method tailored for DPO, namely IW-DPO, to address distribution shifts in LM alignment. IW-DPO can be applied to joint distribution shifts in the prompts, responses, and preference labels without explicitly assuming the type of distribution shift. Our experimental results on various distribution shift scenarios demonstrate the usefulness of IW-DPO.

URL: https://openreview.net/forum?id=C7QWN4AXvp

---

Title: Activate and Adapt: A Two-Stage Framework for Open-Set Model Adaptation

Authors: Xiasi Wang, Jiaqi Lin, Chaoqi Chen, Luyao Tang, Yi Huang, Chengsen Wang, Lei YE, Yuan Yao

Abstract: The ability of generalizing to new environments is critical for deep neural networks. Most existing works presume that the training and test data share an identical label set, overlooking the potential presence of new classes in test data. In this paper, we tackle a practical and challenging problem: Open-Set Model Adaptation (OSMA). OSMA aims to train a model on the source domain, which contains only known class data, and then adapt the trained model to the distribution-shifted target domain to classify known class data while identifying new class data. In this context, we face two challenges: (1) enabling the model to recognize new classes using only the known class data from the source domain during training, and (2) adapting the source-trained model to the target domain that contains new class data. To address these challenges, we propose a novel and universal two-stage framework named Activate and Adapt (ADA). In the training stage, we extract potential new class information hidden within the rich semantics of the source domain data to enable the model to identify new class data. Additionally, to retain source domain information while preserving data privacy, we condense the source domain data into a small dataset, facilitating the subsequent adaptation phase. In the test stage, we adaptively adjust the source-trained model to the target domain with new classes by infusing the style of target data into the condensed dataset, and decoupling domain alignment for known and new classes. Experiments across three standard benchmarks demonstrate that ADA surpasses previous methods in both online and offline settings.

URL: https://openreview.net/forum?id=2AWbwSpET9

---

Title: MUC: Machine Unlearning for Contrastive Learning with Black-box Evaluation

Authors: Yihan Wang, Yiwei Lu, Guojun Zhang, Franziska Boenisch, Adam Dziedzic, Yaoliang Yu, Xiao-Shan Gao

Abstract: Machine unlearning offers effective solutions for revoking the influence of specific training data on pre-trained model parameters. While existing approaches address unlearning for classification and generative models, they overlook an important category of machine learning models: contrastive learning (CL) methods. This paper addresses this gap by introducing the Machine Unlearning for Contrastive Learning (MUC) framework and adapting existing methods. We identify limitations in current approaches, noting that several methods perform inadequately as unlearners and that existing evaluation tools insufficiently validate unlearning effects in contrastive learning. To address these issues, we propose Alignment Calibration (AC), a novel method that explicitly considers contrastive learning properties and optimizes towards new auditing metrics for easy verification of unlearning. Through empirical comparisons with baseline methods on SimCLR, MoCo, and CLIP, we demonstrate that AC: (1) achieves state-of-the-art performance, approximating exact unlearning (retraining); (2) enables data owners to clearly visualize unlearning effects through black-box evaluation. The code is available at https://github.com/EhanW/Alignment-Calibration.

URL: https://openreview.net/forum?id=F9pjSDvuM9

---

Title: Distributionally Robust Alignment for Medical Federated Vision-Language Pre-training Under Data Heterogeneity

Authors: Zitao Shuai, Chenwei Wu, Zhengxu Tang, Liyue Shen

Abstract: Vision-language pre-training (VLP) has emerged as an effective scheme for multimodal representation learning, but its reliance on large-scale multimodal data poses significant challenges for medical applications. Federated learning (FL) offers a promising solution to scale up the dataset for medical VLP while preserving data privacy. However, we observe that client data heterogeneity in real-world scenarios could cause models to learn biased cross-modal alignment during local pre-training. This would limit the transferability of the federally learned representation model on downstream tasks. To address this challenge, we propose Federated Distributionally Robust Alignment (FedDRA), a framework for federated VLP that achieves robust vision-language alignment under heterogeneous conditions. Based on client datasets, we construct a distribution family that encompasses potential test-time domains, and apply a distributionally robust framework to optimize the pre-trained model's performance across this distribution space. This approach bridges the gap between pre-training samples and downstream applications. To avoid over-fitting on client-specific information, we use anchor representation from the global model to guide the local training, and adopt a two-stage approach to first tune deeper layers before updating the entire network. Extensive experiments on real-world datasets demonstrate FedDRA’s effectiveness in enhancing medical federated VLP under data heterogeneity. Our method also adapts well to various medical pre-training methods.

URL: https://openreview.net/forum?id=hb3ZGvBja4

---

Title: Exploring End-to-end Differentiable Neural Charged Particle Tracking – A Loss Landscape Perspective

Authors: Tobias Kortus, Ralf Keidel, Nicolas R. Gauger

Abstract: Measurement and analysis of high energetic particles for scientific, medical or industrial applications is a complex procedure, requiring the design of sophisticated detector and data processing systems. The development of adaptive and differentiable software pipelines using a combination of conventional and machine learning algorithms is therefore getting ever more important to optimize and operate the system efficiently while maintaining end-to-end (E2E) differentiability. In this work, we lay the groundwork for E2E differentiable decision focused learning for the application of charged particle tracking using graph neural networks with combinatorial components solving a linear assignment problem for each detector layer. We demonstrate empirically that including differentiable variations of discrete assignment operations allows for efficient network optimization, working better or on par with approaches that lack E2E differentiability. In additional studies, we dive deeper into the optimization process and provide further insights from a loss landscape perspective, providing a robust foundation for future work. We demonstrate that while both methods converge into similar performing, globally well-connected regions, they suffer under substantial predictive instability across initialization and optimization methods, which can have unpredictable consequences on the performance of downstream tasks such as image reconstruction. We also point out a dependency between the interpolation factor of the gradient estimator and the prediction stability of the model, suggesting the choice of sufficiently small values. Given the strong global connectivity of learned solutions and the excellent training performance, we argue that E2E differentiability provides, besides the general availability of gradient information, an important tool for robust particle tracking to mitigate prediction instabilities by favoring solutions that perform well on downstream tasks.

URL: https://openreview.net/forum?id=1Pi2GwduEz

---

Title: Adversarial Subspace Generation for Outlier Detection in High-Dimensional Data

Authors: Jose Cribeiro-Ramallo, Federico Matteucci, Paul Enciu, Alexander Jenke, Vadim Arzamasov, Thorsten Strufe, Klemens Böhm

Abstract: Outlier detection in high-dimensional tabular data is challenging since data is often distributed across multiple lower-dimensional subspaces—a phenomenon known as the Multiple Views effect (MV). This effect led to a large body of research focused on mining such subspaces, known as *subspace selection*. However, as the precise nature of the MV effect was not well understood, traditional methods had to rely on heuristic-driven search schemes that struggle to accurately capture the true structure of the data. Properly identifying these subspaces is critical for unsupervised tasks such as outlier detection or clustering, where misrepresenting the underlying data structure can hinder the performance. We introduce Myopic Subspace Theory (MST), a new theoretical framework that mathematically formulates the Multiple Views effect and writes subspace selection as a stochastic optimization problem. Based on MST, we introduce V-GAN, a generative method trained to solve such an optimization problem. This approach avoids any exhaustive search over the feature space while ensuring that the intrinsic data structure is preserved. Experiments on 42 real-world datasets show that using V-GAN subspaces to build ensemble methods leads to a significant increase in one-class classification performance—compared to existing subspace selection, feature selection, and embedding methods. Further experiments on synthetic data show that V-GAN identifies subspaces more accurately while scaling better than other relevant subspace selection methods. These results confirm the theoretical guarantees of our approach and also highlight its practical viability in high-dimensional settings.

URL: https://openreview.net/forum?id=k7QsjiRE17

---

Title: Agreement-Based Cascading for Efficient Inference

Authors: Steven Kolawole, Don Dennis, Ameet Talwalkar, Virginia Smith

Abstract: Adaptive inference schemes reduce the cost of machine learning inference by assigning smaller models to easier examples, attempting to avoid invocation of larger models when possible. In this work we explore a simple, effective adaptive inference technique we term Agreement-Based Cascading (ABC). ABC builds a cascade of models of increasing size/complexity and uses agreement between ensembles of models at each level of the cascade as a basis for data-dependent routing. Although ensemble execution introduces additional expense, we show that these costs can be easily offset in practice due to large expected differences in model sizes, parallel inference execution capabilities, and accuracy benefits of ensembling. We examine ABC theoretically and empirically in terms of these parameters, showing that the approach can reliably act as a drop-in replacement for existing models and surpass the best single model it aims to replace in terms of both efficiency and accuracy. Additionally, we explore the performance of ABC relative to existing cascading methods in three common scenarios: (1) edge-to-cloud inference, where ABC reduces communication costs by up to 14x; (2) cloud-based model serving, where it achieves a 3x reduction in rental costs; and (3) inference via model API services, where ABC achieves a 2-25x reduction in average price per token/request relative to state-of-the-art LLM cascades.

URL: https://openreview.net/forum?id=jn9B7LMlzk

---

Title: Convergence Properties of Natural Gradient Descent for Minimizing KL Divergence

Authors: Adwait Datar, Nihat Ay

Abstract: The Kullback-Leibler (KL) divergence plays a central role in probabilistic machine learning, where it commonly serves as the canonical loss function. Optimization in such settings is often performed over the probability simplex, where the choice of parameterization significantly impacts convergence. In this work, we study the problem of minimizing the KL divergence and analyze the behavior of gradient-based optimization algorithms under two dual coordinate systems within the framework of information geometry$-$ the exponential family ($\theta$ coordinates) and the mixture family ($\eta$ coordinates). We compare Euclidean gradient descent (GD) in these coordinates with the coordinate-invariant natural gradient descent (NGD), where the natural gradient is a Riemannian gradient that incorporates the intrinsic geometry of the underlying statistical model. In continuous time, we prove that the convergence rates of GD in the $\theta$ and $\eta$ coordinates provide lower and upper bounds, respectively, on the convergence rate of NGD. Moreover, under affine reparameterizations of the dual coordinates, the convergence rates of GD in $\eta$ and $\theta$ coordinates can be scaled to $2c$ and $\frac{2}{c}$, respectively, for any $c>0$, while NGD maintains a fixed convergence rate of $2$, remaining invariant to such transformations and sandwiched between them.
Although this suggests that NGD may not exhibit uniformly superior convergence in continuous time, we demonstrate that its advantages become pronounced in discrete time, where it achieves faster convergence and greater robustness to noise, outperforming GD. Our analysis hinges on bounding the spectrum and condition number of the Hessian of the KL divergence at the optimum, which coincides with the Fisher information matrix.

URL: https://openreview.net/forum?id=h6hjjAF5Bj

---

Title: SKADA-Bench: Benchmarking Unsupervised Domain Adaptation Methods with Realistic Validation On Diverse Modalities

Authors: Yanis Lalou, Theo Gnassounou, Antoine Collas, Antoine de Mathelin, Ambroise Odonnat, Thomas Moreau, Alexandre Gramfort, Rémi Flamary

Abstract: Unsupervised Domain Adaptation (DA) consists of adapting a model trained on a labeled source domain to perform well on an unlabeled target domain with some data distribution shift. While many methods have been proposed in the literature, fair and realistic evaluation remains an open question, particularly due to methodological difficulties in selecting hyperparameters in the unsupervised setting. With SKADA-bench, we propose a framework to evaluate DA methods on diverse modalities, beyond computer vision task that have been largely explored in the literature. We present a complete and fair evaluation of existing shallow algorithms, including reweighting, mapping, and subspace alignment. Realistic hyperparameter selection is performed with nested cross-validation and various unsupervised model selection scores, on both simulated datasets with controlled shifts and real-world datasets across diverse modalities, such as images, text, biomedical, and tabular data. Our benchmark highlights the importance of realistic validation and provides practical guidance for real-life applications, with key insights into the choice and impact of model selection approaches. SKADA-bench is open-source, reproducible, and can be easily extended with novel DA methods, datasets, and model selection criteria without requiring re-evaluating competitors. The code is available at https://github.com/scikit-adaptation/skada-bench.

URL: https://openreview.net/forum?id=k9F63DV3Qe

---

Title: Long-Term Fairness Inquiries and Pursuits in Machine Learning: A Survey of Notions, Methods, and Challenges

Authors: Usman Gohar, Zeyu Tang, Jialu Wang, Kun Zhang, Peter Spirtes, Yang Liu, Lu Cheng

Abstract: The widespread integration of Machine Learning systems in daily life, particularly in high-stakes domains, has raised concerns about the fairness implications. While prior works have investigated static fairness measures, recent studies reveal that automated decision-making has long-term implications and that off-the-shelf fairness approaches may not serve the purpose of achieving long-term fairness. Additionally, the existence of feedback loops and the interaction between models and the environment introduces additional complexities that may deviate from the initial fairness goals. In this survey, we review existing literature on long-term fairness from different perspectives and present a taxonomy for long-term fairness studies. We highlight key challenges and consider future research directions, analyzing both current issues and potential further explorations.

URL: https://openreview.net/forum?id=mYi6EWvFlR

---

Title: Exact Recovery Guarantees for Parameterized Nonlinear System Identification Problem under Sparse Disturbances or Semi-Oblivious Attacks

Authors: Haixiang Zhang, Baturalp Yalcin, Javad Lavaei, Eduardo Sontag

Abstract: In this work, we study the problem of learning a nonlinear dynamical system by parameterizing its dynamics using basis functions. We assume that disturbances occur at each time step with an arbitrary probability $p$, which models the sparsity level of the disturbance vectors over time. These disturbances are drawn from an arbitrary, unknown probability distribution, which may depend on past disturbances, provided that it satisfies a zero-mean assumption. The primary objective of this paper is to learn the system's dynamics within a finite time and analyze the sample complexity as a function of $p$. To achieve this, we examine a LASSO-type non-smooth estimator, and establish necessary and sufficient conditions for its well-specifiedness and the uniqueness of the global solution to the underlying optimization problem. We then provide exact recovery guarantees for the estimator under two distinct conditions: boundedness and Lipschitz continuity of the basis functions. We show that finite-time exact recovery is achieved with high probability, even when $p$ approaches $1$. Unlike prior works, which primarily focus on independent and identically distributed (i.i.d.) disturbances and provide only asymptotic guarantees for system learning, this study presents the first finite-time analysis of nonlinear dynamical systems under a highly general disturbance model. Our framework allows for possible temporal correlations in the disturbances and accommodates semi-oblivious adversarial attacks, significantly broadening the scope of existing theoretical results.

URL: https://openreview.net/forum?id=c9o9UAmN3r

---

Title: SAIF: Sparse Adversarial and Imperceptible Attack Framework

Authors: Tooba Imtiaz, Morgan R Kohler, Jared F Miller, Zifeng Wang, Masih Eskandar, Mario Sznaier, Octavia Camps, Jennifer Dy

Abstract: Adversarial attacks hamper the decision-making ability of neural networks by perturbing the input signal. For instance, adding calculated small distortions to images can deceive a well-trained image classification network. In this work, we propose a novel attack technique called \textbf{S}parse \textbf{A}dversarial and \textbf{I}mperceptible Attack \textbf{F}ramework (SAIF). Specifically, we design imperceptible attacks that contain low-magnitude perturbations at a few pixels and leverage these sparse attacks to reveal the vulnerability of classifiers. We use the Frank-Wolfe (conditional gradient) algorithm to simultaneously optimize the attack perturbations for bounded magnitude and sparsity with $O(1/\sqrt{T})$ convergence. Empirical results show that SAIF computes highly imperceptible and interpretable adversarial examples, and largely outperforms state-of-the-art sparse attack methods on ImageNet and CIFAR-10.

URL: https://openreview.net/forum?id=YZL29eJ5j1

---

Title: Stochastic Block Model-Aware Topological Neural Networks for Graph Link Prediction

Authors: Yuzhou Chen, Xiao Guo, Shujie Ma

Abstract: Link prediction is an important learning task for graph-structured data and is indispensable to understanding graphs' properties. Recent works focus on designing complicated graph neural networks (GNNs) architectures to explore and capture various pairwise interactions among graph nodes. Most GNNs are based on combining graph structural and node feature information by iterative message-passing schemes. However, despite GNNs revolutionizing the field of graph representation learning, some thorny questions are raised concerning whether GNNs can efficiently learn the edge probabilities based on topological structures (i.e., higher-order interactions) and node features, and provide statistically rigorous uncertainty estimates. In this paper, we tackle these challenges and propose a novel stochastic block model (SBM)-aware topological neural networks, called SBM-TNN, that uses SBMs to infer the latent community structure of nodes from graph structures and uses persistent homology to encode higher-order information. Furthermore, we theoretically study the entrywise bound and asymptotic normality of the estimated edge probability matrix to quantify the uncertainty in statistical inference of the edge probabilities. Our extensive experiments for link prediction on both graphs and knowledge graphs show that SBM-TNN achieves state-of-the-art performance over a set of popular baseline methods.

URL: https://openreview.net/forum?id=FBjVSPAsgs

---

Title: Sparsity-Driven Plasticity in Multi-Task Reinforcement Learning

Authors: Aleksandar Todorov, Juan Cardenas-Cartagena, Rafael F. Cunha, Marco Zullich, Matthia Sabatelli

Abstract: Plasticity loss, a diminishing capacity to adapt as training progresses, is a critical challenge in deep reinforcement learning. We examine this issue in multi-task reinforcement learning (MTRL), where higher representational flexibility is crucial for managing diverse and potentially conflicting task demands. We systematically explore how sparsification methods, particularly Gradual Magnitude Pruning (GMP) and Sparse Evolutionary Training (SET), enhance plasticity and consequently improve performance in MTRL agents. We evaluate these approaches across distinct MTRL architectures (shared backbone, Mixture of Experts, Mixture of Orthogonal Experts) on standardized MTRL benchmarks, comparing against dense baselines, and a comprehensive range of alternative plasticity-inducing or regularization methods. Our results demonstrate that both GMP and SET effectively mitigate key indicators of plasticity degradation, such as neuron dormancy and representational collapse. These plasticity improvements often correlate with enhanced multi-task performance, with sparse agents frequently outperforming dense counterparts and achieving competitive results against explicit plasticity interventions. Our findings offer insights into the interplay between plasticity, network sparsity, and MTRL designs, highlighting dynamic sparsification as a robust but context-sensitive tool for developing more adaptable MTRL systems.

URL: https://openreview.net/forum?id=9L4Z23EfE9

---

Title: Adaptive Gradient Normalization and Independent Sampling for (Stochastic) Generalized-Smooth Optimization

Authors: Yufeng Yang, Erin E. Tripp, Yifan Sun, Shaofeng Zou, Yi Zhou

Abstract: Recent studies have shown that many nonconvex machine learning problems satisfy a generalized-smooth condition that extends beyond traditional smooth nonconvex optimization. However, the existing algorithms are not fully adapted to such generalized-smooth nonconvex geometry and encounter significant technical limitations on their convergence analysis. In this work, we first analyze the convergence of adaptively normalized gradient descent under function geometries characterized by generalized-smoothness and the generalized PL condition, revealing the advantage of adaptive gradient normalization. Our results provide theoretical insights into adaptive normalization across various scenarios. For stochastic generalized-smooth nonconvex optimization, we propose the Independent-Adaptively Normalized Stochastic Gradient Descent algorithm, which leverages adaptive gradient normalization, independent sampling, and gradient clipping to achieve an $\mathcal{O}(\epsilon^{-4})$ sample complexity under relaxed noise assumptions. Experiments on large-scale nonconvex generalized-smooth problems demonstrate the fast convergence of our algorithm.

URL: https://openreview.net/forum?id=KKSQQMlEfw

---

Title: DIVINE: Diverse-Inconspicuous Feature Learning to Mitigate Abridge Learning

Authors: Saheb Chhabra, Kartik Thakral, Surbhi Mittal, Mayank Vatsa, Richa Singh

Abstract: Deep learning algorithms aim to minimize overall error and exhibit impressive performance on test datasets across various domains. However, they often struggle with out-of-distribution (OOD) data samples. We posit that deep models primarily capture prominent features beneficial for the task while neglecting subtle yet discriminative features, a phenomenon we refer to as Abridge Learning. To address this issue and encourage more comprehensive feature utilization, we introduce DIVINE (DIVerse and INconspicuous FEature Learning), a novel approach that leverages iterative feature suppression guided by dominance maps to ensure that models engage with a diverse and complementary set of discriminative features. Through extensive experiments on multiple datasets, including MNIST, CIFAR-10, CIFAR-100, TinyImageNet, and their corrupted and perturbed variants (CIFAR-10-C/P, CIFAR-100-C/P, TinyImageNet-C/P), we demonstrate that DIVINE significantly improves model robustness and generalization. On perturbation benchmarks, DIVINE achieves mean Flip Rates (mFR) of 5.36%, 3.10%, and 21.85% on CIFAR-10-P, CIFAR-100-P, and TinyImageNet-P respectively, compared to 6.53%, 11.75%, and 31.90% for standard training methods exhibiting Abridge Learning. Moreover, DIVINE attains state-of-the-art results on CIFAR-100-P, demonstrating that addressing Abridge Learning leads to more robust models against real-world distribution variations.

URL: https://openreview.net/forum?id=8NGKGTAD6F

---

Title: Variance Reduction of Stochastic Hypergradient Estimation by Mixed Fixed-Point Iteration

Authors: Naoyuki Terashita, Satoshi Hara

Abstract: Hypergradient represents how the hyperparameter of an optimization problem (or inner-problem) changes an outer-cost through the optimized inner-parameter, and it takes a crucial role in hyperparameter optimization, meta learning, and data influence estimation.
This paper studies hypergradient computation involving a stochastic inner-problem, a typical machine learning setting where the empirical loss is estimated by minibatches.
Stochastic hypergradient estimation requires estimating products of Jacobian matrices of the inner iteration.
Current methods struggle with large estimation variance because they depend on a specific sequence of Jacobian samples to estimate this product.
This paper overcomes this problem by \emph{mixing} two different stochastic hypergradient estimation methods that use distinct sequences of Jacobian samples.
Furthermore, we show that the proposed method enables almost sure convergence to the true hypergradient through the stochastic Krasnosel'ski\u{\i}-Mann iteration.
Theoretical analysis demonstrates that, compared to existing approaches, our method achieves lower asymptotic variance bounds while maintaining comparable computational complexity.
Empirical evaluations on synthetic and real-world tasks verify our theoretical results and superior variance reduction over existing methods.

URL: https://openreview.net/forum?id=mkmX2ICi5c

---

Title: DiffNat : Exploiting the Kurtosis Concentration Property for Image quality improvement

Authors: Aniket Roy, Maitreya Suin, Anshul Shah, Ketul Shah, Jiang Liu, Rama Chellappa

Abstract: Diffusion models have significantly advanced generative AI in terms of creating and editing natural images. However, improving the image quality of generated images is still of paramount interest. In this context, we propose a generic kurtosis concentration (KC) loss that can be readily applied to any standard diffusion model pipeline to improve image quality. Our motivation stems from the projected kurtosis concentration property of natural images, which states that natural images have nearly constant kurtosis values across different band-pass filtered versions of the image. To improve the image quality of generated images, we reduce the gap between the highest and lowest kurtosis values across the band-pass filtered versions (e.g., Discrete Wavelet Transform (DWT)) of images. In addition, we also propose a novel condition-agnostic perceptual guidance strategy during inference to further improve the quality. We validate the proposed approach on four diverse tasks, viz., (1) personalized few-shot finetuning using text guidance, (2) unconditional image generation, (3) image super-resolution, and (4) blind face-restoration. Integrating the proposed KC loss and perceptual guidance has improved the perceptual quality in all these tasks in terms of FID, MUSIQ score, and user evaluation. Code: https://github.com/aniket004/DiffNat.git

URL: https://openreview.net/forum?id=HdZQ7pMPRd

---

Title: Label Smoothing is a Pragmatic Information Bottleneck

Authors: Sota Kudo

Abstract: This study revisits label smoothing via a form of information bottleneck. Under the assumption of sufficient model flexibility and no conflicting labels for the same input, we theoretically and experimentally demonstrate that the model output obtained through label smoothing explores the optimal solution of the information bottleneck. Based on this, label smoothing can be interpreted as a practical approach to the information bottleneck, enabling simple implementation. As an information bottleneck method, we experimentally show that label smoothing also exhibits the property of being insensitive to factors that do not contain information about the target, or to factors that provide no additional information about it when conditioned on another variable.

URL: https://openreview.net/forum?id=Q0QEDhpbAK

---

Title: Preference Discerning with LLM-Enhanced Generative Retrieval

Authors: Fabian Paischer, Liu Yang, Linfeng Liu, Shuai Shao, Kaveh Hassani, Jiacheng Li, Ricky T. Q. Chen, Zhang Gabriel Li, Xiaoli Gao, Wei Shao, Xue Feng, Nima Noorshams, Sem Park, Bo Long, Hamid Eghbalzadeh

Abstract: In sequential recommendation, models recommend items based on user's interaction history. To this end, current models usually incorporate information such as item descriptions and user intent or preferences. User preferences are usually not explicitly given in open-source datasets, and thus need to be approximated, for example via large language models (LLMs). Current approaches leverage approximated user preferences only during training and rely solely on the past interaction history for recommendations, limiting their ability to dynamically adapt to changing preferences, potentially reinforcing echo chambers. To address this issue, we propose a new paradigm, namely *preference discerning*, which explicitly conditions a generative recommendation model on user preferences in natural language within its context. To evaluate *preference discerning*, we introduce a novel benchmark that provides a holistic evaluation across various scenarios, including preference steering and sentiment following. Upon evaluating current state-of-the-art methods on our benchmark, we discover that their ability to dynamically adapt to evolving user preferences is limited. To address this, we propose a new method named Mender (**M**ultimodal Prefer**en**ce **D**iscern**er**), which achieves state-of-the-art performance in our benchmark. Our results show that Mender effectively adapts its recommendation guided by human preferences, even if not observed during training, paving the way toward more flexible recommendation models.

URL: https://openreview.net/forum?id=74mrOdhvvT

---


New submissions
===============


Title: Rethinking Memory in Continual Learning: Beyond a Monolithic Store of the Past

Abstract: Memory is a critical component in replay-based continual learning (CL). Prior research has largely treated CL memory as a monolithic store of past data, focusing on how to select and store representative past examples. However, this perspective overlooks the higher-level memory architecture that governs the interaction between old and new data. In this work, we identify and characterize a dual-memory system that is inherently present in both online and offline CL settings. This system comprises: a short-term memory, which temporarily buffers recent data for immediate model updates, and a long-term memory, which maintains a carefully curated subset of past experiences for future replay and consolidation. We propose \textit{memory capacity ratio} (MCR), the ratio between short-term memory and long-term memory capacities, to characterize online and offline CL. Based on this framework, we systematically investigate how MCR influences generalization, stability, and plasticity. Across diverse CL settings—class-incremental, task-incremental, and domain-incremental—and multiple data modalities (e.g., image and text classification), we observe that a smaller MCR, characteristic of online CL, can yield comparable or even superior performance relative to a larger one, characteristic of offline CL, when both are evaluated under equivalent computational and data storage budgets. This advantage holds consistently across several state-of-the-art replay strategies, such as ER, DER, and SCR. Theoretical analysis further reveals that a reduced MCR yields a better trade-off between stability and plasticity and lowers a bound on generalization error when learning from non-stationary data streams with limited memory. These findings offer new insights into the role of memory allocation in continual learning and underscore the underexplored potential of online CL approaches.

URL: https://openreview.net/forum?id=wgjVUIYyOD

---

Title: Torque-Aware Momentum

Abstract: Efficiently exploring complex loss landscapes is key to the performance of deep neural networks. While momentum-based optimizers are widely used in state-of-the-art setups, classical momentum can still struggle with large, misaligned gradients, leading to oscillations. To address this, we propose Torque-Aware Momentum (TAM), which introduces a damping factor based on the angle between the new gradients and previous momentum, stabilizing the update direction during training. Empirical results show that TAM, which can be combined with both SGD and Adam, enhances exploration, handles distribution shifts more effectively, and improves generalization performance across various tasks, including image classification, large language model fine-tuning and continual learning, when compared to classical momentum-based optimizers.

URL: https://openreview.net/forum?id=BdvNbMLo5e

---

Title: Batched Nonparametric Bandits via k-Nearest Neighbor UCB

Abstract: We study sequential decision-making in batched nonparametric contextual bandits, where actions are selected over a finite horizon divided into a small number of batches. Motivated by constraints in domains such as medicine and marketing, where online feedback is limited, we propose a nonparametric algorithm that combines adaptive k-nearest neighbor (k-NN) regression with the upper confidence bound (UCB) principle. Our method, BaNk-UCB, is fully nonparametric, adapts to the context density, and is simple to implement. Unlike prior works relying on parametric or binning-based estimators, BaNk-UCB uses local geometry of the contexts to estimate rewards and adaptively balances exploration and exploitation. We provide near-optimal regret guarantees under standard Lipschitz smoothness and margin assumptions, using a theoretically motivated batch schedule that balances regret across batches and achieves minimax-optimal rates. Empirical evaluations on synthetic and real-world datasets demonstrate that BaNk-UCB consistently outperforms binning-based baselines.

URL: https://openreview.net/forum?id=9gB2Eu0PXb

---

Title: IGNIS: A Robust Neural Network Framework for Constrained Parameter Estimation in Archimedean Copulas

Abstract: Classical estimators, the cornerstones of statistical inference, face insurmountable challenges when applied to important emerging classes of Archimedean copulas. These models exhibit pathological properties, including numerically unstable densities, non-monotonic parameter-to-dependence mappings, and vanishingly small likelihood gradients, rendering methods like Maximum Likelihood (MLE) and Method of Moments (MoM) inconsistent or computationally infeasible. We introduce \textbf{IGNIS}, a unified neural estimation framework that sidesteps these barriers by learning a direct, robust mapping from data-driven dependency measures to the underlying copula parameter $\theta$. IGNIS utilizes a multi-input architecture and a theory-guided output layer ($\mathrm{softplus}(z) + 1$) to automatically enforce the domain constraint $\hat{\theta} \ge 1$. Trained and validated on four families (Gumbel, Joe, and the numerically challenging A1/A2), IGNIS delivers accurate and stable estimates for real-world financial and health datasets, demonstrating its necessity for reliable inference in modern, complex dependence models where traditional methods fail.

URL: https://openreview.net/forum?id=Z0hkPdCven

---

Title: Adapting Language Models to Produce Reliable Class Probabilities in Classification Tasks

Abstract: Large generative language models (GLM) provide a versatile tool for solving a wide variety of natural processing tasks. GLM responses, though, are provided in the form of text, without an indication of the model's confidence in the answer. This limits the usability of these models on high-risk applications where decisions made based on an incorrect answer can have severe consequences. In this work, we focus on the problem of generating reliable class posterior distributions for text classification tasks, which can be used both for decision making and for producing interpretable confidence scores for the user. We show that the naive approach for computing posteriors based on the token posteriors produced by the GLM results in extremely poor posteriors. We then explore different adaptation approaches for improving the quality of posteriors, focusing on low resource scenarios where a small amount of data is available for adaptation. We show that parameter-efficient supervised fine-tunning (SFT), while providing large gains in terms of decision quality, produces suboptimal posteriors due to overfitting. To address this problem, we propose an approach that combines SFT and post-hoc calibration using a three-stage training strategy, improving the quality of both posteriors and categorical decisions.

URL: https://openreview.net/forum?id=VVneIp69GR

---

Title: Proper Orthogonal Decomposition for Scalable Training of Graph Neural Networks

Abstract: As large-scale graphs become ubiquitous in real-world applications, there is growing concern about the memory and time requirement to train a graph neural network (GNN) model for such datasets. Storing the entire adjacency and node embedding matrices in memory is infeasible in such a scenario. Standard sampling-based methods for addressing the memory constraint suffer from the dependence of the number of mini-batches on the graph size. Existing sketch-based methods and graph compression techniques operate at higher sketch ratios, with the graph compression techniques showing poor generalization, implying that different GNNs trained on the same synthetic graph have performance gaps. Sketch-based methods necessitate online learning of sketches, further increasing the complexity. In this paper, we propose a new sketch-based algorithm, PGNN, employing the Proper Orthogonal Decomposition (POD) method to craft update rules to train GNNs, improving the memory requirement and training time without the complication of updating the sketches during training. Experiments on standard graph datasets show that PGNN can reach much lower sketch ratios without compromising the performance. We prove the optimality of the POD update rule for the linearized GNN (SGC). Empirical findings validate our approach, demonstrating superior performance at reduced sketch ratios and adaptability across various GNN architectures.

URL: https://openreview.net/forum?id=LeL6whBoWE

---

Title: Divide and Merge: Motion and Semantic Learning in End-to-End Autonomous Driving

Abstract: Perceiving the environment and its changes over time corresponds to two fundamental yet heterogeneous types of information: semantics and motion. Previous end-to-end autonomous driving works represent both types of information in a single feature vector. However, including motion related tasks, such as prediction and planning, impairs detection and tracking performance, a phenomenon known as negative transfer in multi-task learning. To address this issue, we propose Neural-Bayes motion decoding, a novel parallel detection, tracking, and prediction method that separates semantic and motion learning. Specifically, we employ a set of learned motion queries that operate in parallel with detection and tracking queries, sharing a unified set of recursively updated reference points. Moreover, we employ interactive semantic decoding to enhance information exchange in semantic tasks, promoting positive transfer. Experiments on the nuScenes dataset with UniAD and SparseDrive confirm the effectiveness of our divide and merge approach, resulting in performance improvements across perception, prediction, and planning. The code will be released.

URL: https://openreview.net/forum?id=RvtCNm1Rdv

---

Title: RANa: Retrieval-Augmented Navigation

Abstract: Methods for navigation based on large-scale learning typically treat each episode as a new problem, where the agent is spawned with a clean memory in an unknown environment. While these generalization capabilities to an unknown environment are extremely important, we claim that, in a realistic setting, an agent should have the capacity of exploiting information collected during earlier robot operations. We address this by introducing a new retrieval-augmented agent, trained with RL, capable of querying a database collected from previous episodes in the same environment and learning how to integrate this additional context information. We introduce a unique agent architecture for the general navigation task, evaluated on ImageNav, Instance-ImageNav and ObjectNav. Our retrieval and context encoding methods are data-driven and employ vision foundation models (FM) for both semantic and geometric understanding. We propose new benchmarks for these settings and we show that retrieval allows zero-shot transfer across tasks and environments while significantly improving performance.

URL: https://openreview.net/forum?id=OWCJ5JfsRB

---

Title: Multi-BK-Net: Multi-Branch Multi-Kernel Convolutional Neural Networks for Clinical EEG Analysis

Abstract: Classifying an electroencephalography (EEG) recording as pathological or non-pathological is an important first step in diagnosing and managing neurological diseases and disorders. As manual EEG classification is costly, time-consuming and requires highly trained experts, deep learning methods for automated classification of general EEG pathology offer a promising option to assist clinicians in screening EEGs. Convolutional neural networks (CNNs) are well-suited for classifying pathological EEG signals due to their ability to perform end-to-end learning. In practice, however, current CNN solutions suffer from limited generalisation due to I) a single-scale network design that cannot fully capture the high intra- and inter-subject variability of the EEG signal, the diversity of the data, and the heterogeneity of pathological EEG patterns; and II) the small size and limited diversity of the dataset commonly used to train and evaluate the networks. These challenges result in a low sensitivity score and a performance drop on other datasets, further hindering their reliability for real-world applications.
Here, we propose a novel multi-branch, multi-scale CNN called Multi-BK-Net (Multi-Branch Multi-Kernel Network), comprising five parallel branches that incorporate temporal convolution, spatial convolution, and pooling layers, with temporal kernel sizes defined based on five clinically relevant frequency bands within its first block.
Evaluation is based on two public datasets with predefined test sets: the Temple University Hospital (TUH) Abnormal EEG Corpus and the TUH Abnormal Expansion Balanced EEG Corpus.
Our Multi-BK-Net outperforms five baseline architectures and state-of-the-art end-to-end approaches in terms of accuracy and sensitivity on these datasets, setting a new benchmark. Furthermore, ablation experiments highlight the effectiveness of the multi-branch, multi-scale input block of the Multi-BK-Net. Overall, our approach demonstrates the effectiveness of multi-branch, multi-scale CNNs in accurately and reliably classifying general EEG pathology, while being more effective at handling data heterogeneity, and constitutes a next step towards deep end-to-end classification of general EEG pathology.

URL: https://openreview.net/forum?id=IsG10xZAaA

---

Title: StructTest: Benchmarking LLMs’ Reasoning through Compositional Structured Outputs

Abstract: The rapid advancement of large language models (LLMs) demands robust, unbiased, and scalable evaluation methods. However, human annotations are costly to scale, model-based evaluations are susceptible to stylistic biases, and target-answer-based benchmarks are vulnerable to data contamination and cheating. We propose StructTest, a novel benchmark that evaluates LLMs on their ability to follow compositional instructions and generate structured outputs, providing an unbiased, cost-effective, and difficult-to-cheat evaluation framework. The tasks in StructTest require significant reasoning skills. Assessments are conducted deterministically using rule-based evaluators, which can be easily extended to new tasks {and datasets}. By testing structured outputs across diverse domains—including Summarization, Code, HTML, and Math—and evaluating 17 popular LLMs, we demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o, establishing it as a robust proxy for measuring reasoning capabilities. We believe StructTest offers a critical and complementary approach to achieving objective and comprehensive model evaluation. Our code and data are available at https://anonymous.4open.science/r/StructTest-EF37/README.md.

URL: https://openreview.net/forum?id=8SMiJtqheH

---

Title: BoSS: A Best-of-Strategies Selector as an Upper Baseline for Deep Active Learning

Abstract: Active learning (AL) aims to reduce annotation costs while maximizing model performance by iteratively selecting valuable instances.
While foundation models have made it easier to identify these instances, existing selection strategies still lack robustness across different models, annotation budgets, and datasets.
To quantify the performance gains that are still attainable and to establish a reference point for research, we explore oracle strategies, i.e., upper baseline strategies approximating the optimal selection by accessing ground truth information unavailable in practical AL scenarios. Current oracle strategies, however, fail to scale effectively to large datasets and complex deep neural networks. To tackle these limitations, we introduce the Best-of-Strategy Selector (BoSS), a scalable oracle strategy designed for large-scale AL scenarios.
Boss constructs a set of candidate batches through an ensemble of selection strategies and then selects the batch yielding the highest performance gain. As an ensemble of selection strategies, BoSS can be easily extended with new state-of-the-art strategies as they emerge, ensuring it remains a reliable upper baseline in the future.
Our evaluation demonstrates that i) BoSS outperforms existing oracle strategies, ii) state-of-the-art AL strategies have significant room for improvement, especially in large-scale datasets with many classes, and iii) one possible solution to counteract the inconsistent performance of AL strategies is to employ an ensemble‑based approach for the selection.

URL: https://openreview.net/forum?id=qTs6spvhOS

---

Title: Accounting for Missing Covariates in Heterogeneous Treatment Estimation

Abstract: Many applications of causal inference require using treatment effects estimated on a study population to then make decisions for a separate target population that lacks treatment and outcome data. We consider the challenging setting where there are important covariates that are observed in the target population but are missing from the original study. Our goal is to estimate the tightest possible bounds on heterogeneous treatment effects conditioned on such newly observed covariates. We introduce a novel partial identification strategy based on ideas from ecological inference; the main idea is that estimates of conditional treatment effects for the full covariate set must marginalize correctly when restricted to only the covariates observed in both populations. Furthermore, we introduce a bias-corrected estimator for these bounds and prove that it enjoys fast convergence rates and statistical guarantees (e.g., asymptotic normality). Experimental results on both real and synthetic data demonstrate that our framework can produce bounds that are much tighter than would otherwise be possible.

URL: https://openreview.net/forum?id=05AIXzU4HV

---

Title: Dataset Condensation with Color Compensation

Abstract: Dataset condensation always faces a constitutive trade-off: balancing performance and fidelity under extreme compression. Existing methods struggle with two bottlenecks: image-level selection methods (Coreset Selection, Dataset Quantization) suffer from inefficiency condensation, while pixel-level optimization (Dataset Distillation) introduces semantic distortion due to over-parameterization. With empirical observations, we find that a critical problem in dataset condensation is the oversight of color's dual role as an information carrier and a basic semantic representation unit. We argue that improving the colorfulness of condensed images is beneficial for representation learning. Motivated by this, we propose DC3: a Dataset Condensation framework with Color Compensation. After a calibrated selection strategy, DC3 utilizes the latent diffusion model to enhance the color diversity of an image rather than creating a brand-new one. Extensive experiments demonstrate the superior performance and generalization of DC3 that outperforms SOTA methods across multiple benchmarks. To the best of our knowledge, besides focusing on downstream tasks, DC3 is the first research to fine-tune pre-trained diffusion models with condensed datasets. The FID results prove that training networks with our high-quality datasets is feasible without model collapse or other degradation issues. Code and generated data will be released soon.

URL: https://openreview.net/forum?id=hIdwvIOiJt

---

Title: Bags of Projected Nearest Neighbours: Competitors to Random Forests?

Abstract: In this paper we introduce a simple and intuitive adaptive k nearest neighbours classifier, and explore its utility within the context of bootstrap aggregating (“bagging”). The approach is based on finding discriminant subspaces which are computationally efficient to compute, and are motivated by enhancing the discrimination of classes through nearest neighbour classifiers. This adaptiveness promotes diversity of the individual classifiers fit across different bootstrap samples, and so further leverages the variance reducing effect of bagging. Extensive experimental results are presented documenting the strong performance of the proposed approach in comparison with Random Forest classifiers, as well as other nearest neighbours based ensembles from the literature, plus other relevant benchmarks.

URL: https://openreview.net/forum?id=ZKLj2U0CsO

---

Title: KASPER: Kolmogorov Arnold Networks for Stock Prediction and Explainable Regimes

Abstract: Forecasting in financial markets remains a significant challenge due to their nonlinear and regime-dependent dynamics. Traditional deep learning models, such as long short-term memory networks and multilayer perceptrons, often struggle to generalize across shifting market conditions, highlighting the need for a more adaptive and interpretable approach. To address this, we introduce Kolmogorov–Arnold networks for stock prediction and explainable regimes (KASPER), a novel framework that integrates regime detection, sparse spline-based function modeling, and symbolic rule extraction. The framework identifies hidden market conditions using a Gumbel-Softmax-based mechanism, enabling regime-specific forecasting. For each regime, it employs Kolmogorov–Arnold networks with sparse spline activations to capture intricate price behaviors while maintaining robustness. Interpretability is achieved through symbolic learning based on Monte Carlo Shapley values, which extracts human-readable rules tailored to each regime. Applied to real-world financial time series from Yahoo Finance, the model achieves an $R^2$ score of 0.89, a Sharpe Ratio of 12.02, and a mean squared error as low as 0.0001, outperforming existing methods. This research establishes a new direction for regime-aware, transparent, and robust forecasting in financial markets.

URL: https://openreview.net/forum?id=PD4jGJQtL8

---

Title: Scalable Generative Modeling of Weighted Graphs

Abstract: Weighted graphs are ubiquitous throughout biology, chemistry, and the social sciences, motivating the development of generative models for abstract weighted graph data using deep neural networks. However, most current deep generative models are either designed for unweighted graphs and are not easily extended to weighted topologies or incorporate edge weights without consideration of a joint distribution with topology. Furthermore, learning a distribution over weighted graphs must account for complex nonlocal dependencies between both the edges of the graph and corresponding weights of each edge. We develop an autoregressive model BiGG-E, a nontrivial extension of the BiGG model, that learns a joint distribution over weighted graphs while still exploiting sparsity to generate a weighted graph with $n$ nodes and $m$ edges in $O((n + m)\log n)$ time. Simulation studies and experiments on a variety of benchmark datasets demonstrate that BiGG-E best captures distributions over weighted graphs while remaining scalable and computationally efficient.

URL: https://openreview.net/forum?id=yWKkBOcD18

---

Title: Estimating Expected Calibration Error for Positive-Unlabeled Learning

Abstract: The reliability of probabilistic classifiers hinges on their calibration---the property that their confidence accurately reflect the true class probabilities.
The expected calibration error (ECE) is a standard metric for quantifying the calibration of classifiers.
However, its estimation presumes access to ground-truth labels.
In positive-unlabeled (PU) learning, only positive and unlabeled data are available, which makes the standard ECE estimator inapplicable.
Although PU learning has been extensively studied for risk estimation and classifier training, calibration in this setting has received little attention.
In this paper, we present PU-ECE, the first ECE estimator for PU data.
We provide non-asymptotic bias bounds and prove convergence rates that match those of the fully supervised ECE with an optimal bin size.
Furthermore, we develop an information-theoretic generalization error analysis of PU-ECE by formalizing the conditional mutual information (CMI) for a PU setting.
Experiments on synthetic and real-world benchmark datasets validate our theoretical analysis and demonstrate that our PU-based ECE estimator achieves performance comparable to that of the fully-labeled ECE estimator.

URL: https://openreview.net/forum?id=SvoBtLIrPZ

---

Title: TESGNN: Temporal Equivariant Scene Graph Neural Networks for Efficient and Robust Multi-View 3D Scene Understanding

Abstract: Scene graphs have proven to be highly effective for various scene understanding tasks due to their compact and explicit representation of relational information. However, current methods often overlook the critical importance of preserving symmetry when generating scene graphs from 3D point clouds, which can lead to reduced accuracy and robustness, particularly when dealing with noisy, multi-view data. Furthermore, a major limitation of prior approaches is the lack of temporal modeling to capture time-dependent relationships among dynamically evolving entities in a scene. To address these challenges, we propose Temporal Equivariant Scene Graph Neural Network (TESGNN), consisting of two key components: (1) an Equivariant Scene Graph Neural Network (ESGNN), which extracts information from 3D point clouds to generate scene graph while preserving crucial symmetry properties, and (2) a Temporal Graph Matching Network, which fuses scene graphs generated by ESGNN across multiple time sequences into a unified global representation using an approximate graph-matching algorithm. Our combined architecture TESGNN outperforms current state-of-the-art methods in scene graph generation, achieving higher accuracy and faster training convergence. Moreover, we show that leveraging the symmetry-preserving property produces a more stable and accurate global scene representation compared to existing approaches. Last but not least, it is computationally efficient and easily implementable using existing frameworks, making it well-suited for real-time applications in robotics and computer vision. This approach paves the way for more robust and scalable solutions to complex multi-view scene understanding challenges.

URL: https://openreview.net/forum?id=boM0kkYPzE

---

Title: Meta Prompting: A Framework for Agentic and Compositional Reasoning

Abstract: We introduce Meta Prompting (MP), a framework that elevates the reasoning capabilities of large language models (LLMs) by focusing on the formal structure of a task rather than content-specific examples. We establish a theoretical foundation for this paradigm, formalizing MP as a functor that maps a category of tasks to a category of structured prompts, thereby guaranteeing that compositional problem-solving strategies can be systematically decomposed into modular prompt structures. We extend this concept to Recursive Meta Prompting (RMP), an automated process where an LLM can generate and refine its own prompts. We model this self-improvement loop formally as a monad, providing a principled framework for automated prompt engineering. Our claims are validated through several experiments demonstrating that a Qwen-72B base model, guided by a single, example-agnostic meta-prompt, achieves improved results on MATH, GSM8K, and Game of 24. These results are achieved with substantial token efficiency gains over traditional few-shot methods.

URL: https://openreview.net/forum?id=lgrhcptfam

---

Title: A Survey on Generative Modeling with Limited Data, Few Shots, and Zero Shot

Abstract: Generative modeling in machine learning aims to synthesize new data samples that are statistically similar to those observed during training. While conventional generative models such as GANs and diffusion models typically assume access to large and diverse datasets, many real-world applications (e.g., in medicine, satellite imaging, and artistic domains) operate under limited data availability and strict constraints. In this survey, we examine Generative Modeling under Data Constraint (GM-DC), which includes limited-data, few-shot, and zero-shot settings. We present a unified perspective on the key challenges in GM-DC, including overfitting, frequency bias, and incompatible knowledge transfer, and discuss how these issues impact model performance.
To systematically analyze this growing field, we introduce two novel taxonomies: one categorizing GM-DC tasks (e.g., unconditional vs. conditional generation, cross-domain adaptation, and subject-driven modeling), and another organizing methodological approaches (e.g., transfer learning, data augmentation, meta-learning, and frequency-aware modeling).
Our study reviews over 230 papers, offering a comprehensive view across generative model types and constraint scenarios. We further analyze task-approach-method interactions using a Sankey diagram and highlight promising directions for future work, including adaptation of foundation models, holistic evaluation frameworks, and data-centric strategies for sample selection.
This survey provides a timely and practical roadmap for researchers and practitioners aiming to advance generative modeling under limited data. Project website: https://anonymous4mysubmission.github.io/gmdc-survey/.

URL: https://openreview.net/forum?id=u7GTHazuRp

---

Title: Pseudo-Physics-Informed Neural Operators: Enhancing Operator Learning from Limited Data

Abstract: Neural operators have shown great potential in surrogate modeling. However, training a well-performing neural operator typically requires a substantial amount of data, which can pose a major challenge in complex applications. In such scenarios, detailed physical knowledge can be unavailable or difficult to obtain, and collecting extensive data is often prohibitively expensive. To mitigate this challenge, we propose the Pseudo Physics-Informed Neural Operator (PPI-NO) framework. PPI-NO constructs a surrogate physics system for the target system using partial differential equations (PDEs) derived from simple, rudimentary physics principles, such as basic differential operators.
This surrogate system is coupled with a neural operator model, using an alternating update and learning process to iteratively enhance the model's predictive power.
While the physics derived via PPI-NO may not mirror the ground-truth underlying physical laws --- hence the term ``pseudo physics'' --- this approach significantly improves the accuracy of standard operator learning models in data-scarce scenarios, which is evidenced by extensive evaluations across five benchmark tasks and a fatigue modeling application.

URL: https://openreview.net/forum?id=5N1V25Rf7D

---

Title: Unveiling Topological Structures from Language: A Survey of Topological Data Analysis Applications in NLP

Abstract: The surge of data available on the Internet has led to the adoption of various computational methods to analyze and extract valuable insights from this wealth of information. Among these, the field of Machine Learning (ML) has thrived by leveraging data to extract meaningful insights. However, ML techniques face notable challenges when dealing with real-world data, often due to issues of imbalance, noise, insufficient labeling, and high dimensionality. To address these limitations, some researchers advocate for the adoption of Topological Data Analysis (TDA), a statistical approach that discerningly captures the intrinsic shape of data despite noise. Despite its potential, TDA has not gained as much traction within the Natural Language Processing (NLP) domain compared to structurally distinct areas like computer vision. Nevertheless, a dedicated community of researchers has been exploring the application of TDA in NLP, yielding 100 papers we comprehensively survey in this paper. Our findings categorize these efforts into theoretical and non-theoretical approaches. Theoretical approaches aim to explain linguistic phenomena from a topological viewpoint, while non-theoretical approaches merge TDA with ML features, utilizing diverse numerical representation techniques. We conclude by exploring the challenges and unresolved questions that persist in this niche field.

URL: https://openreview.net/forum?id=pf4UWMpTLE

---

Title: DualXDA: Towards Sparse, Efficient and Explainable Data Attribution in Large AI Models

Abstract: Contemporary deep learning models achieve remarkable performance over a wide range of domains, yet their decision-making processes often remain opaque. In response, the field of eXplainable Artificial Intelligence (XAI) has grown significantly over the last decade, primarily focusing on feature attribution methods to shed light on which input features drive model predictions. Complementing this perspective, Data Attribution (DA) has emerged as a promising paradigm that shifts the focus from features to data provenance. With the insights gained on the level of (training) data points, DA provides transparency about the model and individual predictions, e.g. for model debugging, identifying data-related causes of suboptimal performance, such as mislabelled instances, dataset distillation or knowledge discovery purposes. However, existing DA approaches suffer from prohibitively high computational costs and memory demands when applied to large-scale or even medium- scale datasets and models, forcing practitioners to resort to approximations that may fail to capture the true inference process of the underlying model. Additionally, current attribution methods exhibit low sparsity, resulting in non negligible attribution scores across a high number of training examples, hindering the discovery of decisive patterns in the data.
In this work, we introduce DualXDA, a framework for sparse, efficient and explainable DA, comprised of two interlinked approaches for Dual Data Attribution (DualDA) and eXplainable Data Attribution (XDA): With DualDA, we propose a novel approach for efficient and effective DA, leveraging Support Vector Machine theory to provide fast and naturally sparse data attributions for AI predictions. In extensive quantitative analyses, we demonstrate that DualDA achieves high attribution quality, excels at solving a series of evaluated downstream tasks, while at the same time improving explanation time by a factor of up to $4{,}100{,}000\times$ compared to the original Influence Functions method, and up to $11{,}000\times$ compared to the method’s most efficient approximation from literature to date. We further introduce XDA, a method for enhancing Data Attribution with capabilities from feature attribution methods to explain why training samples are relevant for the prediction of a test sample in terms of impactful features, which we showcase and verify qualitatively in detail. Taken together, our contributions in DualXDA ultimately point towards a future of eXplainable AI applied at unprecedented scale, enabling transparent, efficient and novel analysis of even the largest neural architectures – such as LLMs – and fostering a new generation of interpretable and accountable AI systems. The implementation of our methods, as well as the full experimental protocol, is available on github.

URL: https://openreview.net/forum?id=qfx81N884A

---

Title: The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning

Abstract: Chain-of-Thought (CoT) prompting has been widely recognized for its ability to enhance reasoning capabilities in large language models (LLMs). However, our study reveals a surprising contradiction to this prevailing perspective within the fundamental domain of pattern-based in-context learning (ICL). Through extensive experiments involving 16 state-of-the-art LLMs and nine diverse pattern-based ICL datasets, we demonstrate that CoT and its reasoning variants consistently underperform direct answering across varying model scales and benchmark complexities. To systematically investigate this unexpected phenomenon, we designed extensive experiments to validate several hypothetical explanations. Our analysis uncovers a fundamental explicit-implicit duality driving CoT’s performance in pattern-based ICL: while explicit reasoning falters due to LLMs’ struggles to infer underlying patterns from demonstrations, implicit reasoning—disrupted by the increased contextual distance of CoT rationales—often compensates, delivering correct answers despite flawed rationales. This duality explains CoT’s relative underperformance, as noise from weak explicit inference undermines the process, even as implicit mechanisms partially salvage outcomes. Notably, even long-CoT reasoning models, which excel in abstract and symbolic reasoning, fail to fully overcome these limitations despite higher computational costs. Our findings challenge existing assumptions regarding the universal efficacy of CoT, yielding novel insights into its limitations and guiding future research toward more nuanced and effective reasoning methodologies for LLMs.

URL: https://openreview.net/forum?id=7SIrvcYNYj

---

Title: Exploring a novel Feedback Mechanism for Convolutional Neural Networks

Abstract: Convolutional neural networks (CNNs), which have achieved significant success in various visual tasks, are inspired by the architecture of the mammalian vision system. However, unlike CNNs, the visual cortex contains a substantial number of top-down or feedback connections. Inspired by this, recent research has investigated incorporating feedback mechanisms into CNNs. In this paper, we propose a novel feedback mechanism called 'Image Specific Feature Selection (ISFS)' that leverages feedback to utilize only a relevant subset of filters for the given image. The feedback weights are learned, and thus the network learns to select features/filters tailored to each image. The feedback improves performance both in terms of better accuracy and better confidence in classification. The selection of filters through the feedback is indeed image-specific and results in interesting behaviour of the network. The feedback signals produced for a given image, can be viewed as a useful low-dimensional approximation of the internal representation of the image. We demonstrate that we can effectively use the feedback signals to identify when a given image has adversarial noise.

URL: https://openreview.net/forum?id=YI1iDh2nYc

---

Title: Real-Time Privacy Preservation for Robot Visual Perception

Abstract: Many robots (e.g., iRobot's Roomba) operate based on visual observations from live video streams, and such observations may inadvertently include privacy-sensitive objects, such as personal identifiers. Existing approaches for preserving privacy rely on deep learning models, differential privacy, or cryptography. They lack guarantees for the complete concealment of all sensitive objects. Guaranteeing concealment requires post-processing techniques and thus is inadequate for real-time video streams. We develop a method for privacy-constrained video streaming, PCVS, that conceals sensitive objects within real-time video streams. PCVS takes a logical specification constraining the existence of privacy-sensitive objects, e.g., never show faces when a person exists. It uses a detection model to evaluate the existence of these objects in each incoming frame. Then, it blurs out a subset of objects such that the existence of the remaining objects satisfies the specification. We then propose a conformal prediction approach to (i) establish a theoretical lower bound on the probability of the existence of these objects in a sequence of frames satisfying the specification and (ii) update the bound with the arrival of each subsequent frame. Quantitative evaluations show that PCVS achieves over 95 percent specification satisfaction rate in multiple datasets, significantly outperforming other methods. The satisfaction rate is consistently above the theoretical bounds across all datasets, indicating that the established bounds hold. Additionally, we deploy PCVS on robots in real-time operation and show that the robots operate normally without being compromised when PCVS conceals objects.

URL: https://openreview.net/forum?id=uMf2vn8396

---

Title: A Review of Bayesian Uncertainty Quantification in Deep Probabilistic Image Segmentation

Abstract: Advances in architectural design, data availability, and compute have driven remarkable progress in semantic segmentation. Yet, these models often rely on relaxed Bayesian assumptions, omitting critical uncertainty information needed for robust decision-making. The resulting reliance on point estimates has fueled interest in probabilistic segmentation, but the literature remains fragmented. In response, this review consolidates and contextualizes foundational concepts in uncertainty modeling, including the non-trivial task of distinguishing between epistemic and aleatoric uncertainty and examining their roles across four key downstream segmentation tasks, highlighting Active Learning as particularly promising. By unifying theory, terminology, and applications, we provide a coherent foundation for researchers and identify critical challenges, such as strong assumptions in spatial aggregation, lack of standardized benchmarks, and pitfalls in current uncertainty quantification methods. We identify trends such as the adoption of contemporary generative models, driven by advances in the broader field of generative modeling, with segmentation-specific innovation primarily in the conditioning mechanisms. Moreover, we observe growing interest in distribution- and sampling-free approaches to uncertainty estimation. We further propose directions for advancing uncertainty-aware segmentation in deep learning, including pragmatic strategies for disentangling different sources of uncertainty, novel uncertainty modeling approaches and improved Transformer-based backbones. In this way, we aim to support the development of more reliable, efficient, and interpretable segmentation models that effectively incorporate uncertainty into real-world applications.

URL: https://openreview.net/forum?id=Yzf4anYwao

---

Title: A Practical Investigation of Spatially-Controlled Image Generation with Transformers

Abstract: Enabling image generation models to be spatially controlled is an important area of research, empowering users to better generate images according to their own fine-grained specifications via e.g. edge maps, poses. Although this task has seen impressive improvements in recent times, a focus on rapidly producing stronger models has come at the cost of detailed and fair scientific comparison. Differing training data, model architectures and generation paradigms make it difficult to disentangle the factors contributing to performance. Meanwhile, the motivations and nuances of certain approaches become lost in the literature. In this work, we aim to provide clear takeaways across generation paradigms for practitioners wishing to develop transformer-based systems for spatially-controlled generation, clarifying the literature and addressing knowledge gaps. We perform controlled experiments on ImageNet across diffusion-based/flow-based and autoregressive (AR) models. First, we establish control token prefilling as a simple, general and performant baseline approach for transformers. We then investigate previously underexplored sampling time enhancements, showing that extending classifier-free guidance to control, as well as softmax truncation, have a strong impact on control-generation consistency.
Finally, we re-clarify the motivation of adapter-based approaches, demonstrating that they mitigate “forgetting” and maintain generation quality when trained on limited downstream data, but underperform full training in terms of generation-control consistency. Code will be released upon publication.

URL: https://openreview.net/forum?id=loT6xhgLYK

---

Title: LTSM-Bundle: A Toolbox and Benchmark on Large Language Models for Time Series Forecasting

Abstract: Time Series Forecasting (TSF) has long been a challenge in time series analysis. Inspired by the success of Large Language Models (LLMs), researchers are now developing Large Time Series Models (LTSMs)—universal transformer-based models that use autoregressive prediction to improve TSF. However, training LTSMs on heterogeneous time series data poses unique challenges, including diverse frequencies, dimensions, scalability, and patterns across datasets. Recent efforts have studied and evaluated various design choices aimed at enhancing LTSM training and generalization capabilities. However, these design choices are typically studied and evaluated in isolation and are not compared collectively. In this work, we introduce LTSM-Bundle, a comprehensive toolbox and benchmark for training LTSMs, spanning pre-processing techniques, model configurations, and dataset configurations. Modularized and benchmarked LTSMs from multiple dimensions, encompassing prompting strategies, tokenization approaches, training paradigms, base model selection, data quantity, and dataset diversity. Furthermore, we combine the most effective design choices identified in our study. Empirical results demonstrate that this combination achieves superior zero-shot and few-shot performances compared to state-of-the-art LTSMs and traditional TSF methods on benchmark datasets.

URL: https://openreview.net/forum?id=rLTEpTYeiI

---

Title: Inverse Scaling in Test-Time Compute

Abstract: We construct evaluation tasks where extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy. Our evaluation tasks span four categories: simple counting tasks with distractors, regression tasks with spurious features, deduction tasks with constraint tracking, and advanced AI risks. We identify five distinct failure modes when models reason for longer: 1) Claude models become increasingly distracted by irrelevant information; 2) OpenAI o-series models resist distractors but overfit to problem framings; 3) models shift from reasonable priors to spurious correlations; 4) all models show difficulties in maintaining focus on complex deductive tasks; and 5) extended reasoning may amplify concerning behaviors, with Claude Sonnet 4 showing increased expressions of self-preservation. These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns. Our results demonstrate the importance of evaluating models across diverse reasoning lengths to identify and address these failure modes in LRMs.

URL: https://openreview.net/forum?id=NXgyHW1c7M

---

Title: Higher Order Transformers With Kronecker-Structured Attention

Abstract: Modern datasets are increasingly high-dimensional and multiway, often represented as tensor-valued data with multi-indexed variables. While Transformers excel in sequence modeling and high-dimensional tasks, their direct application to multiway data is computationally prohibitive due to the quadratic cost of dot-product attention and the need to flatten inputs, which disrupts tensor structure and cross-dimensional dependencies.
We propose the Higher-Order Transformer (HOT), a novel factorized attention framework that represents multiway attention as sums of Kronecker products or sums of mode-wise attention matrices. HOT efficiently captures dense and sparse relationships across dimensions while preserving tensor structure. Theoretically, HOT retains the expressiveness of full high-order attention and allows complexity control via factorization rank.
Experiments on 2D and 3D datasets show that HOT achieves competitive performance in multivariate time series forecasting and image classification, with significantly reduced computational and memory costs. Visualizations of mode-wise attention matrices further reveal interpretable high-order dependencies learned by HOT, demonstrating its versatility for complex multiway data across diverse domains.

URL: https://openreview.net/forum?id=QN0aXcKFkT

---

Title: Feature-Based Belief Aggregation for Partially Observable Markov Decision Problems

Abstract: We consider a finite-state partially observable Markov decision problem (POMDP) with an infinite horizon and a discounted cost, and we propose a new method for computing a cost function approximation that is based on features and aggregation. In particular, using the classical belief-space formulation, we construct a related Markov decision problem (MDP) by first aggregating the unobservable states into feature states, and then introducing representative beliefs over these feature states. This two-stage aggregation approach facilitates the use of dynamic programming methods for solving the aggregate problem and provides additional design flexibility. The optimal cost function of the aggregate problem can in turn be used within an on-line approximation in value space scheme for the original POMDP. We derive a new bound on the approximation error of our scheme. In addition, we establish conditions under which the cost function approximation provides a lower bound for the optimal cost. Finally, we present a biased aggregation approach, which leverages an optimal cost function estimate to improve the quality of the approximation error of the aggregate problem.

URL: https://openreview.net/forum?id=Beg6xmckXY

---

Title: Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Abstract: Recent advances in reinforcement learning (RL) with numerical feedback, such as scalar rewards, have significantly enhanced the complex reasoning capabilities of large language models (LLMs). Despite this success, we identify three key challenges encountered by RL with solely numerical feedback: performance plateaus, limited effectiveness of spontaneous self-reflection, and persistent failures. We then demonstrate that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems by leveraging natural language feedback in the form of critiques. Building on this insight, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for effective policy optimization. Critique-GRPO enables LLMs to learn from initial responses and critique-guided self-refinements simultaneously while maintaining exploration. Additionally, we employ a shaping function to amplify learning from correct, especially unfamiliar, refinements and penalize incorrect ones. Extensive experiments with Qwen2.5-7B-Base, Qwen2.5-Math-7B-Base, and Qwen3-8B demonstrate that Critique-GRPO consistently outperforms supervised learning and RL-based fine-tuning methods across eight challenging mathematical, STEM, and general reasoning tasks, improving average pass@1 scores by approximately 4.4% and 3.8% on Qwen2.5-7B-Base and Qwen3-8B, respectively. Notably, Critique-GRPO enables effective self-improvement through self-critiquing and weak-to-strong generalization, achieving consistent gains over GRPO, such as 16.7% and 10.0% pass@1 improvements on AIME 2024.

URL: https://openreview.net/forum?id=E8c5iotM9t

---

Title: LCEN: A Nonlinear, Interpretable Feature Selection and Machine Learning Algorithm

Abstract: Interpretable models can have advantages over black-box models, and interpretability is essential for the application of machine learning in critical settings, such as aviation or medicine. In this work, we introduce the LASSO-Clip-EN (LCEN) algorithm for nonlinear, interpretable feature selection and machine learning modeling. LCEN is tested on a wide variety of artificial and empirical datasets, frequently creating more accurate, sparser models than other methods, including sparse, nonlinear methods. LCEN is robust against many issues typically present in datasets and modeling, including noise, multicollinearity, data scarcity, and hyperparameter variance. As a feature selection algorithm, LCEN matches or surpasses the thresholded elastic net (EN) but is 10-fold faster. LCEN for feature selection can also rediscover multiple physical laws from empirical data. As a machine learning algorithm, when tested on processes with no known physical laws, LCEN achieves better results than many other dense and sparse methods --- including being comparable to or better than ANNs on multiple datasets.

URL: https://openreview.net/forum?id=wmNucISPdl

---

Title: The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs

Abstract: Large vision-language models (VLMs) increasingly adopt post-training techniques such as long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL) to elicit sophisticated reasoning. While these methods exhibit synergy in language-only models, their joint effectiveness in VLMs remains uncertain. We present a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks. We find that SFT improves performance on difficult questions by in-depth, structured reasoning, but introduces verbosity and degrades performance on simpler ones. In contrast, RL promotes generalization and brevity, yielding consistent improvements across all difficulty levels, though the improvements on the hardest questions are less prominent compared to SFT. Surprisingly, combining them through two-staged, interleaved, or progressive training strategies, as well as data mixing and model merging, all fails to produce additive benefits, instead leading to trade-offs in accuracy, reasoning style, and response length. This "synergy dilemma" highlights the need for more seamless and adaptive approaches to unlock the full potential of combined post-training techniques for reasoning VLMs. Code, dataset, and fine-tuned models will be made publicly available.

URL: https://openreview.net/forum?id=XPML8UGI04

---

Title: Graphons of Line Graphs

Abstract: We consider the problem of estimating graph limits, known as graphons, from observations of sequences of sparse finite graphs.
In this paper we show a simple method that can shed light on a subset of sparse graphs. The method involves mapping the original graphs to their \textit{line graphs}.
We show that graphs satisfying a particular property are sparse, but give rise to dense line graphs.
This property, the \textit{square-degree property}, enables us to apply results on graph limits of dense graphs to derive convergence.
In particular, star graphs satisfy the square-degree property resulting in dense line graphs and non-zero graphons of line graphs.
We demonstrate empirically that we can distinguish different numbers of stars (which are sparse) by the graphons of their corresponding line graphs. Whereas in the original graphs, the different number of stars all converge to the zero graphon due to sparsity.
Similarly, superlinear preferential attachment graphs give rise to dense line graphs almost surely. In contrast, dense graphs, including Erdős–Rényi graphs make the line graphs sparse, resulting in the zero graphon.

URL: https://openreview.net/forum?id=HtarQlJDZQ

---

Title: Commander-GPT: Dividing and Routing for Multimodal Sarcasm Detection

Abstract: Multimodal sarcasm understanding is a high-order cognitive task. Although large language models (LLMs) have shown impressive performance on many downstream NLP tasks, growing evidence suggests that they struggle with sarcasm understanding.
In this paper, we propose Commander-GPT, a modular decision routing framework inspired by military command theory. Rather than relying on a single LLM's capability, Commander-GPT orchestrates a team of specialized LLM agents where each agent will be selectively assigned to a focused sub-task such as context modeling, sentiment analysis, etc. Their outputs are then routed back to the commander, which integrates the information and performs the final sarcasm judgment.
To coordinate these agents, we introduce three types of centralized commanders:
(1) a trained lightweight encoder-based commander (e.g., multi-modal BERT); (2) four small autoregressive language models, serving as moderately capable commanders (e.g., DeepSeek-VL); (3) two large LLM-based commander (Gemini Pro and GPT-4o) that performs task routing, output aggregation, and sarcasm decision-making in a zero-shot fashion.
We evaluate Commander-GPT on the MMSD and MMSD 2.0 benchmarks, comparing five prompting strategies. Experimental results show that our framework achieves 4.4% and 11.7% improvement in F1 score over state-of-the-art (SoTA) baselines on average, demonstrating its effectiveness.

URL: https://openreview.net/forum?id=zRxRbBsqwE

---

Title: SIRE: SE(3) Intrinsic Rigidity Embeddings

Abstract: Motion serves as a powerful cue for scene perception and understanding by separating independently moving surfaces and organizing the physical world into distinct entities. We introduce SIRE, a self-supervised method for motion discovery of objects and dynamic scene reconstruction from casual scenes by learning intrinsic rigidity embeddings from videos. Our method trains an image encoder to estimate scene rigidity and geometry, supervised by a simple 4D reconstruction loss: a least-squares solver uses the estimated geometry and rigidity to lift 2D point track trajectories into SE(3) tracks, which are simply re-projected back to 2D and compared against the original 2D trajectories for supervision. Crucially, our framework is fully end-to-end differentiable and can be optimized either on video datasets to learn generalizable image priors, or even on a single video to capture scene-specific structure -- highlighting strong data efficiency. We demonstrate the effectiveness of our rigidity embeddings and geometry across multiple settings, including downstream object segmentation, SE(3) rigid motion estimation, and self-supervised depth estimation. Our findings suggest that SIRE can pave the way towards self-supervised learning of priors over geometry and motion rigidity from large-scale video data.

URL: https://openreview.net/forum?id=OZ9H0TOYMt

---

Title: Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

Abstract: Language agents based on large language models (LLMs) have demonstrated great promise in automating web-based tasks. Recent work has shown that incorporating advanced planning algorithms, e.g., tree search, is advantageous over reactive planning for web agents. However, unlike simulated sandbox environments, real-world environments such as the web are rife with irreversible actions. This undermines the feasibility of backtracking, a cornerstone of (tree) search. Overly relying on test-time search also hurts efficiency. We advocate model-based planning for web agents that employs a world model to simulate and deliberate over the outcome of each candidate action before committing to one. We systematically explore this paradigm by: (1) Proposing a model-based planning framework, WebDreamer, which employs LLMs to serve as both world models and value functions; (2) Training specialized LLMs as world models with a scalable data synthesis pipeline. Empirical results demonstrate that WebDreamers achieves substantial performance improvements over reactive baselines. It is competitive, while being - times more efficient, with tree search in sandbox environments (VisualWebArena) and also works effectively on real-world websites (Online-Mind2Web and Mind2Web-Live). Furthermore, our trained world model, Dreamer-7B, performs comparable to GPT-4o, highlighting the potential of specialized world models for efficient and effective planning in complex web environments. All code, models, and data will be publicly available upon acceptance.

URL: https://openreview.net/forum?id=c6l7yA0HSq

---

Title: Assortment of Attention Heads: Accelerating Federated PEFT with Head Pruning and Strategic Client Selection

Abstract: Parameter Efficient Fine-Tuning (PEFT) has become the de facto approach in adapting Large Language Models (LLMs) for downstream tasks in Natural Language Processing. However, its adoption in privacy-preserving distributed learning frameworks, such as Federated Learning (FL), remains relatively limited. This is mainly due to challenges specific to FL, such as resource-constrained devices and diverse data distributions among clients. In this paper, we propose an efficient method to perform PEFT within the FL framework for Multi-Head Attention (MHA) based language models. We address the challenges through head pruning, a novel head-specific weighted aggregation mechanism, and a client selection strategy. Head pruning minimizes training complexity within the clients, guided by the importance score computed based on the confidence of the attention head. Weighted aggregation of heads ensures the global model captures crucial updates from diverse clients, complementing our client selection strategy. We show results on the MultiNLI benchmark along with 20 Newsgroups, XL-Sum, and E2E NLG datasets. We use the MultiNLI dataset and T5-small model with LoRA as our PEFT method, attaining sparsity levels of up to 90\%, resulting in a communication advantage of up to 1.8x and a reduction in training OPs of 3.9x while maintaining the accuracy drop under 2\%.

URL: https://openreview.net/forum?id=WFpicZbAHe

---

Title: LAMBDA: Assessing Few-shot Lexical Analogical Reasoning in Language Models

Abstract: Analogical reasoning in language models is a critical yet underexplored aspect of their capability, particularly as models grow in scale and training data. This work investigates the limitations of current models in inferring latent relational structures, focusing on lexical analogies. We introduce LAMBDA, a novel dataset of 3,000 relation-hidden lexical analogies spanning synonyms, antonyms, and derivational transformations, designed for two-shot induction. Our empirical evaluation across eight models, including four open-source models from 0.1B to 17B parameters, along with four commercial models, reveals a wide performance gap, with accuracies ranging from 0.3% to 46.4%, highlighting the challenge of systematic generalization. By analyzing error patterns such as identity echo and semantic drift, we provide insights into model weaknesses. These findings suggest that large-scale pre-training alone does not guarantee strong relational reasoning abilities, offering a foundation for targeted improvements in model design. Broader implications point to the potential for refining training methodologies to enhance analogical abstraction in language models.

URL: https://openreview.net/forum?id=xsRxxm11pS

---

Title: Source-Free Domain Adaptation Using Neighborhood Signature–Based Prediction Matching

Abstract: Source-Free Domain Adaptation (SFDA) is an emerging area of research that aims to adapt a model trained on a labeled source domain to an unlabeled target domain without accessing the source data. Most of the successful methods in this area rely on the concept of neighborhood consistency but are prone to errors due to misleading neighborhood information. In this paper, we explore this approach from the point of view of learning more informative clusters and mitigating the effect of noisy neighbors using a concept called neighborhood signature, and demonstrate that adaptation can be achieved using just a single loss term tailored to optimize the similarity and dissimilarity of predictions of samples in the target domain. In particular, our proposed method outperforms existing methods in the challenging VisDA dataset while also yielding competitive results on other benchmark datasets.

URL: https://openreview.net/forum?id=deX5F5zxXU

---

Title: Unlocking the matrix form of the Quaternion Fourier Transform and Quaternion Convolution: Properties, connections, and application to Lipschitz constant bounding

Abstract: Linear transformations are ubiquitous in machine learning, and matrices are the standard way to represent them. In this paper, we study matrix forms of quaternionic versions of the Fourier Transform and Convolution operations. Quaternions offer a powerful representation unit, however they are related to difficulties in their use that stem foremost from non-commutativity of quaternion multiplication, and due to that $\mu^2 = -1$ possesses infinite solutions in the quaternion domain. Handling of quaternionic matrices is consequently complicated in several aspects (definition of eigenstructure, determinant, etc.). Our research findings clarify the relation of the Quaternion Fourier Transform matrix to the standard (complex) Discrete Fourier Transform matrix, and the extend on which well-known complex-domain theorems extend to quaternions. We focus especially on the relation of Quaternion Fourier Transform matrices to Quaternion Circulant matrices (representing quaternionic convolution), and the eigenstructure of the latter. A proof-of-concept application that makes direct use of our theoretical results is presented, where we present a method to bound the Lipschitz constant of a Quaternionic Convolutional Neural Network. Code is publicly available at: \url{https://github.com/sfikas/quaternion-fourier-convolution-matrix}.

URL: https://openreview.net/forum?id=rhcpXTxb8j

---

Title: Step-Controlled DPO: Leveraging Stepwise Errors for Enhancing Mathematical Reasoning of Language Models

Abstract: Direct Preference Optimization (DPO) has proven effective at improving the performance of large language models (LLMs) on downstream tasks such as reasoning and alignment. In this work, we propose Step-Controlled DPO (SCDPO), a method for automatically providing stepwise error supervision by creating negative samples of mathematical reasoning rationales that start making errors at a specified step. By applying these samples in DPO training, SCDPO can better align the model to avoid reasoning errors and output accurate reasoning steps. Qualitative analysis of the credit assignment of SCDPO and DPO demonstrates the effectiveness of SCDPO at identifying errors in mathematical solutions. We then apply SCDPO to an InternLM2-20B model, resulting in a 20B model that achieves competitive scores of 88.5% on GSM8K and 58.1% on MATH, rivaling all other open-source LLMs, showing the great potential of our method. The code, models and data are released to inspire future work.

URL: https://openreview.net/forum?id=jp1AdIcKTj

---

Title: Improved CLIP Training Objective on Fine-Grained Tasks: Tackling False Negatives and Data Noise

Abstract: Despite its success in various image-text tasks like zero-shot classification on ImageNet, CLIP has been shown to overlook important details in images and captions. This limitation hinders its performance in fine-grained image-text matching tasks. In this paper, we approach this issue through the lens of false negatives (incorrect negative pairs) and data noise (i.e., mislabeled data), which can prevent the model from learning critical details, especially in downstream tasks with a limited number of classes.
To address this, we introduce a new loss term incorporating additional supervision to emphasize true negatives. Additionally, we modify the InfoNCE loss to mitigate the impact of data noise. We show that our new method is provably effective under fewer data assumptions than previous approaches, making it particularly suited to noisy multi-modal data. Using the counting task as an example and CLEVR-Count as the benchmark, we demonstrate the performance improvements achieved by our algorithm without requiring extra labeled data.

URL: https://openreview.net/forum?id=btPILAvTyW

---

Reply all
Reply to author
Forward
0 new messages