Weekly TMLR digest for Feb 01, 2026

0 views
Skip to first unread message

TMLR

unread,
Feb 1, 2026, 12:00:10 AM (11 days ago) Feb 1
to tmlr-annou...@googlegroups.com


New certifications
==================

Featured Certification, J2C Certification: Retrospective Feature Estimation for Continual Learning

Nghia D. Nguyen, Hieu Trung Nguyen, Ang Li, Hoang Pham, Viet Anh Nguyen, Khoa D Doan

https://openreview.net/forum?id=9NnhVME4Q6

---


Survey Certification: A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Qihan Ren, Yiran Wu, Hongru WANG, Han Xiao, Yuhang Zhou, Shaokun Zhang, Jiayi Zhang, Jinyu Xiang, Yixiong Fang, Qiwen Zhao, Dongrui Liu, Cheng Qian, Zhenhailong Wang, Minda Hu, Huazheng Wang, Qingyun Wu, Heng Ji, Mengdi Wang

https://openreview.net/forum?id=CTr3bovS5F

---


J2C Certification: Classification of high-dimensional data with spiked covariance matrix structure

Yin-Jen Chen, Minh Tang

https://openreview.net/forum?id=6bQDtTbaQs

---


Accepted papers
===============


Title: Adapting Language Models to Produce Good Class Probabilities for Classification Tasks

Authors: Lautaro Estienne, Matias Vera, Elizabeth Fons, Elena Kochkina, Pablo Piantanida, Luciana Ferrer

Abstract: Large generative language models (GLM) provide a versatile tool for solving a wide variety of natural processing tasks. GLM responses, though, are provided in the form of text, without an indication of the model's confidence in the answer. This limits the usability of these models on high-risk applications where decisions made based on an incorrect answer can have severe consequences. In this work, we focus on the problem of generating class posterior distributions for text classification tasks like sentiment, news category and intent classification. These posteriors can be used for decision making and as interpretable scores for the user. We show that the naive approach for computing posteriors based on the token posteriors produced by the GLM results in extremely poor posteriors. We then explore different adaptation approaches for improving the quality of posteriors, focusing on low resource scenarios where a small amount of data is available for adaptation. We show that parameter-efficient supervised fine-tuning (SFT), while providing large gains in terms of decision quality, produces suboptimal posteriors due to overfitting. To address this problem, we propose an approach that combines SFT and post-hoc calibration (PHC) using a three-stage training strategy, improving the quality of both posteriors and categorical decisions.

URL: https://openreview.net/forum?id=VVneIp69GR

---

Title: Scaling Gaussian Process Regression with Full Derivative Observations

Authors: Daniel Huang

Abstract: We present a scalable Gaussian Process (GP) method called DSoftKI that can fit and predict full derivative observations. It extends SoftKI, a method that approximates a kernel via softmax interpolation, to the setting with derivatives. DSoftKI enhances SoftKI's interpolation scheme by replacing its global temperature vector with local temperature vectors associated with each interpolation point. This modification allows the model to encode local directional sensitivity, enabling the construction of a scalable approximate kernel, including its first and second-order derivatives, through interpolation. Moreover, the interpolation scheme eliminates the need for kernel derivatives, facilitating extensions such as Deep Kernel Learning (DKL). We evaluate DSoftKI on synthetic benchmarks, a toy n-body physics simulation, standard regression datasets with synthetic gradients, and high-dimensional molecular force field prediction (100-1000 dimensions). Our results demonstrate that DSoftKI is accurate and scales to larger datasets with full derivative observations than previously possible.

URL: https://openreview.net/forum?id=fbonXp38r9

---

Title: Retrospective Feature Estimation for Continual Learning

Authors: Nghia D. Nguyen, Hieu Trung Nguyen, Ang Li, Hoang Pham, Viet Anh Nguyen, Khoa D Doan

Abstract: The intrinsic capability to continuously learn a changing data stream is a desideratum of deep neural networks (DNNs). However, current DNNs suffer from catastrophic forgetting, which interferes with remembering past knowledge. To mitigate this issue, existing Continual Learning (CL) approaches often retain exemplars for replay, regularize learning, or allocate dedicated capacity for new tasks. This paper investigates an unexplored direction for CL called Retrospective Feature Estimation (RFE). RFE learns to reverse feature changes by aligning the features from the current trained DNN backward to the feature space of the old task, where performing predictions is easier. This retrospective process utilizes a chain of small feature mapping networks called retrospector modules. Empirical experiments on several CL benchmarks, including CIFAR10, CIFAR100, and Tiny ImageNet, demonstrate the effectiveness and potential of this novel CL direction compared to existing representative CL methods, motivating further research into retrospective mechanisms as a principled alternative for mitigating catastrophic forgetting in CL. Code is available at: https://github.com/mail-research/retrospective-feature-estimation.

URL: https://openreview.net/forum?id=9NnhVME4Q6

---

Title: Theoretically Understanding Data Reconstruction Leakage in Federated Learning

Authors: Binghui Zhang, Zifan Wang, Meng Pang, Yuan Hong, Binghui Wang

Abstract: Federated learning (FL) is a collaborative learning paradigm that aims to protect data privacy. Unfortunately, recent works show FL algorithms are vulnerable to data reconstruction attacks (DRA), a serious type of privacy leakage. However, existing works lack a theoretical foundation on to what extent the devices' data can be reconstructed and the effectiveness of these attacks cannot be compared fairly due to their unstable performance. To address this deficiency, we propose a theoretical framework to understand DRAs to FL. Our framework involves bounding the data reconstruction error and an attack's error bound reflects its inherent effectiveness using Lipschitz constant. We show that a smaller Lipschitz constant indicates a stronger attacker. Under the framework, we theoretically compare the effectiveness of existing attacks (such as DLG and iDLG). We then empirically examine our results on multiple datasets, validating that the iDLG attack inherently outperforms the DLG attack.

URL: https://openreview.net/forum?id=1UfDXeYxwk

---

Title: $$\texttt{C2-DPO}$$: Constrained Controlled Direct Preference Optimization

Authors: Kavosh Asadi, Xingzi Xu, Julien Han, Ege Beyazit, Idan Pipano, Dominique Perrault-Joncas, Shoham Sabach, Mohammad Ghavamzadeh, Karim Bouyarmane

Abstract: Direct preference optimization (\texttt{DPO}) has emerged as a promising approach for solving the alignment problem in AI. In this paper, we make two counter-intuitive observations about \texttt{DPO}. First, we show that the \texttt{DPO} loss could be derived by starting from an alternative optimization problem that only defines the KL guardrail on in-sample responses, unlike the original RLHF problem where guardrails are defined on the entire distribution. Second, we prove a surprising property of this alternative optimization problem, where both the preferred and rejected responses tend to decrease in probability under its optimal policy, a phenomenon typically displayed by DPO in practice. To control this behavior, we propose a set of constraints designed to limit the displacement of probability mass between the preferred and rejected responses in the reference and target policies. The resulting algorithm, which we call Constrained Controlled DPO (\texttt{C2-DPO}), has a meaningful RLHF interpretation. By hedging against the displacement, \texttt{C2-DPO} provides practical improvements over vanilla \texttt{DPO} when aligning several language models using standard preference datasets.

URL: https://openreview.net/forum?id=7h5Ho9t5NL

---

Title: Explaining with trees: interpreting CNNs using hierarchies

Authors: Caroline Mazini Rodrigues, Nicolas Boutry, Laurent Najman

Abstract: Challenges remain in providing interpretable explanations for neural network decision-making in explainable AI (xAI). Existing methods like Integrated Gradients produce noisy maps, and LIME, while intuitive, may deviate from the model’s internal logic. We introduce a framework that uses hierarchical segmentation techniques for faithful and interpretable explanations of Convolutional Neural Networks (CNNs). Our method constructs model-based hierarchical segmentations that maintain fidelity to the model’s decision-making process and allow both human-centric and model-centric segmentation. This approach can be combined with various xAI methods and provides multiscale explanations that help identify biases and improve understanding of neural network predictive behavior. Experiments show that our framework, xAiTrees, delivers highly interpretable and faithful model explanations, not only surpassing traditional xAI methods but shedding new light on a novel approach to enhancing xAI interpretability.

URL: https://openreview.net/forum?id=zjyWZh5IiI

---

Title: SMILE: A Composite Lexical-Semantic Metric for Question-Answering Evaluation

Authors: Shrikant Kendre, Austin Xu, Honglu Zhou, Michael S Ryoo, Shafiq Joty, Juan Carlos Niebles

Abstract: Traditional evaluation metrics for textual and visual question answering—like ROUGE, METEOR, and Exact Match (EM)—focus heavily on n-gram based lexical similarity, often missing the deeper semantic understanding needed for accurate assessment. While measures like BERTScore and MoverScore leverage contextual embeddings to address this limitation, they lack flexibility in balancing sentence-level and keyword-level semantics and ignore lexical similarity, which remains important. Large Language Model (LLM) based evaluators, though powerful, come with drawbacks like high costs, bias, inconsistency, and hallucinations. To address these issues, we introduce SMILE: Semantic Metric Integrating Lexical Exactness, a novel approach that combines sentence-level semantic understanding with keyword-level semantic understanding and easy keyword matching. This composite method balances lexical precision and semantic relevance, offering a comprehensive evaluation. Extensive benchmarks across text, image, and video QA tasks show SMILE is highly correlated with human judgments and computationally lightweight, bridging the gap between lexical and semantic evaluation.

URL: https://openreview.net/forum?id=lnpOvuQYih

---

Title: CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions

Authors: Kung-Hsiang Huang, Akshara Prabhakar, Onkar Thorat, Divyansh Agarwal, Prafulla Kumar Choubey, Yixin Mao, Silvio Savarese, Caiming Xiong, Chien-Sheng Wu

Abstract: While AI agents hold transformative potential in business, effective performance benchmarking is hindered by the scarcity of public, realistic business data on widely used platforms. Existing benchmarks often lack fidelity in their environments, data, and agent-user interactions, with limited coverage of diverse business scenarios and industries. To address these gaps, we introduce CRMArena-Pro, a novel benchmark for holistic, realistic assessment of LLM agents in diverse professional settings. CRMArena-Pro expands on CRMArena with nineteen expert-validated tasks across sales, service, and 'configure, price, and quote' processes, for both Business-to-Business and Business-to-Customer scenarios. It distinctively incorporates multi-turn interactions guided by diverse personas and robust confidentiality awareness assessments. Experiments reveal leading LLM agents achieve only around 58% single-turn success on CRMArena-Pro, with performance dropping significantly to approximately 35% in multi-turn settings. While Workflow Execution proves more tractable for top agents (over 83% single-turn success), other evaluated business skills present greater challenges. Furthermore, agents exhibit near-zero inherent confidentiality awareness; though targeted prompting can improve this, it often compromises task performance. These findings highlight a substantial gap between current LLM capabilities and enterprise demands, underscoring the need for advancements in multi-turn reasoning, confidentiality adherence, and versatile skill acquisition.

URL: https://openreview.net/forum?id=EPlpe3Fx1x

---

Title: Dual-Phase Continual Learning: Supervised Adaptation Meets Unsupervised Retention

Authors: Vaibhav Singh, Rahaf Aljundi, Eugene Belilovsky

Abstract: Foundational vision-language models (VLMs) excel across diverse tasks, but adapting them to new domains without forgetting prior knowledge remains a critical challenge. Continual Learning (CL) addresses this challenge by enabling models to learn sequentially from new data while mitigating the forgetting of prior information, typically under supervised settings involving label shift. Nonetheless, abrupt distribution shifts can still cause substantial forgetting, potentially nullifying the benefits of supervised updates, especially when storing or replaying past data is infeasible. In this work, we propose leveraging unlabeled test-time data in an unsupervised manner to reinforce prior task performance without requiring replay or stored examples. Unlike traditional Test-Time Adaptation (TTA), which primarily focuses on domain shift or corruption, our method improves performance on earlier tasks by exploiting representative test samples encountered during deployment. We introduce a simple teacher-student framework with gradient-based sparse parameter updates, and show that it effectively mitigates forgetting in class-incremental CL for VLMs, offering a memory-free alternative to episodic replay with strong empirical results.

URL: https://openreview.net/forum?id=GFrHdXzZwo

---

Title: Enhancing Concept Localization in CLIP-based Concept Bottleneck Models

Authors: Rémi Kazmierczak, Steve Azzolin, Goran Frehse, Eloïse Berthier, Gianni Franchi

Abstract: This paper addresses explainable AI (XAI) through the lens of Concept Bottleneck Models (CBMs) that do not require explicit concept annotations, relying instead on concepts extracted using CLIP in a zero-shot manner. We show that CLIP, which is central in these techniques, is prone to concept hallucination—incorrectly predicting the presence or absence of concepts within an image in scenarios used in numerous CBMs, hence undermining the faithfulness of explanations. To mitigate this issue, we introduce Concept Hallucination Inhibition via Localized Interpretability (CHILI), a technique that disentangles image embeddings and localizes pixels corresponding to target concepts. Furthermore, our approach supports the generation of saliency-based explanations that are more interpretable.

URL: https://openreview.net/forum?id=2xaOl0wluw

---

Title: Random Projection-Induced Gaussian Latent Features for Arbitrary Style Transfer

Authors: Weizhi Lu, Zhongzheng Li, Dongchen Gao, Mingrui Chen, Weiyu Li, jinglin zhang, Wei Zhang

Abstract: The feature transfer technique centered on mean and variance statistics, widely known as AdaIN, lies at the core of current style transfer research. This technique relies on the assumption that latent features for style transfer follow Gaussian distributions. In practice, however, this assumption is often hard to meet, as the features typically exhibit sparse distributions due to the significant spatial correlation inherent in natural images. To tackle this issue, we propose first performing a random projection on the sparse features, and then conducting style transfer on these projections. Statistically, the projections will satisfy or approximate Gaussian distributions, thereby better aligning with AdaIN's requirements and enhancing transfer performance. With the stylized projections, we can further reconstruct them back to the original feature space by leveraging compressed sensing theory, thereby obtaining the stylized features. The entire process constitutes a projection-stylization-reconstruction module, which can be seamlessly integrated into AdaIN without necessitating network retraining. Additionally, our proposed module can also be incorporated into another promising style transfer technique based on cumulative distribution functions, dubbed EFDM. This technique faces limitations when there are substantial differences in sparsity levels between content and style features. By projecting both types of features into dense Gaussian distributions, random projection can reduce their sparsity disparity, thereby improving performance. Experiments demonstrate that the aforementioned performance improvements can be achieved on existing state-of-the-art approaches.

URL: https://openreview.net/forum?id=XBu6iqHof8

---

Title: Privacy Profiles Under Tradeoff Composition

Authors: Paul Glasserman

Abstract: Privacy profiles and tradeoff functions are two frameworks for comparing differential privacy guarantees of alternative privacy mechanisms. We study connections between these frameworks. We show that the composition of tradeoff functions corresponds to a binary operation on privacy profiles we call their T-convolution. Composition of tradeoff functions characterizes group privacy guarantees, so the T-convolution provides a bridge for translating group privacy properties from one framework to the other. Composition of tradeoff functions has also been used to characterize mechanisms with log-concave additive noise; we derive a corresponding property based on privacy profiles. We also derive new bounds on privacy profiles for log-concave mechanisms based on new convexity properties. In developing these ideas, we characterize regular privacy profiles, which are privacy profiles for mutually absolutely continuous probability measures.

URL: https://openreview.net/forum?id=gRvKjXWacu

---

Title: Efficient Dilated Squeeze and Excitation Neural Operator for Differential Equations

Authors: Prajwal Chauhan, Salah Eddine Choutri, Saif Jabari

Abstract: Fast and accurate surrogates for physics-driven partial differential equations (PDEs) are essential in fields such as aerodynamics, porous media design, and flow control. However, many transformer-based models and existing neural operators remain parameter-heavy, resulting in costly training and sluggish deployment. We propose D-SENO (Dilated Squeeze-Excitation Neural Operator), a lightweight operator learning framework for efficiently solving a wide range of PDEs, including airfoil potential flow, Darcy flow in porous media, pipe Poiseuille flow, and incompressible Navier–Stokes vortical fields. D-SENO combines dilated convolution (DC) blocks with squeeze-and-excitation (SE) modules to jointly capture wide receptive fields and dynamics alongside channel-wise attention, enabling both accurate and efficient PDE inference. Carefully chosen dilation rates allow the receptive field to focus on critical regions, effectively modeling long-range physical dependencies. Meanwhile, the SE modules adaptively recalibrate feature channels to emphasize dynamically relevant scales. Our model achieves training speed of up to $\approx 20\times$ faster than standard transformer-based models and neural operators, while also surpassing (or matching) them in accuracy across multiple PDE benchmarks. Ablation studies show that removing the SE modules leads to a slight drop in performance.

URL: https://openreview.net/forum?id=Xl942THEUa

---

Title: How Well Can Preference Optimization Generalize Under Noisy Feedback?

Authors: Shawn Im, Yixuan Li

Abstract: As large language models (LLMs) advance their capabilities, aligning these models with human preferences has become crucial. Preference optimization, which trains models to distinguish between preferred and non-preferred responses based on human feedback, has become a crucial component for aligning LLMs. However, most existing works assume noise-free feedback, which is unrealistic due to the inherent errors and inconsistencies in human judgments. This paper addresses the impact of noisy feedback on preference optimization, providing generalization guarantees under these conditions. In particular, we consider noise models that correspond to common real-world sources of noise, such as mislabeling and uncertainty. Unlike traditional analyses that assume convergence, our work focuses on finite-step preference optimization, offering new insights that are more aligned with practical LLM training. We describe how generalization decays with different types of noise across levels of noise rates based on the preference data distribution and number of samples. Our analysis for noisy preference learning applies to a broad family of preference optimization losses such as DPO, IPO, SLiC, etc. Empirical validation on contemporary LLMs confirms the practical relevance of our findings, offering valuable insights for developing AI systems that align with human preferences.

URL: https://openreview.net/forum?id=8f5gRWwzDx

---

Title: KITTEN: A Knowledge-Integrated Evaluation of Image Generation on Visual Entities

Authors: Hsin-Ping Huang, Xinyi Wang, Yonatan Bitton, Hagai Taitelbaum, Gaurav Singh Tomar, Ming-Wei Chang, Xuhui Jia, Kelvin C.K. Chan, Hexiang Hu, Yu-Chuan Su, Ming-Hsuan Yang

Abstract: Recent advances in text-to-image generation have improved the quality of synthesized images, but evaluations mainly focus on aesthetics or alignment with text prompts. Thus, it remains unclear whether these models can accurately represent a wide variety of realistic visual entities. To bridge this gap, we propose KITTEN, a benchmark for Knowledge-InTegrated image generaTion on real-world ENtities. Using KITTEN, we conduct a systematic study of recent text-to-image models, retrieval-augmented models, and unified understanding and generation models, focusing on their ability to generate real-world visual entities such as landmarks and animals. Analyses using carefully designed human evaluations, automatic metrics, and MLLMs as judges show that even advanced text-to-image and unified models fail to generate accurate visual details of entities. While retrieval-augmented models improve entity fidelity by incorporating reference images, they tend to over-rely on them and struggle to create novel configurations of the entities in creative text prompts. The dataset and evaluation code are publicly available at https://kitten-project.github.io.

URL: https://openreview.net/forum?id=wejaKS9Ps0

---

Title: Algorithmic Recourse in Abnormal Multivariate Time Series

Authors: Xiao Han, Lu Zhang, Yongkai Wu, Shuhan Yuan

Abstract: Algorithmic recourse provides actionable recommendations to alter unfavorable predictions of machine learning models, enhancing transparency through counterfactual explanations. While significant progress has been made in algorithmic recourse for static data, such as tabular and image data, limited research explores recourse for multivariate time series, particularly for reversing abnormal time series. This paper introduces Recourse in time series Anomaly Detection (RecAD), a framework for addressing anomalies in multivariate time series using backtracking counterfactual reasoning. By modeling the causes of anomalies as external interventions on exogenous variables, RecAD predicts recourse actions to restore normal status as counterfactual explanations, where the recourse function, responsible for generating actions based on observed data, is trained using an end-to-end approach. Experiments on synthetic and real-world datasets demonstrate its effectiveness.

URL: https://openreview.net/forum?id=kzxFc2Suo5

---

Title: Constant Rate Scheduling: A General Framework for Optimizing Diffusion Noise Schedule via Distributional Change

Authors: Shuntaro Okada, Kenji Doi, Ryota Yoshihashi, Hirokatsu Kataoka, Tomohiro Tanaka

Abstract: We propose a general framework for optimizing noise schedules in diffusion models, applicable to both training and sampling.
Our method enforces a constant rate of change in the probability distribution of diffused data throughout the diffusion process,
where the rate of change is quantified using a user-defined discrepancy measure.
We introduce three such measures, which can be flexibly selected or combined depending on the domain and model architecture.
While our framework is inspired by theoretical insights, we do not aim to provide a complete theoretical justification of how distributional change affects sample quality.
Instead, we focus on establishing a general-purpose scheduling framework and validating its empirical effectiveness.
Through extensive experiments, we demonstrate that our approach consistently improves the performance of both pixel-space and latent-space diffusion models,
across various datasets, samplers, and a wide range of number of function evaluations from 5 to 250.
In particular, when applied to both training and sampling schedules, our method achieves a state-of-the-art FID score of 2.03 on LSUN Horse 256$\times$256, without compromising mode coverage.

URL: https://openreview.net/forum?id=Pjq6kdvMBj

---

Title: CacheFlow: Fast Human Motion Prediction by Cached Normalizing Flow

Authors: Takahiro Maeda, Jinkun Cao, Norimichi Ukita, Kris Kitani

Abstract: Many density estimation techniques for 3D human motion prediction require a significant amount of inference time, often exceeding the duration of the predicted time horizon. To address the need for faster density estimation for 3D human motion prediction, we introduce a novel flow-based method for human motion prediction called CacheFlow. Unlike previous conditional generative models that suffer from poor time efficiency, CacheFlow takes advantage of an unconditional flow-based generative model that transforms a Gaussian mixture into the density of future motions. The results of the computation of the flow-based generative model can be precomputed and cached. Then, for conditional prediction, we seek a mapping from historical trajectories to samples in the Gaussian mixture. This mapping can be done by a much more lightweight model, thus saving significant computation overhead compared to a typical conditional flow model. In such a two-stage fashion and by caching results from the slow flow model computation, we build our CacheFlow without loss of prediction accuracy and model expressiveness. This inference process is completed in approximately one millisecond, making it 4$\times$ faster than previous VAE methods and 30$\times$ faster than previous diffusion-based methods on standard benchmarks such as Human3.6M and AMASS datasets. Furthermore, our method demonstrates improved density estimation accuracy and comparable prediction accuracy to a SOTA method on Human3.6M. Our code and models are available at \url{https://github.com/meaten/CacheFlow}.

URL: https://openreview.net/forum?id=icq5659pQt

---

Title: ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer

Authors: Jinyi Hu, Shengding Hu, Yuxuan Song, Yufei Huang, Mingxuan Wang, Hao Zhou, Zhiyuan Liu, Wei-Ying Ma, Maosong Sun

Abstract: Autoregressive and diffusion models have achieved remarkable progress in language models and visual generation, respectively. We present ACDiT, a novel Autoregressive blockwise Conditional Diffusion Transformer, that innovatively combines autoregressive and diffusion paradigms for continuous visual information. By introducing a block-wise autoregressive unit, ACDiT offers a flexible interpolation between token-wise autoregression and full-sequence diffusion, bypassing the limitations of discrete tokenization. The generation of each block is formulated as a conditional diffusion process, conditioned on prior blocks. ACDiT is easy to implement, as simple as applying a specially designed Skip-Causal Attention Mask on the standard diffusion transformer during training. During inference, the process iterates between diffusion denoising and autoregressive decoding that can make full use of KV-Cache. We validate the effectiveness of ACDiT on image, video, and text generation and show that ACDiT performs best among all autoregressive baselines under similar model scales on visual generation tasks. We also demonstrate that, benefiting from autoregressive modeling, pretrained ACDiT can be transferred in visual understanding tasks despite being trained with the generative objective. The analysis of the trade-off between autoregressive and diffusion demonstrates the potential of ACDiT to be used in long-horizon visual generation tasks. We hope that ACDiT offers a novel perspective on visual autoregressive generation and sheds light on new avenues for unified models.

URL: https://openreview.net/forum?id=OuFNXESoCO

---

Title: \textsc{PGO-BEN}: Proxy-Guided Orthogonalization and Beta Ensembling for Few-Shot Domain-Incremental Learning

Authors: Samrat Mukherjee, Thivyanth Venkateswaran, Eric Nuertey Coleman, Luigi Quarantiello, Julio Hurtado, Vincenzo Lomonaco, Gemma Roig, Subhasis Chaudhuri, Biplab Banerjee

Abstract: Continual adaptation to evolving domains with minimal supervision is essential for real-world deployment of machine learning systems. We formalize this objective as \textbf{Few-Shot Domain-Incremental Learning (FSDIL)}, where a model must adapt to each new domain using only a few labeled samples while retaining prior knowledge without access to previous data. This setting mirrors practical constraints in domains such as autonomous driving and medical imaging, where annotations are expensive and data retention is restricted by privacy regulations.
Pre-trained vision–language models such as CLIP provide a strong initialization for FSDIL due to their transferable multi-modal representations. However, adapting CLIP incrementally under domain shifts remains challenging: few-shot updates often trigger \emph{catastrophic forgetting} and insufficient \emph{plasticity} across evolving distributions.
To address these challenges, we introduce \textbf{\textsc{PGO-BEn}} (\textit{Proxy-Guided Orthogonalization and Beta Ensembling})—a rehearsal-free framework that leverages CLIP’s semantic priors via prompt learning while preserving prior domain knowledge through two key mechanisms.
(1) \textbf{Proxy-Guided Orthogonalization (PGO):} identifies conflicts between current gradients and proxy representations of past knowledge, inferred from current samples, and projects conflicting updates into an orthogonal subspace to prevent knowledge degradation.
(2) \textbf{Beta Ensembling (BEn):} introduces a Beta-function-based temporal ensembling strategy that adaptively balances stability and plasticity, outperforming conventional exponential moving average (EMA) approaches in retaining early-domain knowledge.
We extensively evaluate \textsc{PGO-BEn} on three diverse benchmarks—\textbf{DomainNet}, \textbf{CoRE50}, and \textbf{CDDB-Hard}—and demonstrate consistent improvements over state-of-the-art domain-incremental and few-shot learning methods across all supervision levels in this challenging setting.

URL: https://openreview.net/forum?id=jlb27FbHLv

---

Title: Differentially Private Conformal Prediction via Quantile Binary Search

Authors: Ogonnaya Michael Romanus, Roberto Molinari

Abstract: Differentially Private (DP) approaches have been widely explored and implemented for a broad variety of tasks delivering corresponding privacy guarantees in these settings. While most of these DP approaches focus on limiting privacy leakage from training data, there are fewer approaches that consider leakage when procedures involve \textit{calibration data} which is common in uncertainty quantification through Conformal Prediction (CP). Since there is a limited amount of approaches in this direction, in this work we deliver a general DP approach for CP that we call Private Conformity via Quantile Search (P-COQS). The proposed approach adapts an existing randomized binary search algorithm for computing DP quantiles in the calibration phase of CP thereby guaranteeing privacy of the consequent prediction sets. This however comes at a price of marginally under-covering with respect to the desired $(1 - \alpha)$-level when using finite-sample calibration sets (although broad empirical results show that the P-COQS generally targets the required level in the considered cases). Confirming properties of the adapted algorithm and quantifying the approximate coverage guarantees of the consequent CP, we conduct extensive experiments to examine the effects of privacy noise, sample size and significance level on the performance of P-COQS compared to existing alternatives. In addition, we empirically evaluate our approach on several benchmark datasets, including CIFAR-10, ImageNet and CoronaHack. Our results suggest that the proposed method is robust to privacy noise and performs favorably with respect to the current DP alternative in terms of \textit{empirical coverage}, \textit{efficiency}, and \textit{informativeness}. Specifically, the results indicate that P-COQS produces smaller conformal prediction sets while simultaneously targeting the desired coverage and privacy guarantees in all these experimental settings.

URL: https://openreview.net/forum?id=IK7tNOucJ3

---

Title: Dealing with Uncertainty in Contextual Anomaly Detection

Authors: Luca Bindini, Lorenzo Perini, Stefano Nistri, Jesse Davis, Paolo Frasconi

Abstract: Contextual anomaly detection (CAD) aims to identify anomalies in a target (behavioral) variable conditioned on a set of contextual variables that influence the normalcy of the target variable but are not themselves indicators of anomaly. In many anomaly detection tasks, there exist contextual variables that influence the normalcy of the target variable but are not themselves indicators of anomaly. In this work, we propose a novel framework for CAD, normalcy score (NS), that explicitly models both the aleatoric and epistemic uncertainties. Built on heteroscedastic Gaussian process regression, our method regards the Z-score as a random variable, providing confidence intervals that reflect the reliability of the anomaly assessment. Through experiments on benchmark datasets and a real-world application in cardiology, we demonstrate that NS outperforms state-of-the-art CAD methods in both detection accuracy and interpretability. Moreover, confidence intervals enable an adaptive, uncertainty-driven decision-making process, which may be very important in domains such as healthcare.

URL: https://openreview.net/forum?id=yLoXQDNwwa

---

Title: On Uncertainty Calibration for Equivariant Functions

Authors: Edward Berman, Jacob Ginesin, Marco Pacini, Robin Walters

Abstract: Data-sparse settings such as robotic manipulation, molecular physics, and galaxy morphology classification are some of the hardest domains for deep learning. For these problems, equivariant networks can help improve modeling across undersampled parts of the input space, and uncertainty estimation can guard against overconfidence. However, until now, the relationships between equivariance and model confidence, and more generally equivariance and model calibration, has yet to be studied. Since traditional classification and regression error terms show up in the definitions of calibration error, it is natural to suspect that previous work can be used to help understand the relationship between equivariance and calibration error. In this work, we present a theory relating equivariance to uncertainty estimation. By proving lower and upper bounds on uncertainty calibration errors (ECE and ENCE) under various equivariance conditions, we elucidate the generalization limits of equivariant models and illustrate how symmetry mismatch can result in miscalibration in both classification and regression. We complement our theoretical framework with numerical experiments that clarify the relationship between equivariance and uncertainty using a variety of real and simulated datasets, and we comment on trends with symmetry mismatch, group size, and aleatoric and epistemic uncertainties.

URL: https://openreview.net/forum?id=rxLUTPLBT3

---

Title: Disentangled Concept-Residual Models: Bridging the Interpretability–Performance Gap for Incomplete Concept Sets

Authors: Renos Zabounidis, Ini Oguntola, Konghao Zhao, Joseph Campbell, Woojun Kim, Simon Stepputtis, Katia P. Sycara

Abstract: Deploying AI in high-stakes settings requires models that are not only accurate but also interpretable and amenable to human oversight. Concept Bottleneck Models (CBMs) support these goals by structuring predictions around human-understandable concepts, enabling interpretability and post-hoc human intervenability. However, CBMs rely on a ‘complete’ concept set, requiring practitioners to define and label enough concepts to match the predictive power of black-box models. To relax this requirement, prior work introduced residual connections that bypass the concept layer and recover information missing from an incomplete concept set. While effective in bridging the performance gap, these residuals can redundantly encode concept information, a phenomenon we term \textbf{concept-residual overlap}. In this work, we investigate the effects of concept-residual overlap and evaluate strategies to mitigate it. We (1) define metrics to quantify the extent of concept-residual overlap in CRMs; (2) introduce complementary metrics to evaluate how this overlap impacts interpretability, concept importance, and the effectiveness of concept-based interventions; and (3) present \textbf{Disentangled Concept-Residual Models (D-CRMs)}, a general class of CRMs designed to mitigate this issue. Within this class, we propose a novel disentanglement approach based on minimizing mutual information (MI). Using CelebA, CIFAR100, AA2, CUB, and OAI, we show that standard CRMs exhibit significant concept-residual overlap, and that reducing this overlap with MI-based D-CRMs restores key properties of CBMs, including interpretability, functional reliance on concepts, and intervention robustness, without sacrificing predictive performance.

URL: https://openreview.net/forum?id=NKgNizwDa6

---

Title: A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

Authors: Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Qihan Ren, Yiran Wu, Hongru WANG, Han Xiao, Yuhang Zhou, Shaokun Zhang, Jiayi Zhang, Jinyu Xiang, Yixiong Fang, Qiwen Zhao, Dongrui Liu, Cheng Qian, Zhenhailong Wang, Minda Hu, Huazheng Wang, Qingyun Wu, Heng Ji, Mengdi Wang

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks but remain fundamentally static, unable to adapt their internal parameters to novel tasks, evolving knowledge domains, or dynamic interaction contexts. As LLMs are increasingly deployed in open-ended, interactive environments, this static nature has become a critical bottleneck, necessitating agents that can adaptively reason, act, and evolve in real time. This paradigm shift ---from scaling static models to developing self-evolving agents --- has sparked growing interest in architectures and methods enabling continual learning and adaptation from data, interactions, and experiences. This survey provides the first systematic and comprehensive review of self-evolving agents, organizing the field around three foundational dimensions --- what to evolve, when to evolve, and how to evolve. We examine evolutionary mechanisms across agent components (e.g., models, memory, tools, architecture), categorize adaptation methods by stages (e.g., intra-test-time, inter-test-time), and analyze the algorithmic and architectural designs that guide evolutionary adaptation (e.g., scalar rewards, textual feedback, single-agent and multi-agent systems). Additionally, we analyze evaluation metrics and benchmarks tailored for self-evolving agents, highlight applications in domains such as coding, education, and healthcare, and identify critical challenges and research directions in safety, scalability, and co-evolutionary dynamics. By providing a structured framework for understanding and designing self-evolving agents, this survey establishes a roadmap for advancing more adaptive, capable, robust, and versatile agentic systems in both research and real-world deployments, and ultimately sheds light on the realization of Artificial Super Intelligence (ASI) where agents evolve autonomously and perform beyond human-level intelligence across a wide array of tasks.

URL: https://openreview.net/forum?id=CTr3bovS5F

---

Title: On the (linear) convergence of Generalized Newton Inexact ADMM

Authors: Zachary Frangella, Theo Diamandis, Bartolomeo Stellato, Madeleine Udell

Abstract: This paper presents GeNI-ADMM, a framework for large-scale composite convex optimization that facilitates theoretical analysis of both existing and new approximate ADMM schemes. GeNI-ADMM encompasses any ADMM algorithm that solves a first- or second-order approximation to the ADMM subproblem inexactly. GeNI-ADMM exhibits the usual O(1/t)-convergence rate under standard hypotheses and converges linearly under additional hypotheses such as strong convexity. Further, the GeNI-ADMM framework provides explicit convergence rates for ADMM variants accelerated with randomized linear algebra, such as NysADMM and sketch-and-solve ADMM, resolving an important open question on the convergence of these methods. This analysis quantifies the benefit of improved approximations and can aid in the design of new ADMM variants with faster convergence.

URL: https://openreview.net/forum?id=GT3naIXBxK

---

Title: Video Prediction Transformers without Recurrence or Convolution

Authors: Yujin Tang, Lu Qi, Xiangtai Li, Chao Ma, Ming-Hsuan Yang

Abstract: Video prediction has witnessed the emergence of RNN-based models led by ConvLSTM, and CNN-based models led by SimVP. Following the significant success of ViT, recent works have integrated ViT into both RNN and CNN frameworks, achieving improved performance. While we appreciate these prior approaches, we raise a fundamental question: Is there a simpler yet more effective solution that can eliminate the high computational cost of RNNs while addressing the limited receptive fields and poor generalization of CNNs? How far can it go with a simple pure transformer model for video prediction? In this paper, we propose PredFormer, a framework entirely based on Gated Transformers. We provide a comprehensive analysis of 3D Attention in the context of video prediction. Extensive experiments demonstrate that PredFormer delivers state-of-the-art performance across four standard benchmarks. The significant improvements in both accuracy and efficiency highlight the potential of PredFormer as a strong baseline for real-world video prediction applications. The source code and trained models will be released to the public.

URL: https://openreview.net/forum?id=Afvhu9Id8m

---

Title: Prescribe-then-Select: Adaptive Policy Selection for Contextual Stochastic Optimization

Authors: Caio de Próspero Iglesias, Kimberly Villalobos Carballo, Dimitris Bertsimas

Abstract: We address the problem of policy selection in contextual stochastic optimization (CSO), where covariates are available as contextual information and decisions must satisfy hard feasibility constraints. In many CSO settings, multiple candidate policies—arising from different modeling paradigms—exhibit heterogeneous performance across the covariate space, with no single policy uniformly dominating. We propose Prescribe-then-Select (PS), a modular framework that first constructs a library of feasible candidate policies and then learns a meta-policy to select the best policy for the observed covariates. We implement the meta-policy using ensembles of Optimal Policy Trees trained via cross-validation on the training set, making policy choice entirely data-driven. Across two benchmark CSO problems—single-stage newsvendor and two-stage shipment planning—PS consistently outperforms the best single policy in heterogeneous regimes of the covariate space and converges to the dominant policy when such heterogeneity is absent. All the code to reproduce the results can be found at https://anonymous.4open.science/r/Prescribe-then-Select-TMLR.

URL: https://openreview.net/forum?id=lFEsAF2I7C

---

Title: Language Models are Symbolic Learners in Arithmetic

Authors: Chunyuan Deng, Zhiqi Li, Roy Xie, Ruidi Chang, Hanjie Chen

Abstract: The prevailing question in LM performing arithmetic is whether these models learn to truly compute or if they simply master superficial pattern matching. In this paper, we argues for the latter, presenting evidence that LMs act as greedy symbolic learners, prioritizing the simplest possible shortcuts to fit the stats of dataset to solve arithmetic tasks. To investigate this, we introduce \textbf{subgroup induction}, a practical framework adapted from Solomonoff Induction (SI), one of the most powerful universal predictors. Our framework analyzes arithmetic problems by breaking them down into ``subgroups''—minimal mappings between a few input digits and a single output digit. Our primary metric, subgroup quality, measures the viability of these shortcuts. Experiments reveal a distinct U-shaped accuracy pattern in multi-digit multiplication: LMs quickly master the first and last output digits while struggling with those in the middle. We demonstrate this U-shape is not coincidental; it perfectly mirrors the quality of the simplest possible subgroups, those requiring the fewest input tokens. This alignment suggests a core learning mechanism: LMs first learn easy, low-token shortcuts and only incorporate more complex, multi-token patterns as training progresses. They do not learn the algorithm of multiplication but rather a hierarchy of increasingly complex symbol-to-symbol mappings. Ultimately, our findings suggest that the path to arithmetic mastery for LMs is not paved with algorithms, but with a cascade of simple, hierarchically-learned symbolic shortcuts. The code is at https://github.com/chili-lab/Symbolic-Arithmetic.

URL: https://openreview.net/forum?id=QSblPg1xUM

---

Title: High-Layer Attention Pruning with Rescaling

Authors: Songtao Liu, Peng Liu

Abstract: Pruning is a highly effective approach for compressing large language models (LLMs), significantly reducing inference latency. However, conventional training-free structured pruning methods often employ a heuristic metric that indiscriminately removes some attention heads across all pruning layers, without considering their positions within the network architecture. In this work, we propose a novel pruning algorithm that strategically prunes attention heads in the model's higher layers. Since the removal of attention heads can alter the magnitude of token representations, we introduce an adaptive rescaling parameter that calibrates the representation scale post-pruning to counteract this effect. We conduct comprehensive experiments on a wide range of LLMs, including LLaMA3.1-8B, Mistral-7B-v0.3, Qwen2-7B, and Gemma2-9B. Our evaluation includes both generation and discriminative tasks across 27 datasets. The results consistently demonstrate that our method outperforms existing structured pruning methods. This improvement is particularly notable in generation tasks, where our approach significantly outperforms existing baselines. Code is available at \url{https://github.com/SongtaoLiu0823/HARP}.

URL: https://openreview.net/forum?id=jkPBIxYmWE

---

Title: Still Competitive: Revisiting Recurrent Models for Irregular Time Series Prediction

Authors: Ankitkumar Joshi, Milos Hauskrecht

Abstract: Modeling irregularly sampled multivariate time series is a persistent challenge in domains like healthcare and sensor networks. While recent works have explored a variety of complex learning architectures to solve the prediction problems for irregularly sampled time series, it remains unclear what the true benefits of some of these architectures are, and whether clever modifications of simpler and more efficient RNN-based algorithms are still competitive, i.e. they are on par with or even superior to these methods. In this work, we propose and study GRUwE: Gated Recurrent Unit with Exponential basis functions, that builds upon RNN-based architectures for observations made at irregular times. GRUwE supports both regression-based and event-based predictions in continuous time. GRUwE works by maintaining a Markov state representation of the time series that updates with the arrival of irregular observations. The Markov state update relies on two reset mechanisms: (i) observation-triggered reset to account for the new observation, and (ii) time-triggered reset that relies on learnable exponential decays, to support the predictions in continuous time. Our empirical evaluations across several real-world benchmarks on next-observation and next-event prediction tasks demonstrate that GRUwE can indeed achieve competitive or superior performance compared to the recent state-of-the-art (SOTA) methods. Thanks to its simplicity, GRUwE offers compelling advantages: it is easy to implement, requires minimal hyper-parameter tuning efforts, and significantly reduces the computational overhead in the online deployment.

URL: https://openreview.net/forum?id=YLoZA77QzR

---

Title: Robust Conformal Prediction for Infrequent Classes

Authors: Jens-Michalis Papaioannou, Sebastian Jäger, Alexei Figueroa, David Stutz, Betty van Aken, Keno Bressem, Wolfgang Nejdl, Felix Gers, Alexander Löser, Felix Biessmann

Abstract: Many real-world classification tasks involve datasets with large and imbalanced label spaces, making class-specific uncertainty quantification particularly challenging. Conformal Prediction (CP) provides a model-agnostic framework, which formally guarantees coverage, meaning that its prediction sets contain the true label with a user-defined probability (confidence level). However, standard class-conditional methods often fail when data is scarce for some classes. We propose a method that uses domain knowledge or label hierarchies to dynamically group semantically related classes to meet the desired coverage for a given confidence threshold. Our method maintains class-conditioned calibration when possible and provides group-conditioned guarantees where necessary. We evaluate our method on outcome diagnoses prediction, an important clinical task that does not only benefit from robust uncertainty estimation, but also presents a very imbalanced label distribution. We conduct experiments using three clinical datasets employing two medical taxonomies (ICD-10 and CCSR) and label spaces of varying sizes with up to more than 1,000 classes. Our results show that the proposed approach is able to successfully exploit the label hierarchy and consistently improves class-conditional coverage for infrequent diagnoses. By improving coverage for underrepresented classes, our method enhances the reliability and trustworthiness of predictive models. This improvement is especially valuable in clinical applications, where failure to detect rare but serious conditions can lead to harmful consequences.

URL: https://openreview.net/forum?id=nJ4p8rh3Ig

---

Title: LZ Penalty: An information-theoretic repetition penalty for autoregressive language models.

Authors: Tony A Ginart, Naveen Kodali, Jason Lee, Caiming Xiong, Silvio Savarese, John Emmons

Abstract: We introduce the Lempel-Ziv (LZ) penalty, a penalty specialized for reducing degenerate repetitions in autoregressive language models without loss of capability. The penalty is based on the codelengths in the LZ77 universal lossless compression algorithm. Through the lens of the prediction-compression duality, decoding with the LZ penalty has the interpretation of sampling from the residual distribution after removing the information that is highly compressible. We demonstrate the LZ penalty enables open-source reasoning models to operate with greedy decoding without loss of capability and without instances of degenerate repetition. In contrast, both the industry-standard frequency penalty and repetition penalty are ineffective, incurring degenerate repetition rates of up to 4% or more.

URL: https://openreview.net/forum?id=vNzPB4YCHj

---

Title: Learning to Imitate with Less: Efficient Individual Behavior Modeling in Chess

Authors: Zhenwei Tang, Difan Jiao, Eric Xue, Reid McIlroy-Young, Jon Kleinberg, Siddhartha Sen, Ashton Anderson

Abstract: As humans seek to collaborate with, learn from, and better understand artificial intelligence systems, developing AIs that can accurately emulate individual decision-making becomes increasingly important. Chess, a long-standing AI benchmark with precise skill measurement, offers an ideal testbed for human-AI alignment. However, existing approaches to modeling human behavior require prohibitively large amounts of data from each individual, making them impractical for new or sparsely represented users. In this work, we introduce Maia4All, a framework designed to learn and adapt to individual decision-making styles efficiently, even with limited data. Maia4All achieves this through a two-stage optimization process: (1) an enrichment step, which bridges population and individual-level human behavior modeling with a prototype-enriched model, and (2) a democratization step, which leverages ability levels or user prototypes to initialize and refine individual embeddings with minimal data. Our experimental results show that Maia4All can accurately predict individual moves and profile behavioral patterns with high fidelity, establishing a new standard for personalized human-like AI behavior modeling in chess. Maia4All achieves individual human behavior modeling in chess with only 20 games, compared to the 5,000 games required previously, representing a significant improvement in data efficiency. Our work provides an example of how population AI systems can flexibly adapt to individual users using a prototype-enriched model as a bridge. This approach extends beyond chess, as shown in our case study on idiosyncratic LLMs, highlighting its potential for broader applications in personalized AI adaptation.

URL: https://openreview.net/forum?id=iw4kjcw319

---

Title: Classification of high-dimensional data with spiked covariance matrix structure

Authors: Yin-Jen Chen, Minh Tang

Abstract: We study the classification problem for
high-dimensional data with $n$ observations on $p$ features where the
$p \times p$ covariance matrix $\Sigma$ exhibits a spiked eigenvalue structure and the
vector $\zeta$, given by the difference between the {\em whitened} mean
vectors, is sparse. We analyze an adaptive
classifier (adaptive with respect to the sparsity $s$) that first
performs dimension reduction on the feature vectors prior to classification in
the dimensionally reduced space, i.e., the classifier whitens
the data, then screens the features by keeping only those corresponding
to the $s$ largest coordinates of $\zeta$ and finally applies Fisher
linear discriminant on the selected features. Leveraging recent
results on entrywise matrix perturbation bounds for covariance
matrices, we show that the resulting classifier is Bayes optimal
whenever $n \rightarrow \infty$ and $s \sqrt{n^{-1} \ln p} \rightarrow
0$. Notably, our theory also guarantees Bayes optimality for the corresponding quadratic discriminant analysis (QDA). Experimental results on real and synthetic data further indicate that the proposed approach is competitive with state-of-the-art methods while operating on a substantially lower-dimensional representation.

URL: https://openreview.net/forum?id=6bQDtTbaQs

---


New submissions
===============


Title: Pruning Close to Home: Distance from Initialization impacts Lottery Tickets

Abstract: The Lottery Ticket Hypothesis (LTH) states that there exist sparse subnetworks (called 'winning' Lottery Tickets) within dense randomly initialized networks that, when trained under the same regime, achieve similar or better validation accuracy as the dense network. It has been shown that for larger networks and more complex datasets, these Lottery Tickets cannot be found in randomly initializations, but that they require lightly pretrained weights. More specifically, the pretrained weights need to be stable to SGD noise, but calculating this metric involves an expensive procedure. In this paper, we take a closer look at certain training hyperparameters that influence SGD noise throughout optimization. We show that by smart hyperparameter selection we can forego the pretraining step and still find winning tickets in various settings. We term these hyperparameters early-stable, as networks trained with those become stable to SGD noise early during training, and discover that the tickets they produce, exhibit remarkable generalization properties. Finally, we hypothesize that a larger Learning Distance negatively impacts generalization of the resulting sparse network when iterative pruning, and devise an experiment to show this.

URL: https://openreview.net/forum?id=caBhOKTD6Z

---

Title: CUDA: Capturing Uncertainty and Diversity in Preference Feedback Augmentation

Abstract: Preference-based Reinforcement Learning (PbRL) effectively addresses reward design challenges in Reinforcement Learning and facilitates human-AI alignment by enabling agents to learn human intentions. However, optimizing PbRL critically depends on abundant, diverse, and accurate human feedback, which is costly and time-consuming to acquire. Existing feedback augmentation methods aim to alleviate the scarcity of human preference feedback. However, they often neglect diversity, primarily generating feedback for high-confidence trajectory pairs with extreme differences. This approach leads to a biased augmented set that incompletely represents human preferences. To overcome this, we introduce Capturing Uncertainty and Diversity in preference feedback Augmentation (CUDA), a novel approach that comprehensively considers both uncertainty and diversity. CUDA enhances augmentation by employing ensemble-based uncertainty estimation for filtering and extracting feedback from diverse clusters via bucket-based categorization. These two mechanisms enable CUDA to obtain diverse and accurate augmented feedback. We evaluate CUDA on MetaWorld and DMControl offline datasets, demonstrating significant performance improvements over various offline PbRL algorithms and existing augmentation methods across diverse scenarios.

URL: https://openreview.net/forum?id=KWENSE1tC4

---

Title: Grothendieck Graph Neural Networks Framework: An Algebraic Platform for Crafting Topology-Aware GNNs

Abstract: Graph Neural Networks (GNNs) are almost universally built on a single primitive: the neighborhood. Regardless of architectural variations, message passing ultimately aggregates over neighborhoods, which intrinsically limits expressivity and often yields power no stronger than the Weisfeiler–Lehman (WL) test. In this work, we challenge this primitive. We introduce the Grothendieck Graph Neural Networks (GGNN) framework, which provides a strict algebraic extension of neighborhoods to covers, and in doing so replaces neighborhoods as the fundamental objects of message passing. Neighborhoods and adjacency matrices are recovered as special cases, while covers enable a principled and flexible foundation for defining topology-aware propagation schemes.
GGNN formalizes covers and systematically translates them into matrices, analogously to how adjacency matrices encode neighborhoods, enabling both theoretical analysis and practical implementation. Within this framework, we introduce the cover of sieves, inspired by category theory, which captures rich topological structure. Based on this cover, we design Sieve Neural Networks (SNN), a canonical fixed-cover instantiation that generalizes the adjacency matrix. Experiments show that SNN achieves zero failures on challenging graph isomorphism benchmarks (SRG, CSL, BREC) and substantially improves topology-aware evaluation via a controlled label-propagation probe. These results establish GGNN as a principled foundational framework for replacing neighborhoods in GNNs.

URL: https://openreview.net/forum?id=oD3qWXgB4e

---

Title: Bridging Efficiency and Adaptability: Continual Learning of MLPs on Class-Incremental Graphs

Abstract: Compared to static graphs, class-incremental graphs place higher demands on inference latency to support timely predictions for newly emerged node classes, especially in latency-sensitive applications. However, the high inference cost of Graph Neural Networks (GNNs) limits their scalability and motivates GNN-to-MLP distillation, which transfers knowledge from a GNN to a Multi-Layer Perceptron (MLP) to enable graph-free, low-latency inference. Yet, existing efforts focus on static graphs. When directly applied to class-incremental graphs, they inevitably suffer from the high computational cost of frequent GNN updates and the MLP’s inability to retain knowledge of previously learned classes. To bridge efficiency and adaptability, we propose a novel framework featuring an asynchronous update paradigm between GNN and MLPs, allowing rapid adaptation to evolving data. The MLPs employ a progressive expansion strategy for continual adaptation and an energy-based routing mechanism for test-time inference. During GNN updates, knowledge from MLPs trained in the current cycle is distilled back into the GNN to preserve long-term knowledge. Experiments on real-world datasets demonstrate that our framework achieves superior performance on class-incremental graphs, effectively balancing adaptability to new data and inference efficiency.

URL: https://openreview.net/forum?id=3KYwHaKXcn

---

Title: Why Equivariant Networks Lose Information: Invariant Rings and the Role of Aggregation

Abstract: Equivariant neural networks exhibit fundamental expressivity limitations: rotation-equivariant networks collapse directional information to radial features, and matrix-equivariant networks show rank degeneracy. We explain these phenomena using classical invariant theory and prehomogeneous vector space (PVS) theory. For $\mathrm{SO}(3)$ on $\mathbb{R}^3$, the First Fundamental Theorem forces equivariant maps to be radial scalings; for $\mathrm{GL}(n) \times \mathrm{GL}(n)$ on matrices, PVS theory shows the invariant ring contains only constants. Our central finding is that aggregation, not depth, escapes these constraints: product representations $V^n$ have richer invariant rings with cross-invariants (e.g., dot products encoding angles) inaccessible to single-fiber processing. We connect this theory to modern architectures---SchNet, PaiNN, DimeNet, MACE---showing their body-order corresponds to which $V^n$ they access. Experiments confirm that $\mathrm{SO}(3)$- versus $\mathrm{O}(3)$-invariant networks exhibit categorically different expressivity on pseudoscalar targets ($R^2 = 1.00$ vs. $R^2 < 0$), and that cross-invariants enable learning angles while norm-only features cannot. These results provide design guidance: prioritize multi-body interactions over depth when expressivity is limited.

URL: https://openreview.net/forum?id=o1GqYCe2rS

---

Title: Plain Transformers Can be Powerful Graph Learners

Abstract: Transformers have attained outstanding performance across various modalities, owing to their simple but powerful scaled-dot-product (SDP) attention mechanisms.
Researchers have attempted to migrate Transformers to graph learning, but most advanced Graph Transformers (GTs) have strayed far from plain Transformers, exhibiting major architectural differences either by integrating message-passing or incorporating sophisticated attention mechanisms.
These divergences hinder the easy adoption of training advances for Transformers developed in other domains.
Contrary to previous GTs, this work demonstrates that the plain Transformer architecture can be a powerful graph learner.
To achieve this, we propose to incorporate three simple, minimal, and easy-to-implement modifications to the plain Transformer architecture to construct our Powerful Plain Graph Transformers (PPGT):
(1) simplified $L_2$ attention for measuring the magnitude closeness among tokens; (2) adaptive root-mean-square normalization to preserve token magnitude information; and (3) a simple MLP-based stem for graph positional encoding.
Consistent with its theoretical expressivity, PPGT demonstrates noteworthy realized expressivity on the empirical graph expressivity benchmark, comparing favorably to more complicated alternatives such as subgraph GNNs and higher-order GNNs.
Its empirical performance across various graph datasets also justifies the effectiveness of PPGT.
This finding underscores the versatility of plain Transformer architectures and highlights their strong potential as a unified backbone for multimodal learning across language, vision, and graph domains.

URL: https://openreview.net/forum?id=bEmDvP0fdv

---

Title: Evaluating the Reversal Curse in Model Editing

Abstract: Large language models (LLMs) are prone to hallucinate unintended text due to false or outdated knowledge. Since retraining LLMs is resource intensive, there has been a growing interest in model editing. Despite the emergence of benchmarks and approaches, these unidirectional editing and evaluation have failed to explore the reversal curse. In this paper, we study bidirectional language model editing, aiming to provide a rigorous evaluation to assess if edited LLMs can recall the editing knowledge bidirectionally. A metric of reversibility is introduced and a benchmark dubbed as Bidirectional Assessment for Knowledge Editing (BAKE) is constructed to evaluate if post-edited models can recall the editing knowledge in the reverse direction of editing. Experimental results show that while most editing methods are able to accurately recall editing facts along the modification direction, they exhibit substantial systematic deficiencies when evaluating in the reverse direction. Our findings also reveal that the in-context learning (ICL) can mitigate the reversal curse to a certain extent.

URL: https://openreview.net/forum?id=jAHwodCUxP

---

Title: Stronger Approximation Guarantees for Non-Monotone $\gamma$-Weakly DR-Submodular Maximization

Abstract: We study the maximization of nonnegative, non-monotone $\gamma$-weakly diminishing-returns (DR) submodular functions over down-closed convex bodies. The weakly DR model relaxes classical diminishing returns by allowing marginal gains to decay up to a multiplicative factor $\gamma \in (0,1]$, capturing a broad class of objectives that interpolate between monotone and fully non-monotone DR submodularity. Existing methods in this regime achieve guarantees that deteriorate rapidly as $\gamma$ decreases and fail to recover the best known bounds in the fully DR case.

We develop a $\gamma$-aware algorithmic framework that combines a Frank--Wolfe guided measured continuous greedy procedure with a $\gamma$-weighted double-greedy method. Our analysis explicitly accounts for the asymmetric structure induced by weak diminishing returns, yielding $\gamma$-dependent progress certificates that remain valid across the entire weakly DR spectrum. As a result, we obtain an approximation guarantee that strictly improves upon the baseline $\gamma e^{-\gamma}$ for all $\gamma \in (0,1)$ and recovers the current best constant $0.401$ when $\gamma = 1$. The proposed algorithms are projection-free, use only first-order information and linear optimization oracles, and run in polynomial time.

URL: https://openreview.net/forum?id=yS78Cb1CnX

---

Title: Echo-GAT: Debiasing Graph Attention with Echo Nodes and Degree Diversity for Heterophilic Graphs

Abstract: Attention mechanisms have become a de facto standard for enhancing the expressivity of deep learning models, achieving remarkable success in graph data. Recent studies have shown that attention-based graph neural networks (GNNs) often perform poorly on heterophilic graphs and have attributed this degradation primarily to low levels of homophily. In contrast to this prevailing explanation, we find that on heterophilic graphs, under standard graph attention mechanisms, node-level homophily shows only a weak correlation with prediction accuracy, and nodes with lower homophily ratios can even achieve higher accuracy on average. These observations suggest that homophily alone is insufficient to explain the failure of graph attention. In this work, we show that standard graph attention networks exhibit a systematic performance imbalance across nodes with different degrees of diversity, favoring structurally inhomogeneous nodes (i.e., those with significantly divergent degrees compared to their neighbors). To mitigate this bias, we propose a graph attention optimization framework that integrates augmented feature attention and degree diversity-aware attention score to mitigate node-level structural bias. Experiments show that the proposed method consistently outperforms strong GAT variants and state-of-the-art heterophily-oriented GNNs. Moreover, it maintains stable performance gains across nodes with varying heterophily levels, demonstrating its effectiveness on diverse graph structures.

URL: https://openreview.net/forum?id=mj1UMx1sAL

---

Title: Networked Communication for Decentralised Agents in Mean-Field Games

Abstract: Methods like multi-agent reinforcement learning struggle to scale with growing population size. Mean-field games (MFGs) are a game-theoretic approach that can circumvent this by finding a solution for an abstract infinite population, which can then be used as an approximate solution for the $N$-agent problem. However, classical mean-field algorithms usually only work under restrictive conditions. We take steps to address this by introducing networked communication to MFGs, in particular to settings that use a single, non-episodic run of $N$ decentralised agents to simulate the infinite population, as is likely to be most reasonable in real-world deployments. We prove that our architecture's sample guarantees lie between those of earlier theoretical algorithms for the centralised- and independent-learning architectures, varying dependent on network structure and the number of communication rounds. However, the sample guarantees of the three theoretical algorithms do not actually result in practical convergence times. We thus contribute practical enhancements to all three algorithms allowing us to present their first empirical demonstrations. We then show that in practical settings where the theoretical hyperparameters are not observed, giving fewer loops but poorer estimation of the Q-function, our communication scheme still respects the earlier theoretical comparison: it considerably accelerates learning over the independent case, which hardly seems to learn at all, and often performs similarly to the centralised case, while removing the restrictive assumption of the latter. We provide ablations and additional studies showing that our networked approach also has advantages over both alternatives in terms of robustness to update failures and to changes in population size.

URL: https://openreview.net/forum?id=7ALoJiEbO2

---

Title: Not All CAMs Are Complete: Completeness as the Key to Faithfulness

Abstract: Although input-gradients techniques have evolved to mitigate and tackle the challenges associated with gradients, modern gradient-weighted CAM approaches still rely on vanilla gradients, which are inherently susceptible to the saturation phenomena. Despite recent enhancements have incorporated counterfactual gradient strategies as a mitigating measure, these local explanation techniques still exhibit a lack of sensitivity to their baseline parameter. Our work proposes a gradient-weighted CAM augmentation that tackles both the saturation and sensitivity problem by reshaping the gradient computation, incorporating two well-established and provably approaches: Expected Gradients and kernel smoothing. By revisiting the original formulation as the smoothed expectation of the perturbed integrated gradients, one can concurrently construct more faithful, localized and robust explanations which minimize infidelity. Through fine modulation of the perturbation distribution it is possible to regulate the complexity characteristic of the explanation, selectively discriminating stable features. Our technique, Expected Grad-CAM, differently from recent works, exclusively optimizes the gradient computation, purposefully designed as an enhanced substitute of the foundational Grad-CAM algorithm and any method built therefrom. Quantitative and qualitative evaluations have been conducted to assess the effectiveness of our method.

URL: https://openreview.net/forum?id=NeeGBwXNs5

---

Title: Scene Layout Generation with Rectified Flow

Abstract: We introduce SLayR, Scene Layout Generation with Rectified flow, a novel transformer-based model for text-to-layout generation, which can integrate into a complete text-to-image pipeline. SLayR addresses a domain in which current text-to-image pipelines struggle: generating scene layouts that are of significant variety and plausibility, when the given prompt is ambiguous and does not provide constraints on the scene. In this setting, SLayR surpasses existing baselines, including LLMs. To accurately evaluate the layout generation, we introduce a new benchmark suite, including numerical metrics and a carefully designed repeatable human-evaluation procedure that assesses the plausibility and variety of images that are generated. We show that our method sets a new state of the art for achieving high plausibility and variety simultaneously, while being at least 3× times smaller in the number of parameters.

URL: https://openreview.net/forum?id=YGsQxG5ubd

---

Title: A Survey on Behavioral Data Representation Learning

Abstract: Behavioral data, reflecting dynamic and complex interactions among entities, are pivotal for advancing multidisciplinary research and practical applications. Effective modeling and representation of behavioral data facilitate enhanced understanding, predictive analytics, and informed decision-making across diverse domains. This paper presents a comprehensive taxonomy of behavioral data representation learning methods, categorized by data modalities: tabular data, event sequences, dynamic graphs, and natural language. Within each category, we further dissect methods based on distinct modeling strategies and capabilities, and provide detailed reviews of their developments. Additionally, we extensively discuss significant downstream applications, datasets, and benchmarks, highlighting their roles in guiding methodological development and evaluating performance. To support further exploration in behavioral data representation learning, we release a continuously maintained repository at [Anonymous GitHub](https://anonymous.4open.science/r/BehavioralDataSurvey) that curates the methods and papers covered in this survey.

URL: https://openreview.net/forum?id=NSdV2qvglw

---

Title: Computationally Sufficient Reductions for Joint Multiple Matrix Estimators with Sparsity and Fusion

Abstract: We study a broad class of methods for the joint estimation of multiple sparse
symmetric matrices that incorporates group and fusion penalties for borrowing
strength across related matrices. This class includes extensions of popular
methods for precision and covariance matrix estimation as well as PCA. We show
that these methods can be unified through the lens of computational
sufficiency, a recently proposed theory that can reveal hidden commonalities
between seemingly disparate methods yielding both theoretical insights into
the underlying optimization problems and practical advantages in terms of
computational efficiency. We derive a universal screening rule that applies
simultaneously to all methods in this class, allowing us to reduce the search
space to block diagonal matrices. This enables streamlined algorithms that
drastically reduce the runtime, making the methods far more scalable and
practical for high-dimensional data analysis.

URL: https://openreview.net/forum?id=KK9RHgSbdp

---

Title: Activation Functions and Normalization in Deep Continual Learning

Abstract: Deep learning models often struggle to remain adaptable in continual learning scenarios, where the data distribution changes over time. Beyond the well-known challenge of catastrophic forgetting, these models also face plasticity loss that is characterized as the gradual decline in their ability to learn from future data. We study plasticity loss through the lens of activation and normalization interactions.
Through a large-scale empirical study, we evaluate 26 activation functions across three normalization strategies using ResNet-18 on the class-incremental CIFAR-100 benchmark. Our findings reveal that plasticity is not determined by any single design choice, but rather is influenced by the complex interaction between activation functions and normalization layers. We uncover a link between overfitting and plasticity loss, and show that simple yet effective training strategies, such as applying soft labels, learning rate warm-up and excluding affine normalization parameters from L2 regularization can significantly slow down the emergence of plasticity loss. Based on these findings, we offer additional recommendations for model design and training, we keep the networks inherently more performant and adaptable over a long time without any active component.

URL: https://openreview.net/forum?id=BtymUQyqtQ

---

Title: DeltaSM: Delta-Level Contrastive Learning with Mamba for Time-Series Representation

Abstract: Self-supervised contrastive learning offers a compelling route to transferable time-series representations in label-scarce settings.
Yet existing frameworks face a persistent trade-off between preserving fine-grained local dynamics at high temporal resolution and scaling to long sequences under practical compute constraints.
Convolutional encoders often require deep stacks to retain rapid transitions, whereas Transformers incur quadratic cost in sequence length, making high-resolution long-context training expensive.
Recent selective state-space models such as \emph{Mamba} enable linear-time ($O(L)$) sequence modeling and offer a promising path to mitigate this bottleneck.
However, their potential for \emph{general-purpose} time-series representation learning remains underexplored;
to our knowledge, prior Mamba-based contrastive learners have not been evaluated on the full UCR 2018 archive (128 datasets) under a unified protocol.
We propose \textbf{DeltaSM} (\emph{Delta-selective Mamba}), a self-supervised framework for univariate time series that reconciles efficiency and expressivity.
DeltaSM integrates (i) a lightweight Mamba backbone, (ii) token-budget-constrained training, and (iii) a $\Delta$-level contrastive objective that counterbalances Mamba's smoothing tendency.
Specifically, we apply curvature-adaptive weighting to first-order differences of the latent sequence, encouraging the encoder to emphasize informative local transitions without increasing computational cost.
At inference time, we further augment the learned time-domain embeddings with explicitly extracted frequency-domain descriptors from the raw signal to improve expressivity at negligible overhead.
Across all 128 UCR datasets, under \textbf{Protocol A}---a unified compute setting with a fixed number of optimization steps and a standardized downstream classifier---DeltaSM converges in seconds and achieves classification accuracy comparable to or better than strong baselines such as TS-TCC, TS2Vec, and TimesURL, using a single global configuration and a fixed pretraining-step budget (300 optimization updates per dataset).
On a focused subset that includes long-sequence datasets under \textbf{Protocol B}---where baselines are allowed their recommended training budgets and hyperparameters while DeltaSM remains fixed as in Protocol A---DeltaSM reduces pretraining time by up to $184\times$ while remaining competitive.
Extensive ablations confirm that curvature-based weighting is crucial for suppressing noise while capturing local dynamics, and that inference-time frequency integration provides complementary gains with minimal additional cost.

URL: https://openreview.net/forum?id=3S1kAqCiAS

---

Title: BG-HGNN: Toward Efficient Learning for Complex Heterogeneous Graphs

Abstract: Heterogeneous graphs—comprising diverse node and edge types connected through varied relations—are ubiquitous in real-world applications. Message-passing heterogeneous graph neural networks (HGNNs) have emerged as a powerful model class for such data. However, existing HGNNs typically allocate a separate set of learnable weights for each relation type to model relational heterogeneity. Despite their promise, these models are effective primarily on simple heterogeneous graphs with only a few relation types. In this paper, we show that this standard design inherently leads to parameter explosion (the number of learnable parameters grows rapidly with the number of relation types) and relation collapse (the model loses the ability to distinguish among different relations). These issues make existing HGNNs inefficient or impractical for complex heterogeneous graphs with many relation types. To address these challenges, we propose Blend&Grind-HGNN (BG-HGNN), a unified feature-representation framework that integrates and distills relational heterogeneity into a shared low-dimensional feature space. This design eliminates the need for relation-specific parameter sets and enables efficient, expressive learning even as the number of relations grows. Empirically, BG-HGNN achieves substantial gains over state-of-the-art HGNNs—improving parameter efficiency by up to 28.96 $\times$ and training throughput by up to up to 110.30 $\times$—while matching or surpassing their accuracy on complex heterogeneous graphs.

URL: https://openreview.net/forum?id=pkhDICLhy7

---

Title: Graph Unitary Message Passing

Abstract: Unitarity has emerged as a fundamental principle for efficient learning of deep neural networks, from parameter initialization to advanced optimizers, proven effective in various fields, including RNN, CNN, Transformer and Muon optimizer. However, imposing unitarity to the parameters is not enough to improve learning efficiency of graph neural networks (GNNs) due to the instability arising from the graph structure through the message passing mechanism. This data-dependent inefficiency, also known as oversquashing or oversmoothing problems, causes information from distant nodes to decay or node representation to become indistinguishable as the number of layers increases. Motivated by the success of unitarity in stabilizing neural network training, we propose a new graph-learning paradigm called Graph Unitary Message Passing (GUMP) to improve graph learning efficiency by applying unitary adjacency matrices for message passing. GUMP introduces a graph transformation algorithm that equips general graphs with unitary adjacency matrices while preserving original connectivity, and implements Newton-Schulz iteration for efficient unitary projection. Extensive experiments demonstrate that GUMP achieves significant performance improvements over vanilla message passing methods across various graph learning tasks.

URL: https://openreview.net/forum?id=dvNMDkSBIA

---

Title: Conflict-Averse IL-RL: Resolving Gradient Conflicts for Stable Imitation-to-Reinforcement Learning Transfer

Abstract: Reinforcement Learning (RL) and Imitation Learning (IL) offer complementary capabilities: RL can learn high-performing policies but is data-intensive, whereas IL enables rapid learning from demonstrations but is limited by the demonstrator's quality. Combining them offers the potential for improved sample efficiency in learning high-performing policies, yet naïve integrations often suffer from two fundamental issues: (1) negative transfer, where optimizing the IL loss hinders effective RL fine-tuning, and (2) gradient conflict, where differences in the scale or direction of IL and RL gradients lead to unstable updates.
We introduce Conflict-Averse IL--RL (CAIR), a general framework that addresses both challenges by combining two key components: (1) Loss Manipulation: an adaptive annealing mechanism utilizing a convex combination of IL and RL losses. This mechanism dynamically increases the weight of the RL loss when its gradient aligns with the IL gradient and decreases it otherwise, mitigating instabilities during the transition from IL to RL. (2) Gradient Manipulation: to further reduce conflict, we incorporate CAGrad to compute a joint gradient that balances IL and RL objectives while avoiding detrimental interference.
Under standard trust-region assumptions, CAIR guarantees monotonic improvement in the expected return when the loss weights are annealed monotonically. Our empirical study evaluates CAIR on four sparse-reward MuJoCo domains, where pure RL algorithms typically struggle. Compared against relevant hybrid RL baselines, CAIR improves sample efficiency in three out of four domains and asymptotic performance in two, while performing comparably on the remainder. These trends are consistent across multiple combinations of IL (BC, DAgger) and RL (DDPG, SAC, PPO) methods, demonstrating the robustness of the novel framework.

URL: https://openreview.net/forum?id=S98skK3FfD

---

Title: Decoupling Planning from Control: Stable Hierarchical RL with a Learned Metric Space

Abstract: Hierarchical Reinforcement Learning (HRL) offers a promising framework for solving complex, long-horizon tasks by decomposing them into manageable subproblems. However, conventional HRL methods suffer from a critical non-stationarity problem: the high-level planner's learning process is destabilized because the low-level policy is concurrently learning and constantly changing. This issue is particularly severe in resource-constrained systems, such as edge-cloud robotics, where the low-level controller must be a computationally simple, low-capacity model.
To address this challenge, we propose a novel HRL framework that resolves the non-stationarity issue by decoupling high-level planning from low-level control. The core of our approach is to reframe the planner's task: instead of learning the planner via RL on non-stationary transitions, it learns to navigate a stable "map" of the environment. This map is represented by a critic network trained to function as a metric space, where distances reflect optimal travel costs. Planning is then simplified to finding optimal subgoals that lie along the shortest path (geodesic) between the current state and the final goal. To further improve the accuracy of this map, we introduce a novel trajectory regularization loss that enforces geometric consistency along the agent's experienced trajectories.
Experiments demonstrate that our decoupled framework is highly robust. In scenarios with resource-constrained low-level policies, our method learns to solve complex tasks effectively where standard approaches fail. This result highlights our framework's suitability for real-world systems where low-level controllers have inherently limited computational capacity.

URL: https://openreview.net/forum?id=Kmtlv8X0BN

---

Title: The Out-of-sample Extensions of t-SNE: From Gradient Descent to Fixed-point Iteration Algorithms

Abstract: This paper addresses the out-of-sample extension of the t-distributed stochastic neighbor embedding (t-SNE), namely extending the embedding to other data that were not considered in the training of the t-SNE. We demonstrate the ease of deriving the out-of-sample extension of t-SNE, thanks to the proper nature of t-SNE. Several resolution strategies are devised, from gradient descent to fixed-point iteration algorithms. Moreover, we establish several theoretical findings that allow to understand the underlying optimization mechanism of the fixed-point iteration, such as demonstrating that its repulsion-free variant corresponds to Newton's method, and providing several appealing properties, including connections with the mean shift algorithm and the resolution of the pre-image problem in Machine Learning. Experimental results on three well-known real data sets show the relevance and efficiency of the proposed out-of-sample methods, with the repulsion-free fixed-point iteration outperforming the other methods.

URL: https://openreview.net/forum?id=kYwq49F8Gt

---

Title: PLGC: Pseudo-Labeled Graph Condensation

Abstract: Large graph datasets make training graph neural networks (GNNs) computationally costly. Graph condensation methods address this by generating small synthetic graphs that approximate the original data. However, existing approaches rely on clean, supervised labels, which limits their reliability when labels are scarce, noisy, or inconsistent.
We propose Pseudo-Labeled Graph Condensation (PLGC), a self-supervised framework that constructs latent pseudo-labels from node embeddings and optimizes condensed graphs to match the original graph’s structural and feature statistics—without requiring ground-truth labels.
PLGC offers three key contributions: (1) A diagnosis of why supervised condensation fails under label noise and distribution shift. (2) A label-free condensation method that jointly learns latent prototypes and node assignments. (3) Theoretical guarantees showing that pseudo-labels preserve latent structural statistics of the original graph and ensure accurate embedding alignment.
Empirically, across node classification and link prediction tasks, PLGC achieves competitive performance with state-of-the-art supervised condensation methods on clean datasets and exhibits substantial robustness under label noise, often outperforming all baselines by a significant margin.
Our findings highlight the practical and theoretical advantages of self-supervised graph condensation in noisy or weakly-labeled environments\footnote{Code Link: \href{https://anonymous.4open.science/r/PLGC-0B26/

URL: https://openreview.net/forum?id=TkpewrzsnJ

---

Title: Symmetric Divergence and Normalized Similarity: A Unified Topological Framework for Representation Analysis

Abstract: Topological Data Analysis (TDA) offers a principled, intrinsic lens for comparing neural representations. However, existing paired topological divergences (e.g., RTD) are limited by heuristic asymmetry and, more critically, unbounded scores that depend on sample size, hindering reliable cross-scenario benchmarking. To address these challenges, we develop a unified topological toolkit serving two complementary needs: fine-grained structural diagnosis and robust, standardized evaluation.
First, we complete the RTD framework by introducing \textbf{Symmetric Representation Topology Divergence (SRTD)} and its efficient variant \textbf{SRTD-lite}. Beyond resolving the theoretical asymmetry of prior variants, SRTD consolidates diagnostic information into a single, comprehensive cross-barcode signature. This allows for precise localization of structural discrepancies and serves as an effective optimization objective without the overhead of dual directional computations.
Second, to enable reliable benchmarking across heterogeneous settings, we propose \textbf{Normalized Topological Similarity (NTS)}. By measuring the rank correlation of hierarchical merge orders, NTS yields a scale-invariant metric bounded between -1 and 1, effectively overcoming the scale and sample-dependence of unnormalized divergences.
Experiments across synthetic and real-world deep learning settings demonstrate that our toolkit captures functional shifts in CNNs missed by geometric measures and robustly maps LLM genealogy even under distance saturation, offering a rigorous, topology-aware perspective that complements measures like CKA.

URL: https://openreview.net/forum?id=pGgJ9qB2Io

---

Title: Advancing Model Refinement: Muon-Optimized Distillation and Quantization for LLM Deployment

Abstract: Large Language Models (LLMs) enable advanced natural language processing but face deploy-
ment challenges on resource-constrained edge devices due to high computational, memory,
and energy demands. Optimizing these models requires addressing three key challenges: ac-
quiring task-specific data, fine-tuning for performance, and compressing models to accelerate
inference while reducing resource demands. We propose an integrated framework combining GPTQ-based
quantization, low-rank adaptation (LoRA), and a specialized data distillation process to
significantly reduce model size and complexity while preserving or enhancing task-specific
performance. By leveraging data distillation, knowledge distillation via Kullback-Leibler
divergence, Bayesian hyperparameter optimization, and the Muon optimizer, we
achieve up to 2× memory compression (e.g., reducing a 6GB model to 3GB) and enables
efficient inference for specialized tasks. Empirical results demonstrate the superior
performance on standard LLM benchmarks compared to GPTQ quantization alone, with the
Muon optimizer notably enhancing fine-tuned models’ resistance to accuracy decay during
quantization.

URL: https://openreview.net/forum?id=tuY7MoLyDG

---

Title: Boosting Text Encoder for Personalized Text-to-Image Generation

Abstract: In this paper, we introduce TextBoost, an efficient one-shot personalization approach for text-to-image diffusion models. Traditional personalization methods typically involve fine-tuning extensive portions of the model, leading to substantial storage requirements and slow convergence. In contrast, we propose selectively fine-tuning only the text encoder, significantly improving computational and storage efficiency. To preserve the original semantic integrity, we develop a novel causality-preserving adaptation mechanism. Additionally, lightweight adapters are employed to locally refine text embeddings immediately before their interaction with cross-attention layers, greatly enhancing the expressiveness of text embeddings with minimal computational overhead. Empirical evaluations across diverse concepts demonstrate that TextBoost achieves faster convergence and substantially reduces storage demands by minimizing the number of trainable parameters. Furthermore, TextBoost maintains comparable subject fidelity, superior text fidelity, and greater generation diversity compared to existing methods. We show that our proposed method offers an efficient, scalable, and practically applicable solution for high-quality text-to-image personalization, particularly beneficial in resource-constrained environments.

URL: https://openreview.net/forum?id=hiZzk1nHuV

---

Title: Confidence-Aware Explanations for 3D Molecular Graphs via Energy-Based Masking

Abstract: Graph Neural Networks (GNNs) have become a powerful tool for modeling molecular data. To improve their reliability and interpretability, various explanation methods aim to identify key molecular substructures, typically a subset of edges, influencing the decision-making process. Early work on 2D GNNs represented molecules as graphs with atoms as nodes and bonds as edges, ignoring 3D geometric configurations. While existing explanation methods perform well for 2D GNNs, there is a growing demand for 3D explanation techniques suited to 3D GNNs, which often surpass 2D GNNs in performance. Current explanation methods struggle with 3D GNNs due to the construction of edges based on distance cutoffs, leading to an exponential increase in the number of edges a molecular graph possesses. We identify key sources of explanation errors and decompose them into two components, derived from an upper bound between optimized masks and the true explanatory subgraph. This gap is especially significant in 3D GNNs because of the dense edge structures. To improve explanation fidelity, our method assigns two energy values to each atom, representing its contribution to predictions: one for importance and one for non-importance. The explanation model becomes more confident when the distinction between importance and non-importance is clearer. Analogous to physics, lower energy values indicate greater stability, enhancing confidence in the associated scenario. By optimizing these energy values to distinguish the two cases, we minimize both components of the error bound and identify a stable subgraph with high explanation fidelity. Experiments with various 3D backbone models on widely used datasets are conducted to validate our method's effectiveness in providing accurate and reliable explanations for 3D molecular graphs.

URL: https://openreview.net/forum?id=V4oLqP0vJf

---

Title: Reward Modeling for Reinforcement Learning-Based LLM Reasoning: Design, Challenges, and Evaluation

Abstract: Large Language Models (LLMs) demonstrate transformative potential, yet their reasoning remains inconsistent and unreliable. Reinforcement learning (RL)–based fine-tuning is a key mechanism for improvement, but its effectiveness is fundamentally governed by reward design. Despite its importance, the relationship between reward modeling and core LLM challenges—such as evaluation bias, hallucination, distribution shift, and efficient learning—remains poorly understood. This survey argues that reward modeling is not merely an implementation detail but a central architect of reasoning alignment, shaping what models learn, how they generalize, and whether their outputs can be trusted. We introduce Reasoning-Aligned Reinforcement Learning (RARL), a unifying framework that systematizes diverse reward paradigms for multi-step reasoning. Within this framework, we present a taxonomy of reward mechanisms, analyze reward hacking as a pervasive failure mode, and examine how reward signals unify challenges ranging from inference-time scaling to hallucination mitigation. We further critically evaluate existing benchmarks, highlighting vulnerabilities such as data contamination and reward misalignment, and outline directions for more robust evaluation. By integrating fragmented research threads and clarifying the interplay between reward design and fundamental reasoning capabilities, this survey provides a foundational roadmap for building reasoning models that are robust, verifiable, and trustworthy.

URL: https://openreview.net/forum?id=TDfrN1TbGH

---

Title: Local MDI+: Local Feature Importances for Tree-Based Models

Abstract: Tree-based ensembles such as random forests remain the go-to for tabular data over deep learning models due to their prediction performance and computational efficiency. These advantages have led to their widespread deployment in high-stakes domains, where interpretability is essential for ensuring trustworthy predictions. This has motivated the development of popular local (i.e. sample-specific) feature importance (LFI) methods such as LIME and TreeSHAP. However, these approaches rely on approximations that ignore the model’s internal structure and instead depend on potentially unstable perturbations. These issues are addressed in the global setting by MDI+, a global feature importance method which combines tree-based and linear feature importances by exploiting an equivalence between decision trees and least squares on a transformed node basis. However, the global MDI+ scores are not able to explain predictions when faced with heterogeneous individual characteristics. To address this gap, we propose Local MDI+ (LMDI+), a novel extension of the MDI+ framework that quantifies feature importances for each particular sample. Across twelve real-world benchmark datasets, LMDI+ outperforms existing baselines at identifying instance-specific predictive features, yielding an average 10% improvement in predictive performance when using only the selected features. It further demonstrates greater stability by consistently producing similar instance-level feature importance rankings across repeated model fits with different random seeds. Ablation experiments show that each component of LMDI+ contributes to these gains, and that the improvements extend beyond random forests to gradient boosting models. Finally, we show that LMDI+ enables local interpretability use cases by identifying closely matched counterfactuals for each classification benchmark and discovering homogeneous subgroups in a commonly-used housing dataset.

URL: https://openreview.net/forum?id=TcXidnGHpA

---

Title: Active Teacher Selection for Reward Learning

Abstract: Reward learning techniques enable machine learning systems to learn objectives from human
feedback. A core limitation of these systems is their assumption that all feedback comes from
a single human teacher, despite gathering feedback from large and heterogeneous populations.
We propose the Hidden Utility Bandit (HUB) framework to model differences in teacher
rationality, expertise, and costliness, formalizing the problem of learning from multiple
teachers. We develop a variety of solution algorithms and apply them to two real-world
domains: paper recommendation systems and COVID-19 vaccine testing. We find that Active
Teacher Selection (ATS) algorithms outperform baselines by actively selecting when and which
teacher to query. Our key contributions are 1) the HUB framework: a novel mathematical
framework for modeling the teacher selection problem, 2) ATS: an active-learning based
algorithmic approach that demonstrates the utility of modeling teacher heterogeneity, and
3) proof-of-concept application of the HUB framework and ATS approaches to model and
solve multiple real-world problems with complex trade-offs between reward learning and
optimization.

URL: https://openreview.net/forum?id=9OEy68av40

---

Title: Optimal Pattern Detection Tree for Symbolic Rule-Based Classification

Abstract: Pattern discovery in data plays a crucial role across diverse domains, including healthcare, risk assessment, and machinery maintenance. In contrast to black-box deep learning models, symbolic rule discovery emerges as a key data mining task, generating human-interpretable rules that offer both transparency and intuitive explainability. This paper introduces the optimal pattern detection tree (OPDT) for binary classification, a rule-based machine learning model based on novel mixed integer programming to extract an optimal pattern in data. This optimization-based approach discovers a hidden underlying pattern in datasets, when it exists, by identifying an optimal rule that maximizes coverage while minimizing the false positive rate due to misclassification. Our computational experiments show that OPDT discovers a pattern with optimality guarantees on moderately sized datasets within reasonable runtime.

URL: https://openreview.net/forum?id=RJ6eMDcDCv

---

Title: Mitigating Social Desirability Bias in Random Silicon Sampling

Abstract: Large Language Models (LLMs) are increasingly used to simulate population responses, a method known as ``Silicon Sampling''.
However, responses to socially sensitive questions frequently exhibit Social Desirability Bias (SDB), diverging from real human data toward socially acceptable answers. Existing studies on social desirability bias in LLM-based sampling remain limited. In this work, we investigate whether minimal, psychologically grounded prompt wording can mitigate this bias and improve alignment between silicon and human samples.
We conducted a study using data from the American National Election Study (ANES) on three LLMs from two model families: the open-source Llama-3.1 series and GPT-4.1-mini. We first replicate a baseline silicon sampling study, confirming the persistent Social Desirability Bias. We then test four prompt-based mitigation methods: reformulated (neutral, third-person phrasing), reverse-coded (semantic inversion), and two meta-instructions, priming and preamble, respectively encouraging analytics and sincerity. Alignment with ANES is evaluated using Jensen-Shannon Divergence with bootstrap confidence intervals. Our results demonstrate that reformulated prompts most effectively improve alignment by reducing distribution concentration on socially acceptable answers and achieving distributions closer to ANES. Reverse-coding produced mixed results across eligible items, while the Priming and Preamble encouraged response uniformity and showed no systematic benefit for bias mitigation. Our findings validate the efficacy of prompt-based framing controls in mitigating inherent Social Desirability Bias in LLMs, providing a practical path toward more representative silicon samples.

URL: https://openreview.net/forum?id=DmwJzi99Bi

---

Title: CF-HPO: Counterfactual Explanations for Hyperparameter Optimization

Abstract: Hyperparameter optimization (HPO) is a fundamental component of studies that use tech-
technologies such as machine learning and deep learning. Regardless of the field, almost every
study requires hyperparameter optimization at some level. In general, applying HPO to a
developed system improves its performance by optimizing multiple parameters. However,
extant HPO methods do not provide information on why specific configurations are successful, what should not be done, or what could be improved. The present study proposes
a novel approach to address this gap in the literature by introducing CF-HPO, a modular
framework that generates counterfactual explanations for HPO results. CF-HPO answers
questions such as “what potential improvements could be made,” “what settings should be
avoided,” and “what-if analysis.” These outputs can serve as a guide, especially for those who
are not optimization experts. The recommended system has a modular design that supports
different search strategies (UCB-driven, random, restart). This allows it to perform well in
optimization and also to show counterfactual explanations at the end of optimization. Experiments conducted on the YAHPO benchmark package yielded validation rates of 92.2%
for neural networks and 60.4% for random forests. These findings reveal that counterfactual
generability depends on the geometry of the performance surface rather than dimensionality.

URL: https://openreview.net/forum?id=f4eQmsYiNN

---

Title: ProPINN: Demystifying Propagation Failures in Physics Informed Neural Networks

Abstract: Physics-informed neural networks (PINNs) have earned high expectations in solving partial differential equations (PDEs), but their optimization usually faces thorny challenges due to the unique derivative-dependent loss function. By analyzing the loss distribution, previous research observed the propagation failure phenomenon of PINNs, intuitively described as the correct supervision for model outputs cannot ``propagate'' from initial states or boundaries to the interior domain. Going beyond intuitive understanding, this paper provides a formal and in-depth study of propagation failure and its root cause. Based on a detailed comparison with classical finite element methods, we ascribe the failure to the conventional single-point-processing architecture of PINNs and further prove that propagation failure is essentially caused by the lower gradient correlation of PINN models on nearby collocation points. Compared to superficial loss maps, this new perspective provides a more precise quantitative criterion to identify where and why PINN fails. The theoretical finding also inspires us to present a new PINN architecture, named ProPINN, which can effectively unite the gradients of region points for better propagation. ProPINN can reliably resolve PINN failure modes and significantly surpass advanced Transformer-based models with 46% relative promotion.

URL: https://openreview.net/forum?id=Wy54lrFd46

---

Title: Source-Optimal Training is Transfer-Suboptimal

Abstract: We prove that training a source model optimally for its own task is generically suboptimal when the objective is downstream transfer. We study the source-side optimization problem in L2-SP ridge regression and show a fundamental mismatch between the source-optimal and transfer-optimal source regularization: outside of a measure-zero set, $\tau_0^* \neq \tau_S^*$. We characterize the transfer-optimal source penalty $\tau_0^*$ as a function of task alignment and identify an alignment-dependent reversal: with imperfect alignment ($0<\rho<1$), transfer benefits from stronger source regularization, while in super-aligned regimes ($\rho>1$), transfer benefits from weaker regularization. Additionally, in isotropic settings, the decision of whether transfer helps is independent of the target sample size and noise, depending only on task alignment and source characteristics. We verify the linear predictions in a synthetic ridge regression experiment, and we present experiments on MNIST, CIFAR-10, and 20 Newsgroups as evidence that the source-optimal versus transfer-optimal mismatch persists in standard nonlinear transfer learning pipelines.

URL: https://openreview.net/forum?id=CMlpokFXfA

---

Title: MEDIC: Comprehensive Evaluation of Leading Indicators for LLM Safety and Utility in Clinical Applications

Abstract: While Large Language Models (LLMs) achieve superhuman performance on standardized medical licensing exams, these static benchmarks have become saturated and increasingly disconnected from the functional requirements of clinical workflows. To bridge the gap between theoretical capability and verified utility, we introduce MEDIC, a comprehensive evaluation framework establishing leading indicators across various clinical dimensions. Beyond standard question-answering, we assess operational capabilities using deterministic execution protocols and a novel Cross-Examination Framework (CEF), which quantifies information fidelity and hallucination rates without reliance on reference texts. Our evaluation across a heterogeneous task suite exposes critical performance trade-offs: we identify a significant knowledge-execution gap, where proficiency in static retrieval does not predict success in operational tasks such as clinical calculation or SQL generation. Furthermore, we observe a divergence between passive safety (refusal) and active safety (error detection), revealing that models fine-tuned for high refusal rates often fail to reliably audit clinical documentation for factual accuracy. These findings demonstrate that no single architecture dominates across all dimensions, highlighting the necessity of a portfolio approach to clinical model deployment.

URL: https://openreview.net/forum?id=pDQe9Icwb6

---

Reply all
Reply to author
Forward
0 new messages