Weekly TMLR digest for Jul 20, 2025

11 views

Skip to first unread message

TMLR

unread,

Jul 20, 2025, 12:00:13 AMJul 20

to tmlr-annou...@googlegroups.com

New certifications
==================

Reproducibility Certification: Reproducibility study of "Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals"

Tijs Wiegman, Leyla Perotti, Viktória Pravdová, Ori Brand, Maria Heuss

https://openreview.net/forum?id=VCG6j3tcAA

---

Featured Certification: What Should Embeddings Embed? Autoregressive Models Represent Latent Generating Distributions

Liyi Zhang, Michael Y. Li, R. Thomas McCoy, Theodore Sumers, Jian-Qiao Zhu, Thomas L. Griffiths

https://openreview.net/forum?id=YyMACp98Kz

---

Survey Certification: A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems

Zixuan Ke, Fangkai Jiao, Yifei Ming, Xuan-Phi Nguyen, Austin Xu, Do Xuan Long, Minzhi Li, Chengwei Qin, PeiFeng Wang, silvio savarese, Caiming Xiong, Shafiq Joty

https://openreview.net/forum?id=SlsZZ25InC

---

Featured Certification: The Geometry of Phase Transitions in Diffusion Models: Tubular Neighbourhoods and Singularities

Manato Yaguchi, Kotaro Sakamoto, Ryosuke Sakamoto, Masato Tanabe, Masatomo Akagawa, Yusuke Hayashi, Masahiro Suzuki, Yutaka Matsuo

https://openreview.net/forum?id=ahVFKFLYk2

---

Reproducibility Certification: Tracing Facts or just Copies? A critical investigation of the Competitions of Mechanisms in Large Language Models

Dante Campregher, Yanxu Chen, Sander Hoffman, Maria Heuss

https://openreview.net/forum?id=1QrB5WSWOR

---

Accepted papers
===============

Title: Between Linear and Sinusoidal: Rethinking the Time Encoder in Dynamic Graph Learning

Authors: Hsing-Huan Chung, Shravan S Chaudhari, Xing Han, Yoav Wald, Suchi Saria, Joydeep Ghosh

Abstract: Dynamic graph learning is essential for applications involving temporal networks and requires effective modeling of temporal relationships.
Seminal attention-based models like TGAT and DyGFormer rely on sinusoidal time encoders to capture temporal dependencies between edge events. Prior work justified sinusoidal encodings because their inner products depend on the time spans between events, which are crucial features for modeling inter-event relations. However, sinusoidal encodings inherently lose temporal information due to their many-to-one nature and therefore require high dimensions. In this paper, we rigorously study a simpler alternative: the linear time encoder, which avoids temporal information loss caused by sinusoidal functions and reduces the need for high-dimensional time encoders. We show that the self-attention mechanism can effectively learn to compute time spans between events from linear time encodings and extract relevant temporal patterns. Through extensive experiments on six dynamic graph datasets, we demonstrate that the linear time encoder improves the performance of TGAT and DyGFormer in most cases. Moreover, the linear time encoder can lead to significant savings in model parameters with minimal performance loss. For example, compared to a 100-dimensional sinusoidal time encoder, TGAT with a 2-dimensional linear time encoder saves 43% of parameters and achieves higher average precision on five datasets. While both encoders can be used simultaneously, our study highlights the often-overlooked advantages of linear time features in modern dynamic graph models. These findings can positively impact the design choices of various dynamic graph learning architectures and eventually benefit temporal network applications such as recommender systems, communication networks, and traffic forecasting. The experimental code is available at: https://github.com/hsinghuan/dg-linear-time.git.

URL: https://openreview.net/forum?id=W6GQvdOGHg

---

Title: Knockout: A simple way to handle missing inputs

Authors: Minh Nguyen, Batuhan K. Karaman, Heejong Kim, Alan Q. Wang, Fengbei Liu, Mert R. Sabuncu

Abstract: Deep learning models benefit from rich (e.g., multi-modal) input features. However, multimodal models might be challenging to deploy, because some inputs may be missing at inference. Current popular solutions include marginalization, imputation, and training multiple models. Marginalization achieves calibrated predictions, but it is computationally expensive and only feasible for low dimensional inputs. Imputation may result in inaccurate predictions, particularly when high-dimensional data, such as images, are missing. Training multiple models, where each model is designed to handle different subsets of inputs, can work well but requires prior knowledge of missing input patterns. Furthermore, training and retaining multiple models can be costly. We propose an efficient method to learn both the conditional distribution using full inputs and the marginal distributions. Our method, Knockout, randomly replaces input features with appropriate placeholder values during training. We provide a theoretical justification for Knockout and show that it can be interpreted as an implicit marginalization strategy. We evaluate Knockout across a wide range of simulations and real-world datasets and show that it offers strong empirical performance.

URL: https://openreview.net/forum?id=K71y5pge84

---

Title: Reproducibility study of "Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals"

Authors: Tijs Wiegman, Leyla Perotti, Viktória Pravdová, Ori Brand, Maria Heuss

Abstract: This paper presents a reproducibility study of Ortu et al. (2024), investigating the competition of the factual recall and counterfactual in-context adaptation mechanisms in GPT-2. We extend experiments developed by the original authors with softmax-normalized logits as another metric for gauging the evolution of the scoring of tokens in the model. Our reproduced and extended experiments validate the original paper's main claims regarding the location of the competition of mechanisms in GPT-2, i.e. that the competition emerges predominantly in later layers, and is driven by the attention blocks corresponding to a subset of specialized attention heads. Additionally, we explore intervention strategies based on attention modification to increase factual accuracy. We find that boosting multiple attention heads involved in factual recall simultaneously can have a synergistic effect on factual accuracy, which is further enhanced by the suppression of copy heads. Finally, we rework how the competition of mechanisms is conceptualized and find that the specialized factual recall heads identified by Ortu et al. (2024) act as copy regulators, penalizing counterfactual in-context adaptation and rewarding the copying of factual information.

URL: https://openreview.net/forum?id=VCG6j3tcAA

---

Title: Gaussian Loss Smoothing Enables Certified Training with Tight Convex Relaxations

Authors: Stefan Balauca, Mark Niklas Mueller, Yuhao Mao, Maximilian Baader, Marc Fischer, Martin Vechev

Abstract: Training neural networks with high certified accuracy against adversarial examples remains an open challenge despite significant efforts.
While certification methods can effectively leverage tight convex relaxations for bound computation, in training, these methods, perhaps surprisingly, can perform worse than looser relaxations.
Prior work hypothesized that this phenomenon is caused by the discontinuity, non-smoothness, and perturbation sensitivity of the loss surface induced by tighter relaxations.
In this work, we theoretically show that Gaussian Loss Smoothing (GLS) can alleviate these issues.
We confirm this empirically by instantiating GLS with two variants: a zeroth-order optimization algorithm, called PGPE, which allows training with non-differentiable relaxations, and a first-order optimization algorithm, called RGS, which requires gradients of the relaxation but is much more efficient than PGPE.
Extensive experiments show that when combined with tight relaxations, these methods surpass state-of-the-art methods when training on the same network architecture for many settings.
Our results clearly demonstrate the promise of Gaussian Loss Smoothing for training certifiably robust neural networks and pave a path towards leveraging tighter relaxations for certified training.

URL: https://openreview.net/forum?id=lknvxcjuos

---

Title: What Should Embeddings Embed? Autoregressive Models Represent Latent Generating Distributions

Authors: Liyi Zhang, Michael Y. Li, R. Thomas McCoy, Theodore Sumers, Jian-Qiao Zhu, Thomas L. Griffiths

Abstract: Autoregressive language models have demonstrated a remarkable ability to extract latent structure from text. The embeddings from large language models have been shown to capture aspects of the syntax and semantics of language. But what should embeddings represent? We show that the embeddings from autoregressive models correspond to predictive sufficient statistics. By identifying settings where the predictive sufficient statistics are interpretable distributions over latent variables, including exchangeable models and latent state models, we show that embeddings of autoregressive models encode these explainable quantities of interest. We conduct empirical probing studies to extract information from transformers about latent generating distributions. Furthermore, we show that these embeddings generalize to out-of-distribution cases, do not exhibit token memorization, and that the information we identify is more easily recovered than other related measures. Next, we extend our analysis of exchangeable models to more realistic scenarios where the predictive sufficient statistic is difficult to identify by focusing on an interpretable subcomponent of language, topics. We show that large language models encode topic mixtures inferred by latent Dirichlet allocation (LDA) in both synthetic datasets and natural corpora.

URL: https://openreview.net/forum?id=YyMACp98Kz

---

Title: Revisiting Data Augmentation for Ultrasound Images

Authors: Adam Tupper, Christian Gagné

Abstract: Data augmentation is a widely used and effective technique to improve the generalization performance of deep neural networks. Yet, despite often facing limited data availability when working with medical images, it is frequently underutilized. This appears to come from a gap in our collective understanding of the efficacy of different augmentation techniques across different tasks and modalities. One modality where this is especially true is ultrasound imaging. This work addresses this gap by analyzing the effectiveness of different augmentation techniques at improving model performance across a wide range of ultrasound image analysis tasks. To achieve this, we introduce a new standardized benchmark of 14 ultrasound image classification and semantic segmentation tasks from 10 different sources and covering 11 body regions. Our results demonstrate that many of the augmentations commonly used for tasks on natural images are also effective on ultrasound images, even more so than augmentations developed specifically for ultrasound images in some cases. We also show that diverse augmentation using TrivialAugment, which is widely used for natural images, is also effective for ultrasound images. Moreover, our proposed methodology represents a structured approach for assessing various data augmentations that can be applied to other contexts and modalities.

URL: https://openreview.net/forum?id=iGcxlTLIL5

---

Title: A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems

Authors: Zixuan Ke, Fangkai Jiao, Yifei Ming, Xuan-Phi Nguyen, Austin Xu, Do Xuan Long, Minzhi Li, Chengwei Qin, PeiFeng Wang, silvio savarese, Caiming Xiong, Shafiq Joty

Abstract: Reasoning is a fundamental cognitive process that enables logical inference, problem-solving, and decision-making. With the rapid advancement of large language models (LLMs), reasoning has emerged as a key capability that distinguishes advanced AI systems from conventional models that empower chatbots. In this survey, we categorize existing methods along two orthogonal dimensions: (1) Regimes, which define the stage at which reasoning is achieved (either at inference time or through dedicated training); and (2) Architectures, which determine the components involved in the reasoning process, distinguishing between standalone LLMs and agentic compound systems that incorporate external tools, and multiagent collaborations. Within each dimension, we analyze two key perspectives: (1) Input level, which focuses on techniques that construct high-quality prompts that the LLM condition on; and (2) Output level, which methods that refine multiple sampled candidates to enhance reasoning quality. This categorization provides a systematic understanding of the evolving landscape of LLM reasoning, highlighting emerging trends such as the shift from inference-scaling to learning-to-reason (e.g., DeepSeek-R1), and the transition to agentic workflows (e.g., OpenAI Deep Research, Manus Agent). Additionally, we cover a broad spectrum of learning algorithms, from supervised fine-tuning to reinforcement learning such as PPO and GRPO, and the training of reasoners and verifiers. We also examine key designs of agentic workflows, from established patterns like generator-evaluator and LLM debate to recent innovations. Finally, we identify emerging trends, such as domain-specific reasoning systems, and open challenges, such as evaluation and data quality. This survey aims to provide AI researchers and practitioners with a comprehensive foundation for advancing reasoning in LLMs, paving the way for more sophisticated and reliable AI systems.

URL: https://openreview.net/forum?id=SlsZZ25InC

---

Title: TSkips: Efficiency Through Explicit Temporal Delay Connections in Spiking Neural Networks

Authors: Prajna G. Malettira, Shubham Negi, Wachirawit Ponghiran, Kaushik Roy

Abstract: Spiking Neural Networks (SNNs) with their bio-inspired Leaky Integrate-and-Fire (LIF) neurons inherently capture temporal information. This makes them well-suited for sequential tasks like processing event-based data from Dynamic Vision Sensors (DVS) and event-based speech tasks. Harnessing the temporal capabilities of SNNs requires mitigating vanishing spikes during training, capturing spatio-temporal patterns and enhancing precise spike timing. To address these challenges, we propose _TSkips_, augmenting SNN architectures with forward and backward skip connections that incorporate explicit temporal delays. These connections capture long-term spatio-temporal dependencies and facilitate better spike flow over long sequences. The introduction of _TSkips_ creates a vast search space of possible configurations, encompassing skip positions and time delay values. To efficiently navigate this search space, this work leverages training-free Neural Architecture Search (NAS) to identify optimal network structures and corresponding delays. We demonstrate the effectiveness of our approach on four event-based datasets: DSEC-flow for optical flow estimation, DVS128 Gesture for hand gesture recognition and Spiking Heidelberg Digits (SHD) and Spiking Speech Commands (SSC) for speech recognition. Our method achieves significant improvements across these datasets: up to 18% reduction in Average Endpoint Error (AEE) on DSEC-flow, 8% increase in classification accuracy on DVS128 Gesture, and up to ~8% and ~16% higher classification accuracy on SHD and SSC, respectively.

URL: https://openreview.net/forum?id=hwz32S06G4

---

Title: Adaptive Resolution Residual Networks — Generalizing Across Resolutions Easily and Efficiently

Authors: Léa Demeule, Mahtab Sandhu, Glen Berseth

Abstract: The majority of signal data captured in the real world uses numerous sensors with different resolutions. In practice, most deep learning architectures are fixed-resolution; they consider a single resolution at training and inference time. This is convenient to implement but fails to fully take advantage of the diverse signal data that exists. In contrast, other deep learning architectures are adaptive-resolution; they directly allow various resolutions to be processed at training and inference time. This provides computational adaptivity but either sacrifices robustness or compatibility with mainstream layers, which hinders their use. In this work, we introduce Adaptive Resolution Residual Networks (ARRNs) to surpass this tradeoff. We construct ARRNs from Laplacian residuals, which serve as generic adaptive-resolution adapters for fixed-resolution layers. We use smoothing filters within Laplacian residuals to linearly separate input signals over a series of resolution steps. We can thereby skip Laplacian residuals to cast high-resolution ARRNs into low-resolution ARRNs that are computationally cheaper yet numerically identical over low-resolution signals. We guarantee this result when Laplacian residuals are implemented with perfect smoothing kernels. We complement this novel component with Laplacian dropout, which randomly omits Laplacian residuals during training. This regularizes for robustness to a distribution of lower resolutions. This also regularizes for numerical errors that may occur when Laplacian residuals are implemented with approximate smoothing kernels. We provide a solid grounding for the advantageous properties of ARRNs through a theoretical analysis based on neural operators, and empirically show that ARRNs embrace the challenge posed by diverse resolutions with computational adaptivity, robustness, and compatibility with mainstream layers.

URL: https://openreview.net/forum?id=kTh5tFd1Mq

---

Title: The Geometry of Phase Transitions in Diffusion Models: Tubular Neighbourhoods and Singularities

Authors: Manato Yaguchi, Kotaro Sakamoto, Ryosuke Sakamoto, Masato Tanabe, Masatomo Akagawa, Yusuke Hayashi, Masahiro Suzuki, Yutaka Matsuo

Abstract: Diffusion models undergo phase transitions during the generative process where data features suddenly emerge in the final stages.
The current study aims to elucidate this critical phenomenon from the geometrical perspective. We employ the concept of ``injectivity radius'', a quantity that characterises the structure of the data manifold. Through theoretical and empirical evidence, we demonstrate that phase transitions in the generative process of diffusion models are closely related to the injectivity radius. Our findings offer a novel perspective on phase transitions in diffusion models, with potential implications for improving performance and sampling efficiency.

URL: https://openreview.net/forum?id=ahVFKFLYk2

---

Title: Identifying Macro Causal Effects in a C-DMG over ADMGs

Authors: Simon Matthieu Ferreira, Charles K. Assaad

Abstract: Causal effect identification using causal graphs is a fundamental challenge in causal inference. While extensive research has been conducted in this area, most existing methods assume the availability of fully specified directed acyclic graphs or acyclic directed mixed graphs. However, in complex domains such as medicine and epidemiology, complete causal knowledge is often unavailable, and only partial information about the system is accessible. This paper focuses on causal effect identification within partially specified causal graphs, with particular emphasis on cluster-directed mixed graphs (C-DMGs) which can represent many different acyclic directed mixed graphs (ADMGs). These graphs provide a higher-level representation of causal relationships by grouping variables into clusters, offering a more practical approach for handling complex systems. Unlike fully specified ADMGs, C-DMGs can contain cycles, which complicate their analysis and interpretation. Furthermore, their cluster-based nature introduces new challenges, as it gives rise to two distinct types of causal effects: macro causal effects and micro causal effects, each with different properties. In this work, we focus on macro causal effects, which describe the effects of entire clusters on other clusters. We establish that the do-calculus is both sound and complete for identifying these effects in C-DMGs over ADMGs when the cluster sizes are either unknown or of size greater than one. Additionally, we provide a graphical characterization of non-identifiability for macro causal effects in these graphs.

URL: https://openreview.net/forum?id=905LEugq6R

---

Title: Deep Autoregressive Models as Causal Inference Engines

Authors: Daniel Jiwoong Im, Kevin Zhang, Nakul Verma, Kyunghyun Cho

Abstract: Existing causal inference (CI) models are often restricted to data with low-dimensional confounders and singleton actions. We propose an autoregressive (AR) CI framework capable of handling complex confounders and sequential actions commonly found in modern applications. Our approach accomplishes this using sequencification, which transforms data from an underlying causal diagram into a sequence of tokens. Sequencification not only accommodates training with data generated from a large class of DAGs, but also extends existing CI capabilities to estimate multiple causal quantities using a single model. We can directly compute probabilities from interventional distributions, simplifying inference and improving outcome prediction accuracy. We demonstrate that an AR model adapted for CI is efficient and effective in various complex applications such as navigating mazes, playing chess endgames, and evaluating the impact of certain keywords on paper acceptance rates, where we consider causal queries beyond standard reinforcement learning-type questions.

URL: https://openreview.net/forum?id=uuREHPf2ll

---

Title: Interactive Large Language Models for Reliable Answering under Incomplete Context

Authors: Jing-Cheng Pang, Heng-Bo Fan, Pengyuan Wang, Jia-Hao Xiao, Nan Tang, Si-Hang Yang, Chengxing Jia, Ming-Kun Xie, Xiang Chen, Sheng-Jun Huang, Yang Yu

Abstract: The rise of large language models (LLMs) has revolutionized the way humans interact with artificial intelligence systems. However, their reliability in sensitive applications—such as personal consultations or clinical decision-making—remains limited. A critical shortfall lies in LLMs’ inherent lack of interactivity: these models generate responses even when essential context or domain-specific knowledge is absent, risking inaccurate or misleading outputs. A potential approach to mitigate this issue is to enable LLMs to pose clarifying questions, thereby uncovering the missing information required to provide accurate responses. However, previous methods often tend to greedily prompt LLMs to ask questions. This burdens the user to respond to potentially irrelevant questions and makes the system less flexible. In this paper, we introduce LaMSeI (Language Model with Selective Interaction) method, which enhances LLMs’ ability to judge when interaction is necessary under ambiguous or incomplete contexts. The motivation of LaMSeI is to measure the level of LLMs’ uncertainty about the user query, and interacts with user only when the uncertainty is high. Additionally, we incorporate active learning techniques to select the most informative questions from question candidates, for effectively uncovering the missing context. Our empirical studies, across various challenging question answering benchmarks, where LLMs are posed queries with incomplete context, demonstrate the effectiveness of LaMSeI. The method improves answer accuracy from 31.9% to 50.9%, outperforming other leading question-answering frameworks. Moreover, in experiments involving human participants, LaMSeI consistently generates answers superior to or comparable to baselines in more than 82% of the cases. Moreover, we verify the performance of LaMSeI on various LLMs, such as LLAMA2, LLAMA3, Vicuna and GPT-3.5, highlighting its capability to improve interactive language models.

URL: https://openreview.net/forum?id=nnlmcxYWlV

---

Title: Synthetic Data is Sufficient for Zero-Shot Visual Generalization from Offline Data

Authors: Ahmet H. Güzel, Ilija Bogunovic, Jack Parker-Holder

Abstract: Offline reinforcement learning (RL) offers a promising framework for training agents using pre-collected datasets without the need for further environment interaction. However, policies trained on offline data often struggle to generalise due to limited exposure to diverse states.The complexity of visual data introduces additional challenges such as noise, distractions, and spurious correlations, which can misguide the policy and increase the risk of overfitting if the training data is not sufficiently diverse. Indeed, this makes it challenging
to leverage vision-based offline data in training robust agents that can generalize to unseen environments. To solve this problem, we propose a simple approach—generating additional synthetic training data. We propose a two-step process, first augmenting the originally collected offline data to improve zero-shot generalization by introducing diversity, then using a diffusion model to generate additional data in latent space. We test our method across both continuous action spaces (Visual D4RL) and discrete action spaces (Procgen), demonstrating that it significantly improves generalization without requiring any algorithmic changes to existing model-free offline RL methods. We show that our method not only increases the diversity of the training data but also significantly reduces the generalization gap at test time while maintaining computational efficiency. We believe this approach could fuel additional progress in generating synthetic data to train more general agents in the future.

URL: https://openreview.net/forum?id=gFmSFa408D

---

Title: The Over-Certainty Phenomenon in Modern Test-Time Adaptation Algorithms

Authors: Fin Amin, Jung-Eun Kim

Abstract: When neural networks are confronted with unfamiliar data that deviate from their training set, this signifies a domain shift. While these networks output predictions on their inputs, they typically fail to account for their level of familiarity with these novel observations. Prevailing works navigate test-time adaptation with the goal of curtailing model entropy, yet they unintentionally produce models that struggle with sub-optimal calibration—a dilemma we term the over-certainty phenomenon. This over-certainty in predictions can be particularly dangerous in the setting of domain shifts, as it may lead to misplaced trust. In this paper, we propose a solution that not only maintains accuracy but also addresses calibration by mitigating the over-certainty phenomenon. To do this, we introduce a certainty regularizer that dynamically adjusts pseudo-label confidence by accounting for both backbone entropy and logit norm. Our method achieves state-of-the-art performance in terms of Expected Calibration Error and Negative Log Likelihood, all while maintaining parity in accuracy.

URL: https://openreview.net/forum?id=AGQRij8iUC

---

Title: Expressive Pooling for Graph Neural Networks

Authors: Veronica Lachi, Alice Moallemy-Oureh, Andreas Roth, Pascal Welke

Abstract: Considerable efforts have been dedicated to exploring methods that enhance the expressiveness of graph neural networks. Current endeavors primarily focus on modifying the message-passing process to overcome limitations imposed by the Weisfeiler-Leman test, often at the expense of increasing computational cost. In practical applications, message-passing layers are interleaved with pooling layers for graph-level tasks, enabling the learning of increasingly abstract and coarser representations of input graphs. In this work, we formally prove two directions that allow pooling methods to increase the expressive power of a graph neural network while keeping the message-passing method unchanged. We systemically assign eight frequently used pooling operators to our theoretical conditions for increasing expressivity and introduce a novel pooling method XP, short for eXpressive Pooling, as an additional simple method that satisfies our theoretical conditions. Experiments conducted on the Brec dataset confirm that those pooling methods that satisfy our conditions empirically increase the expressivity of graph neural networks.

URL: https://openreview.net/forum?id=xGADInGWMt

---

Title: Towards Robust Scale-Invariant Mutual Information Estimators

Authors: Cheuk Ting Leung, Rohan Ghosh, Mehul Motani

Abstract: Mutual information (MI) is hard to estimate for high dimensional data, and various estimators have been proposed over the years to tackle this problem. Here, we note that there exists another challenging problem, namely that many estimators of MI, which we denote as $I(X;T)$, are sensitive to scale, i.e., $I(X;\alpha T)\neq I(X;T)$ where $\alpha \in \mathbb{R}^{+}$. Although some normalization methods have been hinted at in previous works, there is no in-depth study of the problem. In this work, we study new normalization strategies for MI estimators to be scale-invariant, particularly for the Kraskov–Stögbauer–Grassberger (KSG) and the neural network-based MI (MINE) estimators. We provide theoretical and empirical results and show that the original un-normalized estimators are not scale-invariant and highlight the consequences of an estimator's scale-dependence. We propose new global normalization strategies that are tuned to the corresponding estimator and scale invariant. We compare our global normalization strategies to existing local normalization strategies and provide intuitive and empirical arguments to support the use of global normalization. Extensive experiments across multiple distributions and settings are conducted, and we find that our proposed variants KSG-Global-$L_{\infty}$ and MINE-Global-Corrected are most accurate within their respective approaches. Finally, we perform an information plane analysis of neural networks and observe clearer trends of fitting and compression using the normalized estimators compared to the original un-normalized estimators. Our work highlights the importance of scale awareness and global normalization in the MI estimation problem.

URL: https://openreview.net/forum?id=vB7Wvytko5

---

Title: Algorithmic fairness with monotone likelihood ratios

Authors: Wes Camp

Abstract: We show that inequalities of many commonly used fairness metrics (true/false positive/negative rates, predicted positive/negative rates, and positive/negative predictive values) are guaranteed for groups with different outcome rates under a monotonically calibrated model whose risk distributions have a monotone likelihood ratio, extending existing impossibility results. We further provide lower bounds on the FNR/FPR disparities and PPR/PNR disparities in the same setting, showing that either the FNR disparity or FPR disparity is at least as large as the positive outcome rate disparity (for FNR disparity) or negative outcome rate disparity (for FPR disparity), and either the PPR disparity or PNR disparity is at least as large as the positive outcome rate disparity (for PPR disparity) or negative outcome rate disparity (for PNR disparity). While incompatibilities of some combinations of these metrics have been demonstrated previously, we are unaware of any work that has demonstrated direct incompatibility of calibration with these individual equalities, equivalence of these inequalities, or lower bounds for the disparity in these values under distributional assumptions about a model's predictions.

URL: https://openreview.net/forum?id=mtoWa0gIKy

---

Title: Tracing Facts or just Copies? A critical investigation of the Competitions of Mechanisms in Large Language Models

Authors: Dante Campregher, Yanxu Chen, Sander Hoffman, Maria Heuss

Abstract: This paper presents a reproducibility study examining how Large Language Models (LLMs) manage competing factual and counterfactual information, focusing on the role of attention heads in this process. We attempt to reproduce and reconcile findings from three recent studies by Ortu et al. [13], Yu, Merullo, and Pavlick [17] and McDougall et al. [7] that investigate the competition between model-learned facts and contradictory context information through Mechanistic Interpretability tools. Our study specifically examines the relationship between attention head strength and factual output ratios, evaluates competing hypotheses about attention heads' suppression mechanisms, and investigates the domain specificity of these attention patterns. Our findings suggest that attention heads promoting factual output do so via general copy suppression rather than selective counterfactual suppression, as strengthening them can also inhibit correct facts. Additionally, we show that attention head behavior is domain-dependent, with larger models exhibiting more specialized and category-sensitive patterns.

URL: https://openreview.net/forum?id=1QrB5WSWOR

---

Title: A Survey on Future Frame Synthesis: Bridging Deterministic and Generative Approaches

Authors: Ruibo Ming, Zhewei Huang, Jingwei Wu, Zhuoxuan Ju, Daxin Jiang, Jianming HU, Lihui Peng, Shuchang Zhou

Abstract: Future Frame Synthesis (FFS), the task of generating subsequent video frames from context, represents a core challenge in machine intelligence and a cornerstone for developing predictive world models. This survey provides a comprehensive analysis of the FFS landscape, charting its critical evolution from deterministic algorithms focused on pixel-level accuracy to modern generative paradigms that prioritize semantic coherence and dynamic plausibility. We introduce a novel taxonomy organized by algorithmic stochasticity, which not only categorizes existing methods but also reveals the fundamental drivers—advances in architectures, datasets, and computational scale—behind this paradigm shift. Critically, our analysis identifies a bifurcation in the field's trajectory: one path toward efficient, real-time prediction, and another toward large-scale, generative world simulation. By pinpointing key challenges and proposing concrete research questions for both frontiers, this survey serves as an essential guide for researchers aiming to advance the frontiers of visual dynamic modeling.

URL: https://openreview.net/forum?id=ZN4rzrHlNz

---

Title: SparseDiff: Sparse Discrete Diffusion for Scalable Graph Generation

Authors: Yiming QIN, Clement Vignac, Pascal Frossard

Abstract: Graph generative models encounter significant scaling challenges due to the need to predict the presence or type of edges for every node pair, resulting in quadratic complexity. While some models attempt to support large graph generation, they often impose restrictive assumptions, such as enforcing cluster or hierarchical structures, which can limit generalizability and result in unstable generation quality across various graph types. To address this, we introduce SparseDiff, a novel diffusion framework that leverages the inherent sparsity in large graphs - a highly relaxed assumption that enables efficient sparse modeling without sacrificing generation quality for different datasets. SparseDiff reduces the complexity of the three core components in graph diffusion models. It first introduces a noising trajectory that preserves sparsity with more memory-efficient computation. During training, SparseDiff uses a denoising network based on convolutional attention layers over a sparse edge subsets combining edge-based graph attention and query edge-based random attention mechanisms, maintaining expressiveness with reduced memory usage. Finally, for inference, at each denoising step, SparseDiff generates edge subsets iteratively, progressively reconstructing the adjacency structure. SparseDiff achieves state-of-the-art results on both small and large datasets, showing its robustness across varying graph sizes and its scalability. Additionally, it ensures faster convergence for large graphs, achieving a fourfold speedup on the large-scale Ego dataset compared to dense models. SparseDiff's efficiency, combined with its effective control over space complexity, positions it as a powerful solution for scaling applications involving large graphs.

URL: https://openreview.net/forum?id=kuJ3lpxnVC

---

Title: Online Selective Conformal Inference: Errors and Solutions

Authors: Yusuf Sale, Aaditya Ramdas

Abstract: In online selective conformal inference, data arrives sequentially, and prediction intervals are constructed only when an online selection rule is met. Since online selections may break the exchangeability between the selected test datum and the rest of the data, one must correct for this by suitably selecting the calibration data. In this paper, we evaluate existing calibration selection strategies and pinpoint some fundamental errors in the associated claims that guarantee selection-conditional coverage and control of the false coverage rate (FCR). To address these shortcomings, we propose novel calibration selection strategies that provably preserve the exchangeability of the calibration data and the selected test datum. Consequently, we demonstrate that online selective conformal inference with these strategies guarantees both selection-conditional coverage and FCR control. Our theoretical findings are supported by experimental evidence examining trade-offs between valid methods.

URL: https://openreview.net/forum?id=PjIQwFyP07

---

Title: Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

Authors: Haoran Li, Qingxiu Dong, Zhengyang Tang, Chaojun Wang, Xingxing Zhang, Haoyang Huang, Shaohan Huang, Xiaolong Huang, Zeqiang Huang, Dongdong Zhang, Yuxian Gu, Xin Cheng, Xun Wang, Si-Qing Chen, Li Dong, Wei Lu, Zhifang Sui, Benyou Wang, Wai Lam, Furu Wei

Abstract: We introduce Generalized Instruction Tuning (called GLAN), a general and scalable method for instruction tuning of Large Language Models (LLMs). Unlike prior work that relies on seed examples or existing datasets to construct instruction-tuning data, GLAN exclusively utilizes a pre-curated taxonomy of human knowledge and capabilities as input and generates large-scale synthetic instruction data across all disciplines. Specifically, inspired by the systematic structure in human education system, we build the taxonomy by decomposing human knowledge and capabilities to various fields, sub-fields and ultimately, distinct disciplines semi-automatically, facilitated by LLMs. Subsequently, we generate a comprehensive list of subjects for every discipline and proceed to design a syllabus tailored to each subject, again utilizing LLMs. With the fine-grained key concepts detailed in every class session of the syllabus, we are able to generate diverse instructions with a broad coverage across the entire spectrum of human knowledge and skills. Extensive experiments on large language models (e.g., Mistral) demonstrate that GLAN excels in multiple dimensions from mathematical reasoning, coding, academic exams, logical reasoning to general instruction following without using task-specific training data of these tasks. In addition, GLAN allows for easy customization and new fields or skills can be added by simply incorporating a new node into our taxonomy. While promising, our approach may inherit biases or inaccuracies from LLM-generated data as in other synthetic data work and is primarily evaluated on exam-style benchmarks. Broader evaluations and data quality control are left for future work.

URL: https://openreview.net/forum?id=PahnCreCxK

---

Title: Universal Link Predictor By In-Context Learning on Graphs

Authors: Kaiwen Dong, Haitao Mao, Zhichun Guo, Nitesh V Chawla

Abstract: Link prediction is a crucial task in graph machine learning, where the goal is to infer missing or future links within a graph. Traditional approaches leverage heuristic methods based on widely observed connectivity patterns, offering broad applicability and generalizability without the need for model training. Despite their utility, these methods are limited by their reliance on human-derived heuristics and lack the adaptability of data-driven approaches. Conversely, parametric link predictors excel in automatically learning the connectivity patterns from data and achieving state-of-the-art but fail short to directly transfer across different graphs. Instead, it requires the cost of extensive training and hyperparameter optimization to adapt to the target graph. In this work, we introduce the Universal Link Predictor (UniLP), a novel model that combines the generalizability of heuristic approaches with the pattern learning capabilities of parametric models. UniLP is designed to autonomously identify connectivity patterns across diverse graphs, ready for immediate application to any unseen graph dataset without targeted training. We address the challenge of conflicting connectivity patterns—arising from the unique distributions of different graphs—through the implementation of In-context Learning (ICL). This approach allows UniLP to dynamically adjust to various target graphs based on contextual demonstrations, thereby avoiding negative transfer. Through rigorous experimentation, we demonstrate UniLP's effectiveness in adapting to new, unseen graphs at test time, showcasing its ability to perform comparably or even outperform parametric models that have been finetuned for specific datasets. Our findings highlight UniLP's potential to set a new standard in link prediction, combining the strengths of heuristic and parametric methods in a single, versatile framework.

URL: https://openreview.net/forum?id=EYpqmoejB8

---

New submissions
===============

Title: crowd-hpo: Realistic Hyperparameter Optimization and Benchmarking for Learning from Crowds with Noisy Labels

Abstract: Crowdworking is a cost-efficient solution for acquiring class labels. Since these labels are subject to noise, various approaches to learning from crowds have been proposed. Typically, these approaches are evaluated with default hyperparameter configurations, resulting in unfair and suboptimal performance, or with hyperparameter configurations tuned via a validation set with ground truth class labels, representing an often unrealistic scenario. Moreover, both setups can produce different approach rankings, complicating study comparisons. Therefore, we introduce crowd-hpo as a framework for evaluating approaches to learning from crowds in combination with criteria to select well-performing hyperparameter configurations with access only to noisy crowd-labeled validation data. Extensive experiments with neural networks demonstrate that these criteria select hyperparameter configurations, which improve the learning from crowd approaches' generalization performances, measured on separate test sets with ground truth labels. Hence, incorporating such criteria into experimental studies is essential for enabling fairer and more realistic benchmarking.

URL: https://openreview.net/forum?id=SaKfhylVLK

---

Title: Tree Structure for the Categorical Wasserstein Weisfeiler-Lehman Graph Kernel

Abstract: The Wasserstein Weisfeiler-Lehman~(WWL) graph kernel is a popular and efficient approach, utilized in various kernel-dependent machine learning frameworks for practical applications with graph data. It incorporates optimal transport geometry into the Weisfeiler-Lehman graph kernel, to mitigate the information loss inherent in aggregation strategies of graph kernels. While the WWL graph kernel demonstrates superior performances in many applications, it suffers a drawback in its computational complexity, i.e., at least $\mathcal{O}(n_{1} n_{2})$, where $n_{1}, n_{2}$ denote the number of vertices on input graphs. Consequently, it hinders the practical applicability of WWL graph kernel, especially in large-scale settings. In this paper, we propose the Tree Wasserstein Weisfeiler-Lehman (TWWL) graph kernel, which leverages \emph{tree structure} to scale up the exact computation of the WWL graph kernel for graph data with categorical node labels. In particular, the computational complexity of TWWL graph kernel is $\mathcal{O}(n_{1} + n_{2})$, enabling for its applications with large-scale graphs. Numerical experiments demonstrate that performances of the proposed kernel compare favorably with those baseline kernels, while its computation is several-order faster than the classic WWL graph kernel, paving ways for its applications in large-scale datasets where the WWL kernel is computationally prohibited.

URL: https://openreview.net/forum?id=VwoSsFK22P

---

Title: Beyond Magnitude and Gradient: Network Pruning Inspired by Optimization Trajectories

Abstract: Deep neural networks are dramatically over-parameterized and can be pruned without effecting the generalization. Existing pruning criteria inspect weights or gradients in isolation and ignore the effect of optimization dynamics on pruning. We introduce Causal Pruning (CP) -- A method by which one learns the parameter-importance from the optimization trajectory directly.

We exploit the causal signal hidden in SGD trajectories, where each weight update is considered as an intervention and measuring its effect on the loss -- observed versus predicted. This view yields two insights: (i) a weight’s importance is proportional to the gap between the predicted loss change (via a first-order Taylor estimate) and the observed loss change, and (ii) at convergence, weights whose removal leaves the local basin no sharper -- i.e. does not reduce flatness -- can be pruned without harming generalization. Empirically, we show that causal pruning is comparable to recent state-of-the-art approaches.

URL: https://openreview.net/forum?id=EExAniiwtQ

---

Title: Linear Attention Optimized GPU Kernel Implementation

Abstract: The original softmax-based attention mechanism (regular attention) in the extremely successful Transformer architecture computes attention between $N$ tokens, each embedded in a $D$-dimensional head, with a time complexity of $O(N^2D)$. Given the success of Transformers, improving their runtime during both training and inference is a popular research area. One such approach is the introduction of the linear attention (LA) mechanisms, which offers a linear time complexity of $O(ND^2)$ and have demonstrated comparable accuracy to regular attention. However, LA in practice lags behind its theoretical efficiency. We propose a novel method for LA's forward and backward passes, along with a highly-optimized CUDA implementation. Our approach outperforms the state-of-the-art by 3.3× in speed and reduces memory consumption by 3.6×. We validate these improvements in both single-layer and end-to-end settings by training a 1.4 billion parameter language model, which demonstrates similar expressivity to regular attention on major reasoning benchmarks.

URL: https://openreview.net/forum?id=JfiUKCPVR9

---

Title: Scaling Laws of Distributed Random Forests

Abstract: Random forests are a widely used machine learning technique valued for their robust predictive performance and interpretability. They are applied in many critical applications and often combined with federated learning to collaboratively build machine learning models across multiple distributed sites. The independent decision trees make random forests inherently parallelizable and well-suited for distributed and federated settings. Despite this perfect fit, there is a lack of comprehensive scalability studies, and many existing methods show limited parallel efficiency or are tested only at smaller scales. To address this gap, we present a comprehensive analysis of the scaling capabilities of distributed random forests on up to 64 compute nodes. Using a tree-parallel approach, we demonstrate a strong scaling speedup of up to 31.98 and a weak scaling efficiency of over 0.96 without affecting predictive performance. Comparing the performance trade-offs of distributed and local inference strategies enables us to simulate various real-life scenarios in terms of distributed computing resources, data availability, and privacy considerations. We further explore how increasing model and data size improves prediction accuracy, scaling up to 51 200 trees and 7.5 million training samples. We find that while distributing the data across nodes leads to super-scalar speedup, it negates the predictive benefit of increased data. Finally, we study the impact of distributed and non-IID data and find that while global imbalance reduces performance, local distribution differences can help mitigate this effect.

URL: https://openreview.net/forum?id=ICHxTlgnSy

---

Title: Flows and Diffusions on the Neural Manifold

Abstract: Diffusion and flow-based generative models have achieved remarkable success in domains such as image synthesis, video generation, and natural language modeling. In this work, we extend these advances to weight space learning by leveraging recent techniques to incorporate structural priors derived from optimization dynamics. Central to our approach is modeling the trajectory induced by gradient descent as a trajectory inference problem. We unify several trajectory inference techniques towards matching a gradient flow, providing a theoretical framework for treating optimization paths as inductive bias. We further explore architectural and algorithmic choices, including reward fine-tuning by adjoint matching, the use of autoencoders for latent weight representation, conditioning on task-specific context data, and adopting informative source distributions such as Kaiming uniform. Experiments demonstrate that our method matches or surpasses baselines in generating in-distribution weights, improves initialization for downstream training, and supports fine-tuning to enhance performance. Finally, we illustrate a practical application in safety-critical systems: detecting harmful covariate shifts, where our method outperforms the closest comparable baseline

URL: https://openreview.net/forum?id=d24Zv3QEdK

---

Title: Studying memorization of large language models using answers to Stack Overflow questions

Abstract: Large Language Models (LLMs) are capable of answering many software related questions and supporting developers by generating code snippets. These capabilities originate from training on massive amounts of data from the Internet, including information from Stack Overflow. This raises the question whether answers to software related questions are simply memorized from the training data, which might raise problems as this often requires attribution (e.g., CC-BY license), sharing with a similar license (e.g., GPL licenses) or may even be prohibited (proprietary license). To study this, we compare responses to questions from Stack Overflow for questions that were known during LLM pre-training and questions that were not included in the pre-training data. We then calculate the overlap both with answers marked as accepted on Stack Overflow as well as other texts we can find on the internet. We further explore the impact of the popularity of programming languages, the complexity of the prompts used, and the randomization of the text generation process on the memorization of answers to Stack Overflow. We find that many generated answers are to some degree collages of memorized content and that this does not dependent on whether the questions were seen during training or not. However, many of the memorized snippets are common phrases or code and, therefore, not copyrightable. Still, we also have clear evidence that copyright violation happens and is likely when LLMs are used at large scales.

URL: https://openreview.net/forum?id=ddocn44Kaq

---

Title: AcademicEval: Live Long-Context LLM Benchmark

Abstract: Large Language Models (LLMs) have recently achieved remarkable performance in long-context understanding. However, current long-context LLM benchmarks are limited by rigid context length, labor-intensive annotation, and the pressing challenge of label leakage issues during LLM training. Therefore, we propose \textsc{AcademicEval}, a live benchmark for evaluating LLMs over long-context generation tasks. \textsc{AcademicEval} adopts papers on arXiv to introduce several academic writing tasks with long-context inputs, \textit{i.e.}, \textsc{Title}, \textsc{Abstract}, \textsc{Introduction}, and \textsc{Related Work}, which cover a wide range of abstraction levels and require no manual labeling. Moreover, \textsc{AcademicEval} integrates high-quality and expert-curated few-shot demonstrations from a collected co-author graph to enable flexible context length. Especially, \textsc{AcademicEval} features an efficient live evaluation, ensuring no label leakage. We conduct a holistic evaluation on \textsc{AcademicEval}, and the results illustrate that LLMs perform poorly on tasks with hierarchical abstraction levels and tend to struggle with long few-shot demonstrations, highlighting the challenge of our benchmark. Through experimental analysis, we also reveal some insights for enhancing LLMs' long-context modeling capabilities. We will release the source code and data upon publication.

URL: https://openreview.net/forum?id=LjQ4voE5bs

---

Title: Thera: Aliasing-Free Arbitrary-Scale Super-Resolution with Neural Heat Fields

Abstract: Recent approaches to arbitrary-scale single image super-resolution (ASR) use neural fields to represent continuous signals that can be sampled at arbitrary resolutions. However, point-wise queries of neural fields do not naturally match the point spread function (PSF) of pixels, which may cause aliasing in the super-resolved image. Existing methods attempt to mitigate this by approximating an integral version of the field at each scaling factor, compromising both fidelity and generalization. In this work, we introduce neural heat fields, a novel neural field formulation that inherently models a physically exact PSF. Our formulation enables analytically correct anti-aliasing at any desired output resolution, and -- unlike supersampling -- at no additional cost. Building on this foundation, we propose Thera, an end-to-end ASR method that substantially outperforms existing approaches, while being more parameter-efficient and offering strong theoretical guarantees. The project page is at <link redacted for anonymity>.

URL: https://openreview.net/forum?id=GU8YOfmqyg

---

Title: Consistency Trajectory Planning: High-Quality and Efficient Trajectory Optimization for Offline Model-Based Reinforcement Learning

Abstract: This paper introduces Consistency Trajectory Planning (CTP), a novel offline model-based reinforcement learning method that leverages the recently proposed Consistency Trajectory Model (CTM) for efficient trajectory optimization. While prior work applying diffusion models to planning has demonstrated strong performance, it often suffers from high computational costs due to iterative sampling procedures. CTP supports fast, single-step trajectory generation without significant degradation in policy quality. We evaluate CTP on the D4RL benchmark and show that it consistently outperforms existing diffusion-based planning methods in long-horizon, goal-conditioned tasks. Notably, CTP achieves higher normalized returns while using significantly fewer denoising steps. In particular, CTP achieves comparable performance with over $120\times$ speedup in inference time, demonstrating its practicality and effectiveness for high-performance, low-latency offline planning.

URL: https://openreview.net/forum?id=RVGkT9ISVf

---

Title: Epitope Generation for Peptide-based Cancer Vaccine using Goal-directed Wasserstein Generative Adversarial Network with Gradient Penalty

Abstract: We introduce a novel goal-directed Wasserstein Generative Adversarial Network with Gradient Penalty (GD-WGAN-GP) for training a generator capable of producing peptide sequences with high predicted immunogenicity and strong binding affinity to the human leukocyte antigen HLA-A*0201, thereby eliciting cytotoxic T-cell immune responses. The proposed GD-WGAN-GP incorporates a critic network to guide the generator in producing peptides with a strong binding affinity similar to those in the training set and a reward network to steer the generator toward producing sequences with high predicted immunogenicity. To avoid the generator prioritizing the objective of the critic at the expense of immunogenicity, we introduce a scaling factor to balance the influence of the reward in the loss of the generator. To reduce peptide repetition, we integrate the reward into the loss of the generator using two approaches: a switching mechanism that excludes the reward term when duplicated peptides are present in a batch, and otherwise multiplies it by a $\gamma_{max}$ parameter to control the reward's contribution, and (2) a repetition penalty from ORGAN, where each reward is divided by the number of occurrences of its corresponding peptide within the batch. Experiments on bladder cancer epitope sequences demonstrate that GD-WGAN-GP with the switching mechanism enables a tunable trade-off between the number of unique peptides and the average immunogenicity score via varying $\gamma_{max}$. Furthermore, the generator trained by the GD-WGAN-GP with the ORGAN’s repetition penalty achieves an optimal balance of uniqueness and immunogenicity. Across multiple datasets, GD-WGAN-GP outperforms existing methods by effectively reducing peptide redundancy while preserving high immunogenicity scores and strong binding affinity. The Python codes are provided at: \url{https://github.com/AnnonymousForPapers/GP-WGAN-GP_with_switch_and_ORGAN_penalty}.

URL: https://openreview.net/forum?id=Lff5AnexHJ

---

Title: The inexact power augmented Lagrangian method for constrained nonconvex optimization

Abstract: This work introduces an unconventional inexact augmented Lagrangian method where the augmenting term is a Euclidean norm raised to a power between one and two. The proposed algorithm is applicable to a broad class of constrained nonconvex minimization problems that involve nonlinear equality constraints. In a first part of this work, we conduct a full complexity analysis of the method under a mild regularity condition, leveraging an accelerated first-order algorithm for solving the Hölder-smooth subproblems. Interestingly, this worst-case result indicates that using lower powers for the augmenting term leads to faster constraint satisfaction, albeit with a slower decrease of the dual residual. Notably, our analysis does not assume boundedness of the iterates. Thereafter, we present an inexact proximal point method for solving the weakly-convex and Hölder-smooth subproblems, and demonstrate that the combined scheme attains an improved rate that reduces to the best-known convergence rate whenever the augmenting term is a classical squared Euclidean norm. Different augmenting terms, involving a lower power, further improve the primal complexity at the cost of the dual complexity. Finally, numerical experiments validate the practical performance of unconventional augmenting terms.

URL: https://openreview.net/forum?id=63ANb4r7EM

---

Title: Optimizing Time Series Forecasting Architectures: A Hierarchical Neural Architecture Search Approach

Abstract: The rapid development of time series forecasting research has brought many deep learning-based modules to this field. However, despite the increasing number of new forecasting architectures, it is still unclear if we have leveraged the full potential of these existing modules within a properly designed architecture. In this work, we propose a novel hierarchical neural architecture search space for time series forecasting tasks. With the design of a hierarchical search space, we incorporate many architecture types designed for forecasting tasks and allow for the efficient combination of different forecasting architecture modules. Results on long-term time series forecasting tasks show that our approach can search for lightweight, high-performing forecasting architectures across different forecasting tasks.

URL: https://openreview.net/forum?id=Ym2wqojm4e

---

Title: Robust Weight Imprinting: Insights from Neural Collapse and Proxy-Based Aggregation

Abstract: The capacity of foundation models allows for their application to new, unseen tasks. The adaptation to such tasks is called transfer learning. An efficient transfer learning method that circumvents parameter optimization is imprinting. It has been reinvented several times, but not systematically studied. In this work, we propose the general $\texttt{IMPRINT}$ framework, identifying three main components: generation, normalization, and aggregation. Through the lens of this framework, we conduct an in-depth analysis and a comparison of the existing methods. Our findings reveal the benefits of representing novel data with multiple proxies in the generation step and show the importance of proper normalization. Beyond an extensive analytical grounding, our framework enables us to propose a novel variant of imprinting which outperforms previous work on transfer learning tasks by $6\%$. This variant determines proxies through clustering motivated by the neural collapse phenomenon -- a connection that we draw for the first time. We publicly release our code at (link removed for review).

URL: https://openreview.net/forum?id=duU11BnQ3Y

---

Title: Regret Analysis of Posterior Sampling-Based Expected Improvement for Bayesian Optimization

Abstract: Bayesian optimization is a powerful tool for optimizing an expensive-to-evaluate black-box function. In particular, the effectiveness of expected improvement (EI) has been demonstrated in a wide range of applications. However, theoretical analyses of EI are limited compared with other theoretically established algorithms. This paper analyzes a randomized variant of EI, which evaluates the EI from the maximum of the posterior sample path. We show that this posterior sampling-based random EI achieves the sublinear Bayesian cumulative regret bounds under the assumption that the black-box function follows a Gaussian process. Finally, we demonstrate the effectiveness of the proposed method through numerical experiments.

URL: https://openreview.net/forum?id=v0s9knY99c

---

Title: Multi-Teacher Knowledge Distillation Augmented Group Relative Policy Optimization

Abstract: Transfer learning, a key paradigm for leveraging pre-existing knowledge, can significantly enhance reinforcement learning agents, particularly when dealing with Large Language Models (LLMs) and Small Language Models (SLMs). Knowledge Distillation (KD) provides a potent mechanism for this transfer from expert LLM teacher models to SLM student models. Group Relative Policy Optimization (GRPO) is a robust critic-free reinforcement learning algorithm effective for policy optimization by estimating advantage via intra-group reward comparisons. Standard GRPO, however, does not inherently incorporate guidance from external expert policies and can exhibit training instability. This paper introduces a novel theoretical framework to integrate multi-teacher KD with GRPO. We propose a family of GRPO-KD objective functions; our primary formulation augments GRPO with an explicit, adaptively weighted multi-teacher distillation term to preserve stability for the SLM training. We further explore two advanced strategies: one modifying the Kullback-Leibler (KL) regularization of GRPO, and another introducing a Teacher Agreement Score to directly modulate the advantage calculation for deeper guidance from multiple LLM teachers. Experimental results on benchmark reasoning tasks demonstrate that the proposed framework not only stabilizes the training process but also significantly outperforms standard GRPO and other baseline approaches, validating the effectiveness of synergizing critic-free RL with multi-teacher guidance.

URL: https://openreview.net/forum?id=3y9nyDrJxj

---

Title: Stabilizing black-box model selection with the inflated argmax

Abstract: Model selection is the process of choosing from a class of candidate models given data. For instance, methods such as the LASSO and sparse identification of nonlinear dynamics (SINDy) formulate model selection as finding a sparse solution to a linear system of equations determined by training data. However, absent strong assumptions, such methods are highly unstable: if a single data point is removed from the training set, a different model may be selected. In this paper, we present a new approach to stabilizing model selection with theoretical stability guarantees that leverages a combination of bagging and an “inflated” argmax operation. Our method selects a small collection of models that all fit the data, and it is stable in that, with high probability, the removal of any training point will result in a collection of selected models that overlaps with the original collection. We illustrate this method in (a) a simulation in which strongly correlated covariates make standard LASSO model selection highly unstable, (b) a Lotka–Volterra model selection problem focused on identifying how competition in an ecosystem influences species’ abundances, and (c) a graph subset selection problem using cell-signaling data from proteomics. In these settings, the proposed method yields stable, compact, and accurate collections of selected models, outperforming a variety of benchmarks.

URL: https://openreview.net/forum?id=DSDWHsQLgA

---

Title: Uncertainty-aware Reward Design Process

Abstract: Designing effective reward functions is a cornerstone of reinforcement learning (RL), yet it remains a challenging process due to the inefficiencies and inconsistencies inherent in conventional reward engineering methodologies. Recent advances have explored leveraging large language models (LLMs) to automate reward function design. However, their suboptimal performance in numerical optimization often yields unsatisfactory reward quality, while the evolutionary search paradigm demonstrates inefficient utilization of simulation resources, resulting in prohibitively lengthy design cycles with disproportionate computational overhead. To address these challenges, we propose the Uncertainty-aware Reward Design Process (URDP), a novel framework that integrates large language models to streamline reward function design and evaluation in RL environments. URDP quantifies candidate reward function uncertainty based on the self-consistency analysis, enabling simulation-free identification of ineffective reward components while discovering novel reward components. Furthermore, we introduce uncertainty-aware Bayesian optimization (UABO), which incorporates uncertainty estimation to significantly enhance hyperparameter configuration efficiency. Finally, we construct a bi-level optimization architecture by decoupling the reward component optimization and the hyperparameter tuning. URDP orchestrates synergistic collaboration between the reward logic reasoning of the LLMs and the numerical optimization strengths of the Bayesian Optimization. We conduct a comprehensive evaluation of URDP across 35 diverse tasks spanning three benchmark environments: IsaacGym, Bidexterous Manipulation, and ManiSkill2. Our experimental results demonstrate that URDP not only generates higher-quality reward functions but also achieves significant improvements in the efficiency of automated reward design compared to existing approaches.

URL: https://openreview.net/forum?id=CId5tW1HxR

---

Title: Learning to Rank Features to Enhance Graph Neural Networks for Graph Classification

Abstract: A common strategy to enhance the predictive performance of graph neural networks (GNNs) for graph classification is to extend input graphs with node- and graph-level features. However, identifying the optimal feature set for a specific learning task remains a significant challenge, often requiring domain-specific expertise. To address this, we propose a general two-step method that automatically selects a compact, informative subset from a large pool of candidate features to improve classification accuracy. In the first step, a GNN is trained to estimate the importance of each feature for a given graph. In the second step, the model generates feature rankings for the training graphs, which are then aggregated into a global ranking. A top-ranked subset is selected from this global ranking and used to train a downstream graph classification GNN. Experiments on real-world and synthetic datasets show that our method outperforms various baselines, including models using all candidate features, and achieves state-of-the-art results on several benchmarks.

URL: https://openreview.net/forum?id=WmZGvWRAWb

---

Title: GraphGini: Fostering Individual and Group Fairness in Graph Neural Networks

Abstract: Graph Neural Networks (GNNs) have demonstrated impressive performance across various tasks, leading to their increased adoption in high-stakes decision-making systems. However, concerns have arisen about GNNs potentially generating unfair decisions for underprivileged groups or individuals when lacking fairness constraints. This work addresses this issue by introducing GraphGini, a novel approach that incorporates the Gini coefficient to enhance both individual and group fairness within the GNN framework. We rigorously establish that the Gini coefficient offers greater robustness and promotes equal opportunity among GNN outcomes, advantages not afforded by the prevailing Lipschitz constant methodology. Additionally, we employ the Nash social welfare program to ensure our solution yields a Pareto optimal distribution of group fairness. GraphGini automatically balances the three optimization objectives of utility, individual fairness, and group fairness without requiring manual tuning of weight parameters. Extensive experimentation on real-world datasets demonstrates GraphGini's efficacy in significantly improving individual fairness compared to state-of-the-art methods while maintaining utility and group fairness.

URL: https://openreview.net/forum?id=IEVGBI9MiL

---

Title: Multiway Multislice PHATE: Visualizing Hidden Dynamics of RNNs through Training

Abstract: Recurrent neural networks (RNNs) are a widely used tool for sequential data analysis; however they are still often seen as black boxes. Visualizing the internal dynamics of RNNs is a critical step in understanding the functional principles of these networks and developing ideal model architectures and optimization strategies. Previous studies typically only emphasize the network representation post-training, overlooking their evolution process throughout training. Here, we present Multiway Multislice PHATE (MM-PHATE), a novel method for visualizing the evolution of RNNs' hidden states. MM-PHATE is a graph-based embedding using structured kernels across the multiple dimensions spanned by RNNs: time, training epoch, and units. We demonstrate on various datasets that MM-PHATE uniquely preserves hidden representation community structure among units and identifies information processing and compression phases during training. The embedding allows users to look under the hood of RNNs across training and provides an intuitive and comprehensive strategy for understanding the network's internal dynamics, such as why and how one model outperforms another or how specific architectures impact an RNN's learning ability.

URL: https://openreview.net/forum?id=9Yr4V7iZsq

---

Title: H-FEX: A Symbolic Learning Method for Hamiltonian Systems

Abstract: Hamiltonian systems describe a broad class of dynamical systems governed by Hamiltonian functions, which encode the total energy and dictate the evolution of the system. Data-driven approaches, such as symbolic regression and neural network-based methods, provide a means to learn the governing equations of dynamical systems directly from observational data of Hamiltonian systems. However, these methods often struggle to accurately capture complex Hamiltonian functions while preserving energy conservation. To overcome this limitation, we propose the Finite Expression Method for learning Hamiltonian Systems (H-FEX), a symbolic learning method that introduces novel interaction nodes designed to capture intricate interaction terms effectively.
Our experiments, including those on highly stiff dynamical systems, demonstrate that H-FEX can recover Hamiltonian functions of complex systems that accurately capture system dynamics and preserve energy over long time horizons.
These findings highlight the potential of H-FEX as a powerful framework for discovering closed-form expressions of complex dynamical systems.

URL: https://openreview.net/forum?id=ksscGE8ySb

---

Title: Offline Learning and Forgetting for Reasoning with Large Language Models

Abstract: Leveraging inference-time search in large language models has proven effective in further enhancing a trained model's capability to solve complex mathematical and reasoning problems. However, this approach significantly increases computational costs and inference time, as the model must generate and evaluate multiple candidate solutions to identify a viable reasoning path. To address this, we propose an effective approach that integrates search capabilities directly into the model by fine-tuning it on unpaired successful (learning) and failed reasoning paths (forgetting) derived from diverse search methods. A key challenge we identify is that naive fine-tuning can degrade the model’s search capability; we show this can be mitigated with a smaller learning rate. Extensive experiments on the challenging Game-of-24 and Countdown reasoning benchmarks show that, replacing CoT-generated data with search-generated data for offline fine-tuning improves success rates by around 23% over inference-time search baselines, while reducing inference time by 180$\times$. On top of this, our learning and forgetting objective consistently outperforms both supervised fine-tuning and preference-based methods.

URL: https://openreview.net/forum?id=RF6raEUATc

---

Title: Quantized Disentanglement: A Practical Approach

Abstract: Recent theoretical work established the unsupervised identifiability of quantized factors under any diffeomorphism.
The theory assumes that quantization thresholds correspond to axis-aligned discontinuities in the probability density of the latent factors. By constraining a learned map to have a density with axis-aligned discontinuities, we can recover the quantization of the factors. However, translating this high-level principle into an effective practical criterion remains challenging, especially under nonlinear maps. Here, we develop a criterion for unsupervised disentanglement by encouraging axis-aligned discontinuities.
Discontinuities manifest as sharp changes in the estimated density of factors and form what we call cliffs. Following the definition of independent discontinuities from the theory, we encourage the location of the cliffs along a factor to be independent of the values of the other factors.
We show that our method, Cliff, outperforms the baselines on disentanglement benchmarks, demonstrating its effectiveness in unsupervised disentanglement.

URL: https://openreview.net/forum?id=uZ0aDRxo7H

---

Title: Uncertainty-quantified Pulse Signal Recovery from Facial Video using Regularized Stochastic Interpolants

Abstract: Imaging Photoplethysmography (iPPG), an optical procedure which recovers a human’s blood volume pulse (BVP) waveform using pixel readout from a camera, is an exciting research field with many researchers performing clinical studies of iPPG algorithms. While current algorithms to solve the iPPG task have shown outstanding performance on benchmark datasets, no state-of-the art algorithms, to the best of our knowledge, performs test-time sampling of solution space, precluding an uncertainty analysis that is critical for clinical applications. We address this deficiency though a new paradigm named Regularized Interpolants with Stochasticity for iPPG (RIS-iPPG). Modeling iPPG recovery as an inverse problem, we build probability paths that evolve the camera pixel distribution to the ground-truth signal distribution by predicting the instantaneous flow and score vectors; and at test-time, we sample the posterior distribution of the correct BVP waveform given the camera pixel intensity measurements by solving a stochastic differential equation. Given that physiological changes are slowly varying, we show that iPPG recovery can be improved through regularization that maximizes the correlation between the residual flow vector predictions of two adjacent time windows. Experimental results on three datasets show that RIS-iPPG provides superior reconstruction quality and uncertainty estimates of the reconstruction, a critical tool for the widespread adoption of iPPG algorithms in clinical and
consumer settings.

URL: https://openreview.net/forum?id=R8oqt6JEAY

---

Title: Can Weight Regularization in the Last Layer Reduce Dimensional Collapse?

Abstract: The dimensional collapse of representations in self-supervised learning is an ever-present issue. One notable technique to prevent such collapse of representations is using a multi-layered perceptron network called Projector. In several works, the projector has been found to heavily influence the quality of representations learned in a self-supervised pre-training task. However, the question still lingers. What role does the projector play? If it does prevent the collapse of representations, then why doesn’t the last layer of the encoder take up the role of projector in the absence of an MLP one? In this work, we intend to study what happens inside the projector by examining the rank dynamics of the same and the encoder through empirical study and analyses. Through mathematical analysis, we observe that the effect of rank reduction predominantly occurs in the last layer. Furthermore, we show that applying weight regularization only in the last layer yields better performance than when used on the whole network (WeRank), both with and without a projector. Empirical results justify that our interpretation of the role of the projector is correct.

URL: https://openreview.net/forum?id=PD532fs8b1

---

Title: TicketLLM: Next-Generation Sparse and Low-bit Transformers with Supermask-based Method

Abstract: Strong Lottery Tickets are subnetworks within a randomly weighted network uncovered by a binary mask called supermask. They offer a promising approach to model compression by eliminating the need to store weights since their effective subnetwork can be regenerated from a fixed random seed and the supermask. However, extending this approach to large language models (LLMs) is non-trivial due to limited scalability and inefficient training dynamics of existing SLT methods. To address these challenges, we propose Adaptive Supermask (Ada-Sup), a scalable and efficient method for discovering high-quality multi-bit supermasks through an innovative quantization-based approach. Building on this method, we introduce TicketLLM, a low-bit and sparse Transformer-based LLM architecture powered by Ada-Sup. Experimental results show that Ada-Sup can discover high-quality supermasks with significantly reduced training costs compared to previous methods in both binary and multi-bit settings. Furthermore, TicketLLM outperforms BitNet b1.58 on a 1.3B parameter model with the same memory per connection, achieving 0.08 lower perplexity while operating at a higher sparsity level (50% vs. 33%). These results highlight the potential of supermask-based methods as a promising approach for building lightweight LLMs. Code will be made available upon acceptance.

URL: https://openreview.net/forum?id=sE69HKykQw

---

Title: EdgeMask-DG*: Learning Domain-Invariant Graph Structures via Adversarial Edge Masking

Abstract: Structural shifts pose a significant challenge for graph neural networks, as graph topology acts as a covariate that can vary across domains. Existing domain generalization methods rely on fixed structural augmentations or training on globally perturbed graphs, mechanisms that do not pinpoint which specific edges encode domain-invariant information. We argue that domain-invariant structural information is not rigidly tied to a single topology but resides in the consensus across multiple graph structures derived from topology and feature similarity. To capture this, we first propose EdgeMask-DG, a novel min-max algorithm where an edge masker learns to find worst-case continuous masks subject to a sparsity constraint, compelling a task GNN to perform effectively under these adversarial structural perturbations. Building upon this, we introduce EdgeMask-DG*, an extension that applies this adversarial masking principle to an enriched graph. This enriched graph combines the original topology with feature-derived edges, allowing the model to discover invariances even when the original topology is noisy or domain-specific. At equilibrium, the structural patterns that the task GNN relies upon are, by design, robust and generalizable. EdgeMask-DG* is the first to systematically combine adaptive adversarial topology search with feature-enriched graphs. We provide a formal justification for our approach from a robust optimization perspective. We demonstrate that EdgeMask-DG* achieves new state-of-the-art performance on diverse graph domain generalization benchmarks, including citation networks, social networks, and temporal graphs. Notably, on the Cora OOD benchmark, EdgeMask-DG\* lifts the worst-case domain accuracy to {78.0\%}, a {+3.8 pp} improvement over the prior state of the art (74.2\%). The source code for our experiments can be found here: \url{https://anonymous.4open.science/r/TMLR-EAEF/}

URL: https://openreview.net/forum?id=vkfe8Ke7eC

---

Title: MemeSense: An Adaptive In-Context Framework for Social Commonsense Driven Meme Moderation

Abstract: Online memes are a powerful yet challenging medium for content moderation, often masking harmful intent behind humor, irony, or cultural symbolism. Conventional moderation systems “especially those relying on explicit text” frequently fail to recognize such subtle or implicit harm. We introduce MemeSense, an adaptive framework designed to generate socially grounded interventions for harmful memes by combining visual and textual understanding with curated, semantically aligned examples enriched with commonsense cues. This enables the model to detect nuanced complexed threats like misogyny, stereotyping, or vulgarity “even in memes lacking overt language”. Across multiple benchmark datasets, MemeSense outperforms state-of-the-art methods, achieving up to 35% higher semantic similarity
and 9% improvement in BERTScore for non-textual memes, and notable gains for text-rich memes as well. These results highlight MemeSense as a promising step toward safer, more context-aware AI systems for real-world content moderation. The code is available at: https://anonymous.4open.science/r/MemeSense/

URL: https://openreview.net/forum?id=ahRqI3NBiq

---

Title: Group-robust Machine Unlearning

Abstract: Machine unlearning is an emerging paradigm to remove the influence of specific training data (i.e., the forget set) from a model while preserving its knowledge of the rest of the data (i.e., the retain set). Previous approaches assume the forget data to be uniformly distributed from all training datapoints. However, if the data to unlearn is dominant in one group (e.g., ethnicity, gender), we empirically show that performance for this group degrades, leading to fairness issues. To perform unlearning while preserving fairness, this work addresses the overlooked problem of non-uniformly distributed forget sets, which we refer to as group-robust machine unlearning. We formalize the problem and present a simple and effective exact unlearning strategy that mitigates the performance loss in dominant groups via sample distribution reweighting. Moreover, we present MIU (Mutual Information-aware Machine Unlearning), the first approach for group robustness in approximate machine unlearning. MIU minimizes the mutual information between model features and group information, achieving unlearning while reducing performance degradation in the dominant group of the forget set. Additionally, MIU exploits sample distribution reweighting and mutual information calibration with the original model to preserve group robustness. We conduct experiments on three datasets and show that MIU outperforms standard methods, achieving unlearning without compromising model robustness.

URL: https://openreview.net/forum?id=StSq7mpUVw

---

Title: PixelWorld: Towards Perceiving Everything as Pixels

Abstract: Recent agentic language models increasingly accept raw camera pixels rather than tokenized text, underscoring the need for a unified perception paradigm. We explore this idea through Perceive Everything as Pixels (PEAP) and release PixelWorld, a benchmark that renders natural-language, tabular, mathematical and diagrammatic inputs into a single pixel space. Experiments show that PEAP attains competitive accuracy on semantic-understanding tasks, indicating that a vision transformer can capture global textual semantics without explicit tokens. In contrast, reasoning-intensive benchmarks (math and code) exhibit sharp performance drops; however, Chain-of-Thought prompting partially mitigates this gap, hinting that explicit reasoning traces compensate for the missing token structure. We also observe that scenarios with tightly intertwined visual--text cues benefit from the unified pixel view, reducing preprocessing overhead and ambiguity relative to split-modality baselines. PixelWorld therefore provides a compact yet challenging yardstick and encourages wider adoption of PEAP for holistic evaluation of next-generation vision–language agents.

URL: https://openreview.net/forum?id=uY5eDN2bML

---

Title: The Science of Evaluating Foundation Models

Abstract: The emergent phenomena of large foundation models have revolutionized natural language processing. However, evaluating these models presents significant challenges due to their size, capabilities, and deployment across diverse applications. Existing literature often focuses on individual aspects, such as benchmark performance or specific tasks, but fails to provide a cohesive process that integrates the nuances of diverse use cases with broader ethical and operational considerations. This work focuses on three key aspects: (1) Formalizing the Evaluation Process by providing a structured framework tailored to specific use-case contexts, (2) Offering Actionable Tools and Frameworks such as checklists and templates to ensure thorough, reproducible, and practical evaluations, and (3) Surveying Recent Work with a targeted review of advancements in LLM evaluation, emphasizing real-world applications.

URL: https://openreview.net/forum?id=Ty7lSwZ2aP

---

Title: A Hierarchical Nearest Neighbour Approach to Contextual Bandits

Abstract: In this paper we consider the contextual bandit problem in metric spaces. We design and analyse an algorithm that can handle the fully adversarial problem in which no assumptions are made about the space itself, or the generation of contexts and losses. In addition to analysing our performance on general metric spaces, we further analyse the important special case in which the space is euclidean, and furthermore analyse the i.i.d. stochastic setting. Unlike previous work our algorithm is adaptive to the local density of contexts and the smoothness of the decision boundary of the comparator policy, as well as other quantities. Our algorithm is highly efficient - having a per-trial time polylogarithmic in both the number of trials and the number of actions when the dimensionality of the metric space is bounded. We also give the results of real world experiments, demonstrating the excellent performance of our algorithm.

URL: https://openreview.net/forum?id=4bJMIrI5oX

---

Title: Diversity Augmentation of Dynamic User Preference Data for Boosting Personalized Text Summarizers

Abstract: Document summarization facilitates efficient identification and assimilation of user-relevant content, a process inherently influenced by individual subjectivity. Discerning $\textit{subjective}$ salient information within a document, particularly when it has multiple facets, poses significant challenges. This complexity underscores the necessity for $\textit{personalized summarization}$. However, training models for personalized summarization has so far been challenging, particularly because diverse training data containing both user preference history (i.e., $\textit{click-skip}$ trajectory) and expected (gold-reference) summaries are scarce. The MS/CAS PENS dataset is a rare resource in this direction. However, the training data only contains preference history $\textit{without any target summaries}$, thereby blocking end-to-end supervised learning. Also, the diversity in terms of topic transitions along the trajectory is relatively low, thereby leaving scope for better generalization. To address this, we first introduce a novel user preference data diversity evaluation metric, called DegreeD. We then propose PerAugy, a novel $\text{cross-trajectory shuffling}$ and $\text{summary-content perturbation}$-based data augmentation technique that increases the DegreeD-score and thereby, significantly boosts the accuracy of four state-of-the-art (SOTA) baseline user-encoders commonly used in personalized summarization frameworks (\text{best result}: $\text{0.132}$$\uparrow$ w.r.t AUC). We select two such SOTA summarizer frameworks as baselines and observe that when augmented with their corresponding improved user-encoders, they consistently show an increase in personalization ($\text{avg. boost}$: ${61.2\%}\uparrow$ w.r.t. PSE-SU4 metric). This further establishes the efficacy of PerAugy as an augmentation method to boost personalized summarizers.

URL: https://openreview.net/forum?id=JVx7Qi8tz3

---

Title: Universal Black-Box Targeted Reward Poisoning Attack Against Online Deep Reinforcement Learning

Abstract: This work proposes the first universal black-box targeted attack against online reinforcement learning through reward poisoning during training time. Our attack is universally efficient against any efficient learning algorithm training in general RL environments and requires limited attack budgets and computational resources. We generalize a common feature of the efficient learning algorithms and assume that such algorithms would mostly take the optimal actions or actions close to them during training. We quantify the efficiency of an attack and propose an attack framework where it is feasible to evaluate the efficiency of any attack instance in the framework based on the assumption. Finally, we find an instance in the framework that requires a minimal per-step perturbation, which we call `adaptive target attack.' We theoretically analyze and prove a lower bound for the attack efficiency of our attack in the general RL setting. Empirically, on a diverse set of popular DRL environments learned by state-of-the-art DRL algorithms, we verify that our attack efficiently leads the learning agent to various target policies with limited budgets.

URL: https://openreview.net/forum?id=MX0aDKu8lY

---

Title: Evaluating Disparities in the Quality of Post hoc Explanations when the Explained Blackboxes are subjected to Fairness Contraints

Abstract: In recent years, the application of machine learning models in critical domains has raised
significant concerns regarding the fairness and interpretability of their predictions. This
study investigates the disparities in the quality of post-hoc explanations generated for com-
plex black-box models, specifically focusing on the influence of fairness constraints on these
explanations across diverse demographic groups. Utilizing datasets from ACSIncome, AC-
SEmployment, and COMPAS, we employ explanation methods such as LIME and Ker-
nelSHAP to evaluate metrics including Maximum Fidelity Gap from Average (MFGA),
Consistency and Stability. Our findings reveal that the imposition of fairness constraints
impacts the fidelity and consistency of explanations, with notable variations observed be-
tween demographic groups. While some datasets demonstrate equitable explanation quality
across genders, significant biases persist in others, particularly affecting younger individuals
and racial minorities. The research highlights the necessity for robust fairness-preserving
techniques in post-hoc explanations and underscores the critical need for transparency in
AI-driven decision-making processes. By correlating model unfairness with disparities in
explanation quality, this work aims to contribute to the ongoing discourse on ethical AI,
emphasizing the importance of both accuracy and fairness in machine learning applications.

URL: https://openreview.net/forum?id=b0Uq58Ef6y

---

Title: Information-Guided Diffusion Sampling for Dataset Distillation

Abstract: Dataset distillation aims to create a compact dataset that retains essential information while maintaining model performance. Diffusion models (DMs) have shown promise for this task but struggle in low images-per-class (IPC) settings, where generated samples lack diversity. In this paper, we address this issue from an information-theoretic perspective by identifying two key types of information that a distilled dataset must preserve: ($i$) \textit{prototype information} $\mathrm{I}(X;Y)$, which captures label-relevant features; and ($ii$) \textit{contextual information} $\mathrm{H}(X | Y)$, which preserves intra-class variability. Here, $(X,Y)$ represents the pair of random variables corresponding to the input data and its ground truth label, respectively. Observing that the required contextual information scales with IPC, we propose maximizing $\mathrm{I}(X;Y) + \beta \mathrm{H}(X | Y)$ during the DM sampling process, where $\beta$ is IPC-dependent. Since directly computing $\mathrm{I}(X;Y)$ and $\mathrm{H}(X | Y)$ is intractable, we develop \textit{variational estimations} to tightly lower-bound these quantities via a data-driven approach. Our approach, information-guided diffusion sampling (IGDS), seamlessly integrates with diffusion models and improves dataset distillation across all IPC settings. Experiments on Tiny ImageNet and ImageNet subsets show that IGDS significantly outperforms existing methods, particularly in low-IPC regimes. The code is available at \url{https://anonymous.4open.science/r/IGDS-4C0F/}.

URL: https://openreview.net/forum?id=LwLyfyWMpk

---

Title: DiNAT-IR: Exploring Dilated Neighborhood Attention for High-Quality Image Restoration

Abstract: Transformers, with their self-attention mechanisms for modeling long-range dependencies, have become a dominant paradigm in image restoration tasks. However, the high computational cost of self-attention limits scalability to high-resolution images, making efficiency–quality trade-offs a key research focus. To address this, Restormer employs channel-wise self-attention, which computes attention across channels instead of spatial dimensions. While effective, this approach may overlook localized artifacts that are crucial for high-quality image restoration. To bridge this gap, we explore Dilated Neighborhood Attention (DiNA) as a promising alternative, inspired by its success in high-level vision tasks. DiNA balances global context and local precision by integrating sliding-window attention with mixed dilation factors, effectively expanding the receptive field without excessive overhead. However, our preliminary experiments indicate that directly applying this global-local design to the classic deblurring task hinders accurate visual restoration, primarily due to the constrained global context understanding within local attention. To address this, we introduces a channel-aware module that complements local attention, effectively integrating global context without sacrificing pixel-level precision. The proposed DiNAT-IR, a Transformer-based architecture specifically designed for image restoration, achieves competitive results across multiple benchmarks, offering a high-quality solution for diverse low-level computer vision problems.

URL: https://openreview.net/forum?id=d4EqLcWpN0

---

Title: On the Benefits of Instance Decomposition in Video Prediction Models

Abstract: Video prediction is a crucial task for intelligent agents such as robots and autonomous vehicles, since it enables them to anticipate and act early on time-critical incidents. State-of-the-art video prediction methods typically model the dynamics of a scene jointly and implicitly, without any explicit decomposition into separate objects. This is challenging and potentially sub-optimal, as every object in a dynamic scene has their own pattern of movement, typically somewhat independent of others. In this paper, we investigate the benefit of explicitly modeling the objects in a dynamic scene separately within the context of latent-transformer video prediction models. We conduct detailed and carefully-controlled experiments on both synthetic and real-world datasets; our results show that decomposing a dynamic scene leads to higher quality predictions compared with models of a similar capacity that lack such decomposition.

URL: https://openreview.net/forum?id=lyqhffQbS7

---

Title: Tractable Representation Learning with Probabilistic Circuits

Abstract: Probabilistic circuits (PCs) are powerful probabilistic models that enable exact and tractable inference, making them highly suitable for probabilistic reasoning and inference tasks. While dominant in neural networks, representation learning with PCs remains underexplored, with prior approaches relying on external neural embeddings or activation-based encodings. To address this gap, we introduce autoencoding probabilistic circuits (APCs), a novel framework leveraging the tractability of PCs to model probabilistic embeddings explicitly. APCs extend PCs by jointly modeling data and embeddings, obtaining embedding representations through tractable probabilistic inference. The PC encoder allows the framework to natively handle arbitrary missing data and is seamlessly integrated with a neural decoder in a hybrid, end-to-end trainable architecture enabled by differentiable sampling. Our empirical evaluation demonstrates that APCs outperform existing PC-based autoencoding methods in reconstruction quality, generate embeddings competitive with, and exhibit superior robustness in handling missing data compared to neural autoencoders. These results highlight APCs as a powerful and flexible representation learning method that exploits the probabilistic inference capabilities of PCs, showing promising directions for robust inference, out-of-distribution detection, and knowledge distillation.

URL: https://openreview.net/forum?id=h8D75pVKja

---

Title: Proximal Regularization of Deep Residual Neural Networks with An Application to Genomic Prediction

Abstract: Residual neural networks (ResNets) have become widely used as they allow for smooth and efficient training of deep neural network architectures. However, when trained on small, noisy and high-dimensional data, ResNets may suffer from overfitting due to the large amount of parameters. As a solution, a range of regularization methods have been proposed. One promising approach relies on the proximal mapping technique which is computationally efficient since it can be directly incorporated into the optimization algorithm. However, the performance of ResNets with various convex or non-convex proximal regularizers remains under-explored on high-dimensional data. In our study, we develop a stochastic adaptive proximal gradient ResNet method that can handle both convex and non-convex regularizers that range from $L_0$ to $L_{\infty}$. Moreover, we evaluate the prediction performance in a supervised regression setting on three high-dimensional genomic data sets from mice, pig and wheat. Traditional sparse linear proximal gradient methods are also implemented with the same regularizers and evaluated for comparison. Experimental results demonstrate that a ResNet with 18-layers and $L_{\frac{1}{2}}$ regularization outperforms other configurations on both mice and pig datasets, as well as the sparse linear proximal gradient methods across all the datasets. For the wheat data, a 15-layer ResNet configuration achieves the lowest test mean squared error. These findings highlight the effectiveness of our regularized adaptive proximal gradient ResNet method and its potential for prediction tasks on high-dimensional genomic data.

URL: https://openreview.net/forum?id=wbO4vBl69i

---

Title: Enhancing Semantic Segmentation with Continual Self-Supervised Pre-training

Abstract: Self-supervised learning (SSL) has emerged as a central paradigm for training foundation models by leveraging large-scale unlabeled datasets, often producing representations with strong generalization capabilities. These models are typically pre-trained on general-purpose datasets such as ImageNet and subsequently adapted to various downstream tasks through finetuning. While recent advances have explored parameter-efficient strategies for adapting pre-trained models, extending SSL pre-training itself to new domains—particularly under limited data regimes and for dense prediction tasks—remains underexplored. In this work, we address the problem of adapting vision foundation models to new domains in an unsupervised and data-efficient manner, specifically targeting downstream semantic segmentation. We propose GLARE (Global Local and Regional Enforcement), a novel continual self-supervised pre-training task designed to enhance downstream segmentation performance. GLARE introduces patch-level augmentations to encourage local consistency and incorporates a regional consistency constraint that leverages spatial semantics in the data. For efficient continual pre-training, we initialize Vision Transformers (ViTs) with weights from existing SSL models and update only lightweight adapter modules—specifically UniAdapter—while keeping the rest of the backbone frozen. Experiments across multiple semantic segmentation benchmarks on different domains demonstrate that GLARE consistently improves downstream performance with minimal computational and parameter overhead.

URL: https://openreview.net/forum?id=Ax9Y4W0g7s

---

Title: Improving Generalization in ML models via Causal Interaction Constraints

Abstract: Machine learning models are effective in identifying patterns within independently and identically distributed (i.i.d.) data. However, this assumption rarely holds in real-world applications, where violations of i.i.d. can hinder both generalization and explainability. Causal Machine Learning is an emerging discipline that addresses these limitations by integrating causal reasoning, an element typically absent from conventional approaches.

In this work, we introduce a novel causal machine learning strategy that emphasizes the role of spurious variable interactions, a concept grounded in the Independent Causal Mechanisms (ICM) principle. We argue that recognizing and constraining these spurious interactions is essential for improving model robustness and interpretability. To that end, we introduce a novel approach for incorporating interaction restrictions into neural network architectures and tree-based models.

When applied to real-world scenarios, our method demonstrates that predictive models explicitly constrained to avoid spurious interactions exhibit enhanced generalization performance across diverse domains, outperforming their unconstrained counterparts.

URL: https://openreview.net/forum?id=VgCOfTJDD2

---

Title: AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving

Abstract: Recent advancements in large vision language models (VLMs) tailored for autonomous driving (AD) have shown strong scene understanding and reasoning capabilities, making them undeniable candidates for end-to-end driving systems. However, limited work exists on studying the trustworthiness of DriveVLMs—a critical factor that directly impacts public transportation safety. In this paper, we introduce AutoTrust, a comprehensive trustworthiness benchmark for large vision-language models in autonomous driving (DriveVLMs), considering diverse perspectives---including trustfulness, safety, robustness, privacy, and fairness. We constructed the largest visual question-answering dataset for investigating trustworthiness issues in driving scenarios, comprising over 10k unique scenes and 18k queries. We evaluated six publicly available VLMs, spanning from generalist to specialist, from open-source to commercial models. Our exhaustive evaluations have unveiled previously undiscovered vulnerabilities of DriveVLMs to trustworthiness threats. Specifically, we found that the general VLMs like LLaVA-v1.6 and GPT-4o-mini surprisingly outperform specialized models fine-tuned for driving in terms of overall trustworthiness. DriveVLMs like DriveLM-Agent are particularly vulnerable to disclosing sensitive information. Additionally, both generalist and specialist VLMs remain susceptible to adversarial attacks and struggle to ensure unbiased decision-making across diverse environments and populations. Our findings call for immediate and decisive action to address the trustworthiness of DriveVLMs--an issue of critical importance to public safety and the welfare of all citizens relying on autonomous transportation systems.

URL: https://openreview.net/forum?id=z2VZl6sH7T

---

Title: Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling

Abstract: The remarkable success of large language pretraining and the discovery of the empirical
scaling laws signify a paradigm shift in machine learning. Notably, the primary objective
has evolved from minimizing generalization error to reducing approximation error, and the
most effective strategy has transitioned from regularization (in a broad sense) to scaling
up models. This raises a critical question:

Do the established principles that proved successful in the generalization-centric era remain
valid in this new era of scaling?

This paper examines several influential regularization-based principles that may no longer
hold true in the scaling-centric, large language model (LLM) era. These principles include explicit L2 regularization and implicit regularization through small batch sizes and large learning rates. Additionally, we identify a new phenomenon termed “scaling law
crossover,” where two scaling curves intersect at a certain scale, implying that methods
effective at smaller scales may not generalize to larger ones. Together, these observations highlight two fundamental questions within this new paradigm:

• Guiding Principles for Scaling: If regularization is no longer the primary guiding principle for model design, what new principles are emerging to guide scaling?

• Model Comparison at Scale: How to reliably and effectively compare models at the scale where only a single experiment is feasible?

URL: https://openreview.net/forum?id=bUXYGz0OAT

---

Title: A Mean Field Reinforcement Learning Approach to Large-Scale Vehicle Routing Problems

Abstract: Solving large-scale vehicle routing problems (VRPs) is NP-hard and poses a computational challenge in numerous applications such as logistics. Meanwhile, mean field control (MFC) provides a tractable and rigorous approach to controlling many agents. We provide a solution to pickup-and-delivery VRPs via scalable MFC. In combination with reinforcement learning (RL) and clustering, our MFC approach efficiently scales to large-scale VRPs. We perform a theoretical analysis of our MFC-based approximation, giving convergence results for large VRP instances and error bounds for clustering-based approximations. We verify our algorithms on different datasets and compare them against solutions such as OR-Tools, PyVRP and heuristics, showing scalability in terms of speed for mean-field methods, for the first time in discrete optimization. Overall, our work establishes a novel synthesis of MFC-based RL techniques, vehicle routing problems and clustering approximations, to solve a hard discrete optimization problem of practical use in a scalable way.

URL: https://openreview.net/forum?id=E8JRswdyDR

---

Title: An Evolutionary Algorithm for Black-Box Adversarial Attack Against Explainable Methods

Abstract: The challenge of deep neural network (DNN) explainability continues to be a significant hurdle in developing trustworthy AI, particularly in essential fields like medical imaging. Despite progress in explainable AI (XAI), these methods remain susceptible to adversarial images, emphasizing the urgent need for robustness evaluation. While many current adversarial attack techniques focus on specific explanation strategies, emerging research has introduced black-box methods capable of targeting multiple approaches. However, such methods often necessitate a large number of queries due to the complexity of pixel-level modifications. In response, we propose an innovative attack method that employs semi-transparent, RGB-valued circles to create perturbations, optimizing their features via an evolutionary strategy, drastically reducing the number of tunable optimization parameters required. Through experiments on medical image datasets, our method demonstrates superior performance compared to current leading techniques. This study further underscores the vulnerabilities of XAI methods in critical sectors such as medical imaging, advocating for more robust solutions.

URL: https://openreview.net/forum?id=MlUP5Euj6S

---

Title: AuToMATo: An Out-Of-The-Box Persistence-Based Clustering Algorithm

Abstract: We present AuToMATo, a novel clustering algorithm based on persistent homology. While AuToMATo is not parameter-free per se, we provide default choices for its parameters that make it into an out-of-the-box clustering algorithm that performs well across the board. AuToMATo combines the existing ToMATo clustering algorithm with a bootstrapping procedure in order to separate significant peaks of an estimated density function from non-significant ones. We perform a thorough comparison of AuToMATo (with its parameters fixed to their defaults) against many other state-of-the-art clustering algorithms. We find not only that AuToMATo compares favorably against parameter-free clustering algorithms, but in many instances also significantly outperforms even the best selection of parameters for other algorithms. AuToMATo is motivated by applications in topological data analysis, in particular the Mapper algorithm, where it is desirable to work with a clustering algorithm that does not need tuning of its parameters. Indeed, we provide evidence that AuToMATo performs well when used with Mapper. Finally, we provide an open-source implementation of AuToMATo in Python that is fully compatible with the standard scikit-learn architecture.

URL: https://openreview.net/forum?id=Qd7H5mAbzV

---

Title: Active Prompt Learning with Vision-Language Model Priors

Abstract: Vision-language models (VLMs) have demonstrated remarkable zero-shot performance across various classification tasks. Nonetheless, their reliance on hand-crafted text prompts for each task hinders efficient adaptation to new tasks. While prompt learning offers a promising solution, most studies focus on maximizing the utilization of given few-shot labeled datasets, often overlooking the potential of careful data selection strategies, which enable higher accuracy with fewer labeled data. This motivates us to study a budget-efficient active prompt learning framework. Specifically, we introduce a class-guided clustering that leverages the pre-trained image and text encoders of VLMs, thereby enabling our cluster-balanced acquisition function from the initial round of active learning. Furthermore, considering the substantial class-wise variance in confidence exhibited by VLMs, we propose a budget-saving selective querying based on adaptive class-wise thresholds. Extensive experiments in active learning scenarios across seven datasets demonstrate that our method outperforms existing baselines.

URL: https://openreview.net/forum?id=qBeGCzD3Ij

---

Title: Uniﬁed People Tracking with Graph Neural Networks

Abstract: This work presents a unified, fully differentiable model for multi-people tracking that learns to associate detections into trajectories without relying on pre-computed tracklets. The model builds a dynamic spatiotemporal graph that aggregates spatial, contextual, and temporal information, enabling seamless information propagation across entire sequences. To improve occlusion handling, the graph can also encode scene-specific information. We also introduce a new large-scale dataset with 25 partially overlapping views, detailed scene reconstructions, and extensive occlusions. Experiments show the model achieves state-of-the-art performance on public benchmarks and the new dataset, with flexibility across diverse conditions. Both the dataset and approach will be publicly released to advance research in multi-people tracking.

URL: https://openreview.net/forum?id=rt6PFpGtv1

---

Title: Learning Is a Kan Extension

Abstract: Previous work has demonstrated that efficient algorithms exist for computing Kan extensions and that some Kan extensions have interesting similarities to various machine learning algorithms. This paper closes the gap by proving that all error minimisation algorithms may be presented as a Kan extension. This result provides a foundation for future work to investigate the optimisation of machine learning algorithms through their presentation as Kan extensions. A corollary of this representation of error-minimising algorithms is a presentation of error from the perspective of lossy and lossless transformations of data.

URL: https://openreview.net/forum?id=xWKtKdeefL

---

Title: A General Constraint for Gaussian Latent Variables

Abstract: Encoder-based generative models fundamentally rely on the structure of their latent space to achieve high-quality image reconstruction, generation, and semantic manipulation. In latent spaces, a multivariate Gaussian distribution is often desirable due to its closure under linear transformations. To approximate this, most existing methods impose a standard Gaussian prior via Kullback-Leibler (KL) divergence, which assumes independence among latent components. However, real-world latent representations typically exhibit strong internal correlations, rendering the independence assumption inadequate. In this work, we apply random projection theory to analyze how latent representations differ from a target multivariate Gaussian distribution. We prove that the normalized third absolute moment in low-dimensional subspaces effectively quantifies such deviations. Building on this result, we propose a regularization method that encourages the latent space to align with a multivariate Gaussian distribution without independence assumption across dimensions. The method is compatible with a wide range of encoder-based architectures and introduces no additional computational overhead. We validate the effectiveness of our method through extensive experiments across diverse models. The results consistently show improvements in generation quality, semantic editability, and alignment with the target latent distribution, demonstrating the practical value of the proposed regularization.

URL: https://openreview.net/forum?id=A76w7FTEXR

---

Title: Learning to Imitate with Less: Efficient Individual Behavior Modeling in Chess

Abstract: As humans seek to collaborate with, learn from, and better understand artificial intelligence systems, developing AIs that can accurately emulate individual decision-making becomes increasingly important. Chess, a long-standing AI benchmark with precise skill measurement, offers an ideal testbed for human-AI alignment. However, existing approaches to modeling human behavior require prohibitively large amounts of data from each individual, making them impractical for new or sparsely represented users. In this work, we introduce Maia4All, a framework designed to learn and adapt to individual decision-making styles efficiently, even with limited data. Maia4All achieves this through a two-stage optimization process: (1) an enrichment step, which bridges population and individual-level human behavior modeling with a prototype-enriched model, and (2) a democratization step, which leverages ability levels or user prototypes to initialize and refine individual embeddings with minimal data. Our experimental results show that Maia4All can accurately predict individual moves and profile behavioral patterns with high fidelity, establishing a new standard for personalized human-like AI behavior modeling in chess. Maia4All achieves individual human behavior modeling in chess with only 20 games, compared to the 5,000 games required previously, representing a significant improvement in data efficiency. Our work provides an example of how population AI systems can flexibly adapt to individual users using a prototype-enriched model as a bridge. This approach extends beyond chess, as shown in our case study on idiosyncratic LLMs, highlighting its potential for broader applications in personalized AI adaptation.

URL: https://openreview.net/forum?id=iw4kjcw319

---

Title: On Verbalized Confidence Scores for LLMs

Abstract: The rise of large language models (LLMs) and their tight integration into our daily life make it essential to dedicate efforts towards their trustworthiness. Uncertainty quantification for LLMs can establish more human trust into their responses, but also allows LLM agents to make more informed decisions based on each other’s uncertainty. To estimate the uncertainty in a response, internal token logits, task-specific proxy models, or sampling of multiple responses are commonly used. This work focuses on asking the LLM itself to verbalize its uncertainty with a confidence score as part of its output tokens, which is a promising way for prompt- and model-agnostic uncertainty quantification with low overhead. Using an extensive benchmark, we assess the reliability of verbalized confidence scores with respect to different datasets, models, and prompt methods. Our results reveal that the reliability of these scores strongly depends on how the model is asked, but also that it is possible to extract well-calibrated confidence scores with certain prompt methods. We argue that verbalized confidence scores can become a simple but effective and versatile uncertainty quantification method in the future. Our code is available at ***.

URL: https://openreview.net/forum?id=znJYiHnvWk

---

Title: Improved seeding strategies for k-means and k-GMM

Abstract: We revisit the randomized seeding techniques for k-means clustering and k-GMM (Gaussian Mixture model fitting with Expectation-Maximization), formalizing their three key ingredients: the metric used for seed sampling, the number of candidate seeds, and the metric used for seed selection. This analysis yields novel families of initialization methods exploiting a lookahead principle–conditioning the seed selection to an enhanced coherence with the final metric used to assess the algorithm, and a multipass strategy to tame down the effect of randomization. Experiments show a consistent constant factor improvement over classical contenders in terms of the final metric (SSE for k-means, log-likelihood for k-GMM), at a modest overhead. In particular, for k-means, our methods improve on the recently designed multi-swap strategy, which was the first one to outperform the greedy k-means++ seeding. Our experimental analysis also shed light on subtle properties of k-means often overlooked, including the (lack of) correlations between the SSE upon seeding and the final SSE, the variance reduction phenomena observed in iterative seeding methods, and the sensitivity of the final SSE to the pool size for greedy methods.

Practically, our most effective seeding methods are strong candidates to become one of the--if not
the--standard techniques. From a theoretical perspective, our formalization of seeding opens the door to
a new line of analytical approaches.

URL: https://openreview.net/forum?id=4Ut2YnekhN

---

Title: UMP-Net: Uncertainty-Aware Mixture of Prompts Network for Efficient Instruction Tuning

Abstract: Instruction tuning has greatly improved how large language models (LLMs) respond to human-like instructions. However, fully fine-tuning these models is still computationally demanding, and many existing parameter-efficient methods fall short—particularly when it comes to uncertainty estimation and working effectively across different modalities. To address this, we introduce UMP-Net (Uncertainty-Aware Mixture of Prompts Network), a new approach designed to enhance the ability of LLaMA to follow instructions. UMPNet combines a novel mixture of prompts (MoPs) technique with Latent Noise Prompting, KNN-based Heterogeneous Clustering, and Conformal Predictions to select the most reliable prompts dynamically while accounting for uncertainty. In addition, it features a CLIP-based multi-modal architecture to streamline vision-language integration. We evaluated UMPNet on a range of benchmarks including ScienceQA, COCO Caption, and various zero-shot multi-modal tasks. The results show a strong performance: an average accuracy of 88.41% on ScienceQA and a CIDEr score of 158.3 on COCO Caption—surpassing models such as LLaVA, LLaMA-Adapter, and LLaMA-Excitor. These findings suggest that UMP-Net offers both improved multi-modal capability and computational efficiency.

URL: https://openreview.net/forum?id=EehtvgNXAl

---

Title: Improved Localized Machine Unlearning Through the Lens of Memorization

Abstract: Machine unlearning refers to removing the influence of a specified subset of training data from a model efficiently, after it has already been trained. This is important for key applications, including making the model more accurate by removing outdated, mislabeled, or poisoned data. In this paper, we draw inspiration from prior work that attempts to identify where in the network a given example is memorized, to propose a new "localized unlearning'' algorithm, Deletion by Example Localization (DEL). DEL has two components: a localization strategy that identifies critical parameters for a given set of examples, and a simple unlearning algorithm that finetunes only the critical parameters on the data we want to retain. Through extensive experiments, we find that our localization strategy outperforms prior strategies in terms of metrics of interest for unlearning and test accuracy, and pairs well with various unlearning algorithms. Our experiments on different datasets, forget sets, and metrics reveal that DEL outperforms prior work in producing better trade-offs between unlearning performance and accuracy.

URL: https://openreview.net/forum?id=zXAVdHYPIB

---

Title: Even Faster Hyperbolic Random Forests: A Beltrami-Klein Wrapper Approach

Abstract: Decision trees and models that use them as primitives are workhorses of machine learning in Euclidean spaces. Recent work has further extended these models to the Lorentz model of hyperbolic space by replacing axis-parallel hyperplanes with homogeneous hyperplanes when partitioning the input space. In this paper, we show how the \hyperdt\ algorithm can be elegantly reexpressed in the Beltrami-Klein model of hyperbolic spaces. This preserves the thresholding operation used in Euclidean decision trees, enabling us to further rewrite \hyperdt as simple pre-- and post-processing steps that form a wrapper around existing tree-based models designed for Euclidean spaces. The wrapper approach unlocks many optimizations already available in Euclidean space models, improving flexibility, speed, and accuracy while offering a simpler, more maintainable, and extensible codebase.

URL: https://openreview.net/forum?id=J981sCATKv

---

Reply all

Reply to author

Forward

0 new messages