Weekly TMLR digest for Dec 07, 2025

2 views

Skip to first unread message

TMLR

unread,

Dec 7, 2025, 12:00:15 AM12/7/25

to tmlr-annou...@googlegroups.com

New certifications
==================

Featured Certification, J2C Certification: Angular Regularization for Positive-Unlabeled Learning on the Hypersphere

Vasileios Sevetlidis, George Pavlidis, Antonios Gasteratos

https://openreview.net/forum?id=XQhO0Ly6el

---

Survey Certification, Featured Certification: Cognitive Architectures for Language Agents

Theodore Sumers, Shunyu Yao, Karthik R Narasimhan, Thomas L. Griffiths

https://openreview.net/forum?id=1i6ZCvflQJ

---

Featured Certification, Outstanding Certification: Mantis: Interleaved Multi-Image Instruction Tuning

Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, Wenhu Chen

https://openreview.net/forum?id=skLtdUVaJa

---

Accepted papers
===============

Title: Causal Ordering for Structure Learning from Time Series

Authors: Pedro Sanchez, Damian Machlanski, Steven McDonagh, Sotirios A. Tsaftaris

Abstract: Predicting causal structure from time series data is crucial for understanding complex phenomena in physiology, brain connectivity, climate dynamics, and socio-economic behaviour. Causal discovery in time series is hindered by the combinatorial complexity of identifying true causal relationships, especially as the number of variables and time points grows. A common approach to simplify the task is the so-called ordering-based methods. Traditional ordering methods inherently limit the representational capacity of the resulting model. In this work, we fix this issue by leveraging multiple valid causal orderings, instead of a single one as standard practice. We propose DOTS (Diffusion Ordered Temporal Structure), using diffusion-based causal discovery for temporal data. By integrating multiple orderings, DOTS effectively recovers the transitive closure of the underlying directed acyclic graph (DAG), mitigating spurious artifacts inherent in single-ordering approaches. We formalise the problem under standard assumptions such as stationarity and the additive noise model, and leverage score matching with diffusion processes to enable efficient Hessian estimation. Extensive experiments validate the approach. Empirical evaluations on synthetic and real-world datasets demonstrate that DOTS outperforms state-of-the-art baselines, offering a scalable and robust approach to temporal causal discovery. On synthetic benchmarks spanning $d{=}3{-}6$ variables, $T{=}200{-}5{,}000$ samples and up to three lags, DOTS improves mean window‑graph $F1$ from $0.63$ (best baseline) to $0.81$. On the CausalTime real‑world benchmark (Medical, AQI, Traffic; $d{=}20{-}36$), while baselines remain the best on individual datasets, DOTS attains the highest average summary‑graph $F1$ while halving runtime relative to graph‑optimisation methods. These results establish DOTS as a scalable and accurate solution for temporal causal discovery. Code is available at \url{https://github.com/CHAI-UK/DOTS}.

URL: https://openreview.net/forum?id=hWuTzqggSd

---

Title: The Initialization Determines Whether In-Context Learning Is Gradient Descent

Authors: Shifeng Xie, Rui Yuan, Simone Rossi, Thomas Hannagan

Abstract: In-context learning (ICL) in large language models (LLMs) is a striking phenomenon, yet its underlying mechanisms remain only partially understood. Previous work connects linear self-attention (LSA) to gradient descent (GD), this connection has primarily been established under simplified conditions with zero-mean Gaussian priors and zero initialization for GD. However, subsequent studies have challenged this simplified view by highlighting its overly restrictive assumptions, demonstrating instead that under conditions such as multi-layer or nonlinear attention, self-attention performs optimization-like inference, akin to but distinct from GD. We investigate how multi-head LSA approximates GD under more realistic conditions—specifically when incorporating non-zero Gaussian prior means in linear regression formulations of ICL. We first extend multi-head LSA embedding matrix by introducing an initial estimation of the query, referred to as the initial guess. We prove an upper bound on the number of heads needed for ICL linear regression setup. Our experiments confirm this result and further observe that a performance gap between one-step GD and multi-head LSA persists. To address this gap, we introduce $y_q$-LSA, a simple generalization of single-head LSA with a trainable initial guess $y_q$. We theoretically establish the capabilities of $y_q$-LSA and provide experimental validation on linear regression tasks, thereby extending the theory that bridges ICL and GD. Finally, inspired by our findings in the case of linear regression, we consider widespread LLMs augmented with initial guess capabilities, and show that their performance is improved on a semantic similarity task.

URL: https://openreview.net/forum?id=fvqSKLDtJi

---

Title: Towards shutdownable agents via stochastic choice

Authors: Elliott Thornley, Alexander Roman, Christos Ziakas, Louis Thomson, Leyton Ho

Abstract: The POST-Agents Proposal (PAP) is an idea for ensuring that advanced artificial agents never resist shutdown. A key part of the PAP is using a novel ‘Discounted Reward for Same-Length Trajectories (DReST)’ reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be 'USEFUL'), and (2) choose stochastically between different trajectory-lengths (be NEUTRAL' about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DReST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus provide some initial evidence that DReST reward functions could train advanced agents to be USEFUL and NEUTRAL. Our theoretical work suggests that these agents would be useful and shutdownable.

URL: https://openreview.net/forum?id=j5Qv7KdWBn

---

Title: Angular Regularization for Positive-Unlabeled Learning on the Hypersphere

Authors: Vasileios Sevetlidis, George Pavlidis, Antonios Gasteratos

Abstract: Positive–Unlabeled (PU) learning addresses classification problems where only a subset of positive examples is labeled and the remaining data is unlabeled, making explicit negative supervision unavailable. Existing PU methods often rely on negative-risk estimation or pseudo-labeling, which either require strong distributional assumptions or can collapse in high-dimensional settings. We propose AngularPU, a novel PU framework that operates on the unit hypersphere using cosine similarity and angular margin. In our formulation, the positive class is represented by a learnable prototype vector, and classification reduces to thresholding the cosine similarity between an embedding and this prototype—eliminating the need for explicit negative modeling. To counteract the tendency of unlabeled embeddings to cluster near the positive prototype, we introduce an angular regularizer that encourages dispersion of the unlabeled set over the hypersphere, improving separation. We provide theoretical guarantees on the Bayes-optimality of the angular decision rule, consistency of the learned prototype, and the effect of the regularizer on the unlabeled distribution. Experiments on benchmark datasets demonstrate that AngularPU achieves competitive or superior performance compared to state-of-the-art PU methods, particularly in settings with scarce positives and high-dimensional embeddings, while offering geometric interpretability and scalability.

URL: https://openreview.net/forum?id=XQhO0Ly6el

---

Title: Task-agnostic Prompt Compression with Context-aware Sentence Embedding and Reward-guided Task Descriptor

Authors: Barys Liskavets, Shuvendu Roy, Maxim Ushakov, Mark Klibanov, Ali Etemad, Shane K. Luke

Abstract: The rise of Large Language Models (LLMs) has led to significant interest in prompt compression, a technique aimed at reducing the length of input prompts while preserving critical information. However, the prominent approaches in prompt compression often require explicit questions or handcrafted templates for compression, limiting their generalizability. We propose Task-agnostic Prompt Compression (TPC), a novel framework that generalizes compression across tasks and domains without requiring input questions or templates. TPC generates a context-relevant task description using a task descriptor trained on a curated dataset of context and query pairs, and fine-tuned via reinforcement learning with a reward function designed to capture the most relevant information. The task descriptor is then utilized to compute the relevance of each sentence in the prompt to generate the compressed prompt. We introduce 3 model sizes (Base, Large, and Huge), where the largest model outperforms the existing state-of-the-art methods on LongBench and ZeroSCROLLS, and our smallest model performs comparable to the existing solutions while being considerably smaller.

URL: https://openreview.net/forum?id=TpGcX9UTOt

---

Title: Unifying Linear-Time Attention via Latent Probabilistic Modelling

Authors: Rares Dolga, Lucas Maystre, Marius Cobzarenco, David Barber

Abstract: Transformers have achieved state-of-the-art results across a range of domains, but their quadratic attention mechanism poses significant challenges for long-sequence modelling. Recent efforts to design linear-time attention mechanisms have yielded more scalable alternatives, yet often at the cost of performance, particularly on discrete data such as language. In this work, we revisit linear attention through the lens of probabilistic graphical models. We first show that standard linear attention can be interpreted as an undirected latent variable model, revealing a key limitation: the absence of directionality. To address this, we propose a novel directed parameterisation of linear attention that introduces an asymmetric structure, enabling an interpretation aligned with the causal and sequential nature of language. Our formulation integrates global latent-variable attention with local standard attention in a fully probabilistic framework. Additionally, we introduce a recurrent parameterisation of queries and keys that avoids reliance on relative positional encodings, often incompatible with linear attention. Experiments on language modelling benchmarks demonstrate that our model achieves competitive performance with standard attention and outperforms existing linear attention variants.

URL: https://openreview.net/forum?id=TDFIjR7ynG

---

Title: Understanding Class Bias Amplification in Graph Representation Learning

Authors: Shengzhong Zhang, Wenjie Yang, Yimin Zhang, Hongwei Zhang, Zengfeng Huang

Abstract: Recent research reveals that GNN-based graph representation learning may inadvertently introduce various structural biases. In this work, we discover a phenomenon of structural bias in graph representation learning called class bias amplification, which refers to the exacerbation of performance bias between different classes by GNN encoder. We conduct an in-depth theoretical study of this phenomenon from a novel spectral perspective. Our analysis suggests that structural disparities between nodes in different classes result in varying local convergence speeds for node embeddings. This phenomenon leads to bias amplification in the classification results of downstream tasks. Based on the theoretical insights, we propose random graph coarsening, which is proved to be effective in dealing with the above issue. Finally, we propose an unsupervised graph contrastive learning model called Random Graph Coarsening Contrastive Learning (RGCCL), which utilizes random coarsening as data augmentation and mitigates class bias amplification by contrasting the coarsened graph with the original graph. Extensive experiments on various datasets demonstrate the advantage of our method when dealing with class bias amplification.

URL: https://openreview.net/forum?id=SqpgDUdRE9

---

Title: Heterogeneous Knowledge for Augmented Modular Reinforcement Learning

Authors: Lorenz Wolf, Mirco Musolesi

Abstract: Existing modular Reinforcement Learning (RL) architectures are generally based on reusable components, also allowing for ``plug-and-play'' integration. However, these modules are homogeneous in nature - in fact, they essentially provide policies obtained via RL through the maximization of individual reward functions. Consequently, such solutions still lack the ability to integrate and process multiple types of information (i.e., heterogeneous knowledge representations), such as rules, sub-goals, and skills from various sources. In this paper, we discuss several practical examples of heterogeneous knowledge and propose Augmented Modular Reinforcement Learning (AMRL) to address these limitations. Our framework uses a selector to combine heterogeneous modules and seamlessly incorporate different types of knowledge representations and processing mechanisms. Our results demonstrate the performance and efficiency improvements, also in terms of generalization, which can be achieved by augmenting traditional modular RL with heterogeneous knowledge sources and processing mechanisms. Finally, we examine the safety, robustness, and interpretability issues stemming from the introduction of knowledge heterogeneity.

URL: https://openreview.net/forum?id=eme87YbiND

---

Title: HyResPINNs: A Hybrid Residual Physics-Informed Neural Network Architecture Designed to Balance Expressiveness and Trainability

Authors: Madison Cooley, Mike Kirby, Shandian Zhe, Varun Shankar

Abstract: Physics-informed neural networks (PINNs) have emerged as a powerful approach for solving partial differential equations (PDEs) by training neural networks with loss functions that incorporate physical constraints. In this work, we introduce HyResPINNs, a two-level convex-gated architecture designed to maximize approximation expressiveness for a fixed number of degrees of freedom (DoF). The first level involves a trainable, per-block combination of smooth basis functions with trainable sparsity, and deep neural networks; the second involves the ability to gate entire blocks (much like in ResNets or Highway Nets), allowing for expressivity along the depth dimension of the architecture. Our empirical evaluation on a diverse set of challenging PDE problems demonstrates that HyResPINNs consistently achieve superior accuracy to baseline methods while remaining competitive relative to training times. These results highlight the potential of HyResPINNs to combine desirable features from traditional scientific computing methods and modern machine learning, paving the way for more robust and expressive approaches to physics-informed modeling.

URL: https://openreview.net/forum?id=et9WkjkqAw

---

Title: Why Settle for Mid: A Probabilistic Viewpoint to Spatial Relationship Alignment in Text-to-image Models

Authors: Parham Rezaei, Arash Marioriyad, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban

Abstract: Despite the ability of text-to-image models to generate high-quality, realistic, and diverse images, they face challenges in compositional generation, often struggling to accurately represent details specified in the input prompt. A prevalent issue in compositional generation is the misalignment of spatial relationships, as models often fail to faithfully generate images that reflect the spatial configurations specified between objects in the input prompts.
To address this challenge, we propose a novel probabilistic framework for modeling the relative spatial positioning of objects in a scene, leveraging the concept of Probability of Superiority (PoS). Building on this insight, we make two key contributions. First, we introduce a novel evaluation metric, PoS-based Evaluation (PSE), designed to assess the alignment of 2D and 3D spatial relationships between text and image, with improved adherence to human judgment. Second, we propose PoS-based Generation (PSG), an inference-time method that improves the alignment of 2D and 3D spatial relationships in T2I models without requiring fine-tuning. PSG employs a PoS-based reward function that can be utilized in two distinct ways: (1) as a gradient-based guidance mechanism applied to the cross-attention maps during the denoising steps, or (2) as a search-based strategy that evaluates a set of initial noise vectors to select the best one. Extensive experiments demonstrate that the PSE metric exhibits stronger alignment with human judgment compared to traditional center-based metrics, providing a more nuanced and reliable measure of complex spatial relationship accuracy in text-image alignment. Furthermore, PSG significantly enhances the ability of text-to-image models to generate images with specified spatial configurations, outperforming state-of-the-art methods across multiple evaluation metrics and benchmarks.

URL: https://openreview.net/forum?id=mFlanJKVFD

---

Title: The AI Hippocampus: How Far are We From Human Memory?

Authors: Zixia Jia, Jiaqi Li, Yipeng Kang, Yuxuan Wang, Tong Wu, Quansen Wang, Xiaobo Wang, Shuyi Zhang, Junzhe Shen, Qing Li, Siyuan Qi, Yitao Liang, Di He, Zilong Zheng, Song-Chun Zhu

Abstract: Memory plays a foundational role in augmenting the reasoning, adaptability, and contextual fidelity of modern Large Language Models (LLMs) and Multi-Modal LLMs (MLLMs). As these models transition from static predictors to interactive systems capable of continual learning and personalized inference, the incorporation of memory mechanisms has emerged as a central theme in their architectural and functional evolution. This survey presents a comprehensive and structured synthesis of memory in LLMs and MLLMs, organizing the literature into a cohesive taxonomy comprising implicit, explicit, and agentic memory paradigms. Specifically, the survey delineates three primary memory frameworks. \textit{Implicit memory} refers to the knowledge embedded within the internal parameters of pre-trained transformers, encompassing their capacity for memorization, associative retrieval, and contextual reasoning. Recent work has explored methods to interpret, manipulate, and reconfigure this latent memory. \textit{Explicit memory} involves external storage and retrieval components designed to augment model outputs with dynamic, queryable knowledge representations—such as textual corpora, dense vectors, and graph-based structures—thereby enabling scalable and updatable interaction with information sources. \textit{Agentic memory} introduces persistent, temporally extended memory structures within autonomous agents, facilitating long-term planning, self-consistency, and collaborative behavior in multi-agent systems, with relevance to embodied and interactive AI. Extending beyond text, the survey examines the integration of memory within multi-modal settings, where coherence across vision, language, audio, and action modalities is essential. Key architectural advances, benchmark tasks, and open challenges are discussed, including issues related to memory capacity, alignment, factual consistency, and cross-system interoperability. By charting the current landscape and identifying critical research directions, this survey aims to inform the development of memory-augmented (M)LLMs that are more flexible, context-sensitive, and aligned with the requirements of real-world intelligent systems. The survey’s website is available at \url{https://github.com/bigai-nlco/LLM-Memory-Survey}.

URL: https://openreview.net/forum?id=Sk7pwmLuAY

---

Title: SIRE: SE(3) Intrinsic Rigidity Embeddings

Authors: Cameron Omid Smith, Basile Van Hoorick, Chonghyuk Song, Vincent Sitzmann, Vitor Campagnolo Guizilini, Yue Wang

Abstract: Motion serves as a powerful cue for scene perception and understanding by separating independently moving surfaces and organizing the physical world into distinct entities. We introduce SIRE, a self-supervised method for motion discovery of objects and dynamic scene reconstruction from casual scenes by learning intrinsic rigidity embeddings from videos. Our method trains an image encoder to estimate scene rigidity and geometry, supervised by a simple 4D reconstruction loss: a least-squares solver uses the estimated geometry and rigidity to lift 2D point track trajectories into SE(3) tracks, which are simply re-projected back to 2D and compared against the original 2D trajectories for supervision. Crucially, our framework is fully end-to-end differentiable and can be optimized either on video datasets to learn generalizable image priors, or even on a single video to capture scene-specific structure -- highlighting strong data efficiency. We demonstrate the effectiveness of our rigidity embeddings and geometry across multiple settings, including downstream object segmentation, SE(3) rigid motion estimation, and self-supervised depth estimation. Our findings suggest that SIRE can pave the way towards self-supervised learning of priors over geometry and motion rigidity from large-scale video data.

URL: https://openreview.net/forum?id=OZ9H0TOYMt

---

Title: A second-order-like optimizer with adaptive gradient scaling for deep learning

Authors: Jerome Bolte, Ryan Boustany, Edouard Pauwels, Andrei Purica

Abstract: In this empirical article, we introduce INNAprop, an optimization algorithm that combines the INNA method with the RMSprop adaptive gradient scaling. It leverages second order information and rescaling while keeping the memory and compute requirements of standard DL methods as AdamW or SGD. INNAprop is evaluated on CIFAR-10, Food101, and ImageNet with ResNets, VGG, DenseNet, and ViT. We also train GPT-2 (OpenWebText) from scratch and with LoRA fine-tuning (E2E). INNAprop consistently offers close performance to AdamW, while performing significantly better in our LLM training experiments, achieving faster convergence and higher accuracy with minimal hyperparameter tuning, even at large scale. Our code is public.

URL: https://openreview.net/forum?id=3khtiJDXQW

---

Title: Learning Task-Aware Abstract Representations for Meta-Reinforcement Learning

Authors: Louk van Remmerden, Zhao Yang, Shujian Yu, Mark Hoogendoorn, Vincent Francois-Lavet

Abstract: A central challenge in meta-reinforcement learning (meta-RL) is enabling agents trained on a set of environments to generalize to new, related tasks without requiring full policy retraining. Existing model-free approaches often rely on context-conditioned policies learned via encoder networks. However, these context encoders are prone to overfitting to the training environments, resulting in poor out-of-sample performance on unseen tasks. To address this issue, we adopt an alternative approach that uses an abstract representation model to learn augmented, task-aware abstract states. We achieve this by introducing a novel architecture that offers greater flexibility than existing recurrent network-based approaches. In addition, we optimize our model with multiple loss terms that encourage predictive, task-aware representations in the abstract state space. Our method simplifies the learning problem and provides a flexible framework that can be readily combined with any off-the-shelf reinforcement learning algorithm. We provide theoretical guarantees alongside empirical results, showing strong generalization performance across classical control and robotic meta-RL benchmarks, on par with state-of-the-art meta-RL methods and significantly better than non-meta RL approaches.

URL: https://openreview.net/forum?id=3CWyTh4hJ4

---

Title: State Combinatorial Generalization In Decision Making With Conditional Diffusion Models

Authors: Xintong Duan, Yutong He, Fahim Tajwar, Wentse Chen, Ruslan Salakhutdinov, Jeff Schneider

Abstract: Many real-world decision-making problems are combinatorial in nature, where states (e.g., surrounding traffic of a self-driving car) can be seen as a combination of basic elements (e.g., pedestrians, trees, and other cars). Due to combinatorial complexity, observing all combinations of basic elements in the training set is infeasible, which leads to an essential yet understudied problem of zero-shot generalization to states that are unseen combinations of previously seen elements. In this work, we first formalize this problem and then demonstrate how existing value-based reinforcement learning (RL) algorithms struggle due to unreliable value predictions in unseen states. We argue that this problem cannot be addressed with exploration alone, but requires more expressive and generalizable models. We demonstrate that behavior cloning with a conditioned diffusion model trained on successful trajectory generalizes better to states formed by new combinations of seen elements than traditional RL methods. Through experiments in maze, driving, and multiagent environments, we show that conditioned diffusion models outperform traditional RL techniques and highlight the broad applicability of our problem formulation.

URL: https://openreview.net/forum?id=XB1dd01Ozz

---

Title: MDTree: A Masked Dynamic Autoregressive Model for Phylogenetic Inference

Authors: Zelin Zang, ChenRui Duan, Siyuan Li, Jinlin Wu, BingoWing-Kuen Ling, Fuji Yang, Jiebo Luo, Zhen Lei, Stan Z. Li

Abstract: Phylogenetic tree inference requires optimizing both branch lengths and topologies, yet traditional MCMC-based methods suffer from slow convergence and high computational cost. Recent deep learning approaches improve scalability but remain constrained: Bayesian models are computationally intensive, autoregressive methods depend on fixed species orders, and flow-based models underutilize genomic signals. Fixed-order autoregression introduces an inductive bias misaligned with evolutionary proximity: early misplacements distort subsequent attachment probabilities and compound topology errors (exposure bias). Absent sequence-informed priors, the posterior over the super-exponential topology space remains diffuse and multimodal, yielding high-variance gradients and sluggish convergence for both MCMC proposals and neural samplers.
We propose MDTree, a masked dynamic autoregressive framework that integrates genomic priors into a Dynamic Ordering Network to learn biologically informed node sequences. A dynamic masking mechanism further enables parallel node insertion, improving efficiency without sacrificing accuracy. Experiments on standard benchmarks demonstrate that MDTree outperforms existing methods in accuracy and runtime while producing biologically coherent phylogenies, providing a scalable solution for large-scale evolutionary analysis.

URL: https://openreview.net/forum?id=dTSptQNygv

---

Title: Convergence of linear programming hierarchies for Gibbs states of spin systems

Authors: Hamza Fawzi, Omar Fawzi

Abstract: We consider the problem of computing expectation values of local functions under the Gibbs distribution of a spin system. In particular, we study two families of linear programming hierarchies for this problem. The first hierarchy imposes local spin flip equalities and has been considered in the bootstrap literature in high energy physics. For this hierarchy, we prove fast convergence under a spatial mixing (decay of correlations) condition. This condition is satisfied for example above the critical temperature for Ising models on a d-dimensional grid. The second hierarchy is based on a Markov chain having the Gibbs state as a fixed point and has been studied in the optimization literature and more recently in the bootstrap literature. For this hierarchy, we prove fast convergence provided the Markov chain mixes rapidly. Both hierarchies lead to an ε-approximation for local expectation values using a linear program of size quasi-polynomial in n/ε, where n is the total number of sites, provided the interactions can be embedded in a d-dimensional grid with constant d. Compared to standard Monte Carlo methods, an advantage of this approach is that it always (i.e., for any system) outputs rigorous upper and lower bounds on the expectation value of interest, without needing an a priori analysis of the convergence speed.

URL: https://openreview.net/forum?id=mc1dPxZsv3

---

Title: LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects

Authors: Guangyi Liu, Pengxiang Zhao, Yaozhen Liang, Liang Liu, Yaxuan Guo, Han Xiao, Weifeng Lin, Yuxiang Chai, Yue Han, Shuai Ren, Hao Wang, Xiaoyu Liang, WenHao Wang, Tianze Wu, Zhengxi Lu, Siheng Chen, LiLinghao, Hao Wang, Guanjing Xiong, Yong Liu, Hongsheng Li

Abstract: With the rapid rise of large language models (LLMs), phone automation has undergone transformative changes. This paper systematically reviews LLM-driven phone GUI agents, highlighting their evolution from script-based automation to intelligent, adaptive systems. We first contextualize key challenges, (i) limited generality, (ii) high maintenance overhead, and (iii) weak intent comprehension, and show how LLMs address these issues through advanced language understanding, multimodal perception, and robust decision-making. We then propose a taxonomy covering fundamental agent frameworks (single-agent, multi-agent, plan-then-act), modeling approaches (prompt engineering, training-based), and essential datasets and benchmarks. Furthermore, we detail task-specific architectures, supervised fine-tuning, and reinforcement learning strategies that bridge user intent and GUI operations. Finally, we discuss open challenges such as dataset diversity, on-device deployment efficiency, user-centric adaptation, and security concerns, offering forward-looking insights into this rapidly evolving field. By providing a structured overview and identifying pressing research gaps, this paper serves as a definitive reference for researchers and practitioners seeking to harness LLMs in designing scalable, user-friendly phone GUI agents. The collection of papers reviewed in this survey will be hosted and regularly updated on the GitHub repository: \url{https://github.com/PhoneLLM/Awesome-LLM-Powered-Phone-GUI-Agents}

URL: https://openreview.net/forum?id=yWQqoi1G1K

---

Title: MMD Two-sample Testing in the Presence of Arbitrarily Missing Data

Authors: Yijin Zeng, Niall M. Adams, Dean A. Bodenham

Abstract: In many real-world applications, it is common that a proportion of the data may be missing or only partially observed. We develop a novel two-sample testing method based on the Maximum Mean Discrepancy (MMD) which accounts for missing data in both samples, without making assumptions about the missingness mechanism. Our approach is based on deriving the mathematically precise bounds of the MMD test statistic after accounting for all possible missing values. To the best of our knowledge, it is the only two-sample testing method that is guaranteed to control the Type I error for both univariate and multivariate data where data may be arbitrarily missing. Simulation results show that the method has good statistical power, typically for cases where 5% to 10% of the data are missing. We highlight the value of this approach when the data are missing not at random, a context in which either ignoring the missing values or using common imputation methods may not control the Type I error.

URL: https://openreview.net/forum?id=GfcDel1ICb

---

Title: An Efficient Sparse Fine-Tuning with Low Quantization Error via Neural Network Pruning

Authors: Cen-Jhih Li, Aditya Bhaskara

Abstract: Fine-tuning is an important step in adapting foundation models such as large language models to downstream tasks. To make this step more accessible to users with limited computational budgets, it is crucial to develop fine-tuning methods that are memory and computationally efficient. Sparse Fine-tuning (SpFT) and Low-rank adaptation (LoRA) are two frameworks that have emerged for addressing this problem and have been adopted widely in practice. In this work, we develop a new SpFT framework, based on ideas from neural network pruning. At a high level, we first identify ``important'' neurons/nodes using feature importance metrics from network pruning (specifically, we use the structural pruning method), and then perform fine-tuning by restricting to weights involving these neurons. Experiments on common language tasks show our method improves SpFT’s memory efficiency by 20–50% while matching the accuracy of state-of-the-art methods like LoRA’s variants.

URL: https://openreview.net/forum?id=w3b67v5EzD

---

Title: Cognitive Architectures for Language Agents

Authors: Theodore Sumers, Shunyu Yao, Karthik R Narasimhan, Thomas L. Griffiths

Abstract: Recent efforts have augmented large language models (LLMs) with external resources (e.g., the Internet) or internal control flows (e.g., prompt chaining) for tasks requiring grounding or reasoning, leading to a new class of language agents. While these agents have achieved substantial empirical success, we lack a framework to organize existing agents and plan future developments. In this paper, we draw on the rich history of cognitive science and symbolic artificial intelligence to propose Cognitive Architectures for Language Agents (CoALA). CoALA describes a language agent with modular memory components, a structured action space to interact with internal memory and external environments, and a generalized decision-making process to choose actions. We use CoALA to retrospectively survey and organize a large body of recent work, and prospectively identify actionable directions towards more capable agents. Taken together, CoALA contextualizes today's language agents within the broader history of AI and outlines a path towards language-based general intelligence.

URL: https://openreview.net/forum?id=1i6ZCvflQJ

---

Title: Oscillations Make Neural Networks Robust to Quantization

Authors: Jonathan Wenshøj, Bob Pepin, Raghavendra Selvan

Abstract: We challenge the prevailing view that weight oscillations observed during Quantization Aware Training (QAT) are merely undesirable side-effects and argue instead that they are an essential part of QAT. We show in a univariate linear model that QAT results in an additional loss term that causes oscillations by pushing weights away from their nearest quantization level. Based on the mechanism from the analysis, we then derive a regularizer that induces oscillations in the weights of neural networks during training. Our empirical results on ResNet-18 and Tiny Vision Transformer, evaluated on CIFAR-10 and Tiny ImageNet datasets, demonstrate across a range of quantization levels that training with oscillations followed by post-training quantization (PTQ) is sufficient to recover the performance of QAT in most cases. With this work we provide further insight into the dynamics of QAT and contribute a novel insight into explaining the role of oscillations in QAT which until now have been considered to have a primarily negative effect on quantization.

URL: https://openreview.net/forum?id=bPwcJ0nkDC

---

Title: Decentralized Projection-free Online Upper-Linearizable Optimization with Applications to DR-Submodular Optimization

Authors: Yiyang Lu, Mohammad Pedramfar, Vaneet Aggarwal

Abstract: We introduce a novel framework for decentralized projection-free optimization, extending projection-free methods to a broader class of upper-linearizable functions. Our approach leverages decentralized optimization techniques with the flexibility of upper-linearizable function frameworks, effectively generalizing traditional DR-submodular function optimization. We obtain the regret of $O(T^{1-\theta/2})$ with communication complexity of $O(T^{\theta})$ and number of linear optimization oracle calls of $O(T^{2\theta})$ for decentralized upper-linearizable function optimization, for any $0\le \theta \le 1$. This approach allows for the first results for monotone up-concave optimization with general convex constraints and non-monotone up-concave optimization with general convex constraints. Further, the above results for first order feedback are extended to zeroth order, semi-bandit, and bandit feedback.

URL: https://openreview.net/forum?id=bZ5WD2HUQr

---

Title: kNNSampler: Stochastic Imputations for Recovering Missing Value Distributions

Authors: Parastoo PASHMCHI, Jérôme Benoit, Motonobu Kanagawa

Abstract: We study a missing-value imputation method, termed kNNSampler, that imputes a given unit's missing response by randomly sampling from the observed responses of the k most similar units to the given unit in terms of the observed covariates. This method can sample unknown missing values from their distributions, quantify the uncertainties of missing values, and be readily used for multiple imputation. Unlike popular kNNImputer, which estimates the conditional mean of a missing response given an observed covariate, kNNSampler is theoretically shown to estimate the conditional distribution of a missing response given an observed covariate. Experiments illustrate the performance of kNNSampler. The code for kNNSampler is made publicly available (https://github.com/SAP/knn-sampler).

URL: https://openreview.net/forum?id=4CDnIACCQG

---

Title: Learning to Rank Features to Enhance Graph Neural Networks for Graph Classification

Authors: Fouad Alkhoury, Tamas Horvath, Christian Bauckhage, Stefan Wrobel

Abstract: A common strategy to enhance the predictive performance of graph neural networks (GNNs) for graph classification is to extend input graphs with node- and graph-level features. However, identifying the optimal feature set for a specific learning task remains a significant challenge, often requiring domain-specific expertise. To address this, we propose a general two-step method that automatically selects a compact, informative subset from a large pool of candidate features to improve classification accuracy. In the first step, a GNN is trained to estimate the importance of each feature for a given graph. In the second step, the model generates feature rankings for the training graphs, which are then aggregated into a global ranking. A top-ranked subset is selected from this global ranking and used to train a downstream graph classification GNN. Experiments on real-world and synthetic datasets show that our method outperforms various baselines, including models using all candidate features, and achieves state-of-the-art results on several benchmarks.

URL: https://openreview.net/forum?id=WmZGvWRAWb

---

Title: Pushing the Limits of Sparsity: A Bag of Tricks for Extreme Pruning

Authors: Andy Li, Aiden Durrant, Milan Markovic, Tianjin Huang, Souvik Kundu, Tianlong Chen, Lu Yin, Georgios Leontidis

Abstract: Pruning of deep neural networks has been an effective technique for reducing model size while preserving most of the performance of dense networks, crucial for deploying models on memory and power-constrained devices. While recent sparse learning methods have shown promising performance up to moderate sparsity levels such as 95% and 98%, accuracy quickly deteriorates when pushing sparsities to extreme levels due to unique challenges such as fragile gradient flow. In this work, we explore network performance beyond the commonly studied sparsities, and develop techniques that encourage stable training without accuracy collapse even at extreme sparsities, including 99.90%, 99.95\% and 99.99% on ResNet architectures. We propose three complementary techniques that enhance sparse training through different mechanisms: 1) Dynamic ReLU phasing, where DyReLU initially allows for richer parameter exploration before being gradually replaced by standard ReLU, 2) weight sharing which reuses parameters within a residual layer while maintaining the same number of learnable parameters, and 3) cyclic sparsity, where both sparsity levels and sparsity patterns evolve dynamically throughout training to better encourage parameter exploration. We evaluate our method, which we term Extreme Adaptive Sparse Training (EAST) at extreme sparsities using ResNet-34 and ResNet-50 on CIFAR-10, CIFAR-100, and ImageNet,achieving competitive or improved performance compared to existing methods, with notable gains at extreme sparsity levels.

URL: https://openreview.net/forum?id=XX9JdOJD8R

---

Title: Sparse Multiple Kernel Learning: Alternating Best Response and Semidefinite Relaxations

Authors: Dimitris Bertsimas, Caio de Próspero Iglesias, Nicholas A. G. Johnson

Abstract: We study Sparse Multiple Kernel Learning (SMKL), which is the problem of selecting a sparse convex combination of prespecified kernels for support vector binary classification. Unlike prevailing $\ell_1$‐regularized approaches that approximate a sparsifying penalty, we formulate the problem by imposing an explicit cardinality constraint on the kernel weights and add an $\ell_2$ penalty for robustness. We solve the resulting non-convex minimax problem via an alternating best response algorithm with two subproblems: the $\alpha$‐subproblem is a standard kernel SVM dual solved via LIBSVM, while the $\beta$‐subproblem admits an efficient solution via the Greedy Selector and Simplex Projector algorithm. We reformulate SMKL as a mixed integer semidefinite optimization problem and derive a hierarchy of semidefinite convex relaxations which can be used to certify near-optimality of the solutions returned by our best response algorithm and also to warm start it. On ten UCI benchmarks, our method with random initialization outperforms state-of-the-art MKL approaches in out of sample prediction accuracy on average by $3.34$ percentage points (relative to the best performing benchmark) while selecting a small number of candidate kernels in comparable runtime. With warm starting, our method outperforms the best performing benchmark's out of sample prediction accuracy on average by $4.05$ percentage points. Our convex relaxations provide a certificate that in several cases, the solution returned by our best response algorithm is the globally optimal solution.

URL: https://openreview.net/forum?id=Y5icwFwkyh

---

Title: A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs

Authors: Benno Krojer, Mojtaba Komeili, Candace Ross, Quentin Garrido, Koustuv Sinha, Nicolas Ballas, Mido Assran

Abstract: Existing benchmarks for assessing the spatio-temporal understanding and reasoning abilities of video language models are susceptible to score inflation due to the presence of shortcut solutions based on superficial visual or textual cues. This paper mitigates the challenges in accurately assessing model performance by introducing the Minimal Video Pairs (MVP) benchmark, a simple shortcut-aware video QA benchmark for assessing the physical understanding of video language models. The benchmark is comprised of 55K high-quality multiple-choice video QA examples focusing on physical world understanding. Examples are curated from nine video data sources, spanning first-person egocentric and exocentric videos, robotic interaction data, and cognitive science intuitive physics benchmarks. To mitigate shortcut solutions that rely on superficial visual or textual cues and biases, each sample in MVP has a minimal-change pair — a visually similar video accompanied by an identical question but an opposing answer. To answer a question correctly, a model must provide correct answers for both examples in the minimal-change pair; as such, models that solely rely on visual or textual biases would achieve below random performance. Human performance on MVP is 92.9%, while the best open-source state-of-the- art video-language model achieves 40.2% compared to random performance at 25%.

URL: https://openreview.net/forum?id=gvFgNJcSw1

---

Title: Outcome-based Reinforcement Learning to Predict the Future

Authors: Benjamin Turtel, Danny Franklin, Kris Skotheim, Luke Hewitt, Philipp Schoenegger

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has been an effective approach for improving Large Language Models' reasoning in domains such as coding and mathematics. Here, we apply RLVR methods towards forecasting future real-world events – a challenging task for RL due to the very noisy (and delayed) outcomes involved. Using a novel dataset of recent questions from a prediction market, and accompanying relevant news headlines, we show that a compact (14B) reasoning model can be trained to match or surpass the predictive accuracy of frontier models like o1, while greatly improving probabilistic calibration. The model's performance is also practically meaningful: in a Polymarket trading simulation, we estimate that its bets would have yielded a return on investment of over 10\% across all questions in the test set. We detail and compare approaches used in training our model, including augmenting our training-data with synthetic prediction questions, guardrails for learning stability, and median prediction sampling at inference-time.

URL: https://openreview.net/forum?id=bbhdeL8EUX

---

Title: Joint Diffusion for Universal Hand-Object Grasp Generation

Authors: Jinkun Cao, Jingyuan Liu, Kris Kitani, Yi Zhou

Abstract: Predicting and generating human hand grasp over objects is critical for animation and robotic tasks. In this work, we focus on generating both the hand and objects in a grasp by a single diffusion model. Our proposed Joint Hand-Object Diffusion (JHOD) models the hand and object in a unified latent representation. It uses the hand-object grasping data to learn to accommodate hand and object to form plausible grasps. Also, to enforce the generalizability over diverse object shapes, it leverages large-scale object datasets to learn an inclusive object latent embedding. With or without a given object as an optional condition, the diffusion model can generate grasps unconditionally or conditional to the object. Compared to the usual practice of learning object-conditioned grasp generation from only hand-object grasp data, our method benefits from more diverse object data used for training to handle grasp generation more universally. According to both qualitative and quantitative experiments, both conditional and unconditional generation of hand grasp achieves good visual plausibility and diversity. With the extra inclusiveness of object representation learned from large-scale object datasets, the proposed method generalizes well to unseen object shapes.

URL: https://openreview.net/forum?id=TZ0ztsYR6x

---

Title: Sparse-Input Neural Network using Group Concave Regularization

Authors: Bin Luo, Susan Halabi

Abstract: Simultaneous feature selection and non-linear function estimation is challenging in modeling, especially in high-dimensional settings where the number of variables exceeds the available sample size. In this article, we investigate the problem of feature selection in neural networks. Although the group least absolute shrinkage and selection operator (LASSO) has been utilized to select variables for learning with neural networks, it tends to select unimportant variables into the model to compensate for its over-shrinkage. To overcome this limitation, we propose a framework of sparse-input neural networks using group concave regularization for feature selection in both low-dimensional and high-dimensional settings. The main idea is to apply a proper concave penalty to the $l_2$ norm of weights from all outgoing connections of each input node, and thus obtain a neural net that only uses a small subset of the original variables. In addition, we develop an effective algorithm based on backward path-wise optimization to yield stable solution paths, in order to tackle the challenge of complex optimization landscapes. We provide a rigorous theoretical analysis of the proposed framework, establishing finite-sample guarantees for both variable selection consistency and prediction accuracy. These results are supported by extensive simulation studies and real data applications, which demonstrate the finite-sample performance of the estimator in feature selection and prediction across continuous, binary, and time-to-event outcomes.

URL: https://openreview.net/forum?id=m9UsLHZYeX

---

Title: Stabilizing black-box model selection with the inflated argmax

Authors: Melissa Adrian, Jake A Soloff, Rebecca Willett

Abstract: Model selection is the process of choosing from a class of candidate models given data. For instance, methods such as the LASSO and sparse identification of nonlinear dynamics (SINDy) formulate model selection as finding a sparse solution to a linear system of equations determined by training data. However, absent strong assumptions, such methods are highly unstable: if a single data point is removed from the training set, a different model may be selected. In this paper, we present a new approach to stabilizing model selection with theoretical stability guarantees that leverages a combination of bagging and an ''inflated'' argmax operation. Our method selects a small collection of models that all fit the data, and it is stable in that, with high probability, the removal of any training point will result in a collection of selected models that overlaps with the original collection. We illustrate this method in (a) a simulation in which strongly correlated covariates make standard LASSO model selection highly unstable, (b) a Lotka–Volterra model selection problem focused on identifying how competition in an ecosystem influences species' abundances, (c) a graph subset selection problem using cell-signaling data from proteomics, and (d) unsupervised $\kappa$-means clustering. In these settings, the proposed method yields stable, compact, and accurate collections of selected models, outperforming a variety of benchmarks.

URL: https://openreview.net/forum?id=DSDWHsQLgA

---

Title: CoCoIns: Consistent Subject Generation via Contrastive Instantiated Concepts

Authors: Lee Hsin-Ying, Kelvin C.K. Chan, Ming-Hsuan Yang

Abstract: While text-to-image generative models can synthesize diverse and faithful content, subject variation across multiple generations limits their application to long-form content generation. Existing approaches require time-consuming fine-tuning, reference images for all subjects, or access to previously generated content. We introduce Contrastive Concept Instantiation (CoCoIns), a framework that effectively synthesizes consistent subjects across multiple independent generations. The framework consists of a generative model and a mapping network that transforms input latent codes into pseudo-words associated with specific concept instances. Users can generate consistent subjects by reusing the same latent codes. To construct such associations, we propose a contrastive learning approach that trains the network to distinguish between different combinations of prompts and latent codes. Extensive evaluations on human faces with a single subject show that CoCoIns performs comparably to existing methods while maintaining greater flexibility. We also demonstrate the potential for extending CoCoIns to multiple subjects and other object categories.

URL: https://openreview.net/forum?id=fPZ7DNlOSn

---

Title: Unlocking the matrix form of the Quaternion Fourier Transform and Quaternion Convolution: Properties, connections, and application to Lipschitz constant bounding

Authors: Giorgos Sfikas, George Retsinas

Abstract: Linear transformations are ubiquitous in machine learning, and matrices are the standard way to represent them. In this paper, we study matrix forms of quaternionic versions of the Fourier Transform and Convolution operations. Quaternions offer a powerful representation unit, however they are related to difficulties in their use that stem foremost from non-commutativity of quaternion multiplication, and due to that $\mu^2 = -1$ possesses infinite solutions in the quaternion domain. Handling of quaternionic matrices is consequently complicated in several aspects (definition of eigenstructure, determinant, etc.). Our research findings clarify the relation of the Quaternion Fourier Transform matrix to the standard (complex) Discrete Fourier Transform matrix, and the extend on which well-known complex-domain theorems extend to quaternions. We focus especially on the relation of Quaternion Fourier Transform matrices to Quaternion Circulant matrices (representing quaternionic convolution), and the eigenstructure of the latter. A proof-of-concept application that makes direct use of our theoretical results is presented, where we present a method to bound the Lipschitz constant of a Quaternionic Convolutional Neural Network. Code is publicly available at: https://github.com/sfikas/quaternion-fourier-convolution-matrix.

URL: https://openreview.net/forum?id=rhcpXTxb8j

---

New submissions
===============

Title: Reconciling In-Context and In-Weight Learning via Dual Representation Space Encoding

Abstract: In-context learning (ICL) is a valuable capability exhibited by Transformers pretrained on diverse sequence tasks. However, previous studies have observed that ICL often conflicts with the model’s inherent in-weight learning (IWL) ability. By examining the representation space learned by a toy model in synthetic experiments, we identify the shared encoding space for context and samples in Transformers as a potential source of this conflict. To address this, we modify the model architecture to separately encode the context and samples into two distinct spaces: a \textit{task representation space} and a \textit{sample representation space}. We model these two spaces under a simple yet principled framework, assuming a linear representational structure and treating them as a pair of dual spaces. Both theoretical analysis and empirical results demonstrate that our proposed architecture, CoQE, not only enhances ICL performance through improved representation learning, but also successfully reconciles ICL and IWL capabilities across all tested conditions.

URL: https://openreview.net/forum?id=bJK7VIOWAU

---

Title: TACO: Training-free Sound Prompted Segmentation via Semantically Constrained Audio-visual CO-factorization

Abstract: Large-scale pre-trained audio and image models demonstrate an unprecedented degree of generalization, making them suitable for a wide range of applications. Here, we tackle the specific task of sound-prompted segmentation, aiming to segment image regions corresponding to objects heard in an audio signal. Most existing approaches tackle this problem by fine-tuning pre-trained models or by training additional modules specifically for the task. We adopt a different strategy: we introduce a training-free approach that leverages Non-negative Matrix Factorization (NMF) to co-factorize audio and visual features from pre-trained models so as to reveal shared interpretable concepts. These concepts are passed on to an open-vocabulary segmentation model for precise segmentation maps. By using frozen pre-trained models, our method achieves high generalization and establishes state-of-the-art performance in unsupervised sound-prompted segmentation, significantly surpassing previous unsupervised methods.

URL: https://openreview.net/forum?id=Xt9sdzQQlJ

---

Title: Retrieval-augmented Adaptive Decoding for Open-ended Question Answering Generation

Abstract: Ensuring truthfulness in large language models (LLMs) remains a critical challenge for reliable text generation. While supervised fine-tuning and reinforcement learning with human feedback have shown promise, they require a substantial amount of annotated data and computational resources, limiting scalability. In contrast, decoding-time interventions offer lightweight alternatives without model retraining. However, existing decoding strategies often face issues like prompt sensitivity, limited generalization, or dependence on internal model states. We propose \textbf{Retrieval-Augmented Decoding (RAD)}, a context-aware adaptive decoding method that leverages a compact reference grounding space built from \textit{as few as 10 annotated examples} and comprising pairs of context embeddings and next-token logits from truthful responses, to enable retrieval-based logit shaping during inference. At each decoding step, RAD retrieves high-quality semantically similar contexts from this grounding space and aggregates their associated next token logits to modify the model's current logits. Across three open-ended question-answering benchmarks and four LLMs, our method consistently outperforms strong baselines and shows robust cross-task generalization, underscoring the promise of context-aware decoding for enhancing factual reliability.

URL: https://openreview.net/forum?id=XVeVhCKkqN

---

Title: When Test-Time Training Fails: A Critical Analysis of Robustness and Hyperparameter Sensitivity

Abstract: Test-time training (TTT) through input perplexity minimization has emerged as a promising approach for enhancing language model performance during inference. However, questions remain about its practical robustness and applicability beyond popular benchmarks. This paper presents a preliminary analysis investigating two critical questions: whether TTT is effective on unseen tasks and how sensitive it is to hyperparameter choices. We evaluate TTT on three anti-memorization datasets—Memo-Trap, GSM-Symbolic, and Math-Perturb—using six models from the Qwen 2.5 and Llama 3 families. Our findings reveal that while TTT shows effectiveness on common benchmarks such as AIME 2024, it struggles with tasks designed to counter memorization, raising questions about whether the gains stem from domain adaptation or data contamination. We identify significant performance differences among optimizers, with SGD outperforming Adam despite slower convergence. Through extensive hyperparameter sweeps over learning rates, training steps, weight decay, momentum, and gradient normalization, we demonstrate that TTT is highly sensitive to these choices, with no universal recipe across tasks and models. Notably, gradient normalization emerges as an effective technique for improving robustness by mitigating catastrophic performance drops and reducing sensitivity to the learning rate. Our analysis also reveals that tuning feed-forward networks can achieve better peak performance than full model tuning, while attention-only tuning provides more stable worst-case performance. These findings highlight the need for continued research into making test-time training more practical and reliable for real-world deployment. Since this research only focuses on a specific algorithm of TTT: input perplexity minimization, our conclusion may not be applied to all TTT algorithms. We call on the community to pay closer attention to TTT's sensitivity to make it better suited for real-world applications

URL: https://openreview.net/forum?id=0Eh31N1Hoj

---

Title: COunterfactual Reasoning for Temporal EXplanations: Plausible and Robust Explanations for EEG-Based Seizure Detection

Abstract: Identifying the drivers of change in time-sensitive domains like healthcare is critical for reliable decision-making, yet explanations must account for both temporal dynamics and structural complexity. While counterfactual explanations are well-studied for static data, existing methods often fail in dynamic, spatio-temporal settings, producing implausible or temporally inconsistent explanations. To address this, we introduce COunterfactual Reasoning for Temporal EXplanations (CORTEX), a search-based explainer for multivariate time series modeled as spatio-temporal graphs, tailored to seizure detection from EEG recordings. CORTEX generates temporally robust and plausible counterfactuals by retrieving relevant past instances and sieving them via structural dissimilarity, temporal distance, and instability. Evaluated on clinical seizure detection data, CORTEX outperforms state-of-the-art methods with a $2.73\times$ improvement in validity and $5.32\times$ in fidelity, and achieves zero implausibility, demonstrating consistency and practical relevance. By shifting the focus from mere validity to plausible and time-consistent explanations, CORTEX enables more reliable and controllable counterfactual explanations.

URL: https://openreview.net/forum?id=FkHVmYnNS9

---

Title: On the Dynamics & Transferability of Latent Generalization during Memorization

Abstract: Deep networks have been known to have extraordinary generalization abilities, via mechanisms that aren't yet well understood. It is also known that upon shuffling labels in the training data to varying degrees, deep networks, trained with standard methods, can still achieve perfect or high accuracy on this corrupted training data. This phenomenon is called memorization, and typically comes at the cost of poorer generalization to true labels. Recent work has demonstrated, surprisingly, that the internal representations of such models retain significantly better latent generalization abilities than is directly apparent from the model. In particular, it has been shown that such latent generalization can be recovered via simple probes (called MASC probes) on the layer-wise representations of the model. However, several basic questions about this phenomenon of latent generalization remain poorly understood: (1) What is the origin and dynamics over training of latent generalization during memorization? Specifically, is it the case that model generalization and latent generalization use largely the same underlying mechanisms? (2) Is the specific nature of the probe critical for our ability to extract latent generalization from the model's layerwise outputs? (3) Does there exist a way to immediately transfer latent generalization to model generalization by suitably modifying model weights directly? On the one hand, this question is conceptually important because it establishes conclusively that the latent generalization manifested by the probe is also within reach of the model, with exactly the information that the model was provided during training, namely the corrupted training data. On the other hand -- and more pragmatically -- it also suggests the possibility of "repairing" a trained model that has memorized, without requiring expensive retraining from scratch. To address (1), we track the training dynamics, empirically, and find that latent generalization abilities largely peak early in training, with model generalization, suggesting a common origin for both. However, while model generalization degrades steeply over training thereafter, latent generalization falls more modestly & plateaus at a higher level over epochs of training. These experiments lend circumstantial evidence to the hypothesis that latent generalization uses largely similar mechanisms as those that underlie the model's generalization in the early phases of training. To investigate (2), we examine the MASC probe and show that it is a quadratic classifier. The question in (2) thus becomes whether the quadratic nature of the MASC probe underlies its remarkable effectiveness in extracting latent generalization. If this were so, a linear probe constructed along these lines would not be as effective. To investigate this, we designed a new linear probe for this setting, and find, surprisingly, that it has superior generalization performance in comparison to the quadratic probe, in most cases. This suggests that the quadratic nature of the probe is not critical in extracting latent generalization. Importantly, the effectiveness of the linear probe enables us to answer (3) in the affirmative. Specifically, using this new linear probe, we devise a way to transfer the latent generalization present in last-layer representations to the model by directly modifying the model weights. This immediately endows such models with improved generalization, i.e. without additional training. Our findings provide a more detailed account of the rich dynamics of latent generalization during memorization, provide clarifying insight on the specific role of the probe in latent generalization, as well as demonstrate the means to leverage this understanding to directly transfer this generalization to the model.

URL: https://openreview.net/forum?id=t024Zm0tKF

---

Title: Networked Communication for Decentralised Cooperative Agents in Mean-Field Control

Abstract: The mean-field framework has been used to find approximate solutions to problems involving very large populations of symmetric, anonymous agents, which may be intractable by other methods. The cooperative mean-field control (MFC) problem has received less attention than the non-cooperative mean-field game (MFG), despite the former potentially being more useful as a tool for engineering large-scale collective behaviours. Decentralised communication algorithms have recently been introduced to MFGs, giving benefits to learning speed and robustness. Inspired by this, we introduce networked communication to MFC - where populations arguably have broader incentive to communicate - and in particular to the setting where decentralised agents learn online from a single, non-episodic run of the empirical system. We adapt recent MFG algorithms to this new setting, as well as contributing a novel sub-routine allowing networked agents to estimate the global average reward from their local neighbourhood. Previous theoretical analysis of decentralised communication in MFGs does not extend trivially to MFC. We therefore contribute new theory proving that in MFC the networked communication scheme allows agents to increase social welfare faster than under *both* of the two typical alternative architectures, namely independent and centralised learning. We provide experiments that support this new result across different classes of cooperative game, and also give numerous ablation studies and additional experiments concerning numbers of communication round and robustness to communication failures.

URL: https://openreview.net/forum?id=qCTg7Dv0DT

---

Title: Representation Similarity Reveals Implicit Layer Grouping in Neural Networks

Abstract: Providing human-understandable insights into the inner workings of neural networks is an important step toward achieving more explainable and trustworthy AI. Analyzing representations across neural layers has become a widely used approach for this purpose in various applications. In this work, we take a step toward a holistic understanding of neural layers by investigating the existence of distinct layer groupings within them. Specifically, we explore using representation similarity within neural networks to identify clusters of similar layers, revealing potential layer groupings. We achieve this by proposing, for the first time to our knowledge, the use of Gromov-Wasserstein distance, which overcomes challenges posed by varying distributions and dimensionalities across intermediate representations--issues that complicate direct layer-to-layer comparisons.
On algebraic, language, and vision tasks, we observe the emergence of layer groups that correspond to functional abstractions within networks. These results reveal implicit layer structure pattern, and suggest that the network computations may exhibit abrupt shifts rather than smooth transitions. Through downstream applications of model compression and fine-tuning, we validate our measure and further show the proposed approach offers meaningful insights into the internal behavior of neural networks.

URL: https://openreview.net/forum?id=V91vAkesm7

---

Title: Interpreting Kolmogorov-Arnold Networks in Neuroimaging: A Path-Based Attribution Framework

Abstract: Explainability aspects of most classification models are learnt through instance-specific analysis. However, in understanding diseases, it is important to consider population-wide analysis in order to identify affected regions that are consistently seen across cohorts of diseased population. In this study, we report utility of Kolmogorov-Arnold Networks (KANs) in understanding population-wide characteristics seen in subjects affected by Alzheimer’s disease (AD). KANs offer enhanced interpretability through learnable activation functions on network edges. Thus, the learned functions reflect the characteristics of the entire span of training data. In a KAN network trained for classification, attributions through the network can be traced to understand how specific inputs influence the output label. In this study, we propose a path-based attribution framework that generates global importance maps by tracing exhaustive information flow through all potential paths. Our method scores edges using L2 norms of the learned spline and base functions. Subsequently, these scores are propagated through the network to compute path-attributions. This approach scales linearly with network depth, and is only dependent on model training and does not need further analysis on data post-hoc. Evaluation on three public AD neuroimaging datasets (OASIS, ADNI, Mendeley, totally comprising 7428 acquisitions), were carried out on 3D brain volumes as well as 2D brain slices. The corresponding KAN test accuracies are $93.24\%$, $81.85\%$, and $91.25\%$ on OASIS, ADNI, and Mendeley datasets, respectively. Alongside, improved performance via metrics such as Insertion AUC, Deletion AUC and Sufficiency, is also demonstrated. The generated attribution maps identify clinically meaningful regions including the body and genu of corpus callossum, corona radiata, bilateral caudate nuclei, medial prefrontal cortex and temporal lobe structures, aligned with established AD pathology literature. By providing voxel-level global attributions as network-intrinsic properties, our framework addresses a critical gap in medical AI interpretability and supports clinical validation of AI-assisted AD diagnosis systems.

URL: https://openreview.net/forum?id=cPtKpNdYc2

---

Title: GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation

Abstract: Modern deep learning methods typically treat image sequences as large tensors of sequentially stacked frames. However, is this straightforward representation ideal given the current state-of-the-art (SoTA)? In this work, we address this question in the context of generative models and aim to devise a more effective way of modeling image sequence data. Observing the inefficiencies and bottlenecks of current SoTA image sequence generation methods, we showcase that rather than working with large tensors, we can improve the generation process by factorizing it into first generating the coarse sequence at low resolution and then refining the individual frames at high resolution. We train a generative model solely on grid images comprising subsampled frames. Yet, we learn to generate image sequences, using the strong self-attention mechanism of the Diffusion Transformer (DiT) to capture correlations between frames. In effect, our formulation extends a 2D image generator to operate as a 3D image-sequence generator without introducing any architectural modifications. Subsequently, we super-resolve each frame individually to add the sequence-independent high-resolution details. This approach offers several advantages and can overcome key limitations of the SoTA in this domain. Compared to existing image sequence generation models, our method achieves superior synthesis quality and improved coherence across sequences. It also delivers high-fidelity generation of arbitrary-length sequences and increased efficiency in inference time and training data usage. Furthermore, our straightforward formulation enables our method to generalize effectively across diverse data domains, which typically require additional priors and supervision to model in a generative context. Our method consistently delivers superior quality and offers a $>2\times$ speedup in inference rates across various datasets. We will make our implementation publicly available.

URL: https://openreview.net/forum?id=QLD47Ou5lp

---

Title: Discovering Symbolic Differential Equations with Symmetry Invariants

Abstract: Discovering symbolic differential equations from data uncovers fundamental dynamical laws underlying complex systems. However, existing methods often struggle with the vast search space of equations and may produce equations that violate known physical laws. In this work, we address these problems by introducing the concept of \textit{symmetry invariants} in equation discovery. We leverage the fact that differential equations admitting a symmetry group can be expressed in terms of differential invariants of symmetry transformations. Thus, we propose to use these invariants as atomic entities in equation discovery, ensuring the discovered equations satisfy the specified symmetry. Our approach integrates seamlessly with existing equation discovery methods such as sparse regression and genetic programming, improving their accuracy and efficiency. We validate the proposed method through applications to various physical systems, such as Darcy flow and reaction-diffusion, demonstrating its ability to recover parsimonious and interpretable equations that respect the laws of physics.

URL: https://openreview.net/forum?id=9t1dEyYfPc

---

Title: Leveraging Reference Documents for Zero-Shot Ranking via Large Language Models

Abstract: Large language models (LLMs) have proven strong zero-shot rerankers, yet the two dominant paradigms expose a sharp accuracy-efficiency trade-off. Existing methods mainly fall into two categories: Individual-scoring (pointwise) issues $O(n)$ parallel calls but suffers from calibration drift across isolated prompts; Comparative-sorting (pairwise/listwise) alleviates drift via explicit inter-document comparison, but incurs higher-than-linear inference or long single-call latency. To address their limitations, we propose **RefRank**, a reference-anchored framework that marries the throughput of Individual-scoring with the calibration benefits of Comparative-sorting. RefRank prompts the LLM to score each candidate relative to a fixed anchor document harvested from the first-stage top-k list; all candidates are thus implicitly compared through the same anchors while parallelism is preserved. The method is training-free, adds no extra model calls, and keeps complexity at O(n). Across six standard benchmarks and multiple backbones, RefRank significantly outperforms Individual-scoring baselines and surpasses Comparative-sorting competitors with only negligible overhead.

URL: https://openreview.net/forum?id=4XBTDxpLYK

---

Title: Solving Constrained Optimization Problems as ODE-based Models Using Reinforcement Learning

Abstract: Previous learning-to-optimize (L2O) methods on constrained optimization problems often treat neural networks as initializers that generate approximate solutions requiring substantial post-hoc refinements. This approach overlooks a key insight: Solving complex optimization problems often requires iterative refinement of candidate solutions, a process naturally aligned with the Markov Decision Process (MDP) and reinforcement learning (RL) framework. We show that within the MDP framework, RL and Ordinary Differential Equation (ODE)-based generative models (e.g., diffusion, flow matching) are formally equivalent, unifying them as trainable optimizers. Building on our unified perspective, we propose to train a flow-matching model within an RL paradigm as a learnable refinement mechanism, thereby incorporating constraint satisfaction directly into the optimization process. To further enhance feasibility, we introduce a minimal correction step that adjusts solutions to ensure constraint compliance. Empirical results demonstrate that our approach achieves state-of-the-art performance across a range of constrained optimization tasks, yielding improvements in efficiency, solution quality, and feasibility over prior baselines.

URL: https://openreview.net/forum?id=QW0ZX4zRC2

---

Title: Controlling Coverage of Uncertainty Sets for Batch Evaluation via Vanilla Conformal Prediction

Abstract: Conformal prediction (CP) provides provable coverage guarantees over uncertainty sets for any given black-box predictive model. The standard split CP guarantees that for a single test input, the uncertainty set contains the true output with a user-specified probability $1 - \alpha$ (say 90\%). However, in many real-world applications, practitioners evaluate the predictive model on a batch of test inputs after calibration on a fixed set. The marginal coverage guarantee of split CP does not say anything directly about the realized false-coverage proportion (FCP) across a batch of inputs. This paper develops a novel approach referred to as {\em Probably Approximately Correct FCP (PAC-FCP)}. PAC-FCP leverages the key insight that FCP over a batch of test inputs from split CP follows a Beta-Binomial distribution and inverts the Beta–Binomial tail to find the minimum level to produce a guarantee around FCP using vanilla CP methods. We provide theoretical analysis for the validity and effectiveness of PAC-FCP building on prior theoretical results. Our experimental results on 17 OpenML benchmarks for regression and ImageNet data for classification, demonstrate that PAC-FCP achieves the specified FCP rate with smaller prediction sets/intervals.

URL: https://openreview.net/forum?id=H1dE34hmHA

---

Title: Uncertainty Regions for Multi-Target Regression via Input- Dependent Conformal Calibration

Abstract: We consider the problem of provable and effective uncertainty quantification (UQ) for multi-target regression tasks where we need to predict multiple related target variables. This is important in many safety-critical applications in domains including healthcare, engineering, and finance. Conformal prediction (CP) is a promising framework for calibrating predictive models for UQ with guaranteed finite sample coverage. There is relatively less work on multi-target CP compared to single-target CP, and existing methods tend to produce large prediction regions that are not useful in real-world applications. This paper proposes a novel approach referred to as {\em Adaptive Prediction Regions (APR)} to produce provably smaller prediction regions by exploiting heterogeneity in the input data. APR is inspired by the principle behind localized CP for single-target \cite{guan2023localized} and extends it to multi-target settings. The key idea behind APR is to perform adaptive calibration by assigning differential weights to multi-dimensional calibration examples based on their similarity to a test input. We theoretically analyze APR and show that it (a) achieves finite-sample coverage guarantees; and (b) constructs smaller prediction regions. Our experiments on diverse real-world datasets with various numbers of targets show that APR outperforms existing methods by producing significantly smaller prediction regions (achieving up to 85.51\% reduction in region area) over state-of-the-art multi-target CP methods.

URL: https://openreview.net/forum?id=O0AXPvbqG9

---

Title: PriSM: Prior-Guided Search Methods for Query Efficient Black-Box Attacks

Abstract: Deep Neural Networks are vulnerable to adversarial examples in black-box settings, requiring query-efficient attack methods. We propose PriSM (Prior-Guided Search Methods), which systematically exploits two types of transferable surrogate information: decision boundary geometry and loss landscape topography. We demonstrate their utility through complementary attacks: (1) TGEA leverages boundary geometry to initialize evolutionary optimization with surrogate evolved populations, maximizing attack success rates, and (2) SGSA leverages loss topography via multi-scale saliency guidance to direct Square Attack's perturbations, minimizing query costs. Across MNIST, CIFAR-10, and ImageNet, both methods achieve 30-60% query reductions compared to uninformed baselines, while also being competitive with state of the art hybrid attacks. Our evaluation reveals a strategic trade off: SGSA excels in query efficiency through local exploitation, whereas TGEA maximizes success rates via global exploration. Our comprehensive evaluation also demonstrates that different types of surrogate information require matched exploitation strategies, providing practical guidance for query-efficient black-box attacks.

URL: https://openreview.net/forum?id=UQsOh2kfhP

---

Title: Improved Sample Complexity Bounds For Diffusion Model Training Without Empirical Risk Minimizer Access

Abstract: Diffusion models have demonstrated state-of-the-art performance across vision, language, and scientific domains. Despite their empirical success, prior theoretical analyses of the sample complexity suffer from poor scaling with input data dimension or rely on unrealistic assumptions such as access to exact empirical risk minimizers. In this work, we provide a principled analysis of score estimation, establishing a sample complexity bound of $\mathcal{O}(\epsilon^{-4})$. Our approach leverages a structured decomposition of the score estimation error into statistical, approximation, and optimization errors, enabling us to eliminate the exponential dependence on neural network parameters that arises in prior analyses. It is the first such result that achieves sample complexity bounds without assuming access to the empirical risk minimizer of score function estimation loss.

URL: https://openreview.net/forum?id=CFdNqqlqOv

---

Title: OmniCache: Multidimensional Hierarchical Feature Caching for Diffusion Models

Abstract: Recent high-resolution image and video diffusion models, e.g., SD3, FLUX, Sora, have advanced generative intelligence but remain computationally expensive due to quadratic attention and multi-step inference. In this paper, we address the challenge of computational inefficiency in image & video generation by exploiting the inherent redundancy in the processed token content. We identify four primary types of redundancies: intra-frame, inter-frame, motion, and step redundancy. To mitigate these, we propose OmniCache, a novel mechanism that employs multidimensional hierarchical feature caching techniques: Frame Cache and Block Cache, together with incorporating Token Cache across transformer layers. These strategies enable us to compress spatial features in the temporal layers and temporal features in the spatial layers, significantly enhancing generation efficiency without the need for additional training. Moreover, we also study the improvements introduced by the orthogonal layered caching technique with OmniCache. OmniCache is evaluated on state-of-the-art diffusion models for both image and video generation, including SD3, SVD-XT, and Latte. It achieves up to 35% reduction in inference latency on Stable Diffusion 3 (SD3), 25% on SVD-XT, and 28% on Latte, while maintaining high visual fidelity.

URL: https://openreview.net/forum?id=5lRaQ4XAwN

---

Title: PRISM: Patch Diffusion with Dynamic Retrieval Augmented Guidance and Permutation Invariant Conditioning

Abstract: Diffusion models have achieved state-of-the-art results in image generation but often require extensive computational resources and large-scale datasets, limiting their practicality in resource-constrained settings. To address these challenges, we introduce PRISM, a retrieval-guided, patch-based method that trains solely on image patches instead of full resolution images.
PRISM achieves superior global coherence and outperforms patch-only baselines, even when trained on only a fraction of the data. For each training example, PRISM retrieves semantically related neighbors from a disjoint retrieval set using CLIP embeddings. It aggregates their unordered signals with a Set Transformer, ensuring permutation-invariant conditioning that captures higher-order relationships. A dynamic neighbor-annealing schedule optimizes the contextual guidance over time, leading to more coherent results. Experiments on unconditional image generation tasks using CIFAR-10, CelebA, ImageNet-100, and AFHQv2 datasets, along with ablation studies, validate our approach, demonstrating that retrieval-augmented, set-based conditioning closes the coherence gap in patch-only diffusion.

URL: https://openreview.net/forum?id=ru712j5D2d

---

Title: ODE-Constrained Generative Modeling of Cardiac Dynamics for 12-Lead ECG Synthesis

Abstract: Generating realistic training data for supervised learning remains a significant challenge in artificial intelligence. This is particularly true in the synthesis of electrocardiograms (ECGs), where the objective is to develop a synthetic 12-lead ECG model. The primary challenge in this task lies in accurately modeling the intricate biological and physiological interactions among different ECG leads. Although mathematical process models have shed light on these dynamics, effectively incorporating this understanding into generative models is not straightforward. We introduce an innovative method that employs ordinary differential equations (ODEs) to enhance the fidelity of 12-lead ECG data generation. This approach integrates cardiac dynamics directly into the generative optimization process via a novel Euler Loss, producing biologically plausible data that respects real-world variability and inter-lead constraints. Empirical analysis on the G12EC and PTB-XL datasets demonstrates that augmenting training data with MultiODE-GAN yields consistent, statistically significant improvements in specificity across multiple cardiac abnormalities. This highlights the value of enforcing physiological coherence in synthetic medical data.

URL: https://openreview.net/forum?id=4N56Pwwsti

---

Title: Adapt via Bayesian Nonparametric Clustering: Fine-Grained Classification for Model Recycling Under Domain and Category Shift

Abstract: Recycling pretrained classification models for new domains, known as Source-Free Domain Adaptation (SFDA), has been extensively studied under the closed-set assumption that source and target domains share identical label spaces. However, this assumption does not hold when unseen classes appear in the target domain. Addressing this category shift is challenging, as unknown target classes usually arise with no prior knowledge of their identities or number, and becomes particularly difficult in the source-free setting, where access to source data is unavailable. Most existing methods treat all unknown classes as a single group during both training and evaluation, limiting their capacity to model the underlying structure within the unknown class space. In this work, we present Adapt via Bayesian Nonparametric Clustering (ABC), a novel framework designed for SFDA scenarios where unknown target classes are present. Unlike prior methods, ABC explicitly achieves fine-grained classification of unknown target classes, offering a more structured vision of the problem. Our method first identifies high-confidence target samples likely to belong to known source classes. Using these as guidance, we develop a guided Bayesian nonparametric clustering approach that learns distinct prototypes for both known and unknown classes without requiring the number of unknown classes a priori, and assigns target samples accordingly. We further introduce a training objective that refines the source model by encouraging prototype-based discriminability and local prediction consistency. Experiments show that our method achieves competitive performance on standard benchmarks while simultaneously providing effective clustering of unknown classes.

URL: https://openreview.net/forum?id=J5B4yt7C37

---

Title: VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model

Abstract: Training Vision-Language Models (VLMs) for Graphical User Interfaces (GUI) agents via Reinforcement Learning (RL) faces critical challenges: environment-based RL requires costly interactions, while environment-free methods struggle with distribution shift and reward generalization. We propose an environment-free RL framework that decouples value estimation from policy optimization by leveraging a pretrained Value Environment Model (VEM), which requires no live environment interaction during policy optimization. VEM predicts state-action values directly from offline data, distilling human-like priors about GUI interaction outcomes without requiring next-state prediction or environmental feedback. This avoids compounding errors and enhances resilience to UI changes by focusing on semantic reasoning (e.g., “Does this action advance the user’s goal?”). The framework operates in two stages: (1) pretraining VEM to estimate long-term action utilities and (2) guiding policy exploration with frozen VEM signals, enabling layout-agnostic GUI automation. Evaluated across diverse benchmarks including Android-in-the-Wild for mobile apps and Multimodal-Mind2Web for web environments, VEM achieves state-of-the-art or highly competitive performance in both offline and online settings. It significantly outperforms environment-free baselines and matches or exceeds environment-based approaches, crucially without incurring interaction costs. Importantly, VEM demonstrates that robust, generalizable GUI agents can be trained efficiently using semantic-aware value estimation, proving effective across distinct interaction platforms like mobile and web. The code is available at https://anonymous.4open.science/r/VEM-Agent-51E7.

URL: https://openreview.net/forum?id=q1wLUxaBPn

---

Title: A Dual-Protection Framework for Copyright Protection and Image Editing Using Multi-Label Conformal Prediction

Abstract: Recent advances in diffusion models have significantly enhanced image editing capabilities, raising serious concerns about copyright protection. Traditional watermarks often fail to withstand diffusion-based edits, making image protection challenging. To address this, we propose a method that embeds an imperceptible perturbation in images, serving as a watermark while simultaneously disrupting the output of latent diffusion models. Our approach employs a Score Estimator trained on select latent embeddings to embed the watermark by minimizing the score function. We then apply conformal inference to compute p-values for watermark detection. To distort the output of latent diffusion models, we shift watermarked image embeddings away from the distribution mean, distorting unauthorized generations. Experiments demonstrate our framework's superior performance in watermark detection, imperceptibility, and robustness against attacks, offering a comprehensive approach to protect images against latent diffusion models.

URL: https://openreview.net/forum?id=yiOmppKOdj

---

Title: Self-Supervised Laplace Approximation for Bayesian Uncertainty Quantification

Abstract: Approximate Bayesian Inference typically revolves around computing the posterior parameter distribution. The main practical interest, however, often lies in a model’s predictions rather than its parameters. In this work, we propose to bypass the posterior, focusing directly on approximating the posterior predictive distribution. We achieve this by drawing inspiration from self-supervised and semi-supervised learning. Essentially, we quantify a Bayesian model’s predictive uncertainty by refitting on self-predicted data. The idea is strikingly simple: If a model assigns high likelihood to self-predicted data, these predictions are of low uncertainty, and vice versa. The modular structure of our Self-Supervised Laplace Approximation (SSLA) further allows to plug in different prior specifications, enabling classical Bayesian sensitivity (w.r.t. prior choice) analysis. In order to bypass expensive refitting, we further introduce an approximate version of SSLA, called ASSLA. We study (A)SSLA both theoretically and empirically by employing it in models ranging from Bayesian linear models to Bayesian neural networks. Our approximations outperform classical Laplace approximations on a wide array of both simulated and real-world datasets.

URL: https://openreview.net/forum?id=T8w8L2t3JG

---

Title: Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training

Abstract: In standard large vision-language models (LVLMs) pre-training, the model typically maximizes the joint probability of the caption conditioned on the image via next-token prediction (NTP); however, since only a small subset of caption tokens directly relates to the visual content, this naive NTP unintentionally fits the model to noise and increases the risk of hallucination. We present PRIOR, a simple vision-language pre-training approach that addresses this issue by prioritizing image-related tokens through differential weighting in the NTP loss, drawing from the importance sampling framework. PRIOR introduces a reference model—a text-only large language model (LLM) trained on the captions without image inputs, to weight each token based on its probability for LVLMs training. Intuitively, tokens that are directly related to the visual inputs are harder to predict without the image and thus receive lower probabilities from the text-only reference LLM. During training, we implement a token-specific re-weighting term based on the importance scores to adjust each token's loss. We implement PRIOR in two distinct settings: LVLMs with visual encoders and LVLMs without visual encoders. We observe 19% and 8% average relative improvement, respectively, on several vision-language benchmarks compared to NTP. In addition, PRIOR exhibits superior scaling properties, as demonstrated by significantly higher scaling coefficients, indicating greater potential for performance gains compared to NTP given increasing compute and data. The code and data will be made public.

URL: https://openreview.net/forum?id=jDcnL1hB1Z

---

Title: Progressive Depth Up-scaling via Optimal Transport

Abstract: Scaling Large Language Models (LLMs) yields performance gains but incurs substantial training costs. Depth up-scaling offers training efficiency by adding new layers to pre-trained models. However, most existing methods copy or average weights from base layers, neglecting neuron permutation differences. This limitation can potentially cause misalignment that harms performance. Inspired by applying Optimal Transport (OT) for neuron alignment, we propose Optimal Transport Depth Up-Scaling (OpT-DeUS). OpT-DeUS aligns and fuses Transformer blocks in adjacent base layers via OT for new layer creation, to mitigate neuron permutation mismatch between layers. OpT-DeUS achieves better overall performance and offers improved training efficiency than existing methods for continual pre-training and supervised fine-tuning across different model sizes. To further evaluate the impact of interpolation positions, our extensive analysis shows that inserting new layers closer to the top results in higher training efficiency due to shorter back-propagation time while obtaining additional performance gains. We also find a strong correlation between strong depth up-scaling performance and high transport matrix entropy. Code is provided in the supplementary material.

URL: https://openreview.net/forum?id=ybKrBKnxPV

---

Title: SCUT : Spectral Clustering for Unsupervised classification Trees

Abstract: The lack of annotated data often limits the training of machine learning models. In addition, during the labelling process, some data points may remain unlabelled. While unsupervised methods such as clustering can reveal the underlying structure of the data, they are typically unsuitable to place new samples into existing clusters. Here, we propose Spectral Clustering for Unsupervised decision Tree (SCUT), a novel hierarchical clustering method based on algebraic connectivity that can position new data points appropriately within the clustering structure. By leveraging a feature-splitting approach, SCUT also enables straightforward extraction of {\em ante-hoc} explanations for its clustering decisions. Formally, SCUT works by recursively splitting the data through the solution of the Normalized Cut (NCUT) problem—a graph-partitioning formulation that seeks to split a graph into balanced subsets while minimizing the total connection strength between them—on a bipartite graph. We demonstrate, both visually and quantitatively, that SCUT captures the intrinsic structure of data more effectively than existing methods, while offering competitive performance compared to common hierarchical clustering algorithms.

URL: https://openreview.net/forum?id=wjTysDxSg5

---

Title: BN-Pool: a Bayesian Nonparametric Pooling for Graphs

Abstract: We introduce BN-Pool, the first clustering-based pooling method for Graph Neural Networks that adaptively determines the number of supernodes in a coarsened graph.
BN-Pool leverages a generative model based on a Bayesian non-parametric framework for partitioning graph nodes into an unbounded number of clusters. During training, the node-to-cluster assignments are learned by combining the supervised loss of the downstream task with an unsupervised auxiliary term, which encourages the reconstruction of the original graph topology while penalizing unnecessary proliferation of clusters. By automatically discovering the optimal coarsening level for each graph, BN-Pool preserves the performance of soft-clustering pooling methods while avoiding their typical redundancy by learning compact pooled graphs.
The code is available at https://anonymous.4open.science/r/BN-Pool.

URL: https://openreview.net/forum?id=3B3Zr2xfkf

---

Title: Probing Layer-wise Memorization and Generalization in Deep Neural Networks via Model Stitching

Abstract: It is well-known that deep neural networks can both memorize randomly labeled training data and generalize to unseen inputs. However, despite several prior efforts, the mechanism and dynamics of how and where memorization takes place in the network are still unclear, with contradictory findings in the literature. To address this, we aim to study the functional similarity between the layers of the memorized model to the model that generalizes. Specifically, we leverage model stitching as a tool to enable layer-wise comparison of a memorized noisy model, trained on a partially noisy-labeled dataset, to that of the generalized clean model, trained on a clean, noise-free dataset.
Our simple but effective approach guides the design of experiments that help shed light on the learning dynamics of different layers in deep neural networks and why models with harmful memorization still generalize well. Our results show that early layers are as important as deeper ones for generalization. We find that ``cleaning'' the early layers of the noisy model improves the functional similarity of its deeper layers to that of the corresponding layers in the clean model. Moreover, cleaning the noise in the early layers of the noisy model can drastically reduce memorization and improve generalization. Furthermore, noise fixation up to a certain depth results in generalization similar to that of a noise-free model. However, interestingly, the reverse may not be true. That is, if early layers are noisy but deeper layers are noise-free, then perfect memorization cannot be achieved, emphasizing the dominant role of deeper layers in memorization.
Our extensive experiments on four different architectures - customized CNN model, ResNet-18, ResNet-34, and ResNet-50, and three datasets - SVHN, CIFAR-10, and CIFAR-100, with varying levels of noise, consistently corroborate our findings.

URL: https://openreview.net/forum?id=wWye46fXo7

---

Title: SLIP: Securing LLM’s IP Using Weights Decomposition

Abstract: Large language models (LLMs) have recently seen widespread adoption, in both academia
and industry. As these models grow, they become valuable intellectual property (IP), reflecting
enormous investments by their owners. Moreover, the high cost of cloud-based
deployment has driven interest towards deployment to edge devices, yet this risks exposing
valuable parameters to theft and unauthorized use. Current methods to protect models’
IP on the edge have limitations in terms of practicality, loss in accuracy, or suitability to
requirements. In this paper, we introduce a novel hybrid inference algorithm, named SLIP,
designed to protect edge-deployed models from theft. SLIP is the first hybrid protocol that
is both practical for real-world applications and provably IP-preserving, while having zero
accuracy degradation and minimal impact on latency. It involves partitioning the model
between two computing resources, one secure but expensive, and another cost-effective but
untrusted. This is achieved through matrix decomposition, ensuring that the secure resource
retains a maximally sensitive portion of the model’s IP while performing a minimal amount
of computations, and vice versa for the untrusted resource. Importantly, the protocol includes
guarantees that prevent attackers from exploiting the partition to infer the model
weights. Finally, we present experimental results that show the robustness and effectiveness
of our method, positioning it as a compelling solution for protecting LLMs.

URL: https://openreview.net/forum?id=3MAGV75ndV

---

Title: Transformers in the Dark: Navigating Unknown Search Spaces via Bandit Feedback

Abstract: Effective problem solving with Large Language Models (LLMs) can be enhanced when they are paired with external search algorithms. By viewing the space of diverse ideas and their follow-up possibilities as a tree structure, the search algorithm can navigate such a search space and guide the LLM toward better solutions more efficiently. While the search algorithm enables an effective balance between exploitation and exploration of a tree-structured space, the need for an external component can complicate the overall problem-solving process. We therefore pose the following question: can LLMs or their underlying Transformer architectures fully internalize the search algorithm? To answer this question, we first introduce a simplified framework in which tree extensions and feedback signals are externally specified, allowing for controlled evaluation of search capabilities. We call this setting unknown tree search with bandit feedback. Within this setting, we show that Transformers are theoretically expressive enough to implement distinct search strategies and can be trained from scratch to approximate those strategies. Our Transformer models are capable of generalizing to unseen conditions such as longer horizons or deeper trees. Furthermore, we demonstrate that continued task-focused training unlocks the complete capabilities of a pretrained LLM, by fine-tuning the LLM on search trajectories.

URL: https://openreview.net/forum?id=Jij7zCjVfc

---

Title: On the Vulnerability of Discrete Graph Diffusion Models to Backdoor Attacks

Abstract: Diffusion models have demonstrated remarkable generative capabilities in continuous data domains such as images and videos. Recently, discrete graph diffusion models (DGDMs) have extended this success to graph generation, achieving state-of-the-art performance. However, deploying DGDMs in safety-critical applications—such as drug discovery—poses significant risks without a thorough understanding of their security vulnerabilities.
In this work, we conduct the first study of backdoor attacks on DGDMs, a potent threat that manipulates both the training and generation phases of graph diffusion. We begin by formalizing the threat model and then design a backdoor attack that enables the compromised model to: 1) generate high-quality, benign graphs when the backdoor is not activated,
2) produce effective, stealthy, and persistent backdoored graphs when triggered, and
3) preserve fundamental graph properties—permutation invariance and exchangeability—even under attack.
We validate 1) and 2) empirically, both with and without backdoor defenses, and support 3) through theoretical analysis.

URL: https://openreview.net/forum?id=Brn7lUoDtf

---

Title: Incremental3D: Incremental 3D Scene Generation with Scene Graph for Immersive Teleoperation

Abstract: Graph-based 3D scene generation aims to synthesize 3D environments conditioned on scene graphs and has been widely explored in applications such as 3D gaming and interior design. However, its potential for immersive robotic teleoperation has been largely overlooked. In this setting, transmitting lightweight incremental 3D scene graphs from the robot-side to the operator-side is far more bandwidth-efficient and lower-latency than streaming raw RGB or point-cloud data. %from the robot side to the operator side, and
At the same time, recent advances in robot-side 3D scene-graph learning now make such incremental scene-graphs readily obtainable from RGB-D inputs. % for this new teleoperation system. Despite this opportunity, existing scene-graph-based 3D scene generation methods are fundamentally single-shot: inserting even a single new object requires regenerating the entire scene. This global re-computation incurs prohibitive latency and renders existing approaches unsuitable for real-time immersive robotic teleoperation, where the scene graph, and therefore the scene itself, is built and generated incrementally as the robot moves through the environment. To address this limitation, we propose \textit{Incremental3D}, the first framework capable of incremental graph-to-3D scene generation for teleoperation applications. \textit{Incremental3D} augments an existing scene graph with a global classification (CLS) node that maintains a holistic representation of the evolving environment. At each update step, the CLS node aggregates global context and conditions the generation of newly added objects, enabling geometry synthesis and spatial prediction without recomputing unchanged regions. Extensive experiments demonstrate that \textit{Incremental3D} achieves 38 Hz generation speed while maintaining high spatial accuracy, indicating its suitability for real-time teleoperation and other latency-sensitive 3D applications.

URL: https://openreview.net/forum?id=am8Zv3R8GW

---

Title: PRISM: PRIor from corpus Statistics for topic Modeling

Abstract: Topic modeling seeks to uncover latent semantic structure in text, with LDA providing a foundational probabilistic framework. While recent methods often incorporate external knowledge (e.g., pre-trained embeddings), such reliance limits applicability in emerging or underexplored domains. We introduce \textbf{PRISM}, a corpus-intrinsic method that derives a Dirichlet parameter from word co-occurrence statistics to initialize LDA without altering its generative process. Experiments on text and single cell RNA-seq data show that PRISM improves topic coherence and interpretability, rivaling models that rely on external knowledge. These results underscore the value of corpus-driven initialization for topic modeling in resource-constrained settings.

Code will be released upon acceptance.

URL: https://openreview.net/forum?id=454v3Xbtza

---

Title: On the Unreasonable Effectiveness of Last-layer Retraining

Abstract: Last-layer retraining (LLR) methods --- wherein the last layer of a neural network is reinitialized and retrained on a held-out set following ERM training --- have garnered interest as an efficient approach to rectify dependence on spurious correlations and improve performance on minority groups. Surprisingly, LLR has been found to improve worst-group accuracy even when the held-out set is an imbalanced subset of the training set. We initially hypothesize that this ``unreasonable effectiveness'' of LLR is explained by its ability to mitigate neural collapse through the held-out set, resulting in the implicit bias of gradient descent benefiting robustness. Our empirical investigation does not support this hypothesis. Instead, we present strong evidence for an alternative hypothesis: that the success of LLR is primarily due to better group balance in the held-out set. We conclude by showing how the recent algorithms CB-LLR and AFR perform implicit group-balancing to elicit a robustness improvement.

URL: https://openreview.net/forum?id=h81ztbrkFb

---

Title: Goal Achievement Guided Exploitation: Rethinking Maximum Entropy Reinforcement Learning

Abstract: Reinforcement learning (RL) algorithms often rely on entropy maximization to prevent premature convergence, yet this practice introduces fundamental drawbacks: it alters the optimization objective and cannot guarantee sufficient exploration in some tasks with local optima. We propose Goal Achievement Guided Exploitation (GAGE), a principled alternative that adaptively regulates exploration based on the agent's performance relative to the optimal goal. Instead of maximizing entropy, GAGE enforces hard lower bounds on policy flatness, represented by the standard deviation in continuous actions and the logit range in discrete ones, providing interpretable and controllable exploration without modifying the reward function. This mechanism ensures lower-bounds of action probabilities and naturally reduces stochasticity as learning progresses. Across a suite of challenging robotic control tasks with severe local optima, GAGE consistently improves stability, robustness, and final per formance over entropy-based baselines for both on-policy and off-policy algorithms by a clear margin. Our results suggest that performance-guided exploration offers a scalable and interpretable direction beyond the maximum-entropy paradigm in reinforcement learning.

URL: https://openreview.net/forum?id=uGidW0fKhK

---

Title: Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients

Abstract: We unify two seemingly distinct approaches to policy gradient optimization for the Pass@K objective in reinforcement learning with verifiable rewards (RLVR): direct REINFORCE-style methods and advantage-shaping techniques that modify GRPO. By reverse-engineering existing advantage-shaping algorithms, we reveal that they implicitly optimize surrogate rewards. We specifically interpret practical ``hard-example upweighting'' modifications to GRPO as reward-level regularization. Conversely, starting from surrogate reward objectives, we provide a simple recipe for deriving both existing and new advantage-shaping methods.
This perspective provides a lens for RLVR beyond our original motivation of Pass@K.

URL: https://openreview.net/forum?id=R1RhBFUk8t

---

Title: GroundingBooth: Grounding Text-to-Image Customization

Abstract: Recent approaches in text-to-image customization have primarily focused on preserving the identity of the input subject, but often fail to control the spatial location and size of objects. We introduce GroundingBooth, which achieves zero-shot, instance-level spatial grounding on both foreground subjects and background objects in the text-to-image customization task. Our proposed grounding module and subject-grounded cross-attention layer enable the creation of personalized images with accurate layout alignment, identity preservation, and strong text-image coherence. In addition, our model seamlessly supports personalization with multiple subjects. Our model shows strong results in both layout-guided image synthesis and text-to-image customization tasks. The code link will be provided upon acceptance.

URL: https://openreview.net/forum?id=TRlZpHU300

---

Title: Synapse: Adaptive Arbitration of Complementary Expertise in Time Series Foundational Models

Abstract: Pre-trained Time Series Foundational Models (TSFMs) represent a significant advance, capable of forecasting diverse time series with complex characteristics, including varied seasonalities, trends, and long-range dependencies. Despite their primary goal of universal time series forecasting, their efficacy is far from uniform; divergent training protocols and data sources cause individual TSFMs to exhibit highly variable performance across different forecasting tasks, domains, and horizons. Leveraging this complementary expertise by arbitrating existing TSFM outputs presents a compelling strategy, yet this remains a largely unexplored area of research. In this paper, we conduct a thorough examination of how different TSFMs exhibit specialized performance profiles across various forecasting settings, and how we can effectively leverage this behavior in arbitration between different time series models. We specifically analyze how factors such as model selection and forecast horizon distribution can influence the efficacy of arbitration strategies. Based on this analysis, we propose Synapse, a novel arbitration framework for TSFMs. Synapse is designed to dynamically leverage a pool of TSFMs, assign and adjust predictive weights based on their relative, context-dependent performance, and construct a robust forecast distribution by adaptively sampling from the output quantiles of constituent models. Experimental results demonstrate that Synapse consistently outperforms other popular ensembling techniques as well as individual TSFMs, demonstrating Synapse's efficacy in time series forecasting.

URL: https://openreview.net/forum?id=j3HqbsCwt1

---

Title: Investigating Linguistic Steering: An Analysis of Adjectival Effects Across Large Language Model Architectures

Abstract: Achieving reliable control of Large Language Models (LLMs) requires a precise, scalable understanding of how they interpret linguistic cues. We introduce a rigorous framework using Shapley values to quantify the steering effect of individual adjectives on model performance, moving beyond anecdotal heuristics to principled attribution. Applying this method to 100 adjectives across a diverse suite of models (including o3, gpt-4o-mini, phi-3, llama-3-70b, and deepseek-r1) on the MMLU benchmark, we uncover several critical findings for AI alignment. First, we find that a small subset of adjectives act as disproportionately powerful "levers," yet their effects are not universal. Cross-model analysis reveals a "family effect": models of a shared lineage exhibit correlated sensitivity profiles, while architecturally distinct models react in a largely uncorrelated manner, challenging the notion of a one-size-fits-all prompting strategy. Second, focused follow-up studies demonstrate that the steering direction of these powerful adjectives is not intrinsic but is highly contingent on their syntactic role and position within the prompt. For larger models like gpt-4o-mini, we provide the first quantitative evidence of strong, non-additive interaction effects where adjectives can synergistically amplify, antagonistically dampen, or even reverse each other's impact. In contrast, smaller models like phi-3 exhibit a more literal and less compositional response. These results suggest that as models scale, their interpretation of prompts becomes more sophisticated but also less predictable, posing a significant challenge for robustly steering model behavior and highlighting the need for compositional and model-specific alignment techniques.

URL: https://openreview.net/forum?id=xN7NYpQeBm

---

Title: RWR-RGCN : A Novel Framework for Fraud Detection via Node Context Aggregation

Abstract: The integrity of online reviews is crucial for businesses, yet widespread review fraud poses significant risks. This paper addresses this challenge by leveraging the power of multi-relational graph convolutional networks (RGCNs) for fraud detection. We introduce RWR-RGCN, a novel framework integrating a multi-layer RGCN architecture with Random Walks with Restart (RWR). The essential role of capturing critical connections lies in RWR generating node sequences, which can aggregate node features, enhancing the model's understanding of the local and global context within the review graph. To further refine fraud detection, we incorporate Louvain clustering for community identification, identifying high-modularity clusters indicative of coordinated fraudulent activity. Evaluated on the Yelp dataset, RWR-RGCN achieved an AUC of 82.58\% and a recall of 94.56\%, surpassing the state-of-the-art and baseline methods in AUC and recall. These results demonstrate the superior effectiveness of the proposed framework in detecting fraudulent activity within complex online review networks.

URL: https://openreview.net/forum?id=y3mhzyu1TT

---

Title: CI-CBM: Class-Incremental Concept Bottleneck Model for Interpretable Continual Learning

Abstract: Catastrophic forgetting remains a fundamental challenge in continual learning, in which models often forget previous knowledge when fine-tuned on a new task. This issue is especially pronounced in class incremental learning (CIL), which is the most challenging setting in continual learning. Existing methods to address catastrophic forgetting often sacrifice either model interpretability or accuracy. To address this challenge, we introduce Class-Incremental Concept Bottleneck Model (CI-CBM), which leverage novel techniques, including concept regularization and pseudo-concept generation to maintain interpretable decision processes throughout incremental learning phases.
Through extensive evaluation on seven benchmark datasets, CI-CBM achieves comparable performance to black-box models and significantly outperforms previous interpretable approaches in CIL, with an average 36\% accuracy gain.
CI-CBM provides both interpretable decisions on individual inputs and understandable global decision rules, as shown in our experiments, thereby demonstrating that human-understandable concepts can be maintained during incremental learning without compromising model performance.
Our approach is effective in both pretrained and non-pretrained scenarios; in the latter, the backbone is trained from scratch during the first learning phase.

URL: https://openreview.net/forum?id=Wf6OpLgj2i

---

Title: A Systematic Study of Model Merging Techniques in Large Language Models

Abstract: Model merging combines multiple fine-tuned checkpoints into a single model without additional training, offering an attractive approach to reusing models and efficiently improving performance. However, it remains unclear whether the advantages reported for smaller models and classifiers generalize to LLMs. We present a large-scale, systematic evaluation of six state-of-the-art merging methods, including recent subspace methods, across four open-weight LLMs, twelve fine-tuned checkpoints per base model, and sixteen standard LLM benchmarks. Evaluating through standardized benchmarks, we measure both the probability that a merged model outperforms the base model and relative gains over the best individual checkpoint. Our results show that the oldest and simplest method, Task Arithmetic, is the only approach that reliably yields performance gains on LLMs. Other interference-aware and subspace merging methods typically result in significant performance drops. Our findings indicate that current merging techniques do not directly transfer to modern LLMs. This motivates the design of LLM-specific merging algorithms and merging-aware fine-tuning methods. Code will be released upon acceptance of this paper.

URL: https://openreview.net/forum?id=6zSIyrqS7J

---

Title: DRAW: Domain Weight Randomization with Bayesian Updating for LLM Pre-Training

Abstract: Optimal pre-training data mixture is pivotal for large language model (LLM) performance, but searching for the best domain weights is computationally expensive. We present Domain Weight Randomization with Bayesian Updating (DRAW), a principled framework treating domain weights as Dirichlet-distributed random variables whose parameters scale with model width. Informative priors are first estimated using proxy models; the main model then refines these using Bayesian inference and parameter scaling, dynamically sampling domain weights during training. Theoretically, DRAW reduces generalization error at a rate $\mathcal{O}(1/\sqrt{n})$ as model width increases, ensuring stable convergence. Empirical results on open-domain corpora and diverse benchmarks show DRAW reliably outperforms fixed and adaptive baselines in both language modeling and downstream tasks, achieving better average and worst-case performance alongside strong robustness. DRAW not only highlights valuable data domains while suppressing noisy ones, but also introduces a scalable and effective mechanism for adaptive data mixing in LLM pre-training, facilitating efficient knowledge transfer from proxy to large models.

URL: https://openreview.net/forum?id=tc8TyD7ZyD

---

Title: Domain-Invariant Hyperbolic Distillation for Robust Medical Image Analysis

Abstract: Robust generalization beyond training distributions remains a critical challenge for deep neural networks. This is especially pronounced in medical image analysis, where data is often scarce and covariate shifts arise from different hardware devices, imaging protocols, and heterogeneous patient populations. These factors collectively hinder reliable performance and slow down clinical adoption. Despite recent progress, existing learning paradigms primarily rely on the Euclidean manifold, whose flat geometry fails to capture the complex, hierarchical structures present in clinical data. In this work, we exploit the superiority of hyperbolic manifolds to model complex data characteristics. We present the first comprehensive validation of hyperbolic representation learning for medical image analysis and demonstrate statistically significant gains across eleven in-distribution datasets and three ViT backbones. We further propose an unsupervised, domain-invariant hyperbolic distillation strategy. Extensive experiments confirm that our hyperbolic distillation learns domain-invariant features and outperforms state-of-the-art Euclidean methods by an average of $+2.1\%$ AUC on three domain generalization benchmarks: Fitzpatrick17k, Camelyon17-Wilds, and a cross-dataset setup for retinal imaging. These datasets span different imaging modalities, data sizes, and label granularities, confirming generalization capabilities across severely different conditions. The code will be released upon acceptance.

URL: https://openreview.net/forum?id=1spGpYmDjy

---

Title: Communication-Efficient Adaptive Federated Bi-level Optimization with Data and System Heterogeneity

Abstract: Bilevel optimization is a popular nested optimization model in machine learning. Federated bilevel optimization, which extends bilevel optimization to the Federated Learning setting, faces challenges such as complex nested sub-loops, high communication overhead, and a lack of adaptive mechanisms. To address these issues, this paper proposes an Adaptive Single-loop Federated Bilevel Optimization algorithm (ASFBO) in the presence of both data heterogeneity (Non-IID client data) and system heterogeneity (partial client participation per round and varying numbers of local iterations). By replacing nested sub-iterations with a single-loop architecture, ASFBO significantly reduces communication frequency and computational costs. It employs multiple adaptive learning rate variables to dynamically adjust the step sizes of upper-level variable updates, thereby speeding up the algorithm's convergence. Furthermore, a locally accelerated version of the algorithm (LA-ASFBO) that incorporates momentum-based variance reduction techniques is proposed to mitigate hyper-gradient estimation bias across distributed nodes effectively. Theoretical analysis shows that, under the classic setting of a non-convex upper-level and strongly convex lower-level, ASFBO and LA-ASFBO achieve convergence to an $\epsilon$-stationary point with only $\tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity and $\tilde{\mathcal{O}}(\epsilon^{-1})$ communication complexity. Experiments on federated hyper-representation learning tasks demonstrate the superiority of the proposed algorithm.

URL: https://openreview.net/forum?id=f9LWE2bA4R

---

Title: Leveraging Multimodal LLM Descriptions of Activity for Explainable Semi-Supervised Video Anomaly Detection

Abstract: Existing semi-supervised video anomaly detection (VAD) methods often struggle with detecting complex anomalies involving object interactions and generally lack explainability. To overcome these limitations, we propose a novel VAD framework leveraging Multimodal Large Language Models (MLLMs). Unlike previous MLLM-based approaches that make direct anomaly judgments at the frame level, our method focuses on extracting and interpreting object activity and interactions over time. By querying an MLLM with visual inputs of object pairs at different moments, we generate textual descriptions of the activity and interactions from nominal videos. These textual descriptions serve as a high-level representation of the activity and interactions of objects in a video. They are used to detect anomalies during test time by comparing them to textual descriptions found in nominal training videos. Our approach inherently provides explainability and can be combined with many traditional VAD methods to further enhance their interpretability. Extensive experiments on benchmark datasets demonstrate that our method not only detects complex interaction-based anomalies effectively but also achieves state-of-the-art performance on datasets without interaction anomalies.

URL: https://openreview.net/forum?id=dfc2HpDSlH

---

Title: From Clutter to Clarity: Visual Recognition through Foveated Object-Centric Learning (FocL)

Abstract: Human active vision integrates spatial attention (dorsal) and object recognition (ventral) as distinct information processing pathways. Rapid eye movements focus perception on task-relevant regions while filtering out background clutter. Mimicking this ventral specialization, we introduce FocL (Foveated Object-Centric Learning), a training strategy that biases image classification models toward label-consistent object regions by replacing full images with foveated crops. Standard training often relies on spurious correlation between label and background, increasing memorization of hard examples in the tail of the difficulty distribution. FocL simulates saccades by jittering fixation points and extracting foveated glimpses from annotated bounding boxes. This object-first restructuring reduces non-foreground contamination and lowers mean training loss. FocL reduces memorization, lowering mean cumulative sample loss by approximately 65 % and making nearly all high-memorization samples (top 1 %) easier to learn. It also increases the mean $\ell_2$ adversarial perturbation distance required to flip predictions by approximately 62 %. On ImageNet-V1, FocL achieves around 11 % higher accuracy on oracle crops. When paired with the Segment Anything Model (SAM) as a dorsal proposal generator, FocL provides around an 8 % gain on ImageNet-V1 and around 8 % under natural distribution shift (ImageNet-V2). Extending this setup to COCO, FocL improves cross-domain mAP by 3--4 points without any target-domain training. Finally, FocL reaches higher accuracy using roughly 56 % less training data, offering a simple path to more robust and efficient visual recognition.

URL: https://openreview.net/forum?id=kVS7sMlv7P

---

Title: CP-POL + PPI: Conformal Guarantees in Partially-Observed Label Space

Abstract: We study Conformal Prediction (CP) in the practical and challenging regime where labeled training and calibration data observe only a subset of the label space. In this setting, classical Conformal guarantees no longer control marginal risk and naive unseen labels detection methods are either overconservative or uninformative. We introduce CP-POL, a simple operational pipeline that couples Split CP over observed labels with a calibrated novelty test and integrates Prediction-Powered Inference (PPI) for finite sample population estimation. We provide a non-asymptotic theory that (i) proves Le Cam impossibility result: novelty test from features alone is hopeless without structural assumptions, (ii) derives tight finite-sample coverage decompositions that isolate the role of the non-conforming event $s(X)>q$, (iii) gives Dvoretzky-Kiefer-Wolfowitz (DKW)-based conservative estimators and anytime martingale analogues for the novel mass function $\pi_{nov}$, (iv) identifies practically meaningful structural conditions under which strong guarantees for novel region prediction hold, and (v) proves finite-sample PPI bounds that cleanly separate sampling fluctuation, trained model error and novel-mass effects. We validate the theory with reproducible simulations. All bounds are non-asymptotic and designed for immediate use in deployed monitoring pipelines.

URL: https://openreview.net/forum?id=GEy2BtBQKa

---

Title: Counting Still Counts: Understanding Neural Complex Query Answering Through Query Relaxation

Abstract: Neural methods for Complex Query Answering (CQA) over knowledge graphs (KGs) are widely believed to learn patterns that generalize beyond explicit graph structure, allowing them to infer answers that are unreachable through symbolic query processing.
In this work, we critically examine this assumption through a systematic analysis comparing neural CQA models with an alternative, training-free query relaxation strategy that retrieves possible answers by relaxing query constraints and counting resulting paths. Across multiple datasets and query structures, we find several cases where neural and relaxation-based approaches perform similarly, with no neural model consistently outperforming the latter. Moreover, a similarity analysis reveals that their retrieved answers exhibit little overlap, and that combining their outputs consistently improves performance.
These results call for a re-evaluation of progress in neural query answering: despite their complexity, current models fail to subsume the reasoning patterns captured by query relaxation. Our findings highlight the importance of stronger non-neural baselines and suggest that future neural approaches could benefit from incorporating principles of query relaxation.

URL: https://openreview.net/forum?id=YVFxB6bkeC

---

Title: A Normative Framework for Reasoning in Language Models

Abstract: Large Language Models (LLMs) increasingly exhibit advanced abilities, enabled by techniques such as chain-of-thought prompting and test-time deliberation. However, they continue to struggle with tasks that demand complex reasoning, prompting debate over whether their outputs reflect genuine reasoning processes or merely statistical pattern generation. These difficulties stem in part from the absence of a unified framework for explaining and assessing reasoning in LLMs, which limits our ability to diagnose errors, establish bounds, and design effective interventions. In this paper, we propose a normative framework that characterizes reasoning as probabilistic inference over propositions and we show how this abstraction can be instantiated in LLMs. Within this framework, we provide a typology of reasoning modes, formalise success criteria for proposition-level correctness, and derive a taxonomy of failure modes. For each class, we map model-level requirements to LLM-level implementation constraints and identify potential remedies. Finally, we outline a roadmap for improving proposition-level accuracy under tractable approximations. Our contribution is both diagnostic and prescriptive: an account of what it means for LLMs to reason, where and why current systems fail, and how to close the gap.

URL: https://openreview.net/forum?id=rexmsDzqwf

---

Title: On the Generalization Superiority of Flat Representation Manifolds for Deep Learning Machines

Abstract: While modern (deep) Neural Networks (NN) with their high number of parameters have the
ability to memorize training data, they achieve surprisingly high accuracies on test sets. One
theory that could explain this behavior is based on the manifold hypothesis: real-world high-
dimensional input data lies near low-dimensional manifolds. A NN layer transforms the input
manifold, arriving at a so-called representation manifold. The NN learns transformations
which flatten and disentangle the manifolds layer by layer. In this way, the NNs learn the
structure of the data instead of memorizing. Under the manifold hypothesis, we demonstrate
that flat manifolds (affine linear subspaces) in the second-to-last layer of a classification
network ensure perfect classification performance in the noiseless case. In regression tasks,
we derive an upper bound on the generalization error which decreases as the input manifold
becomes flatter. In the case of almost flat manifolds, the bound can be modified to be even
lower. These results support the argument that flat input manifolds improve generalization.
However, we argue that the results can also be used to show that flatter representation
manifolds improve generalization. Further, we conduct numerical experiments to show that
these findings apply beyond strict theoretical assumptions. Based on our results, we argue
that a flatness promoting regularizer, combined with an $L1$-regularizer, could enhance the
generalization of Neural Networks.

URL: https://openreview.net/forum?id=z92WP36Vxm

---

Title: Designing Preconditioners for SGD: Local Conditioning, Noise Floors, and Basin Stability

Abstract: Stochastic Gradient Descent (SGD) often slows in the late stage of training due to anisotropic curvature and gradient noise.
We analyze preconditioned SGD in the geometry induced by a symmetric positive definite matrix $\mathbf{M}$,
deriving bounds in which both the convergence rate and the stochastic noise floor are governed by $\mathbf{M}$-dependent quantities:
the rate through an effective condition number in the $\mathbf{M}$-metric, and the floor through the {product} of that condition number and the preconditioned noise level.
For nonconvex objectives, we establish a preconditioner-dependent basin-stability guarantee:
when smoothness and basin size are measured in the $\mathbf{M}$-norm, the probability that the iterates remain in a well-behaved local region admits an explicit lower bound. This perspective is particularly relevant in Scientific Machine Learning (SciML), where achieving small training loss under stochastic updates is closely tied to physical fidelity, numerical stability, and constraint satisfaction.
The framework applies to both diagonal/adaptive and curvature-aware preconditioners and yields a simple design principle: choose $\mathbf{M}$ to improve local conditioning while attenuating noise. Experiments on a quadratic diagnostic and three SciML benchmarks validate the predicted rate–floor behavior.

URL: https://openreview.net/forum?id=vo8FOBt6f6

---

Title: Diffusion Models are Secretly Zero-Shot 3DGS Harmonizers

Abstract: Gaussian Splatting has become a popular technique for various 3D Computer Vision tasks, including novel view synthesis, scene reconstruction, and dynamic scene rendering. However, the challenge of natural-looking object insertion, where the object's appearance seamlessly matches the scene, remains unsolved. In this work, we propose a method, dubbed D3DR, for inserting a 3DGS-parametrized object into a 3DGS scene while correcting its lighting, shadows, and other visual artifacts to ensure consistency. We reveal a hidden ability of diffusion models trained on large real-world datasets to implicitly understand correct scene lighting, and leverage it in our pipeline. After inserting the object, we optimize a diffusion-based Delta Denoising Score (DDS)-inspired objective to adjust its 3D Gaussian parameters for proper lighting correction. We introduce a novel diffusion personalization technique that preserves object geometry and texture across diverse lighting conditions, and utilize it to achieve consistent identity matching between original and inserted objects. Finally, we demonstrate the effectiveness of the method by comparing it to existing approaches, achieving 2.0 dB PSNR improvements in relighting quality.

URL: https://openreview.net/forum?id=1jjIitxVmM

---

Title: Models with a Cause: Causal Discovery with Language Models on Temporally Ordered Text Data

Abstract: While language models (LMs) have been proposed for causal discovery tasks, it remains unclear whether they possess the inductive biases necessary to identify causal structures in token generation processes. We investigate whether LMs can learn the causal structure governing how tokens depend on their predecessors by testing if they possess the temporal and statistical properties required for causal discovery. We prove that existing algorithms can recover a unique causal model when token sequences satisfy standard causal assumptions and have temporal ordering. LMs' sequential processing and positional encodings enable them to leverage this temporal information. Using controlled experiments on synthetic data generated by mixtures of Markov chains, we test whether LMs learn conditional independencies and Markov exchangeability properties necessary for causal discovery. We find that transformers successfully learn these properties, achieving this not by approximating exact probability distributions but by learning qualitative probability rankings. These synthetic experiments provide initial evidence that LMs possess inductive biases suitable for discovering token-level causal structures.

URL: https://openreview.net/forum?id=YJddclPGuY

---

Title: Fed-SB: A Silver Bullet for Extreme Communication Efficiency and Performance in (Private) Federated LoRA Fine-Tuning

Abstract: Low-Rank Adaptation (LoRA) has become ubiquitous for efficiently fine-tuning foundation models. However, federated fine-tuning using LoRA is challenging due to suboptimal updates arising from traditional federated averaging of individual adapters. Existing solutions either incur prohibitively high communication cost that scales linearly with the number of clients or suffer from performance degradation due to limited expressivity. We introduce Fed-SB, a novel approach for federated fine-tuning of LLMs using LoRA-SB, a recently proposed low-rank adaptation method. LoRA-SB optimally aligns the optimization trajectory with the ideal low-rank full fine-tuning projection by learning a small square matrix ($R$) between adapters $B$ and $A$, keeping other components fixed. Direct averaging of $R$ guarantees exact updates, substantially reducing communication cost, which remains independent of the number of clients, and enables scalability. Fed-SB achieves state-of-the-art performance across commonsense reasoning, arithmetic reasoning, and language inference tasks while reducing communication costs by up to 230x. In private settings, Fed-SB further improves performance by (1) reducing trainable parameters, thereby lowering the noise required for differential privacy and (2) avoiding noise amplification introduced by other methods. Overall, Fed-SB offers a state-of-the-art, efficient, and scalable solution for both private and non-private federated fine-tuning. Our code is available anonymously at: https://anonymous.4open.science/r/fed-sb-anonymous-EF55.

URL: https://openreview.net/forum?id=87UyFEhzyP

---

Title: Character-Level Perturbations Amplify LLM Jailbreak Attacks

Abstract: Contemporary large language models (LLMs) exhibit remarkable capabilities, yet their subword tokenization mechanisms suffer from a vulnerability, whereby small character-level perturbations can re-partition text into unfamiliar subwords, degrading model performance across various tasks. Building on this, we show that this tokenization vulnerability also compromises safety mechanisms in jailbreak scenarios. We introduce a simple, model- and template-agnostic character-level jailbreak method and demonstrate that minimal character-level perturbations effectively increase the success rates of both simple and complex jailbreak attacks across multiple LLMs. We reveal that these perturbations lead to over-fragmented tokenization and token representation drift, resulting in substantial divergence in the semantic representations of words. Furthermore, our analysis using word-level semantic recovery and sentence-level spelling error detection and correction shows that models struggle to reconstruct the original semantics for perturbed content. In addition, layer-wise probe classifiers also fail to reliably detect the harmful intent of perturbed jailbreak prompts, further exposing the models' vulnerability in comprehending adversarially perturbed input. Finally, we find that in certain cases, perturbations reduce rather than increase attack success, as the corrupted spans fit less naturally into the template. Together, our findings demonstrate that tokenization-induced vulnerabilities compromise safety mechanisms, underscoring the need for investigation into mitigation strategies.

URL: https://openreview.net/forum?id=BXsOIppKEI

---

Title: Pull-to-Outlier \& Contrastive Objective-level (POCO) Unlearning: A Framework for Sample and Objective Forgetting

Abstract: Current Machine Unlearning (MU) methods require full retraining or extensive fine-tuning, lack formal removal criteria, and focus only on sample-level forgetting, limiting their practicality. We address these gaps with two lightweight, projection-only techniques operating above frozen feature extractors. Pull-to-Outlier Unlearning (POU) offers a transparent, unsupervised geometric removal method by displacing embeddings of unwanted samples or entire classes into synthetic outlier regions, while preserving downstream performance and distilling knowledge of the remaining data. To the best of our knowledge, Contrastive Objective-level Unlearning (COU) is the first method to remove learned objectives. It perturbs projection weights to eliminate a target task’s influence. Then it realigns the original data manifold, which can provide the possibility for managing agentic learning behaviors. We validate POU on CIFAR10, CIFAR100, and Caltech-256 with ResNet-based backbones, showing efficient instance and class forgetting with minimal impact on retained accuracy. COU is tested on DINO and CLIP feature representations, demonstrating effective objective-level erasure while preserving all non-target tasks.

URL: https://openreview.net/forum?id=KQxEwiA0VE

---

Title: Sparse Mean Estimation in Adversarial Settings via Incremental Learning

Abstract: In this paper, we study the problem of sparse mean estimation under adversarial corruptions, where the goal is to estimate the $k$-sparse mean of a heavy-tailed distribution from samples contaminated by adversarial noise. Existing methods face two key limitations: they require prior knowledge of the sparsity level $k$ and scale poorly to high-dimensional settings. We propose a simple and scalable estimator that addresses both challenges. Specifically, it learns the $k$-sparse mean without knowing $k$ in advance and operates in near-linear time and memory with respect to the ambient dimension. Under a moderate signal-to-noise ratio, our method achieves the optimal statistical rate, matching the information-theoretic lower bound. Extensive simulations corroborate our theoretical guarantees.
At the heart of our approach is an incremental learning phenomenon: we show that a basic subgradient method applied to a nonconvex two-layer formulation with an $\ell_1$-loss can incrementally learn the $k$ nonzero components of the true mean while suppressing the rest. More broadly, our work is the first to reveal the incremental learning phenomenon of the subgradient method in the presence of heavy-tailed distributions and adversarial corruption.

URL: https://openreview.net/forum?id=S3e7ikEZfg

---

Title: Phase Transitions or Continuous Evolution? Methodological Sensitivity in Neural Network Training Dynamics

Abstract: Recent work on neural network training dynamics often identifies transitions or phase changes in weight matrices through rank-based metrics. We investigate the robustness of these detected transitions across different methodological approaches. Analyzing 55 experiments spanning Transformer, CNN, and MLP architectures (30,147 measurement points), we find that transition detection
exhibits substantial sensitivity to methodological choices. Varying the detection threshold from 2$\sigma$ to 100$\sigma$ changes total detected transitions by an order of magnitude (25,513 to 1,608). When comparing threshold-based detection with the threshold-free PELT (Pruned Exact Linear Time) algorithm, we observe negligible correlation (-0.029) between methods: PELT identifies 40--52 transitions per layer while threshold methods at 5$\sigma$ detect 0.00-0.09. Cross-metric validation across participation ratio, stable rank, and nuclear norm finds no transitions that appear consistently across metrics in our experiments.

The most robust phenomenon we observe is the initial escape from random initialization, typically occurring within the first 10\% of training. Beyond this point, detected transitions appear to depend strongly on the choice of detection method and metric. While architecture-specific patterns emerge within each method, the lack of agreement across methods and metrics raises
important questions about the interpretation of phase transitions in neural network training.

Our findings suggest that current detection methods cannot reliably identify phase transitions in models at the scales we studied, with training dynamics exhibiting predominantly continuous evolution beyond initialization. We propose practical guidelines for practitioners that embrace continuous monitoring approaches and discuss the implications for understanding neural network optimization. This work highlights the importance of methodological scrutiny when characterizing training dynamics and suggests that multiple perspectives—both continuous and discrete—may be needed to fully understand how neural networks learn.

URL: https://openreview.net/forum?id=MkZIew531l

---

Title: Inference-Time Alignment via Hypothesis Reweighting

Abstract: Chat assistants must handle diverse and often conflicting user preferences, requiring adaptability to various user needs. We propose Hypothesis Reweighting (HyRe), a method that enables real-time personalization by reweighting ensemble members based on just 1-5 labeled examples from the target user or domain. Our method builds on the key empirical observation that optimally weighting ensemble members substantially outperforms uniform averaging under distribution shift, providing a powerful inductive bias for personalization. HyRe trains a single network with multiple prediction heads that capture different valid interpretations of preference data, then performs a simple Bayesian update to upweight heads that best match the target user's preferences. This requires only a single forward pass with negligible (<1\%) computational overhead, making it practical for inference-time alignment. We empirically validate HyRe in several target evaluation distributions. With as few as five preference pairs from each target distribution, adaptation via HyRe surpasses state-of-the-art reward models on RewardBench at both the 2B and 8B parameter scales, and improves reward model accuracy by 20\% across 32 diverse personalization tasks.

URL: https://openreview.net/forum?id=Q9p8LSEpiJ

---

Title: Accurate Split Learning on Noisy Signals

Abstract: Noise injection is applied in Split Learning to address privacy concerns about data leakage. Previous works protect Split Learning by adding noise to the intermediate results during the forward pass. Unfortunately, noisy signals significantly degrade the accuracy of Split Learning training. This paper focuses on improving the training accuracy of Split Learning over noisy signals while protecting training data from reconstruction attacks. We propose two denoising techniques, namely scaling and random masking. Our theoretical results show that both of our denoising techniques accurately estimate the intermediate variables during the forward pass of Split Learning. Moreover, our experiments with deep neural networks demonstrate that the proposed denoising approaches allow Split Learning to tolerate high noise levels while achieving almost the same accuracy as the noise-free baseline. Interestingly, we show that after applying our denoising techniques, the resultant network is more resilient against a state-of-the-art attack compared to the simple noise injection approach.

URL: https://openreview.net/forum?id=in1T4BlzG9

---

Title: Learn to Explore: Meta NAS via Bayesian Optimization Guided Graph Generation

Abstract: Neural Architecture Search (NAS) automates the design of high-performing neural networks but typically targets a single predefined task, thereby restricting its real-world applicability. To address this, Meta Neural Architecture Search (Meta-NAS) has emerged as a promising paradigm that leverages prior knowledge across tasks to enable rapid adaptation to new ones. Nevertheless, existing Meta-NAS methods often struggle with poor generalization, limited search spaces, or high computational costs. In this paper, we propose a novel Meta-NAS framework, GraB-NAS. Specifically, GraB-NAS first models neural architectures as graphs, and then a hybrid search strategy is developed to find and generate new graphs that lead to promising neural architectures. The search strategy combines global architecture search via Bayesian Optimization in the search space with local exploration for novel neural networks via gradient ascent in the latent space. Such a hybrid search strategy allows GraB-NAS to discover task-aware architectures with strong performance, even beyond the predefined search space. Extensive experiments demonstrate that GraB-NAS outperforms state-of-the-art Meta-NAS baselines, achieving better generalization and search effectiveness.

URL: https://openreview.net/forum?id=w15FmwsmKW

---

Title: ChromaFormer: A Scalable and Accurate Transformer Architecture for Land Cover Classification

Abstract: Remote sensing satellites such as Sentinel-2 provide high-resolution, multi-spectral imagery that enables dense, large-scale land cover classification. However, most deep learning models used in this domain—typically CNN-based architectures—are limited in their ability to process high-dimensional spectral data and scale with increasing dataset sizes. Moreover, while transformer architectures have recently been introduced for remote sensing tasks, their performance on large, densely labeled multi-spectral datasets remains underexplored.

In this paper, we present ChromaFormer, a scalable family of multi-spectral transformer models designed for large-scale land cover classification. We introduce a novel Spectral Dependency Module (SDM) that explicitly learns inter-band relationships through attention across spectral channels, enabling efficient spectral-spatial feature fusion. Our models are evaluated on the Biological Valuation Map (BVM) of Flanders, a large, densely labeled dataset spanning over 13,500 km² and 14 classes. ChromaFormer models achieve substantial accuracy gains over baselines: while a 23M-parameter UNet++ achieves less than 70% accuracy, a 655M-parameter ChromaFormer attains over 96% accuracy. We also analyze performance scaling trends and demonstrate generalization to standard benchmarks. Our results highlight the effectiveness of combining scalable transformer architectures with explicit spectral modeling for next-generation remote sensing tasks.

URL: https://openreview.net/forum?id=qzJVTJYEBc

---

Title: Juliet: Per-Sample Conditional Branching for Efficient Con- volutional Networks

Abstract: We introduce Juliet, a dynamic, trie-augmented neural architecture that improves the efficiency of convolutional neural networks by routing each input through learned per-node branches while growing and pruning capacity on the fly. Each node pairs a lightweight sub-module with a transformer-based path selector trained end-to-end; growing and pruning based on exponential moving average (EMA) usage let the model expand or contract during training to preserve accuracy within compute and memory budgets. We graft Juliet onto ResNet-18, EfficientNet-B0, and DenseNet-121 and train on CIFAR-10 (ARCHER2), with an ImageNet/H100 check using ResNet-101. On CIFAR-10, Juliet reduces theoretical training and inference FLOPs, even when the parameter count increases. The results show a $\sim21\%$, (ResNet-18), $\sim68\%$ (EfficientNet-B0), and $\sim70\%$ (DenseNet-121) in inference flops, while staying within $\sim1\%$ Top-1 of the baseline for ResNet-18 and DenseNet-121, with a larger trade-off on EfficientNet-B0. At ImageNet scale, Juliet-101 achieves $27.1$ Top-1 per GFLOPs, outscoring SkipNet, ConvNet-AIG, and BlockDrop. Ablations and hyperparameter sweeps (growth/prune thresholds, prune interval, prebuild limit) reveal nuances in Juliet's architecture, and simpler routers (e.g., a small MLP) match transformer routing, indicating the transformer router may not be a prerequisite for achieving competitive accuracy. Overall, Juliet provides a flexible, interpretable approach to conditional computation for convolutional neural networks, improving the efficiency–accuracy trade-off for the CNNs we evaluate.

URL: https://openreview.net/forum?id=ETQbfcbtjJ

---

Title: Bayesian Optimisation via Difference-of-Convex Thompson Sampling

Abstract: Thompson sampling is a method for Bayesian optimisation whereby a randomly drawn belief of the objective function is sampled at each round and then optimised, informing the next observation point.
The belief is typically maintained using a sufficiently expressive Gaussian process (GP) surrogate of the true objective function.
The sample drawn is non-convex in general and non-trivial to optimise.
Motivated by the desire to make this optimisation subproblem more tractable, we propose difference-of-convex Thompson sampling (DCTS): a scalable method for drawing GP samples that combines random neural network features with pathwise updates on the limiting kernel. The resulting samples belong to the difference-of-convex function class, which are inherently easier to optimise while retaining rich expressive power. We establish sublinear cumulative regret bounds using a simplified proof technique and demonstrate the advantages of our framework on various problems, including synthetic test functions, hyperparameter tuning, and computationally expensive physics simulations.

URL: https://openreview.net/forum?id=Ih9sJCZ0sW

---

Title: Learning Energy-Based Models by Self-Normalising the Likelihood

Abstract: Training an energy-based model (EBM) with maximum likelihood is challenging due to the intractable normalisation constant. Traditional methods rely on expensive Markov chain Monte Carlo (MCMC) sampling to estimate the gradient of logartihm of the normalisation constant. We propose a novel objective called self-normalised log-likelihood (SNL) that introduces a single additional learnable parameter representing the normalisation constant compared to the regular log-likelihood. SNL is a lower bound of the log-likelihood, and its optimum corresponds to both the maximum likelihood estimate of the model parameters and the normalisation constant. We show that the SNL objective is concave in the model parameters for exponential family distributions. Unlike the regular log-likelihood, the SNL can be directly optimised using stochastic gradient techniques by sampling from a crude proposal distribution. We validate the effectiveness of our proposed method on various density estimation and parameter estimation tasks. Our results show that the proposed method, while simpler to implement and tune, outperforms existing techniques on small to moderate dimensions but its performance starts to degrade for very high-dimensional problems. We extend this framework to handle EBM for regression and show the usefulness of our method in this setting as we outperform existing techniques.

URL: https://openreview.net/forum?id=GVaPBqI6ny

---

Title: From Prompts to Perception: Auditing Stereotypes in Multimodal AI

Abstract: Multimodal large language models (MLLMs) and text-to-image (T2I) systems are pervasive, yet how stereotypes propagate across pipelines remains unclear. We present a model-agnostic auditing framework that evaluates joint stereotype formation across T2I and MLLM pipelines using four T2I models and five MLLMs. We use seven nationalities (American, Indian, Iranian, Japanese, Mexican, Nigerian, Russian) along with five gender terms (man, woman, boy, girl, person) to create a set of images, which is then evaluated across different attributes and traits. For the evaluation, we also generate a set of images as a neutral baseline along with distance and radar plots. Embeddings through t-SNE and distance plots reveal tight nationality clusters and a drift of gender neutral prompts toward “man”. We further introduce five metrics: TDS and WTD to quantify trait shifts; SDI and OM for label dominance/overlap; and MCS for corruption-induced instability. TDS and WTD show minimal deviation for American and maximal for Nigerian groups, indicating that physical traits can be nationality-specific. Frequency plots, treemaps, along with SDI and OM, indicate that there is an over-reliance on a few words. MCS shows that mild degradations yield 15-45% meaningful label changes and accuracy drops, indicating that noise affects predictions. Our framework offers actionable and reproducible tools for auditing stereotype risk in multimodal AI.

URL: https://openreview.net/forum?id=qjNoOeJVfJ

---

Title: Addition is almost all you need: Compressing neural networks with double binary factorization

Abstract: Binary quantization approaches, which replace weight matrices with binary matrices and substitute costly multiplications with cheaper additions,
offer a computationally efficient approach to address the increasing computational and storage requirements of Large Language Models (LLMs).
However, the severe quantization constraint ($\pm1$) can lead to significant accuracy degradation.
In this paper, we propose Double Binary Factorization (DBF), a novel method that factorizes dense weight matrices into products of two binary (sign) matrices, each accompanied by scaling vectors.
DBF preserves the efficiency advantages of binary representations while achieving compression rates that are competitive with or superior to state-of-the-art methods.
Specifically, in a 1-bit per weight range, DBF is better than existing binarization approaches. In a 2-bit per weight range, DBF is competitive with the best quantization methods like QuIP\# and QTIP.
Unlike most existing compression techniques, which offer limited compression level choices, DBF allows fine-grained control over compression ratios by adjusting the factorization's intermediate dimension.
Based on this advantage, we further introduce an algorithm for estimating non-uniform layer-wise compression ratios for DBF, based on previously developed channel pruning criteria.

URL: https://openreview.net/forum?id=k5kUKoewdQ

---

Title: Dynamic Neural Graph Encoding of Inference Processes in Deep Weight Space

Abstract: The rapid advancements in using neural networks as implicit data representations have attracted significant interest in developing machine learning methods that analyze and process the weight spaces of other neural networks. However, efficiently handling these high-dimensional weight spaces remains challenging. Existing methods often overlook the sequential nature of layer-by-layer processing in neural network inference. In this work, we propose a novel approach using dynamic graphs to represent neural network parameters, capturing the temporal dynamics of inference. Our Dynamic Neural Graph Encoder (DNG-Encoder) processes these graphs, preserving the sequential nature of neural processing. Additionally, we also leverage DNG-Encoder to develop INR2JLS for facilitate downstream applications, such as classifying INRs. Our approach demonstrates significant improvements across multiple tasks, surpassing the state-of-the-art INR classification accuracy by approximately 10% on the CIFAR-100-INR. The source code has been made available in the supplementary materials.

URL: https://openreview.net/forum?id=4fweEyVYLF

---

Title: SMILE: A Composite Lexical-Semantic Metric for Question-Answering Evaluation

Abstract: Traditional evaluation metrics for textual and visual question answering—like ROUGE, METEOR, and Exact Match (EM)—focus heavily on n-gram based lexical similarity, often missing the deeper semantic understanding needed for accurate assessment. While measures like BERTScore and MoverScore leverage contextual embeddings to address this limitation, they lack flexibility in balancing sentence-level and keyword-level semantics and ignore lexical similarity, which remains important. Large Language Model (LLM) based evaluators, though powerful, come with drawbacks like high costs, bias, inconsistency, and hallucinations. To address these issues, we introduce SMILE: Semantic Metric Integrating Lexical Exactness, a novel approach that combines sentence-level semantic understanding with keyword-level semantic understanding and easy keyword matching. This composite method balances lexical precision and semantic relevance, offering a comprehensive evaluation. Extensive benchmarks across text, image, and video QA tasks show SMILE is highly correlated with human judgments and computationally lightweight, bridging the gap between lexical and semantic evaluation.

URL: https://openreview.net/forum?id=lnpOvuQYih

---

Title: StethoLM: Audio Language Model for Cardiopulmonary Analysis Across Clinical Tasks

Abstract: Listening to heart and lung sounds — auscultation — is one of the first and most fundamental steps in a clinical examination. Despite being fast and non-invasive, it demands years of experience to interpret subtle audio cues. Recent deep learning methods have made progress in automating cardiopulmonary sound analysis, yet most are restricted to simple classification and offer little clinical interpretability or decision support. We present StethoLM, the first audio–language model specialized for cardiopulmonary auscultation, capable of performing instruction-driven clinical tasks across the full spectrum of auscultation analysis. StethoLM integrates audio encoding with a medical language model backbone and is trained on StethoBench, a comprehensive benchmark comprising 77,027 instruction–response pairs synthesized from 16,125 labeled cardiopulmonary recordings spanning seven clinical task categories: binary classification, detection, reporting, reasoning, differential diagnosis, comparison, and location-based analysis. Through multi-stage training that combines supervised fine-tuning and direct preference optimization, StethoLM achieves substantial gains in performance and robustness on out-of-distribution data. Our work establishes a foundation for instruction-following AI systems in clinical auscultation.

URL: https://openreview.net/forum?id=i9RuUH9Jyj

---

Title: On Adversarial Attacks In Acoustic Localization

Abstract: Multi-rotor aerial vehicles (drones) are increasingly deployed across diverse domains, where
accurate navigation is critical. The limitations of vision-based methods under poor lighting
and occlusions have driven growing interest in acoustic sensing as an alternative. However,
the security of acoustic-based localization has not been examined. Adversarial attacks pose
a serious threat, potentially leading to mission-critical failures and safety risks. While prior
research has explored adversarial attacks on vision-based systems, no work has addressed
the acoustic setting. In this paper, we present the first comprehensive study of adversarial
robustness in acoustic drone localization. We formulate white-box projected gradient descent (PGD) attacks from an external sound source and show their significant impact on
localization accuracy. Furthermore, we propose a novel defense algorithm based on rotor
phase modulation, capable of effectively recovering clean signals and mitigating adversarial
degradation. Our results highlight both the vulnerability of acoustic localization and the
potential for robust defense strategies.

URL: https://openreview.net/forum?id=Nxm5xXoLFb

---

Title: Anytime Verified Agents: Adaptive Compute Allocation for Reliable LLM Reasoning under Budget Constraints

Abstract: Large language model (LLMs) agents show promising results in reasoning, planning, and tool use. However, their performance scales with the computational budget. Existing methods allocate computational resources using static strategies such as fixed search depths, constant self-consistency sampling, or uniform verification. This means that simple problems are used as much as complex tasks. We present Anytime Verified Agents (AVA), a framework that dynamically allocates compute search, tool use, and verification within a user-specified budget. AVA integrates calibrated uncertainty estimation, value-of-information-guided search expansion, and selective verification cascades with early exits. The controller dynamically allocates the compute based on the predicted failure risk and marginal reliability gains, allowing the agent to achieve higher accuracy at fixed budgets or lower costs at target reliability levels. AVA is evaluated on mathematical reasoning (GSM8K), multi-hop question answering (HotpotQA), and code generation (HumanEval) benchmarks, and it is compared to fixed-depth search, self-consistency, and always-verify baselines. The results show that the adaptive allocation achieves a 20-40% cost reduction at equivalent reliability while maintaining accuracy, showing clear Pareto improvements in the compute-reliability trade-off.

URL: https://openreview.net/forum?id=JMDCMf7mlF

---

Title: CP Merging: Joint LoRA Merging using Canonical Polyadic Decomposition

Abstract: Large language models (LLMs) are often fine-tuned for specific tasks using Low-Rank Adaptation (LoRA), an efficient method that adds small, task-specific modules called LoRA adapters to a pre-trained base model. However, a major challenge arises when merging multiple LoRA adapters trained on different data sources for a specific task: it often leads to \textit{task interference}, which refers to the redundancy or sign discrepancies found in parameters across different task models, resulting in information conflict and performance loss. While SVD-based merging methods show promise by decomposing adapters into orthogonal components to reduce cross-task interference, they suffer from a critical limitation: SVD decomposition treats the LoRA adapters merely as matrices, which prevents the identification of the optimal orthogonal basis, limiting these approaches from effectively reducing the task interference. To address this, we propose a novel LoRA merging approach using joint Canonical Polyadic (CP) decomposition, which we term CP Merging. We first aggregate the LoRA adapters into a single third-order tensor. Subsequently, we apply CP decomposition to this tensor to disentangle factors that are unique to each task from those that are shared across tasks. This joint factorization inherently helps to reduce cross-task interference without sacrificing critical information. Our extensive experiments further validate this approach, demonstrating that CP merging yields superior performance compared to existing SVD-based merging approaches.

URL: https://openreview.net/forum?id=2poB2149km

---

Title: Enhancing Model Robustness Against Noisy Labels via Kronecker Product Decomposition

Abstract: Deep learning models have made remarkable progress across various domains in recent years. These models heavily rely on large-scale datasets for training, and a noisy dataset can degrade the performance of the model. To train accurate deep learning models, it is crucial to develop training algorithms that are robust to noisy training data and outliers while ensuring high performance. In this work, we study the problem of model training under noisy labels/outputs and propose a method based on Kronecker product decomposition to improve robustness during training. The proposed method is easy to implement and can be readily combined with robust loss functions.
We report results from experiments conducted on both classification and regression tasks in the presence of noisy labels/outputs. Our results demonstrate that our approach outperforms existing robust loss methods in terms of model performance.

URL: https://openreview.net/forum?id=3C1JLecije

---

Title: eDQA: Efficient Deep Quantization of DNN Activations on Edge Devices

Abstract: Quantization of Deep Neural Network (DNN) activations is a commonly used technique to reduce compute and memory demands during DNN inference, which can be particularly beneficial on resource-constrained edge devices. To achieve high accuracy, existing methods for quantizing activations rely on complex mathematical computations or perform extensive online searches for the best hyperparameters. However, these expensive operations are impractical on edge devices with limited computational capabilities, memory capacities, and energy budgets. Furthermore, many existing methods either do not focus on sub-6-bit (or deep) quantization, or leverage mixed-precision approaches to achieve deep quantization on average but without further improving the hardware usage efficiency. To fill these gaps, in this paper we propose eDQA (Efficient Deep Quantization of DNN Activations on Edge Devices), a new method that focuses on sub-6-bit quantization of activations and leverages simple shifting-based operations and data compression techniques to achieve high efficiency and accuracy. We evaluate eDQA with 3, 4, and 5-bit quantization levels and four different DNN models on two different datasets. eDQA shows up to 75\% better accuracy compared to three existing methods: direct quantization, classic power-of-two quantization, and the state-of-the-art NoisyQuant for sub-6-bit quantization. Additionally, we compare eDQA with NoisyQuant on an edge FPGA, achieving up to $309\times$ speedup. The code is available at https://github.com/xxxx.

URL: https://openreview.net/forum?id=SEIBCdgE5W

---

Title: Domain Adaptation under Continuous Spurious Shift

Abstract: Recent advances in domain adaptation have shown promise in transferring knowledge across domains characterized by a continuous value or vector, such as varying patient ages, where “age” serves as a continuous index. However, these approaches often fail when spurious features shift continuously along with the domain index. This paper introduces the first method designed to withstand the continuous shifting of spurious features during domain adaptation. Our method enhances domain adaptation performance by aligning causally transportable encodings across continuously indexed domains. Theoretical analysis demonstrates that our approach more effectively ensures causal transportability across different domains. Empirical results, from both semi-synthetic and real-world medical datasets, indicate that our method outperforms state-of-the-art domain adaptation methods.

URL: https://openreview.net/forum?id=uYatRBQeVZ

---

Title: UltraEdit: Training-, Subject-, and Memory-Free Lifelong Editing in Language Models

Abstract: Lifelong learning enables large language models (LLMs) to adapt to evolving information by continually updating their internal knowledge. An ideal system should support efficient, wide-ranging updates while preserving existing capabilities and ensuring reliable deployment. Model editing stands out as a promising solution for this goal, offering a focused and efficient way to revise a model’s internal knowledge. Although recent paradigms have made notable progress, they often struggle to meet the demands of practical lifelong adaptation at scale. To bridge this gap, we propose UltraEdit, a training-, subject-, and memory-free approach that is well-suited for ultra-scalable, real-world lifelong model editing. UltraEdit fundamentally differs from traditional paradigms by computing parameter shifts in one step using only a hidden state and its gradient, making the approach simple yet efficient. To improve scalability in lifelong settings, UltraEdit employs a lifelong normalization strategy that continuously updates feature statistics across turns, allowing it to adapt to distributional shifts and maintain consistency over time. UltraEdit achieves editing speeds more than 7× faster than the previous state-of-the-art method, while requiring 4× less VRAM. This makes it the only method currently capable of editing a 7B LLM on a 24GB consumer-grade GPU. Furthermore, we construct UltraEditBench, the largest dataset in the field to date with over 2M editing pairs, and demonstrate that our method supports up to 2M edits while maintaining high accuracy. Comprehensive experiments on five datasets and six models show that UltraEdit consistently achieves superior performance across diverse model editing scenarios, taking a further step towards safe and scalable lifelong learning. We will release the code and dataset upon acceptance.

URL: https://openreview.net/forum?id=GoJLp3BlRV

---

Title: UFO2: The Desktop AgentOS

Abstract: Recent Computer-Using Agents (CUAs), powered by multimodal large language models (LLMs), offer a promising direction for automating complex desktop workflows through natural language. However, most existing CUAs remain conceptual prototypes, hindered by shallow OS integration, fragile screenshot-based interaction, and disruptive execution.

We present UFO2, a multiagent AgentOS for Windows desktops that elevates CUAs into practical, system-level automation. UFO2 features a centralized HostAgent for task decomposition and coordination, alongside a collection of application-specialized AppAgents equipped with native APIs, domain-specific knowledge, and a unified GUI--API action layer. This architecture enables robust task execution while preserving modularity and extensibility. A hybrid control detection pipeline fuses Windows UI Automation (UIA) with vision-based parsing to support diverse interface styles. Runtime efficiency is further enhanced through speculative multi-action planning, reducing per-step LLM overhead. Finally, a Picture-in-Picture (PiP) interface enables automation within an isolated virtual desktop, allowing agents and users to operate concurrently without interference.

We evaluate UFO2 across over 20 real-world Windows applications, demonstrating substantial improvements in robustness and execution accuracy over prior CUAs. Our results show that deep OS integration unlocks a scalable path toward reliable, user-aligned desktop automation.

URL: https://openreview.net/forum?id=iAuZVWCduc

---

Reply all

Reply to author

Forward

0 new messages