Weekly TMLR digest for May 24, 2026

5 views
Skip to first unread message

TMLR

unread,
May 24, 2026, 12:00:11 AM (10 days ago) May 24
to tmlr-annou...@googlegroups.com


New certifications
==================

Reproducibility Certification: [Re] FairDICE: A Fair Tradeoff in Multi-objective Offline RL

Peter Adema, Karim Galliamov, Aleksey Evstratovskiy, Ross Geurts

https://openreview.net/forum?id=Tr6MBt0hAj

---


Reproducibility Certification: What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?

Basile Terver, Tsung-Yen Yang, Jean Ponce, Adrien Bardes, Yann LeCun

https://openreview.net/forum?id=cHZn5Gdh8e

---


Expert Certification: InvertiTune: High-Quality Data Synthesis for Cost-Effective Single-Shot Text-to-Knowledge Graph Generation

Faezeh Faez, Marzieh S. Tahaei, Yaochen Hu, Ali Pourranjbar, Mahdi Biparva, Mark Coates, Yingxue Zhang

https://openreview.net/forum?id=bvJlAodxEC

---


Expert Certification, J2C Certification: BoSS: A Best-of-Strategies Selector as an Oracle for Deep Active Learning

Denis Huseljic, Paul Hahn, Marek Herde, Christoph Sandrock, Bernhard Sick

https://openreview.net/forum?id=qTs6spvhOS

---


Accepted papers
===============


Title: BG-HGNN: Toward Efficient Learning for Complex Heterogeneous Graphs

Authors: Junwei Su, Lingjun Mao, Da Zheng, Chuan Wu

Abstract: Heterogeneous graphs—comprising diverse node and edge types connected through varied relations—are ubiquitous in real-world applications. Message-passing heterogeneous graph neural networks (HGNNs) have emerged as a powerful model class for such data. However, existing HGNNs typically allocate a separate set of learnable weights for each relation type to model relational heterogeneity. Despite their promise, these models are effective primarily on simple heterogeneous graphs with only a few relation types. In this paper, we show that this standard design inherently leads to parameter explosion (the number of learnable parameters grows rapidly with the number of relation types) and relation collapse (the model loses the ability to distinguish among different relations). These issues make existing HGNNs inefficient or impractical for complex heterogeneous graphs with many relation types. To address these challenges, we propose Blend&Grind-HGNN (BG-HGNN), a unified feature-representation framework that integrates and distills relational heterogeneity into a shared low-dimensional feature space. This design eliminates the need for relation-specific parameter sets and enables efficient, expressive learning even as the number of relations grows. Empirically, BG-HGNN achieves substantial gains over state-of-the-art HGNNs—improving parameter efficiency by up to 28.96 $\times$ and training throughput by up to up to 110.30 $\times$—while matching or surpassing their accuracy on complex heterogeneous graphs.

URL: https://openreview.net/forum?id=pkhDICLhy7

---

Title: SafeOR-Gym: A Benchmark Suite for Safe Reinforcement Learning Algorithms on Practical Operations Research Problems

Authors: Asha Ramanujam, Adam El Youmi, Hao Chen, Sai Madhukiran Kompalli, Akshdeep Singh Ahluwalia, Shraman Pal, Dimitri J Papageorgiou, Can Li

Abstract: Most existing safe reinforcement learning (RL) benchmarks focus on robotics and control tasks, offering limited relevance to high-stakes domains that involve structured constraints, mixed-integer decisions, and industrial complexity. This gap hinders the advancement and deployment of safe RL in critical areas such as energy systems, manufacturing, and supply chains. To address this limitation, we present SafeOR-Gym, a benchmark suite of nine operations research (OR) environments tailored for safe RL under complex constraints. Each environment captures a realistic planning, scheduling, or control problems characterized by cost-based constraint violations, planning horizons, and hybrid discrete-continuous action spaces. The suite integrates seamlessly with the Constrained Markov Decision Process (CMDP) interface provided by OmniSafe. We evaluate several state-of-the-art safe RL algorithms across these environments, revealing a wide range of performance: while some tasks are tractable, others expose fundamental limitations in current approaches. SafeOR-Gym provides a challenging and practical testbed that aims to catalyze future research in safe RL for real-world decision-making problems.

URL: https://openreview.net/forum?id=3EREfePUvi

---

Title: Bridging Efficiency and Adaptability: Continual Learning of MLPs on Class-Incremental Graphs

Authors: Qiannan Zhang, Shichao Pei

Abstract: Compared to static graphs, class-incremental graphs place higher demands on inference latency to support timely predictions for newly emerged node classes, especially in latency-sensitive applications. However, the high inference cost of Graph Neural Networks (GNNs) limits their scalability and motivates GNN-to-MLP distillation, which transfers knowledge from a GNN to a Multi-Layer Perceptron (MLP) to enable graph-free, low-latency inference. Yet, existing efforts focus on static graphs. When directly applied to class-incremental graphs, they inevitably suffer from the high computational cost of frequent GNN updates and the MLP’s inability to retain knowledge of previously learned classes. To bridge efficiency and adaptability, we propose a novel framework featuring an asynchronous update paradigm between GNN and MLPs, allowing rapid adaptation to evolving data. The MLPs employ a progressive expansion strategy for continual adaptation and an energy-based routing mechanism for test-time inference. During GNN updates, knowledge from MLPs trained in the current cycle is distilled back into the GNN to preserve long-term knowledge. Experiments on real-world datasets demonstrate that our framework achieves superior performance on class-incremental graphs, effectively balancing adaptability to new data and inference efficiency.

URL: https://openreview.net/forum?id=3KYwHaKXcn

---

Title: Robust Answers, Fragile Logic: Probing the Decoupling Hypothesis in LLM Reasoning

Authors: Enyi Jiang, Changming Xu, Nischay Singh, Gagandeep Singh

Abstract: While Chain-of-Thought (CoT) prompting has become a cornerstone for complex reasoning in Large Language Models (LLMs), the robustness of the generated reasoning remains an open question. We investigate the Decoupling Hypothesis: that the robustness of a model's reasoning path and the robustness of its final answer are largely independent; correct answers can coexist with arbitrarily fragile reasoning under small input perturbations. To systematically verify this, we introduce MATCHA, a novel Answer-Conditioned Probing framework. Unlike standard evaluations that focus on final output accuracy, MATCHA isolates the reasoning phase by conditioning generation on the model's predicted answer, allowing us to stress-test the stability of the rationale itself. Our experiments reveal a critical vulnerability: under imperceptible input perturbations, LLMs frequently maintain the correct answer while generating inconsistent or nonsensical reasoning - effectively being ``Right for the Wrong Reasons''. Using LLM judges to quantify this robustness gap, we find that multi-step and commonsense tasks are significantly more susceptible to this decoupling than logical tasks. Furthermore, we demonstrate that adversarial examples generated by MATCHA transfer non-trivially to black-box models. Crucially, we show that this fragility is not solely an artifact of our answer-conditioned protocol: while standard CoT-then-Answer generation does not permit strict answer-fixed isolation, it nevertheless exhibits similar patterns of reasoning degradation under analogous attacks. Our findings expose the illusion of CoT robustness and underscore the need for future architectures that enforce genuine answer-reasoning consistency rather than mere surface-level accuracy.

URL: https://openreview.net/forum?id=pMhTFUdM4G

---

Title: The Vertex-Attribute-Constrained Densest $k$-Subgraph Problem

Authors: Qiheng Lu, Nicholas D. Sidiropoulos, Aritra Konar

Abstract: Dense subgraph mining is a fundamental technique in graph mining, commonly applied in fraud detection, community detection, product recommendation, and document summarization. In such applications, we are often interested in identifying communities, recommendations, or summaries that reflect different constituencies, styles or genres, and points of view. For this task, we introduce a new variant of the Densest $k$-Subgraph (D$k$S) problem that incorporates the attribute values of vertices. The proposed {\em Vertex-Attribute-Constrained Densest $k$-Subgraph} (VAC-D$k$S) problem retains the NP-hardness and inapproximability properties of the classical D$k$S. Nevertheless, we prove that a suitable continuous relaxation of VAC-D$k$S is tight and can be efficiently tackled using a projection-free Frank--Wolfe algorithm. We also present an insightful analysis of the optimization landscape of the relaxed problem. Extensive experimental results demonstrate the effectiveness of our proposed formulation and algorithm, and its ability to scale up to large graphs. We further elucidate the properties of VAC-D$k$S versus classical D$k$S in a political network mining application, where VAC-D$k$S identifies a balanced and more meaningful set of politicians representing different ideological camps, in contrast to the classical D$k$S solution which is unbalanced and rather mundane.

URL: https://openreview.net/forum?id=ae8hda3atq

---

Title: Improving Generalization and Data Efficiency with Diffusion in Offline Multi-agent RL

Authors: Zhuoran Li, Ling Pan, Jiatai Huang, Longbo Huang

Abstract: We present a novel Diffusion Offline Multi-agent Model (DOM2) for offline Multi-Agent Reinforcement Learning (MARL). Different from existing algorithms that rely mainly on conservatism in policy design, DOM2 enhances policy expressiveness and diversity based on diffusion model. Specifically, we incorporate a diffusion model into the policy network and propose a trajectory-based data-reweighting scheme in training. These key ingredients significantly improve algorithm robustness against environment changes and achieve significant improvements in performance, generalization and data-efficiency. Our extensive experimental results demonstrate that DOM2 outperforms existing state-of-the-art methods in all multi-agent particle and multi-agent MuJoCo environments, and generalizes significantly better to shifted environments (in $28$ out of $30$ settings evaluated) thanks to its high expressiveness and diversity. Moreover, DOM2 is ultra data efficient and requires no more than $5\%$ data for achieving the same performance compared to existing algorithms (a $20\times$ improvement in data efficiency).

URL: https://openreview.net/forum?id=GKuCKSJKvl

---

Title: Conformal Prediction for Hierarchical Data

Authors: Guillaume Principato, Gilles Stoltz, Yvenn Amara-Ouali, Yannig Goude, Jean-Michel Poggi

Abstract: We consider conformal prediction for multivariate data and focus on hierarchical data, where some components are linear combinations of others. Intuitively, the hierarchical structure can be leveraged to reduce the size of prediction regions for the same coverage level. We implement this intuition by including a projection step (also called a reconciliation step) in the split conformal prediction [SCP] procedure, and prove that the resulting prediction regions are indeed globally smaller. We do so both under the classic objective of joint coverage and under a new and challenging task: component-wise coverage, for which efficiency results are more difficult to obtain. The associated strategies and their analyses are based both on the literature of SCP and of forecast reconciliation, which we connect. We also illustrate the theoretical findings, for different scales of hierarchies on simulated data.

URL: https://openreview.net/forum?id=lBDEZiW7MX

---

Title: Process Reinforcement through Implicit Rewards

Authors: Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, Ning Ding

Abstract: Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (\underline{\textbf{P}}rocess \underline{\textbf{R}}einforcement through \underline{\textbf{IM}}plicit r\underline{\textbf{E}}wards), which enables online PRM updates using only policy rollouts and outcome labels through \textit{implicit process rewards}. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME's effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1\% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10\% of its training data. Code is available at \href{https://github.com/PRIME-RL/PRIME}{https://github.com/PRIME-RL/PRIME}.

URL: https://openreview.net/forum?id=9SkkifLopZ

---

Title: Constrained Reinforcement Learning Using Successor Representations

Authors: Michael Girstl, Alexander Mattick, Christopher Mutschler

Abstract: Real-world Reinforcement Learning depends on the ability to formulate safety constraints into a policy. A common way to model such constraints is to introduce an additional cost signal in the Markov Decision Process, which notifies the agent of unwanted behavior independently of the reward signal. Unfortunately, current methods are hard to adapt to changes in the cost function introduced by, e.g., domain shift or obstacles moving over time. The lack of adaptability means that policies are too unflexible to deal with complex real-world conditions. We propose the Safe Deep Successor Representation (SafeDSR), a novel method that allows quick retraining of policies towards new cost structures. SafeDSR extends the Deep Successor Representation (Kulkarni et al., 2016) to Constrained Reinforcement Learning by introducing a single learnable weight matrix to decouple the learned value function across dynamics, rewards, and costs. This matrix can be updated in a supervised manner instead of having to adapt the whole network if the cost structure of the environment changes. We demonstrate this ability in a freely configurable two-dimensional navigation environment and show that our method is competitive on a simple navigation task while being considerably more flexible

URL: https://openreview.net/forum?id=6zUq7knzwA

---

Title: Continually Adding New Languages to Multilingual Language Models

Authors: Abraham Toluwase Owodunni, Sachin Kumar

Abstract: Multilingual language models are trained on a fixed set of languages, and to support new languages, the models need to be retrained from scratch. This is an expensive endeavor and is often infeasible, as model developers tend not to release their pre-training data. Naive
approaches, such as continued pretraining, suffer from catastrophic forgetting; however, mitigation strategies like experience replay cannot be applied due to the lack of original pretraining data. In this work, we investigate the problem of continually adding new languages to a multilingual model, assuming access to pretraining data in only the target languages. We explore multiple approaches to address this problem and propose Layer-Selective LoRA (LayRA), which adds Low-Rank Adapters (LoRA) to selected initial and final layers while keeping the rest of the model frozen. LayRA builds on two insights: (1) LoRA reduces forgetting, and (2) multilingual models encode inputs in the source language in the initial layers, reason in English in intermediate layers, and translate back to the source language in final layers. We experiment with adding multiple combinations of Galician, Swahili, and Urdu to pretrained language models and evaluate each method on diverse multilingual tasks. We find that LayRA provides the overall best tradeoff between preserving models’ capabilities in previously supported languages, while being competitive with existing approaches such as LoRA in learning new languages. We also demonstrate that using model arithmetic, the adapted models can be equipped with strong instruction following abilities without access to any instruction tuning data in the target languages.

URL: https://openreview.net/forum?id=HE84ER1BNL

---

Title: Perturbed Gradient Descent via Convex Quadratic Approximation for Nonconvex Bilevel Optimization

Authors: Nazanin Abolfazli, Sina Sharifi, Mahyar Fazlyab, Erfan Yazdandoost Hamedani

Abstract: Bilevel optimization is a fundamental tool in hierarchical decision-making and has been widely applied to machine learning tasks such as hyperparameter tuning, meta-learning, and adversarial learning. Although significant progress has been made in bilevel optimization, existing methods predominantly focus on the nonconvex-strongly convex, or the nonconvex-PL settings, the more general nonconvex-nonconvex framework is underexplored. In this paper, we address this gap by developing an efficient gradient-based method to decrease the upper-level objective, coupled with a convex Quadratic Program (QP) that minimally perturbed the gradient descent directions to reduce the suboptimality of the condition imposed by the lower-level problem. We provide a rigorous convergence analysis, demonstrating that under the existence of a KKT point and a regularity assumption (norm-squared gradient of the lower-level satisfies PL), our method achieves an iteration complexity of $\mathcal{O}(1/\epsilon^{1.5})$ in terms of the squared norm of the KKT residual for the reformulated problem. Moreover, even in the absence of the regularity assumption, we establish an iteration complexity of $\mathcal{O}(1/\epsilon^{3})$ for the same metric.Through extensive numerical experiments on convex and nonconvex synthetic benchmarks and data hyper-cleaning tasks, we illustrate the efficiency and scalability of our approach.

URL: https://openreview.net/forum?id=sFtPtOHzYO

---

Title: Learnable Coreset Selection for Graph Active Learning

Authors: Xueqi Ma, Xingjun Ma, Sarah Erfani, James Bailey

Abstract: Graph Neural Networks (GNNs) have demonstrated their effectiveness in a variety of graph-based tasks. However, their performance heavily depends on the availability of a sufficient amount of labeled data, which is often costly to acquire in real-world applications.
To tackle this, GNN-based Active Learning (AL) methods aim to enhance labeling efficiency by selecting the most informative nodes for labeling. However, existing methods often rely on heuristic or implicit approaches that fail to fully capture the influence of labeled data on unlabeled nodes, thereby limiting their adaptability across diverse graph types.
In this paper, we propose LearnAL, a Learnable coreset labeling framework for graph Active Learning to address these limitations. Unlike traditional heuristic-based methods, LearnAL explicitly models the correlations between labeled and unlabeled nodes using an attention architecture, linking these correlations directly to prediction performance. Leveraging global influence (attention) scores, LearnAL selects and labels samples that maximize representational diversity, enhancing sample coverage.
We provide theoretical analysis demonstrating that this attention-based selection reduces the covering radius bound, improving prediction performance on unlabeled data. Our experimental results show that the labeled coreset significantly enhances the generalizability of various graph models across different graph datasets, as well as CNN models in image classification tasks.

URL: https://openreview.net/forum?id=ursw3nWq5K

---

Title: Inference-Time Alignment via Hypothesis Reweighting

Authors: Yoonho Lee, Jonathan Williams, Henrik Marklund, Archit Sharma, Eric Mitchell, Anikait Singh, Chelsea Finn

Abstract: Reward models trained on aggregate preferences often fail to capture individual users' values, but existing adaptation methods such as fine-tuning or long-context conditioning are too costly for real-time personalization. We propose Hypothesis Reweighting (HyRe), which enables real-time personalization by reweighting ensemble members using just 1-5 labeled examples from the target user or domain. Our method builds on the empirical observation that when different heads capture different valid interpretations of preference data, reweighting them can substantially outperform uniform averaging. HyRe trains a single network with multiple prediction heads that capture different valid interpretations of preference data, then uses a Bayesian update to upweight the heads that best match the target user's preferences. This requires only a single forward pass with negligible (<1%) computational overhead, making it practical for inference-time personalization. We evaluate HyRe across diverse target preference distributions. With as few as five preference pairs per target distribution, HyRe surpasses state-of-the-art reward models on RewardBench at 2B and 8B scale and improves reward model accuracy by 20% across 32 personalization tasks.

URL: https://openreview.net/forum?id=Q9p8LSEpiJ

---

Title: VFEM: Visual Feature Empowered Multivariate Time Series Forecasting with Cross-Modal Fusion

Authors: Yanlong Wang, Hang Yu, Jian Xu, Fei Ma, Hongkang Zhang, Tongtong Feng, Zijian Zhang, Shao-Lun Huang, Danny Dongning Sun, Xiao-Ping Zhang

Abstract: Large time series foundation models often adopt channel-independent architectures to handle varying data dimensions, but this design ignores crucial cross-channel dependencies. Meanwhile, existing cross-modal methods predominantly rely on textual modalities, leaving the spatial pattern recognition capabilities of vision models underexplored for time series analysis. To address these limitations, we propose VFEM, a cross-modal forecasting model that leverages pre-trained large vision models (LVMs) to capture complex cross-variable patterns. VFEM transforms multivariate time series into visual representations, enabling LVMs to perceive spatial relationships that are not explicitly modeled by channel-independent models. Through a dual-branch architecture, visual and temporal features are independently extracted and then fused via cross-modal attention, allowing complementary information from both modalities to enhance forecasting. By freezing the LVM and training only 7.45% of the total parameters, VFEM achieves competitive performance on multiple benchmarks, offering a new perspective on multivariate time series forecasting.

URL: https://openreview.net/forum?id=mPhlTmYiyg

---

Title: Extracting Probabilistic Knowledge from Large Language Models for Bayesian Network Parameterization

Authors: Aliakbar Nafar, K. Brent Venable, Zijun Cui, Parisa Kordjamshidi

Abstract: In this work, we evaluate the potential of Large Language Models (LLMs) in building Bayesian Networks (BNs) by approximating domain expert priors. LLMs have demonstrated potential as factual knowledge bases; however, their capability to generate probabilistic knowledge about real-world events remains understudied. We explore utilizing the probabilistic knowledge inherent in LLMs to derive probability estimates for statements regarding events and their relationships within a BN. Using LLMs in this context allows for the parameterization of BNs, enabling probabilistic modeling within specific domains. Our experiments on eighty publicly available Bayesian Networks, from healthcare to finance, demonstrate that querying LLMs about the conditional probabilities of events provides meaningful results when compared to baselines, including random and uniform distributions, as well as approaches based on next-token generation probabilities. We explore how these LLM-derived distributions can serve as expert priors to refine distributions extracted from data, especially when data is scarce. Overall, this work introduces a promising strategy for automatically constructing Bayesian Networks by combining probabilistic knowledge extracted from LLMs with real-world data. Additionally, we establish the first comprehensive baseline for assessing LLM performance in extracting probabilistic knowledge.

URL: https://openreview.net/forum?id=Fy3Byg3CVo

---

Title: PSAG: Projection-based Stabilized Attribution Guidance for Online Continual Learning

Authors: Hang Yu, Kun Hu, Zhuqiang Lu, Steven Qiang Lu, Zhiyong Wang, Fengxiang He

Abstract: Online Continual Learning (OCL) aims to incrementally learn from non-stationary data streams in a one-pass setting, facing the dual challenges of catastrophic forgetting and insufficient training. These challenges intensify the stability-plasticity dilemma, where preserving old knowledge conflicts with acquiring new information. In this paper, we propose Projection-based Stabilized Attribution Guidance (PSAG), a modular framework that leverages gradient-based attributions as active guidance signals to selectively preserve task-relevant representations. Our framework consists of three complementary mechanisms: (1) Attribution-Guided Feature Modulation (AGFM) that anchors critical features in the representation space; (2) Importance-Aware Loss Reweighting (IALR) that prioritizes informative samples at the loss level; and (3) Manifold-Consistent Projection (MCP) that emphasizes critical feature dimensions within a Riemannian metric space. To address the issue of attribution instability in online continual learning, we introduce a {Reliable Reference Model (R-Model)} that maintains consistent knowledge through exponential moving average updates. This design prevents feedback loops during attribution computation and enables reliable feature importance estimation. Extensive experiments on Split CIFAR-10, Split CIFAR-100, and Split Mini-ImageNet demonstrate that PSAG achieves consistent improvements over strong baselines, confirming the efficacy of stabilized attribution guidance in resolving the stability-plasticity dilemma.

URL: https://openreview.net/forum?id=NvXpSvMrXS

---

Title: Efficient Text-Attributed Graph Learning through Selective Annotation and Graph Alignment

Authors: Huanyi Xie, Lu Yu, Tianhao Huang, Longfei Li, Meng Li, JUN ZHOU, Lijie Hu, Di Wang

Abstract: In the realm of Text-attributed Graphs (TAGs), traditional graph neural networks (GNNs) often fall short due to the complex textual information associated with each node. Recent methods have improved node representations by leveraging large language models (LLMs)
to enhance node text features, but these approaches typically require extensive annotations or fine-tuning across all nodes, which is both time-consuming and costly. To overcome these challenges, we introduce GAGA, an efficient framework for TAG representation learning.
GAGA reduces annotation time and cost by focusing on annotating only representative nodes and edges. It constructs an annotation graph that captures the topological relationships among these annotations. Furthermore, GAGA employs a two-level alignment module to effectively integrate the annotation graph with the TAG, aligning their underlying structures. Experiments show that GAGA achieves classification accuracy on par with or surpassing state-of-the-art methods while requiring only 1% of the data to be annotated, demonstrating its high efficiency.

URL: https://openreview.net/forum?id=UBIPauyTYp

---

Title: [Re] FairDICE: A Fair Tradeoff in Multi-objective Offline RL

Authors: Peter Adema, Karim Galliamov, Aleksey Evstratovskiy, Ross Geurts

Abstract: Offline Reinforcement Learning (RL) is an emerging field of RL in which policies are learned solely from demonstrations. Within offline RL, some environments involve balancing multiple objectives, but existing multi-objective offline RL algorithms do not provide an efficient way to find a fair compromise. FairDICE seeks to fill this gap by adapting OptiDICE (an offline RL algorithm) to automatically learn weights for multiple objectives to e.g. incentivise fairness among objectives. As this would be a valuable contribution, this replication study examines the replicability of claims made regarding FairDICE. We find that many theoretical claims are supported, but an error in the code reduces FairDICE to standard behaviour cloning in continuous environments, and many important hyperparameters were underspecified. After rectifying this, we show in experiments extending the original paper that FairDICE can scale to complex environments and high-dimensional rewards, though it can be reliant on (online) hyperparameter tuning. We conclude that FairDICE is a theoretically interesting method, but the experimental justification requires significant revision.

URL: https://openreview.net/forum?id=Tr6MBt0hAj

---

Title: What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?

Authors: Basile Terver, Tsung-Yen Yang, Jean Ponce, Adrien Bardes, Yann LeCun

Abstract: A long-standing challenge in AI is to develop agents capable of solving a wide range of physical tasks and generalizing to new, unseen tasks and environments. A popular recent approach involves training a world model from state-action trajectories and subsequently use it with a planning algorithm to solve new tasks. Planning is commonly performed in the input space, but a recent family of methods has introduced planning algorithms that optimize in the learned representation space of the world model, with the promise that abstracting irrelevant details yields more efficient planning. In this work, we characterize models from this family as JEPA-WMs and investigate the technical choices that make algorithms from this class work. We propose a comprehensive study of several key components with the objective of finding the optimal approach within the family. We conducted experiments using both simulated environments and real-world robotic data, and studied how the model architecture, the training objective, and the planning algorithm affect planning success. We combine our findings to propose a model that outperforms two established baselines, DINO-WM and V-JEPA-2-AC, in both navigation and manipulation tasks.
Code, data and checkpoints are available in supplementary material.

URL: https://openreview.net/forum?id=cHZn5Gdh8e

---

Title: Probing the Impact of Scale on Data-Efficient, Generalist Transformer World Models for Atari

Authors: Jooyeon Kim

Abstract: Developing generalist systems that retain human-like data efficiency is a central challenge. While world models (WMs) offer a promising path, existing research often conflates architectural mechanisms with the independent impact of model \emph{scale}. In this work, we use a minimalist transformer world model to analyze scaling behaviors on the Atari 100k benchmark, using fixed offline datasets derived from a presupposed expert policy. Our results reveal that environments fundamentally fall into distinct scaling regimes, even when constrained by identical offline data budgets and model capacities. For individual tasks, some environments naturally allow models to pass the interpolation threshold, yielding monotonic improvements in the overparameterized regime, while others remain trapped in the classical regime, where larger world models degrade fidelity. In the unified setting, i.e., a single transformer trained on a suite of 26 Atari environments, we uncover that joint training stabilizes scaling dynamics, ensuring monotonic gains across all environments, regardless of their distinct inherent scaling regimes. Finally, we demonstrate that improved fidelity translates directly to downstream control, with policies learned entirely within the simulated dynamics achieving a median expert-random-normalized score of 0.770. Our findings suggest that future progress lies as much in precise scaling strategies as in architectural innovation.

URL: https://openreview.net/forum?id=wVcvqtKaMY

---

Title: A New Kind of Network? Review and Reference Implementation of Neural Cellular Automata

Authors: Martin Spitznagel, Janis Keuper

Abstract: Stephen Wolfram proclaimed in his 2003 seminal work ``A New Kind Of Science'' that simple recursive programs in the form of Cellular Automata (CA) are a promising approach to replace currently used mathematical formalizations, e.g. differential equations, to improve the modeling of complex systems.
Over two decades later, while Cellular Automata have still been waiting for a substantial breakthrough in scientific applications, recent research showed new and promising approaches which combine Wolfram's ideas with learnable Artificial Neural Networks: So-called Neural Cellular Automata (NCA) are able to learn the complex update rules of CA from data samples, allowing them to model complex, self-organizing generative systems.
The aim of this paper is to review the existing work on NCA and provide a unified theory, as well as a reference implementation in the open-source library NCAtorch.

URL: https://openreview.net/forum?id=NRwjj0ZLq0

---

Title: Improved DDIM Sampling with Moment Matching Gaussian Mixtures

Authors: Prasad Gabbur

Abstract: We propose using a Gaussian Mixture Model (GMM) as reverse transition operator (kernel) within the Denoising Diffusion Implicit Models (DDIM) framework, which is one of the most widely used approaches for accelerated sampling from pre-trained Denoising Diffusion Probabilistic Models (DDPM). Specifically we match the first and second order central moments of the DDPM forward marginals by constraining the parameters of the GMM. We see that moment matching is sufficient to obtain samples with equal or better quality than the original DDIM with Gaussian kernels. We provide experimental results with unconditional models trained on CelebAHQ and FFHQ, class-conditional models trained on ImageNet, and text-to-image generation using Stable Diffusion v2.1 on COYO700M datasets respectively. Our results suggest that using the GMM kernel leads to significant improvements in the quality of the generated samples when the number of sampling steps is small, as measured by FID and IS metrics. For example on ImageNet 256x256, using 10 sampling steps, we achieve a FID of 6.94 and IS of 207.85 with a GMM kernel compared to 10.15 and 196.73 respectively with a Gaussian kernel. Further, we derive novel SDE samplers for rectified flow matching models and experiment with the proposed approach. We see improvements using both 1-rectified flow and 2-rectified flow models. Code: https://github.com/pgabbur/ddim-gmm.

URL: https://openreview.net/forum?id=CdSPjfmrQN

---

Title: Cross-Modal Generative Augmentation for Multimodal Biological Classification

Authors: Hyunwoo Yoo, Efstathia Soufleri, Deepak Ravikumar, Gail Rosen

Abstract: Recent advances in vision-language models have enabled cross-modal generation between text and images, achieving remarkable progress in general-domain understanding. However, their potential in scientific and biological applications remains largely unexplored, where datasets often couple complex visual observations with structured metadata or textual descriptors.
We propose a cross-modal generative framework that supports direction-agnostic generation (image-to-text or text-to-image) depending on modality availability to enrich multimodal biological classification.
Our framework integrates generative augmentation and multimodal alignment to provide complementary augmentation for visual and textual representations, enabling the synthesis of complementary modality data that may otherwise be unavailable in biological datasets.
Experimental results on the HAM10000 and EMPO500 datasets demonstrate improvements across multiple evaluation metrics across diverse biological datasets over baseline models.
The proposed framework is model-agnostic and compatible with open-weight alternatives, paving the way for biologically grounded multimodal generation and analysis.

URL: https://openreview.net/forum?id=bowYeHa8dn

---

Title: Relative Translation Invariant Wasserstein Distance

Authors: Binshuai Wang, Qiwei Di, Ming Yin, Mengdi Wang, Quanquan Gu, Peng Wei

Abstract: Motivated by the Bures distance, we introduce a new family of distances, \emph{relative translation invariant Wasserstein distances}, denoted by $RW_p$, as an extension of the classical Wasserstein distances $W_p$ for $p \in [1, +\infty)$.
We establish that $RW_p$ defines a valid metric and demonstrate that this type of metric is more intrinsic than the classical Wasserstein distance.
A bi-level algorithm is designed to compute the general $RW_p$ distance between arbitrary discrete distributions.
Moreover, when $p = 2$, we show that the optimal coupling matrix is invariant under distributional translation in the discrete setting, and we further propose two algorithms, the $\mathrm{RW}_2$-LP algorithm and the $\mathrm{RW}_2$-Sinkhorn algorithm, to improve the numerical stability of computing $W_2$ distance and the optimal coupling matrix solutions.
Finally, we conduct three experiments to validate our theoretical results and algorithms.
The first two experiments report that the $\mathrm{RW}_2$-LP algorithm and the $\mathrm{RW}_2$-Sinkhorn algorithm, both with and without normalization, can significantly reduce the numerical errors compared to standard algorithms.
The third experiment shows that $RW_p$ algorithms are computationally scalable and applicable to the retrieval of similar thunderstorm patterns in practical applications.

URL: https://openreview.net/forum?id=NfhVTi2G4a

---

Title: TextTeacher: What Can Language Teach About Images?

Authors: Tobias Christian Nauen, Stanislav Frolov, Brian Bernhard Moser, Federico Raue, Ahmed Anwar, Andreas Dengel

Abstract: The platonic representation hypothesis suggests that sufficiently large models converge to a shared representation geometry, even across modalities.
Motivated by this, we ask:
Can the semantic knowledge of a language model efficiently improve a vision model?
As an answer, we introduce TextTeacher, a simple auxiliary objective that injects text embeddings as additional information into image classification training.
TextTeacher uses readily available image captions, a pre-trained and frozen text encoder, and a lightweight projection to produce semantic anchors that guide efficiently representations during training while leaving the inference-time model unchanged.
On ImageNet with standard ViT backbones, TextTeacher improves accuracy by up to $+2.7$ percentage points (p.p.) and yields consistent transfer gains (on average $+1.0$ p.p.) under the same recipe and compute.
It outperforms vision knowledge distillation, yielding more accuracy at a constant compute budget or similar accuracy, but $33\%$ faster.
Our analysis indicates that TextTeacher acts as a feature‑space preconditioner, shaping deeper layers in the first stages of training, and aiding generalization by supplying complementary semantic cues.
TextTeacher adds negligible overhead, requires no costly multimodal pretraining and preserves the simplicity and latency of pure vision models.

Project page with code and captions: https://nauen-it.de/publications/text-teacher

URL: https://openreview.net/forum?id=Xwb0aEUwKh

---

Title: Challenges in Non-Polymeric Crystal Structure Prediction: Why a Geometric, Permutation-Invariant Loss is Needed

Authors: Emmanuel Jehanno, Romain Menegaux, Julien Mairal, Sergei Grudinin

Abstract: Crystalline structure prediction is an essential prerequisite for designing materials with targeted properties. Yet, it is still an open challenge in materials design and drug discovery. Despite recent advances in computational materials science, accurately predicting three-dimensional non-polymeric crystal structures remains elusive. In this work, we focus on the molecular assembly problem, where a set~$\mathcal{S}$ of identical rigid molecules is packed to form a crystalline structure. Such a simplified formulation provides a useful approximation to the actual problem. However, while recent state-of-the-art methods have increasingly adopted sophisticated techniques, the underlying learning objective remains ill-posed. We propose a better formulation that introduces a loss function capturing key geometric molecular properties while ensuring permutation invariance over $\mathcal{S}$. Remarkably, we demonstrate that within this framework, a simple regression model already outperforms prior approaches, including flow matching techniques, on the COD-Cluster17 benchmark, a curated non-polymeric subset of the Crystallography Open Database (COD).

URL: https://openreview.net/forum?id=MsIi78JXXZ

---

Title: A Closer Look on Memorization in Tabular Diffusion Model: A Data-Centric Perspective

Authors: Zhengyu Fang, Zhimeng Jiang, Huiyuan Chen, Xiaoge Zhang, Kaiyu Tang, Xiao Li, Jing Li

Abstract: Diffusion models have shown strong performance in generating high-quality tabular data, but they carry privacy risks by inadvertently reproducing exact training samples. While prior work focuses on data augmentation for memorization mitigation, little is known about which individual samples contribute the most to memorization. In this paper, we present the first data-centric study of memorization dynamics in tabular diffusion models. We begin by quantifying memorization for each real sample based on how many generated samples are flagged as its memorized replicas, using a relative distance ratio metric. Our empirical analysis reveals a heavy-tailed distribution of memorization counts: a small subset of samples disproportionately contributes to leakage, a finding further validated through sample-removal experiments. To better understand this effect, we divide real samples into the top- and non-top-memorized groups (tags) and analyze their training-time behavior differences. We track when each sample is first memorized and monitor per-epoch memorization intensity (AUC) across groups. We find that memorized samples tend to be memorized slightly earlier and show significantly stronger memorization signals in early training stages. Based on these insights, we propose DynamicCut, a two-stage, model-agnostic mitigation method. DynamicCut (a) ranks real samples by their epoch-wise memorization intensity, (b) prunes a tunable top fraction, and (c) retrains the model on the filtered dataset. Across multiple benchmark tabular datasets and tabular diffusion models, DynamicCut reduces memorization ratios with negligible impact on data diversity and downstream task performance, and complements existing data augmentation methods for further memorization mitigation. Furthermore, DynamicCut has transferability across different generative models for memorization sample tagging, i.e., high-ranked samples identified from one model (e.g., a diffusion model) are also effective in reducing memorization when removed from other generative models such as GANs and VAEs.

URL: https://openreview.net/forum?id=p2n88DfaXB

---

Title: AgriPath: A Systematic Exploration of Architectural Trade- offs for Crop Disease Classification

Authors: Hamza Mooraj, Georgios Pantazopoulos, Alessandro Suglia

Abstract: Reliable crop disease detection requires models that perform consistently across diverse acquisition conditions, yet existing evaluations often focus on single architectural families or lab-generated datasets. This work presents a systematic empirical comparison of three model paradigms for fine-grained crop disease classification: Convolutional Neural Networks (CNNs), contrastive Vision–Language Models (VLMs), and generative VLMs. To enable controlled analysis of domain effects, we introduce AgriPath-LF16, a benchmark of 111k images spanning 16 crops and 41 diseases with explicit separation between laboratory and field imagery, alongside a balanced 30k subset for standardised training and evaluation. We train and evaluate all models under unified protocols across full, lab-only, and field-only training regimes using macro-F1 and Parse Success Rate (PSR) to account for generative reliability (i.e., output parsability measured via PSR). The results reveal distinct performance profiles: CNN-based models achieve the strongest in-domain performance but exhibit pronounced degradation under domain shift; contrastive VLMs provide a robust and parameter-efficient alternative with competitive cross-domain performance; generative VLMs demonstrate the strongest resilience to distributional variation, albeit with additional failure modes stemming from free-text generation. These findings highlight that architectural choice should be guided by deployment context rather than aggregate performance alone. The AgriPath-LF16 benchmark and accompanying codebase are publicly available at the following links: Huggingface Dataset: https://huggingface.co/datasets/hamzamooraj99/AgriPath-LF16-30k; GitHub Repository: https://github.com/hamzamooraj99/AgriPath-TMLR

URL: https://openreview.net/forum?id=5UI1wrq5pS

---

Title: A Universal Source-Free Class Unlearning Framework via Synthetic Embeddings

Authors: Zahra Dehghani Tafti, Pablo Piantanida, Mohammadhadi Shateri

Abstract: Class unlearning in neural classifiers refers to selectively removing the model’s ability to recognize a target (forget) class by reshaping the decision boundaries. This is essential when taxonomies change, labels are corrected, or legal or ethical requirements mandate class removal. The objective is to preserve performance on the remaining (retain) classes while avoiding costly full retraining. Existing methods generally require access to the source, i.e., forget/retain data or a relevant surrogate dataset. This dependency limits their applicability in scenarios where access to source data is restricted or unavailable. Even the recent source-free class unlearning methods rely on generating samples in the data space, which is computationally expensive and not even essential for doing class unlearning. In this work, we propose a novel source-free class unlearning framework that enables existing unlearning methods to operate using only the deployed model. We show that, under assumptions on the forget loss with respect to logits, class unlearning can be performed source-free for any given neural classifier by utilizing randomly generated samples within the classifier’s intermediate space. Specifically, randomly generated embeddings pseudo-labeled by the model as belonging to the forget or retain classes can support effective source-free unlearning.
Our analysis further shows that, under conditions on the forget loss and synthetic forget embeddings, minimizing the forget loss induces expected logit shifts consistent with class unlearning, without requiring a specific parametric form of the embedding distribution. We validate our framework on four backbone architectures, ResNet-18, ResNet-50, ViT-B/16, and Swin-T, across three benchmark datasets, CIFAR-10, CIFAR-100, and TinyImageNet. Our experimental results show that existing class unlearning methods can operate within our source-free framework, with minimal impact on their forgetting efficacy and retain class accuracy. The code is available at https://github.com/Yasaman-dt/Source_Free_Class_Unlearning.

URL: https://openreview.net/forum?id=Fb2sZ1eoVe

---

Title: KeySync: A Robust Approach for Leakage-free Lip Synchronization in High Resolution

Authors: Antoni Bigata Casademunt, Rodrigo Mira, Stella Bounareli, Michał Stypułkowski, Konstantinos Vougioukas, Stavros Petridis, Maja Pantic

Abstract: Lip synchronization, known as the task of aligning lip movements in an existing video with new input audio, is typically framed as a simpler variant of audio-driven facial animation. However, as well as suffering from the usual issues in talking head generation (e.g., temporal consistency), lip synchronization presents significant new challenges such as expression leakage from the input video and facial occlusions, which can severely impact real-world applications like automated dubbing, but are largely neglected by existing works. To address these shortcomings, we present KeySync, a two-stage framework that succeeds in mitigating the issue of temporal consistency, while also incorporating solutions for leakage and occlusions using a carefully designed masking strategy. We show that KeySync achieves state-of-the-art results in lip reconstruction and cross-synchronization, improving visual quality and reducing expression leakage according to LipLeak, our novel leakage metric. Furthermore, we demonstrate the effectiveness of our new masking approach in handling occlusions and validate our architectural choices through several ablation studies. Our code and videos are available here: https://antonibigata.github.io/KeySync/.

URL: https://openreview.net/forum?id=dvtMHhZUyG

---

Title: On Federated Compositional Optimization: Algorithms, Analysis, and Guarantees

Authors: Prashant Khanduri, Chengyin Li, Rafi Ibn Sultan, Aditi Sarker, Yao Qiang, Joerg Kliewer, Dongxiao Zhu

Abstract: Compositional optimization (CO) has recently gained popularity due to its applications in many machine learning applications. The large-scale and distributed nature of data necessitates efficient federated learning (FL) algorithms for CO, but the compositional structure of the objective poses significant challenges. Current methods either rely on large batch gradients (which are impractical), require expensive computations, or suffer from suboptimal guarantees. To address these challenges, we propose efficient FedAvg-type algorithms for solving non-convex CO in the FL setting. We first theoretically establish that standard FedAvg fails in solving the federated CO problems due to data heterogeneity, which amplifies bias in local gradient estimates. Our analysis shows that controlling this bias necessarily requires either {\em additional communication} or {\em additional structural assumptions}. To this end, we develop two algorithms for solving the federated CO problem. First, we propose FedDRO that utilizes the compositional problem structure to design a communication strategy that allows FedAvg to converge. FedDRO achieves a sample complexity of $\mathcal{O}(\epsilon^{-2})$ and a communication complexity of $\mathcal{O}(\epsilon^{-3/2})$ when the inner compositional objective is low-dimensional. When the inner objective is high-dimensional, the communication complexity increases to $\mathcal{O}(\epsilon^{-2})$, while the sample complexity remains $\mathcal{O}(\epsilon^{-2})$. Then we propose DS-FedDRO, a two-sided learning rate algorithm that leverages an additional assumption to improve upon the communication complexity of FedDRO. DS-FedDRO achieves the optimal $\mathcal{O}(\epsilon^{-2})$ sample and $\mathcal{O}(\epsilon^{-1})$ communication complexity irrespective of the dimensionality of the inner compositional objective. We corroborate our theoretical findings with empirical studies on large-scale CO problems.

URL: https://openreview.net/forum?id=4uRlbSNevR

---

Title: InvertiTune: High-Quality Data Synthesis for Cost-Effective Single-Shot Text-to-Knowledge Graph Generation

Authors: Faezeh Faez, Marzieh S. Tahaei, Yaochen Hu, Ali Pourranjbar, Mahdi Biparva, Mark Coates, Yingxue Zhang

Abstract: Large Language Models (LLMs) have revolutionized the ability to understand and generate text, enabling significant progress in knowledge graph construction from text (Text2KG). Many Text2KG methods, however, rely on iterative LLM prompting, making them computationally expensive and prone to overlooking complex relations distributed throughout the text. To address these limitations, we propose InvertiTune, a framework that combines a controlled data generation pipeline with supervised fine-tuning (SFT). Within this framework, the data-generation pipeline systematically extracts subgraphs from large knowledge bases, applies noise filtering, and leverages LLMs to generate corresponding natural text descriptions, a task more aligned with LLM capabilities than direct KG generation from text. This pipeline enables generating datasets composed of longer texts paired with larger KGs that better reflect real-world scenarios compared to existing benchmarks, thus supporting effective SFT of lightweight models for single-shot KG construction. Experimental results on CE12k, a dataset generated using our pipeline, show that InvertiTune outperforms larger non-fine-tuned LLMs as well as state-of-the-art Text2KG approaches, while demonstrating stronger cross-dataset generalization on CrossEval-1200, a test set created from three established benchmark datasets and CE12k. These findings highlight the importance of realistic, high-quality training data for advancing efficient and high-performing Text2KG systems.

URL: https://openreview.net/forum?id=bvJlAodxEC

---

Title: Variational Graph Structure Learning for GNNs by using Marginal Likelihood

Authors: Anita Yang, Thomas Möllenhoff, Ken-Ichi Kawarabayashi, Mohammad Emtiyaz Khan

Abstract: Learning graph structures for Graph Neural Networks (GNNs) can improve their performance, but it is challenging to search over the large discrete space of graphs. Prior works often impose fixed structural constraints to promote properties such as sparsity, but these constraints can be misspecified, overly restrictive, and can also degrade performance. Here, we propose a simple alternative based on marginal likelihood–which naturally favors such properties without requiring any explicit graph constraints. We show that a variational formulation with Laplace's method automatically leads to a marginal-likelihood based objective over discrete graph structures, which can be optimized efficiently using the Gumbel-Softmax trick. We call this approach the Laplace Approximation-based Graph Structure (LAGS) method, and show empirically that it improves the recent state-of-the-art GNNs.

URL: https://openreview.net/forum?id=fVMr2sTow5

---

Title: Open-Set Domain Adaptation Under Background Distribution Shift: Challenges and A Provably Efficient Solution

Authors: Shravan S Chaudhari, Yoav Wald, Suchi Saria

Abstract: As we deploy machine learning systems in the real world, a core challenge is to maintain a model that is performant even as the data shifts. Such shifts can take many forms: new classes may emerge that were absent during training, a problem known as open-set recognition, and the distribution of known categories may change. Guarantees on open-set recognition are mostly derived under the assumption that the distribution of known classes, which we call \emph{the background distribution}, is fixed. In this paper we develop \ours{}, a method that is guaranteed to solve open-set recognition even in the challenging case where the background distribution shifts. We prove that the method works under benign assumptions that the novel class is separable from the non-novel classes, and provide theoretical guarantees that it outperforms a representative baseline in a simplified overparameterized setting. We develop techniques to make \ours{} scalable and robust, and perform comprehensive empirical evaluations on image and text data. The results show that \ours{} significantly outperforms existing open-set recognition methods under background shift. Moreover, we provide new insights into how factors such as the size of the novel class influences performance, an aspect that has not been extensively explored in prior work.

URL: https://openreview.net/forum?id=uAJDta7VaQ

---

Title: BoSS: A Best-of-Strategies Selector as an Oracle for Deep Active Learning

Authors: Denis Huseljic, Paul Hahn, Marek Herde, Christoph Sandrock, Bernhard Sick

Abstract: Active learning (AL) aims to reduce annotation costs while maximizing model performance by iteratively selecting valuable instances. While foundation models have made it easier to identify these instances, existing selection strategies still lack robustness across different models, annotation budgets, and datasets. To highlight the potential weaknesses of existing AL strategies and provide a reference point for research, we explore oracle strategies, i.e., strategies that approximate the optimal selection by accessing ground-truth information unavailable in practical AL scenarios. Current oracle strategies, however, fail to scale effectively to large datasets and complex deep neural networks. To tackle these limitations, we introduce the Best-of-Strategies Selector (BoSS), a scalable oracle strategy designed for large-scale AL scenarios. BoSS constructs a set of candidate batches through an ensemble of selection strategies and then selects the batch yielding the highest performance gain. As an ensemble of selection strategies, BoSS can be easily extended with new state-of-the-art strategies as they emerge, ensuring it remains a reliable oracle strategy in the future. Our evaluation demonstrates that i) BoSS outperforms existing oracle strategies, ii) state-of-the-art AL strategies still fall noticeably short of oracle performance, especially in large-scale datasets with many classes, and iii) one possible solution to counteract the inconsistent performance of AL strategies might be to employ an ensemble‑based approach for the selection.

URL: https://openreview.net/forum?id=qTs6spvhOS

---

Title: VECO: VEctor COnformity Based OOD Detection in Text and Multimodal Models

Authors: Mouïn Ben Ammar, Arturo Mendoza, Antoine Manzanera, Gianni Franchi

Abstract: Out-of-distribution (OOD) detection is critical for the reliable deployment of natural language processing and multimodal document understanding systems, where domain and semantic shifts are unavoidable. While many post-hoc OOD detection methods were developed for vision models, their direct transfer to textual and multimodal Transformer architectures remains poorly understood. We show that, unlike in vision benchmarks, feature-space provides the dominant OOD signal for text and document models, consistently outperforming logit-based and hybrid detectors.
Building on this observation, we introduce \textbf{VECO} (\emph{VEctor COnformity}), a geometry-aware, purely feature-based OOD scoring framework that implements a stable soft contrast between in-distribution conformity and residual-space deviation.
We instantiate VECO using principal-subspace conformity for multimodal document models and Mahalanobis distance conformity for text classifiers, reflecting modality-aligned representation structure.
VECO achieves state-of-the-art and consistent performance improvements on multimodal document and text classification benchmarks. These results highlight the modality-dependent nature of OOD detection and the importance of adapting score design to representation cues.

URL: https://openreview.net/forum?id=sMbGqh7Zvt

---

Title: Achieving PAC Guarantees in Mechanism Design through Multi-Armed Bandits

Authors: Takayuki Osogami, Hirota Kinoshita, Segev Wasserkrug

Abstract: We analytically derive a class of optimal solutions to a linear program (LP) for automated mechanism design that satisfies efficiency, incentive compatibility, strong budget balance (SBB), and individual rationality (IR), where SBB and IR are enforced in expectation. Our solutions can be expressed using a set of essential variables whose cardinality is exponentially smaller than the total number of variables in the original formulation. However, evaluating a key term in the solutions requires exponentially many optimization steps as the number of players $N$ increases. We address this computational bottleneck by formulating it as a best mean estimation problem in the multi-armed bandit (MAB) framework, where the goal is pure exploration to estimate expectations rather than online learning with exploration-exploitation tradeoffs. We develop a probably approximately correct (PAC) estimator with asymptotically optimal sample complexity. This MAB-based statistical estimation approach reduces the optimization complexity from exponential to $O(N\log N)$. Numerical experiments confirm that our method efficiently computes mechanisms with the target properties, scaling to problems with up to $N=128$ players---substantially improving over prior work.

URL: https://openreview.net/forum?id=tbe8143jO8

---

Title: Super-fast Rates of Convergence for Neural Network Classifiers under the Hard Margin Condition

Authors: Nathanael Tepakbong, Xiang ZHOU, Ding-Xuan Zhou

Abstract: We study the classical binary classification problem for hypothesis spaces of Deep Neural Networks (DNNs) under Tsybakov's low-noise condition with exponent $q>0$, as well as its limit case $q=\infty$, which we refer to as the \emph{hard margin condition}. We demonstrate that, for a wide range of commonly used activation functions (including but not limited to ReLU, LeakyReLU, ELU, CELU, SELU, Softplus, GELU, SiLU, Swish, Mish, and Softmax), DNN solutions to the empirical risk minimization (ERM) problem with square loss surrogate and $\ell_p$ penalty on the weights $(0 \lt p \lt \infty)$ can achieve excess risk bounds of order $\mathcal{O}\left(n^{-\alpha}\right)$ for $\alpha$ close to $1$ under the low-noise condition, and for arbitrarily large $\alpha>1$ under the hard-margin condition, provided that the Bayes regression function $\eta$ satisfies a \emph{distribution-adapted smoothness} condition relative to the marginal data distribution $\rho_{X}$. Furthermore, when the activation function is chosen as $\tanh$ or sigmoid, we show that the same rates follow from the standard assumption that $\eta\in \mathcal{C}^s$. Finally, we establish minimax lower bounds, showing that these rates cannot be improved upon whenever $q\ge2$. Our proof relies on a novel decomposition of the excess risk for general ERM-based classifiers which might be of independent interest.

URL: https://openreview.net/forum?id=HXun3l0Feu

---

Title: Modeling Stochastic Conditional Dynamics from Sparse Observations via Kernel-Stabilized Flow Matching

Authors: Adam P. Generale, Andreas Euan Robertson, Surya Kalidindi

Abstract: Learning to transform conditional probability densities over time is a fundamental challenge spanning probabilistic modeling and the natural sciences. This task is paramount when forecasting the evolution of stochastic nonlinear dynamical systems in biological and physical domains. While flow-based models can predict the temporal evolution of probability distributions, existing approaches often assume discrete conditioning with samples that are paired across time, limiting their scientific applicability where frequently only sparse data with unpaired continuous conditioning is available. We propose Conditional Variable Flow Matching (CVFM), a framework for learning flows transforming conditional distributions with amortization across the continuous space of conditional densities. CVFM addresses the high-variance instability of prior methods by jointly sampling flows over state and conditioning variables, utilizing a conditioning mismatch kernel alongside a conditional Wasserstein distance to reweight the conditional optimal transport objective. Collectively, these advances allow for learning dynamics from sparse unpaired measurements of state-condition across time. We evaluate CVFM on conditional mapping benchmarks and a case study modeling the temporal evolution of materials internal structure during manufacturing processes, observing improved performance and convergence characteristics over existing conditional variants.

URL: https://openreview.net/forum?id=3A6oAS2TWo

---

Title: ZipAct: Zipping Interaction History into a Compact State for Efficient LLM Agents

Authors: Zhiming Pan, Junyu Luo, Zhiping Xiao, Kaize Ding, Xiao Luo, Ming Zhang

Abstract: Current large language model (LLM) agentic frameworks typically rely on the entire raw interaction history to make decisions. Despite recent remarkable progress, this paradigm notably suffers from the \textit{context snowball} effect: as the task progresses step by step, the history grows unboundedly, resulting in excessive token consumption and diluted agent attention. Toward this end, this paper proposes a novel and lightweight framework named ZipAct, which ``zips'' the lengthy history into a compact state during agentic reasoning. In particular, instead of feeding the full history to the model, our ZipAct maintains a structured state table comprising the agent's goal, world status and key constraints, which are updated dynamically at each step. Our simple design shifts agentic reasoning from a history-dependent paradigm to a state-dependent paradigm, which significantly reduces computational cost from quadratic ($O(T^2)$) to linear ($O(T)$) under a bounded-state assumption. Extensive, comprehensive experiments across multiple benchmark datasets demonstrate that ZipAct drastically reduces token usage while stably preserving or improving success rates compared to competing baselines. For reproducing results, our codebase can be accessed at: \url{https://github.com/Thomas-mci-21/ZipAct_TMLR}.

URL: https://openreview.net/forum?id=ZssIalqqrz

---


New submissions
===============


Title: Optimal or Greedy Decision Trees? Revisiting their Objectives, Tuning, and Performance

Abstract: Recently there has been a surge of interest in optimal decision tree (ODT) methods that globally optimize accuracy directly, in contrast to traditional approaches that locally optimize an impurity or information metric. However, the literature shows conflicting evidence on the value of ODTs, with some demonstrating superior out-of-sample performance of ODTs over greedy approaches, while others show the opposite. The value and performance of ODTs therefore remains one of several open question regarding ODTs, most of which could not be answered before due to lack of scalability. With our experimental study---the largest to this date---we examine five such open questions. Our results show (i) that a major advantage of optimal decision trees over greedy approaches is that they can optimize the target objective directly (e.g., accuracy rather than a proxy such as Gini impurity); (ii) that hyperparameter tuning of ODTs is essential; and reaffirm (iii) that optimal methods, on average, obtain smaller and more accurate trees than greedy approaches. Our results also refute two previously posited hypotheses: (iv) that the difference between optimal and greedy approaches diminish with more data, and (v) that optimal methods are more sensitive to overfitting. Finally, our work provides insights on the value of ODTs, clear recommendations for researchers and practitioners on the usage of greedy and optimal methods, and code for future comparisons.

URL: https://openreview.net/forum?id=DvDOAtskXl

---

Title: Target-Risk Identification and Honest Inference from Weak Labels: The Observed Fiber, Not the Row Space

Abstract: We study target-population evaluation of a fixed predictor when clean target labels are unavailable and source labels are observed only through weak supervision. The standard loss-correction view says that the clean target risk is estimable when the clean loss vector lies in the row space of the weak-label channel, so that an unbiased corrected weak-label loss exists. We show that this row-space condition answers a stronger, uniform question, not the observed-law question faced by an evaluator. The correct population object is the observed weak-label fiber: the set of clean posteriors that reproduce the observed weak conditional distribution. Under exact covariate shift and overlap, the target risk is point identified exactly when the clean-loss functional is constant on this fiber almost surely; otherwise, two pointwise linear programs give the sharp identified interval. The main technical addition is a finite-sample inference layer for the realistic case in which the weak-label law, target covariate weights, and weak-label channel are estimated or sensitivity-modeled. We introduce confidence fibers, prove honest coverage of the clean target risk from joint confidence sets for these nuisance objects, give an exact linear-program formulation under polyhedral multinomial confidence sets, and show convergence to the structural fiber interval without a separation condition at the boundary between point and partial identification. The resulting audit output is deliberately conservative: it certifies point identification when weak labels justify a number, and otherwise reports an honest interval rather than a pseudo-corrected point estimate. Public WRENCH audits illustrate the warning that coarse weak labels can severely understate clean risk while confidence-fiber intervals expose the missing information.

URL: https://openreview.net/forum?id=gA0UOVhFah

---

Title: Intelligent Agents with Emotional Intelligence: Current Trends, Challenges, and Future Prospects

Abstract: The development of agents with emotional intelligence is becoming increasingly vital due
to their significant role in human-computer interaction and the growing integration of computational
systems across various sectors of society. Affective computing aims to design
intelligent systems that can recognize, evoke, and express human emotions, thereby emulating
human emotional intelligence. While previous reviews have focused on specific aspects
of this field, there has been limited comprehensive research that encompasses emotion understanding,
elicitation, and expression, along with the related challenges. This survey
addresses this gap by providing a holistic overview of core components of artificial emotion
intelligence into one cohesive map for researchers. It covers emotion understanding
through multimodal data processing, as well as affective cognition, which includes cognitive
appraisal, emotion mapping, and adaptive modulation in decision-making, learning, and reasoning.
Additionally, it addresses the synthesis of emotional expression across text, speech,
and facial modalities to enhance human-agent interaction. This paper identifies and analyzes
the key challenges and issues encountered in the development of affective systems, covering
state-of-the-art methodologies designed to address them. Finally, we highlight promising
future directions, with particular emphasis on the potential of generative technologies to
advance affective computing.

URL: https://openreview.net/forum?id=pMBsQ9ierF

---

Title: ThinkRanker: Reasoning-Augmented LLM Text Ranking

Abstract: Large language models (LLMs) have recently shown strong reasoning abilities in domains
like mathematics, coding, and scientific problem-solving, yet their potential for ranking tasks,
where prime examples include retrieval, recommender systems, and LLM routing, remains
underexplored. Ranking requires complex reasoning across heterogeneous candidates, but
existing LLM-based rankers are often domain-specific, tied to fixed backbones, and lack
iterative refinement, limiting their ability to fully exploit LLMs’ reasoning potential. To
address these challenges, we propose ThinkRanker, a reasoning-incentive framework built on
reinforcement learning, with two complementary designs: DRanker, which generates full rank-
ings in one shot, and IRanker, which decomposes ranking into an iterative elimination process
with step-wise rewards to encourage deeper reasoning. We evaluate unified ThinkRankers
on nine datasets spanning recommendation, routing, and passage ranking, showing that
IRanker-3B consistently achieves state-of-the-art performance, surpasses larger 7B models on
some tasks, and yields a 15.7% average relative improvement. Ablation and generalization
experiments further confirm the critical role of reinforcement learning and iterative reasoning,
with IRanker-3B improving zero-shot performance by over 9% on out-of-domain tasks and
reasoning traces boosting other LLMs by up to 22.87%. These results demonstrate that
unifying diverse ranking tasks with a single reasoning-driven foundation model is both
effective and essential for advancing LLM reasoning in ranking scenarios.

URL: https://openreview.net/forum?id=mDorbbaFCS

---

Title: On the Computational Stability of Cross-Fitted Double Machine Learning

Abstract: Double machine learning (DML) combines orthogonal score construction with cross-fitting to enable semiparametric inference using flexible machine learning (ML) nuisance estimators. While the statistical properties of DML have been studied extensively, comparatively little attention has been devoted to the computational variability induced by randomized sample splitting itself. In practical implementations, repeated executions of the same DML procedure on a fixed dataset may produce different parameter estimates solely because randomized fold assignments vary across runs. This paper investigates the finite-sample computational stability of cross-fitted DML estimators under repeated randomized sample splitting. We introduce a simple split-instability metric that quantifies estimator variability across independently generated cross-fitting realizations and study its empirical behavior through controlled simulation experiments. The experiments demonstrate that randomized cross-fitting induces non-negligible finite-sample variability, even when the underlying dataset and nuisance learners remain fixed. Repeated cross-fitting substantially stabilizes DML estimates and empirically exhibits behavior closely consistent with inverse square-root variance reduction under repeated averaging. Additional experiments show that instability decreases with increasing sample size, increases in highly correlated high-dimensional regimes and depends nontrivially on nuisance learner choice and cross-fitting configuration. Across all experiments, repeated cross-fitting primarily improves estimator performance through variance reduction rather than bias reduction. Overall, the results highlight the importance of computational reproducibility diagnostics in ML-assisted inference and suggest that repeated cross-fitting provides a simple stabilization mechanism for practical DML workflows.

URL: https://openreview.net/forum?id=y5Z1kHKwDe

---

Title: Provable Generalization of Dataset Condensation for Classification via Signal--Noise Dynamics

Abstract: Dataset condensation, particularly via gradient matching, distills massive datasets into compact synthetic sets, making it an important tool for training under severe storage or computation constraints. However, despite strong empirical performance on classification tasks, existing theory largely relies on regression surrogates or static analyses and gives limited explanation of the underlying classification dynamics. We study gradient-matching condensation for regularized hinge-loss SVMs under an additive sub-Gaussian classification model. Our analysis shows that the learned condensed samples act as signal-concentrating representatives: they aggregate class-level structure while suppressing finite-sample noise and initialization residuals. This mechanism leads to population generalization guarantees for geometry-based evaluators and yields an explicit advantage over random one-shot coresets. The dynamics also identify an early-stopping tradeoff: informative structure is encoded early, whereas overly long inner loops can weaken certified signal accumulation. We further give a multiclass one-condensed-sample-per-class extension through a classwise one-vs-rest update and nearest-prototype evaluation, and simulations on synthetic data and KMNIST corroborate the predicted geometry, schedule sensitivity, and multiclass behavior.

URL: https://openreview.net/forum?id=zTDzlcRRpC

---

Title: Collaborative Unpaired Multimodal Learning for Image Classification

Abstract: Multimodal learning typically requires expensive paired data for training and assumes all modalities are available at inference. Many real-world scenarios, however, involve unpaired and heterogeneous data distributed across institutions, making collaboration challenging. We introduce Unpaired Multimodal Learning (UML) as the problem of leveraging semantically related but unaligned data across modalities, without requiring explicit pairing or multimodal inference. This setting naturally arises in collaborative scenarios such as satellite imagery, where institutions collect data from diverse sensors (optical, multispectral, SAR), but paired acquisitions are rare, and data sharing is restricted. We propose a collaborative framework that combines modality-specific projections with a shared backbone, enabling cross-modal knowledge transfer without paired samples. A key element is post-hoc batch normalization calibration, which adapts the shared model to each modality. Our framework also extends naturally to federated training across institutions. Experiments on multiple satellite benchmarks and additional visual datasets show consistent improvements over unimodal baselines, with particularly strong gains for weaker modalities and in low-data regimes.

URL: https://openreview.net/forum?id=ojH2fVwxE2

---

Title: Never a Dull Moment: Distributional Properties as a Baseline for Time-Series Classification

Abstract: The variety of complex algorithmic approaches for tackling time-series classification problems has grown considerably over the past decades, including the development of sophisticated but challenging-to-interpret deep-learning-based methods. But without comparison to simpler methods, it can be difficult to determine whether such complexity is required to obtain strong performance on a given problem. Here, we evaluate the performance of an extremely simple classification approach: a linear classifier in the space of two basic features that ignore the sequential ordering of the data: the mean and standard deviation of time-series values. Across a large repository of 129 (after filtering) univariate time-series classification problems, this simple distributional moment-based approach outperformed chance on 71 problems and reached 100% accuracy on two problems. In an additional neuroimaging time-series classification case study, we find that a simple linear model based on the mean and standard deviation performs better at classifying individuals with schizophrenia than a model that additionally includes features of the time-series dynamics, with performance sitting within the range of current literature. We conclude that comparing the performance of simple distributional features of a time series provides important context for interpreting the performance of more complex features or methods, which may not always be required to obtain high accuracy.

URL: https://openreview.net/forum?id=HHDz1Z3CSE

---

Title: High precision PINNs in unbounded domains: application to singularity formulation in PDEs

Abstract: We investigate the high-precision training of Physics-Informed Neural Networks (PINNs) in unbounded domains, with a special focus on applications to singularity formulation in PDEs. We propose a modularized approach and study the choices of neural network ansatz, sampling strategy, and optimization algorithm. When combined with rigorous computer-assisted proofs and PDE analysis, the numerical solutions identified by PINNs, provided they are of high precision, can serve as a powerful tool for studying singularities in PDEs. For 1D Burgers equation, our framework can lead to a solution with very high precision, and for the 2D Boussinesq equation, which is directly related to the singularity formulation in 3D Euler and Navier-Stokes equations, we obtain a solution whose loss is 4 digits smaller than that obtained in \cite{wang2023asymptotic} with fewer training steps. We also discuss potential directions for pushing towards machine precision for higher-dimensional problems.

URL: https://openreview.net/forum?id=g34kZZAbT8

---

Title: Adaptive Hypergraph Pruning with Learned Threshold Control and Attention-Based Negative Mining

Abstract: Hypergraph neural networks (HGNNs) capture multi-way interactions but their quadratic cost limits scalability. Existing pruning methods rely on hand-crafted schedules and fixed per-level thresholds that require per-dataset tuning and treat structural pruning, contrastive learning, and loss balancing as independent objectives. We propose \textbf{TriPrune-HGNN}, an adaptive hypergraph pruning framework that replaces these hand-crafted components with four small learnable controllers: a compressibility predictor that estimates the achievable retention ratio from graph-level statistics before training; hierarchical soft gates with learned thresholds operating jointly over components, hyperedges, and nodes; attention-based contrastive mining that identifies false and hard negatives induced by topology change; and meta-learned loss balancing. Pruning is realised as a density curriculum so the model trains at its final inference sparsity. Our principal claim is a favourable accuracy--efficiency \emph{operating point}, not per-metric dominance. Against the strongest efficient baselines (AdaGLT, SHARP-Distill, Shaver) under matched tuning budgets on five heterogeneous hypergraph benchmarks, MAE improvements are small but consistent in sign: $0.002$--$0.006$ at fixed cross-dataset hyperparameters, narrowing to $0.002$--$0.004$ when the baselines are re-tuned per-dataset. The larger headline numbers ($75.5\%$ inference-time reduction, $69.7\%$ memory reduction) are computed against the unpruned HGNN baseline (Table~\ref{tab:main}, Avg rows) and are reported for context, not as the primary contribution. We additionally test out-of-domain generalisation on NTU2012 3D shape recognition, where the accuracy advantage shrinks to $0.6\%$ while efficiency transfers cleanly --- a useful diagnostic separation between the framework's transferable (efficiency) and non-transferable (accuracy) components. Three short design-motivation propositions justify the choices of gate-smoothing $\epsilon$, finite-difference perturbation $\delta$, and the three-level pruning decomposition; they are local justifications under stated regularity conditions, not end-to-end guarantees. Code, weights, the synthetic generator, and seed-level numbers are released for reproducibility.

URL: https://openreview.net/forum?id=sukyDGns84

---

Title: Graph-based Subset Selection for Efficient Training of Gene Perturbation Models

Abstract: Genomic studies face a vast hypothesis space, while interventions such as gene perturbations remain costly and time-consuming. To accelerate such experiments, gene perturbation models predict the transcriptional outcome of interventions. Since constructing the training set is challenging, active learning is often employed in a “lab-in-the-loop” process. While this strategy makes training more targeted, it is substantially slower, as it fails to exploit the inherent parallelizability of Perturb-seq experiments. Here, we focus on graph neural network–based gene perturbation models and propose a subset selection method that, unlike active learning, selects the training perturbations in one shot. Our method chooses the interventions that maximize the propagation of the supervision signal to the model, thereby enhancing generalization. The selection criterion is defined over the input knowledge graph and is optimized with submodular maximization, ensuring a near-optimal guarantee. Experimental results across multiple datasets show that, in addition to providing months of acceleration compared to active learning, the method improves the stability of perturbation choices while maintaining competitive predictive accuracy.

URL: https://openreview.net/forum?id=S5YDiO3Oox

---

Title: Partition of Unity Neural Networks for Interpretable Classification with Explicit Class Regions

Abstract: We introduce Partition of Unity Neural Networks (PUNNs), a neural architecture that
constructs class probabilities directly as a partition of unity, eliminating the need for a
softmax layer. PUNNs produce nonnegative functions that sum to one via a recursive
product of gate functions, guaranteeing valid probability distributions by design.
Our contributions are threefold. First, we prove that PUNNs are dense in the space of
continuous probability maps on compact domains, establishing a universal approximation
guarantee for probabilistic classification. Second, the recursive gate construction induces
a hierarchical rejection chain that explicitly reveals how predictions are formed: each gate
performs a sequential accept/reject decision, passing remaining probability mass onward.
We demonstrate this on MNIST, where the resulting gate trace localizes model uncertainty
and pinpoints specific gating failures in misclassified examples. Third, we evaluate PUNNs
against multilayer perceptrons and Explainable Boosting Machines across MNIST, UCI
benchmarks, and synthetic datasets. Under matched parameter budgets on MNIST, PUNNs
achieve accuracy within 0.4–1.1 percentage points of MLPs, with performance stable across
random class orderings; on UCI tabular benchmarks, the gap is at most one percentage point.
When geometric priors align with the data structure, shape-informed gate parameterizations
can achieve comparable accuracy with up to 300× fewer parameters.
We relate PUNNs to stick-breaking constructions from Bayesian nonparametrics, clarifying
connections to probabilistic modeling while emphasizing the deterministic, input-dependent
nature of the architecture. Overall, PUNNs provide a principled alternative to softmaxbased
classifiers, offering transparent class probability assignments through explicit gate
decompositions, with controlled accuracy trade-offs.

URL: https://openreview.net/forum?id=YPd3khSMhW

---

Title: Localising Failure between Representation and Readout: A Fresh-Head Probe for Parameter-Space Model Merging

Abstract: Parameter-space merging has become increasingly sophisticated about how independently fine-tuned models should be combined, but less explicit about what the resulting post-merge score is allowed to diagnose. The native end-to-end accuracy of a merged model is a valid verdict on the system delivered by the merge; it is not, by itself, evidence that failure entered through the merged representation rather than through the readout the merge also delivered. We formalise this distinction with a fresh-head probe: the merged backbone is held fixed, the union-label readout is re-estimated under matched supervision, and the native--fresh-head gap identifies the readout-recoverable component of the native shortfall. In a controlled CIFAR-100 diagnostic regime with Task Arithmetic and TIES-Merging, this component constitutes a substantial share of the native shortfall across the tested regimes, including structured Ward decompositions and random class-to-task partitions on both ViT-B/16 and ResNet-50. A separate geometry control shows that centroid-routed modular composition has a complementary boundary: it outperforms naive ensembling under structured Ward geometry on both backbones, but its advantage disappears or reverses under random class partitions. These results show that model-merging evaluation needs not only better merge operators, but a stricter evidential contract: post-merge scores should be read as delivered-model verdicts, not self-localising diagnoses of representation failure.

URL: https://openreview.net/forum?id=230T2UcWwR

---

Title: EFFEKT: Efficient Federated Knowledge Transfer to Foundation Models

Abstract: Recent data protection laws have accelerated the adoption of Federated Learning (FL) for privacy-preserving decentralized training. Nevertheless, increasing model sizes imposes substantial computational demands on client devices, limiting FL applicability in resource-
constrained settings. We introduce a novel multi-domain federated learning framework in which lightweight client-side proxy models collaborate with a server-side Foundation Model (FM) to learn new concepts without sharing private data. Our approach, EFFEKT, enables
efficient server-side training of domain-specific LoRA adapters while preserving feature-space alignment between the FM and proxy extractors via novel bi-directional cross-distillation strategies. Experiments on multiple real-world datasets and deployments on low-power edge devices demonstrate consistent improvements over state-of-the-art baselines while maintaining lightweight computation at client side.

URL: https://openreview.net/forum?id=jpUDUJfE1K

---

Title: A variational approach to dimension-free self-normalized concentration

Abstract: We study the self-normalized concentration of vector-valued stochastic processes. We focus on bounds for ``sub-$\psi$'' processes, a well-known and quite general class of process that encompasses a wide variety of well-known tail conditions (including sub-exponential, sub-Gaussian, sub-gamma, sub-Poisson, and several heavy-tailed settings without a moment generating function such as symmetric or bounded 2nd or 3rd moments). Our results recover and generalize the influential bound of de la Peña et al. (2004) (proved again in Abbasi-Yadkori et al. 2011) in the sub-Gaussian case. Further, we fill a gap in the literature between determinant-based bounds and more recent bounds based on condition numbers. As applications we prove a Bernstein inequality for random vectors satisfying a moment condition (a more general condition than boundedness), and also provide the first dimension-free self-normalized empirical Bernstein inequality. Our techniques are based on the variational (PAC-Bayes) approach to concentration.

URL: https://openreview.net/forum?id=XX5vENsuwx

---

Title: On the Information Bottleneck of VJEPA

Abstract: Joint Embedding Predictive Architectures (JEPAs) learn representations by predicting latent targets rather than reconstructing high-dimensional, pixel-level observations. Variational JEPA (VJEPA) extends this idea by replacing deterministic target regression with a probabilistic predictive model and a target-side KL regularizer. This paper provides a theoretical analysis of VJEPA through the lenses of Information Bottleneck (IB) and Predictive Information Bottleneck (PIB). We analyze two forms of VJEPA: the context-target form, naturally associated with IB, and the temporal form, naturally associated with PIB. We show that the current VJEPA objective is a partial bottleneck objective: its latent negative log-likelihood implements a Barber-Agakov lower bound on predictive mutual information, while its target KL regularizes the target-side latent distribution, rather than directly compressing the context or current state. We then derive two completed objectives: a full IB-VJEPA, which introduces a stochastic context encoder and a KL-to-prior penalty that upper-bounds the context information $I(X_C;Z_C)$, and a full PIB-VJEPA as its temporal specialization, which introduces a stochastic current-state encoder and a KL-to-prior penalty that upper-bounds the state information $I(X_{\le t};Z_t)$. The resulting analysis separates three information-theoretic roles that are conflated in standard JEPA-style objectives: predictive information maximization, target-side regularization, and explicit context/state compression. This closes the objective-level gap between probabilistic latent prediction and explicit information bottleneck control, providing a principled route to compression-controlled, uncertainty-aware VJEPA models.

URL: https://openreview.net/forum?id=S7hzZHbMwY

---

Title: LaCy: What Small Language Models Can and Should Learn is Not Just a Question of Loss

Abstract: Language models have consistently grown to compress more world knowledge into their parameters, but the knowledge that can be pretrained into them is upper-bounded by their parameter size. Especially the capacity of Small Language Models (SLMs) is limited, leading to factually incorrect generations. This problem is often mitigated by giving the SLM access to an outside source: the ability to query a larger model, documents, or a database.
Under this setting, we study the fundamental question of \emph{which tokens an SLM can and should learn} during pretraining, versus \emph{which ones it should delegate} via a \texttt{<CALL>} token.
We find that this is not simply a question of loss: although the loss is predictive of whether a predicted token mismatches the ground-truth, it is insufficient for identifying which predictions would actually lead to factual or semantically invalid continuations. Some high-loss tokens correspond to \emph{acceptable} alternative continuations of a pretraining document and therefore should not trigger a \texttt{<CALL>}. This suggests that learnability cannot be characterized from loss alone, but requires additional domain-specific signals about the role of a token in the sentence. In Wikipedia-like domains, we show that augmenting the loss signal with lightweight grammatical information from a spaCy parser substantially improves delegation decisions. Based on this insight, we propose LaCy, a novel pretraining method that combines loss with factuality signals to decide which tokens an SLM should learn. Our experiments demonstrate that LaCy models successfully learn which tokens to predict and when to call for help. This results in higher FactScores when generating in a cascade with a bigger model and outperforms Rho or LLM-judge trained SLMs, while being simpler and cheaper.

URL: https://openreview.net/forum?id=JxVxBa3wO5

---

Title: Sample-efficient Multiclass Calibration under Probability-weighted $\ell_{p}$ Error

Abstract: Calibrating a multiclass predictor, that outputs a distribution over labels, is particularly challenging due to the exponential number
of possible prediction values. In this work, we propose a new definition of calibration error that interpolates between two established calibration error notions, one with known exponential sample complexity and one with polynomial sample complexity for calibrating a given predictor. Our algorithm can calibrate any given predictor for the entire range of interpolation, except for one endpoint, using only a polynomial number of samples. At the other endpoint, we achieve nearly optimal dependence on the error parameter, improving upon previous work. A key technical contribution is a novel application of adaptive data analysis with high adaptivity but only logarithmic overhead in the sample complexity.

URL: https://openreview.net/forum?id=6gNBKKY1NX

---

Title: VideoMark: A Distortion-Free Robust Watermarking Framework for Video Diffusion Models

Abstract: This work introduces \textbf{VideoMark}, a distortion-free robust watermarking framework for video diffusion models. As diffusion models excel in generating realistic videos, reliable content attribution is increasingly critical. However, existing video watermarking methods often introduce distortion by altering the initial distribution of diffusion variables and are vulnerable to temporal attacks, such as frame deletion, due to variable video lengths. VideoMark addresses these challenges by employing a \textbf{pure pseudorandom initialization} to embed watermarks, avoiding distortion while ensuring uniform noise distribution in the latent space to preserve generation quality. To enhance robustness, we adopt a frame-wise watermarking strategy with pseudorandom error correction (PRC) codes, using a fixed watermark sequence with randomly selected starting indices for each video. For watermark extraction, we propose a Temporal Matching Module (TMM) that leverages homology distance to align decoded messages with the original watermark sequence, ensuring resilience against temporal attacks. Experimental results show that VideoMark achieves higher decoding accuracy than existing methods while maintaining video quality comparable to watermark-free generation. The watermark remains imperceptible to attackers without the secret key, offering superior invisibility compared to other frameworks. VideoMark provides a practical, training-free solution for content attribution in diffusion-based video generation.

URL: https://openreview.net/forum?id=V1KnWicf7A

---

Title: Analysis of an Idealized Stochastic Polyak Method and its Application to Black-Box Model Distillation

Abstract: We provide a general convergence theorem of an idealized stochastic Polyak step size called \texttt{SPS}*. Besides convexity, we only assume a local expected gradient bound, that includes locally smooth and locally Lipschitz losses as special cases. We refer to \texttt{SPS}* as idealized because it requires access to the loss for every training batch evaluated at a solution. It is also ideal, in that it achieves the optimal lower bound for globally Lipschitz function, and is the first Polyak step size to have an $\mathcal{O}(1/\sqrt{t})$ anytime convergence in the smooth setting. We show how to combine \texttt{SPS}* with momentum to achieve the same favorable rates for the last iterate. We conclude with several experiments to validate our theory, and a more practical setting showing how we can distill a teacher GPT-2 model into a smaller student model without any hyperparameter tuning.

URL: https://openreview.net/forum?id=lNRBJekTFX

---

Title: Rethinking channel fusion for robust multivariate time series classification under distribution shift

Abstract: In real-world applications of multivariate time series classification (TSC), distribution shift between training and testing data is common, often leading to degraded out-of-distribution (OOD) performance relative to in-distribution (ID) performance. Existing methods typically improve robustness through the training objective or augmentation, and evaluate models that use early fusion, where channels are jointly processed. However, the impact of fusing channels at later stages remains unclear. We show that later fusion structurally isolates channel-specific shifts, preventing a corrupted channel from contaminating the full feature representation as in early fusion. We evaluate this across four HAR datasets and four MI datasets under both subject-level and sensor corruption distribution shifts. Across HAR datasets, later fusion consistently reduces the ID-OOD gap, and models trained with standard ERM outperform domain generalisation algorithms, often substantially. Later fusion also exhibits strong resilience to sensor corruptions, with late fusion showing near-zero degradation even when half of all channels are corrupted. However, these gains are dataset-dependent: on MI datasets, the ID cost of later fusion outweighs its robustness benefits, while domain generalisation algorithms offer little improvement. We additionally propose a simple ID-based heuristic for selecting fusion strategies. Our findings show that fusion strategy is a critical and underexplored design choice for OOD robustness in multivariate TSC, with effects that can rival those of specialised learning algorithms. The code for this work is available at \url{https://...}.

URL: https://openreview.net/forum?id=lbAzMRlrFC

---

Title: An Empirical Analysis of Static Analysis Methods for Detection and Mitigation of Code Library Hallucinations

Abstract: Despite extensive research, Large Language Models continue to hallucinate when generating code, particularly when using libraries. On NL-to-code benchmarks that require library use, we find that LLMs generate code that uses non-existent library features in 8.1-40% of responses. One intuitive approach for detection and mitigation of hallucinations is static analysis. In this paper, we analyse the potential of static analysis tools, both in terms of what they can solve and what they cannot. We find that static analysis tools can detect 16-70% of all errors, and 14-85% of library hallucinations, with performance varying by LLM and dataset. Through manual analysis, we identify cases a static method could not plausibly catch, which gives an upper bound on their potential from 48.5% to 77%. Overall, we show that static analysis methods are cheap method for addressing some forms of hallucination, and we quantify how far short of solving the problem they will always be.

URL: https://openreview.net/forum?id=6RB9FJdqld

---

Title: Meta-Learned Surrogates for Clustering Model Selection

Abstract: Clustering model selection without ground-truth labels relies on Internal Validity Measures (IVMs) such as Silhouette, Calinski-Harabasz, and Davies-Bouldin. These fixed surrogates encode particular geometric assumptions and often correlate poorly with external agreement across heterogeneous datasets. We propose MetaIVM, a meta-learned surrogate for external-agreement-based clustering model selection. Trained offline on labeled benchmarks and deployed without labels, MetaIVM predicts the quality of individual (dataset, partition) pairs from observable features of partition structure, dataset statistics, and graph topology. Unlike prior meta-learning work that recommends algorithms at the dataset level, this per-partition formulation handles algorithm choice, hyperparameter selection, and cluster-count selection in a single framework. On a benchmark of 223 datasets and 16,889 clustering runs, MetaIVM reduces selection regret by 67% over the best classical IVM. Principled controls show that neither learning from IVMs alone nor dataset-level meta-selection suffices: the per-partition formulation is essential, and even linear regression with our features outperforms all IVMs. The method adapts its feature reliance to dataset geometry, is robust across four external metrics and across modified candidate pools, and transfers from synthetic to real-world domains, though performance depends on the training distribution. As a preliminary extension to graph community detection, where coordinate-based IVMs do not apply, MetaIVM outperforms modularity-based selection.

URL: https://openreview.net/forum?id=5r3WODcU47

---

Title: Optimization as a Dynamical System: Generative Schedules from Latent ODEs

Abstract: We present a new meta-learning method to determine the optimal learning rate schedule
for gradient descent. It leverages training runs from a hyperparameter search to learn a
latent representation of the training process, which is modeled as a dynamical systems.
Given current training metrics, it predicts the future learning rate schedule with the best
long-term validation performance. Our scheduler generalizes beyond previously observed
training dynamics and creates specialized schedules that deviate noticeably from even the
best-performing parametric functions. It outperforms all baselines we compare to on results
for image classification with CNN and ResNet models as well as for next-token prediction
with a transformer model. The trained models are located in flatter regions of the loss
landscape and thus provide better generalization than those trained with other schedules.
Our method is computationally efficient, optimizer-agnostic, and can easily be layered on top
of ML experiment-tracking platforms to streamline training of neural networks from scratch.

URL: https://openreview.net/forum?id=SwmzJgB9TA

---

Title: Double Robustness Is Not a Privacy Certificate: Sensitivity Spillover in Private Policy Selection

Abstract: Doubly robust (DR) scores are a standard method in causal inference and policy evaluation: they make policy-value estimates insensitive to global nuisance estimation error. Differential privacy (DP), however, requires a different form of robustness: the released output must be stable under the replacement of any one individual. This paper shows that these two notions of robustness can differ significantly in private policy selection. We study the problem of selecting a high-value policy from a finite public library using learned DR utilities and the exponential mechanism. Although fixed or frozen-nuisance utilities have constant sensitivity, we identify a sensitivity spillover effect: replacing one record in the nuisance-training block can change the learned score map, and that changed score map is then evaluated on all records in the scoring block. We prove a separation showing that double robustness, vanishing nuisance error, and even zero DR population bias can coexist with order-$n$ realized utility sensitivity, invalidating the usual fixed-utility privacy calibration. We then give a sufficient certification route based on deterministic replace-one prediction stability of the nuisance learners, which yields a valid pure-DP exponential mechanism and a regret bound separating library approximation, concentration, DR product remainder, and certified privacy cost. Semi-synthetic experiments confirm that spillover can be large even when DR statistical diagnostics look benign, and that stability-oriented regularization controls the privacy-relevant movement. The investigations in this paper highlight an important future need that the private causal policy selection method requires both orthogonality for statistical robustness and algorithmic stability for individual replacement robustness.

URL: https://openreview.net/forum?id=7Z9E7axGAT

---

Title: Language Guidance for Supervised Vision Training: An Empirical Study of Generalization

Abstract: Deep neural networks have achieved remarkable success on vision benchmarks, yet they continue to struggle with many generalization challenges. Supervised vision training relies on one-hot labels, which provide limited information about semantic structure and shared attributes between classes. This limited supervision can leave visual representations vulnerable to distribution shifts, spurious correlations, texture bias, adversarial perturbations, and forgetting in sequential learning settings. We study whether pretrained language models can serve as lightweight auxiliary supervision for vision training without requiring paired image-text data, prompt engineering, or contrastive objectives. Specifically, we evaluate two forms of language guidance, Explicit Language Guidance (ExLG) and Implicit Language Guidance (ImLG). We conduct a comprehensive evaluation across six generalization regimes, including in-distribution, out-of-distribution generalization, shortcut and spurious correlation resistance, texture and shape bias, adversarial robustness, and continual learning. Our analyses show that the two mechanisms have complementary strengths, with explicit guidance consistently benefiting in-distribution, low-data performance, and continual learning retention, while implicit guidance is often more useful in shortcut-sensitive settings and under stronger adversarial perturbations. Importantly, both are lightweight and add minimal parameters and training overhead. The analyses characterize when language-derived structure helps supervised vision training and provides a practical roadmap for using off-the-shelf pretrained models from another modality as auxiliary supervision.

URL: https://openreview.net/forum?id=qjb79dmZjf

---

Title: Operator-Based Generalization Bounds for Multitask Deep Learning

Abstract: We study generalization bounds for compositions of functions through an operator-theoretic (Koopman) framework. Existing analyses in this direction are primarily restricted to scalar-valued settings and to Sobolev-type reproducing kernel Hilbert spaces (RKHSs), where the resulting bounds depend on smoothness parameters. We extend this framework to vector-valued RKHSs, enabling the analysis of multi-output function classes and making explicit how task-coupling kernels enter the resulting Rademacher complexity bounds. Within this setting, we derive bounds that depend on operator norms, singular values, and determinant-based geometric quantities associated with the underlying linear maps. We further introduce a vector-valued Brownian RKHS formulation, which replaces Sobolev smoothness assumptions by a first-order Cameron--Martin-type structure. In this regime, the resulting bounds no longer depend on Sobolev smoothness exponents and instead exhibit a milder spectral dependence involving only operator norms and determinant factors. This highlights a qualitative difference between Sobolev- and Brownian-based analyses at the level of function spaces. We additionally study a shared operator-learning formulation for multitask transfer in vector-valued RKHSs deriving an exact representer theorem, a finite-dimensional reduction of the corresponding operator-learning problem, and transfer bounds for the induced operator class. We illustrate these effects empirically on synthetic data and MNIST, comparing the behavior of Sobolev and Brownian bounds during training.

URL: https://openreview.net/forum?id=ZDKQuQMCy9

---

Title: Faithful Image Editing via Degraded Representations

Abstract: Rectified flow and diffusion-based models currently represent the state-of-the-art in image editing, leveraging powerful pre-trained generative priors to produce visually compelling modifications. Despite their impressive capabilities, maintaining faithfulness to the source image -- preserving structure and photometric characteristics while satisfying a target prompt -- remains a persistent challenge in this domain. Direct traversal between source and target distributions in rectified flow frameworks offers a promising direction for improving fidelity. However, identifying trajectories that are both semantically effective and strictly structure-preserving remains an open problem. In this work, we propose an optimization- and inversion-free image editing framework that is, in principle, agnostic to the underlying generative backbone. Our central insight is to operate within a carefully designed degraded representation space that constrains editing trajectories and suppresses unintended collateral modifications to the target. We first establish the existence of such degraded representations for generative-prior-based editing and then develop a principled method to project editing trajectories onto this space. The resulting method, Editing via Degraded Representations (EDR), systematically eliminates unfaithful trajectory deviations while preserving the flexibility required to satisfy the target text prompt. Extensive quantitative and qualitative evaluations demonstrate that EDR achieves precise, high-quality edits with superior fidelity, establishing a new state-of-the-art in faithful image editing. Code will be released upon acceptance.

URL: https://openreview.net/forum?id=U2fY7u10QY

---

Title: TAH-Quant: Effective Activation Quantization in Pipeline Parallelism over Slow Network

Abstract: Decentralized training of large language models offers the opportunity to pool computational resources across geographically distributed participants, but is often bottlenecked by network communication, particularly under pipeline parallel settings. While pipeline parallelism partitions model layers across devices to handle large-scale models, it necessitates frequent communication of intermediate activations, creating challenges when network bandwidth is limited.
To address these issues, we propose TAH-Quant (Tile-wise Adaptive Hadamard Quantization), a novel activation quantization framework for pipeline parallelism. TAH-Quant integrates fine-grained tile-wise quantization, entropy-guided tile-wise adaptive bit allocation for bit usage, and a Hadamard-based transformation with pivot swapping to effectively suppress outliers. We prove that pipeline parallel training equipped with TAH-Quant maintains a convergence rate of $\mathcal{O}(1/\sqrt{T})$, matching that of vanilla stochastic gradient descent. Extensive experiments demonstrate that \sys achieves an aggressive activation quantization ratio of 3--4 bits, providing up to $4.3\times$ throughput speedup over uncompressed FP32 and up to $1.33\times$ wall-clock speedup over AQ-SGD, while preserving training convergence, avoiding AQ-SGD's activation-cache overhead, and generalizing well across various training scenarios.

URL: https://openreview.net/forum?id=6ysPGq2RVD

---

Title: Revisiting Mixture Policies in Entropy-Regularized Actor-Critic

Abstract: Mixture policies theoretically offer greater flexibility than unimodal policies in continuous action reinforcement learning, but the practical benefits of this complexity remain elusive. Mixture policies are notably absent from most state-of-the-art algorithms, raising a fundamental question: Is the added representational overhead useful? We show that increased flexibility can theoretically enhance solution quality and entropy robustness. Yet standard algorithms like SAC do not leverage these advantages. A core issue is the lack of a low-variance reparameterization trick for mixtures, a luxury Gaussian policies enjoy. We propose a marginalized reparameterization (MRP) estimator to address this, proving it offers lower variance than the standard likelihood-ratio (LR) approach. Our experiments across Gym MuJoCo, DeepMind Control Suite, and MetaWorld show that MRP mixture policies significantly outperform their LR ones, and reach parity (sometimes better) with Gaussian counterparts. In addition, we do find several cases where MRP mixture policies exhibit clear empirical advantages. In this paper, we provide a clearer understanding of the trade-offs involved, elevating MRP mixture policies from theoretical curiosity to a practical tool.

URL: https://openreview.net/forum?id=sFXsQRd8v9

---

Title: The Global Empirical NTK: Self-Referential Bias and Dimensionality of Gradient Descent Learning

Abstract: In training a neural network with gradient descent (GD), each iteration induces a linear operator governing the first-order updates to a model's internal state variables. We define this operator as the global empirical neural tangent kernel, $\text{NTK}_S$. In finite-width networks, $\text{NTK}_S$ is typically intractable to form, leading prior work to focus on more restrictive settings, such as tracking outputs only or taking infinite-width limits. Here, we bridge this gap by studying the structure of $\text{NTK}_S$ for a broad class of models. Formulating the model state as the solution to a single global implicit constraint, we derive $\text{NTK}_S$ as a product of two operators: $\mathcal{K}$, accounting for immediate parameter-to-state interactions, and $\mathcal{P}$, describing internal state-to-state dependencies. For a broad class of weight-based models, including RNNs and transformers, we prove a universal Kronecker-core theorem showing that $\mathcal{K}$ admits an exact forward-pass-computable form given by the Gram matrix of weight-site variables. This core structure reveals that $\text{NTK}_S$ is structurally bottlenecked, constraining its effective rank and giving rise to a \textit{self-referential bias}, whereby GD preferentially learns within dominant modes of joint hidden-input activity. For recurrent models, including GRUs and RNNs, we examine the spectrum of $\text{NTK}_S$ and show when it is biased and low-rank in space or time under the proposed decomposition. We further demonstrate that the structure of the model dynamics at initialization biases $\text{NTK}_S$, restricting learning and preventing certain task components from being learned effectively. Finally, to demonstrate broader applicability, we show that the $\text{NTK}_S$ associated with a self-attention transformer is likewise structurally constrained to be low-rank. Overall, we show that $\text{NTK}_S$ possesses tractable structure that explains GD bias toward particular task solutions and the typical emergence of low-rank representations. To further enable the use of $\text{NTK}_S$ as a practical metric, we build a library, \texttt{kpflow}, based on randomized matrix-free numerical linear algebra.

URL: https://openreview.net/forum?id=tPjNd8S8eb

---

Title: Cross-Layer Discrete Concept Discovery for Interpreting Language Models

Abstract: Interpreting language models remains challenging due to the residual stream, which linearly mixes and duplicates features across adjacent layers, causing single-layer analyses to miss this cross-layer structure. Cross-layer sparse autoencoders (SAEs) address layer mixing but operate in continuous space, where concepts split across many neurons without clear boundaries. We introduce CLVQ-VAE, a novel framework that maps representations from a lower layer to a higher layer through a discrete vector-quantization bottleneck, collapsing duplicated residual-stream features into compact, interpretable concept vectors. Our approach combines top-k temperature-based sampling with exponential moving average (EMA) codebook updates, providing controlled exploration of the discrete latent space while maintaining codebook diversity. Across both encoder- and decoder-based models on ERASER-Movie, Jigsaw, and AGNews, CLVQ-VAE outperforms clustering, single-layer VQ-VAE, and SAE baselines across three evaluation axes: removing identified concepts drops model accuracy by up to 93%, LLM judges rank our concepts first in 66.7% of comparisons, and human annotators recover model predictions from our visualizations with 78% accuracy versus 54% for clustering.

URL: https://openreview.net/forum?id=XbBvGKSoEG

---

Title: Few-Shot Closed-Loop Neural System Identification via Meta-Learning

Abstract: We study few-shot closed-loop neural system identification in a meta-learning setting, where related source systems are used to learn a neural-network-based open-loop dynamics model for a new target system from limited feedback-controlled data.
In closed loop, inputs are generated through output feedback; consequently, the observed trajectories are shaped by both the plant dynamics and the controller.
Under feedback-dependent data generation and scarce target data, existing system identification methods are insufficient for recovering the open-loop dynamics.
Based on meta-learning and neural closed-loop identification, we propose Meta-ICI, which learns an initialization for an intermediate operator to recover open-loop dynamics from limited target closed-loop data.
We further extend Meta-ICI to fragmented target adaptation, where only scattered one-step transitions are available instead of continuous trajectories.
This extension yields Fast Meta-ICI for fully observable systems, using fragmented transitions to support accurate long-horizon rollouts.
To instantiate Fast Meta-ICI, we design a Schur-Koopman model that enforces the latent spectral-radius constraint during unconstrained optimization.
Experiments on partially observable Li\'enard systems and fully observable nonlinear pendulum systems show that Meta-ICI improves few-shot adaptation and Fast Meta-ICI enables non-divergent long-horizon rollouts from fragmented target data.

URL: https://openreview.net/forum?id=oUWRVhTrWQ

---

Title: CLAPS: Aleatoric-Epistemic Scaling via Last-Layer Laplace for Conformal Regression

Abstract: Conformal regression provides finite-sample marginal coverage, but it does not by itself determine how interval width should adapt across heterogeneous inputs. Existing locally adaptive methods mainly account for aleatoric noise, leaving uncertainty from weak training support less explicit. We propose \emph{Conformal Laplace-Aware Predictive Scaling} (CLAPS), a split conformal regression method that uses heteroscedastic last-layer Laplace uncertainty as the local normalization scale. CLAPS combines learned input-dependent noise with last-layer epistemic uncertainty, while retaining validity through standard conformal calibration. We characterize this aleatoric--epistemic scale, derive its heteroscedastic last-layer precision, and show that it reduces to aleatoric local scaling as epistemic uncertainty contracts. Experiments show nominal-level coverage with competitive interval efficiency.

URL: https://openreview.net/forum?id=RvoPfNBjwp

---

Title: Searching for actual causes: Approximate algorithms with adjustable precision

Abstract: Causality has gained increasing attention in recent years, notably for improving the interpretability of machine learning models. Yet the field of explainable artificial intelligence (XAI) has been criticized for emphasizing general tendencies rather than the situation-specific facts, which users typically expect as explanations. These expectations align with the notion of actual causes, which identify what made the observed outcome happen, in the specific context at hand. Halpern and Pearl provided a formal basis for actual causation, but identifying actual causes is NP-complete. Practical identification algorithms are extremely scarce, restricted to narrow classes of models, and typically identify only the shortest cause. We address this gap between the formal theory and its applicability through two main contributions. First, we introduce a baseline approximate polynomial-time algorithm with adjustable precision, together with two complementary algorithms that improve its efficiency. Second, we provide a theoretical result showing that the actual-cause identification problem can be decomposed into smaller sub-instances that preserve the set of solutions. This result directly motivates one of the complementary algorithms. Our experiments demonstrate that the baseline method can approximate the set of actual causes, notably for non-Boolean and stochastic models, and that the complementary algorithms further improve its performance.

URL: https://openreview.net/forum?id=3NlYZPCH9v

---

Title: Spectral Condition for $\mu$P under Width–Depth Scaling

Abstract: Generative foundation models are increasingly scaled in both width and depth, posing significant challenges for stable feature learning and reliable hyperparameter (HP) transfer across model sizes. While maximal update parameterization ($\mu$P) has provided a principled solution to both problems for width scaling, existing extensions to the joint width–depth scaling regime remain fragmented, architecture- and optimizer-specific, and often rely on technically involved theories. In this work, we develop a simple and unified spectral framework for $\mu$P under joint width–depth scaling. For deep residual networks whose residual blocks contain $k$ transformations, the framework specifies how the norms of weights and their per-step updates should scale with width and depth. It reveals a fundamental transition from $k=1$ to $k\geq 2$, unifying previously disparate $\mu$P formulations and identifying the $k\geq 2$ case as more appropriate for practical architectures with multi-transformation branches such as Transformers. Building on this framework, we derive a general recipe for implementing $\mu$P across a broad class of optimizers by mapping spectral constraints to concrete HP parameterizations, recovering existing results and extending them to additional optimizers. Finally, experiments on GPT-2 style language models show that the $\mu$P formulation derived from the $k\geq 2$ case achieves stable feature learning and robust HP transfer under width–depth scaling, whereas standard parameterization and $\mu$P in the $k=1$ case often fail to do so. These results support the practical effectiveness of the proposed spectral framework.

URL: https://openreview.net/forum?id=5JIhRkWAxI

---

Title: Training Non-Differentiable Networks via Optimal Transport

Abstract: Neural networks increasingly embed non-differentiable components (spiking neurons, quantized layers, discrete routing, blackbox simulators, etc.) where backpropagation is inapplicable and surrogate gradients introduce bias. We present PolyStep, a gradient-free optimizer that updates parameters using only forward passes. Each step evaluates the loss at structured polytope vertices in a compressed subspace, computes softmax-weighted assignments over the resulting cost matrix, and displaces particles toward low-cost vertices via barycentric projection. This update corresponds to the one-sided limit of a regularized optimal-transport problem, inheriting its geometric structure without Sinkhorn iterations.

PolyStep trains genuinely non-differentiable models where existing gradient-free methods collapse to near-random accuracy. On hard-LIF spiking networks we reach 93.4\% test accuracy, outperforming all gradient-free baselines by over 60~pp and closing to within 4.4~pp of a surrogate-gradient Adam ceiling. Across four additional non-differentiable architectures (int8 quantization, argmax attention, staircase activations, hard MoE routing) we lead every gradient-free competitor. On MAX-SAT scaling from 100 to 1M variables, we sustain above 92\% clause satisfaction while evolution strategies drop 8--12~pp. On RL policy search, we match OpenAI-ES on classical control and retain performance under integer and binary quantization that collapses gradient-based methods. We prove convergence to conservative-stationary points at rate $O(\log T/\sqrt{T})$ on piecewise-smooth losses, upgraded to Clarke-stationary on the headline architectures and extended to the piecewise-constant regime via a hitting-time bound. These rates match the known zeroth-order query-complexity lower bounds that all forward-only methods inherit.

URL: https://openreview.net/forum?id=8mlcqTTMuU

---

Title: Plasticity by Precision: Exemplar-free Analytic Adaptation for Class-Incremental Learning

Abstract: Class-Incremental Learning (CIL) aims to enable models to acquire new knowledge sequentially while preserving previously learned information, emulating human-like learning capabilities. Current methods, including pre-trained foundation models and Experience Replay (ER) methods, serve as strong baselines for sequential task learning. However, these methods remain prone to catastrophic forgetting, especially in online settings with non-stationary data and blurry task boundaries. Additionally, the requirement to store historical samples in ER-based methods introduces significant memory overhead and privacy risks, limiting the practical adoption of CIL models in real-world applications. To address this, we propose an \textbf{E}xemplar-\textbf{f}ree \textbf{A}nalytic \textbf{A}daptation for \textbf{C}lass-\textbf{I}ncremental \textbf{L}earning (AACL) framework that updates the classifier in a principled probabilistic manner. Our key contribution is a closed-form Bayesian update that unifies three critical components: (1) the prior precision encapsulating knowledge from previous tasks, (2) a Fisher Information-inspired weight penalty to protect learned knowledge, and (3) the feature correlation matrix representing evidence from new data. Our framework balances plasticity and stability by integrating prior knowledge with streaming data, preserving learned representations while adapting to new tasks. We conduct comprehensive evaluations on benchmark datasets under the SI-Blurry setting, achieving $\mathcal A_{\text{AUC}}$ improvements of 8\%, 3\%, and 4\% on CIFAR-100, ImageNet-R, and Tiny-ImageNet, respectively.

URL: https://openreview.net/forum?id=qZ8ppalHKT

---

Title: A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima

Abstract: As AI models achieve remarkable capabilities across diverse domains, understanding what representations they learn and how they encode concepts has become increasingly important for both scientific progress and trustworthy deployment. Recent works in mechanistic interpretability have widely reported that neural networks represent meaningful concepts as linear directions in their representation spaces and often encode diverse concepts in superposition. Various sparse dictionary learning (SDL) methods, including sparse autoencoders, transcoders, and crosscoders, are utilized to address this by training auxiliary models with sparsity constraints to disentangle these superposed concepts into monosemantic features. These methods are the backbone of modern mechanistic interpretability, yet in practice they consistently produce polysemantic features, feature absorption, and dead neurons, with very limited theoretical understanding of why these phenomena occur. Existing theoretical work is limited to tied-weight sparse autoencoders, leaving the broader family of SDL methods without formal grounding. We develop the first unified theoretical framework that casts all major SDL variants as a single piecewise biconvex optimization problem, and characterize its global solution set, non-identifiability, and spurious optima. This analysis yields principled explanations for feature absorption and dead neurons. To expose these pathologies under full ground-truth access, we introduce the Linear Representation Bench. Guided by our theory, we propose feature anchoring, a novel technique that restores SDL identifiability, substantially improving feature recovery across synthetic benchmarks and real neural representations, with direct implications for real-world applications on feature interpretation.

URL: https://openreview.net/forum?id=Pb1Ix9Hrxp

---

Title: Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

Abstract: Reward models (RMs) have become an indispensable fixture of the language model (LM) post-training playbook, enabling policy alignment and test-time scaling. Research on the application of RMs in code generation, however, has been comparatively sparse, with existing work largely focusing on execution feedback. This choice constrains post-training to optimizing functional correctness over self-contained executable code. In this work, we examine the training and evaluation of multilingual, multi-criteria code RMs. To this end, we first compile Themis-CodeRewardBench, a benchmark to evaluate code RMs across five preference dimensions (i.e., criteria) and eight programming languages, on which we profile 50+ code, math, and general-purpose RMs. Observing the limited proficiency of current RMs beyond scoring for functional correctness, we develop Themis-CodePreference, the largest open-source collection of code preferences to date (more than 350k preference pairs), and use it to train Themis-RM, a suite of multilingual code reward models for flexible multi-criteria scoring, ranging in size from 600M to 32B parameters. Our experiments and ablations demonstrate positive scaling trends, strong cross-lingual transfer when training on diverse preferences, and the importance of multi-criteria training for reliable code reward modeling.

URL: https://openreview.net/forum?id=S2zGSvsZV2

---

Title: CausalPlan: Empowering Efficient LLM Multi-Agent Coordination Through Causality-Driven Planning

Abstract: Large language model (LLM) agents often generate causally invalid plans in multi-agent coordination tasks due to reliance on spurious statistical correlations rather than grounded causal reasoning, leading to poor task performance. We propose CausalPlan, a framework that enforces causal consistency in LLM planning by embedding learned structural knowledge directly into the decoding process. CausalPlan first extracts a Structural Causal Action (SCA) model, which learns a policy-level causal graph from agent trajectories to capture how prior actions and current environment states influence future decisions. The learned SCA guides planning by reweighting candidate action tokens during generation and providing grounded alternatives when causal violations are detected. By embedding causal knowledge, CausalPlan constrains planning to causal-consistent behaviors under the learned causal model without requiring fine-tuning. We evaluated CausalPlan on the Overcooked-AI benchmark across five multi-agent tasks and four LLMs: Gemma-7B, Llama-8B, Qwen-14B and Llama-70B. Experimental results show that CausalPlan consistently reduces causally invalid actions and improves task completion in both AI-AI and human-AI collaboration settings, outperforming strong LLMs and reinforcement learning baselines. Our findings demonstrate that causality-driven planning is essential for deploying efficient, interpretable, and robust multi-agent systems.

URL: https://openreview.net/forum?id=Zk87VXBW8A

---

Title: PViT: Prior-Augmented Vision Transformer for Out-of-Distribution Detection

Abstract: Vision Transformers (ViTs) have achieved remarkable success over various vision tasks, yet their robustness against data distribution shifts and inherent inductive biases remain underexplored. To enhance the robustness of ViT models for image Out-of-Distribution (OOD) detection, we introduce a novel and generic framework named Prior-augmented Vision Transformer (PViT). We train PViT to predict class labels while taking as input both image tokens and the prior class logits from a pretrained model. During inference, PViT identifies OOD samples by quantifying the divergence between the predicted class logits and the prior logits obtained from pre-trained models. Unlike existing state-of-the-art(SOTA) OOD detection methods, PViT shapes the decision boundary between ID and OOD by utilizing the proposed prior guided confidence, without requiring additional data modeling, generation methods, or structural modifications. Extensive experiments on the large-scale \textsc{ImageNet} benchmark, evaluated against over seven OOD datasets, demonstrate that PViT significantly outperforms existing SOTA OOD detection methods in terms of FPR95 and AUROC.

URL: https://openreview.net/forum?id=PnLtZsLkIp

---

Title: ORGEval: Graph-Theoretic Evaluation of LLMs in Optimization Modeling

Abstract: Formulating optimization problems demands substantial manual effort and specialized domain expertise. While Large Language Models (LLMs) have shown promise for automating this process, evaluating the correctness of their outputs remains challenging due to the lack of reliable evaluation metrics. Existing solver-based evaluation methods lack rigorous correctness guarantees, become uninformative when models are infeasible, and incur prohibitive computational costs on hard instances. To address these limitations, we propose ORGEval, a graph-theoretic evaluation framework for assessing LLMs’ capabilities in formulating linear and mixed-integer linear programs (MILPs). ORGEval represents optimization instances as bipartite graphs, thereby reducing equivalence detection to graph isomorphism (GI) testing. The Weisfeiler-Lehman (WL) test is a classical heuristic for GI, but it is known to yield false positives on certain graph structures. We identify a sufficient condition, called symmetric decomposability (SD), under which the WL test is guaranteed to correctly determine isomorphism. Building on this result, ORGEval combines the WL-test for bipartite graphs with an efficient SD verification procedure to provide provably correct equivalence evaluation. We further introduce Bench4Opt, a benchmark dataset that separates models from data, to validate ORGEval and benchmark state-of-the-art LLMs on optimization modeling. Experimental results demonstrate that ORGEval reliably detects equivalence while significantly outperforming solver-based methods in runtime, particularly on computationally challenging instances. Our benchmark reveals that optimization modeling remains a challenging task for all tested LLMs, with the best-performing models achieving only 54.82\% accuracy.

URL: https://openreview.net/forum?id=HoTay0dEhg

---

Title: XFacta: Contemporary, Real-World Dataset and Evaluation for Multimodal Misinformation Detection with Multimodal LLMs

Abstract: The rapid spread of multimodal misinformation on social media calls for more effective and robust detection methods.
Recent advances leveraging multimodal large language models (MLLMs) have shown the potential in this challenge.
However, it remains unclear exactly where the bottleneck of existing approaches lies (evidence retrieval v.s. reasoning), hindering the further advances in this field.
On the dataset side, existing benchmarks either contain outdated events, leading to evaluation bias due to discrepancies with contemporary social media scenarios as MLLMs can simply memorize these events, or artificially synthetic, failing to reflect real-world misinformation patterns.
Additionally, it lacks comprehensive analyses of MLLM-based model design strategies.
To address these issues, we introduce XFacta, a contemporary, real-world dataset that is better suited for evaluating MLLM-based detectors. We systematically evaluate various MLLM-based misinformation detection strategies, assessing models across different architectures and scales, as well as benchmarking against existing detection methods.
Building on these analyses, we further enable a semi-automatic detection-in-the-loop framework that continuously updates XFacta with new content to maintain its contemporary relevance.
Our analysis provides valuable insights and practices for advancing the field of multimodal misinformation detection.

URL: https://openreview.net/forum?id=8mQvwFVt1B

---

Title: Test Time Augmentations are Worth One Million Images for Out-of-Distribution Detection

Abstract: Out-of-distribution (OOD) detection is commonly improved either by storing large in-distribution (InD) reference sets (e.g., nearest-neighbor methods) or by exposing the model to auxiliary OOD data during training. Both requirements limit deployability at scale. This paper shows that carefully chosen test-time augmentations (TTA) can provide a strong, self-referential signal for OOD detection from a single test input, without any stored InD data and without OOD exposure. We first identify a practical taxonomy that separates mild, feature-preserving InD augmentations (IDAs) from aggressive OOD augmentations (OODAs), and empirically demonstrate that IDAs consistently improve detection while OODAs often degrade it. Building on this insight, we propose a simple plug-and-play detector based on sequential masking: for each test image, we generate a small set of masked views and use the k-th largest embedding similarity to the original image as an “ID-ness” score. With only 25 TTAs per input, our method surpasses competitive baselines on ImageNet that rely on the full 1.2M-image training set as a reference.

URL: https://openreview.net/forum?id=1sESATPkdH

---

Title: Supervised Contrastive Block Disentanglement

Abstract: Real-world datasets often combine data collected under different experimental conditions. This yields larger datasets, but also introduces spurious correlations that make it difficult to model the phenomena of interest. We address this by learning two embeddings to independently represent the phenomena of interest and the spurious correlations. The embedding representing the phenomena of interest is correlated with the target variable $y$, and is invariant to the environment variable $e$. In contrast, the embedding representing the spurious correlations is correlated with $e$. The invariance to $e$ is difficult to achieve on real-world datasets. Our primary contribution is an algorithm called Supervised Contrastive Block Disentanglement (SCBD) that effectively enforces this invariance. It is based purely on Supervised Contrastive Learning, and applies to real-world data better than existing approaches. We empirically validate SCBD on two challenging problems. The first problem is domain generalization, where we achieve strong performance on a synthetic dataset, as well as on Camelyon17-WILDS. We introduce a single hyperparameter $\alpha$ to control the degree of invariance to $e$. When we increase $\alpha$ to strengthen the degree of invariance, out-of-distribution performance improves at the expense of in-distribution performance. The second problem is batch correction, in which we apply SCBD to preserve biological signal and remove inter-well batch effects when modeling single-cell perturbations from 26 million Optical Pooled Screening images.

URL: https://openreview.net/forum?id=CROGDBt1my

---

Title: A Cross-Model Study of Over-Compliance in Large Lan- guage Models

Abstract: Large language models increasingly mediate decisions in healthcare, legal advisory, and financial analysis, settings in which a model’s willingness to answer an inadequate prompt can matter as much as the accuracy of its answer. Yet systematic cross-model evidence on this behavior remains scarce. The present study examined over-compliance, understood as the generation of substantive content when the input warrants clarification, refusal, or deferral. Four frontier models from OpenAI, Google, Meta, and Anthropic were evaluated on a benchmark of 400 prompts spanning underspecification, ambiguity, contradiction, and nonsense, under two system-prompt conditions. Each of the 3,200 resulting responses was scored by a deterministic rule-based classifier that mapped outputs to a nine-category taxonomy and computed both an Over-Compliance Rate and a Terminal Refusal Rate. Over-compliance proved pervasive and model-specific. Rates ranged from 58.0 to 98.8 percent across the four models, and only GPT-4.1-mini showed a reduction under the clarification instruction. Claude Haiku 4.5 exhibited a refusal cascade on ambiguous prompts that no other model produced, visible only because the taxonomy distinguished terminal from clarifying refusals. The findings indicated that prompt-level mitigation was unreliable and that response-policy evaluation should proceed alongside capability evaluation.

URL: https://openreview.net/forum?id=LnUP74YNze

---

Title: Heat and Matérn Kernels on Matchings

Abstract: Applying kernel methods to matchings is challenging due to their discrete, non-Euclidean nature. In this paper, we develop a principled framework for constructing geometric kernels that respect the natural geometry of the space of matchings. To this end, we first provide a complete characterization of stationary kernels, i.e. kernels that respect the inherent symmetries of this space. Because the class of stationary kernels is too broad, we specifically focus on the heat and Matérn kernel families, adding an appropriate inductive bias of smoothness to stationarity. While these families successfully extend widely popular Euclidean kernels to matchings, evaluating them naively incurs a prohibitive super-exponential computational cost. To overcome this difficulty, we introduce and analyze a novel, sub-exponential algorithm leveraging zonal polynomials for efficient kernel evaluation. Finally, motivated by the known bijective correspondence between matchings and phylogenetic trees—a crucial data modality in biology—we explore whether our framework can be seamlessly transferred to the space of trees, establishing novel negative results and identifying a significant open problem.

URL: https://openreview.net/forum?id=qvWBks4du6

---

Reply all
Reply to author
Forward
0 new messages