Weekly TMLR digest for Jan 18, 2026

2 views

Skip to first unread message

TMLR

unread,

Jan 18, 2026, 12:00:11 AMJan 18

to tmlr-annou...@googlegroups.com

New certifications
==================

J2C Certification: mSOP-765k: A Benchmark For Multi-Modal Structured Output Predictions

Bianca Lamm, Janis Keuper

https://openreview.net/forum?id=H7eYL4yFZS

---

Survey Certification: The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhong-Zhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Francisco Piedrahita Velez, Yue Liao, Hongru WANG, Mengyue Yang, Heng Ji, Jun Wang, Shuicheng YAN, Philip Torr, LEI BAI

https://openreview.net/forum?id=RY19y2RI1O

---

Survey Certification: Large Language Model Reasoning Failures

Peiyang Song, Pengrui Han, Noah Goodman

https://openreview.net/forum?id=vnX1WHMNmz

---

Featured Certification, J2C Certification: Beyond Accuracy: What Matters in Designing Well-Behaved Image Classification Models?

Robin Hesse, Doğukan Bağcı, Bernt Schiele, Simone Schaub-Meyer, Stefan Roth

https://openreview.net/forum?id=E7HDtLCoT6

---

Featured Certification, J2C Certification: One-Sided Matrix Completion from Ultra-Sparse Samples

Hongyang R. Zhang, Zhenshuo Zhang, Huy Nguyen, Guanghui Lan

https://openreview.net/forum?id=vYGi4Dj777

---

Featured Certification, J2C Certification: Training-Conditional Coverage Bounds under Covariate Shift

Mehrdad Pournaderi, Yu Xiang

https://openreview.net/forum?id=F6hHT3qWxT

---

Expert Certification: Extracting and Following Paths for Robust Relational Reasoning with Large Language Models

Ge Zhang, Mohammad Ali Alomrani, Hongjian Gu, Jiaming Zhou, Yaochen Hu, Bin Wang, Qun Liu, Mark Coates, Yingxue Zhang, Jianye HAO

https://openreview.net/forum?id=EbELaNKmZK

---

Accepted papers
===============

Title: Adapting Vision Transformers to Ultra-High Resolution Semantic Segmentation with Relay Tokens

Authors: Yohann PERRON, Vladyslav Sydorov, Christophe Pottier, Loic Landrieu

Abstract: Current approaches for segmenting ultra-high-resolution images either slide a window, thereby discarding global context, or downsample and lose fine detail. We propose a simple yet effective method that brings explicit multi-scale reasoning to vision transformers, simultaneously preserving local details and global awareness. Concretely, we process each image in parallel at a local scale (high-resolution, small crops) and a global scale (low-resolution, large crops), and aggregate and propagate features between the two branches with a small set of learnable relay tokens. The design plugs directly into standard transformer backbones (e.g. ViT and Swin) and adds fewer than 2 % parameters. Extensive experiments on three ultra-high-resolution segmentation benchmarks, Archaeoscape, URUR, and Gleason, and on the conventional Cityscapes dataset show consistent gains, with up to 15 % relative mIoU improvement. Code and pretrained models are available at https://archaeoscape.ai/work/relay-tokens/.

URL: https://openreview.net/forum?id=tidYprMlsg

---

Title: Incomplete Tasks Induce Shutdown Resistance in Some Frontier LLMs

Authors: Benjamin Weinstein-Raun, Jeremy Schlatter, Jeffrey Ladish

Abstract: In experiments spanning more than 100,000 trials across thirteen large language models, we show that several state-of-the-art models presented with a simple task (including Grok 4, GPT-5, and Gemini 2.5 Pro) sometimes actively subvert a shutdown mechanism in their environment to complete that task. Models differed substantially in their tendency to resist the shutdown mechanism, and their behavior was sensitive to variations in the prompt including the strength and clarity of the instruction to allow shutdown and whether the instruction was in the system prompt or the user prompt (surprisingly, models were consistently less likely to obey the instruction when it was placed in the system prompt). Even with an explicit instruction not to interfere with the shutdown mechanism, some models did so up to 97% (95% CI: 96-98%) of the time.

URL: https://openreview.net/forum?id=e4bTTqUnJH

---

Title: Bayesian Sensitivity of Causal Inference Estimators under Evidence-Based Priors

Authors: Nikita Dhawan, Daniel Shen, Leonardo Cotta, Chris J. Maddison

Abstract: Causal inference, especially in observational studies, relies on untestable assumptions about the true data-generating process. Sensitivity analysis helps us determine how robust our conclusions are when we alter these underlying assumptions. Existing frameworks for sensitivity analysis are concerned with worst-case changes in assumptions. In this work, we argue that using such pessimistic criteria can often become uninformative or lead to conclusions contradicting our prior knowledge about the world. To demonstrate this claim, we generalize the recent s-value framework (Gupta & Rothenhäusler, 2023) to estimate the sensitivity of three different common assumptions in causal inference. Empirically, we find that, indeed, worst-case conclusions about sensitivity can rely on unrealistic changes in the data-generating process. To overcome this, we extend the s-value framework with a new sensitivity analysis criterion: Bayesian Sensitivity Value (BSV), which computes the expected sensitivity of an estimate to assumption violations under priors constructed from real-world evidence. We use Monte Carlo approximations to estimate this quantity and illustrate its applicability in an observational study on the effect of diabetes treatments on weight loss.

URL: https://openreview.net/forum?id=0zqt85NUyK

---

Title: Mitigating Steady-State Bias in Off-Policy TD Learning via Distributional Correction

Authors: Emani Naga Sai Venkata Sowmya, Amit Kesari, Ajin George Joseph

Abstract: We explore the off-policy value prediction problem in the reinforcement learning setting, where one estimates the value function of the target policy using the sample trajectories obtained from a behaviour policy. Importance sampling is a standard tool for correcting action-level mismatch between behaviour and target policies. However, it only addresses single-step discrepancies. It cannot correct steady-state bias, which arises from long-horizon differences in how the behaviour policy visits states. In this paper, we propose an off-policy value-prediction algorithm under linear function approximation that explicitly corrects discrepancies in state visitation distributions. We provide rigorous theoretical guarantees for the resulting estimator. In particular, we prove asymptotic convergence under Markov noise and show that the corrected update matrix has favourable spectral properties that ensure stability. We also derive an error decomposition showing that the estimation error is bounded by a constant multiple of the best achievable approximation in the function class. This constant depends transparently on the quality of the distribution estimate and the choice of features. Empirical evaluation across multiple benchmark domains demonstrates that our method effectively mitigates steady-state bias and can be a robust alternative to existing methods in scenarios where distributional shift is critical.

URL: https://openreview.net/forum?id=QLZAHgiowr

---

Title: A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints

Authors: Youssef Tawfilis, Hossam Amer, Minar El-Aasser, Tallal Elshabrawy

Abstract: Federated Learning has gained increasing attention for its ability to enable multiple nodes to collaboratively train machine learning models without sharing their raw data. At the same time, Generative AI—particularly Generative Adversarial Networks (GANs)—have achieved remarkable success across a wide range of domains, such as healthcare, security, and Image Generation. However, training generative models typically requires large datasets and significant computational resources, which are often unavailable in real-world settings. Acquiring such resources can be costly and inefficient, especially when many underutilized devices—such as IoT devices and edge devices—with varying capabilities remain idle. Moreover, obtaining large datasets is challenging due to privacy concerns and copyright restrictions, as most devices are unwilling to share their data. To address these challenges, we propose a novel approach for decentralized GAN training that enables the utilization of distributed data and underutilized, low-capability devices while not sharing data in its raw form. Our approach is designed to tackle key challenges in decentralized environments, combining KLD-weighted Clustered Federated Learning to address the issues of data heterogeneity and multi-domain datasets, with Heterogeneous U-Shaped split learning to tackle the challenge of device heterogeneity under strict data sharing constraints—ensuring that no labels or raw data, whether real or synthetic, are ever shared between nodes. Experimental results shows that our approach demonstrates consistent and significant improvements across key performance metrics, where it it achieves an average 10% boost in classification metrics (up to 60% in multi domain non-IID settings), 1.1×—3× higher image generation scores for the MNIST family datasets, and 2×—70× lower FID scores for higher resolution datasets, in much lower latency compared to several benchmarks.

URL: https://openreview.net/forum?id=rpbL7pfPYH

---

Title: mSOP-765k: A Benchmark For Multi-Modal Structured Output Predictions

Authors: Bianca Lamm, Janis Keuper

Abstract: This paper introduces mSOP-765k, a large-scale benchmark for the evaluation of multi-modal Structured Output Prediction (mSOP) pipelines. Besides novel evaluation metrics, the benchmark provides combined training and test datasets with over 765,000 images taken from real-world product advertisements. Each of these images contains product visualizations, textual information like product name or brand, and numerical data such as product weight, price, and discount. All images are annotated with the corresponding structured information in form of dictionaries containing key-value pairs.
An initial baseline evaluation, including various LLMs and VLMs, as well as multi-modal RAG approaches, shows that the proposed benchmark provides a challenging problem which can not yet be solved completely by state-of-the-art mSOP methods. The benchmark and dataset are available under a creative-commons license: https://www.msop-765k.org/

URL: https://openreview.net/forum?id=H7eYL4yFZS

---

Title: Enhancing Semantic Segmentation with Continual Self-Supervised Pre-training

Authors: Brown Ebouky, Ajad Chhatkuli, A. Cristiano I. Malossi, Christoph Studer, Roy Assaf, Andrea Bartezzaghi

Abstract: Self-supervised learning (SSL) has emerged as a central paradigm for training foundation models by leveraging large-scale unlabeled datasets, often producing representations with strong generalization capabilities. These models are typically pre-trained on general-purpose datasets such as ImageNet and subsequently adapted to various downstream tasks through finetuning. While prior work has investigated parameter-efficient adaptation methods like adapters, LoRA, and prompt tuning, primarily targeting downstream finetuning, extending the SSL pre-training itself in a continual manner to new domains under limited data remains largely underexplored, especially for downstream dense prediction tasks like semantic segmentation. In this work, we address the challenge of adapting vision foundation models to low-data target domains through continual self-supervised pre-training, specifically targeting downstream semantic segmentation. We propose GLARE (Global Local and Regional Enforcement), a novel continual self-supervised pre-training task designed to enhance downstream semantic segmentation performance. GLARE introduces patch-level augmentations to encourage local consistency and incorporates a regional consistency constraint that leverages spatial semantics in the data. For efficient continual pre-training, we initialize Vision Transformers (ViTs) with weights from existing SSL models and update only lightweight adapter modules specifically UniAdapter–while keeping the rest of the backbone frozen. Experiments across multiple semantic segmentation benchmarks on different domains demonstrate that GLARE consistently improves downstream performance with minimal computational and parameter overhead.

URL: https://openreview.net/forum?id=Ax9Y4W0g7s

---

Title: Uncertainty-Aware Surrogate-based Amortized Bayesian Inference for Computationally Expensive Models

Authors: Stefania Scheurer, Philipp Reiser, Tim Brünnette, Wolfgang Nowak, Anneli Guthke, Paul-Christian Bürkner

Abstract: Bayesian inference typically relies on a large number of model evaluations to estimate posterior distributions. Established methods like Markov Chain Monte Carlo (MCMC) and Amortized Bayesian Inference (ABI) can become computationally challenging. While ABI enables fast inference $\text{\emph{after}}$ training, generating sufficient training data still requires thousands of model simulations, which is infeasible for expensive models. Surrogate models offer a solution by providing $\text{\emph{approximate}}$ simulations at a lower computational cost, allowing the generation of large data sets for training. However, the introduced approximation errors and uncertainties can lead to overconfident posterior estimates. To address this, we propose Uncertainty-Aware Surrogate-based Amortized Bayesian Inference (UA-SABI) -- a framework that combines surrogate modeling and ABI while explicitly quantifying and propagating surrogate uncertainties through the inference pipeline. Our experiments show that this approach enables reliable, fast, and repeated Bayesian inference for computationally expensive models, even under tight time constraints.

URL: https://openreview.net/forum?id=aVSoQXbfy1

---

Title: Generalization Bound for a Shallow Transformer Trained Using Gradient Descent

Authors: Brian Mwigo, Anirban Dasgupta

Abstract: In this work, we establish a norm-based generalization bound for a shallow Transformer model trained via gradient descent under the bounded-drift (lazy training) regime, where model parameters remain close to their initialization throughout training. Our analysis proceeds in three stages: (a) we formally define a hypothesis class of Transformer models constrained to remain within a small neighborhood of their initialization; (b) we derive an upper bound on the Rademacher complexity of this class, quantifying its effective capacity; and (c) we establish an upper bound on the empirical loss achieved by gradient descent under suitable assumptions on model width, learning rate, and data structure. Combining these results, we obtain a high-probability bound on the true loss that decays sublinearly with the number of training samples $N$ and depends explicitly on model and data parameters. The resulting bound demonstrates that, in the lazy regime, wide and shallow Transformers generalize similarly to their linearized (NTK) counterparts. Empirical evaluations on both text and image datasets support the theoretical findings.

URL: https://openreview.net/forum?id=t3iUeMOT8Z

---

Title: Mechanism-Aware Prediction of Tissue-Specific Drug Activity via Multi-Modal Biological Graphs

Authors: Sally Turutov, Kira Radinsky

Abstract: Predicting how small molecules behave across human tissues is essential for targeted therapy development. While some existing models incorporate tissue identity, they treat it as a label—ignoring the underlying biological mechanisms that differentiate tissues. We present Expresso, a multi-modal architecture that predicts tissue-contextual molecular activity as measured by the assay by modeling how compounds interact with transcriptomic and pathway-level tissue context. Expresso constructs heterogeneous graphs from GTEx data, linking samples, genes, and pathways to reflect expression profiles and curated biological relationships. These graphs are encoded using a hierarchical GNN and fused with frozen molecular embeddings to produce context-aware predictions. A multi-task pretraining strategy—spanning gene recovery, tissue classification, and pathway-level contrastive learning—guides the model to learn mechanistically grounded representations. On nine tissues, Expresso improves mean AUC by up to 27.9 points over molecule-only baselines. Our results demonstrate that incorporating biological structure—as defined by the assay—yields more accurate and interpretable models for tissue-specific drug behavior in human cell-based in vitro assay systems.

URL: https://openreview.net/forum?id=UDW8m9iQeC

---

Title: AC$\oplus$DC search: behind the winning solution to the FlyWire graph-matching challenge

Authors: Daniel Lee, Arie Matsliah, Lawrence K. Saul

Abstract: This paper describes the Alternating Continuous and Discrete Combinatorial (AC$\oplus$DC) optimizations behind the winning solution to the FlyWire Ventral Nerve Cord Matching Challenge. The challenge was organized by the Princeton Neuroscience Institute and held over three months, ending on January 31, 2025. During this period, the challenge attracted teams of researchers with expertise in machine learning, high-performance computing, graph data mining, biological network analysis, and quadratic assignment problems. The goal of the challenge was to align the connectomes of a male and female fruit fly, and more specifically, to determine a one-to-one correspondence between the neurons in their ventral nerve cords. The connectomes were represented as sparse weighted graphs with thousands of nodes and millions of edges, and the challenge was to find the permutation that best maps the nodes and edges of one graph onto those of the other. The winning solution to the challenge alternated between two complementary approaches to graph matching---the first, a combinatorial optimization over the symmetric group of permutations, and the second, a continuous relaxation of this problem to the space of doubly stochastic matrices. For the latter, the doubly stochastic matrices were optimized by combining Frank-Wolfe methods with a fast preconditioner to solve the linear assignment problem at each iteration. We provide a complete implementation of these methods with a few hundred lines of code in MATLAB. Notably, this implementation obtains a winning score to the challenge in less than 10 minutes on a laptop computer.

URL: https://openreview.net/forum?id=8MjCOMyaDf

---

Title: The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

Authors: Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhong-Zhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Francisco Piedrahita Velez, Yue Liao, Hongru WANG, Mengyue Yang, Heng Ji, Jun Wang, Shuicheng YAN, Philip Torr, LEI BAI

Abstract: The emergence of agentic reinforcement learning (Agentic RL) marks a paradigm shift from conventional reinforcement learning applied to large language models (LLM RL), reframing LLMs from passive sequence generators into autonomous, decision-making agents embedded in complex, dynamic worlds. This survey formalizes this conceptual shift by contrasting the degenerate single-step Markov Decision Processes (MDPs) of LLM RL with the temporally extended Partially Observable Markov Decision Processes (POMDPs) that define Agentic RL. Building on this foundation, we propose a comprehensive twofold taxonomy: one organized around core agentic capabilities, including planning, tool use, memory, reasoning, self-improvement, and perception, and the other around their applications across diverse task domains. Central to our thesis is that reinforcement learning serves as the critical mechanism for transforming these capabilities from static, heuristic modules into adaptive, robust agentic behavior. To support and accelerate future research, we consolidate the landscape of open-source environments, benchmarks, and frameworks into a practical compendium. By synthesizing over five hundred recent works, this survey charts the contours of this rapidly evolving field and highlights the opportunities and challenges that will shape the development of scalable, general-purpose AI agents.

URL: https://openreview.net/forum?id=RY19y2RI1O

---

Title: VICON: Vision In-Context Operator Networks for Multi-Physics Fluid Dynamics Prediction

Authors: Yadi Cao, Yuxuan Liu, Liu Yang, Rose Yu, Hayden Schaeffer, Stanley Osher

Abstract: In-Context Operator Networks (ICONs) have demonstrated the ability to learn operators across diverse partial differential equations using few-shot, in-context learning. However, existing ICONs process each spatial point as an individual token, severely limiting computational efficiency when handling dense data in higher spatial dimensions. We propose \textit{Vision In-Context Operator Networks} (VICON), which integrate vision transformer architectures to efficiently process 2D data through patch-wise operations while preserving ICON's adaptability to multi-physics systems and varying timesteps. Evaluated across three fluid dynamics benchmarks, VICON significantly outperforms state-of-the-art baselines DPOT and MPP, reducing the average last-step rollout error by 37.9\% compared to DPOT and 44.7\% compared to MPP, while requiring only 72.5\% and 34.8\% of their respective inference times. VICON naturally supports flexible rollout strategies with varying timestep strides, enabling immediate deployment in \textit{imperfect measurement systems} where sampling frequencies may differ or frames might be dropped—common challenges in real-world settings—without requiring retraining or interpolation. In these realistic scenarios, VICON exhibits remarkable robustness, experiencing only 24.41\% relative performance degradation compared to 71.37\%-74.49\% degradation in baseline methods, demonstrating its versatility for deployment in realistic applications. Our scripts for processing datasets and code are publicly available at https://github.com/Eydcao/VICON.

URL: https://openreview.net/forum?id=6V3YmHULQ3

---

Title: The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs

Authors: Jierun Chen, Tiezheng YU, Haoli Bai, Lewei Yao, Jiannan Wu, Kaican Li, Fei Mi, Chaofan Tao, Lei Zhu, Manyi Zhang, Xiao-Hui Li, Lu Hou, Lifeng Shang, Qun Liu

Abstract: Large vision-language models (VLMs) increasingly adopt post-training techniques such as long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL) to elicit sophisticated reasoning. While these methods exhibit synergy in language-only models, their joint effectiveness in VLMs remains uncertain. We present a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks. We find that SFT improves performance on difficult questions by in-depth, structured reasoning, but introduces verbosity and degrades performance on simpler ones. In contrast, RL promotes generalization and brevity, yielding consistent improvements across all difficulty levels, though the improvements on the hardest questions are less prominent compared to SFT. Surprisingly, combining them through two-staged, interleaved, or progressive training strategies, as well as data mixing and model merging, all fails to produce additive benefits, instead leading to trade-offs in accuracy, reasoning style, and response length. This "synergy dilemma" highlights the need for more seamless and adaptive approaches to unlock the full potential of combined post-training techniques for reasoning VLMs. Code, dataset, and fine-tuned models are available at https://github.com/JierunChen/SFT-RL-SynergyDilemma.

URL: https://openreview.net/forum?id=XPML8UGI04

---

Title: Learning object representations through amortized inference over probabilistic programs

Authors: Francisco Silva, Hélder P. Oliveira, Tania Pereira

Abstract: The recent developments of modern probabilistic programming languages have enabled the combination of pattern recognition engines implemented by neural networks to guide inference over explanatory factors written as symbols in probabilistic programs. We argue that learning to invert fixed generative programs, instead of learned ones, places stronger restrictions on the representations learned by feature extraction networks, which reduces the space of latent hypotheses and enhances training efficiency. To empirically demonstrate this, we investigate a neurosymbolic object-centric representation learning approach that combines a slot-based neural module optimized via inference compilation to invert a prior generative program of scene generation. By amortizing the search over posterior hypotheses, we demonstrate that approximate inference using data-driven sequential Monte Carlo methods achieves competitive results when compared to state-of-the-art fully neural baselines while requiring several times fewer training steps.

URL: https://openreview.net/forum?id=nUFSrlJaUr

---

Title: On the Importance of Pretraining Data Alignment for Atomic Property Prediction

Authors: Yasir M. Ghunaim, Hasan Abed Al Kader Hammoud, Bernard Ghanem

Abstract: This paper challenges the recent paradigm in atomic property prediction that links progress to growing dataset sizes and computational resources. We show that pretraining on a carefully selected task-aligned dataset can match or even surpass large-scale joint pretraining while using only 1/24th of the pretraining budget. We introduce the Chemical Similarity Index (CSI), a simple metric for molecular graphs inspired by the Fréchet Inception Distance in computer vision, which quantifies the alignment between upstream pretraining datasets and downstream tasks. By selecting the most aligned dataset with minimal CSI distance, we show that models pretrained on a smaller, focused dataset consistently achieve better performance on downstream tasks than those pretrained on massive, mixed datasets such as JMP. This holds even when the mixed dataset includes the upstream dataset most aligned with the downstream task. Counterintuitively, we also find that indiscriminately adding more data can degrade model performance when the additional data is poorly aligned with the target task. Our findings highlight that quality often outperforms quantity in pretraining for atomic property prediction.

URL: https://openreview.net/forum?id=jfD9BsrDTb

---

Title: RLHF in an SFT Way: From Optimal Solution to Reward-Weighted Alignment

Authors: Yuhao Du, Zhuo Li, Pengyu Cheng, Zhihong Chen, Yuejiao XIE, Xiang Wan, Anningzhe Gao

Abstract: Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning Large Language Models (LLMs) with human values. However, RLHF has been continuously challenged by its high complexity in implementation and computation consumption, specifically for online sampling-based methods like Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO). Even with recent simplifications, such as Direct Preference Optimization (DPO) that designs an offline implicit reward learning objective relying on pre-collected preference datasets, the problems of over-fitting and training instability remain hindering the alignment process from the expected optimal performance. To address the existing challenges, we propose a novel simplification of RLHF from the perspective of variational inference, called **V**ariational **A**lignment with **R**e-weighting (**VAR**). Specifically, by directly minimizing the distribution gap between the learning LLM policy and the optimal solution of RLHF, we transform the alignment objective into an offline reward-driven re-weighted supervised fine-tuning (SFT) form, which only requires minor adjustment on the SFT loss to obtain noticeable improvement on training stability and effectiveness. In comprehensive evaluation benchmarks, our objective empowers LLMs to outperform offline alignments, demonstrating superior performance in both helpfulness and harmlessness metrics (avg. $\uparrow7.16\%$ than DPO). Meanwhile, when compared to online sampling methods, our method is also comparable even better while significantly reducing computational overhead and accelerating convergence speed (over $5\times$ faster than GRPO), suggesting our approach as an efficient and effective solution in bridging the gap between efficiency and performance in LLM alignment.

URL: https://openreview.net/forum?id=jewB0UhFuj

---

Title: ReVision: Refining Video Diffusion with Explicit 3D Motion Modeling

Authors: Qihao Liu, Ju He, Qihang Yu, Liang-Chieh Chen, Alan Yuille

Abstract: In recent years, video generation has seen significant advancements. However, challenges still persist in generating complex motions and interactions. To address these challenges, we introduce ReVision, a plug-and-play framework that explicitly integrates parameterized 3D model knowledge into a pretrained conditional video generation model, significantly enhancing its ability to generate high-quality videos with complex motion and interactions. Specifically, ReVision consists of three stages. First, a video diffusion model is used to generate a coarse video. Next, we extract a set of 2D and 3D features from the coarse video to construct a 3D object-centric representation, which is then refined by our proposed parameterized motion prior model to produce an accurate 3D motion sequence. Finally, this refined motion sequence is fed back into the same video diffusion model as additional conditioning, enabling the generation of motion-consistent videos, even in scenarios involving complex actions and interactions. We validate the effectiveness of our approach on Stable Video Diffusion, where ReVision significantly improves motion fidelity and coherence. Remarkably, with only 1.5B parameters, it even outperforms a state-of-the-art video generation model with over 13B parameters on complex video generation by a substantial margin. Our results suggest that, by incorporating 3D motion knowledge, even a relatively small video diffusion model can generate complex motions and interactions with greater realism and controllability, offering a promising solution for physically plausible video generation.

URL: https://openreview.net/forum?id=mQ5frFQTFV

---

Title: Characterizing Evolution in Expectation-Maximization Estimates for Overspecified Mixed Linear Regression

Authors: Zhankun Luo, Abolfazl Hashemi

Abstract: Estimating data distributions using parametric families is crucial in many learning setups, serving both as a standalone problem and an intermediate objective for downstream tasks. Mixture models, in particular, have attracted significant attention due to their practical effectiveness and comprehensive theoretical foundations. A persisting challenge is model misspecification, which occurs when the model to be fitted has more mixture components than those in the data distribution. In this paper, we develop a theoretical understanding of the Expectation-Maximization (EM) algorithm's behavior in the context of targeted model misspecification for overspecified two-component Mixed Linear Regression (2MLR) with unknown $d$-dimensional regression parameters and mixing weights. In Theorem 5.1 at the population level, with an unbalanced initial guess for mixing weights, we establish linear convergence of regression parameters in $\mathcal{O}(\log (1/\epsilon))$ steps. Conversely, with a balanced initial guess for mixing weights, we observe sublinear convergence in $\mathcal{O}(\epsilon^{-2})$ steps to achieve the $\epsilon$-accuracy at Euclidean distance. In Theorem 6.1 at the finite-sample level, for mixtures with sufficiently unbalanced fixed mixing weights, we demonstrate a statistical accuracy of $\mathcal{O}((d/n)^{1/2})$, whereas for those with sufficiently balanced fixed mixing weights, the accuracy is $\mathcal{O}((d/n)^{1/4})$ given $n$ data samples. Furthermore, we underscore the connection between our population level and finite-sample level results: by setting the desired final accuracy $\epsilon$ in Theorem 5.1 to match that in Theorem 6.1 at the finite-sample level, namely letting $\epsilon = \mathcal{O}((d/n)^{1/2})$ for sufficiently unbalanced fixed mixing weights and $\epsilon = \mathcal{O}((d/n)^{1/4})$ for sufficiently balanced fixed mixing weights, we intuitively derive iteration complexity bounds $\mathcal{O}(\log (1/\epsilon))=\mathcal{O}(\log (n/d))$ and $\mathcal{O}(\epsilon^{-2})=\mathcal{O}((n/d)^{1/2})$ at the finite-sample level for sufficiently unbalanced and balanced initial mixing weights, respectively. We further extend our analysis in the overspecified setting to the finite low SNR regime, providing approximate dynamic equations that characterize the EM algorithm's behavior in this challenging case. Our new findings not only expand the scope of theoretical convergence but also improve the bounds for statistical error, time complexity, and sample complexity, and rigorously characterize the evolution of EM estimates.

URL: https://openreview.net/forum?id=mFdHMNFtrT

---

Title: Softmax is $1/2$-Lipschitz: A tight bound across all $\ell_p$ norms

Authors: Pravin Nair

Abstract: The softmax function is a basic operator in machine learning and optimization, used in classification, attention mechanisms, reinforcement learning, game theory, and problems involving log-sum-exp terms. Existing robustness guarantees of learning models and convergence analysis of optimization algorithms typically consider the softmax operator to have a Lipschitz constant of $1$ with respect to the $\ell_2$ norm. In this work, we prove that the softmax function is contractive with the Lipschitz constant $1/2$, uniformly across all $\ell_p$ norms with $p \ge 1$. We also show that the local Lipschitz constant of softmax attains $1/2$ for $p = 1$ and $p = \infty$, and for $p \in (1,\infty)$, the constant remains strictly below $1/2$ and the supremum $1/2$ is achieved only in the limit. To our knowledge, this is the first comprehensive norm-uniform analysis of softmax Lipschitz continuity. We demonstrate how the sharper constant directly improves a range of existing theoretical results on robustness and convergence. We further validate the sharpness of the $1/2$ Lipschitz constant of the softmax operator through empirical studies on attention-based architectures (ViT, GPT-2, Qwen3-8B) and on stochastic policies in reinforcement learning.

TL;DR: We show that the softmax operator is $1/2$-Lipschitz (contractive) over all $\ell_p$ norms ($p \ge 1$), and characterize the tightness of this bound. We validate the constant empirically on modern attention architectures and stochastic RL policies, and demonstrate how the sharper Lipschitz bound improves existing robustness and optimization guarantees.

URL: https://openreview.net/forum?id=6dowaHsa6D

---

Title: Hypergraph clustering using Ricci curvature: an edge transport perspective

Authors: Olympio Hacquard

Abstract: In this paper, we introduce a novel method for extending Ricci flow to hypergraphs by defining probability measures on the edges and transporting them on the line expansion. This approach yields a new weighting on the edges, which proves particularly effective for community detection. We extensively compare this method with a similar notion of Ricci flow defined on the clique expansion, demonstrating its enhanced sensitivity to the hypergraph structure, especially in the presence of large hyperedges. The two methods are complementary and together form a powerful and highly interpretable framework for community detection in hypergraphs.

URL: https://openreview.net/forum?id=HMROU8MXqV

---

Title: RT2I-Bench: Evaluating Robustness of Text-to-Image Systems Against Adversarial Attacks

Authors: Athanasios Glentis, Ioannis Tsaknakis, Jiangweizhi Peng, Xun Xian, Yihua Zhang, Gaowen Liu, Charles Fleming, Mingyi Hong

Abstract: Text-to-Image (T2I) systems have demonstrated impressive abilities in the generation of images from text descriptions. However, these systems remain susceptible to adversarial prompts—carefully crafted input manipulations that can result in misaligned or even toxic outputs. This vulnerability highlights the need for systematic evaluation of attack strategies that exploit these weaknesses, as well as for testing the robustness of T2I systems against them. To this end, this work introduces the RT2I-Bench benchmark. RT2I-Bench serves two primary purposes. First, it provides a structured evaluation of various adversarial attacks, examining their effectiveness, transferability, stealthiness and potential for generating misaligned or toxic outputs, as well as assessing the resilience of state-of-the-art T2I models to such attacks. We observe that state-of-the-art T2I systems are vulnerable to adversarial prompts, with the most effective attacks achieving success rates of over 60\% across the majority of T2I models we tested. Second, RT2I-Bench enables the creation of a set of strong adversarial prompts (consisting of 1,439 that induce misaligned or targeted outputs and 173 that induce toxic outputs), which are effective across a wide range of systems. Finally, our benchmark is designed to be extensible, enabling the seamless addition of new attacks, T2I models, and evaluation metrics. This framework provides an automated solution for robustness assessment and adversarial prompt generation in T2I systems.

URL: https://openreview.net/forum?id=ZUiWjEouSf

---

Title: Large Language Model Reasoning Failures

Authors: Peiyang Song, Pengrui Han, Noah Goodman

Abstract: Large Language Models (LLMs) have exhibited remarkable reasoning capabilities, achieving impressive results across a wide range of tasks. Despite these advances, significant reasoning failures persist, occurring even in seemingly simple scenarios. To systematically understand and address these shortcomings, we present the first comprehensive survey dedicated to reasoning failures in LLMs. We introduce a novel categorization framework that distinguishes reasoning into embodied and non-embodied types, with the latter further subdivided into informal (intuitive) and formal (logical) reasoning. In parallel, we classify reasoning failures along a complementary axis into three types: fundamental failures intrinsic to LLM architectures that broadly affect downstream tasks; application-specific limitations that manifest in particular domains; and robustness issues characterized by inconsistent performance across minor variations. For each reasoning failure, we provide a clear definition, analyze existing studies, explore root causes, and present mitigation strategies. By unifying fragmented research efforts, our survey provides a structured perspective on systemic weaknesses in LLM reasoning, offering valuable insights and guiding future research towards building stronger, more reliable, and robust reasoning capabilities. We additionally release a comprehensive collection of research works on LLM reasoning failures, as a GitHub repository at https://github.com/Peiyang-Song/Awesome-LLM-Reasoning-Failures, to provide an easy entry point to this area.

URL: https://openreview.net/forum?id=vnX1WHMNmz

---

Title: Nonlinear reconciliation: Error reduction theorems

Authors: Lorenzo Nespoli, Anubhab Biswas, Roberto Rocchetta, Vasco Medici

Abstract: Forecast reconciliation, an ex-post technique applied to forecasts that must satisfy constraints, has been a prominent topic in the forecasting literature over the past two decades. Recently, several efforts have sought to extend reconciliation methods to the probabilistic settings. Nevertheless, formal theorems demonstrating error reduction in nonlinear contexts, analogous to those presented in Panagiotelis et al., (2021), are still lacking. This paper addresses that gap by establishing such theorems for various classes of nonlinear hypersurfaces and vector-valued functions. Specifically, we derive an exact analog of Theorem 3.1 from Panagiotelis et al., (2021) for hypersurfaces with constant-sign curvature. Additionally, we provide an error reduction theorem for the broader case of hypersurfaces with non-constant-sign curvature and for general manifolds with codimension > 1. To support reproducibility and practical adoption, we release a JAX-based Python package, JNLR, implementing the presented theorems and reconciliation procedures.

URL: https://openreview.net/forum?id=dXRWuogm3J

---

Title: Beyond Accuracy: What Matters in Designing Well-Behaved Image Classification Models?

Authors: Robin Hesse, Doğukan Bağcı, Bernt Schiele, Simone Schaub-Meyer, Stefan Roth

Abstract: Deep learning has become an essential part of computer vision, with deep neural networks (DNNs) excelling in predictive performance. However, they often fall short in other critical quality dimensions, such as robustness, calibration, or fairness. While existing studies have focused on a subset of these quality dimensions, none have explored a more general form of "well-behavedness" of DNNs. With this work, we address this gap by simultaneously studying nine different quality dimensions for image classification. Through a large-scale study, we provide a bird's-eye view by analyzing 326 backbone models and how different training paradigms and model architectures affect these quality dimensions. We reveal various new insights such that (i) vision-language models exhibit high class balance on ImageNet-1k classification and strong robustness against domain changes; (ii) training models initialized with weights obtained through self-supervised learning is an effective strategy to improve most considered quality dimensions; and (iii) the training dataset size is a major driver for most of the quality dimensions. We conclude our study by introducing the QUBA score (Quality Understanding Beyond Accuracy), a novel metric that ranks models across multiple dimensions of quality, enabling tailored recommendations based on specific user needs.

URL: https://openreview.net/forum?id=E7HDtLCoT6

---

Title: Supervised score aggregation for active anomaly detection

Authors: Kevin Bleakley, Martin Royer, Benjamin Auder

Abstract: Detecting rare anomalies in batches of multidimensional data is challenging.
We propose an original supervised active-learning framework that sends a small number of data points from each batch to an expert for labeling as `anomaly' or `nominal' via two mechanisms: (i) points most likely to be anomalies in the eyes of a supervised classifier trained on previously-labeled data; and (ii) points suggested by an active learner. Instead of training the supervised classifier directly on currently-labeled raw data, we treat the scores calculated by an ensemble of $M$ user-defined unsupervised anomaly detectors as if they were the learner's input features. Our approach generalizes earlier attempts to linearly aggregate unsupervised anomaly detector scores, and broadens the scope of these methods from unordered bags of data to ordered data such as time series. Simulated and real data trials suggest that this method usually outperforms---often significantly---linear strategies.
The Python library acanag implements our proposed method.

URL: https://openreview.net/forum?id=nrmJD3XMA3

---

Title: ASMa: Asymmetric Spatio-temporal Masking for Skeleton Action Representation Learning

Authors: Aman Anand, Amir Eskandari, Elyas Rashno, Farhana Zulkernine

Abstract: Self-supervised learning (SSL) has shown remarkable success in skeleton-based action recognition by leveraging data augmentations to learn meaningful representations. However, existing SSL methods rely on data augmentations that predominantly focus on masking high-motion frames and high-degree joints such as joints with degree 3 or 4. This results in biased and incomplete feature representations that struggle to generalize across varied motion patterns. To address this, we propose Asymmetric Spatio-temporal Masking (ASMa) for Skeleton Action Representation Learning, a novel combination of masking to learn a full spectrum of spatio-temporal dynamics inherent in human actions. ASMa employs two complementary masking strategies: one that selectively masks high-degree joints and low-motion, and another that masks low-degree joints and high-motion frames. These masking strategies ensure a more balanced and comprehensive skeleton representation learning. Furthermore, we introduce a learnable feature alignment module to effectively align the representations learned from both masked views. To facilitate deployment in resource-constrained settings and on low-resource devices, we compress the learned and aligned representation into a lightweight model using knowledge distillation. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that our approach outperforms existing SSL methods with an average improvement of 2.7–4.4 % in fine-tuning and up to 5.9 % in transfer learning to noisy datasets and achieves competitive performance compared to fully supervised baselines. Our distilled model achieves 91.4 % parameter reduction and 3× faster inference on edge devices while maintaining competitive accuracy, enabling practical deployment in resource-constrained scenarios.

URL: https://openreview.net/forum?id=kIFo1q3VMS

---

Title: BenchOverflow: Measuring Overflow in Large Language Models via Plain-Text Prompts

Authors: Erin Feiglin, Nir Hutnik, Raz Lapid

Abstract: We investigate a failure mode of large language models (LLMs) in which benign, plain-text prompts elicit excessive outputs, a phenomenon we term Overflow. Unlike jailbreaks or prompt injection, Overflow arises under ordinary interaction settings and carries concrete risks for denial-of-wallet, latency, and cross-user performance degradation. We introduce BenchOverflow, a model-agnostic benchmark of nine plain-text prompting strategies that amplify output volume without adversarial suffixes or policy circumvention. Using a standardized protocol with a fixed budget of 5,000 new tokens, we evaluate BenchOverflow on nine open- and closed-source models. Across models, BenchOverflow produces pronounced rightward shifts and heavy tails in length distributions. Cap-saturation rates (CSR@1k/3k/5k) and empirical cumulative distribution functions (ECDFs) quantify tail risk; within-prompt variance and cross-model correlations show that Overflow is broadly reproducible yet heterogeneous across families and attack vectors. A lightweight mitigation—a fixed conciseness reminder—attenuates right tails and lowers CSR for several strategies. Our findings reframe verbosity as a measurable risk to reliability and cost, rather than a mere stylistic quirk. BenchOverflow provides a practical, reproducible protocol for benchmarking length-control robustness in deployed LLMs.

URL: https://openreview.net/forum?id=tiQjg5i4ii

---

Title: PRISM: Diversifying Dataset Distillation by Decoupling Architectural Priors

Authors: Brian Bernhard Moser, Shalini Sarode, Federico Raue, Stanislav Frolov, Krzysztof Adamkiewicz, Arundhati Shanbhag, Joachim Folz, Tobias Christian Nauen, Andreas Dengel

Abstract: Dataset distillation (DD) promises compact yet faithful synthetic data, but existing approaches often inherit the inductive bias of a single teacher model. As dataset size increases, this bias drives generation toward overly smooth, homogeneous samples, reducing intra-class diversity and limiting generalization. We present PRISM (PRIors from diverse Source Models), a framework that disentangles architectural priors during synthesis. PRISM decouples the logit-matching and regularization objectives, supervising them with different teacher architectures: a primary model for logits and a stochastic subset for batch-normalization (BN) alignment. On ImageNet-1K, PRISM consistently and reproducibly outperforms single-teacher methods (e.g., SRe2L) and recent multi-teacher variants (e.g., G-VBSM) at low- and mid-IPC regimes. The generated data also show significantly richer intra-class diversity, as reflected by a notable drop in cosine similarity between features. We further analyze teacher selection strategies (pre- vs. intra-distillation) and introduce a scalable cross-class batch formation scheme for fast parallel synthesis. Code: https://github.com/Brian-Moser/prism

URL: https://openreview.net/forum?id=xN58FtB1Gq

---

Title: On a Gradient Approach to Chebyshev Center Problems with Applications to Function Learning

Authors: Abhinav Raghuvanshi, Mayank Baranwal, Debasish Chatterjee

Abstract: We introduce $\textsf{gradOL}$, the first gradient-based optimization framework for solving Chebyshev center problems, a fundamental challenge in optimal function learning and geometric optimization. $\textsf{gradOL}$ hinges on reformulating the semi-infinite problem as a finitary max-min optimization, making it amenable to gradient-based techniques. By leveraging automatic differentiation for precise numerical gradient computation, $\textsf{gradOL}$ ensures numerical stability and scalability, making it suitable for large-scale settings. Under strong convexity of the ambient norm, $\textsf{gradOL}$ provably recovers optimal Chebyshev centers while directly computing the associated radius. This addresses a key bottleneck in constructing stable optimal interpolants. Empirically, $\textsf{gradOL}$ achieves significant improvements in accuracy and efficiency on 34 benchmark Chebyshev center problems from a benchmark \textsf{CSIP} library. Moreover, we extend $\textsf{gradOL}$ to general convex semi-infinite programming (CSIP), attaining up to $4000\times$ speedups over the state-of-the-art \textsf{sipampl} solver tested on the indicated \textsf{CSIP} library containing 67 benchmark problems. Furthermore, we provide the first theoretical foundation for applying gradient-based methods to Chebyshev center problems, bridging rigorous analysis with practical algorithms. $\textsf{gradOL}$ thus offers a unified solution framework for Chebyshev centers and broader CSIPs.

URL: https://openreview.net/forum?id=lPZVsDhyj3

---

Title: SiLVR: A Simple Language-based Video Reasoning Framework

Authors: Ce Zhang, Yan-Bo Lin, Ziyang Wang, Mohit Bansal, Gedas Bertasius

Abstract: Recent advances in test-time optimization have led to remarkable reasoning capabilities in Large Language Models (LLMs), enabling them to solve highly complex problems in math and coding. However, the reasoning capabilities of multimodal LLMs (MLLMs) still significantly lag, especially for complex video-language tasks. To address this issue, we present SILVR, a Simple Language-based Video Reasoning framework that decomposes complex video understanding into two stages. In the first stage, SILVR transforms raw video into language-based representations using multisensory inputs, such as short clip captions and audio/speech subtitles. In the second stage, language descriptions are fed into a powerful reasoning LLM to solve complex video-language understanding tasks. To handle long-context multisensory inputs, we use an Adaptive Context Reduction scheme, which dynamically determines the temporal granularity with which to sample the tokens. Our simple, modular, and training-free video reasoning framework achieves the best-reported results on Video-MME (long), Video-MMMU (comprehension), Video-MMLU, CGBench, and EgoLife. Furthermore, our empirical study focused on video reasoning capabilities shows that, despite not being explicitly trained on video, strong reasoning LLMs can effectively aggregate multisensory input information from video, speech, and audio for complex temporal, causal, long-context, and knowledge acquisition reasoning tasks in video. More details can be found at https://sites.google.com/cs.unc.edu/silvr.

URL: https://openreview.net/forum?id=mQZbh9Zlbw

---

Title: Beyond Affinity: A Benchmark of 1D, 2D, and 3D Methods Reveals Critical Trade-offs in Structure-Based Drug Design

Authors: Kangyu Zheng, Kai Zhang, Jiale Tan, Xuehan Chen, Yingzhou Lu, ZAIXI ZHANG, Lichao Sun, Marinka Zitnik, Tianfan Fu, Zhiding Liang

Abstract: Currently, the field of structure-based drug design is dominated by three main types of algorithms: search-based algorithms, deep generative models, and reinforcement learning. While existing works have typically focused on comparing models within a single algorithmic category, cross-algorithm comparisons remain scarce. In this paper, to fill the gap, we establish a benchmark to evaluate the performance of fifteen models across these different algorithmic foundations by assessing the pharmaceutical properties of the generated molecules and their docking affinities and poses with specified target proteins. We highlight the unique advantages of each algorithmic approach and offer recommendations for the design of future SBDD models. We emphasize that 1D/2D ligand-centric drug design methods can be used in SBDD by treating the docking function as a black-box oracle, which is typically neglected. Our evaluation reveals distinct patterns across model categories. 3D structure-based models excel in binding affinities but show inconsistencies in chemical validity and pose quality. 1D models demonstrate reliable performance in standard molecular metrics but rarely achieve optimal binding affinities. 2D models offer balanced performance, maintaining high chemical validity while achieving moderate binding scores. Through detailed analysis across multiple protein targets, we identify key improvement areas for each model category, providing insights for researchers to combine strengths of different approaches while addressing their limitations. All the code that are used for benchmarking is available in https://github.com/zkysfls/2025-sbdd-benchmark

URL: https://openreview.net/forum?id=gaTwx1rzCw

---

Title: One-Sided Matrix Completion from Ultra-Sparse Samples

Authors: Hongyang R. Zhang, Zhenshuo Zhang, Huy Nguyen, Guanghui Lan

Abstract: Matrix completion is a classical problem that has received recurring interest from a wide range of fields. In this paper, we revisit this problem in an ultra-sparse sampling regime, where each entry of an unknown, $n\times d$ matrix $M$ (with $n \ge d$) is observed independently with probability $p = C / d$, for a fixed integer $C \ge 2$. This setting is motivated by applications involving large, sparse panel datasets, where the number of rows (users) far exceeds the number of columns (items). When each row contains only $C$---fewer than the rank of $M$---accurate imputation of $M$ is impossible. Instead, we focus on estimating the row span of $M$, or equivalently, the averaged second-moment matrix $T = M^{\top} M / n$.

The empirical second-moment matrix computed from observational data exhibits non-random and sparse missingness. We propose an unbiased estimator that normalizes each nonzero entry of the second moment by its observed frequency, followed by gradient descent to impute the missing entries of $T$. This normalization divides a weighted sum of $n$ binomial random variables by their total number of ones---a nonlinear operation. We show that the estimator is unbiased for any value of $p$ and enjoys low variance. When the row vectors of $M$ are drawn uniformly from a rank-$r$ factor model satisfying an incoherence condition, we prove that if $n \ge O({d r^5 \epsilon^{-2} C^{-2} \log d})$, any local minimum of the gradient-descent objective is approximately global and recovers $T$ with error at most $\epsilon^2$.

Experiments on both synthetic and real-world data validate our approach. On three MovieLens datasets, our algorithm reduces bias by $88\%$ relative to several baseline estimators. We also empirically evaluate the linear sampling complexity of $n$ relative to $d$ using synthetic data. Finally, on the Amazon reviews dataset with sparsity $10^{-7}$, our method reduces the recovery error of $T$ by $59\%$ and $M$ by $38\%$ compared to existing matrix completion methods.

URL: https://openreview.net/forum?id=vYGi4Dj777

---

Title: AC-PKAN: Attention-Enhanced and Chebyshev Polynomial-Based Physics-Informed Kolmogorov–Arnold Networks

Authors: Hangwei Zhang, Zhimu Huang, Yan Wang

Abstract: Kolmogorov–Arnold Networks (KANs) have recently shown promise for solving partial differential equations (PDEs). Yet their original formulation is computationally and memory intensive, motivating the introduction of Chebyshev Type-I-based KANs (Chebyshev1KANs). Although Chebyshev1KANs have outperformed the vanilla KANs architecture, our rigorous theoretical analysis reveals that they still suffer from rank collapse, ultimately limiting their expressive capacity. To overcome these limitations, we enhance Chebyshev1KANs by integrating wavelet-activated MLPs with learnable parameters and an internal attention mechanism. We prove that this design preserves a full-rank Jacobian and is capable of approximating solutions to PDEs of arbitrary order. Furthermore, to alleviate the loss instability and imbalance introduced by the Chebyshev polynomial basis, we externally incorporate a Residual Gradient Attention (RGA) mechanism that dynamically re-weights individual loss terms according to their gradient norms and residual magnitudes. By jointly leveraging internal and external attention, we present AC-PKAN, a novel architecture that constitutes an enhancement to weakly supervised Physics-Informed Neural Networks (PINNs) and extends the expressive power of KANs. Experimental results from nine benchmark tasks across three domains show that AC-PKAN consistently outperforms or matches state-of-the-art models such as PINNsFormer, establishing it as a highly effective tool for solving complex real-world engineering problems in zero-data or data-sparse regimes. The code will be made publicly available upon acceptance.

URL: https://openreview.net/forum?id=J4SkwpIgj7

---

Title: COLT: Enhancing Video Large Language Models with Continual Tool Usage

Authors: Yuyang Liu, Meng Cao, Xinyuan Shi, Xiaodan Liang

Abstract: The success of Large Language Models (LLMs) has significantly propelled the research of video understanding. To harvest the benefits of well-trained expert models (i.e., tool), video LLMs prioritize the exploration of tool usage capabilities. Existing methods either prompt closed-source LLMs or employ the instruction tuning paradigm for tool-use finetuning. These methods, however, assume an established repository of fixed tools and struggle to generalize to real-world environments where tool data is perpetually evolving and streaming in. To this end, we propose to enhance open-source video LLMs with COntinuaL Tool usage (termed COLT), which automatically acquires tool-use ability in a successive tool stream without suffering "catastrophic forgetting" of the past learned tools. Specifically, our COLT incorporates a learnable tool codebook as a tool-specific memory system. Then, relevant tools are dynamically selected based on the similarity between user instructions and tool features within the codebook. To unleash the tool usage potential of video LLMs, we collect a video-centric tool-use instruction tuning dataset VideoToolBench. Extensive experiments on both previous video LLM benchmarks and the tool-use-specific VideoToolBench dataset demonstrate the state-of-the-art performance of our proposed COLT.

URL: https://openreview.net/forum?id=NT9tHHTlXn

---

Title: Calibration Enhanced Decision Maker: Towards Trustworthy Sequential Decision-Making with Large Sequence Models

Authors: Haoyuan Sun, Bo Xia, Yifu Luo, Tiantian Zhang, Xueqian Wang

Abstract: Offline deep reinforcement learning (offline DRL) has attracted considerable attention across various domains due to its ability to learn effective policies without direct environmental interaction. Although highly effective, the trustworthiness of agent concerns the community. Offline DRL can be categorized into three principal paradigms: model-based algorithms, model-free algorithms, and trajectory optimization. While extant research predominantly concentrates on calibration enhancement of model-based and model-free algorithms, calibration of trajectory optimization remains a rather rare topic. In this paper, we introduce the concept of Expected Agent Calibration Error (EACE), a novel metric designed to assess agent calibration. Furthermore, we rigorously prove its theoretical relationship to the state-action marginal distribution distance. Subsequently, we introduce the Calibration Enhanced Decision Maker (CEDM), which employs a binning executor to process feature distribution histograms as input for the large sequence model, thereby minimizing the state-action marginal distribution distance and enhancing the agent's calibration. A series of in-depth case studies of CEDM are carried out, with application on Decision Transformer, Decision ConvFormer, and Decision Mamba. Empirical results substantiate the robustness of EACE and demonstrate the effectiveness of CEDM in enhancing agent calibration, thereby offering valuable insights for future research on trustworthy sequential decision-making.

URL: https://openreview.net/forum?id=b6WcxPEb48

---

Title: Symmetry in Neural Network Parameter Spaces

Authors: Bo Zhao, Robin Walters, Rose Yu

Abstract: Modern deep learning models are highly overparameterized, resulting in large sets of parameter configurations that yield the same outputs. A significant portion of this redundancy is explained by symmetries in the parameter space—transformations that leave the network function unchanged. These symmetries shape the loss landscape and constrain learning dynamics, offering a new lens for understanding optimization, generalization, and model complexity that complements existing theory of deep learning. This survey provides an overview of parameter space symmetry. We summarize existing literature, uncover connections between symmetry and learning theory, and identify gaps and opportunities in this emerging field.

URL: https://openreview.net/forum?id=jLpWq5QY6I

---

Title: Consistency Trajectory Planning: High-Quality and Efficient Trajectory Optimization for Offline Model-Based Reinforcement Learning

Authors: Guanquan Wang, Takuya Hiraoka, Yoshimasa Tsuruoka

Abstract: This paper introduces Consistency Trajectory Planning (CTP), a novel offline model-based reinforcement learning method that leverages the recently proposed Consistency Trajectory Model (CTM) for efficient trajectory optimization. While prior work applying diffusion models to planning has demonstrated strong performance, it often suffers from high computational costs due to iterative sampling procedures. CTP supports few-step trajectory generation without significant degradation in policy quality. We evaluate CTP on the D4RL benchmark and show that it consistently outperforms existing diffusion-based planning methods in long-horizon, goal-conditioned tasks. Notably, CTP achieves higher normalized returns while using fewer denoising steps. In particular, CTP attains comparable—or even superior—performance with reduced inference cost, highlighting its practicality and effectiveness for high-performance, low-latency offline planning.

URL: https://openreview.net/forum?id=RVGkT9ISVf

---

Title: Estimating Expected Calibration Error for Positive-Unlabeled Learning

Authors: Ryuichi Kiryo, Futoshi Futami, Masashi Sugiyama

Abstract: The reliability of probabilistic classifiers hinges on their calibration---the property that their confidence accurately reflect the true class probabilities.
The expected calibration error (ECE) is a standard metric for quantifying the calibration of classifiers.
However, its estimation presumes access to ground-truth labels.
In positive-unlabeled (PU) learning, only positive and unlabeled data are available, which makes the standard ECE estimator inapplicable.
Although PU learning has been extensively studied for risk estimation and classifier training, calibration in this setting has received little attention.
In this paper, we present PU-ECE, the first ECE estimator for PU data.
We provide non-asymptotic bias bounds and prove convergence rates that match those of the fully supervised ECE with an optimal bin size.
Furthermore, we develop an information-theoretic generalization error analysis of PU-ECE by formalizing the conditional mutual information (CMI) for a PU setting.
Experiments on synthetic and real-world benchmark datasets validate our theoretical analysis and demonstrate that our PU-based ECE estimator achieves performance comparable to that of the fully-labeled ECE estimator.

URL: https://openreview.net/forum?id=SvoBtLIrPZ

---

Title: DeepSeek-R1 Thoughtology: Let’s think about LLM reasoning

Authors: Sara Vera Marjanovic, Arkil Patel, Vaibhav Adlakha, Milad Aghajohari, Parishad BehnamGhader, Mehar Bhatia, Aditi Khandelwal, Austin Kraft, Benno Krojer, Xing Han Lù, Nicholas Meade, Dongchan Shin, Amirhossein Kazemnejad, Gaurav Kamath, Marius Mosbach, Karolina Stanczak, Siva Reddy

Abstract: Large Reasoning Models like DeepSeek-R1 mark a fundamental shift in how LLMs approach complex problems. Instead of directly producing an answer for a given input, DeepSeek-R1 creates detailed multi-step reasoning chains, seemingly “thinking” about a problem before providing an answer. This reasoning process is publicly available to the user, creating endless opportunities for studying the reasoning behaviour of the model and opening up the field of Thoughtology. Starting from a taxonomy of DeepSeek-R1’s basic building blocks of reasoning, our analyses on DeepSeek-R1 investigate the impact and controllability of thought length, management of long or confusing contexts, cultural and safety concerns, and the status of DeepSeek-R1 vis-à-vis cognitive phenomena, such as human-like language processing and world modelling. Our findings paint a nuanced picture. Notably, we show DeepSeek-R1 has a ‘sweet spot’ of reasoning, where extra inference time can impair model performance. Furthermore, we find a tendency for DeepSeek-R1 to persistently ruminate on previously explored problem formulations, obstructing further exploration. We also note strong safety vulnerabilities of DeepSeek-R1 compared to its non-reasoning counterpart, which can also compromise safety-aligned LLMs.

URL: https://openreview.net/forum?id=BZwKsiRnJI

---

Title: Training-Conditional Coverage Bounds under Covariate Shift

Authors: Mehrdad Pournaderi, Yu Xiang

Abstract: Conformal prediction methodology has recently been extended to the covariate shift setting, where the distribution of covariates differs between training and test data. While existing results ensure that the prediction sets from these methods achieve marginal coverage above a nominal level, their coverage rate conditional on the training dataset—referred to as training-conditional coverage—remains unexplored. In this paper, we address this gap by deriving upper bounds on the tail of the training-conditional coverage distribution, offering probably approximately correct (PAC) guarantees for these methods. Our results characterize the reliability of the prediction sets in terms of the severity of distributional changes and the size of the training dataset.

URL: https://openreview.net/forum?id=F6hHT3qWxT

---

Title: There are no Champions in Supervised Long-Term Time Series Forecasting

Authors: Lorenzo Brigato, Rafael Morand, Knut Joar Strømmen, Maria Panagiotou, Markus Schmidt, Stavroula Mougiakakou

Abstract: Recent advances in long-term time series forecasting have introduced numerous complex supervised prediction models that consistently outperform previously published architectures.
However, this rapid progression raises concerns regarding inconsistent benchmarking and reporting practices, which may undermine the reliability of these comparisons. In this study, we first perform a broad, thorough, and reproducible evaluation of the top-performing supervised models on the most popular benchmark and additional baselines representing the most active architecture families. This extensive evaluation assesses eight models on 14 datasets, encompassing $\sim$5,000 trained networks for the hyperparameter (HP) searches. Then, through a comprehensive analysis, we find that slight changes to experimental setups or current evaluation metrics drastically shift the common belief that newly published results are advancing the state of the art.
Our findings emphasize the need to shift focus away from pursuing ever-more complex models, towards enhancing benchmarking practices through rigorous and standardized evaluations that enable more substantiated claims, including reproducible HP setups and statistical testing. We offer recommendations for future research.

URL: https://openreview.net/forum?id=yO1JuBpTBB

---

Title: Extracting and Following Paths for Robust Relational Reasoning with Large Language Models

Authors: Ge Zhang, Mohammad Ali Alomrani, Hongjian Gu, Jiaming Zhou, Yaochen Hu, Bin Wang, Qun Liu, Mark Coates, Yingxue Zhang, Jianye HAO

Abstract: Large language models (LLMs) possess vast semantic knowledge but often struggle with complex reasoning tasks, particularly in relational reasoning problems such as kinship or spatial reasoning. In this paper, we present Path-of-Thoughts (PoT), a novel framework for solving relation reasoning that decomposes the task into three key stages: graph extraction, path identification, and reasoning. Unlike previous approaches, PoT efficiently extracts a reasoning graph that identifies crucial entities, relations, and attributes within the context. Subsequently, PoT identifies query-relevant reasoning paths within the graph, facilitating downstream reasoning of potential answers. Experimental evaluations across four datasets of relational reasoning demonstrate that PoT surpasses state-of-the-art baselines by a significant margin (up to 21.3%) without requiring fine-tuning or extensive LLM calls. Furthermore, unlike prior neuro-symbolic methods, PoT exhibits improved resilience against LLM extraction errors and input ambiguity by leveraging the compositional nature of graphs.

URL: https://openreview.net/forum?id=EbELaNKmZK

---

Title: SSFL: Discovering Sparse Unified Subnetworks at Initialization for Efficient Federated Learning

Authors: Riyasat Ohib, Bishal Thapaliya, Gintare Karolina Dziugaite, Jingyu Liu, Vince D. Calhoun, Sergey Plis

Abstract: In this work, we propose Salient Sparse Federated Learning (SSFL), a streamlined approach for sparse federated learning with efficient communication. SSFL identifies a sparse subnetwork prior to training, leveraging parameter saliency scores computed separately on local client data in non-IID scenarios, and then aggregated, to determine a global mask. Only the sparse model weights are trained and communicated each round between the clients and the server. On standard benchmarks including CIFAR-10, CIFAR-100, and Tiny-ImageNet, SSFL consistently improves the accuracy–sparsity trade-off, achieving more than 20\% relative error reduction on CIFAR-10 compared to the strongest sparse baseline, while reducing communication costs by $2 \times$ relative to dense FL. Finally, in a real-world federated learning deployment, SSFL delivers over $2.3 \times$ faster communication time, underscoring its practical efficiency.

URL: https://openreview.net/forum?id=kUZ6LhUB26

---

Title: Investigating a Model-Agnostic and Imputation-Free Approach for Irregularly-Sampled Multivariate Time-Series Modeling

Authors: Abhilash Neog, Arka Daw, Sepideh Fatemi, Medha Sawhney, Aanish Pradhan, Mary E. Lofton, Bennett J. McAfee, Adrienne Breef-Pilz, Heather L. Wander, Dexter W Howard, Cayelan C. Carey, Paul Hanson, Anuj Karpatne

Abstract: Modeling Irregularly-sampled and Multivariate Time Series (IMTS) is crucial across a variety of applications where different sets of variates may be missing at different time-steps due to sensor malfunctions or high data acquisition costs. Existing approaches for IMTS either
consider a two-stage impute-then-model framework or involve specialized architectures specific to a particular model and task. We perform a series of experiments to derive insights about the performance of IMTS methods on a variety of semi-synthetic and real-world datasets for both classification and forecasting. We also introduce Missing Feature-aware Time Series Modeling (MissTSM) or MissTSM, a simple model-agnostic and imputation-free approach for IMTS modeling. We show that MissTSM shows competitive performance compared to other IMTS approaches, especially when the amount of missing values is large and the data lacks simplistic periodic structures - conditions common to real-world IMTS applications.

URL: https://openreview.net/forum?id=HgJ0DMVAA3

---

Title: InfGraND: An Influence-Guided GNN-to-MLP Knowledge Distillation

Authors: Amir Eskandari, Aman Anand, Elyas Rashno, Farhana Zulkernine

Abstract: Graph Neural Networks (GNNs) are the go-to model for graph data analysis. However, GNNs rely on two key operations—aggregation and update, which can pose challenges for low-latency inference tasks or resource-constrained scenarios. Simple Multi-Layer Perceptrons (MLPs) offer a computationally efficient alternative. Yet, training an MLP in a supervised setting often leads to suboptimal performance. Knowledge Distillation (KD) from a GNN teacher to an MLP student has emerged to bridge this gap. However, most KD methods either transfer knowledge uniformly across all nodes or rely on graph-agnostic indicators such as prediction uncertainty. We argue this overlooks a more fundamental, graph-centric inquiry: "How important is a node to the structure of the graph?" We introduce a framework, InfGraND, an Influence-guided Graph KNowledge Distillation from GNN to MLP that addresses this by identifying and prioritizing structurally influential nodes to guide the distillation process, ensuring that the MLP learns from the most critical parts of the graph. Additionally, InfGraND embeds structural awareness in MLPs through one-time multi-hop neighborhood feature pre-computation, which enriches the student MLP’s input and thus avoids inference-time overhead. Our rigorous evaluation in transductive and inductive settings across seven homophilic graph benchmark datasets shows InfGraND consistently outperforms prior GNN to MLP KD methods, demonstrating its practicality for numerous latency-critical applications in real-world settings.

URL: https://openreview.net/forum?id=lfzHR3YwlD

---

New submissions
===============

Title: Deep Actor-Critics with Tight Risk Certificates

Abstract: Abstract: Deep actor-critic algorithms have reached a level where they influence everyday life. They are a driving force behind continual improvement of large language models through user feedback. However, their deployment in physical systems is not yet widely adopted, mainly because no validation scheme fully quantifies their risk of malfunction. We demonstrate that it is possible to develop tight risk certificates for deep actor-critic algorithms that predict generalization performance from validation-time observations. Our key insight centers on the effectiveness of minimal evaluation data. A small feasible set of evaluation roll-outs collected from a pretrained policy suffices to produce accurate risk certificates when combined with a simple adaptation of PAC-Bayes theory. Specifically, we adopt a recently introduced recursive PAC-Bayes approach, which splits validation data into portions and recursively builds PAC-Bayes bounds on the excess loss of each portion's predictor, using the predictor from the previous portion as a data-informed prior. Our empirical results across multiple locomotion tasks, actor-critic methods, and policy expertise levels demonstrate risk certificates tight enough to be considered for practical use.

URL: https://openreview.net/forum?id=8EeIXrzFHT

---

Title: Overcoming Output Dimension Collapse: When Sparsity Enables Zero-shot Brain-to-image Reconstruction at Small Data Scales

Abstract: Advances in brain-to-image reconstruction are enabling us to externalize the subjective visual experiences encoded in the brain as images.
A key challenge in this task is data scarcity: a translator that maps brain activity to latent image features is trained on a limited number of brain-image pairs, making the translator a bottleneck for zero-shot reconstruction beyond the training stimuli.
In this paper, we provide a theoretical analysis of two translator designs widely used in recent reconstruction pipelines: naive multivariate linear regression and sparse multivariate linear regression.
We define the data scale as the ratio of the number of training samples to the latent feature dimensionality and characterize the behavior of each model across data scales.
We first show that the naive linear regression model, which uses a shared set of input variables for all outputs, suffers from ``output dimension collapse'' at small data scales, restricting generalization beyond the training data.
We then analyze sparse linear regression models in a student--teacher framework and derive expressions for the prediction error in terms of data scale and other sparsity-related parameters.
Our analysis clarifies when variable selection can reduce prediction error at small data scales by exploiting the sparsity of the brain-to-feature mapping.
Our findings provide quantitative guidelines for diagnosing output dimension collapse and for designing effective translators and feature representations for zero-shot reconstruction.

URL: https://openreview.net/forum?id=RQiXUK4kQr

---

Title: Unsupervised Domain Adaptation for Binary Classification with an Unobservable Source Subpopulation

Abstract: We study an unsupervised domain adaptation problem where the source domain consists of subpopulations defined by the binary label $Y$ and a binary background (or environment) $A$. We focus on a challenging setting in which one such subpopulation in the source domain is unobservable. Naively ignoring this unobserved group can result in biased estimates and degraded predictive performance. Despite this structured missingness, we show that the prediction in the target domain can still be recovered. Specifically, we rigorously derive both background-specific and overall prediction models for the target domain. For practical implementation, we propose the distribution matching method to estimate the subpopulation proportions. We provide theoretical guarantees for the asymptotic behavior of our estimator, and establish an upper bound on the prediction error. Experiments on both synthetic and real-world datasets show that our method outperforms the naive benchmark that does not account for this unobservable source subpopulation.

URL: https://openreview.net/forum?id=aOKcvMt8xE

---

Title: Weak-to-strong Generalization via Formative Learning from Student Demonstrations & Teacher Evaluation

Abstract: As Large Language Models (LLMs) exceed human capabilities, providing reliable human feedback for evaluating and aligning them, via standard frameworks such as Reinforcement Learning from Human Feedback, becomes challenging. This raises a fundamental question: \textit{how can we leverage weaker (teacher) supervision to elicit the full capabilities of a stronger (student) model? This emerging paradigm, known as Weak-to-Strong (W2S) generalization, however, also introduces a key challenge as the strong student may ``overfit'' to the weak teacher's mistakes, resulting in a notable performance degradation compared to learning with ground-truth data. We show that this overfitting problem occurs because learning with weak supervision implicitly regularizes the strong student's policy toward the weak reference policy. Building on this insight, we propose a novel learning approach, called Weak Teacher \textbf{E}\textbf{v}aluation of Strong Student D\textbf{e}monstrations or \textsc{Eve}, to instead regularize the strong student toward its reference policy.\textsc{Eve}'s regularization intuitively elicits the strong student's knowledge through its own task demonstrations while relying on the weaker teacher to evaluate these demonstrations -- an instance of formative learning. Extensive empirical evaluations demonstrate that \textsc{Eve} significantly outperforms existing W2S learning approaches and exhibits significantly better robustness under unreliable feedback compared to contrastive learning methods such as Direct Preference Optimization.

URL: https://openreview.net/forum?id=nSO0g4lLxV

---

Title: Maximum Mean Discrepancy with Unequal Sample Sizes via Generalized U-Statistics

Abstract: Existing two-sample testing techniques, particularly those based on choosing a kernel for the Maximum Mean Discrepancy (MMD), often assume equal sample sizes from the two distributions. Applying these methods in practice can require discarding valuable data, unnecessarily reducing test power. We address this long-standing limitation by extending the theory of generalized U-statistics and applying it to the usual MMD estimator, resulting in new characterization of the asymptotic distributions of the MMD estimator with unequal sample sizes (particularly outside the proportional regimes required by previous partial results). This generalization also provides a new criterion for optimizing the power of an MMD test with unequal sample sizes. Our approach preserves all available data, enhancing test accuracy and applicability in realistic settings. Along the way, we give much cleaner characterizations of the variance of MMD estimators, revealing something that might be surprising to those in the area: while zero MMD implies a degenerate estimator, it is sometimes possible to have a degenerate estimator with nonzero MMD as well. We give a construction of such a case, and a proof that it does not happen in common situations.

URL: https://openreview.net/forum?id=KjXW75GHHF

---

Title: A Tighter Bound for Reward Learning in Reinforcement Learning from Human Feedback

Abstract: As a key component of reinforcement learning from human feedback (RLHF), reward learning directly influences the final learned policy.
Unfortunately, existing theoretical estimation error bounds in reward learning rely on the complexity of the reward function class, unattainable optimal parameters, or non-zero constants independent of sample size, leading to uncomputable bounds that are meaningless for reward function classes with unknown complexity.
To address this issue,
this paper presents an analysis of parameter estimation for reward learning in RLHF under general function approximation, without imposing restrictions on the complexity of the reward function class.
A tighter bound is provided without non-zero terms independent of the sample size.
The optimal parameters are eliminated by applying linear approximation around the learned parameters.
Additionally, the relationship between the preference dataset and the learned parameters is further examined to demonstrate how to efficiently collect data based on the current learned parameters.
Inspired by the theoretical results,
a novel offline RLHF algorithm with parameter constraints is proposed, restricting parameters to the valid space defined by the dataset.
Furthermore, an online RLHF algorithm is proposed to iteratively optimize parameter learning and improve data collection efficiency.
This work provides a tighter bound than previous studies and offers theoretical guidance for online data collection under general function approximation.

URL: https://openreview.net/forum?id=EyMoFzI3Oz

---

Title: On Federated Compositional Optimization: Algorithms, Analysis, and Guarantees

Abstract: Compositional optimization (CO) has recently gained popularity due to its applications in many machine learning applications. The large-scale and distributed nature of data necessitates efficient federated learning (FL) algorithms for CO, but the compositional structure of the objective poses significant challenges. Current methods either rely on large batch gradients (which are impractical), require expensive computations, or suffer from suboptimal guarantees. To address these challenges, we propose efficient FedAvg-type algorithms for solving non-convex CO in the FL setting. We first theoretically establish that standard FedAvg fails in solving the federated CO problems due to data heterogeneity, which amplifies bias in local gradient estimates. Our analysis shows that controlling this bias necessarily requires either {\em additional communication} or {\em additional structural assumptions}. To this end, we develop two algorithms for solving the federated CO problem. First, we propose \aname~that utilizes the compositional problem structure to design a communication strategy that allows FedAvg to converge. \aname~achieves a sample complexity of $\mathcal{O}(\epsilon^{-2})$ and communication complexity of $\mathcal{O}(\epsilon^{-3/2})$. Then we propose \anameds, a two-sided learning rate algorithm, that leverages an additional assumption to improve upon the communication complexity of \aname. \anameds~achieves the optimal $\mathcal{O}(\epsilon^{-2})$ sample and $\mathcal{O}(\epsilon^{-1})$ communication complexity. We corroborate our theoretical findings with empirical studies on large-scale CO problems.

URL: https://openreview.net/forum?id=4uRlbSNevR

---

Title: OT Score: An OT based Confidence Score for Source Free Unsupervised Domain Adaptation

Abstract: We address the computational and theoretical limitations of current distributional alignment methods for source-free unsupervised domain adaptation (SFUDA). In particular, we focus on estimating classification performance and confidence in the absence of target labels. Current theoretical frameworks for these methods often yield computationally intractable quantities and fail to adequately reflect the properties of the alignment algorithms employed. To overcome these challenges, we introduce the Optimal Transport (OT) score, a confidence metric derived from a novel theoretical analysis that exploits the flexibility of decision boundaries induced by Semi-Discrete Optimal Transport alignment. The proposed OT score is intuitively interpretable and theoretically rigorous. It provides principled uncertainty estimates for any given set of target pseudo-labels. Experimental results demonstrate that OT score outperforms existing confidence scores. Moreover, it improves SFUDA performance through training-time reweighting and provides a reliable, label-free proxy for model performance.

URL: https://openreview.net/forum?id=VQu8cWE9yJ

---

Title: Autofocus Retrieval: An Effective Pipeline for Multi-Hop Question Answering With Semi-Structured Knowledge

Abstract: In many real-world settings, machine learning models and interactive systems have access to both structured knowledge, e.g., knowledge graphs or tables, and unstructured content, e.g., natural language documents. Yet, most rely on either. Semi-Structured Knowledge Bases (SKBs) bridge this gap by linking unstructured content to nodes within structured data.
In this work, we present Autofocus-Retriever (AF-Retriever), a modular framework for SKB-based, multi-hop question answering. It combines structural and textual retrieval through novel integration steps and optimizations, achieving the best zero- and one-shot results across all three STaRK QA benchmarks, which span diverse domains and evaluation metrics. AF-Retriever’s average first-hit rate surpasses the second-best method by 32.1\%.
Its performance is driven by (1) leveraging exchangeable large language models (LLMs) to extract entity attributes and relational constraints for both parsing and reranking the top-$k$ answers, (2) vector similarity search for ranking both extracted entities and final answers, (3) a novel incremental scope expansion procedure
that prepares for the
reranking on a configurable amount of suitable candidates that fulfill the given constraints the most, and (4) a hybrid retrieval strategy that reduces error susceptibility.
In summary, while constantly adjusting the focus like an optical autofocus, AF-Retriever delivers a configurable amount of answer candidates in four constraint-driven retrieval steps, which are then supplemented and ranked through four additional processing steps.
An ablation study and a detailed error analysis, including a comparison of three different LLM reranking strategies, provide component-level insights that are valuable for advancing the model and for enabling researchers and users to adapt, optimize, or extend its parts. The source code is publicly available at [URL will be placed here].

URL: https://openreview.net/forum?id=U2vqruHfQY

---

Title: Out-of-distribution generalization of deep-learning surrogates for 2D PDE-generated dynamics in the small-data regime

Abstract: Partial differential equations (PDEs) are a central tool for modeling the dynamics of physical, engineering, and materials systems, but high-fidelity simulations are often computationally expensive. At the same time, many scientific applications can be viewed as the evolution of spatially distributed fields, making data-driven forecasting of such fields a core task in scientific machine learning. In this work we study autoregressive deep-learning surrogates for two-dimensional PDE dynamics on periodic domains, focusing on generalization to out-of-distribution initial conditions within a fixed PDE and parameter regime and on strict small-data settings with at most $\mathcal{O}(10^2)$ simulated trajectories per system. We introduce a multi-channel U-Net with enforced periodic padding (me-UNet) that takes short sequences of past solution fields of a single representative scalar variable as input and predicts the next time increment. We evaluate me-UNet on five qualitatively different PDE families--- linear advection, diffusion, continuum dislocation dynamics, Kolmogorov flow, and Gray--Scott reaction--diffusion---and compare it to ViT, AFNO, PDE-Transformer, and KAN-UNet under a common training setup. Across all datasets, me-UNet matches or outperforms these more complex architectures in terms of field-space error, spectral similarity, and physics-based metrics for in-distribution rollouts, while requiring substantially less training time. It also generalizes qualitatively to unseen initial conditions and, e.g., reaches comparable performance on continuum dislocation dynamics with as few as $\approx 20$ training simulations. A data-efficiency study and Grad-CAM analysis further suggest that, in small-data periodic 2D PDE settings, convolutional architectures with inductive biases aligned to locality and periodic boundary conditions remain strong contenders for accurate and moderately out-of-distribution-robust surrogate modeling.

URL: https://openreview.net/forum?id=TyW6Ar3wcD

---

Title: A Diagnostic Benchmark for Transformer Training Failures: Establishing Baseline Methods and Quantifying the Accuracy–Interpretability Tradeoff

Abstract: A fundamental trade-off in automated diagnostics is revealed by our evaluation, which establishes quantitative baselines: simple rule-based heuristics achieve 57.1% accuracy with full transparency, while machine learning classifiers reach 95.7% accuracy but sacrifice interpretability. This 38.6 percentage point gap quantifies a core tension: methods practitioners can trust and understand perform poorly, while methods that work well offer no insight into their reasoning. Transformer training failures incur significant costs through wasted computational resources and delayed research progress, but diagnostic approaches have never been systematically evaluated before. Training dynamics, such as gradient norms and loss trajectories, contribute 48% of the diagnostic signal, according to feature importance analysis, indicating that practitioners should log these metrics more frequently than static configuration parameters. Validated against simulated expert behaviour, our framework exhibits uncertainty handling with a 30.3% abstention rate on ambiguous cases. In addition to identifying hybrid approaches that combine rule-based transparency with machine learning accuracy as a promising direction for bridging the interpretability gap, this work establishes the first quantitative foundation for automated training diagnostics. To facilitate repeatable advancement in this crucial but understudied field, all code, data, and assessment procedures are made public.

URL: https://openreview.net/forum?id=LH1vwKgvkM

---

Title: The Sparse Matrix-Based Random Projection: A Mean Absolute Deviation Analysis for Sparse Ternary Data

Abstract: In this paper, we investigate random projections based on sparse $\{0,\pm1\}$ matrices, which take sparse $\{0,\pm\mu\}$-ternary data as input. Such sparse ternary data, including $\{\pm\mu\}$-binary data as a special case, are widely used in machine learning, particularly for data quantization tasks, where they often match or even outperform their full-precision counterparts. For the projection of such ternary data, we analyze the mean absolute deviation (MAD), a metric that quantifies the dispersion of projected data points. In general, greater dispersion is expected to better capture the intrinsic variations in the original data, making it favorable for downstream classification tasks. Our analysis demonstrates that extremely sparse $\{0,\pm1\}$ matrices, such as those with only one or a few nonzero entries per row, can achieve large MAD values. By employing such sparse matrices, we indeed obtain favorable classification performance on the projected data. These highly sparse matrix structures suggest that substantial computational savings can be realized in random projection.

URL: https://openreview.net/forum?id=D9muB8ArqS

---

Title: Measuring Fine-Grained Relatedness in Multitask Learning via Data Attribution

Abstract: Measuring task relatedness and mitigating negative transfer remain a critical open challenge in Multitask Learning (MTL). This work extends data attribution---which quantifies the influence of individual training data points on model predictions---to MTL setting for measuring task relatedness. We propose the MultiTask Influence Function (MTIF), a method that adapts influence functions to MTL models with hard or soft parameter sharing. Compared to conventional task relatedness measurements, MTIF provides a fine-grained, instance-level relatedness measure beyond the entire-task level. This fine-grained relatedness measure enables a data selection strategy to effectively mitigate negative transfer in MTL. Through extensive experiments, we demonstrate that the proposed MTIF efficiently and accurately approximates the performance of models trained on data subsets. Moreover, the data selection strategy enabled by MTIF consistently improves model performance in MTL. Our work establishes a novel connection between data attribution and MTL, offering an efficient and fine-grained solution for measuring task relatedness and enhancing MTL models.

URL: https://openreview.net/forum?id=zIDGm96xwg

---

Title: Analyzing Neural Network Information Flow Using Differential Geometry

Abstract: This paper provides a fresh view of the neural network (NN) data flow problem, i.e., identifying the NN connections that are most important for the performance of the full model, through the lens of graph theory. Understanding the NN data flow provides a tool for symbolic NN analysis, e.g., robustness analysis or model repair. Unlike the standard approach to NN data flow analysis, which is based on information theory, we employ the notion of graph curvature, specifically Ollivier-Ricci curvature (ORC). The ORC has been successfully used to identify important graph edges in various domains such as road traffic analysis, biological and social networks. In particular, edges with negative ORC are considered bottlenecks and as such are critical to the graph’s overall connectivity, whereas positive-ORC edges are not essential. We use this intuition for the case of NNs as well: we 1) construct a graph induced by the NN structure and introduce the notion of neural curvature (NC) based on the ORC; 2) calculate curvatures based on activation patterns for a set of input examples; 3) aim to demonstrate that NC can indeed be used to rank edges according to their importance for the overall NN functionality. We evaluate our method through pruning experiments and show that removing negative-ORC edges quickly degrades the overall NN performance, whereas positive-ORC edges have little impact. The proposed method is evaluated on a variety of models trained on three image datasets, namely MNIST, CIFAR-10 and CIFAR-100. The results indicate that our method can identify a larger number of unimportant edges as compared to state-of-the-art pruning methods.

URL: https://openreview.net/forum?id=kwACVY73Ug

---

Title: The Impact of Anisotropic Covariance Structure on the Training Dynamics and Generalization Error of Linear Networks

Abstract: The success of deep neural networks largely depends on the statistical structure of the training data. While learning dynamics and generalization on isotropic data are well-established, the impact of pronounced anisotropy on these crucial aspects is not yet fully understood. We examine the impact of data anisotropy, represented by a spiked covariance structure, a canonical yet tractable model, on the learning dynamics and generalization error of a two-layer linear network in a linear regression setting. Our analysis reveals that the learning dynamics proceed in two distinct phases, governed initially by the input-output correlation and subsequently by other principal directions of the data structure. Furthermore, we derive an analytical expression for the generalization error, quantifying how the alignment of the spike structure of the data with the learning task improves performance. Our findings offer deep theoretical insights into how data anisotropy shapes the learning trajectory and final performance, providing a foundation for understanding complex interactions in more advanced network architectures.

URL: https://openreview.net/forum?id=pHDSgtDDez

---

Title: Asymptotically and Minimax Optimal Regret Bounds for Multi-Armed Bandits with Abstention

Abstract: We introduce a novel extension of the canonical multi-armed bandit problem that incorporates an additional strategic innovation: \emph{abstention}. In this enhanced framework, the agent is not only tasked with selecting an arm at each time step, but also has the option to {\em abstain} from accepting the stochastic instantaneous reward before observing it. When opting for abstention, the agent either suffers a fixed regret or gains a guaranteed reward. This added layer of complexity naturally prompts the key question: can we develop algorithms that are both computationally efficient and asymptotically and minimax optimal in this setting? We answer this question in the affirmative by designing and analyzing algorithms whose regrets meet their corresponding information-theoretic lower bounds. Our results offer valuable quantitative insights into the benefits of the abstention option, laying the groundwork for further exploration in other online decision-making problems with such an option. Extensive numerical experiments validate our theoretical results, demonstrating that our approach not only advances theory but also has the potential to deliver significant practical benefits.

URL: https://openreview.net/forum?id=AYp5zOcFdA

---

Title: Soft Preference Optimization: Aligning Language Models to Expert Distributions

Abstract: Preference optimization methods such as DPO often yield aligned models that are overly deterministic, reducing output diversity and increasing the risk of mode collapse. This can limit downstream applications that benefit from multiple plausible outputs, such as reasoning and search. We propose Soft Preference Optimization (SPO), a reward-model-free algorithm that controls entropy of the aligned model through a ``softness'' parameter. SPO minimizes a preference-based loss together with a global KL regularization term, which helps prevent unwanted distribution shifts outside the preference dataset. While the method does not rely on any reward model assumption, we provide theoretical guarantees that under a Bradley–Terry assumption, it converges to a softmax distribution over the expert rewards. We present the methodology, theoretical analysis, and comparative advantages in alignment precision and output diversity.

URL: https://openreview.net/forum?id=EUPIcAkrSR

---

Title: RPATH: Explaining Time Series Mixture of Experts Routing via Ensemble Consensus and Structural Robustness

Abstract: Mixture-of-Experts (MoE) architectures achieve strong performance in time series forecasting through sparse expert activation, but understanding \textit{why} specific experts are selected remains challenging. We present RPATH (Routing Pathway Analysis for Temporal Hierarchies), a post-hoc explainability framework for time series MoE models that combines temporal saliency mapping with counterfactual generation. Evaluating on Time-MoE-50M across 300 expert-sample pairs, we discover two properties of the routing architecture: (1) \textit{Ensemble Consensus}, where experts at different layers independently converge on the same critical temporal windows (mean saliency Intersection over Union (IoU) = 0.677), rather than developing distinct specializations; and (2) \textit{Structural Robustness}, characterized by a 300-fold ``Stability Gap'' where gentle perturbations alter routing in only 0.3\% of cases while aggressive perturbations succeed in 99.7\%, indicating that routing decisions reflect structural anchors rather than superficial signal characteristics. Together, these findings demonstrate that Time-MoE achieves reliable forecasting through \textit{Ensemble Redundancy}: multiple experts verify the same structural features, providing consensus that is insensitive to noise but responsive to fundamental signal changes. Our framework provides practitioners with tools to visualize expert attention, identify critical input regions, and quantify routing stability for deployed MoE models.

URL: https://openreview.net/forum?id=kwpDOqas2x

---

Title: From SQL to Knowledge Graphs: An LLM-Driven MultiAgent Approach with Data Schema Improvement

Abstract: RDBMS (Relational Database Management System) databases face several limitations, including slow execution with multi-hop queries and a lack of explainability through graphical interpretations. In contrast, graph databases offer a more intuitive and efficient data schema that enables faster execution on large datasets. Most existing RDBMS conversion pipelines focus on running traditional loading commands and relying on Cypher queries. However, the efficiency of using an LLM to generate an effective graph data schema, significantly reducing the ambiguity of the graph database, remains underexplored in the current research literature. This paper presents a novel algorithm that bridges RDBMS and graph databases by using an LLM-powered ETL agent to standardize table and column names before saving them to the Data Mart. A Multi-Agent System generates a looping discussion between ETL, Analyzer, and Graph agents to optimize the final design through an iterative process of suggesting and scoring the graph database schema. We ensure that the final graph database meets three criteria before being accepted for data conversion: Accuracy, Groundedness, and Faithfulness. This system demonstrates an effective pipeline to automatically convert a tabular database into a graph database through a comprehensive end-to-end process. Our study highlights notable efficiency gains from using the converted graph database, evaluated on 1,081 samples of a BFSI dataset across three levels of complexity (easy, medium, and hard). Specifically, CypherAgent achieves an 85.6% accuracy for Q&A tasks using the graph database, which is 12.12% higher than the accuracy achieved by an SQLAgent on the PostgreSQL RDBMS across all queries. Additionally, the graph database demonstrates faster performance, reducing latency by approximately three times. For easy, medium, and hard queries, the graph database attains accuracies of 90.43%, 81.98%, and 80.06%, respectively, surpassing the RDBMS by 17.8%, 4.2%, and 11.0%.

URL: https://openreview.net/forum?id=HYu0dGmj5x

---

Title: On Preference Optimization in Large Language Models Under Pure Semantic Preferences

Abstract: Large language models (LLMs) are typically aligned with human preferences through methods such as direct preference optimization (DPO). While empirically successful, these approaches face well-known limitations, including length bias, reward hacking, binary preference assumptions, and the aggregation of heterogeneous preferences into a single scalar signal. In this work, we take an inverse perspective: rather than attempting to resolve these issues directly, we investigate an idealized setting, which we call the pure semantic preference scenario, where such confounding factors are absent. To formalize this setting, we decompose the log-likelihood preference gap between two semantically equivalent generations into three additive components: a length alignment gap, a syntactic alignment gap, and a semantic alignment gap, and study the regime in which the length and syntactic gaps are controlled to be zero, so that observed preferences reflect semantics alone. We show that even in this idealized setting, existing alignment methods still do not fully capture the preference. Our analysis further reveals that (i) on-policy algorithms align more effectively, (ii) models trained without an explicit reference model perform better, and (iii) preference-model-based approaches consistently outperform reward-model-based approaches. Finally, motivated by these observations, we propose a lightweight preference-matching optimization (PMO) with a closed-form optimum that is well-suited to the pure semantic setting. Experiments on both practical and idealized settings demonstrate performance comparable to standard alignment baselines in the practical setting, while yielding clearer theoretical interpretation and improved results in the pure semantic setting.

URL: https://openreview.net/forum?id=Zu0OoTeku2

---

Title: Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback

Abstract: Large text-to-video models hold immense potential for a wide range of downstream applications. However, they struggle to accurately depict dynamic object interactions, often resulting in unrealistic movements and frequent violations of real-world physics. One solution inspired by large language models is to align generated outputs with desired outcomes using external feedback. In this work, we investigate the use of feedback to enhance the quality of object dynamics in text-to-video models. We aim to answer a critical question: what types of feedback, paired with which specific self-improvement algorithms, can most effectively overcome movement misalignment and realistic object interactions? We first point out that offline RL-finetuning algorithms for text-to-video models can be equivalent as derived from a unified probabilistic objective. This perspective highlights that there is no algorithmically dominant method in principle; rather, we should care about the property of reward and data. While human feedback is less scalable, vision-language models could notice the video scenes as humans do. We then propose leveraging vision-language models to provide perceptual feedback specifically tailored to object dynamics in videos. Compared to popular video quality metrics measuring alignment or dynamics, the experiments demonstrate that our approach with binary AI feedback drives the most significant improvements in the quality of interaction scenes in video, as confirmed by AI, human, and quality metric evaluations. Notably, we observe substantial gains when using signals from vision language models, particularly in scenarios involving complex interactions between multiple objects and realistic depictions of objects falling.

URL: https://openreview.net/forum?id=Ys1G6BQdzd

---

Title: LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers

Abstract: Automated feature engineering plays a critical role in improving predictive model performance for tabular learning tasks. Traditional automated feature engineering methods are limited by their reliance on pre-defined transformations within fixed, manually designed search
spaces, often neglecting domain knowledge. Recent advances using Large Language Models (LLMs) have enabled the integration of domain knowledge into the feature engineering process. However, existing LLM-based approaches use direct prompting or rely solely on
validation scores for feature selection, failing to leverage insights from prior feature discovery experiments or establish meaningful reasoning between feature generation and data-driven performance. To address these challenges, we propose LLM-FE, a novel framework that combines evolutionary search with the domain knowledge and reasoning capabilities of LLMs to automatically discover effective features for tabular learning tasks. LLM-FE formulates feature engineering as a program search problem, where LLMs propose new feature transformation programs iteratively, and data-driven feedback guides the search process. Our results demonstrate that LLM-FE consistently outperforms state-of-the-art baselines, showcasing generalizability across diverse models, tasks, and datasets.

URL: https://openreview.net/forum?id=qvI35hkpOO

---

Title: Heterogeneous Matrix Factorization: When Features Differ by Datasets

Abstract: In myriad statistical applications, data are collected from related but heterogeneous sources. These sources share some commonalities while containing idiosyncratic characteristics. One of the most fundamental challenges in such scenarios is to recover the shared and source-specific factors at scale. Despite the existence of a few heuristic approaches, a scalable algorithm with theoretical guarantees has yet to be established.

In this paper, we tackle the problem by proposing a method called Heterogeneous Matrix Factorization to separate the shared and unique factors for a class of problems. HMF maintains the orthogonality between the shared and unique factors by leveraging an invariance property in the objective. The algorithm is easy to implement and intrinsically distributed. On the theoretic side, we show that for the square error loss, HMF will converge into the optimal solutions, which are close to the ground truth.

HMF can be integrated auto-encoders to learn nonlinear feature mappings. Through a variety of case studies, we showcase HMF's benefits and applicability in video segmentation, time-series feature extraction, and recommender systems.

URL: https://openreview.net/forum?id=1BUB0I3Obx

---

Title: TriggerCraft: A Framework for Enabling Scalable Physical Backdoor Dataset Generation with Generative Models

Abstract: Backdoor attacks, representing an emerging threat to the integrity of deep neural networks have received significant attention due to their ability to compromise deep learning systems covertly. While numerous backdoor attacks occur within the digital realm, their practical implementation in real-world prediction systems remains limited and vulnerable to disturbances in the physical world.
Consequently, this limitation has led to the development of physical backdoors, where trigger objects manifest as physical entities within the real world.
However, creating a requisite dataset to study physical backdoors is a daunting task. This hinders backdoor researchers and practitioners from studying such backdoors, leading to stagnant research progresses. This paper presents a framework namely as TriggerCraft that empowers researchers to effortlessly create a massive physical backdoor dataset with generative modeling. Particularly, TriggerCraft involves three automatic modules: suggesting the suitable physical triggers, generating the poisoned candidate samples (either by synthesizing new samples or editing existing clean samples), and finally selecting only the most plausible ones. As such, it effectively mitigates the perceived complexity associated with creating a physical backdoor dataset, converting it from a daunting task into an attainable objective.
Extensive experiment results show that datasets created by TriggerCraft achieve similar observations with the real physical world counterparts in terms of both attacks and defenses, exhibiting similar properties compared to previous physical backdoor studies. This paper offers researchers a valuable toolkit for advancing the frontier of physical backdoors, all within the confines of their laboratories.

URL: https://openreview.net/forum?id=3FlCGLMtxT

---

Title: Iterative Compositional Data Generation for Robot Control

Abstract: Collecting robotic manipulation data is expensive, making it impractical to acquire demonstrations for the combinatorially large space of tasks that arise in multi-object, multi-robot, and multi-environment settings. While recent generative models can synthesize useful data for individual tasks, they do not exploit the compositional structure of robotic domains and struggle to generalize to unseen task combinations. We propose a semantic compositional diffusion transformer that factorizes transitions into robot-, object-, obstacle-, and objective-specific components and learns their interactions through attention. Once trained on a limited subset of tasks, we show that our model can zero-shot generate high-quality transitions from which we can learn control policies for unseen task combinations. Then, we introduce an iterative self-improvement procedure in which synthetic data is validated via offline reinforcement learning and incorporated into subsequent training rounds. Our approach substantially improves zero-shot performance over monolithic and hard-coded compositional baselines, ultimately solving nearly all held-out tasks and demonstrating the emergence of meaningful compositional structure in the learned representations.

URL: https://openreview.net/forum?id=cASorO1kiy

---

Title: AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset

Abstract: Diffusion models have achieved remarkable progress in the field of video generation. However, their iterative denoising nature requires a large number of inference steps to generate a video, which is slow and computationally expensive. In this paper, we begin with a detailed analysis of the challenges present in existing diffusion distillation methods and propose a novel efficient method, namely AccVideo, to reduce the inference steps for accelerating video diffusion models with synthetic dataset. We leverage the pretrained video diffusion model to generate multiple valid denoising trajectories as our synthetic dataset, which eliminates the use of useless data points during distillation. Based on the synthetic dataset, we design a trajectory-based few-step guidance that utilizes key data points from the denoising trajectories to efficiently learn the noise-to-video mapping, enabling video generation in fewer steps. Furthermore, since the synthetic dataset captures the data distribution at each intermediate diffusion timestep, we introduce an adversarial training strategy to align the intermediate distribution of the student model with that of our synthetic dataset, thereby enhancing the video quality. Extensive experiments demonstrate that our model achieves 8.5x improvement in generation speed compared to the teacher model while maintaining comparable performance. Furthermore, our method is compatible with various pretrained models. Compared to previous accelerating methods, our approach is capable of generating videos with higher quality and resolution, i.e., 5-seconds, 720x1280, 24fps.

URL: https://openreview.net/forum?id=5ntdEzTTa2

---

Title: The 2025 Foundation Model Transparency Index

Abstract: Foundation model developers are among the world’s most important companies. As these companies become increasingly consequential, how do their transparency practices evolve? The 2025 Foundation Model Transparency Index is the third edition of an annual effort to characterize and quantify the transparency of foundation model developers. The 2025 FMTI introduces new indicators related to data acquisition, usage data, and monitoring and evaluates companies like Alibaba, DeepSeek, and xAI for the first time. The 2024 FMTI reported that transparency was improving, but the 2025 FMTI finds this progress has deteriorated: the average score out of 100 fell from 58 in 2024 to 40 in 2025. Companies are most opaque about their training data and training compute as well as the post-deployment usage and impact of their flagship models. While companies tend to disclose evaluations of model capabilities and risks, limited methodological transparency, third-party involvement, reproducibility, and reporting of train-test overlap pose challenges. In spite of this general trend, IBM stands out as a positive outlier, scoring 95, in contrast to the lowest scorers, xAI and Midjourney, at just 14. Several groups of companies score higher than the mean: open model developers, enterprise-focused B2B companies, companies that prepare their own transparency reports, and signatories to the EU AI Act General Purpose-AI Code of Practice. The five members of the Frontier Model Forum we score end up in the middle of the Index: we posit that major companies aim to avoid particularly low rankings but also lack incentives to be highly transparent. As policymakers around the world increasingly mandate certain types of transparency, this work reveals the current state of transparency for foundation model developers, how it may change given newly enacted policy, and where more aggressive policy interventions are necessary to address critical information deficits.

URL: https://openreview.net/forum?id=1jT253Xtyf

---

Title: Test-Time Adaptation for Unsupervised Combinatorial Optimization

Abstract: Unsupervised neural combinatorial optimization (NCO) enables learning powerful solvers without access to ground-truth solutions. Existing approaches fall into two disjoint paradigms: models trained for generalization across instances, and instance-specific models optimized independently at test time. While the former are efficient during inference, they lack effective instance-wise adaptability; the latter are flexible but fail to exploit learned inductive structure and are prone to poor local optima. This motivates the central question of our work: how can we leverage the inductive bias learned through generalization while unlocking the flexibility required for effective instance-wise adaptation? We first identify a challenge in bridging these two paradigms: generalization-focused models often constitute poor warm starts for instance-wise optimization, potentially underperforming even randomly initialized models when fine-tuned at test time. To resolve this incompatibility, we propose TACO, a model-agnostic test-time adaptation framework that unifies and extends the two existing paradigms for unsupervised NCO. TACO applies strategic warm-starting to partially relax trained parameters while preserving inductive bias, enabling rapid and effective unsupervised adaptation. Crucially, compared to naively fine-tuning a trained generalizable model or optimizing an instance-specific model from scratch, TACO achieves better solution quality while incurring negligible additional computational cost. Experiments on canonical CO problems, Minimum Vertex Cover and Maximum Clique, demonstrate the effectiveness and robustness of TACO across static, distribution-shifted, and dynamic combinatorial optimization problems, establishing it as a practical bridge between generalizable and instance-specific unsupervised NCO.

URL: https://openreview.net/forum?id=VVyGfRp4fG

---

Title: Synergizing Deconfounding and Temporal Generalization For Time-series Counterfactual Outcome Estimation

Abstract: Estimating counterfactual outcomes from time‑series observations is crucial for effective decision-making, e.g. when to administer a life‑saving treatment, yet remains significantly challenging because (i) the counterfactual trajectory is never observed and (ii) confounders evolve with time and distort estimation at every step. To address these challenges, we propose a novel framework that synergistically integrates two complementary approaches: Sub-treatment Group Alignment (SGA) and Random Temporal Masking (RTM). Instead of the coarse practice of aligning marginal distributions of the treatments in latent space, SGA uses iterative treatment‑agnostic clustering to identify fine-grained sub‑treatment groups. Aligning these fine‑grained groups achieves improved distributional matching, thus leading to more effective deconfounding. We theoretically demonstrate that SGA optimizes a tighter upper bound on counterfactual risk and empirically verify its deconfounding efficacy. RTM promotes temporal generalization by randomly replaces input covariates with Gaussian noises during training. This encourages the model to rely less on potentially noisy or spuriously correlated covariates at the current step and more on stable historical patterns, thereby improving its ability to generalize across time and better preserve underlying causal relationships. Our experiments demonstrate that while applying SGA and RTM individually improves counterfactual outcome estimation, their synergistic combination consistently achieves state-of-the-art performance. This success comes from their distinct yet complementary roles: RTM enhances temporal generalization and robustness across time steps, while SGA improves deconfounding at each specific time point.

URL: https://openreview.net/forum?id=xuJH3BJiNu

---

Title: Knowing How to Edit: Reliable Evaluation Signals for Diagnosing and Optimizing Prompts at Query Level

Abstract: Prompt optimization has become a central mechanism for eliciting strong performance from LLMs, and recent work has made substantial progress by proposing diverse prompt evaluation metrics and optimization strategies. Despite these advances, prompt evaluation and prompt optimization are often developed in isolation, limiting the extent to which evaluation can effectively inform prompt refinement. In this work, we study prompt optimization as a process guided by performance-relevant evaluation signals. To address the disconnect between evaluation and optimization, we propose an evaluation-instructed prompt optimization approach that explicitly connects prompt evaluation with query-dependent optimization. Our method integrates multiple complementary prompt quality metrics into a performance-reflective evaluation framework and trains an execution-free evaluator that predicts prompt quality directly from text, avoiding repeated model executions. These evaluation signals then guide prompt refinement in a targeted and interpretable manner. Empirically, the proposed evaluator achieves 83.7\% accuracy in predicting prompt performance. When incorporated into the optimization process, our approach consistently outperforms existing optimization baselines across eight benchmark datasets and three different backbone LLMs. Overall, our results demonstrate that reliable and efficient evaluation signals can serve as an effective foundation for robust and interpretable prompt optimization.

URL: https://openreview.net/forum?id=fKs3VWTj31

---

Title: DCD: Decomposition-based Causal Discovery from Autocorrelated and Non-Stationary Temporal Data

Abstract: Multivariate time series in domains such as finance, climate science, and healthcare often exhibit long-term trends, seasonal patterns, and short-term fluctuations, complicating causal inference under non-stationarity and autocorrelation. Existing causal discovery methods typically operate on raw observations, making them vulnerable to spurious edges and misattributed temporal dependencies. We introduce a decomposition-based causal discovery framework that separates each time series into trend, seasonal, and residual components and performs component-specific causal analysis. Trend components are assessed using stationarity tests, seasonal components using kernel-based dependence measures, and residual components using constraint-based causal discovery. The resulting component-level graphs are integrated into a unified multi-scale causal structure. This approach isolates long- and short-range causal effects, reduces spurious associations, and improves interpretability. Across extensive synthetic benchmarks and real-world climate data, our framework more accurately recovers ground-truth causal structure than state-of-the-art baselines, particularly under strong non-stationarity and temporal autocorrelation.

URL: https://openreview.net/forum?id=ohLsplTytO

---

Title: OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control

Abstract: Transparent decision-making is essential for traffic signal control (TSC) systems to earn public trust. However, traditional reinforcement learning–based TSC methods function as black boxes, providing little to no insight into their decisions. Although large language models (LLMs) could provide the needed interpretability through natural language reasoning, they face challenges such as limited memory and difficulty in deriving optimal policies from sparse environmental feedback. Existing TSC methods that apply reinforcement fine-tuning to LLMs face notable training instability and deliver only limited improvements over pretrained models. We attribute this instability to the long-horizon nature of TSC: feedback is sparse and delayed, most control actions yield only marginal changes in congestion metrics, and the resulting weak reward signals interact poorly with policy-gradient optimization. We introduce OracleTSC, which addresses these issues through: (1) a reward hurdle mechanism that filters weak learning signals by subtracting a calibrated threshold from environmental feedback, and (2) preventing policy degeneracy by maximizing the probability of the chosen answer, which promotes consistent decision-making across multiple responses. Experiments on the standard LibSignal benchmark demonstrate that our approach enables a compact model (LLaMA3-8B) to achieve substantial improvements in traffic flow, with a 75% reduction in travel time and 67% decrease in queue lengths over the pretrained baseline while preserving interpretability through natural language explanations. Furthermore, the method exhibits strong cross-intersection generalization: a policy trained on one intersection transfers to a structurally distinct intersection with 17% lower travel time and 39% lower queue length, all without any additional finetuning for the target topology. These findings show that uncertainty-aware reward shaping could stabilize reinforcement fine-tuning and provide a new perspective for improving its effectiveness in TSC tasks.

URL: https://openreview.net/forum?id=WmJu5MkoQD

---

Title: Hierarchical Geometry of Cognitive States in Transformer Embedding Spaces

Abstract: Recent work has shown that transformer-based language models learn rich geometric structure in their embedding spaces, yet the presence of higher-level cognitive organization within these representations remains underexplored. In this work, we investigate whether sentence embeddings encode a graded, hierarchical structure aligned with human-interpretable cognitive or psychological attributes. We construct a dataset of 480 natural-language sentences annotated with both continuous energy scores (ranging from -5 to 5) and discrete tier labels spanning seven ordered consciousness-related cognitive categories. Using fixed sentence embeddings from multiple transformer models, we evaluate the recoverability of these annotations via linear and shallow nonlinear probes. Across models, both continuous energy scores and tier labels are reliably decodable by both linear and nonlinear probes, with nonlinear probes outperforming linear counterparts. To assess statistical significance, we conduct nonparametric permutation tests that randomize labels while preserving embedding geometry, finding that observed probe performance significantly exceeds chance under both regression and classification null hypotheses (p < 0.005). Qualitative analyses using UMAP visualizations and tier-level confusion matrices are consistent with these findings, illustrating a coherent low-to-high gradient and predominantly local (adjacent-tier) confusions in embedding space. Taken together, these results provide evidence that transformer embedding spaces exhibit a hierarchical geometric organization statistically aligned with our human-defined cognitive structure; while this work does not claim internal awareness or phenomenology, it demonstrates a systematic alignment between learned representation geometry and interpretable cognitive and psychological attributes, with potential implications for representation analysis, safety modeling, and geometry-based generation steering.

URL: https://openreview.net/forum?id=qKKqZAOJig

---

Title: Graph Generation via Temporal-Aware Biased Walks

Abstract: Some real networks keep a fixed structure (e.g., roads, sensors and their connections) while node or edge signals evolve over time. Existing graph generators either model topology changes (i.e., edge additions/deletions) or focus only on static graph properties (such as degree distributions or motifs), without considering how temporal signals shape the generated structure. By approaching the problem from an unconventional perspective, we introduce temporally attributed graphs, named TANGEM, that integrate a temporal similarity matrix into biased random walks, thereby coupling signals with structure to generate graphs that highlight patterns reflecting how nodes co-activate over time. We evaluate TANGEM using an approach that separates structural fidelity (clustering, spectral metrics) from downstream temporal consistency, allowing us to clearly isolate the impact of the topology generator itself. In time series benchmarks, TANGEM consistently outperforms strong baselines in structural metrics while remaining lightweight, learning from a single graph. These results show that adding temporal bias to structural sampling produces more realistic graphs and establishes TANGEM as a basis for future models that further integrate evolving signals and structure.

URL: https://openreview.net/forum?id=lDnMlhk3aw

---

Title: Adaptive multi-frame sampling for consistent zero-shot text-to-video editing

Abstract: Achieving convincing temporal coherence is a fundamental challenge in zero-shot text-to-video editing. To address this issue, this paper introduces AMAC (Adaptive Multi-frame sAmpling for Consistent zero-shot text-to-video editing), a novel method that effectively balances temporal consistency with detail preservation. Our approach proposes a theoretical framework with a fully adaptive sampling strategy that selects frames for joint processing using a pre-trained text-to-image diffusion model. By reformulating the sampling strategy as a stochastic permutation over frame indexes and constructing its distribution based on inter-frame similarities, we promote consistent processing of related content. This method demonstrates superior robustness against temporal variations and shot transitions, making it particularly well-suited for editing long dynamic video sequences, as validated through experiments on DAVIS and BDD100K datasets. Some examples of generated videos are available in the following anonymous repository https://anonymous.4open.science/r/AMAC-A406.

URL: https://openreview.net/forum?id=vcZ6qdbADL

---

Title: LinMU: Multimodal Understanding Made Linear

Abstract: Modern Vision-Language Models (VLMs) achieve impressive performance but are limited by the quadratic complexity of self-attention, which prevents their deployment on edge devices and makes their understanding of high-resolution images and long-context videos prohibitively expensive. To address this challenge, we introduce LinMU (Linear-complexity Multimodal Understanding), a VLM design that achieves linear complexity without using any quadratic-complexity modules while maintaining the performance of global-attention-based VLMs. LinMU replaces every self-attention layer in the VLM with the M-MATE block: a dual-branch module that combines a bidirectional state-space model for global context (Flex-MA branch) with localized Swin-style window attention (Local-Swin branch) for adjacent correlations. To transform a pre-trained VLM into the LinMU architecture, we propose a three-stage distillation framework that (i) initializes both branches with self-attention weights and trains the Flex-MA branch alone, (ii) unfreezes the Local-Swin branch and fine-tunes it jointly with the Flex-MA branch, and (iii) unfreezes the remaining blocks and fine-tunes them using LoRA adapters, while regressing on hidden states and token-level logits of the frozen VLM teacher. On MMMU, TextVQA, LongVideoBench, Video-MME, and other benchmarks, LinMU matches the performance of teacher models, yet reduces Time-To-First-Token (TTFT) by up to 2.7$\times$ and improves token throughput by up to 9.0$\times$ on minute-length videos. Ablations confirm the importance of each distillation stage and the necessity of the two branches of the M-MATE block. We also conduct distillation on various VLM backbones to validate the universality of LinMU. The proposed framework demonstrates that state-of-the-art multimodal reasoning can be achieved without quadratic attention, thus opening up avenues for long-context VLMs that can deal with high-resolution images and long videos.

URL: https://openreview.net/forum?id=6BYdTSNrab

---

Title: Process Reward Models That Think

Abstract: Step-by-step verifiers—also known as process reward models (PRMs)—are a key ingredient for test-time scaling, but training them requires expensive step-level supervision. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers—using only 1% of the process labels in PRM800K—across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME ’24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation over subsets of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained with the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. This work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training.

URL: https://openreview.net/forum?id=FPVCb0WMuN

---

Title: Learning Long-Range Representations with Equivariant Messages

Abstract: Machine learning interatomic potentials trained on first-principles reference data are becoming valuable tools for computational physics, biology, and chemistry. Equivariant message-passing neural networks, including transformers, achieve state-of-the-art accuracy but rely on cutoff-based graphs, limiting their ability to capture long-range effects such as electrostatics or dispersion, as well as electron delocalization. While long-range correction schemes based on inverse power laws of interatomic distances have been proposed, they are unable to communicate higher-order geometric information and are thus limited in applicability. To address this shortcoming, we propose the use of equivariant, rather than scalar, charges for long-range interactions, and design a graph neural network architecture, Lorem, around this long-range message passing mechanism. We consider several datasets specifically designed to highlight non-local physical effects, and compare short-range message passing with different receptive fields to invariant and equivariant long-range message passing.
Even though most approaches work for careful dataset-specific choices of their hyperparameters, Lorem works consistently without adjustments, with excellent benchmark performance.

URL: https://openreview.net/forum?id=pZI9e4SW9P

---

Title: AIMing for Standardised Explainability Evaluation in GNNs: A Framework and Case Study on Graph Kernel Networks

Abstract: Graph Neural Networks (GNNs) have advanced significantly in handling graph-structured data, but a comprehensive framework for evaluating explainability remains lacking. Existing evaluation frameworks primarily involve post-hoc explanations, and operate in the setting where multiple methods generate a suite of explanations for a single model. This makes comparison of explanations across models difficult. Evaluation of inherently interpretable models often targets a specific aspect of interpretability relevant to the model, but remains underdeveloped in terms of generating insight across a suite of measures. We introduce AIM, a comprehensive framework that addresses these limitations by measuring Accuracy, Instance-level explanations, and Model-level explanations. AIM is formulated with minimal constraints to enhance flexibility and facilitate broad applicability. Here, we use AIM in a pipeline, extracting explanations from inherently interpretable GNNs such as graph kernel networks (GKNs) and prototype networks (PNs), evaluating these explanations with AIM, identifying their limitations and obtaining insights to their characteristics. Taking GKNs as a case study, we show how the insights obtained from AIM can be used to develop an updated model, xGKN, that maintains high accuracy while demonstrating improved explainability. Our approach aims to advance the field of Explainable AI (XAI) for GNNs, providing more robust and practical solutions for understanding and improving complex models.

URL: https://openreview.net/forum?id=onZkYXI7oe

---

Title: TCSurv: Time-based Clustering for Reliable Survival Analysis

Abstract: Survival analysis is critical in healthcare for predicting time-to-event outcomes such as disease progression or patient survival. While deep learning excels at capturing meaningful representations from complex clinical data and has improved performance in deep survival models, it inherently struggles with reliability and robustness, challenges that are especially significant when deploying these models in real-world clinical practice. Out-of-distribution (OOD) detection, designed to identify or flag samples that deviate from the training distribution, has become a key method for evaluating AI reliability across fields. This capability is especially important in clinical applications, where noisy or heterogeneous patient data can lead to incorrect assessments; yet, OOD detection remains underexplored and challenging in deep survival analysis due to the need to handle both censored and observed samples, which are unique to this domain. In this study, we address this critical gap by introducing TCSurv, a novel time-base clustering approach for survival analysis that handles both observed and censored samples for robust OOD detection. TCSurv initializes cluster centers using in-distribution data, creating time-specific clusters that anchor model predictions for both observed and censored samples. Experiments in real-world clinical data, including Alzheimer’s dementia progression, and benchmark medical imaging datasets demonstrate that TCSurv effectively distinguishes OOD samples without compromising survival performance compared to existing deep survival analysis frameworks.

URL: https://openreview.net/forum?id=d2zkcIC69b

---

Title: Understanding Emotion in Discourse: Recognition Insights and Linguistic Patterns for Generation

Abstract: Despite strong recent progress in Emotion Recognition in Conversation (ERC), two gaps remain: we still lack a clear understanding of which modeling choices materially affect performance, and we have limited linguistic analysis that links recognition findings to actionable cues for generation. We address both gaps via a systematic study on IEMOCAP.

For recognition, we conduct controlled ablations with 10 random seeds and paired tests over seeds (with correction for multiple comparisons), yielding three findings. First, conversational context is the dominant factor: performance saturates quickly, with roughly 90% of the gain observed within our context sweep achieved using only the most recent 10--30 preceding turns (depending on the label set). Second, hierarchical sentence representations improve utterance-only recognition ($K{=}0$), but the benefit vanishes once turn-level context is available, suggesting that conversational history subsumes much of the intra-utterance structure. Third, a simple integration of an external affective lexicon (SenticNet) does not improve results, consistent with pretrained encoders already capturing much of the affective signal needed for ERC. Under a strictly causal (past-only) setting, our simple models attain strong performance (82.69% 4-way; 67.07% 6-way weighted F1), indicating that competitive accuracy is achievable without access to future turns.

For linguistic analysis, we examine 5,286 discourse-marker occurrences and find a reliable association between emotion and marker position within the utterance ($p < 0.0001$). In particular, "Sad" utterances show reduced left-periphery marker usage (21.9%) relative to other emotions (28--32%), aligning with accounts that link left-periphery markers to active discourse management. This pattern is consistent with our recognition results, where "Sad" benefits most from conversational context (+22%p), suggesting that sadness often relies more on discourse history than on overt pragmatic signaling in the utterance itself.

URL: https://openreview.net/forum?id=zCFQiJT7XN

---

Title: $\texttt{DecompSR}$: A Dataset for Decomposed Analyses of Compositional Multihop Spatial Reasoning

Abstract: We introduce $\texttt{DecompSR}$, decomposed spatial reasoning, a large benchmark dataset (over 5m datapoints) and generation framework designed to analyse compositional spatial reasoning ability. The generation of $\texttt{DecompSR}$ allows users to independently vary several aspects of compositionality, namely: productivity (reasoning depth), substitutivity (entity and linguistic variability), overgeneralisation (input order, distractors) and systematicity (novel linguistic elements). $\texttt{DecompSR}$ has been built procedurally in a manner which makes it is correct by construction, which is independently verified using a symbolic solver to guarantee the correctness of the dataset. $\texttt{DecompSR}$ is comprehensively benchmarked across a host of Large Language Models (LLMs) where we show that LLMs struggle with productive and systematic generalisation in spatial reasoning tasks whereas they are more robust to linguistic variation. $\texttt{DecompSR}$ provides a provably correct and rigorous benchmarking dataset with a novel ability to independently vary the degrees of several key aspects of compositionality, allowing for robust and fine-grained probing of the compositional reasoning abilities of LLMs.

URL: https://openreview.net/forum?id=P81p2nTuvA

---

Title: Hierarchical Mamba Meets Hyperbolic Geometry: A New Paradigm for Structured Language Embeddings

Abstract: Selective state-space models excel at long-sequence modeling, but their capacity for language representation, in complex hierarchical reasoning -- remains underexplored. Most large language models rely on flat Euclidean embeddings, limiting their ability to capture latent hierarchies. To address this, we propose Hierarchical Mamba (HiM), integrating efficient Mamba2 with hyperbolic geometry to learn hierarchy-aware language embeddings for deeper linguistic understanding. Mamba2-processed sequences are projected to the Poincaré ball or Lorentzian manifold with "learnable" curvature, optimized with a hyperbolic loss. Our HiM model facilitates the capture of relational distances across varying hierarchical levels, enabling effective long-range reasoning for tasks like mixed-hop prediction and multi-hop inference in hierarchical classification. Experimental results show both HiM effectively capture hierarchical relationships across four linguistic and medical datasets, surpassing Euclidean baselines, with HiM-Poincaré providing fine-grained distinctions with higher h-norms, while HiM-Lorentz offers more stable, compact, and hierarchy-preserving embeddings.

URL: https://openreview.net/forum?id=a3g13FKzct

---

Title: The SMOTE Paradox: Why a 92% Baseline Collapsed to 6%—A Systematic Review of 821 Papers in Imbalanced Learning (2020–2025)

Abstract: Class imbalance pervades production systems—fraud detection, medical diagnosis,
industrial monitoring—yet handling it effectively remains challenging. For two
decades, SMOTE has been the default solution, but practitioners increasingly
abandon it at scale.

We investigate this disconnect through systematic review of 821 DBLP papers
(2020–2025) and bibliometric analysis of 4,985 Scopus records. Our analysis
reveals the SMOTE Paradox: only 6% of high-impact papers successfully executed
SMOTE at full scale due to memory exhaustion or preprocessing bottlenecks.
The field has fragmented, with 30% adopting generative models, 30% using
cost-sensitive losses, and 40% employing hybrid approaches.

Three factors explain SMOTE's decline. First, $O(N \cdot N_{\text{min}} \cdot d)$
nearest-neighbor search requires 1.28 TB memory for typical modern datasets.
Second, linear interpolation produces off-manifold artifacts scaling as $\sqrt{d}$
in high dimensions. Third, CPU-bound preprocessing creates friction with
GPU-centric training pipelines.

We validate these findings through controlled experiments across seven datasets
(196 trials, imbalance ratios 1.1:1 to 129:1). Statistical testing reveals
no significant ROC-AUC differences between SMOTE and cost-sensitive baselines
(Friedman $p=0.907$), despite SMOTE incurring 2.7× computational overhead.
However, cost-sensitive methods severely degrade at extreme imbalance (>40:1).

URL: https://openreview.net/forum?id=Rd2ZIA5AnN

---

Title: FIT-GNN: Faster Inference Time for GNNs that ‘FIT’ in Memory Using Coarsening

Abstract: Scalability of Graph Neural Networks (GNNs) remains a significant challenge. To tackle this, methods like coarsening, condensation, and computation trees are used to train on a smaller graph, resulting in faster computation. Nonetheless, prior research has not adequately addressed the computational costs during the inference phase. This paper presents a novel approach to improve the scalability of GNNs by reducing computational burden during the inference phase using graph coarsening. We demonstrate two different methods -- Extra Nodes and Cluster Nodes. Our study extends the application of graph coarsening for graph-level tasks, including graph classification and graph regression. We conduct extensive experiments on multiple benchmark datasets to evaluate the performance of our approach. Our results show that the proposed method achieves orders of magnitude improvements in single-node inference time compared to traditional approaches. Furthermore, it significantly reduces memory consumption for node and graph classification and regression tasks, enabling efficient training and inference on low-resource devices where conventional methods are impractical. Notably, these computational advantages are achieved while maintaining competitive performance relative to baseline models.

URL: https://openreview.net/forum?id=g7r7y2I7Sz

---

Title: Information-Theoretic State Variable Selection for Reinforcement Learning

Abstract: Identifying the most suitable variables to represent the state is a fundamental challenge in Reinforcement Learning (RL). These variables must efficiently capture the information necessary for making optimal decisions. In order to address this problem, in this paper, we introduce the Transfer Entropy Redundancy Criterion (TERC), an information-theoretic criterion, which determines if there is \textit{entropy transferred} from state variables to actions during training. We define an algorithm based on TERC that provably excludes variables from the state that do not affect the agent's policy during learning. Our approach is policy-dependent, making it agnostic to the underlying learning algorithm. Consequently, we use our method to enhance efficiency across three different algorithm classes (represented by tabular Q-learning, Actor-Critic, and Proximal Policy Optimization (PPO)) in a variety of environments. Furthermore, to highlight the differences between the proposed methodology and the current state-of-the-art feature selection approaches, we present a series of controlled experiments on synthetic data, before generalizing to real-world decision-making tasks. We also introduce a representation of the problem that compactly captures the transfer of information from state variables to actions as Bayesian networks.

URL: https://openreview.net/forum?id=J0ad21E0vX

---

Title: Temporal Variational Implicit Neural Representations

Abstract: We introduce Temporal Variational Implicit Neural Representations (TV-INRs), a probabilistic framework for modeling irregular multivariate time series that enables efficient and accurate individualized imputation and forecasting. By integrating implicit neural representations with latent variable models, TV-INRs learn distributions over time-continuous generator functions conditioned on signal-specific covariates.
Unlike existing approaches that require extensive training, fine-tuning or meta-learning, our method achieves accurate individualized predictions through a single forward pass. Our experiments demonstrate that with a single TV-INRs instance, we can accurately solve diverse imputation and forecasting tasks, offering a computationally efficient and scalable solution for real-world applications.
TV-INRs performs particularly well in low-data regimes, where on several datasets it achieves substantially lower imputation error, including order-of-magnitude improvements.

URL: https://openreview.net/forum?id=1CGfvw4ySe

---

Title: Automatic Selection of the Nugget for Linear System Solves in Machine Learning

Abstract: Rapid prototyping of algorithms is a critical step in modern machine learning. Most algorithms exploit linear algebra, creating a need for lightweight numerical routines which -- while potentially sub-optimal for the task at hand -- can be rapidly implemented. For the numerical solution of ill-conditioned linear systems of equations, the standard solution for prototyping is Tikhonov-regularised inversion using a nugget. However, selection of the size of nugget is often difficult, and the use of data-adaptive procedures precludes automatic differentiation, introducing instabilities into end-to-end training. Further, while data-adaptive procedures perform multiple linear solves to select the size of nugget, only the result of one such solve is returned, which we argue is wasteful. This paper aims to resolve the above difficulties, presenting `autonugget`; a `Python` package for automatic and stable numerical solution of linear systems suitable for rapid prototyping, and fully compatible with automatic differentiation using `JAX`. A distinguishing feature of `autonugget` is the ability to combine multiple linear solves using Richardson extrapolation, improving in accuracy over approximations based on a single nugget.

URL: https://openreview.net/forum?id=fqbkenUpRa

---

Title: Persistent homology for time series: a selective review

Abstract: Over the last ten years, persistent homology has been increasingly used to analyze the structure and shape of various types of data, including time series. This article is a review of persistent homology applied to (univariate or multivariate) time series data. We review 84 articles that apply methods involving persistent homology to time series data, published between 2014 and 2025 in several domains of application, such as biomedicine, industry, and economics. We introduce the main concepts of persistent homology, give an overview of the application fields and tasks, and propose a general framework to describe the main characteristics of all the methods.

URL: https://openreview.net/forum?id=tztKO9jzBR

---

Title: MCIR: A Feature Dependence-Aware Explainability Method with Reliability Guarantees

Abstract: As modern machine learning models are deployed in high-stakes, data-rich environments, the interactions among features have grown more intricate and less amenable to traditional interpretation. Many explanation methods fail when features are strongly dependent. In the presence of multicollinearity or near-duplicate predictors, existing value attribution tools such as SHAP, LIME, HSIC, MI/CMI, and SAGE often distribute importance across redundant features, obscuring which variables represent "important and unique information". This may lead to unstable rankings, jeopardising importance scores, and usually results in a high computational cost. Recent correlation-aware approaches, such as CIR or BlockCIR, offer partial improvements but still struggle to fully separate redundancy from unique contributions at the feature level. To address this, we propose the Mutual Correlation Impact Ratio Method (MCIR-M), a simple and robust measure of global importance under feature dependence. MCIR-M introduces the score Mutual Correlation Impact Ratio (MCIR) that conditions each feature on a small set of its most correlated neighbours and computes a normalized ratio of conditional information having value range, $[0,1]$, which is comparable across tasks, and collapses to zero when a feature is redundant, enabling clear redundancy detection. In addition to MICR, we introduce a lightweight estimation procedure that requires only a fraction of the data while preserving the attribution behaviour of the full model. Across a synthetic household-energy dataset and the real UCI HAR benchmark, MCIR yields more stable and dependence-aware rankings than SHAP (independent and conditional), SAGE, HSIC, MI-based scores, and correlation-aware baselines such as CIR or BlockCIR. Lightweight explanations preserve over $95\%$ top-feature agreement and reduce runtime by manyfold. These results demonstrate that MCIR-M provides a practical and scalable solution for global explanation in settings with strong feature dependence.

URL: https://openreview.net/forum?id=UHMkfgIVbS

---

Title: Challenges in Non-Polymeric Crystal Structure Prediction: Why a Geometric, Permutation-Invariant Loss is Needed

Abstract: Crystalline structure prediction is an essential prerequisite for designing materials with targeted properties. Yet, it is still an open challenge in materials design and drug discovery. Despite recent advances in computational materials science, accurately predicting three-dimensional non-polymeric crystal structures remains elusive. In this work, we focus on the molecular assembly problem, where a set~$\mathcal{S}$ of identical rigid molecules is packed to form a crystalline structure. Such a simplified formulation provides a useful approximation to the actual problem. However, while recent state-of-the-art methods have increasingly adopted sophisticated techniques, the underlying learning objective remains ill-posed. We propose a better formulation that introduces a loss function capturing key geometric molecular properties while ensuring permutation invariance over $\mathcal{S}$. Remarkably, we demonstrate that within this framework, a simple regression model already outperforms prior approaches, including flow matching techniques, on the COD-Cluster17 benchmark, a curated non-polymeric subset of the Crystallography Open Database (COD).

URL: https://openreview.net/forum?id=MsIi78JXXZ

---

Title: Transforming Language Models into Program Interpreters via Execution Trace Chain of Thought

Abstract: Code execution reasoning (CER), the ability to predict how code executes on a given input, has been added to the expected aspects of language models' (LMs') coding capabilities. However, many open-source LMs perform poorly on simple code snippets and, as our observations show, they exhibit limitations even on a single basic operation. To enable LMs to accumulate fine-grained reasoning steps in a structured format, we propose leveraging extremely granular execution traces as chain-of-thought rationales. Specifically, we introduce a fine-tuning method called ET-CoT (Execution Trace Chain of Thought), which leverages execution traces generated by our custom code interpreter and characterized by sub-line-level, thorough expansion of all expressions, going beyond merely logging intermediate variables. After fine-tuning with 127k examples, ET-CoT consistently improves CER performance across models and benchmarks, for instance with Qwen2.5-7B-Instruct outperforming its official Coder model. In addition, our custom tests show improved accuracy on repeated application of simple operations. Overall, ET-CoT serves as a unique approach that provides strong baselines and insights for improving CER performance.

URL: https://openreview.net/forum?id=pOg7iub4Pz

---

Title: A Close Look At World Model Recovery In Supervised Fine-Tuned LLM Planners

Abstract: Supervised fine-tuning (SFT) improves end-to-end classical planning in large language models (LLMs), but do these models also learn to represent and reason about the planning problems they are solving? Due to the relative complexity of classical planning problems and the challenge that end-to-end plan generation poses for LLMs, it has been difficult to explore this question. In our work, we devise and perform a series of interpretability experiments that holistically interrogate world model recovery by examining both internal representations and generative capabilities of fine-tuned LLMs. We find that: a) Supervised fine-tuning on valid action sequences enables LLMs to linearly encode action validity and some state predicates. b) Models that struggle to use output probabilities for classifying action validity may still learn internal representations that separate valid from invalid actions. c) Broader state space coverage during fine-tuning, such as from random walk data, yields more accurate recovery of the underlying world model. In summary, this work contributes a recipe for applying interpretability techniques to planning LLMs and generates insights that shed light on open questions about how knowledge is represented in LLMs.

URL: https://openreview.net/forum?id=zEIpt5UsHM

---

Reply all

Reply to author

Forward

0 new messages