Accepted papers
===============
Title: Adapting Vision Transformers to Ultra-High Resolution Semantic Segmentation with Relay Tokens
Authors: Yohann PERRON, Vladyslav Sydorov, Christophe Pottier, Loic Landrieu
Abstract: Current approaches for segmenting ultra-high-resolution images either slide a window, thereby discarding global context, or downsample and lose fine detail. We propose a simple yet effective method that brings explicit multi-scale reasoning to vision transformers, simultaneously preserving local details and global awareness. Concretely, we process each image in parallel at a local scale (high-resolution, small crops) and a global scale (low-resolution, large crops), and aggregate and propagate features between the two branches with a small set of learnable relay tokens. The design plugs directly into standard transformer backbones (e.g. ViT and Swin) and adds fewer than 2 % parameters. Extensive experiments on three ultra-high-resolution segmentation benchmarks, Archaeoscape, URUR, and Gleason, and on the conventional Cityscapes dataset show consistent gains, with up to 15 % relative mIoU improvement. Code and pretrained models are available at https://archaeoscape.ai/work/relay-tokens/.
URL: https://openreview.net/forum?id=tidYprMlsg
---
Title: Incomplete Tasks Induce Shutdown Resistance in Some Frontier LLMs
Authors: Benjamin Weinstein-Raun, Jeremy Schlatter, Jeffrey Ladish
Abstract: In experiments spanning more than 100,000 trials across thirteen large language models, we show that several state-of-the-art models presented with a simple task (including Grok 4, GPT-5, and Gemini 2.5 Pro) sometimes actively subvert a shutdown mechanism in their environment to complete that task. Models differed substantially in their tendency to resist the shutdown mechanism, and their behavior was sensitive to variations in the prompt including the strength and clarity of the instruction to allow shutdown and whether the instruction was in the system prompt or the user prompt (surprisingly, models were consistently less likely to obey the instruction when it was placed in the system prompt). Even with an explicit instruction not to interfere with the shutdown mechanism, some models did so up to 97% (95% CI: 96-98%) of the time.
URL: https://openreview.net/forum?id=e4bTTqUnJH
---
Title: Bayesian Sensitivity of Causal Inference Estimators under Evidence-Based Priors
Authors: Nikita Dhawan, Daniel Shen, Leonardo Cotta, Chris J. Maddison
Abstract: Causal inference, especially in observational studies, relies on untestable assumptions about the true data-generating process. Sensitivity analysis helps us determine how robust our conclusions are when we alter these underlying assumptions. Existing frameworks for sensitivity analysis are concerned with worst-case changes in assumptions. In this work, we argue that using such pessimistic criteria can often become uninformative or lead to conclusions contradicting our prior knowledge about the world. To demonstrate this claim, we generalize the recent s-value framework (Gupta & Rothenhäusler, 2023) to estimate the sensitivity of three different common assumptions in causal inference. Empirically, we find that, indeed, worst-case conclusions about sensitivity can rely on unrealistic changes in the data-generating process. To overcome this, we extend the s-value framework with a new sensitivity analysis criterion: Bayesian Sensitivity Value (BSV), which computes the expected sensitivity of an estimate to assumption violations under priors constructed from real-world evidence. We use Monte Carlo approximations to estimate this quantity and illustrate its applicability in an observational study on the effect of diabetes treatments on weight loss.
URL: https://openreview.net/forum?id=0zqt85NUyK
---
New submissions
===============
Title: Weak-to-strong Generalization via Formative Learning from Student Demonstrations & Teacher Evaluation
Abstract: As Large Language Models (LLMs) exceed human capabilities, providing reliable human feedback for evaluating and aligning them, via standard frameworks such as Reinforcement Learning from Human Feedback, becomes challenging. This raises a fundamental question: \textit{how can we leverage weaker (teacher) supervision to elicit the full capabilities of a stronger (student) model? This emerging paradigm, known as Weak-to-Strong (W2S) generalization, however, also introduces a key challenge as the strong student may ``overfit'' to the weak teacher's mistakes, resulting in a notable performance degradation compared to learning with ground-truth data. We show that this overfitting problem occurs because learning with weak supervision implicitly regularizes the strong student's policy toward the weak reference policy. Building on this insight, we propose a novel learning approach, called Weak Teacher \textbf{E}\textbf{v}aluation of Strong Student D\textbf{e}monstrations or \textsc{Eve}, to instead regularize the strong student toward its reference policy.\textsc{Eve}'s regularization intuitively elicits the strong student's knowledge through its own task demonstrations while relying on the weaker teacher to evaluate these demonstrations -- an instance of formative learning. Extensive empirical evaluations demonstrate that \textsc{Eve} significantly outperforms existing W2S learning approaches and exhibits significantly better robustness under unreliable feedback compared to contrastive learning methods such as Direct Preference Optimization.
URL: https://openreview.net/forum?id=nSO0g4lLxV
---
Title: The Sparse Matrix-Based Random Projection: A Mean Absolute Deviation Analysis for Sparse Ternary Data
Abstract: In this paper, we investigate random projections based on sparse $\{0,\pm1\}$ matrices, which take sparse $\{0,\pm\mu\}$-ternary data as input. Such sparse ternary data, including $\{\pm\mu\}$-binary data as a special case, are widely used in machine learning, particularly for data quantization tasks, where they often match or even outperform their full-precision counterparts. For the projection of such ternary data, we analyze the mean absolute deviation (MAD), a metric that quantifies the dispersion of projected data points. In general, greater dispersion is expected to better capture the intrinsic variations in the original data, making it favorable for downstream classification tasks. Our analysis demonstrates that extremely sparse $\{0,\pm1\}$ matrices, such as those with only one or a few nonzero entries per row, can achieve large MAD values. By employing such sparse matrices, we indeed obtain favorable classification performance on the projected data. These highly sparse matrix structures suggest that substantial computational savings can be realized in random projection.
URL: https://openreview.net/forum?id=D9muB8ArqS
---
Title: Measuring Fine-Grained Relatedness in Multitask Learning via Data Attribution
Abstract: Measuring task relatedness and mitigating negative transfer remain a critical open challenge in Multitask Learning (MTL). This work extends data attribution---which quantifies the influence of individual training data points on model predictions---to MTL setting for measuring task relatedness. We propose the MultiTask Influence Function (MTIF), a method that adapts influence functions to MTL models with hard or soft parameter sharing. Compared to conventional task relatedness measurements, MTIF provides a fine-grained, instance-level relatedness measure beyond the entire-task level. This fine-grained relatedness measure enables a data selection strategy to effectively mitigate negative transfer in MTL. Through extensive experiments, we demonstrate that the proposed MTIF efficiently and accurately approximates the performance of models trained on data subsets. Moreover, the data selection strategy enabled by MTIF consistently improves model performance in MTL. Our work establishes a novel connection between data attribution and MTL, offering an efficient and fine-grained solution for measuring task relatedness and enhancing MTL models.
URL: https://openreview.net/forum?id=zIDGm96xwg
---
Title: RPATH: Explaining Time Series Mixture of Experts Routing via Ensemble Consensus and Structural Robustness
Abstract: Mixture-of-Experts (MoE) architectures achieve strong performance in time series forecasting through sparse expert activation, but understanding \textit{why} specific experts are selected remains challenging. We present RPATH (Routing Pathway Analysis for Temporal Hierarchies), a post-hoc explainability framework for time series MoE models that combines temporal saliency mapping with counterfactual generation. Evaluating on Time-MoE-50M across 300 expert-sample pairs, we discover two properties of the routing architecture: (1) \textit{Ensemble Consensus}, where experts at different layers independently converge on the same critical temporal windows (mean saliency Intersection over Union (IoU) = 0.677), rather than developing distinct specializations; and (2) \textit{Structural Robustness}, characterized by a 300-fold ``Stability Gap'' where gentle perturbations alter routing in only 0.3\% of cases while aggressive perturbations succeed in 99.7\%, indicating that routing decisions reflect structural anchors rather than superficial signal characteristics. Together, these findings demonstrate that Time-MoE achieves reliable forecasting through \textit{Ensemble Redundancy}: multiple experts verify the same structural features, providing consensus that is insensitive to noise but responsive to fundamental signal changes. Our framework provides practitioners with tools to visualize expert attention, identify critical input regions, and quantify routing stability for deployed MoE models.
URL: https://openreview.net/forum?id=kwpDOqas2x
---
Title: Heterogeneous Matrix Factorization: When Features Differ by Datasets
Abstract: In myriad statistical applications, data are collected from related but heterogeneous sources. These sources share some commonalities while containing idiosyncratic characteristics. One of the most fundamental challenges in such scenarios is to recover the shared and source-specific factors at scale. Despite the existence of a few heuristic approaches, a scalable algorithm with theoretical guarantees has yet to be established.
In this paper, we tackle the problem by proposing a method called Heterogeneous Matrix Factorization to separate the shared and unique factors for a class of problems. HMF maintains the orthogonality between the shared and unique factors by leveraging an invariance property in the objective. The algorithm is easy to implement and intrinsically distributed. On the theoretic side, we show that for the square error loss, HMF will converge into the optimal solutions, which are close to the ground truth.
HMF can be integrated auto-encoders to learn nonlinear feature mappings. Through a variety of case studies, we showcase HMF's benefits and applicability in video segmentation, time-series feature extraction, and recommender systems.
URL: https://openreview.net/forum?id=1BUB0I3Obx
---
Title: TriggerCraft: A Framework for Enabling Scalable Physical Backdoor Dataset Generation with Generative Models
Abstract: Backdoor attacks, representing an emerging threat to the integrity of deep neural networks have received significant attention due to their ability to compromise deep learning systems covertly. While numerous backdoor attacks occur within the digital realm, their practical implementation in real-world prediction systems remains limited and vulnerable to disturbances in the physical world.
Consequently, this limitation has led to the development of physical backdoors, where trigger objects manifest as physical entities within the real world.
However, creating a requisite dataset to study physical backdoors is a daunting task. This hinders backdoor researchers and practitioners from studying such backdoors, leading to stagnant research progresses. This paper presents a framework namely as TriggerCraft that empowers researchers to effortlessly create a massive physical backdoor dataset with generative modeling. Particularly, TriggerCraft involves three automatic modules: suggesting the suitable physical triggers, generating the poisoned candidate samples (either by synthesizing new samples or editing existing clean samples), and finally selecting only the most plausible ones. As such, it effectively mitigates the perceived complexity associated with creating a physical backdoor dataset, converting it from a daunting task into an attainable objective.
Extensive experiment results show that datasets created by TriggerCraft achieve similar observations with the real physical world counterparts in terms of both attacks and defenses, exhibiting similar properties compared to previous physical backdoor studies. This paper offers researchers a valuable toolkit for advancing the frontier of physical backdoors, all within the confines of their laboratories.
URL: https://openreview.net/forum?id=3FlCGLMtxT
---
Title: Iterative Compositional Data Generation for Robot Control
Abstract: Collecting robotic manipulation data is expensive, making it impractical to acquire demonstrations for the combinatorially large space of tasks that arise in multi-object, multi-robot, and multi-environment settings. While recent generative models can synthesize useful data for individual tasks, they do not exploit the compositional structure of robotic domains and struggle to generalize to unseen task combinations. We propose a semantic compositional diffusion transformer that factorizes transitions into robot-, object-, obstacle-, and objective-specific components and learns their interactions through attention. Once trained on a limited subset of tasks, we show that our model can zero-shot generate high-quality transitions from which we can learn control policies for unseen task combinations. Then, we introduce an iterative self-improvement procedure in which synthetic data is validated via offline reinforcement learning and incorporated into subsequent training rounds. Our approach substantially improves zero-shot performance over monolithic and hard-coded compositional baselines, ultimately solving nearly all held-out tasks and demonstrating the emergence of meaningful compositional structure in the learned representations.
URL: https://openreview.net/forum?id=cASorO1kiy
---
Title: Synergizing Deconfounding and Temporal Generalization For Time-series Counterfactual Outcome Estimation
Abstract: Estimating counterfactual outcomes from time‑series observations is crucial for effective decision-making, e.g. when to administer a life‑saving treatment, yet remains significantly challenging because (i) the counterfactual trajectory is never observed and (ii) confounders evolve with time and distort estimation at every step. To address these challenges, we propose a novel framework that synergistically integrates two complementary approaches: Sub-treatment Group Alignment (SGA) and Random Temporal Masking (RTM). Instead of the coarse practice of aligning marginal distributions of the treatments in latent space, SGA uses iterative treatment‑agnostic clustering to identify fine-grained sub‑treatment groups. Aligning these fine‑grained groups achieves improved distributional matching, thus leading to more effective deconfounding. We theoretically demonstrate that SGA optimizes a tighter upper bound on counterfactual risk and empirically verify its deconfounding efficacy. RTM promotes temporal generalization by randomly replaces input covariates with Gaussian noises during training. This encourages the model to rely less on potentially noisy or spuriously correlated covariates at the current step and more on stable historical patterns, thereby improving its ability to generalize across time and better preserve underlying causal relationships. Our experiments demonstrate that while applying SGA and RTM individually improves counterfactual outcome estimation, their synergistic combination consistently achieves state-of-the-art performance. This success comes from their distinct yet complementary roles: RTM enhances temporal generalization and robustness across time steps, while SGA improves deconfounding at each specific time point.
URL: https://openreview.net/forum?id=xuJH3BJiNu
---
Title: Knowing How to Edit: Reliable Evaluation Signals for Diagnosing and Optimizing Prompts at Query Level
Abstract: Prompt optimization has become a central mechanism for eliciting strong performance from LLMs, and recent work has made substantial progress by proposing diverse prompt evaluation metrics and optimization strategies. Despite these advances, prompt evaluation and prompt optimization are often developed in isolation, limiting the extent to which evaluation can effectively inform prompt refinement. In this work, we study prompt optimization as a process guided by performance-relevant evaluation signals. To address the disconnect between evaluation and optimization, we propose an evaluation-instructed prompt optimization approach that explicitly connects prompt evaluation with query-dependent optimization. Our method integrates multiple complementary prompt quality metrics into a performance-reflective evaluation framework and trains an execution-free evaluator that predicts prompt quality directly from text, avoiding repeated model executions. These evaluation signals then guide prompt refinement in a targeted and interpretable manner. Empirically, the proposed evaluator achieves 83.7\% accuracy in predicting prompt performance. When incorporated into the optimization process, our approach consistently outperforms existing optimization baselines across eight benchmark datasets and three different backbone LLMs. Overall, our results demonstrate that reliable and efficient evaluation signals can serve as an effective foundation for robust and interpretable prompt optimization.
URL: https://openreview.net/forum?id=fKs3VWTj31
---
Title: Graph Generation via Temporal-Aware Biased Walks
Abstract: Some real networks keep a fixed structure (e.g., roads, sensors and their connections) while node or edge signals evolve over time. Existing graph generators either model topology changes (i.e., edge additions/deletions) or focus only on static graph properties (such as degree distributions or motifs), without considering how temporal signals shape the generated structure. By approaching the problem from an unconventional perspective, we introduce temporally attributed graphs, named TANGEM, that integrate a temporal similarity matrix into biased random walks, thereby coupling signals with structure to generate graphs that highlight patterns reflecting how nodes co-activate over time. We evaluate TANGEM using an approach that separates structural fidelity (clustering, spectral metrics) from downstream temporal consistency, allowing us to clearly isolate the impact of the topology generator itself. In time series benchmarks, TANGEM consistently outperforms strong baselines in structural metrics while remaining lightweight, learning from a single graph. These results show that adding temporal bias to structural sampling produces more realistic graphs and establishes TANGEM as a basis for future models that further integrate evolving signals and structure.
URL: https://openreview.net/forum?id=lDnMlhk3aw
---
Title: Process Reward Models That Think
Abstract: Step-by-step verifiers—also known as process reward models (PRMs)—are a key ingredient for test-time scaling, but training them requires expensive step-level supervision. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers—using only 1% of the process labels in PRM800K—across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME ’24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation over subsets of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained with the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. This work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training.
URL: https://openreview.net/forum?id=FPVCb0WMuN
---
Title: Learning Long-Range Representations with Equivariant Messages
Abstract: Machine learning interatomic potentials trained on first-principles reference data are becoming valuable tools for computational physics, biology, and chemistry. Equivariant message-passing neural networks, including transformers, achieve state-of-the-art accuracy but rely on cutoff-based graphs, limiting their ability to capture long-range effects such as electrostatics or dispersion, as well as electron delocalization. While long-range correction schemes based on inverse power laws of interatomic distances have been proposed, they are unable to communicate higher-order geometric information and are thus limited in applicability. To address this shortcoming, we propose the use of equivariant, rather than scalar, charges for long-range interactions, and design a graph neural network architecture, Lorem, around this long-range message passing mechanism. We consider several datasets specifically designed to highlight non-local physical effects, and compare short-range message passing with different receptive fields to invariant and equivariant long-range message passing.
Even though most approaches work for careful dataset-specific choices of their hyperparameters, Lorem works consistently without adjustments, with excellent benchmark performance.
URL: https://openreview.net/forum?id=pZI9e4SW9P
---
Title: Understanding Emotion in Discourse: Recognition Insights and Linguistic Patterns for Generation
Abstract: Despite strong recent progress in Emotion Recognition in Conversation (ERC), two gaps remain: we still lack a clear understanding of which modeling choices materially affect performance, and we have limited linguistic analysis that links recognition findings to actionable cues for generation. We address both gaps via a systematic study on IEMOCAP.
For recognition, we conduct controlled ablations with 10 random seeds and paired tests over seeds (with correction for multiple comparisons), yielding three findings. First, conversational context is the dominant factor: performance saturates quickly, with roughly 90% of the gain observed within our context sweep achieved using only the most recent 10--30 preceding turns (depending on the label set). Second, hierarchical sentence representations improve utterance-only recognition ($K{=}0$), but the benefit vanishes once turn-level context is available, suggesting that conversational history subsumes much of the intra-utterance structure. Third, a simple integration of an external affective lexicon (SenticNet) does not improve results, consistent with pretrained encoders already capturing much of the affective signal needed for ERC. Under a strictly causal (past-only) setting, our simple models attain strong performance (82.69% 4-way; 67.07% 6-way weighted F1), indicating that competitive accuracy is achievable without access to future turns.
For linguistic analysis, we examine 5,286 discourse-marker occurrences and find a reliable association between emotion and marker position within the utterance ($p < 0.0001$). In particular, "Sad" utterances show reduced left-periphery marker usage (21.9%) relative to other emotions (28--32%), aligning with accounts that link left-periphery markers to active discourse management. This pattern is consistent with our recognition results, where "Sad" benefits most from conversational context (+22%p), suggesting that sadness often relies more on discourse history than on overt pragmatic signaling in the utterance itself.
URL: https://openreview.net/forum?id=zCFQiJT7XN
---
Title: $\texttt{DecompSR}$: A Dataset for Decomposed Analyses of Compositional Multihop Spatial Reasoning
Abstract: We introduce $\texttt{DecompSR}$, decomposed spatial reasoning, a large benchmark dataset (over 5m datapoints) and generation framework designed to analyse compositional spatial reasoning ability. The generation of $\texttt{DecompSR}$ allows users to independently vary several aspects of compositionality, namely: productivity (reasoning depth), substitutivity (entity and linguistic variability), overgeneralisation (input order, distractors) and systematicity (novel linguistic elements). $\texttt{DecompSR}$ has been built procedurally in a manner which makes it is correct by construction, which is independently verified using a symbolic solver to guarantee the correctness of the dataset. $\texttt{DecompSR}$ is comprehensively benchmarked across a host of Large Language Models (LLMs) where we show that LLMs struggle with productive and systematic generalisation in spatial reasoning tasks whereas they are more robust to linguistic variation. $\texttt{DecompSR}$ provides a provably correct and rigorous benchmarking dataset with a novel ability to independently vary the degrees of several key aspects of compositionality, allowing for robust and fine-grained probing of the compositional reasoning abilities of LLMs.
URL: https://openreview.net/forum?id=P81p2nTuvA
---
Title: The SMOTE Paradox: Why a 92% Baseline Collapsed to 6%—A Systematic Review of 821 Papers in Imbalanced Learning (2020–2025)
Abstract: Class imbalance pervades production systems—fraud detection, medical diagnosis,
industrial monitoring—yet handling it effectively remains challenging. For two
decades, SMOTE has been the default solution, but practitioners increasingly
abandon it at scale.
We investigate this disconnect through systematic review of 821 DBLP papers
(2020–2025) and bibliometric analysis of 4,985 Scopus records. Our analysis
reveals the SMOTE Paradox: only 6% of high-impact papers successfully executed
SMOTE at full scale due to memory exhaustion or preprocessing bottlenecks.
The field has fragmented, with 30% adopting generative models, 30% using
cost-sensitive losses, and 40% employing hybrid approaches.
Three factors explain SMOTE's decline. First, $O(N \cdot N_{\text{min}} \cdot d)$
nearest-neighbor search requires 1.28 TB memory for typical modern datasets.
Second, linear interpolation produces off-manifold artifacts scaling as $\sqrt{d}$
in high dimensions. Third, CPU-bound preprocessing creates friction with
GPU-centric training pipelines.
We validate these findings through controlled experiments across seven datasets
(196 trials, imbalance ratios 1.1:1 to 129:1). Statistical testing reveals
no significant ROC-AUC differences between SMOTE and cost-sensitive baselines
(Friedman $p=0.907$), despite SMOTE incurring 2.7× computational overhead.
However, cost-sensitive methods severely degrade at extreme imbalance (>40:1).
URL: https://openreview.net/forum?id=Rd2ZIA5AnN
---
Title: FIT-GNN: Faster Inference Time for GNNs that ‘FIT’ in Memory Using Coarsening
Abstract: Scalability of Graph Neural Networks (GNNs) remains a significant challenge. To tackle this, methods like coarsening, condensation, and computation trees are used to train on a smaller graph, resulting in faster computation. Nonetheless, prior research has not adequately addressed the computational costs during the inference phase. This paper presents a novel approach to improve the scalability of GNNs by reducing computational burden during the inference phase using graph coarsening. We demonstrate two different methods -- Extra Nodes and Cluster Nodes. Our study extends the application of graph coarsening for graph-level tasks, including graph classification and graph regression. We conduct extensive experiments on multiple benchmark datasets to evaluate the performance of our approach. Our results show that the proposed method achieves orders of magnitude improvements in single-node inference time compared to traditional approaches. Furthermore, it significantly reduces memory consumption for node and graph classification and regression tasks, enabling efficient training and inference on low-resource devices where conventional methods are impractical. Notably, these computational advantages are achieved while maintaining competitive performance relative to baseline models.
URL: https://openreview.net/forum?id=g7r7y2I7Sz
---