Weekly TMLR digest for Jun 07, 2026

5 views
Skip to first unread message

TMLR

unread,
Jun 7, 2026, 12:00:12 AMJun 7
to tmlr-annou...@googlegroups.com


New certifications
==================

J2C Certification: RSQ: Learning from Important Tokens Leads to Better Quantized LLMs

Yi-Lin Sung, Prateek Yadav, Jialu Li, Jaehong Yoon, Mohit Bansal

https://openreview.net/forum?id=kBezrKXHVS

---


Expert Certification: Point-Identification of a Robust Predictor Under Latent Shift with Imperfect Proxies

Zahra Rahiminasab, Reza Soumi, Arto Klami, Samuel Kaski

https://openreview.net/forum?id=QFJuVreJDC

---


Reproducibility Certification: Towards Representation Backdoor on CLIP via Concept Confusion

Junchi Liao, Weimin Lyu, Lijie Hu, Shaopeng Fu, Tianhao Huang, Shu Yang, Jie Li, Di Wang

https://openreview.net/forum?id=jQ91DETUIr

---


Accepted papers
===============


Title: LLM-Guided Search for Deletion-Correcting Codes

Authors: Franziska Weindel, Reinhard Heckel

Abstract: Finding deletion-correcting codes of maximum size has been an open problem for over 70 years, even for a single deletion. We adapt FunSearch, a large language model (LLM)-guided evolutionary search, to discover functions that construct deletion-correcting codes at short code lengths. For a single deletion, our search finds a function that we prove constructs the conjectured-optimal Varshamov-Tenengolts code. For multiple deletions and quaternary edit codes, the discovered functions improve on prior explicit, search-based, and neural constructions but remain empirical heuristics without new theoretical insights. We study design choices for LLM-guided evolutionary search and find that, for our problem, compute is better allocated to sampling more functions than to longer reasoning traces per function, and that co-evolving natural language descriptions with code hurts search quality. We propose deduplicating logically identical functions during evolution, which we find critical for search diversity. Our results demonstrate the potential of LLM-guided evolutionary search for information theory and code design and represent the first application of such methods for constructing error-correcting codes. However, in our current formulation, evaluating a function scales exponentially with code length, limiting the approach to short codes.

URL: https://openreview.net/forum?id=qZ69Ozpo6v

---

Title: Index2Sort: Sorting Algorithm Using Static Index Structure

Authors: Atsuki Sato, Yusuke Matsui

Abstract: We introduce Index2Sort, a general framework for deriving sorting algorithms from static indexes. Index2Sort treats the index as an opaque box that exposes only two operations: index construction and rank queries. This abstraction allows Index2Sort to be applied to various index structures, including classical and learned indexes. Our theoretical analysis shows that the computational guarantees of the index transfer directly to Index2Sort. If the index can be constructed in expected time $\mathcal{O}(nC(n))$ and can answer rank queries in expected time $\mathcal{O}(Q(n))$, then Index2Sort sorts the input in expected time $\mathcal{O}(nC(n) + nQ(n))$. In particular, when using a state-of-the-art learned index with $C(n)=Q(n)=1$, this yields an expected complexity of $\mathcal{O}(n)$, which is a strictly tighter bound than those of existing learned sorting algorithms. In contrast to recent theoretical works on learned sorting, which derive complexity guarantees by analyzing the internal structure of a learned index and designing a sorting algorithm with a similar structure, Index2Sort achieves stronger guarantees without requiring any inspection or modification of the index internals.

URL: https://openreview.net/forum?id=YUmVxjMhpm

---

Title: Twin: Tuning Learning Rate and Weight Decay of Deep Homogeneous Classifiers without Validation

Authors: Lorenzo Brigato, Stavroula Mougiakakou

Abstract: We introduce \textbf{T}une \textbf{w}ithout Validat\textbf{i}o\textbf{n} (Twin), a simple and effective pipeline for tuning learning rate and weight decay of homogeneous classifiers without validation sets, eliminating the need to hold out data and avoiding the two-step process.
Twin leverages the margin-maximization dynamics of homogeneous networks and an empirical scaling law that links training and test losses across hyper-parameter configurations.
This mathematical modeling yields a regime-dependent, validation-free selection rule: in the \emph{non-separable} regime, training loss is monotonic in test loss and therefore predictive of generalization, whereas in the \emph{separable} regime, the parameters' norm becomes a reliable indicator of generalization due to margin maximization.
Across 37 dataset-architecture configurations for image classification, we demonstrate that Twin achieves a mean absolute error of 1.28\% compared to an \textit{Oracle} baseline that selects HPs using test accuracy.
We demonstrate Twin’s benefits in scenarios where validation data is scarce, such as small-data regimes, or difficult and costly to collect, as in medical imaging.
Code available at \url{https://github.com/lorenzobrigato/twin}.

URL: https://openreview.net/forum?id=1SIP2M2HJa

---

Title: Robust Recourse via Kernel Distributionally Robust Optimization and Bayesian Posterior Predictive Modeling

Authors: Sita bissu, Navnit Kumar Yadav, APARNA MEHRA, Sandeep Kumar

Abstract: Machine learning recourse provides actionable recommendations to achieve favorable outcomes from predictive decision models. A critical limitation of current approaches is their reliance on the assumption of model stationarity, an assumption that is frequently violated in dynamic, real-world settings with distributional shifts. Robust approaches such as Robust Algorithmic Recourse (ROAR) and the Wasserstein-based DiRRAc address some uncertainties but remain limited in handling nonlinear dependencies and large-scale shifts, including concept drift and the worst-case distributional shifts within the MMD ambiguity set. We propose Kernel Distributionally Robust Recourse Action (KDRRA), a framework that defines ambiguity sets using Maximum Mean Discrepancy (MMD) in a Reproducing Kernel Hilbert Space (RKHS), enabling flexible, nonparametric modeling of complex, nonlinear discrepancies between distributions. A practical challenge for kernel DRO is that empirical kernel mean embeddings can deviate from the true distribution, inflating ambiguity radii and yielding overly conservative recommendations. To address this, we introduce Bayesian KDRRA (BKDRRA), which centers the ambiguity set on a Bayesian posterior predictive distribution constructed via posterior bootstrap. This Bayesian centering integrates sampling variability and moderate model uncertainty into the reference distribution, leading to tighter ambiguity sets and markedly lower conservatism without sacrificing robustness. Leveraging the representer theorem, we derive finite-dimensional convex reformulations of the worst-case recourse optimization for both KDRRA and BKDRRA. We conduct a comprehensive empirical evaluation across four real-world datasets that exhibit correction, temporal, geospatial, and demographic covariate shifts. The KDRRA consistently outperforms state-of-the-art baselines in yielding superior robustness and lower recourse cost, while BKDRRA further improves stability and calibration by integrating Bayesian uncertainty. Our research advances the frontier of distributionally robust recourse by integrating machine learning tools and optimization, offering reliable and resilient decision-making under uncertainty.

URL: https://openreview.net/forum?id=LmEDkCTY0X

---

Title: BitLogic: A Framework for Gradient-Based LUT-Native Neural Networks

Authors: Simon Jonas Bührer, Andreas Plesner, Till Aczel, Roger Wattenhofer

Abstract: Gradient-based LUT- and logic-gate-based neural networks (LUTNet, LogicNets, DiffLogic, PolyLUT, NeuraLUT, WARP-LUT, DWN, LILogicNet, LightLUT) replace multiply-accumulate arithmetic with Boolean lookups. The same trained checkpoint deploys to GPU as bitwise ops on bit-packed activations, to FPGA as LUT primitives, and to ASIC as standard-cell gates, all from one code path. Yet each method ships its own training pipeline, encoder, connectivity rule, fan-in, and hardware-reporting convention. The natural practitioner question, which of these choices actually matter for accuracy and which for hardware cost, therefore has no answer in the current literature. We release BitLogic, a unified framework that factors the field into a five-axis design space (encoder, connectivity, fan-in, node parameterization, head) and instantiates every prior method under one shared training and evaluation protocol. The framework deliberately omits method-specific procedures such as calibration, pruning, and thresholding, and all evaluations are limited to two-layer feed-forward networks. Combining the per-axis winners identifies a new best-of-space configuration that outperforms every retrained prior on every (dataset, width) cell in which every compared prior fits the shared budget, across MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100. We evaluate the best-of-space model on all three backends. On MNIST the resulting two-layer network reaches ~126 MSamples/s on FPGA, ~15x the throughput of a bit-packed GPU forward path that itself processes 64 samples per 64-bit operation, at four-to-five orders of magnitude less energy.

URL: https://openreview.net/forum?id=ZbsSZAfDod

---

Title: Insights From a Data- and Space-Agnostic Approach to Zero-Cost Proxies

Authors: Timotée Ly-Manson, Mathieu Léonardon, Abdeldjalil AISSA EL BEY

Abstract: Zero-cost proxies (ZCPs) have enabled low-cost Neural Architecture Search (NAS) by removing the computational overhead from model training. However, important drawbacks of currently designed ZCPs remain unaddressed. While there is a strong correlation between ZCPs and model performance at the scale of entire search spaces, this does not necessarily translate to guiding the search to top-performing architectures. In this paper, we conduct extensive benchmarking over state-of-the-art proxies in the NAS-Bench-Suite-Zero setting and observe that the correlation decreases dramatically when reducing the space to the best architectures, demonstrating the presence of a top-rank gap. Moreover, embedded priors on search space and data make ZCPs unreliable across diverse tasks. We leverage adaptive parameter distribution statistics as a discriminator metric in the genetic framework and introduce ParaDis, a low-cost NAS algorithm that remains orthogonal to ZCP design, with potential to define a fully data- and space-agnostic search when paired with the right metric. Experiments on multiple benchmarks confirm that ParaDis reduces the top-rank gap across diverse tasks. ParaDis also achieves a test accuracy of $97.29 \pm 0.07$ \% on CIFAR10 in the DARTS space, within $0.25$\% of state-of-the-art, remaining competitive against methods with heavier priors.

URL: https://openreview.net/forum?id=sVWJczov4Q

---

Title: Transforming Language Models into Program Interpreters via Execution Trace Chain of Thought

Authors: Koshi Eguchi, Kazusato Oko, Kenshin Yamauchi, Makoto Shing, Takuya Akiba

Abstract: Code execution reasoning (CER), the ability to predict how code executes on a given input, has been added to the expected aspects of language models' (LMs') coding capabilities. However, many open-source LMs perform poorly on simple code snippets and, as our observations show, they exhibit limitations even on a single basic operation. To enable LMs to accumulate fine-grained reasoning steps in a structured format, we propose leveraging extremely granular execution traces as chain-of-thought rationales. Specifically, we introduce a fine-tuning method called ET-CoT (Execution Trace Chain of Thought), which leverages execution traces generated by our custom code interpreter and characterized by sub-line-level, thorough expansion of all expressions, going beyond merely logging intermediate variables. After fine-tuning with 127k examples, ET-CoT effectively improves CER performance, for instance with Qwen2.5-7B-Instruct outperforming its official Coder model. In addition, our custom tests show improved accuracy on repeated application of simple operations. Overall, ET-CoT serves as a unique approach that provides valuable insight into how systematically composing atomic reasoning steps improves CER performance.

URL: https://openreview.net/forum?id=pOg7iub4Pz

---

Title: LARP: Learner-Agnostic Robust Data Prefiltering

Authors: Kristian Minchev, Dimitar Iliev Dimitrov, Nikola Konstantinov

Abstract: Public datasets, crucial for modern machine learning and statistical inference, often contain low-quality or contaminated samples that can harm model performance. This creates a need for principled prefiltering procedures that a data provider can apply to protect the accuracy of a range of potential downstream statistical and learning procedures _simultaneously_. In this work, we formalize and analyze **L**earner-**A**gnostic **R**obust data **P**refiltering (LARP), the problem of designing prefiltering procedures with guarantees on the worst-case loss over a pre-specified set of learners. We establish the feasibility of LARP in two theoretical settings, by providing upper-bound guarantees on the worst-case loss. Our theoretical results indicate that protecting heterogeneous learner sets via LARP comes at the price of some performance loss compared to individual, learner-specific prefiltering; we call this gap the price of LARP. To assess this gap in performance, we empirically measure the price of LARP across image and tabular tasks. We further explore potential benefits of LARP from the perspective of saving on repeated data curation efforts, in a game-theoretic model where the downstream learners can split the cost of the single prefiltering.

URL: https://openreview.net/forum?id=gI6VOV3jfO

---

Title: RSQ: Learning from Important Tokens Leads to Better Quantized LLMs

Authors: Yi-Lin Sung, Prateek Yadav, Jialu Li, Jaehong Yoon, Mohit Bansal

Abstract: Layer-wise quantization is a key technique for efficiently compressing large models without expensive retraining. Previous methods typically quantize the weights of each layer by “uniformly” optimizing the layer reconstruction loss across all output tokens. However, in this paper, we demonstrate that better quantized models can be obtained by prioritizing learning from important tokens. Building on this finding, we propose RSQ (Rotate, Scale, then Quantize), which (1) applies rotations (orthogonal transformation) to the model to mitigate weight outliers, (2) scales the token feature based on its importance, and (3) quantizes the model using the GPTQ framework with the second-order statistics computed by scaled tokens. To compute token importance, we explore both heuristic and dynamic strategies. Based on a thorough analysis of all approaches, we adopt attention concentration, which uses attention scores of each token as its importance, as the best approach. We demonstrate that RSQ consistently outperforms baseline methods across multiple downstream tasks and three model families: LLaMA3, Mistral, and Qwen2.5. Additionally, models quantized with RSQ achieve superior performance on long-context tasks, further highlighting its effectiveness. Lastly, RSQ demonstrates generalizability across various setups, including different model sizes, calibration datasets, bit precisions, and quantization methods. Our code is available in the supplementary material.

URL: https://openreview.net/forum?id=kBezrKXHVS

---

Title: ReciNet: Reciprocal Space-Aware Long-Range Modeling for Crystalline Property Prediction

Authors: Jianan Nie, Peiyao Xiao, Kaiyi Ji, Peng Gao

Abstract: Predicting properties of crystals from their structures is a fundamental yet challenging task in materials science. Unlike molecules, crystal structures exhibit infinite periodic arrangements of atoms, requiring methods capable of capturing both local and global information effectively. However, current works fall short of capturing long-range interactions within periodic structures. To address this, we leverage reciprocal space, the natural domain for periodic crystals, and construct a Fourier series representation from fractional coordinates and reciprocal lattice vectors with learnable filters. Building on this, we introduce the reciprocal space-based geometry network (ReciNet), a novel architecture that integrates geometric GNNs and reciprocal blocks to model short-range and long-range interactions. Experiments on comprehensive benchmarks JARVIS, Materials Project, and MatBench demonstrate that ReciNet achieves state-of-the-art predictive accuracy across a range of crystal property prediction tasks. Additionally, we explore a model extension for multi-property prediction with the mixture-of-experts, which demonstrates high computational efficiency and reveals positive transfer between correlated properties. These findings highlight the potential of our model as a scalable and accurate solution for crystal property prediction.

URL: https://openreview.net/forum?id=ODlxgod5e3

---

Title: Family Matters: A Systematic Study of Spatial vs. Frequency Masking for Continual Test-Time Adaptation

Authors: Chandler Timm Cagmat Doloriel, Yunbei Zhang, Yeonguk Yu, Taki Hasan Rafi, Muhammad salman siddiqui, Tor Kristian Stevik, Fadi Al Machot, Kristian Hovde Liland, Habib Ullah

Abstract: Recent continual test-time adaptation (CTTA) methods adopt masked image modeling to stabilize learning under distribution shift, yet each treats its masking family $\mathcal{F}$ as a fixed design choice and innovates exclusively along the selection strategy $\mathcal{S}$, leaving the family axis underexplored. We present a systematic empirical study that isolates this axis. Using a controlled CTTA instantiation, Mask to Adapt (M2A), that fixes $\mathcal{S}{=}\textit{random}$ and standard losses, we vary only $\mathcal{F}$ across spatial (patch, pixel) and frequency (all-band, low-band, high-band) families while keeping every other component identical. The study's contributions are the design guidance it extracts for the CTTA settings we evaluated: (1)~\emph{the masking family determines whether adaptation compounds useful structure or compounds errors}, on patch-tokenized architectures, spatial masking accumulates stable representations over long streams while frequency masking collapses catastrophically. We characterize this instability through a \emph{structural-preservation} account, where spatial coherence maintains the broad-spectrum redundancy needed to avoid terminally overlapping with a corruption's spectral signature; (2)~\emph{the optimal family depends on architecture-task alignment}, on CNNs, whose overlapping receptive fields dilute patch occlusion, the family gap vanishes, whereas on fine-grained tasks with global cues and large-capacity ViTs, frequency masking becomes competitive. In confounded system-level comparisons, where baselines also differ in losses and auxiliary components, M2A's random selection performs comparably to heuristic strategies, though we treat this observation as suggestive context rather than a controlled quantification of $\mathcal{S}$'s relative importance.

URL: https://openreview.net/forum?id=pBI64qNXHp

---

Title: Provably Efficient Off-Policy Adversarial Imitation Learning with Convergence Guarantees

Authors: Yilei Chen, Vittorio Giammarino, James Queeney, Ioannis Paschalidis

Abstract: Adversarial Imitation Learning (AIL) faces challenges with sample inefficiency because of its reliance on sufficient on-policy data to evaluate the performance of the current policy during reward function updates. In this work, we study the convergence properties and sample complexity of off-policy AIL algorithms. We show that, even in the absence of importance sampling correction, reusing samples generated by the $o(\sqrt{K})$ most recent policies, where $K$ is the number of iterations of policy updates and reward updates, does not undermine the convergence guarantees of this class of algorithms. Furthermore, our results indicate that the distribution shift error induced by off-policy updates is dominated by the benefits of having more data available. This result provides theoretical support for the sample efficiency of off-policy AIL algorithms that has been observed in practice.

URL: https://openreview.net/forum?id=OahvMeRgKP

---

Title: When Glass Disappears at Night: A Novel NIR-RGB Multimodal Solution

Authors: Tao Yan, Yiwei Lu, Ke Xu, Hao Chen, Hui Li, Xiaojun Chang, Xiaojun Wu, Rynson W. H. Lau

Abstract: Glass surface detection (GSD) has recently been attracting research interests. However, existing GSD methods focus on modeling glass surface properties for daytime scenes only, and can easily fail in nighttime scenes due to significant lighting discrepancies. We observe that, due to the spectral differences between Near-Infrared (NIR) light sources and common LED lights, NIR and RGB cameras capture complementary visual patterns (e.g., light reflections, shadows, and edges) of glass surfaces, and cross-comparing their lighting and reflectance properties can provide reliable cues for nighttime GSD. Inspired by this observation, we propose a novel approach for nighttime GSD based on the multi-modal NIR and RGB image pairs. We first construct a nighttime GSD dataset, which contains $6,192$ RGB-NIR image pairs captured in diverse real-world nighttime scenes, with corresponding carefully-annotated glass surface masks. We then propose a novel network for the nighttime GSD task with two novel modules: (1) an RGB-NIR Guidance Enhancement (RNGE) module for extracting and enriching the NIR reflectance features with the guidance of RGB reflectance features, and (2) an RGB-NIR Fusion and Localization (RNFL) module for fusing RGB and NIR reflectance features into glass features conditioned on the multi-modal illumination discrepancy-aware features. Extensive experiments demonstrate that our method outperforms state-of-the-art methods in nighttime scenes while generalizing well to daytime scenes. Our dataset and code are available at https://github.com/YT3DVision/NGSDNet.

URL: https://openreview.net/forum?id=hdh3vHsakv

---

Title: Implicit geometric regularization in flow matching via density weighted Stein operators

Authors: Shinto Eguchi

Abstract: Flow Matching (FM) has emerged as a powerful paradigm for continuous normalizing flows, yet standard FM implicitly performs an unweighted $L^2$ regression over the entire ambient space. In high dimensions, this leads to a fundamental inefficiency: the vast majority of the integration domain consists of low-density ``void'' regions where the target velocity fields are often chaotic or ill-defined. In this paper, we propose {$\gamma$-Flow Matching ($\gamma$-FM)}, a density-weighted variant that aligns the regression geometry with the underlying probability flow. While density weighting is desirable, naive implementations would require evaluating the intractable target density. We circumvent this by introducing a Dynamic Density-Weighting strategy that estimates the target density directly from training particles. This approach allows us to dynamically downweight the regression loss in void regions without compromising the simulation-free nature of FM. Theoretically, we formulate an ideal $\gamma$-weighted regression geometry motivated by the $\gamma$-Stein metric, derive a variance-suppression bound for low-density regions, and use a weighted Dirichlet/spectral analysis to suggest a mechanism for smoother learned vector fields. Empirically, we evaluate $\gamma$-FM under a shared-time density-estimation protocol and compare it against both standard FM and an explicit Jacobian-regularized baseline using latent-space and image-space metrics.

URL: https://openreview.net/forum?id=LBlkVBDRdu

---

Title: When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering

Authors: Mahdi Astaraki, Mohammad Arshi Saloot, Ali Shiraee Kasmaee, Hamidreza Mahyar, Soheila Samiee

Abstract: 12:47 PMClaude responded: Retrieval-Augmented Generation (RAG) is widely used to extend large language models (LLMs) beyond their parametric knowledge, yet it remains unclear when itera…Retrieval-Augmented Generation (RAG) is widely used to extend large language models (LLMs) beyond their parametric knowledge, yet it remains unclear when iterative retrieval-reasoning loops meaningfully outperform traditional static RAG, particularly in scientific domains where multi-hop reasoning, sparse domain knowledge, and heterogeneous evidence impose substantial complexity. This study provides the first controlled, mechanism-level diagnostic evaluation of whether synchronized iterative retrieval and reasoning can surpass even an idealized static upper bound (Gold-Context) RAG in the scientific domain. We benchmark eleven state-of-the-art LLMs under three regimes: (i) No Context, measuring reliance on parametric memory; (ii) Gold Context, where all oracle evidence is supplied at once; and (iii) Iterative RAG, a training-free controller that alternates retrieval, hypothesis refinement, and evidence-aware stopping. Using the chemistry-focused ChemKGMultiHopQA dataset, we isolate questions requiring genuine retrieval and analyze model behavior through a comprehensive diagnostic suite covering retrieval coverage gaps, anchor carry-drop, query quality, composition fidelity, and control calibration. Across models, iterative RAG consistently outperforms Gold Context, yielding gains up to 25.6 percentage points, particularly for non-reasoning fine-tuned models. Our analysis shows that synchronized retrieval and reasoning reduces late-hop failures, mitigates context overload, and enables dynamic correction of early hypothesis drift, benefits that static evidence cannot provide. However, we also identify limiting failure modes, including incomplete hop coverage, distractor latch trajectories, early stopping miscalibration, and high composition failure rates even with perfect retrieval. Overall, our results demonstrate that the process of staged retrieval is often more influential than the mere presence of ideal evidence in our evaluation experimental setup. We provide practical guidance for deploying and diagnosing RAG systems in specialized scientific settings and establish a foundation for developing more reliable, controllable iterative retrieval-reasoning frameworks. The code and evaluation results are available at https://github.com/Matroid1998/Iterative-rag.

URL: https://openreview.net/forum?id=pa5TnBdyDP

---

Title: Scalable Ensemble Federated Learning with Enhanced Open-Set Recognition

Authors: Mustafa Siddiqui, Muhammad Tahir

Abstract: Consensus-driven parameter averaging constitutes the dominant paradigm in federated learning. Although many methods incorporate auxiliary mechanisms or refinements, repeated round averaging remains their fundamental backbone. This paradigm inherently depends on repeated rounds of client–server communication to maintain consensus. The reliance on repeated communication is further amplified in regimes with high data heterogeneity and large client populations, as shown across numerous studies. This behavior arises from optimization drift in out-of-distribution settings, where client objectives differ and multi-step local SGD updates increasingly diverge, making consensus difficult to maintain. We argue that an emerging alternative, ensemble with abstention, provides a more suitable framework for addressing these issues. Rather than enforcing consensus across diverging client objectives, this approach constructs a specialized mixture-of-experts model by preserving client-specific models and selectively aggregating their predictions. As a one-shot FL method, it eliminates the need for repeated communication rounds altogether. Moreover, supported by both theoretical and empirical analysis, we show that this paradigm sidesteps cross-client drift and is inherently less sensitive to data heterogeneity. Despite these advantages, ensemble with abstention introduces two fundamental challenges. First, its performance depends on the design of the open-set recognition (OSR) task, which directly affects performance under heterogeneity. Second, and more critically, preserving client-specific models causes linear growth in model size with the number of clients, limiting scalability. As a step toward addressing these limitations, we introduce FedSOV, which incorporates improved negative sample generation to prevent shortcut cues in the OSR task and employs pruning to address the scalability problem. We show that pruning provides a practical and effective solution to the scalability problem while simultaneously enhancing generalization, yielding higher test accuracy. Across datasets, our method demonstrates clear improvements in highly heterogeneous regimes compared to both the ensemble baseline FedOV and the strongest parameter-averaging method, FedGF. Code is available at: https://github.com/Mustafa00124/FedSOV

URL: https://openreview.net/forum?id=QnnCYOfuUI

---

Title: Random Character-Level Perturbations Amplify LLM Jailbreak Attacks

Authors: Shuyi Yu, Zhe Cao, Kohei Tsuji, Yusuke Sakai, Hidetaka Kamigaito, Jingun Kwon, Manabu Okumura, Taro Watanabe

Abstract: Contemporary large language models (LLMs) exhibit remarkable capabilities, yet their subword tokenization mechanisms suffer from a vulnerability, whereby small character-level perturbations can re-partition text into unfamiliar subwords, degrading model performance across various tasks. Building on this, we show that this tokenization vulnerability also compromises safety mechanisms in jailbreak scenarios. We demonstrate how this vulnerability can be easily exploited through simple character-level manipulations, showing that minimal word-internal perturbations effectively increase the success rates of both simple and complex jailbreak attacks across multiple LLMs. We reveal that these perturbations lead to over-fragmented tokenization and token representation drift, resulting in substantial divergence in the semantic representations of words. Furthermore, our analysis using word-level semantic recovery and sentence-level spelling error detection and correction shows that models struggle to reconstruct the original semantics for perturbed content. In addition, layer-wise probe classifiers also fail to reliably detect the harmful intent of perturbed jailbreak prompts, further exposing the models' vulnerability in comprehending adversarially perturbed input. Finally, we discuss cases where perturbations reduce rather than increase attack success, observing that character-level noise can occasionally lead models to produce off-topic or incoherent responses. Together, our findings demonstrate that tokenization-induced vulnerabilities compromise safety mechanisms, underscoring the need for investigation into mitigation strategies.

URL: https://openreview.net/forum?id=BXsOIppKEI

---

Title: MSTN: A Lightweight and Fast Model for General Time-Series Analysis

Authors: Sumit S Shevtekar, Chandresh Kumar Maurya

Abstract: Real-world time series often exhibit strong non-stationarity, complex nonlinear dynamics, and behavior expressed across multiple temporal scales, from rapid local fluctuations to slow-evolving long-range trends. However, many contemporary architectures impose rigid, fixed-scale structural priors—such as patch-based tokenization, predefined receptive fields, or frozen backbone encoders—which can over-regularize temporal dynamics and limit adaptability to abrupt high-magnitude events. To handle this, we introduce the Multi-scale Temporal Network (MSTN), a hybrid neural architecture grounded in an Early Temporal Aggregation principle. MSTN integrates three complementary components: (i) a multi-scale convolutional encoder that captures fine-grained local structure; (ii) a sequence modeling module that learns long-range dependencies through either recurrent or attention-based mechanisms; and (iii) a self-gated fusion stage incorporating squeeze–excitation and a single dense layer to dynamically reweight and fuse multi-scale representations. Importantly, MSTN applies early temporal aggregation immediately after encoding, ensuring that all subsequent refinement and prediction modules operate in constant time O(1) with respect to sequence length, while the front-end encoder retains its original complexity (O(L²) for Transformer, O(L) for BiLSTM). This design enables MSTN to flexibly model temporal patterns spanning milliseconds to extended horizons, while avoiding the computational burden typically associated with long-context models. Across extensive benchmarks covering imputation, long-term forecasting, classification, and cross-dataset generalization, MSTN achieves state-of-the-art performance, establishing new best results on 21 of 27 datasets, while remaining lightweight (∼0.40M params for MSTN-BiLSTM and ∼1.06M for MSTN-Transformer) and suitable for low-latency inference (<1 sec, often in milliseconds), resource-constrained deployment. Code: https://github.com/SumitPTW/MSTN

URL: https://openreview.net/forum?id=je2N2nnDry

---

Title: Uncovering Language Model Processing Strategies with Non-Negative Per-Example Fisher Factorization

Authors: Michael S Matena, Colin Raffel

Abstract: Understanding the heuristics and algorithms that comprise a model's behavior is important for safe and reliable deployment.
While gradient clustering has been used for this purpose, gradients of a single log probability capture only a slice of the model's behavior, and clustering can only assign a single factor to each behavior.
We introduce NPEFF (Non-Negative Per-Example Fisher Factorization), an interpretability method that overcomes these limitations by decomposing per-example Fisher matrices using a novel decomposition algorithm that learns a set of components represented by learned rank-1 positive semi-definite matrices.
Through a combination of human evaluation and automated analysis, we demonstrate that these NPEFF components correspond to heuristics used by language models on a variety of text processing tasks.
We find that NPEFF excels at decomposing behaviors comprised of multiple factors compared to the baselines of gradient clustering and activation sparse autoencoders.
We also show how NPEFF can be adapted to be more efficient on tasks with few classes.
We further show how to construct parameter perturbations from NPEFF components to selectively disrupt a given component's role in the model's processing.
Along with ablation studies, we include experiments using NPEFF to study in-context learning.
We release the code used in this work.

URL: https://openreview.net/forum?id=UjeDVujI8q

---

Title: Let Me Explain, Again: Multiplicity in Local Sufficient Explanations

Authors: Ryan Pilgrim, Beepul Bharti, Kyle Poe, Rene Vidal, Jeremias Sulam

Abstract: When asked to explain their decisions, humans can produce multiple complementary justifications. In contrast, several feature attribution methods for machine learning produce only one such attribution, despite the existence of multiple equally strong and succinct explanations. The explanations found by these methods thus offer an incomplete picture of model behavior. In this paper, we study the problem of explaining a machine learning model's prediction on a given input from the perspective of minimal feature subsets that are sufficient for the model's prediction, focusing on their non-uniqueness. We give a tour of perspectives on this non-uniqueness, in terms of Boolean logic, conditional independence, approximate sufficiency, and degenerate conditional feature distributions. To cope with the multiplicity of these explanations, we propose a wrapper methodology that can adapt and extend methods that find a single explanation into methods for finding multiple explanations of similar quality. Our experiments benchmark the proposed meta-algorithm, which we call Let Me Explain Again (LMEA), against two multi-explanation method baselines on synthetic and real-world multiple-instance learning problems for image classification and demonstrate the ability of LMEA to augment two single-explanation methods.

URL: https://openreview.net/forum?id=d6FMg4hozX

---

Title: Towards Preventing Global Knowledge Forgetting in Federated Learning with Non-IID Data

Authors: Abhijit Chunduru, Majid Morafah, Mahdi Morafah, Vishnu Pandi Chellapandi, Ang Li

Abstract: Federated learning under client-level data heterogeneity remains challenging despite extensive work on drift correction, regularization, and improved aggregation. In this paper, we argue that an important yet underexplored failure mode is catastrophic forgetting of the global decision boundary during local training: as clients optimize their local objectives, they rapidly overfit to client-specific data and erase globally useful multi-class structure, causing server aggregation to average incompatible models rather than accumulate progress. We provide empirical evidence for this phenomenon through a controlled pilot study that directly visualizes decision boundary evolution in federated learning. Our analysis reveals that standard FL methods consistently forget the global decision boundary after local updates, even when clients are initialized from a strong pretrained global model. Motivated by this observation, we propose FedProj, a federated learning framework designed to preserve global functional knowledge throughout local optimization. FedProj maintains a small public-memory buffer and enforces a hard gradient constraint that prevents local updates from increasing a memory-based distillation loss, thereby acting as a safety barrier against global knowledge erosion. At the server, we further employ ensemble distillation on the same public proxy data to consolidate the preserved knowledge into a single global model. We conduct extensive experiments across computer vision and natural language processing benchmarks, covering highly non-IID regimes and domain-shifted settings. The results show that FedProj consistently outperforms state-of-the-art federated learning methods, highlighting the practical importance of explicitly preventing global decision-boundary forgetting

URL: https://openreview.net/forum?id=lhTWPh3Tjm

---

Title: Discovering Meaningful Units with Visually Grounded Semantics from Image Captions

Authors: Melika Behjati, James Henderson

Abstract: Fine-grained knowledge is crucial for vision-language models to obtain a better understanding of the real world. While there has been work trying to acquire this kind of knowledge in the space of vision and language, it has mostly focused on aligning the image patches with the tokens on the language side. However, image patches do not have any meaning to the human eye, and individual tokens do not necessarily carry groundable information in the image. It is groups of tokens which describe different aspects of the scene. In this work, we propose a model which groups the caption tokens as part of its architecture in order to capture a fine-grained representation of the language. We expect our representations to be at the level of objects present in the image, and therefore align our representations with the output of an image encoder trained to discover objects. We show that by learning to group the tokens, the vision-language model has a better fine-grained understanding of vision and language. In addition, the token groups that our model discovers are highly similar to groundable phrases in text, both qualitatively and quantitatively.

URL: https://openreview.net/forum?id=kndKGnE0tb

---

Title: Trip-to-Gaussian: A Versatile Framework for Unconditional 3D Generation

Authors: Youngwoo Jeon, Inhyeok Choi, Jaehyeok Shim, Sangjune Park, Kyungdon Joo

Abstract: Unconditional 3D generation is a classical task which focuses on learning the underlying distribution of 3D assets by exploring 3D representations, model architectures, and pipeline design. However, most existing methods limits in versatility, struggling to scale from object- to scene-level generation. Achieving such versatility critically depends on how 3D representations are designed in the latent and output spaces, and how these spaces are connected. In this work, we focus on leveraging the expressiveness of triplane representation together with the fast and high-fidelity 3D Gaussian Splatting (3DGS). Yet, integrating these two representations remains a challenge due to their fundamentally different natures -- the structured triplane and unstructured 3DGS. Our core idea is a coarse-to-fine generation scheme that first extracts reliable geometric priors from triplane and subsequently refines them to capture detailed geometry and textures through 3D Gaussians. To this end, we introduce \texttt{Trip-to-Gaussian}, a versatile 3D generation framework that seamlessly integrates two distinct representations. We propose a Gaussian indicator module (GIM) along with surface occupancy fields (SOF), which generates coarse anchor points that serve as reliable geometric prior for 3D Gaussians. Building upon this, we present a point upsampling module (PUM) that maps discontinuous and coarse anchor points into a continuous space, densifying them to ensure fine-grained representation. Extensive experiments demonstrate that our approach outperforms recent methods in both unconditional object and scene generation, establishing a versatile paradigm for 3D generation. Project page: https://vision3d-lab.github.io/trip2gs

URL: https://openreview.net/forum?id=9uL23Jcjvj

---

Title: On the Relationship Between CoCoA and ADMM for Distributed Empirical Risk Minimization

Authors: Runxiong Wu, Andi Wang

Abstract: Distributed empirical risk minimization (ERM) is often studied through two influential yet seemingly separate families of methods: CoCoA-type algorithms, derived from distributed dual coordinate ascent, and ADMM-type algorithms, derived from consensus and proximal splitting. In this paper, we investigate the connection of the two types of algorithms from a unified primal-dual perspective. We show that consensus ADMM, linearized consensus ADMM, two distributed proximal ADMM variants, and ridge-regularized CoCoA can all be written in a common update form involving a global primal variable and block dual variables. This reformulation makes several previously hidden connections explicit: For ridge-regularized ERM, CoCoA coincides with a particular proximal ADMM scheme at the level of the dual update. Moreover, consensus ADMM on the primal problem is equivalent to proximal ADMM on the dual problem under an explicit parameter mapping together with a sign reversal of the saddle objective; similar correspondences also hold for the linearized variants. These results indicates that the ADMM-type algorithms, when fine tuned, performs at least as good as CoCoA, under ridge regularized ERM problems. The unified view also yields a natural primal-dual gap stopping criterion for consensus ADMM and a unified O(1/T ) ergodic convergence analysis for the ADMM-type methods. Experiments on synthetic regression problems and real SVM datasets support the predicted relationships, clarify the role of tuning parameters, and show that suitably tuned ADMM variants can outperform CoCoA in the ridge-regularized setting.

URL: https://openreview.net/forum?id=kLhxBDa2yD

---

Title: Expected Free Energy-based Planning as Variational Inference

Authors: Wouter W. L. Nuijten, Thijs van de Laar, Bert de Vries

Abstract: Planning under uncertainty requires agents to balance goal achievement with information gathering. Active inference addresses this through the Expected Free Energy (EFE), a cost function that unifies instrumental and epistemic objectives. However, existing EFE-based methods typically employ specialized optimization procedures that are difficult to extend or analyze. In this paper, we show that EFE-based planning can be formulated as Variational Free Energy minimization on a generative model augmented with epistemic priors. Our main result demonstrates that minimizing a Variational Free Energy functional with appropriately chosen priors yields a decomposition into expected plan costs (the EFE) plus a complexity term. This formulation reinforces theoretical consistency with the Free Energy Principle by casting planning as the same inferential process that governs perception and learning. We validate our approach on three environments of increasing complexity: a deterministic T-maze, a stochastic Reactivity Maze, and a partially observable MiniGrid DoorKey-8x8 environment. The experiments demonstrate that the epistemic priors induce information-seeking behavior, that the variational formulation yields policy-based inference outperforming plan-based methods under stochastic transitions, and that temporal factorization enables scalability to environments where existing tabular active inference methods cannot operate.

URL: https://openreview.net/forum?id=Kzm8I1oS1s

---

Title: Reasoning-Aware Multimodal Fusion for Hateful Video Detection

Authors: Shuonan Yang, Tailin Chen, Jiangbei Yue, Guangliang Cheng, Jianbo Jiao, ZEYU FU

Abstract: Hate speech in online videos is posing an increasingly serious threat to digital platforms, especially as video content becomes increasingly multimodal and context-dependent. Existing methods often struggle to effectively fuse the complex semantic relationships between modalities and lack the ability to understand nuanced hateful content. To address these issues, we propose an innovative Reasoning-Aware Multimodal Fusion (RAMF) framework. To tackle the first challenge, we design Local-Global Context Fusion (LGCF) to capture both local salient cues and global temporal structures, and propose Semantic Cross Attention (SCA) to enable fine-grained multimodal semantic interaction. To tackle the second challenge, we introduce adversarial reasoning—a structured three-stage process where a vision-language model generates (i) objective descriptions, (ii) hate-assumed inferences, and (iii) non-hate-assumed inferences—providing complementary semantic perspectives that enrich the model's contextual understanding of nuanced hateful intent. Evaluations on two real-world hateful video datasets demonstrate that our method achieves robust generalisation performance, improving upon state-of-the-art methods by 3% and 7% in Macro-F1 and hate class recall, respectively. The source codes and data required to reproduce our results are available at https://github.com/Multimodal-Intelligence-Lab-MIL/RAMF.

URL: https://openreview.net/forum?id=U9KnNiuMu1

---

Title: Universal Latent Homeomorphic Manifolds: A Framework for Cross-Domain Representation Unification

Authors: Tong Wu, Tayab Uddin Wara, Daniel Hernandez, Sidong Lei

Abstract: We present the Universal Latent Homeomorphic Manifold (ULHM), a framework that unifies semantic representations (e.g., human descriptions, diagnostic labels) and observation-driven machine representations (e.g., pixel intensities, sensor readings) into a single latent structure. Despite originating from fundamentally different pathways, both modalities capture the same underlying reality. We establish homeomorphism, a continuous bijection preserving topological structure, as the mathematical criterion for determining when latent manifolds induced by different semantic-observation pairs can be rigorously unified. When this homeomorphic criterion is satisfied, it enables three critical applications: (1) semantic-guided sparse recovery from incomplete observations, (2) cross-domain transfer learning with empirically assessed structural compatibility, and (3) transductive zero-shot compositional learning via valid transfer from semantic to observation space. Our framework learns continuous manifold-to-manifold transformations through conditional variational inference, with training objectives explicitly designed to enforce bi-Lipschitz homeomorphic properties. We develop practical verification algorithms, including trust, continuity, and Wasserstein distance metrics, that empirically indicate whether the learned representations exhibit properties consistent with homeomorphic structure from finite samples. Experiments demonstrate substantial improvements over state-of-the-art (SOTA) baselines: (1) sparse recovery from 8% of pixels with much lower MSE than SOTA on CelebA under noise, (2) cross-domain transfer achieving 86.73% MNIST$\rightarrow$Fashion-MNIST accuracy without retraining, and (3) transductive zero-shot classification achieving 78.76% on CIFAR-10, exceeding prior work by 16.66%. Critically, the homeomorphism criterion determines when different semantic-observation pairs share compatible latent structure, enabling principled unification into shared representations within the tested domains and suggesting a structured basis for decomposing broad models into domain-specific components.

URL: https://openreview.net/forum?id=YoZSpRWhZH

---

Title: Evaluating the Reversal Curse in Model Editing

Authors: Hao-Xiang Xu, Jun-Yu Ma, Zhen-Hua Ling, Quan Liu, Cong Liu, Jia-Chen Gu

Abstract: Large language models (LLMs) are prone to hallucinate unintended text due to false or outdated knowledge. Since retraining LLMs is resource intensive, there has been a growing interest in model editing. Despite the emergence of benchmarks and approaches, these unidirectional editing and evaluation have failed to explore the reversal curse. In this paper, we study bidirectional language model editing, aiming to provide a rigorous evaluation to assess if edited LLMs can recall the editing knowledge bidirectionally. A metric of reversibility is introduced and a benchmark dubbed as Bidirectional Assessment for Knowledge Editing (BAKE) is constructed to evaluate if post-edited models can recall the editing knowledge in the reverse direction of editing. Experimental results show that while most editing methods are able to accurately recall editing facts along the modification direction, they exhibit substantial systematic deficiencies when evaluating in the reverse direction. Our findings also reveal that the in-context learning (ICL) can mitigate the reversal curse to a certain extent.

URL: https://openreview.net/forum?id=jAHwodCUxP

---

Title: Fairness in Link Prediction Beyond Demographic Parity: A Reproducibility Study

Authors: Valentijn Oldenburg, Floris de Kam, Stef de Wildt, Jarno Nilson Balk

Abstract: In fair ranked link prediction, demographic parity ($\Delta_\mathrm{DP}$) is a common fairness metric. Yet, Mattos et al. (2025) argue that it fails to detect exposure bias because it ignores where links appear in the ranking. In this study, we reproduce this claim by showing that $\Delta_\mathrm{DP}$ can indicate aggregate parity even when some subgroup-pair links are systematically ranked lower than others. The proposed rank-aware Normalized Discounted KL-divergence (NDKL), however, does detect such disparities. We also reproduce the effectiveness of MORAL, a post-processing method that improves exposure-based fairness while maintaining competitive utility. Beyond reproduction, we assess robustness using synthetic homophily settings, categorical sensitive attributes, and additional fairness and utility metrics, including subgroup-pair-adapted Attention-Weighted Rank Fairness (AWRF). Overall, our results show that exposure-based metrics uncover biases hidden by $\Delta_\mathrm{DP}$ and that MORAL reduces these biases with minimal utility loss across diverse settings and datasets. We release a corrected, reproducible implementation at https://github.com/Floris93100/reproducing-MORAL.

URL: https://openreview.net/forum?id=QNCZoPb9uV

---

Title: Towards Understanding the Transferability of Adversarial Suffixes in Large Language Models

Authors: Sarah Ball, Niki Hasrati, Alexander Robey, Avi Schwarzschild, Frauke Kreuter, J Zico Kolter, Andrej Risteski

Abstract: Discrete optimization-based jailbreaking attacks on large language models aim to generate short, nonsensical suffixes that, when appended onto input prompts, elicit disallowed content. Notably, these suffixes are often transferable—succeeding on prompts and models for which they were never optimized. And yet, despite the fact that transferability is surprising and empirically well-established, the field lacks a rigorous analysis of when and why transfer occurs. To fill this gap, we identify three statistical properties that strongly correlate with transfer success across numerous experimental settings: (1) how much a prompt without a suffix activates a model’s internal refusal direction, (2) how strongly a suffix induces a push away from this direction, and (3) how large these shifts are in directions orthogonal to refusal. On the other hand, we find that prompt semantic similarity only weakly correlates with transfer success. These findings lead to a more fine-grained understanding of transferability, which we use in interventional experiments to showcase how our statistical analysis can translate into practical improvements in attack success.

URL: https://openreview.net/forum?id=wQZmcEZCUK

---

Title: Variance-Gated Ensembles: An Epistemic-Aware Framework for Uncertainty Estimation

Authors: Martin Gillis, Isaac Xu, Thomas Trappenberg

Abstract: Machine learning applications require fast and reliable per-sample uncertainty estimation. A common approach is to use predictive distributions from Bayesian or approximation methods and additively decompose uncertainty into aleatoric (data-related) and epistemic (model-related) components. However, additive decomposition has recently been questioned, with evidence that it breaks down when using finite-ensemble sampling and/or mismatched predictive distributions. This paper introduces Variance-Gated Ensembles (VGE), an intuitive, differentiable framework that injects epistemic sensitivity via a signal-to-noise gate computed from ensemble statistics. VGE provides: (i) a Variance-Gated Margin Uncertainty (VGMU) score that couples decision margins with ensemble predictive variance; and (ii) a Variance-Gated Normalization (VGN) layer that generalizes the variance-gated uncertainty mechanism to training via per-class, learnable normalization of ensemble member probabilities. We derive closed-form vector-Jacobian products enabling end-to-end training through ensemble sample mean and variance. VGE matches or exceeds state-of-the-art information-theoretic baselines while remaining computationally efficient. As a result, VGE provides a practical and scalable approach to epistemic-aware uncertainty estimation in ensemble models.

URL: https://openreview.net/forum?id=fNMZjV1gje

---

Title: Point-Identification of a Robust Predictor Under Latent Shift with Imperfect Proxies

Authors: Zahra Rahiminasab, Reza Soumi, Arto Klami, Samuel Kaski

Abstract: Addressing the domain adaptation problem becomes more challenging when distribution shifts across domains stem from latent confounders that affect both covariates and outcomes. Existing proxy-based approaches that address latent shift rely on a strong completeness assumption to uniquely determine (point-identify) a robust predictor. Completeness requires that proxies have sufficient information about variations in latent confounders. For imperfect proxies the mapping from confounders to the space of proxy distributions is non-injective, and multiple latent confounder values can generate the same proxy distribution. This breaks the completeness assumption and observed data are consistent with multiple potential predictors (set-identified). To address this, we introduce latent equivalent classes (LECs). LECs are defined as groups of latent confounders that induce the same conditional proxy distribution. We show that point-identification for the robust predictor remains achievable as long as multiple domains differ sufficiently in how they mix proxy-induced LECs to form the robust predictor. This domain diversity condition is formalized as a cross-domain rank condition on the mixture weights, which is substantially weaker assumption than completeness. We introduce the Proximal Quasi-Bayesian Active learning (PQAL) framework, which actively queries a minimal set of diverse domains that satisfy this rank condition. PQAL can efficiently recover the point-identified predictor, demonstrates robustness to varying degrees of shift and outperforms previous methods on synthetic data and semi-synthetic dSprites dataset.

URL: https://openreview.net/forum?id=QFJuVreJDC

---

Title: Legal Alignment for Safe and Ethical AI

Authors: Noam Kolt, Nicholas Caputo, Jack Boeglin, Cullen O'Keefe, Rishi Bommasani, Stephen Casper, Mariano-Florentino Cuellar, Noah Feldman, Iason Gabriel, Gillian K Hadfield, Lewis Hammond, Peter Henderson, Atoosa Kasirzadeh, Seth Lazar, Anka Reuel, Kevin Wei, Jonathan Zittrain

Abstract: Alignment of artificial intelligence (AI) encompasses the normative problem of specifying how AI systems should act and the technical problem of ensuring AI systems comply with those specifications. To date, AI alignment has generally overlooked an important source of knowledge and practice for grappling with these problems: law. In this paper, we survey the emerging field of legal alignment that aims to fill this gap and systematize research that studies how legal rules, principles, and methods can be leveraged to address problems of alignment and inform the design of AI systems that operate safely and ethically. Our survey provides a taxonomy of the three core research pathways of legal alignment and explores how each can be operationalized in practice: (1) designing AI systems to comply with the content of legal rules developed through legitimate institutions and processes, (2) adapting methods from legal interpretation to guide how AI systems reason and make decisions, and (3) harnessing legal concepts as a structural blueprint for confronting challenges of reliability, trust, and cooperation in AI systems. These research pathways present new conceptual, empirical, and institutional questions, which include examining the specific set of laws that particular AI systems should follow, creating evaluations to assess their legal compliance in real-world settings, and developing governance frameworks to support the implementation of legal alignment in practice. Tackling these questions requires expertise across law, computer science, and other disciplines, offering these communities the opportunity to collaborate in designing AI for the better.

URL: https://openreview.net/forum?id=BypXEQa7mf

---

Title: Can LLMs Rank Candidates with Missing Sensitive Attributes Fairly?

Authors: Oluseun Olulana, Fabricio Murai, Elke Rundensteiner

Abstract: Large language models (LLMs) are increasingly deployed in high-stakes ranking systems used for hiring, lending, and scholarship allocation, raising concerns about fairness, accountability, and ethical use. These challenges are exacerbated in ranking settings wh eresensitive demographic attributes are unavailable due to legal, ethical, or practical constraints. Inferring such attributes may introduce harm by violating consent requirements, misrepresenting individuals, and reinforcing structural inequities. This work thus investigates the timely question: How is LLM-based fair re-ranking impacted when demographic information is missing? In this context, we study three alternate strategies that span alternate places within the pipeline where demographic inference may be deployed: (1) inferring sensitive attributes using traditional third-party services prior to ranking, (2) directly prompting LLMs to produce fair rankings without explicit mention of attribute inference, and (3) employing a chain-of-thought approach in which LLMs are first prompted to infer attributes and thereafter to perform fairness-aware re-ranking. We compare these strategies across multiple datasets using established group-fairness metrics for ranking. Across the datasets we evaluate, LLMs achieve demographic inference accuracy comparable to leading third-party services. We further observe that LLMs can produce rankings that improve group-fairness metrics without explicitly inferring sensitive attributes, suggesting a possible design space for fairness interventions that avoids direct demographic labeling. In contrast to zero-shot in-context learning, few-shot prompting improves LLM’s ability in balancing fairness and utility in our experiments. We conclude by discussing ethical and governance implications of deploying LLMs for fairness-critical ranking tasks. While LLMs offer flexibility under demographic uncertainty, their capacity for implicit inference also raises significant risks if adopted without transparency, evaluation, and institutional oversight. To support reproducibility and continued research exploration by others, we release our source code and experimental artifacts.

URL: https://openreview.net/forum?id=VrAs5EJ11G

---

Title: MM-Eureka: Toward Stable Multimodal Reasoning via Rule-based Reinforcement Learning with Policy Drift Control

Authors: Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Tiancheng Han, Daocheng Fu, Kaipeng Zhang, Ping Luo, Yu Qiao, Jiaheng Zhang, Michael Qizhe Shieh, Qiaosheng Zhang, Wenqi Shao

Abstract: Existing rule-based reinforcement learning (RL) methods that work well for text reasoning often collapse when extended to long-horizon multimodal reasoning settings. We identify a structural instability driven by ratio-based policy objectives under sparse multimodal rewards: importance sampling ratios in PPO-style objectives can amplify policy shifts, especially under negative advantages, which can trigger catastrophic mid-training collapse.
To make multimodal rule-based RL reliably trainable, we propose \textbf{CPGD (Clipped Policy Gradient Optimization with Policy Drift)}, a stability-oriented RL objective that removes ratio-induced amplification while maintaining proximal updates via an explicit policy drift regularizer and a numerically stable KL estimator. We provide both theoretical analysis and empirical evidence showing that ratio-based objectives can systematically amplify policy drift beyond intended bounds under sparse-reward multimodal settings, and demonstrate how CPGD addresses this through controlled policy updates.
To support diagnosis and evaluation under consistent settings, we introduce \textbf{MMK12}, a K12-level multimodal reasoning dataset with 15,616 training problems and 2,000 evaluation questions across mathematics, physics, chemistry, and biology, all with human-verified solutions. Using CPGD on MMK12, we train \textbf{MM-Eureka} models that demonstrate stable long-horizon training without collapse. CPGD achieves consistent performance improvements while maintaining training stability throughout, validating that the instability mechanism has been effectively addressed. We open-source our complete pipeline at \url{https://anonymous.4open.science/r/MM-EUREKA-C86D}

URL: https://openreview.net/forum?id=8y1ch6y24H

---

Title: You Only Scan Once: Efficient Multi-dimension Sequential Modeling with LightNet

Authors: Zhen Qin, Yuxin Mao, Xuyang Shen, Dong Li, Jing Zhang, Yuchao Dai, Yiran Zhong

Abstract: Linear attention mechanisms have gained prominence in causal language models due to their linear computational complexity and enhanced speed. However, the inherent decay mechanism in linear attention presents challenges when applied to multi-dimensional sequence modeling tasks, such as image processing and multi-modal learning. In these scenarios, the utilization of sequential scanning to establish a global receptive field necessitates multiple scans for multi-dimensional data, thereby leading to inefficiencies. This paper identifies the inefficiency caused by a \enquote{multiplicative decay} linear recurrence and proposes an efficient alternative \enquote{additive decay} linear recurrence to avoid the issue, as it can handle multi-dimensional data within a single scan. We further develop an efficient multi-dimensional sequential modeling framework called LightNet based on the new recurrence. Moreover, we present two new multi-dimensional linear relative positional encoding methods, MD-TPE and MD-LRPE to enhance the model's ability to discern positional information in multi-dimensional scenarios. Our empirical evaluations across various tasks, including image classification, image generation, bidirectional language modeling, and autoregressive language modeling, demonstrate the efficacy of LightNet, showcasing its potential as a versatile and efficient solution for multi-dimensional sequential modeling.

URL: https://openreview.net/forum?id=XG9ngiTupe

---

Title: Graph State Networks (GSNs): Persistent Nodewise Selective State Space Models

Authors: Arijit Dey, Bahman Gharesifard

Abstract: Temporal graphs are often observed as streams of timestamped interactions, where accurate prediction requires retaining and selectively using historical information nodes. Existing temporal graph models either (i) recompute representations from a sliding neighborhood/history at query time, or (ii) maintain a memory module but offer limited control and limited theory for what is retained over long horizons. We propose Graph State Networks (GSNs), a bucketed temporal-graph framework that maintains a persistent hidden state per node and updates it online using a content- and time-dependent selective state space update. Concretely, GSNs store node states in an explicit id-indexed state table and for each bucket, read the current state, update it with a time-aware Mamba-like mechanism, and commit the state back via an exponential moving average controlled by a commit-rate. This commit mechanism provides an explicit "retention dial'' and enables a tractable analysis of forgetting. We develop a capacity/recall theory for persistent node memory and show that, under incremental-stability assumptions on blank-bucket dynamics, the influence of a single past event admits a geometric forgetting bound, with the effective decay governed by the contraction of the blank dynamics and the commit mechanism. Empirically, GSNs are competitive on standard dynamic link prediction benchmarks under Average Precision (AP), with the strongest gains appearing in several inductive settings, while AUC-ROC remains more mixed. We validate these ideas with controlled synthetic write-wait-read probes. Under shared later blank sequences and a small nonzero state-noise blank update, the measured write-vs-zero-write relative influence exhibits near-exponential decay over the main operating regime. Our simulation studies verify this overall trend, including the nonzero floor at larger commit rates. We provide an extended implementation of GSNs.

URL: https://openreview.net/forum?id=zMEuBQfeT6

---

Title: Scalable Constrained Multi-Agent Reinforcement Learning via State Augmentation and Consensus for Separable Dynamics

Authors: Santiago Amaya-Corredor, Miguel Calvo-Fullana, Anders Jonsson

Abstract: We present a distributed approach for constrained Multi-Agent Reinforcement Learning (MARL) that combines state-augmented policy learning with distributed consensus over dual variables. Our method targets systems where agents have separable dynamics but must coordinate to satisfy global resource constraints, a setting in which, as we demonstrate empirically, independent learning fails to produce feasible solutions because agents cannot determine appropriate individual contributions toward collective constraintsatisfaction. The key technical contribution is showing that lightweight neighbor-to-neighbor consensus over Lagrange multipliers suffices for globally coordinated constraint enforcement while preserving the scalability of independent training. Each agent learns a single augmented policy offline, conditioned on both its local state and a dual variable encoding constraint feedback. During execution, agents reach agreement on this dual variable through local communication alone. We prove that under mild connectivity assumptions, the consensus error among agents' multipliers is bounded, and show that this translates to a bounded constraint violation that decreases with graph connectivity and the number of consensus rounds. Unlike centralized training with decentralized execution (CTDE) approaches, whose complexity grows at least quadratically with agent count, our method scales linearly in both training and execution. Experiments on smart grid demand response demonstrate that consensus coordination is \emph{essential for feasibility}: without it, agents satisfy grid capacity constraints only by indefinitely postponing demand, a degenerate non-solution. With consensus, agents converge to a shared dual variable and satisfy both grid constraints and demand fulfillment, scaling to thousands of agents while CTDE baselines are limited to dozens.

URL: https://openreview.net/forum?id=whihxstZcO

---

Title: HERMES: Heterogeneous Effects Representation with Matched Embeddings using Siamese Networks

Authors: Rocco Zaccagnino, Gerardo Benevento, Delfina Malandrino, Donatello Telesca, Alessia Ture

Abstract: We consider the problem of estimating heterogeneous treatment effects from observational data. Specifically, we are interested in the estimation of conditional average treatment effects (CATE) functions, i.e. functions mapping the effect of a binary treatment to the space
of unit-level covariates. In the absence of a controlled randomized mechanism of treatment assignment, simple comparisons between treated and control populations can be potentially confounded by significant distributional differences in the covariate space. In this context,
recent representation learning strategies aim to learn balanced latent representations in a new space where the treated and control distributions are more comparable, reducing variance. We introduce HERMES (Heterogeneous Effects Representation with Matched Embeddings using Siamese Networks), a novel framework that integrates self-supervised contrastive learning into causal representation learning. HERMES employs a Siamese architecture that dynamically pairs individuals based on similarity in estimated individual treatment effects (ITE), encouraging representations where proximity reflects treatment-response similarity rather than covariate similarity alone. Unlike representation learning approaches that rely only on covariates, HERMES injects the ITE into representation learning, improving accuracy under standard assumptions. Experiments on IHDP and JOBS benchmarks show that HERMES improves the expected Precision in MSE by 14-15% over baselines, without added inference cost.

URL: https://openreview.net/forum?id=G6F0fkBQEG

---

Title: Gradient Tree Boosting for Regression Transfer

Authors: Dag Björnberg, Jonas Nordqvist, Morgan Ericsson, Johan Lindeberg, Welf Löwe, Johan E.S. Fransson

Abstract: Many real-world modeling problems are hindered by limited data availability. In such cases, \emph{transfer learning} leverages related source domains to improve predictions in a target domain of interest. We extend the classical gradient tree boosting paradigm to a regression transfer algorithm by modeling the weak learner as a sum of two regression trees. The trees are fitted on source data and target data, respectively, and jointly optimized for the target data. We derive optimal coefficients for the model update under the least-squares, the least-absolute-deviation, and the Huber loss functions. We benchmark our approach against boosting-based regression transfer methods in twelve transfer scenarios. The results indicate that our approach constitutes a competitive alternative within the realm of boosting-based regression transfer. Moreover, we provide a theoretical justification as well as empirical validation that our approach is robust under larger domain shifts.

URL: https://openreview.net/forum?id=b29TPa8NPT

---

Title: InPhyRe Discovers: Large Multimodal Models Struggle in Inductive Physical Reasoning

Authors: Gautam Sreekumar, Vishnu Boddeti

Abstract: Large multimodal models (LMMs) encode physical laws observed during training, such as momentum conservation, as parametric knowledge. It allows LMMs to answer physical reasoning queries, such as the outcome of a potential collision event from visual input. However, since parametric knowledge includes only the physical laws seen during training, it is insufficient for reasoning in inference scenarios that follow physical laws unseen during training. In such novel physical environments, humans could adapt their physical reasoning based on provided demonstrations. This inductive physical reasoning ability is indispensable for LMMs if they are to replace human agents in safety-critical applications. Despite its importance, existing visual benchmarks do not evaluate inductive physical reasoning and only consider the parametric knowledge in LMMs. To this end, we propose InPhyRe, the first visual question answering benchmark to measure inductive physical reasoning in LMMs. InPhyRe evaluates LMMs' ability to predict the outcome of collision events in algorithmically generated synthetic videos. By inspecting over 13 open-source and proprietary LMMs, InPhyRe informs us that (1) LMMs struggle to apply their limited parametric knowledge about universal physical laws to reasoning, (2) inductive physical reasoning in LMMs is weak when the physical laws underlying inference scenarios were unseen during training, and (3) inductive physical reasoning in LMMs suffers from language bias and may ignore the visual inputs, questioning the trustworthiness of LMMs regarding visual inputs.

URL: https://openreview.net/forum?id=TGmridJZQo

---

Title: Towards Representation Backdoor on CLIP via Concept Confusion

Authors: Junchi Liao, Weimin Lyu, Lijie Hu, Shaopeng Fu, Tianhao Huang, Shu Yang, Jie Li, Di Wang

Abstract: Backdoor attacks pose a serious threat to deep learning models by allowing adversaries to implant hidden behaviors that remain dormant on clean inputs but are maliciously triggered at inference. Existing backdoor attack methods typically rely on explicit triggers such as image patches or pixel perturbations, which makes them easier to detect and limits their applicability in complex settings. To address this limitation, we take a different perspective by analyzing backdoor attacks through the lens of concept-level reasoning, drawing on insights from interpretable AI. We show that traditional attacks can be viewed as implicitly manipulating the concepts activated within a model’s latent space. This motivates a natural question: can backdoors be built by directly manipulating concepts? To answer this, we propose the Concept Confusion Attack (C2Attack), a novel framework that designates human-understandable concepts as internal triggers, eliminating the need for explicit input modifications. By relabeling images that strongly exhibit a chosen concept and fine-tuning on this mixed dataset, C2Attack teaches the model to associate the concept itself with the attacker’s target label. Consequently, the presence of the concept alone is sufficient to activate the backdoor, making the attack stealthier and more resistant to existing defenses. Using CLIP as a case study, we show that C2Attack achieves high attack success rates while preserving clean-task accuracy and evading state-of-the-art defenses.

URL: https://openreview.net/forum?id=jQ91DETUIr

---


New submissions
===============


Title: Who is the Winning Algorithm? Rank Aggregation for Comparative Studies

Abstract: Consider a collection of $m$ competing machine learning algorithms. Given their performance on a benchmark of datasets, we would like to identify the best performing algorithm. Specifically, which algorithm is most likely to ''win" (rank highest) on a future, unseen dataset. The standard maximum likelihood approach suggests counting the number of wins per each algorithm. In this work, we argue that there is much more information in the complete rankings. That is, the number of times that each algorithm finished second, third and so forth. Yet, it is not entirely clear how to effectively utilize this information for our purpose. In this work, we study the problem of estimating future win probabilities for each of the $m$ algorithms, and propose a statistically motivated weighting scheme that effectively incorporates the complete rankings over a benchmark of datasets. Our proposed framework improves upon currently known methods in synthetic and real-world examples.

URL: https://openreview.net/forum?id=LWyOQm9kAb

---

Title: Blade: A Derivative-free Bayesian Inversion Method using Diffusion Priors

Abstract: Derivative-free Bayesian inversion arises in science and engineering applications, particularly when forward model is costly or infeasible to differentiate through. Existing derivative-free methods collapse the posterior to a point estimate or return severely over-confident uncertainty on high-dimensional, nonlinear problems. We introduce Blade, which produces accurate and well-calibrated posteriors using an ensemble of interacting particles. Blade leverages diffusion models as data-driven priors, and only queries the forward model through forward evaluations (i.e., derivative-free). Theoretically, we show the convergence and stability of Blade under forward model approximation and prior score estimation error. Empirically, on nonlinear fluid dynamics, Blade produces well-calibrated posterior samples that existing derivative-free methods cannot, as measured by CRPS, the spread-skill ratio, and the rank histogram. Its accuracy and calibration improve consistently with more iterations and particles, backed by our convergence and stability analysis and empirical experiments.

URL: https://openreview.net/forum?id=p73sz3ajBc

---

Title: Opening Up a New Layer: A Deeper Look into "Interpreting CLIP with Hierarchical Sparse Autoencoders"

Abstract: Sparse Autoencoders (SAEs) have become essential for decomposing model activations into interpretable concepts. However, despite their effectiveness, SAEs lack a natural ordering of features, making it difficult to prioritize important concepts under compute constraints. The Matryoshka Sparse Autoencoder (MSAE) was introduced as a means of learning nested subspaces, theoretically forcing high-level features into earlier dimensions to enable adaptive granularity. In this paper, we reproduce and analyze the MSAE framework. The reproduction of the study suffers from certain challenges, but the main claims still hold. Furthermore, this study extends on the original paper by implementing feature absorption as a metric, reducing computational cost and emissions with a method inspired by the Sandwich Rule, and deeply analyzing MSAE's ability to learn hierarchical information.

URL: https://openreview.net/forum?id=l0qZkSk2aH

---

Title: Reproducibility study of "Bilinear MLPs enable weight-based mechanistic interpretability"

Abstract: This paper presents a reproducibility study of "Bilinear MLPs enable weight-based mechanistic interpretability" by Pearce et al. (2024), which proposes that bilinear architectures possess intrinsic interpretability properties accessible via eigenvalue decomposition. We verify the core empirical image classification claims. Our results confirm the findings for image classification: bilinear layers consistently exhibit an interpretable low-rank structure where the leading eigenvectors capture the majority of task-relevant information, allowing for significant truncation without performance loss. Furthermore, we validate that these eigenstructures are stable across random initializations and varying model sizes. Additionally, we explore extensions to the original work, demonstrating that adversarial training (specifically PGD) enhances the interpretability of eigenvector features on MNIST. Finally, we explored generalization on more complex RGB datasets, such as CIFAR-10 and CIFAR-100, which have generated eigenvectors with uninterpretable structures. All our code is publicly available at: https://anonymous.4open.science/r/reproduced-mech-inter-image-class

URL: https://openreview.net/forum?id=Bnx3hf3mWp

---

Title: Demystifying Spectral Bias on Real-World Data

Abstract: Machine learning models can appear to learn functions easily, yet generalize poorly out of distribution due to shortcut learning. This gap arises because statistical properties of the data are often entangled with the model’s information processing. In this work, we aim to disentangle these two aspects by introducing *cross-dataset learnability*, which measures whether a function learned from one dataset is captured when evaluated on a choice of general distribution. We show that for kernel ridge regression (KRR) and Gaussian processes (GPs), cross-dataset learnability admits an explicit upper bound and that the cross-dataset learnability sample complexity can be predicted on real-world datasets. This framework further allows us to exploit symmetries present in realistic Neural Tangent Kernels (NTKs) and Neural Network Gaussian Processes (NNGPs), enabling a characterization of their spectral bias when learned from real-world and evaluated on symmetric distributions.

URL: https://openreview.net/forum?id=l3RF3e0lVj

---

Title: TreeSMOTE: Structure-Aware Data Augmentation for Im balanced Tabular Learning

Abstract: Class imbalance has been a critical bottleneck in classification problems, undermining a classifier's identification of minority instances. Data augmentation provides an effective solution by oversampling the minority. Extant methods often generate samples through duplication, perturbation, or interpolation, largely relying on the assumption of local smoothness of the data space to ensure synthetic data reliability. Alternatively, generative models are leveraged for data learning and synthesis. However, both approaches encounter significant limitations in tabular data, primarily due to data heterogeneity and smaller sizes. Specifically, in tabular data with heterogeneous (e.g., continuous, categorical, and ordinal) features, local smoothness may not hold, and thus spatial proximity may fail to capture the underlying data distribution, resulting in noisy or misleading synthetic instances. Moreover, generative models can overfit (memorize) smaller tabular data. To bridge the gap in tailored augmentation for imbalanced tabular classification, we propose TreeSMOTE, a structure-aware oversampling framework that shifts the focus from the feature space to a tree-induced decision space. We quantify the similarity between samples based on the depth of their lowest common ancestor in decision paths from trees and synthesize label-consistent instances based on structurally similar samples. TreeSMOTE readily combines with and improves classification models by increasing data diversity and reducing label contamination. Extensive experiments on large-scale imbalanced tabular learning benchmarks demonstrate that TreeSMOTE consistently outperforms popular oversampling methods and generative models and further improves state-of-the-art ensemble approaches. Preserving the data manifold, TreeSMOTE offers an effective and efficient solution in imbalanced tabular data environments.

URL: https://openreview.net/forum?id=OelOS8cbBY

---

Title: FlashBind: Towards Accurate and Efficient Structure-based Virtual Screening

Abstract: Accurate prediction of protein-ligand interactions is central to computational drug discovery. Recent foundation models such as Boltz-2 have achieved remarkable accuracy in binding affinity prediction, yet their prohibitive computational cost remains a major barrier to large-scale virtual screening. Here we introduce FlashBind, a lightweight structure-based model that achieves a 50× speedup over Boltz-2 at inference time by replacing expensive structure prediction with a fast docking model and substituting costly PairFormer modules with a streamlined EGNN architecture. FlashBind matches Boltz-2 on standard virtual screening benchmarks and demonstrates superior generalization to enzyme-substrate specificity prediction. To evaluate real-world applicability, we apply FlashBind to target-based antibiotic screening against the essential bacterial proteins in E. coli and show that FlashBind substantially outperforms Boltz-2 and other virtual screening baselines. Notably, several top-ranked candidates exhibit potent inhibition of DnaG and effective bacterial growth inhibition against E. coli in wet-lab validation. Together, these results demonstrate that FlashBind bridges the gap between accuracy and efficiency, enabling ultra-fast, high-fidelity screening of massive chemical libraries for drug discovery.

URL: https://openreview.net/forum?id=2D91AcVcMi

---

Title: GCP-VQVAE: A Geometry-Complete Language for Protein 3D Structure

Abstract: Protein structure tokenization provides a discrete interface between 3D geometry and
modern learning systems, with applications in reconstruction, retrieval, and generative
modeling. However, existing protein structure tokenizers are still not sufficiently accurate,
robust to structural perturbations, or efficient enough for real-world use, and the field still
lacks a fully open, end-to-end method that combines these properties with transparent
reproducibility for the community. In this work, we introduce GCP-VQVAE, a fully open
discrete protein structure tokenizer built around a chirality-aware, SE(3)-equivariant GCPNet
encoder. Our design is motivated by the hypothesis that stronger geometry-aware continuous
representations provide a better substrate for discrete structure tokenization.
Trained on monomer protein backbone structures from the AlphaFold Protein Structure
Database, GCP-VQVAE delivers the strongest reconstruction performance among the opensource
baselines evaluated in this work. For example, it attains 0.5293 Å RMSD on CASP15,
reducing error by 38.5% relative to the strongest prior open baseline (AIDO), and 0.8193
Å RMSD on a zero-shot benchmark of 1,938 newly deposited experimental structures, a
59.2% improvement over the same baseline. In addition, the Large and Lite variants are
approximately 408× and 530× faster SOTA, respectively, while remaining robust to structural
perturbations such as rigid-body rotations and other input corruptions. To the best of our
knowledge, this is the first protein structure tokenizer to release the full training pipeline,
datasets, model weights, and implementation details, providing a fully transparent and
reproducible foundation for the community to build on.

URL: https://openreview.net/forum?id=bLLdOqDd6k

---

Title: Continuum-marginal optimal transport: a mesh-free kernel method

Abstract: In this paper we study \emph{continuum-marginal optimal transport}. Given a time-continuous family of probability marginals, the problem
is to recover the minimum-energy velocity field whose flow reproduces every marginal. This problem is the continuum limit of the classical two-marginal Benamou--Brenier formulation, and also the deterministic limit of the Nelson problem of stochastic optimal transport. We propose a practical mesh-free solver for this problem. The weak continuity equation is embedded in a reproducing kernel Hilbert space, yielding a sample-only objective that requires no spatial discretization. The velocity is parametrized by any linear-in-parameters dictionary or neural network, and is optimized by mini-batch stochastic methods. Synthetic experiments confirm that the method achieves accurate drift recovery and marginal consistency. The same computational framework also applies to the stochastic Nelson problem.

URL: https://openreview.net/forum?id=1bzgCxWYpr

---

Title: Multimodal PDE Foundation Models and Applications to Physical Systems

Abstract: We introduce MORPH, a multimodal foundation model for partial differential equations (PDEs) designed to learn across heterogeneous physics datasets with varying dimensionalities, resolutions, and fields. Its architecture combines component-wise convolutions, field-wise cross-attention, and factorized axial spatiotemporal attention, enabling scalable learning across diverse PDE modalities. We pretrain multiple model variants on a heterogeneous PDE corpus using a next-frame prediction objective and evaluate transfer across a broad set of downstream tasks. These include forward modeling tasks such as autoregressive rollouts and terminal key-frame prediction, as well as inverse tasks including composite material property estimation, damage detection in aerospace structures, inertial-confinement-fusion parameter estimation, and sparse sea-surface-temperature reconstruction. Across zero-shot evaluation, full-model fine-tuning, and parameter-efficient low-rank adaptation, pretrained MORPH models consistently outperform models trained from scratch. Our results demonstrate positive cross-modality transfer, improved data efficiency in low-data regimes, favorable model- and data-scaling behavior, and strong generalization to out-of-distribution physical systems. Collectively, these capabilities establish MORPH as a flexible backbone for learning from the heterogeneous and multimodal nature of scientific observations, charting a path toward scalable and data-efficient scientific machine learning.

URL: https://openreview.net/forum?id=XZoYWHtXEC

---

Title: Imbalanced Tabular Data Synthesis via LLM-Seeds and Interpolation

Abstract: Imbalanced tabular data poses a persistent challenge in machine learning, as classifiers often underperform on the minority-class data when trained on skewed data. Synthetic minority data generation is a standard approach, with traditional interpolation-based methods offering efficiency but limited representational capacity and being inapplicable under severe imbalance due to their reliance on existing data points. Recent approaches that employ large language models (LLMs) address this limitation by generating synthetic samples informed by contextual knowledge, but they are computationally expensive and often impractical at scale. To bridge this gap, we propose a hybrid approach that integrates the strengths of both strategies: LLMs generate a small set of contextually meaningful seed samples to expand the observed minority support, while interpolators efficiently generate additional samples within this augmented support. As a foundational work on this hybrid approach, we use standard LLMs and interpolators for experiments to observe better the benefits of the hybrid design and provide baseline results for further research. Extensive experiments across 60 benchmark tabular datasets show that the hybrid approach provides considerable efficiency gain over the LLM-only method without performance degradation, demonstrating the potential of the hybrid approach as a complementary strategy for synthetic minority data generation in imbalanced tabular learning.

URL: https://openreview.net/forum?id=hQjjq5OaiV

---

Title: Towards Practical Reproduction of Stochastic Concept Bottleneck Models

Abstract: Stochastic Concept Bottleneck Models (SCBMs) model dependencies among concept logits with a joint Gaussian distribution, enabling interventions on corrected concepts to propagate to related non-intervened concepts. We reproduce the main SCBM experiments on a synthetic correlated-concept dataset and two natural image datasets, comparing SCBM with Hard CBM, autoregressive CBM, and Concept Embedding Models. Our results broadly validate the original empirical findings: SCBM remains competitive in predictive accuracy, improves concept-probability calibration, and enables more efficient interventions, requiring fewer manual concept corrections to achieve comparable concept and target accuracy. Beyond empirical reproduction, we study the practical cost of reproducing SCBM. We identify implementation bottlenecks in the official codebase and introduce a refactored, GPU-oriented pipeline with optimized data loading, batched model execution, batch-first intervention evaluation, and a vectorized Frank-Wolfe solver for dependency-aware interventions. These changes reduce the practical reproduction cost to approximately 62 wall-clock hours on a single RTX 4090. Our optimized implementation also changes the relative runtime behavior reported in the original paper, indicating that computational-efficiency claims are sensitive to implementation choices. Our study therefore supports SCBM's core methodological contribution while suggesting that its practical value should be framed primarily around dependency-aware intervention rather than raw computational efficiency.

URL: https://openreview.net/forum?id=LT4vR5S9Af

---

Title: When Vision Needs a Second Look: Tool-Augmented Active Perception for Earth Observation

Abstract: Earth Observation (EO) uses satellite and aerial imagery to monitor the Earth’s surface, supporting critical applications in infrastructure, agriculture, and climate change. As governments and industry scale EO pipelines, reliable automation has become essential. Yet, current Vision-Language Models are limited to coarse-grained perception, struggling to execute the precise, multi-step reasoning required for operational decision-making. Recent evaluations on benchmarks like GeoBench-VLM highlight this shortcoming: even state-of-the-art models show low accuracy and frequently struggle with tasks requiring precise numerical reasoning and domain-specific knowledge, such as object counting, crop-type classification, and assessing vegetation health. These limitations stem from their static, monolithic inference pipeline, which prevents adaptive analysis and error correction. To address these limitations, we introduce GeoScout-Agent, an autonomous agentic framework designed to overcome these constraints by coupling GPT-5-mini’s tool capabilities. The system built upon LangChain dynamically invokes code execution, progressive zooming, sharpening, and external context verification, while DINOv3 and SAM3 provide zero-shot segmentation and high-quality feature extraction for richer LLM context. This coordinated framework enables the model to iteratively decompose, validate, and refine its predictions rather than relying on a single forward pass.
Evaluated on GeoBench-VLM, our approach achieves substantial gains over standard VLM baselines. GeoScout-Agent consistently resolves intermediate failures, improves geospatial understanding, and achieves a relative 17.3% improvement across the evaluated tasks over the baseline approach. We will publicly release our code upon acceptance.

URL: https://openreview.net/forum?id=7yUrnyFgEq

---

Title: Sparse Covariance Supervised Principal Component Analysis

Abstract: Principal component analysis (PCA) is one of the most well-studied machine learning methods of the last century. However, the principal components derived by PCA are not guaranteed to be response-informative and are usually dense, meaning they are hard to interpret in high dimensional settings. The former has led to the development of supervised PCA techniques where the response is usually incorporated in an objective function to guide informative projections and enhance predictive accuracy, while the latter to sparse PCA methods that seek to induce sparsity by shrinking non-significant variables to zero, effectively improving interpretability. Sparse supervised PCA methods seek to combine the two concepts as a means of simultaneous supervised dimensionality reduction and variable selection, but they usually depend on iteratively biconvex solutions of auxiliary objective functions, with no robust convergence guarantees and are sensitive to initialisation. In this paper, we propose a novel sparse supervised PCA method, sparse covariance supervised PCA (SCS-PCA), that seeks to trade-off prediction accuracy and sparsity performance. We impose an $L_1$ penalty on a supervised objective function and we employ manifold proximal gradient descent to solve the derived optimization problem, which guarantees global convergence to a stationary point. Numerical results from simulations and real-world microarray data illustrate that SCS-PCA provides competitive performance in prediction tasks and is able to select more features compared to existing supervised sparse methods.

URL: https://openreview.net/forum?id=c2XDHSfZt2

---

Title: From Images to Signals: Are Large Vision Models Useful for Time Series Analysis?

Abstract: Large Vision Models (LVMs) are emerging tools for transferring cross-modal knowledge to time series, but their potential for this domain is not yet fully understood. This work addresses the gap by investigating LVMs for both high-level (classification) and low-level (forecasting) time series tasks. Our aim is not only to assess whether LVMs can succeed, but also to reveal why they succeed or fall short. Through a comparative benchmark covering four representative LVMs, eight imaging methods, 18 datasets, and 21 baselines, we identify the strengths and limitations of the foundational LVMs, as well as effective strategies for adapting them to time series modeling. Our findings indicate that while the LVMs are effective for time series classification, they face notable challenges in forecasting. In particular, the best-performing LVM-based forecaster is limited to specific model types and imaging methods, exhibits biases toward periodic components in time series, and struggles to leverage long look-back windows. We hope our findings will serve as both a cornerstone and a practical guide for advancing future research on LVM- and multimodal-based solutions for different time series tasks.

URL: https://openreview.net/forum?id=aWVG2rN4OV

---

Title: Adaptive Off-Policy Inference for M-Estimators Under Model Misspecification

Abstract: When data are collected adaptively, such as in bandit algorithms, classical statistical approaches such as ordinary least squares and $M$-estimation will often fail to achieve asymptotic normality. Although recent lines of work have modified the classical approaches to ensure valid inference on adaptively collected data, most of these works assume that the model is correctly specified. The misspecified setting poses unique challenges because the parameter of interest itself may not be well-defined over a non-stationary distribution of rewards. We therefore tackle the problem of off-policy inference in adaptive settings, where we uniquely define a projected solution over a stationary evaluation policy. Our method provides valid inference for $M$-estimators that use adaptively collected bandit data with a possibly misspecified working model. A key ingredient in our approach is the use of flexible approaches to stabilize the variance induced by adaptive data collection. A major novelty is that the procedure enables the construction of valid confidence sets even in settings where treatment policies are unstable and non-converging, such as when there is no unique optimal arm and standard bandit algorithms are used. Empirical results on semi-synthetic datasets constructed from the Osteoarthritis Initiative demonstrate that the method maintains type I error control, while existing methods for inference in adaptive settings do not cover in the misspecified case.

URL: https://openreview.net/forum?id=eSroiYLFC8

---

Title: Synthetic Hallucinations, Real Gains: Hard Negatives from Frontier Models for FIM Hallucination Mitigation

Abstract: Small open-source code models that power IDE autocomplete still emit hallucinated Fill-in-the-Middle (FIM) completions: syntactically natural calls to methods, parameters, variables, and imports that do not exist in the surrounding project. Existing mitigations either require per-language execution sandboxes that do not apply at mid-keystroke or preference-optimisation pipelines that need large human-labelled corpora. We propose an execution-free alternative: use frontier code models to synthesise plausible-but-wrong completions as hard negatives, then leverage the contrast between these synthetic hallucinations and the ground-truth developer edit as a supervised fine-tuning signal. Our pipeline scrapes multilingual FIM contexts from public GitHub across eight languages and asks a panel of three frontier
generators to produce one hard negative per context for each of four hallucination types drawn from the Delulu taxonomy, a Docker-verified multilingual FIM hallucination benchmark, yielding a paired chosen/rejected dataset. Fine-tuning Qwen2.5-Coder-7B-Instruct on a 100K-row curated subset lifts Delulu exact match by +18.8 points and edit similarity by +0.22 on every language and every type, while also improving every HumanEval-Infilling split and every SAFIM subset. The same recipe at 3B lifts Delulu by +12.8 EM with a small, characterised general-FIM trade-off. Five-axis ablations (size, type mix, language coverage, base-model family, and a difficulty-aware fool rate) plus a head-to-head SFT vs. DPO/ORPO comparison map which design choices drive the gain. We release the full pipeline source code — generation, fool-rate LLM judging, curation, and the FIM fine-tuning recipe — so that the experiments in this paper can be reproduced end-to-end on any permissively licensed corpus.

URL: https://openreview.net/forum?id=OoYmjSlQap

---

Title: SPRINT: Stochastic Performative Prediction With Variance Reduction

Abstract: Performative prediction (PP) provides a framework for learning in settings where model deployment influences the data distribution on which the model is trained. Unlike conventional learning under a fixed data distribution, PP involves model-induced distribution shifts that make it challenging to design algorithms converging to stable points, known as stationary performatively stable (SPS) solutions. Existing algorithms for finding SPS solutions, including repeated gradient descent (RGD) and greedy stochastic gradient descent (SGD-GD), have largely focused on strongly convex losses. Although recent work established an $\mathcal{O}(1/\sqrt{T})$ convergence guarantee for smooth non-convex losses, it relies on a bounded-variance assumption on stochastic gradients and converges only to a non-vanishing error neighborhood whose size scales with the variance. To address these limitations, we develop a variance-reduced stochastic optimization method for non-convex performative prediction. Our algorithm, stochastic performative prediction with variance reduction (SPRINT), converges to an SPS solution at a rate of $\mathcal{O}(1/T)$. This guarantee removes the bounded-variance assumption on stochastic gradients and contains no variance-dependent residual error. Experiments on multiple real-world datasets with non-convex models demonstrate that SPRINT converges faster and more stably than SGD-GD.

URL: https://openreview.net/forum?id=q0L3a1zSro

---

Title: WorldPack: Dynamic Frame Compression for Long-context Video World Modeling

Abstract: Video world models have attracted significant attention for their ability to produce high-fidelity future visual observations conditioned on past observations and navigation actions.
However, achieving temporally and spatially consistent generation over long horizons remains an open challenge: existing approaches either compress past frames at fixed rates based on temporal proximity, discarding spatially critical information, or retrieve only a handful of relevant frames without increasing the total amount of retained history.
In this paper, we propose WorldPack, a video world model that introduces spatially-aware compressed memory to address both limitations simultaneously.
The key insight is that compression rates should not be uniform or temporally determined, but should instead be dynamically allocated based on 3D spatial relevance to the current viewpoint.
WorldPack achieves this through two tightly coupled mechanisms: trajectory packing, which fits substantially more historical frames into a fixed-length context through hierarchical frame compression, and geometric selection, which leverages camera pose information and field-of-view
overlap to assign lower compression to spatially important frames and higher compression to less relevant ones.
Together, these mechanisms expand the effective context from 4 to 22 frames with only 16\% additional inference time, while preserving the most informative frames for spatial reasoning with high fidelity.
We evaluate WorldPack on LoopNav, a Minecraft benchmark for
long-horizon spatial consistency, and conduct comprehensive experiments on the RECON, real-world navigation dataset, across multiple evaluation protocols.
WorldPack consistently outperforms strong baselines---including Oasis, Mineworld, DIAMOND, NWM---with particularly pronounced gains in spatial reasoning tasks that require recall of distant observations.

URL: https://openreview.net/forum?id=zJuiG3PiNJ

---

Title: Mycroft: Effective and Efficient External Data Augmentation

Abstract: Machine learning (ML) models often require large amounts of data to perform well.
When the data available to the model trainer is insufficient to obtain good performance on their desired task, they may need to acquire more data from external sources. Often, useful data is held by private entities who may be unwilling to share their data due to monetary and privacy concerns. This makes it challenging and expensive for model trainers to acquire the data they need to improve their model's performance. To tackle this problem, we propose Mycroft, a data-efficient framework that enables model trainers to compare data sources.
Mycroft identifies small but informative data subsets from each data owner by leveraging subset selection methods based on (i) a sparse gradient approximation, (ii) a low-rank kernel matrix approximation, and (iii) a voting-based coverage heuristic. This allows model trainers to identify useful data owners and improve model performance with minimal data exposure.
Our experiments on vision datasets show that Mycroft converges rapidly to the performance achievable when all the data is shared. Moreover, Mycroft is robust to label and data noise, and can effectively recover a utility-based ranking of data owners.
We believe Mycroft paves the way for principled data markets which can democratize training of high performance ML models.

URL: https://openreview.net/forum?id=Lc3Xq2shHM

---

Title: LossVal: Data Valuation using Weighted Loss Functions

Abstract: Machine learning models are often limited not by how much data we have, but by how much trustworthy data we have. We introduce LossVal, a data valuation method that computes per-sample importance scores during neural network training by integrating a self-weighting
mechanism into standard loss functions (e.g., cross-entropy and mean squared error). LossVal produces meaningful importance scores without repeated retraining and achieves competitive performance on common data valuation tasks such as noisy sample detection and
bad point removal. Across multiple classification and regression datasets, LossVal reliably distinguishes helpful from harmful samples. Experiments with ResNet-50 and BERT indicate that LossVal can also be applied to larger architectures in our experimental setup.

URL: https://openreview.net/forum?id=TIGUqDjVMZ

---

Title: Error Bounds for a Diffusion Model-Based Drift Estimator

Abstract: Parameter estimation in stochastic differential equations is a classical statistical problem of much importance in many scientific fields. Recent work of Tapia Costa et al. (2026) introduced a novel technique for estimating the drift when the diffusion parameter is known, using discrete samples from multiple trajectories.
Their method treats drift estimation as a denoising problem, and leverages tools from (conditional) score-matching diffusion models. Although their experiments showed promising results across different drift classes, the question of theoretical guarantees for their estimator was left unanswered.
In this note, we address this gap by exploiting techniques from diffusion model theory. More concretely, we derive an explicit risk bound for the time-averaged mean-squared error of said drift estimator. Our bound decomposes the risk into the (i) Euler-Maruyama discretization, (ii) score/denoiser approximation, (iii) noise initialization, and (iv) sampling variance, revealing the trade-offs between the different hyperparameters and sources of error in the estimator.

URL: https://openreview.net/forum?id=cyHG8JSTGe

---

Title: Geometry-Aware Distillation for Prompt Tuning Biomedical Vision-Language Models

Abstract: Current prompt-based and adapter-based tuning of vision-language models (VLMs) is attractive for medical imaging, where clinical data sensitivity favors frozen backbones and annotations are limited. However, these methods typically optimize only the ground-truth class, treating all other classes as equally incorrect, ignoring clinically meaningful class relations and yielding unstable decision boundaries in limited-supervision settings. We propose Omni-Geometry Knowledge Distillation (OGKD), a new framework that injects class-relation structure into the teacher to produce directional targets that preserve the ground truth while respecting inter-class geometry. Using these targets, we develop two distillation losses: Global Geometry-Aware Distillation (GAD) operates on the global image token, and Label-Guided Geometry Distillation (LGD) applies the same geometry to attentive patch tokens to improve fine-grained alignment. Across comprehensive experiments and analyses on 11 widely-used medical datasets for base-to-novel and few-shot evaluations, our OGKD achieves substantially better performance, consistently improving accuracy by an average absolute gain of 1.7%–2.8% over all prior state-of-the-art VLM adaptation counterparts. It also robustly generalizes to unseen classes and yields more reliable predictions than other approaches. Our code is anonymously available at https://anonymous.4open.science/r/OGKD-02B2.

URL: https://openreview.net/forum?id=TT7pMLmKPG

---

Title: Post-Anomaly Detection Inference for Deep SVDD

Abstract: Deep Support Vector Data Description (Deep SVDD) has become a prominent framework for unsupervised anomaly detection by learning latent representations that compactly characterize normal data around a center. Despite its empirical success, anomaly decisions produced by Deep SVDD are typically made solely based on anomaly scores without rigorous statistical guarantees, thereby limiting their reliability in safety-critical and high-stakes applications where false positives must be strictly controlled. In this paper, we propose PADI (Post-Anomaly Detection Inference), a novel framework that equips a trained and frozen Deep SVDD detector with statistically valid inference by leveraging the Selective Inference framework. Specifically, PADI performs inference conditional on the event that a test instance is identified as anomalous by Deep SVDD, thereby enabling rigorous statistical assessment of anomaly decisions. Based on this formulation, we derive valid selective $p$-values that quantify the statistical significance of the detected anomaly. Using these $p$-values, we theoretically establish control of the false positive rate (FPR) at a user-specified significance level $\alpha$ (e.g., $\alpha=0.05$). Furthermore, we extend the proposed framework to Deep Semi-Supervised Anomaly Detection (Deep SAD), providing a principled approach for statistically reliable inference in semi-supervised anomaly detection settings. Extensive experiments on both synthetic and real-world benchmark datasets robustly support the theoretical findings. The results demonstrate that PADI consistently achieves proper FPR control while attaining superior true positive rates compared with existing approaches.

URL: https://openreview.net/forum?id=f8XTHjxBig

---

Title: Model-Agnostic Association Rule Learning with Tabular Foundation Models

Abstract: Association Rule Mining (ARM) is a fundamental data mining technique for knowledge discovery in tabular data and is widely used in high-stakes decision-making. Existing approaches suffer from two fundamental limitations: reliance on frequent itemset mining leading to rule explosion (for classical methods), and failure in low-data regimes (for recent neural approaches). We reframe ARM to a conditional probability estimation problem and introduce a model-agnostic rule learning framework that extracts association rules from any conditional probabilistic model, formally decoupling rule extraction from frequent itemset mining. We introduce TabProbe, a probing-based algorithm that instantiates this framework, and demonstrate its generality across multiple model classes, ranging from gradient-boosted trees to large pretrained foundation models. As a particularly powerful instantiation, we leverage tabular foundation models (TFMs), large models pretrained on diverse tabular data with strong in-context generalization, which can be applied out-of-the-box without task-specific training. Evaluation on datasets of varying sizes shows that TFM-learned rules are concise, non-spurious, and highly generalizable, achieving up to 15% higher association strength than state-of-the-art with full data coverage. Furthermore, TFM-learned rules have better predictive performance on unseen data and remain robust in low-data settings. Source code is available at https://anonymous.4open.science/r/tabprobe-D0C0.

URL: https://openreview.net/forum?id=Y70lnNnA6T

---

Title: Reproducibility study of FACTER: Fairness-Aware Conformal Thresholding and Prompt Engineering

Abstract: We present a reproducibility study of FACTER, a post-hoc framework that combines conformal thresholding with iterative prompt engineering to mitigate demographic bias in black-box LLM-based recommender systems. Using the released codebase and experimental setting from the original paper, we evaluate FACTER on MovieLens-1M and Amazon Movies & TV with various LLM backbones. We assess fairness using the reported violation-based criterion group and counterfactual metrics (SNSR, CFR), and measure recommendation quality via catalog-mapped ranking metrics (NDCG@10, Recall@10) alongside the validity rate of generated items (Valid@10). Across datasets and supported backbones, we reproduce FACTER’s key qualitative behavior, with fairness violations decreasing sharply and converging within a small number of calibration rounds. However, unlike the original study, we observe a collapse in recommendation quality, largely driven by low validity of generated movie titles under open-vocabulary generation, inaccurate item-mapping and evaluation assumptions. We further identify and resolve multiple implementation and reproducibility issues in the released code, providing a cleaner and easier-to-run codebase to support future replication. Overall, our findings support FACTER’s effectiveness in reducing measured fairness violations, but with a higher loss in utility.

URL: https://openreview.net/forum?id=vX5mPaLdeQ

---

Title: PULSE: Projection-based Unlearning via Linear Speedy Entropy Maximization

Abstract: Machine unlearning enables selective erasure of knowledge associated with specific data points from trained models without retraining from scratch. However, existing retain-data-free methods typically degrade retain accuracy by 13--50\%, require access to retain data to preserve model utility, and incur high computational costs. Moreover, the majority of existing approximate unlearning methods are not designed for the black-box setting, where the unlearner has access only to the last few layers to classifier head and not to the feature extractor which is vendor locked. To address these limitations, we propose \textbf{PULSE} (Projection-based Unlearning via Linear Speedy Entropy Maximization), a retain-data-free unlearning method that performs \emph{knowledge localization} in representation space. PULSE introduces a learnable projection matrix that can be trained jointly with the model (fully retain-data-free during unlearning) or attached post hoc to any pretrained network (requiring only a small subset of training data for efficient initialization). During unlearning, a forget-specific projection is optimized to maximize confidence on the forget set via entropy minimization. Subtracting a scaled copy of this matrix from the original projection induces a targeted entropy increase on forget samples while preserving global model utility through controlled geometric transformations of localized feature subspaces. Extensive experiments on CIFAR-10, CIFAR-100, CIFARSuper20, and ImageNet-1k across MobileNetV2, ResNet18/50, and ViT-B/16 demonstrate that PULSE achieves competitive forgetting performance while preserving model utility. It runs faster than strong baselines thereby, establishing PULSE as a scalable and practical paradigm for efficient, localized machine unlearning in both joint-training and black-box post-hoc settings.

URL: https://openreview.net/forum?id=38wtCnBTyv

---

Title: ACRL: Adaptive Control of Training-Inference Discrepancy for Stable Reinforcement Learning

Abstract: Reinforcement Learning (RL) training for Large Language Models (LLMs) often suffers from instability due to the discrepancy between training and inference. This training-inference discrepancy stems from two primary factors: an architectural separation between training and inference engines, and the use of low-precision quantization in inference versus higher-precision computation in training. To address training instability issues caused by high training-inference discrepancy, we present the principles and methods for its adaptive control. We propose Adaptive Control Reinforcement Learning (ACRL), which adaptively maintains the training-inference discrepancy within a reasonable range to ensure stable RL training. Beyond stabilization, ACRL inherently increases policy entropy, thereby enhancing exploration and improving accuracy. The experimental results show that when the inference engine utilizes FP8 quantization, ACRL consistently maintains the training-inference discrepancy within a reasonable range and stabilizes RL training. Furthermore, ACRL not only matches the accuracy of the BF16 baseline but also outperforms importance sampling (IS) fixes.

URL: https://openreview.net/forum?id=8sBZSgAkAH

---

Title: Controlling Neural Network Generalization via Constraint-Guided Weight Transformations

Abstract: Despite the success of neural networks (NN), models often reach a plateau during training where they converge to a suboptimal region. In these cases, standard gradient-based optimization often fails to escape or recover, leading to overfitting. We show that generalization can be improved by deliberately perturbing a converged model in a constraint-guided, minimal way, and resuming training. To that end, we present Controlled Misclassification (CMC), a framework that identifies a small subset of training points whose predicted labels are intentionally flipped through minimal weight perturbations. Our approach uses mixed-integer linear programming (MILP), to ensure that model changes are minimal, while enforcing the desired label changes and preserving the model’s overall structure. The key insight is that targeted, constraint-guided perturbations push the model out of sharp or overfitted regions of the loss landscape. When training is resumed from this modified state, the model converges to solutions with improved generalization. We evaluate our approach on 10 multiclass image datasets and 5 binary tabular datasets; we show that CMC improves test accuracy by up to 2.8%. By using constraint optimization for generalization, our method enables more precise and interpretable model edits than gradient-based fine-tuning,
offering a verifiable way to enhance performance.

URL: https://openreview.net/forum?id=qhllNrMUIi

---

Title: AffecTalk: An Efficient Framework for Real-Time Affective Talking Head Synthesis

Abstract: Audio-driven talking head synthesis faces a ``Generative Trilemma'', where existing methods struggle to balance precise lip-synchronization, expressive naturalness, and real-time performance. We propose AffecTalk, an efficient unified framework that achieves a strong balance across these three objectives. AffecTalk introduces a novel differentiable 3DMM bridge, which provides robust geometric guidance to ensure accurate lip-sync while avoiding the error propagation inherent in cascaded systems. To achieve nuanced expressiveness without manual emotion labels for training, we devise an annotation-free emotion learning strategy via cross-modal knowledge distillation and a cross-synthesis consistency loss. These geometric and emotional representations are hierarchically injected into a single-pass GAN generator to minimize feature interference. Extensive experiments show that AffecTalk achieves a strong overall balance among state-of-the-art methods, generating highly expressive, high-fidelity video at speeds exceeding 55 FPS.

URL: https://openreview.net/forum?id=2oiuLGtmAI

---

Title: Nearest-Neighbor Imputation with Error Guarantees and Extensions for Mixed-Type Data and Joint Learning

Abstract: Missing feature values are pervasive in real-world applications, and remain a significant hurdle for downstream machine-learning tasks such as classification. Imputation methods combined with downstream tasks are often also time-consuming for high-dimensional data, and offer few theoretical guarantees on imputation error, especially for not-missing-at-random mechanisms. We first show that (weighted) nearest-neighbor approaches remain competitive on real-world data sets compared to the state-of-the-art, while being orders of magnitude faster. Second, we derive a novel concentration inequality from which we obtain theoretically-supported bounds on the imputation error for several types of missingness mechanisms in nearest-neighbor algorithms. Third, we show that nearest-neighbor algorithms can be adapted to mixed-type imputation and extended to joint training with downstream tasks by introducing a data-distribution-preserving function and tuning the weights with an online learner. We validate our theoretical bounds on synthetic data sets, and empirical results on nine real-world data sets. This paper demonstrates the strength of nearest-neighbor imputation and opens the way towards more theoretically-backed approaches for imputation.

URL: https://openreview.net/forum?id=IBsRy81oaj

---

Title: Two-Stage Variance Approximation for Heteroscedastic Causal Discovery

Abstract: Learning causal structures from observational data is difficult when noise variances are unequal or depend on parent values (heteroscedasticity). We propose a two‑stage framework that decouples structure learning from variance estimation. Instead of modeling full variance functions, we use a variance‑matrix approximation: node‑wise constant variances expanded across samples, refined by a small, bounded per‑sample correction. We show that, under heteroscedastic causal models, the optimal constant surrogate equals the expected conditional variance, and the residual approximation error is a scale‑invariant gap. We develop practical centralized and federated algorithms using stabilizers, including variance clipping and a progressive variance floor. Extensive empirical studies on both synthetic and real-world data show that our proposed approach discovers more plausible causal structures than competing baselines.

URL: https://openreview.net/forum?id=Pfjb9kK9sT

---

Title: Structural Bias Beyond Homophily: A Study of Fairness in Link Prediction

Abstract: Graph link prediction (LP) plays a critical role in socially impactful applications such as job recommendation and friendship formation, making fairness a critical concern in this task. While many fairness-aware methods manipulate graph structures to mitigate prediction disparities, the topological biases inherent to social graphs remain poorly understood and are consistently conflated with homophily alone. In this work, we study the relationship between structural biases and fairness outcomes in LP. To this end, we formalize a taxonomy of topological bias measures and introduce a graph generation method producing a diverse corpus of synthetic graphs with controlled structural properties. Using this corpus, we show empirically that fairness outcomes are strongly correlated with graph topology, and that current fairness-aware methods remain sensitive to structural biases beyond homophily. These findings highlight the need for structurally grounded evaluations in fair graph learning.

URL: https://openreview.net/forum?id=xMQ5v0cxH3

---

Title: Screening Feedback for Language Models with Costly Verification

Abstract: Language model training and alignment rely on high-quality human feedback, yet platforms must incentivize valuable contributions while limiting harmful feedback from non-experts. We study a screening environment in which a platform commits to a uniform policy $(\rho,R,P)$---a verification rate, a reward for submitting feedback, and a sanction imposed when verified feedback is harmful---and heterogeneous users decide whether to participate. High-type users are more likely to produce helpful feedback, while low-type users are more likely to generate harmful feedback, and user types may differ in their effective exposure to sanctions.

We characterize platform-optimal verification, reward, and sanction policies under costly verification in robust pure-participation regimes. A key boundary condition,
$\phi_H(1-\eta_H)=\phi_L(1-\eta_L)$, separates parameter regions in which incentives implement normal separation, where high types participate and low types abstain, from regions exhibiting reverse screening, where low types participate while high types are deterred. Verification is the primary policy margin: it improves the value of screened feedback and enables sanctions, but it is costly because aggregate verification costs are convex in the mass of verified feedback. Rewards are pinned down by participation constraints, while sanctions are useful only when their expected collections exceed the induced increase in reward compensation and are limited by enforcement and reputational costs.

We further show that, under optimal verification, platform profit need not increase monotonically with population quality. In our numerical illustration, this non-monotonicity appears as an inverted-U pattern, with profit peaking at intermediate shares of high-type users. The mechanism is that a larger high-type population changes the platform's optimal verification intensity, and the resulting adjustment in verification benefits and costs can make additional high-type participation less profitable at the margin. Finally, we provide an illustrative simulation using a bigram language model as a transparent calibration exercise to generate plausible magnitudes for $(\eta_H,\eta_L)$ and to visualize the model's comparative statics.

URL: https://openreview.net/forum?id=lqY1n8zekd

---

Title: [Re] Boosting the Visual Interpretability of CLIP via Adversarial Fine-Tuning

Abstract: This paper presents a reproducibility study of "Boosting the Visual Interpretability of CLIP via Adversarial Fine-Tuning" by Gong et al. (2025), published at ICLR 2025, which proposes an unsupervised adversarial fine-tuning (AFT) method with norm regularization to enhance the visual interpretability of CLIP's image encoder. We attempt to reproduce the key claims regarding improved saliency map quality, increased concept alignment, transferability to out-of-distribution datasets, and the trade-off with zero-shot accuracy. Beyond reproduction, we propose a saliency-guided regularization extension that introduces an Energy Pointing Game loss, directly supervising the spatial alignment of Simple Gradient saliency maps with target objects. We evaluate our extension across a range of saliency-loss weights and show that explicit saliency supervision improves localization metrics with only a modest reduction in adversarial robustness.

URL: https://openreview.net/forum?id=uPXRBZfkYy

---

Title: The Impact of Semantic Pairs on Self-Supervised Representation Learning

Abstract: Instance discrimination learns visual representations by treating different augmented views
of the same image as positive pairs. While this encourages invariance to handcrafted transformations,
same-image positives can preserve nuisance correlations such as background,
texture, illumination, and object-specific details. Semantic positive pairs, i.e., different
same-class instances, may reduce these correlations by presenting objects across diverse
contexts. However, previous studies often combine semantic pairs with augmented positives
or false neighbors (i.e., incorrectly mapped semantic pairs), making it difficult to isolate the
effect of semantic pairing. We present a controlled empirical study of semantic positive pairs
for self-supervised representation learning. From ImageNet-1K, we construct two matched
subsets: an augmented-pair baseline and a manually curated semantic-pair dataset with the
same class composition and training-pair count. We use these datasets to compare representative
contrastive and non-contrastive SSL methods under matched training conditions.
Across transfer learning and object detection evaluations, semantic-pair pretraining consistently
improves generalisation over augmented-pair pretraining. Additional ablations show
that semantic pairs induce invariances beyond the standard transformation pipeline. Among
the evaluated methods, contrastive learning benefits most strongly from semantic pairs, with
SimCLR showing the largest relative improvement. These results clarify the role of semantic
positive pairs in SSL and provide guidance for selecting and designing frameworks that can
exploit semantic pair information effectively.

URL: https://openreview.net/forum?id=7RgiccEWpN

---

Title: Chronos: A Unified Framework for Predicting Training Time and Convergence in Deep Learning

Abstract: Training deep neural networks requires extensive experimentation to estimate how long it will take, and how much it will cost, for a model to reach convergence. Chronos is the first end-to-end framework that predicts both the number of epochs to convergence and the total training time and cost before full training is performed for the target configuration. It integrates computational features that represent model behavior with early signal probes capturing initialization dynamics from a single mini-batch. Leveraging these signals, Chronos learns a cross-architecture mapping that generalizes across models and hardware configurations. It provides zero-shot estimates of convergence and cost for new target configurations without performing full training for those configurations. Across a diverse spectrum of architectures, from lightweight models such as MobileNetV2 and DeiT-Tiny to deeper and more complex networks, including ViT, ResNet-50, and DenseNet-121, Chronos achieves an average prediction error of 13.7% MAPE for iteration-level execution time and 22.1% MAPE for convergence estimation, demonstrating robust generalization across model scales and training complexities.

URL: https://openreview.net/forum?id=gDLPOEjQTH

---

Title: Decoupling Task and Behavior: A Two-Stage Reward Curriculum in Reinforcement Learning for Robotics

Abstract: Deep Reinforcement Learning is a promising tool for robotic control, yet practical application is often hindered by the difficulty of designing effective reward functions. Real-world tasks typically require optimizing multiple objectives simultaneously, necessitating precise tuning of their weights to learn a policy with the desired characteristics. To address this, we propose a two-stage reward curriculum where we decouple task-specific objectives from behavioral terms. In our method, we first train the agent on a simplified task-only reward function to ensure effective exploration before introducing the full reward that includes auxiliary behavior-related terms such as energy efficiency. Further, we analyze various transition strategies and demonstrate that reusing samples between phases is critical for training stability. We validate our approach on the DeepMind Control Suite, ManiSkill3, and a mobile robot environment, augmented to include auxiliary behavioral objectives. Our method proves to be simple yet effective, substantially outperforming baselines trained directly on the full reward while exhibiting higher robustness to specific reward weightings.

URL: https://openreview.net/forum?id=VoD4BzDef8

---

Title: INDEQS: Informed Neural controlled Differential EQuationS

Abstract: Neural Controlled Differential Equations (NCDE) provide a powerful continuous-time framework for forecasting time series, but standard graph-based extensions typically learn spatial structure purely from data, even in settings where a directed graph structure is known a priori. We introduce Informed Neural controlled Differential EQuationS (INDEQS), a graph-based NCDE forecasting method that incorporates prior knowledge of a directed graph at distinct architectural positions. INDEQS separates inner mixing of hidden states across graph nodes from outer mixing between vector field and control, and offers both a lightweight graph-constrained variant and a more expressive variant, learning additional graph connections from data via adaptive graph convolutions. To systematically study when graph informedness is beneficial in forecasting, we devise a continuous advection simulation on directed graphs, yielding synthetic spatio-temporal datasets with known ground-truth flow structure. We then evaluate INDEQS on two real-world tasks: river discharge forecasting on a hydrological network and traffic flow prediction on PeMS08. Across these synthetic and real-world benchmarks, outer informedness consistently improves mean absolute error over an uninformed NCDE with comparable parameter count, particularly on larger graphs, while inner informedness offers a more parameter-efficient alternative when strict adherence to a known adjacency is desired. A comparison of discrete convolutional and continuous-time decoders further shows that continuous decoders yield better accuracy and greater temporal flexibility on real-world tasks. An implementation of INDEQS and the advection simulation is available at https://anonymous.4open.science/r/indeqs_tmlr-588E.

URL: https://openreview.net/forum?id=okGwJeKlZ4

---

Title: GeoGCD: Geometry-Guided Hierarchical Learning for Generalized Category Discovery

Abstract: Generalized Category Discovery (GCD) requires learning a feature space in which both known and previously unseen categories are well organised across multiple semantic granularities. Existing hierarchical approaches rely either on fixed taxonomic priors, which are blind to the actual geometry of the learned features, or on data-driven prototypes, which discard the prior semantic structure; cross-level alignment is typically enforced through KL-based losses that become uninformative or unstable precisely when novel-class predictions are most fragile. We argue that the global geometry of semantics, the shape that taxonomic relations carve in feature and label space, is the natural object to be preserved in this setting, and we introduce a Geometry-Guided Hierarchical Learning framework for Generalized Category Discovery (GeoGCD) that operationalises this principle. GeoGCD models pairwise relations through a hybrid soft-label matrix that fuses taxonomic similarity with a diffusion similarity derived from random walks on the batch graph, and aligns predictions across granularities via a Wasserstein consistency loss for which we establish formal guarantees of finiteness, continuity, and gradient informativeness under disjoint support. On standard fine-grained GCD benchmarks, GeoGCD achieves new state-of-the-art accuracy on CUB-200-2011 and Stanford Cars with 79.93 and 80.10 in overall, respectively, while notably enhancing performance on known classes through better preservation of global semantic geometry with 87.73 and 93.57 in Old accuracy. Code is available at: https://anonymous.4open.science/r/GeoGCD-7C66.

URL: https://openreview.net/forum?id=hP61dEp1Y1

---

Title: KOALA: Koopman Predictive Coding for WiFi-Based Anticipatory Human Motion Prediction

Abstract: WiFi Channel State Information (CSI) has emerged as a privacy-preserving alternative to cameras for human pose estimation. However, existing approaches treat pose inference as an instantaneous regression problem and do not model temporal dynamics, making future motion prediction infeasible. Naively applying vision-based prediction methods compounds the estimation noise already present in CSI-derived poses, as autoregressive rollouts amplify errors at every step. We propose KOALA, the framework for human motion prediction directly from WiFi CSI, by lifting noisy CSI-derived pose sequences into a learned Koopman latent space where nonlinear dynamics become linear, enabling multi-horizon prediction via simple matrix-vector products without autoregressive iteration or error accumulation. A residual CSI-conditioned operator resolves the identity attractor problem inherent from Koopman formulations, and an anchor-delta prediction head eliminates the degenerate shortcut of copying the current pose across all horizons. To regularise the lifting and operator jointly, we introduce a Koopman Predictive Coding (KPC) loss that operates in the temporal-encoder feature space, enforcing dynamical consistency across prediction horizons without requiring contrastive, spectral, or auxiliary losses. Experiments on MM-Fi and WiPose show that KOALA achieves robust, consistent performance across both short- and long-term prediction horizons, outperforming all baselines by a substantial margin.

URL: https://openreview.net/forum?id=JQA0EfQIfj

---

Title: Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models

Abstract: *Concept-based explanations* offer a promising approach for explaining the predictions of deep neural networks in terms of high-level, human-understandable concepts. However, existing methods either do not establish a causal connection between the concepts and model predictions or are limited in expressivity and only able to infer causal explanations involving single concepts. At the same time, the parallel line of work on *formal abductive and contrastive explanations* computes the minimal set of input features causally relevant for model outcomes but only considers low-level features such as pixels. Merging these two threads, in this work, we propose the notion of *concept-based abductive and contrastive explanations* that capture the minimal sets of high-level concepts causally relevant for model outcomes. We then present a family of algorithms that enumerate all minimal explanations while using *concept erasure* procedures to establish causal relationships. By appropriately aggregating such explanations, we are not only able to understand model predictions on individual images but also on collections of images where the model exhibits a user-specified, common *behavior*. We evaluate our approach on multiple models, datasets, and behaviors, and demonstrate its effectiveness in computing helpful, user-friendly explanations.

URL: https://openreview.net/forum?id=7YVclyf1Ec

---

Title: Grade Discovery as an Identifiability Diagnostic for Clifford Valued Features

Abstract: Clifford valued models can represent scalars, vectors, bivectors and pseudoscalars in one algebra, but current models usually assume the geometric type of each feature is known prior to training. This assumption is not always identifiable from the data. For example, rotation only evidence makes scalar and pseudoscalar transformation laws indistinguishable, whereas reflections reveal the difference. We study grade discovery as an identifiability diagnostic, where grade means the Clifford algebra type of a feature channel, such as scalar, vector, bivector or pseudoscalar. The method estimates this type from paired observations before and after known transformations, assigns a soft weight over candidate grades with the same coordinate dimension and fits this weight with a least-squares equivariant prediction loss. We prove that the loss recovers the correct type when the observed transformations separate the candidate transformation laws and that it remains uninformative when those laws agree on all observed transformations. In controlled three-dimensional experiments, rotation only evidence keeps the true type weight at one half, while rotations with reflections assign weight near one to the true type. Reflection frequency and loss landscape experiments show how often parity revealing transformations must appear and why they remove the flat ambiguity. Stress tests further show that inaccurate orthogonal transformation matrices and weak variance in separating directions reduce the diagnostic evidence. Finally, a differentiable soft gate and a minimal trainable prediction module show that the diagnostic can be optimized jointly with another parameter, although this experiment is not evidence of performance in full Clifford neural networks.

URL: https://openreview.net/forum?id=8Q9Xmttlum

---

Title: Reasoning as Meta-Learning: Understanding Long Chain-of-Thought Reasoning in LLMs through an Optimization Lens

Abstract: We propose a novel framework RaML for interpreting the reasoning capabilities of large language models (LLMs) through the perspective of meta-learning. By conceptualizing reasoning trajectories as pseudo-gradient descent updates to the LLM's parameters, we identify parallels between LLM reasoning and various meta-learning paradigms. We formalize the training process for reasoning tasks as a meta-learning setup, with each question treated as an individual task, and reasoning trajectories serving as the inner loop optimization for adapting model parameters. Once trained on a diverse set of questions, the LLM develops fundamental reasoning capabilities that can generalize to previously unseen questions. Extensive empirical evaluations substantiate the strong connection between LLM reasoning and meta-learning. We further explore the potential of the proposed RaML to advance LLM reasoning and provide valuable insights. Our work deepens the understanding of LLM reasoning processes and provides actionable insights for enhancing these models through established meta-learning techniques.

URL: https://openreview.net/forum?id=qtvkca3UNQ

---

Title: Attention by Synchronization in Coupled Oscillator Networks

Abstract: We address transformer attention on energy-constrained physical substrates.
Softmax attention requires exponentiation and global reduction, operations
with high energy cost on von Neumann hardware and no natural physical analog.
We show that Kuramoto synchronization dynamics (which arise in electrical,
mechanical, superconducting, and charge-density-wave oscillator arrays, among
other physical systems) implement a well-defined attention operation without
either. The resulting mechanism, \emph{fixed-query oscillator attention},
replaces softmax's arithmetic with the equilibration of a gradient flow on the
sphere: queries are learned anchors fixed on the sphere, and free oscillators
evolve under Kuramoto--Lohe dynamics until they settle at positions encoding
attention weights via cosine similarity. Because the computation is
equilibration, it requires no exponentiation; the only global operation is an
affine normalization at readout. The fixed point is provably unique and
globally attractive from almost every initial condition, a guarantee that
holds across every physical realization. Empirically, at the minimal hardware
configuration (oscillator dimension $\dosc{=}2$), oscillator attention
outperforms softmax on keyword spotting ($+1.00$~pp) and on subject-verb
agreement ($+5.27$~pp on hard sentences, with zero training failures versus
one in five for softmax). On causal language modeling, where softmax retains
an advantage, oscillator attention closes the gap as $\dosc$ grows: from
$+11.09$ PPL at $\dosc{=}2$ to $+2.98$ PPL at $\dosc{=}32$ on WikiText-2, and
from $+2.39$ PPL at $\dosc{=}2$ to $+0.57$ PPL at $\dosc{=}32$ on
TinyStories. The main objective of this work is not to replace softmax in
software but to provide a mathematically grounded blueprint for accurate
attention on physical substrates.

URL: https://openreview.net/forum?id=T9g7VleXf2

---

Title: OASIS: Open Agent Social Interaction Simulations with One Million Agents

Abstract: Online social platforms are increasingly becoming hybrid ecosystems in which humans, algorithmic recommenders, and LLM-based agents jointly shape collective behavior. However, controlled study of such ecosystems remains challenging because real platforms couple evolving social graphs, heterogeneous user actions, recommendation-driven exposure, and population-scale feedback loops. We introduce OASIS, a modular and scalable social media simulation framework for studying emergent dynamics in LLM-agent societies under platform-faithful interaction mechanisms. OASIS abstracts social platforms into evolving environments, platform-style action spaces, interchangeable recommendation modules, and parallel agent execution, allowing researchers to instantiate different platform templates while retaining controllability and reproducibility. The framework supports simulations ranging from thousands to up to one million agents under practical compute budgets, enabling population-level mechanism analysis and intervention studies. We instantiate OASIS as Reddit- and X-like platforms and conduct empirical studies on herding behavior and misinformation propagation. Our experiments reveal that LLM-agent societies exhibit stronger negative herding under early adverse feedback than human baselines, and that matched misinformation can achieve higher diffusion intensity than official news in large-scale X-like simulations. We further show that both model choice and recommendation policy substantially affect diffusion dynamics: smaller LLM backbones tend to amplify misinformation more strongly in our setting, while interest-based recommendation increases misinformation exposure compared with random recommendation. Finally, we demonstrate that external grounding interventions, including tool-augmented verification and human participation, consistently reduce misinformation dominance. These results position OASIS as a controllable, extensible, and intervention-ready testbed for reproducible research on hybrid online ecosystems.

URL: https://openreview.net/forum?id=P2BRs9uFwX

---

Title: What Suppresses Nash Equilibrium Play in Large Language Models? Mechanistic Evidence and Causal Control

Abstract: LLM agents are known to deviate from Nash equilibria in strategic interactions, but nobody has looked inside the model to understand why,
or asked whether the deviation can be reversed. We do both. Working with four open-source models (Llama-3 and Qwen2.5, 8B to 72B
parameters) playing four canonical two-player games, we first establish the behavioral picture through self-play and cross-play experiments,
then open up the 32-layer Llama-3-8B model and examine what actually happens during a strategic decision. The mechanistic findings are clear. Opponent history is encoded with near-perfect fidelity at the very first layer (96\% probe accuracy) and consumed progressively by later
ones, while Nash action encoding is weak throughout, never exceeding 56\%. There is no dedicated Nash module. Instead, the model privately favors the Nash action through most of its forward pass, but a prosocial override (a bias toward cooperative, other-regarding
behavior rooted in pretraining on human text and further modulated by RLHF) concentrated in the final layers reverses this, reaching 84\% probability of cooperation at layer 30. When we inject a learned Nash direction into the residual stream, the behavior shifts bidirectionally and causally, confirmed through concept clamping. The behavioral experiments surface six scale- and architecture-dependent findings in self-play, the most notable being that chain-of-thought reasoning worsens Nash play in small models but achieves near-perfect Nash play in models above 70B parameters. The cross-play experiments reveal three phenomena invisible in self-play: a small model can unravel the cooperation of any partner simply by defecting early; two large models reinforce each other's cooperative instincts indefinitely; and who moves first in a coordination game determines which Nash equilibrium the system lands on. The central finding is that LLMs do not lack Nash-playing competence. They compute it, then suppress it.

URL: https://openreview.net/forum?id=a9I1xtEXyT

---

Title: GNNUpdater: Adaptive Self-Triggered Training Framework on Dynamic Graphs

Abstract: Adapting Graph Neural Networks (GNNs) to evolving, dynamic graph data poses a difficult operational problem: when should these models be updated? At scale, retraining is expensive, so updates should be triggered only when the expected performance gain justifies the GPU cost. This problem is especially challenging in graph settings because of two factors: label delay, where ground truth arrives long after predictions are made, and hidden drift, where structural dependencies propagate changes through multiple hops and degrade performance unexpectedly. We propose GNNUpdater, an adaptive framework that decides when to trigger GNN training. It addresses these challenges with two components: (1) a performance predictor that estimates model quality from node embedding shifts, removing dependence on immediate ground-truth labels, and (2) a graph-aware update trigger that uses label propagation to detect widespread performance degradation across the graph. We implement GNNUpdater as a distributed streaming-GNN library for billion-edge dynamic graphs. Experiments show that GNNUpdater outperforms periodic, performance-based, and drift-detection baselines at comparable GPU-hour budgets, or matches their performance with substantially less training cost. The implementation can be found at the anonymous link: <https://anonymous.4open.science/r/GNNUpdater-B47D/>.

URL: https://openreview.net/forum?id=Srz5Swk9Nw

---

Title: Differentiable Plastic Recurrent Neural Networks: A Systematic Evaluation

Abstract: Plastic recurrent neural networks (RNNs) augment standard architectures with weights that adapt during inference, offering a biologically-inspired path toward more adaptive artificial systems. Despite a growing collection of architectures, the field lacks a systematic account of how these methods relate to one another, when plasticity improves performance, and whether learned plastic components are actually used at inference time. We address this gap by placing existing differentiable plastic RNNs under a unified notation, conducting a parameter matched evaluation of ten architectures across four task families, and analyzing how these architectures utilize plasticity, if at all. We also identify an underexplored design direction for plastic RNNs. Grounding plasticity in extended neuron dynamics and population-level synchronization, rather than instantaneous pre- and post-synaptic correlations, we propose the Plastic Continuous Thought Machine (Plastic CTM). On recall and continual learning tasks, plastic and non-plastic architectures perform comparably, but on in-context learning, plastic architectures consistently outperform their non-plastic counterparts, with the Plastic CTM achieving the strongest results among baselines. Our analysis reveals that architectures with similar accuracy can use plasticity in very different ways, with some relying on plastic weights for genuine in-context adaptation, whilst others achieve comparable accuracy even with their plasticity disabled. These findings highlight the need for more discriminating benchmarks, mechanistic analysis, and exploration of richer biologically inspired rules such as population level synchronization to meaningfully advance the field.

URL: https://openreview.net/forum?id=8z0trg6tku

---

Title: The Noise Geometry of Stochastic Gradient Descent

Abstract: In this paper, we present a comprehensive analysis of the heterogeneous structure of minibatch noise, focusing on its favorable *alignment* with the landscape's local geometry (Wu et al., 2022). Specifically, we propose two metrics, derived from analyzing the influence of the noise structure on the loss and subspace projection dynamics separately, to quantify the alignment property. To showcase the practical relevance of our noise geometry characterization, we revisit the convergence analysis of stochastic gradient descent (SGD), revealing that the favorable noise geometry is crucial for ensuring benign convergence of SGD in high-dimensional settings. We also examine the noise geometry's influence on how SGD escapes from sharp minima. It is demonstrated that, unlike gradient descent (GD), which escapes sharp regions along the sharpest directions, SGD tends to escape through flatter directions. To support our theoretical findings, both synthetic and real-dataset experiments are provided.

URL: https://openreview.net/forum?id=F6wmIilZXG

---

Title: VBO-MI: A Differentiable Bayesian Optimization Framework via Variational Mutual Information Estimation

Abstract: Many real-world tasks require optimizing expensive black-box functions accessible only through noisy evaluations, a setting commonly addressed with Bayesian optimization (BO). While Gaussian processes (GPs) and Bayesian neural networks (BNNs) have driven significant progress, GP-based methods scale cubically with observations, and BNN-based approaches remain burdened by expensive posterior sampling such as Hamiltonian Monte Carlo, and inner-loop acquisition optimization in high-dimensional setting. In this work, we propose VBO-MI (Variational Bayesian Optimization with Mutual Information), a fully gradient-based BO framework that requires no assumptions on a Gaussian prior or a variational posterior family over $f$. While our framework uses gradient-based updates to train the internal action and information-estimation networks, it treats the objective function $f$ as a strict black box, requiring only zeroth-order evaluations (input-output pairs). To enable end-to-end gradient flow, we employ an actor-critic architecture consisting of an action-net to navigate the input space and a variational critic to estimate information gain. This formulation effectively eliminates the traditional inner-loop acquisition optimization bottleneck, achieving up to a $10^{2}\times$ reduction in FLOPs compared to BNN-BO baselines. Additionally, we design a lightweight surrogate that approximates the objective from observed data using a replay buffer, which reduces the total required real function queries to one batch per iteration. We provide theoretical analysis for our framework, and establish the consistency of VBO-MI estimates of the optimum point. We evaluate our method on a diverse suite of benchmarks, including high-dimensional synthetic functions (Ackley, Levy, Griewank) and complex real-world tasks such as Rover Trajectory planning optimization, the Lunar Lander, and Pest Control problems. Our experiments demonstrate that VBO-MI provides the same or superior optimization performance over the various baselines.

URL: https://openreview.net/forum?id=nkj5zYqJ2p

---

Title: LLM-guided Hierarchical Search for End-to-end Reasoning Intensive Retrieval

Abstract: Search systems are increasingly used for *reasoning-intensive* queries, where what makes a document relevant requires understanding or reasoning over the query–document relation rather than relying on surface vocabulary or topical similarity. The standard recipe - a cheap embedding-based retriever followed by an LLM verifier - works only when the embedding model places the right documents in its top-*k*, an assumption that recent reasoning-intensive IR benchmarks show often fails to hold even for SOTA embedding models. Many recent works propose query-side fixes such as query rewriting and agentic loops, which have shown promise in bringing LLM reasoning to bear on the search process but remain brittle to the embedding model's effectiveness and to the LLM's ability to rewrite the query from its parametric knowledge alone. In this paper, we explore a different paradigm - *LLM-guided hierarchical search* - in which an LLM interacts with the corpus directly via a hierarchically navigable search index, with no embedding model in the loop at search time. We propose **LATTICE**, an instantiation of this paradigm with two technical contributions: 1. a top-down construction of the search index using LLM judgements over multi-level document summaries; and 2. a robust LLM-guided hierarchical search algorithm that mitigates noisy, context-dependent LLM scores via cross-branch reference nodes and path-aggregated latent scores. Through extensive experiments on the reasoning-intensive BRIGHT benchmark, base LATTICE with an off-the-shelf LLM achieves 46.7 nDCG@10 (matching the best fine-tuned ensemble baseline overall). A lightweight ensemble LATTICE++ that fuses LATTICE with cheap retrieval reaches **49.1 nDCG@10**. A controlled same-LLM comparison against sliding-window reranking shows that reranking offers a better tradeoff at low LLM token budgets, but after a moderate token budget LATTICE converges to a higher asymptote. We further show that LATTICE works with open-weight LLMs and remains competitive on traditional IR benchmarks (NQ, SciFact, SciDocs).

URL: https://openreview.net/forum?id=49vX1GYy4s

---

Title: Isolating Stochastic Sources of Policy Divergence in Proximal Policy Optimization

Abstract: This work presents an analysis of how Reinforcement Learning (RL) policies diverge during training, aiming to separate and investigate divergence arising from environmental randomness and from the policy’s initialization. We trained 300 policies using Proximal Policy Optimization (PPO) across three contexts in the MinAtar testbed: (1) with fixed parameter initialization, (2) with fixed sampled scenarios, and (3) with neither fixed. The resulting policies were examined for performance, feature-attribution overlaps, action disaccord, and overlaps in critical state estimates. The distributions of policies are similar across all training contexts for all examinations, except that the overlap in feature attributions increases when the initial parameters are fixed. These results show that, despite controlling for parameter initialization and the scenarios drawn from the environment, the PPO policies diverge similarly across training contexts. The results, therefore, suggest that PPO exhibits severe path dependence: the unpredictability of the final policy is deeply ingrained in PPO’s stochastic exploration-update loop. Furthermore, our investigation demonstrates that similarly trained PPO policies exhibited substantial differences in how they solved the RL tasks.

URL: https://openreview.net/forum?id=JuDAeqI49e

---

Title: MI-Pruner: Crossmodal Mutual Information-guided Token Pruner for Efficient MLLMs

Abstract: In multimodal large language models (MLLMs), visual tokens are characterized by their high volume and inherent sparsity compared with the text counterparts. To achieve efficient inference with controllable token budgets, training-free token pruning techniques emerge for their versatility and near-zero cost. Current methods typically measure token importance based on attention salience in the visual encoder or the LLM decoder, then preserve visual tokens with high attention scores while pruning others. However, attention salience is often biased by sink tokens and positional bias. These salience-based methods require extracting attention maps, which introduces implementation complexity and memory overhead, while inadequately accounting for the diversity of selected tokens. In this paper, we pursue a sound and surgical approach, called MI-Pruner, which detours attention collection and instead estimates Mutual Information (MI) based relevance in the projection space. This allows an explicit measure of feature-level dependency with information-theoretic motivation to identify the most informative tokens. Without reliance on internal attention maps or architectural modifications, MI-Pruner can be seamlessly applied to off-the-shelf MLLMs for inference acceleration. Extensive experiments on LLaVA1.5, Qwen-series and Video-LLaVA demonstrate that our approach achieves a favorable performance-efficiency trade-off across diverse image and video understanding benchmarks.

URL: https://openreview.net/forum?id=Bc2DZoXBus

---

Title: NSFL: A Post-Training Neuro-Symbolic Fuzzy Logic Framework for Boolean Operators in Neural Embeddings

Abstract: Standard dense retrievers lack a native calculus for multi-atom logical constraints. We introduce Neuro-Symbolic Fuzzy Logic (NSFL), a framework that adapts formal t-norms and t-conorms to neural embedding spaces without requiring retraining. NSFL operates as a first-order hybrid calculus: it anchors logical operations on isolated zero-order similarity scores while actively steering representations using Neuro-Symbolic Deltas (NS-$\Delta$) — the first-order marginal differences derived from contextual fusion. This preserves pure atomic meaning while capturing domain reliance, preventing the representation collapse and manifold escape endemic to traditional geometric baselines. For scalable real-time retrieval, Spherical Query Optimization (SQO) leverages Riemannian optimization to project these fuzzy
formulas into manifold-stable query vectors. Validated across six distinct encoder configurations and two modalities (including zero-shot and SOTA fine-tuned models), NSFL yields mAP improvements up to +81%. Notably, NSFL provides an additive 20% average and up to 47% boost even when applied to encoders explicitly fine-tuned for logical reasoning. By establishing a training-free, order-aware calculus for high-dimensional spaces, this framework lays the foundation for future dynamic scaling and learned manifold logic.

URL: https://openreview.net/forum?id=WYpsWBabCu

---

Title: AI Stereotypes: An Unequipartition Property for Perplexity in Generative Language Models

Abstract: We present a simple generalization of the classical Asymptotic Equipartition Property and the associated concept of typical sets for the perplexity of long texts generated by a language model. Specifically we show that the logarithmic perplexity of any large text generated by a language model must asymptotically converge to the average entropy of its token distributions. This defines an ``idiotypical set'' that all long synthetic texts generated by a language model must belong to. We refine the concept of ``idiotypical set'' to include only grammatically correct texts. We then show that this refined idiotypical set is a vanishingly small subset of all possible grammatically correct texts. This is a general result that applies to any language model for a very broad definition of grammar and it asserts that language models are strongly constrained in the range of their possible behaviors and outputs. We make no simplifying assumptions (such as stationarity) about the statistics of language model outputs, and therefore our results are directly applicable to practical real-world models without any approximations. We also present supporting experimental evidence from open-source models and discuss possible applications of the idiotypical set concept to problems such as detecting synthetic texts and membership inference in training datasets.

URL: https://openreview.net/forum?id=CGMpdOeFqp

---

Title: Learning to Destroy and Repair Local Optima in Multilinear Graph QUBO Optimization

Abstract: Graph combinatorial optimization problems such as Maximum Independent Set and Maximum Cut can be formulated as pseudo-Boolean quadratic objectives with multilinear polynomial forms. Relaxing the variable domain from $\{0,1\}^n$ to $[0,1]^n$ while keeping the same multilinear objective preserves all 0-1 global minimizers, and every strict local minimizer of the relaxed objective is still binary. This exactness enables continuous gradient-based optimization for discrete graph problems, but the relaxed landscape remains nonconvex: direct optimizers may require many iterations, and amortized neural solvers can still converge to suboptimal local solutions. We propose LDR, a learned destroy-and-repair framework for improving such solutions. A recurrent repair network is trained with the multilinear QUBO relaxation to construct and repair continuous states that are decoded into feasible discrete solutions. A destroy policy then perturbs part of the repaired continuous state before repair is applied again, and is trained with post-repair improvement in the decoded task score as a graph-level reward. Because the reward is graph-level, LDR uses the learned policy primarily as a state-dependent destruction-rate estimator: at inference time, it samples a random mask over eligible vertices using the learned rate and repeatedly applies repair. We analyze why destroy-and-repair is useful for escaping local optima, especially in hard-exclusion tasks such as Maximum Independent Set, and evaluate the method on Maximum Independent Set and Maximum Cut. Experiments show that LDR improves over the base repair network under explicit runtime and fixed-budget comparisons

URL: https://openreview.net/forum?id=mwyTnQWtPa

---

Title: Forward Target Propagation: A Forward-Only Approach to Global Error Credit Assignment via Local Losses

Abstract: Training neural networks has traditionally relied on backpropagation (BP), a gradient-based
algorithm that, despite its widespread success, suffers from key limitations in both biological
and hardware perspectives. These include backward error propagation by symmetric weights,
non-local credit assignment, update locking, and frozen activity during backward passes. We
propose Forward Target Propagation (FTP), a biologically plausible and computationally
efficient alternative that replaces the backward pass with a second forward pass. FTP
estimates layer-wise targets using only feedforward computations, eliminating the need for
symmetric feedback weights or learnable inverse functions, hence enabling modular and local
learning. We evaluate FTP on fully connected networks, CNNs, and RNNs, demonstrating
accuracies competitive with BP on MNIST, CIFAR-10, and CIFAR-100, as well as effective
modeling of long-term dependencies in sequential tasks. FTP shows improved robustness
under quantized low-precision and emerging hardware constraints while also demonstrating
substantial efficiency gains over other biologically inspired methods such as target propagation
variants and forward-only learning algorithms. With its minimal computational overhead,
forward-only nature, and hardware compatibility, FTP provides a promising direction for
energy-efficient on-device learning and neuromorphic computing.

URL: https://openreview.net/forum?id=BN9uZVb7wQ

---

Title: Analysing Extrapolation Capabilities of Modern Positional Embeddings in Vision Transformers

Abstract: Vision Transformers scale quadratically with input resolution, making high-resolution training prohibitively expensive, yet detection and segmentation demand it. A natural alternative is resolution extrapolation: train at low resolution and deploy at higher resolution without fine-tuning. Whether this works depends entirely on the positional encoding. Modern encodings such as RoPE, ALiBi, YaRN, and
FIRE enable striking length extrapolation in language models, but their behaviour on 2D image grids is largely unknown. We present a systematic study, training a single ViT-Tiny on ImageNet-100 at 224 × 224 and evaluating zero-shot up to 1024 × 1024 (4.57×) on classification COCO detection, and ADE20K segmentation. Across all three tasks, additive attention-bias methods consistently outperform rotation-based methods at large scales: at 4.57×, FIRE retains 63.1% Top-1 accuracy while RoPE collapses to 18.9%. An attention-entropy analysis shows that bias-based encodings preserve focused, semantically coherent attention,
whereas rotation-based phases drift out of distribution and induce attention collapse. These results recast extrapolation robustness as the primary axis for choosing positional encodings, and yield a practical recipe for resolution-flexible Vision Transformers.

URL: https://openreview.net/forum?id=mEGr7Fm54B

---

Title: Deterministic Lexical Routing for Language Model MLPs

Abstract: We present COMPLEXITY-DEEP, a language model architecture built around deterministic lexical routing. The paper studies three components: (1) Token-Routed MLP with Zipf-balanced greedy bin-packing, which assigns tokens to experts using an estimated corpus-frequency table, eliminating learned routers and auxiliary load-balancing losses; (2) Shared Lexical Expert, a dense MLP path shared by all tokens while routed experts specialize on lexical partitions; and (3) Mu-Guided Attention, where a latent state $\mu$ flows forward between adjacent layers to modulate K, Q, and V projections after the MLP. We give a formal analysis of the conditional capacity and compute trade-off induced by deterministic lexical partitions, and we explicitly distinguish vocabulary balance from corpus-frequency balance under Zipfian token distributions. Empirically, component ablations at 187M parameters (500M tokens) show that the full Token-Routed+Shared+Mu model improves average loss over a dense SwiGLU baseline. Our main scaling result is a corrected iso-parameter comparison at $\sim$300M parameters over 8B FineWeb-Edu tokens: a residual Token-Routed model without Mu-Guidance pays an early specialization cost, first beats the dense baseline at step 740 on logged train loss and step 750 on validation loss, and finishes the budget with a smoothed train-loss advantage of $-0.0163$.

URL: https://openreview.net/forum?id=Jd9jhTnkUy

---

Title: A symmetry-matching approach to blind-spot elimination in sparse autoencoders

Abstract: Language models can treat semantically distinct inputs as interchangeable at the representa-
tion level, creating blind spots that no existing sparse autoencoder (SAE) training objective
detects. In safety-critical settings — clinical dosage extraction, legal clause interpretation,
financial amount verification — such blind spots propagate silently into downstream deci-
sions. We show that they arise from the orientation of the feature basis, not from insufficient
model capacity, and that they are eliminable. Using a vulnerability measure derived from
algebraic error-detection theory, we add a differentiable regularisation term to the SAE
training objective that penalises uneven perturbation sensitivity. Across three language
models of different scale and architecture (GPT-2, Gemma 2, Qwen 2.5), the regularisation
reduces blind-spot severity by 83–100% on six perturbation families on the smallest model
and achieves near-complete elimination on the two larger ones, while alternative training ob-
jectives (JumpReLU, MDL) leave the blind spots unchanged. A single well-oriented feature
basis suffices for all families simultaneously. Extending the study to sixteen perturbation
families (six standard plus ten auto-generated medical-domain families), the regularisation
generalises to a 99–100% reduction on GPT-2 and Qwen, while Gemma 2 layer 13 exhibits
a model-specific structural floor at V ≈0.15 that is invariant under a 20×variation of the
regularisation hyperparameters. No model retraining or additional capacity is required.

URL: https://openreview.net/forum?id=NWWpKC9CZH

---

Title: Prescriptive SVD-Inspired Attention via Spectral Energy Retention

Abstract: Self-attention is central to modern Transformer architectures, but its dense dot-product formulation makes it difficult to identify which internal directions are structurally important and which can be modified without disrupting the model. SVD-Inspired Attention (SVDA) addresses part of this problem by introducing a learned diagonal spectrum into the query-key score interaction, making latent attention directions explicitly inspectable through indicators such as spectral entropy, effective rank, sparsity, alignment, selectivity, and perturbation response. The paper examines the transition from diagnostic interpretation to operational intervention. A diagnosis--intervention--verification framework is proposed, in which the learned SVDA spectrum is used to guide targeted changes to the attention-score operator. The framework treats the learned spectrum as an intervention surface where spectral coefficients can be masked, retained, regularized, or compared across heads according to their role in score formation. This view is instantiated through spectral energy retention, which converts the learned spectrum into a score-structural intervention rule. Experiments on FashionMNIST, CIFAR-10, CIFAR-100, and Food-101 show that low-energy score directions can be removed while preserving accuracy within experimental noise and reducing the score-forming part of the attention operator. The contribution is a controlled demonstration that intrinsic spectral diagnostics can be converted into verified, structurally realizable modifications of attention, rather than remaining post-hoc descriptive indicators.

URL: https://openreview.net/forum?id=LZBWqyWNxS

---

Title: WISE: A Long-Horizon Agent in Minecraft with Why-Which Reasoning

Abstract: Rapid advances have been made in developing general-purpose embodied agent in environments like Minecraft through the adoption of LLM-augmented hierarchical approaches. Despite their promise, low-level controllers often become performance bottlenecks due to repeated execution failures. We argue that a key limitation is not only the lack of episodic memory, but also the decoupling of \textit{what-where-when} memory from \textit{which-why} reasoning.
To address this, we propose \textbf{WISE} (Which-Why Informed Semantic Explorer), a long-horizon agent framework with an enhanced low-level controller equipped with a Causal Event Graph that augments episodic memory with explicit causal structure linking observations to task relevance. Unlike prior work such as MrSteve, which relies on feature similarity for retrieval, WISE enables robust recall under viewpoint changes and supports opportunistic task reordering through causal reasoning. Building on this memory, we propose an Opportunistic Task Scheduler that dynamically re-prioritizes subtasks when causally relevant opportunities are detected. We further equip WISE with a multi-scale progressive exploration strategy to provide spatially comprehensive observations for downstream reasoning.
Experiments show that WISE largely improves task success and efficiency on long-horizon sparse tasks, particularly in settings requiring adaptive decision-making.

URL: https://openreview.net/forum?id=T8HuiP3yM9

---

Title: Nonlinear Time Series Modeling Using Bernstein Polynomials and Bayesian Inference

Abstract: Modeling and forecasting nonlinear time series presents a highly non-trivial challenge to both statistical and neural network methods. This is particularly true with time series from chaotic systems with very few methods being able to provide accurate predictions along with uncertainty quantification. This article taps into fundamental theoretical properties of Bernstein polynomials to model temporal dependence structures present in nonlinear time series. The new Bernstein Polynomial Autoregressive (BPAR) model is seen to have architectural similarity to a single layer perceptron or a single hidden layer artificial neural network. Unlike deep multilayer neural network forecasters, the BPAR retains a single structured layer, making the model comparatively simple while allowing lag-specific Bernstein coefficient blocks to indicate which past observations are most responsible for prediction. Bayesian statistical inference is applied and this produces uncertainty quantification on model parameters and on predictions. Carefully designed shrinkage prior distributions are utilized in the Bayesian inference, and they assist in maintaining overall model parsimony. An extensive empirical study involving several real and simulated nonlinear time series data is conducted along with suitable model residual diagnostic checks. Results demonstrate, especially with chaotic and highly nonlinear time series, superior forecasting performance of the BPAR relative to existing statistical and neural network methods, and also point to the unique ability of BPAR to provide uncertainty quantification in these data settings.

URL: https://openreview.net/forum?id=PPEHt53UJh

---

Title: Self-Distillation as a Performance Recovery Mechanism for LLMs: Counteracting Compression and Catastrophic Forgetting

Abstract: Large Language Models (LLMs) have achieved remarkable success, underpinning diverse AI applications. However, they often suffer from performance degradation due to factors such as catastrophic forgetting during Supervised Fine-Tuning (SFT), quantization, and pruning. In this work, we introduce a performance recovery framework based on Self-Distillation Fine-Tuning (SDFT) that effectively restores model capabilities. Complementing this practical contribution, we provide a rigorous theoretical explanation for the underlying recovery mechanism. We posit that an LLM's generative capability fundamentally relies on the high-dimensional manifold constructed by its hidden layers. To investigate this, we employ Centered Kernel Alignment (CKA) to quantify the alignment between student and teacher activation trajectories, leveraging its invariance to orthogonal transformations and scaling. Our experiments demonstrate a strong correlation between performance recovery and manifold alignment, substantiating the claim that self-distillation effectively aligns the student's high-dimensional manifold with the optimal structure represented by the teacher. This study bridges the gap between practical recovery frameworks and geometric representation theory, offering new insights into the internal mechanisms of self-distillation.

URL: https://openreview.net/forum?id=ZYNq5cjwvK

---

Title: TriScore: Post-Hoc Out-of-Distribution Detection with Energy, Boundary Probes, and Transform Consistency

Abstract: Post-hoc out-of-distribution detection usually relies on a single signal extracted from a frozen classifier, such as confidence, energy, features, or gradients. This dependence becomes fragile under classifier-level adversarial shifts, where the shift can preserve the detector’s chosen cue while changing the model’s decision behavior. We investigate whether this failure mode can be mitigated by combining complementary detection criteria instead of selecting a single score. We introduce TriScore, a post-hoc OOD detector that combines three ID- calibrated signals from the same frozen backbone: an energy prior, a boundary-fragility score, and a transform-consistency score. These signals are standardized on ID validation data and fused by an input-dependent ID-residual gate, requiring no OOD labels, auxiliary model, or training. Across four ImageNet-scale backbones, TriScore achieves the best mean AUROC among 15 post-hoc baselines under a non-adaptive adversarial-shift protocol, while remaining competitive on standard OOD benchmarks. Ablations show that the gains come from combining complementary cues rather than optimizing a single detector signal. These results support multi-criterion post-hoc detection as a practical route toward more robust OOD detection under classifier-level shifts.

URL: https://openreview.net/forum?id=H5hyDKANup

---

Title: UPL: Uncertainty-aware Pseudo-labeling for Imbalanced Transductive Node Classification

Abstract: Graph-structured datasets often suffer from class imbalance, which complicates node classification tasks. In this work, we address this issue by first providing an upper bound on population risk for imbalanced transductive node classification. We then propose a simple and novel algorithm, Uncertainty-aware Pseudo-labeling (UPL). Our approach leverages pseudo-labels assigned to unlabeled nodes to mitigate the adverse effects of imbalance on classification accuracy. Furthermore, the UPL algorithm enhances the accuracy of pseudo-labeling by reducing training noise of pseudo-labels through a novel uncertainty-aware approach. We comprehensively evaluate the UPL algorithm across various benchmark datasets, demonstrating its superior performance compared to existing state-of-the-art methods.

URL: https://openreview.net/forum?id=9ovThQD3LQ

---

Title: Reproducing and Dissecting Denoising Language Models for Speech Recognition

Abstract: Denoising language models (DLMs) have been proposed
as a powerful alternative to traditional language models (LMs)
for improving automatic speech recognition (ASR),
motivated by their ability to use bidirectional context
and adapt to error patterns of ASR models.
However, the complexity of the DLM training pipeline has hindered wider investigation.
This paper presents the first independent, large-scale empirical study of DLMs.
Using a reproducible pipeline,
we evaluate dozens of configurations across data augmentation, text-to-speech systems, and decoding strategies.
Our analysis reveals that while traditional LMs are more efficient at lower training compute budgets,
DLMs exhibit superior scaling and surpass LMs after a distinct compute tipping point,
mirroring behaviors observed in diffusion language models.
Our results show that the magnitude of DLM improvement
is sensitive to the baseline ASR performance and vocabulary choice,
and a key factor for improving performance
is to condition the DLM on richer information from the ASR's hypothesis space,
rather than just a single best guess.
To this end, we introduce DLM-sum, a novel method for decoding from multiple ASR hypotheses,
which consistently outperforms the previously proposed DSR decoding method.
The code is publicly available at https://anonymous.4open.science/r/2025-dlm/.

URL: https://openreview.net/forum?id=U3duFJ6S9t

---

Title: Multi Timescale Stochastic Approximation: Stability and Convergence

Abstract: This paper presents the first sufficient conditions that guarantee the stability and almost sure convergence of multi-timescale stochastic approximation (SA) iterates. It extends the existing results on one-timescale and two-timescale SA iterates to general $N$-timescale stochastic recursions, for any $N \geq 1$, using the ordinary differential equation (ODE) method. As an application of our results, we first study stochastic approximation algorithms augmented with heavy-ball momentum in the context of Gradient Temporal Difference (GTD) learning. The addition of momentum introduces an auxiliary state that evolves on an intermediate timescale, resulting in a three-timescale recursion. We show that when the momentum parameters are chosen appropriately, the resulting scheme fits within our framework and converges almost surely to the same fixed point as the baseline GTD algorithm. The stability and convergence of all iterates, including the momentum state, are guaranteed by our main results, without requiring ad hoc bounds. We then study off-policy actor–critic algorithms with a baseline learner, actor, and critic updated on separate timescales. In contrast to prior work, we eliminate projection steps from the actor update and instead use our convergence framework to guarantee stability and almost sure convergence of all components. Finally, we extend our analysis to constrained policy optimization in the average-reward setting, where the actor, critic, and dual variables evolve on three distinct timescales, and we verify that the resulting dynamics satisfy the conditions of our general theorem. Together, these examples demonstrate how diverse reinforcement learning algorithms spanning momentum acceleration, off-policy learning, and primal–dual methods fit naturally into the proposed multi-timescale framework.

URL: https://openreview.net/forum?id=GFC3vv4cl9

---

Title: Subspace Inference Enables Efficient Active Reward Learning from Preferences

Abstract: Reinforcement learning from human feedback (RLHF) has emerged as a powerful approach for aligning decision-making agents with human intentions, primarily through the use of reward models trained on human preferences. However, RLHF suffers from poor sample efficiency, as each preference feedback provides minimal information, making it necessary to collect large amounts of human feedback. Active learning addresses this by enabling agents to select informative queries, but effective uncertainty quantification required for active learning remains a challenge. While popular uncertainty representations methods such as ensembles and dropout are popular for their simplicity, they are computationally expensive at scale and do not always provide good posterior approximation.
Inspired by the recent advances in approximate Bayesian inference, we develop a method that leverages Bayesian filtering in neural network subspaces to efficiently maintain model posterior for active reward modeling in continuous control tasks. Our approach enables scalable sampling of neural network reward models to efficiently compute active learning acquisition functions. Experiments on the D4RL and V-D4RL benchmark demonstrate that our approach achieves superior sample efficiency, scalability, and calibration compared to other Bayesian deep learning approaches, and leads to competitive offline reinforcement learning policy performance. This highlights the potential of scalable Bayesian methods for preference-based reward modeling in RLHF. Our code is anonymously available at https://github.com/preferenceEKF2025/preference_ekf.

URL: https://openreview.net/forum?id=ckWRSrY8i4

---

Title: PolyNODE: Variable-dimension Neural ODEs on M-polyfolds

Abstract: Neural ordinary differential equations (NODEs) are geometric deep learning models based on dynamical systems and flows generated by vector fields on manifolds. Despite numerous successful applications, particularly within the flow matching paradigm, all existing NODE models are fundamentally constrained to fixed-dimensional dynamics by the intrinsic nature of the manifold's dimension. In this paper, we extend NODEs to M-polyfolds (spaces that can simultaneously accommodate varying dimensions and a notion of differentiability) and introduce PolyNODEs, the first intrinsic variable-dimensional flow-based model in geometric deep learning. As an example application, we construct explicit M-polyfolds featuring dimensional bottlenecks and PolyNODE autoencoders based on parametrised vector fields that traverse these bottlenecks. We demonstrate experimentally that our PolyNODE models can be trained to solve reconstruction tasks in these spaces, and that latent representations of the input can be extracted. The code used in our experiments is publicly available at [omitted for reviewing].

URL: https://openreview.net/forum?id=CKQ1RfnHbn

---

Title: WaveletDiff: Multilevel Wavelet Diffusion For Time Series Generation

Abstract: Time series are ubiquitous in many applications that involve forecasting, classification and causal inference tasks, such as healthcare, finance, audio signal processing and climate sciences. Still, large, high-quality time series datasets remain scarce. Synthetic generation can address this limitation; however, current models confined either to the time or frequency domains struggle to reproduce the inherently multi-scaled structure of real-world time series. We introduce WaveletDiff, a new framework that trains diffusion models directly on wavelet coefficients to exploit the inherent multi-resolution structure of time series data. The model combines dedicated transformers for each decomposition level with cross-level attention mechanisms that enable selective information exchange between temporal and frequency scales through adaptive gating. It is also informed by level-specific energy constraints based on Parseval's theorem which preserve time-frequency properties throughout the diffusion process. Comprehensive tests across six real-world datasets from energy, finance, and neuroscience domains demonstrate that WaveletDiff consistently outperforms state-of-the-art time-domain and frequency-domain generative methods on both short and long time series across five diverse performance metrics. For example, WaveletDiff achieves discriminative scores and Context-FID scores that are $3\times$ smaller on average than the second-best baseline across all datasets. Our code is available at https://anonymous.4open.science/r/WaveletDiff-27E9/.

URL: https://openreview.net/forum?id=YSTJ71fDBm

---

Title: Predicting Chain-of-Thought Correctness from Trajectory Geometry

Abstract: We ask whether the geometry of a reasoning trajectory, that is, how a chain-of-thought (CoT) trace moves through semantic space independent of its length, predicts whether the final answer is correct, and whether that prediction is useful in practice. Across 2,800 CoT traces spanning three reasoning benchmarks (FOLIO, GSM8K, and PrOntoQA) and five language models, we extract interpretable trajectory-level features (adjacent-step transition energy, path entropy, semantic drift, loopiness, discourse-graph spectra, and direction- sensitive drift) and predict per-trace correctness. Under problem-grouped cross-validation, in which every model’s traces for a given problem are held out together, the engineered features reach 0.748 balanced accuracy (95% confidence interval [0.733, 0.764]), against 0.609 for a length-only baseline and 0.687 for a mean-pooled sentence-embedding classifier; the margin over both is significant under a paired bootstrap (p < 0.001). A purely geometric subset of the features reaches 0.72 balanced accuracy on its own, well above the lexical-
entropy and discourse-marker feature groups (both below 0.64), which isolates a specifically geometric signal rather than a generic interpretable-feature effect. The result survives label permutation, length-feature ablation, final-answer removal, and prefix truncation, and generalizes leave-one-dataset-out with a measurable discount. Used as a verifier, the classifier reranks the four to five model traces for each problem to a correct answer 88.3% of the time in distribution, against 79.8% for majority voting; out of distribution, however, the ranking inverts, and majority voting wins on two of three held-out benchmarks. The verifier is thus useful when trained on the target distribution but does not transfer across benchmarks. We report one further qualifying result, namely that on the synthetic PrOntoQA benchmark a generic embedding baseline matches the engineered features.

URL: https://openreview.net/forum?id=H9cBkEqVeY

---

Title: LoRPS: Solving High Dimensional Integral Equations With Low-Rank Polynomials Sum

Abstract: Solving high-dimensional integral equations is a core challenge in science and engineering, primarily due to the curse of dimensionality. Classical numerical solvers, which rely on discretizing the problem domain, suffer from computational costs that grow exponentially with dimension. Monte Carlo methods, often used in machine learning, sidestep this scaling by providing convergence rates independent of dimension. However, they require a large number of samples to accurately approximate integrals. We show that naive sampling can bias the loss function, resulting in inaccurate solutions, especially in high-dimensional settings. We introduce Low Rank Polynomials Sum (LoRPS), a method that can scale to high dimension and solve the challenge of accurate integral estimation. LoRPS leverages a low-rank, separable polynomial structure that allows the integrals in the loss function to be computed analytically, avoiding any sampling-induced error while maintaining minimal computational overhead. We prove that under mild assumptions, LoRPS mitigates the curse of dimensionality. On challenging high-dimensional benchmarks, it consistently achieves higher accuracy while using at least \(3.5\times\) less memory than existing methods.

URL: https://openreview.net/forum?id=5CEsFukGtf

---

Title: Pragmatic Curiosity: A Unified Framework for Hybrid Learning and Optimization via Active Inference

Abstract: Many engineering and scientific workflows rely on expensive black-box evaluations, requiring sequential decisions that must both improve task performance and reduce uncertainty. Bayesian optimization (BO) and Bayesian experimental design (BED) provide powerful but largely separate treatments of goal-directed optimization and information-seeking experimentation, leaving limited guidance for hybrid settings in which learning and optimization are intrinsically coupled. We propose Pragmatic Curiosity (PraC), a unified framework for hybrid learning and optimization via active inference. PraC evaluates candidate queries by trading information gain about a task-relevant latent symbol against an expected regret-based potential over outcomes. This formulation foregrounds three operational design choices: which latent quantity should be clarified, how task value is encoded as regret, and how strongly information gain should be exchanged against pragmatic regret. We instantiate PraC across three regimes of increasing complexity: decision-oriented monitoring with fixed global symbols and known downstream losses, targeted active search with induced local symbols and evolving coverage goals, and composite Bayesian optimization with hierarchical regret learning under unknown preferences. Across these regimes, PraC reduces downstream decision risk, improves coverage of critical outcome regions, and jointly learns predictive and preference structures without relying on task-specific staging rules.

URL: https://openreview.net/forum?id=DHkYR1qKpS

---

Title: Tackling Decision Processes with Non-Cumulative Objectives using Reinforcement Learning

Abstract: Markov decision processes (MDPs) are used to model a wide variety of applications ranging from game playing over robotics to finance. Their optimal policy typically maximizes the expected sum of rewards given at each step of the decision process. However, many real-world problems do not fit straightforwardly into this framework: Non-cumulative Markov decision processes (NCMDPs), where instead of the expected sum of rewards, the expected value of an arbitrary function of the rewards is maximized. Example functions include the maximum of the rewards or their mean divided by their standard deviation. In this work, we introduce a general mapping of NCMDPs to standard MDPs. This allows all techniques developed to find optimal policies for MDPs, such as reinforcement learning or dynamic programming, to be directly applied to the larger class of NCMDPs. We demonstrate the effectiveness of our approach in diverse reinforcement learning tasks, including classical control, financial portfolio optimization, and discrete optimization. Our approach improves both final performance and training efficiency compared to relying on standard MDPs.

URL: https://openreview.net/forum?id=OJCdddDtNy

---

Title: Disentanglement by means of action-induced representations

Abstract: Learning interpretable representations with variational autoencoders (VAEs) is a major goal of representation learning. The main challenge lies in obtaining disentangled representations, where each latent dimension corresponds to a distinct generative factor. This difficulty is fundamentally tied to the inability to perform nonlinear independent component analysis. Here, we introduce the framework of action-induced representations (AIRs) which models representations of physical systems given experiments (or actions) that can be performed on them. We show that, in this framework, we can provably disentangle degrees of freedom w.r.t. their action dependence. We further introduce a variational AIR architecture (VAIR) that can extract AIRs and therefore achieve provable disentanglement where standard VAEs fail. Beyond state representation, VAIR also captures the action dependence of the underlying generative factors, directly linking experiments to the degrees of freedom they influence.

URL: https://openreview.net/forum?id=ePsWvdgHu6

---

Title: Breaking BatchNorm Barriers for Noise-driven Data Free Knowledge Distillation

Abstract: Distillation using Gaussian noise is the simplest instantiation of data-free knowledge distillation: it uses no auxiliary generator, no synthesized images, and no proxy data. The idea is to sample inputs from a standard Gaussian and match teacher-student outputs. In practice, however, this approach is fragile: its behavior varies sharply across architectures and is tightly coupled to the teacher's normalization choices. In this work, we systematically study when and why Gaussian-noise distillation succeeds or fails across model families, normalization schemes, and scales from CIFAR-10 to ImageNet-100, and we identify the main factors that control its stability and effectiveness. Building on these insights, we propose NormShift-KD, a normalization-aware framework for noise-driven distillation with two instantiations tailored to the teacher's normalization: (i) for BatchNorm teachers, we pair current-statistics (CS) inference with rejection sampling to correct the class-imbalance that BN teachers exhibit on Gaussian inputs; (ii) for LayerNorm and GroupNorm teachers, we introduce a lightweight batch-alignment wrapper that restores the inter-sample coupling these per-sample normalizers lack, enabling noise-driven distillation from non-BatchNorm teachers for the first time. We further conduct a targeted BatchNorm ablation, progressively replacing BatchNorm in the teacher to map how transfer quality degrades and which components matter most, and we analyze how the student's architecture and normalization interact with the teacher (CNN/BatchNorm vs.\ Transformer/LayerNorm). Finally, we provide a theoretical explanation for the failure modes observed in ViT-style models under Gaussian-noise inputs, making noise-driven distillation more interpretable and more broadly usable.

URL: https://openreview.net/forum?id=x49pHahbkv

---

Title: FreeFuse: Multi-Subject LoRA Fusion via Adaptive Token-Level Routing at Test Time

Abstract: This paper proposes FreeFuse, a training-free framework for multi-subject text-to-image generation through automatic fusion of multiple subject LoRAs. In contrast to prior studies that focus on retraining LoRAs to alleviate feature conflicts, our analysis reveals that spatially confining each subject LoRA's output to its target region and preventing other LoRAs from intruding into this area is sufficient for effective mitigation. Accordingly, we implement Adaptive Token-Level Routing during the inference phase. However, obtaining reliable routing regions remains challenging. Existing methods that rely on text-image latent association, such as raw cross-attention or concept-level similarity matching, often suffer from sparse activations, hole artifacts, and unstable localization when handling visually similar subjects, leading to incomplete or ambiguous subject masks. To address these issues, we introduce FreeFuseAttn, a mechanism that exploits the flow matching model's intrinsic semantic alignment to dynamically match subject-specific tokens to their corresponding spatial regions at early denoising timesteps, thereby bypassing the need for external segmentors. FreeFuse distinguishes itself through high practicality: it necessitates no additional training, model modifications, or user-defined spatial constraints. Users need only provide subject activation words to achieve seamless integration into standard workflows. Extensive experiments validate that FreeFuse outperforms existing approaches in both identity preservation and compositional fidelity. Our code is available at \url{https://anonymous.4open.science/r/FreeFuse_anno-FC99}.

URL: https://openreview.net/forum?id=Bvela8OAMc

---

Title: Simulation-free Structure Learning for Stochastic Dynamics

Abstract: Modeling dynamical systems and unraveling their underlying causal relationships is central to many domains in the natural sciences. Various physical systems, such as those arising in cell biology, are inherently high-dimensional and stochastic in nature, and admit only partial, noisy state measurements. This poses a significant challenge for addressing the problems of modeling the underlying dynamics and inferring the network structure of these systems. Existing methods are typically tailored either for structure learning or modeling dynamics at the population level, but are limited in their ability to address both problems together. In this work, we address both problems simultaneously: we present StructureFlow, a novel and principled simulation-free approach for jointly learning the structure and stochastic population dynamics of physical systems. We showcase the utility of StructureFlow for the tasks of structure learning from interventions and dynamical (trajectory) inference of conditional population dynamics. We empirically evaluate our approach on high-dimensional synthetic systems, a set of biologically plausible simulated systems, and an experimental single-cell dataset. We show that StructureFlow can learn the structure of underlying systems while simultaneously modeling their conditional population dynamics — a key step toward the mechanistic understanding of systems behavior.

URL: https://openreview.net/forum?id=Kj9cibMkV3

---

Title: Emotion-JEPA: Predictive Visual Adaptation and Audio-Modulated Fusion for Multimodal Emotion Recognition

Abstract: Multimodal emotion recognition (MER) requires combining visual, acoustic, and textual cues from short, noisy, and often ambiguous emotional expressions. While large pretrained multimodal models provide strong general-purpose representations, their direct use for MER can be limited by a mismatch between generic pretraining data and fine-grained affective behavior, as well as by fusion mechanisms that do not explicitly account for modality reliability. We study a two-stage framework for MER that isolates two factors: affective visual representation adaptation and reliability-aware multimodal fusion. In the first stage, we adapt a visual encoder to the emotion domain using predictive self-supervised learning on unlabeled emotion videos, without using pseudo-labels or additional manual annotations. In the second stage, we train a supervised multimodal classifier with Audio-Modulated Hybrid Fusion (AMHF), where audio cues guide cross-modal interaction through audio spectral gating, adaptive cross-modal routing, temporal memory, uncertainty estimation, and progressive fusion. On the MER2024-SEMI benchmark, visual predictive adaptation improves performance by $+7.92$ weighted-average F1 (WAF) over the same model without domain adaptation. Under matched encoders, parameter budgets, and training protocols, AMHF improves performance by $+7.25$ WAF over a capacity-matched cross-attention fusion baseline. Component-level ablations further show that each AMHF stage contributes to the final performance. These results suggest that, for MER under limited supervision, domain-aligned representation learning and reliability-aware fusion can be as important as increasing model scale.

URL: https://openreview.net/forum?id=J5TIa3f9vd

---

Title: When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL

Abstract: For sparse, structured reinforcement-learning tasks with semantic reward-function interfaces, LLM-generated reward shaping is better framed as a debugging problem than as pure one-shot generation. We study PPO-trained agents, using MiniGrid as the core sparse-structured evaluation and MuJoCo reaching/locomotion as boundary stress tests. Our audit finds two dominant one-shot failure modes: reward flooding and semantic/API misunderstanding, plus a rarer, less reliably labeled weak-shaping case. We propose diagnostic-driven iterative refinement, where training diagnostics and a failure-mode taxonomy guide targeted reward-function revision. On sparse structured tasks with diagnosable reward failures, refinement improves DoorKey-8x8 from 2.3% success without shaping to 97.6%, and KeyCorridor from 31.2% one-shot to 86.7% with high seed-to-seed variance. Controls indicate that these gains are not attributable to generic retrying or extra training alone: metrics-only re-prompting produces large drops (DoorKey-8x8: 97.6% to 68.6%; KeyCorridor: 86.7% to 11.5%), while a static-vocabulary control recovers much of the gap (DoorKey-8x8: 87.6%; KeyCorridor: 70.7%), showing that the taxonomy prompt itself is a major mechanism and that dynamic trigger labels provide only partially isolated incremental evidence. Budget-matched and Best-of-3 comparisons help separate refinement from selection and training-time effects. Component-removal stress tests, sensitivity analyses, and an audit against author labels provide converging evidence for the debugging interpretation while revealing calibration limits. Continuous-control results expose the boundary: success-based diagnostics can misfire in dense-reward locomotion, and return-trend feedback removes one false-positive mechanism without producing robust locomotion gains. The low-call protocol is a protocol-cost contrast with population-based reward search, not a shared-benchmark performance comparison. In the four environments where we run a fully crossed variance design, variance point estimates are consistent with larger gains from reward correction when LLM reward-function variance dominates, but bootstrap intervals are wide and temper exact share claims. The method's scope is deliberately bounded to sparse structured tasks with reliable structured interfaces under PPO training; richer semantic fields such as event_text can help, hurt, or be neutral depending on alignment with the task structure.

URL: https://openreview.net/forum?id=5TmwBBn8Oh

---

Title: Adaptive Target Scheduling for Safe Offline-to-Online Deployment of Constrained Decision Transformers

Abstract: Target-conditioned decision transformers expose cost-to-go (CTG) targets that appear to offer deployment-time safety control, but whether this interface remains meaningful under environment shift is an open question. We study this problem in a source-to-target offline-to-online setting and propose Safe-CDT, which combines conservative Wilson-score target scheduling with cost-aware multiplicative sample reweighting. Under a consistent runtime-budget protocol, Safe-CDT achieves mean cost $8.56$ at budget $B=30$ on CarGoal1$\to$CarGoal2, with violation rate $14.4\%$; on PointCircle1$\to$PointCircle2, however, violation rate rises to $56\%$. We therefore frame the paper as an empirical characterization rather than a universal safety claim. Offline safe RL baselines cluster into distinct failure modes: budget saturation, reward collapse, and seed instability. CDT-only attribution reveals that the low violation rate on the main pair is largely inherited from the pre-trained backbone (zero-shot VR $15.6\%$). We then test whether the CTG interface remains useful as a deployment-time control variable, using 150 zero-shot target-response runs. A static sweep shows positive target-cost association on all three pairs, and a runtime step-change experiment moves cost in the expected direction under a seed-consistency criterion, subject to saturation, non-monotonicity, and temporal-drift caveats. All safety observations are empirical, not formal guarantees; code and full per-seed results are released.

URL: https://openreview.net/forum?id=20AFD1NyLH

---

Title: Sparse Training of Neural Networks based on Multilevel Mirror Descent

Abstract: We introduce a dynamic sparse training algorithm based on linearized Bregman iterations / mirror descent that exploits the naturally incurred sparsity by alternating between periods of static and dynamic sparsity pattern updates. The key idea is to combine sparsity-inducing Bregman iterations with adaptive freezing of the network structure to enable efficient exploration of the sparse parameter space while maintaining sparsity. We provide convergence guaranties by embedding our method in a multilevel optimization framework. Furthermore, we empirically show that our algorithm can produce highly sparse and accurate models on standard benchmarks. We also show that the theoretical number of FLOPs compared to SGD training can be reduced from 38% for standard Bregman iterations to 6\% for our method while maintaining test accuracy. We additionally show a training time reduction by about 50%, when using a sparsity-aware CPU implementation of our method.

URL: https://openreview.net/forum?id=c3Wgvy3FiU

---

Title: SERA: Soft Ensemble Reliability Aggregation for Robust Multi-Agent Reinforcement Learning.

Abstract: Bootstrapped temporal-difference learning inherently introduces variance into value estimates, which often destabilizes learning due to value function oscillation between over- and under-estimation. Overestimation is commonly mitigated through pessimistic critic updates, but such bias-based approaches can introduce underestimation and do not address the estimation variance, which is often amplified in multi-agent reinforcement learning (MARL) due to its inherent learning complexities. To address this, we propose SERA, a soft ensemble reliability aggregation framework designed to reduce value estimation variance through reliability-aware critic aggregation. SERA constructs targets through soft reliability-weighted aggregation of critic estimates and introduces a novel decorrelation mechanism that adaptively tunes each critic’s learning rate based on temporal-difference error uncertainty and the variance of target estimation error. This leads to more stable and reliable target estimation during training. Experiments on a wide range of multi-agent continuous-control benchmarks from MuJoCo and PettingZoo show that SERA consistently outperforms strong twin-critic and ensemble baselines, achieving performance improvements of up to 41.1%. We further demonstrate that the same framework generalizes well to single-agent continuous-control tasks, providing gains of up to 31.25% over established methods.

URL: https://openreview.net/forum?id=fOTpZtq4aO

---

Title: Grid-Based Initialization Resolves Frequency Reachability\\ in Trainable-Frequency Quantum Machine Learning

Abstract: Angle-encoded variational quantum circuits admit a truncated Fourier
series representation of their output, but approximating functions
with maximum frequency $\omega_{\max}$ using fixed unary encoding
requires $\mathcal{O}(\omega_{\max})$ encoding gates.
Trainable-frequency (TF) circuits promise a reduction by learning the
data-encoding prefactors alongside the ansatz parameters, adapting the
accessible frequency spectrum to the target during training.
We identify a practical barrier that prevents this promise from being
realized: the prefactor gradient is suppressed by the spectral gap
between the circuit's accessible frequencies and the target spectrum,
independently of the ansatz parameters, confining gradient-driven
prefactor movement to a narrow neighborhood of initialization.
We propose \emph{ternary grid initialization}---setting prefactors to
$\{1, 3, 9, \ldots, 3^{k-1}\}$---which resolves this limitation by
ensuring every target frequency within $[-\omega_{\max}, \omega_{\max}]$
lies within $\tfrac{1}{2}$ unit of a grid point at initialization,
removing the spectral gap suppression by construction.
On a synthetic benchmark with target frequencies shifted well beyond
the standard initialization range, ternary initialization achieves
median $R^2 = 0.997$ versus $0.18$ for unary initialization, with
$100\%$ of runs achieving $R^2 > 0.95$ against $0\%$.
CMA-ES with $20\times$ the evaluation budget reaches only $25\%$
success, confirming the limitation is a property of the optimization
landscape rather than of gradient-based optimization specifically.
Real-world validation on two benchmark datasets demonstrates consistent
advantages over both fixed and trainable unary baselines.

URL: https://openreview.net/forum?id=qAP88JJnF8

---

Title: Minority Collective Action for User-Side Fairness

Abstract: Machine learning models often preserve biases present in training data, leading to unfair treatment of certain minority groups. Despite an array of existing firm-side bias-mitigation techniques, these methods typically incur utility costs and require organizational buy-in. Recognizing that many models rely on user-contributed data, end-users can induce fairness through the framework of Algorithmic Collective Action. In this setting, a coordinated minority group strategically relabels its own data to enhance fairness without altering the firm's training process. We design practical, model-agnostic, minority-only collective-action methods and validate them on real-world datasets. Our findings show that a subgroup of the minority can substantially reduce unfairness with little impact on overall prediction error.

URL: https://openreview.net/forum?id=08C5vWdBzA

---

Title: Do Vision-Language Models Actually See?

Abstract: Vision-language models (VLMs) fail systematically on counterfactual visual reasoning tasks, defaulting to canonical linguistic priors even when visual evidence directly contradicts them. Prior work documents this phenomenon but does not explain where the failure originates—in visual perception or in language-level reasoning. We address this gap through a controlled two-condition ablation study across 13 VLMs (spanning 2.2B–235B parameters, 2023–2025) on a matched dataset of 60 counterfactual images and 60 structurally equivalent scene graphs. We establish three findings. First, semantic bias is a perception failure, not a reasoning failure: all models achieve 100% accuracy on scene graph inputs with zero bias, yet show 61–100% bias on equivalent visual inputs— confirming that the reasoning capability for counterfactual recognition is intact and the bottleneck lies upstream in visual processing. Second, three qualitatively distinct failure modes exist across the model population— complete prior collapse, visual confusion without prior retrieval, and salience-gated partial grounding—and these modes are not predicted by model scale alone; a 235B-parameter model and a 2.2B-parameter model occupy the same failure mode. Third, perceptual salience governs where grounding succeeds: in the one model capable of partial grounding (Llama-4-Scout, 17B), bias rates follow a salience-stratified gradient spanning 35 percentage points across high-, mid-, and low-salience concept tiers (Spearman ρ = −0.379, p = 0.164, directional; n = 15). These findings reframe VLM bias as a perceptual alignment problem rather than a knowledge or reasoning problem, with direct implications for model evaluation, training data design, and safety-critical deployment.

URL: https://openreview.net/forum?id=o9KC3v0pvb

---

Title: MirrorTD: Constraint-Aware Diffusion Models for Mixed-Type EHR Time Series Generation

Abstract: The generation of synthetic electronic health records (EHRs) data is a critical enabler for ML in healthcare. However, it remains challenging because clinical time series are mixed-type (numerical and categorical), high-dimensional, temporally structured, and subject to constraints such as data validity and patient survival status. In response to these challenges, we propose `MirrorTD`, a multi-stage score-based diffusion framework that integrates mixed-type Gaussian and discrete diffusion processes with a mirror-mapping variational autoencoder to embed constraints. Specifically, we embed constrained indicators into a continuous latent space via the mirror mapping and utilize an efficient spatio-temporal attention mechanism to capture temporal dynamics and cross-feature dependencies. Experiments on three real-world ICU datasets show that our method produces realistic, diverse, and constraint-compliant synthetic EHRs, advancing synthetic time-series generation for critical-care cohorts.

URL: https://openreview.net/forum?id=4vXf3BBRck

---

Title: Concept Drift from a Causal Perspective

Abstract: Concept drift is a common phenomenon in real-world data streams, in which changes in the data-generating distribution can degrade predictive model performance. Most existing definitions characterize drift as changes in the joint distribution $P(\mathbf{x}, y)$, without distinguishing which component of the data-generating process has changed. In this work, we introduce a causal perspective on concept drift based on Structural Causal Models (SCMs). We propose a taxonomy that categorizes drift events by their causal origin, including changes in exogenous variables, endogenous mechanisms, confounders, and target-generating processes. Building on this framework, we develop an SCM-based data stream generator that simulates controlled mechanism-level drift events. Our experiments empirically characterize the distributional effects of each drift type and show that drifts with different causal origins induce distinct patterns of distribution shift and predictive behavior. Furthermore, by integrating causal discovery methods, we use our framework to construct data streams grounded in real-world dependency structures, enabling more realistic and informative evaluation scenarios. We also demonstrate that leveraging the generated data can improve downstream performance. These results highlight the importance of accounting for causal structure when studying and evaluating adaptive learning methods, and establish a foundation for causally-aware evaluation in non-stationary environments.

URL: https://openreview.net/forum?id=B6RFKcPpEW

---

Title: Prediction Stability as a Function-Space Proxy for Flat Minima

Abstract: Flat minima in neural network loss landscapes are widely believed to support better generalization. However, many existing definitions of flatness are poorly specified, as they are highly sensitive to reparameterization and architectural design choices. This raises a fundamental question: is flatness truly a property of the parameter space, or does it reflect stability in the learned function itself? We adopt a function-centered perspective and examine flatness through behavior in output space rather than through parameter-space geometry. From this standpoint, meaningful flatness is expressed as stability in model predictions under controlled perturbations, independent of how the weights are parameterized. To investigate this idea, we introduce the Function-Space Flatness Proxy, an analytical probe designed for both empirical and theoretical evaluation of output-space stability. The proxy measures prediction stability under perturbations and incorporates stability-guided model selection along with a stability complexity metric. It is invariant to reparameterization and does not rely on second-order quantities such as the Hessian or Fisher information matrix. Crucially, FSFP pursues flatness indirectly: it optimizes for prediction stability in output space, and flat minima are the intended destination reached through that stability objective, via a principled route that never requires measuring curvature explicitly. Using this framework, we analyze test accuracy, loss dynamics, calibration, and negative log-likelihood to explore how output-space stability relates to generalization. Experiments on CIFAR benchmarks with ResNet architectures provide empirical validation of the probe. The observed improvements in accuracy and the divergence between absolute and normalized sharpness measures are interpreted as evidence that output-space stability yields at minima as the intended outcome of a principled stability-driven approach, rather than serving as a performance comparison against existing approaches. Overall, these findings suggest that output-space stability offers a practical and interpretable lens for studying flatness (including sharpness geometry and generalization behavior) without reducing the concept to geometric properties alone. This function-centered perspective complements parameter-space approaches and broadens the understanding of generalization beyond traditional notions of sharpness.

URL: https://openreview.net/forum?id=CgwJ12dkiT

---

Title: Content Robust Image Generator Attribution

Abstract: Image generator attribution aims to identify what generator produced an image, if any.
Prior work often focused on identifying new generators without requiring large amounts
of labeled samples by searching for shifts in image distributions. However, these shifts
may appear in other contests, such as a change in image content. Thus, an image may
be attributed to the wrong generator because its image content changed from what was
typically seen during training. To address this issue, we explore Content Robust imagE
generator attrIbuTion (CREdIT), where a model is evaluated on its ability to attribute an
image accurately even if the generators and/or image content is different than what was
seen during training. After a thorough analysis, we created a carefully crafted yet simple
baseline we refer to as FakesSense, which outperforms the state-of-the-art by 3-7%. This
illustrates a significant shortcoming in prior work, demonstrating a need for more complex
image generator attribution benchmarks like CREdIT.

URL: https://openreview.net/forum?id=8IRi83PW7g

---

Title: VRPAgent: LLM-Driven Discovery of Heuristic Operators for Vehicle Routing Problems

Abstract: Designing high-performing heuristics for vehicle routing problems (VRPs) is a complex task that requires both intuition and deep domain knowledge. Large language model (LLM)-based code generation has recently shown promise across many domains, but it still falls short of producing heuristics that rival those crafted by human experts. In this paper, we propose VRPAgent, a framework that integrates LLM-generated components into a metaheuristic and refines them through a novel genetic search. By using the LLM to generate problem-specific operators, embedded within a generic metaheuristic framework, VRPAgent keeps tasks manageable, guarantees correctness, and still enables the discovery of novel and powerful strategies. Across multiple problems, including the capacitated VRP, the VRP with time windows, and the prize-collecting VRP, our method discovers heuristic operators that outperform handcrafted methods and recent learning-based approaches while requiring only a single CPU core. To our knowledge, VRPAgent is one the first LLM-based methods to advance the state-of-the-art in VRPs, highlighting a promising future for automated heuristics discovery.

URL: https://openreview.net/forum?id=2cYbzycSbz

---

Title: ASCENSION: Autoencoder-Based Latent Space Class Expansion for Time Series Data Augmentation

Abstract: Achieving effective data augmentation (DA) in time series (TS)
classification is challenging due to the complex nature of
temporal data. While state-of-the-art generative models for DA,
based on generative adversarial networks, diffusion models and variational autoencoders (VAEs), demonstrate potential,
they often fail to yield consistent performance gains across
diverse domains (e.g., ECG, power, vibration). To overcome this,
we propose $\textbf{ASCENSION}$ ($\textbf{A}$utoencoder-based latent
$\textbf{S}$pace $\textbf{C}$lass $\textbf{ExpaNSION}$), a novel
generative framework that leverages the probabilistic nature of a
VAE latent space together with a contrastive loss to promote
intra-class compactness and inter-class separability. Its key
innovation is an $\alpha$-scaling mechanism that progressively
expands per-class posterior covariances while preserving class
identity, populating under-represented neighbourhoods beyond the
training distribution. We evaluate ASCENSION on 102 univariate datasets
from the UCR benchmark using two established deep TS
classifiers and a recent TS foundation model, comparing
it against eight state-of-the-art DA methods. Empirical results
demonstrate that ASCENSION increases average classification
accuracy by approximately $2$%, while the strongest baseline
method results in a $-1.7$% decrease. Notably, ASCENSION
delivers non-negative performance gains on $73.5$% of datasets
(averaged over the three classifiers), compared to $50.0$%% for
the second best-performing baseline. An ablation study
further highlights the significant impact of the $\alpha$-scaling
mechanism on these gains. These findings position ASCENSION as
the only DA method in our benchmark that delivers consistent
positive gains across all three classifier families on the
102-dataset UCR archive, without requiring prior knowledge of
method suitability.

URL: https://openreview.net/forum?id=7KFXjf7K2o

---

Title: Manifold Approximation leads to Robust Kernel Alignment

Abstract: Centered kernel alignment (CKA) is a popular metric for comparing representations, determining equivalence of networks, and neuroscience research. However, CKA does not account for the underlying manifold and relies on numerous heuristics that cause it to behave differently at different scales of data. In this work, we propose Manifold approximated Kernel Alignment (MKA), which incorporates manifold geometry into the alignment task. We derive a theoretical framework for MKA. We perform empirical evaluations on synthetic datasets and real-world examples to characterize and compare MKA to its contemporaries. Our findings suggest that manifold-aware kernel alignment provides a more robust foundation for measuring representations, with potential applications in representation learning.

URL: https://openreview.net/forum?id=AP7VqoJcLp

---

Title: Longitudinal Evaluation of Large Language Models

Abstract: Evaluations of large language models (LLMs) commonly report aggregate accuracy on static benchmarks, collapsing heterogeneous item behavior into a single score and obscuring how true performance evolves across model evolutions. This limitation is acute for rapidly evolving LLM families, where successive releases may progress unevenly or even regress. To mitigate this issue, we propose a dynamic Item Response Theory (IRT) framework for longitudinal evaluation that jointly calibrates benchmark items by difficulty and discrimination while modeling each LLM release with a time-varying latent ability. Applied to four multiple-choice benchmarks (JEE Math and three MMLU-Pro subsets) across GPT, DeepSeek, and Llama model families, with GPT used as the primary longitudinal case study, we observe non-monotone ability evolution and regimes where GPT-5.2 underperforms GPT-5 despite comparable accuracy, especially in MMLU-Pro Engineering. Item level diagnostics based on changes in model implied correctness probability show that regressions concentrate in the moderate difficulty and high discrimination region indicating strong evidence of reduced ability. These results demonstrate that dynamic IRT offers a task-sensitive, longitudinal lens for tracking progress and regressions beyond accuracy while separating true ability changes from the benchmark characteristics. Beyond model comparison, our framework offers a practical tool for benchmark designers and evaluators to identify which items carry meaningful evaluation signal, diagnose instability across model updates, and design more robust longitudinal evaluation protocols.

URL: https://openreview.net/forum?id=INuSvLC7Bq

---

Title: RMB: Reward Model Boosting Mitigates Reward Hacking

Abstract: Reinforcement Learning from Human Feedback (RLHF) is a powerful technique for aligning large language models (LLMs) with human preference. However, it often suffers from the reward hacking issue, where policy optimization improves the proxy reward model while actually degrading performance with respect to the true human preference, due to the imperfection of the proxy. To address this, we propose Reward Model Boosting (RMB), a novel approach that enhances the robustness and reliability of the reward signal for RLHF. RMB first trains a set of reward models with a diverse-promoting regularizer. This encourages each model to learn complementary aspects of the reward landscape. Then, RMB learns a lightweight aggregator in the principle of boosting to aggregate the outputs of the diverse reward models into a more accurate and robust reward signal. Our extensive experiments demonstrate that RMB significantly improves reward accuracy on both in-distribution and out-of-distribution datasets, substantially mitigating the reward hacking issue and ultimately improving RLHF performance.

URL: https://openreview.net/forum?id=bIaCQSyb4g

---

Title: The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling

Abstract: A recurring pattern in ``reasoning without training'' is that base LLMs already assign non-trivial probability mass to correct multi-step solutions; the bottleneck is locating these modes efficiently at inference time.
A principled way to bias inference toward such modes is power sampling, i.e., sampling from $p_\theta(x)^\alpha$ with $\alpha>1$.
Recent work makes power sampling practical by estimating a future-dependent correction factor $z_t$ via Monte Carlo rollouts, thereby replacing iterative Markov chain Monte Carlo with forward-looking estimation.
In this paper, we reframe that correction factor as a future-value selection potential in a Sequential Monte Carlo (SMC) view of power sampling: $z_t$ plays the role of a critic-like quantity, but can be estimated directly from the model by short-horizon rollouts, no verifier and no training required.

Building on this view, we introduce Auxiliary Particle Power Sampling (APPS), a blockwise particle algorithm for training-free reasoning that approximates the sequence-level power target with a bounded population of partial solutions. APPS propagates these hypotheses in parallel by proposal-corrected power reweighting and refines their survival through future-value-guided selection at resampling boundaries, so that finite compute is redistributed across competing prefixes rather than spent along a single unfolding path. This yields a transparent scaling knob in the particle count, predictable peak memory, and a compute pattern that avoids both iterative trajectory editing and dense candidate-wise rollout fan-out, while improving robustness to pivotal early decisions by keeping multiple hypotheses alive throughout decoding. We further study an amortized variant in which the rollout-based selection potential is replaced by a lightweight learned head trained offline from rollout supervision, enabling fast future-value guidance at inference time. More broadly, our results add to a growing view that a nontrivial part of the gains often attributed to post-training may also be approached through more faithful power approximation at inference time.

URL: https://openreview.net/forum?id=PqsjB61Twg

---

Title: Contrastive Time Series Forecasting with Anomalies

Abstract: Time-series forecasting predicts future values from past data. In real-world settings, some
anomalous events have lasting effects and influence the forecast, while others are short-lived
and should be ignored. Standard forecasting models fail to make this distinction, often
either overreacting to noise or missing persistent shifts. We propose Co-TSFA (Contrastive
Time-Series Forecasting with Anomalies), a regularization framework that learns when to
ignore anomalies and when to respond. Co-TSFA generates input-only and input–output
augmentations to model forecast-irrelevant and forecast-relevant anomalies, and introduces
a latent–output alignment loss that ties representation changes to forecast changes. This
encourages invariance to irrelevant perturbations while preserving sensitivity to meaningful
distributional shifts. Experiments on Traffic and Electricity benchmarks, as well as on a
real-world cash-demand dataset, demonstrate that Co-TSFA improves performance under
anomalous conditions while maintaining accuracy on normal data. An anonymized GitHub
repository with the implementation of Co-TSFA is provided here and will be made public
upon acceptance.

URL: https://openreview.net/forum?id=WLtcRXqHT2

---

Title: Reproducing "Towards Safer Pretraining": Public Artifacts Replicate, Closed-API Results Drift

Abstract: Closed-API safety classifiers can shift F1 by 20+ points over two weeks with no user-visible model change. We evaluate four claims from Mendu et al. (2025)'s harmful-content-detection framework using only public artefacts and find a clean split. Claims that hinge on locally-runnable artefacts reproduce within tolerance: HAVOC leakage matches the reported 26.7% almost exactly, and HarmFormer reaches 0.78 F1 on TTP-Eval, a public-artefact baseline the original paper does not report. The two TTP claims, which depend on gpt-4o through a floating alias, diverge by 21 and 25 F1 points in April and close on a byte-identical May rerun, isolating closed-API snapshot drift as the dominant variance source and extending Chen et al. (2023) into the safety-classifier setting. Two extensions complement the reproduction. A raw-output audit traces the cross-model F1 spread on TTP-Eval to each model's Intent-label emission rate rather than to parser brittleness. A six-language evaluation across two classifier benchmarks shows HarmFormer collapsing on both (mean drops of $-0.47$ F1 on TTP-Eval and $-0.31$ F1 on OpenAI Moderation vs. English) while Llama Guard 3 stays robust on its native short-prompt surface (mean drop $-0.06$), separating model-level multilingual brittleness in HarmFormer from benchmark-mismatch artefacts in Llama Guard 3. All code, Slurm job scripts, CodeCarbon traces, and per-sample TSVs are released through the anonymous repository so every reported number is independently re-runnable.

URL: https://openreview.net/forum?id=BlUr1G35Aw

---

Title: Preprocessing Robustness in Heterogeneous Tabular Learning: Sensitivity and Volatility Indices Across Model Families

Abstract: Heterogeneous tabular pipelines combine numerical, categorical, and textual fields, making upstream preprocessing a major, yet often hidden, source of performance variation. While tabular architectures are heavily benchmarked, their robustness to preprocessing choices is rarely studied systematically. We address this gap by treating preprocessing as a first-class experimental variable. By defining a modular space of common transformations, including scaling, encoding, dimensionality reduction, and feature selection, we run controlled, single-knob ablations across three model families: gradient-boosted trees, MLP-style tabular networks, and tabular foundation models (TFMs).
To quantify this impact, we introduce two model-agnostic robustness measures: the Preprocessing Sensitivity Index (Sens), which captures the effect of individual operations, and the Model Volatility Index (Vol), which aggregates sensitivity to measure overall fragility. To distinguish avoidable user errors from within-family variability, we evaluate both mismatched and "best-practice" operator spaces on text-rich public datasets.
Our results reveal that preprocessing effects are family-dependent. Tree-based models exhibit the highest volatility; MLPs are more stable overall but remain highly vulnerable to specific representation shifts; and despite their promises of automation, TFMs do not uniformly abstract away preprocessing, showing notable sensitivity to text and categorical handling. Finally, we introduce the Preprocessing Robustness Evaluation Framework (PREF), an open-source toolkit that generates Robustness Cards to summarize a model's volatility, critical failure risks, and high-payoff tuning opportunities.

URL: https://openreview.net/forum?id=1JhhSxdBS1

---

Title: A Likely Geometry of Generative Models

Abstract: The geometry of generative models serves as the basis for interpolation, model inspection, and more. Although certain generative models admit an implicit geometric structure, there is no broadly applicable framework that captures a principled notion of geometry across generative models without imposing restrictive assumptions on the model class or data dimensionality. In this paper, we show how to equip generative models with a general geometry compatible with different metrics and probability distributions to analyze generative models. Our method does not require additional training. We consider curves analogous to geodesics constrained to a suitable data distribution aimed at targeting high-density regions learned by generative models. We formulate this as a (pseudo-)metric and prove correspondence to a Newtonian system on a Riemannian manifold. We show that shortest paths can here be characterized by a system of ordinary differential equations, which, along the optimal path, locally correspond to geodesics under a suitable Riemannian metric. Numerically, we derive a novel algorithm to efficiently compute interpolation and generalized Fréchet means. Quantitatively, we show that curves using our metric traverse regions of higher likelihood areas than baselines across a range of models and datasets.

URL: https://openreview.net/forum?id=qKM72YmSSw

---

Title: SQS: Bayesian DNN Compression through Sparse Quantized Sub-distributions

Abstract: Compressing large-scale neural networks is essential for deploying models on resource-constrained devices. Most existing methods adopt weight pruning or low-bit quantization individually, often resulting in suboptimal compression rates to preserve acceptable performance drops.
We introduce a unified framework for simultaneous pruning and low-bit quantization via Bayesian variational learning (\method), which achieves higher compression rates than prior baselines while maintaining comparable performance.
The key idea is to employ a spike-and-slab prior to inducing sparsity and model quantized weights using Gaussian Mixture Models (GMMs) to enable low-bit precision.
Due to the intractability of the objective involving spike-and-slab priors with GMMs, we derive an efficient approximation that facilitates effective compression with minimal accuracy loss.
In theory, we provide the consistent result of our proposed variational approach to a sparse and quantized deep neural network.
Extensive experiments on compressing ResNet, BERT-base, Llama3, and Qwen2.5 models show that our method achieves higher compression rates than a line of existing methods with comparable performance drops. Code implementation of SQS and baselines is available at: https://anonymous.4open.science/r/SQS_private-411C.

URL: https://openreview.net/forum?id=3nZb43fvAQ

---

Reply all
Reply to author
Forward
0 new messages