Daily TMLR digest for Feb 22, 2026

9 views

Skip to first unread message

TMLR

unread,

Feb 22, 2026, 12:30:09 AMFeb 22

to tmlr-anno...@googlegroups.com

Accepted papers
===============

Title: GenAI vs. Human Creators: Procurement Mechanism Design in Two-/Three-Layer Markets

Authors: Rui Ai, David Simchi-Levi, Haifeng Xu

Abstract: With the rapid advancement of generative AI (GenAI), mechanism design adapted to its unique characteristics poses new theoretical and practical challenges. Unlike traditional goods, content from one domain can enhance the training and performance of GenAI models in other domains. For example, OpenAI’s video generation model Sora (Liu et al., 2024b) relies heavily on image data to improve video generation quality. In this work, we study nonlinear procurement mechanism design under data transferability, where online platforms employ both human creators and GenAI to satisfy cross-domain content demand. We propose optimal mechanisms that maximize either platform revenue or social welfare and identify the specific properties of GenAI that make such high-dimensional design problems tractable. Our analysis further reveals which domains face stronger competitive pressure and which tend to experience overproduction. Moreover, the growing role of data intermediaries, including labeling companies such as Scale AI and creator organizations such as The Wall Street Journal, introduces a third layer into the traditional platform–creator structure. We show that this three-layer market can result in a lose-lose outcome, reducing both platform revenue and social welfare, as large pre-signed contracts distort creators’ incentives and lead to inefficiencies in the data market. These findings suggest a need for government regulation of the GenAI data ecosystem, and our theoretical insights are further supported by numerical simulations.

URL: https://openreview.net/forum?id=Eukf4TBHS7

---

Title: GraphMERT: Efficient and Scalable Distillation of Reliable Knowledge Graphs from Unstructured Data

Authors: Margarita Belova, Jiaxin Xiao, Shikhar Tuli, Niraj Jha

Abstract: Researchers have pursued neurosymbolic artificial intelligence (AI) applications for nearly three decades. Thus, a marriage of the two components can lead to rapid advancements in AI. Yet, the field has not realized this promise since most neurosymbolic AI frameworks fail to scale. In addition, the implicit representations and approximate reasoning of purely neural approaches limit interpretability and trust. Knowledge graphs (KGs), a gold-standard representation of explicit semantic knowledge, can address the symbolic side. However, automatically deriving reliable KGs from text corpora has remained an open problem. We address the above challenges by introducing GraphMERT, a tiny graphical encoder-only model that distills high-quality KGs from unstructured text corpora and its own internal representations. Together, GraphMERT and its equivalent KG form a modular neurosymbolic stack: neural learning of abstractions; symbolic KGs for verifiable reasoning. GraphMERT + KG is the first efficient and scalable neurosymbolic model to achieve state-of-the-art benchmark accuracy along with superior symbolic representations relative to baselines. More concretely, we target reliable domain-specific KGs that are both (1) factual (with provenance) and (2) valid (ontology-consistent relations with domain-appropriate semantics). When an off-the-shelf large language model (LLM), e.g., Qwen3-32B, generates domain-specific KGs, it falls short on the reliability front due to prompt sensitivity, shallow domain expertise, and hallucinated relations. Thus, practitioners should avoid employing LLM-generated KGs in high-stakes domains, e.g., medicine, law, business, education, etc. On text obtained from PubMed papers related to diabetes, our KG extraction pipeline with a small 80M-parameter GraphMERT yields a KG with a 69.8% FActScore; a 32B-parameter baseline LLM yields a KG that achieves only a 40.2% FActScore. The GraphMERT-extracted KG also achieves a significantly higher ValidityScore of 68.7%, compared to an LLM-generated baseline (43.0%), demonstrating its ability to preserve ontology alignment. KG cleaning further improves factuality, with GraphMERT reaching 76.9% FActScore, compared to 55.6% for the LLM baseline. GraphMERT can then treat the augmented KG as the seed KG and refine it further. Finally, human experts can edit and audit the extracted KGs, further increasing their reliability. This is nearly impossible with purely neural representations. Hence, GraphMERT enables efficient, scalable, transparent (interpretable and explainable), attributable (with provenance), accountable (with governance), editable, auditable, and continually improvable state-of-the-art neurosymbolic AI. The code is available at https://github.com/jha-lab/graphmert_umls

URL: https://openreview.net/forum?id=tnXSdDhvqc

---

Title: Sociodynamics of Reinforcement Learning

Authors: Yann Bouteiller, Karthik Soma, Giovanni Beltrame

Abstract: Reinforcement Learning (RL) has emerged as a core algorithmic paradigm explicitly driving innovation in a growing number of industrial applications, including large language models and quantitative finance. Furthermore, computational neuroscience has long found evidence of natural forms of RL in biological brains. Therefore, it is crucial for the study of social dynamics to develop a scientific understanding of how RL shapes population behaviors. We leverage the framework of Evolutionary Game Theory (EGT) to provide building blocks and insights toward this objective. We propose a methodology that enables simulating large populations of RL agents in simple game theoretic interaction models. More specifically, we derive fast and parallelizable implementations of two fundamental revision protocols from multi-agent RL - Policy Gradient (PG) and Opponent-Learning Awareness (LOLA) - tailored for population simulations of random pairwise interactions in stateless normal-form games. Our methodology enables us to simulate large populations of 200,000 independent co-learning agents, yielding compelling insights into how non-stationarity-aware learners affect social dynamics.
In particular, we find that LOLA learners promote cooperation in the Stag Hunt model, delay cooperative outcomes in the Hawk-Dove model, and reduce strategy diversity in the Rock-Paper-Scissors model.

URL: https://openreview.net/forum?id=Ro6Ylnx8se

---

Title: Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters

Authors: Xingjian Zhang, Tianhong Gao, Suliang Jin, Tianhao Wang, Teng Ye, Eytan Adar, Qiaozhu Mei

Abstract: Large language models (LLMs) are increasingly used as raters for evaluation tasks. However, their reliability is often limited for subjective tasks, when human judgments involve subtle reasoning beyond annotation labels. Thinking traces, the reasoning behind a judgment, are highly informative but challenging to collect and curate. We present a human-LLM collaborative framework to infer thinking traces from label-only annotations. The proposed framework uses a simple and effective rejection sampling method to reconstruct these traces at scale. These inferred thinking traces are applied to two complementary tasks: (1) fine-tuning open LLM raters; and (2) synthesizing clearer annotation guidelines for proprietary LLM raters. Across multiple datasets, our methods lead to significantly improved LLM-human agreement. Additionally, the refined annotation guidelines increase agreement among different LLM models. These results suggest that LLMs can serve as practical proxies for otherwise unrevealed human thinking traces, enabling label-only corpora to be extended into thinking–trace–augmented resources that enhance the reliability of LLM raters.

URL: https://openreview.net/forum?id=1jLQ629Yps

---

Title: The Clever Hans Mirage: A Comprehensive Survey on Spurious Correlations in Machine Learning

Authors: Wenqian Ye, Luyang Jiang, Eric Xie, Guangtao Zheng, Yunsheng Ma, Xu Cao, Dongliang Guo, Daiqing Qi, Zeyu He, Yijun Tian, Christopher W. Porter, Megan Coffee, Zhe Zeng, Sheng Li, Ziran Wang, Ting-Hao Kenneth Huang, James Matthew Rehg, Henry Kautz, Aidong Zhang

Abstract: Back in the early 20th century, a horse named Hans appeared to perform arithmetic and other intellectual tasks during exhibitions in Germany, while it actually relied solely on involuntary cues in the body language from the human trainer. Modern machine learning models are no different. These models are known to be sensitive to spurious correlations between non-essential features of the inputs (e.g., background, texture, and secondary objects) and the corresponding labels. Such features and their correlations with the labels are known as spurious because they tend to change with shifts in real-world data distributions, which can negatively impact the model's generalization and robustness. In this paper, we provide a comprehensive survey of this emerging issue, along with a fine-grained taxonomy of existing state-of-the-art methods for addressing spurious correlations in machine learning models. Additionally, we summarize existing datasets, benchmarks, and metrics to facilitate future research. The paper concludes with a discussion of the broader impacts, the recent advancements, and future challenges in the era of generative AI, aiming to provide valuable insights for researchers in the related domains of the machine learning community.

URL: https://openreview.net/forum?id=kIuqPmS1b1

---

Title: Learning Adaptive Multi-Stage Energy-based Prior for Hierarchical Generative Model

Authors: Jiali Cui, Tian Han

Abstract: Hierarchical generative models represent data with multiple layers of latent variables organized in a top-down structure. These models typically assume Gaussian priors for multi-layer latent variables, which lack expressivity for the contextual dependencies among latents, resulting in a distribution gap between the prior and the learned posterior. Recent works have explored hierarchical energy-based prior models (EBMs) as a more expressive alternative to bridge this gap. However, most approaches learn only a single EBM, which can be ineffective when the target distribution is highly multi-modal and multi-scale across hierarchical layers of latent variables. In this work, we propose a framework that learns multi-stage hierarchical EBM priors, where a sequence of adaptive stages progressively refines the prior to match the posterior. Our method supports both joint training with the generator and a more efficient two-phase strategy for deeper hierarchies. Experiments across standard benchmarks show that our approach consistently generates higher-quality images and learns richer hierarchical representations.

URL: https://openreview.net/forum?id=W2zqUkA9Ub

---

New submissions
===============

Title: A Quotient Homology Theory of Representation in Neural Networks

Abstract: Previous research has proven that the set of maps implemented by neural networks with a ReLU activation function is identical to the set of piecewise linear continuous maps. Furthermore, such networks induce a hyperplane arrangement splitting the input domain of the network into convex polyhedra $G_J$ over which a network $\Phi$ operates in an affine manner.

In this work, we leverage these properties to define an equivalence class $\sim_\Phi$ on top of an input dataset, which can be split into two sets related to the local rank of $\Phi_J$ and the intersections $\cap \text{Im}\Phi_{J_i}$. We refer to the latter as the \textit{overlap decomposition} $\mathcal{O}_\Phi$ and prove that if the intersections between each polyhedron and an input manifold are convex, the homology groups of neural representations are isomorphic to quotient homology groups $H_k(\Phi(\mathcal{M})) \simeq H_k(\mathcal{M}/\mathcal{O}_\Phi)$. This lets us intrinsically calculate the Betti numbers of neural representations without the choice of an external metric. We develop methods to numerically compute the overlap decomposition through linear programming and a union-find algorithm.

Using this framework, we perform several experiments on toy datasets showing that, compared to standard persistent homology, our overlap homology-based computation of Betti numbers tracks purely topological rather than geometric features. Finally, we study the evolution of the overlap decomposition during training on several classification problems while varying network width and depth and discuss some shortcomings of our method.

URL: https://openreview.net/forum?id=RluspxztzS

---

Title: Reasoning-Aware Multimodal Fusion for Hateful Video Detection

Abstract: Hate speech in online videos is posing an increasingly serious threat to digital platforms, especially as video content becomes increasingly multimodal and context-dependent. Existing methods often struggle to effectively fuse the complex semantic relationships between modalities and lack the ability to understand nuanced hateful content. To address these issues, we propose an innovative Reasoning-Aware Multimodal Fusion (RAMF) framework. To tackle the first challenge, we design Local-Global Context Fusion (LGCF) to capture both local salient cues and global temporal structures, and propose Semantic Cross Attention (SCA) to enable fine-grained multimodal semantic interaction. To tackle the second challenge, we introduce adversarial reasoning—a structured three-stage process where a vision-language model generates (i) objective descriptions, (ii) hate-assumed inferences, and (iii) non-hate-assumed inferences—providing complementary semantic perspectives that enrich the model's contextual understanding of nuanced hateful intent. Evaluations on two real-world hateful video datasets demonstrate that our method achieves robust generalisation performance, improving upon state-of-the-art methods by 3% and 7% in Macro-F1 and hate class recall, respectively. We will release the code after the anonymity period ends.

URL: https://openreview.net/forum?id=U9KnNiuMu1

---

Title: Scalable Ensemble Federated Learning with Enhanced Open-Set Recognition

Abstract: Consensus-driven parameter averaging constitutes the dominant paradigm in federated learning. Although many methods incorporate auxiliary mechanisms or refinements, repeated round averaging remains their fundamental backbone. This paradigm inherently depends on repeated rounds of client–server communication to maintain consensus. The reliance on repeated communication is further amplified in regimes with high data heterogeneity and large client populations, as shown across numerous studies. This behavior arises from optimization drift in out-of-distribution settings, where client objectives differ and multi-step local SGD updates increasingly diverge, making consensus difficult to maintain. We argue that an emerging alternative, ensemble with abstention, provides a more suitable framework for addressing these issues. Rather than enforcing consensus across diverging client objectives, this approach constructs a specialized mixture-of-experts model by preserving client-specific models and selectively aggregating their predictions. As a one-shot FL method, it eliminates the need for repeated communication rounds altogether. Moreover, supported by both theoretical and empirical analysis, we show that this paradigm sidesteps cross-client drift and is inherently less sensitive to data heterogeneity. Despite these advantages, ensemble with abstention introduces two fundamental challenges. First, its performance depends on the design of the open-set recognition (OSR) task, which directly affects performance under heterogeneity. Second, and more critically, preserving client-specific models causes linear growth in model size with the number of clients, limiting scalability. As a step toward addressing these limitations, we introduce FedSOV, which incorporates improved negative sample generation to prevent shortcut cues in the OSR task and employs pruning to address the scalability problem. We show that pruning provides a practical and effective solution to the scalability problem while simultaneously enhancing generalization, yielding higher test accuracy. Across datasets, our method achieves an average gain of $18.81\%$ over the ensemble baseline FedOV in extreme label-skew settings and up to $92.43\%$ over FedGF, the best-performing parameter-averaging method. Code is available at: https://anonymous.4open.science/r/FedSOV-C7EF/

URL: https://openreview.net/forum?id=QnnCYOfuUI

---

Title: High precision PINNs in unbounded domains: application to singularity formulation in PDEs

Abstract: We investigate the high-precision training of Physics-Informed Neural Networks (PINNs) in unbounded domains, with a special focus on applications to singularity formulation in PDEs. We propose a modularized approach and study the choices of neural network ansatz, sampling strategy, and optimization algorithm. When combined with rigorous computer-assisted proofs and PDE analysis, the numerical solutions identified by PINNs, provided they are of high precision, can serve as a powerful tool for studying singularities in PDEs. For 1D Burgers equation, our framework can lead to a solution with very high precision, and for the 2D Boussinesq equation, which is directly related to the singularity formulation in 3D Euler and Navier-Stokes equations, we obtain a solution whose loss is 4 digits smaller than that obtained in \cite{wang2023asymptotic} with fewer training steps. We also discuss potential directions for pushing towards machine precision for higher-dimensional problems.

URL: https://openreview.net/forum?id=sF3iEJMVVQ

---

Title: Let data talk: data-regularized operator learning theory for inverse problems

Abstract: Regularization plays a critical role in incorporating prior information into inverse problems. While numerous deep learning methods have been proposed to tackle inverse problems, the strategic placement of regularization remains a crucial consideration.
In this article, we introduce an innovative approach known as ``data-regularized operator learning" (DaROL) method, specifically designed to address the regularization of inverse problems.
In comparison to typical methods that impose regularization though the training of neural networks, the DaROL method trains a neural network on data that are regularized through well-established techniques including the Lasso regularization method and the Bayesian inference.
Our DaROL method offers flexibility across various frameworks, and features a simplified structure that clearly delineates the processes of regularization and neural network training. In addition, we demonstrate that training a neural network on regularized data is equivalent to supervised learning for a regularized inverse mapping. Furthermore, we provide sufficient conditions for the smoothness of such a regularized inverse mapping and estimate the learning error with regard to neural network size and the number of training samples.

URL: https://openreview.net/forum?id=D7iXTzFhAj

---

Title: When Glass Disappears at Night: A Novel NIR-RGB Multimodal Solution

Abstract: Glass surface detection (GSD) has recently been attracting research interests. However, existing GSD methods focus on modeling glass surface properties for daytime scenes only, and can easily fail in nighttime scenes due to significant lighting discrepancies. We observe that, due to the spectral differences between Near-Infrared (NIR) light sources and common LED lights, NIR and RGB cameras capture complementary visual patterns (\eg, light reflections, shadows, and edges) of glass surfaces, and cross-comparing their lighting and reflectance properties can provide reliable cues for nighttime GSD. Inspired by this observation, we propose a novel approach for nighttime GSD based on the multi-modal NIR and RGB image pairs. We first construct a nighttime GSD dataset, which contains $6,192$ RGB-NIR image pairs captured in diverse real-world nighttime scenes, with corresponding carefully-annotated glass surface masks. We then propose a novel network for the nighttime GSD task with two novel modules: (1) a RGB-NIR Guidance Enhancement (RNGE) module for extracting and enriching the NIR reflectance features with the guidance of RGB reflectance features, and (2) a RGB-NIR Fusion and Localization (RNFL) module for fusing RGB and NIR reflectance features into glass features conditioned on the multi-modal illumination discrepancy-aware features. Extensive experiments demonstrate that our method outperforms state-of-the-art methods in nighttime scenes while generalizing well to daytime scenes. We will release our dataset and codes.

URL: https://openreview.net/forum?id=hdh3vHsakv

---

Title: The Paradox of Robustness: Decoupling Rule-Based Logic from Affective Noise in High-Stakes Decision-Making

Abstract: While Large Language Models (LLMs) are widely documented to be sensitive to minor prompt perturbations and prone to sycophantic alignment, their robustness in consequential, rule-bound decision-making remains under-explored. We uncover a striking "Paradox of Robustness": despite their known lexical brittleness, instruction-tuned LLMs exhibit near-total invariance to emotional framing effects. Using a controlled perturbation framework across three high-stakes domains (healthcare, finance, and education), we find a negligible effect size (Cohen's h = 0.003) compared to the substantial biases observed in analogous human contexts (h in [0.3, 0.8])--approximately two orders of magnitude smaller. This invariance persists across eight models with diverse training paradigms, suggesting the mechanisms driving sycophancy and prompt sensitivity do not translate to failures in logical constraint satisfaction. While LLMs may be "brittle" to how a query is formatted, they are notably "stable" against why a decision should be biased. We release a benchmark (9 base scenarios x 18 condition variants = 162 unique prompts), code, and data to facilitate reproducible evaluation.

URL: https://openreview.net/forum?id=2XPD66IiQI

---

Title: Parameterized Adverse Lens Corruptions to Probe Model Robustness to Optical Tolerances

Abstract: Deep neural networks excel at image classification on benchmarks like ImageNet, yet they remain vulnerable to adverse conditions, including environmental changes and sensor noise, such as lens blur or camera noise. Consequently, the study of these adverse noise corruptions has been extensive. At the same time, image blur, naturally introduced in optical systems, has been widely ignored as a threat to model robustness. In fact, Gaussian blur has even been considered as viable defense against adversarial attacks. In this work, we challenge the common perception of blur as a rather benign data corruption and study optics-driven, blur-based adversarial attacks. Specifically, we introduce Adverse Lens Corruption (ALC), an optical adversarial attack that identifies worst-case lens blurs, obtained by optimizing Zernike polynomial-based aberrations via gradient descent. Unlike traditional noise-based attacks, ALC provides a physically-grounded continuous search space. This enables the analysis of model robustness to optics-driven blur corruptions and complements existing noise and corruption benchmarks.

URL: https://openreview.net/forum?id=a93BmQRNxC

---

Title: When Active Learning Meets Graph Similarity: Evidential Variance for Graph Selection

Abstract: Graph Similarity Learning (GSL) is pivotal in graph data mining, yet training effective models necessitates substantial labeled pairs, which incur prohibitive annotation costs. To address this, we introduce Active Learning (AL) into the GSL paradigm. However, directly transferring existing AL strategies is non-trivial due to two unique impediments: (1) the continuous regression nature of similarity prediction complicates standard uncertainty quantification, and (2) the paired-input structure requires evaluating a graph's informational value across its pairings rather than in isolation. To bridge this gap, we propose EVGS (Evidential Variance for Graph Selection), a novel AL framework tailored for GSL. EVGS leverages evidential deep learning to impose a prior over predictions, enabling disentangled uncertainty estimation. Crucially, we identify a ``gradient shrinkage'' pathology inherent to the data-scarce regime characteristic of AL cycles. We introduce a novel MSE-anchored regularizer to mitigate this issue, ensuring discriminative uncertainty estimation even with limited labels. Furthermore, to address the paired-input challenge, we propose a graph-centric selection criterion: uncertainty variance. This metric captures a graph's holistic informational value by measuring fluctuations in its epistemic uncertainty across diverse interactions. Extensive experiments on three benchmarks with two GSL backbones demonstrate that EVGS consistently outperforms established AL baselines.

URL: https://openreview.net/forum?id=dV6UopxOjX

---

Title: Efficient DAG Learning via Modular Subgraph Integration

Abstract: Learning causal structures from observational data remains a fundamental yet computationally intensive task, particularly in high-dimensional settings where existing methods face challenges such as the super-exponential growth of the search space and increasing computational demands. To address this, we introduce VISTA (Voting-based Integration of Subgraph Topologies for Acyclicity), a modular framework that decomposes the global causal structure learning problem into local subgraphs based on Markov Blankets. The global integration is achieved through a weighted voting mechanism that penalizes low-support edges via exponential decay, filters unreliable ones with an adaptive threshold, and ensures acyclicity using a Feedback Arc Set (FAS) algorithm. The framework is model-agnostic, imposing no assumptions on the inductive biases of base learners, is compatible with arbitrary data settings without requiring specific structural forms, and fully supports parallelization. We also theoretically establish finite-sample error bounds for VISTA, and prove its asymptotic consistency under mild conditions. Extensive experiments on both synthetic and real datasets consistently demonstrate the effectiveness of VISTA, yielding notable improvements in both accuracy and efficiency over a wide range of base learners.

URL: https://openreview.net/forum?id=D5hmL01dIG

---

Title: CAPTAIN: : Conformal-Prediction-Based Multi-Source Time-Series Forecasting

Abstract: Uncertainty quantification is critical for real-world forecasting applications such as predictive maintenance, patient health monitoring, and environmental sensing, where decisions must account for confidence levels. Multi-source time-series forecasting introduces additional complexity due to inter-source interactions and temporal dependencies, which existing methods struggle to capture within a unified probabilistic framework, and most previous approaches also lack theoretical guarantees, leading to miscalibrated uncertainty estimates. We propose CAPTAIN (Conformal Prediction based multi-source Time-series forecasting), a two-stage framework that first employs Normal Inverse Gamma (NIG) distributions to model source-specific uncertainties and integrates a meta-source to capture inter-source interactions, then uses temporal copulas to model the evolution of joint uncertainties over time, ensuring robust and theoretically valid uncertainty coverage. Experiments on five diverse datasets (Synthetic, Shaoxing ECG, Air Quality, NGSIM Traffic, and ETTh1) demonstrate that CAPTAIN achieves valid coverage (>=90%) across all five benchmarks while other baselines achieve on 4 or fewer, confirming it is a better approach for multi-source uncertainty quantification over existing state-of-the-art baselines.

URL: https://openreview.net/forum?id=WJjlXHo4yS

---

Title: Gradient Tree Boosting for Regression Transfer

Abstract: Many real-world modeling problems are hindered by limited data availability. In such cases, *transfer learning* leverages related source domains to improve predictions in a target domain of interest. We extend the classical gradient tree boosting paradigm to a regression transfer algorithm by modeling the weak learner as a sum of two regression trees. The trees are fitted on source data and target data, respectively, and jointly optimized for the target data. We derive optimal coefficients for the model update under the least-squares, the least-absolute-deviation, and the Huber loss functions. We benchmark our approach against the widely used XGBoost algorithm in several transfer scenarios, achieving superior performance in seven out of eight cases.

URL: https://openreview.net/forum?id=b29TPa8NPT

---

Title: An Efficient Framework for Length Extension via Dynamically Growing Positional Embedding and Routing Attention

Abstract: Modeling long sequences is critical for numerous large-scale models. However, extending existing architectures to handle significantly longer sequences poses substantial technical and computational challenges. One inevitable issue is the overfitting of large models to positional encodings during pretraining, which limits their ability to generalize to unseen positional encoding scales. Additionally, extending sequence lengths requires extensive computational resources and time. Existing positional encoding methods often rely on carefully designed scaling factors but typically yield suboptimal results. To tackle these challenges, we propose Cyclic, Randomly Truncated, and Dynamically Growing NTK Positional Embedding (CRG NTK), a data-augmentation-based technique that fully explores the RoPE encoding space, enabling models to adapt to various positional scales and achieve state-of-the-art extrapolation for the extension of lengths dominated by position encoding. Furthermore, we introduce an efficient attention mechanism with a correlation-based routing strategy to enhance the fitting of the augmented positional encoding, yielding superior performance and more efficient fine-tuning. With our approach, LLaMA-7B and Mistral-7B fine-tuned at 16K context length achieve extrapolation factors of at least 128$\times$ on simple tasks and maintain stable perplexity over 32$\times$ sequence length extensions and saves at least 16 times the GPU training resources compared to the existing optimal method. Experiments also show that correlation routing can achieve good performance by further filtering out large amounts of noise in long sequences.

URL: https://openreview.net/forum?id=qLNYDuNYKZ

---

Title: PLA: A Principled Path from Softmax Attention to Linear Models via KV Cache Compression

Abstract: Transformers, despite their remarkable sequence modeling capabilities, are fundamentally constrained by the quadratic complexity of Softmax attention and the unbounded growth of the key–value (KV) cache. Replacing Softmax attention with linear variants has emerged as a promising direction, yet existing approaches lack a systematic functional comparison with Softmax attention, clear error analysis, and a theoretically guided roadmap for improvement.
In this work, we approach the problem from the perspective of KV cache compression and present a theoretically grounded pathway from Softmax attention to linear models.
Our analysis reveals five critical components: redundancy elimination, tokenizer-level quantization and positional information separation, positional information compression, inter-layer similarity, and multi-state decomposition. For each, we provide succinct theoretical justification, derive error bounds, and demonstrate equivalence to existing mechanisms. Building on this pathway, we introduce PLA, a linearized attention model that inherits pretrained weights and achieves state-of-the-art performance. Notably, PLA surpasses strong baselines such as MVA and GSA on multiple benchmarks while requiring only 80\% of the fine-tuning resources. Our findings provide both theoretical clarity and practical guidance for advancing linear attention, highlighting a principled route towards efficient and scalable alternatives to Softmax attention.

URL: https://openreview.net/forum?id=ohkS8NffLp

---

Title: MSTN: A Lightweight and Fast Model for General Time-Series Analysis

Abstract: Real-world time series often exhibit strong non-stationarity, complex nonlinear dynamics, and behavior expressed across multiple temporal scales, from rapid local fluctuations to slow-evolving long-range trends. However, many contemporary architectures impose rigid, fixed-scale structural priors---such as patch-based tokenization, predefined receptive fields, or frozen backbone encoders---which can over-regularize temporal dynamics and limit adaptability to abrupt high-magnitude events. To handle this, we introduce the Multi-scale Temporal Network (MSTN), a hybrid neural architecture grounded in an Early Temporal Aggregation principle. MSTN integrates three complementary components: (i) a multi-scale convolutional encoder that captures fine-grained local structure; (ii) a sequence modeling module that learns long-range dependencies through either recurrent or attention-based mechanisms; and (iii) a self-gated fusion stage incorporating squeeze--excitation and multi-head attention to dynamically modulate cross-scale representations. This design enables MSTN to flexibly model temporal patterns spanning milliseconds to extended horizons, while avoiding the computational burden typically associated with long-context models. Across extensive benchmarks covering forecasting, imputation, classification, and cross-dataset generalization, MSTN achieves state-of-the-art performance, establishing new best results on 24 of 32 datasets, while remaining lightweight (≈ 1M params) and suitable for low-latency (<1 sec, often in milliseconds), resource-constrained deployment.

URL: https://openreview.net/forum?id=je2N2nnDry

---

Title: DeGLIF for Label Noise Robust Node Classification using GNNs

Abstract: Noisy labelled datasets are generally inexpensive compared to clean labelled datasets, and the same is true for graph data. In this paper, we propose a denoising technique DeGLIF: Denoising Graph Data using Leave-One-Out Influence Function. DeGLIF uses a small set of clean data and the leave-one-out influence function to make label noise robust node-level prediction on graph data. Leave-one-out influence function approximates the change in the model parameters if a training point is removed from the training dataset. Recent advances propose a way to calculate the leave-one-out influence function for Graph Neural Networks (GNNs). We extend that recent work to estimate the change in validation loss, if a training node is removed from the training dataset. We use this estimate and a new theoretically motivated relabelling function to denoise the training dataset. We propose two DeGLIF variants to identify noisy nodes. Neither of these variants requires any information about the noise model or the noise level in the dataset; DeGLIF also does not estimate these quantities. For one of these variants, we prove that the noisy points detected can indeed increase risk. We carry out detailed computational experiments on different datasets to investigate the effectiveness of DeGLIF. It achieves better accuracy than other baseline algorithms.

URL: https://openreview.net/forum?id=pcs5DmBtUJ

---

Title: Protein structural superfamily classification using hand-crafted and language model features: A performance vs interpretability trade-off

Abstract: The CATH database categorizes more than 600,000 protein domain structures into superfamilies based on a hierarchy of structural similarity notions. Members of a single superfamily may share less than 35% sequence similarity. The scale of such data motivates the use of machine learning methods that can accurately predict the CATH superfamily of a protein domain and, at the same time, are interpretable, i.e. provide insights into the characteristic features of a superfamily. The newfound rise of protein language models (PLMs) that leverage data and compute has introduced an interesting conflict: a trade-off between the high predictive performance of non-interpretable features and the scientific insight that can be gained from interpretable, hand-crafted ones. In this work, we highlight and study this conflict via the task of classifying protein domains into their CATH superfamilies. We train one-vs-all (OvA) linear SVM classifiers for 45 diverse CATH superfamilies, each characterised by significant class imbalance. We address the class imbalance by using a class-balanced loss function and the arithmetic mean (AM) of specificity and sensitivity for evaluation. Our analysis compares nine feature vector types, which are either non-interpretable embeddings from PLMs or interpretable hand-crafted features. The latter includes amino acid composition (AAC), di- and tri-peptide composition (DPC, TPC), and novel sequence-order (2OAAC, 3OAAC) and structure-based features (OCPC, CSIC). Our results demonstrate that PLM-based features achieve superior test AM scores of 90-99% with low variability, outperforming hand-crafted features by 20-30%. While PLM features yield high classification accuracy, their lack of interpretability obscures the underlying biological determinants. Conversely, the interpretability of hand-crafted features, despite their relatively low performance, can be leveraged to infer sequence and structural characteristics of CATH superfamilies. We illustrate this for two superfamilies. First, we rank the components of hand-crafted features using a known method, marginal contribution feature importance (MCI). Then, based on the interpretability of the top-ranked hand-crafted feature components, we derive biological insights, such as characteristic contacts of superfamily structures. The proposed hand-crafted CSIC feature strikes a balance between predictive performance and interpretability, as it overfits less while providing rich structural information about contact sequence separation. This can be valuable for downstream applications, such as investigating protein-related diseases and guiding rational protein design.

URL: https://openreview.net/forum?id=huTeyYU0yD

---

Title: Watermarking Language Models with Error Correcting Codes

Abstract: Recent progress in large language models enables the creation of realistic machine-generated content. Watermarking is a promising approach to distinguish machine-generated text from human text, embedding statistical signals in the output that are ideally undetectable to humans. We propose a watermarking framework that encodes such signals through an error correcting code. Our method, termed robust binary code (RBC) watermark, introduces no noticeable degradation in quality. We evaluate our watermark on base and instruction fine-tuned models and find that our watermark is robust to edits, deletions, and translations. We provide an information-theoretic perspective on watermarking, a powerful statistical test for detection and for generating p-values, and theoretical guarantees. Our empirical findings suggest our watermark is fast, powerful, and robust, comparing favorably to the state-of-the-art.

URL: https://openreview.net/forum?id=H6oBZxNQk2

---

Title: Favourability of Loss Landscape with Weight Decay Requires Both Large Overparametrization and Initialization

Abstract: The optimization of neural networks under weight decay remains poorly understood from a theoretical standpoint. While weight decay is standard practice in modern training procedures, most theoretical analyses focus on unregularized settings. In this work, we investigate the loss landscape of the $\ell_2$-regularized training loss for two-layer ReLU networks. We show that the landscape becomes favourable -- i.e., spurious local minima represent a negligible fraction of local minima -- under large overparametrization, specifically when the network width $m$ satisfies $m \gtrsim \min(n^d, 2^n)$, where $n$ is the number of data points and $d$ the input dimension. More precisely in this regime, almost all constant activation regions contain a global minimum and no spurious local minima. We further show that this level of overparametrization is not only sufficient but also necessary via the example of orthogonal data. Finally, we demonstrate that such loss landscape results primarily hold relevance in the large initialization regime. In contrast, for small initializations -- corresponding to the feature learning regime -- optimization can still converge to spurious local minima, despite the favourability of the landscape.

URL: https://openreview.net/forum?id=jbU0Tjjhfg

---

Title: Spectral Ghost in Representation Learning: from Component Analysis to Self-Supervised Learning

Abstract: Self-supervised learning (SSL) has improved empirical performance by unleashing the power of unlabeled data for practical applications. Specifically, SSL extracts the representation from massive unlabeled data, which will be transferred to a plenty of down streaming tasks with limited data. The significant improvement on diverse applications of representation learning has attracted increasing attention, resulting in a variety of dramatically different self-supervised learning objectives for representation extraction, with an assortment of learning procedures, but the lack of a clear and unified understanding. Such an absence hampers the ongoing development of representation learning, leaving a theoretical understanding missing, principles for efficient algorithm design unclear, and the use of representation learning methods in practice unjustified. The urgency for a unified framework is further motivated by the rapid growth in representation learning methods. In this paper, we are therefore compelled to develop a principled foundation of representation learning. We first theoretically investigate the sufficiency of the representation from a spectral representation view, which reveals the spectral essence of the existing successful SSL algorithms and paves the path to a unified framework for understanding and analysis. Such a framework work also inspires the development of more efficient and easy-to-use representation learning algorithms with principled way in real-world applications.

URL: https://openreview.net/forum?id=C82ZSnEC1z

---

Title: Hierarchy-Aware Multimodal Unlearning for Medical AI

Abstract: Pretrained Multimodal Large Language Models (MLLMs) are increasingly used in sensitive domains such as medical AI, where privacy regulations like HIPAA and GDPR require the removal of specific individuals’ or institutions’ data. This motivates machine unlearning, which aims to remove the influence of target data from a trained model. However, existing unlearning benchmarks fail to reflect the hierarchical and multimodal structure of real-world medical data, limiting their ability to properly evaluate unlearning in practice.
Therefore, we introduce MedForget, a hierarchy-aware multimodal unlearning benchmark that models hospital data as a nested structure, enabling fine-grained evaluation of multimodal unlearning across retain and forget splits. Experiments show that current unlearning methods struggle to achieve effective hierarchy-aware forgetting without degrading downstream medical utility, measured by performance on clinically relevant prediction tasks. To address this limitation, we propose Cross-modal Hierarchy-Informed Projection for unlearning (CHIP), a training-free, hierarchy-aware multimodal unlearning method that deletes information by selectively removing target-specific weight subspaces while preserving sibling-shared information. Our results show that CHIP achieves the highest forget-retain performance gap across all hierarchy levels while maintaining competitive downstream utility compared to existing methods.
Overall, MedForget provides a practical, HIPAA-aligned benchmark for evaluating structured multimodal unlearning for medical data, and CHIP offers an effective and general solution for hierarchy-aware forgetting that balances deletion with utility.

URL: https://openreview.net/forum?id=TVSIhLqIkf

---

Title: Think2SQL: Blueprinting Reward Density and Advantage Scaling for Effective Text-to-SQL Reasoning

Abstract: While Large Language Models (LLMs) have advanced the state-of-the-art in Text-to-SQL, robust reasoning in complex, multi-table environments remains a bottleneck for parameter-efficient models. This paper presents a systematic empirical study on injecting reasoning capabilities into Text-to-SQL through the lens of Reinforcement Learning with Verifiable Rewards (RLVR). We uncover a critical interplay between reward density, advantage scaling, and model capacity. Our analysis yields four primary insights. First, we propose a novel execution-guided dense reward function that significantly outperforms binary signals and existing state-of-the-art rewards by providing granular feedback at the instance level. Second, we analyze the mechanics of advantage calculation, demonstrating that while large models thrive on sparse signals with aggressive advantage scaling, smaller models require dense rewards and conservative scaling to improve Text-to-SQL performance. Third, we evaluate the impact of cold start, showing that distillation does not always benefit RLVR
performance, and supervised fine-tuned models are prone to distributional mimicry. Fourth, we map the Pareto frontier of training efficiency, providing insights for optimizing Text-to-SQL reasoning under computational constraints. Our findings culminate in the Think2SQL family: our 4B-parameter model demonstrates reasoning capabilities competitive with state-of-the-art models such as o3. We release our models, datasets, and code to create a blueprint for RLVR optimization in Text-to-SQL at https://anonymous.4open.science/r/Think2SQL-3B7F.

URL: https://openreview.net/forum?id=NxU1KWnpOG

---

Title: M3Ret: Unleashing Zero-shot Multi-Modal Medical Image Retrieval via Self-Supervision

Abstract: Medical image retrieval is essential for clinical decision-making and translational research, relying on discriminative visual representations. Yet, current methods remain fragmented, relying on separate architectures and training strategies for 2D, 3D, and video-based medical data. This modality-specific design hampers scalability and inhibits the development of unified representations.
To enable unified learning, we curate a large-scale hybrid-modality dataset comprising 867,653 medical imaging samples, including 2D X-rays and ultrasounds, RGB endoscopy videos, and 3D CT scans. Leveraging this dataset, we train M3Ret, a unified visual encoder without any modality-specific customization. It successfully learns transferable representations using both generative (MAE) and contrastive (SimDINO) self-supervised learning (SSL) paradigms.
Our approach sets a new state-of-the-art in zero-shot image-to-image retrieval across all individual modalities, surpassing strong baselines such as DINOv3 and the text-supervised BMC-CLIP. More remarkably, strong cross-modal alignment emerges without paired data, and the model generalizes to unseen MRI tasks, despite never observing MRI during pretraining, demonstrating the generalizability of purely visual self-supervision to unseen modalities.
Comprehensive analyses further validate the scalability of our framework across model and data sizes. These findings deliver a promising signal to the medical imaging community, positioning M3Ret as a step toward foundation models for visual SSL in multimodal medical image understanding.

URL: https://openreview.net/forum?id=VgmIrgbzkX

---

Title: Code Reasoning for Software Engineering Tasks: A Survey and A Call to Action

Abstract: The rise of large language models (LLMs) has led to dramatic improvements across a wide range of natural language tasks.
Their performance on certain tasks can be further enhanced by incorporating test-time reasoning techniques.
These inference-time advances have been adopted into the code domain, enabling complex software engineering (SWE) tasks such as code generation, test generation and issue resolution. However, the impact of different reasoning techniques on code-centric SWE tasks has not been systematically explored. In this work, we survey code reasoning techniques that underpin these capabilities, with a focus on test-time compute and inference-time reasoning paradigms.

We examine a variety of code-specific reasoning methods and progressively build up to SWE agents, which combine planning, tool use, and multi-step interaction. We also compare the impact of different techniques on coding tasks, highlighting their relative importance and outlining open challenges and future research directions. Across commonly used models and benchmarks, we find that approaches exploiting code-specific signals (e.g., structure and execution feedback) are frequently associated with improved performance, motivating a dedicated study of code reasoning beyond natural-language reasoning.

Our contributions are: (1) to the best of our knowledge, the first dedicated survey of code reasoning for SWE tasks, highlighting overarching reasoning strategies, hybrid methods, and agentic approaches; (2) a taxonomy of inference-time techniques used to drive code reasoning, accompanied by a curated set of under-explored benchmarks with high potential for SWE evaluation; (3) a comparative analysis of reasoning design patterns across commonly used models and benchmarks; and (4) a synthesis of gaps in current methods and evaluation practices, identifying under-explored areas and concrete opportunities for future research.

URL: https://openreview.net/forum?id=zZa3u6LKwO

---

Title: Leveraging Vision-Language Models for Resource Constrained Settings

Abstract: Vision-language models (VLMs) such as CLIP have emerged as extremely strong zero-shot and few-shot image classifiers.
However, these models are often too expensive or cumbersome for resource constrained downstream applications.
In this work, we examine how to best leverage the strength of pretrained VLMs: by extracting $\textit{task-specific}$ information in order to obtain a small model that can be deployed in a very specific and low-resource setting.
We present the SIDCLIP method, a novel training pipeline which drastically improves the performance of small, efficient models, such as EfficientNet B0.
The pipeline includes three components that are critical to obtaining strong performance: 1) augmenting the classifier with $\textit{synthetic data}$ generated by leveraging CLIP itself; 2) $\textit{initializing}$ the modeling process using a smaller CLIP model pretrained on the target architecture; and 3) incorporating $\textit{knowledge distillation}$ to maximally mimic the performance of the larger model.
SIDCLIP improves the performance of an EfficientNet B0 model by an average of $50\%$ on 1-shot versions of four datasets and by an average of $26\%$ on the 8-shot versions, relative to directly trained networks, additionally approaching CLIP's linear probe performance while using a model with less than $2\%$ of the parameters of CLIP ViT-L/14's image encoder.
We hope our work can be useful as a practical guide for leveraging the power of foundation models in downstream data-scarce and budget constrained settings.

URL: https://openreview.net/forum?id=cYOKSg60jC

---

Title: Towards Preventing Global Knowledge Forgetting in Federated Learning with Non-IID Data

Abstract: Federated learning under client-level data heterogeneity remains challenging despite extensive work on drift correction, regularization, and improved aggregation. In this paper, we argue that an important yet underexplored failure mode is catastrophic forgetting of the global decision boundary during local training: as clients optimize their local objectives, they rapidly overfit to client-specific data and erase globally useful multi-class structure, causing server aggregation to average incompatible models rather than accumulate progress. We provide empirical evidence for this phenomenon through a controlled pilot study that directly visualizes decision boundary evolution in federated learning. Our analysis reveals that standard FL methods consistently forget the global decision boundary after local updates, even when clients are initialized from a strong pretrained global model. Motivated by this observation, we propose FedProj, a federated learning framework designed to preserve global functional knowledge throughout local optimization. FedProj maintains a small public-memory buffer and enforces a hard gradient constraint that prevents local updates from increasing a memory-based distillation loss, thereby acting as a safety barrier against global knowledge erosion. At the server, we further employ ensemble distillation on the same public proxy data to consolidate the preserved knowledge into a single global model. We conduct extensive experiments across computer vision and natural language processing benchmarks, covering highly non-IID regimes and domain-shifted settings. The results show that FedProj consistently outperforms state-of-the-art federated learning methods, highlighting the practical importance of explicitly preventing global decision-boundary forgetting

URL: https://openreview.net/forum?id=lhTWPh3Tjm

---

Reply all

Reply to author

Forward

0 new messages