Weekly TMLR digest for Feb 22, 2026

55 views

Skip to first unread message

TMLR

unread,

Feb 22, 2026, 12:00:12 AMFeb 22

to tmlr-annou...@googlegroups.com

New certifications
==================

J2C Certification: Probabilistic Pretraining for Improved Neural Regression

Boris N. Oreshkin, Shiv Kumar Tavker, Dmitry Efimov

https://openreview.net/forum?id=F6BTATGXaf

---

J2C Certification: CodePDE: An Inference Framework for LLM-driven PDE Solver Generation

Shanda Li, Tanya Marwah, Junhong Shen, Weiwei Sun, Andrej Risteski, Yiming Yang, Ameet Talwalkar

https://openreview.net/forum?id=eG3Qy5Oux6

---

J2C Certification: A Multi-Fidelity Control Variate Approach for Policy Gradient Estimation

Xinjie Liu, Cyrus Neary, Kushagra Gupta, Wesley A. Suttle, Christian Ellis, ufuk topcu, David Fridovich-Keil

https://openreview.net/forum?id=zAo0L7Dcqt

---

J2C Certification: Single-loop Algorithms for Stochastic Non-Convex Optimization with Weakly-Convex Constraints

Ming Yang, Gang Li, Quanqi Hu, Qihang Lin, Tianbao Yang

https://openreview.net/forum?id=aCgOR2KvAI

---

Accepted papers
===============

Title: GenAI vs. Human Creators: Procurement Mechanism Design in Two-/Three-Layer Markets

Authors: Rui Ai, David Simchi-Levi, Haifeng Xu

Abstract: With the rapid advancement of generative AI (GenAI), mechanism design adapted to its unique characteristics poses new theoretical and practical challenges. Unlike traditional goods, content from one domain can enhance the training and performance of GenAI models in other domains. For example, OpenAI’s video generation model Sora (Liu et al., 2024b) relies heavily on image data to improve video generation quality. In this work, we study nonlinear procurement mechanism design under data transferability, where online platforms employ both human creators and GenAI to satisfy cross-domain content demand. We propose optimal mechanisms that maximize either platform revenue or social welfare and identify the specific properties of GenAI that make such high-dimensional design problems tractable. Our analysis further reveals which domains face stronger competitive pressure and which tend to experience overproduction. Moreover, the growing role of data intermediaries, including labeling companies such as Scale AI and creator organizations such as The Wall Street Journal, introduces a third layer into the traditional platform–creator structure. We show that this three-layer market can result in a lose-lose outcome, reducing both platform revenue and social welfare, as large pre-signed contracts distort creators’ incentives and lead to inefficiencies in the data market. These findings suggest a need for government regulation of the GenAI data ecosystem, and our theoretical insights are further supported by numerical simulations.

URL: https://openreview.net/forum?id=Eukf4TBHS7

---

Title: GraphMERT: Efficient and Scalable Distillation of Reliable Knowledge Graphs from Unstructured Data

Authors: Margarita Belova, Jiaxin Xiao, Shikhar Tuli, Niraj Jha

Abstract: Researchers have pursued neurosymbolic artificial intelligence (AI) applications for nearly three decades. Thus, a marriage of the two components can lead to rapid advancements in AI. Yet, the field has not realized this promise since most neurosymbolic AI frameworks fail to scale. In addition, the implicit representations and approximate reasoning of purely neural approaches limit interpretability and trust. Knowledge graphs (KGs), a gold-standard representation of explicit semantic knowledge, can address the symbolic side. However, automatically deriving reliable KGs from text corpora has remained an open problem. We address the above challenges by introducing GraphMERT, a tiny graphical encoder-only model that distills high-quality KGs from unstructured text corpora and its own internal representations. Together, GraphMERT and its equivalent KG form a modular neurosymbolic stack: neural learning of abstractions; symbolic KGs for verifiable reasoning. GraphMERT + KG is the first efficient and scalable neurosymbolic model to achieve state-of-the-art benchmark accuracy along with superior symbolic representations relative to baselines. More concretely, we target reliable domain-specific KGs that are both (1) factual (with provenance) and (2) valid (ontology-consistent relations with domain-appropriate semantics). When an off-the-shelf large language model (LLM), e.g., Qwen3-32B, generates domain-specific KGs, it falls short on the reliability front due to prompt sensitivity, shallow domain expertise, and hallucinated relations. Thus, practitioners should avoid employing LLM-generated KGs in high-stakes domains, e.g., medicine, law, business, education, etc. On text obtained from PubMed papers related to diabetes, our KG extraction pipeline with a small 80M-parameter GraphMERT yields a KG with a 69.8% FActScore; a 32B-parameter baseline LLM yields a KG that achieves only a 40.2% FActScore. The GraphMERT-extracted KG also achieves a significantly higher ValidityScore of 68.7%, compared to an LLM-generated baseline (43.0%), demonstrating its ability to preserve ontology alignment. KG cleaning further improves factuality, with GraphMERT reaching 76.9% FActScore, compared to 55.6% for the LLM baseline. GraphMERT can then treat the augmented KG as the seed KG and refine it further. Finally, human experts can edit and audit the extracted KGs, further increasing their reliability. This is nearly impossible with purely neural representations. Hence, GraphMERT enables efficient, scalable, transparent (interpretable and explainable), attributable (with provenance), accountable (with governance), editable, auditable, and continually improvable state-of-the-art neurosymbolic AI. The code is available at https://github.com/jha-lab/graphmert_umls

URL: https://openreview.net/forum?id=tnXSdDhvqc

---

Title: Sociodynamics of Reinforcement Learning

Authors: Yann Bouteiller, Karthik Soma, Giovanni Beltrame

Abstract: Reinforcement Learning (RL) has emerged as a core algorithmic paradigm explicitly driving innovation in a growing number of industrial applications, including large language models and quantitative finance. Furthermore, computational neuroscience has long found evidence of natural forms of RL in biological brains. Therefore, it is crucial for the study of social dynamics to develop a scientific understanding of how RL shapes population behaviors. We leverage the framework of Evolutionary Game Theory (EGT) to provide building blocks and insights toward this objective. We propose a methodology that enables simulating large populations of RL agents in simple game theoretic interaction models. More specifically, we derive fast and parallelizable implementations of two fundamental revision protocols from multi-agent RL - Policy Gradient (PG) and Opponent-Learning Awareness (LOLA) - tailored for population simulations of random pairwise interactions in stateless normal-form games. Our methodology enables us to simulate large populations of 200,000 independent co-learning agents, yielding compelling insights into how non-stationarity-aware learners affect social dynamics.
In particular, we find that LOLA learners promote cooperation in the Stag Hunt model, delay cooperative outcomes in the Hawk-Dove model, and reduce strategy diversity in the Rock-Paper-Scissors model.

URL: https://openreview.net/forum?id=Ro6Ylnx8se

---

Title: Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters

Authors: Xingjian Zhang, Tianhong Gao, Suliang Jin, Tianhao Wang, Teng Ye, Eytan Adar, Qiaozhu Mei

Abstract: Large language models (LLMs) are increasingly used as raters for evaluation tasks. However, their reliability is often limited for subjective tasks, when human judgments involve subtle reasoning beyond annotation labels. Thinking traces, the reasoning behind a judgment, are highly informative but challenging to collect and curate. We present a human-LLM collaborative framework to infer thinking traces from label-only annotations. The proposed framework uses a simple and effective rejection sampling method to reconstruct these traces at scale. These inferred thinking traces are applied to two complementary tasks: (1) fine-tuning open LLM raters; and (2) synthesizing clearer annotation guidelines for proprietary LLM raters. Across multiple datasets, our methods lead to significantly improved LLM-human agreement. Additionally, the refined annotation guidelines increase agreement among different LLM models. These results suggest that LLMs can serve as practical proxies for otherwise unrevealed human thinking traces, enabling label-only corpora to be extended into thinking–trace–augmented resources that enhance the reliability of LLM raters.

URL: https://openreview.net/forum?id=1jLQ629Yps

---

Title: The Clever Hans Mirage: A Comprehensive Survey on Spurious Correlations in Machine Learning

Authors: Wenqian Ye, Luyang Jiang, Eric Xie, Guangtao Zheng, Yunsheng Ma, Xu Cao, Dongliang Guo, Daiqing Qi, Zeyu He, Yijun Tian, Christopher W. Porter, Megan Coffee, Zhe Zeng, Sheng Li, Ziran Wang, Ting-Hao Kenneth Huang, James Matthew Rehg, Henry Kautz, Aidong Zhang

Abstract: Back in the early 20th century, a horse named Hans appeared to perform arithmetic and other intellectual tasks during exhibitions in Germany, while it actually relied solely on involuntary cues in the body language from the human trainer. Modern machine learning models are no different. These models are known to be sensitive to spurious correlations between non-essential features of the inputs (e.g., background, texture, and secondary objects) and the corresponding labels. Such features and their correlations with the labels are known as spurious because they tend to change with shifts in real-world data distributions, which can negatively impact the model's generalization and robustness. In this paper, we provide a comprehensive survey of this emerging issue, along with a fine-grained taxonomy of existing state-of-the-art methods for addressing spurious correlations in machine learning models. Additionally, we summarize existing datasets, benchmarks, and metrics to facilitate future research. The paper concludes with a discussion of the broader impacts, the recent advancements, and future challenges in the era of generative AI, aiming to provide valuable insights for researchers in the related domains of the machine learning community.

URL: https://openreview.net/forum?id=kIuqPmS1b1

---

Title: Learning Adaptive Multi-Stage Energy-based Prior for Hierarchical Generative Model

Authors: Jiali Cui, Tian Han

Abstract: Hierarchical generative models represent data with multiple layers of latent variables organized in a top-down structure. These models typically assume Gaussian priors for multi-layer latent variables, which lack expressivity for the contextual dependencies among latents, resulting in a distribution gap between the prior and the learned posterior. Recent works have explored hierarchical energy-based prior models (EBMs) as a more expressive alternative to bridge this gap. However, most approaches learn only a single EBM, which can be ineffective when the target distribution is highly multi-modal and multi-scale across hierarchical layers of latent variables. In this work, we propose a framework that learns multi-stage hierarchical EBM priors, where a sequence of adaptive stages progressively refines the prior to match the posterior. Our method supports both joint training with the generator and a more efficient two-phase strategy for deeper hierarchies. Experiments across standard benchmarks show that our approach consistently generates higher-quality images and learns richer hierarchical representations.

URL: https://openreview.net/forum?id=W2zqUkA9Ub

---

Title: Reproducibility Study: Understanding multi-agent LLM cooperation in the GovSim framework

Authors: Alessio Silverio, Carmen Michaela Chezan, Mathijs van Sprang, Tom Cappendijk, Martin Smit

Abstract: Governance of the Commons Simulation (GovSim) is a Large Language Model (LLM) multi-agent framework designed to study cooperation and sustainability between LLM agents in resource-sharing environments (Piatti et al., 2024). Understanding the cooperation capabilities of LLMs is vital to the real-world applicability of these models. This study reproduces and extends the original GovSim experiments using recent small-scale open-source LLMs, including newly released instruction-tuned models such as Phi-4 and DeepSeek-R1 distill variants. We evaluate three core claims from the original paper: (1) GovSim enables the study and benchmarking of emergent sustainable behavior, (2) only the largest and most powerful LLM agents achieve a sustainable equilibrium, while smaller models fail, and (3) agents using universalization-based reasoning significantly improve sustainability. Our findings support the first claim, demonstrating that GovSim remains a valid platform for studying social reasoning in multi-agent LLM systems. However, our results challenge the second claim: recent smaller-sized LLMs, particularly DeepSeek-R1-Distill-Qwen-14B, achieve sustainable equilibrium, indicating that advancements in model design and instruction tuning have narrowed the performance gap with larger models. Regarding the third claim, our results confirm that universalization-based reasoning improves performance in the GovSim environment, supporting the third claim of the author. However, further analysis suggests that the improved performance primarily stems from the numerical instructions provided to agents rather than the principle of universalization itself. To further generalize these findings, we extended the framework to include a broader set of social reasoning frameworks. We find that reasoning strategies incorporating explicit numerical guidance consistently outperform abstract ethical prompts, highlighting the critical role of prompt specificity in influencing agent behavior.

URL: https://openreview.net/forum?id=ON8EMrNwww

---

Title: A Multilevel Low-Rank Newton Method with Super-linear Convergence Rate and its Application to Non-convex Problems

Authors: Nick Tsipinakis, Panagiotis Tigas, Panos Parpas

Abstract: Second-order methods can address the shortcomings of first-order methods for the optimization of large-scale machine learning models.
However, second-order methods have significantly higher computational costs associated with the computation of second-order information. Subspace methods that are based on randomization have addressed some of these computational costs as they compute search directions in lower dimensions. Even though super-linear convergence rates have been empirically observed, it has not been possible to rigorously show that these variants of second-order methods can indeed achieve such fast rates.
Also, it is not clear whether subspace methods are efficient for non-convex settings.
To address these shortcomings, we develop a link between multigrid optimization methods and low-rank Newton methods that enables us to prove the super-linear rates of stochastic low-rank Newton methods rigorously. Our method does not require any computations in the original model dimension. We further propose a truncated version of the method that is capable of solving high-dimensional non-convex problems. Preliminary numerical experiments show that our method has a better escape rate from saddle points compared to the state-of-the-art first-order methods and thus returns lower training errors.

URL: https://openreview.net/forum?id=PKakPzVVja

---

Title: From Preferences to Prejudice: The Role of Alignment Tuning in Shaping Social Bias in Video Diffusion Models

Authors: Zefan Cai, Haoyi Qiu, Haozhe Zhao, Ke Wan, Jiachen Li, Jiuxiang Gu, Wen Xiao, Nanyun Peng, Junjie Hu

Abstract: Recent advances in video diffusion models have significantly enhanced text-to-video generation, particularly through alignment tuning using reward models trained on human preferences. While these methods improve visual quality, they can unintentionally encode and amplify social biases. To systematically trace how such biases evolve throughout the alignment pipeline, we introduce VideoBiasEval, a comprehensive diagnostic framework for evaluating social representation in video generation. Grounded in established social bias taxonomies, VideoBiasEval employs an event-based prompting strategy to disentangle semantic content (verbs and contexts) from actor attributes (gender and ethnicity). It further introduces multi-granular metrics to evaluate (1) overall ethnicity bias, (2) gender bias conditioned on ethnicity, (3) distributional shifts in social attributes across model variants, and (4) the temporal persistence of bias within videos. Using this framework, we conduct the first end-to-end analysis connecting biases in human preference datasets, their amplification in reward models, and their propagation through alignment-tuned video diffusion models. Our results reveal that alignment tuning not only strengthens representational biases but also makes them temporally stable, producing smoother yet more stereotyped portrayals. These findings highlight the need for bias-aware evaluation and mitigation throughout the alignment process to ensure fair and socially responsible video generation.

URL: https://openreview.net/forum?id=C0yxuS6jty

---

Title: Semi-Supervised Cross-Domain Imitation Learning

Authors: Li-Min Chu, Kai-Siang Ma, Ming-Hong Chen, Ping-Chun Hsieh

Abstract: Cross-domain imitation learning (CDIL) accelerates policy learning by transferring expert knowledge across domains, which is valuable in applications where collection of expert data is costly. Existing methods are either supervised, relying on proxy tasks and explicit alignment, or unsupervised, aligning distributions without paired data but often unstable. We introduce the Semi-Supervised CDIL (SS-CDIL) setting and propose the first algorithm for SS-CDIL with theoretical justification. Our method uses only offline data, including a small number of target expert demonstrations and some unlabeled imperfect trajectories. To handle domain discrepancy, we propose a novel cross-domain loss function for learning inter-domain state-action mappings and design an adaptive weight function to balance the source and target knowledge. Experiments on MuJoCo and Robosuite show consistent gains over the baselines, demonstrating that our approach achieves stable and data-efficient policy learning with minimal supervision.

URL: https://openreview.net/forum?id=WARXnbJawZ

---

Title: Exploring Perceptual Limitations of Multimodal LLMs on Small Visual Objects

Authors: Jiarui Zhang, Jinyi Hu, Mahyar Khayatkhoei, Filip Ilievski, Maosong Sun

Abstract: Multimodal Large Language Models (MLLMs) have recently achieved remarkable performance in various multimodal benchmarks. However, general benchmarks often do not reveal the specific aspects of their visual perception limits due to the lack of controllability. In this work, we quantitatively study the perception of small visual objects in several widely-used MLLMs and reveal a pervasive limitation in answering questions about small objects in images. We then conduct a controlled study of MLLMs' perception, using text-reading as a surrogate task for general visual perception to understand how quality, size, distractors, and location of an object can independently affect the ability of MLLMs to perceive it in images. Through this controlled study, we find that lower object quality, smaller object size and the presence of visual distractors can both independently reduce MLLMs' ability to answer visual questions. More surprisingly, even local perturbations of an object by a few pixels can cause a drastic decline in the ability of MLLMs to perceive it. Our study provides a better understanding of the perceptual limitations of MLLMs and contributes new evaluation protocols for analyzing, enhancing perception of future MLLMs.

URL: https://openreview.net/forum?id=D8MjYW8m35

---

Title: Probabilistic Pretraining for Improved Neural Regression

Authors: Boris N. Oreshkin, Shiv Kumar Tavker, Dmitry Efimov

Abstract: While transfer learning has revolutionized computer vision and natural language processing, its application to probabilistic regression remains underexplored, particularly for tabular data. We introduce NIAQUE (Neural Interpretable Any-Quantile Estimation), a novel permutation-invariant architecture that enables effective transfer learning across diverse regression tasks. Through extensive experiments on 101 datasets, we demonstrate that pre-training NIAQUE on multiple datasets and fine-tuning on target datasets consistently outperforms both traditional tree-based models and transformer-based neural baseline. On real-world Kaggle competitions, NIAQUE achieves competitive performance against heavily hand-crafted and feature-engineered solutions and outperforms strong baselines such as TabPFN and TabDPT, while maintaining interpretability through its probabilistic framework. Our results establish NIAQUE as a robust and scalable approach for tabular regression, effectively bridging the gap between traditional methods and modern transfer learning.

URL: https://openreview.net/forum?id=F6BTATGXaf

---

Title: CAE: Repurposing the Critic as an Explorer in Deep Reinforcement Learning

Authors: Yexin Li

Abstract: Exploration remains a fundamental challenge in reinforcement learning, as many existing methods either lack theoretical guarantees or fall short in practical effectiveness. In this paper, we propose CAE, i.e., the Critic as an Explorer, a lightweight approach that repurposes the value networks in standard deep RL algorithms to drive exploration, without introducing additional parameters. CAE leverages multi-armed bandit techniques combined with a tailored scaling strategy, enabling efficient exploration with provable sub-linear regret bounds and strong empirical stability. Remarkably, it is simple to implement, requiring only about 10 lines of code. For complex tasks where learning reliable value networks is difficult, we introduce CAE+, an extension of CAE that incorporates an auxiliary network. CAE+ increases the parameter count by less than 1% while preserving implementation simplicity, adding roughly 10 additional lines of code. Extensive experiments on MuJoCo, MiniHack, and Habitat validate the effectiveness of CAE and CAE+, highlighting their ability to unify theoretical rigor with practical efficiency.

URL: https://openreview.net/forum?id=54MOD02xC2

---

Title: CodePDE: An Inference Framework for LLM-driven PDE Solver Generation

Authors: Shanda Li, Tanya Marwah, Junhong Shen, Weiwei Sun, Andrej Risteski, Yiming Yang, Ameet Talwalkar

Abstract: Partial differential equations (PDEs) are fundamental to modeling physical systems, yet solving them remains a complex challenge. Traditional numerical solvers rely on expert knowledge to implement and are computationally expensive, while neural-network-based solvers require large training datasets and often lack interpretability. In this work, we frame PDE solving as a code generation task and introduce CodePDE, the first inference framework for generating PDE solvers using large language models (LLMs). With CodePDE, we present a thorough evaluation on critical capacities of LLM for PDE solving: reasoning, debugging, self-refinement, and test-time scaling. CodePDE shows that, with advanced inference-time algorithms and scaling strategies, LLMs can achieve strong performance across a range of representative PDE problems. We also identify novel insights into LLM-driven solver generation, such as trade-offs between solver reliability and sophistication, design principles for LLM-powered PDE solving agents, and failure modes for LLM on hard tasks. These insights offer guidance for building more capable and reliable LLM-based scientific engines.

URL: https://openreview.net/forum?id=eG3Qy5Oux6

---

Title: Watermarking Degrades Alignment in Language Models: Analysis and Mitigation

Authors: Apurv Verma, Hai Phan, Shubhendu Trivedi

Abstract: Watermarking has become a practical tool for tracing language model outputs, but it modifies token probabilities at inference time, which were carefully tuned by alignment training. This creates a tension: how do watermark-induced shifts interact with the procedures intended to make models safe and useful? Experiments on several contemporary models and two representative watermarking schemes reveal that watermarking induces a nontrivial, patterned yet model-specific shift in alignment. We see two failure modes: guard attenuation, where models become more helpful but less safe, and guard amplification, where refusals become overly conservative. These effects persist even after controlling for perplexity degradation, pointing to alignment-specific distortions, not just quality loss. We address this with Alignment Resampling (AR), a procedure that samples multiple watermarked outputs and selects the most aligned response according to an external reward model. Using standard results on the expected maximum of Gaussian random variables, we derive a theoretical lower bound showing that alignment gains grow sublogarithmically with sample size. In practice, sampling as few as two to four candidates largely restores unwatermarked alignment performance in truthfulness, safety, and helpfulness, without hurting watermark detection. This is the first empirical study of watermarking-alignment interactions; it shows that a simple inference-time fix can recover alignment.

URL: https://openreview.net/forum?id=w2ATKQcfWx

---

Title: From Link Prediction to Forecasting: Addressing Challenges in Batch-based Temporal Graph Learning

Authors: Moritz Lampert, Christopher Blöcker, Ingo Scholtes

Abstract: Dynamic link prediction is an important problem considered in many recent works that propose approaches for learning temporal edge patterns. To assess their efficacy, models are evaluated on continuous-time and discrete-time temporal graph datasets, typically using a traditional batch-oriented evaluation setup. However, as we show in this work, a batch-oriented evaluation is often unsuitable and can cause several issues. Grouping edges into fixed-sized batches regardless of their occurrence time leads to information loss or leakage, depending on the temporal granularity of the data. Furthermore, fixed-size batches create time windows with different durations, resulting in an inconsistent dynamic link prediction task. In this work, we empirically show how traditional batch-based evaluation leads to skewed model performance and hinders the fair comparison of methods. We mitigate this problem by reformulating dynamic link prediction as a link forecasting task that better accounts for temporal information present in the data.

URL: https://openreview.net/forum?id=iZPAykLE3l

---

Title: Towards Scalable Language-Image Pre-training for 3D Medical Imaging

Authors: Chenhui Zhao, Yiwei Lyu, Asadur Zaman Chowdury, Edward S Harake, Akhil Kondepudi, Akshay T Rao, Xinhai Hou, Honglak Lee, Todd C Hollon

Abstract: The scalability of current language-image pre-training for 3D medical imaging, such as CT and MRI, is constrained by the need for radiologists to manually curate raw clinical studies. In this work, we pioneer pre-training directly on uncurated studies, which both aligns more closely with the clinical workflow and provides a natural path to scalability. However, the unique structure of such data presents new challenges for existing model architectures, which were originally designed for 2D slices or single 3D scans. To address this, we introduce a novel hierarchical attention mechanism inspired by the intrinsic hierarchy of radiology data: slice, scan, and study. We denote our framework as Hierarchical attention for Language-Image Pre-training (HLIP). Trained on 220K studies with 3.13 million scans for brain MRI and 240K studies with 1.44 million scans for head CT, HLIP achieves state-of-the-art performance, e.g., +10.5% balanced ACC on the proposed publicly available brain MRI benchmark Pub-Brain-5; +8.3% and +1.7% macro AUC on head CT benchmarks CQ500 and RSNA, respectively. HLIP also exhibits strong generalizability on existing 3D medical language-image pre-training benchmarks, e.g., +4.3% macro AUC on the Rad-ChestCT benchmark when pre-trained on CT-RATE. These results demonstrate that, with HLIP, directly pre-training on uncurated clinical datasets is a scalable and effective direction for language-image pre-training in 3D medical imaging. The code is available at https://github.com/zch0414/hlip

URL: https://openreview.net/forum?id=WxHf4EcBWA

---

Title: TABASCO: A Fast, Simplified Model for Molecular Generation with Improved Physical Quality

Authors: Carlos Vonessen, Charles Harris, Miruna Cretu, Pietro Lio

Abstract: State-of-the-art models for 3D molecular generation are based on significant inductive biases: SE(3) equivariance, permutation invariance and graph message‑passing networks to capture local chemistry, yet the generated molecules struggle with physical plausibility.
We introduce TABASCO which relaxes these assumptions: The model has a standard non-equivariant transformer architecture, treats atoms in a molecule as sequences and does not explicitly model bonds. The absence of equivariant layers and message passing allows us to simplify the model architecture and scale data throughput.
On the GEOM‑Drugs and QM9 benchmarks TABASCO achieves state-of-the-art PoseBusters validity and delivers inference roughly 10x faster than the strongest baseline, while exhibiting emergent rotational equivariance without hard-coded symmetry.
Our work offers a blueprint for training minimalist, high‑throughput, unconditional generative models and the resulting architecture is readily extensible to future conditional tasks.
We provide a link to our implementation at https://github.com/carlosinator/tabasco.

URL: https://openreview.net/forum?id=Kg6CSrbXl4

---

Title: Fuzzy PyTorch: Rapid Numerical Variability Evaluation for Deep Learning Models

Authors: Inés Gonzalez Pepe, Hiba Akhaddar, Tristan Glatard, Yohan Chatelain

Abstract: We introduce Fuzzy PyTorch, a framework for rapid evaluation of numerical variability in deep learning (DL) models. As DL is increasingly applied to diverse tasks, understanding variability from floating-point arithmetic is essential to ensure robust and reliable performance. Tools assessing such variability must be scalable, efficient, and integrate seamlessly with existing frameworks while minimizing code modifications. Fuzzy PyTorch enables this by integrating stochastic arithmetic into PyTorch through Probabilistic Rounding with Instruction Set Management, a novel library interfacing with Verificarlo, a numerical analysis compiler. The library offers stochastic rounding mode and a novel mode; up-down rounding.
Comparative evaluations show Fuzzy PyTorch maintains model performance and achieves runtime reductions of $5\times$ to $60\times$ versus Verrou, a state-of-the-art tool. We further demonstrate scalability by running models from 1 to 341 million parameters, confirming applicability across small and large DL architectures. Overall, Fuzzy PyTorch provides an efficient, scalable, and practical solution for assessing numerical variability in deep learning, enabling researchers and practitioners to quantify and manage floating-point uncertainty without compromising performance or computational efficiency.

URL: https://openreview.net/forum?id=0ogq232VGP

---

Title: A Multi-Fidelity Control Variate Approach for Policy Gradient Estimation

Authors: Xinjie Liu, Cyrus Neary, Kushagra Gupta, Wesley A. Suttle, Christian Ellis, ufuk topcu, David Fridovich-Keil

Abstract: Many reinforcement learning (RL) algorithms are impractical for deployment in operational systems or for training with computationally expensive high-fidelity simulations, as they require large amounts of data. Meanwhile, low-fidelity simulators—such as reduced-order models, heuristic reward functions, or generative world models—can cheaply provide useful data for RL training, even if they are too coarse for direct sim-to-real transfer. We propose multi-fidelity policy gradients (MFPGs), an RL framework that mixes a small amount of data from the target environment with a control variate formed from a large volume of low-fidelity simulation data to construct an unbiased, variance-reduced estimator for on-policy policy gradients. We instantiate the framework by developing a practical, multi-fidelity variant of the classical REINFORCE algorithm. We show that under standard assumptions, the MFPG estimator guarantees asymptotic convergence of multi-fidelity REINFORCE to locally optimal policies in the target environment, and achieves faster finite-sample convergence rates compared to training with high-fidelity data alone. We evaluate the MFPG algorithm across a suite of simulated robotics benchmark tasks in scenarios with limited high-fidelity data but abundant off-dynamics, low-fidelity data. In our baseline comparisons, for scenarios where low-fidelity data are neutral or beneficial and dynamics gaps are mild to moderate, MFPG is, among the evaluated off-dynamics RL and low-fidelity-only approaches,the only method that consistently achieves statistically significant improvements in mean performance over a baseline trained solely on high-fidelity data. When low-fidelity data become harmful, MFPG exhibits the strongest robustness against performance degradation among the evaluated methods, whereas strong off-dynamics RL methods tend to exploit low-fidelity data aggressively and fail substantially more severely. An additional experiment in which the high- and low-fidelity environments are assigned anti-correlated rewards shows that MFPG can remain effective even when the low-fidelity environment exhibits reward misspecification. Thus, MFPG not only offers a reliable and robust paradigm for exploiting low-fidelity data, e.g., to enable efficient sim-to-real transfer, but also provides a principled approach to managing the trade-off between policy performance and data collection costs.

URL: https://openreview.net/forum?id=zAo0L7Dcqt

---

Title: Proc-to-Spec: A Functorial Map of Network Processes

Authors: Shanfeng Hu

Abstract: The analysis of dynamic networks is central to understanding complex environmental systems in nature, yet traditional methods often focus on describing changing states rather than formalising the underlying processes of change. In this work, we introduce a category-theoretical framework, Proc-to-Spec, that provides a principled, functorial method for analysing the transformations that govern network evolution. We model resource-constrained systems, such as those commonly found in biology and ecology, within a source category Proc, where morphisms represent dissipative physical processes. We then construct a spectral functor, $\chi: Proc \to Spec$, that maps each process to a unique linear transformation between the eigenspaces of the network's symmetrised Laplacian. This framework allows us to establish a set of rigorous theorems. We prove that physical conservation laws in Proc correspond directly to spectral invariants in Spec, such as the conservation of the Laplacian's trace. We derive a spectral sensitivity theorem that formally links resource dissipation to network fragmentation via the Fiedler value. We also establish a stability-spectrum equivalence theorem, proving that a system's physical dynamics converge to a stable state if and only if its spectral geometry converges. We also derive an optimal Spec-to-Func projection to compress these transformations into interpretable, low-dimensional functional fingerprints. We validate our theory with numerical experiments and demonstrate its generality as a tool for scientific discovery across two comprehensive, contrasting case studies. (1) In a high-signal, high-noise, macro-timescale ecological case study of the Serengeti food web in northern Tanzania, we use a large collection of 1.2 million classified image sets of animal activity from 225 camera traps spread across 1,125 km$^2$ of the Serengeti National Park from 2010 to 2013 to show that our framework can detect the subtle, cyclical signature of seasonal change and identify the unique geometric fingerprint of the 2011 East Africa drought. (2) In a low-signal, high-noise, micro-timescale neuroscience case study, we show that our framework's functional fingerprints can detect and characterise subtle cognitive processes from human brain fMRI data, classifying 8 distinct task states with high, generalisable accuracy. Our work provides a different way of thinking about dynamic systems, shifting the focus from describing states to understanding the fundamental geometry of change. Code to reproduce all results in the paper is released at https://github.com/shanfenghu/pts

URL: https://openreview.net/forum?id=pT84Ii6igG

---

Title: Topological Inductive Bias fosters Multiple Instance Learning in Data-Scarce Scenarios

Authors: Salome Kazeminia, Carsten Marr, Bastian Rieck

Abstract: Multiple instance learning (MIL) is a framework for weakly supervised classification, where labels are assigned to sets of instances, i.e., bags, rather than to individual data points. This paradigm has proven effective in tasks where fine-grained annotations are unavailable or costly to obtain. However, the effectiveness of MIL drops sharply when training data are scarce, such as for rare disease classification. To address this challenge, we propose incorporating topological inductive biases into the data representation space within the MIL framework. This bias introduces a topology-preserving constraint that encourages the instance encoder to maintain the topological structure of the instance distribution within each bag when mapping them to MIL latent space. As a result, our Topology Guided MIL (TG-MIL) method enhances the performance and generalizability of MIL classifiers across different aggregation functions, especially under scarce-data regimes. Our evaluations show average performance improvements of 15.3% for synthetic MIL datasets, 2.8% for MIL benchmarks, and 5.5% for rare anemia classification compared to current state-of-the-art MIL models, where only 17–120 samples per class are available. We make our code publicly available at https://github.com/SalomeKaze/TGMIL.

URL: https://openreview.net/forum?id=1hZy9ZjjCc

---

Title: Policy Learning with a Language Bottleneck

Authors: Megha Srivastava, Cédric Colas, Dorsa Sadigh, Jacob Andreas

Abstract: Modern AI systems such as self-driving cars and game-playing agents achieve superhuman
performance. But they often lack human-like generalization, interpretability, and inter-
operability with human users. This paper introduces *Policy Learning with a Language
Bottleneck* (PLLB), a framework enabling AI agents to generate linguistic rules that capture
the high-level strategies underlying rewarding behaviors. PLLB alternates between a *rule
generation* step guided by language models, and an *update* step where agents learn new
policies guided by rules. Crucially, PLLB enables this kind of language-guided learning
even when a natural language rule is insufficient to completely describe the target policy.
Across five diverse tasks, including a two-player signaling game, maze navigation, image
reconstruction, and robot grasp planning, we show that PLLB learns more interpretable
and generalizable behaviors than standard policy learning methods. In three additional
human subject studies, we show that show the learned rules significantly improve human
task performance, enabling more effective human-AI coordination

URL: https://openreview.net/forum?id=sK8uEqzQPv

---

Title: Denoising Hamiltonian Network for Physical Reasoning

Authors: Congyue Deng, Brandon Y. Feng, Cecilia Garraffo, Alan Garbarz, Robin Walters, William T. Freeman, Leonidas Guibas, Kaiming He

Abstract: Machine learning frameworks for physical problems must capture and enforce physical constraints that preserve the structure of dynamical systems. Many existing approaches achieve this by integrating physical operators into neural networks. While these methods offer theoretical guarantees, they face two key limitations: (i) they primarily model local relations between adjacent time steps, overlooking longer-range or higher-level physical interactions, and (ii) they focus on forward simulation while neglecting broader physical reasoning tasks. We propose the Denoising Hamiltonian Network (DHN), a novel framework that generalizes Hamiltonian mechanics operators into more flexible neural operators. DHN captures non-local temporal relationships and mitigates numerical integration errors through a denoising mechanism. DHN also supports multi-system modeling with a global conditioning mechanism. We demonstrate its effectiveness and flexibility across three diverse physical reasoning tasks with distinct inputs and outputs.

URL: https://openreview.net/forum?id=KublEgx7Hv

---

Title: Amortized Bayesian Workflow

Authors: Chengkun LI, Aki Vehtari, Paul-Christian Bürkner, Stefan T. Radev, Luigi Acerbi, Marvin Schmitt

Abstract: Bayesian inference often faces a trade-off between computational speed and sampling accuracy. We propose an adaptive workflow that integrates rapid amortized inference with gold-standard MCMC techniques to achieve a favorable combination of both speed and accuracy when performing inference on many observed datasets. Our approach uses principled diagnostics to guide the choice of inference method for each dataset, moving along the Pareto front from fast amortized sampling via generative neural networks to slower but guaranteed-accurate MCMC when needed. By reusing computations across steps, our workflow synergizes amortized and MCMC-based inference. We demonstrate the effectiveness of this integrated approach on several synthetic and real-world problems with tens of thousands of datasets, showing efficiency gains while maintaining high posterior quality.

URL: https://openreview.net/forum?id=osV7adJlKD

---

Title: Layer Collapse Can be Induced by Unstructured Pruning

Authors: Zhu LIAO, Victor Quétu, Van-Tam Nguyen, Enzo Tartaglione

Abstract: Unstructured pruning is a popular compression method for efficiently reducing model parameters. However, while it effectively decreases the number of parameters, it is commonly believed that unstructured pruning cannot shorten the computational critical path, i.e., the maximum number of layers traversed during forward propagation.

In this paper, we study when and how unstructured pruning can yield structural effects. For rectifier-activated networks, we introduce the notion of neuron entropy, which quantifies the degree of nonlinearity utilization. We show that magnitude-based pruning naturally lowers this entropy, sometimes down to zero-entropy layers that become linearizable and can thus be removed. Building on this insight, we propose a method that leverages "unstructured" pruning to favor sparsity in low-entropy layers, enabling their complete removal. We validate the phenomenon across CNNs, Vision Transformers, and NLP models: unstructured pruning can induce effective layer removal with little or no performance degradation in over-parameterized networks. Our code is available at https://github.com/ZhuLIAO001/NEPENTHE.git.

URL: https://openreview.net/forum?id=rfDYZNZIZT

---

Title: The Cost of Replicability in Active Learning

Authors: Rupkatha Hira, Dominik Kau, Jessica Sorrell

Abstract: Active learning aims to reduce the number of labeled data points required by machine learning algorithms by selectively querying labels from initially unlabeled data. Ensuring replicability, where an algorithm produces consistent outcomes across different runs, is essential for the reliability of machine learning models but often increases sample complexity. This report investigates the cost of replicability in active learning using two classical disagreement-based methods: the CAL and A\textsuperscript{2} algorithms. Leveraging random thresholding techniques, we propose two replicable active learning algorithms: one for realizable learning of finite hypothesis classes, and another for agnostic. Our theoretical analysis shows that while enforcing replicability increases label complexity, CAL and A\textsuperscript{2} still achieve substantial label savings under this constraint. These findings provide key insights into balancing efficiency and stability in active learning.

URL: https://openreview.net/forum?id=ZsqJu9eITd

---

Title: Single-loop Algorithms for Stochastic Non-Convex Optimization with Weakly-Convex Constraints

Authors: Ming Yang, Gang Li, Quanqi Hu, Qihang Lin, Tianbao Yang

Abstract: Constrained optimization with multiple functional inequality constraints has significant applications in machine learning. This paper examines a crucial subset of such problems where both the objective and constraint functions are weakly convex. Existing methods often face limitations, including slow convergence rates or reliance on double-loop algorithmic designs. To overcome these challenges, we introduce a novel single-loop penalty-based stochastic algorithm. Following the classical exact penalty method, our approach employs a hinge-based penalty, which permits the use of a constant penalty parameter, enabling us to achieve a state-of-the-art complexity for finding an approximate Karush-Kuhn-Tucker (KKT) solution. We further extend our algorithm to address finite-sum coupled compositional objectives, which are prevalent in artificial intelligence applications, establishing improved complexity over existing approaches. Finally, we validate our method through experiments on fair learning with receiver operating characteristic (ROC) fairness constraints and continual learning with non-forgetting constraints.

URL: https://openreview.net/forum?id=aCgOR2KvAI

---

Title: Explainable Graph Learning for Particle Accelerator Operations

Authors: Song Wang, Chris Tennant, Jundong Li

Abstract: Particle accelerators are vital tools in physics, medicine, and industry, requiring precise tuning to ensure optimal beam performance. However, real-world deviations from idealized simulations make beam tuning a time-consuming and error-prone process. In this work, we propose an explanation-driven framework for providing actionable insight into beamline operations, with a focus on the injector beamline at the Continuous Electron Beam Accelerator Facility (CEBAF). We represent beamline configurations as heterogeneous graphs, where setting nodes represent elements that human operators can actively adjust during beam tuning, and reading nodes passively provide diagnostic feedback. To identify the most influential setting nodes responsible for differences between any two beamline configurations, our approach first predicts the resulting changes in reading nodes caused by variations in settings, and then learns importance scores that capture the joint influence of multiple setting nodes. Experimental results on real-world CEBAF injector data demonstrate the framework’s ability to generate interpretable insights that can assist human operators in beamline tuning and reduce operational overhead.

URL: https://openreview.net/forum?id=jnReRk2EX1

---

Title: Robust Clustering using Gaussian Mixtures in the Presence of Cellwise Outliers

Authors: Pushpendra Rajpurohit, Petre Stoica, Prabhu babu

Abstract: In this paper we propose a novel algorithm for robust estimation of Gaussian Mixture Model (GMM) parameters and clustering that explicitly accounts for cell outliers. To achieve this, the proposed algorithm minimizes a penalized negative log-likelihood function where the penalty term is derived via the false discovery rate principle. The penalized negative log-likelihood function is cyclically minimized over outlier positions and the GMM parameters. Furthermore, the minimization over the GMM parameters is done using the majorization minimization framework: specifically we minimize a tight upper bound on the negative log-likelihood function which decouples into simpler optimization subproblems that can be solved efficiently.
We present several numerical simulation studies comprising experiments aimed at evaluating the performance of the proposed method on synthetic as well as real world data and at systematically comparing it with state-of-the-art robust techniques in different scenarios. The simulation studies demonstrate that our approach effectively addresses the challenges inherent in parameter estimation of GMM and clustering in contaminated data environments.

URL: https://openreview.net/forum?id=oVHPEgjdWk

---

Title: Enhancing Deep Consistent Graph Metric with Affinity and Alignment for Incremental Social Event Detection using Cross-Layer Attention

Authors: Shraban Kumar Chatterjee, Shubham Gupta, Suman Kundu

Abstract: Existing methods of event detection from social media (i.e., X), for instance, KPGNN, FinEvent, and CLKD, use triplet loss for feature separation. Triplet loss suffers from two notable discrepancies in the latent space: (i) inconsistency in intra-event and inter-event distances, and (ii) an inability to ensure the closeness of messages from the same event across different mini-batches. The present paper proposes two novel loss functions to improve consistency in the latent space. The first loss function guarantees consistent intra-event and inter-event distances by increasing the affinity between intra-event points. On the other hand, the alignment loss enhances the cosine similarity between the feature space and label space, thereby aligning features of the same event class across diverse mini-batches. We provide theoretical justification that the proposed loss ensures discriminative features in the latent space, like CGML, without its costly pairwise or specialised batching. Adding to our loss function, we introduce a new attention module designed to effectively address heterogeneous relations without necessitating a separate optimisation objective. Through comprehensive experimentation on two publicly available datasets, we have shown an average improvement of $24.05\%$, $27.23\%$ and $123.69\%$ in NMI, AMI and ARI, respectively, over supervised SOTA event detection methods. Our method also shows improvements over SOTA unsupervised event detection methods across both datasets. These are supported by statistical significance tests. Generalizability of the proposed loss in general clustering problem in graph domain is shown through experiments.

URL: https://openreview.net/forum?id=vNJ7mCgDbq

---

Title: Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks

Authors: Hanjiang Hu, Alexander Robey, Changliu Liu

Abstract: Large language models (LLMs) are shown to be vulnerable to jailbreaking attacks where adversarial prompts are designed to elicit harmful responses. While existing defenses effectively mitigate single-turn attacks by detecting and filtering unsafe inputs, they fail against multi-turn jailbreaks that exploit contextual drift over multiple interactions, gradually leading LLMs away from safe behavior. To address this challenge, we propose a safety steering framework grounded in safe control theory, ensuring invariant safety in multi-turn dialogues. Our approach models the dialogue with LLMs using state-space representations and introduces a novel neural barrier function (NBF) to detect and filter harmful queries emerging from evolving contexts proactively. Our method achieves invariant safety at each turn of dialogue by learning a safety predictor that accounts for adversarial queries, preventing potential context drift toward jailbreaks. Extensive experiments under multiple LLMs show that our NBF-based safety steering outperforms safety alignment, prompt-based steering, and lightweight LLM guardrails baselines, offering stronger defenses against multi-turn jailbreaks while maintaining a better trade-off among safety, helpfulness, and over-refusal. Check out the website at https://sites.google.com/view/llm-nbf/home.

URL: https://openreview.net/forum?id=dcyLr9xYoI

---

Title: Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models

Authors: Alberto Messina, Stefano Scotta

Abstract: Even when decoding with temperature $T=0$, large language models (LLMs) can produce divergent outputs for identical inputs. Recent works align in highlighting implementation-level sources of nondeterminism, including batch-size variation, kernel non-invariance, and floating-point non-associativity. In this work, we formalize this behavior by introducing the notion of background temperature $T_{\mathrm{bg}}$, the effective temperature induced by an implementation-dependent perturbation process observed even when nominal $T=0$. We provide clean definitions, show how $T_{\mathrm{bg}}$ relates to a stochastic perturbation governed by the inference environment $I$, and propose an empirical protocol to estimate $T_{bg}$ via the equivalent temperature $T_n(I)$ of an ideal reference system. We conclude with a set of pilot experiments run on a representative pool from the major LLM providers that demonstrate the idea and outline implications for reproducibility, evaluation, and deployment.

URL: https://openreview.net/forum?id=bz0he4bARF

---

Title: Density-Aware Farthest Point Sampling

Authors: Paolo Climaco, Jochen Garcke

Abstract: We focus on training machine learning regression models in scenarios where the availability of labeled training data is limited due to computational constraints or high labeling costs. Thus, selecting suitable training sets from unlabeled data is essential for balancing performance and efficiency. For the selection of the training data, we focus on passive and model-agnostic sampling methods that only consider the data feature representations. We derive an upper bound for the expected prediction error of Lipschitz continuous regression models that linearly depends on the weighted fill distance of the training set—a quantity we can estimate simply by considering the data features. We introduce ``Density-Aware Farthest Point Sampling'' (DA-FPS), a novel sampling method. We prove that DA-FPS provides approximate minimizers for a data-driven estimation of the weighted fill distance, thereby aiming at minimizing our derived bound. We conduct experiments using two regression models across three datasets. The results demonstrate that DA-FPS significantly reduces the mean absolute prediction error compared to other sampling strategies.

URL: https://openreview.net/forum?id=vI47lgIfYc

---

Title: Delta-Influence: Identifying Poisons via Influence Functions

Authors: Wenjie Li, Jiawei Li, Pengcheng Zeng, Christian Schroeder de Witt, Ameya Prabhu, Amartya Sanyal

Abstract: Addressing data integrity challenges, such as unlearning the effects of targeted data poisoning after model training, is necessary for the reliable deployment of machine learning models. State-of-the-art influence functions, such as EK-FAC and TRAK, often fail to accurately attribute abnormal model behavior to the specific poisoned training data responsible for the data poisoning attack. In addition, traditional unlearning algorithms often struggle to effectively remove the influence of poisoned samples, particularly when only a few affected examples can be identified. To address these challenge, we introduce $\Delta$-Influence, a novel approach that leverages influence functions to trace abnormal model behavior back to the responsible poisoned training data using just one poisoned test example, without assuming any prior knowledge of the attack. $\Delta$-Influence applies data transformations that sever the link between poisoned training data and compromised test points without significantly affecting clean data. This allows detecting large negative shifts in influence scores following data transformations, a phenomenon we term as influence collapse, thereby accurately identifying poisoned training data. Unlearning this subset, e.g. through retraining, effectively eliminates the data poisoning. We validate our method across three vision-based poisoning attacks and three datasets, benchmarking against five detection algorithms and five unlearning strategies. We show that $\Delta$-Influence consistently achieves the best unlearning across all settings, showing the promise of influence functions for corrective unlearning.

URL: https://openreview.net/forum?id=4XtcG8NNaG

---

New submissions
===============

Title: A Quotient Homology Theory of Representation in Neural Networks

Abstract: Previous research has proven that the set of maps implemented by neural networks with a ReLU activation function is identical to the set of piecewise linear continuous maps. Furthermore, such networks induce a hyperplane arrangement splitting the input domain of the network into convex polyhedra $G_J$ over which a network $\Phi$ operates in an affine manner.

In this work, we leverage these properties to define an equivalence class $\sim_\Phi$ on top of an input dataset, which can be split into two sets related to the local rank of $\Phi_J$ and the intersections $\cap \text{Im}\Phi_{J_i}$. We refer to the latter as the \textit{overlap decomposition} $\mathcal{O}_\Phi$ and prove that if the intersections between each polyhedron and an input manifold are convex, the homology groups of neural representations are isomorphic to quotient homology groups $H_k(\Phi(\mathcal{M})) \simeq H_k(\mathcal{M}/\mathcal{O}_\Phi)$. This lets us intrinsically calculate the Betti numbers of neural representations without the choice of an external metric. We develop methods to numerically compute the overlap decomposition through linear programming and a union-find algorithm.

Using this framework, we perform several experiments on toy datasets showing that, compared to standard persistent homology, our overlap homology-based computation of Betti numbers tracks purely topological rather than geometric features. Finally, we study the evolution of the overlap decomposition during training on several classification problems while varying network width and depth and discuss some shortcomings of our method.

URL: https://openreview.net/forum?id=RluspxztzS

---

Title: Reasoning-Aware Multimodal Fusion for Hateful Video Detection

Abstract: Hate speech in online videos is posing an increasingly serious threat to digital platforms, especially as video content becomes increasingly multimodal and context-dependent. Existing methods often struggle to effectively fuse the complex semantic relationships between modalities and lack the ability to understand nuanced hateful content. To address these issues, we propose an innovative Reasoning-Aware Multimodal Fusion (RAMF) framework. To tackle the first challenge, we design Local-Global Context Fusion (LGCF) to capture both local salient cues and global temporal structures, and propose Semantic Cross Attention (SCA) to enable fine-grained multimodal semantic interaction. To tackle the second challenge, we introduce adversarial reasoning—a structured three-stage process where a vision-language model generates (i) objective descriptions, (ii) hate-assumed inferences, and (iii) non-hate-assumed inferences—providing complementary semantic perspectives that enrich the model's contextual understanding of nuanced hateful intent. Evaluations on two real-world hateful video datasets demonstrate that our method achieves robust generalisation performance, improving upon state-of-the-art methods by 3% and 7% in Macro-F1 and hate class recall, respectively. We will release the code after the anonymity period ends.

URL: https://openreview.net/forum?id=U9KnNiuMu1

---

Title: Scalable Ensemble Federated Learning with Enhanced Open-Set Recognition

Abstract: Consensus-driven parameter averaging constitutes the dominant paradigm in federated learning. Although many methods incorporate auxiliary mechanisms or refinements, repeated round averaging remains their fundamental backbone. This paradigm inherently depends on repeated rounds of client–server communication to maintain consensus. The reliance on repeated communication is further amplified in regimes with high data heterogeneity and large client populations, as shown across numerous studies. This behavior arises from optimization drift in out-of-distribution settings, where client objectives differ and multi-step local SGD updates increasingly diverge, making consensus difficult to maintain. We argue that an emerging alternative, ensemble with abstention, provides a more suitable framework for addressing these issues. Rather than enforcing consensus across diverging client objectives, this approach constructs a specialized mixture-of-experts model by preserving client-specific models and selectively aggregating their predictions. As a one-shot FL method, it eliminates the need for repeated communication rounds altogether. Moreover, supported by both theoretical and empirical analysis, we show that this paradigm sidesteps cross-client drift and is inherently less sensitive to data heterogeneity. Despite these advantages, ensemble with abstention introduces two fundamental challenges. First, its performance depends on the design of the open-set recognition (OSR) task, which directly affects performance under heterogeneity. Second, and more critically, preserving client-specific models causes linear growth in model size with the number of clients, limiting scalability. As a step toward addressing these limitations, we introduce FedSOV, which incorporates improved negative sample generation to prevent shortcut cues in the OSR task and employs pruning to address the scalability problem. We show that pruning provides a practical and effective solution to the scalability problem while simultaneously enhancing generalization, yielding higher test accuracy. Across datasets, our method achieves an average gain of $18.81\%$ over the ensemble baseline FedOV in extreme label-skew settings and up to $92.43\%$ over FedGF, the best-performing parameter-averaging method. Code is available at: https://anonymous.4open.science/r/FedSOV-C7EF/

URL: https://openreview.net/forum?id=QnnCYOfuUI

---

Title: High precision PINNs in unbounded domains: application to singularity formulation in PDEs

Abstract: We investigate the high-precision training of Physics-Informed Neural Networks (PINNs) in unbounded domains, with a special focus on applications to singularity formulation in PDEs. We propose a modularized approach and study the choices of neural network ansatz, sampling strategy, and optimization algorithm. When combined with rigorous computer-assisted proofs and PDE analysis, the numerical solutions identified by PINNs, provided they are of high precision, can serve as a powerful tool for studying singularities in PDEs. For 1D Burgers equation, our framework can lead to a solution with very high precision, and for the 2D Boussinesq equation, which is directly related to the singularity formulation in 3D Euler and Navier-Stokes equations, we obtain a solution whose loss is 4 digits smaller than that obtained in \cite{wang2023asymptotic} with fewer training steps. We also discuss potential directions for pushing towards machine precision for higher-dimensional problems.

URL: https://openreview.net/forum?id=sF3iEJMVVQ

---

Title: Let data talk: data-regularized operator learning theory for inverse problems

Abstract: Regularization plays a critical role in incorporating prior information into inverse problems. While numerous deep learning methods have been proposed to tackle inverse problems, the strategic placement of regularization remains a crucial consideration.
In this article, we introduce an innovative approach known as ``data-regularized operator learning" (DaROL) method, specifically designed to address the regularization of inverse problems.
In comparison to typical methods that impose regularization though the training of neural networks, the DaROL method trains a neural network on data that are regularized through well-established techniques including the Lasso regularization method and the Bayesian inference.
Our DaROL method offers flexibility across various frameworks, and features a simplified structure that clearly delineates the processes of regularization and neural network training. In addition, we demonstrate that training a neural network on regularized data is equivalent to supervised learning for a regularized inverse mapping. Furthermore, we provide sufficient conditions for the smoothness of such a regularized inverse mapping and estimate the learning error with regard to neural network size and the number of training samples.

URL: https://openreview.net/forum?id=D7iXTzFhAj

---

Title: When Glass Disappears at Night: A Novel NIR-RGB Multimodal Solution

Abstract: Glass surface detection (GSD) has recently been attracting research interests. However, existing GSD methods focus on modeling glass surface properties for daytime scenes only, and can easily fail in nighttime scenes due to significant lighting discrepancies. We observe that, due to the spectral differences between Near-Infrared (NIR) light sources and common LED lights, NIR and RGB cameras capture complementary visual patterns (\eg, light reflections, shadows, and edges) of glass surfaces, and cross-comparing their lighting and reflectance properties can provide reliable cues for nighttime GSD. Inspired by this observation, we propose a novel approach for nighttime GSD based on the multi-modal NIR and RGB image pairs. We first construct a nighttime GSD dataset, which contains $6,192$ RGB-NIR image pairs captured in diverse real-world nighttime scenes, with corresponding carefully-annotated glass surface masks. We then propose a novel network for the nighttime GSD task with two novel modules: (1) a RGB-NIR Guidance Enhancement (RNGE) module for extracting and enriching the NIR reflectance features with the guidance of RGB reflectance features, and (2) a RGB-NIR Fusion and Localization (RNFL) module for fusing RGB and NIR reflectance features into glass features conditioned on the multi-modal illumination discrepancy-aware features. Extensive experiments demonstrate that our method outperforms state-of-the-art methods in nighttime scenes while generalizing well to daytime scenes. We will release our dataset and codes.

URL: https://openreview.net/forum?id=hdh3vHsakv

---

Title: The Paradox of Robustness: Decoupling Rule-Based Logic from Affective Noise in High-Stakes Decision-Making

Abstract: While Large Language Models (LLMs) are widely documented to be sensitive to minor prompt perturbations and prone to sycophantic alignment, their robustness in consequential, rule-bound decision-making remains under-explored. We uncover a striking "Paradox of Robustness": despite their known lexical brittleness, instruction-tuned LLMs exhibit near-total invariance to emotional framing effects. Using a controlled perturbation framework across three high-stakes domains (healthcare, finance, and education), we find a negligible effect size (Cohen's h = 0.003) compared to the substantial biases observed in analogous human contexts (h in [0.3, 0.8])--approximately two orders of magnitude smaller. This invariance persists across eight models with diverse training paradigms, suggesting the mechanisms driving sycophancy and prompt sensitivity do not translate to failures in logical constraint satisfaction. While LLMs may be "brittle" to how a query is formatted, they are notably "stable" against why a decision should be biased. We release a benchmark (9 base scenarios x 18 condition variants = 162 unique prompts), code, and data to facilitate reproducible evaluation.

URL: https://openreview.net/forum?id=2XPD66IiQI

---

Title: Parameterized Adverse Lens Corruptions to Probe Model Robustness to Optical Tolerances

Abstract: Deep neural networks excel at image classification on benchmarks like ImageNet, yet they remain vulnerable to adverse conditions, including environmental changes and sensor noise, such as lens blur or camera noise. Consequently, the study of these adverse noise corruptions has been extensive. At the same time, image blur, naturally introduced in optical systems, has been widely ignored as a threat to model robustness. In fact, Gaussian blur has even been considered as viable defense against adversarial attacks. In this work, we challenge the common perception of blur as a rather benign data corruption and study optics-driven, blur-based adversarial attacks. Specifically, we introduce Adverse Lens Corruption (ALC), an optical adversarial attack that identifies worst-case lens blurs, obtained by optimizing Zernike polynomial-based aberrations via gradient descent. Unlike traditional noise-based attacks, ALC provides a physically-grounded continuous search space. This enables the analysis of model robustness to optics-driven blur corruptions and complements existing noise and corruption benchmarks.

URL: https://openreview.net/forum?id=a93BmQRNxC

---

Title: When Active Learning Meets Graph Similarity: Evidential Variance for Graph Selection

Abstract: Graph Similarity Learning (GSL) is pivotal in graph data mining, yet training effective models necessitates substantial labeled pairs, which incur prohibitive annotation costs. To address this, we introduce Active Learning (AL) into the GSL paradigm. However, directly transferring existing AL strategies is non-trivial due to two unique impediments: (1) the continuous regression nature of similarity prediction complicates standard uncertainty quantification, and (2) the paired-input structure requires evaluating a graph's informational value across its pairings rather than in isolation. To bridge this gap, we propose EVGS (Evidential Variance for Graph Selection), a novel AL framework tailored for GSL. EVGS leverages evidential deep learning to impose a prior over predictions, enabling disentangled uncertainty estimation. Crucially, we identify a ``gradient shrinkage'' pathology inherent to the data-scarce regime characteristic of AL cycles. We introduce a novel MSE-anchored regularizer to mitigate this issue, ensuring discriminative uncertainty estimation even with limited labels. Furthermore, to address the paired-input challenge, we propose a graph-centric selection criterion: uncertainty variance. This metric captures a graph's holistic informational value by measuring fluctuations in its epistemic uncertainty across diverse interactions. Extensive experiments on three benchmarks with two GSL backbones demonstrate that EVGS consistently outperforms established AL baselines.

URL: https://openreview.net/forum?id=dV6UopxOjX

---

Title: Efficient DAG Learning via Modular Subgraph Integration

Abstract: Learning causal structures from observational data remains a fundamental yet computationally intensive task, particularly in high-dimensional settings where existing methods face challenges such as the super-exponential growth of the search space and increasing computational demands. To address this, we introduce VISTA (Voting-based Integration of Subgraph Topologies for Acyclicity), a modular framework that decomposes the global causal structure learning problem into local subgraphs based on Markov Blankets. The global integration is achieved through a weighted voting mechanism that penalizes low-support edges via exponential decay, filters unreliable ones with an adaptive threshold, and ensures acyclicity using a Feedback Arc Set (FAS) algorithm. The framework is model-agnostic, imposing no assumptions on the inductive biases of base learners, is compatible with arbitrary data settings without requiring specific structural forms, and fully supports parallelization. We also theoretically establish finite-sample error bounds for VISTA, and prove its asymptotic consistency under mild conditions. Extensive experiments on both synthetic and real datasets consistently demonstrate the effectiveness of VISTA, yielding notable improvements in both accuracy and efficiency over a wide range of base learners.

URL: https://openreview.net/forum?id=D5hmL01dIG

---

Title: CAPTAIN: : Conformal-Prediction-Based Multi-Source Time-Series Forecasting

Abstract: Uncertainty quantification is critical for real-world forecasting applications such as predictive maintenance, patient health monitoring, and environmental sensing, where decisions must account for confidence levels. Multi-source time-series forecasting introduces additional complexity due to inter-source interactions and temporal dependencies, which existing methods struggle to capture within a unified probabilistic framework, and most previous approaches also lack theoretical guarantees, leading to miscalibrated uncertainty estimates. We propose CAPTAIN (Conformal Prediction based multi-source Time-series forecasting), a two-stage framework that first employs Normal Inverse Gamma (NIG) distributions to model source-specific uncertainties and integrates a meta-source to capture inter-source interactions, then uses temporal copulas to model the evolution of joint uncertainties over time, ensuring robust and theoretically valid uncertainty coverage. Experiments on five diverse datasets (Synthetic, Shaoxing ECG, Air Quality, NGSIM Traffic, and ETTh1) demonstrate that CAPTAIN achieves valid coverage (>=90%) across all five benchmarks while other baselines achieve on 4 or fewer, confirming it is a better approach for multi-source uncertainty quantification over existing state-of-the-art baselines.

URL: https://openreview.net/forum?id=WJjlXHo4yS

---

Title: Gradient Tree Boosting for Regression Transfer

Abstract: Many real-world modeling problems are hindered by limited data availability. In such cases, *transfer learning* leverages related source domains to improve predictions in a target domain of interest. We extend the classical gradient tree boosting paradigm to a regression transfer algorithm by modeling the weak learner as a sum of two regression trees. The trees are fitted on source data and target data, respectively, and jointly optimized for the target data. We derive optimal coefficients for the model update under the least-squares, the least-absolute-deviation, and the Huber loss functions. We benchmark our approach against the widely used XGBoost algorithm in several transfer scenarios, achieving superior performance in seven out of eight cases.

URL: https://openreview.net/forum?id=b29TPa8NPT

---

Title: An Efficient Framework for Length Extension via Dynamically Growing Positional Embedding and Routing Attention

Abstract: Modeling long sequences is critical for numerous large-scale models. However, extending existing architectures to handle significantly longer sequences poses substantial technical and computational challenges. One inevitable issue is the overfitting of large models to positional encodings during pretraining, which limits their ability to generalize to unseen positional encoding scales. Additionally, extending sequence lengths requires extensive computational resources and time. Existing positional encoding methods often rely on carefully designed scaling factors but typically yield suboptimal results. To tackle these challenges, we propose Cyclic, Randomly Truncated, and Dynamically Growing NTK Positional Embedding (CRG NTK), a data-augmentation-based technique that fully explores the RoPE encoding space, enabling models to adapt to various positional scales and achieve state-of-the-art extrapolation for the extension of lengths dominated by position encoding. Furthermore, we introduce an efficient attention mechanism with a correlation-based routing strategy to enhance the fitting of the augmented positional encoding, yielding superior performance and more efficient fine-tuning. With our approach, LLaMA-7B and Mistral-7B fine-tuned at 16K context length achieve extrapolation factors of at least 128$\times$ on simple tasks and maintain stable perplexity over 32$\times$ sequence length extensions and saves at least 16 times the GPU training resources compared to the existing optimal method. Experiments also show that correlation routing can achieve good performance by further filtering out large amounts of noise in long sequences.

URL: https://openreview.net/forum?id=qLNYDuNYKZ

---

Title: PLA: A Principled Path from Softmax Attention to Linear Models via KV Cache Compression

Abstract: Transformers, despite their remarkable sequence modeling capabilities, are fundamentally constrained by the quadratic complexity of Softmax attention and the unbounded growth of the key–value (KV) cache. Replacing Softmax attention with linear variants has emerged as a promising direction, yet existing approaches lack a systematic functional comparison with Softmax attention, clear error analysis, and a theoretically guided roadmap for improvement.
In this work, we approach the problem from the perspective of KV cache compression and present a theoretically grounded pathway from Softmax attention to linear models.
Our analysis reveals five critical components: redundancy elimination, tokenizer-level quantization and positional information separation, positional information compression, inter-layer similarity, and multi-state decomposition. For each, we provide succinct theoretical justification, derive error bounds, and demonstrate equivalence to existing mechanisms. Building on this pathway, we introduce PLA, a linearized attention model that inherits pretrained weights and achieves state-of-the-art performance. Notably, PLA surpasses strong baselines such as MVA and GSA on multiple benchmarks while requiring only 80\% of the fine-tuning resources. Our findings provide both theoretical clarity and practical guidance for advancing linear attention, highlighting a principled route towards efficient and scalable alternatives to Softmax attention.

URL: https://openreview.net/forum?id=ohkS8NffLp

---

Title: MSTN: A Lightweight and Fast Model for General Time-Series Analysis

Abstract: Real-world time series often exhibit strong non-stationarity, complex nonlinear dynamics, and behavior expressed across multiple temporal scales, from rapid local fluctuations to slow-evolving long-range trends. However, many contemporary architectures impose rigid, fixed-scale structural priors---such as patch-based tokenization, predefined receptive fields, or frozen backbone encoders---which can over-regularize temporal dynamics and limit adaptability to abrupt high-magnitude events. To handle this, we introduce the Multi-scale Temporal Network (MSTN), a hybrid neural architecture grounded in an Early Temporal Aggregation principle. MSTN integrates three complementary components: (i) a multi-scale convolutional encoder that captures fine-grained local structure; (ii) a sequence modeling module that learns long-range dependencies through either recurrent or attention-based mechanisms; and (iii) a self-gated fusion stage incorporating squeeze--excitation and multi-head attention to dynamically modulate cross-scale representations. This design enables MSTN to flexibly model temporal patterns spanning milliseconds to extended horizons, while avoiding the computational burden typically associated with long-context models. Across extensive benchmarks covering forecasting, imputation, classification, and cross-dataset generalization, MSTN achieves state-of-the-art performance, establishing new best results on 24 of 32 datasets, while remaining lightweight (≈ 1M params) and suitable for low-latency (<1 sec, often in milliseconds), resource-constrained deployment.

URL: https://openreview.net/forum?id=je2N2nnDry

---

Title: DeGLIF for Label Noise Robust Node Classification using GNNs

Abstract: Noisy labelled datasets are generally inexpensive compared to clean labelled datasets, and the same is true for graph data. In this paper, we propose a denoising technique DeGLIF: Denoising Graph Data using Leave-One-Out Influence Function. DeGLIF uses a small set of clean data and the leave-one-out influence function to make label noise robust node-level prediction on graph data. Leave-one-out influence function approximates the change in the model parameters if a training point is removed from the training dataset. Recent advances propose a way to calculate the leave-one-out influence function for Graph Neural Networks (GNNs). We extend that recent work to estimate the change in validation loss, if a training node is removed from the training dataset. We use this estimate and a new theoretically motivated relabelling function to denoise the training dataset. We propose two DeGLIF variants to identify noisy nodes. Neither of these variants requires any information about the noise model or the noise level in the dataset; DeGLIF also does not estimate these quantities. For one of these variants, we prove that the noisy points detected can indeed increase risk. We carry out detailed computational experiments on different datasets to investigate the effectiveness of DeGLIF. It achieves better accuracy than other baseline algorithms.

URL: https://openreview.net/forum?id=pcs5DmBtUJ

---

Title: Protein structural superfamily classification using hand-crafted and language model features: A performance vs interpretability trade-off

Abstract: The CATH database categorizes more than 600,000 protein domain structures into superfamilies based on a hierarchy of structural similarity notions. Members of a single superfamily may share less than 35% sequence similarity. The scale of such data motivates the use of machine learning methods that can accurately predict the CATH superfamily of a protein domain and, at the same time, are interpretable, i.e. provide insights into the characteristic features of a superfamily. The newfound rise of protein language models (PLMs) that leverage data and compute has introduced an interesting conflict: a trade-off between the high predictive performance of non-interpretable features and the scientific insight that can be gained from interpretable, hand-crafted ones. In this work, we highlight and study this conflict via the task of classifying protein domains into their CATH superfamilies. We train one-vs-all (OvA) linear SVM classifiers for 45 diverse CATH superfamilies, each characterised by significant class imbalance. We address the class imbalance by using a class-balanced loss function and the arithmetic mean (AM) of specificity and sensitivity for evaluation. Our analysis compares nine feature vector types, which are either non-interpretable embeddings from PLMs or interpretable hand-crafted features. The latter includes amino acid composition (AAC), di- and tri-peptide composition (DPC, TPC), and novel sequence-order (2OAAC, 3OAAC) and structure-based features (OCPC, CSIC). Our results demonstrate that PLM-based features achieve superior test AM scores of 90-99% with low variability, outperforming hand-crafted features by 20-30%. While PLM features yield high classification accuracy, their lack of interpretability obscures the underlying biological determinants. Conversely, the interpretability of hand-crafted features, despite their relatively low performance, can be leveraged to infer sequence and structural characteristics of CATH superfamilies. We illustrate this for two superfamilies. First, we rank the components of hand-crafted features using a known method, marginal contribution feature importance (MCI). Then, based on the interpretability of the top-ranked hand-crafted feature components, we derive biological insights, such as characteristic contacts of superfamily structures. The proposed hand-crafted CSIC feature strikes a balance between predictive performance and interpretability, as it overfits less while providing rich structural information about contact sequence separation. This can be valuable for downstream applications, such as investigating protein-related diseases and guiding rational protein design.

URL: https://openreview.net/forum?id=huTeyYU0yD

---

Title: Watermarking Language Models with Error Correcting Codes

Abstract: Recent progress in large language models enables the creation of realistic machine-generated content. Watermarking is a promising approach to distinguish machine-generated text from human text, embedding statistical signals in the output that are ideally undetectable to humans. We propose a watermarking framework that encodes such signals through an error correcting code. Our method, termed robust binary code (RBC) watermark, introduces no noticeable degradation in quality. We evaluate our watermark on base and instruction fine-tuned models and find that our watermark is robust to edits, deletions, and translations. We provide an information-theoretic perspective on watermarking, a powerful statistical test for detection and for generating p-values, and theoretical guarantees. Our empirical findings suggest our watermark is fast, powerful, and robust, comparing favorably to the state-of-the-art.

URL: https://openreview.net/forum?id=H6oBZxNQk2

---

Title: Favourability of Loss Landscape with Weight Decay Requires Both Large Overparametrization and Initialization

Abstract: The optimization of neural networks under weight decay remains poorly understood from a theoretical standpoint. While weight decay is standard practice in modern training procedures, most theoretical analyses focus on unregularized settings. In this work, we investigate the loss landscape of the $\ell_2$-regularized training loss for two-layer ReLU networks. We show that the landscape becomes favourable -- i.e., spurious local minima represent a negligible fraction of local minima -- under large overparametrization, specifically when the network width $m$ satisfies $m \gtrsim \min(n^d, 2^n)$, where $n$ is the number of data points and $d$ the input dimension. More precisely in this regime, almost all constant activation regions contain a global minimum and no spurious local minima. We further show that this level of overparametrization is not only sufficient but also necessary via the example of orthogonal data. Finally, we demonstrate that such loss landscape results primarily hold relevance in the large initialization regime. In contrast, for small initializations -- corresponding to the feature learning regime -- optimization can still converge to spurious local minima, despite the favourability of the landscape.

URL: https://openreview.net/forum?id=jbU0Tjjhfg

---

Title: Spectral Ghost in Representation Learning: from Component Analysis to Self-Supervised Learning

Abstract: Self-supervised learning (SSL) has improved empirical performance by unleashing the power of unlabeled data for practical applications. Specifically, SSL extracts the representation from massive unlabeled data, which will be transferred to a plenty of down streaming tasks with limited data. The significant improvement on diverse applications of representation learning has attracted increasing attention, resulting in a variety of dramatically different self-supervised learning objectives for representation extraction, with an assortment of learning procedures, but the lack of a clear and unified understanding. Such an absence hampers the ongoing development of representation learning, leaving a theoretical understanding missing, principles for efficient algorithm design unclear, and the use of representation learning methods in practice unjustified. The urgency for a unified framework is further motivated by the rapid growth in representation learning methods. In this paper, we are therefore compelled to develop a principled foundation of representation learning. We first theoretically investigate the sufficiency of the representation from a spectral representation view, which reveals the spectral essence of the existing successful SSL algorithms and paves the path to a unified framework for understanding and analysis. Such a framework work also inspires the development of more efficient and easy-to-use representation learning algorithms with principled way in real-world applications.

URL: https://openreview.net/forum?id=C82ZSnEC1z

---

Title: Hierarchy-Aware Multimodal Unlearning for Medical AI

Abstract: Pretrained Multimodal Large Language Models (MLLMs) are increasingly used in sensitive domains such as medical AI, where privacy regulations like HIPAA and GDPR require the removal of specific individuals’ or institutions’ data. This motivates machine unlearning, which aims to remove the influence of target data from a trained model. However, existing unlearning benchmarks fail to reflect the hierarchical and multimodal structure of real-world medical data, limiting their ability to properly evaluate unlearning in practice.
Therefore, we introduce MedForget, a hierarchy-aware multimodal unlearning benchmark that models hospital data as a nested structure, enabling fine-grained evaluation of multimodal unlearning across retain and forget splits. Experiments show that current unlearning methods struggle to achieve effective hierarchy-aware forgetting without degrading downstream medical utility, measured by performance on clinically relevant prediction tasks. To address this limitation, we propose Cross-modal Hierarchy-Informed Projection for unlearning (CHIP), a training-free, hierarchy-aware multimodal unlearning method that deletes information by selectively removing target-specific weight subspaces while preserving sibling-shared information. Our results show that CHIP achieves the highest forget-retain performance gap across all hierarchy levels while maintaining competitive downstream utility compared to existing methods.
Overall, MedForget provides a practical, HIPAA-aligned benchmark for evaluating structured multimodal unlearning for medical data, and CHIP offers an effective and general solution for hierarchy-aware forgetting that balances deletion with utility.

URL: https://openreview.net/forum?id=TVSIhLqIkf

---

Title: Sample-wise Adaptive Weighting for Transfer Consistency in Adversarial Distillation

Abstract: Adversarial distillation in the standard min–max adversarial training framework aims to transfer adversarial robustness from a large, robust teacher network to a compact student. However, existing work often neglects to incorporate state-of-the-art robust teachers. Through extensive analysis, we find that stronger teachers do not necessarily yield more robust students–a phenomenon known as robust saturation. While typically attributed to capacity gaps, we show that such explanations are incomplete. Instead, we identify adversarial transferability–the fraction of student-crafted adversarial examples that remain effective against the teacher–as a key factor in successful robustness transfer. Based on this insight, we propose Sample-wise Adaptive Adversarial Distillation (SAAD), which reweights training examples by their measured transferability without incurring additional computational cost. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet show that SAAD consistently improves AutoAttack robustness over prior methods.

URL: https://openreview.net/forum?id=ek45VamPCE

---

Title: HyperDG: Hyperbolic Representation Alignment for Robust Domain Generalization via Curvature Refinement

Abstract: Domain generalization often suffers from geometric inconsistencies in representations learned across multiple source domains. Although recent approaches pursue flat minima or invariant features, they remain restricted to Euclidean space, overlooking the inherently curved nature of real data manifolds. We introduce HyperDG, a hyperbolic representation learning framework that models each domain as a Lorentz manifold with learnable negative curvature and enforces cross-domain consistency through a self feedback mechanism alternating between local adaptation, tangent space mapping, and global manifold adjustment, effectively unifying flat minima consistency with non Euclidean representation learning within a single optimization process. By jointly optimizing model parameters and manifold curvature, the framework learns a shared meta manifold that preserves invariance across domains while maintaining hierarchical structure within each.
Extensive experiments on standard domain generalization benchmarks show consistent improvements in accuracy, robustness, and out of distribution performance, demonstrating that embracing hyperbolic representation spaces rather than flattening them leads to geometry consistent and domain resilient generalization.

URL: https://openreview.net/forum?id=TSshrjqnXu

---

Title: Recursive Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

Abstract: We study risk-sensitive reinforcement learning in finite discounted MDPs with recursive entropic risk measures (ERM), where the risk parameter $\beta \neq 0$ controls the agent's risk attitude: $\beta>0$ for risk-averse and $\beta<0$ for risk-seeking behavior. A generative model of the MDP is assumed to be available. Our focus is on the sample complexities of learning the optimal state–action value function (value learning) and an optimal policy (policy learning) under recursive ERM.
We introduce a model-based algorithm, called Model-Based ERM $Q$-Value Iteration (MB-ERM-QVI), and derive PAC-type bounds on its sample complexity for both value and policy learning. Both PAC bounds scale exponentially with $|\beta|/(1-\gamma)$, where $\gamma$ is the discount factor. We also establish corresponding lower bounds for both value and policy learning, showing that exponential dependence on $|\beta|/(1-\gamma)$ is unavoidable in the worst case. The bounds are tight in the number of states and actions ($S$ and $A$), providing the first rigorous sample complexity guarantees for recursive ERM across both risk-averse and risk-seeking regimes.

URL: https://openreview.net/forum?id=TFwSG4uYwl

---

Title: Efficient Test-time Scaling via Iterative Deepening

Abstract: Recent reasoning models, such as OpenAI’s O1 series, have demonstrated exceptional performance on complex reasoning tasks and revealed new test-time scaling laws. Inspired by this, many people have been studying how to train models to achieve effective self-evaluation and self-correction to further enable the scaling paradigm. However, less studied is how to efficiently scale test-time compute from a fixed model, and this remains a challenge. In this paper, we focus on whether LLMs can benefit from matching the pattern of correct responses. Specifically, we explore how systematically triggering a model's self-correction mechanisms can improve performance on challenging reasoning tasks. To this end, we propose a novel iterative deepening sampling algorithm framework designed to enhance self-correction and generate higher-quality samples. Through extensive experiments on Math500, AIME, and GPQA-diamond benchmarks, we demonstrate that our method achieves a higher success rate on difficult tasks and provide detailed ablation studies to analyze its effectiveness across diverse settings.

URL: https://openreview.net/forum?id=oSNRwIM6hU

---

Title: Learning Multimodal Energy-Based Model with Multimodal Variational Auto-Encoder via MCMC Revision

Abstract: Energy-based models (EBMs) are a flexible class of deep generative models and are well-suited to capture complex dependencies in multimodal data. However, learning multimodal EBM by maximum likelihood requires Markov Chain Monte Carlo (MCMC) sampling in the joint data space, where noise-initialized Langevin dynamics often mixes poorly and fails to discover coherent inter-modal relationships. Multimodal VAEs have made progress in capturing such inter-modal dependencies by introducing a shared latent generator and a joint inference model. However, both the shared latent generator and joint inference model are parameterized as unimodal Gaussian (or Laplace), which severely limits their ability to approximate the complex structure induced by multimodal data. In this work, we study the learning problem of the multimodal EBM, shared latent generator, and joint inference model. We present a learning framework that effectively interweaves their MLE updates with corresponding MCMC refinements in both the data and latent spaces. Specifically, the generator is learned to produce coherent multimodal samples that serve as strong initial states for EBM sampling, while the inference model is learned to provide informative latent initializations for generator posterior sampling. Together, these two models serve as complementary models that enable effective EBM sampling and learning, yielding realistic and coherent multimodal EBM samples. Extensive experiments demonstrate superior performance for multimodal synthesis quality and coherence compared to various baselines. We conduct various analyses and ablation studies to validate the effectiveness and scalability of the proposed multimodal framework.

URL: https://openreview.net/forum?id=ZVD7bHNpY1

---

Title: LiteXrayNet: Bilateral Asymmetry-Aware Attention for Lightweight Pediatric Pneumonia Detection

Abstract: Pediatric pneumonia remains a major cause of mortality among children under five, with the greatest burden in resource-constrained settings where access to timely diagnosis is limited. Although deep learning methods have achieved strong performance in chest X-ray analysis, many existing approaches rely on large models that are difficult to deploy in such environments and do not explicitly account for the bilateral anatomical structure that radiologists routinely use during interpretation. We present LiteXrayNet, a lightweight convolutional neural network that incorporates Bilateral Asymmetry Attention (BAA), a geometry-guided attention mechanism designed to model left-right lung correspondence through spatial splitting, horizontal flipping, and adaptive feature gating. With only 127K parameters, LiteXrayNet achieves competitive pneumonia classification performance, attaining an F1 score of 97.31% and an accuracy of 97.90%, while supporting real-time inference on edge hardware with latencies of 4.11 ms on GPU and 14.53 ms on CPU. Feature-level bilateral asymmetry analysis indicates that BAA induces representations that differ systematically from those produced by generic attention mechanisms, while Grad-CAM visualizations suggest anatomically structured attention patterns consistent with common radiological reasoning. These results suggest that incorporating domain-specific anatomical priors as architectural constraints can support efficient and interpretable models suitable for deployment in resource-limited clinical settings.

URL: https://openreview.net/forum?id=dsu8ZAL4LJ

---

Title: An Information-Theoretic Framework for Training-Dependent Memory in Neural Sequence Models

Abstract: State-space models trained with identical architectures exhibit vastly different long-range retrieval performance depending solely on training procedure. Standard next-token prediction produces models that fail on tasks requiring precise recall, while multi-objective curricula enable the same architecture to approach Transformer-level accuracy. This training-induced capacity gap cannot be explained by existing theories, which treat representational capacity as an architectural property.
We resolve this puzzle by formalizing fixed-dimensional hidden states as communication channels where capacity depends on both bandwidth (dimension) and signal-to-noise ratio: the degree to which learned features align with task-relevant information versus interference. We prove that multi-objective training systematically increases task-aligned signal while suppressing noise (Lemmas 1-2), yielding strictly higher effective SNR without architectural modification (Theorem 1). This establishes that training can alter effective capacity within fixed dimension by reallocating representational energy across subspaces.
The framework distinguishes three architectural regimes through qualitative capacity bounds: fixed-state models as single channels, Transformers achieving bandwidth scaling through parallel storage, and training procedures amplifying SNR within fixed dimension. Observed performance patterns are quantitatively consistent with predicted scaling relationships through inverse inference. This demonstrates that representational geometry can be characterized using information-theoretic principles, formalizing how training objectives determine memory capacity in neural sequence models.

URL: https://openreview.net/forum?id=4pkynwvPtZ

---

Title: RSQ: Learning from Important Tokens Leads to Better Quantized LLMs

Abstract: Layer-wise quantization is a key technique for efficiently compressing large models without expensive retraining. Previous methods typically quantize the weights of each layer by “uniformly” optimizing the layer reconstruction loss across all output tokens. However, in this paper, we demonstrate that better quantized models can be obtained by prioritizing learning from important tokens. Building on this finding, we propose RSQ (Rotate, Scale, then Quantize), which (1) applies rotations (orthogonal transformation) to the model to mitigate weight outliers, (2) scales the token feature based on its importance, and (3) quantizes the model using the GPTQ framework with the second-order statistics computed by scaled tokens. To compute token importance, we explore both heuristic and dynamic strategies. Based on a thorough analysis of all approaches, we adopt attention concentration, which uses attention scores of each token as its importance, as the best approach. We demonstrate that RSQ consistently outperforms baseline methods across multiple downstream tasks and three model families: LLaMA3, Mistral, and Qwen2.5. Additionally, models quantized with RSQ achieve superior performance on long-context tasks, further highlighting its effectiveness. Lastly, RSQ demonstrates generalizability across various setups, including different model sizes, calibration datasets, bit precisions, and quantization methods. Our code is available in the supplementary material.

URL: https://openreview.net/forum?id=kBezrKXHVS

---

Title: SA-PEF: Step-Ahead Partial Error Feedback for Efficient Federated Learning

Abstract: Biased gradient compression with error feedback (EF) reduces communication in federated learning (FL), but under heterogeneous (non-IID) data and local updates, the compression residual can decay slowly. This induces a mismatch between where gradients are evaluated and where the (decompressed) update is effectively applied, often slowing progress in the early rounds. We propose step-ahead partial error feedback (SA-PEF), which introduces a tunable step-ahead coefficient $\alpha_r\in[0,1]$ and previews only a fraction of the residual while carrying the remainder through standard EF. SA-PEF interpolates smoothly between EF ($\alpha_r=0$) and full step-ahead EF (SAEF; $\alpha_r=1$). For nonconvex objectives with $\delta$-contractive compressors, we develop a second-moment bound and a residual recursion that yield nonconvex stationarity guarantees under data heterogeneity and partial client participation. With a constant inner stepsize, the bound exhibits the standard $\mathcal{O}\!\bigl((\eta\,\eta_0TR)^{-1}\bigr)$ optimization term and an $R$-independent variance/heterogeneity floor induced by biased compression. Our analysis highlights a step-ahead-controlled residual contraction factor $\rho_r$, explaining the observed early-phase acceleration, and suggests choosing $\alpha_r$ near a theory-predicted optimum to balance SAEF’s rapid warm-up with EF’s long-run stability. Experiments across architectures, datasets, and compressors show that SA-PEF consistently reaches target accuracy in fewer communication rounds than EF.

URL: https://openreview.net/forum?id=ejnVWfknCm

---

Title: Privacy Leakage via Output Label Space and Differentially Private Continual Learning

Abstract: Differential privacy (DP) is a formal privacy framework that enables training machine learning (ML) models while protecting individuals' data. As pointed out by prior work, ML models are part of larger systems, which can lead to so-called privacy side-channels even if the model training itself is DP. We identify the output label space of a classification model as such a privacy side-channel and show a concrete privacy attack that exploits it. The side-channel becomes highly relevant in continual learning (CL), where the output label space changes over time. To reason about privacy guarantees in CL, we introduce a formalisation of DP for CL, which also clarifies how our approach differs from existing approaches. We propose and evaluate two methods for eliminating this side-channel: applying an optimal DP mechanism to release the labels in the sensitive data, and using a large public label space. We explore the trade-offs of these methods through adapting pre-trained models. We demonstrate empirically that our models consistently achieve higher accuracy under DP than previous work over both Split-CIFAR-100 and Split-ImageNet-R, with a stronger privacy model.

URL: https://openreview.net/forum?id=ZshFgRQWrm

---

Title: Mitigating Disparate Impact of Differentially Private Learning through Bounded Adaptive Clipping

Abstract: Differential privacy (DP) has become an essential framework for privacy-preserving machine learning. Existing DP learning methods, however, often have disparate impacts on model predictions, e.g., for minority groups. Gradient clipping, which is often used in DP learning, can suppress larger gradients from challenging samples. We show that this problem is amplified by adaptive clipping, which will often shrink the clipping bound to tiny values to match a well-fitting majority, while significantly reducing the accuracy for others. We propose bounded adaptive clipping, which introduces a tunable lower bound to prevent excessive gradient suppression. Our method improves worst-class accuracy by over 10 percentage points on Skewed and Fashion MNIST compared to unbounded adaptive clipping, 7 points compared to Automatic clipping, and 5 points compared to constant clipping. The code is available at https://anonymous.4open.science/r/adaptive-clipping-DPDL.

URL: https://openreview.net/forum?id=UlzcKSHVoN

---

Title: Think2SQL: Blueprinting Reward Density and Advantage Scaling for Effective Text-to-SQL Reasoning

Abstract: While Large Language Models (LLMs) have advanced the state-of-the-art in Text-to-SQL, robust reasoning in complex, multi-table environments remains a bottleneck for parameter-efficient models. This paper presents a systematic empirical study on injecting reasoning capabilities into Text-to-SQL through the lens of Reinforcement Learning with Verifiable Rewards (RLVR). We uncover a critical interplay between reward density, advantage scaling, and model capacity. Our analysis yields four primary insights. First, we propose a novel execution-guided dense reward function that significantly outperforms binary signals and existing state-of-the-art rewards by providing granular feedback at the instance level. Second, we analyze the mechanics of advantage calculation, demonstrating that while large models thrive on sparse signals with aggressive advantage scaling, smaller models require dense rewards and conservative scaling to improve Text-to-SQL performance. Third, we evaluate the impact of cold start, showing that distillation does not always benefit RLVR
performance, and supervised fine-tuned models are prone to distributional mimicry. Fourth, we map the Pareto frontier of training efficiency, providing insights for optimizing Text-to-SQL reasoning under computational constraints. Our findings culminate in the Think2SQL family: our 4B-parameter model demonstrates reasoning capabilities competitive with state-of-the-art models such as o3. We release our models, datasets, and code to create a blueprint for RLVR optimization in Text-to-SQL at https://anonymous.4open.science/r/Think2SQL-3B7F.

URL: https://openreview.net/forum?id=NxU1KWnpOG

---

Title: Modelling Complex Tabular Datasets with a Mixture of Diverse Generative Models.

Abstract: Generative models are widely used, yet they often struggle to capture the multi-modal structure of complex tabular datasets. We address this challenge by introducing a novel framework that employs mixtures of diverse generators, each specialized to different regions
of the data space. Our method proceeds in two stages: first, generators are assigned to data clusters via a compute-efficient bandit-based allocation strategy; second, cluster assignments are refined through an iterative procedure inspired by the Expectation–Maximization (EM)
framework. Crucially, our approach is designed for settings where the generators’ likelihoods are intractable and only generated data samples are accessible. We provide theoretical guarantees by establishing convergence rates of the mixture distribution under approxi-
mate cluster identification. Empirical evaluations on both synthetic and real-world tabular datasets demonstrate that our approach produces high-quality synthetic data, validating its effectiveness in challenging generative modeling tasks.

URL: https://openreview.net/forum?id=3y3mHAldp7

---

Title: Centrality Graph Shift Operators for Graph Neural Networks

Abstract: Graph Shift Operators (GSOs), such as the adjacency and graph Laplacian matrices, play a fundamental role in graph theory and graph representation learning. Traditional GSOs are typically constructed by normalizing the adjacency matrix by the degree matrix, a local centrality metric. In this work, we instead propose and study Centrality GSOs (CGSOs), which normalize adjacency matrices by global centrality metrics such as the PageRank, $k$-core or count of fixed length walks. We study spectral properties of the CGSOs, allowing us to get an understanding of their action on graph signals. This understanding is confirmed by defining and running the spectral clustering algorithm based on different CGSOs on several synthetic and real-world datasets. We furthermore outline how our CGSO can act as the message passing operator in any Graph Neural Network and in particular demonstrate strong performance of a variant of the Graph Convolutional Network and Graph Attention Network using our CGSOs on several real-world datasets.

URL: https://openreview.net/forum?id=Btd0SIpoO4

---

Title: Identifiable Latent Bandits: Leveraging observational data for personalized decision-making

Abstract: Sequential decision-making algorithms such as multi-armed bandits can find optimal personalized decisions, but are notoriously sample-hungry. In personalized medicine, for example, training a bandit from scratch for every patient is typically infeasible, as the number of trials required is much larger than the number of decision points for a single patient. To combat this, latent bandits offer rapid exploration and personalization beyond what context variables alone can offer, provided that a latent variable model of problem instances can be learned consistently. However, existing works give no guidance as to how such a model can be found. In this work, we propose an identifiable latent bandit framework that leads to optimal decision-making with a shorter exploration time than classical bandits by learning from historical records of decisions and outcomes. Our method is based on nonlinear independent component analysis that provably identifies representations from observational data sufficient to infer optimal actions in new bandit instances. We verify this strategy in simulated and semi-synthetic environments, showing substantial improvement over online and offline learning baselines when identifying conditions are satisfied.

URL: https://openreview.net/forum?id=SvkZ76wKpu

---

Title: Likelihood-based Fine-tuning of Protein Language Models for Few-shot Fitness Prediction and Design

Abstract: Machine learning models trained on measurements of protein functional properties are widely used to accelerate laboratory-based protein design campaigns. To maximise the signal that can be extracted from limited experimental data, sequence embeddings produced by protein language models (PLMs) are often used as the basis of supervised fitness predictors. However, embedding-based predictors do not directly exploit the distributional information encoded in PLM likelihoods after self-supervised or generative pretraining on natural protein sequences. In contrast, likelihood-based fine-tuning approaches exploit this prior knowledge by directly updating pretrained PLM likelihoods to reflect observed fitness differences between sequences. While likelihood-based fine-tuning methods have been proposed previously, a conclusive comparison of their performance against state-of-the-art embedding-based methods has been lacking. To address this gap, we conduct a comprehensive empirical evaluation of both fine-tuning strategies on a representative set of protein fitness datasets from the ProteinGym benchmark. To ensure our evaluation is applicable across different PLM classes, we develop a simple, unified framework for likelihood-based fine-tuning that applies to models trained with various objectives. Across model classes and fitness datasets, likelihood-based fine-tuning consistently outperforms embedding-based methods previously reported as state-of-the-art, with the largest gains in low-data settings. Finally, to highlight the practical relevance of these findings, we demonstrate that the best-performing fine-tuning strategies can substantially improve the maximal fitness of designed sequences in multi-round in silico optimisation campaigns.

URL: https://openreview.net/forum?id=vfTcUT220j

---

Title: LoRAQuant: Mixed-Precision Quantization of LoRA to Ultra-Low Bits

Abstract: Low-Rank Adaptation (LoRA) has become a popular technique for parameter-efficient fine-tuning of large language models (LLMs). In many real-world scenarios, multiple adapters are loaded simultaneously to enable LLM customization for personalized user experiences or to support a diverse range of tasks.
Although each adapter is lightweight in isolation, their aggregate cost becomes substantial at scale. To address this, we propose LoRAQuant, a mixed-precision post-training quantization method tailored to LoRA. Specifically, LoRAQuant reparameterizes each adapter by singular value decomposition (SVD) to concentrate the most important information into specific rows and columns. This makes it possible to quantize the important components to higher precision, while quantizing the rest to ultra-low bitwidth. We conduct comprehensive experiments with LLaMA 2-7B, LLaMA 2-13B, and Mistral 7B models on mathematical reasoning, coding, and summarization tasks. Results show that our LoRAQuant uses significantly lower bits than other quantization methods, but achieves comparable or even higher performance.

URL: https://openreview.net/forum?id=71svCWi178

---

Title: RPWithPrior: Label Differential Privacy in Regression

Abstract: With the wide application of machine learning techniques in practice, privacy preservation has gained increasing attention. Protecting user privacy with minimal accuracy loss is a fundamental task in the data analysis and mining community. In this paper, we focus on regression tasks under $\epsilon$-label differential privacy guarantees. Some existing methods for regression with $\epsilon$-label differential privacy, such as the RR-On-Bins mechanism, discretized the output space into finite bins and then applied RR algorithm. To efficiently determine these finite bins, the authors rounded the original responses down to integer values. However, such operations does not align well with real-world scenarios. To overcome these limitations, we model both original and randomized responses as continuous random variables, avoiding discretization entirely. Our novel approach estimates an optimal interval for randomized responses and introduces new algorithms designed for scenarios where a prior is either known or unknown. Additionally, we prove that our algorithm, RPWithPrior, guarantees $\epsilon$-label differential privacy and provide error analysis. Numerical results demonstrate that our approach gets better performance compared with the Gaussian, Laplace, Staircase, and RRonBins, Unbiased mechanisms on the Communities and Crime, Criteo Sponsored Search Conversion Log, California Housing datasets.

URL: https://openreview.net/forum?id=FiUe0OCMaj

---

Title: PDEInvBench: A Comprehensive Dataset and Design Space Exploration of Neural Networks for PDE Inverse Problems

Abstract: Inverse problems in partial differential equations (PDEs) involve estimating the physical parameters of a system from observed spatiotemporal solution fields, a fundamental task in numerous scientific domains. Neural networks, and particularly neural operators, are well-suited for PDE parameter estimation due to their capability to model function-to-function space transformations.
While existing benchmarks of machine learning methods for PDEs primarily focus on the forward problem --- mapping physical parameters to solution fields---to our knowledge, there are no similar comprehensive studies and benchmark datasets on PDE inverse problems - mapping solution fields to underlying physical parameters. We fill this gap by introducing PDEInvBench, a comprehensive benchmark dataset consisting of numerical simulations for both time-dependent and time-independent PDEs across a wide range of physical behaviors and parameters. Our dataset includes evaluation splits that assess performance in both in-distribution and various out-of-distribution settings. Using our benchmark dataset, we comprehensively explore the design space of neural networks for PDE inverse problems along three key dimensions: (1) optimization procedures, analyzing the role of supervised, self-supervised, and test-time training objectives on performance, (2) problem representations, where we study the value of architectural choices with different inductive biases and various conditioning strategies, and (3) scaling, which we perform with respect to both model and data size.
Our experiments reveal several practical insights: 1) neural networks perform best with a two-stage training procedure: initial supervision with PDE parameters followed by test-time fine-tuning using the PDE residual, 2) incorporating PDE derivatives as input features consistently improves accuracy, and 3) increasing the diversity of initial conditions in the training data yields greater performance gains than expanding the range of PDE parameters. We make our dataset and evaluation codebase freely available to facilitate reproducibility and further development of our work.

URL: https://openreview.net/forum?id=MSjhqRnNyZ

---

Title: M3Ret: Unleashing Zero-shot Multi-Modal Medical Image Retrieval via Self-Supervision

Abstract: Medical image retrieval is essential for clinical decision-making and translational research, relying on discriminative visual representations. Yet, current methods remain fragmented, relying on separate architectures and training strategies for 2D, 3D, and video-based medical data. This modality-specific design hampers scalability and inhibits the development of unified representations.
To enable unified learning, we curate a large-scale hybrid-modality dataset comprising 867,653 medical imaging samples, including 2D X-rays and ultrasounds, RGB endoscopy videos, and 3D CT scans. Leveraging this dataset, we train M3Ret, a unified visual encoder without any modality-specific customization. It successfully learns transferable representations using both generative (MAE) and contrastive (SimDINO) self-supervised learning (SSL) paradigms.
Our approach sets a new state-of-the-art in zero-shot image-to-image retrieval across all individual modalities, surpassing strong baselines such as DINOv3 and the text-supervised BMC-CLIP. More remarkably, strong cross-modal alignment emerges without paired data, and the model generalizes to unseen MRI tasks, despite never observing MRI during pretraining, demonstrating the generalizability of purely visual self-supervision to unseen modalities.
Comprehensive analyses further validate the scalability of our framework across model and data sizes. These findings deliver a promising signal to the medical imaging community, positioning M3Ret as a step toward foundation models for visual SSL in multimodal medical image understanding.

URL: https://openreview.net/forum?id=VgmIrgbzkX

---

Title: CADO: From Imitation to Cost Minimization for Heatmap-based Solvers in Combinatorial Optimization

Abstract: Heatmap-based solvers have emerged as a promising paradigm for Combinatorial Optimization (CO). However, we argue that the dominant Supervised Learning (SL) training paradigm suffers from a fundamental objective mismatch: minimizing imitation loss (e.g., cross-entropy) does not guarantee solution cost minimization. We dissect this mismatch into two deficiencies: Decoder-Blindness (being oblivious to the non-differentiable decoding process) and Cost-Blindness (prioritizing structural imitation over solution quality). We empirically demonstrate that these intrinsic flaws impose a hard performance ceiling. To overcome this limitation, we propose CADO (Cost-Aware Diffusion models for Optimization), a streamlined Reinforcement Learning fine-tuning framework that formulates the diffusion denoising process as an MDP to directly optimize the post-decoded solution cost. We introduce Label-Centered Reward, which repurposes ground-truth labels as unbiased baselines rather than imitation targets, and Hybrid Fine-Tuning for parameter-efficient adaptation. CADO achieves state-of-the-art performance across diverse benchmarks, validating that objective alignment is essential for unlocking the full potential of heatmap-based solvers.

URL: https://openreview.net/forum?id=fvxx5FOED6

---

Title: A Survey of Linear Attention: Algorithm, Theory, Application, and Infrastructure

Abstract: Large Language Models (LLMs) have proven effective in understanding and generating extremely long contexts.
Recently, linear attention mechanisms have garnered significant attention, as they can largely reduce the quadratic computational complexity of traditional attention mechanisms to linear complexity relative to token sequence length, thus balancing effectiveness and efficiency in LLM training and inference. This survey mainly focuses on a broad spectrum of linear attention techniques, including traditional linear attention methods, state space model (SSM) series, and linear recurrent neural networks (RNNs). These methods enable implicit historical information integration via state propagation, and achieve approximately constant memory footprint as well as linear time complexity in sequence modeling tasks. Beyond algorithmic designs and model architectures, we further explore the characteristics, challenges, and successful applications of linear attention from a more comprehensive perspective. We also discuss the essential factors for practical hybrid frameworks, robust and efficient infrastructure, and scenario-specific features of downstream tasks, which jointly contribute to the successful deployment of linear attention mechanisms.

URL: https://openreview.net/forum?id=ilkVX8aGmQ

---

Title: CauFR-TS: Causal Time-Series Identifiability via Factorized Representations

Abstract: Causal discovery from multivariate time series is a fundamental problem for interpretable modelling, causality-aware downstream analysis, and intervention-driven simulation. Recent neural approaches commonly rely on shared latent embeddings to capture temporal dynamics and utilize them for causal structure estimation and downstream prediction. We formally establish that such shared encoders entangle distinct causal mechanisms into a unified latent manifold, which exhibits fundamental theoretical limitations of structural non-identifiability and conditional independence assumptions required for Granger causality. To address these issues, we propose CauFR-TS, a recurrent variational framework that enforces mechanism modularity through dimension-wise encoders and ensures mediation of all cross-variable dependencies through structured latent aggregation. Furthermore, we address the instability of heuristic thresholding in continuous relaxation methods by proposing an adaptive, data-driven unsupervised link selection strategy based on decoder weight distribution. Empirical evaluation on synthetic and in silico biological benchmarks demonstrates that CauFR-TS outperforms recent baselines in graph recovery metrics while preserving competitive probabilistic forecasting performance.

URL: https://openreview.net/forum?id=Al4OnLoQsp

---

Title: When Does Margin Clamping Affect Training Variance? Dataset-Dependent Effects in Contrastive Forward-Forward Learning

Abstract: Contrastive Forward-Forward (CFF) learning is a layer-local alternative to backpropagation that trains Vision Transformers using supervised contrastive objectives at each layer independently. In practice, CFF can exhibit substantial seed-to-seed variability, complicating reproducibility and hyperparameter selection. We audit one implementation detail inside the supervised contrastive loss: applying the positive-pair margin via saturating similarity clamping, min⁡(s+m,1)\min(s + m, 1)
min(s+m,1). We compare this against a post-log-probability subtraction reference that we prove is gradient-neutral under the mean-over-positives reduction (Proposition 4.1), thereby isolating the effect of saturation itself. On CIFAR-10 in a 2×22 \times 2
2×2 factorial ablation (n=7n=7
n=7 seeds per cell), the clamped variant exhibits 5.90×5.90\times
5.90× higher pooled test-accuracy variance (p=0.003p=0.003
p=0.003, bootstrap 95% CI [1.62,15.80][1.62, 15.80]
[1.62,15.80]) with no detectable difference in mean accuracy. Clamp activation rates (CAR), layerwise gradient norms, and a reduced-margin dose-response probe jointly indicate that this variance increase is associated with gradient truncation in early transformer layers. However, the effect is dataset-dependent: replication on CIFAR-100 (VR=0.39×\mathrm{VR} = 0.39\times
VR=0.39×), SVHN (VR=0.25×\mathrm{VR} = 0.25\times
VR=0.25×), and Fashion-MNIST (VR=0.08×\mathrm{VR} = 0.08\times
VR=0.08×, p=0.029p=0.029
p=0.029) reveals inverted variance ratios in all three cases. Cross-dataset analysis identifies layer-0 clamp activation rate as a necessary but insufficient condition for variance inflation: CIFAR-10's high L0 CAR (60.7%) co-occurs with the only elevated variance ratio, while CIFAR-100's low L0 CAR (29.0%) and SVHN/Fashion-MNIST's high task accuracy (>92%>92\%
>92%) each independently suppress the effect. An SVHN difficulty sweep confirms this interaction: increasing augmentation difficulty on the same dataset drives the variance ratio from 0.25×0.25\times
0.25× to 16.73×16.73\times
16.73×. These results characterize the conditions under which margin clamping destabilizes CFF training and offer practical guidance for practitioners.

URL: https://openreview.net/forum?id=EmHvSp7Jm0

---

Title: Sequential Causal Discovery with Noisy Language Model Priors

Abstract: Causal discovery from observational data typically assumes access to complete data and availability of domain experts. In practice, data often arrive in batches, are subject to sampling bias, and expert knowledge is scarce. Language Models (LMs) offer a surrogate for expert knowledge but suffer from hallucinations, inconsistencies, and bias. We present a hybrid framework that bridges these gaps by adaptively integrating sequential batch data with LM-derived noisy, expert knowledge while accounting for both data-induced and LM-induced biases. We propose a representation shift from Directed Acyclic Graph (DAG) to Partial Ancestral Graph (PAG), that accommodates ambiguities within a coherent framework, allowing grounding the global LM knowledge in local observational data. To guide LM interactions, we use a sequential optimization scheme that adaptively queries the most informative edges. Across varied datasets and LMs, we outperform prior work in structural accuracy and extend to parameter estimation, showing robustness to LM noise.

URL: https://openreview.net/forum?id=wFs71JzEO7

---

Title: ExpertLens: Activation steering features are highly interpretable

Abstract: Activation steering methods in large language models (LLMs) have emerged as an effective way to perform targeted updates to enhance generated language without requiring large amounts of adaptation data. We ask whether the features discovered by activation steering methods are interpretable. We identify neurons responsible for specific concepts (e.g., ``cat'') using the ``finding experts'' method from research on activation steering and show that the ExpertLens, i.e., inspection of these neurons, provides insights about model representation. We find that ExpertLens representations are stable across models and datasets and closely align with human representations inferred from behavioral data, matching inter-human alignment levels. ExpertLens significantly outperforms the alignment captured by word/sentence embeddings. By reconstructing human concept organization through ExpertLens, we show that it enables a granular view of LLM concept representation. Our findings suggest that ExpertLens is a flexible and lightweight approach for capturing and analyzing model representations.

URL: https://openreview.net/forum?id=FBIsN6RdYO

---

Title: Hardware Acceleration for Neural Networks: A Comprehensive Survey

Abstract: Neural networks have become a dominant computational workload across cloud and edge platforms, but their rapid growth in model size and deployment diversity has exposed hardware bottlenecks that are increasingly dominated by memory movement, communication, and irregular operators rather than peak arithmetic throughput. This survey reviews the current technology landscape for hardware acceleration of deep learning, spanning Graphics Processing Units (GPUs) and tensor-core architectures, domain-specific accelerators (e.g., Tensor Processing Units (TPUs)/Neural Processing Units (NPUs)), Field-Programmable Gate Array (FPGA)-based designs, Application-Specific Integrated Circuit (ASIC) inference engines, and emerging Large Language Model (LLM)-serving accelerators such as Language Processing Units (LPUs), alongside in-/near-memory computing and neuromorphic/analog approaches. We organize the survey using a unified taxonomy across (i) workloads (Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Graph Neural Networks (GNNs), Transformers/Large Language Models (LLMs)), (ii) execution settings (training vs.\ inference; datacenter vs.\ edge), and (iii) optimization levers (reduced precision, sparsity and pruning, operator fusion, compilation and scheduling, and memory-system/interconnect design). We synthesize key architectural ideas such as systolic arrays, vector and Single Instruction, Multiple Data (SIMD) engines, specialized attention and softmax kernels, quantization-aware datapaths, and high-bandwidth memory, and we discuss how software stacks and compilers bridge model semantics to hardware. Finally, we highlight open challenges—including efficient long-context LLM inference (Key-Value (KV)-cache management), robust support for dynamic and sparse workloads, energy- and security-aware deployment, and fair benchmarking—pointing to promising directions for the next generation of neural acceleration.

URL: https://openreview.net/forum?id=Da8LO5NvDU

---

Title: Evading Protections Against Unauthorized Data Usage via Limited Fine-tuning

Abstract: Text-to-image diffusion models, such as Stable Diffusion, have demonstrated exceptional potential for generating high-quality images. However, recent studies have raised concerns about the use of unauthorized data to train these models, which can lead to intellectual property infringement or privacy violations. A promising approach to mitigating these issues is to embed a signature in the model that can be detected or verified from its generated images. Existing works also aim to prevent training on protected images by degrading generation quality, for example by injecting adversarial perturbations into the training data. In this paper, we propose RATTAN, which effectively evades such protection methods by removing protective perturbations from images and inducing catastrophic forgetting of the corresponding learned features in the model. RATTAN leverages the diffusion process to generate controlled images from the protected inputs, preserving high-level features while ignoring the low-level details used by the embedded pattern. A small number of generated images (e.g., 10) are then used to fine-tune a marked model to remove the learned features. Our experiments on four datasets, two different IP protection methods, and 300 text-to-image diffusion models reveal that, while some protections already suffer from weak memorization, RATTAN can reliably bypass stronger defenses, exposing fundamental limitations of current protections and highlighting the need for stronger defenses.

URL: https://openreview.net/forum?id=8xF5KYHRCU

---

Title: Multimodal Video Generation Models with Audio: Present and Future

Abstract: Video generation models have advanced rapidly and are now widely used across entertainment, advertising, filmmaking, and robotics applications such as world modeling and simulation. However, visual content alone is often insufficient for realistic and engaging media experiences—audio is also a key component of immersion and semantic coherence. As AI-generated videos become increasingly prevalent in everyday content, demand has grown for systems that can generate synchronized sound alongside visuals. This trend has driven rising interest in multimodal video generation, which jointly models video and audio to produce more complete, coherent, and appealing outputs. Since late 2025, a wave of multimodal video generation models has emerged, with releases including Veo 3.1, Sora 2,
Kling 2.6, Wan 2.6, OVI, and LTX 2. As multimodal generation technology advances, its impact expands across both daily consumer and industrial domains—revolutionizing daily entertainment while enabling more sophisticated world simulation for training embodied AI systems. In this paper, we provide a comprehensive overview of the multimodal video generation model literature covering the major topics: evolution and common architectures of multimodal video generation models; common post-training methods and evaluation; applications and active research areas of video generation; limitations and challenges of multimodal video generation.

URL: https://openreview.net/forum?id=8i5vInabkm

---

Title: CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning

Abstract: Post-training compression of large language models (LLMs) often relies on low-rank weight approximations that represent each column of the weight matrix in a shared low-dimensional subspace. This strategy is computationally efficient but the underlying constraint can be overly rigid for heterogeneous projection weights and may incur avoidable accuracy loss. We propose CoSpaDi (Compression via Sparse Dictionary Learning), a training-free framework that replaces low-rank factorization with a structured sparse decomposition in which each weight matrix is represented as a dense dictionary multiplied by a column-sparse coefficient matrix. This yields a union-of-subspaces model: the columns of the weight matrix are represented as linear combinations of different subsets of dictionary atoms, improving expressiveness at a fixed parameter budget.
CoSpaDi is calibration-guided: using a small calibration set, we optimize the factorization to minimize functional reconstruction error of layer outputs rather than weight-space error. An activation-derived Gram orthonormalization reformulates this data-aware objective into a standard Frobenius-norm dictionary learning problem, and we support both per-layer compression and cross-layer dictionary sharing within groups of similar projections.
Across Llama and Qwen model families, CoSpaDi consistently improves the accuracy-compression and perplexity-compression trade-offs over state-of-the-art SVD-based baselines and strong structured pruning baselines at 20-40 % compression ratios. The resulting structured sparsity enables sparse-dense computation and integrates with post-training quantization of the sparse coefficients.

URL: https://openreview.net/forum?id=N8WUKWDy5C

---

Title: Autoregressive Models for Knowledge Graph Generation

Abstract: Knowledge Graph (KG) generation requires models to learn complex semantic dependencies between triples while maintaining domain validity constraints. Unlike link prediction, which scores triples independently, generative models must capture interdependencies across entire subgraphs to produce semantically coherent structures. We present ARK (Auto-Regressive Knowledge Graph Generation), a family of autoregressive models that generate KGs by treating graphs as sequences of (head, relation, tail) triples. ARK learns implicit semantic constraints directly from data, including type consistency, temporal validity, and relational patterns, without explicit rule supervision. On the IntelliGraphs benchmark, our models achieve 89.2% to 100.0% semantic validity across diverse datasets while generating novel graphs not seen during training. We also introduce SAIL, a variational extension of ARK that enables controlled generation through learned latent representations, supporting both unconditional sampling and conditional completion from partial graphs. Our analysis reveals that model capacity (hidden dimensionality >= 64) is more critical than architectural depth for KG generation, with recurrent architectures achieving comparable validity to transformer-based alternatives while offering substantial computational efficiency. These results demonstrate that autoregressive models provide an effective framework for KG generation, with practical applications in knowledge base completion and query answering.

URL: https://openreview.net/forum?id=xhy0tB4uzb

---

Title: Bridging Mechanistic Interpretability and Prompt Engineering with Gradient Ascent for Interpretable Persona Control

Abstract: Controlling emergent behavioral personas (e.g., sycophancy, hallucination) in Large Language Models (LLMs) is critical for AI safety, yet remains a persistent challenge. Existing solutions face a dilemma: manual prompt engineering is intuitive but unscalable and imprecise, while automatic optimization methods are effective but operate as "black boxes" with no interpretable connection to model internals. We propose a novel framework that adapts gradient ascent to LLMs, enabling targeted prompt discovery. In specific, we propose two methods, RESGA and SAEGA, that both optimize randomly initialized prompts to achieve better aligned representation with an identified persona direction. We introduce fluent gradient ascent to control the fluency of discovered persona steering prompts. We demonstrate RESGA and SAEGA’s effectiveness across Llama 3.1, Qwen 2.5, and Gemma 3 for steering three different personas, sycophancy, hallucination, and myopic reward. Crucially, on sycophancy, our automatically discovered prompts achieve significant improvement (49.90% compared with 79.24%). By grounding prompt discovery in mechanistically meaningful features, our method offers a new paradigm for controllable and interpretable behavior modification.

URL: https://openreview.net/forum?id=dcmHPxgo4c

---

Title: Disentangling Causal Importance from Emergent Structure in Multi-Expert Orchestration

Abstract: Multi-expert systems, where multiple Large Language Models (LLMs) collaborate to solve complex tasks, are increasingly adopted for high-performance reasoning and generation. However, the orchestration policies governing expert interaction and sequencing remain largely opaque. We introduce INFORM, an interpretability analysis that treats orchestration as an explicit, analyzable computation, enabling the decoupling of expert interaction structure, execution order, and causal attribution. We use INFORM to evaluate an orchestrator on GSM8K, HumanEval, and MMLU using a homogeneous consortium of ten instruction-tuned experts drawn from LLaMA-3.1 8B, Qwen3 8B, and DeepSeek-R1 8B, with controlled decoding-temperature variation, and a secondary heterogeneous consortium spanning 1B-7B parameter models. Across tasks, routing dominance is a poor proxy for functional necessity. We reveal a divergence between relational importance, captured by routing mass and interaction topology, and intrinsic importance, measured via gradient-based causal attribution: frequently selected experts often act as interaction hubs with limited causal influence, while sparsely routed experts can be structurally critical. Orchestration behaviors emerge asynchronously, with expert centralization preceding stable routing confidence and expert ordering remaining non-deterministic. Targeted ablations show that masking intrinsically important experts induces disproportionate collapse in interaction structure compared to masking frequent peers, confirming that INFORM exposes causal and structural dependencies beyond accuracy metrics alone.

URL: https://openreview.net/forum?id=4W7sgat04A

---

Title: Fractal Predictive Operators: Learnable Iterated Function Systems for Multi-Scale Latent Modeling

Abstract: Joint Embedding Predictive Architectures (JEPAs) rely on latent-space prediction to learn
representations without explicit reconstruction. While effective, their predictors are typically
implemented as shallow feed-forward networks, offering limited control over multi-step dynamics and
stability. We introduce Learnable Iterated Function Systems (LIFS), a contractive predictive operator
that replaces the standard JEPA predictor with a learned mixture of affine maps applied recursively
in latent space. Mixture weights are generated conditionally on the context embedding, allowing
the operator to adapt its local geometry across spatial locations and inputs. LIFS does not change
the training objective or encoder architecture, but explicitly constrains predictor dynamics through
spectral control and adaptive gating. Additionally, our analysis unifies spectral control, exponential
moving average (EMA) updates, and predictive convergence through a contraction-based perspective.
Empirically, integrating LIFS into JEPA improves training stability and yields consistent, though
moderate, gains in linear probing accuracy, particularly for ViT-based encoders and non-overlapping
prediction settings. These results highlight predictor dynamics as an important and underexplored
design axis in self-supervised learning.

URL: https://openreview.net/forum?id=k2Z2gPOtlq

---

Title: A Survey on the Abstraction and Reasoning Corpus

Abstract: Chollet (2019) proposed a definition of intelligence that emphasizes efficiency in skill acquisition rather than performance on a predefined set of tasks, and introduced the Abstraction and Reasoning Corpus (ARC-v1, or ARC-AGI-1) as a challenge benchmark for machine learning research.
In the following years, ARC and the associated competitions have highlighted fundamental limitations of classical deep learning approaches and underscored the need for new ideas in abstract reasoning. This has incentivized extensive trial-and-error exploration, resulting in a wide variety of methods applied to the corpus.
As ARC-v2 was released in March 2025, this literature survey provides a systematic breadth-first overview of the methods applied to ARC-v1 in the six years since its introduction, prior to version 2, and covers early developments for ARC-v2 and ARC Prize 2025.
We apply a taxonomy distinguishing inductive (which explicitly construct transformation rules) and transductive approaches (which directly map inputs to outputs), examine the ecosystem of enabling techniques and auxiliary datasets, and synthesize patterns, trade-offs, and underexplored areas across the research landscape.
Our goal is to provide newcomers with a comprehensive foundation for understanding existing approaches and identifying promising research directions in abstract reasoning.

URL: https://openreview.net/forum?id=qzFxBcK9Cg

---

Title: CHyLL: Learning Continuous Neural Representations of Hybrid Systems

Abstract: Learning the flows of hybrid systems with both continuous and discrete dynamics is challenging. The existing method learns the dynamics in each discrete mode, which suffers from the combination of mode switching and discontinuities in the flows. In this work, we propose CHyLL (Continuous Hybrid System Learning in Latent Space), which learns a continuous neural representation of a hybrid system without trajectory segmentation, event functions, or mode switching. The key insight of CHyLL is that the reset map glues the state space at the guard surface, reformulating the state space as a piecewise smooth quotient manifold where the flow becomes spatially continuous. Building upon these insights and the embedding theorems grounded in differential topology, CHyLL concurrently learns a singularity-free neural embedding in a higher-dimensional space and the continuous flow in it. We demonstrate that CHyLL can accurately predict the flow of hybrid systems with superior accuracy and identify their topological invariants. Finally, we apply CHyLL to the stochasticoptimal control problem.

URL: https://openreview.net/forum?id=xK4WQnf7Yj

---

Title: Glocal Smoothness: Line search and adaptive sizes can help in theory too!

Abstract: Iteration complexities for optimizing smooth functions with first-order algorithms are typically stated in terms of a global Lipschitz constant of the gradient, and near-optimal results are then achieved using fixed step sizes. But many objective functions that arise in practice have regions with small Lipschitz constants where larger step sizes can be used. Many local Lipschitz assumptions have been proposed, which have led to results showing that adaptive step sizes and/or line searches yield improved convergence rates over fixed step sizes. However, these faster rates tend to depend on the iterates of the algorithm, which makes it difficult to compare the iteration complexities of different methods. We consider a simple characterization of global and local ("glocal") smoothness that only depends on properties of the function. This allows upper bounds on iteration complexities in terms of iterate-independent constants and enables us to compare iteration complexities between algorithms. Under this assumption it is straightforward to show the advantages of line searches over fixed step sizes and that, in some settings, gradient descent with line search has a better iteration complexity than accelerated methods with fixed step sizes. We further show that glocal smoothness can lead to improved complexities for the Polyak and AdGD step sizes, as well other algorithms including coordinate optimization, stochastic gradient methods, accelerated gradient methods, and non-linear conjugate gradient methods.

URL: https://openreview.net/forum?id=be9PdukwEL

---

Title: Affine Invariance in Continuous-Domain Convolutional Neural Networks

Abstract: The notion of group invariance helps neural networks in recognizing patterns and features under geometric transformations. Group convolutional neural networks enhance traditional convolutional neural networks by incorporating group-based geometric structures into their design. This research studies affine invariance on continuous-domain convolutional neural networks. Despite other research considering isometric invariance or similarity invariance, we focus on the full structure of affine transforms generated by the group of all invertible $2 \times 2$ real matrices (generalized linear group $\mathrm{GL}_2(\mathbb{R})$). We introduce a new criterion to assess the invariance of two signals under affine transformations. The input image is embedded into the affine Lie group $G_2 = \mathbb{R}^2 \ltimes \mathrm{GL}_2(\mathbb{R})$ to facilitate group convolution operations that respect affine invariance. Then, we analyze the convolution of embedded signals over $G_2$. In sum, our research could eventually extend the scope of geometrical transformations that usual deep-learning pipelines can handle.

URL: https://openreview.net/forum?id=d4ZNyIAtXt

---

Title: Jump Start or False Start? A Theoretical and Empirical Evaluation of LLM-initialized Bandits

Abstract: The recent advancement of Large Language Models (LLMs) offers new opportunities to generate user preference data to warm-start bandits. Recent studies on contextual bandits with LLM initialization (CBLI) have shown that these synthetic priors can significantly lower early regret. However, these findings assume that LLM-generated choices are reasonably aligned with actual user preferences. In this paper, we systematically examine how LLM-generated preferences perform when random and label-flipping noise is injected into the synthetic training data. For aligned domains, we find that warm-starting remains effective up to 30\% corruption, loses its advantage around 40\%, and degrades performance beyond 50\%. When there is systematic misalignment, even without added noise, LLM-generated priors can lead to higher regret than a cold-start bandit. To explain these behaviors, we develop a theoretical analysis that decomposes the effect of random label noise and systematic misalignment on the prior error driving the bandit’s regret, and derive a sufficient condition under which LLM-based warm starts are provably better than a cold-start bandit. We validate these results across multiple conjoint datasets and LLMs, showing that estimated alignment reliably tracks when warm-starting improves or degrades recommendation quality.

URL: https://openreview.net/forum?id=tojKjqIOBd

---

Title: Code Reasoning for Software Engineering Tasks: A Survey and A Call to Action

Abstract: The rise of large language models (LLMs) has led to dramatic improvements across a wide range of natural language tasks.
Their performance on certain tasks can be further enhanced by incorporating test-time reasoning techniques.
These inference-time advances have been adopted into the code domain, enabling complex software engineering (SWE) tasks such as code generation, test generation and issue resolution. However, the impact of different reasoning techniques on code-centric SWE tasks has not been systematically explored. In this work, we survey code reasoning techniques that underpin these capabilities, with a focus on test-time compute and inference-time reasoning paradigms.

We examine a variety of code-specific reasoning methods and progressively build up to SWE agents, which combine planning, tool use, and multi-step interaction. We also compare the impact of different techniques on coding tasks, highlighting their relative importance and outlining open challenges and future research directions. Across commonly used models and benchmarks, we find that approaches exploiting code-specific signals (e.g., structure and execution feedback) are frequently associated with improved performance, motivating a dedicated study of code reasoning beyond natural-language reasoning.

Our contributions are: (1) to the best of our knowledge, the first dedicated survey of code reasoning for SWE tasks, highlighting overarching reasoning strategies, hybrid methods, and agentic approaches; (2) a taxonomy of inference-time techniques used to drive code reasoning, accompanied by a curated set of under-explored benchmarks with high potential for SWE evaluation; (3) a comparative analysis of reasoning design patterns across commonly used models and benchmarks; and (4) a synthesis of gaps in current methods and evaluation practices, identifying under-explored areas and concrete opportunities for future research.

URL: https://openreview.net/forum?id=zZa3u6LKwO

---

Title: CS-pFedTM: Communication-Efficient and Similarity-based Personalised Federated Learning with Tsetlin Machine

Abstract: Federated Learning has emerged as a promising framework for privacy-preserving collaborative model training across decentralised data sources. However, data heterogeneity remains a major challenge, adversely affecting both the performance and efficiency of FL systems. To address this issue, we propose CS-pFedTM (Communication-Efficient and Similarity-based Personalised Federated Learning with Tsetlin Machine), a method that jointly incorporates communication-aware resource allocation and heterogeneity-driven personalisation. CS-pFedTM enforces communication budget constraints through adaptive clause allocation and tailors personalisation by using similarity between clients’ model parameters as a proxy for data heterogeneity. To further enhance scalability, the proposed framework integrates confidence-based aggregation and class-specific weight masking. Extensive experiments show that CS-pFedTM achieves reductions in communication and runtime costs, with up to $1352\times$ and $210\times$ reductions in upload and download communication respectively, and at least $1.43\times$ improvements in runtime efficiency, while maintaining performance comparable to state-of-the-art personalised FL approaches.

URL: https://openreview.net/forum?id=sdwGiofszZ

---

Title: PersonaFeedback: A Large-scale Human-annotated Benchmark For Personalization

Abstract: With the rapid improvement in the general capabilities of Large Language Models (LLMs), LLM personalization, i.e., how to build LLM systems that can generate personalized responses or services that are tailored to distinct user personas, has become an increasingly important research and engineering problem. However, unlike many new challenging benchmarks being released for evaluating the general/reasoning capabilities, the lack of high-quality benchmarks for evaluating LLM personalization greatly hinders progress in this field. To address this, we introduce PersonaFeedback, a new benchmark that directly evaluates LLMs' ability to provide personalized responses given pre-defined user personas and queries. Unlike existing benchmarks that require models to infer implicit user personas from historical interactions, PersonaFeedback decouples persona inference from personalization, focusing on evaluating the model's ability to generate responses tailored to explicit personas.
PersonaFeedback consists of 8298 human-annotated test cases, which are categorized into easy, medium, and hard tiers based on the contextual complexity of the user personas and the difficulty in distinguishing subtle differences between two personalized responses. We conduct comprehensive evaluations across a wide range of models. The empirical results reveal that even state-of-the-art LLMs that can solve complex real-world reasoning tasks could fall short on the hard tier of PersonaFeedback where even human evaluators may find the distinctions challenging. Furthermore, we conduct an in-depth analysis of failure modes across various types of systems, demonstrating that the current retrieval-augmented framework should not be seen as a de facto solution for personalization tasks. All benchmark data, annotation protocols, and the evaluation pipeline will be publicly available to facilitate future research on LLM personalization.

URL: https://openreview.net/forum?id=Q5HRUJuy9g

---

Title: Rectified Flows for Fast Multiscale Fluid Flow Modeling

Abstract: Statistical surrogate modeling of fluid flows is challenging due to multiscale dynamics and strong sensitivity to initial conditions. Conditional diffusion models can achieve high fidelity, but typically require hundreds of stochastic steps at inference.
We introduce a rectified-flow surrogate that learns a time-dependent conditional velocity field transporting input-to-output laws along nearly straight trajectories. Sampling reduces to solving a deterministic ODE along this learned transport, so each function evaluation is substantially more effective: on multi-scale 2D benchmarks we match diffusion-class posterior statistics with as few as $8$ ODE steps versus $\ge\!128$ steps for score-based diffusion.

On the theory side, we develop a law-level analysis tailored to conditional PDE forecasts.
First, we formalize the link between our evaluation criterion—one-point Wasserstein distances on fields—and the $k\!=\!1$ correlation-marginal viewpoint in statistical solutions.
Second, we provide a one-step error decomposition for the learned pushforward law into a \emph{coverage} (high-frequency tail) term controlled by structure functions (equivalently, by spectral decay), and a \emph{fit} term controlled directly by the training objective.
Third, we show how \emph{straightness} in rectification time governs local truncation error for ODE sampling, yielding step-count requirements and explaining why rectified transports admit large, stable steps.

Guided by this picture, we propose a curvature-aware sampler that tracks an EMA-based straightness proxy and adaptively blends and steps the velocity during inference.
Across multiscale incompressible and compressible 2D flows, our method matches diffusion models in Wasserstein statistics and energy spectra, preserves fine-scale structure missed by MSE baselines, and delivers high-resolution conditional samples at a fraction of the inference cost.

URL: https://openreview.net/forum?id=2tMD6YXgkp

---

Title: FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents

Abstract: Web agents powered by large language models (LLMs) must process lengthy web page observations to complete user goals; these pages often exceed tens of thousands of tokens. This saturates context limits and increases computational cost processing; moreover, processing full pages exposes agents to security risks such as prompt injection. Existing pruning strategies either discard relevant content or retain irrelevant context, leading to suboptimal action prediction. We introduce FocusAgent, a simple yet effective approach that leverages a lightweight LLM retriever to extract the most relevant lines from accessibility tree (AxTree) observations, guided by task goals. By pruning noisy and irrelevant content, FocusAgent enables efficient reasoning while reducing vulnerability to injection attacks. Experiments on WorkArena and WebArena benchmarks show that \method matches the performance of strong baselines, while reducing observation size by over 50\%. Furthermore, a variant of FocusAgent significantly reduces the success rate of prompt-injection attacks, including banner and pop-up attacks, while maintaining task success performance in attack-free settings. Our results highlight that targeted LLM-based retrieval is a practical and robust strategy for building web agents that are efficient, effective, and secure.

URL: https://openreview.net/forum?id=mINaJKSy7A

---

Title: Universal Latent Homeomorphic Manifolds: A Framework for Cross-Domain Representation Unification

Abstract: We present the Universal Latent Homeomorphic Manifold (ULHM), a framework that unifies semantic representations (e.g., human descriptions, diagnostic labels) and observation-driven machine representations (e.g., pixel intensities, sensor readings) into a single latent structure. Despite originating from fundamentally different pathways, both modalities capture the same underlying reality. We establish \emph{homeomorphism}, a continuous bijection preserving topological structure, as the mathematical criterion for determining when latent manifolds induced by different semantic-observation pairs can be rigorously unified. When this homeomorphic criterion is satisfied, it enables three critical applications: (1) semantic-guided sparse recovery from incomplete observations, (2) cross-domain transfer learning with verified structural compatibility, and (3) zero-shot compositional learning via valid transfer from semantic to observation space. Our framework learns continuous manifold-to-manifold transformations through conditional variational inference, with training objectives explicitly designed to enforce bi-Lipschitz homeomorphic properties. We develop practical verification algorithms, including trust, continuity, and Wasserstein distance metrics, that empirically validate whether the learned representations achieve homeomorphic structure from finite samples. Experiments demonstrate substantial improvements over state-of-the-art (SOTA) baselines: (1) sparse recovery from 8\% of pixels with much lower MSE than SOTA on CelebA under noise, (2) cross-domain transfer achieving 86.73\% MNIST$\rightarrow$Fashion-MNIST accuracy without retraining, and (3) zero-shot classification achieving 78.76\% on CIFAR-10, exceeding prior work by 16.66\%. Critically, the homeomorphism criterion determines when different semantic-observation pairs share compatible latent structure, enabling principled unification into universal representations and providing a mathematical foundation for decomposing general foundation models into domain-specific components.

URL: https://openreview.net/forum?id=YoZSpRWhZH

---

Title: Leveraging Vision-Language Models for Resource Constrained Settings

Abstract: Vision-language models (VLMs) such as CLIP have emerged as extremely strong zero-shot and few-shot image classifiers.
However, these models are often too expensive or cumbersome for resource constrained downstream applications.
In this work, we examine how to best leverage the strength of pretrained VLMs: by extracting $\textit{task-specific}$ information in order to obtain a small model that can be deployed in a very specific and low-resource setting.
We present the SIDCLIP method, a novel training pipeline which drastically improves the performance of small, efficient models, such as EfficientNet B0.
The pipeline includes three components that are critical to obtaining strong performance: 1) augmenting the classifier with $\textit{synthetic data}$ generated by leveraging CLIP itself; 2) $\textit{initializing}$ the modeling process using a smaller CLIP model pretrained on the target architecture; and 3) incorporating $\textit{knowledge distillation}$ to maximally mimic the performance of the larger model.
SIDCLIP improves the performance of an EfficientNet B0 model by an average of $50\%$ on 1-shot versions of four datasets and by an average of $26\%$ on the 8-shot versions, relative to directly trained networks, additionally approaching CLIP's linear probe performance while using a model with less than $2\%$ of the parameters of CLIP ViT-L/14's image encoder.
We hope our work can be useful as a practical guide for leveraging the power of foundation models in downstream data-scarce and budget constrained settings.

URL: https://openreview.net/forum?id=cYOKSg60jC

---

Title: VoiceAgentBench: Are Voice Assistants Ready For Agentic Tasks?

Abstract: Large scale Speech Language Models have enabled voice assistants capable of understanding natural spoken queries and performing complex tasks. However, existing speech benchmarks largely focus on isolated capabilities such as transcription or question answering and do not systematically evaluate agentic behavior or adversarial robustness. To address this, we introduce VoiceAgentBench, a comprehensive benchmark for evaluating SpeechLMs in realistic spoken agentic settings, comprising 6,000+ synthetic spoken queries spanning single-tool invocations, multi-tool workflows, multi-turn dialogue, and safety evaluations across English and six Indic languages. To ensure speaker diversity, we further simulate speaker variability using a novel sampling strategy that selects audios for TTS voice conversion based on speaker embeddings to maximize acoustic diversity. Our evaluation measures tool selection accuracy, structural consistency, and the correctness of tool invocations, including adversarial robustness. Across agentic tasks, ASR-LLM pipelines outperform end-to-end SpeechLMs, achieving up to 60.6\% average parameter-filling accuracy on English, while SpeechLMs exhibit lower performance and sharper degradation on Indic languages. All models struggle in sequential workflows and safety evaluations, highlighting persistent limitations in tool orchestration, multilingual generalization, and safety robustness.

URL: https://openreview.net/forum?id=mi9q49AR3d

---

Title: Constraint-Aware Flow Matching via Randomized Exploration

Abstract: We consider the problem of designing constraint-aware flow matching (FM) models that address the issue of constraint violations commonly observed in vanilla generative models. We consider two scenarios, viz.: (a) when a differentiable distance function to the constraint set is given, and (b) when the constraint set is only available via queries to a membership oracle. For case (a), we propose a simple adaptation of the FM objective with an additional term that penalizes the distance between the constraint set and the generated samples. For case (b), we propose to employ randomization and learn a mean flow that is numerically shown to have a high likelihood of satisfying the constraints. This approach deviates significantly from existing works that require simple convex constraints, knowledge of a barrier function, or a reflection mechanism to constrain the probability flow. Furthermore, in the proposed setting we show that a two-stage approach, where both stages approximate the same original flow but with only the second stage probing the constraints via randomization, is more computationally efficient. Through several synthetic cases of constrained generation, we numerically show that the proposed approaches achieve significant gains in terms of constraint satisfaction while matching the target distributions. As a showcase for a practical oracle-based constraint, we show how our approach can be used for training an adversarial example generator, using queries to a hard-label black-box classifier. We conclude with several future research directions.

URL: https://openreview.net/forum?id=OR4h9WPJhV

---

Title: Process Reinforcement through Implicit Rewards

Abstract: Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (\underline{\textbf{P}}rocess \underline{\textbf{R}}einforcement through \underline{\textbf{IM}}plicit r\underline{\textbf{E}}wards), which enables online PRM updates using only policy rollouts and outcome labels through \textit{implicit process rewards}. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME's effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1\% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10\% of its training data.

URL: https://openreview.net/forum?id=9SkkifLopZ

---

Title: The Hidden Cost of Modeling P(x): Vulnerability to Membership Inference Attacks in Generative Text Classifiers

Abstract: Membership Inference Attacks (MIAs) pose a critical privacy threat by enabling adversaries to determine whether a specific sample was included in a model's training dataset. Despite extensive research on MIAs, systematic comparisons between generative and discriminative classifiers remain limited. This work addresses this gap by first providing theoretical motivation for why generative classifiers exhibit heightened susceptibility to MIAs, then validating these insights through comprehensive empirical evaluation.
Our study encompasses discriminative, generative, and pseudo-generative text classifiers across varying training data volumes, evaluated on nine benchmark datasets. Employing a diverse array of MIA strategies, we consistently demonstrate that fully generative classifiers which explicitly model the joint likelihood $P(X,Y)$ are most vulnerable to membership leakage. Furthermore, we observe that the canonical inference approach commonly used in generative classifiers significantly amplifies this privacy risk.
These findings reveal a fundamental utility-privacy trade-off inherent in classifier design, underscoring the critical need for caution when deploying generative classifiers in privacy-sensitive applications. Our results motivate future research directions in developing privacy-preserving generative classifiers that can maintain utility while mitigating membership inference vulnerabilities.

URL: https://openreview.net/forum?id=SHMC01wdVM

---

Title: On The Scalability Of Forward Gradients, Evolutionary Strategies, And Control Variates

Abstract: Stochastic gradient estimation methods such as Forward Gradients (FG) and Evolutionary Strategies (ES) have been proposed to overcome drawbacks of computing gradients with backpropagation (BP). However, FG and ES have large variance in high dimensions, connections between these methods have previously remained unclear, and while pure FG is guaranteed to be unbiased, proposed improvements have typically abandoned this property. We illuminate connections between FG and a popular variant of ES by proving mathematical equivalence on all quadratic objective functions. On an illustrative problem, we demonstrate theoretically how optimal convergence and learning rates scale unfavourably with intrinsic dimensionality and population size. We show that popular gradient descent techniques such as momentum and Adam do not address these fundamental scalability problems. We explore using control variates to reduce variance of FG while maintaining unbiasedness, and while we find limited success in improving over baselines, we also identify challenges that need to be overcome for these methods to scale effectively. Lastly we consider a biased method for variance reduction, and on a particular problem we show that this significantly outperforms the unbiased variance reduction methods that we consider. Assuming access to an asymptotically unbiased control variate, our results suggest that maintaining unbiasedness is not necessarily advantageous for variance reduction techniques, however we leave open the possibility that unbiasedness may be helpful when the control variate is asymptotically biased. Our code is publicly available at https://github.com/anon908bp2zy/forward_grad_public .

URL: https://openreview.net/forum?id=s6g8yZimHE

---

Title: On Rate-Optimal Partitioning Classification from Observable and from Privatised Data

Abstract: In this paper we revisit the classical method of partitioning classification and prove novel convergence rates under relaxed conditions, both for observable (non-privatised) and for privatised data. We consider the problem of classification in a $d$ dimensional Euclidean space. Previous results on the partitioning classifier worked with the strong density assumption (SDA), which is restrictive, as we demonstrate through simple examples. Here, we study the problem under much milder assumptions. We presuppose that the distribution of the inputs is a mixture of an absolutely continuous and a discrete distribution, such that the absolutely continuous component is concentrated to a $d_a$ dimensional subspace. In addition to the standard Lipschitz and margin conditions, a novel characteristic of the absolutely continuous component is introduced, by which the convergence rate of the classification error probability is computed, both for the binary and for the multi-class cases. This bound can reach the minimax optimal convergence rate achievable using SDA, but under much milder distributional assumptions. Interestingly, this convergence rate depends only on the intrinsic dimension of the continuous inputs, $d_a$, and not on $d$. Under privacy constraints, the data cannot be directly observed, and the constructed classifiers are functions of the randomised outcome of a suitable local differential privacy mechanism. In this paper we add Laplace distributed noises to the discontinuations of all possible locations of the feature vector and to its label. Again, tight upper bounds on the convergence rate of the classification error probability can be derived, without using SDA, such that this rate depends on $2d_a$.

URL: https://openreview.net/forum?id=KYYvIrtgK0

---

Title: PAC-Bayesian Meta-Learning for Few-Shot Identification of Linear Dynamical Systems

Abstract: Identifying linear time-invariant (LTI) dynamical systems from data is especially challenging when trajectories are short, noisy, or high-dimensional. Traditional system identification methods typically treat each system in isolation and therefore discard shared information that may exist across related systems. We propose a PAC-Bayesian Meta-Learning framework for LTI system identification (PBML-LTI) that explicitly leverages cross-task structure while preserving task-level heterogeneity. Each task corresponds to an unknown LTI system, and a meta-learner uses a collection of training trajectories to learn a data-dependent prior over system parameters. Given a new system with limited trajectory data, the method performs Bayesian inference to produce a posterior distribution over the new system’s parameters, enabling calibrated uncertainty quantification and principled adaptation in the few-shot regime.

A key technical challenge is temporal dependence: trajectories generated by LTI systems violate i.i.d. assumptions underlying standard learning theory. To address this, we develop generalization guarantees for meta-learned priors under sequential dependence using martingale-based PAC-Bayes analysis with sub-normalized concentration tools. The resulting bounds characterize how the quality of the learned prior controls expected identification error on unseen systems, with explicit dependence on trajectory length, noise, and the divergence between task posteriors and the meta-prior. This connects uncertainty-aware meta-identification with finite-sample theory for dependent dynamical data.

URL: https://openreview.net/forum?id=CiGFpSLzFv

---

Title: SynQuE: Estimating Synthetic Dataset Quality Without Annotations

Abstract: We introduce and formalize the Synthetic Dataset Quality Estimation (SynQuE) problem: ranking synthetic datasets by their expected real-world task performance using only limited unannotated real data. This addresses a critical and open challenge where data is scarce due to collection costs or privacy constraints. We establish the first comprehensive benchmarks for this problem by introducing and evaluating proxy metrics that choose synthetic data for training to maximize task performance on real data. We introduce the first proxy metrics for SynQuE by adapting distribution and diversity-based distance measures to our context via embedding models. To address the shortcomings of these metrics on complex planning tasks, we propose Lens, a novel proxy that leverages large language model reasoning. Our results show that SynQuE proxies correlate with real task performance across diverse tasks, including sentiment analysis, Text2SQL, web navigation, and image classification, with Lens consistently outperforming others on complex tasks by capturing nuanced characteristics. For instance, on text-to-SQL parsing, training on the top-3 synthetic datasets selected via SynQuE proxies can raise accuracy from 30.4% to 38.4 (+8.1)% on average compared to selecting data indiscriminately. This work establishes SynQuE as a practical framework for synthetic data selection under real-data scarcity and motivates future research on foundation model-based data characterization and fine-grained data selection.

URL: https://openreview.net/forum?id=W4Pwb4SX3P

---

Title: DRESS: Disentangled Representation-based Self-Supervised Meta-Learning for Diverse Tasks

Abstract: Meta-learning represents a strong class of approaches for solving few-shot learning tasks. Nonetheless, recent research suggests that simply pre-training a generic encoder can potentially surpass meta-learning algorithms. In this paper, we hypothesize that the reason meta-learning fails to stand out in popular few-shot learning benchmarks is the lack of diversity among the few-shot learning tasks. We propose DRESS, a task-agnostic Disentangled REpresentation-based Self-Supervised meta-learning approach that enables fast model adaptation on highly diversified few-shot learning tasks. Specifically, DRESS utilizes disentangled representation learning to create self-supervised tasks that can fuel the meta-training process. We validate the effectiveness of DRESS through experiments on datasets with multiple factors of variation and varying complexity. The results suggest that DRESS is able to outperform competing methods on the majority of the datasets and task setups. Through this paper, we advocate for a re-examination of how task adaptation studies are conducted, and aim to reignite interest in the potential of meta-learning for solving few-shot learning tasks via disentangled representations.

URL: https://openreview.net/forum?id=TSjDJYKLmu

---

Title: On Theoretical Identifiability of Discrete Latent Causal Graphical Models

Abstract: This paper considers a challenging problem of identifying a causal graphical model under the presence of latent variables. While various identifiability conditions have been proposed in the literature, they often require multiple pure children per latent variable or restrictions on the latent causal graph. Furthermore, it is common for all observed variables to exhibit the same modality. Consequently, the existing identifiability conditions are often too stringent for complex real-world data. We consider a general nonparametric measurement model with arbitrary observed variable types and binary latent variables, and propose a double triangular graphical condition that guarantees identifiability of the entire causal graphical model. The proposed condition significantly relaxes the popular pure children condition. We also establish necessary conditions for identifiability and provide valuable insights into fundamental limits of identifiability. Simulation studies verify that latent structures satisfying our conditions can be accurately estimated from data. We also illustrate the practicality of our conditions with a real data example.

URL: https://openreview.net/forum?id=KiiSlAsLuN

---

Title: Tumor-anchored deep feature random forests for out-of-distribution detection in lung cancer segmentation

Abstract: Accurate segmentation of cancerous lesions from 3D computed tomography (CT) scans is essential for automated treatment planning and response assessment. However, even state-of-the-art models combining self-supervised learning (SSL) pretrained transformers with convolutional decoders are susceptible to out-of-distribution (OOD) inputs, generating confidently incorrect tumor segmentations, posing risks to safe clinical deployment. Existing logit-based methods suffer from task-specific model biases, while architectural enhancements to explicitly detect OOD increase parameters and computational costs. Hence, we introduce a lightweight, plug-and-play post-hoc random forests-based OOD detection framework called RF-Deep that leverages deep features with limited outlier exposure. RF-Deep enhances generalization to imaging variations by repurposing the hierarchical features from the pretrained-then-finetuned backbone, providing task-relevant OOD detection by extracting the features from multiple regions of interest anchored to the predicted tumor segmentations. We compared RF-Deep against existing OOD detection methods using 2,056 CT scans across near-OOD (pulmonary embolism, negative COVID-19) and far-OOD (kidney cancer, healthy pancreas) datasets. RF-Deep achieved AUROC > 93.50 for the challenging near-OOD datasets and near-perfect detection (AUROC > 99.00) for the far-OOD datasets, substantially outperforming logit-based and radiomics approaches. RF-Deep maintained consistent performance across networks of different depths and pretraining strategies, demonstrating its effectiveness as a lightweight, architecture-agnostic approach to enhance the reliability of tumor segmentation from CT volumes.

URL: https://openreview.net/forum?id=XmjYlBxFxn

---

Title: Nested Slice Sampling: Vectorized Nested Sampling for GPU-Accelerated Inference

Abstract: Model comparison and calibrated uncertainty quantification often require integrating over parameters, but scalable inference can be challenging for complex, multimodal targets. Nested Sampling is a robust alternative to standard MCMC, yet its typically sequential structure and hard constraints make efficient accelerator implementations difficult. This paper introduces Nested Slice Sampling (NSS), a GPU-friendly, vectorized formulation of Nested Sampling that uses Hit-and-Run Slice Sampling for constrained updates. A tuning analysis yields a simple near-optimal rule for setting the slice width, improving high-dimensional behavior and making per-step compute more predictable for parallel execution. Experiments on challenging synthetic targets, high dimensional Bayesian inference, and Gaussian process hyperparameter marginalization show that NSS maintains accurate evidence estimates and high-quality posterior samples, and is particularly robust on difficult multimodal problems where current state-of-the-art methods such as tempered SMC baselines can struggle. An open-source implementation is released to facilitate adoption and reproducibility.

URL: https://openreview.net/forum?id=5mF2eRl3gt

---

Title: Disjoint Generation of Synthetic Data

Abstract: We propose a new framework for generating tabular synthetic datasets via disjoint generative models. In this paradigm, a dataset is partitioned into disjoint subsets that are supplied to separate instances of generative models. The results are then combined post hoc by a joining operation that works in the absence of common variables/identifiers. The success of the framework is demonstrated through several case studies and examples on tabular data that help illuminate some of the design choices that one may make. The advantages achieved by the disjoint generation include: i) An observed increase in the empirical measurement of privacy. ii) Increased computational feasibility of certain model types. iii) Ability to generate synthetic data using a mixture of different generative models. Specifically, mixed-model synthesis bridges the gap between privacy and utility performance, providing state-of-the-art performance on Accuracy and Area Under the Curve for downstream tasks while significantly lowering the empirical re-identification risk.

URL: https://openreview.net/forum?id=LSzXkAWBKI

---

Title: Inducing Disagreement in Multi-Agent LLM Executive Teams: Only the Devil’s Advocate Works

Abstract: Multi-agent large language model (LLM) systems for strategic decision-making suffer from premature convergence, limiting the benefits of multiple perspectives. While several techniques for inducing disagreement have been proposed, no systematic comparison exists—particularly for strategic decisions without objectively correct answers. We compare five prompting techniques across 20 business scenarios with four-agent executive teams (CEO, CFO, CMO, COO), analyzing 480 team decisions and $1{,}920$ individual agent responses. Our key finding is stark: Devil's Advocate assignment achieves $99.2%$ disagreement rates, while baseline conditions show only $48.3%$ disagreement. Critically, "soft" techniques—Strong Role Framing ($61.7%$), Explicit Dissent Instructions ($55.0%$), and their combination ($63.3%$)—are statistically indistinguishable from baseline. Only Devil's Advocate produces significant improvement. We also discover consistent coalition patterns: $80.3%$ of 2-2 splits follow a CEO+CMO versus CFO+COO alignment, suggesting functional perspective differentiation. Analysis of confidence allocations reveals that soft techniques create "nuanced agreement"—agents express lower conviction but reach the same conclusions—while Devil's Advocate produces "inauthentic dissent" where $4.9%$ of agents recommend options they privately rate lower. These findings demonstrate that explicit behavioral assignment ("you must oppose") succeeds where implicit instructions ("think critically") fail, with implications for practitioners designing multi-agent deliberation systems.

URL: https://openreview.net/forum?id=mxBmj5LYU2

---

Title: A Causal Testbed for Disentangling Skill from Aggregate Game Statistics in Chess

Abstract: A long-standing objective in human-AI interaction is to create personalized AI coaching systems that enhance human skill without tainting quantifiable behavioral patterns. We hypothesize that the common problem of style drift in AI coaching results from a failure to recognize the underlying causal structure, namely the collision between skill and behavioral patterns. We propose a methodological testbed for formalizing, quantifying, and addressing skill-behavioral pattern disentanglement under a particular causal structure. Instead of concentrating on holistic chess style, we specifically target a tractable proxy problem: decoupling skill from six interpretable aggregate play statistics. Our contribution is positioned as methodological rather than a comprehensive solution to chess coaching because this simplified feature space allows controlled testing of the collider hypothesis with known ground truth. We evaluate our approach on 30,000 real-world chess games, demonstrating that unsupervised disentanglement models ($\beta$-VAE, InfoGAN) fail on our testbed (MIG $\approx$ 0), while our causally informed architecture achieves strong disentanglement (MIG = 0.89, HSIC $\approx$ 0.00016). Our model produces statistically independent latent representations while maintaining excellent predictive accuracy. While we achieve statistical disentanglement on our defined features, we cannot validate whether the learned representations capture meaningful strategic concepts or enable effective coaching without human evaluation by chess domain experts. Our contribution demonstrates the statistical mechanism by which collider bias prevents disentanglement and how HSIC regularization addresses it.

URL: https://openreview.net/forum?id=X3s31GOYPz

---

Title: A Survey on Efficient Protein Language Models

Abstract: Protein language models (pLMs) have become indispensable tools in computational biology, driving advances in variant effect prediction, functional annotation, structure prediction, and engineering. However, their rapid expansion from millions to tens of billions of parameters introduces significant computational, accessibility, and sustainability challenges that limit practical application in environments constrained by GPU memory, hardware availability, and energy budgets. This survey presents the first comprehensive review of efficient pLMs, synthesizing recent advancements across four key dimensions. We first examine (1) dataset efficiency through meta-learning-based few-shot and scaling-law-guided data allocation; and (2) architecture efficiency via lightweight alternatives including quantized transformers, embedding compression, and convolution-based designs. Furthermore, we review (3) training efficiency through scaling-law-informed pretraining, structure-integrated multimodal approaches, and low-rank adaptations with diverse distillation strategies; and (4) inference efficiency via quantization, dense-retrieval, and structure-search methods. By providing a structured taxonomy and practical guidance, this survey enables the development of high-performance, scalable, yet sustainable next-generation pLMs.

URL: https://openreview.net/forum?id=PTReuOwsXz

---

Title: WAREX: Web Agent Reliability Evaluation on Existing Benchmarks

Abstract: Recent advances in browser-based LLM agents have shown promise for automating tasks ranging from simple form filling to hotel booking or online shopping. Current benchmarks measure agent performance in controlled environments, such as containers or stable networks, where websites behave deterministically. However, in the real world, users access websites over networks and HTTPS connections that introduce instability from multiple sources: client-side, server-side issues or broader system failures. Moreover, live websites are prone to web attacks such Cross-Site Scripting, as well as general site modifications which can cause unexpected or malicious pop-ups or improper functionality. To address this gap, we present WAREX, a plug-and-play, network-layer tool that integrates with existing web agent benchmarks by simulating common website failures. We measure the impact of WAREX across three popular benchmarks: WebArena, WebVoyager, and REAL. Our experiments show that introducing WAREX leads to significant drops in task success rates, highlighting the limited robustness of state-of-the-art agents. We demonstrate that WAREX serves as more than a diagnostic tool. By fine-tuning an open-source model (Qwen3-8B) on WAREX-generated "failure-recovery" trajectories, we achieve an 88.9% relative improvement in error recovery rates, validating WAREX as a core component for training the next generation of reliable web agents.

URL: https://openreview.net/forum?id=o4pXVP8RCD

---

Title: Unlocking The Power Of Layer-By-Layer Training And Fine- Tuning

Abstract: Layer-wise (LW) training of deep neural networks has long been associated with memory and
parallelism advantages, yet it suffers from information degradation and poor convergence
in deep architectures. Recent work attributes these issues to the loss of input information
and the lack of layer-role differentiation, as measured by the Hilbert-Schmidt Independence
Criterion (HSIC).
In this paper, we present a novel algorithm that enables full end-to-end training of ResNet-
18/ResNet-50 and end-to-end fine-tuning of Large Language Models (LLMs) using a modified
LW approach, while minimizing performance degradation. Our fundamental contribution
lies in the discovery that strategically reintroducing the final layers during LW training
mitigates the convergence degradation typically observed during LW when compared to
conventional end-to-end fine-tuning.
We introduce Segmented Propagation (SegProp), a training paradigm that seamlessly integrates
the computational efficiency of LW optimization with the representational power
of global supervision. Quantitative results demonstrate substantial improvements in convergence
compared to standard LW fine-tuning of LLMs and compared to LW training of
ResNet-18/ResNet-50. SegProp improves ResNet-50 accuracy on CIFAR-10 from 90.0%
(LW) to 94.3%, approaching E2E training at 95.5%. On ResNet-18, SegProp improves
CIFAR-10 accuracy from 93.7% (LW) to 95.2%, closely matching E2E at 95.5%. On Mistral-
Nemo-Instruct-2407, SegProp segmented fine-tuning matches E2E MMLU (5-shot) performance
(69.3%), and for Llama3.1-8B-Instruct it achieves 78.9% on Winogrande (5-shot),
closely matching E2E fine-tuning at 79.1%.

URL: https://openreview.net/forum?id=p5ObETPuTi

---

Title: Efficient and Programmable Exploration of Synthesizable Chemical Space

Abstract: The constrained nature of synthesizable chemical space poses a significant challenge for sampling molecules that are both synthetically accessible and possess desired properties. In this work, we present PrexSyn, an efficient and programmable model for molecular discovery within synthesizable chemical space. PrexSyn is based on a decoder-only transformer trained on a billion-scale datastream of synthesizable pathways paired with molecular properties, enabled by a real-time, high-throughput C++-based data generation engine. The large-scale training data allows PrexSyn to reconstruct the synthesizable chemical space nearly perfectly at a high inference speed and learn the association between properties and synthesizable molecules. Based on its learned property-pathway mappings, PrexSyn can generate synthesizable molecules that satisfy not only single-property conditions but also composite property queries joined by logical operators, thereby allowing users to ``program'' generation objectives. Moreover, by exploiting this property-based querying capability, PrexSyn can efficiently optimize molecules against black-box oracle functions via iterative query refinement, achieving higher sampling efficiency than even synthesis-agnostic baselines, making PrexSyn a powerful general-purpose molecular optimization tool. Overall, PrexSyn pushes the frontier of synthesizable molecular design by setting a new state of the art in synthesizable chemical space coverage, molecular sampling efficiency, and inference speed.

URL: https://openreview.net/forum?id=xDlIer2UnI

---

Title: Primus: Enforcing Attention Usage for 3D Medical Image Segmentation

Abstract: Transformers have achieved remarkable success across multiple fields, yet their impact on 3D medical image segmentation remains limited with convolutional networks still dominating major benchmarks. In this work, (A) we analyze current Transformer-based segmentation models and identify critical shortcomings, particularly their over-reliance on convolutional blocks. Further, we demonstrate that in some architectures, performance is unaffected by the absence of the Transformer, thereby demonstrating their limited effectiveness. To address these challenges, we move away from hybrid architectures and (B) introduce Transformer-centric segmentation architectures, termed Primus and PrimusV2. Primus leverages high-resolution tokens, combined with advances in positional embeddings and block design, to maximally leverage its Transformer blocks, while PrimusV2 expands on this through an iterative patch embedding. Through these adaptations, Primus surpasses current Transformer-based methods and competes with a default nnU-Net while PrimusV2 exceeds it and is on par with the state-of-the-art CNNs such as ResEnc-L and MedNeXt architectures across nine public datasets. In doing so, we introduce the first competitive Transformer-centric model, making Transformers state-of-the-art in 3D medical segmentation. Our code will be published.

URL: https://openreview.net/forum?id=x4vZE4PDEu

---

Title: Embryology of a Language Model

Abstract: Understanding how language models develop their internal computational structure is a central problem in the science of deep learning. We study this development through an embryological lens, applying UMAP to susceptibility vectors to visualize structural organization over training. We observe the emergence of a striking ``body plan''---the rainbow serpent---with an anterior-posterior axis defined by global expression versus suppression, dorsal-ventral stratification corresponding to the induction circuit, and a novel ``spacing fin'' structure. This body plan is reproducible across random seeds, suggesting that high-level functional organization is determined by architecture and data rather than initialization. Our work demonstrates that the relationship between data and internal structure is legible and developmental, with implications for both understanding and guiding model development.

URL: https://openreview.net/forum?id=1sgL0GrY4l

---

Title: Differential Privacy for Transformer Embeddings of Text with Nonparametric Variational Information Bottleneck

Abstract: We propose a privacy-preserving method for sharing text data by sharing noisy versions of their transformer embeddings.
It has been shown that hidden representations learned by deep models can encode sensitive information from the input, making it possible for adversaries to recover the input data with considerable accuracy. This problem is exacerbated in transformer embeddings because they consist of multiple vectors, one per token. To mitigate this risk, we propose Nonparametric Variational Differential Privacy (NVDP), which ensures both useful data sharing and strong privacy protection. We take a differential privacy (DP) approach, integrating a nonparametric variational information bottleneck (NVIB) layer into the transformer architecture to inject noise into its multivector embeddings and thereby hide information, and measuring privacy protection with Rényi Divergence (RD) and its corresponding Bayesian Differential Privacy (BDP) guarantee. Training the NVIB layer calibrates the noise level according to the utility of the downstream task. We test NVDP on the General Language Understanding Evaluation (GLUE) benchmark and show that varying the noise level gives us a useful trade-off between privacy and accuracy. With lower noise levels, our model maintains high accuracy while offering strong privacy guarantees, effectively balancing privacy and utility.

URL: https://openreview.net/forum?id=Y5rKWT4e6G

---

Title: Beyond the Linear Separability Ceiling: Aligning Representations in VLMs

Abstract: A challenge in advancing Visual-Language Models (VLMs) is determining whether their failures on abstract reasoning tasks, such as Bongard problems, stem from flawed perception or faulty top-down reasoning. To disentangle these factors, we introduce a diagnostic framework centered on the Linear Separability Ceiling (LSC), the performance achievable by a linear classifier on a VLM's raw visual embeddings. Applying this framework to state-of-the-art VLMs, we uncover a pervasive ''alignment gap'', where most models fail to generatively outperform the linear separability of their representations. We find that the few models surpassing this ceiling do so via two mechanisms: by further refining visual representations into a more linearly separable format or by executing non-linear decision logic. We demonstrate that this bottleneck is not a fundamental limitation but a solvable visual alignment issue. Our method augments standard next-token prediction with a contrastive objective to restructure the visual manifold into a more one-dimensionally linear geometry, improving image-to-image comparison and enabling models to significantly surpass the LSC on abstract binary classification tasks.

URL: https://openreview.net/forum?id=3uX4p80bN0

---

Title: Legal Alignment for Safe and Ethical AI

Abstract: Alignment of artificial intelligence (AI) encompasses the normative problem of specifying how AI systems should act and the technical problem of ensuring AI systems comply with those specifications. To date, AI alignment has generally overlooked an important source of knowledge and practice for grappling with these problems: law. In this paper, we aim to fill this gap by exploring how legal rules, principles, and methods can be leveraged to address problems of alignment and inform the design of AI systems that operate safely and ethically. This emerging field -- legal alignment -- focuses on three research directions: (1) designing AI systems to comply with the content of legal rules developed through legitimate institutions and processes, (2) adapting methods from legal interpretation to guide how AI systems reason and make decisions, and (3) harnessing legal concepts as a structural blueprint for confronting challenges of reliability, trust, and cooperation in AI systems. These research directions present new conceptual, empirical, and institutional questions, which include examining the specific set of laws that particular AI systems should follow, creating evaluations to assess their legal compliance in real-world settings, and developing governance frameworks to support the implementation of legal alignment in practice. Tackling these questions requires expertise across law, computer science, and other disciplines, offering these communities the opportunity to collaborate in designing AI for the better.

URL: https://openreview.net/forum?id=BypXEQa7mf

---

Title: Concatenated Matrix SVD: Compression Bounds, Incremental Approximation, and Error-Constrained Clustering

Abstract: Large collections of matrices arise throughout modern machine learning, signal processing, and scientific computing, where they are commonly compressed by concatenation followed by truncated singular value decomposition (SVD). This strategy enables parameter sharing and efficient reconstruction and has been widely adopted across domains ranging from multi-view learning and signal processing to neural network compression. However, it leaves a fundamental question unanswered: which matrices can be safely concatenated and compressed together under explicit reconstruction error constraints? Existing approaches rely on heuristic or architecture-specific grouping and provide no principled guarantees on the resulting SVD approximation error. In the present work, we introduce a theory-driven framework for compression-aware clustering of matrices under SVD compression constraints. Our analysis establishes new spectral bounds for horizontally concatenated matrices, deriving global upper bounds on the optimal rank-$r$ SVD reconstruction error from lower bounds on singular value growth. The first bound follows from Weyl-type monotonicity under blockwise extensions, while the second leverages singular values of incremental residuals to yield tighter, per-block guarantees. We further develop an efficient approximate estimator based on incremental truncated SVD that tracks dominant singular values without forming the full concatenated matrix. Therefore, we propose three clustering algorithms that merge matrices only when their predicted joint SVD compression error remains below a user-specified threshold. The algorithms span a trade-off between speed, provable accuracy, and scalability, enabling compression-aware clustering with explicit error control.

URL: https://openreview.net/forum?id=E9n35dehqx

---

Title: MMCOMPOSITION: Revisiting the Compositionality of Pre- trained Vision-Language Models

Abstract: The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal under- standing, enabling more sophisticated and accurate integration of visual and textual information across various tasks, including image and video captioning, visual question answering, and cross-modal retrieval. Despite VLMs’ superior capabilities, researchers lack a comprehensive understanding of their compositionality – the ability to understand and produce novel combinations of known visual and textual components. Prior benchmarks provide only a relatively rough compositionality evaluation from the perspectives of objects, relations, and attributes while neglecting deeper reasoning about object interactions, counting, and complex compositions. However, compositionality is a critical ability that facilitates coherent reasoning and understanding across modalities for VLMs. To address this limitation, we propose MMCOMPOSITION, a novel human-annotated benchmark for comprehensively and accurately evaluating VLMs’ compositionality. With MMCOMPOSITION, we can quantify and explore the compositionality of the mainstream VLMs. Surprisingly, we find GPT-4o’s compositionality inferior to the best open-source model, and we analyze the underlying reasons. Our experimental analysis reveals the limitations of VLMs in fine-grained compositional perception and reasoning, and points to areas for improvement in VLM design and training.

URL: https://openreview.net/forum?id=aWO15tpSH8

---

Title: Self-Improvement as Coherence Optimization: A Theoretical Account

Abstract: Can language models improve their accuracy without external supervision? Methods such as debate, bootstrap, and internal coherence maximization achieve this surprising feat, even matching golden finetuning performance. Yet why they work remains theoretically unclear. We show that they are all special cases of coherence optimization, i.e., finding a context-to-behavior mapping that's most compressible and jointly predictable. We prove that coherence optimization is equivalent to description-length regularization, and that among all such regularization schemes, it is optimal for semi-supervised learning when the regularizer is derived from a pretrained model. Our theory, supported by preliminary experiments, explains why feedback-free self-improvement works and predicts when it should succeed or fail.

URL: https://openreview.net/forum?id=nR47qAX9oL

---

Title: FedIndex: Federated Domain Adaptation with Continuous Domain Indices

Abstract: Federated domain adaptation incorporates source clients’ knowledge to improve the model performance on the target client under the coordination of the server, mitigating the impact of data insufficiency and domain shift. Existing federated domain adaptation (FDA) methods focus on domain adaptation with categorical domain indices (e.g., “source” and “target”), while many real-world tasks involve domains with continuous domain indices. For instance, hospitals need to adapt disease analysis and prediction across patients via age, a continuous domain index in medical applications capturing the underlying relation between patient information and disease analysis. Prior FDA methods struggle with such tasks due to their ignorance of continuous domain indices. This paper proposes FedIndex to enable FDA with continuous domain indices. FedIndex performs adversarial domain adaptation across clients with the help of a global discriminator, aligning all domains’ distributions. Our theoretical analysis demonstrates the capability of FedIndex to generate domain-invariant features across clients using continuous domain indices without accessing data on clients, simultaneously maintaining privacy preservation. Our empirical results show that FedIndex outperforms the state-of-the-art FDA methods on synthetic and real-world datasets.

URL: https://openreview.net/forum?id=fnbGFH0330

---

Title: Control-oriented Energy-Based Actionable World Model for Decision-Making and Process Control

Abstract: We introduce the \emph{Energy-Based Actionable World Model} (EBAWM), a hybrid
world-modeling framework for industrial process forecasting and control that
combines deterministic state-space dynamics with an energy-based transition
critic. EBAWM is designed for long-horizon, high-stakes decision-making, where
reliable recursive prediction requires both stable state evolution and
principled uncertainty awareness. In contrast to modern deep time-series models—such
as CNNs, RNNs, and Transformers—that operate primarily as input--output predictors, EBAWM
maintains an explicit, recursively propagated state tied to physically
meaningful system variables. This structure enables state correction,
long-horizon simulation, and direct integration with Receding Horizon Control,
model predictive control, and model-based reinforcement learning.The deterministic transition
model provides a strong inductive bias for system
identification by favoring explicit, Markovian, action-conditioned state
transitions, thereby mitigating representation collapse, a common failure mode
in energy-based learning. Uncertainty is captured through an energy function that evaluates the
plausibility of action-conditioned state transitions, rather than by injecting
stochasticity into the dynamics or relying on model ensembles. High-energy
regions naturally indicate dynamically inconsistent or out-of-distribution
behavior, yielding an interpretable uncertainty-aware signal without assuming a
parametric noise model. Our contributions are: (i) we show that the geometry of
the learned energy landscape encodes
dynamical structure and stability-related properties, enabling
uncertainty-aware forecasting and implicit control;(ii) we introduce a
control-oriented world model that combines recursive,
action-conditioned physical state propagation with energy-based transition
evaluation, supporting online optimization and closed-loop decision-making;
and (iii) we propose a simple and stable energy-based modeling design that avoids
representation collapse by operating on a latent space shaped by a
deterministic forecaster.

URL: https://openreview.net/forum?id=JLXdpnjEU3

---

Title: White-Box Sensitivity Auditing with Steering Vectors

Abstract: Algorithmic audits are essential tools for examining systems for properties required by regulators or desired by operators. Current audits of large language models (LLMs) primarily rely on black-box evaluations that assess model behavior only through input–output testing. These methods are limited to tests constructed in the input space, often generated by heuristics. In addition, many socially relevant model properties (e.g., gender bias) are abstract and difficult to measure through text-based inputs alone. To address these limitations, we propose a white-box sensitivity auditing framework for LLMs that leverages activation steering to conduct more rigorous assessments through model internals. Our auditing method conducts internal sensitivity tests by manipulating key concepts relevant to the model's intended function for the task. We demonstrate its application to bias audits in four simulated high-stakes LLM decision tasks. Our method consistently reveals substantial dependence on protected attributes in model predictions, even in settings where standard black-box evaluations suggest little or no bias.

URL: https://openreview.net/forum?id=EfinGGyQRz

---

Title: Towards Preventing Global Knowledge Forgetting in Federated Learning with Non-IID Data

Abstract: Federated learning under client-level data heterogeneity remains challenging despite extensive work on drift correction, regularization, and improved aggregation. In this paper, we argue that an important yet underexplored failure mode is catastrophic forgetting of the global decision boundary during local training: as clients optimize their local objectives, they rapidly overfit to client-specific data and erase globally useful multi-class structure, causing server aggregation to average incompatible models rather than accumulate progress. We provide empirical evidence for this phenomenon through a controlled pilot study that directly visualizes decision boundary evolution in federated learning. Our analysis reveals that standard FL methods consistently forget the global decision boundary after local updates, even when clients are initialized from a strong pretrained global model. Motivated by this observation, we propose FedProj, a federated learning framework designed to preserve global functional knowledge throughout local optimization. FedProj maintains a small public-memory buffer and enforces a hard gradient constraint that prevents local updates from increasing a memory-based distillation loss, thereby acting as a safety barrier against global knowledge erosion. At the server, we further employ ensemble distillation on the same public proxy data to consolidate the preserved knowledge into a single global model. We conduct extensive experiments across computer vision and natural language processing benchmarks, covering highly non-IID regimes and domain-shifted settings. The results show that FedProj consistently outperforms state-of-the-art federated learning methods, highlighting the practical importance of explicitly preventing global decision-boundary forgetting

URL: https://openreview.net/forum?id=lhTWPh3Tjm

---

Title: Molecule Meets Protein Pocket 3D-Aware Molecular Optimization for Protein Targets

Abstract: Lead optimization, refining drug candidates to improve binding to protein targets, is a key challenge in drug discovery. We introduce a 3D-aware generative framework that performs fragment-level molecular optimization conditioned on the geometry of the protein's binding pocket. Our model represents the molecule-protein complex as a sparse 3D graph and applies grouped vector attention to learn spatial interactions. It decomposes the molecule into a stable scaffold and generates new fragments using a Variational Autoencoder (VAE) and a SMILES-based transformer guided by local pocket structure. To handle the imbalance in fragment sizes, we incorporate a focal loss. On the CrossDock2020 benchmark, our method outperforms prior approaches in generating diverse, novel, and chemically valid candidates with improved Vina scores-while generalizing to unseen proteins.

URL: https://openreview.net/forum?id=0irSt7bUGw

---

Title: Multitask Transformer Models for Demographic and Industry Profiling on Long-Form Blog Texts

Abstract: We address the challenge of multitask author profiling on long-form blog text by developing four transformer-based models that jointly predict gender, age group, and industry. Using a cleaned version of the Blog Authorship Corpus, we explore document-length handling strategies that span input ranges from 192 to 500 tokens, including long-context encoding, BART-based summarization, and chunked processing with prediction fusion. Our experiments show that multitask learning consistently outperforms strong single-task baselines, with the largest gains for industry. We further find that broader input context yields more reliable predictions, while alternative representations emphasize complementary stylistic and topical cues. Taken together, these findings provide a comprehensive analysis of text-length effects in multitask author profiling and highlight the importance of contextual breadth for robust demographic inference. The dataset was preprocessed by merging industry tags into fourteen categories and applying standard text normalization.

URL: https://openreview.net/forum?id=WtFwcCvt9i

---

Title: Constrained Reinforcement Learning Using Successor Representations

Abstract: Real-world Reinforcement Learning depends on the ability to formulate safety constraints into a policy. Unfortunately, current methods are hard to adapt to changes in the cost function introduced by, e.g., domain shift or obstacles moving over time.
The lack of adaptability means that policies are too unflexible to deal with complex real-world conditions.
We propose the SafeDSR, a novel method that allows quick retraining of policies towards new cost structures by decoupling the dynamics, reward structure, and costs by introducing a single learnable weight matrix. This matrix can be updated in a supervised manner instead of having to adapt the whole network if the cost structure of the environment changes.
We demonstrate this ability in a freely configurable environment and show that our method is competitive with the state of the art while being considerably more flexible. The source code will be made publicly available upon acceptance.

URL: https://openreview.net/forum?id=6zUq7knzwA

---

Title: Causally Fair Node Classification on Non-IID Graph Data

Abstract: Fair machine learning seeks to identify and mitigate biases in predictions against unfavorable populations characterized by demographic attributes, such as race and gender. Recent research has extended fairness to graph data, such as social networks, but many neglect the causal relationships among data instances. This paper addresses the prevalent challenge in fair ML algorithms, which typically assume Independent and Identically Distributed (IID) data, from the causal perspective. We base our research on the Network Structural Causal Model (NSCM) framework and develop a Message Passing Variational Autoencoder for Causal Inference (MPVA) framework to compute interventional distributions and facilitate causally fair node classification through estimated interventional distributions. Theoretical soundness of the proposed method is established under two general and practical conditions: Decomposability and Graph Independence. These conditions formalize when interventional distributions can be computed using do-calculus in non-IID settings, thereby grounding the framework in rigirous causal inference theory rather than imposing ad hoc constraints. Empirical evaluations on semi-synthetic and real-world datasets demonstrate that MPVA outperforms conventional methods by effectively approximating interventional distributions and mitigating bias. The implications of our findings underscore the potential of causalitybased fairness in complex ML applications, setting the stage for further research into relaxing the initial assumptions to enhance model fairness.

URL: https://openreview.net/forum?id=AwptwzGld5

---

Title: RA-CoA: Training-free Fashion Image Captioning via Retrieval-Augmented Chain-of-Attributes

Abstract: Fashion Image Captioning (FIC) plays a vital role in enhancing user experience and product search in e-commerce platforms. Unlike natural scene image captioning, FIC requires fine-grained visual reasoning and knowledge of domain-specific terminology to capture subtle attributes such as neckline and closure types, graphic patterns, and dress silhouettes. Moreover, as fashion inventories evolve rapidly with new trends, styles, and frequently emerging vocabulary, developing training-free captioning solution becomes essential for scalability and real-world adaptability. Instruction-tuned vision-language models (VLMs) offer a promising solution to fashion image captioning dueto their strong zero-shot capabilities and natural language fluency. However, these general-purpose models often lack attribute-level coverage and precision, and tend to hallucinate or misidentify fine-grained fashion details, making them less suitable for high-fidelity applications like product cataloging or personalized recommendations. To address this, we propose RA-CoA (Retrieval-Augmented Chain-of-Attributes), a novel, training-free framework that disentangles fashion image captioning into two interpretable stages: (i) retrieval of relevant attribute sets from a product knowledge base, and (ii) attribute-level reasoning to generate the final caption. RA-CoA is a model-agnostic approach that works with frozen VLMs to improve fine-grained attribute precision in product captions without the need for fine-tuning. Extensive evaluations across diverse VLM model families under different prompting paradigms demonstrate that RA-CoA significantly improves caption quality, achieving an average gain of 26.3% METEOR score over zero-shot captioning. We shall make our code publicly available upon acceptance.

URL: https://openreview.net/forum?id=PpkOrVUpJ6

---

Title: Foundations and Frontiers of Multimodal Agentic Frameworks

Abstract: Advances in large language models (LLMs) have fueled a wave of research into agency: the ability to reason, plan, and act. This effort has produced agentic frameworks that orchestrate perception, memory, and decision-making around powerful LLM backbones. With the advent of large multimodal models (LMMs), these systems can process and integrate diverse modalities, including images, audio, and video, thereby improving their real-world applicability. Yet, while surveys of LLM-based agents exist, the role of multimodality in shaping agency has not been systematically examined in recent years. This survey fills the gap by analyzing the impact of multimodality across the core functional modules of the agentic framework: perception, reasoning, planning, memory, and action. Using this lens, we trace the evolution from text-centric agents to multimodal frameworks, examine how modalities are integrated through delegated, late-fusion, and early-fusion architectures, and assess the emergence of agentic behaviors enabled by grounded perception and multimodal reasoning. We organize existing work through a modality-centric taxonomy that links architectural design choices to agent capabilities. Moreover, we review multimodal agentic systems across various application domains, including Robotics, GUI & Web Navigation, Multimedia Content Generation & Editing, and Long-form Video Understanding & Retrieval. Beyond capabilities, we analyze performance across these settings and discuss efficiency-scalability trade-offs, including training and inference costs, latency, and deployment constraints. By focusing on the impact of multimodality in agentic design, we aim to identify key gaps and chart a roadmap toward robust and general-purpose intelligent systems.

URL: https://openreview.net/forum?id=eaVoaI7f8v

---

Title: Multi-Level Spatial Embedding Sharing for Enhanced Online Trajectory-User Linking

Abstract: Trajectory-User Linking (TUL) is a critical task in mobility applications that links unlabeled spatial trajectories to the users or entities that generated them. In these applications, data often arrives as a continuous stream and may experience distributional shifts over time. While adapting TUL models via online learning could address these challenges, this approach remains unexplored in current research. Our work bridges this gap by conducting comprehensive evaluations of common TUL techniques in an online learning context. To improve the performance of existing TUL techniques in this setting, we further introduce a novel embedding approach called Multi-Level Spatial Embedding Sharing (MiLES). MiLES operates by partially sharing embeddings for locations within neighborhoods of multiple size levels. This design enables faster adaptation via frequently-updated shared embeddings, while maintaining fine-grained discrimination through more location-specific representations. MiLES also significantly reduces the number of embedding parameters leading to lower memory usage and more computationally efficient model updates. We further incorporate learnable weighting parameters for each embedding level, allowing the model to dynamically adjust the influence of different levels based on incoming data. Our experimental results on several real-world datasets show that integrating MiLES into state-of-the-art TUL models significantly improves their performance in online learning scenarios, yielding relative gains in top-1 accuracy of up to 24%. To demonstrate its general applicability, we also evaluate MiLES on the task of destination prediction, where it also provides consistent performance improvements, confirming its value as a domain-general embedding technique. Our code is available at \url{https://anonymous.4open.science/r/MiLES-3D20}.

URL: https://openreview.net/forum?id=LGflWbxAuP

---

Title: MatchEx: Model-Level GNN Explanations with Multi-Granular Insights

Abstract: Graph Neural Networks (GNNs) are increasingly deployed in high-stakes domains where interpretability is crucial. Existing model-level explanation methods largely rely on generative models, which often produce motifs that fail to resemble real instances, cannot account for the diversity of discriminative motifs recognized by the classifier for a target class and lack mechanisms for translating global explanations to instance-level insights. We present MatchEx, a framework that discovers discriminative motifs directly from real instances by optimizing a novel matching objective. Unlike isomorphism, which can only recover identical motifs that rarely occur in real-world graphs, this objective extends beyond exact matches to provably recover semantically similar motifs, allowing generalizable explanations. The matching mechanism also enables projection of class level rationales onto individual graphs for faithful instance-level insights. When a single motif fails to explain all instances, MatchEx adaptively partitions the instances in a class into coherent subgroups with distinct rationales. Extensive experiments across six real and synthetic datasets show that MatchEx consistently outperforms state-of-the-art baselines, delivering coherent, generalizable, and multi-granular explanations.

URL: https://openreview.net/forum?id=YMETLG2WvM

---

Reply all

Reply to author

Forward

0 new messages