Weekly TMLR digest for Mar 22, 2026

25 views

Skip to first unread message

TMLR

unread,

Mar 22, 2026, 12:00:12 AMMar 22

to tmlr-annou...@googlegroups.com

New certifications
==================

J2C Certification: MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation

Longtao Zheng, Yifan Zhang, Hanzhong Guo, Jiachun Pan, Zhenxiong Tan, Jiahao Lu, Chuanxin Tang, Bo An, Shuicheng YAN

https://openreview.net/forum?id=uBcHcM7Kzi

---

J2C Certification: Reasoning-Driven Synthetic Data Generation and Evaluation

Tim R. Davidson, Benoit Seguin, Enrico Bacis, Cesar Ilharco, Hamza Harkous

https://openreview.net/forum?id=NALsdGEPhB

---

J2C Certification: Statistical Inference for Generative Model Comparison

Zijun Gao, Han Su, Yan Sun

https://openreview.net/forum?id=PXL6SBxh0q

---

Accepted papers
===============

Title: Quantifying Structure in CLIP Embeddings: A Statistical Framework for Concept Interpretation

Authors: Jitian Zhao, Chenghui Li, Frederic Sala, Karl Rohe

Abstract: Concept-based approaches, which aim to identify human-understandable concepts within a model's internal representations, are promising for interpreting embeddings from deep neural network models, such as CLIP. While these approaches help explain model behavior, current methods lack statistical rigor, making it challenging to validate identified concepts and compare different techniques. To address this challenge, we introduce a hypothesis testing framework that quantifies rotation-sensitive structures within the CLIP embedding space.
Once such structures are identified, we propose a post-hoc concept decomposition method. Unlike existing approaches, it offers theoretical guarantees that discovered concepts represent robust, reproducible patterns (rather than method-specific artifacts) and outperforms other techniques in terms of reconstruction error. Empirically, we demonstrate that our concept-based decomposition algorithm effectively balances reconstruction accuracy with concept interpretability and helps mitigate spurious cues in data. Applied to a popular spurious correlation dataset, our method yields a 22.6% increase in worst-group accuracy after removing spurious background concepts.

URL: https://openreview.net/forum?id=D6K0Wi3kRY

---

Title: Hierarchical Mamba Meets Hyperbolic Geometry: A New Paradigm for Structured Language Embeddings

Authors: Sarang Rajendra Patil, Ashish Parmanand Pandey, Ioannis Koutis, Mengjia Xu

Abstract: Selective state-space models excel at long-sequence modeling, but their capacity for language representation, in complex hierarchical reasoning -- remains underexplored. Most large language models rely on flat Euclidean embeddings, limiting their ability to capture latent hierarchies. To address this, we propose Hierarchical Mamba (HiM), integrating efficient Mamba2 with hyperbolic geometry to learn hierarchy-aware language embeddings for deeper linguistic understanding. Mamba2-processed sequences are projected to the Poincaré ball or Lorentzian manifold with "learnable" curvature, optimized with a hyperbolic loss. Our HiM model facilitates the capture of relational distances across varying hierarchical levels, enabling effective long-range reasoning for tasks like mixed-hop prediction and multi-hop inference in hierarchical classification. Experimental results show both HiM effectively capture hierarchical relationships across four linguistic and medical datasets, surpassing Euclidean baselines, with HiM-Poincaré providing fine-grained distinctions with higher h-norms, while HiM-Lorentz offers more stable, compact, and hierarchy-preserving embeddings.

URL: https://openreview.net/forum?id=a3g13FKzct

---

Title: SPARC: Concept-Aligned Sparse Autoencoders for Cross-Model and Cross-Modal Interpretability

Authors: Ali Nasiri-Sarvi, Hassan Rivaz, Mahdi S. Hosseini

Abstract: Understanding how different AI models encode the same high-level concepts, such as objects or attributes, remains challenging because each model typically produces its own isolated representation. Existing interpretability methods like Sparse Autoencoders (SAEs) produce latent concepts individually for each model, resulting in incompatible concept spaces and limiting cross-model interpretability. To address this, we introduce SPARC (Sparse Autoencoders for Aligned Representation of Concepts), a new framework that learns a single, unified latent space shared across diverse architectures and modalities (e.g., vision models like DINO, and multimodal models like CLIP). SPARC's alignment is enforced through two key innovations: (1) a Global TopK sparsity mechanism, ensuring all input streams activate identical latent dimensions for a given concept; and (2) a Cross-Reconstruction Loss, which explicitly encourages semantic consistency between models. On Open Images, SPARC dramatically improves concept alignment, achieving a Jaccard similarity of 0.80, more than tripling the alignment compared to previous methods. SPARC creates a shared sparse latent space where individual dimensions often correspond to similar high-level concepts across models and modalities, enabling direct comparison of how different architectures represent identical concepts without requiring manual alignment or model-specific analysis. As a consequence of this aligned representation, SPARC also enables practical applications such as text-guided spatial localization in vision-only models and cross-model/cross-modal retrieval.

URL: https://openreview.net/forum?id=IJfvoc2GbZ

---

Title: MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation

Authors: Longtao Zheng, Yifan Zhang, Hanzhong Guo, Jiachun Pan, Zhenxiong Tan, Jiahao Lu, Chuanxin Tang, Bo An, Shuicheng YAN

Abstract: Recent advances in video diffusion models have unlocked new potential for realistic audio-driven talking video generation. However, maintaining long-term identity consistency, achieving seamless lip-audio synchronization, and producing natural, audio-aligned expressions in generated talking videos remain significant challenges. To address these challenges, we propose Memory-guided EMOtion-aware diffusion (MEMO), an end-to-end audio-driven portrait animation approach to generate identity-consistent and expressive talking videos. Our approach is built around two key modules: (1) a memory-guided temporal module, which enhances long-term identity consistency and motion smoothness by developing causal motion memory to store information from an extended past context to guide temporal modeling; and (2) an emotion-aware audio module, which replaces traditional cross attention with multi-modal attention to enhance audio-video interaction, while detecting emotions from audio to refine facial expressions via emotion-adaptive layer norm. Extensive quantitative and qualitative results demonstrate that MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, lip-audio synchronization, identity consistency, and expression-audio alignment. Our model and video demos are available at https://memoavatar.github.io.

URL: https://openreview.net/forum?id=uBcHcM7Kzi

---

Title: TS-Reasoner: Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis

Authors: Wen Ye, Wei Yang, Defu Cao, Yizhou Zhang, Lumingyuan Tang, Jie Cai, Yan Liu

Abstract: Time series analysis is crucial in real-world applications, yet traditional methods focus on isolated tasks only, and recent studies on time series reasoning remain limited to either single-step inference or are constrained to natural language answers. In this work, we introduce TS-Reasoner, a domain-specialized agent designed for multi-step time series inference. By integrating large language model (LLM) reasoning with domain- specific computational tools and error feedback loop, TS-Reasoner enables domain-informed, constraint-aware analytical workflows that combine symbolic reasoning with precise numerical analysis. We assess the system’s capabilities along two axes: 1) fundamental time series understanding assessed by TimeSeriesExam and 2) complex, multi-step inference, evaluated by a newly proposed dataset designed to test both compositional reasoning and computational precision in time series analysis. Experiments show that our approach outperforms standalone general-purpose LLMs in both basic time series concept understanding as well as the multi-step time series inference task, highlighting the promise of domain-specialized agents for automating real-world time series reasoning and analysis.

URL: https://openreview.net/forum?id=yhy7Vigjcf

---

Title: EarthquakeNPP: A Benchmark for Earthquake Forecasting with Neural Point Processes

Authors: Samuel Stockman, Daniel John Lawson, Maximilian J. Werner

Abstract: For decades, classical point process models, such as the epidemic-type aftershock sequence (ETAS) model, have been widely used for forecasting the event times and locations of earthquakes. Recent advances have led to Neural Point Processes (NPPs), which promise greater flexibility and improvements over such classical models. However, the currently-used benchmark for NPPs does not represent an up-to-date challenge in the seismological community, since it contains data leakage and omits the largest earthquake sequence from the region. Additionally, initial earthquake forecasting benchmarks fail to compare NPPs with state-of-the-art forecasting models commonly used in seismology. To address these gaps, we introduce EarthquakeNPP: a benchmarking platform that curates and standardizes existing public resources: globally available earthquake catalogs, the ETAS model, and evaluation protocols from the seismology community. The datasets cover a range of small to large target regions within California, dating from 1971 to 2021, and include different methodologies for dataset generation. Benchmarking experiments, using both log-likelihood and generative evaluation metrics widely recognised in seismology, show that none of the five NPPs tested outperform ETAS. These findings suggest that current NPP implementations are not yet suitable for practical earthquake forecasting. Nonetheless, EarthquakeNPP provides a platform to foster future collaboration between the seismology and machine learning communities.

URL: https://openreview.net/forum?id=dIcNAg6ZuZ

---

Title: UltraEdit: Training-, Subject-, and Memory-Free Lifelong Editing in Language Models

Authors: Xiaojie Gu, Ziying Huang, Jia-Chen Gu, Kai Zhang

Abstract: Lifelong learning enables large language models (LLMs) to adapt to evolving information by continually updating their internal knowledge. An ideal system should support efficient, wide-ranging updates while preserving existing capabilities and ensuring reliable deployment. Model editing stands out as a promising solution for this goal, offering a focused and efficient way to revise a model’s internal knowledge. Although recent paradigms have made notable progress, they often struggle to meet the demands of practical lifelong adaptation at scale. To bridge this gap, we propose UltraEdit, a training-, subject-, and memory-free approach that is well-suited for ultra-scalable, real-world lifelong model editing. UltraEdit fundamentally differs from traditional paradigms by computing parameter shifts in one step using only a hidden state and its gradient, making the approach simple yet efficient. To improve scalability in lifelong settings, UltraEdit employs a lifelong normalization strategy that continuously updates feature statistics across turns, allowing it to adapt to distributional shifts and maintain consistency over time. UltraEdit achieves editing speeds more than 7× faster than the previous state-of-the-art method, while requiring 4× less VRAM. This makes it the only method currently capable of editing a 7B LLM on a 24GB consumer-grade GPU. Furthermore, we construct UltraEditBench, the largest dataset in the field to date with over 2M editing pairs, and demonstrate that our method supports up to 2M edits while maintaining high accuracy. Comprehensive experiments on five datasets and six models show that UltraEdit consistently achieves superior performance across diverse model editing scenarios, taking a further step towards safe and scalable lifelong learning. Our code is available at https://github.com/XiaojieGu/UltraEdit

URL: https://openreview.net/forum?id=GoJLp3BlRV

---

Title: A Systematic Study of In-the-Wild Model Merging for Large Language Models

Authors: Oğuz Kağan Hitit, Leander Girrbach, Zeynep Akata

Abstract: Model merging combines multiple fine-tuned checkpoints into a single model without additional training, offering an attractive approach to reusing models and efficiently improving performance. However, it remains unclear whether the advantages reported for settings where all merged experts have distinct roles and are tuned on clearly separated tasks also hold in settings where the merged experts do not have clearly distinct roles, but are trained on overlapping or even conflicting objectives. To evaluate this setting, we present a large-scale, systematic evaluation of “in-the-wild” model merging of heterogeneous experts, that may have been trained on overlapping or conflicting objectives. Concretely, we evaluate six state-of-the-art merging methods, including recent subspace methods, across four open-weight LLMs, twelve fine-tuned checkpoints per base model, and sixteen standard LLM benchmarks. Evaluating through standardized benchmarks, we measure both the probability that a model merged from a heterogeneous set of experts outperforms the base model and we measure relative gains over the best individual checkpoint. Our results show that the oldest and simplest method, Task Arithmetic, is the only approach that reliably yields performance gains on LLMs in this “in-the-wild” setting. Other interference-aware and subspace merging methods typically do not result in notable improvements over the base model. Our findings indicate that current merging techniques mostly do not enable extracting useful weight updates from heterogeneous and potentially conflicting versions. This motivates the design of LLM-specific merging algorithms and merging-aware fine-tuning methods.

URL: https://openreview.net/forum?id=6zSIyrqS7J

---

Title: Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients

Authors: Christos Thrampoulidis, Sadegh Mahdavi, Wenlong Deng

Abstract: We unify two seemingly distinct approaches to policy gradient optimization for the Pass@K objective in reinforcement learning with verifiable rewards (RLVR): direct REINFORCE-style methods and advantage-shaping techniques that modify GRPO. By reverse-engineering existing advantage-shaping algorithms, we reveal that they implicitly optimize surrogate rewards. We specifically interpret practical ``hard-example upweighting'' modifications to GRPO as reward-level regularization. Conversely, starting from surrogate reward objectives, we provide a simple recipe for deriving both existing and new advantage-shaping methods.
This perspective provides a lens for RLVR beyond our original motivation of Pass@K.

URL: https://openreview.net/forum?id=R1RhBFUk8t

---

Title: From Weighting to Modeling: A Nonparametric Estimator for Off-Policy Evaluation

Authors: Rong J.B. Zhu

Abstract: We study off-policy evaluation in the setting of contextual bandits, where we aim to evaluate a new policy using historical data that consists of contexts, actions and received rewards. This historical data typically does not faithfully represent action distribution of the new policy accurately. A common approach, inverse probability weighting (IPW), adjusts for these discrepancies in action distributions. However, this method often suffers from high variance due to the probability being in the denominator. The doubly robust (DR) estimator reduces variance through modeling reward but does not directly address variance from IPW. In this work, we address the limitation of IPW by proposing a Nonparametric Weighting (NW) approach that constructs weights using a nonparametric model. Our NW approach achieves low bias like IPW but typically exhibits significantly lower variance. To further reduce variance, we incorporate reward predictions -- similar to the DR technique -- resulting in the Model-assisted Nonparametric Weighting (MNW) approach. The MNW approach yields accurate value estimates by explicitly modeling and mitigating bias from reward modeling, without aiming to guarantee the standard doubly robust property. Extensive empirical comparisons show that our approaches consistently outperform existing techniques, achieving lower variance in value estimation while maintaining low bias.

URL: https://openreview.net/forum?id=RW6PY0AU3w

---

Title: Synapse: Adaptive Arbitration of Complementary Expertise in Time Series Foundational Models

Authors: Sarkar Snigdha Sarathi Das, Palash Goyal, Mihir Parmar, Yiwen Song, Long Le, Lesly Miculicich, Jinsung Yoon, Rui Zhang, Hamid Palangi, Tomas Pfister

Abstract: Pre-trained Time Series Foundational Models (TSFMs) represent a significant advance, capable of forecasting diverse time series with complex characteristics, including varied seasonalities, trends, and long-range dependencies. Despite their primary goal of universal time series forecasting, their efficacy is far from uniform; divergent training protocols and data sources cause individual TSFMs to exhibit highly variable performance across different forecasting tasks, domains, and horizons. Leveraging this complementary expertise by arbitrating existing TSFM outputs presents a compelling strategy, yet this remains a largely unexplored area of research. In this paper, we conduct a thorough examination of how different TSFMs exhibit specialized performance profiles across various forecasting settings, and how we can effectively leverage this behavior in arbitration between different time series models. We specifically analyze how factors such as model selection and forecast horizon distribution can influence the efficacy of arbitration strategies. Based on this analysis, we propose Synapse, a novel arbitration framework for TSFMs. Synapse is designed to dynamically leverage a pool of TSFMs, assign and adjust predictive weights based on their relative, context-dependent performance, and construct a robust forecast distribution by adaptively sampling from the output quantiles of constituent models. Experimental results demonstrate that Synapse consistently outperforms other popular ensembling techniques as well as individual TSFMs, demonstrating Synapse's efficacy in time series forecasting.

URL: https://openreview.net/forum?id=j3HqbsCwt1

---

Title: Reasoning-Driven Synthetic Data Generation and Evaluation

Authors: Tim R. Davidson, Benoit Seguin, Enrico Bacis, Cesar Ilharco, Hamza Harkous

Abstract: Although many AI applications of interest require specialized multi-modal models, relevant data to train such models is inherently scarce or inaccessible. Filling these gaps with human annotators is prohibitively expensive, error-prone, and time-consuming, leading model builders to increasingly consider synthetic data as a scalable alternative. However, existing synthetic data generation methods often rely on manual prompts, evolutionary algorithms, or extensive seed data from the target distribution — limiting their scalability, explainability, and control. In this paper, we introduce Simula: a novel reasoning-driven framework for data generation and evaluation. It employs a seedless, agentic approach to generate synthetic datasets at scale, allowing users to define desired dataset characteristics through an explainable and controllable process that enables fine-grained resource allocation. We show the efficacy of our approach on a variety of datasets, rigorously testing both intrinsic and downstream properties. Our work (1) offers guidelines for synthetic data mechanism design, (2) provides insights into generating and evaluating synthetic data at scale, and (3) unlocks new opportunities for developing and deploying AI in domains where data scarcity or privacy concerns are paramount.

URL: https://openreview.net/forum?id=NALsdGEPhB

---

Title: On Gossip Algorithms for Machine Learning with Pairwise Objectives

Authors: Igor Colin, Aurélien Bellet, Stephan Clémençon, Joseph Salmon

Abstract: In the IoT era, information is more and more frequently picked up by connected smart sensors with increasing, though limited, storage, communication and computation abilities. Whether due to privacy constraints or to the structure of the distributed system, the development of statistical learning methods dedicated to data that are shared over a network is now a major issue. Gossip-based algorithms have been developed for the purpose of solving a wide variety of statistical learning tasks, ranging from data aggregation over sensor networks to decentralized multi-agent optimization. Whereas the vast majority of contributions consider situations where the function to be estimated or optimized is a basic average of individual observations, it is the goal of this article to investigate the case where the latter is of pairwise nature, taking the form of a $U$-statistic of degree two. Motivated by various problems such as similarity learning, ranking or clustering for instance, we revisit gossip algorithms specifically designed for pairwise objective functions and provide a comprehensive theoretical framework for their convergence. This analysis fills a gap in the literature by establishing conditions under which these methods succeed, and by identifying the graph properties that critically affect their efficiency. In particular, a refined analysis of the convergence upper and lower bounds is performed.

URL: https://openreview.net/forum?id=VxxpURovJF

---

Title: Neural Conditional Transport Maps

Authors: Carlos Rodriguez-Pardo, Leonardo Chiani, Emanuele Borgonovo, Massimo Tavoni

Abstract: We present a neural framework for learning conditional optimal transport (OT) maps between probability distributions. Conditional OT maps are transformations that adapt based on auxiliary variables, such as labels, time indices, or other parameters. They are essential for applications ranging from generative modeling to uncertainty quantification of black-box models. However, existing methods for generating conditional OT maps face significant limitations: input convex neural networks (ICNNs) impose severe architectural constraints that limit expressivity. At the same time, simpler conditioning strategies, such as concatenation, fail to model fundamentally different transport behaviors across conditions. Our approach introduces a conditioning mechanism capable of simultaneously processing both categorical and continuous conditioning variables, using learnable embeddings and positional encoding. At the core of our method lies a hypernetwork that generates transport layer parameters based on these inputs, creating adaptive mappings that outperform simpler conditioning methods. We showcase the framework's practical impact through applications to global sensitivity analysis, enabling efficient computation of OT-based sensitivity indices for complex black-box models. This work advances the state-of-the-art in conditional optimal transport, enabling broader application of optimal transport principles to complex, high-dimensional domains such as generative modeling, black-box model explainability, and scientific computing.

URL: https://openreview.net/forum?id=CZvkpQc73I

---

Title: Adaptive Online Mirror Descent for Tchebycheff Scalarization in Multi-Objective Learning

Authors: Meitong Liu, Xiaoyuan Zhang, Chulin Xie, Kate Donahue, Han Zhao

Abstract: Multi-objective learning (MOL) aims to learn under multiple potentially conflicting objectives and strike a proper balance. While recent preference-guided MOL methods often rely on additional optimization objectives or constraints, we consider the classic Tchebycheff scalarization (TCH) that naturally allows for locating solutions with user-specified trade-offs. Due to its minimax formulation, directly optimizing TCH often leads to training oscillation and stagnation. In light of this limitation, we propose an adaptive online mirror descent algorithm for TCH, called (Ada)OMD-TCH. One of our main ingredients is an adaptive online-to-batch conversion that significantly improves solution optimality over traditional conversion in practice while maintaining the same theoretical convergence guarantees. We show that (Ada)OMD-TCH achieves a convergence rate of $\mathcal O(\sqrt{\log m/T})$, where $m$ is the number of objectives and $T$ is the number of rounds, providing a tighter dependency on $m$ in the offline setting compared to existing work. Empirically, we demonstrate on both synthetic problems and federated learning tasks that (Ada)OMD-TCH effectively smooths the training process and yields preference-guided, specific, diverse, and fair solutions.

URL: https://openreview.net/forum?id=MUTRffB3af

---

Title: Lorenza: Enhancing Generalization in Low-Rank Gradient LLM Training via Efficient Zeroth-Order Adaptive SAM

Authors: Yehonathan Refael, Iftach Arbel, Ofir Lindenbaum, Tom Tirer

Abstract: Modern applications often require fine-tuning large language models (LLMs) within strict memory and computational limits, but existing memory-efficient optimizers tend to compromise robustness and generalization. To tackle this, we introduce Lorenza, a low-memory optimizer based on Sharpness-Aware Minimization (SAM). Lorenza employs a stochastic zeroth-order estimator to approximate ascent directions, reducing the computational complexity of SAM while, as we prove, maintaining its convergence guarantees. Additionally, by applying randomized singular value decomposition, Lorenza performs efficient low-rank gradient updates, achieving memory efficiency similar to traditional methods. Our theoretical analysis and experiments demonstrate that Lorenza improves robustness and generalization, particularly in challenging language tasks. Furthermore, we present Lorenza+, which enhances Lorenza by incorporating the discarded orthogonal gradient component, resulting in additional performance gains without requiring extra memory or computational overhead.

URL: https://openreview.net/forum?id=YyA51ekcQo

---

Title: S$^2$Transformer: Scalable Structured Transformers for Global Station Weather Forecasting

Authors: Hongyi Chen, Xiucheng Li, Xinyang Chen, Yun Cheng, Jing Li, Kehai Chen, Liqiang Nie

Abstract: Global Station Weather Forecasting (GSWF) is a key meteorological research area, critical to energy, aviation, and agriculture. Existing time series forecasting methods often ignore or unidirectionally model spatial correlation when conducting large-scale global station forecasting. This contradicts the intrinsic nature underlying observations of the global weather system, limiting forecast performance. To address this, we propose a novel Spatial Structured Attention Block in this paper. It partitions the spatial graph into a set of subgraphs and instantiates Intra-subgraph Attention to learn local spatial correlation within each subgraph, and aggregates nodes into subgraph representations for message passing among the subgraphs via Inter-subgraph Attention---considering both spatial proximity and global correlation. Building on this block, we develop a multiscale spatiotemporal forecasting model S$^2$Transformer by progressively expanding subgraph scales. The resulting model is both scalable and able to produce structured spatial correlation, and meanwhile, it is easy to implement. The experimental results show that it can achieve performance improvements up to 16.8% over time series forecasting baselines at low running costs.

URL: https://openreview.net/forum?id=AL2VnKno5n

---

Title: Guess-and-Learn (G&L): Measuring the Cumulative Error Cost of Cold-Start Adaptation

Authors: Roland Arnold

Abstract: Evaluation of machine learning models typically emphasizes final accuracy, overlooking the cost of adaptation: the cumulative errors incurred while learning from scratch. Guess-and-Learn (G&L) v1.0 addresses this gap by measuring cold-start adaptability - the total mistakes a model makes while sequentially labeling an unlabeled dataset. At each step, the learner selects an instance, predicts its label, receives the ground truth, and updates parameters under either online (per-sample) or batch (delayed) mode. The resulting error trajectory exposes adaptation speed, selection quality, and bias - dynamics invisible to endpoint metrics. G&L defines four tracks (Scratch/Pretrained $\times$ Online/Batch) to disentangle the effects of initialization and update frequency. We formalize the protocol, relate it to classical mistake-bound theory, and estimate a heuristic ``oracle reference band'' for MNIST as a plausibility reference. Baseline experiments on MNIST and AG~News, spanning classical methods (Perceptron, $k$-NN), convolutional architectures (CNN, ResNet-50), and pretrained transformers (ViT-B/16, BERT-base), reveal systematic differences in early-phase efficiency: smaller models can adapt with fewer initial errors, while pretraining benefits vary by domain. Across settings, current models remain well above the oracle band, highlighting an adaptability gap. By quantifying the mistake cost of early learning, G&L complements conventional benchmarks and provides a reproducible framework for developing learners that are not only accurate in the limit but also reliable from the first examples.

URL: https://openreview.net/forum?id=uNxKhjcRp9

---

Title: GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation

Authors: Snehal Singh Tomar, Alexandros Graikos, Arjun Krishna, Dimitris Samaras, Klaus Mueller

Abstract: Modern deep learning methods typically treat image sequences as large tensors of sequentially stacked frames. However, is this straightforward representation ideal given the current state-of-the-art (SoTA)? In this work, we address this question in the context of generative models and aim to devise a more effective way of modeling image sequence data. Observing the inefficiencies and bottlenecks of current SoTA image sequence generation methods, we showcase that rather than working with large tensors, we can improve the generation process by factorizing it into first generating the coarse sequence at low resolution and then refining the individual frames at high resolution. We train a generative model solely on grid images comprising subsampled frames. Yet, we learn to generate image sequences, using the strong self-attention mechanism of the Diffusion Transformer (DiT) to capture correlations between frames. In effect, our formulation extends a 2D image generator to operate as a 3D image-sequence generator without introducing any architectural modifications. Subsequently, we super-resolve each frame individually to add the sequence-independent high-resolution details. This approach offers several advantages and can overcome key limitations of the SoTA in this domain. Compared to existing image sequence generation models, our method achieves superior synthesis quality and improved coherence across sequences. It also delivers high-fidelity generation of arbitrary-length sequences and increased efficiency in inference time and training data usage. Furthermore, our straightforward formulation enables our method to generalize effectively across diverse data domains, which typically require additional priors and supervision to model in a generative context. Our method consistently delivers superior quality and offers a $>2\times$ speedup in inference rates across various datasets.

URL: https://openreview.net/forum?id=QLD47Ou5lp

---

Title: Amplified Patch-Level Differential Privacy for Free via Random Cropping

Authors: Kaan Durmaz, Jan Schuchardt, Sebastian Schmidt, Stephan Günnemann

Abstract: Random cropping is one of the most common data augmentation techniques in computer vision, yet the role of its inherent randomness in training differentially private machine learning models has thus far gone unexplored. We observe that when sensitive content in an image is spatially localized, such as a face or license plate, random cropping can probabilistically exclude that content from the model’s input. This introduces a third source of stochasticity in differentially private training with stochastic gradient descent, in addition to gradient noise and minibatch sampling. This additional randomness amplifies differential privacy without requiring changes to model architecture or training procedure. We formalize this effect by introducing a patch-level neighboring relation for vision data and deriving tight privacy bounds for differentially private stochastic gradient descent (DP-SGD) when combined with random cropping. Our analysis quantifies the patch inclusion probability and shows how it composes with minibatch sampling to yield a lower effective sampling rate. Empirically, we validate that patch-level amplification improves the privacy-utility trade-off across multiple segmentation architectures and datasets. Our results demonstrate that aligning privacy accounting with domain structure and additional existing sources of randomness can yield stronger guarantees at no additional cost.

URL: https://openreview.net/forum?id=pSWuUF8AVP

---

Title: Facial Counterfactual Generation via Causal Mask-Guided Editing

Authors: Pei Sze Tan, Sailaja Rajanala, Arghya Pal, Raphael CW Phan, Huey Fang Ong

Abstract: Generating counterfactual facial images is an important tool for interpretable machine learning, fairness analysis, and understanding the causal relationships among facial attributes. In this work, we propose a novel neuro-symbolic framework for causal editing, which integrates causal graph discovery, mask-guided counterfactual generation, and semantic interpretation to produce facial images that are both realistic and causally consistent. We first employ the Fast Causal Inference (FCI) algorithm to uncover latent causal relationships among facial attributes, enabling the identification of direct and indirect factors for target interventions. Using these causal graphs, we construct spatially informed masks that guide a DDPM-based generative model, ensuring that only regions relevant to the causal factors are modified. Finally, we leverage CLIP-based embeddings to provide logical, human-understandable explanations of the semantic changes in the counterfactuals. Experiments on CelebA and CelebA-HQ demonstrate that our approach produces high-fidelity counterfactuals, achieves superior performance on sparsity and realism metrics, and mitigates bias compared to state-of-the-art methods. This framework offers a principled approach to causally grounded, interpretable facial image editing.

URL: https://openreview.net/forum?id=ssamEGQj0C

---

Title: LLM-Based World Models Can Make Decisions Solely, But Rigorous Evaluations are Needed

Authors: Chang Yang, Xinrun Wang, Junzhe Jiang, Qinggang Zhang, Xiao Huang

Abstract: World model emerges as a key module in decision making, where MuZero and Dreamer achieve remarkable successes in complex tasks. Recent work leverages Large Language Models (LLMs) as general world simulators to simulate the dynamics of the world due to their generalizability. LLMs also serve as the world model for deliberative reasoning in Reasoning via Planning (RAP) and Tree of Thought (ToT). However, the world model is either evaluated as a general world simulator, or as a functional module of the agent, i.e., predicting the transitions to assist the planning. This paper argues that LLM-based world models can make decisions solely, but rigorous evaluations are needed. We first present the two key observations to showcase how LLM-based world models can make decisions solely, and then present the three key observations to demonstrate why current evaluation framework of LLM-based world models is not sufficient. Then, we present our suggested evaluation framework: policy verification, action proposal, and policy planning, where the world model is used for decision making solely, and finally we leverage the 31 diverse environments from (Wang et al., 2023; 2024) and curate the rule-based policy of each environment for diverse evaluations. The key findings include: i) GPT-4o significantly outperforms GPT-4o-mini on the three main tasks, especially for the tasks which require the domain knowledge, e.g., scientific tasks, ii) the performance of the LLM-based world models depends predominantly on their performance in key steps, while the total number of steps required for task completion is a weak indicator of task difficulty with critical bottleneck steps playing a more decisive role and iii) the combination of world models’ functionalities
for decision making tends to increase performance instability in our experiments, which can partially obscure the performance gap between stronger and weaker model.

URL: https://openreview.net/forum?id=XmYCERErcD

---

Title: GroundingBooth: Grounding Text-to-Image Customization

Authors: Zhexiao Xiong, Wei Xiong, Jing Shi, He Zhang, Yizhi Song, Nathan Jacobs

Abstract: Recent approaches in text-to-image customization have primarily focused on preserving the identity of the input subject, but often fail to control the spatial location and size of objects. We introduce GroundingBooth, which achieves zero-shot, instance-level spatial grounding on both foreground subjects and background objects in the text-to-image customization task. Our proposed grounding module and subject-grounded cross-attention layer enable the creation of personalized images with accurate layout alignment, identity preservation, and strong text-image coherence. In addition, our model seamlessly supports personalization with multiple subjects. Our model shows strong results in both layout-guided image synthesis and text-to-image customization tasks. The project page is available at https://groundingbooth.github.io.

URL: https://openreview.net/forum?id=TRlZpHU300

---

Title: LJ-Bench: Ontology-Based Benchmark for U.S. Crime

Authors: Hung Yun Tseng, Wuzhen Li, Blerina Gkotse, Grigorios Chrysos

Abstract: The potential of Large Language Models (LLMs) to provide harmful information remains a significant concern due to the vast breadth of illegal queries they may encounter. Unfortunately, existing benchmarks only focus on a handful types of illegal activities, and are not grounded in legal works. In this work, we introduce an ontology of crime-related concepts grounded in the legal frameworks of Model Panel Code, which serves as an influential reference for criminal law and has been adopted by many U.S. states, and instantiated using Californian Law. This structured knowledge forms the foundation for LJ-Bench, the first comprehensive benchmark designed to evaluate LLM robustness against a wide range of illegal activities. Spanning 76 distinct crime types organized taxonomically, LJ-Bench enables systematic assessment of diverse attacks, revealing valuable insights into LLM vulnerabilities across various crime categories — LLMs exhibit heightened susceptibility to attacks targeting societal harm rather than those directly impacting individuals. Our benchmark aims to facilitate the development of more robust and trustworthy LLMs. The LJ-Bench benchmark and LJ-Ontology, along with experiments implementation for reproducibility are publicly available at https://github.com/AndreaTseng/LJ-Bench.

URL: https://openreview.net/forum?id=gsWEbyzFl2

---

Title: Making Video Models Adhere to User Intent with Minor Adjustments

Authors: Daniel Ajisafe, Eric Hedlin, Helge Rhodin, Kwang Moo Yi

Abstract: With the recent drastic advancements in text-to-video diffusion models, controlling their generations has drawn interest. A popular way for control is through bounding boxes or layouts. However, enforcing adherence to these control inputs is still an open problem. In this work, we show that by slightly adjusting user-provided bounding boxes we can improve both the quality of generations and the adherence to the control inputs. This is achieved by simply optimizing the bounding boxes to better align with the internal attention maps of the video diffusion model while carefully balancing the focus on foreground and background. In a sense, we are modifying the bounding boxes to be at places where the model is familiar with. Surprisingly, we find that even with small modifications, the quality of generations can vary significantly. To do so, we propose a smooth mask to make the bounding box position differentiable and an attention-maximization objective that we use to alter the bounding boxes. We conduct thorough experiments, including a user study to validate the effectiveness of our method.

URL: https://openreview.net/forum?id=Opvq2wfBR5

---

Title: Inverse classification with logistic and softmax classifiers: efficient optimization

Authors: Miguel Á. Carreira-Perpiñán, Suryabhan Singh Hada

Abstract: In recent years, a certain type of problems have become of interest where one wants to query a trained classifier. Specifically, one wants to find the closest instance to a given input instance such that the classifier's predicted label is changed in a desired way. Examples of these "inverse classification"' problems are counterfactual explanations, adversarial examples and model inversion. All of them are fundamentally optimization problems over the input instance vector involving a fixed classifier, and it is of interest to achieve a fast solution for interactive or real-time applications. We focus on solving this problem efficiently with the squared Euclidean distance for two of the most widely used classifiers: logistic regression and softmax classifier. Owing to special properties of these models, we show that the optimization can be solved in closed form for logistic regression, and iteratively but extremely fast for the softmax classifier. This allows us to solve either case exactly (to nearly machine precision) in a runtime of milliseconds to around a second even for very high-dimensional instances and many classes.

URL: https://openreview.net/forum?id=ZrNhf7P3a1

---

Title: XCTFormer: Leveraging Cross-Channel and Cross-Time Dependencies for Enhanced Time-Series Analysis

Authors: Israel Zexer, Omri Azencot

Abstract: Multivariate time-series analysis involves extracting informative representations from sequences of multiple interdependent variables, supporting tasks such as forecasting, imputation, and anomaly detection. In real-world scenarios, these variables are typically collected from a shared context or underlying phenomenon, suggesting the presence of latent dependencies across time and channels that can be leveraged to improve performance. However, recent findings show that channel-independent (CI) models, which assume no inter-variable dependencies, often outperform channel-dependent (CD) models that explicitly model such relationships. This surprising result indicates that current CD models may not fully exploit their potential due to limitations in how dependencies are captured. Recent studies have revisited channel dependence modeling with various approaches; however, these methods often employ indirect modeling strategies, which can lead to meaningful dependencies being overlooked. To address this issue, we introduce \textbf{XCTFormer}, a transformer-based channel-dependent (CD) model that explicitly captures cross-temporal and cross-channel dependencies via an enhanced attention mechanism. The model operates in a \emph{token-to-token} fashion, modeling pairwise dependencies between every pair of tokens across time and channels. The architecture comprises (i) a data processing module, (ii) a novel Cross-Relational Attention Block (CRAB) that increases capacity and expressiveness, and (iii) an optional Dependency Compression Plugin (DeCoP) that improves scalability. Through extensive experiments on three time-series benchmarks, we show that \textbf{XCTFormer} achieves strong results compared to widely recognized baselines; in particular, it attains state-of-the-art performance on the imputation task, outperforming the second-best method by an average of
20.8\% in MSE and 15.3\% in MAE. Our code is publicly available at \url{https://github.com/azencot-group/XCTFormer}.

URL: https://openreview.net/forum?id=TEfyR4t0Tw

---

Title: Hypothesis-Driven Feature Manifold Analysis in LLMs via Supervised Multi-Dimensional Scaling

Authors: Federico Tiblias, Irina Bigoulaeva, Jingcheng Niu, Simone Balloccu, Iryna Gurevych

Abstract: The linear representation hypothesis states that language models (LMs) encode concepts as directions in their latent space, forming organized, multidimensional manifolds. Prior work has largely focused on identifying specific geometries for individual features, limiting its ability to generalize. We introduce Supervised Multi-Dimensional Scaling (SMDS), a model-agnostic method for evaluating and comparing competing feature manifold hypotheses. We apply SMDS to temporal reasoning as a case study and find that different features instantiate distinct geometric structures, including circles, lines, and clusters. SMDS reveals several consistent characteristics of these structures: they reflect the semantic properties of the concepts they represent, remain stable across model families and sizes, actively support reasoning, and dynamically reshape in response to contextual changes. Together, our findings shed light on the functional role of feature manifolds, supporting a model of entity-based reasoning in which LMs encode and transform structured representations.

URL: https://openreview.net/forum?id=vCKZ40YYPr

---

Title: PriSM: Prior-Guided Search Methods for Query Efficient Black-Box Attacks

Authors: Pavlos Ntais, Thanassis Avgerinos

Abstract: Deep Neural Networks are vulnerable to adversarial examples in black-box settings, requiring query-efficient attack methods. We propose PriSM (Prior-Guided Search Methods), which systematically exploits two types of transferable surrogate information: decision boundary geometry and loss landscape topography. We demonstrate their utility through complementary attacks: (1) TGEA leverages boundary geometry to initialize evolutionary optimization with surrogate evolved populations, maximizing attack success rates, and (2) SGSA leverages loss topography via multi-scale saliency guidance to direct Square Attack's perturbations, minimizing query costs. Across MNIST, CIFAR-10, and ImageNet, both methods achieve 30-60% query reductions compared to uninformed baselines, while also being competitive with state of the art hybrid attacks. Our evaluation reveals a strategic trade off: SGSA excels in query efficiency through local exploitation, whereas TGEA maximizes success rates via global exploration. Our comprehensive evaluation also demonstrates that different types of surrogate information require matched exploitation strategies, providing practical guidance for query-efficient black-box attacks.

URL: https://openreview.net/forum?id=UQsOh2kfhP

---

Title: Involuntary Jailbreak: On Self-Prompting Attacks

Authors: Yangyang Guo, Yangyan Li, Mohan Kankanhalli

Abstract: In this study, we disclose a worrying new vulnerability in Large Language Models (LLMs), which we term involuntary jailbreak. Unlike existing jailbreak attacks, this weakness is distinct in that it does not involve a specific attack objective, such as generating instructions for building a bomb. Prior attack methods predominantly target localized components of the LLM guardrail. In contrast, involuntary jailbreaks may potentially compromise the global guardrail structure, which our method reveals to be surprisingly fragile. We merely employ a single universal prompt to achieve this goal. In particular, we instruct LLMs to generate several questions (self-prompting) that would typically be rejected, along with their corresponding in-depth responses (rather than a refusal). Remarkably, this simple prompt strategy consistently jailbreaks almost all leading LLMs tested, such as Claude Opus 4.1, Grok 4, Gemini 2.5 Pro, and GPT 4.1. With its wide targeting scope and near-universal effectiveness, this vulnerability makes existing jailbreak attacks seem less necessary until it is patched. More importantly, we hope this problem can motivate researchers and practitioners to rethink and re-evaluate the robustness of LLM guardrails and contribute to stronger safety alignment in the future.

URL: https://openreview.net/forum?id=2s0AkiVPYc

---

Title: Statistical Inference for Generative Model Comparison

Authors: Zijun Gao, Han Su, Yan Sun

Abstract: Generative models have achieved remarkable success across a range of applications, yet their evaluation still lacks principled uncertainty quantification. In this paper, we develop a method for comparing how close different generative models are to the underlying distribution of test samples. Particularly, our approach employs the Kullback-Leibler (KL) divergence to measure the distance between a generative model and the unknown test distribution, as KL requires no tuning parameters such as the kernels used by RKHS-based distances. And the relative KL divergence is the only $f$-divergence that admits a crucial cancellation of the hard-to-estimate term to enable the faithful uncertainty quantification. Furthermore, we extend our method to comparing conditional generative models and leverage Edgeworth expansions to address limited-data settings. On simulated datasets with known ground truth, we show that our approach realizes effective coverage rates, and has higher power compared to kernel-based methods. When applied to generative models on image and text datasets, our procedure yields conclusions consistent with benchmark metrics but with statistical confidence. The source code to reproduce our experiments is available at https://github.com/sylydya/compare-generative-models.

URL: https://openreview.net/forum?id=PXL6SBxh0q

---

Title: On the Dynamics & Transferability of Latent Generalization during Memorization

Authors: Simran Ketha, Venkatakrishnan Ramaswamy

Abstract: Deep networks have been known to have extraordinary generalization abilities, via mechanisms that aren't yet well understood. It is also known that upon shuffling labels in the training data to varying degrees, deep networks, trained with standard methods, can still achieve perfect or high accuracy on this corrupted training data. This phenomenon is called memorization, and typically comes at the cost of poorer generalization to true labels. Our recent work has demonstrated, surprisingly, that the internal representations of such models retain significantly better latent generalization abilities than is directly apparent from the model. In particular, it has been shown that such latent generalization can be recovered via simple probes (called MASC probes) on the layer-wise representations of the model. However, the origin and dynamics over training of this latent generalization during memorization is not well understood. Here, we track the training dynamics, empirically, and find that latent generalization abilities largely peak early in training, with model generalization. Next, we investigate to what extent the specific nature of the MASC probe is critical for our ability to extract latent generalization from the model's layerwise outputs. To this end, we first examine the mathematical structure of the MASC probe and show that it is a quadratic classifier, i.e. is non-linear. This brings up the question of the extent to which this latent generalization might be linearly decodable from layerwise outputs. To investigate this, we designed a new linear probe for this setting. We find cases where this linear probe outperforms MASC, and other cases, where the opposite happens, notably for many instances of ResNet-18 trained on CIFAR-10. Next, we consider the question of whether it is possible to transfer latent generalization to model generalization by directly editing model weights. To this end, we devise a way to transfer the latent generalization present in last-layer representations to the model using the new linear probe. This immediately endows such models with improved generalization in many cases, i.e. without additional training. We also explore training dynamics, when the aforementioned weight editing is done midway during training. Our findings provide a more detailed account of the rich dynamics of latent generalization during memorization, provide clarifying explication on the specific role of the probe in latent generalization, as well as demonstrate the means to leverage this understanding to directly transfer this generalization to the model. Our code is available at: https://github.com/simranketha/Dynamics_during_training_DNN .

URL: https://openreview.net/forum?id=t024Zm0tKF

---

Title: A Systematic Evaluation of Out-of-Distribution Generalization in Crop Yield Prediction

Authors: Aditya Chakravarty

Abstract: Accurate crop yield forecasting under shifting climatic conditions is essential for food security and agricultural resilience. While recent deep learning models achieve strong performance in in-domain settings, their ability to generalize across space and time—critical for real-world deployment—remains poorly understood. In this work, we present the first systematic evaluation of temporally-aware crop yield prediction models under spatio-temporal out-of-distribution (OOD) conditions, using corn and soybean data across more than 1,200 U.S. counties. We benchmark two representative architectures, GNN-RNN and MMST-ViT, using rigorous evaluation strategies including year-ahead forecasting, leave-one-region-out validation, and stratified OOD scenarios of varying difficulty based on USDA Farm Resource Regions. Our comprehensive analysis reveals significant performance gaps across agro-ecological zones, with some models showing negative R² values under distribution shift. We uncover asymmetric transferability patterns and identify the Prairie Gateway region as consistently challenging for generalization. These findings challenge prior generalizability claims and provide practical insights for deploying agricultural AI systems under climate variability.

URL: https://openreview.net/forum?id=to4sVjsxsO

---

Title: Flow Matching for Tabular Data Synthesis

Authors: Bahrul Ilmi Nasution, Floor Eijkelboom, Mark Elliot, Richard Allmendinger, Christian A. Naesseth

Abstract: Synthetic data generation is an important tool for privacy-preserving data sharing. Although diffusion models have set recent benchmarks, flow matching (FM) offers a promising alternative. This paper presents different ways to implement FM for tabular data synthesis. We provide a comprehensive empirical study that compares flow matching (FM and variational FM) with a state-of-the-art diffusion method (TabDDPM and TabSyn) in tabular data synthesis. We evaluate both the standard Optimal Transport (OT) and the Variance Preserving (VP) probability paths, and also compare deterministic and stochastic samplers -- something possible when learning to generate using \textit{variational} FM -- characterising the empirical relationship between data utility and privacy risk. Our key findings reveal that FM, particularly TabbyFlow, outperforms diffusion baselines. Flow matching methods also achieve better performance with remarkably low function evaluations ($\leq$ 100 steps), offering a substantial computational advantage. The choice of probability path is also crucial, as using the OT is a strong default and more robust to early stopping on average, while VP has potential to produce synthetic data with lower privacy risk. Lastly, our results show that making flows stochastic not only preserves marginal distributions but, in some instances, enables the generation of high utility synthetic data with reduced disclosure risk. The implementation code associated with this paper is publicly available at~\url{https://github.com/rulnasution/tabular-flow-matching}.

URL: https://openreview.net/forum?id=RdOjoAa66L

---

Title: Goal Achievement Guided Exploitation: A Principled Performance-Based Scheduling Framework for Reinforcement Learning

Authors: Shengchao Yan, Baohe Zhang, Chenguang Huang, Joschka Boedecker, Wolfram Burgard

Abstract: In dense-reward tasks, Reinforcement learning (RL) algorithms often employ soft entropy regularization to promote exploration. By integrating an entropy term into the objective function, they regularize exploration via tuning the coefficient. However, the entropy coefficient only indirectly influences the action distribution through gradient updates, making it difficult to precisely control exploration, and requires careful scheduling to balance exploration and exploitation throughout training. As a solution, we propose Goal Achievement Guided Exploitation~(GAGE), a performance-based scheduling framework that adaptively regulates exploration by linking policy stochasticity directly to the agent's performance relative to a target value. Unlike soft entropy regularizers, GAGE enforces hard, performance-dependent constraints on action distribution's standard deviation for continuous actions and logit range for discrete actions. Consequently, GAGE ensures a guaranteed lower bound on action probabilities that naturally decays as the agent approaches optimal performance. Across a suite of challenging robotic control tasks, GAGE improves learning efficiency and stability across various strong baselines, achieving competitive or superior final performance. By providing a more interpretable and robust alternative to entropy-based exploration heuristics, GAGE offers a scalable path toward solving complex dense reward tasks with pronounced local optima.

URL: https://openreview.net/forum?id=uGidW0fKhK

---

Title: SPoT: Subpixel Placement of Tokens in Vision Transformers

Authors: Martine Hjelkrem-Tan, Marius Aasan, Gabriel Y. Arteaga, Adín Ramírez Rivera

Abstract: Vision Transformers naturally accommodate sparsity, yet standard tokenization methods confine features to discrete patch grids. This constraint prevents models from fully exploiting sparse regimes, forcing awkward compromises. We propose Subpixel Placement of Tokens (SPoT), a novel tokenization strategy that positions tokens continuously within images, effectively sidestepping grid-based limitations. With our proposed oracle-guided search, we uncover substantial performance gains achievable with ideal subpixel token positioning, drastically reducing the number of tokens necessary for accurate predictions during inference. SPoT provides a new direction for flexible, efficient, and interpretable ViT architectures, redefining sparsity as a strategic advantage rather than an imposed limitation.

URL: https://openreview.net/forum?id=XrBzSmzAVo

---

Title: Merging Memory and Space: A State Space Neural Operator

Authors: Nodens Koren, Samuel Lanthaler

Abstract: We propose the *State Space Neural Operator* (SS-NO), a compact architecture for learning solution operators of time-dependent partial differential equations (PDEs). Our formulation extends structured state space models (SSMs) to joint spatiotemporal modeling, introducing two key mechanisms: *adaptive damping*, which stabilizes learning by localizing receptive fields, and *learnable frequency modulation*, which enables data-driven spectral selection. These components provide a unified framework for capturing long-range dependencies with parameter efficiency. Theoretically, we establish connections between SSMs and neural operators, proving a universality theorem for convolutional architectures with a full field of view. Empirically, SS-NO achieves strong performance across diverse PDE benchmarks—including 1D Burgers' and Kuramoto–Sivashinsky equations, and 2D Navier–Stokes and compressible Euler flows—while using significantly fewer parameters than competing approaches. Our results demonstrate that state space modeling provides an effective foundation for efficient and accurate neural operator learning.

URL: https://openreview.net/forum?id=SwLxxz0x58

---

Title: End-to-end Deep Reinforcement Learning for Stochastic Multi-objective Optimization in C-VRPTW

Authors: Abdo Abouelrous, Laurens Bliek, Yaoxin Wu, Yingqian Zhang

Abstract: In this work, we consider learning-based applications in routing to solve a Vehicle Routing variant characterized by stochasticity and multiple objectives. Such problems are repre- sentative of practical settings where decision-makers have to deal with uncertainty in the
operational environment as well as multiple conflicting objectives due to different stakeholders. We specifically consider travel time uncertainty. We also consider two objectives, total travel time and route makespan, that jointly target operational efficiency and labor regulations on shift length, although more/different objectives could be incorporated. Learning-based methods offer earnest computational advantages as they can repeatedly solve problems with limited interference from the decision-maker. We specifically focus on end-to-end deep learning models that leverage the attention mechanism and multiple solution trajectories. These models have seen several successful applications in routing problems. However, since travel times are not a direct input to these models due to the large dimensions of the travel time matrix, accounting for uncertainty is a challenge, especially in the presence of multiple objectives. In turn, we propose a model that simultaneously addresses stochasticity and multi-objectivity and provide a refined training mechanism for this model through scenario clustering to reduce training time. Our results show that our model is capable of constructing a Pareto Front of good quality within acceptable run times compared to three baselines. We also provide two ablation studies to assess our model’s suitability in different settings.

URL: https://openreview.net/forum?id=Wwtb1tYnp5

---

Title: Explaining Graph Neural Networks for Node Similarity on Graphs

Authors: Daniel Daza, Cuong Xuan Chu, Trung-Kien Tran, Daria Stepanova, Michael Cochez, Paul Groth

Abstract: Similarity search is a fundamental task for exploiting information in various applications dealing with graph data, such as citation networks or knowledge graphs.
Prior work on the explainability of graph neural networks (GNNs) has focused on supervised tasks, such as node classification and link prediction. However, the challenge of explaining similarities between node embeddings has been left unaddressed.
We take a step towards filling this gap by formulating the problem, identifying desirable properties of explanations of similarity, and proposing intervention-based metrics that qualitatively assess them.
Using our framework, we evaluate the performance of representative methods for explaining GNNs, based on the concepts of mutual information (MI) and gradient-based (GB) explanations. We find that unlike MI explanations,
GB explanations have three desirable properties. First, they are actionable: selecting particular inputs results in predictable changes in similarity scores of corresponding nodes. Second, they are consistent: the effect of selecting certain inputs hardly overlaps with the effect of discarding them. Third, they can be pruned significantly to obtain sparse explanations that retain the effect on similarity scores. These important findings highlight the utility of our metrics as a framework for evaluating the quality of explanations of node similarities in GNNs.

URL: https://openreview.net/forum?id=zDEwl4zidP

---

New submissions
===============

Title: Simplifying complex machine learning by linearly separable network embedding spaces

Abstract: Low-dimensional embeddings are a cornerstone of modelling and analysis of complex networks. However, most of the existing approaches for mining network embedding spaces rely on computationally intensive machine learning systems to facilitate downstream analysis tasks. In contrast, in the field of Natural Language Processing, it was observed that word embedding spaces capture semantic relationships linearly, allowing for information retrieval using simple linear operations on word embedding vectors. Similar linear semantic relationships (i.e., the compositionality of embedding vectors) have also been observed in data embeddings from pre-trained vision-language models. This poses the question of why in some cases the embedding methods lead to a linearly separable embedding space amenable to linear exploitation, while in other cases they do not. Here, we gain fundamental insight into the structure of network data that yield this linearity. We show that the more homophilic the network representation, the more linearly separable the corresponding network embedding space, yielding better downstream analysis results. We demonstrate applicability of our insight on thirteen networks from multiple domains, six multi-label biological networks and seven single-label networks from social, citation, and transportation networks domain. We believe that these fundamental insights into the structure of network data that enable their linear mining and exploitation are the foundation to build upon towards efficient and explainable mining of complex network data.

URL: https://openreview.net/forum?id=Hl2xyw0RFA

---

Title: Minimax learning rates for estimating binary classifiers under margin conditions

Abstract: We study classification problems using binary estimators where the decision boundary is described by horizon functions and where the data distribution satisfies a geometric margin condition. A key novelty of our work is the derivation of lower bounds for the worst-case learning rates over broad classes of functions, under a geometric margin condition---a setting that is almost universally satisfied in practice, but remains theoretically challenging. Moreover, we work in the noiseless setting, where lower bounds are particularly hard to establish. Our general results cover, in particular, classification problems with decision boundaries belonging to several classes of functions: for Barron-regular functions, Hölder-continuous functions, and convex-Lipschitz functions with strong margins, we identify optimal rates close to the fast learning rates of $\mathcal{O}(n^{-1})$ for $n \in \mathbb{N}$ samples.

URL: https://openreview.net/forum?id=ZIshsqojB6

---

Title: Symbolic Governing Equation Discovery Using Neural Arithmetic Modules

Abstract: Neural architectures with arithmetic inductive biases, such as Neural Arithmetic Logic Units (NALU) and Neural Power Units (NPU), are designed to model arithmetic relationships for improved out-of-distribution extrapolation and interpretability. However, in practice, these models frequently exhibit unstable optimization behaviours such as gradient starvation and convergence to dense and numerically fragile parameterizations that obscure the underlying data structure. We show that arithmetic inductive bias alone is insufficient to guarantee the recovery of sparse symbolic equations. Instead, interpretability should be explicitly enforced through strict architectural constraints. We propose MSRNet, a structured neural framework for extracting sparse symbolic expressions from high-dimensional data. The model has two variants: MSRNet (Multiplicative Symbolic Regression Network), which allows multiplicative and discrete exponential arithmetic interactions via differentiable softmax relaxations, and ExMSRNet (Extended MSRNet), which further allows for logarithmic and exponential pathways. We use a composite training objective that utilizes description-length regularization via entropy-based measures to bias the model towards confident discrete operator selection. Our experiments suggest that MSRNet variants significantly reduce gradient starvation. This could be attributed to explicit constraining of the hypothesis space. We benchmark MSRNet variants on synthetic datasets, SRBench 2025, and AI Feynman I/II/III, where it achieves strong performance with significantly lower computational cost than other symbolic regression methods. Source code for MSRNet is available at: https://anonymous.4open.science/r/MSRNet-6B05

URL: https://openreview.net/forum?id=9D59Q1lHsF

---

Title: Exposing Long-Tail Safety Failures in Large Language Mod- els through Efficient Diverse Response Sampling

Abstract: Safety tuning through supervised fine-tuning and reinforcement learning from human feedback has substantially improved the robustness of large language models (LLMs). However, it often suppresses rather than eliminates unsafe behaviors, leaving rare but critical failures hidden in the long tail of the output distribution. While most red-teaming work emphasizes adversarial prompt search (input-space optimization), we show that safety failures can also be systematically exposed through diverse response generation (output-space exploration) for a fixed safety-critical prompt, where increasing the number and diversity of sampled responses can drive jailbreak success rates close to unity. To efficiently uncover such failures, we propose Progressive Diverse Population Sampling (PDPS), which combines stochastic token-level sampling with diversity-aware selection to explore a large candidate pool of responses and retain a compact, semantically diverse subset. Across multiple jailbreak benchmarks and open-source LLMs, PDPS achieves attack success rates comparable to large-scale IID sampling while using only 8%–29% of the computational cost. Under limited-response settings, it improves success rates by 26%–40% over IID sampling and Diverse Beam Search. Furthermore, responses generated by PDPS exhibit both a higher number and greater diversity of unsafe outputs, demonstrating its effectiveness in uncovering a broader range of failures.

URL: https://openreview.net/forum?id=tHfAskovWI

---

Title: Cross-Modal Generative Augmentation for Multimodal Biological Classification

Abstract: Recent advances in vision-language models have enabled cross-modal generation between text and images, achieving remarkable progress in general-domain understanding. However, their potential in scientific and biological applications remains largely unexplored, where datasets often couple complex visual observations with structured metadata or textual descriptors.
We propose a cross-modal generative framework that leverages both text-to-image and image-to-text generation to enrich multimodal biological classification.
Our framework integrates generative augmentation and multimodal alignment to mutually refine visual and textual representations, enabling the synthesis of complementary modality data that may otherwise be unavailable in biological datasets.
Experimental results on the HAM10000 and EMPO500 datasets demonstrate consistent gains in accuracy and generalization across diverse biological datasets, achieving 3-5% improvements over baseline models.
The proposed framework is model-agnostic and compatible with open-weight alternatives, paving the way for biologically grounded multimodal generation and analysis.

URL: https://openreview.net/forum?id=bowYeHa8dn

---

Title: FedProTIP: Task-Agnostic Federated Continual Learning via Replay-Free Gradient Projection

Abstract: Federated continual learning (FCL) enables collaborative model training across distributed clients on sequentially arriving tasks without revisiting past data. However, existing approaches often suffer from catastrophic forgetting, rely on replay buffers or generative models that may violate privacy constraints, or assume knowledge of task identities during inference. We propose FedProTIP (Federated Projection-based Continual Learning with Task Identity Prediction), a replay-free FCL framework that maintains shared task-specific feature subspaces across clients. Each client extracts low-rank core bases from intermediate activations using randomized singular value decomposition, capturing dominant feature directions associated with the current task. These bases are transmitted to the server and aggregated to construct global task subspaces that capture shared feature directions across clients without requiring data sharing. During training, client updates are projected onto the orthogonal complement of previously learned subspaces to reduce cross-task interference and mitigate catastrophic forgetting. The learned subspaces are also reused during inference to estimate task identity via subspace relevance, enabling task-agnostic prediction without requiring explicit task labels. Experiments on CIFAR100, ImageNet-R, and DomainNet demonstrate that FedProTIP consistently outperforms state-of-the-art federated continual learning baselines while maintaining lower training time, memory footprint, and communication cost.

URL: https://openreview.net/forum?id=GW4aw0fUKC

---

Title: Improved denoising diffusion probabilistic models with efficient non-diagonal covariance modeling

Abstract: The sampling process of Denoising Diffusion Probabilistic Models (DDPMs) can be accelerated by leveraging second-order information in the form of approximations to the denoising posterior covariance. Previous attempts at using such information have used drastic (e.g. diagonal) simplifications of the covariance. These do not do justice to the peculiar statistical structure of natural images, which exhibit strong non-diagonal correlations between pixels and color channels, and a slow-decaying power-law frequency spectrum. Here, we develop a novel covariance model that captures these features. Our Kronecker-DCT (K-DCT) model uses a Kronecker-factored decomposition of inter-color covariances and spatial covariances modeled in the frequency domain using the Discrete Cosine Transform (DCT). The use of the DCT reduces the computational complexity from quadratic to log-linear, resulting in negligible computational and memory overhead in the sampling process. By learning K-DCT-structured amortizations of the denoising posterior covariance using pre-trained score models on CIFAR-10, Celeb-A, and ImageNet datasets, we show improved performance both in terms of FID and likelihoods compared to previous SOTA denoising samplers.

URL: https://openreview.net/forum?id=V6FBm4kfML

---

Title: When is More Better? Efficient and Adaptive Modality Acquisition in Multimodal Learning

Abstract: Multimodal machine learning can improve flexibility, performance, and robustness, but incorporating many modalities increases acquisition costs and system complexity. This motivates adaptive modality acquisition (AMA), where a subset of the most informative modalities is selected before observation to balance predictive performance against cost. Prior work has largely focused on population-level acquisition, selecting a fixed subset of modalities that performs well on average. In this work, we instead adapt modality acquisition per sample, which is critical in settings such as healthcare where the value and cost of additional tests depend on the specific patient, improving efficiency while naturally supporting heterogeneous acquisition costs. This setting leads to the problem of multi-stage subset selection with unobserved items and heterogeneous costs, which poses challenges in both uncertainty and scalability. Our key idea is to learn a compositional energy-based value function that scores candidate modality subsets for their expected contribution to downstream prediction. We implement this through recursive value functions that estimate the value of acquiring any subset of modalities conditioned on the currently observed modalities, allowing the same model to be applied iteratively as new modalities are acquired. Our key contributions are: (1) learning this recursive value function as an energy-based model; (2) designing and characterizing suitable value functions for this setting, with a selection rule based on the model confusion rate (MCR; probability that added modalities flip a correct prediction); and (3) showing that, under a natural submodularity assumption on modality value, the acquisition objective can be optimized efficiently via submodular optimization. This framework yields scalable training and inference algorithms, collectively referred to as Efficient Adaptive Modality Acquisition (EAMA), that scale linearly in the number of modalities. Across multiple real-world multimodal datasets with up to $M=15$ modalities, EAMA achieves up to an $8\times$ improvement in balancing accuracy gains against acquisition costs relative to baseline methods. In some cases, EAMA is able to do more with less, improving accuracy while using only $27.4\%$ of the available modalities on average.

URL: https://openreview.net/forum?id=gnJNCIPO6y

---

Title: Economies of Open Intelligence: Tracing Power & Participation in the Model Ecosystem

Abstract: Since 2019, the Hugging Face Model Hub has been the primary global platform for sharing open weight AI models.
By releasing a dataset of the complete history of weekly model downloads (June 2020--August 2025) alongside model metadata, we provide the most rigorous examination to-date of concentration dynamics and evolving characteristics in the open model economy.
Our analysis spans 851,000 models, over 200 aggregated attributes per model, and 2.2B downloads---establishing persistent scientific infrastructure for measuring how AI capability, influence, and participation diffuse across the global research and deployment landscape.
We document a fundamental rebalancing of economic power: US open-weight industry dominance by Google, Meta, and OpenAI has declined sharply in favor of unaffiliated developers, community organizations, and, as of 2025, Chinese industry, with DeepSeek and Qwen models potentially heralding a new consolidation of market power.
We identify statistically significant shifts in model properties---a 17$\times$ increase in average model size, rapid growth in multimodal generation (3.4$\times$), quantization (5$\times$), and mixture-of-experts architectures (7$\times$)---alongside concerning declines in data transparency, with open weights models surpassing truly open source models for the first time in 2025.
We expose a new layer of developer intermediaries that has emerged, focused on quantizing and adapting base models for both efficiency and artistic expression.
To enable continued research and oversight, we release the complete dataset with an interactive dashboard for real-time monitoring of concentration dynamics, innovation diffusion, and evolving properties in the open model economy.

URL: https://openreview.net/forum?id=lYvDFMUcwv

---

Title: [Re] FairDICE: A Gap Between Theory And Practice

Abstract: Offline Reinforcement Learning (RL) is an emerging field of RL in which policies are learned solely from demonstrations. Within offline RL, some environments involve balancing multiple objectives, but existing multi-objective offline RL algorithms do not provide an efficient way to find a fair compromise. FairDICE seeks to fill this gap by adapting OptiDICE (an offline RL algorithm) to automatically learn weights for multiple objectives to e.g. incentivise fairness among objectives. As this would be a valuable contribution, this replication study examines the replicability of claims made regarding FairDICE. We find that many theoretical claims hold, but an error in the code reduces FairDICE to standard behaviour cloning in continuous environments, and many important hyperparameters were originally underspecified. After rectifying this, we show in experiments extending the original paper that FairDICE can scale to complex environments and high-dimensional rewards, though it can be reliant on (online) hyperparameter tuning. We conclude that FairDICE is a theoretically interesting method, but the experimental justification requires significant revision.

URL: https://openreview.net/forum?id=Tr6MBt0hAj

---

Title: LENS: Learning to Navigate with Active Search for Partially Observable MAPF in Unknown Environments

Abstract: Multi-Agent Path Finding (MAPF) is a fundamental challenge in autonomous robotics. While classical solvers guarantee collision-free coordination, their reliance on perfect global knowledge limits their applicability in strictly unknown environments. In response, modern learning-based approaches have diverged into a severe dichotomy. Decentralized reactive heuristics scale efficiently under partial observability but systematically fail at structured deadlocks due to myopic reasoning. Conversely, large-scale neural foundation models (e.g., MAPF-GPT) offer topological awareness but require pre-computed global heuristics, prohibitive training data, and operate as black boxes.

In this work, we bridge this dichotomy by introducing LENS (LEarning to Navigate with active Search), a hybrid architecture that explicitly decouples long-horizon topological guidance from rigorous local collision avoidance. To enable efficient navigation without myopic behavior, LENS employs a lightweight neural network that takes each agent’s field of view (FOV) and goal direction vector as input to predict a dense local potential field, from which a multi-step local subpath is generated.
When collisions are detected, agents aggregate their limited local observations into a consistent shared belief and project the endpoints of the neural proposals as local waypoints, then invoke a localized Conflict-Based Search (L-CBS) over bounded spatio-temporal windows to compute safe, collision-free refinements and dispatches them back to the agents. By partitioning anticipated collisions into disjoint conflict graphs, this integration resolves structural deadlocks through independent L-CBS instances. This ensures strict collision-free execution within a receding horizon window, avoiding the exponential overhead of global replanning and overcoming the myopia of purely reactive planners.

We evaluate our approach using a two-stage protocol. Assuming full global knowledge, LENS matches the solution quality of centralized algorithmic oracles, validating the efficiency of our hybrid search integration. Under this setting, it achieves competitive out-of-distribution generalization capabilities comparable to foundation models with up to 85M parameters (MAPF-GPT), while requiring less than $0.2\%$ of their training trajectory data. In strictly unknown environments with online collaborative mapping, LENS leverages its collaborative inference to significantly outperform reactive baselines, improving the success rate by $41.5\%$ in high-density maze configurations. Our results demonstrate that decoupling topological inference from local collision resolution provides a scalable, data-efficient, and collision-aware solution (locally collision-free within bounded windows) for complex autonomous navigation.

URL: https://openreview.net/forum?id=DuNU6ZN5GG

---

Title: Text2Smell: Emergent Representations of Human Olfactory Perception in Large Language Models

Abstract: Large language models (LLMs) trained exclusively on text have recently demonstrated emergent capacities to perceive the world as humans do, suggesting that linguistic co-occurrence statistics implicitly encode aspects of human sensory experience. Yet, whether such models capture the structure of olfactory perception, one of the most complex and least understood human senses, remains unknown. In this work, we investigate whether state-of-the-art LLMs can predict human smell perception purely from linguistic cues and how their representations compare to those of molecular transformer models explicitly trained on chemical structure. We prompt LLMs to provide perceptual olfactory ratings to odorants, and evaluate their outputs against human ratings across several datasets. Surprisingly, we find that LLMs exhibit strong alignment with human perceptual judgments, comparable to, and in most cases exceeding, the performance of specialized molecular transformers. These results indicate that linguistic knowledge alone carries rich latent structure about human olfaction, bridging the gap between language and chemical perception. Our findings position LLMs as powerful linguistically grounded perceptual models and open new directions for studying sensory grounding and cross-modal representation learning through language.

URL: https://openreview.net/forum?id=zXQWUqMijT

---

Title: Generative Trace Attribution Network

Abstract: A deepfake is a digitally created or altered image, video, or audio made using artificial intelligence (AI) that seems real but is intended to deceive or mislead viewers. The rapid rise of generative AI tools like GANs and diffusion models has made it easier to create deepfake images. While traditional methods can detect deepfakes, they often fail to identify which model created them. The process of matching a deepfake image to its generative model is known as deepfake attribution. Deepfake attribution is essential for accountability and preventing misuse of generative models in the creation of deefake. Unfortunately, most existing attribution methods only work well on the specific models they were trained on. They struggle to attribute images generated from unseen generators with different initialization seeds, trained for additional epochs, fine-tuned, retrained, or having slight modifications in loss functions or model architecture. To address these limitations, we propose the Generative Trace Attribution Network (GTA-Net), a generalized attribution network that robustly attributes fake images across diverse generative models with variations, including entirely unseen generative models. GTA-Net works by analyzing hidden patterns in input images using a combination of frequency analysis and latent space analysis to capture training-induced artifacts of target generative models to be attributed. GTA-Net also employs supervised contrastive learning to separate features between different target generative models. Extensive experiments on diverse generative models demonstrate that GTA-Net significantly outperforms existing attribution techniques, offering a more robust and reliable approach for deepfake attribution.

URL: https://openreview.net/forum?id=RC8nynlHmu

---

Title: Effectiveness of Distributed Gradient Descent with Local Steps for Overparameterized Models

Abstract: In distributed training of machine learning models, gradient descent with {\em local iterative steps}, commonly known as Local (Stochastic) Gradient Descent (Local-(S)GD) or Federated averaging (FedAvg), is a very popular method to mitigate communication burden. In this method, gradient steps based on local datasets are taken independently in distributed compute nodes to update the local models, which are then aggregated intermittently. In the interpolation regime, Local-GD can converge to zero training loss. However, with many potential solutions corresponding to zero training loss, it is not known which solution Local-GD converges to. In this work we answer this question by analyzing implicit bias of Local-GD for classification tasks with linearly separable data. For the interpolation regime, our analysis shows that the aggregated global model obtained from Local-GD, with arbitrary number of local steps, converges exactly to the model that would be obtained if all data were in one place (centralized model) "in direction''. Our result gives the exact rate of convergence to the centralized model with respect to the number of local steps. We also obtain the same implicit bias with a learning rate independent of number of local steps with a modified version of the Local-GD algorithm. Our analysis provides a new view to understand why Local-GD can still perform well with a very large number of local steps even for heterogeneous data. Lastly, we also discuss the extension of our results to Local-SGD and non-separable data.

URL: https://openreview.net/forum?id=nqG7naNjBB

---

Title: PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?

Abstract: Multiple works have emerged to push the boundaries of multi-modal large language models (MLLMs) towards pixel-level understanding. The current trend is to train MLLMs with pixel-level grounding supervision in terms of masks on large-scale labelled data and specialized decoders for the segmentation task. However, we show that such MLLMs when evaluated on recent challenging vision-centric benchmarks, exhibit a weak ability in visual question answering (VQA). Surprisingly, some of these methods even downgrade the grounding ability of MLLMs that were never trained with such pixel-level supervision. In this work, we propose two novel challenging benchmarks with paired evaluation for both VQA and grounding. We demonstrate that simple baselines that are not unified achieve performance that matches or surpasses some of the pixel-level MLLMs. Our paired benchmarks and evaluation enable additional analysis on the reasons for failure with respect to VQA and/or grounding. Furthermore, we propose a prompt sensitivity analysis on both the language and visual prompts tailored for the grounding task. More importantly, we study the research question of "When does grounding emerge in MLLMs with respect to the output tokens?" We propose an interpretability tool that can be plugged into any MLLM to study the aforementioned question. We show that grounding does not necessarily coincide with the exact referring expression in the output, but can coincide with the object parts, its location, appearance, context or state. Code and datasets will be made publicly available.

URL: https://openreview.net/forum?id=HNtjTTAac1

---

Title: Synth-FAR: A Synthetic Frequency-Autoregressive Driven Framework for Time Series Forecasting

Abstract: Time series forecasting is essential for predicting future values based on observed patterns. Traditional methods perform well in in-domain scenarios with ample data but struggle with scarce data, leading to the rise of zero-shot and few-shot learning. Recent advancements use large-scale models but require extensive data and resources, often learning ineffectively from the available data. This study explores factors influencing effective learning in time series forecasting using Fourier analysis. Findings show that forecasters struggle with data containing multiple frequencies and generalizing to unseen frequencies. To address this, we introduce Synth-FAR, a synthetic data generation framework that enhances or replaces real data by creating a mixture of autoregressive and frequency information, improving model robustness in limited data scenarios. Our method outperforms other popular synthetic data techniques, such as Kernel-Synth, in both generation time and performance, and demonstrates the potential for integration into foundation model data pipelines, thereby enhancing their effectiveness.

URL: https://openreview.net/forum?id=NuIpv0WySf

---

Title: GCN-DevLSTM: Path Development for Skeleton-Based Action Recognition

Abstract: Skeleton-based action recognition (SAR) in videos is an important but challenging task in computer vision. The recent state-of-the-art (SOTA) models for SAR are primarily based on graph convolutional neural networks (GCNs), which are powerful in extracting the spatial information from skeleton data. However, it is not yet clear that such GCN-based models can effectively capture the temporal dynamics of human action sequences. To this end, we propose the G-Dev layer, which exploits the path development -- a principled and parsimonious representation for sequential data by leveraging the Lie group structure. By integrating the G-Dev layer, the hybrid G-DevLSTM module enhances the traditional LSTM to reduce the time dimension while retaining high-frequency information. It can be conveniently applied to any temporal graph data, complementing existing advanced GCN-based models. Our empirical studies on the NTU-60, NTU-120 and Chalearn2013 datasets demonstrate that our proposed GCN-DevLSTM network consistently improves the strong GCN baseline models and achieves superior performance. \footnote{The camera-ready version will contain a link to the code repository to ensure reproducibility.}

URL: https://openreview.net/forum?id=3o5seglmgn

---

Title: Taming Modality Entanglement in Continual Audio-Visual Segmentation

Abstract: Recently, significant progress has been made in multi-modal continual learning, aiming to learn new tasks sequentially in multi-modal settings while preserving performance on previously learned ones. However, existing methods mainly focus on coarse-grained tasks, with limitations in addressing modality entanglement in fine-grained continual learning settings. To bridge this gap, we introduce a novel Continual Audio-Visual Segmentation (CAVS) task, aiming to continuously segment new classes guided by audio. Through comprehensive analysis, two critical challenges are identified: 1) multi-modal semantic drift, where a sounding objects is labeled as background in sequential tasks; 2) co-occurrence confusion, where frequent co-occurring classes tend to be confused.
In this work, a Collision-based Multi-modal Rehearsal (CMR) framework is designed to address these challenges. Specifically, for multi-modal semantic drift, a Multi-modal Sample Selection (MSS) strategy is proposed to select samples with high modal consistency for rehearsal. Meanwhile, for co-occurence confusion, a Collision-based Sample Rehearsal (CSR) mechanism is designed, allowing for the increase of rehearsal sample frequency of those confusable classes during training process.
Moreover, we construct three audio-visual incremental scenarios to verify effectiveness of our method. Comprehensive experiments demonstrate that our method significantly outperforms single-modal continual learning methods. The source code will be made publicly available upon acceptance.

URL: https://openreview.net/forum?id=8mPymf31zG

---

Title: Unbiased Stochastic Optimization for Gaussian Processes on Finite Dimensional RKHS

Abstract: Current methods for stochastic hyperparameter learning in Gaussian Processes (GPs) rely
onapproximations, suchascomputingbiasedstochasticgradientsorusinginducingpointsin
stochastic variational inference. However, when using such methods, we are not guaranteed
to converge to a stationary point of the true marginal likelihood. In this work, we propose
algorithms for exact stochastic inference of GPs with kernels that induce a Reproducing
Kernel Hilbert Space (RKHS) of moderate finite dimension. Our approach can also be
extendedtoinfinitedimensionalRKHSsatthecostofforgoingexactness. Bothforfiniteand
infinite dimensional RKHSs, our method achieves better experimental results than existing
methods when memory resources limit the feasible batch size and the possible number of
inducing points.

URL: https://openreview.net/forum?id=wDCulUZla4

---

Title: Data-Driven Priors for Uncertainty-Aware Deterioration Risk Prediction with Multimodal Data

Abstract: Safe predictions are a crucial requirement for integrating predictive models into clinical decision support systems.
One approach for ensuring trustworthiness is to enable the models' ability to express their uncertainty about individual predictions.
However, current machine learning models frequently lack reliable uncertainty estimation, hindering real-world deployment. This is further observed in multimodal settings, where the goal is to enable effective information fusion.
In this work, we propose $\texttt{MedCertAIn}$, a predictive uncertainty framework that leverages multimodal clinical data for in-hospital risk prediction to improve model performance and reliability. We design data-driven priors over neural network parameters using a hybrid strategy that considers cross-modal similarity in self-supervised latent representations and modality-specific data corruptions.
We train and evaluate the models with such priors using clinical time-series and chest X-ray images from the publicly-available datasets MIMIC-IV and MIMIC-CXR.
Our results show that $\texttt{MedCertAIn}$ significantly improves predictive performance and uncertainty quantification compared to state-of-the-art deterministic baselines and alternative Bayesian methods.
These findings highlight the promise of data-driven priors in advancing robust, uncertainty-aware AI tools for high-stakes clinical applications.

URL: https://openreview.net/forum?id=L6PmexvcFY

---

Title: A Mechanistic Analysis of Low-Precision Instabilities in Microscaling Formats

Abstract: Training large language models is an expensive, compute-bound process that must be repeated as models scale, algorithms improve, and new data is collected. To address this, next-generation hardware accelerators increasingly support lower-precision arithmetic formats,
such as the Microscaling (MX) formats introduced in NVIDIA’s Blackwell architecture. These formats use a shared scale within blocks of parameters to extend representable range and perform forward/backward GEMM operations in reduced precision for efficiency gains. In
this work, we investigate the challenges and viability of block-scaled precision formats during model training. Across nearly one thousand language models trained from scratch – spanning compute budgets from 2 × 1017 to 4.8 × 1019 FLOPs and sweeping over a broad range of weight–activation precision combinations – we consistently observe that training in MX formats exhibits sharp, stochastic instabilities in the loss, particularly at larger compute scales. To explain this phenomenon, we conduct controlled experiments and ablations on
a smaller proxy model that exhibits similar behavior as the language model, sweeping across architectural settings, hyperparameters, and precision formats. These experiments motivate a simple model in which multiplicative gradient bias introduced by the quantization
of layer-norm affine parameters and a small fraction of activations can trigger runaway divergence. Through in situ intervention experiments on our proxy model, we demonstrate that instabilities can be averted or delayed by modifying precision schemes mid-training.
Guided by these findings, we evaluate stabilization strategies in the LLM setting and show that certain hybrid configurations recover performance competitive with full-precision training. We

URL: https://openreview.net/forum?id=I5bxWT7Xfw

---

Title: When Fine-Tuning Fails and when it Generalises: Role of Data Diversity and Mixed Training in LLM-based TTS

Abstract: Large language models are increasingly adopted as semantic backbones for neural text-to-speech systems. However, frozen LLM representations are insufficient for modeling speaker-specific acoustic and perceptual characteristics. Our experiments involving fine tuning of the Language Model backbone of TTS show promise in improving the voice consistency and Signal to Noise ratio (SNR) in voice cloning task.
Across multiple speakers, LoRA fine-tuning consistently outperforms the non–fine-tuned base Qwen-0.5B model across three complementary dimensions of speech quality. First, perceptual quality improves significantly, with DNS-MOS gains of up to +0.42 points for speakers whose training data exhibits sufficient acoustic variability. Second, speaker fidelity improves for all evaluated speakers, with consistent increases in voice similarity, indicating that LoRA effectively adapts speaker identity representations without degrading linguistic modeling. Third, signal-level quality improves in most cases, with signal-to-noise ratio increasing by as much as 34 percent.
Crucially, these improvements are strongly governed by the characteristics of the training data. Speakers with high variability in acoustic energy and perceptual quality achieve simultaneous gains in DNS-MOS, voice similarity, and SNR. In contrast, speakers trained on acoustically homogeneous data experience limited gains or perceptual degradation, even when voice similarity improves. This reveals that LoRA can faithfully clone speaker identity while also amplifying noise characteristics and recording artifacts present in narrow training distributions.
We further identify a loss–quality divergence phenomenon in which training and validation loss continue to improve during fine-tuning while perceptual quality degrades for low-variability speakers. Besides, we show that optimal inference temperature of the language model backbone depends on training data variability, with conservative sampling benefiting low-variability speakers but degrading quality for high-variability ones.

Overall, this work establishes that LoRA fine-tuning is not merely a parameter-efficient optimization technique but an effective mechanism for better speaker-level adaptation in compact LLM-based TTS systems. When supported by sufficiently diverse training data, LoRA-adapted Qwen-0.5B consistently surpasses its frozen base model in perceptual quality, speaker similarity with low latency using GGUF model hosted in quantized form.

URL: https://openreview.net/forum?id=cnXJiwVLg3

---

Title: Model-Agnostic Shift-Aware Risk-Sensitive Curriculum for Long-Horizon Time-Series Forecasting

Abstract: Long-horizon multivariate forecasting is often brittle under regime changes, rare high-impact windows, and
error accumulation.
Standard training samples windows uniformly and optimizes mean loss, while existing curricula typically rank
windows by difficulty alone and robustness objectives (e.g., CVaR, IRM/REx, GroupDRO) act only after windows
have entered the optimization stream.
We propose \method{}, a \emph{model-agnostic} training wrapper that reallocates gradient budget by coupling
(i) self-paced window admission, (ii) shift-aware importance weights over context- or feature-defined
environments, and (iii) tail- and environment-robust outer objectives.
The wrapper leaves the forecasting backbone unchanged and adds no inference-time cost.
At the population level, we formalize the induced target as a trimmed, shift-corrected robust risk. We show
that the differentiable quantile gate is an $O(1/\gamma)$ approximation to its hard admitted-set counterpart,
quantify the bias introduced by label-adaptive difficulty signals via an explicit adaptive-gap term, and
derive a deterministic upper bound on worst-environment risk from the environment-variance penalty.
Empirically, on six long-horizon benchmarks (ETTh1/2, ETTm1/2, Weather, Electricity) and four backbones
(RLinear, DLinear, RMLP, iTransformer), \method{} lowers MSE in 82 of 96 backbone--dataset--horizon cells,
with 65 cells improving by more than 1\%, and yields positive average gains in every backbone--horizon
aggregate.
On a scoped robustness battery (ETTh1 with DLinear), \method{} reduces mean MSE by 5.1--9.0\% across temporal
shift levels and reduces worst-environment MSE by up to 30\% in the hardest stress setting.

URL: https://openreview.net/forum?id=RxYq2LLrOQ

---

Title: Symbolic Density Estimators for Unnormalized Distributions

Abstract: Estimating the symbolic or analytical form of probability density functions (PDFs) from observed samples is a fundamental challenge in statistical and computational modelling. This process is critical for deriving interpretable and generalizable relationships characterizing the underlying phenomenon. Traditionally, this estimation depends strongly on domain expertise and prior field-specific knowledge, with experts selecting appropriate functional forms or parametric families based on empirical evidence and theoretical understanding. The coefficients of these forms are then typically determined through parameter estimation. In this paper, we develop a framework to estimate symbolic expressions of unnormalized distributions from their observed samples. We integrate deep generative models with symbolic regression (SR), incorporating inductive biases, such as, factorizing large distributions, to keep the problem tractable. The deep generative models we examine include likelihood-based models, viz., flow models, and score-based models. Experiments show the effectiveness of the proposed framework for estimating density functions for multivariate toy distributions as well as lattices from computational Physics, namely, XY model and $\phi^4$ theory. When applied to the renormalization problem in $\phi^4$ theory, it discovers new expressions for action function intractable by traditional analytic approaches, thereby providing physicists with a novel tool for theoretical analysis.

URL: https://openreview.net/forum?id=3VRaDoUX3g

---

Title: Conditional Risk-Averse Constrained Reinforcement Learning

Abstract: In Risk-averse Constrained Reinforcement Learning (RaCRL), the optimal tolerance for risk often depends on a preference over the trade-off between reward and safety. This trade-off is influenced by environmental uncertainty, which is generally difficult to quantify, in turn making its effect on an agent's performance difficult to predict at the outset of training. Conventional RaCRL approaches typically train agents under a fixed risk level, set at the beginning of training, leading to an agent with a fixed, often conservative, reward-safety trade-off at deployment time. In this paper, we introduce Conditional Risk-averse Actor Critic (CRAC), a novel algorithm for RaCRL that conditions the agent on risk levels sampled during both exploration and learning. Through exploring and learning from diverse experiences across varied risk levels, CRAC generalises effectively across a spectrum of risk preferences, enabling the deployment of a single agent at risk levels chosen by a user. We evaluate CRAC across a set of environments with increasing difficulty, demonstrating empirically that it generalises effectively across a risk spectrum. CRAC often achieves higher reward than fixed-risk agents, whilst satisfying cost constraints. In cases where CRAC's reward performance is marginally lower than a fixed-risk agent, CRAC retains the advantage of a single risk-conditioned policy that generalises to a risk spectrum, reducing training overhead and providing more control over the reward-cost trade-off.

URL: https://openreview.net/forum?id=JJFFx1HVHi

---

Title: Aligning Path-based Link Prediction with Human Understanding of Valid Reasoning

Abstract: Path-based link prediction methods reconstruct missing links between two vertices of a knowledge graph. They reconstruct a missing link by finding a path through the knowledge graph connecting both vertices. The path is the reasoning of the link prediction method. However, path-based link prediction methods are vulnerable to \emph{Clever Hans} biases. They learn invalid reasoning patterns if these patterns are dominant and generalize well to the training and validation set. As a result, performance drops when evaluated on the real-world distribution. The validity of reasoning is determined by the semantic concept underlying the missing link, which is mostly accessible through human knowledge. The paper's approach makes human understanding of valid reasoning accessible while learning to predict missing links.
This paper proposes the path-based link prediction method \emph{LiEr}. \emph{LiEr} learns valid reasoning within the knowledge graph domain from preference-based human feedback.
The paper demonstrates that \emph{LiEr}'s prediction capability is on par with other state-of-the-art link prediction methods while more aligned with human understanding of valid reasoning on various benchmark reasoning tasks. In addition, a novel benchmark knowledge graph with a \emph{Clever Hans} bias is introduced to evaluate the alignment of link prediction methods with human understanding of valid reasoning. The paper contributes by proposing the first human-in-the-loop link prediction method, capable of aligning its reasoning with the human understanding of valid reasoning.

URL: https://openreview.net/forum?id=FrMBxE7J0v

---

Title: An Axiomatic Atlas for Optimization

Abstract: First-order methods are the workhorses of modern large-scale optimization, powering training and inference across machine learning, signal processing, and scientific computing. Yet the theoretical guarantees that explain their behavior are dispersed across smooth, nonsmooth, stochastic, and composite settings, while practitioners must choose among many algorithmic variants and tune interacting hyperparameters with limited guidance about which assumptions actually matter. We introduce the Optimization Atlas, an axiomatic view that organizes widely used first-order methods and their canonical convergence behaviors within a single, explicit assumption space. The atlas exposes inclusion-minimal assumption sets that suffice for a desired outcome, and it delineates sharp frontiers where a single assumption change alters the attainable regime, such as sublinear versus linear convergence or linear convergence versus variance-limited floors. We then leverage the induced theorem by axiom structure to uncover a small number of recurring modes that clarify which phenomena are structurally shared across methods and which correspond to genuinely distinct mechanisms. Finally, we convert the atlas into a practical diagnostic control plane: from short training traces it estimates the active limiting ceiling and ranks interventions, ranging from relaxing modeling assumptions (for example via smoothing or regularization) to increasing algorithmic capacity (for example via batching or variance reduction). Experiments on controlled synthetic problems and a CIFAR-10 convolutional network show that this control plane reliably identifies the governing regime, recommends high-return changes when optimization is the bottleneck, and abstains when additional tuning is unlikely to help.

URL: https://openreview.net/forum?id=O4uv7RkOjf

---

Title: Autoregressive Flow Matching for Motion Prediction

Abstract: Motion prediction has been studied in different contexts with models trained on narrow distributions and applied to downstream tasks in human motion prediction and robotics. Simultaneously, recent efforts in scaling video prediction have demonstrated impressive visual realism, yet they struggle to accurately model complex motions despite massive scale. Inspired by the scaling of video generation, we develop autoregressive flow matching (ARFM), a new method for probabilistic modeling of sequential continuous data and train it on diverse video datasets to generate future point track locations over long horizons. To evaluate our model, we develop benchmarks for evaluating the ability of motion prediction models to predict human and robot motion. Our model is able to predict complex motions, and we demonstrate that conditioning robot action prediction and human motion prediction on predicted future tracks can significantly improve downstream task performance.

URL: https://openreview.net/forum?id=rYe8n2Kpo3

---

Title: Training Deep Morphological Neural Networks as Universal Approximators

Abstract: We investigate deep morphological neural networks (DMNNs). We demonstrate that despite their inherent non-linearity, "linear" activations are essential for DMNNs. To preserve their inherent sparsity, we propose architectures that constraint the parameters of the "linear" activations: For the first (resp. second) architecture, we work under the constraint that the majority of parameters (resp. learnable parameters) should be part of morphological operations. We improve the generalization ability of our networks via residual connections and weight dropout. Our proposed networks can be successfully trained, and are more prunable than linear networks. To the best of our knowledge, we are the first to successfully train DMNNs under such constraints. Finally, we propose a hybrid network architecture combining linear and morphological layers, showing empirically that the inclusion of morphological layers significantly accelerates the convergence of gradient descent with large batches.

URL: https://openreview.net/forum?id=cDJTz8ra7k

---

Title: NeuMoSync: End‑to‑End Neuromodulatory Control for Plasticity and Adaptability in Continual Learning

Abstract: Continual learning (CL) requires models to learn tasks sequentially, yet deep neural networks often suffer from plasticity loss and poor knowledge transfer, which can impede their long-term adaptability.
Drawing high-level inspiration from global neuromodulatory mechanisms in the brain, we introduce $\textbf{Neu}$ro$\textbf{Mo}$dulation and $\textbf{Sync}$hronization ($\texttt{NeuMoSync}$), a novel architecture that integrates dynamic, neuron-specific modulation into deep neural networks to enhance their adaptability and plasticity.
$\texttt{NeuMoSync}$ extends standard neural network architectures with learnable feature vectors per neuron that tracks network-wide historical context and incorporates a module operating at a higher level of abstraction.
This module synthesizes neuron-specific signals, conditioned on both current inputs and the network’s evolving state, to adaptively regulate activation dynamics and synaptic plasticity. Evaluated on diverse CL benchmarks, including memorization (Random Label CIFAR-10, Random Label MNIST), concept drift (Shuffle CIFAR-10), class-incremental (Class Split T-ImageNet, CIFAR-100) and domain-incremental (Permuted MNIST), $\texttt{NeuMoSync}$ demonstrates strong performance in terms of retention of plasticity and achieves improvements in both forward and backward adaptation compared to existing methods.
Ablation studies validate the necessity of each component, while the analysis of the learned modulatory signals reveals interpretable coordination patterns across tasks. Our work underscores the potential of integrating global coordination mechanisms into deep learning systems to advance robust, adaptive continual learning.

URL: https://openreview.net/forum?id=i6cb6CGMtY

---

Title: Firewalls to Secure Dynamic LLM Agentic Networks

Abstract: The emergence of agent-to-agent communication protocols mirrors the early internet: powerful connectivity with minimal security infrastructure. When AI agents communicate on behalf of users, every message crosses a trust boundary where the user's personal data and the external agent's unconstrained language each present distinct risks. We address both through a dual-firewall architecture grounded in a unifying principle: each task defines a context, and both sides of the communication carry information far exceeding what that context requires. Our firewalls act as projections onto the task context, allowing only contextually appropriate content to cross each boundary. The Language Converter Firewall projects incoming messages onto a closed, domain-specific, structured protocol; an external agent's message is converted to validated fields while persuasive framing, urgency tactics, and embedded instructions are structurally eliminated through deterministic verification. This replaces the asymmetric challenge of resisting every possible manipulation with the structural guarantee that manipulation has no channel through which to arrive. The Data Abstraction Firewall projects outgoing information onto the granularity appropriate for the task, rather than applying binary disclose-or-redact filtering, as previous airgapping solutions did. Both firewalls operate in a trusted environment isolated from external input, applying domain-specific rules learned automatically from demonstrations. Across 864 attacks spanning three domains on the recent ConVerse benchmark, our architecture reduces privacy attack success rates (e.g., from 84% to 10% for GPT-5) and security attacks (from 60% to 3%), while maintaining or even improving task completion quality.

URL: https://openreview.net/forum?id=w02FW1dMoY

---

Title: Provably Robust Watermarks for Open-Source Language Models

Abstract: Watermarking is a leading solution for the increasingly pressing problem of identifying AI-generated text.
Existing large language model (LLM) watermark approaches crucially rely on the LLM's source code and parameters being secret, which makes them ineffective in an open-source setting. In this work, we introduce the first watermarking scheme for open-source LLMs with provable robustness guarantees. Under precisely defined assumptions about the adversary's knowledge, we prove that the adversary either fails to remove the watermark or significantly degrades the quality of the model. We supplement our theoretical results with experiments using OPT, which show how our proven robustness-quality tradeoff manifests in practice.

Our main contribution is showing the feasibility of watermarks with provable guarantees in the open-source setting. We provide the first formal definition of robustness in this setting, and show that it is achievable by a fairly simple scheme. While this scheme is simple, the bulk of our work lies in modeling the problem in a way that is realistic yet amenable to provable results, and analyzing our scheme to prove robustness. We hope that our definitions and the techniques used in our analysis pave the way for future work on open-source watermarks.

URL: https://openreview.net/forum?id=xRd8lA6DmI

---

Title: The Intrinsic Dimension of Prompts in Internal Representa- tions of Large Language Models

Abstract: through the lens of intrinsic dimension. Viewing transformers as mean-field particle systems, we estimate the intrinsic dimension of the empirical measure at each layer and demonstrate that it correlates with next-token uncertainty. Across models and intrinsic dimension estimators, we find that intrinsic dimension peaks in early to middle layers and increases under syntactic and semantic disruption (by shuffling tokens), and that it is strongly correlated with average surprisal, with a simple analysis linking logits geometry to entropy via softmax. As a case study in practical interpretability and safety, we train a linear probe on the per-layer intrinsic dimension profile to distinguish malicious from benign prompts before generation. This probe achieves accuracy of 90 to 95\% in different datasets, outperforming widely used guardrails such as Llama Guard and Shield Gemma. We further compare against linear probes built from layerwise entropy derived via the Tuned Lens and find that the intrinsic dimension-based probe is competitive and complementary, offering a compact, interpretable signal distributed across layers. Our findings suggest that prompt-level geometry provides actionable signals for monitoring and controlling LLM behavior, and offers a bridge between mechanistic insights and practical safety tools.

URL: https://openreview.net/forum?id=rBEgNAslpY

---

Title: Accounting for Heterogeneous Parameters in Decision-Focused Learning

Abstract: Decision-focused learning (DFL) is a recent machine learning paradigm aimed at tackling predict-then-optimize problems, where the task is to predict the parameter values of a parametric optimization problem from features. Instead of maximizing predictive accuracy, DFL maximizes downstream decision quality, by training the model to avoid specifically those errors that most negatively impact decision-making. In this work, we systematically investigate an understudied aspect of DFL: that for different parameters, the way prediction errors affect downstream decision quality may differ, and that the conventional model architecture cannot account for these differences. We first formalize this issue and provide a theoretical characterization of when it arises. We then show that significantly better decision quality can often be achieved by equipping models with the ability to learn parameter-specific predictive mappings – even when the true underlying mappings are identical. To this end, we investigate three architectural alterations to the predictive model, and propose a data augmentation scheme to enhance data efficiency. We extensively evaluate the impact of these changes across several dimensions, using linear and nonlinear predictive models, and optimization problems of different complexities. Our findings show that significant performance gains can be realized through such architectural alterations and data augmentation scheme, across different problem types.

URL: https://openreview.net/forum?id=k6aEfCLMMD

---

Title: Fairness in Link Prediction Beyond Demographic Parity: A Reproducibility Study

Abstract: In fair ranked link prediction, demographic parity ($\Delta_\mathrm{DP}$) is a common fairness criterion, yet Mattos et al. (2025) argue that it obscures exposure bias induced by ranking order. In this study, we reproduce and validate their claims by confirming (i) the limitations of $\Delta_\mathrm{DP}$, (ii) the advantages of the rank-aware, exposure-based Normalized Discounted KL-divergence (NDKL) metric, and (iii) the effectiveness of MORAL, a post-processing algorithm for debiasing ranked outputs while maintaining competitive utility. Beyond reproduction, we assess robustness through synthetic asymmetric homophily stress-tests, categorical sensitive attributes, and alternative fairness and utility metrics, including subgroup-pair-adapted Attention-Weighted Rank Fairness (AWRF). Empirical results demonstrate that exposure-based fairness reveals disparities hidden by $\Delta_\mathrm{DP}$ and that MORAL reduces such bias with minimal utility loss across diverse settings and datasets. We release a corrected, reproducible implementation available at https://github.com/unknown-gitter/reproducing-MORAL.

URL: https://openreview.net/forum?id=QNCZoPb9uV

---

Title: Understanding Judge Calibration in Multi-Turn Debates

Abstract: Multi-turn debates have gained attention as language evaluation tasks for subject matter comprehension, critical reasoning and long-form responses. Language Models (LMs) play the role of judges for obtaining subjective ratings as a cheap alternative to human labor. However, similar to humans, LM judges may remain unsure of ratings and rate debate arguments either under or over-confidently. We empirically study judge calibration in multi-turn self debates, wherein a single LM debater debates with itself, and uncover that LM judges are often overconfident in their judgements. Miscalibration occurs as model confidence ratings increase while rated scores may decrease over debate rounds. Judge confidence exceeds score ratings for both frontier as well as open-source models. We further show that while naive finetuning may improve calibration by increasing scores, it does not necessarily lower overconfidence in ratings. Finetuned overconfident judges prefer similar ratings as confidence and rate different arguments indistinguishably. Our empirical analysis leads us to an observation that helps mitigate overconfidence. Since lower confidences and scores form the tail end of the dataset and are most desirable from a judge’s perspective, sampling from this left tail must calibrate for confidence. We thus fit a mixture of Gumbels distribution on expected ratings of debate arguments and then rejection sample from its tail to finetune judge models. Sampling from the mixture of Gumbels, when compared to naive ratings and Supervised Finetuning (SFT), lowers judge confidence and yield well-calibrated juddges while learning an expressive multi-modal distribution over ratings. Debate datasets and code will be released as part of the final version.

URL: https://openreview.net/forum?id=DQc5tVVx3D

---

Title: DRO-Augment Framework: Robustness by Synergizing Wasserstein Distributionally Robust Optimization and Data Augmentation

Abstract: In many real-world applications, ensuring the robustness and stability of deep neural networks (DNNs) is crucial, particularly for image classification tasks that encounter various input perturbations. While data augmentation techniques have been widely adopted to enhance
the resilience of a trained model against such perturbations, there remains significant room for improvement in robustness against corrupted data and adversarial attacks simultaneously. To address this challenge, we introduce DRO-Augment, a novel framework that integrates Wasserstein Distributionally Robust Optimization (W-DRO) with various data augmentation strategies to improve the robustness of the models significantly across a broad spectrum of corruptions. Our method outperforms existing augmentation methods under severe data perturbations and adversarial attack scenarios while maintaining the accuracy on the clean datasets on a range of benchmark datasets, including but not limited to CIFAR-10-C, CIFAR-100-C, Tiny-ImageNet-C, and Fashion-MNIST. On the theoretical side, we establish novel generalization error bounds for neural networks trained using a computationally efficient, variation-regularized loss function with augmented data, closely related to the W-DRO problem. Furthermore, we introduce a refined CIFAR-C benchmark that corrects
inconsistencies in corruption intensities, providing a more reliable evaluation for future robustness research.

URL: https://openreview.net/forum?id=AuXKvvvGGz

---

Title: CheXGenBench: A Unified Benchmark For Fidelity, Privacy and Utility of Synthetic Chest Radiographs

Abstract: Structured benchmarks have advanced text-conditional image generation for real-world imagery, however, no such benchmark exists for synthetic radiograph generation. Despite being a highly active area of research, existing studies continue adopting inconsistent evaluation protocols and lack a unified assessment of the three most critical criteria: generative fidelity, privacy risk, and downstream utility.
To address these limitations, we introduce CheXGenBench, the first unified evaluation framework for synthetic chest radiograph generation that simultaneously assesses fidelity, privacy risks, and downstream utility across frontier text-to-image (T2I) generative models. Our evaluation protocol, comprising over 20 quantitative metrics, covers 11 leading T2I architectures with plug-and-play integration for newer models. Through a rigorous and fair evaluation protocol, we establish a new SoTA in synthetic chest X-ray generation. Furthermore, our results uncover several limitations of current generative models, which include (1) even SoTA models struggle with long-tailed medical distributions, (2) models pose high privacy risks regardless of fidelity quality, and (3) while synthetic data already benefits downstream classification, it is of limited utility for downstream multimodal tasks. Drawing from these results, we propose concrete research directions to advance the field. The anonymised code is available at https://anonymous.4open.science/r/CheXGenBench-52F0/README.md.

URL: https://openreview.net/forum?id=wrKmzYQACp

---

Title: Self-Supervised Learning via Flow-Guided Neural Operator on Time-Series Data

Abstract: Self-supervised learning (SSL) is a powerful paradigm for learning from unlabeled time-series data. However, popular methods such as masked autoencoders (MAEs) rely on reconstructing inputs from a fixed, predetermined masking ratio. Instead of this static design, we propose treating the corruption level as a new degree of freedom for representation learning, enhancing flexibility and performance. To achieve this, we introduce the Flow-Guided Neural Operator (FGNO), a novel framework combining operator learning with flow matching for SSL training. FGNO learns mappings in functional spaces by using Short-Time Fourier Transform to unify different time resolutions. We extract a rich hierarchy of features by tapping into different network layers and flow times that apply varying strengths of noise to the input data. This enables the extraction of versatile representations, from low-level patterns to high-level global features, using a single model adaptable to specific tasks. Unlike prior generative SSL methods that use noisy inputs during inference, we propose using clean inputs for representation extraction while learning representations with noise; this eliminates randomness and boosts accuracy. We evaluate FGNO across three biomedical domains, where it consistently outperforms established baselines. Our method yields up to 35% AUROC gains in neural signal decoding (BrainTreeBank), 16% RMSE reductions in skin temperature prediction (DREAMT), and over 20% improvement in accuracy and macro-F1 on SleepEDF under low-data regimes. These results highlight FGNO's robustness to data scarcity and its superior capacity to learn expressive representations for diverse time series.

URL: https://openreview.net/forum?id=YAYW9Y173z

---

Title: A Benchmark for Vericoding: Formally Verified Program Synthesis

Abstract: We present and test the largest benchmark for vericoding, LLM-generation of formally verified code from formal specifications — in contrast to vibe coding, which generates potentially buggy code from a natural language description. Our benchmark contains 12,504 formal specifications, with 3,029 in Dafny, 2,334 in Verus/Rust and 7,141 in Lean. Of these, 6,174 are new unseen problems. We find vericoding success rates of 27% in Lean, 44% in Verus/Rust and 82% in Dafny using off-the-shelf LLMs. Adding natural-language descriptions does not significantly improve performance. We also find that LLM progress has improved progress on pure Dafny verification from 68% to 97% over the past year.

URL: https://openreview.net/forum?id=Zgh5kpGAm8

---

Title: Online Dense Video Captioning with Factorized Action Object Retrieval

Abstract: Dense video captioning presents the dual challenge of temporally localizing events and generating descriptive captions within long videos. However, existing methods often struggle to handle evolving contexts in streaming settings or depend on static, global retrieval mechanisms. To address these limitations, we introduce a novel framework that embeds a dynamic, factorized retrieval mechanism directly into a causally-aware video processing backbone. Unlike approaches utilizing static global retrieval, our method dynamically retrieves concise action and object phrases at each timestep as the video streams. These retrieved phrases are integrated into a causal, autoregressive transformer, enriching the video representation to enhance the text decoder. Furthermore, to mitigate the scarcity of densely annotated video data, we introduce an image-based simulated video pretraining strategy. Experiments on the ViTT, YouCook2, and ActivityNet benchmarks demonstrate that our model significantly outperforms existing global and online methods.

URL: https://openreview.net/forum?id=rL7ns0ngAQ

---

Title: Beyond Naïve Prompting: Strategies for Improved Context-aided Forecasting with LLMs

Abstract: Real-world forecasting requires models to integrate not only historical data but also relevant contextual information provided in textual form. While large language models (LLMs) show promise for context-aided forecasting, critical challenges remain: we lack diagnostic tools to understand failure modes, performance remains far below their potential, and high computational costs limit practical deployment. We introduce a unified framework of four strategies that address these limitations along three orthogonal dimensions: model diagnostics, accuracy, and efficiency. Through extensive evaluation across model families from small open-source models to frontier models including Gemini, GPT, and Claude, we uncover both fundamental insights and practical solutions. Our findings span three key dimensions: diagnostic strategies reveal the “Execution Gap” where models correctly explain how context affects forecasts but fail to apply this reasoning; accuracy-focused strategies achieve substantial performance improvements of 25-50%; and efficiency-oriented approaches show that adaptive routing between small and large models can approach large model accuracy on average while significantly reducing inference costs. These orthogonal strategies can be flexibly integrated based on deployment constraints, providing practitioners with a comprehensive toolkit for practical LLM-based context-aided forecasting.

URL: https://openreview.net/forum?id=dkjHHFJkVI

---

Title: Efficient Multi-Adapter LLM Serving via Cross-Model KV-Cache Reuse with Activated LoRA

Abstract: Modern large language model (LLM) systems increasingly rely on multi-turn pipelines that are composed of multiple task-specific adapters, yet existing serving frameworks remain inefficient, incurring substantial recomputation overhead when switching between adapters. We present the first LLM serving engine that supports cross-model prefix cache reuse between base and adapted models via Activated LoRA (aLoRA), enabling efficient and fine-grained adapter switching during inference. Our design extends the vLLM framework by introducing base-aligned block hashing and activation-aware masking within the model execution path, permitting cache reuse across models while preserving compatibility with existing serving engine optimizations. Integrated into a production-grade inference stack, this approach supports dynamic adapter activation without excessive key-value tensor recomputation. Evaluation across representative multi-turn, multi-adapter pipelines demonstrates up to 58× end-to-end latency reduction and over 100× time-to-first-token improvement relative to standard LoRA baselines, with benefits that scale with model size and sequence length and manifest across all stages of the request lifecycle. This work bridges parameter-efficient model adaptation with high-performance serving, providing the first complete realization of cross-model KV-cache reuse in modern LLM inference engines.

URL: https://openreview.net/forum?id=Q8nCBmOkyn

---

Title: Progressive Checkerboards for Autoregressive Multiscale Image Generation

Abstract: A key challenge in autoregressive image generation is to efficiently sample independent locations in parallel, while still modeling mutual dependencies with serial conditioning. Some recent works have addressed this by conditioning between scales in a multiscale pyramid. Others have looked at parallelizing samples in a single image using regular partitions or randomized orders. In this work we examine a flexible, fixed ordering based on progressive checkerboards for multiscale autoregressive image generation. Our ordering draws samples in parallel from evenly spaced regions at each scale, maintaining full balance in all levels of a quadtree subdivision at each step. This enables effective conditioning both between and within scales. Intriguingly, we find evidence that in our balanced setting, a wide range of scale-up factors lead to similar results, so long as the total number of serial steps is constant. On class-conditional ImageNet, our method achieves competitive performance compared to recent state-of-the-art autoregressive systems with like model capacity, using fewer sampling steps.

URL: https://openreview.net/forum?id=wkCTCXo1a9

---

Title: Entropy-Regularized Diffusion-Policies in Offline Reinforcement Learning for Antibody Sequence Design

Abstract: The discovery of therapeutic antibodies is traditionally performed through wet lab screening, which is costly and time-consuming. Generative models offer a data-driven alternative, however such methods become unreliable outside the training distribution. We present Sequential Diffusion + Q-Learning (SeqDiff+QL), which formulates antibody sequence design as a constrained offline Reinforcement Learning (RL) problem, enforcing proximity to the training distribution. SeqDiff+QL employs an entropy-regularized diffusion policy that, through policy improvement, is trained sequentially generate Complementarity Determining Region (CDR) sequences with higher predicted binding affinity based on a variety of training distributions. Our novel entropy regularization thereby promotes diverse candidate generation, while the integration of biophysical priors through contrastive Variational Autoencoder (VAE) latent representations improves the stability of the generative process. The framework can learn from heterogeneous sequence sources across different training distributions. Using the Absolut! simulator and Rosetta energy function as affinity evaluation oracles, we show that SeqDiff+QL produces candidate sequences with improved predicted affinity across multiple target antigens while maintaining diversity.

URL: https://openreview.net/forum?id=KIQKcfjpTh

---

Title: Overcoming Reasoning Shortcuts in Neurosymbolic Learning via Efficient Generative Proxies

Abstract: Symbol grounding, the task of linking high-dimensional sensory inputs to symbolic representations in neurosymbolic AI (NeSy), often suffers from reasoning shortcuts, where inputs are mapped to unintended concepts due to limited supervision. Reconstruction-based training can help mitigate these ambiguities, but its effectiveness depends strongly on the quality and capacity of the reconstruction model. In this work, we propose a new grounding framework, Efficient Generative Proxies (EGP), that cleanly integrates reconstruction-based training into a generative modeling perspective. EGP subsumes several existing grounding approaches as special cases. We further argue that the role of reconstruction should be to capture the underlying structure of the data rather than to faithfully reconstruct inputs. Accordingly, we design a reconstruction term that leverages the principle that similar inputs should correspond to similar concept labels, thereby substantially reducing grounding ambiguity. We also develop extensions that incorporate additional inductive biases through this reconstruction term, improving robustness in more complex tasks. We evaluate our approach on tasks susceptible to reasoning shortcuts from the RSbench benchmark, as well as on the multi-concept ObjectMath dataset, integrating EGP into state-of-the-art neurosymbolic learning frameworks. Experimental results demonstrate that EGP significantly improves grounding accuracy and effectively mitigates reasoning shortcuts across diverse settings.

URL: https://openreview.net/forum?id=Sl2aC9hiaN

---

Title: GeoSDF: Plane Geometry Diagram Synthesis via Signed Distance Field

Abstract: Plane Geometry Diagram Synthesis has been a crucial task in computer graphics, with applications ranging from educational tools to AI-driven mathematical reasoning. Traditionally, we rely on manual tools (e.g., Matplotlib and GeoGebra) to generate precise diagrams, but this usually requires huge, complicated calculations. Recently, researchers start to work on model-based methods (e.g., Stable Diffusion and GPT5) to automatically generate diagrams, saving operational cost but typically suffering from limited realism and insufficient accuracy. In this paper, we propose a novel framework, GeoSDF, to automatically generate diagrams efficiently and accurately with Signed Distance Field (SDF). Specifically, we first represent geometric elements (e.g., points, segments, and circles) in SDF, then construct a series of constraint functions to represent geometric relationships. Next, we optimize those constructed constraint functions to get an optimized field of both elements and constraints. Finally, by rendering the optimized field, we can obtain the synthesized diagram. In our GeoSDF, we define a symbolic language to represent geometric elements and constraints, and our synthesized geometry diagrams can be self-verified in SDF, ensuring both mathematical accuracy and visual plausibility. Through extensive experiments, we demonstrate GeoSDF’s ability to synthesize high-quality geometry diagrams across various levels of complexity, including standard high-school problems and IMO-level (International Mathematical Olympiad) challenges.We achieve an impressive 88.67\% synthesis accuracy as evaluated by human experts on the IMO problem set. Furthermore, leveraging the self-verification property, we attain a geometry problem-solving accuracy exceeding 95\%, outperforming the current state-of-the-art (approximately 75\%) by a significant margin of ~20\%. These results highlight the advantages of GeoSDF, paving the way for more sophisticated, accurate, and flexible geometric diagram generation across a wide range of applications. The accompanying code, datasets, and all synthesized outputs will be released to benefit the research community upon acceptance of the paper.

URL: https://openreview.net/forum?id=Mzywoes4NO

---

Title: Convergence Analysis of Two-Layer Neural Networks under Gaussian Input Masking

Abstract: We investigate the convergence guarantee of two-layer neural network training with Gaussian randomly masked inputs. This scenario corresponds to Gaussian dropout at the input level, or noisy input training common in sensor networks, privacy-preserving training, and federated learning, where each user may have access to partial or corrupted features. Using a Neural Tangent Kernel (NTK) analysis, we demonstrate that training a two-layer ReLU network with Gaussian randomly masked inputs achieves linear convergence up to an error region proportional to the mask's variance. A key technical contribution is resolving the randomness within the non-linear activation, a problem of independent interest.

URL: https://openreview.net/forum?id=7bqpRvrrD0

---

Title: DecompDreamer: A Composition-Aware Curriculum for Structured 3D Asset Generation

Abstract: Current text-to-3D methods excel at generating single objects but falter on compositional prompts. We argue this failure is fundamental to their optimization schedules, as simultaneous or iterative heuristics predictably collapse under a combinatorial explosion of conflicting gradients, leading to entangled geometry or catastrophic divergence. In this paper, we reframe the core challenge of compositional generation as one of optimization scheduling. We introduce DecompDreamer, a framework built on a novel staged optimization strategy that functions as an implicit curriculum. Our method first establishes a coherent structural scaffold by prioritizing inter-object relationships before shifting to the high-fidelity refinement of individual components. This temporal decoupling of competing objectives provides a robust solution to gradient conflict. Qualitative and quantitative evaluations on diverse compositional prompts demonstrate that DecompDreamer outperforms state-of-the-art methods in fidelity, disentanglement, and spatial coherence.

URL: https://openreview.net/forum?id=3qy4J6QFbn

---

Reply all

Reply to author

Forward

0 new messages