Reproducibility Certification, J2C Certification: ARROW: Augmented Replay for RObust World models
Abdulaziz Alyahya, Abdallah Al Siyabi, Markus R. Ernst, Luke Yang, Levin Kuhlmann, Gideon Kowadlo
https://openreview.net/forum?id=3FK2tFwNwK
---
Expert Certification: Mitigating Disparate Impact of Differentially Private Learning through Bounded Adaptive Clipping
Linzh Zhao, Aki Rehn, Mikko A. Heikkilä, Razane Tajeddine, Antti Honkela
https://openreview.net/forum?id=UlzcKSHVoN
---
Accepted papers
===============
Title: Symmetric Divergence and Normalized Similarity: A Unified Topological Framework for Representation Analysis
Authors: Yan Wang, Tianyang Hu
Abstract: Topological Data Analysis (TDA) offers a principled, intrinsic lens for comparing neural representations. However, existing paired topological divergences (e.g., RTD) are limited by heuristic asymmetry and, more critically, unbounded scores that depend on sample size, hindering reliable cross-scenario benchmarking. To address these challenges, we develop a unified topological toolkit serving two complementary needs: fine-grained structural diagnosis and robust, standardized evaluation.
First, we complete the RTD framework by introducing \textbf{Symmetric Representation Topology Divergence (SRTD)} and its efficient variant \textbf{SRTD-lite}. Beyond resolving the theoretical asymmetry of prior variants, SRTD consolidates diagnostic information into a single, comprehensive cross-barcode signature. This allows for precise localization of structural discrepancies and serves as an effective optimization objective without the overhead of dual directional computations.
Second, to enable reliable benchmarking across heterogeneous settings, we propose \textbf{Normalized Topological Similarity (NTS)}. By measuring the rank correlation of hierarchical merge orders, NTS yields a scale-invariant metric bounded between -1 and 1, effectively overcoming the scale and sample-dependence of unnormalized divergences.
Experiments across synthetic and real-world deep learning settings demonstrate that our toolkit captures functional shifts in CNNs missed by geometric measures and robustly maps LLM genealogy even under distance saturation, offering a rigorous, topology-aware perspective that complements measures like CKA.
URL: https://openreview.net/forum?id=pGgJ9qB2Io
---
Title: Attention-Head Binding as a Term-Conditioned Mechanistic Marker of Accessibility Concept Emergence in Language Models
Authors: Khanh-Dung Tran
Abstract: Assessing when language models develop specific capabilities remains challenging, as behavioral evaluations are expensive and internal representations are opaque. We introduce attention-head binding ($EB^*$), a lightweight mechanistic metric that tracks how attention heads bind multi-token technical terms, such as accessibility concepts ("screen reader," "alt text"), into coherent units during training. Using seven models across five architectures, including Pythia (160M, 1B, 2.8B), OLMo-1B, CRFM GPT-2 Small (5 seeds), SmolLM3-3B, and Qwen2.5-1.5B, we evaluate on 41 canonical accessibility terms ($N=205$ prompts) and the 9-term pilot set, reporting five empirical findings. Discriminant validity validates $EB^*$ against token co-occurrence baselines (nonsense $0.26 \to$ real terms $0.74$, all $p<0.001$, $d=1.2$--$2.9$). The relationship between binding and behavior shifts markedly over the course of training. Early in training, the two are tightly coupled ($\rho=+0.57$, $p<0.001$). Later, this pattern reverses into a decoupled regime ($\rho=-0.20$, $p=0.01$). Cross-architecture replication confirms C1-B: OLMo-1B achieves 90% $EB^*$-leads ($p<0.0001$), CRFM 72.7% ($p<<0.001$). This gives rise to a two-factor model. First, a parameter threshold around 1B parameters controls how deeply decoupling occurs. Second, a training-step threshold near 300K steps determines when the temporal ordering between binding and behavior emerges (C1/C4). High-binding/mid-accuracy checkpoints contain unlockable latent knowledge, yielding few-shot gains up to 61 percentage points (a 183% relative improvement), replicated at 18--37 points across six of seven models (CRFM shows weak unlockability at +7.6 pp due to undertraining). Modern models such as SmolLM3 and Qwen show headroom compression where they reach the same absolute ceiling near 0.72, but display smaller nominal gains because their zero-shot baselines are already high (C3). Causal ablation reveals opposite regimes across scales. At 160M, binding heads remain necessary for performance. Removing them impairs accuracy by 16.7 percentage points. At 2.8B, these same heads have become functionally superseded; ablating them improves performance by 33.3 points. Cross-architecture C5 reveals three distinct patterns. First, OLMo and Qwen achieve near-perfect recognition ceiling with negligible ablation effects. Second, SmolLM3 operates in a distributed regime with negative specificity ($-0.043$). Third, CRFM displays striking initialization sensitivity, with four of five random seeds showing coupled behavior and one seed exhibiting suppressor dynamics (C5). Beyond establishing attention binding as a diagnostic for concept emergence, these findings demonstrate a qualitative shift in how mechanistic structures map to behavioral competence across model scales, a phenomenon we term the "binding-behavior decoupling effect".
Code: https://github.com/RayoHQ/attention-binding-a11y
URL: https://openreview.net/forum?id=QG7mfCy9mu
---
Title: VT-DUDA: Visual Token Conditioning for Diffusion-guided Unsupervised Domain Adaptation
Authors: Xuan Qi, Daniele Berardini, Dario Serez, Vito Paolo Pastore, Vittorio Murino
Abstract: Unsupervised domain adaptation (UDA) aims to learn a target-domain classifier from labeled source data and unlabeled target data under distribution shift. Recent diffusion-based UDA methods approach this problem by synthesizing labeled target-style images and training on the resulting synthetic data. However, their performance depends heavily on the conditioning design: class prompts provide only coarse guidance, while domain adaptation modules mainly control appearance, which may leave target-style synthesis insufficiently specified. We propose VT-DUDA, a visual-token conditioning framework for diffusion-guided UDA. Instead of relying only on text prompts, VT-DUDA uses source images to provide additional instance-level visual context for target-style synthesis. Specifically, VT-DUDA maps each source image to a compact sequence of visual tokens and forms a hybrid conditioning context by concatenating these tokens with the corresponding text embeddings along the cross-attention context dimension of a latent diffusion model. This provides instance-dependent conditioning beyond text alone, while synthesis is performed with the target-domain adapter branch. Because guidance is represented explicitly as a token sequence, the same interface also permits inference-time manipulation of the conditioning signal through token selection and token-strength adjustment. The proposed method preserves the standard diffusion objective and can be integrated into existing adapter-based diffusion frameworks without modifying the backbone. Across Office-31, Office-Home, and VisDA-2017, VT-DUDA improves average target-domain accuracy over strong discriminative and diffusion-based UDA baselines. The results suggest that, in generation-based UDA, a stronger conditioning interface can improve the downstream usefulness of synthetic target-style data. The project page is available at https://xuanqi99.github.io/VT-DUDA/.
URL: https://openreview.net/forum?id=Y956680PCe
---
Title: Rethinking Dataset Quantization: Efficient Coreset Selection via Semantically-Aware Data Augmentation
Authors: Yangze Liu, Hong Liu
Abstract: Coreset selection aims to reduce the computational burden of training large-scale deep learning models by identifying representative subsets from massive datasets. However, existing state-of-the-art methods face a fundamental accessibility dilemma: they either require extensive training on the target dataset to compute selection metrics, or depend heavily on large pre-trained models, undermining the core purpose of coreset selection in resource-constrained scenarios. Dataset Quantization (DQ) avoids full dataset training but relies on expensive pre-trained models, introducing computational overhead and domain-specific biases that limit generalization. In this work, we comprehensively redesign the DQ framework to establish a more accessible and domain-robust paradigm for coreset selection. Through rigorous analysis, we identify that: (1) MAE functions primarily as biased data augmentation leveraging memorized ImageNet patterns; (2) MAE benefits ImageNet-related datasets but harms out-of-distribution performance; (3) the original pipeline suffers from feature inconsistency between selection and training phases. We propose DQ_v2, which: (1) eliminates pre-trained model dependencies via Semantically-Aware Data Augmentation (SDA) using randomly initialized CNNs; (2) restructures the pipeline by performing augmentation before selection, ensuring feature consistency. Extensive experiments demonstrate that DQ_v2 achieves superior performance across diverse domains (such as ImageNet-1k, CUB-200, Food-101, and medical imaging) while reducing end-to-end coreset construction cost by 41% on ImageNet-1k (95% in the augmentation phase alone), establishing a robust and practical solution for resource-constrained scenarios.
URL: https://openreview.net/forum?id=Mb2nn1yx66
---
Title: Let's Measure Information Step-by-Step: AI-Based Evaluation Beyond Vibes
Authors: Zachary Robertson, Sanmi Koyejo
Abstract: We evaluate artificial intelligence (AI) systems without ground truth by exploiting a link between strategic gaming and information loss. Building on established information theory, we analyze which mechanisms resist adversarial manipulation. This motivates mutual evaluation, where the overseer is treated as a strategic player estimating mutual information by prompting, making truthful agent reporting an optimal strategy. We show that certain f-divergences, such as total variation distance (TVD), maintain polynomial guarantees under attack, building on an established exponential barrier for estimating mutual information (MI) in worst-case certification settings. Under adversarial attacks, TVD-MI maintains effectiveness (area under the curve 0.70--0.77) while other approaches can decay toward chance, demonstrating that prompting the same system for information relationships rather than quality judgments can improve robustness. The mechanisms decompose pairwise evaluations into reliable item-level detection scores without ground truth, addressing a key limitation of standard peer prediction. \textit{Pre-registration: \url{https://osf.io/c7pum}.}
URL: https://openreview.net/forum?id=i7T1tFvFQM
---
Title: Expressivity Saturation: Reduced Affine Region Usage Under Increasing Task Complexity
Authors: Xuan Qi, Yi Wei, Fanqi Yu, Manuel Lecha
Abstract: Piecewise-affine neural networks (e.g., with ReLU or LeakyReLU activations) implement continuous piecewise-affine maps, and the number of affine regions provides a natural proxy for expressive capacity. However, the gap between theoretical region capacity and the affine regions realized after training remains insufficiently understood. We study this gap from two complementary perspectives. First, we give a rigorous, architecture-dependent theorem for affine line-segment probes: for multilayer perceptrons with piecewise-affine activations, the number of affine pieces realized along an affine line-segment probe is upper bounded by an explicit product of layer-wise width terms (and activation breakpoint factors). This yields a neuron-threshold lower bound for representing target functions with prescribed one-dimensional piece complexity, formalizing the minimal region budget required for complex signals. Second, we exactly enumerate affine regions realized within bounded 2D and higher-dimensional domains under controlled task complexity. Under fixed architectures and training protocols, increasing input--label complexity yields trained solutions with markedly fewer realized regions in the evaluation domain, even though worst-case architectural capacity is unchanged; we call this reduced region usage expressivity saturation. Moreover, in the most challenging regimes, 2D visualizations show that region-usage collapse often coincides with degraded decision boundaries. Finally, we visualize the training dynamics of affine-region partitions and decision boundaries, revealing a consistent refinement process during optimization.
URL: https://openreview.net/forum?id=JiyZE3yKv8
---
Title: Signature Kernel Scoring Rule: A Spatio-Temporal Diagnostic for Probabilistic Weather Forecasting
Authors: Archer Dodson, Ritabrata Dutta
Abstract: Modern weather forecasting has increasingly transitioned from numerical weather prediction (NWP) to data-driven machine learning forecasting techniques. While these new models produce probabilistic forecasts to quantify uncertainty, their training and evaluation may remain hindered by conventional scoring rules, primarily MSE, which are designed for single time point predictions and ignore the highly correlated data structures present in weather behaviour. This work introduces the signature kernel scoring rule to the domain of weather forecasting, which reframes weather variables as continuous paths to encode temporal and spatial dependencies through iterated integrals. Validated as strictly proper through the use of path augmentations to guarantee uniqueness, the signature kernel provides a theoretically robust metric for forecast verification and model training. Empirical evaluations through weather scorecards on WeatherBench 2 models demonstrate the signature kernel scoring rule's high discriminative power and unique capacity to capture path-dependent interactions. Following previous demonstration of successful adversarial-free probabilistic training, we train sliding window generative neural networks using a predictive-sequential scoring rule on ERA5 reanalysis weather data. Using a lightweight model, we demonstrate that signature kernel based training outperforms climatology for forecast paths of up to fifteen timesteps.
URL: https://openreview.net/forum?id=LOLXpt4E5D
---
Title: Iterative Preference Optimization with Proximal Policy Regularization for Large Language Model Alignment
Authors: Siyuan Xu, Hangfan Zhang, Zhimeng Guo, Huaisheng Zhu, Yue Mao, Shicheng Liu
Abstract: Aligning large language models (LLMs) with human preferences is commonly achieved via supervised fine-tuning followed by preference optimization. While direct preference optimization (DPO) offers a simple and efficient alternative to RLHF, its offline and off-policy nature can induce a distribution shift between the policy used to sample preference pairs and the continually updated policy being optimized, reducing data efficiency and limiting alignment gains. We propose \emph{Iterative Proximal Policy Regularized Preference Optimization} (Iterative PRPO), which introduces a proximal regularization that explicitly constrains the optimized policy to remain close to the sampling policy within each iteration, thereby mitigating distribution shift while preserving the efficiency of DPO-style updates. Starting from an RLHF objective with a KL constraint to the sampling policy, we derive an equivalent direct preference optimization formulation that requires offline preference pairs under the sampling policy. Across summarization and dialogue alignment benchmarks, Iterative PRPO consistently improves win rates over offline DPO and iterative DPO baselines under both reward-model and GPT-4o evaluations, with comparable computational cost. Moreover, the same proximal regularization principle generalizes to advanced preference optimization objectives, including Identity Preference Optimization (IPO), self-play preference optimization (SPPO), and efficient exact optimization (EXO), yielding Iterative PR-IPO, PR-SPPO, and PR-EXO variants that further strengthen alignment across model scales.
URL: https://openreview.net/forum?id=xoxO5Tr4Vh
---
Title: Conditional Local Importance by Quantile Expectations
Authors: Kelvyn K. Bladen, Adele Cutler, D. Richard Cutler, Kevin R. Moon
Abstract: Global variable importance measures are commonly used to interpret the results of machine learning models. Local variable importance techniques assess how variables contribute to individual observations. Current, popular methods, including LIME and SHAP, provide
useful measures of feature contribution in the prediction space, while leaving opportunities for improved characterization of local structure in the model loss space. Additionally, they are not natively adapted for multi-class classification problems. We propose a new model-agnostic method for calculating local variable importance, CLIQUE, that highlights locally dependent relationships, provides improved stability over permutation-based methods, and can be directly applied to multi-class classification problems. Simulated and real-world examples show that CLIQUE emphasizes locally dependent information, captures interaction behavior beyond what can be evaluated by correlations, and assigns zero importance in regions where the response is invariant to changes in variables.
URL: https://openreview.net/forum?id=gsuZFPDRqE
---
Title: Test-Time Adaptation of Vision-Language Models with Low-Rank Pseudo-Consistency
Authors: Shuvendu Roy, Ali Etemad
Abstract: While test-time adaptation (TTA) methods enable vision-language models (VLMs) to adapt under distribution shifts, they typically rely on simple feature transformations following frozen encoders while learning from potentially noisy pseudo-labels. This approach may limit adaptation under significant domain shifts. In this paper, we propose PseudoAdapter, a novel TTA framework for VLMs that introduces low-rank adapters into early layers of the encoder to enable domain-specific feature adaptation while maintaining generalization. To ensure effective learning from noisy and low-confidence predictions, PseudoAdapter combines confidence-calibrated pseudo-labelling with unsupervised consistency learning across augmented views. We further extend our approach with PseudoAdapter+, which integrates selective teacher supervision to improve adaptation with minimal overhead. Extensive evaluations on four out-of-distribution and ten cross-domain benchmarks demonstrate that our method outperforms prior state-of-the-art TTA approaches by an average of 6.84\% and 3.25\%, respectively. Ablation studies confirm the effectiveness of each proposed component.
URL: https://openreview.net/forum?id=GDw4pvX9aG
---
Title: Structured Noise Adaptation for Sequential Bayesian Filtering with Embedded Latent Transfer Operators
Authors: NAICHANG KE, Pongpisit Thanasutives, Yoshinobu Kawahara
Abstract: Kalman filters based on the Embedded Latent Transfer Operators (ELTO) emerge as novel statistical tools for sequential state estimation. However, a critical limitation stems from their use of simplified noise models, which fail to dynamically adapt to non-stationary processes. To address this limitation, we introduce an ELTO-based Bayesian filtering approach with a new structured parameterization for the filter's noise model. This parameterization enables structured noise adaptation, which couples the data-driven learning of an optimal time-invariant noise model with dynamic parameter adaptation that responds to changes in dynamics within non-stationary processes. Empirical results show that our structured noise adaptation improves the filter's dynamic state estimation performance in noisy, time-varying environments.
URL: https://openreview.net/forum?id=smFAyzvh5r
---
Title: Latent learning: episodic memory complements parametric learning by enabling flexible reuse of experiences
Authors: Andrew Kyle Lampinen, Martin Engelcke, Yuxuan Li, Arslan Chaudhry, James McClelland
Abstract: When do machine learning systems fail to generalize, and what mechanisms could improve their generalization? Here, we draw inspiration from cognitive science to argue that one weakness of parametric machine learning systems is their failure to exhibit latent learning---learning information that is not relevant to the task at hand, but that might be useful in a future task. Using controlled, synthetic benchmarks, we show how this perspective links failures ranging from the reversal curse in language modeling to new findings on agent-based navigation. We then highlight how cognitive science points to episodic memory as a potential part of the solution to these issues. Correspondingly, we show that a system with an oracle retrieval mechanism can use learning experiences more flexibly to generalize better across many of these challenges---thus motivating episodic memory as an important direction for research in AI. We also identify some of the essential components for effectively using retrieval, including the importance of \emph{within-experience} in-context learning for acquiring the ability to use information across retrieved experiences. In summary, our results illustrate one possible contributor to the relative data inefficiency of current machine learning systems compared to natural intelligence, and help to understand how retrieval methods might complement parametric learning to improve generalization. We close by discussing some of the links between our work and findings in cognitive science and neuroscience---including a possible perspective on hippocampal contributions to generalization---and the broader implications.
URL: https://openreview.net/forum?id=RuWGeX5ZiB
---
Title: Fallback-Enabled Closed-Set Classification: Cross-Modal Consistency in Vision-Language Models
Authors: Sijia Wang, Ricardo Henao
Abstract: Vision-Language Models (VLMs) can describe and label images; however, this does not imply that they truly process what they are perceiving. Recent studies show that, despite their breadth of training, VLMs are surprisingly unreliable as classifiers, for either closed-world or open-world settings. In this work, we explore a deeper question: can a VLM recognize when an image falls outside the set of categories it is asked to choose from? Our results reveal a surprising failure mode: even when the notion of in-set versus out-of-set is explicitly defined, VLM models often assign plausible in-set labels to out-of-set images, violating the task’s explicit constraint. Motivated by this, we propose a cross-modal consistency framework that reasons over both the visual and textual arms of the model and accepts an answer only when they agree. Experiments on three well-known datasets (DomainNet, VisDA and INaturalist-2021) demonstrate that this approach consistently improves balanced known vs. unknown detection over Source-Free Universal Domain Adaptation (SF-UniDA) baselines, showing that cross-modal consistency improves a VLM’s ability to follow the task logic and distinguish when an image falls outside the intended label space. Our results suggest that, with strong VLMs, fallback behavior need not rely exclusively on specialized SF-UniDA adaptation pipelines: a lightweight cross-modal consistency decision rule can be competitive with representative SF-UniDA baselines on standard benchmarks.
URL: https://openreview.net/forum?id=tOKG6sSk3I
---
Title: A Unified Theory of Sinusoidal Activation Families for Implicit Neural Representations
Authors: Alireza Morsali, MohammadJavad Vaez, Mohammad Hossein Soltani, Amirhossein Kazerouni, Babak Taati, Morteza Mohammad-Noori
Abstract: Implicit Neural Representations (INRs) model continuous signals with compact neural networks and have become a standard tool in vision, graphics, and signal processing. A central challenge is accurately capturing fine detail without heavy hand-crafted encodings or brittle training heuristics. Across the literature, periodic activations have emerged as a compelling remedy: from SIREN, which uses a single sinusoid with a fixed global frequency, to more recent architectures employing multiple sinusoids and, in some cases, trainable frequencies and phases. We study this family of sinusoidal activations and develop a principled theoretical and practical framework for trainable sinusoidal activations in INRs. Concretely, we instantiate this framework with Sinusoidal Trainable Activation Functions (STAF), a Fourier-like activation whose amplitudes, frequencies, and phases are learned. Our analysis (i) establishes a Kronecker-equivalence construction that expresses trainable sinusoidal activations with standard sine networks and quantifies expressive growth, (ii) characterizes how the Neural Tangent Kernel (NTK) spectrum changes under trainable sinusoidal parameterization, and (iii) provides an initialization that yields standard normal post-activations without asymptotic central limit theorem (CLT) arguments. Empirically, on images, audio, shapes, inverse problems (super-resolution, denoising) and NeRF, STAF is competitive and often stronger on distortion-oriented reconstruction metrics such as PSNR/SSIM across the evaluated INR tasks, with favorable parameter efficiency under layer-wise sharing. While periodic activations can alleviate practical manifestations of spectral bias, our results indicate they do not eliminate it; instead, trainable sinusoids can improve the observed capacity–optimization trade-off in the evaluated settings.
URL: https://openreview.net/forum?id=ZDmBPYptbL
---
Title: CURS: An exact method for sampling on Riemannian manifolds
Authors: Isabella Costa Maia, Pedro L. C. Rodrigues, Marco Congedo, Salem SAID
Abstract: The present work introduces curvature-based rejection sampling (CURS). This is a method for sampling from a general class of probability densities defined on Riemannian manifolds. It can be used to sample from any probability density which ``depends only on distance". The idea is to combine the statistical principle of rejection sampling with the geometric principle of volume comparison. CURS is an exact sampling method, and (assuming the underlying Riemannian manifold satisfies certain technical conditions) it has a particularly moderate computational cost. The aim of the present work is to show that there are many applications where CURS should be the user's method of choice for dealing with relatively low-dimensional scenarios.
URL: https://openreview.net/forum?id=LY9ecALVDm
---
Title: Interactive Query Answering on Knowledge Graphs with Soft Entity Constraints
Authors: Daniel Daza, Alberto Bernardi, Luca Costabello, Christophe Gueret, Masoud Mansoury, Michael Cochez, M.C. Schut
Abstract: Methods for query answering over incomplete knowledge graphs retrieve entities that are likely to be answers, which is particularly useful when such answers cannot be reached by direct graph traversal due to missing edges. However, existing approaches have focused on queries formalized using first-order-logic. In practice, many real-world queries involve constraints that are inherently vague or context-dependent, such as preferences for attributes or related categories. Addressing this gap, we introduce the problem of query answering with soft constraints. We formalize the problem and introduce two efficient methods designed to adjust query answer scores by incorporating soft constraints without disrupting the original answers to a query. These methods are lightweight, requiring tuning only two parameters or a small neural network trained to capture soft constraints while maintaining the original ranking structure. To evaluate the task, we extend existing QA benchmarks by generating datasets with soft constraints. Our experiments demonstrate that our methods can capture soft constraints while maintaining robust query answering performance and adding very little overhead. With our work, we explore a new and flexible way to interact with graph databases that allows users to specify their preferences by providing examples interactively.
URL: https://openreview.net/forum?id=Qb6vIM7MxE
---
Title: Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos
Authors: Albert J. Zhai, Kuo-Hao Zeng, Jiasen Lu, Ali Farhadi, Shenlong Wang, Wei-Chiu Ma
Abstract: The ability to learn manipulation skills by watching videos of humans has the potential to unlock a new source of highly scalable data for robot learning. Here, we tackle prehensile manipulation, in which tasks involve grasping an object before performing various post-grasp motions. Human videos offer strong signals for learning the post-grasp motions, but they are less useful for learning the prerequisite grasping behaviors, especially for robots without human-like hands. A promising way forward is to use a modular policy design, leveraging a dedicated grasp generator to produce stable grasps. However, arbitrary stable grasps are often not task-compatible, hindering the robot's ability to perform the desired downstream motion. To address this challenge, we present Perceive-Simulate-Imitate (PSI), a framework for training a modular manipulation policy using human video motion data processed by paired grasp-trajectory filtering in simulation. This simulation step extends the trajectory data with grasp suitability labels, which allows for supervised learning of task-oriented grasping capabilities. We show through real-world experiments that our framework can be used to learn precise manipulation skills efficiently without any robot data, resulting in significantly more robust performance than using a grasp generator naively.
URL: https://openreview.net/forum?id=ZEmv4DhaGL
---
Title: Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding
Authors: Haotian Xue, Yunhao Ge, Yu Zeng, Zhaoshuo Li, Ming-Yu Liu, Yongxin Chen, Jiaojiao Fan
Abstract: Vision-Language Models (VLMs) have demonstrated impressive world knowledge across a wide range of tasks, making them promising candidates for embodied reasoning applications. However, existing benchmarks primarily evaluate the embodied reasoning ability of VLMs through multiple-choice questions based on image annotations -- for example, selecting which trajectory better describes an event in the image. In this work, we introduce the Point-It-Out (\modelname) benchmark, a novel benchmark designed to systematically assess the embodied reasoning abilities of VLMs through precise visual grounding. We propose a hierarchical evaluation protocol spanning three stages (S1: referred-object localization, S2: task-driven pointing, and S3: visual trace prediction), with data collected from critical domains for embodied intelligence, including indoor, kitchen, driving, and robotic manipulation scenarios. Extensive experiments with over ten state-of-the-art VLMs reveal several interesting findings. For example, strong general-purpose models such as GPT-4o, while excelling on many benchmarks (e.g., language, perception, and reasoning), underperform compared to some open-source models in precise visual grounding; models such as MoLMO perform well in S1 and S2 but struggle in S3, where requires grounding combined with visual trace planning.
URL: https://openreview.net/forum?id=9e0hRhFsal
---
Title: Taming Modality Entanglement in Continual Audio-Visual Segmentation
Authors: Yuyang Hong, Qi Yang, Tao Zhang, Zili Wang, Zhaojin Fu, Kun Ding, Bin Fan, Shiming Xiang
Abstract: Recently, significant progress has been made in multi-modal continual learning, aiming to learn new tasks sequentially in multi-modal settings while preserving performance on previously learned ones. However, existing methods mainly focus on coarse-grained tasks, with limitations in addressing modality entanglement in fine-grained continual learning settings. To bridge this gap, we introduce a novel Continual Audio-Visual Segmentation (CAVS) task, aiming to continuously segment new classes guided by audio. Through comprehensive analysis, two critical challenges are identified: 1) multi-modal semantic drift, where a sounding objects is labeled as background in sequential tasks; 2) co-occurrence confusion, where frequent co-occurring classes tend to be confused.
In this work, a Collision-based Multi-modal Rehearsal (CMR) framework is designed to address these challenges. Specifically, for multi-modal semantic drift, a Multi-modal Sample Selection (MSS) strategy is proposed to select samples with high modal consistency for rehearsal. Meanwhile, for co-occurence confusion, a Collision-based Sample Rehearsal (CSR) mechanism is designed, allowing for the increase of rehearsal sample frequency of those confusable classes during training process.
Moreover, we construct three audio-visual incremental scenarios to verify effectiveness of our method. Comprehensive experiments demonstrate that our method significantly outperforms single-modal continual learning methods. The source code will be made publicly available upon acceptance.
URL: https://openreview.net/forum?id=8mPymf31zG
---
Title: Concatenated Matrix SVD: Compression Bounds, Incremental Approximation, and Error-Constrained Clustering
Authors: Maksym Shamrai
Abstract: Large collections of matrices arise throughout modern machine learning, signal processing, and scientific computing, where they are commonly compressed by concatenation followed by truncated singular value decomposition (SVD). This strategy enables parameter sharing and efficient reconstruction and has been widely adopted across domains ranging from multi-view learning and signal processing to neural network compression. However, it leaves a fundamental question unanswered: which matrices can be safely concatenated and compressed together under explicit reconstruction error constraints? Existing approaches rely on heuristic or architecture-specific grouping and provide no principled guarantees on the resulting SVD approximation error. In the present work, we introduce a theory-driven framework for compression-aware clustering of matrices under SVD compression constraints. Our analysis establishes new spectral bounds for horizontally concatenated matrices, deriving global upper bounds on the optimal rank-$r$ SVD reconstruction error from lower bounds on singular value growth. The first bound follows from Weyl-type monotonicity under blockwise extensions, while the second leverages singular values of incremental residuals to yield tighter, per-block guarantees. We further develop an efficient approximate estimator based on incremental truncated SVD that tracks dominant singular values without forming the full concatenated matrix. Therefore, we propose three clustering algorithms that merge matrices only when their predicted joint SVD compression error remains below a user-specified threshold. The algorithms span a trade-off between speed, provable accuracy, and scalability, enabling compression-aware clustering with explicit error control.
URL: https://openreview.net/forum?id=E9n35dehqx
---
Title: A Survey on Benchmarks of LLM-based GUI Agents
Authors: Yihong Chen, Shuai Wang, Yaqing Wang, Quanming Yao
Abstract: LLM-based GUI agents have made rapid progress in understanding visual interfaces, interpreting user intentions, and executing multi-step operations across web, mobile, and desktop environments. As these agents become more capable, systematic and reproducible evaluation has become essential for measuring progress and identifying remaining weaknesses. This survey provides a comprehensive overview of benchmarks for LLM-based GUI agents, covering three major categories: grounding and QA tasks, navigation and multi-step reasoning tasks, and open-world environments that reflect realistic and dynamic software usage. We examine how existing benchmarks evaluate both component-level abilities, such as intent understanding, GUI grounding, navigation, and context tracking, and system-level abilities, such as adaptation, personalization, privacy protection, safety, and computational efficiency. By comparing datasets, environments, and evaluation metrics, the survey reveals clear trends in benchmark design, along with persistent gaps including limited adaptability, vulnerability to malicious interfaces and prompt attacks, lack of interpretability, and significant computational overhead. We highlight emerging directions such as safety-aware evaluation, human-centered evaluation, personalization, lightweight deployment, and zero-shot generalization. This survey aims to serve as a practical guide for researchers who design GUI agents, build benchmarks, or study LLM-driven user interface automation. We further provide a cross-benchmark audit that clarifies capability coverage and comparison boundaries, arguing that GUI-agent benchmark scores should be interpreted within benchmark families rather than as a single universal leaderboard.
URL: https://openreview.net/forum?id=ri3yPWE21Q
---
Title: Design Criteria for SGD Preconditioners: Local Conditioning, Noise Floors, and Basin Stability
Authors: Mitchell T. Scott, Tianshi Xu, Ziyuan Tang, Alexandra Pichette-Emmons, Qiang Ye, Yousef Saad, Yuanzhe Xi
Abstract: Stochastic Gradient Descent (SGD) often slows in the late stage of training due to anisotropic curvature and gradient noise. We analyze preconditioned SGD in the geometry induced by a symmetric positive definite matrix $\mathbf{M}$. Our bounds make explicit how both the convergence rate and the stochastic noise floor depend on $\mathbf{M}$. For nonconvex objectives, we establish a basin-stability guarantee in a local $\mathbf{M}$-metric neighborhood around a minimizer set: under local smoothness and a local PL condition, we give an explicit lower bound on the probability that the iterates remain in the basin up to a time horizon. This perspective is particularly relevant in Scientific Machine Learning (SciML), where reaching small training losses under stochastic updates is closely tied to physical fidelity, numerical stability, and constraint satisfaction. Our framework covers both diagonal/adaptive and curvature-aware preconditioners and yields a practical criterion: choose $\mathbf{M}$ to improve local conditioning while attenuating noise in the $\mathbf{M}^{-1}$-norm. Experiments on a quadratic diagnostic and three SciML benchmarks support the predicted rate--floor behavior.
URL: https://openreview.net/forum?id=vo8FOBt6f6
---
Title: Transform-Triggered Adversarial Examples
Authors: Yaoteng Tan, Zikui Cai, Salman Asif
Abstract: Deep neural networks are vulnerable to adversarial attacks, yet most existing attack research focuses on adversarial examples that induce fixed, static mispredictions. In this work, we instead exploit a dynamical adversarial manifold that depends on image transforms, which are a group of functions commonly used for data augmentation, preprocessing, and deployment. We incorporate image transforms into the adversarial optimization process, such that at test-time the same transforms, when applied under malicious conditions, act as triggers that induce diverse adversarial behaviors. We show that a single bounded perturbation can encode behaviors that are selectively activated under different transforms. Our study shows that this transform-dependent property consistently exists across multiple deep network architectures (e.g., CNNs and transformers), computer vision tasks (e.g., image classification and object detection), and a broad range of commonly used image transforms. We further characterize how the number of embeddable targets scales with the transform, the victim architecture, and the perturbation budget. Additionally, to further motivate its real-world relevance, we extend our transform-dependent formulation to a hardware-in-the-loop setting, demonstrating its effectiveness under challenging physical conditions. In summary, we introduce a novel and controllable paradigm for adversarial attack deployment, exposing a previously uncharacterized property in deep neural networks.
URL: https://openreview.net/forum?id=If2OGZMJxs
---
Title: SSM-PixNav: State Space Models for Pixel-Guided Embodied Navigation
Authors: Athira Krishnan R, Sumohana S. Channappayya
Abstract: While navigating a robot toward a specified pixel in an image, ensuring precise spatial grounding remains challenging. Existing paradigms, including object-goal, image-goal, goal-instance, and pixel-goal navigation, address this problem at different levels of granularity. We focus on pixel-goal navigation, which provides pixel-level supervision for more precise localization. Prior approaches primarily rely on RGB observations, limiting geometric awareness in scenarios where visually similar regions differ in navigability. Transformer-based policies improve temporal modeling but incur high computational cost, and there is a lack of standardized open benchmarks for reproducible evaluation. We address these limitations in three ways. First, we propose RGBD-PixNav, which integrates depth directly into the policy. Second, we introduce a Mamba-based State Space Model (SSM) for efficient temporal modeling, along with causal SSM policies and a depth-gating mechanism for adaptive fusion of RGB and depth features. Third, we construct the PixNav Trajectories dataset using HM3D scenes in Habitat-Sim to enable a reproducible benchmark. Experiments show that the Causal SSM-RGB and RGBD variants outperform strong baselines, improving the success rate by approximately 0.4 while reducing the parameter count to $\approx$ 27M (half). The models also demonstrate robustness to observation noise and varying history lengths. Code and dataset are available at \href{https://github.com/lfovia/SSM-PixNav}{https://github.com/lfovia/SSM-PixNav}.
URL: https://openreview.net/forum?id=RmsMd5vdBf
---
Title: Countering Backdoor Attacks in Image Recognition: A Survey and Evaluation of Mitigation Strategies
Authors: Kealan Dunnett, Reza Arablouei, Volkan Dedeoglu, Dimity Miller, Raja Jurdak
Abstract: The widespread adoption of deep learning across various industries has introduced substantial challenges, particularly in terms of model explainability and security. The inherent complexity of deep learning models, while contributing to their effectiveness, also renders them susceptible to adversarial attacks. Among these, backdoor attacks are especially concerning, as they involve surreptitiously embedding specific triggers within training data, causing the model to exhibit aberrant behavior when presented with input containing the triggers. Such attacks often exploit vulnerabilities in outsourced processes, compromising model integrity without affecting performance on clean (trigger-free) input data. In this paper, we present a review of prominent existing mitigation strategies designed to counter backdoor attacks in image recognition. We provide an in-depth analysis of the theoretical foundations, practical efficacy, and limitations of these approaches. In addition, we conduct an extensive benchmarking of sixteen prominent approaches against eight distinct backdoor attacks, utilizing three datasets, four model architectures, and three poisoning ratios. Our results, derived from 122,236 individual experiments, indicate that while many approaches provide some level of protection, their performance can vary considerably. Furthermore, when compared to two seminal approaches, most newer approaches do not demonstrate substantial improvements in overall performance or consistency across diverse settings. Drawing from these findings, we propose potential directions for developing more effective and generalizable defensive mechanisms in the future.
URL: https://openreview.net/forum?id=OysA7cuCUh
---
Title: Causal Emotion Recognition in Conversation: Context Saturation and Discourse-Marker Evidence
Authors: Cheonkam Jeong, Adeline Nyamathi
Abstract: Despite strong recent progress in Emotion Recognition in Conversation (ERC), two gaps remain: we still lack a clear understanding of which modeling choices materially affect performance, and we have limited linguistic analysis that connects recognition findings to interpretable discourse-level patterns. We address both gaps via a systematic study on IEMOCAP, with a cross-dataset validation on MELD that supports the saturation framing while clarifying which effects are corpus-specific.
For recognition, we conduct controlled ablations with 10 random seeds and paired tests over seeds, with correction for multiple comparisons, yielding three findings. First, conversational context is the dominant factor: performance saturates quickly, with roughly 90% of the gain observed within our context sweep achieved using only the most recent 10–30 preceding turns, depending on the label set. Second, hierarchical sentence representations are most useful in utterance-only settings, with a clear advantage on MELD, but the benefit vanishes once turn-level context is available, suggesting that conversational history subsumes much of the intra-utterance structure. Third, a simple integration of an external affective lexicon, SenticNet, does not improve results, consistent with pretrained encoders already capturing much of the affective signal needed for ERC. Under a strictly causal, past-only setting, our simple models attain strong performance, 82.69% 4-way and 67.07% 6-way weighted F1, indicating that competitive accuracy is achievable without access to future turns.
For linguistic analysis, we examine 5,286 discourse-marker occurrences and find a reliable association between emotion and marker position within the utterance, p < .0001. In particular, Sad utterances show reduced left-periphery marker usage, 21.9%, relative to other emotions, 28–32%, aligning with accounts that link left-periphery markers to active discourse management. This pattern is consistent with our recognition results, where Sad benefits most from conversational context, +22 percentage points, suggesting that sadness may be more context-dependent in this corpus than emotions with stronger local pragmatic cues.
URL: https://openreview.net/forum?id=zCFQiJT7XN
---
Title: Efficient Zeroth-Order Federated Finetuning of Language Models on Resource-Constrained Devices
Authors: Mohamed Aboelenien Ahmed, Kilian Pfeiffer, Ramin Khalili, Heba Khdr, Joerg Henkel
Abstract: Federated Learning (FL) is a promising paradigm for finetuning Large Language Models (LLMs) across distributed data sources while preserving data privacy. However, finetuning such large models is challenging on edge devices due to its high resource demand. Zeroth-order Optimization (ZO) estimates gradients through finite-difference approximations, which rely on function evaluations under random perturbations of the model parameters. Consequently, ZO with task alignment provides a potential solution, allowing finetuning using only forward passes with inference-level memory requirements and low communication overhead, but it suffers from slow convergence and higher computational demand. In this paper, we propose a new ZO-based method that applies a more efficient technique to reduce the computational demand associated with using a large number of perturbations while preserving their convergence benefits. This is achieved by splitting the model into consecutive blocks and allocating a higher number of perturbations to the second block, enabling efficient reuse of intermediate activations to update the full network with fewer forward evaluations. Our evaluation on RoBERTa-large, OPT1.3B, LLaMa-3-3.2B models shows up to $3\times$ reduction in computation compared to the other ZO-based techniques, while retaining the memory and communication benefits over first-order federated learning techniques.
URL: https://openreview.net/forum?id=nVmz9Q2l7L
---
Title: LiSeCo: Linear Semantic Control for Language Generation
Authors: Emily Cheng, Carmen Amo Alonso
Abstract: The prevalence of Large Language Models (LLMs) in critical applications highlights the need for controlled language generation methods that are both computationally efficient and enjoy performance guarantees. To address this need, we use a common model of concept semantics as linearly represented in an LLM’s latent space. In particular, we take the view that natural language generation traces a trajectory in this continuous semantic space, realized by the language model’s hidden activations. This view permits a control-theoretic treatment of text generation in latent space, in which we propose Linear Semantic Control (LiSeCo), a lightweight, gradient-free intervention that dynamically steers trajectories away from regions corresponding to undesired meanings. In particular, we propose to directly intervene, in an online fashion, the activations of the token that is being generated in embedding space. Crucially, LiSeCo does not simply steer activations towards a desirable region. Instead, it relies on classical techniques from control theory to precisely control activations in a context-dependent way, and guarantees that they are brought into a specific pre-defined region of embedding space that corresponds to allowed semantics. The intervention is computed in closed form according to an optimal controller formulation, minimally impacting generation time. This control of the activations in embedding space allows for fine-grained steering of attributes of the generated sequence. We demonstrate that our approach is effective on different tasks—toxicity, sentiment, and language (English/Spanish) steering—while maintaining text quality.
URL: https://openreview.net/forum?id=a3o2pzZuvE
---
Title: ARROW: Augmented Replay for RObust World models
Authors: Abdulaziz Alyahya, Abdallah Al Siyabi, Markus R. Ernst, Luke Yang, Levin Kuhlmann, Gideon Kowadlo
Abstract: Continual reinforcement learning challenges agents to acquire new skills while retaining previously learned ones with the goal of improving performance in both past and future tasks. Most existing approaches rely on model-free methods with replay buffers to mitigate catastrophic forgetting; however, these solutions often face significant scalability challenges due to large memory demands. Drawing inspiration from neuroscience, where the brain replays experiences to a predictive World Model rather than directly to the policy, we present ARROW (Augmented Replay for RObust World models), a model-based continual RL algorithm that extends DreamerV3 with a memory-efficient, distribution-matching replay buffer. Unlike standard fixed-size FIFO buffers, ARROW maintains two complementary buffers: a short-term buffer for recent experiences and a long-term buffer that preserves task diversity through intelligent sampling.
We evaluate ARROW on two challenging continual RL settings: Tasks without shared structure (Atari), and tasks with shared structure, where knowledge transfer is possible (Procgen CoinRun variants). Compared to model-free and model-based baselines with replay buffers of the same-size, ARROW demonstrates substantially less forgetting on tasks without shared structure, while maintaining comparable forward transfer. Our findings highlight the potential of model-based RL and bio-inspired approaches for continual reinforcement learning, warranting further research.
URL: https://openreview.net/forum?id=3FK2tFwNwK
---
Title: Dynamic Neural Graph Encoding of Inference Processes in Deep Weight Space
Authors: Di Wu, Huan Liu, Zhixiang Chi, YUANHAO YU, Konstantinos N. Plataniotis, Yang Wang
Abstract: The rapid advancements in using neural networks as implicit data representations have attracted significant interest in developing machine learning methods that analyze and process the weight spaces of other neural networks. However, efficiently handling these high-dimensional weight spaces remains challenging. Existing methods often overlook the sequential nature of layer-by-layer processing in neural network inference. In this work, we propose a novel approach using dynamic graphs to represent neural network parameters, capturing the temporal dynamics of inference. Our Dynamic Neural Graph Encoder (DNG-Encoder) processes these graphs, preserving the sequential nature of neural processing. Additionally, we also leverage DNG-Encoder to develop INR2JLS (Implicit Neural Representation to Joint Latent Space) for facilitate downstream applications, such as classifying Implicit Neural Representations (INRs). Our approach demonstrates significant improvements across multiple tasks, surpassing the state-of-the-art INR classification accuracy by approximately 10\% on the CIFAR-100-INR. Our code is available at \url{https://github.com/dddiowww/DNG}.
URL: https://openreview.net/forum?id=4fweEyVYLF
---
Title: Efficient Block Bi-clustering by Alternating Semidefinite Programming Relaxation
Authors: Yuxin Ma, Weiguo Gao, Xiang ZHOU
Abstract: The bi-clustering problem is a common task in data mining, often formulated as a challenging non-convex optimization problem. In this paper, we address the block bi-clustering problem using a novel formulation with semi-definite programming (SDP) relaxation and two low-rank matrix approximation. Our method alternates between optimizing the row and column membership matrices in a sequential manner, freezing one matrix while solving the subproblem for the other in each step. We prove that the numerical membership matrices generated by our algorithm achieve an error in the Frobenius norm bounded by $O(1/\sqrt{n})$ and $O(1/\sqrt{m})$, ensuring accuracy and scalability as the data dimensions grow. Through experiments on both simulated and real datasets, we demonstrate that our algorithm performs comparably or better than existing bi-clustering methods in terms of both accuracy and efficiency.
URL: https://openreview.net/forum?id=xPA4Xg0IvL
---
Title: Persistent homology for time series: a selective review
Authors: Alexandre Bois, Hugo Henneuse, Brian Tervil, Laurent Oudre
Abstract: Over the last ten years, persistent homology has been increasingly used to analyze the structure and shape of various types of data, including time series. This article is a review of persistent homology applied to (univariate or multivariate) time series data. We review 87 articles that apply methods involving persistent homology to time series data, published between 2014 and 2025 in several domains of application, such as biomedicine, industry, and economics. We introduce the main concepts of persistent homology, give an overview of the application fields and tasks, and propose a general framework to describe the main characteristics of all the methods.
URL: https://openreview.net/forum?id=tztKO9jzBR
---
Title: Encoding Without Influence: Dissociating Demographic Representation from Causal Effect in Large Language Models
Authors: Aarushi Sharma, Phong Le
Abstract: Large language models are increasingly deployed in settings that require normative judg-
ment, yet the internal pathway by which demographic context shapes their outputs remains
uncharacterized. We apply sparse autoencoder feature extraction and causal interventions
(activation patching, feature steering, and targeted ablation) to Gemma 2 9B, Qwen 2.5
7B, and Llama 3.1 8B, tracing how demographic information is represented and used dur-
ing survey responses across five policy domains. We find that demographic representations
and demographic influence are localized in different parts of the network: early layers en-
code demographic identity but exert no measurable effect on outputs, while interventions on
late-layer features recover 68.7–75.8% of behavioral effects across architectures. Variance-
matched null baselines confirm that these effects are specific to demographic features rather
than a generic consequence of perturbation. We further show that demographic influence is
domain-modulated, with the ranking of influential demographics shifting across policy ar-
eas. The dissociation is demonstrated across two architectures (Gemma 2 9B, Qwen 2.5 7B)
with partial replication on a third (Llama 3.1 8B), with different encoding profiles and align-
ment procedures. These results suggest that representational detection alone is insufficient
for bias auditing, as the most detectable demographic encodings are not the ones driving
outputs, and that fairness evaluation must be both causally validated and domain-specific.
URL: https://openreview.net/forum?id=TQbXHsI3Lm
---
Title: Mitigating Disparate Impact of Differentially Private Learning through Bounded Adaptive Clipping
Authors: Linzh Zhao, Aki Rehn, Mikko A. Heikkilä, Razane Tajeddine, Antti Honkela
Abstract: Differential privacy (DP) has become an essential framework for privacy-preserving machine learning. Existing DP learning methods, however, often have disparate impacts on model predictions, e.g., for minority groups. Gradient clipping, which is often used in DP learning, can suppress larger gradients from challenging samples. We show that this problem is amplified by adaptive clipping, which will often shrink the clipping bound to tiny values to match a well-fitting majority, while significantly reducing the accuracy for others. We propose bounded adaptive clipping, which introduces a tunable lower bound to prevent excessive gradient suppression. Our method improves worst-class accuracy by over 10 percentage points on Skewed and Fashion MNIST compared to unbounded adaptive clipping, 7 points compared to Automatic clipping, and 5 points compared to constant clipping. The code is available at https://github.com/TrustworthyMLHelsinki/adaptive-clipping-fairness .
URL: https://openreview.net/forum?id=UlzcKSHVoN
---
Title: Video Generation Models: A Survey of Post-Training and Alignment
Authors: Chaoyu Li, Xiaoyi Gu, Yogesh Kulkarni, Eun Woo Im, Mohammadmahdi Honarmand, Zeyu Wang, Juntong Song, Fei Du, Xilin Jiang, Kexin Zheng, Tianzhi Li, Fei Tao, Pooyan Fazli
Abstract: Video generation has rapidly progressed from short, low-quality clips to high-resolution, long-duration sequences with complex spatiotemporal dynamics. Despite strong generative priors learned through large-scale pretraining, pretrained video models often fail to reliably follow human intent, maintain temporal coherence, or satisfy physical and safety constraints. Compared with image and text generation, alignment in video generation presents unique challenges, including error accumulation over time, motion-appearance coupling, multi-objective trade-offs, and limited supervision for temporal properties. These challenges motivate systematic post-training strategies that adapt pretrained models without retraining them from scratch. In this survey, we present the first comprehensive review of post-training and alignment in video generation models. We frame post-training as a unifying framework and distinguish between implicit alignment and explicit alignment based on how alignment signals are enforced. From this perspective, we organize existing approaches into four broad categories: (1) supervised fine-tuning methods, (2) self-training and distillation methods, (3) preference- and reward-based methods, and (4) inference-time methods. This taxonomy provides a coherent view of how alignment signals shape model behavior across both training and deployment. Beyond methodological advances, we review commonly used datasets, benchmarks, and evaluation practices, and discuss open challenges such as scalable reward design, long-horizon temporal consistency, stability-expressiveness trade-offs, and safety-aware generation. This survey aims to provide a structured conceptual foundation and practical guidance for advancing controllable and reliable video generation models.
URL: https://openreview.net/forum?id=YlUEWLESIu
---
Title: Querying Kernel Methods Suffices for Reconstructing their Training Data
Authors: Daniel Barzilai, Yuval Margalit, Eitan Gronich, Gilad Yehudai, Meirav Galun, Ronen Basri
Abstract: Over-parameterized models have raised concerns about their potential to memorize training data, even when achieving strong generalization. The privacy implications of such memorization are generally unclear, particularly in scenarios where only model outputs are accessible. We study this question in the context of kernel methods, and demonstrate both empirically and theoretically that querying kernel models at various points suffices to reconstruct their training data, even without access to model parameters. Our results hold for a range of kernel methods, including kernel regression, support vector machines, and kernel density estimation. Our hope is that this work can shed light on potential privacy concerns associated with such models.
URL: https://openreview.net/forum?id=qikuoG0Th2
---
Title: Diffusion-based Cumulative Adversarial Purification for Vision Language Models
Authors: Jia Fu, Yongtao Wu, Yihang Chen, Kunyu Peng, Xiao Zhang, Volkan Cevher, Sepideh Pashami, Anders Holst
Abstract: Vision Language Models (VLMs) have shown remarkable capabilities in multimodal understanding, yet their susceptibility to adversarial perturbations poses a significant threat to their reliability in real-world applications. Despite often being imperceptible to humans, these perturbations can drastically alter model outputs, leading to erroneous interpretations and decisions. This paper introduces DiffCAP, a novel diffusion-based purification strategy that can effectively neutralize adversarial corruptions in VLMs. We theoretically establish a provable recovery region in the forward diffusion process and meanwhile quantify the convergence rate of semantic variation with respect to VLMs. These findings manifest that adversarial effects monotonically fade as diffusion unfolds. Guided by this principle, DiffCAP leverages noise injection with a similarity threshold of VLM embeddings as an adaptive criterion, before reverse diffusion restores a clean and reliable representation for VLM inference. Through extensive experiments across six datasets with three VLMs under varying attack strengths in three task scenarios, we show that DiffCAP outperforms existing defense techniques by a substantial margin. Notably, DiffCAP significantly reduces both hyperparameter tuning complexity and the required diffusion time, thereby accelerating the denoising process. Equipped with theorems and empirical support, DiffCAP provides a robust and practical solution for securely deploying VLMs in adversarial environments. The source code is available at https://github.com/JasonFu1998/DiffCAP.
URL: https://openreview.net/forum?id=kpuV3mzwqw
---
Title: A Survey on Evaluating Quality and Trustworthiness in LLM-Generated Data
Authors: Kaituo Zhang, Mingzhi Hu, Hoang Anh Duy Le, Fariha Kabir Torsha, Zhimeng Jiang, Minh Khai Bui, Chia-Yuan Chang, Yu-Neng Chuang, Zhen Xiong, Ying Lin, Guanchu Wang, Na Zou
Abstract: Large Language Models (LLMs) have emerged as powerful tools for generating data across various modalities. By transforming data from a scarce resource into a controllable asset, LLMs alleviate the bottlenecks imposed by the acquisition costs of real-world data for model training, evaluation, and system iteration. However, ensuring the high quality of LLM-generated synthetic data remains a critical challenge. Existing research focuses primarily on generation methodologies, paying limited attention to the quality of the resulting data. Furthermore, most studies are restricted to single modalities, lacking a unified perspective across different data types. To bridge this gap, we propose the \textbf{LLM Data Auditor framework}. In this framework, we first characterize how LLMs are utilized to generate data across six distinct modalities. More importantly, we systematically categorize intrinsic metrics for evaluating synthetic data from two perspectives: quality and trustworthiness. This approach shifts the focus from extrinsic evaluation, which relies on downstream task performance, to the inherent properties of the data itself. Using this evaluation system, we audit the experimental evaluations of representative generation methods for each modality and highlight under-covered evaluation dimensions within this representative sample. Based on these findings, we offer concrete recommendations for the community to improve the evaluation of data generation. Finally, the framework outlines methodologies for the practical application of synthetic data across different modalities.
Our repository has been released: \href{https://github.com/z76316/Awesome-LLM-Data-Generation}{https://github.com/z76316/Awesome-LLM-Data-Generation}.
URL: https://openreview.net/forum?id=f2gS9Ly6tA
---
Title: CheXGenBench: A Unified Benchmark For Fidelity, Privacy and Utility of Synthetic Chest Radiographs
Authors: Raman Dutt, Pedro Sanchez, Yongcheng Yao, Steven McDonagh, Sotirios A. Tsaftaris, Timothy Hospedales
Abstract: Structured benchmarks have advanced text-conditional image generation for real-world imagery, however, no such benchmark exists for synthetic radiograph generation. Despite being a highly active area of research, existing studies continue adopting inconsistent evaluation protocols and lack a unified assessment of the three most critical criteria: generative fidelity, privacy risk, and downstream utility.
To address these limitations, we introduce CheXGenBench, the first unified evaluation framework for synthetic chest radiograph generation that simultaneously assesses fidelity, privacy risks, and downstream utility across frontier text-to-image (T2I) generative models. Our evaluation protocol, comprising over 20 quantitative metrics, covers 11 leading T2I architectures with plug-and-play integration for newer models. Through a rigorous and fair evaluation protocol, we establish comprehensive baseline state-of-the-art (SoTA) performances across all dimensions to guide future research. Furthermore, our results uncover several limitations of current generative models, which include first, even SoTA models struggle with long-tailed medical distributions; second, models pose high privacy risks regardless of fidelity quality; and third, while synthetic data already benefits downstream classification, it is of limited utility for downstream multimodal tasks. Drawing from these results, we propose concrete research directions to advance the field. The code is available at https://github.com/Raman1121/CheXGenBench
URL: https://openreview.net/forum?id=wrKmzYQACp
---
Title: Amnesia: A Stealthy Replay Attack on Continual Learning Dreams
Authors: Ahmed Sharshar, K Naveen Kumar, Mohsen Guizani
Abstract: Continual learning (CL) models rely on experience replay to mitigate catastrophic forgetting, yet their robustness to replay sampling interference is largely unexplored. Existing CL attacks mostly modify inputs or update pipelines (poisoning/backdoors) and lack explicit \emph{auditable} constraints, limiting their realism. Here, \emph{auditability} means that a monitor can verify compliance using sampler-visible telemetry, e.g., logged replay index/label statistics, by checking that the realized replay class histogram stays close to a nominal baseline and that the replay rate is unchanged (per-batch and/or over a rolling window). We study a limited-privilege insider controlling only the replay \emph{index selection}, not pixels, labels, or model parameters, while staying within such auditable limits (e.g., queue priorities). We introduce \textbf{Amnesia}, a replay composition attack maximizing model degradation under two auditable budgets: a visibility budget $\delta$ bounding the $\mathrm{TV}/\mathrm{KL}$ divergence from a nominal class histogram $p_0$, and a mass budget $f$ fixing the replay rate. Amnesia uses a two-step procedure: (i) compute lightweight class utilities (e.g., EMA loss/confidence) to tilt $p_0$ toward harmful classes; (ii) project the tilt back into the $\delta$-ball using efficient $\mathrm{KL}$ (\emph{exponential tilt}) or $\mathrm{TV}$ (\emph{balanced mass redistribution}) optimizers. A windowed scheduler enforces rolling audits. Across challenging CL benchmarks (Split CIFAR-10/100, CORe50, Tiny-ImageNet) and strong replay baselines (ER, ER-ACE, SCR, DER++), Amnesia consistently depresses final accuracy (ACC$\downarrow$) and worsens backward transfer ($-\mathrm{BWT}\uparrow$). The $\mathrm{KL}$ variant achieves high impact while remaining largely undetected by audits, as confirmed empirically under multiple audit schemes (per-batch and rolling-window checks), whereas the $\mathrm{TV}$ variant is more damaging but more easily detected, especially under tight per-class constraints. These results expose \emph{index-only} replay control as a practical, auditable threat surface in CL systems and establish a principled impact-visibility-budget trade-off. Code is available anonymously via \href{https://github.com/ahmed-sharshar/Amensia}{ GitHub}.
URL: https://openreview.net/forum?id=QSTg7z06GH
---
Title: CAPTAIN: Conformal-Prediction-Based Multi-Source Time-Series Forecasting
Authors: Shuaicheng Zhang, Tuo Wang, Adithya Kulkarni, Stephen Adams, Sanmitra Bhattacharya, Sunil Reddy Tiyyagura, Edward Bowen, Balaji Veeramani, Dawei Zhou
Abstract: Uncertainty quantification is critical for real-world forecasting applications such as predictive maintenance, patient health monitoring, and environmental sensing, where decisions must account for confidence levels. Multi-source time-series forecasting introduces additional complexity due to inter-source interactions and temporal dependencies, which existing methods struggle to capture within a unified probabilistic framework. Most previous approaches also lack theoretical guarantees, leading to miscalibrated uncertainty estimates. We propose CAPTAIN, short for Conformal Prediction based multi-source Time-series forecAstiNg, a two-stage framework for uncertainty quantification in multi-source time-series forecasting. First, CAPTAIN employs Normal Inverse Gamma distributions to model source-specific uncertainties and integrates a meta-source to capture inter-source interactions. Next, temporal copulas model the evolution of joint uncertainties over time, ensuring robust and theoretically valid uncertainty coverage. Experiments on five diverse datasets, including Synthetic, Shaoxing ECG, Air Quality, NGSIM Traffic, and ETTh1, demonstrate that CAPTAIN achieves valid coverage of at least 90% across all five benchmarks, while other baselines struggle to achieve valid coverage. Our empirical results on coverage validity and interval width, together with ablation studies, show that CAPTAIN provides a better approach for multi-source uncertainty quantification than existing state-of-the-art baselines.
URL: https://openreview.net/forum?id=WJjlXHo4yS
---
Title: Visual-TCAV: Concept-based Attribution and Saliency Maps for Post-hoc Explainability in Image Classification
Authors: Antonio De Santis, Riccardo Campi, Matteo Bianchi, Marco Brambilla
Abstract: Convolutional Neural Networks (CNNs) have shown remarkable performance in image classification. However, interpreting their predictions is challenging due to the size and complexity of these models. State-of-the-art saliency methods generate local explanations highlighting the area in the input image where a class is identified but cannot explain how a concept of interest contributes to the prediction. On the other hand, concept-based methods, such as TCAV, provide insights into how sensitive the network is to a human-defined concept but cannot compute its attribution in a specific prediction nor show its location within the input image. We introduce Visual-TCAV, a novel explainability framework aiming to bridge the gap between these methods by providing both local and global explanations. Visual-TCAV uses Concept Activation Vectors (CAVs) to generate class-agnostic saliency maps that show where the network recognizes a certain concept. Moreover, it can estimate the attribution of these concepts to the output of any class using a generalization of Integrated Gradients. We evaluate the method's faithfulness via a controlled experiment where the ground truth for explanations is known, showing better ground truth alignment than TCAV. Our code is available at https://github.com/DataSciencePolimi/Visual-TCAV.
URL: https://openreview.net/forum?id=SLh00W5rhu
---
Title: Mitigating Embedding Leakage via Latent Disruption with Controlled Reconstruction
Authors: Zhiyuan Wu, Changkyu Choi, Shujian Yu, Robert Jenssen, Ali Ramezani-Kebrya
Abstract: Pre-trained encoders produce semantically rich latent embeddings, which, however, may expose unintended information through malicious inference or exploitation. We propose SEAL, a framework that mitigates embedding leakage by disrupting latent representations based on information-theoretic principles. It reduces the risk of potential misuse while enabling controlled reconstruction for trusted users. SEAL learns to encode controlled perturbations by minimizing the Matrix Norm-based Quadratic Mutual Information (MQMI) functional between original and perturbed embeddings within a hyperspherical latent space. Meanwhile, a private decoder, jointly trained with the SEAL encoder, is trained to reconstruct the original data that is accessible only to authorized users under an access-controlled setting. Extensive experiments on vision and text datasets demonstrate that SEAL reduces latent leakage, weakens the effectiveness of evaluated inference attacks, and preserves reconstruction under the considered setting.
URL: https://openreview.net/forum?id=nZWBrxJyrS
---
Title: RPWithPrior: Label Differential Privacy in Regression
Authors: Haixia Liu, Ruifan Huang
Abstract: With the wide application of machine learning techniques in practice, privacy preservation has gained increasing attention. Protecting user privacy with minimal accuracy loss is a fundamental task in the data analysis and mining community. In this paper, we focus on regression tasks under $\epsilon$-label differential privacy guarantees. Some existing methods fundamentally convert a regression problem into a classification problem within the framework of Label Differential Privacy. However, such operations does not align well with real-world scenarios. To overcome these limitations, we model both original and randomized responses as continuous random variables, avoiding discretization entirely. Our novel approach estimates an optimal interval for randomized responses and introduces new algorithms designed for scenarios where a prior is either known or unknown. Additionally, we prove that our algorithm, RPWithPrior, guarantees $\epsilon$-label differential privacy. Numerical results show that our method is always the best on the Communities and Crime. On Criteo Sponsored Search Conversion Log, and California Housing datasets, the performance of our approach remains comparable.
URL: https://openreview.net/forum?id=FiUe0OCMaj
---
Title: It depends: Incorporating correlations for joint aleatoric and epistemic uncertainties of high-dimensional output spaces
Authors: Leonhard F. Feiner, Manuel Nickel, Martin J. Menten, Laurin Lux, Rickmer Braren, Daniel Rueckert, Georgios Kaissis, Raphael Rehms, Johannes C. Paetzold
Abstract: Uncertainty Quantification plays a vital role in enhancing the reliability of deep learning model predictions, especially in scenarios with high-dimensional output spaces. This paper addresses the dual nature of uncertainty — aleatoric and epistemic — focusing on their joint integration in high-dimensional regression tasks. For example, in applications like medical image segmentation or restoration, aleatoric uncertainty captures inherent data noise, while epistemic uncertainty quantifies the model's confidence in unfamiliar conditions. Modeling both jointly enables more reliable predictions by reflecting both unavoidable variability and knowledge gaps, whereas modeling only one limits transparency and robustness. We propose a novel approach that approximates the resulting joint uncertainty using a low-rank plus diagonal covariance structure, capturing essential output correlations while avoiding the computational burdens of full covariance matrices. Unlike prior work, our method explicitly combines aleatoric and epistemic uncertainties into a unified second-order distribution that supports robust downstream analyses like sampling and log-likelihood evaluation. We further introduce stabilization strategies for efficient training and inference, achieving superior Uncertainty Quantification in the tasks of image inpainting, colorization, optical flow, and depth estimation.
URL: https://openreview.net/forum?id=zw5EuUnBny
---
Title: Learning Fine-grained Parameter Sharing via Sparse Tensor Decomposition
Authors: Cem Üyük, Mike Lasby, Mohamed Yassin, Utku Evci, Yani Ioannou
Abstract: Large neural networks achieve state-of-the-art performance on many tasks, yet their sheer size hinders deployment on resource-constrained devices. Among existing compression approaches, cross-layer parameter sharing remains relatively unexplored for transformer models. In this paper, we introduce Fine-grained Parameter Sharing (FiPS), a unified framework for compressing transformer Multi-Layer Perceptrons (MLPs) that combines cross-block parameter sharing, low-rank factorization, and sparsity in a single optimization. FiPS concatenates MLP weight matrices across a group of transformer blocks and factorizes them into a shared basis and sparse, layer-specific projection matrices. Both factors are initialized via singular value decomposition (SVD) and jointly optimized by block-wise reconstruction error minimization. FiPS compresses Vision Transformers (ViTs) by up to 33% with less than 1% top-1 accuracy loss on ImageNet-1k, and by up to 57% when combined with fine-tuning. It also compresses Large Language Models (LLMs) by up to 20% while outperforming existing SVD-based methods in perplexity and downstream benchmarks at matched compression. Combined with Quantization-Aware Training (QAT), 3-bit FiPS on Gemma-2-2B achieves lower perplexity than 2-bit QAT alone while matching the same 8x compression. These results establish fine-grained parameter sharing as a practical and effective approach for transformer MLP compression.
URL: https://openreview.net/forum?id=vbS7Z8Zswe
---
Title: Rectified Flows for Fast Multiscale Fluid Flow Modeling
Authors: Victor Armegioiu, Yannick Ramic, Siddhartha Mishra
Abstract: We introduce \emph{ReFlow}, a conditional rectified-flow surrogate for PDE forecasting. Given an initial state $u_i$, ReFlow transports Gaussian noise $\xi$ to a sample from the conditional final-state law $p(u_f\mid u_i)$ by integrating a learned deterministic ODE. Unlike diffusion surrogates, which require many stochastic denoising steps, the rectified transport is close to straight in sampling time: on multiscale 2D flow benchmarks, ReFlow matches diffusion-level posterior statistics with as few as $8$ ODE steps, compared with $\ge 128$ network evaluations for score-based diffusion.
We also give a law-level analysis for conditional PDE surrogates. We formulate the ideal conditional rectified velocity as a barycentric transport field and show that it pushes the reference law to the target conditional law. At fixed spatial resolution, we decompose the one-step law error into a coverage term, controlled by unresolved high-frequency content via structure functions or spectral tails, and a fit term measuring approximation of the ideal velocity field. We further show that ODE discretization error is governed by the variation of the learned velocity along sampled rectified trajectories. This motivates a curvature-aware sampler that uses an EMA proxy for trajectory-wise velocity variation to stabilize inference, especially out of distribution.
Across incompressible and compressible 2D flows, ReFlow matches diffusion baselines in one-point Wasserstein statistics and energy spectra, preserves fine-scale structure missed by deterministic MSE models, and produces high-resolution conditional samples at substantially lower inference cost.
URL: https://openreview.net/forum?id=2tMD6YXgkp
---
Title: Adaptive and Stratified Subsampling for High-Dimensional Robust Estimation
Authors: Prateek Mittal, Joohi Chauhan
Abstract: We study robust high-dimensional sparse regression under finite-variance heavy-tailed noise, ε-contamination, and α-mixing dependence via two subsampling estimators: Adaptive Importance Sampling (AIS) and Stratified Sub-sampling (SS). Under sub-Gaussian design whose scopeis precisely delimited and finite-variance noise, a subsample of size$m=\Omega(s\log p)$ achieves the minimax-optimal rate $O(\sqrt{s\log p/m})$. We close the theory-algorithm gap: Theorem 4.6 applies to AIS at termination conditional on stabilized weights (Proposition 4.1), and SS fits the median-of-means M-estimation framework of Lecu´e and Lerasle (Proposition 4.3). The de-biasing step is fully specified via the nodewise-Lasso precision estimator under a new sparse-precision assumption, yielding valid coordinate-wise CIs (Theorem 4.14). The α-mixing extension uses a calendar-time block protocol that guarantees temporal separation (Theorem 4.12). Empirically, AIS achieves 3.1× lower error than uniform subsampling at 20% contamination, and 29.5% lower test MSE on Riboflavin (p=4,088 ≫ n=71).
URL: https://openreview.net/forum?id=R8y19hU9Ab
---
Title: Learning Materials Interatomic Potentials via Hybrid Invariant-Equivariant Architectures
Authors: Keqiang Yan, Montgomery Bohde, Andrii Kryvenko, Ziyu Xiang, Kaiji Zhao, Siya Zhu, Saagar Kolachina, Doğuhan Sarıtürk, Jianwen Xie, Raymundo Arroyave, Xiaoning Qian, Xiaofeng Qian, Shuiwang Ji
Abstract: Machine learning interatomic potentials (MLIPs) can predict energy, force, and stress of materials and enable a wide range of downstream discovery tasks. A key design choice in MLIPs involves the trade-off between invariant and equivariant architectures. Invariant models offer computational efficiency but may not perform as well, especially when predicting high-order outputs. In contrast, equivariant models can capture high-order symmetries, but are computationally expensive. In this work, we propose HIENet, a \underline{h}ybrid \underline{i}nvariant-\underline{e}quivariant materials interatomic potential model that integrates both invariant and equivariant message passing layers. Furthermore, we show that HIENet provably satisfies key physical constraints. HIENet achieves superior performance with considerable computational speedups over prior models. Experimental results on both common benchmarks and downstream materials discovery tasks demonstrate the efficiency and effectiveness of HIENet. Finally, additional ablations further demonstrate that our hybrid invariant-equivariant approach scales well across model sizes and works with different equivariant model architectures, providing powerful insights into future MLIP designs.
URL: https://openreview.net/forum?id=fq3nrVqNmL
---
Title: Disjoint Generation of Synthetic Data
Authors: Anton Danholt Lautrup, Muhammad Rajabinasab, Tobias Hyrup, Arthur Zimek, Peter Schneider-Kamp
Abstract: We propose a new framework for generating tabular synthetic datasets via disjoint generative models. In this paradigm, a dataset is partitioned into disjoint subsets that are supplied to separate instances of generative models. The results are then combined post hoc by a joining operation that works in the absence of common variables/identifiers. The success of the framework is demonstrated through several case studies and examples on tabular data that help illuminate some of the design choices that one may make. The advantages achieved by the disjoint generation include: i) An observed increase in the empirical measurement of privacy. ii) Increased computational feasibility of certain model types. iii) Ability to generate synthetic data using a mixture of different generative models. Specifically, mixed-model synthesis bridges the gap between privacy and utility performance, providing highly competitive performance on Accuracy and Area Under the Curve for downstream tasks while significantly lowering the empirical re-identification risk.
URL: https://openreview.net/forum?id=LSzXkAWBKI
---
New submissions
===============
Title: A Comprehensive Survey on 3D Deep Point Cloud Models
Abstract: Recently, point cloud data has attracted the attention of researchers as a promising data representation model for a wide range of applications. As unlike 2D data, point clouds are unordered, irregular, and often large in scale, they might impose severe challenges when designing deep learning models. Over the past decade, substantial progress has been made in proposing architectures that address permutation invariance, geometric reasoning, scalability, and robustness, leading to rapid expansion across diverse 3D data oriented applications. The main aims of this paper are to present a comprehensive survey on existing literature and to analyze how different 3D representations have shaped the design and performance of deep learning models. In contrast to prior surveys that have emphasized on limited task subsets or specific model families, this survey reviews deep point cloud models through representation- and architecture-centric perspective. As such, beyond (1) core tasks such as classification, segmentation, detection, tracking, this survey systematically provides insight into recent progress in broader directions, including (2) geometric modeling, alignment, and pose estimation, (3) foundation models and scene understanding, and (4) robustness, generalization, and reliability. Furthermore, this survey presents commonly used datasets and evaluation metrics, and finally summarizes challenges and future directions toward robustness, efficiency, and generalizability of 3D point cloud systems.
URL: https://openreview.net/forum?id=WpQdfOC36s
---
Title: Unifying Object-Centric World Models and Diffusion Policy: A Hierarchical Framework for Multi-Stage Robotic Tasks
Abstract: Visual world models have shown great potential in learning complex system dynamics. Recent advancements leverage these models as transition functions within Model Predictive Control (MPC) frameworks to solve various control tasks. When applied to robotics, however, they are limited to single-stage tasks such as reaching or grasping, and struggle with multi-stage ones that demand complex sequential planning. In this work, we introduce WorldDP, a world model framework designed for multi-stage robotic manipulation. Our hierarchical approach utilizes a high-level world model as a transition function to optimize for feasible subgoals during runtime, which are subsequently reached by a low-level Diffusion Policy. To further aid in learning dynamics and planning, we incorporate object-centric representations that decouple environmental entities and enable us to plan sequentially with respect to each. Evaluated across several robotics benchmarks, WorldDP consistently outperforms existing baselines, validating that coupling the world model's physically grounded planning with diffusion policy's efficient execution yields superior multi-stage performance.
URL: https://openreview.net/forum?id=1ExV3HqQjN
---
Title: JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in LLMs
Abstract: Logical reasoning is a critical component of Large Language Models (LLMs), and substantial research efforts in recent years have aimed to enhance their deductive reasoning abilities. However, existing deductive reasoning benchmarks suffer from significant constraints that restrict their utility, i.e., the lack of task complexity, the presence of prior knowledge as a confounder, and superficial error analysis. To address these deficiencies, we introduce JustLogic, a synthetically generated benchmark designed for rigorous evaluation of LLMs. JustLogic is (i) highly complex, capable of generating a diverse range of linguistic patterns, vocabulary, and argument structures; (ii) prior knowledge independent, eliminating the advantage of models possessing prior knowledge and ensuring that only deductive reasoning is used to answer questions; and (iii) capable of in-depth error analysis on the heterogeneous effects of reasoning depth and argument form on model accuracy. Our experimental results on JustLogic reveal that (i) state-of-the-art (SOTA) reasoning LLMs perform on par or better than the human average but significantly worse than the human ceiling, and (ii) SOTA non-reasoning models still underperform the human average. All code and data are available at \href{https://anonymous.4open.science/r/JustLogic}{\color{linkblue}{https://anonymous.4open.science/r/JustLogic}}
URL: https://openreview.net/forum?id=wsDUuu6P38
---
Title: GENERIC-FNO: Embedding Energy Conservation and Entropy Production into Fourier Neural Operators
Abstract: We introduce GENERIC-FNO, the first neural operator to embed the full GENERIC
(metriplectic) structure of nonequilibrium thermodynamics---reversible,
energy-conserving dynamics and irreversible, entropy-producing dynamics coupled
through the degeneracy conditions---directly in function space. Existing
structure-preserving neural operators enforce at most a single conservation law
or a purely reversible (Hamiltonian) structure, while thermodynamically
consistent learning has so far been confined to finite-dimensional, graph, or
particle systems. GENERIC-FNO closes this gap: it learns the energy and entropy
functionals as neural operators and parameterizes the Poisson and friction
operators as diagonal Fourier multipliers sandwiched between rank-one
projections that enforce the degeneracy conditions \emph{exactly, by
construction}---with no penalty term, no projection of the predicted update, and
no free residual. The degeneracy identities therefore hold to machine precision (residuals
$\sim\!10^{-13}$) for any initialization, spatial dimension, or grid resolution,
so the continuous-time dynamics conserve the learned energy and produce entropy
exactly; the explicit time stepping we deploy adds only a small
$\mathcal{O}(\Delta t^2)$ drift (per-step energy residual $\sim\!10^{-6}$). We further observe that the $(E,S,L,M)$
decomposition realizing a given flow is not unique, and introduce a
gauge-invariant dissipation diagnostic that separates genuinely reversible from
dissipative dynamics independently of the learned functionals. Across three
operator backbones (1D and 2D Fourier neural operators and DeepONet) and four
PDEs spanning reversible, dissipative, and mixed regimes, GENERIC-FNO preserves
its exact structural guarantees zero-shot across a $4\times$ super-resolution
range ($64\!\to\!256$), recovers the correct ground-truth ordering of physical
dissipation, and remains competitive with strong unconstrained and
energy-penalized baselines---outperforming them on several dissipative and mixed
problems at comparable or fewer parameters.
URL: https://openreview.net/forum?id=gbpEYDdfDB
---
Title: Entropy-Based Dimension-Free Convergence and Loss-Adaptive Schedules for Diffusion Models
Abstract: Diffusion generative models synthesize samples by discretizing reverse-time dynamics driven by a learned score or denoiser. Existing convergence analyses often exhibit explicit dependence on the ambient dimension, while dimension-free guarantees typically require structural or geometric assumptions on the target distribution. We develop an information-theoretic approach to reverse-diffusion discretization that avoids such assumptions. We decompose the pathwise KL error into initialization, denoiser approximation and time-discretization terms, and express the discretization term exactly through the MMSE curve of the associated Gaussian channel. Under finite second moment and finite R\'enyi entropy of order $1/2$, we obtain a dimension-free discretization bound controlled by the R\'enyi entropy and the number of sampling steps. Motivated by the same decomposition, we propose a Loss-Adaptive Schedule (LAS), an algorithmic scheduling rule that uses training-loss information to allocate sampling steps across noise levels. Experiments show that LAS improves sampling quality over standard heuristic schedules, especially in low-step regimes.
URL: https://openreview.net/forum?id=5CNcyOBV3j
---
Title: Offline Meta-Reinforcement Learning in Piecewise Stationary Environments
Abstract: Adapting policies in piecewise stationary environments - where the underlying properties remain stable for periods but abruptly change at unknown points - remains a challenge in reinforcement learning (RL). Addressing this problem using context-based offline meta-RL, which enables generalization to new online tasks from offline data, is particularly appealing, as it avoids the risks associated with online exploration. These methods encode transition history (the context) into a task representation and condition the policy and value function to enable generalization. We show that existing approaches relying on a fixed-length context window face an inherent trade-off between rapid adaptation and inferring a stable task representation in piecewise stationary settings. We overcome this limitation by detecting task changes online from the temporal evolution of task representations and selectively retaining relevant transitions, yielding an adaptive context length. Experiments on continuous control benchmarks demonstrate that our approach enables faster adaptation and stable task identification, resulting in higher-performing policies compared to baselines.
URL: https://openreview.net/forum?id=gp1mAySr25
---
Title: Private but Biased? Exploring Fairness in Federated Recommendation
Abstract: Bias and fairness are central concerns in machine learning, particularly in recommendation systems that may reinforce gender, age, or occupation stereotypes.
In parallel, federated recommendation has emerged as a privacy-preserving alternative to centralized systems, particularly in cross-device settings where data remains on user devices.
Despite extensive studies of bias in centralized recommendation, its behavior under federated training remains largely underexplored.
Indeed, the transition to a decentralized architecture introduces additional sources of statistical skew and uneven representation across users, making the impact of federated learning on bias dynamics unclear.
Furthermore, most existing bias mitigation techniques rely on sharing sensitive user or item attributes, which conflicts with the privacy constraints inherent to federated learning.
In this work, we investigate how federated training influences the emergence of gender and user activity bias in cross-device federated recommendation systems.
We further adapt an existing bias mitigation approach to the federated setting and propose a privacy-aware framework for bias mitigation that does not require sharing sensitive attributes.
Our results show that, under certain conditions, federated training can introduce less bias than its centralized counterpart.
Across datasets and model families, we observe a consistent reduction in gender-based bias under FL settings, while popularity (activity) bias exhibits model-dependent behavior and can increase in graph-based user-expansion methods.
Moreover, we demonstrate that effective bias mitigation is feasible in federated recommendation while preserving user privacy. Our Source code, datasets and trained models are available at \url{https://anonymous.4open.science/r/Bias-and-cross-device-Federated-recommendation-50D5/}
URL: https://openreview.net/forum?id=w67jPkWASo
---
Title: Concept Mediation Enables Robust Fine-Grained Visual Understanding
Abstract: Large vision-language models exhibit strong general multimodal understanding, yet training-free prompting strategies often fail on fine-grained visual recognition, where correct predictions depend on subtle and localized visual attributes. Existing approaches such as chain-of-thought reasoning and in-context learning often produce fluent explanations or contextual cues without reliably grounding decisions in discriminative visual evidence. To address this issue, we introduce Concept-Mediated In-Context Learning (CM-ICL), a training-free prompting strategy that first extracts visual attribute concepts from the input image and then uses them as structured context for classification. Without training the model, CM-ICL provides an explicit intermediate representation that re-expresses image-derived cues for fine-grained decision making. To evaluate the extracted concepts without manual concept annotations, we combine promptable-segmentation-based perceptual grounding metrics with task-coupled diagnostics that examine how visual localizability relates to downstream prediction behavior. Experiments on six fine-grained datasets show that CM-ICL improves accuracy over training-free approaches, produces more concise and visually localizable concepts, and substantially reduces generation failures. The results demonstrate that concept mediation provides an effective and interpretable route for training-free fine-grained visual recognition.
URL: https://openreview.net/forum?id=bpJCnrg2ft
---
Title: ORCD: ODE-residual causal decomposition for heterogeneous dynamical systems
Abstract: Causal discovery in multivariate time-series data is fundamentally limited by the entanglement of slow-scale invariant physical dynamics with fast-scale instantaneous causal forcing; a regime where existing approaches either require non-Gaussian noise, which is frequently violated in environmental and health systems, or lack a mechanism to decouple the two generative processes. We propose \emph{ODE residual causal decomposition} (ORCD), a two-step framework for multi-environment time-series data. A single global SINDy ODE is fit on pooled data; the resulting structured residuals are heteroscedastic across environments by construction, as local dynamics that the parsimonious model cannot represent, are forced into the residual. We prove formally that this heteroscedasticity suffices for unique DAG identifiability under Gaussian noise, bypassing the non-Gaussianity assumption that restricts LiNGAM-type methods, and that the global fit is the theoretically necessary mechanism for generating it. Causal discovery is then performed on the augmented space of original variables and whitened residuals, which serve as observed proxy covariates for environment-specific forcing. ORCD is algorithm-agnostic: across four discovery algorithms, it strictly reduces SHD in every synthetic benchmark setting (48/48), with a median reduction of~21\% and up to~43\% (Tabu) and~72\% (LiNGAM) in the most challenging settings. On real-world data linking air pollution and cardiovascular mortality across six English regions (1981--2014), ORCD substantially improves structural accuracy under non-i.i.d.\ temporal hold-out (F1 = 0.973--0.991 vs.\ 0.895--0.963) while recovering richer causal graphs that generalise spatially: subtracting the residuals from parameter learning collapses CVD60 Spearman correlation from $r = 0.62$--$0.66$ to near zero ($r \approx 0.01$-- $0.04$), confirming that the residuals encode genuine causal signals that transfer across unseen geographical environments.
URL: https://openreview.net/forum?id=nLOy6s7CfO
---
Title: Implicit Dynamical Flow Fusion (IDFF) for Generative Modeling
Abstract: Conditional Flow Matching (CFM) generates high-quality samples by learning a deterministic transport from noise to data, but typically requires over a hundred network function evaluations (NFEs) per sample, especially in time-series settings. We introduce Implicit Dynamical Flow Fusion (IDFF), which augments the CFM vector field with learnable momentum terms derived from higher-order derivatives of the log-density. IDFF comes with two clearly separated theoretical guarantees. At first order, with our default Langevin-enhanced schedule, IDFF preserves the CFM marginal density exactly in continuous time. At higher orders, a single Girsanov-to-Pinsker argument bounds the endpoint deviation by a closed-form expression that depends only on the weighted score-matching loss our training objective already minimizes; consequently, the endpoint deviation vanishes as the number of NFEs grows. The practical enabler behind both regimes is a re-parameterization identity: every higher-order marginal derivative can be obtained from the learned first-order score by automatic differentiation, so no additional networks are trained. Empirically, IDFF reduces NFEs by an order of magnitude with no loss in sample quality. On CIFAR-10 it achieves an FID of 2.78 at 10 NFEs, outperforming existing CFM variants and matching methods that need over a hundred evaluations. For time-series modelling, IDFF performs strongly on molecular-dynamics simulation and sea-surface-temperature forecasting at a fraction of the compute. Overall, momentum-augmented flows offer a principled and efficient route to generative modelling across both static and dynamic domains.
URL: https://openreview.net/forum?id=RWnohJJKbq
---
Title: Wavenumber-Resolved Spectral Gating for Parameter-Efficient Generative Modeling of Two-Dimensional Turbulence
Abstract: We study unconditional generative modeling of two-dimensional Kolmogorov flow with denoising diffusion. A plain U-Net diffusion model reproduces the point-wise statistics of turbulent vorticity fields but distorts the inertial-range spectrum and the spectral fluxes that carry energy and enstrophy across scales. We add a wavenumber-resolved spectral gate: a small Fourier-domain bottleneck that learns a per-channel multiplicative correction over a coarse radial-wavenumber grid, conditioned on the diffusion noise level and a continuous log-viscosity parameter. The gate is paired with physics losses on enstrophy, modal spectral amplitude, vorticity structure functions, integral length, and spectral flux. On forced 2D Kolmogorov flow at seven viscosity regimes spanning a factor of nine in Reynolds number, and against four unconditional baselines (a plain U-Net, a squeeze-and-excite bottleneck, an FNO-block bottleneck at 2.8× the parameters, and a standalone Fourier neural operator), the gated model (WRSD, 2.51M parameters) reduces the aggregate log-spectral distance to direct numerical simulation (DNS) by 59% relative to the plain U-Net, beats the larger FNO block on that metric, and raises inverse-energy and forward-enstrophy cascade recovery from 10% and 21% to 49% and 89%. Adding an FNO branch to the gate (WRSD+FNO, 7.27M) is best overall, reaching 84% and 110% cascade recovery and the lowest flux error. A factorial ablation separates the gate, which routes spectral power and corrects the large scales and cascade transport, from the losses, which correct small-scale point statistics, and shows a strong positive interaction: the gate alone barely lowers the spectral distance and the losses alone raise it, but together they cut it by more than half. A one-parameter conformal rescaling raises empirical 90% coverage from 0.70–0.84 to 0.85–0.89. We report two negative results: the physics losses worsen the large-scale integral length relative to the gate alone, and no variant fully recovers the DNS cascade peaks.
URL: https://openreview.net/forum?id=MhYd4okrNa
---
Title: Merging Smarter, Generalizing Better: Enhancing Model Merging on OOD Data
Abstract: Multi-task learning (MTL) concurrently trains a model on diverse task datasets to exploit common features, thereby improving overall performance across the tasks. Recent studies have dedicated efforts to merging multiple independent model parameters into a unified model for MTL, thus circumventing the need for training data and expanding the scope of applicable scenarios of MTL. However, current approaches to model merging predominantly concentrate on enhancing performance within in-domain (ID) datasets, often overlooking their efficacy on out-of-domain (OOD) datasets. In this work, we propose e LwPTV (Layer-wise Pruning Task Vector) by building a salience score, measuring the redundancy of parameters in task vectors. Designed in this way ours can achieve mask vector for each task and thus perform layer-wise pruning on the task vectors, only keeping the pre-trained model parameters at the corresponding layer in merged model. Owing to its flexibility, our method can be seamlessly integrated with most of existing model merging methods to improve their performance on OOD tasks. Extensive experiments demonstrate that the application of our method results in substantial enhancements in OOD performance while preserving the ability on ID tasks.
URL: https://openreview.net/forum?id=msiXRyQgxO
---
Title: SG-MSM: SNR-Guided Data Augmentation for Mitigating Domain Shift in Avian Bioacoustics
Abstract: Passive acoustic monitoring (PAM) for bird classification suffers from a severe domain shift between focal training data and passive deployment environments, a challenge exacerbated by the prohibitive cost of annotating target data. Extensive benchmarking has shown that standard architectures and data augmentations struggle to generalize across this focal-to-passive gap. We propose SNR-Guided Multi-Signal Mix (SG-MSM), a multi-stage data augmentation pipeline that explicitly mitigates the focal-to-passive domain shift. We evaluate SG-MSM across six diverse PAM environments using an EfficientNet-B3, achieving a mean ROC-AUC of 0.884. This outperforms the comprehensive augmentation pipeline established by the BirdSet benchmark and the Perch 1.0 foundation model (860K recordings) across all domains, despite training on orders of magnitude less data (3.6K–28K samples). SG-MSM proves highly competitive with the state-of-the-art Perch 2.0 and BirdMAE models. Component-wise ablation studies isolate the impact of each synthesis stage, demonstrating relative gains of +17.7% in ROC-AUC, +68.5% in cmAP, and +69.1% in Top-1 Accuracy over focal-only training. Source code will be made publicly available upon acceptance.
URL: https://openreview.net/forum?id=qUH4rZhABl
---
Title: Implicit Bias and Invariance: How Hopfield Networks Efficiently Learn Graph Orbits
Abstract: Many learning problems involve symmetries, and while invariance can be built into neural architectures, it can also emerge implicitly when training on group-structured data. We study this phenomenon in classical Hopfield networks and illustrate how they can infer the isomorphism class of a graph from a small, random sample. Our results reveal that: (i) graph isomorphism classes can be represented within a three-dimensional invariant subspace, (ii) using gradient descent to minimize energy flow (MEF) has an implicit bias toward norm-efficient solutions, which underpins a polynomial sample complexity bound for learning isomorphism classes, and (iii) across multiple learning rules, parameters converge toward the invariant subspace as sample sizes grow. Together, these findings highlight a unifying mechanism for generalization in Hopfield networks: a bias toward norm efficiency in learning drives the emergence of approximate invariance under group-structured data.
URL: https://openreview.net/forum?id=Fptb4mpPS2
---
Title: Collaborative Synthetic Data Generation for Knowledge Transfer in Federated Learning
Abstract: One-shot federated learning (OSFL) addresses the communication overhead of federated learning by limiting training to a single round, but doing so without sacrificing model quality is non-trivial, particularly when client data distributions diverge. Recent work has addressed this challenge by aggregating client knowledge on the server through the construction of transferable synthetic datasets or distillates. However, most of these methods lack formal privacy guarantees, leaving a gap in jointly achieving low communication, robustness to heterogeneity, and rigorous privacy. We propose FedKT-CSD (Federated Knowledge Transfer via Collaborative Synthetic Data), a framework inspired by neural image compression that closes this gap by leveraging publicly pretrained autoencoders as a shared latent space. Each client encodes its private data in a single forward pass, computes class-conditional latent statistics, and transmits these to the server. The server aggregates these statistics via secure aggregation, adds calibrated differential privacy noise, and decodes a synthetic dataset for training a global model and further downstream tasks. This design provides formal $(\varepsilon,\delta)$-differential privacy by construction, while keeping client-side computation and communication lightweight. Despite operating under privacy constraints, FedKT-CSD is competitive with and even outperforms non-private baselines across diverse datasets and heterogeneity settings, and scales to a large number of clients. Our code is available at: \url{https://github.com/an7123/FedKT-CSD}.
URL: https://openreview.net/forum?id=9H3b5MX87m
---
Title: Perplexity Can Miss SAE Feature Damage Under Quantization
Abstract: Quantization is a standard path to deploying large language models, and quantized models are typically judged acceptable when perplexity or downstream accuracy remains close to the full-precision original. But behavioral parity need not imply feature fidelity: the sparse-autoencoder (SAE) features used to interpret a full-precision model may change after weight rounding. We test this directly by using a frozen SAE as a fixed measurement basis, encoding full-precision and round-to-nearest (RTN) quantized activations on identical tokens, and measuring per-feature survival by Pearson correlation across bit-widths from INT8 to INT4 on Pythia-70M and Gemma-2-2B. Our central finding is that perplexity can miss feature damage: on Gemma-2-2B, INT7 improves perplexity while degrading 18.7\% of active SAE features, and under sliding-window evaluation INT6 also improves perplexity while only 51.3\% of active features survive. Feature survival is graded rather than cliff-like, with 62.4\% of active Pythia features and 51.3\% of active Gemma features surviving at INT6; most non-surviving features are blurred rather than fully damaged. Survival is also predictable from full-precision feature statistics alone, with cross-validated AUC 0.92--0.97 and peak activation as the strongest marginal predictor. Finally, RTN quantization and matched-perplexity magnitude pruning damage strongly overlapping feature sets, with Jaccard overlap 0.79--0.86 and damage-score Spearman correlation 0.98. These results show that behavioral metrics alone are insufficient evidence that full-precision interpretability findings transfer to quantized models, motivating feature-level audits of compression.
URL: https://openreview.net/forum?id=mT6TyCcN53
---
Title: A Reproduction Study of Weight-Based Mechanistic Interpretability in Bilinear MLPs
Abstract: Mechanistic interpretability typically relies on post-hoc analysis of model activations, but bilinear MLPs offer an alternative: architectures whose weights are directly interpretable through eigendecomposition of interaction tensors.
We reproduce both main experiments from Pearce et al. (2025): their Section 4 (Vision) on MNIST/Fashion-MNIST, and their Section 5 (Language) discovering sentiment negation circuits via Sparse Autoencoder analysis.
Vision results reproduce cleanly: weight decay reduces effective rank from 38.5 to 15.5 while maintaining 97--98% accuracy, and our ablation shows that weight decay $-$ not noise augmentation $-$ is the primary driver of low-rank structure.
In language, we confirm the AND-gate negation circuit (two semantically contrasting negation features, cosine similarity $-0.16$), but do \emph{not} fully reproduce the low-rank interaction claim: the fraction of features achieving $>$0.75 rank-2 correlation varies from 32\% (ts-medium) to 65% (fw-small); only fw-small meets this threshold.
We provide threshold sensitivity analysis and trace the gap to SAE training duration (correlation improves 2.6$\times$ over five checkpoints) and model compute (tokens/parameter); the fw-medium configuration required 8$\times$ rather than 4$\times$ expansion SAEs, making exact reproduction impossible $-$ language results constitute constrained replication under publicly available artifacts.
In extensions, regularized bilinear MLPs transfer structurally across digit and letter datasets: MNIST-trained models classify geometrically similar EMNIST letters (O$\to$0, I$\to$1, Z$\to$2, S$\to$5) at $87-100\%$ accuracy.
We propose Quadratic Form Similarity, which separates similar from dissimilar digit-letter pairs (QFS $0.40$ vs. $-0.06$, $p<10^{-4}$) where cosine similarity fails (0.358 vs. 0.339).
Finally, we explore CP-decomposition as an architectural constraint, achieving 93.8% accuracy with effective rank 17.5 at $\sim$30$\times$ faster training, with CP factors that appear qualitatively more localized than dense eigenvectors $-$ though interpretability gains remain preliminary.
URL: https://openreview.net/forum?id=6k7qRdz7rD
---
Title: Revisiting fairGNN-WOD: A Reproduction and Analysis of Fair Graph Learning Without Demographics
Abstract: Graph Neural Networks are widely used for high-impact prediction tasks where fairness is important. fairGNN-WOD (Wang et al., 2025b) proposes a two-stage framework to make fair predictions without relying on sensitive demographics information, which is often unavailable. In this study, we implement fairGNN-WOD from scratch due to the lack of a publicly available code base. Ultimately, while we reproduce utility results, we fail to reproduce the reported fairness improvements, which is the main contribution of the original paper, because baselines are substantially more fair than originally reported. Additionally, we find no measurable contribution of stage 1 of the original framework, which is a key architectural component of fairGNN-WOD.
URL: https://openreview.net/forum?id=tdMfKP8E5i
---
Title: Gradient-based Sample Selection for Faster Bayesian Optimization
Abstract: Bayesian optimization (BO) is an effective technique for black-box optimization. However, its applicability is typically limited to moderate-budget problems due to the cubic complexity of fitting the Gaussian process (GP) surrogate model. In large-budget scenarios, directly employing the standard GP model faces significant challenges in computational time and resource requirements. In this paper, we propose Gradient-based Sample Selection Bayesian Optimization (GSSBO), a subset-maintenance approach designed to enhance the computational efficiency of BO. Here, ``gradient-based'' refers to response-gradient sensitivity embeddings induced by the GP log marginal likelihood, not derivative observations of the objective function. The GP model is constructed on a selected set of samples instead of the whole dataset. These samples are selected using GP-specific sensitivity embeddings and a cosine-diversity rule to encourage embedding-space diversity and reduce redundancy in the retained subset. We provide a conditional theoretical analysis of the subset-fitted GP surrogate induced by a fixed retained subset and derive posterior-approximation bounds in terms of subset-dependent residual quantities. Experiments on synthetic and real-world tasks show empirical reductions in GP-fitting cost while achieving competitive optimization performance in the reported settings.
URL: https://openreview.net/forum?id=Ysr1zUeuxz
---
Title: The Price of Justice in Machine Learning: Fair Division with Subjective Value under Bounded Rationality
Abstract: Statistical fairness criteria constrain observable rates, but do not determine how error burdens are allocated. This paper studies fairness as a conditional harm-allocation problem: given an evaluation target and an adopted false-positive/false-negative cost representation, a policy induces cost-weighted error burdens. We introduce fair-division diagnostics for these allocative bads, characterize when statistical criteria can serve as burden surrogates, and show how heterogeneous adopted costs and restricted policy classes create surrogate failures and approximation floors. Experiments on benchmark prediction tasks provide controlled stress tests showing that statistical improvements can diverge from harm-allocation diagnostics.
URL: https://openreview.net/forum?id=GdnJrYDied
---
Title: Retry Policy Gradients in Continuous Action Spaces
Abstract: Retry-based objectives such as pass@K and max@K optimize the best return obtained from multiple sampled trajectories, and recent work has shown that they can promote exploration without explicit exploration bonuses. In discrete action spaces, ReMax was shown to do so by adapting to return uncertainty. In this work, we introduce pathwise derivative estimators for retry objectives and use them to extend ReMax to continuous action spaces. We study the resulting learning dynamics and show that, even with deterministic rewards, ReMax can encourage stochastic exploration by reshaping the policy-gradient landscape. In particular, it alters gradients both in direction, biasing updates toward higher policy entropy, and in magnitude, damping gradients and slowing convergence. We further show that Adam's adaptive normalization can mitigate this damping, depending on its numerical stabilization parameter. Empirically, we instantiate this objective as ReMax Actor-Critic (ReMAC), an off-policy actor--critic algorithm that optimizes the ReMax objective using a pathwise derivative estimator. Our experiments show that ReMAC can promote higher policy entropy without entropy regularization and achieves performance comparable to SAC.
URL: https://openreview.net/forum?id=qmCljotj08
---
Title: DP-SGD with weight clipping
Abstract: Recently, due to the popularity of deep neural networks and other methods whose training
typically relies on the optimization of an objective function, and due to concerns for data
privacy, there is a lot of interest in differentially private gradient descent methods. To achieve
differential privacy guarantees with a minimum amount of noise, it is important to be able to
bound precisely the sensitivity of the information which the participants will observe. In this
study, we present a novel approach that mitigates the bias arising from traditional gradient
clipping. By leveraging a public upper bound of the Lipschitz value of the current model,
we can achieve refined noise level adjustments. We present a new algorithm with improved
differential privacy guarantees and a systematic empirical evaluation, showing that our new
approach outperforms existing approaches also in practice.
URL: https://openreview.net/forum?id=j46Eroqz6B
---
Title: SPACR: Single-Pass Adaptive Training of Uncertainty-Aware Conformal Regressors
Abstract: Conformal Prediction (CP) provides robust uncertainty guarantees for predictive models, but is typically applied post hoc, which misaligns model training with the conformal goal of producing efficient (i.e, narrow) intervals. We propose SPACR (Single-Pass Adaptive Conformal Regressor), a novel method for directly training uncertainty-aware regressors within a differentiable loss. SPACR jointly optimizes efficiency and validity without batch-splitting or a predefined confidence levels during training. As a result, a single SPACR model yields valid prediction intervals at multiple confidence levels during inference, avoiding the costly retraining required by methods like DOICR. Experiments on diverse datasets show that SPACR consistently gives tighter intervals and better coverage-efficiency trade-offs compared to standard CP and DOICR, while significantly reducing computational costs.
URL: https://openreview.net/forum?id=dScMsfkwoC
---
Title: Goal-Oriented Sequential Bayesian Experimental Design for Causal Learning
Abstract: We present GO-CBED, a goal-oriented Bayesian framework for sequential causal experimental design. Unlike conventional approaches that select interventions aimed at inferring the full causal model, GO-CBED directly maximizes the expected information gain (EIG) on user-specified causal quantities of interest, enabling more targeted and efficient experimentation. The framework is both non-myopic, optimizing over entire intervention sequences, and goal-oriented, targeting only model aspects relevant to the causal query. To address the intractability of exact EIG computation, we introduce a variational lower bound estimator, optimized jointly through a transformer-based policy network and normalizing flow-based variational posteriors. The resulting policy enables real-time decision-making via an amortized network. We demonstrate that GO-CBED consistently outperforms existing baselines across various causal reasoning and discovery tasks—including synthetic structural causal models and semi-synthetic gene regulatory networks—particularly in settings with limited experimental budgets and complex causal mechanisms. Our results highlight the benefits of aligning experimental design objectives with specific research goals and of forward-looking sequential planning.
URL: https://openreview.net/forum?id=6dVCZT7pO2
---
Title: OpInf-LLM: Parametric PDE Solving with LLMs via Operator Inference
Abstract: Solving diverse partial differential equations (PDEs) is fundamental in science and engineering. Large language models (LLMs) have demonstrated strong capabilities in code generation, symbolic reasoning, and tool use, but reliably solving PDEs across heterogeneous settings remains challenging. Prior work on LLM-based code generation and transformer-based foundation models for PDE learning has shown promising advances. However, a persistent trade-off between execution success rate and numerical accuracy arises, particularly when generalization to unseen parameters and boundary conditions is required. In this work, we propose OpInf-LLM, an LLM parametric PDE solving framework via operator inference. The proposed framework leverages small amounts of solution data to enable accurate prediction of diverse PDE instances, including unseen parameters and configurations, and provides seamless integration with LLMs for natural language task specification and physics-based reasoning of proper feature parameterization. Its low computational demands and unified solution pipeline further enable a high execution success rate across heterogeneous settings, opening new possibilities for generalizable reduced-order modeling in LLM-based PDE solving.
URL: https://openreview.net/forum?id=rCTRgstMr1
---
Title: Judges Hallucinate, Embeddings Don’t: Retrieval-Augmented Latent Regression for Dialogue Optimization
Abstract: Aligning dialogue systems with human quality standards typically relies on either large-scale reward models or Large Language Model (LLM)-based critics. Reward models demand substantial annotation effort and tend toward reward hacking. LLM-based critics have a different problem: they generate scores through autoregressive language modelling, a process susceptible to self-preference bias and confident misevaluation of mediocre outputs. We introduce JADE (Judge-free Alignment via Data Embeddings), a framework that avoids both failure modes by replacing generative evaluation with retrieval-augmented regression. Given a query–response pair, JADE retrieves human-rated examples from a rating-partitioned embedding index and scores the response through a compact Multi-Layer Perceptron (MLP) trained by regression on the embedding manifold. No language model produces the score, so the reward is free of sampling variance and the kind of confident miscalibration that characterises autoregressive judges. On the optimisation side, JADE pairs this scorer with Group Relative Policy Optimization (GRPO), curriculum scheduling that transitions from imitation to quality optimisation, and a contrastive penalty for low-quality examples. Evaluated on Recipe4U and HelpSteer2 with a 0.5B-parameter actor, JADE lifts 1-star responses by up to +0.77 Optimised Response Latent Inference (ORLI) points while changing 5-star responses by at most +0.12. Frontier LLM judges can match or exceed these low-tier gains, but only at the cost of systematically degrading high-rated outputs by 1–2 points on average. Ablation experiments identify curriculum scheduling as the largest single contributor; contrastive regularisation specifically benefits the lowest-quality tail.
URL: https://openreview.net/forum?id=aD6GYaBU0n
---
Title: Aletheia: What Makes RLVR For Code Verifiers Tick?
Abstract: Multi-domain thinking verifiers trained via Reinforcement Learning with Verifiable Rewards (RLVR) are a cornerstone of modern post-training. However, their adoption in code generation has lagged behind that of execution feedback due to the prohibitive costs of the full RLVR pipeline. In this work, we ablate three primary choices along the performance-cost trade-off in RLVR: intermediate thinking traces, learning from negative samples, and on-policy training. We introduce **Aletheia**, a controlled, execution-grounded testbed to facilitate a contamination-free analysis of code verifier training recipes across disparate model sizes and covariate shifts across two common verifier application scenarios. Our analysis reveals that the optimal training recipe is scale-dependent: on-policy learning is the primary performance driver for small verifiers, whereas the thinking budget becomes the most vital factor at larger scales. While leveraging negative samples has a consistent impact on top-1 selection accuracy across sizes, their contribution to ranking reconstruction increases monotonically with scale and plays a key role in stabilizing training at large sizes. Our Pareto optimality analysis demonstrates that eliminating on-policy training at larger model scales yields a verifier that performs comparably to the full RLVR recipe. Furthermore, we find that eschewing thinking traces serves as a compute-efficient strategy at lower budgets, offering a strong trade-off between training cost and verifier accuracy. Ultimately, our work provides the empirical foundation necessary to efficiently deploy robust code verifiers, thereby enabling their wider adoption in post-training pipelines for large code generation models.
URL: https://openreview.net/forum?id=3rVrBGp0mr
---
Title: ‘12-Angry LLMs’ - Divergence from Deliberation as Signal for Complex Stance Detection
Abstract: This study introduces ``12-Angry LLMs,'' a novel annotation and classification model that leverages annotator disagreement to improve complex stance detection. Departing from traditional methods that average out divergence, we deploy a diverse panel of 12 LLMs that engage in a two-stage process: independent voting (Round A) followed by collective deliberation (Round B) when disagreement occurs. We demonstrate that the rationales generated during deliberation serve as critical signals for fine-tuning the Judge model. On the RUStance-2023 dataset, this Judge model achieves performance (F1 $\approx$ 0.81) compared with single-model baselines and standard aggregations. The approach also proves highly transferable, achieving an F1 score of 0.94 on the out-of-domain PStance dataset using few-shot prompting with jury rationale. We contribute a new dataset containing expert labels alongside full jury deliberation traces, establishing a paradigm in which model divergence is utilized as a diagnostic tool for uncertainty and interpretability rather than noise.
URL: https://openreview.net/forum?id=I4AgbIlHtw
---
Title: Where Not to Learn: Prior-Aligned Training with Subset-based Attribution Constraints for Reliable Decision-Making
Abstract: Reliable models should not only predict correctly, but also base their decisions on acceptable evidence. However, conventional supervised learning typically provides only class-level labels, allowing models to achieve high accuracy by exploiting shortcut correlations rather than intended decision evidence. Human priors, such as bounding boxes or target interface elements, can help constrain such behavior, but aligning model evidence with these priors remains challenging because learned decision evidence often diverges from human perception. In this work, we study attribution-guided human-prior alignment with subset-selection-based attribution. Motivated by prior deletion and insertion evaluations showing that subset-selection attribution can identify compact decision-supporting regions, we use it as a training-time signal to expose the model’s decision evidence. When the top-attributed evidence deviates substantially from the prior region, we penalize off-prior reliance and encourage the model to shift its evidence toward the intended regions. This yields a selective prior-constrained objective that avoids uniformly suppressing all non-prior regions. We validate our method on both image classification and click decision tasks in MLLM-based GUI agents. Across discriminative classification and autoregressive decision-making settings, our method improves task accuracy while enhancing attribution reasonability.
URL: https://openreview.net/forum?id=2katjvBbIC
---
Title: Reference-Guided Machine Unlearning
Abstract: Machine unlearning aims to remove the influence of specific training data from a model while preserving its general utility. In vision, many approximate unlearning methods pursue this goal through degradation-based heuristics, such as loss maximization or random labeling. Yet making a model worse on forget samples is not the same as making it behave as if those examples had never been seen: these signals can be poorly conditioned, destabilize optimization, and harm generalization. We argue that approximate unlearning should instead prioritize distributional indistinguishability, aligning the model's predictive behavior on forget data with that on truly unseen data. Motivated by this principle, we propose Reference-Guided Unlearning (ReGUn), a vision unlearning framework that uses disjoint held-out data to construct a principled, class-conditioned reference distribution for distillation. Rather than explicitly degrading predictions on forget examples, ReGUn guides them toward non-member behavior through held-out supervision. Across multiple architectures, natural image datasets, and forget fractions, ReGUn consistently improves the forgetting-utility trade-off over standard approximate baselines while closely matching retrain-like membership inference behavior. As one instantiation of this principle, the results suggest that simple objectives designed around indistinguishability can provide a stronger and more stable alternative to complex degradation-based unlearning procedures.
URL: https://openreview.net/forum?id=Vuo68BTxEc
---
Title: Variational Set Operator Networks: Uncertainty-Aware Meta-Learning via Probabilistic Neural Operators
Abstract: We introduce a probabilistic neural operator framework for learning conditional distributions over functions from sample observations. The proposed model, the Variational Set Operator Network (VSON), extends set-based operator learners by incorporating an amortised latent representation of the branch outputs that induces predictive distributions over function values conditioned on arbitrary sets of input–output pairs. Uncertainty is represented through a learned variational latent structure implemented with expressive normalising flows, allowing the model to capture non-Gaussian behaviour. The resulting operator produces smooth function predictions and coherent joint samples over target sets. VSON improves predictive uncertainty calibration across benchmarks while remaining competitive on accuracy, outperforming the baselines on the regression and optimisation tasks.
URL: https://openreview.net/forum?id=rLgrHnZtzh
---
Title: A Survey of the OpenClaw Ecosystem: From Platform Extensibility to Constraint Design
Abstract: Large language models have evolved into autonomous agents capable of invoking tools, using memory, and taking actions in real-world environments. Yet despite this progress, many agent systems remain difficult for ordinary users to adopt directly. OpenClaw, an open-source and self-hosted agent platform, addresses this gap through a local-first, messaging-native, and skill-extensible design that connects everyday messaging applications to LLM-powered agents. This design makes OpenClaw one of the first open personal-agent ecosystems, with ClawHub for shared Skills, Heartbeat for proactive background turns, and Moltbook as an agent-only social network. We survey this emerging ecosystem and show that the literature repeatedly highlights the same tradeoff: the openness that makes OpenClaw extensible also creates new governance, security, social, and deployment challenges. We organize the survey around four dimensions that trace this tradeoff from platform design to its ecosystem-level consequences: \emph{Platform}, where open Skills enable rapid capability growth but create new governance problems; \emph{Security}, where open Tools, Skills, Memory, and background execution expand the attack surface; \emph{Societies}, where Moltbook reveals a gap between social appearance and reliable collective intelligence; and \emph{Deployment}, where trustworthy use in robotics, healthcare, and scientific research depends on limiting agent freedom rather than expanding it. We also organize OpenClaw benchmarks into a lifecycle view of open-agent evaluation and outline future directions for treating constraints as core parts of open agent platform design. Companion repository: https://anonymous.4open.science/r/Awesome-OpenClaw-Papers/
URL: https://openreview.net/forum?id=2jyFogNCx7
---
Title: AQUA: Revisiting Attention Approximation through Spectral Alignment and Query-Adaptive Feature Selection
Abstract: As Large Language Models (LLMs) scale to longer contexts, the high dimensionality of the attention mechanism imposes increasing computational and representational overhead. Prior works have explored approximating attention scores to enable efficient token selection, typically using orthogonal spectral transformations or magnitude-based heuristics, followed by recomputation of exact attention on a reduced set of tokens. In this work, we take a complementary perspective and investigate whether such approximations can be used directly for attention computation. We introduce AQUA (Attention via Query-Adaptive Spectral SUbspace Approximation), a framework that combines offline spectral projection with online, query-dependent magnitude-based dimension selection, drawing on complementary insights from prior approximation approaches. By rotating representations into a calibrated basis and dynamically selecting salient dimensions, AQUA approximates attention scores using only a dynamically selected subset of rotated features. Our analysis shows that attention representations exhibit substantial spectral redundancy: retaining only 75% of dimensions preserves near-baseline accuracy across a wide range of benchmarks on models such as Llama-3.1-8B. These results suggest that attention representations exhibit substantial compressibility under calibrated spectral projections, and that accurate attention scores can be recovered from a reduced subspace through a combination of spectral alignment and dynamic feature selection.
URL: https://openreview.net/forum?id=Hqm2y0Snga
---
Title: T-CLIP: Enabling Thermal Perception for Contrastive Language-Image Pretraining
Abstract: Thermal imaging offers a powerful alternative to visible-spectrum vision under challenging conditions such as low illumination and adverse weather, yet foundational vision-language models like CLIP fail to align thermal images with textual descriptions due to a fundamental thermal perception gap. We identify three major challenges: the lack of captioned thermal datasets, the inability of standard LLMs to reason about thermal phenomena, and a key representational challenge in thermal imaging where global scene context and object-level
heat signatures conflict when learned together in a single embedding space. To address these, we introduce IR-Cap, the first physics-aware thermal captioning pipeline and dataset providing complementary global and fine-grained thermal descriptions across three public benchmarks, and T-CLIP, a decoupled dual-LoRA framework that independently adapts CLIP for scene-level and object-level thermal understanding. T-CLIP achieves consistent improvements over all baselines across three thermal benchmarks in cross-modal retrieval,
and we provide an exploratory demonstration of its applicability to text-conditioned thermal image generation.
URL: https://openreview.net/forum?id=vlzqdDvDcu
---
Title: Mixture of Complementary Agents for Robust LLM Ensemble
Abstract: Multi-AI collaboration\textemdash such as ensembling or debating large language models (LLMs)\textemdash is a promising paradigm for aggregating information and boosting performance. A foundational step in these pipelines is to feed the responses of several \emph{proposer} LLMs into a \emph{summarizer} LLM, which synthesizes a better answer. However, choosing which proposers to include is non-trivial. Existing approaches primarily focus either on accuracy (picking the strongest models) or diversity (ensuring variety), and often overlook the interactions among proposers and with the summarizer.
We reframe proposer selection as a combinatorial selection problem akin to feature selection, where the value of an LLM lies in its \emph{complementarity} with others. However, directly applying standard feature-selection algorithms is impractical in the LLM setting due to prohibitive time complexity. Motivated by this limitation, we explore an extensive range of computationally feasible, greedy-style selection algorithms that assess complementarity using a small labeled set. Our experiments validate complementarity as a guiding principle for proposer selection and identify methods that achieve the best performance–cost trade-offs in practice.
URL: https://openreview.net/forum?id=AokFBRcCBQ
---
Title: AnyDepth: Efficient Zero-Shot Depth Estimation with Simple Decoding and Data-Quality Filtering
Abstract: Recent monocular depth estimation systems increasingly benefit from strong pretrained visual encoders and large-scale training data. However, once high-quality dense features are available, two remaining sources of cost become especially important: the complexity of Dense Prediction Transformer (DPT)-style multi-branch decoders and the quality of the depth samples used for training. We present AnyDepth, an efficient framework for zero-shot monocular depth estimation that studies these two design choices under controlled settings. AnyDepth uses a frozen DINOv3 encoder and replaces the conventional reassemble-then-fuse DPT head with a Simple Depth Transformer (SDT), a single-path decoder that fuses projected multi-layer tokens before spatial reconstruction. SDT further combines lightweight local refinement with learnable progressive upsampling to improve detail preservation without introducing multi-branch feature alignment. In parallel, we introduce two depth-specific sample quality scores, based on depth distribution and gradient continuity, to filter low-quality training samples before optimization. Across standard indoor, outdoor, synthetic, and robot-scene benchmarks, SDT improves the efficiency-accuracy trade-off relative to DPT under matched encoder and training settings, reducing decoder parameters by 86.6\%--89.2\% while lowering computational cost and edge-device latency. The filtering strategy reduces the merged training set from 584K to 369K samples and preserves or improves several metrics under controlled comparisons. These results suggest that, in the era of strong frozen visual encoders, decoder simplicity and data quality remain practical control points for reproducible and deployable zero-shot depth estimation.
URL: https://openreview.net/forum?id=VNyVDClGhJ
---
Title: Attributional Safety Failures in Large Language Models under Code-Mixed Perturbations
Abstract: While LLMs appear robustly safety-aligned in English, we uncover a catastrophic, overlooked weakness: attributional collapse under code-mixed perturbations. Our systematic evaluation of open models shows that the linguistic camouflage of code-mixing blending languages within a single conversation can cause safety guardrails to fail dramatically. Attack success rates (ASR) spike from a benign ~9% in monolingual English to ~69% under code-mixed inputs, with rates exceeding 90% in non-Western contexts such as Arabic and Hindi. These effects hold not only on controlled synthetic datasets but also on real-world social media traces, revealing a serious risk for billions of users. To explain why this happens, we introduce saliency drift attribution (SDA), an interpretability framework that shows how, under code-mixing, the model’s internal attention drifts away from safety-critical tokens (e.g., ''violence'' or ''corruption''), effectively blinding it to harmful intent.
URL: https://openreview.net/forum?id=5q6S8E7Brx
---
Title: DriveE2E: An Infrastructure-Grounded Ego-Closed-Loop Replay Benchmark for End-to-End Autonomous Driving
Abstract: Closed-loop evaluation is important for end-to-end autonomous driving, but existing CARLA-based benchmarks often rely on manually designed scenarios whose traffic patterns may differ from real-world urban driving. We present DriveE2E, an infrastructure-grounded ego-closed-loop replay benchmark for evaluating end-to-end autonomous driving models in reconstructed real-world intersection scenarios. DriveE2E uses high-mounted infrastructure sensors to extract traffic trajectories from 100 hours of urban intersection data, constructs CARLA-compatible digital twins for 15 real intersections, and imports 800 curated traffic scenarios into simulation. In DriveE2E, the tested model controls the ego vehicle and receives simulation-generated observations from its current simulated state, while non-ego agents replay trajectories extracted from real-world traffic. This protocol does not model fully reactive multi-agent behavior; instead, it provides a reproducible intermediate regime between open-loop log replay and fully reactive simulation. We instantiate the benchmark with representative E2EAD baselines and analyze their open-loop and ego-closed-loop replay performance across behavior categories. The results suggest that DriveE2E can expose differences between open-loop trajectory accuracy and policy behavior under ego feedback in dense intersection scenarios. Code is included in the supplementary material, and will be publicly released upon acceptance.
URL: https://openreview.net/forum?id=9G8Y6jW6zC
---
Title: Evolutionary Context Search for Skill Acquisition
Abstract: Large Language Models cannot reliably acquire new knowledge post-deployment—even when relevant text resources exist, models fail to transform them into actionable knowledge without retraining. Retrieval-Augmented Generation attempts to bridge this gap by surfacing relevant documents at inference time, yet similarity-based retrieval often fails to identify context that actually improves task performance. We introduce Evolutionary Context Search (ECS), an evolutionary method that searches context combinations using accuracy on a small development set, requiring only inference calls without weight updates. ECS moves beyond semantic similarity to discover non-obvious context pairings that significantly boost performance. Our empirical results show that ECS improves BackendBench by 27% and τ2-bench airline by 5%. The evolved contexts are model-agnostic, as those evolved with Gemini-3-Flash transfer effectively to Claude-4.5-Sonnet and DeepSeek-V3.2. This suggests that ECS opens a path toward automated context discovery for skill acquisition—an efficient alternative to manual prompt engineering or costly fine-tuning.
URL: https://openreview.net/forum?id=nedfqQFmbH
---
Title: Towards a Practical Understanding of Lagrangian Methods in Safe Reinforcement Learning
Abstract: Safe reinforcement learning addresses constrained optimization problems where maximizing performance must be balanced against safety constraints, and Lagrangian methods are a widely used approach for this purpose. However, the effectiveness of Lagrangian methods depends crucially on the choice of the Lagrange multiplier λ, which governs the trade-off between return and cost. A common approach is to update the multiplier automatically during training. Although this approach is standard in practice, there remains limited evidence on the variance in practical performance introduced by the choice of λ, nor on how the over- or undershooting of the cost limit, frequently exhibited by automated multiplier updates, affects the return. Therefore, we study (i) the practical variance exhibited by λ for a range of widely studied safety tasks, and show that Lagrange multiplier update methods are sensitive to the choice of cost limit within the same task. We present empirical Pareto frontiers that offer a complete visualization of the return-cost trade-off in the underlying optimization problem. Our results reveal the highly sensitive nature of λ and further show that the performance of λ-update mechanisms does not generalize across cost limits within the same task, meaning that evaluation at a single cost limit risks biased conclusions. We therefore urge the safe RL community to adopt testing algorithms across multiple cost limits as standard practice, and provide (ii) recommendations for benchmarking in the form of a recommended set of cost limits for each evaluated task, and offer an open-source code base: https://github.com/anonymous.
URL: https://openreview.net/forum?id=LHZNCWmIEA
---
Title: Efficient Image Restoration with State-Dependent Forward Diffusion
Abstract: This paper proposes to perform image restoration through a state-dependent mean-reverting forward diffusion (FoD) process. In contrast to traditional diffusion-based approaches that rely on a coupled forward-backward diffusion scheme, FoD directly learns data generation through a single forward diffusion process, yielding a simple yet efficient generative framework. The core of FoD is a state-dependent stochastic differential equation (SDE) that involves a mean-reverting term in both the drift and diffusion functions. This mean-reverting structure guarantees the convergence to fixed clean points, simulating a stochastic interpolation between source and target distributions. More importantly, FoD is analytically tractable and is trained using a simple stochastic flow matching objective, enabling few-step sampling during inference. The proposed FoD model, despite its simplicity, achieves superior performance on various image restoration tasks compared to representative diffusion, diffusion bridge, and flow matching approaches.
URL: https://openreview.net/forum?id=Eq9k6Va3hY
---
Title: Aligning Diffusion Language Models via Unpaired Preference Optimization
Abstract: Diffusion language models (dLLMs) are an emerging alternative to autoregressive (AR) generators, but aligning them to human preferences is challenging because sequence log-likelihoods are intractable and pairwise preference data are costly to collect. We introduce ELBO-KTO, which combines an Evidence Lower Bound (ELBO) surrogate for diffusion log-likelihoods with a prospect-theoretic, unpaired preference objective (Kahneman--Tversky Optimization, KTO). We analyze the bias and variance induced by the ELBO substitution and employ variance-reduction practices that stabilize gradients during training. Applied to LLaDA-8B-Instruct, ELBO-KTO yields 65.9% and 62.3% adjusted win rates on kto-mix-14k and UltraFeedback-Binary, respectively, versus the base model under an automatic LLM judge. Across downstream tasks, including GSM8K, MMLU, and additional reasoning/knowledge benchmarks, ELBO-KTO trained on UltraFeedback-Binary performs on par with or better than the base model under identical decoding. This establishes unpaired preference optimization as a viable alternative to pairwise alignment in diffusion LLMs. Code is available in the supplementary material and will be released publicly upon publication.
URL: https://openreview.net/forum?id=fvrsQLEjPJ
---
Title: TraCER: Offline Reinforcement Learning through Trajectory Clustering and Exclusive Regularisation
Abstract: In this paper, we propose Offline Reinforcement Learning through Trajectory Clustering and Exclusive Regularisation (TraCER), a value regularisation framework that accounts for out-of-distribution (OOD) actions. Unlike most existing methods, which avoid direct reasoning about OOD regions due to their inherent difficulty, TraCER traces and delineates OOD regions in the action space, potentially non-convex, using a trajectory clustering-based behaviour cloning algorithm. This approach assumes that each trajectory in the offline dataset was rolled out by a single behaviour policy, an assumption commonly satisfied in practice when datasets are collected from distinct sources or agents. Conditioned on this delineation, we introduce a Bellman-type operator that constrains value estimates for OOD actions to a tight lower bound while leaving in-distribution action-value estimates unchanged. The resulting value function supports standard policy extraction procedures. Experiments on multiple offline RL benchmarks demonstrate that TraCER consistently outperforms existing approaches.
URL: https://openreview.net/forum?id=SGkQ5GiKDZ
---
Title: InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU
Abstract: In modern large language models (LLMs), handling very long context lengths presents significant challenges as it causes slower inference speeds and increased memory costs. Additionally, most existing pre-trained LLMs fail to generalize beyond their original training sequence lengths. To enable efficient and practical long-context utilization, we introduce *InfiniteHiP*, a novel and practical LLM inference framework that accelerates processing by dynamically eliminating irrelevant context tokens through a modular hierarchical token pruning algorithm. Our method also allows generalization to longer sequences by selectively applying various RoPE adjustment methods according to the internal attention patterns within LLMs. Furthermore, we offload the key-value cache to host memory during inference, significantly reducing GPU memory pressure. As a result, InfiniteHiP enables the processing of up to 3 million tokens on a single L40s 48GB GPU -- 3x larger -- without any permanent loss of context information. Our framework achieves an 18.95x speedup in attention decoding for a 1 million token context without requiring additional training. We implement our method in the SGLang framework and demonstrate its effectiveness and practicality through extensive evaluations.
URL: https://openreview.net/forum?id=TMI8Q3eIVU
---
Title: Retrieval-augmented code generation: A survey with focus on repository-level approaches
Abstract: Recent advances in large language models (LLMs) have significantly improved automated code generation. While existing approaches have achieved strong performance at the function and file levels, real-world software engineering requires reasoning over entire repositories, including cross-file dependencies, evolving execution environments, and global semantic consistency. This challenge has led to the emergence of Repository-Level Code Generation (RLCG), where models must retrieve, organize, and utilize repository-scale context to generate coherent and executable code changes. To address these challenges, Retrieval-Augmented Generation (RAG) has become an increasingly important paradigm for repository-level code intelligence. In this survey, we present a comprehensive review of Retrieval-Augmented Code Generation (RACG), with a particular focus on repository-level approaches. Rather than viewing RACG as a static "retrieve-then-generate" pipeline, we characterize it as a coupled and evolving process involving context construction, retrieval optimization, generation, and environment interaction. We organize existing methods through a unified analytical framework spanning retrieval substrate, control regime, and evaluation setting. Based on this framework, we systematically examine retrieval strategies, graph-based and non-graph-based retrieval paradigms, training-driven optimizations, and autonomous agent architectures. We further summarize widely used datasets, benchmarks, and system configurations, and discuss key challenges including scalability, reliability, efficiency, and the necessity boundary between RACG and long-context LLMs. Through this survey, we aim to provide a structured understanding of the rapidly evolving RACG landscape and highlight promising directions for future AI-powered software engineering research.
URL: https://openreview.net/forum?id=f688AcQuQg
---
Title: Text Rationalization for Robust Causal Effect Estimation
Abstract: Recent advances in natural language processing have enabled the increasing use of text data in causal inference, particularly for adjusting confounding factors in treatment effect estimation. Although high-dimensional text can encode rich contextual information, it also poses unique challenges for causal identification and estimation. In particular, the positivity assumption, which requires sufficient treatment overlap across confounder values, is often violated at the observational level, when massive text is represented in feature spaces. Redundant or spurious textual features inflate dimensionality, producing extreme propensity scores, unstable weights, and inflated variance in effect estimates. We address these challenges with Confounding-Aware Token Rationalization (CATR), a framework that selects a sparse necessary subset of tokens using a residual-independence diagnostic designed to preserve confounding information for unconfoundedness. By discarding irrelevant texts while retaining key signals, CATR mitigates observational-level positivity violations and stabilizes downstream causal effect estimators. Experiments on semi-synthetic data and a real-world study using the MIMIC-III database demonstrate that CATR yields more accurate, stable, and interpretable causal effect estimates than existing baselines.
URL: https://openreview.net/forum?id=m8Vtjiu7SL
---
Title: Evolutionary Self-Supervised Contradiction Detection for Biomedical NLI
Abstract: Identifying conflicting claims in biomedical literature is critical for advancing scientific understanding, yet the scarcity of high-quality training data remains a significant challenge. We introduce EvoNLI, an evolutionary algorithm that learns how to transform entailing sentence pairs into challenging contradictions by mutating words until a frozen teacher model confidently flips its prediction, while preserving topical coherence. EvoNLI, applied to PubMed randomized controlled trials (RCTs), generates SciCon, a dataset of premise–hypothesis pairs whose labels achieve 94.4\% agreement across expert judgments in an audit by five domain experts. Fine-tuning large language models on SciCon improves contradiction ROC-AUC consistently across eight biomedical NLI benchmarks. EvoNLI and SciCon are publicly available to support evidence synthesis and robust biomedical natural language inference, and to advance robust domain-specific contradiction detection.
URL: https://openreview.net/forum?id=WVsNnpVse8
---
Title: Detection without Expression: A Geometric perspective of Language Model Hallucination
Abstract: Language models often respond fluently and confidently to questions for which the appropriate response would be to abstain. We study cases where the prompt is underspecified, has a false premise, or is outside the model's reliable knowledge. Such errors are usually treated as failures of factual access. We argue that they also reflect a failure of routing. A model may internally represent that an input should not be answered while failing to transform that representation into output behavior. Cross-entropy training creates prediction-aligned directions through which token commitments are expressed, because each example supplies a sharp gradient toward a vocabulary target. Answerability, however, is not given an equally stable target unless the training distribution explicitly rewards abstention. It can therefore be encoded as an input-aligned feature of the residual stream without becoming a prediction-aligned control variable. In this view, hallucination can be understood as a mismatch between the geometry that detects uncertainty and the geometry that expresses decisions. Across autoregressive transformer families, we find that factual and uncertain prompts are strongly separated in hidden states, while standard output-side uncertainty measures expose only a weak trace of this distinction. The answerability boundary is concentrated in the principal input geometry and only inconsistently aligned with the prediction geometry defined by the unembedding. Causal interventions confirm that this geometry is not merely diagnostic: routing the hidden answerability signal directly to refusal logits produces selective abstention, boundary steering produces large direction-dependent shifts in decoded responses, and linear projection onto the factual subspace does not repair uncertain states.
These results suggest that reducing hallucination requires mechanisms that explicitly connect internal answerability representations to the output pathways where linguistic commitments are made.
URL: https://openreview.net/forum?id=2rVcTae3zo
---
Title: eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization
Abstract: We show that the key-value (KV) cache in transformer attention heads admits a natural decomposition into a low-rank \emph{shared context} component and a full-rank \emph{per-token} residual, well described by the spiked random matrix model. This observation leads to eOptShrinkQ, a two-stage compression pipeline: optimal singular value shrinkage (eOptShrink) automatically extracts the shared structure, and the residual---which satisfies the \emph{thin shell property} with delocalized coordinates---is quantized by TurboQuant~\citep{zandieh2025turboquant}, a recently proposed per-vector scalar quantizer with near-optimal distortion guarantees. By restoring the isotropy that scalar quantization assumes, spectral denoising eliminates the need for both outlier handling and dedicated inner product bias correction, freeing those bits for improved reconstruction.
The theoretical grounding in random matrix theory provides three guarantees: automatic rank selection via the BBP phase transition, provably near-zero inner product bias on the residual, and coordinate delocalization ensuring near-optimal quantization distortion. Experimentally, we validate eOptShrinkQ on Llama-3.1-8B and Ministral-8B across three levels: per-head MSE and inner product fidelity, where eOptShrinkQ saves nearly one bit per entry over TurboQuant at equivalent quality; end-to-end on LongBench (16 tasks), where eOptShrinkQ at $\sim$2.2 bits per entry outperforms TurboQuant at 3.0 bits; and multi-needle retrieval, where eOptShrinkQ at 2.2 bits closely matches or exceeds uncompressed FP16, suggesting that spectral denoising can act as a beneficial regularizer for retrieval-intensive tasks.
URL: https://openreview.net/forum?id=aFxOW9r228
---
Title: The GNN Trilemma in Recommender Systems: A Survey
Abstract: Graph Neural Networks (GNNs) have become the standard choice for modeling collaborative interactions in recommender systems via message passing. However, as industrial deployments scale, traditional static GNNs face fundamental limitations, including noise propagation, semantic rigidity, and computational bottlenecks. Recent advances (2024-2026) reveal a convergence of generative refinement (e.g., diffusion models) and semantic hybridization (e.g., Large Language Models) to address these challenges. In this survey, we systematically analyze this architectural shift. We introduce an orthogonal three-axis taxonomy that categorizes models along Information Source, Learning Paradigm, and System Objectives. In doing so, we capture the transition from heuristic structural augmentations toward dynamic, privacy-aware, and objective-aligned frameworks. To analyze these design trade-offs, we introduce the GNN Trilemma, a structural framework that examines how improvements in accuracy, scalability, and explainability often compete with one another in practical recommender architectures. Finally, we argue that a growing evaluation crisis exists: as models optimize for complex human-centric objectives, traditional static benchmarks and simplistic heuristic baselines increasingly obscure true system-level trade-offs.
URL: https://openreview.net/forum?id=jIuJQ1Mrjb
---
Title: LLM Features Can Hurt GNNs: Concatenation Interference on Homophilous Graph Benchmarks
Abstract: Adding LLM-generated node features to graph neural networks is widely reported to improve accuracy on standard benchmarks. We document a contrasting observation: when the LLM features are introduced through pure input concatenation, rather than joint training, distillation, or prompt-conditioning, they can systematically degrade accuracy on the same homophilous benchmarks where end-to-end LLM pipelines succeed. Under an MLP backbone with the Planetoid public split and BoW original features ($F_\text{orig}$), concatenating SBERT-encoded GPT-4o-mini TAPE features reduces PubMed test accuracy by $-17.0 \pm 0.3$ pp and Cora by $-4.3 \pm 0.6$ pp, with CiteSeer's $-0.6 \pm 0.8$ pp inside seed noise. The drop attenuates as we relax each condition (GCN / GCNII / GAT backbones, random splits, smaller encoders), and reverses on medium-homophily WikiCS ($+4.4$ pp) and ogbn-arxiv ($+11.7$ pp).
To predict when concatenation helps versus hurts, we report a simple measure of LLM-alone discriminability $\Delta_\text{sig}$. Across 9 datasets, $\Delta_\text{sig}$ correlates with the concat cost more strongly than homophily at point estimate ($r^2 = 0.38$ vs. $0.06$; $N=9$ bootstrap CIs overlap). The bootstrap-best change-point is $\tau = 13.8$ pp (95% CI $[0, 13.8]$), and the rule "$\Delta_\text{sig} \leq \tau$ predicts non-positive concat cost" classifies 7/9 datasets correctly. Because 60% of bootstrap samples place $\tau$ inside $[5, 30]$ pp, we treat $\Delta_\text{sig}$ as an interpretive lens for the helping vs. hurting regimes rather than a precision pre-A/B filter.
A dim-controlled ablation on PubMed places the LLM-feature drop between same-source PCA ($-2.3$ pp) and same-dim Gaussian noise ($-37.3$ pp), ruling out dimensionality and weight-decay artifacts. Nine PubMed configurations (seven training sizes $\times$ two encoder dimensions) fit a power-law profile $|\Delta_\text{concat}| \propto (\sqrt{d_l / n})^{1.31}$ with $r^2 = 0.97$ (PubMed-internal; Cora and CiteSeer have different slopes). The $\sqrt{d_l/n}$ profile and the $\Delta_\text{sig}$ threshold jointly describe a two-axis surface; the low-$\Delta_\text{sig}$, small-$n$ corner is exactly where the headline $-17$ pp PubMed deficit appears.
In the low-$\Delta_\text{sig}$ regime, the most effective remediation is to drop the LLM channel entirely: the $F_\text{orig}$-only baseline strictly dominates every learned cheap fix at $p \approx 0.008$. A learnable scalar gate closes 89% of the raw-concat gap and is a useful second-line option when downstream pipelines structurally require $F_\text{LLM}$. The findings do not contradict the aggregate accuracy gains reported for end-to-end LLM pipelines such as TAPE and GLEM; they identify the specific design choice (pure concatenation) under which the sign flips.
URL: https://openreview.net/forum?id=h9z7R84CsQ
---
Title: Hypothesize and Verify: Natural-Language Explanations of Vision Model Errors
Abstract: LLM- and agent-based assistants now bring non-experts into direct ML work, where they probe model failures by asking the assistant in plain language. When such a classifier misclassifies an image, the non-expert needs a faithful account of \emph{why}. Two obstacles stand in the way. No benchmark scores free-form natural-language explanations of such errors, and existing retrieval-based methods can only return sentences from a fixed error corpus. We close both. NEMO is a task and benchmark of 1{,}200 misclassified images across ImageNet-R, ObjectNet, and ImageNet-D, each varying along a controlled factor (artistic style, viewpoint, low-level attributes), scored by an LLM-as-a-Judge (LLM Match) protocol that asks whether the explanation describes that factor. SciTX is a generation-based method that emulates the scientific method: retrieve observations, hypothesize candidate causes, verify each with a counterfactual intervention, and retain the hypothesis whose intervention shifts the model's prediction farthest toward the ground-truth class. The shift is captured by our Counterfactual Explanation Impact (CEI), which serves as both SciTX's selection signal and a complementary evaluation metric. On NEMO, SciTX outperforms retrieval-based and MLLM-augmented baselines on both LLM Match and CEI, and 30 AI practitioners rank it first across all five helpfulness dimensions, including factuality, specificity, and actionability.
URL: https://openreview.net/forum?id=TVO84EFoTu
---
Title: Convergence of optimizers implies eigenvalue filtering at equilibrium
Abstract: Ample empirical evidence in deep neural network training suggests that a variety of optimizers tend to find nearly global optima. In this article, we adopt the reversed perspective that convergence to an arbitrary point is assumed rather than proven, focusing on the consequences of this assumption. From this viewpoint, in line with recent advances on the edge-of- stability phenomenon, we argue that different optimizers effectively act as eigenvalue filters determined by their hyperparameters. Specifically, the standard gradient descent method inherently avoids the sharpest minima, whereas Sharpness-Aware Minimization (SAM) algorithms go even further by actively favoring wider basins. Inspired by these insights, we propose two novel algorithms that exhibit enhanced eigenvalue filtering, effectively promoting wider minima. Our theoretical analysis leverages a generalized Hadamard–Perron stable manifold theorem and applies to general definable $C^2$ functions, without requiring additional non-degeneracy conditions or global Lipschitz bound assumptions. We support our conclusions with numerical experiments on feed-forward neural networks.
URL: https://openreview.net/forum?id=3tuxYlFWTy
---
Title: Self-Supervised Representation Learning as Mutual Information Maximization
Abstract: Self-supervised representation learning (SSRL) has demonstrated remarkable empirical success, yet its underlying principles remain insufficiently understood. While recent works attempt to unify SSRL methods by examining their information-theoretic objectives or summarizing their heuristics for preventing representation collapse, architectural elements like predictor networks, stop-gradient operations, and statistical regularizers are often viewed as empirically motivated additions. In this paper, we adopt a first-principles approach and investigate whether the learning objective of an SSRL algorithm dictates its possible optimization strategies and model design choices. In particular, by starting from a variational mutual information (MI) lower bound, we derive two training paradigms, namely Self-Distillation MI (SDMI) and Joint MI (JMI), each imposing distinct structural constraints and covering a set of existing SSRL algorithms. SDMI relies on alternating optimization, in which stop-gradient operations serve as a principled mechanism for realizing the alternating updates. In contrast, JMI admits joint optimization through symmetric architectures without such components. Under the proposed formulation, predictor networks in SDMI and statistical regularizers in JMI emerge as tractable surrogates for the MI objective. We show that many existing SSRL methods are specific instances or approximations of these two paradigms. This paper provides a theoretical explanation for the choices of different architectural components of existing SSRL methods, going beyond heuristic conveniences.
URL: https://openreview.net/forum?id=hlNAYdhUi6
---
Title: Can LLMs Reason over Graphs? Formal Expressiveness Bounds and a Hybrid GNN–LLM Framework
Abstract: Large language models read graphs as text and do well on local queries (degree, neighbour walks) but fail on combinatorial questions like graph isomorphism, chromatic number, or long-range connectivity. We supply the missing structural-complexity account. The core is a tight Weisfeiler--Leman ceiling: any log-precision transformer reading a serialised graph through a permutation-averaged readout has distinguishing power at most $1$-WL (Theorem 1). Two witness families pin it on real graphs --- $\mathrm{Rook}(4,4)$ vs.\ Shrikhande and a treewidth-parameterised Cai--Fürer--Immerman ladder (Proposition 2). Concatenating a substructure-counting encoder strictly breaks the bound on a fixed witness pair via an explicit three-layer in-weights construction (Theorem 3); the same recipe extends to bounded-treewidth classes with a per-$\mathrm{MSO}_2$-formula decidability result whose cost on $|\varphi|$ scales non-elementarily (Theorem 4). Six experiments and seven ablations across five LLM providers (Llama-3.1-8B, Qwen-2.5-14B, Gemini-2.5-Flash, gpt-4o-mini, gpt-4o) confirm every prediction: LLM-only at the $0.50$ chance line on the CFI ladder for all five providers and on SRG for four of five (Llama-3.1-8B's $1.00$ is a memorisation outlier on the textbook $\mathrm{Rook}(4,4)$/Shrikhande pair, isolated by four control cells); the hybrid hits $1.00$ on Gemini Flash and gpt-4o via the induced-$C_4$ feature; on the $\mathrm{MSO}_2$ ladder at $w = 2$, the $3$-colourability column reads $0.00 \to 0.00 \to 0.33 \to 1.00 \to 1.00$ across providers in capability order --- the qualitative shape Theorem 4(c) predicts, with the $0.33$ midpoint flagged as individually Bonferroni-non-significant. Every claim is pinned by a passing test in the deterministic suite at \texttt{src/theory/}.
URL: https://openreview.net/forum?id=TXIWVGeele
---
Title: uTECH-GenUrban: A Generative Agent-Based Framework for Urban Planning and Mobility Analytics
Abstract: Urban mobility systems exhibit complex, emergent behavior that is difficult to capture with traditional top-down demand models, which often rely on simplified behavioral assumptions and limited adaptability. We present uTECH-GenUrban, a hybrid generative-predictive framework that integrates large language model (LLM) reasoning with heterogeneous urban data to synthesize demographic cohorts, hourly activity schedules, and origin--destination (OD) flows. The framework consists of three sequential stages: PopulationAgent, which constructs interpretable demographic cohorts from American Community Survey (ACS) distributions; ActivityAgent, which generates hourly participation profiles across five broad activity domains derived from the American Time Use Survey (ATUS), namely sleep, work, meal, errand, and leisure; and PlannerAgent, which converts validated activity schedules into mobility-force signals and OD flow estimates using TomTom mobility records, Overture Maps building data, and a downstream Gradient Boosting Regressor. To improve reliability, each stage is wrapped in a hybrid verification pipeline combining deterministic checks, a self-correction step, and structured multi-agent debate. We evaluate uTECH-GenUrban across four heterogeneous U.S. urban testbeds—Las Vegas, Minneapolis, New York City, and San Francisco. The generated cohorts exhibit coherent and socially interpretable demographic differentiation. Aggregated evaluation against observed activity distributions demonstrates the strong capability of the mobility synthesis approach in reproducing real-world patterns. uTECH-GenUrban generalizes well to unseen temporal windows, with an $R^2$ ranging from 0.80 to 0.98 across four cities. These results suggest that LLM-driven, verification-aware urban simulation offers a promising pathway toward digital twins for planning and urban analysis.
URL: https://openreview.net/forum?id=tFam4451mK
---
Title: Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads
Abstract: Vision Transformers and their variants have achieved remarkable success in diverse visual perception tasks. Despite their effectiveness, they suffer from two significant limitations. First, the quadratic computational complexity of multi-head self-attention (MHSA), which restricts scalability to large token counts, and second, a high dependency on large-scale training data to attain competitive performance. In this paper, to address these challenges, we propose a novel sparse self-attention mechanism named Fibottention. Fibottention employs structured sparsity patterns derived from the Wythoff array, enabling an O(N log N) computational complexity in self-attention. By design, its sparsity patterns vary across attention heads, which provably reduces redundant pairwise interactions while ensuring sufficient and diverse coverage. This leads to an inception-like functional diversity in the attention heads, and promotes more informative and disentangled representations. We integrate Fibottention into standard Transformer architectures and conduct extensive experiments across multiple domains, including image classification, video understanding, and robot learning. Results demonstrate that models equipped with Fibottention either significantly outperform or achieve on-par performance with their dense MHSA counterparts, while leveraging only 2% of all pairwise interactions across self-attention heads in typical settings, resulting in substantial computational savings. Moreover, when compared to existing sparse attention mechanisms, Fibottention consistently achieves superior results on a FLOP-equivalency basis. Finally, we provide an in-depth analysis of the enhanced feature diversity resulting from our attention design and discuss its implications for efficient representation learning.
URL: https://openreview.net/forum?id=GAbMtuKoFW
---
Title: Hierarchy-aligned Language Modeling in Hyperbolic Space for mRNA Coding Sequences
Abstract: Language models are increasingly applied to biological sequences such as proteins and mRNA, yet their default Euclidean geometry may mismatch the hierarchical structures inherent to biological data. While hyperbolic geometry provides a better alternative for accommodating hierarchical data, it has yet to find a way into language modeling for mRNA sequences. In this work, we introduce HyperHELM, a novel framework that implements masked language model pre-training in hyperbolic space for coding (CDS) regions of mRNA sequences. Using a hybrid design with hyperbolic layers atop a Euclidean backbone, HyperHELM aligns learned representations with the biological hierarchy defined by the relationship between mRNA and amino acids. Across multiple multi-species datasets, it outperforms Euclidean baselines on 9 out of 10 tasks involving property prediction, with 10\% improvement on average, and excels in out-of-distribution generalization to long and low-GC content sequences; for antibody region annotation, it surpasses hierarchy-aware Euclidean models by 3\% in annotation accuracy. Our results highlight hyperbolic geometry as an effective inductive bias for hierarchical language modeling of the CDS regions of mRNA sequences.
URL: https://openreview.net/forum?id=q8y7ggWidF
---
Title: Unified Semantic and ID Representation Learning for Deep Recommenders
Abstract: Effective recommendation is crucial for large-scale online platforms. Traditional recommendation systems primarily rely on ID tokens to uniquely identify items, which can effectively capture specific item relationships but suffer from issues such as redundancy and poor performance in cold-start scenarios. Recent approaches have explored using semantic tokens as an alternative, yet they face challenges, including item duplication and inconsistent performance gains, leaving the potential advantages of semantic tokens inadequately examined. To address these limitations, we propose a Unified Semantic and ID Representation Learning framework that leverages the complementary strengths of both token types. In our framework, ID tokens capture unique item attributes, while semantic tokens represent shared, transferable characteristics. Additionally, we analyze the role of cosine similarity and Euclidean distance in embedding search, revealing that cosine similarity is more effective in decoupling accumulated embeddings, while Euclidean distance excels in distinguishing unique items. Our framework integrates cosine similarity in earlier layers and Euclidean distance in the final layer to optimize representation learning. Experiments on three benchmark datasets show that our method significantly outperforms state-of-the-art baselines, with improvements ranging from 6% to 17% and a reduction in token size by over 80%. These results demonstrate the effectiveness of combining ID and semantic tokenization to enhance the generalization ability of recommender systems.
URL: https://openreview.net/forum?id=8xEXo9D5rg
---
Title: Alignment of Diffusion Model and Flow Matching for Text- to-Image Generation
Abstract: Diffusion models and flow matching have demonstrated remarkable success in text-to-image generation. While many existing alignment methods primarily focus on fine-tuning pre-trained generative models to maximize a given reward function, these approaches require extensive computational resources and may not generalize well across different objectives. In this work, we propose a novel alignment framework by leveraging the underlying nature of the alignment problem---sampling from reward-weighted distributions---and show that it applies to both diffusion models (via score guidance) and flow matching models (via velocity guidance). We show that the score function (velocity field) required for the reward-weighted distribution can be decomposed into the pre-trained score (velocity field) plus a conditional expectation of the reward. For the alignment on the diffusion model, we identify a fundamental challenge: the adversarial nature of the guidance term can introduce undesirable artifacts in the generated images. Therefore, we propose a finetuning-free framework that trains a guidance network to estimate the conditional expectation of the reward. We achieve comparable performance to finetuning-based models with one-step generation with at least a 60\% reduction in computational cost. For the alignment on flow matching, we propose a training-free framework that improves the generation quality without additional computational cost.
URL: https://openreview.net/forum?id=axRaqhyahn
---
Title: ItinBench: Benchmarking Planning Across Multiple Cognitive Dimensions with Large Language Models
Abstract: Large language models (LLMs) with advanced cognitive capabilities are emerging as agents for various reasoning and planning tasks. Traditional evaluations often focus on specific reasoning or planning questions within controlled environments. Recent studies have explored travel planning as a medium to integrate various verbal reasoning tasks into real-world contexts. However, reasoning tasks extend beyond verbal reasoning alone, and a comprehensive evaluation of LLMs requires a testbed that incorporates tasks from multiple cognitive domains. To address this gap, we introduce ItinBench, a benchmark that features one task of spatial reasoning, i.e., route optimization, into trip itinerary planning while keeping the traditional verbal reasoning tasks. ItinBench evaluates various LLMs across diverse tasks simultaneously, including Llama 3.1 8B, Mistral Large, Gemini 1.5 Pro, and GPT family. Our findings reveal that LLMs struggle to maintain high and consistent performance when concurrently handling multiple cognitive dimensions. By incorporating tasks from distinct human-level cognitive domains, ItinBench provides new insights into building more comprehensive reasoning testbeds that better reflect real-world challenges. The code and dataset are attached.
URL: https://openreview.net/forum?id=Idp60dibCE
---
Title: FlowReasoner: Reinforcing Query-Level Meta-Agents
Abstract: This paper proposes a query-level meta-agent named FlowReasoner to automate the design of query-level multi-agent systems, i.e., one system per user query.
Our core idea is to incentivize a reasoning-based meta-agent via external execution feedback.
Concretely, by distilling DeepSeek R1, we first endow the basic reasoning ability regarding the generation of multi-agent systems to FlowReasoner.
Then, we further enhance it via reinforcement learning (RL) with external execution feedback.
A multi-purpose reward is designed to guide the RL training from aspects of performance, complexity, and efficiency.
In this manner, FlowReasoner is enabled to generate a personalized multi-agent system for each user query via deliberative reasoning.
Experiments on both engineering and competition code benchmarks demonstrate the superiority of FlowReasoner.
Remarkably, it surpasses o1-mini by 10.52% accuracy across three benchmarks. All the code is included in the Supplemental Material.
URL: https://openreview.net/forum?id=0qklDI3t58
---
Title: Treatment Effects in Extreme Regimes
Abstract: Understanding treatment effects in extreme regimes is important for characterizing risks associated with different interventions. This is hindered by the unavailability of counterfactual outcomes and the rarity and difficulty of collecting extreme data in practice. To address this issue, we propose a new framework based on extreme value theory for estimating treatment effects in extreme regimes. We quantify these effects using variations in tail decay rates of potential outcomes in the presence and absence of treatments. We establish algorithms for calculating these quantities and develop related theoretical results. We demonstrate the efficacy of our approach on various standard synthetic and semi-synthetic datasets.
URL: https://openreview.net/forum?id=u8ZCcv0gTV
---
Title: Constraint-Aware Tabular Score Attacks with Attribution Guided Boundary Search
Abstract: Adversarial robustness in structured data remains less developed than in vision and language, despite the central role of tabular models in high-stakes decision systems. We propose a constraint-aware, query-efficient black-box attack framework for tabular and hybrid feature spaces that couples feature prioritization, boundary localization, and feasibility-preserving refinement into a single attack procedure. Given score-based oracle access, the framework identifies influential mutable features through low-budget attribution-guided queries, searches for nearby decision-boundary crossings using score margins, and refines adversarial examples while enforcing immutability, categorical validity, range constraints, and perturbation-cost budgets. Across multiple public datasets and heterogeneous model families, the proposed framework achieves targeted attack success rates above 90% under strict query and feasibility constraints. We further analyze query cost, sparsity, feasibility, and component-level contributions, supported by analysis linking attribution-guided ranking to local sensitivity and boundary localization to logarithmic query complexity after bracketing.
URL: https://openreview.net/forum?id=SMPpLoKGUi
---
Title: Towards Generalist Game Players: A Survey of Foundation Models in the Game Multiverse
Abstract: The real world unfolds along a single set of physics laws, yet human intelligence demonstrates a remarkable capacity to generalize experiences from this singular physical existence into a multiverse of games, each governed by entirely different rules, aesthetics, physics, and objectives. This omni-reality adaptability is a hallmark of general intelligence. As Artificial Intelligence progresses towards Artificial General Intelligence, the multiverse of games has evolved from mere entertainment into the ultimate ground for training and evaluating AGI. The pursuit of this generality has unfolded across four eras: from environment-specific symbolic and reinforcement learning agents, to current large foundation models acting as generalist players, and toward a future creator stage where agent both creates new game worlds and continually evolves within them. We trace the full lifecycle of a generalist game player along four interdependent pillars: Dataset, Model, Harness, and Benchmark. Every advance across these pillars can be read as an attempt to break one of five fundamental trade-offs that currently bound the whole system. Building on this end-to-end view, we chart a five-level roadmap, progressing from single-game mastery to the ultimate creator stage in which the agent simultaneously creates and evolves within theoretical game multiverse. Taken together, our work offers a unified lens onto a rapidly shifting field,and a principled path toward the omnipotent generalist agent capable of seamlessly mastering any challenge within the multiverse of games, thereby paving the way for AGI.
URL: https://openreview.net/forum?id=SDkEq8Fzvl
---
Title: Regret Bounds and Reinforcement Learning Exploration of EXP-based Algorithms
Abstract: We study the challenging exploration incentive problem in both bandit and reinforcement learning, where the rewards are scale-free and potentially unbounded, driven by real-world scenarios and differing from existing work. Past works in reinforcement learning either assume costly interactions with an environment or propose algorithms finding potentially low quality local maxima. Motivated by EXP-type methods that integrate multiple agents (experts) for exploration in bandits with the assumption that rewards are bounded, we propose new algorithms, namely EXP4.P and EXP4-RL for exploration in the unbounded reward case, and demonstrate their effectiveness in these new settings. Unbounded rewards introduce challenges as the regret cannot be limited by the number of trials, and selecting suboptimal arms may lead to infinite regret. Specifically, we establish EXP4.P's regret upper bounds in both bounded and unbounded linear and stochastic contextual bandits. Surprisingly, we also find that by including one sufficiently competent expert, EXP4.P can achieve global optimality in the linear case. This unbounded reward result is also applicable to a revised version of EXP3.P in the Multi-armed Bandit scenario. In EXP4-RL, we extend EXP4.P from bandit scenarios to reinforcement learning to incentivize exploration by multiple agents, including one high-performing agent, for both efficiency and excellence. This algorithm has been tested on difficult-to-explore games and shows significant improvements in exploration compared to state-of-the-art.
URL: https://openreview.net/forum?id=W5o9ax5m8D
---
Title: A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models
Abstract: Audio super-resolution (SR), also referred to as bandwidth extension (BWE), aims to reconstruct high-fidelity signals from low-resolution (LR) or band-limited (BL) observations, an inherently ill-posed task due to the ambiguity of missing high-frequency (HF) content. This survey provides a comprehensive overview of the field, with a particular focus on the paradigm shift from discriminative mapping to modern generative modeling. We first review early discriminative deep neural network (DNN) models, which formulate BWE/SR as a deterministic mapping problem and are prone to regression-to-the-mean effects and spectral over-smoothing.
We then systematically review generative approaches, including autoregressive (AR) models, variational autoencoders (VAEs), generative adversarial networks (GANs), diffusion and score-based models, flow-based methods, and Schrödinger bridges. Across these approaches, we examine key design aspects, including representation domain, architecture, conditioning mechanisms, and trade-offs among reconstruction fidelity, perceptual quality, robustness, and computational efficiency.
Furthermore, we discuss emerging directions involving large language models (LLMs) and multimodal foundation models, and highlight open challenges in perceptual evaluation, phase modeling, and real-world generalization. By providing a structured taxonomy and unified perspective, this survey establishes a comprehensive foundation and offers a practical roadmap for advancing BWE/SR from deterministic point estimation toward distribution-aware generative modeling.
URL: https://openreview.net/forum?id=NdL2H5aoBn
---
Title: Parametrizing Convex Sets Using Sublinear Neural Networks
Abstract: We propose a neural parameterization of convex sets by learning sublinear (positively homogeneous and convex) functions. Our networks implicitly represent both the support and gauge functions of a convex body. We prove a universal approximation theorem for convex sets under this parametrization. Empirically, we demonstrate the method on shape optimization and inverse design tasks, achieving accurate reconstruction of target shapes.
URL: https://openreview.net/forum?id=N8lACR71ya
---
Title: Stochastic Difference-of-Convex Optimization with Momentum
Abstract: We study the online stochastic difference-of-convex (DC) program $\min_{x \in \mathbb{R}^n} F(x) = G(x) - H(x) + r_1(x) - r_2(x)$, a general formulation capturing numerous non-convex machine learning tasks, including robust regression, sparse learning, and fair classification. We propose \emph{momentum-based DCA} (MDCA), an algorithm that combines a damped DCA step with a momentum estimator for $\nabla H$. Crucially, MDCA operates with a constant per-iteration batch size. We establish that MDCA with Polyak momentum requires $\mathcal{O}( \epsilon^{-4})$ stochastic gradient evaluations to reach $\epsilon$-stationarity, while momentum-based variance reduction (MVR) achieves $\mathcal{O}(\epsilon^{-3})$ under averaged smoothness, matching DCA-PAGE without the need for periodic large anchor batches. Furthermore, we introduce a single-loop variant (S-MDCA) that replaces exact subproblem solves with a single prox-gradient step; it reaches $\epsilon$-stationarity of a smoothed objective in $\mathcal{O}(\epsilon^{-4})$ iterations. We isolate the algorithmic benefit of momentum via a \emph{separation lemma}, demonstrating how it provides a tunable mechanism to trade estimator variance against iterate drift bias. Convergence is proven under both critical-distance and gap-function criteria, and validated through experiments on real classification datasets.
URL: https://openreview.net/forum?id=7aIxLa5L9E
---
Title: Simulating Cryo-EM: Cycle-Consistent Predictor–Corrector Diffusion with Biophysical Modeling
Abstract: Single-particle cryo-electron microscopy (cryo-EM) has become a cornerstone of structural biology, enabling near-atomic resolution analysis of macromolecules through advanced computational methods. However, the development of cryo-EM processing tools is constrained by the scarcity of high-quality annotated datasets. Synthetic data generation offers a promising alternative, but existing approaches lack thorough biophysical modeling of heterogeneity and fail to reproduce the complex noise observed in real imaging. To address these limitations, we present CryoCCD, a comprehensive simulator for cryo-EM, unifying versatile biophysical modeling with the first diffusion model for realistic noise generation. The biophysical engine provides multi-functional generation capabilities to capture authentic biological organization, and the diffusion model is enhanced with cycle consistency and predictor–corrector sampling to improve realism and structural fidelity. Extensive experiments demonstrate that CryoCCD generates structurally faithful micrographs, enhances particle picking and pose estimation, as well as achieves superior performance over state-of-the-art baselines, while also generalizing effectively to held-out protein families.
URL: https://openreview.net/forum?id=oBBBg2MbGB
---
Title: Lightweight Consistency Memory: Training-Time Regularization for Logical Consistency in Small Language Models
Abstract: Small language models (60M--220M parameters) lack logical consistency for reliable reasoning, while most existing consistency methods operate at inference time (chain-of-thought, self-correction) requiring 2-10$\times$ forward passes per query. We introduce Lightweight Consistency Memory (LCM)} a training-time regularization module that jointly optimizes consistency checking alongside the primary language modeling objective. Through multi-objective training ($\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{transformer}} + \alpha(t) \cdot \mathcal{L}_{\text{LCM}}$), LCM provides a consistency-supervised auxiliary signal to the backbone during training, with linear $O(nd)$ complexity in the auxiliary head itself. Crucially, for T5 and GPT-2, the LCM module is discarded at inference---the trained model's improved accuracy on consistency-targeted benchmarks must therefore be attributable to the fine-tuned backbone weights, adding zero overhead at inference time.
T5-small achieves +7.5% accuracy (p$<$0.001) while maintaining baseline latency, and T5-base (+7.2%) and GPT-2 (+1.9%) confirm cross-architecture applicability. LCM outperforms inference-time baselines (self-consistency, best-of-N) by up to +9.9% at 1$\times$ cost, and matches a dedicated DeBERTa-v3 NLI classifier (184M params, 80.6%) using only 60M parameters.
Cross-scale evaluation suggests a pattern of difficulty-dependent effectiveness: DeBERTa-v3's contrasting results on standard (+0.2%) versus adversarial benchmarks (+5.5% ANLI) are consistent with the hypothesis that highly-optimized models may benefit primarily when dataset difficulty exceeds base capabilities; this single-model observation should be confirmed by future work across additional model families. Code and dataset will be publicly released at \url{https://github.com/xxxx/LightweightConsistencyMemory} upon acceptance.
URL: https://openreview.net/forum?id=82AdtbbK5t
---
Title: Loss Landscape Diagnosis for Gradient-Based Gray-Scott System Inversion: Disentangling the Roles of PINN Components
Abstract: Gradient-based inversion of reaction-diffusion systems is typically approached via surrogate models or physics-informed neural networks (PINNs), while the most direct route, backpropagation through the PDE's structure itself, has largely been avoided. We pursue this direct route as a diagnostic probe, backpropagating a steady-state loss through unrolled Gray-Scott simulation to recover its parameters, with no surrogate or neural-network augmentation. Optimization fails to converge, and plotting the landscape directly locates the failure in its geometry—flat plateaus with no gradient signal, bounded by sharp cliffs that align with bifurcation boundaries—a structure that recurs across loss functions and is inherited however the gradients are routed to parameters. Reading this minimal setup as an ablation of PINN, we disentangle each component's role: with the neural network fixed, the residual loss is quadratic in the PDE parameters and yields a smooth landscape, so it alone already avoids the pathology, by implicitly encoding the full PDE dynamics across all initial conditions. The neural network, for its part, cannot repair an ill-posed parameter subspace, and so serves only to complete the observed data—a division of labor not previously made explicit. These findings carry concrete design implications for PINN-type methods and a broader heuristic on when added dimensions actually help.
URL: https://openreview.net/forum?id=EyAzyr4DFZ
---
Title: Learning Inter-Atomic Potentials without Explicit Equivariance
Abstract: Accurate and scalable machine-learned inter-atomic potentials (MLIPs) are essential for molecular simulations ranging from drug discovery to new material design. Current state-of-the-art models enforce roto-translational symmetries through equivariant neural network architectures, a hard-wired inductive bias that can often lead to reduced flexibility, computational efficiency, and scalability. In this work, we introduce \textbf{TransIP}: \textbf{Trans}former-based \textbf{I}nter-Atomic \textbf{P}otentials, a novel training paradigm for interatomic potentials achieving symmetry compliance without explicit architectural constraints. Our approach guides a generic non-equivariant Transformer-based model to learn $\mathrm{SO}(3)$-equivariance by optimizing its representations in the embedding space. Trained on the recent Open Molecules (OMol25) collection, a large and diverse molecular dataset built specifically for MLIPs and covering different types of molecules (including small organics, biomolecular fragments, and electrolyte-like species), TransIP attains comparable performance in machine-learning force fields versus state-of-the-art equivariant baselines. Further, compared to a data augmentation baseline, TransIP achieves 40\% to 60\% improvement in performance across varying OMol25 dataset sizes. More broadly, our work shows that learned equivariance can be a powerful and efficient alternative to equivariant or augmentation-based MLIP models.
URL: https://openreview.net/forum?id=g4ccBRnZas
---
Title: A Dual-Branch Disentanglement Diffusion for ID-Attribute Conditional Face Generation
Abstract: Face identity customization, i.e., face generation with specified identity, has received increasing attention owing to its extensive applications in personalized content creation. Although existing methods achieve high consistency in identity with reference faces, they still struggle to precisely manipulate fine-grained facial attributes. We attribute this issue to the inherent entanglement of identity and attribute information, as well as the lack of attribute-specific supervision. Accordingly, to address this issue, we propose AttPortrait, a high-quality identity-attribute conditional face generation framework. Based on a foundational face diffusion model, we introduce an extra disentanglement branch alongside the conventional denoising branch during the training stage. This extra branch employs explicit attribute supervision to encourage the model to capture the attribute information from the text prompts, effectively disentangling the identity and attributes and achieving precise attribute manipulation with high identity consistency. Comprehensive experiments demonstrate that our method substantially improves attribute accuracy by 34%, while maintaining identity similarity on par with state-of-the-art methods and achieving competitive FID scores across both real and synthetic datasets.
URL: https://openreview.net/forum?id=LyZL8UuATf
---
Title: Enhancing Large Language Models for Constraint-Driven Molecular Generation and Beyond
Abstract: Most de novo molecule generators attempt to satisfy hard chemical constraints in a single forward pass, offering little guidance when outputs fall short. We introduce Code-Driven Molecular Synthesis (CDMS) -- an iterative, model-agnostic framework that embeds a formal self-improving feedback loop into large language models (LLMs). At the start of each task, the LLM uses the chemist’s request as input to generate a snippet of executable code, referred to as an \emph{inspector}, which formalizes the evaluation logic for guiding molecular refinement. This inspector remains fixed throughout the refinement process and is executed on every candidate molecule at each iteration. It produces natural-language critiques describing how to improve the molecule to better meet user-defined constraints (e.g., “add a para-hydroxyl group”). These \emph{Programmatic Feedback Gradients} are appended to subsequent prompts, guiding the LLM toward progressively refined outputs until all structural and functional requirements are satisfied. CDMS achieves state-of-the-art success rates in constraint satisfaction using only a few feedback iterations and without any model retraining. To encourage further research, we release a benchmark dataset curated for code-generated, feedback-driven molecular design
\footnote{\url{https://anonymous.4open.science/r/CDMS-C08D/}}.
URL: https://openreview.net/forum?id=J7eaGcUAGj
---
Title: Compute Optimal Tokenization
Abstract: Scaling laws enable the optimal selection of data amount and language model size, yet the impact of the data unit, the token, on this relationship remains underexplored. In this work, we systematically investigate how the information granularity of tokens, controlled by the compression rate (i.e., average bytes of text per token), affects scaling trends. We train 988 latent tokenized models (BLT) ranging from 50M to 7B parameters that enable setting the desired compression rate. This flexibility allows us to study the role of compression rate well beyond 4.57 bytes per token obtained with a popular BPE tokenizer. Our experiments reveal that in compute-optimal configurations, model parameter counts scale proportionally to data size measured in bytes, not in tokens as commonly perceived (Kaplan et al., 2020; Hoffmann et al., 2022). Furthermore, we discover that the optimal compression rate differs from the one obtained with BPE and decreases with compute. These findings generalize to both latent and subword tokenization, as well as to languages other than English, guiding language model developers on tokenization scheme selection for maximal compute efficiency.
URL: https://openreview.net/forum?id=zLjMDBaOek
---
Title: Reinforcement Learning in GUI Agents: A Survey Toward Digital Inhabitants
Abstract: Graphical User Interface (GUI) agents have emerged as a promising paradigm for intelligent systems that perceive and interact with graphical interfaces visually. Yet supervised fine-tuning alone cannot handle long-horizon credit assignment, distribution shifts, and safe exploration in irreversible environments, making Reinforcement Learning (RL) a central methodology for advancing automation. In this work, we present the first comprehensive overview of the intersection between RL and GUI agents, and examine how this research direction may evolve toward digital inhabitants. We propose a principled taxonomy that organizes existing methods into Offline RL, Online RL, and Hybrid Strategies, and complement it with analyses of reward engineering, data efficiency, and key technical innovations. Our analysis reveals several emerging trends: the tension between reliability and scalability is motivating the adoption of composite, multi-tier reward architectures; GUI I/O latency bottlenecks are accelerating the shift toward world-model-based training, which can yield substantial performance gains; and the spontaneous emergence of System-2-style deliberation suggests that explicit reasoning supervision may not be necessary when sufficiently rich reward signals are available. We distill these findings into a roadmap covering process rewards, continual RL, cognitive architectures, and safe deployment, aiming to guide the next generation of robust GUI automation and its agent-native infrastructure.
URL: https://openreview.net/forum?id=cLFnzcCYx6
---
Title: Knowledge Graph-Augmented Language Models for Knowledge-Grounded Dialogue Generation
Abstract: Language models have achieved impressive performances on dialogue generation tasks. However, when generating responses for a conversation that requires factual knowledge, they are far from perfect, due to an absence of mechanisms to retrieve, encode, and reflect the knowledge in the generated responses. Some knowledge-grounded dialogue generation methods tackle this problem by leveraging facts from Knowledge Graphs (KGs); however, they do not guarantee that the model utilizes a relevant piece of knowledge from the KG. To overcome this limitation, we propose SUbgraph Retrieval-augmented GEneration (SURGE), a framework for generating context-relevant and knowledge-grounded dialogues with the KG. Specifically, our SURGE framework first retrieves the relevant subgraph from the KG, and then enforces consistency across facts by perturbing their word embeddings conditioned by the retrieved subgraph. Then, we utilize contrastive learning to ensure that the generated texts have high similarity to the retrieved subgraphs. We validate our SURGE framework on OpendialKG and KOMODIS datasets, showing that it generates high-quality dialogues that faithfully reflect the knowledge from KG.
URL: https://openreview.net/forum?id=pvVEXyNvG2
---