J2C Certification: Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning
Yixuan Even Xu, Yash Savani, Fei Fang, J Zico Kolter
https://openreview.net/forum?id=MfHOmgqVXM
---
J2C Certification: FedLog: Personalized Federated Classification with Less Communication and More Flexibility
Haolin Yu, Guojun Zhang, Hongliang Li, Pascal Poupart
https://openreview.net/forum?id=7Hwk0bvvKn
---
J2C Certification: CI-CBM: Class-Incremental Concept Bottleneck Model for Interpretable Continual Learning
Amirhosein Javadi, Tuomas Oikarinen, Tara Javidi, Tsui-Wei Weng
https://openreview.net/forum?id=Wf6OpLgj2i
---
J2C Certification: Differentially Private and Scalable Estimation of the Network Principal Component
Alireza Khayatian, Anil Vullikanti, Aritra Konar
https://openreview.net/forum?id=V0BjWbrAYC
---
Featured Certification: Demystifying MaskGIT Sampler and Beyond: Adaptive Order Selection in Masked Diffusion
Satoshi Hayakawa, Yuhta Takida, Masaaki Imaizumi, Hiromi Wakaki, Yuki Mitsufuji
https://openreview.net/forum?id=mKlW68i2Ig
---
Accepted papers
===============
Title: RegMean++: Enhancing Effectiveness and Generalization of Regression Mean for Model Merging
Authors: The-Hai Nguyen, Dang Huu-Tien, Takeshi Suzuki, Le-Minh Nguyen
Abstract: Regression Mean (RegMean), an approach that formulates model merging as a linear regression problem, aims to find the optimal weights for each linear layer in the merged model by minimizing the discrepancy in predictions between the merged and candidate models. RegMean provides a precise closed-form solution for the merging problem; therefore, it offers explainability and computational efficiency. However, RegMean merges each linear layer independently, overlooking how the features and information in earlier layers propagate through deeper layers and influence the final predictions of the merged model. Here, we introduce RegMean++, a simple yet effective alternative to RegMean, that explicitly incorporates both intra-layer and cross-layer dependencies between merged models' layers into RegMean's objective. By accounting for these dependencies, RegMean++ better captures the behaviors of the merged model. Extensive experiments demonstrate that RegMean++ consistently outperforms RegMean across diverse settings, including in-domain (ID) and out-of-domain (OOD) generalization, sequential merging, large-scale tasks, and robustness under several types of distribution shifts. Furthermore, RegMean++ achieves competitive performance across diverse settings compared to various advanced model merging methods.
URL: https://openreview.net/forum?id=H5lDsSCS9i
---
Title: ADAPT: Adaptive Prompt Tuning for Vision-Language Models
Authors: Zhenhan Huang, Tejaswini Pedapati, Pin-Yu Chen, Jianxi Gao
Abstract: Prompt tuning has emerged as an effective way for parameter-efficient fine-tuning. Conventional deep prompt tuning inserts continuous prompts of a fixed context length into the input to each layer. When a pre-trained model is tailored to a specific downstream task, different layers initialized with pre-trained weights might have different levels of deviation from the optimal weights. Inserted prompts with a fixed context length might have redundant context tokens or insufficient context length. To address this issue, we propose a deep continuous prompting method dubbed Adapt that encourages heterogeneous context lengths. In this method, context lengths are automatically determined by iteratively pruning context tokens. We use the saliency criterion for neural network pruning to compute the importance scores of context tokens in order to determine which tokens to prune. To avoid the forgetting issue in the fine-tuning process, we apply the angular knowledge distillation to force the model to learn the angular separation between pairs of classes and that of instances from the pre-trained model. We examine the proposed method on the pre-trained vision-language model CLIP. 16-shot experiments on 11 downstream datasets reveal the advantage of Adapt: the average test accuracy achieves competitive performance, and the highest performance gain on individual datasets is 7.44%. We release the code in https://github.com/Zhenhan-Huang/Adapt-Public.
URL: https://openreview.net/forum?id=3uVAA3ckxT
---
Title: ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers
Authors: Fotios Lygerakis, Ozan Özdenizci, Elmar Rueckert
Abstract: Tactile sensing provides local essential information that is complementary to visual perception, such as texture, compliance, and force. Despite recent advances in visuotactile representation learning, challenges remain in fusing these modalities and generalizing across tasks and environments without heavy reliance on pre-trained vision-language models. Moreover, existing methods do not study positional encodings, thereby overlooking the multi-stage spatial reasoning needed to capture fine-grained visuotactile correlations.
We introduce ViTaPEs, a transformer-based architecture for learning task-agnostic visuotactile representations from paired vision and tactile inputs. Our key idea is a two-stage positional injection: local (modality-specific) positional encodings are added within each stream, and a global positional encoding is added on the joint token sequence immediately before attention, providing a shared positional vocabulary at the stage where cross-modal interaction occurs. We make the positional injection points explicit and conduct controlled ablations that isolate their effect before a token-wise nonlinearity versus immediately before self-attention. Experiments on multiple large-scale real-world datasets show that ViTaPEs not only surpasses state-of-the-art baselines across various recognition tasks but also demonstrates zero-shot generalization to unseen, out-of-domain scenarios. We further demonstrate the transfer-learning strength of ViTaPEs in a robotic grasping task, where it outperforms state-of-the-art baselines in predicting grasp success.
URL: https://openreview.net/forum?id=mxzzO66Zbu
---
Title: You Only Train Once: Differentiable Subset Selection for Omics Data
Authors: Daphné Chopard, Jorge da Silva Gonçalves, Irene Cannistraci, Thomas M. Sutter, Julia E Vogt
Abstract: Selecting compact and informative gene subsets from single-cell transcriptomic data is essential for biomarker discovery, improving interpretability, and cost-effective profiling. However, most existing feature selection approaches either operate as multi-stage pipelines or rely on post hoc feature attribution, making selection and prediction weakly coupled. However, most existing feature selection approaches either operate as multi-stage pipelines or rely on post hoc feature attribution, making selection and prediction weakly coupled. In this work,
we present YOTO (you only train once), an end-to-end framework that jointly identifies discrete gene subsets and performs prediction within a single differentiable architecture. In our model, the prediction task directly guides which genes are selected, while the learned subsets,
in turn, shape the predictive representation. This closed feedback loop enables the model to iteratively refine both what it selects and how it predicts during training. Unlike existing approaches, YOTO enforces sparsity so that only the selected genes contribute to infer-
ence, eliminating the need to train additional downstream classifiers. Through a multi-task learning design, the model learns shared representations across related objectives, allowing different tasks to inform one another, and discovering gene subsets that generalize across tasks without additional training steps. We evaluate YOTO on two representative single-cell RNA-seq datasets, showing that it consistently outperforms state-of-the-art baselines. These results demonstrate that sparse, end-to-end, multi-task gene subset selection improves predictive performance and yields compact and meaningful gene subsets, advancing biomarker discovery and single-cell analysis.
URL: https://openreview.net/forum?id=xQiXlADW5v
---
Title: Joint Embedding Variational Bayes
Authors: Amin Oji, Paul W. Fieguth
Abstract: We introduce Variational Joint Embedding (VJE), a reconstruction-free latent-variable framework for non-contrastive self-supervised learning in representation space. VJE maximizes a symmetric conditional evidence lower bound (ELBO) on paired encoder embeddings by defining a conditional likelihood directly on target representations, rather than optimizing a pointwise compatibility objective. The likelihood is instantiated as a heavy-tailed Student--\(t\) distribution on a polar representation of the target embedding, where a directional--radial decomposition separates angular agreement from magnitude consistency and mitigates norm-induced pathologies. The directional factor operates on the unit sphere, yielding a valid variational bound for the associated spherical subdensity model. An amortized inference network parameterizes a diagonal Gaussian posterior whose feature-wise variances are shared with the directional likelihood, yielding anisotropic uncertainty without auxiliary projection heads. Across ImageNet-1K, CIFAR-10/100, and STL-10, VJE is competitive with standard non-contrastive baselines under linear and \(k\)-NN evaluation, while providing probabilistic semantics directly in representation space for downstream uncertainty-aware applications. We validate these semantics through out-of-distribution detection, where representation-space likelihoods yield strong empirical performance. These results position the framework as a principled variational formulation of non-contrastive learning, in which structured feature-wise uncertainty is represented directly in the learned embedding space.
URL: https://openreview.net/forum?id=4cbPJ5jLtr
---
Title: VidHal: Benchmarking Hallucinations in Vision LLMs
Authors: Wey Yeh Choong, Yangyang Guo, Mohan Kankanhalli
Abstract: Vision Large Language Models (VLLMs) are widely acknowledged to be prone to hallucinations. Existing research addressing this problem has primarily been confined to image inputs, with sparse exploration of their video-based counterparts. Furthermore, current evaluation methods fail to capture nuanced errors in generated responses, which are often exacerbated by the rich spatiotemporal dynamics of videos. To address these two limitations, we introduce VidHal, a benchmark specially designed to evaluate video-based hallucinations in VLLMs. VidHal is constructed by bootstrapping video instances across a wide range of common temporal aspects. A defining feature of our benchmark lies in the careful creation of captions representing varying levels of hallucination associated with each video. To enable fine-grained evaluation, we propose a novel caption ordering task requiring VLLMs to rank captions by hallucinatory extent. We conduct extensive experiments on VidHal and comprehensively evaluated a broad selection of models, including both open-source and proprietary ones. Our results uncover significant limitations in existing VLLMs with respect to video-based hallucination generation. Through our benchmark, we aim to inspire further research on i) holistic understanding of VLLM capabilities, particularly regarding hallucination, and ii) advancing VLLMs to alleviate this problem.
URL: https://openreview.net/forum?id=7ccWCDbdM1
---
Title: Transitioning Heads Conundrum: The Hidden Bottleneck in Long-Tailed Class-Incremental Learning
Authors: Rahul Vigneswaran, Hari Chandana Kuchibhotla, Vineeth N. Balasubramanian
Abstract: Long-Tailed Class-Incremental Learning (LTCIL) faces a fundamental tension: models must sequentially learn new classes while contending with extreme class imbalance, which amplifies catastrophic forgetting. A particularly overlooked phenomenon is the Transitioning Heads Conundrum: as replay buffers constrain memory, initially well-represented head classes shrink over time and effectively become tail classes, undermining knowledge retention. Existing approaches fail to address this because they apply knowledge distillation too late, after these transitions have already eroded head-class representations. To overcome this, we introduce DEcoupling Representations for Early Knowledge distillation (DEREK), which strategically employs Early Knowledge Distillation to safeguard head-class knowledge before data constraints manifest. Comprehensive evaluation across 2 LTCIL benchmarks, 12 experimental settings, and 24 baselines, including Long-Tail, Class-Incremental, Few-Shot CIL, and LTCIL methods, shows that DEREK maintains competitive performance across categories, establishing new state-of-the-art results.
URL: https://openreview.net/forum?id=Hb2Jvi5M7X
---
Title: CLIP-SVD: Efficient and Interpretable Vision–Language Adaptation via Singular Values
Authors: Taha Koleilat, Hassan Rivaz, Yiming Xiao
Abstract: Vision-language models (VLMs) like CLIP have shown impressive zero-shot and few-shot learning capabilities across diverse applications. However, adapting these models to new fine-grained domains remains difficult due to reliance on prompt engineering and the high cost of full model fine-tuning. Existing adaptation approaches rely on augmented components, such as prompt tokens and adapter modules, which could limit adaptation quality, destabilize the model, and compromise the rich knowledge learned during pretraining. In this work, we present CLIP-SVD, a multi-modal and parameter-efficient adaptation framework that applies Singular Value Fine-tuning (SVF) to CLIP, leveraging Singular Value Decomposition (SVD) to modify the internal parameter space of CLIP without injecting additional modules. Specifically, we fine-tune only the singular values of the CLIP parameter matrices to rescale the basis vectors for domain adaptation while retaining the pretrained model. This design enables enhanced adaptation performance using only 0.04% of the model's total parameters and better preservation of its generalization ability. CLIP-SVD achieves state-of-the-art classification results on 11 natural and 10 biomedical datasets, outperforming previous methods in both accuracy and generalization under few-shot settings. Additionally, we leverage a natural language-based approach to analyze the effectiveness and dynamics of the CLIP adaptation to allow interpretability of CLIP-SVD. Overall, this work provides the first extensive empirical evaluation of SVD-based finetuning in the vision-language model setting.
URL: https://openreview.net/forum?id=XYy8pwqwMR
---
Title: Properties and limitations of geometric tempering for gradient flow dynamics
Authors: Francesca Romana Crucinio, Sahani Pathiraja
Abstract: We consider the problem of sampling from a probability distribution $\pi$. It is well known that this can be written as an optimisation problem over the space of probability distributions in which we aim to minimise the Kullback--Leibler divergence from $\pi$.
We consider the effect of replacing $\pi$ with a sequence of moving targets $(\pi_t)_{t\ge0}$ defined via geometric tempering on the Wasserstein and Fisher--Rao gradient flows.
We show that convergence occurs exponentially in continuous time, providing novel bounds in both cases. We also consider popular time discretisations and explore their convergence properties.
We show that in the Fisher--Rao case, replacing the target distribution with a geometric mixture of initial and target distribution never leads to a convergence speed up both in continuous time and in discrete time. Finally, we explore the gradient flow structure of tempered dynamics and derive novel adaptive tempering schedules.
URL: https://openreview.net/forum?id=IP0w5LdcxC
---
Title: Grothendieck Graph Neural Networks Framework: An Algebraic Platform for Crafting Topology-Aware GNNs
Authors: Amirreza Shiralinasab Langari, Leila Yeganeh, Kim Khoa Nguyen
Abstract: Graph Neural Networks (GNNs) are almost universally built on a single primitive: the neighborhood. Regardless of architectural variations, message passing ultimately aggregates over neighborhoods, which intrinsically limits expressivity and often yields power no stronger than the Weisfeiler–Lehman (WL) test. In this work, we challenge this primitive. We introduce the Grothendieck Graph Neural Networks (GGNN) framework, which provides a strict algebraic extension of neighborhoods to covers, and in doing so replaces neighborhoods as the fundamental objects of message passing. Neighborhoods and adjacency matrices are recovered as special cases, while covers enable a principled and flexible foundation for defining topology-aware propagation schemes.
GGNN formalizes covers and systematically translates them into matrices, analogously to how adjacency matrices encode neighborhoods, enabling both theoretical analysis and practical implementation. Within this framework, we introduce the cover of sieves, inspired by category theory, which captures rich topological structure. Based on this cover, we design Sieve Neural Networks (SNN), a canonical fixed-cover instantiation that generalizes the adjacency matrix. Experiments show that SNN achieves zero failures on challenging graph isomorphism benchmarks (SRG, CSL, BREC) and substantially improves topology-aware evaluation via a controlled label-propagation probe. These results establish GGNN as a principled foundational framework for replacing neighborhoods in GNNs.
URL: https://openreview.net/forum?id=oD3qWXgB4e
---
Title: Test-Time Adaptation for Unsupervised Combinatorial Optimization
Authors: Yiqiao Liao, Farinaz Koushanfar, Parinaz Naghizadeh
Abstract: Unsupervised neural combinatorial optimization (NCO) enables learning powerful solvers without access to ground-truth solutions. Existing approaches fall into two disjoint paradigms: models trained for generalization across instances, and instance-specific models optimized independently at test time. While the former are efficient during inference, they lack effective instance-wise adaptability; the latter are flexible but fail to exploit learned inductive structure and are prone to poor local optima. This motivates the central question of our work: how can we leverage the inductive bias learned through generalization while unlocking the flexibility required for effective instance-wise adaptation? We first identify a challenge in bridging these two paradigms: generalization-focused models often constitute poor warm starts for instance-wise optimization, potentially underperforming even randomly initialized models when fine-tuned at test time. To resolve this incompatibility, we propose TACO, a model-agnostic test-time adaptation framework that unifies and extends the two existing paradigms for unsupervised NCO. TACO applies strategic warm-starting to partially relax trained parameters while preserving inductive bias, enabling rapid and effective unsupervised adaptation. Crucially, compared to naively fine-tuning a trained generalizable model or optimizing an instance-specific model from scratch, TACO achieves better solution quality while incurring negligible additional computational cost. Experiments on the canonical problems of minimum vertex cover, maximum clique, maximum independent set, and max cut demonstrate the effectiveness and robustness of TACO across static, distribution-shifted, and dynamic combinatorial optimization problems, establishing it as a practical bridge between generalizable and instance-specific unsupervised NCO.
URL: https://openreview.net/forum?id=VVyGfRp4fG
---
Title: Unstable Unlearning: The Hidden Risk of Concept Resurgence in Diffusion Models
Authors: Vinith Menon Suriyakumar, Rohan Alur, Ayush Sekhari, Manish Raghavan, Ashia C. Wilson
Abstract: Text-to-image diffusion models rely on massive, web-scale datasets. Training them from scratch is computationally expensive, and as a result, developers often prefer to make incremental updates to existing models. These updates often compose fine-tuning steps (to learn new concepts or improve model performance) with "unlearning" steps (to "forget" existing concepts, such as copyrighted works or explicit content). In this work, we demonstrate a critical and previously unknown vulnerability that arises in this paradigm: even under benign, non-adversarial conditions, fine-tuning a text-to-image diffusion model on seemingly unrelated images can cause it to "relearn" concepts that were previously "unlearned." We comprehensively investigate the causes and scope of this phenomenon, which we term concept resurgence, by performing a series of experiments which compose "concept unlearning" with subsequent fine-tuning of Stable Diffusion v1.4 and Stable Diffusion v2.1. Our findings underscore the fragility of composing incremental model updates, and raise serious new concerns about current approaches to ensuring the safety and alignment of text-to-image diffusion models.
URL: https://openreview.net/forum?id=Vj0Z2wspQ5
---
Title: Wasserstein Bounds for generative diffusion models with Gaussian tail targets
Authors: WANG XIXIAN, Zhongjian Wang
Abstract: We present an estimate of the Wasserstein distance between the data distribution and the generation of score-based generative models. The sampling complexity with respect to dimension is $\mathcal{O}(\sqrt{d})$, with a logarithmic constant. In the analysis, we assume a Gaussian-type tail behavior of the data distribution and an $\epsilon$-accurate approximation of the score. Such a Gaussian tail assumption is general, as it accommodates practical target distributions derived from early stopping techniques with bounded support.
The crux of the analysis lies in the global Lipschitz bound of the score, which is shown from the Gaussian tail assumption by a dimension-independent estimate of the heat kernel. Consequently, our complexity bound scales linearly (up to a logarithmic constant) with the square root of the trace of the covariance operator, which relates to the invariant distribution of the forward process.
URL: https://openreview.net/forum?id=QbQ4DtP5vS
---
Title: $\texttt{LucidAtlas}$: Learning Uncertainty-Aware, Covariate-Disentangled, Individualized Atlas Representations
Authors: Yining Jiao, Sreekalyani Bhamidi, Carlton Jude ZDANSKI, Huaizhi Qu, Julia S Kimbell, Andrew Prince, Cameron P Worden, Samuel Kirse, Christopher Rutter, Benjamin H Shields, Jisan Mahmud, Tianlong Chen, Marc Niethammer
Abstract: Interpreting how covariates influence spatially structured biological variation — for example, how pediatric airway geometry changes along the airway and across a growing population — remains a key challenge in developing models suitable for clinical application. We present $\texttt{LucidAtlas}$, a versatile framework for modeling and interpreting spatially varying information with associated covariates. To address the limitations of neural additive models when analyzing dependent covariates, we introduce a marginalization approach that enables accurate explanations of how combinations of covariates shape the learned atlas. $\texttt{LucidAtlas}$ integrates covariate interpretation, spatial representation, individualized prediction, population distribution analysis, and out-of-distribution detection into a single interpretable model. We validate its effectiveness on a synthetic spatiotemporal dataset, the OASIS brain volume dataset, and a pediatric airway shape dataset. Our findings underscore the critical role of by-construction interpretable models in advancing scientific discovery. The implementation is publicly available at https://github.com/****.
URL: https://openreview.net/forum?id=3FbNwC8ua8
---
Title: Interference-Aware K-Step Reachable Communication in Multi-Agent Reinforcement Learning
Authors: Ziyu Cheng, Jinsheng Ren, Jun Yang, Zhouxian Jiang, Chenzhihang Li, Rongye Shi, Bin Liang
Abstract: Effective communication is pivotal for addressing complex collaborative tasks in multi-agent reinforcement learning (MARL). Yet, limited communication bandwidth and dynamic, intricate environmental topologies present significant challenges in identifying high-value communication partners. Agents must consequently select collaborators under uncertainty, lacking a priori knowledge of which partners can deliver task-critical information. To this end, we propose Interference-Aware $K$-Step Reachable Communication (IA-KRC), a novel framework that enhances cooperation via two core components: (1) a $K$-Step reachability protocol that confines message passing to physically accessible neighbors, and (2) an interference-prediction module that optimizes partner choice by minimizing interference while maximizing utility. Compared to existing methods, IA-KRC enables substantially more persistent and efficient cooperation despite environmental interference. Comprehensive evaluations confirm that IA-KRC achieves superior performance compared to state-of-the-art baselines, while demonstrating enhanced robustness and scalability in complex topological and highly dynamic multi-agent scenarios.
URL: https://openreview.net/forum?id=8Fo2AwQE9z
---
Title: Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning
Authors: Yixuan Even Xu, Yash Savani, Fei Fang, J Zico Kolter
Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as the leading approach for enhancing reasoning capabilities in large language models. However, it faces a fundamental compute and memory asymmetry: rollout generation is embarrassingly parallel and memory-light, whereas policy updates are communication-heavy and memory-intensive. To address this, we introduce PODS (Policy Optimization with Down-Sampling), which decouples rollout generation from policy updates by training only on a strategically selected subset of rollouts, maintaining learning quality while dramatically reducing update costs. We propose a principled subset selection criterion—max-variance down-sampling—that maximizes the variance of reward in the selected subset, and provide an efficient $O(n\log n)$ implementation of this rule. Empirically, Group Relative Policy Optimization (GRPO) coupled with PODS achieves the peak test accuracy of vanilla GRPO at least $\mathbf{1.7\times}$ faster across the different reasoning benchmarks and hardware configurations we tested.
URL: https://openreview.net/forum?id=MfHOmgqVXM
---
Title: Contrastive VQ Priors for Multi-Class Plaque Segmentation via SAM Adaptation
Authors: Ruan Yizhe, Yusuke Kurose, JUNICHI IHO, Yoji Tokunaga, Makoto Horie, YUSAKU HAYASHI, Keisuke Nishizawa, Yasushi Koyama, Tatsuya Harada
Abstract: Accurate plaque subtype segmentation in coronary CT angiography (CCTA) is clinically relevant yet remains difficult in practice, where annotations are scarce, and the visual evidence for non-calcified lesions is subtle and highly variable. Meanwhile, segmentation foundation models such as SAM provide strong robustness from large-scale pretraining, but their benefits do not reliably transfer to private CCTA tasks under naïve fine-tuning, especially for multi-class plaque taxonomy. We present a targeted strategy to transfer SAM's segmentation robustness to a private CCTA setting by injecting a task-specific, texture-aware prior into the SAM feature stream. Our framework is two-stage: (i) we learn a discrete latent prior from the private CCTA data using a vector-quantized autoencoder, and structure it with supervised contrastive learning to emphasize hard class boundaries; (ii) we fuse this prior into a SAM-based encoder through a query-based feature-aware cross-attention module, and decode with a multi-class head/decoder tailored for plaque taxonomy. On this private CCTA cohort, the proposed design improves overall performance over the compared baselines, with the largest gains on vessel wall and non-calcified plaque. Ablations suggest that the class-structured prior, query-based fusion, and multi-class decoding each contribute to the final result within this setting.
URL: https://openreview.net/forum?id=5P7HfuejgL
---
Title: FedLog: Personalized Federated Classification with Less Communication and More Flexibility
Authors: Haolin Yu, Guojun Zhang, Hongliang Li, Pascal Poupart
Abstract: Federated representation learning (FRL) aims to learn personalized federated models with effective feature extraction from local data. FRL algorithms that share the majority of the model parameters face significant challenges with huge communication overhead. This overhead stems from the millions of neural network parameters and slow aggregation progress of the averaging heuristic. To reduce the overhead, we propose FedLog, which shares sufficient data summaries instead of raw model parameters. The data summaries encode minimal sufficient statistics of an exponential family, and Bayesian inference is utilized for global aggregation. FedLog helps reduce message sizes and communication frequency. We prove that the shared messages are minimal sufficient statistics and theoretically analyze the convergence rate of FedLog. To further ensure formal privacy guarantees, we extend FedLog with the differential privacy framework. Empirical results demonstrate high learning accuracy with low communication overhead of our method.
URL: https://openreview.net/forum?id=7Hwk0bvvKn
---
Title: SokoBench: Evaluating Long-Horizon Planning and Reasoning in Large Language Models
Authors: Sebastiano Monti, Carlo Nicolini, Giovanni Pellegrini, Jacopo Staiano, Bruno Lepri
Abstract: Although the capabilities of Large Language Models and Large Reasoning Models have been increasingly tested on complex reasoning tasks, their long-horizon planning abilities have not yet been extensively investigated.
In this work, we provide a systematic assessment of the planning and long-horizon reasoning capabilities of state-of-the-art Large Reasoning Models (LRMs). We propose a novel benchmark based on Sokoban puzzles, intentionally simplified to isolate long-horizon planning from state persistence.
Our findings reveal a consistent degradation in planning performance when more than 25 moves are required to reach the solution, suggesting non-recoverable error accumulation under single-pass autoregressive decoding.
We show that equipping LRMs with Planning Domain Definition Language (PDDL) parsing, validation, and solving tools allows for modest improvements, suggesting that character level counting and long yet simple state tracking might not be overcome by test-time scaling approaches alone.
URL: https://openreview.net/forum?id=pLosAkOoGU
---
Title: Continuous Treatment Effect Estimation with Cauchy-Schwarz Divergence Information Bottleneck
Authors: Louk van Remmerden, Shiqin Tang, Shujian Yu
Abstract: Estimating conditional average treatment effects (CATE) for continuous and multivariate treatments remains a fundamental yet underexplored problem in causal inference, as most existing methods are confined to binary treatment settings. In this paper, we make two key theoretical contributions. First, we derive a novel counterfactual error bound based on the Cauchy–Schwarz (CS) divergence, which is provably tighter than prior bounds derived from the Kullback–Leibler (KL) divergence. Second, we strengthen this bound by integrating the Information Bottleneck principle, introducing a compression regularization on latent representations to enhance generalization. Building on these insights, we propose a new neural framework that operationalizes our theory. Extensive experiments on three benchmarks show that our method consistently outperforms state-of-the-art baselines and remains robust under biased treatment assignments.
URL: https://openreview.net/forum?id=9SvY0mMr2u
---
Title: Influence Estimation in Statistical Models Using the Fisher Information Matrix
Authors: Omri Lev, Ashia C. Wilson
Abstract: Quantifying how infinitesimal perturbations of training data affect a model is key to diagnosing and improving learning systems. This task was addressed via the notion of influence functions \citep{Hampel_Influence, KohLiang_Influence_DL, koh2019accuracy, Bae_PBRF_Influence}. Following classical works, whenever the underlying problem can be cast as a weighted empirical risk minimization problem, many such influence estimators rely on the Fisher Information Matrix (FIM). Following these lines, we provide a new accuracy analysis that characterizes the asymptotic behavior of FIM-based influence estimators and compare these to Hessian-based influence estimators, while we further extend the theory to objectives with non-differentiable regularizers. The results obtained are broadly applicable and admit an efficient algorithm with favorable computational complexity. Simulations on realistic setups demonstrate its usefulness in terms of accuracy and computational efficiency in many learning settings.
URL: https://openreview.net/forum?id=1kYIBaXCG8
---
Title: On the Conditioning Consistency Gap in Conditional Neural Processes
Authors: Robin Young
Abstract: Neural processes are meta-learning models that map context sets to predictive distributions. While inspired by stochastic processes, NPs do not generally satisfy the Kolmogorov consistency conditions required to define a valid stochastic process. This inconsistency is widely acknowledged but poorly understood. Practitioners note that NPs work well despite the violation, without quantifying what this means. We address this gap by defining the conditioning consistency gap, a KL divergence measuring how much a conditional neural process's (CNP) predictions change when a point is added to the context versus conditioned upon. Our main results show that for CNPs with bounded encoders and Lipschitz decoders, the consistency gap is $O(1/n^2)$ in context size $n$, and that this rate is tight. These bounds establish the precise sense in which CNPs approximate valid stochastic processes. The inconsistency is negligible for moderate context sizes but can be significant in the few-shot regime.
URL: https://openreview.net/forum?id=rLJ5Hm5vbG
---
Title: MixTraining: A Better Trade-Off Between Compute and Performance
Authors: Zexin Li, Jiancheng Zhang, Yufei Li, Yinglun Zhu, Cong Liu
Abstract: Integrating self-supervised learning (SSL) prior to supervised learning (SL) is a prevalent strategy for enhancing model performance, especially in scenarios with limited labeled data. Nonetheless, this approach inherently introduces a trade-off between computational efficiency and performance gains. Although SSL significantly improves representation learning, it necessitates an additional and often computationally expensive training phase, posing substantial overhead in resource-constrained environments. To mitigate these limitations, we propose MixTraining, a novel training framework designed to interleave multiple epochs of SSL and SL within a unified $\textit{mixtraining phase}$. This phase enables a seamless transition between self-supervised and supervised objectives, facilitating enhanced synergy and improved overall accuracy. Additionally, MixTraining consolidates shared computational steps, thereby reducing redundant computations and lowering overall training latency. Comprehensive experimental evaluations demonstrate that MixTraining provides a superior trade-off between computational efficiency and model performance compared to conventional training pipelines. Specifically, on the TinyImageNet dataset using the ViT-Tiny model, MixTraining achieves an absolute accuracy improvement of 8.81% (a relative gain of 18.89%) while concurrently accelerating training by 1.29$\times$.
URL: https://openreview.net/forum?id=NVpS2g9KRo
---
Title: Preference-Based Gradient Estimation for ML-Guided Approximate Combinatorial Optimization
Authors: Arman Mielke, Uwe Bauknecht, Thilo Strauss, Mathias Niepert
Abstract: Combinatorial optimization (CO) problems arise across a broad spectrum of domains, including medicine, logistics, and manufacturing. While exact solutions are often computationally infeasible, many practical applications require high-quality solutions within a given time budget. To address this, we propose a learning-based approach that enhances existing non-learned heuristics for CO. Specifically, we parameterize these heuristics and train graph neural networks (GNNs) to predict parameter values that yield near-optimal solutions. Our
method is trained end-to-end in a self-supervised fashion, using a novel gradient estimation scheme that treats the heuristic as a black box. This approach combines the strengths of learning and traditional algorithms: the GNN learns from data to guide the algorithm toward better solutions, while the heuristic ensures feasibility. We validate our method on two well-known combinatorial optimization problems: the travelling salesman problem (TSP) and the minimum k-cut problem. Our results demonstrate that the proposed approach is competitive with state-of-the-art learned CO solvers.
URL: https://openreview.net/forum?id=2S224XC378
---
Title: α-OCC: Uncertainty-Aware Camera-based 3D Semantic Occupancy Prediction
Authors: Sanbao Su, Nuo Chen, Chenchen Lin, Felix Juefei-Xu, Chen Feng, Fei Miao
Abstract: Comprehending 3D scenes is paramount for tasks such as planning and mapping for autonomous vehicles and robotics. Camera-based 3D Semantic Occupancy Prediction (OCC) aims to infer scene geometry and semantics from limited observations. While it has gained popularity due to affordability and rich visual cues, existing methods often neglect the inherent uncertainty in models. To address this, we propose an uncertainty-aware OCC method (α-OCC). We first introduce Depth-UP, an uncertainty propagation framework that improves geometry completion by up to 11.58% and semantic segmentation by up to 12.95% across various OCC models. For uncertainty quantification (UQ), we propose the hierarchical conformal prediction (HCP) method, effectively handling the high-level class imbalance in OCC datasets. On the geometry level, the novel KL-based score function significantly improves the occupied recall (45%) of safety-critical classes with minimal performance overhead (3.4% reduction). On UQ, our HCP achieves smaller prediction set sizes while maintaining the defined coverage guarantee. Compared with baselines, it reduces up to 90% set size, with 18% further reduction when integrated with Depth-UP. Our contributions advance OCC accuracy and robustness, marking a noteworthy step forward in autonomous perception systems. Our code is public on https://coperception.github.io/alpha-OCC/.
URL: https://openreview.net/forum?id=bUv25gBLlV
---
Title: Task-Specific Exploration in Meta-Reinforcement Learning via Task Reconstruction
Authors: Radu Stoican, Angelo Cangelosi, Christian Goerick, Thomas H Weisswange
Abstract: Reinforcement learning trains policies specialized for a single task. Meta-reinforcement learning (meta-RL) improves upon this by leveraging prior experience to train policies for few-shot adaptation to new tasks. However, existing meta-RL approaches often struggle to explore and learn tasks effectively. We introduce a novel meta-RL algorithm that learns to learn task-specific exploration policies for sample-efficient few-shot adaptation. We achieve this through task reconstruction, an original method for learning to identify and collect small but informative datasets from tasks. To leverage these datasets, we also propose learning a meta-reward that encourages policies to learn to adapt. Empirical evaluations demonstrate that our algorithm achieves higher returns than existing meta-RL methods. Additionally, we show that even with full task information, adaptation is more challenging than previously assumed. However, policies trained with our meta-reward adapt to new tasks successfully.
URL: https://openreview.net/forum?id=VRRapVcaJH
---
Title: Drawback of Enforcing Equivariance and its Compensation via the Lens of Expressive Power
Authors: Yuzhu Chen, Tian Qin, Xinmei Tian, Fengxiang He, Dacheng Tao
Abstract: Equivariant neural networks encode the intrinsic symmetry of data as an inductive bias, which has achieved impressive performance in wide domains. However, the understanding to their expressive power remains premature. Focusing on 2-layer ReLU networks, this paper investigates the impact of enforcing equivariance constraints on the expressive power. By examining the boundary hyperplanes and the channel vectors, we constructively demonstrate that enforcing equivariance constraints could undermine the expressive power. Naturally, this drawback can be compensated for by enlarging the model size -- we further prove upper bounds on the required enlargement for compensation. Surprisingly, we show that the enlarged neural architectures have reduced hypothesis space dimensionality, implying even better generalizability.
URL: https://openreview.net/forum?id=z5bJ44Brc4
---
Title: ABCDE: Agentic-Based Controlled Dynamic Erasure for Intent-Aware Safety Reasoning
Authors: Ping Liu, CHI ZHANG
Abstract: Concept erasure has emerged as a central mechanism for safety alignment in text-conditioned generative models, yet most existing approaches implicitly adopt an unconditional suppression paradigm in which target concepts are removed whenever they appear, regardless of contextual intent.
This formulation conflates benign and harmful concept usage, leading to systematic over-suppression that unnecessarily censors policy-compliant content and degrades model utility.
We argue that safety intervention should instead be framed as a decision problem grounded in contextual language understanding, rather than as a purely mechanistic removal operation.
Based on this perspective, we introduce Intent-Aware Concept Erasure (ICE), a decision-centric formulation that explicitly separates the question of whether a concept should be suppressed from how suppression is realized, enabling context-sensitive intervention policies that preserve benign usage while maintaining safety guarantees.
To operationalize this formulation, we present Agentic-Based Controlled Dynamic Erasure (ABCDE), an agentic framework that infers a stable intervention decision from semantic context and realizes it through minimal prompt-level intervention with closed-loop multimodal output feedback, without modifying model parameters.
To enable principled evaluation of intent-aware intervention, we further construct the Context-Aware Erasure Benchmark (CAEB), a paired benchmark comprising 500 prompts over 10 object concepts and 100 prompts over 5 artist styles, in which the same concept appears in both removal-required and preservation-required contexts.
Experiments on CAEB show that ABCDE achieves substantially higher precision than unconditional baselines while maintaining strong recall, demonstrating effective avoidance of unnecessary suppression in benign contexts.
URL: https://openreview.net/forum?id=IFjPhMcXJB
---
Title: DOME: Distributed Online Learning based Multi-Estimate Fusion for Cooperative Predictive Target Tracking Using a Robotic Swarm
Authors: Shubhankar Gupta, Saksham Sharma, Suresh Sundaram
Abstract: This paper investigates cooperative predictive target tracking using a robotic swarm operating under high prediction bias and communication uncertainty. The robots interact over a randomly time-varying communication network and exhibit heterogeneity in onboard sensors and prediction algorithms. To address these challenges, a Distributed Online learning-based Multi-Estimate (DOME) fusion algorithm is proposed, which performs a collaborative weighted fusion of local and socially shared predictions. The fusion weights are adapted online using feedback from a prediction loss. Theoretical analysis establishes that conditional expectations of the fusion weights converge under reasonable assumptions. Simulation studies demonstrate that DOME outperforms both covariance-based and online learning-based decentralized fusion baselines, achieving $74\%$ and $72.4\%$ lower prediction loss in performance and scalability tests, respectively -- particularly under conditions involving significant model drift and communication unreliability. Further, DOME fusion is implemented in a ROS-Gazebo simulation environment.
URL: https://openreview.net/forum?id=aF5PHD6vll
---
Title: CI-CBM: Class-Incremental Concept Bottleneck Model for Interpretable Continual Learning
Authors: Amirhosein Javadi, Tuomas Oikarinen, Tara Javidi, Tsui-Wei Weng
Abstract: Catastrophic forgetting remains a fundamental challenge in continual learning, in which models often forget previous knowledge when fine-tuned on a new task. This issue is especially pronounced in class incremental learning (CIL), which is the most challenging setting in continual learning. Existing methods to address catastrophic forgetting often sacrifice either model interpretability or accuracy. To address this challenge, we introduce Class-Incremental Concept Bottleneck Model (CI-CBM), which leverage effective techniques, including concept regularization and pseudo-concept generation to maintain interpretable decision processes throughout incremental learning phases. Through extensive evaluation on seven datasets, CI-CBM achieves comparable performance to black-box models and outperforms previous interpretable approaches in CIL, with an average 36\% accuracy gain. CI-CBM provides interpretable decisions on individual inputs and understandable global decision rules, as shown in our experiments, thereby demonstrating that human-understandable concepts can be maintained during incremental learning without compromising model performance. Our approach is effective in both pretrained and non-pretrained scenarios; in the latter, the backbone is trained from scratch during the first learning phase.
URL: https://openreview.net/forum?id=Wf6OpLgj2i
---
Title: Anytime Verified Agents: Adaptive Compute Allocation for Reliable LLM Reasoning under Budget Constraints
Authors: Dipkumar Patel
Abstract: Large language model (LLM) agents can perform multi-step reasoning, planning, and tool use. However, their performance scales with the computational budget. Existing methods allocate computational resources using static strategies such as fixed search depths, constant self-consistency sampling, or uniform verification, so simple problems can consume as much compute as complex tasks. We present Anytime Verified Agents (AVA), a framework that dynamically allocates compute across search, sampling, and verification within a user-specified budget, with an extensible interface for tool use. AVA combines calibrated uncertainty estimation, value-of-information-guided search expansion, and selective verification cascades with early exits. The controller allocates compute based on uncertainty and estimated marginal reliability gains. AVA is evaluated on mathematical reasoning (GSM8K and MATH), multi-hop question answering (HotpotQA), and code generation (HumanEval), with two model backends (GPT-5 and GPT-4o), and compared to fixed-depth search, self-consistency, and always-verify baselines. Across these benchmarks, AVA reduces cost at matched reliability thresholds while maintaining comparable accuracy.
URL: https://openreview.net/forum?id=JMDCMf7mlF
---
Title: Differentially Private and Scalable Estimation of the Network Principal Component
Authors: Alireza Khayatian, Anil Vullikanti, Aritra Konar
Abstract: Computing the principal component (PC) of the adjacency matrix of an undirected graph has several applications ranging from identifying key vertices for influence maximization and controlling diffusion processes, to discovering densely interconnected vertex subsets. However, many networked datasets are sensitive, which necessitates private computation of the PC for use in the aforementioned applications. Differential privacy has emerged as the gold standard in privacy-preserving data analysis, but existing DP algorithms for private PC suffer from low accuracy due to large noise injection or high complexity. Motivated by the large gap between the local and global sensitivities of the PC on real-graphs, we consider instance-specific mechanisms for privately computing the PC under edge-DP. These mechanisms guarantee privacy for all datasets, but provide good utility on ``well-behaved'' datasets by injecting smaller amounts of noise. More specifically, we consider the Propose-Test-Release (PTR) framework. Although computationally expensive in general, we design a novel approach for implementing a PTR variant in the same time as computation of a non-private PC, while offering good utility.
Our framework tests in a differentially-private manner whether a given graph is ``well-behaved'' or not, and then tests whether its private to release a noisy PC with small noise.
As a consequence, this also leads to the first DP algorithm for the Densest-$k$-subgraph problem, a key graph mining primitive.
We run our method on diverse real-world networks, with the largest having 3 million vertices, and compare its utility to a pre-existing baseline based on the private power method (PPM).
Although PTR requires a slightly larger privacy budget, on average, it achieves a 180-fold improvement in runtime over PPM.
URL: https://openreview.net/forum?id=V0BjWbrAYC
---
Title: Boosting Text Encoder for Personalized Text-to-Image Generation
Authors: NaHyeon Park, Kunhee Kim, Hyunjung Shim
Abstract: In this paper, we introduce TextBoost, an efficient one-shot personalization approach for text-to-image diffusion models. Traditional personalization methods typically involve fine-tuning extensive portions of the model, leading to substantial storage requirements and slow convergence. In contrast, we propose selectively fine-tuning only the text encoder, significantly improving computational and storage efficiency. To preserve the original semantic integrity, we develop a novel causality-preserving adaptation mechanism. Additionally, lightweight adapters are employed to locally refine text embeddings immediately before their interaction with cross-attention layers, greatly enhancing the expressiveness of text embeddings with minimal computational overhead. Empirical evaluations across diverse concepts demonstrate that TextBoost achieves faster convergence and substantially reduces storage demands by minimizing the number of trainable parameters. Furthermore, TextBoost maintains comparable subject fidelity, superior text fidelity, and greater generation diversity compared to existing methods. We show that our proposed method offers an efficient, scalable, and practically applicable solution for high-quality text-to-image personalization, particularly beneficial in resource-constrained environments.
URL: https://openreview.net/forum?id=hiZzk1nHuV
---
Title: Demystifying MaskGIT Sampler and Beyond: Adaptive Order Selection in Masked Diffusion
Authors: Satoshi Hayakawa, Yuhta Takida, Masaaki Imaizumi, Hiromi Wakaki, Yuki Mitsufuji
Abstract: Masked diffusion models have shown promising performance in generating high-quality samples in a wide range of domains, but accelerating their sampling process remains relatively underexplored. To investigate efficient samplers for masked diffusion, this paper theoretically analyzes the MaskGIT sampler for image modeling, revealing its implicit temperature sampling mechanism. Through this analysis, we show that MaskGIT is asymptotically equivalent to a choose-then-sample (CTS) formulation, instantiated as the “moment sampler,” which explicitly separates index selection from token sampling. This CTS reformulation is essential: it yields unbiased token sampling and exposes an algorithmic design space for index selection, both of which are inaccessible in MaskGIT’s original formulation. Regarding token sampling, we reveal that MaskGIT implicitly adopts a low-temperature sampler, which explains why MaskGIT often degrades with more sampling steps. The CTS reformulation of MaskGIT allows to fix the temperature sampling to ensure unbiasedness. We also improve the index selection in CTS through two key innovations: a partial caching technique for transformers that approximates longer sampling trajectories without proportional computational cost, and a hybrid approach formalizing the exploration-exploitation trade-off in adaptive unmasking. Experiments in image and text domains demonstrate our theory as well as the efficiency of our proposed methods, advancing both theoretical understanding and practical implementation of masked diffusion samplers.
URL: https://openreview.net/forum?id=mKlW68i2Ig
---
New submissions
===============
Title: PAVO: Pipeline-Aware Voice Orchestration with Demand-Conditioned Inference Routing
Abstract: Voice agents built on ASR-LLM-TTS pipelines allocate compute statically. It wastes resources on simple queries but does not cater to complex ones. So, we created PAVO (Pipeline-Aware Voice Orchestrator), which routes each turn through a three-stage pipeline. We orchestrate these routes based on demand signals extracted before the transcription even begins. We noticed that ASR errors propagate to downstream LLMs in two distinct regimes. One of them was a sharp factual accuracy cliff and the other one was gradual semantic degradation. This resulted in creating inter stage coupling constraints that prior routing systems ignore. We validated this structure on n=5,430 direct calibration measurements across two hardware platforms (H100, M3) and three LLM families (Llama 3.1 8B, Mistral 7B, Gemma2 2B). We also enforced these constraints via hard logit masking in an 85K parameter RL trained meta controller which reduced coherence failures by 7.9x. It achieved 34% lower median latency and 71% lower energy when compared to rigid cloud baselines on a 50K-turn simulated benchmark. We also noted that direct H100 experiments on 200 LibriSpeech samples confirmed 10.3% P95 tail compression (p = 2x10^-6). Code and data are publicly available.
URL: https://openreview.net/forum?id=zrneoIxlFx
---
Title: Unlocking Multimodal Document Intelligence: From Current Triumphs to Future Frontiers of Visual Document Retrieval
Abstract: With the rapid proliferation of multimodal information, Visual Document Retrieval (VDR) has emerged as a critical frontier in bridging the gap between unstructured visually rich data and precise information acquisition. Unlike traditional natural image retrieval, visual documents exhibit unique characteristics defined by dense textual content, intricate layouts, and fine-grained semantic dependencies. This paper presents the first comprehensive survey of the VDR landscape, specifically through the lens of the Multimodal Large Language Model (MLLM) era. We begin by examining the benchmark landscape, and subsequently dive into the methodological evolution, categorizing approaches into three primary aspects: multimodal embedding models, multimodal reranker models, and the integration of Retrieval-Augmented Generation (RAG) and Agentic systems for complex document intelligence. Finally, we identify persistent challenges and outline promising future directions, aiming to provide a clear roadmap for future multimodal document intelligence.
URL: https://openreview.net/forum?id=w7rzA7YkSc
---
Title: Investigating the limits of free-form debate as a scalable oversight strategy
Abstract: Debate is a scalable oversight method involving two copies of a strong model trained to defend alternative responses to a question, with a weaker judge evaluating which answer is better supported. We replicate and extend a result from prior work demonstrating that training Llama3-8B-Instruct-262k as a debater led to increased performance of a GPT-4-class judge model on QuALITY, a question-answering task that grants the debaters a capability advantage via information asymmetry. When replicating the original setup as closely as possible, we confirm that training debater models in free-form, multi-round debate increased judge accuracy. However, this finding did not generalize across alternative tasks or models, and did not replicate consistently under our closest approximation to the original setting. These results suggest that the effectiveness of free-text debate as a scalable oversight method is sensitive to task structure, model pairing, and training conditions, and highlight the need for greater understanding of when and why debate improves judge accuracy. We identify several factors that may influence debate's success and outline directions for future work aimed at characterizing the conditions under which debate strengthens oversight reliably.
URL: https://openreview.net/forum?id=vRCGzuAhOM
---
Title: Learning from Complaints: Adversarial Disentanglement for Robust Scalper Detection in E-Commerce Promotions
Abstract: Identifying scalpers in e-commerce promotions is a critical challenge where instance- dependent label noise is pervasive: legitimate users with ambiguous patterns (e.g., frequent on-the-hour purchases of high-subsidy items and orders shipped to non-habitual addresses) are often misclassified as scalpers, leading to some user complaints and operational cost. This issue is further amplified in real-time risk control, where model iteration largely relies on historical review/penalty labels, forming a closed-loop supervision that reinforces false positives as positives over time. Existing noise-handling methods (e.g., reweighting or filter- ing) largely treat such errors as random noise and fail to address the root cause—intrinsic feature overlap between scalpers and certain normal users.
We propose GUARD (Grounded User-feedback Adversarial Representation Disentanglement), a complaint-aware framework that learns risk-predictive represen- tations while being insensitive to complaint-triggering superficial cues. Here, grounded means the adversarial supervision is anchored in complaint-verified false positives, rather than raw complaints. GUARD defines a Confusion Domain from these verified cases and uses it as direct supervision for a GRL-based adversarial objective, encouraging the encoder to be invariant to Confusion-Domain membership while remaining predictive of scalper risk. The model is trained in a multi-task manner with a primary risk head (reliable enforcement labels) and an adversarial confusion head. To mitigate the scarcity and bias of verified complaints, we expand the Confusion Domain via MC Dropout uncertainty sampling, mining potential false-positive candidates from a large pool of processed candidate orders, while filtering out high-confidence scalpers using existing high-precision blacklist rules to reduce contamination.
We evaluate GUARD on a large-scale e-commerce promotion platform. In a 14-day online A/B test with thresholds calibrated to match recall, GUARD improves precision by +8.9 points and reduces the complaint rate by 13.5%, while keeping subsidy loss statistically unchanged. GUARD is deployed in production now.
URL: https://openreview.net/forum?id=yrYe2I0HzB
---
Title: Variance Reduction in Sketching Algorithms via Complex Random Variables
Abstract: The seminal \texttt{CountSketch} algorithm of~\cite{count_sketch} compresses high-dimensional real-valued vectors while approximately preserving pairwise inner products in time proportional to the sparsity of the input data. However, the estimator's high variance limits its reliability. In this work, we propose a simple modification of the \texttt{CountSketch} algorithm that not only reduces the variance of the estimate but also maintains its input sparsity running time. Our key idea is to replace real-valued Rademacher $\{-1,1\}$ used in \texttt{CountSketch} with $\{1,\omega,\omega^2,\omega^3\}$(fourth roots of unity). We further extend this idea to the well-known sketching algorithms - \texttt{TensorSketch}~\cite{pham2013fast} and \texttt{Recursive TensorSketch}~\cite{DBLP:conf/soda/AhleKKPVWZ20} for a high-degree polynomial kernel, and obtain improvements in the variance. For \texttt{TensorSketch}, our proposal achieved exponential improvements in the variance, reducing it from $O(3^p)$ to $O(2^p)$. \cite{wacker2022improved} also gives a similar improvement in the variance by exploiting complex random variables. However, the main advantage of our proposal is its running time, which depends on the input sparsity - $ O(p\cdot \mathrm{nnz}(\mathbf{x}))$, whereas for ~\cite{wacker2022improved} it is $O(pd)$, for an input $\bigotimes_{i=1}^{p} \mathbf{x}$ where $\mathbf{x}\in \mathbb{R}^d$. We further extend our technique to \texttt{Recursive TensorSketch}, a state-of-the-art sketching algorithm for polynomial kernels. Our proposal has lower variance than \texttt{Recursive TensorSketch} while retaining the same input sparsity running time.
URL: https://openreview.net/forum?id=0CzpyVDwcf
---
Title: AAC: Admissible-by-Architecture Differentiable Landmark Compression for ALT
Abstract: We introduce \textbf{AAC} (Architecturally Admissible Compressor), a differentiable landmark-selection module for ALT (A*, Landmarks, and Triangle inequality) shortest-path heuristics whose outputs are admissible by construction: each forward pass is a row-stochastic mixture of triangle-inequality lower bounds, so the heuristic is admissible for \emph{every} parameter setting without requiring convergence, calibration, or projection. At deployment, the module reduces to classical ALT on a learned subset, composing end-to-end with neural encoders while preserving the classical toolchain. The construction is the first differentiable instance of the compress-while-preserving-admissibility tradition in classical heuristic search.
Under a matched per-vertex memory protocol, we establish that ALT with farthest-point-sampling landmarks (FPS-ALT) has provably near-optimal coverage on metric graphs, leaving at most a few percentage points of headroom for \emph{any} selector. AAC operates near this ceiling: the gap is $0.9$--$3.9$ percentage points on 9 road networks and ${\leq}1.3$ percentage points on synthetic graphs, with zero admissibility violations across $1{,}500+$ queries and all logged runs. At matched memory, AAC is also $1.2$--$1.5{\times}$ faster than FPS-ALT at the median query on DIMACS road networks, amortizing its offline cost within $170$--$1{,}924$ queries. A controlled ablation isolates the binding constraint: training-objective drift under default initialization, not architectural capacity; identity-on-first-$m$ initialization closes the expansion-count gap entirely. We release the module, a reusable matched-memory benchmarking protocol with paired two-one-sided-test (TOST) equivalence and pre-registration, and a reference compressed-differential-heuristics baseline.
URL: https://openreview.net/forum?id=ESlL4ab9y8
---
Title: Cog-VADU: A Training-Free Cognitive Reasoning Framework for Video Anomaly Detection and Understanding
Abstract: Video Anomaly Detection (VAD) aims to temporally localize abnormal events in videos.
Most existing approaches rely on dataset-specific training and curated annotations, limiting generalization in open-set scenarios.
Recent zero-shot methods based on Large Vision-Language Models (LVLMs) alleviate this dependency but often lack temporal continuity and structured reasoning. We propose \textbf{Cog-VADU}, a fully training-free framework that reformulates VAD as a sequential cognitive reasoning task.
Cog-VADU introduces \emph{Chain-of-Anomaly Detection Thought Prompting} (CoADTP), which unrolls an LVLM into a recurrent reasoning chain across video segments.
By propagating structured rationales over time, the model maintains implicit temporal memory, enabling robust discrimination between complex anomalies and high-motion normal activities. To improve reliability, we further design a cross-modal re-ranking stage that aligns textual rationales with visual embeddings, enforcing semantic consistency and temporal coherence for refined and stable predictions. Extensive experiments on multiple public VAD benchmarks demonstrate that Cog-VADU achieves competitive zero-shot performance. Moreover, cross-model evaluations show that CoADTP consistently enhances reasoning-based anomaly detection in a model-agnostic manner, providing interpretable and generalizable anomaly understanding for real-world applications.
URL: https://openreview.net/forum?id=QcuSMNG7J8
---
Title: RTS Smoother-Guided Learning of Physics-Based Neural Differential Models
Abstract: Ordinary differential equations (ODEs) are widely used to model dynamical systems in physics, biology, neuroscience, and physiology, but in many applications some equations of the dynamics are unknown and only a subset of the state variables are measured. We propose a hybrid neural--physics framework in which the known components of the ODE are kept explicit and the missing components are represented by a neural network. The proposed method consists of two stages where we alternate between state and parameter estimation and iterate until a predetermined criterion is met. Specifically, in the first step, we treat the model parameters as being known and we infer the latent states from the available measurements using a Rauch--Tung--Striebel (RTS) smoother. In the second stage, we treat the smoothed trajectories as being known and use them to estimate the neural networks' parameters through backpropagation. We evaluate the method on benchmark systems spanning linear, nonlinear, and stiff dynamics under partial state observation. Across these settings, the proposed method learns missing ODE components from incomplete measurements while exploiting and retaining interpretable mechanistic structure and improving latent-state reconstruction and long-horizon prediction.
URL: https://openreview.net/forum?id=GmMtDQ54iR
---
Title: Enhancing Self-Supervised Visual Representation Learning via Low-Rank Adapted LLMs
Abstract: The integration of Large Language Model (LLMs) blocks with Vision Transformers (ViTs) holds significant promise for vision-only tasks by leveraging the rich semantic knowledge and reasoning capabilities of LLMs. However, a fundamental challenge lies in the inherent modality mismatch between the text-centric pre-training of LLMs and the vision-centric training of ViTs. Direct fusion often fails to fully exploit the LLM's potential and suffers from unstable finetuning. Consequently, prior works typically keep LLM blocks frozen while learning only the vision components. To address these challenges, we introduce Language-Adapted Vision Enhancer (LAVIE), a novel framework that bridges this modality gap through a synergistic pre-training strategy. LAVIE co-adapts a ViT backbone and an LLM fusion block by (1) employing Masked Auto-Encoding (MAE) to pre-train the ViT for richer visual representations, and (2) concurrently training Low-Rank Adaptation (LoRA) layers within the LLM block using the same MAE objective. This joint optimization guides the ViT to produce LLM-aligned features and the LLM to effectively interpret visual information. We demonstrate through extensive experiments that LAVIE significantly improves performance in various downstream vision tasks, offering an effective and efficient way to enhance visual understanding using frozen LLM knowledge.
URL: https://openreview.net/forum?id=s2T8Kgj6Rd
---
Title: Clarity: The Flexibility-Interpretability Trade-Off in Sparsity-aware Concept Bottleneck Models
Abstract: The widespread adoption of deep learning models in computer vision has intensified concerns about interpretability. Despite strong performance, these models are often treated as black boxes, with limited systematic investigation of their decision-making processes. While many interpretability methods exist, objective evaluation of learned representations remains limited, particularly for approaches that rely on sparsity to ``induce'' interpretability. In this work, we investigate how modeling choices in Concept Bottleneck Models (CBMs) affect the semantic alignment of concept representations. We introduce Clarity, a novel metric that captures the interplay between downstream performance and the sparsity and precision of concept activations. Using an interpretability assessment framework grounded in datasets with ground-truth concept annotations, we evaluate both VLM- and attribute predictor-based CBMs across three amortized sparsity-inducing strategies ($\ell_1$, $\ell_0$, and Bernoulli-based), alongside several widely used sparsity-aware CBM methods from the literature. Our experiments reveal a critical flexibility-interpretability trade-off: a model's capacity to optimize task performance by deviating from semantic alignment. We demonstrate that under this trade-off, different methods exhibit markedly different behaviors even at comparable performance levels. Finally, we validate our framework through a principled human study, which confirms that Clarity aligns significantly more closely with human trust than standard evaluation metrics.
URL: https://openreview.net/forum?id=IyQEQBRR4M
---
Title: Vector Memory and Role-Conditioned Multi-Agent Systems: Two Extensions to Improve Reflexion for Language Model Self-Improvement
Abstract: REFLEXION improves language model performance through verbal self-reflection, but two design choices limit its reach. Its memory
is a recency-ordered sliding window that evicts old reflections as new ones arrive, regardless of which are actually relevant. Moreover, a single model simultaneously generates a solution, critiques it, and plans the next attempt, roles that genuinely benefit from separation. In this paper, we address both of these limitations. We replace the sliding window with vector episodic memory, which stores Sentence-BERT embeddings alongside each reflection and retrieves them based on cosine similarity rather than recency. We also split the single agent into a Generator, a Critic, and a Verifier, each of which draws on a shared role-conditioned memory pool. The results are quite encouraging. In a controlled benchmark in which 9 distractor tasks bury the relevant memories, temporal memory fails (0% recall). In contrast, vector memory succeeds without exception (100%, zero variance across 3 trials), at a constant ~14 ms overhead that stays flat up to 50,000 stored reflections. On 164 HumanEval coding tasks with Google Gemini~2.5 Flash, the vector extension reaches Pass@3= 92.7% (+3.7~pp, p=0.033, d=0.127), and the multi-agent system reaches Pass@3= 96.3% (+7.9~pp, p<0.001, d=0.301) with Pass@1= 93.9% (+12.2~pp over the modular baseline), a first-attempt gain that predates any reflection at all. Our results suggest that semantic retrieval is particularly important when tasks are correlated across sessions, while role separation provides the greatest benefit on independent code-generation tasks.
URL: https://openreview.net/forum?id=T97MrGUNkZ
---
Title: Hybrid Architectures for Language Models: Systematic Analysis and Design Insights
Abstract: Recent progress in large language models demonstrates that hybrid architectures--combining self-attention mechanisms with structured state space models like Mamba--can achieve a compelling balance between modeling quality and computational efficiency, particularly for long-context tasks. While these hybrid models show promising performance, systematic comparisons of hybridization strategies and analyses on the key factors behind their effectiveness have not been clearly shared to the community. In this work, we present a holistic evaluation of hybrid architectures based on inter-layer (sequential) or intra-layer (parallel) fusion. We comprehensively evaluate these designs across multiple dimensions: language modeling and downstream task performance, long-context capabilities, scaling analysis, and training and inference efficiency. By investigating the core characteristics of their computational primitive, we identify the most critical elements for each hybridization strategy and further propose optimal design recipes for hybrid models. Our comprehensive analysis provides practical guidance and valuable insights for developing hybrid language models, facilitating the optimization of architectural configurations.
URL: https://openreview.net/forum?id=x7qyXl8ecT
---
Title: Adversarial Attacks and Defenses in Vision-Language Pre- training: Techniques, Challenges and Opportunities
Abstract: Vision-language pretraining (VLP) has emerged as a powerful paradigm for multimodal learning. However, despite their superior capabilities, VLPs remain vulnerable to adversarial attacks by manipulating their inputs. Such attacks by undermining user trust can significantly compromise their integrity, introduce critical security vulnerabilities and highlight the importance of securing VLPs to ensure safety in various real-world multimodal applications. In the adversarial landscape of VLPs, this review aims to delve into the methodologies and implications of both adversarial attacks and defense strategies, organized by architectural considerations. Our review delves into the complexities of categorizing adversarial attack strategies, underscoring the critical need for robust defensive measures. To improve the reliability of these models, we discuss novel defense mechanisms that counter vulnerabilities. In addition, we analyze how adversarial vulnerabilities impact downstream applications. Overall, this review aims to provide a comprehensive overview of adversarial threats in VLPs and present future research directions.
URL: https://openreview.net/forum?id=I2zOWfsTXP
---
Title: A Moving-Horizon Approximate Branch-and-Reduce Method for Deep Classification Trees
Abstract: Despite the importance for interpretability, decision trees face severe scalability challenges. Existing global optimal methods are often limited by binary feature selection and shallow tree depths, whereas traditional heuristic approaches frequently sacrifice predictive accuracy. To overcome these limitations, this paper proposes a moving-horizon approximate branch-and-reduce method to train near-optimal deep classification trees on large-scale datasets with continuous features. Built on a bilevel optimization framework, the method solves the upper-level problem via branch-and-reduce while approximating the lower-level problem using greedy heuristics. Although the underlying framework is capable of guaranteeing global optimality, the approximation, which functions as a lookahead rollout in a reinforcement learning context, significantly boosts efficiency for deeper structures. A low-cost moving-horizon strategy is then employed to iteratively refine model accuracy. Extensive numerical results demonstrate that our method exceeds the testing accuracy of existing heuristic baselines while offering significantly greater scalability, in terms of both dataset size and tree depth, than global optimal solvers.
URL: https://openreview.net/forum?id=4Sq5Byd4yS
---
Title: When Does Orthogonal LoRA Help Retrieval? Spectral Preservation, Alignment, and Operating Regimes
Abstract: Orthogonal variants of LoRA are typically justified as preserving the geometry of the low-rank adaptation subspace in parameter space. For retrieval embedding models, however, whose performance is evaluated directly in embedding space, it remains unclear whether this parameter-space geometry is itself what matters, or whether the decisive factor is the geometry induced in the resulting embeddings. We present a controlled study of 12 LoRA-family methods on implicit concept retrieval and two BEIR passage-retrieval tasks, combining retrieval metrics with weight-space and embedding-space diagnostics. The comparison shows that standard LoRA often collapses the effective rank of its update, but recovering effective rank alone does not reliably recover retrieval quality.
On our primary compact-encoder setting, the strongest retrieval results arise when two conditions are combined: the update remains orthogonal during training and its initial directions are aligned with the pretrained spectral subspace. Motivated by this finding, we introduce GeoLoRA, a minimal adapter that combines Stiefel-constrained factors, SVD-aligned initialisation, and a learnable diagonal spectral bridge. GeoLoRA improves over the main LoRA-family baselines in our primary ELSST setting, while the advantage weakens or disappears on less geometry-sensitive tasks and on the larger backbones we study. We clarifies when orthogonality helps retrieval, provides a controlled instantiation of the identified ingredients, and offers a diagnostic toolkit for future embedding-centric PEFT work.
URL: https://openreview.net/forum?id=zhql8bU0QM
---
Title: Structured Semantics Meet Uncertain Visuals: A Unified Approach to Calibrated Test-Time Prompt Tuning
Abstract: Large vision-language models (VLMs) generalize well zero-shot but become overconfident
and poorly calibrated under distribution shifts. Existing test-time adaptation (TTA) meth-
ods largely apply uniform entropy minimization with fixed geometric regularizers, ignor-
ing instance-wise uncertainty and domain-specific visual cues. We propose Uncertainty-
Calibrated Test-Time Prompt Tuning (UC-TPT), a label-free TTA framework tar-
geted at improving reliability rather than solely maximizing accuracy. UC-TPT consists of
three theoretically motivated components: (i) lightweight visual-to-text conditioning that
injects shallow visual statistics—where shift is most pronounced—into prompts, yielding
domain-conditioned predictions; (ii) an uncertainty-tempered entropy objective that adap-
tively controls distribution sharpness to curb overconfidence; and (iii) a topology-aware
prompt regularizer that approximately preserves the pairwise semantic relations of manual
prompts, stabilizing adaptation in the pretrained embedding space. Experiments on CLIP
and BiomedCLIP across diverse benchmarks demonstrate that UC-TPT consistently out-
performs existing methods in calibration robustness, yielding significant reductions in
Expected Calibration Error (ECE) across a wide range of distribution shifts while maintaining
competitive classification accuracy.
URL: https://openreview.net/forum?id=eY1W8h2sWC
---
Title: TOAM-YOLO: A Tiny Object-Aware Multi-Expert YOLO Framework for Diverse Domains
Abstract: YOLO-based object detection models have advanced significantly over the years through continuous architectural refinements and subsequent performance improvements. Tiny object detection remains a challenging task due to several constraints posed by progressive downsampling in model architectures and the smaller footprint of tiny objects in high-resolution images. This challenge is faced in diverse applications such as maritime surveillance, aerial surveillance, and in medical applications such as microscopic blood cell analysis. In this study, we introduce and demonstrate that a novel multi-domain expert, which we refer to as TOA-MoE (Tiny object aware mixture of experts), consisting of a Hessian-based curvature expert and a Fourier-based frequency expert, along with a 3-level attention mechanism, substantially improves the detection performance of YOLO models while only increasing the learnable parameters by a small fraction. Additionally, we add a feature fusion network that incorporates a BiFPN-style structure and integrates deformable convolutional layer modules in the architecture, and we replace the standard up-sampling layers with a Content-Aware Reassembly of Features (CARAFE) module to preserve fine-grained feature details during feature map expansion. We systematically demonstrate the plug-and-play capability of these changes on YOLOv11 and YOLOv12 models. Tiny object aware Mixture of experts based YOLO (TOAM-YOLO) achieves state-of-the-art performance on five datasets: three tiny object benchmarks (SeaPerson, TinyPerson, VisDrone) with mAP@0.5 improvements of 11.6%, 3.3%, and 10% respectively, and two blood cell datasets (BCCD, CBC) with mAP@0.5:0.95 improvements of 3.9% and 1.7% for platelet detection, all over YOLOv12n, while adding only 0.75M parameters.
URL: https://openreview.net/forum?id=2lIE1tgmRN
---
Title: Efficient LLM Collaboration via Planning
Abstract: Recently, large language models (LLMs) have demonstrated strong performance, ranging from simple to complex tasks. However, while large models achieve remarkable results across diverse tasks, they often incur substantial monetary inference cost, making frequent use impractical for many applications. In contrast, small models are often freely available and easy to deploy locally, but their performance on complex tasks remains limited. This trade-off raises a natural question: how can small and large models efficiently collaborate to combine their complementary strengths? To bridge this trade-off, we propose COPE, a test-time collaboration framework. A planner model first generates a plan that serves as a lightweight intermediate that guides a downstream executor model. Small and large models take turns acting as planner and executor, exchanging plans in a multi-stage cascade to collaboratively solve tasks. Through comprehensive experiments on benchmarks spanning mathematical reasoning, code generation, open-ended tasks, and agent tasks, we demonstrate that COPE achieves performance comparable to large proprietary models, while drastically reducing the inference API cost. These results highlight planning as an effective prior for cost-efficient inference.
URL: https://openreview.net/forum?id=RPzbeL0koP
---
Title: Sample Complexity of RLHF Reward Learning under General Reward Classes
Abstract: We study the minimax sample complexity of Bradley–Terry preference-based reward learning under an arbitrary reward class $\mathcal{R}$. For the realisable logistic (Bradley–Terry) preference model with a single-policy concentrability coefficient $C$, we prove matching upper and lower bounds
$$N^\star(\mathcal{R}, \varepsilon) = \Theta\!\left(\frac{C^2\cdot \mathcal{H}(\mathcal{R}, \varepsilon/C)}{\kappa(B)\cdot\varepsilon^2}\right),$$
where $\mathcal{H}(\mathcal{R}, \varepsilon) := \log N(\mathcal{R}, \varepsilon, L^2(\tilde\mu))$ is the $L^2$ metric entropy of $\mathcal{R}$ under the induced action marginal $\tilde\mu$ and $\kappa(B) := \sigma(2B)(1-\sigma(2B))$ is the Bradley–Terry Fisher-information curvature on the pairwise-difference range $[-2B,2B]$. The bound matches in $C$, $\varepsilon$, and $B$ up to absolute constants, under a mild saturation condition on $\mathcal{R}$. The closure of the $\kappa(B)$-gap uses a boundary Bregman upper bound on the softplus, invoked in the Fano step with an adversarial ground truth whose induced pairwise differences saturate the boundary. Our result unifies and sharpens a line of work that had resolved the rate only for structured subclasses: linear, low-rank, and general preferences. Three standard reward classes instantiate the bound: linear in $\mathbb{R}^d$, rank-$k$ in a $d$-dimensional embedding, and Sobolev $W^{s,2}([0,1]^d)$. The upper bound uses a localised Rademacher argument on the conditional MLE driven by a quadratic curvature identity for the Bradley–Terry log-likelihood; the two Bregman inequalities driving the rate are mechanically verified in Lean 4 / Mathlib with zero sorry and no custom axioms. The lower bound is a Fano–Le Cam construction tailored to the Bradley–Terry Fisher information, made coverage-aware by a restricted packing on the support of the optimal policy and made $B$-matching by saturating the pairwise range.
URL: https://openreview.net/forum?id=bhNKwfndZt
---
Title: Low-rank orthogonalization for large-scale matrix optimization with applications to foundation model training
Abstract: Neural network (NN) training is inherently a large-scale matrix optimization problem, yet the matrix structure of NN parameters has long been overlooked. Recently, the optimizer Muon \citep{jordanmuon}, which explicitly exploits this structure, has gained significant attention for its strong performance in foundation model training. A key component contributing to Muon's success is matrix orthogonalization. In this paper, we propose \textit{low-rank orthogonalization}, which performs orthogonalization by leveraging the low-rank nature of gradients during NN training. Building on this, we introduce low-rank matrix-signed gradient descent (MSGD) and a low-rank variant of Muon. Numerical experiments demonstrate the superior performance of low-rank orthogonalization, with low-rank Muon achieving promising results in GPT-2 and LLaMA pretraining---surpassing the carefully tuned vanilla Muon on tasks with large model sizes. Theoretically, we establish the iteration complexity of low-rank MSGD for finding an approximate stationary solution, and the iteration complexity of low-rank Muon for finding an approximate stochastic stationary solution under heavy-tailed noise. The code to reproduce our numerical experiments is available at \url{https://github.com/dengzhanwang/Low-rank-Muon}.
URL: https://openreview.net/forum?id=uDlRPLQpAy
---
Title: DIALS: Dynamic Layer-Skipping Framework for Diffusion Language Models
Abstract: Diffusion language models (DLMs) have emerged as promising alternatives to autoregressive models (ARMs) due to their bidirectional attention and parallel decoding. However, their inference cost becomes significantly higher as they scale. Layer skipping addresses this challenge by selectively omitting redundant layers. While these dynamic approaches are effective in ARMs, they cannot be naturally extended to DLMs because their parallel generation paradigm makes fine-grained token-level routing challenging. We propose DIALS, a novel dynamic layer-skipping framework for DLMs. DIALS places a lightweight router before each Transformer layer, aggregating masked token representations to make a unified, sequence-level decision on whether to skip or execute the layer. Evaluated on LLaDA-8B across six benchmarks, DIALS generally achieves a better FLOPs-accuracy trade-off compared to static and random layer-skipping baselines. On PIQA, it reduces inference FLOPs by 14.26% without losing accuracy. Our analysis further shows that initial layers are consistently important. Additionally, by incorporating a scaling term based on the mask ratio into the routing objective, we reveal that inherent layer redundancy emerges as denoising progresses.
URL: https://openreview.net/forum?id=3SmsgAtjRK
---
Title: Mechanisms of Multimodal Synchronization: Insights from Decoder-Based Video-Text-to-Speech Synthesis
Abstract: Unified decoder-only transformers have shown promise for multimodal generation, yet the mechanisms by which they synchronize modalities with heterogeneous sampling rates remain underexplored. We investigate these mechanisms through video-text-to-speech (VTTS) synthesis---a controlled task requiring fine-grained temporal alignment between sparse text, video, and continuous speech. Using a unified decoder-only transformer, dubbed Visatronic, trained on VoxCeleb2, we study: (i) how modalities contribute complementary
information, (ii) how positional encoding strategies enable synchronization across heterogeneous rates, (iii) how modality ordering shapes the trade-off between in-domain performance and cross-domain transfer, (iv) how phoneme-level synchronization metrics provide diagnostic insight into per-phoneme timing errors. Our findings reveal that both ``global sequential indexing'' (unique position IDs across modalities) and ``co-temporal ordered indexing'' (identical IDs for temporally corresponding tokens) achieve strong synchronization performance, with co-temporal ordered indexing providing a simple mechanism without explicit timestamp metadata. Both text and video contribute complementary signals: text ensures intelligibility while video provides temporal cues and emotional expressiveness. Modality ordering reveals a consistent trade-off: video-first ordering achieves stronger in-domain performance while text-first ordering generalizes more robustly to unseen domains. Our findings also reveal, that diverse large-scale training enables transferable synchronization strategies. To enable fine-grained analysis, we also introduce TimeSync, a phoneme-level metric that reveals temporal misalignments overlooked by frame-level metrics. These insights establish VTTS as a valuable testbed for understanding temporal synchronization in unified multimodal decoders. Generated speech results are attached in the supplementary.
URL: https://openreview.net/forum?id=D5QpMy6yUH
---
Title: Attention-Bayesian Hybrid Approach to Modular Multiple Particle Tracking
Abstract: Tracking multiple particles in dense scenes remains challenging due to a combinatorial explosion of trajectory hypotheses, which scales super-exponentially with the number of frames. The transformer architecture has shown a significant improvement in robustness against this high combinatorial load. However, its performance still falls short of the conventional Bayesian filtering approaches in locally sparse scenarios presenting a reduced set of trajectory hypothesis. This suggests that while transformers excel at narrowing down possible associations, they are not able to reach the optimality of the Bayesian approach in locally sparse scenario. Hence, we introduce a hybrid tracking framework that combines the ability of self-attention to learn the underlying representation of particle behavior with the reliability and interpretability of Bayesian filtering. We perform trajectory-to-detection association by solving a label prediction problem, using a transformer encoder to infer soft associations between detections across frames. This prunes the hypothesis set, enabling efficient multiple-particle tracking in Bayesian filtering framework. Our approach demonstrates improved tracking accuracy and robustness against spurious detections. These results open the way to a solution for high-clutter multiple-particle tracking scenarios that takes advantage of the large context accessible to transformers, together with the interpretability and theoretical guarantees of Bayesian filtering techniques.
URL: https://openreview.net/forum?id=uYlWqmxY0T
---
Title: Simple is Better than Complex: A Representation-centric Perspective for Prompting-based Vision--Language Fusion
Abstract: Interactive prompting is an appealing approach to vision--language fusion using frozen unimodal transformers, yet recent progress often relies on increasingly complex prompting architectures. A natural question arises: instead of refining prompt designs, can fusion be improved more effectively by directly adapting internal representations within attention layers? Our analysis, from a representation-centric perspective, suggests that interactive prompting itself has limited ability to directly alter value token representations and intra-modal token interactions, motivating a lightweight alternative that targets these internal attention representations rather than increasing prompting complexity. Specifically, we investigate the cross-attention mechanism and propose combining value-only low-rank adaptation with a key--query replacement strategy, yielding a simple and parameter-efficient fusion design. Across common multimodal fusion benchmarks, the proposed method consistently outperforms prior prompting-based fusion baselines while requiring fewer trainable parameters. These results, along with further ablations, support representation-centric adaptation as an effective principle for prompting-based vision--language fusion.
URL: https://openreview.net/forum?id=yBVwYxHxUq
---
Title: Novel Losses for Contrastive Learning using Siamese Energy-Based Models
Abstract: Learning useful representations without labels is central to modern machine learning, especially when annotation is costly, motivating the development of self-supervised learning. To that end, contrastive learning methods, such as SimCLR, aim to discover representations that are invariant to user-defined augmentations. Recent work has shown that these methods can be reinterpreted as energy-based models (EBMs) that learn to “de-augment” data. Building on this perspective, we propose a principled EBM formulation of contrastive representation learning. Through this formulation, we are able to offer new objectives to train this model. Particularly, we propose a Fisher-Hyvärinen divergence loss, which leverages score matching to bypass the need for negative samples. Our framework bridges contrastive learning with EBM and posterior estimation, offering a new foundation for unsupervised representation learning.
URL: https://openreview.net/forum?id=jsuNSUiT3G
---
Title: Use Bayesian Paired Tests to Improve the Comparison of Machine Learning Models
Abstract: This paper argues that model comparison in machine learning can be much improved by using \emph{paired testing}, i.e.\ comparing the predictions of methods A and B on each (common) test example. Due to the limitations of null hypothesis significance testing in frequentist statistics, Bayesian methods are recommended, including the use of the region of practical equivalence (ROPE; Kruschke 2015a; Kruschke and Liddell 2018; Benavoli, Corani, Dem\v{s}ar, and Zaffalon 2017). We discuss a Bayesian $t$-test and a Bayesian McNemar test for
comparisons on a single task, and Bayesian hierarchical models for comparisons over multiple tasks. Two worked examples are presented to illustrate the methods, and the use of reporting guidelines is discussed as a potential means of changing current practice.
URL: https://openreview.net/forum?id=t6xhcnLlEl
---
Title: Adjoint Matching through the Lens of the Stochastic Maximum Principle in Optimal Control
Abstract: Reward fine-tuning of diffusion and flow models and sampling from tilted or Boltzmann distributions can both be formulated as stochastic optimal control (SOC) problems, where learning an optimal generative dynamics corresponds to optimizing a control under SDE constraints. In this work, we revisit and generalize \emph{Adjoint Matching}, a recently proposed SOC-based method for learning optimal controls, and place it on a rigorous footing by deriving it from the \emph{Stochastic Maximum Principle} (SMP). We formulate a general Hamiltonian adjoint matching objective for SOC problems with control-dependent drift and diffusion and convex running costs, and show that its expected value has the same first variation as the original SOC objective. As a consequence, critical points satisfy the Hamilton--Jacobi--Bellman (HJB) stationarity conditions. In the important practical case of state- and control-independent diffusion, we recover the \emph{lean} adjoint matching loss previously introduced in adjoint matching, which avoids second-order terms and whose critical points coincide with the optimal control under mild uniqueness assumptions. Finally, we show that adjoint matching can be precisely interpreted as a continuous-time method of successive approximations induced by the SMP, yielding a practical and implementable alternative to classical SMP-based algorithms, which are obstructed by intractable martingale terms in the stochastic setting. These results are also of independent interest to the stochastic control community, providing new implementable objectives and a viable pathway for SMP-based iterations in stochastic problems.
URL: https://openreview.net/forum?id=tR5VsdQFhK
---
Title: ViDE: Tuning-Free Video Coherence via Temporal Attention Reweighting and Prompt Blending
Abstract: Despite the substantial progress in long video generation, multi-prompt synthesis often encounters inconsistencies resulting from the training-inference gap caused by length extension techniques and coarse prompt interpolation. To overcome these issues, we propose the Video Diffusion with hidden states Editing (ViDE) framework, which consists of two key components. The first is the Time-frequency based Temporal Attention Reweighting (TiTAR) algorithm, which leverages the relationship between inconsistencies and diagonal elements of temporal attention. By reweighting the attention scores via the Discrete Short-Time Fourier Transform (DSTFT), TiTAR effectively reduces frame inconsistencies, a capability further corroborated by a Fourier-based analysis. The second component, PromptBlend, reduces inconsistencies in multi-prompt settings through fine-grained prompt alignment and adaptive interpolation, enabling smooth semantic transitions. Extensive experiments demonstrate the effectiveness of ViDE, demonstrating consistent and significant improvements over multiple baselines.
URL: https://openreview.net/forum?id=A5WZZxglxT
---
Title: On the Expressive Power and Limitations of Multi-Layer SSMs
Abstract: We study the expressive power and limitations of multi-layer state-space models (SSMs). First, we show that multi-layer SSMs face fundamental limitations in compositional tasks, revealing an inherent gap between SSMs and streaming models. Then, we examine the role of chain-of-thought (CoT), showing that offline CoT does not fundamentally increase the expressiveness, while online CoT can substantially increase its power. Indeed, with online CoT, multi-layer SSMs become equivalent in power to streaming algorithms. Finally, we investigate the tradeoff between width and precision, showing that these resources are not interchangeable in the base model, but admit a clean equivalence once online CoT is allowed. Overall, our results offer a unified perspective on how depth, finite precision, and CoT shape the power and limits of SSMs.
URL: https://openreview.net/forum?id=NQ8IcsIti8
---
Title: Separable Pathways for Causal Reasoning: How Architectural Scaffolding Enables Hypothesis-Space Restructuring in LLM Agents
Abstract: Causal discovery through experimentation and intervention is fundamental to robust problem solving. It requires not just updating beliefs within a fixed framework but revising the hypothesis space itself, a capacity current AI agents lack when evidence demands representations they have not previously constructed. We extend the blicket detector paradigm from developmental science to test this capacity in AI agents equipped with architectural scaffolding that targets hypothesis-space restructuring. Our compositional architecture has two discrete components: context graphs, which structure exploration as typed state machines, and dynamic behaviors, which monitor for evidence that the current hypothesis space is inadequate and expand it at runtime. Across 1,085 experimental trials, these components make orthogonal contributions: context graphs drive reasoning quality within the post-switch hypothesis space, accounting for 94\% of the accuracy gain, while dynamic behaviors drive reasoning eligibility by detecting regime changes and preventing premature commitment to outdated hypotheses. The benchmark codebase, all agent implementations, trace data, and analysis scripts are publicly available.
URL: https://openreview.net/forum?id=05VkYfgXm6
---
Title: Reliability Scaling Laws for Quantized Large Language Models
Abstract: Quantization is a powerful strategy to build capable and resource-efficient large language models (LLMs) by reducing the bitwidth of the parameters. While quantized LLMs achieve state-of-the-art performance on unperturbed inputs using standard predictive metrics, their performance on perturbed inputs, measured using reliability metrics, remains underexplored, despite its importance for reliable deployment. To address this gap, we conduct a comprehensive reliability evaluation of quantized LLMs consisting of three key components: (1) Uncertainty: We assess the trustworthiness of LLMs quantized to 2, 3, 4, and 8 bits using six different quantization methods, employing established uncertainty metrics. (2) Robustness: We design character-level and word-level input perturbations to evaluate the reliability of quantized models under semantically-preserving variations in the inputs that arise in real-world applications.
(3) Reliability scaling trends: We investigate how the reliability scales with the number of model bits. Our study reveals that while the performance scales monotonically with the total number of bits, the reliability scalings are nonlinear. A reliability peak occurs for 4-bit quantized models, indicating that quantizing moderately sized models offers the best reliability-efficiency trade-off. Additionally, our empirical findings reveal that quantization enhances the robustness of LLMs to natural input perturbations.
URL: https://openreview.net/forum?id=UUBijehMQO
---
Title: FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens
Abstract: Contrastive language-image pre-training aligns features of text-image pairs in a common latent space via distinct encoders for each modality. While this approach achieves impressive performance in several zero-shot tasks, it cannot natively handle multimodal inputs, i.e., encoding image and text into a single feature vector. As a remedy, it is common practice to use additional modules to merge the features extracted by unimodal encoders. In this work, we present FuseLIP, a new architecture for multimodal embedding. Leveraging recent progress in discrete image tokenizers, we propose to use a single transformer model operating on a unified vocabulary of text and image tokens. This early fusion approach allows the different modalities to interact at each depth of encoding and obtain richer representations compared to common late fusion. We collect new datasets for multimodal pre-training and evaluation, designing challenging tasks for multimodal encoders. We show that FuseLIP outperforms late fusion approaches in several multimodal and unimodal embedding tasks.
URL: https://openreview.net/forum?id=yq9je6kLC6
---
Title: Diffusion Models in Simulation-Based Inference: A Tutorial Review
Abstract: Diffusion models have recently emerged as powerful learners for simulation-based inference (SBI), enabling fast and accurate estimation of latent parameters from simulated and real data.
Their score-based formulation offers a flexible way to learn conditional or joint distributions over parameters and observations, thereby providing a versatile solution to various modeling problems.
In this tutorial review, we synthesize recent developments on diffusion models for SBI, covering design choices for training, inference, and evaluation.
We highlight opportunities created by various concepts such as guidance, score composition, flow matching, consistency models, and joint modeling.
Furthermore, we discuss how efficiency and statistical accuracy are affected by noise schedules, parameterizations, and samplers.
Finally, we illustrate these concepts with case studies across parameter dimensionalities, simulation budgets, and model types, and outline open questions for future research.
URL: https://openreview.net/forum?id=86Q261z005
---
Title: ASAT: Adaptive Scoring and Thresholding with Human Feedback for Robust Out-of-Distribution Detection
Abstract: Machine Learning (ML) models are trained on in-distribution (ID) data but often encounter out-of-distribution (OOD) inputs during deployment---posing serious risks in safety-critical domains. Recent works have focused on designing scoring functions to quantify OOD uncertainty, with score thresholds typically set based solely on ID data to achieve a target true positive rate (TPR), since OOD data is limited before deployment. However, these TPR-based thresholds leave false positive rates (FPR) uncontrolled, often resulting in high FPRs where OOD points are misclassified as ID. Moreover, fixed scoring functions and thresholds lack the adaptivity needed to handle newly observed, evolving OOD inputs, leading to sub-optimal performance. To address these challenges, we propose a human-in-the-loop framework that \textit{safely updates both scoring functions and thresholds on the fly} based on real-world OOD inputs. Our method maximizes TPR while controlling FPR at all times, even as the system adapts over time. We provide theoretical guarantees for FPR control under stationary conditions and present extensive empirical evaluations on OpenOOD benchmarks to demonstrate that our approach outperforms existing methods by achieving higher TPRs while maintaining FPR control.
URL: https://openreview.net/forum?id=4Kd0VMsL76
---
Title: When Does Geometry Emerge from Memorization in Transformers?
Abstract: Transformer models often show structured internal representations on relational tasks, which are often interpreted as geometric organization. Prior work documents such structure via visualization or performance-based analyses, but does not isolate whether perfect memorization alone yields geometric representations. Here, we conduct a controlled study using synthetic relational worlds defined by canonical graph topologies, explicitly training Transformers to perfectly memorize relational structure without imposing constraints that favor or discourage geometry, and examining whether geometric representations arise as a consequence.
Across chains, cycles, regular graphs, and star graphs, models achieve perfect memorization accuracy while internal embeddings do not systematically preserve either global distances or local neighborhoods, indicating reliance on non-geometric, index-based representations. By probing embeddings against shortest-path distance using rank consistency and neighborhood preservation metrics, we show that memorization alone places no requirement on metric organization. Recoverable geometric structure emerges only when the task objective, together with the relational topology, sufficiently constrains node interchangeability, reducing the space of symmetry-equivalent memorization solutions.
Our results show that perfect memorization does not imply emergent geometric structure, and characterize the conditions under which structure arises in learned embeddings.
URL: https://openreview.net/forum?id=UA353ueO6y
---
Title: Scaling Agents for Computer Use
Abstract: Computer-use agents (CUAs) hold promise for automating everyday digital tasks, but their performance on long-horizon, complex problems remains unreliable. Single-rollout execution is brittle, with small errors compounding over time and leading to high variance in outcomes. While prior work has attempted to scale within a single rollout, such approaches have yielded limited gains. Scaling over multiple rollouts offers a more promising alternative but doing so effectively is challenging due to the difficulty of evaluating and selecting among long-horizon agent behaviors. We introduce Behavior Judge (BJudge), which addresses this challenge by representing agent executions as behavior narratives and comparing candidate behaviors at this level, substantially improving robustness and success rates. Using multiple rollouts, BJudge establishes a new state of the art (SoTA) in OSWorld at 72.6%, significantly outperforming prior methods and surpassing human-level performance at 72.36%, with comprehensive ablations validating key design choices. We further demonstrate strong generalization results to different operating systems on WindowsAgentArena and AndroidWorld. Crucially, our results highlight the strong effectiveness of scaling CUAs, when you do it right: effective scaling requires structured trajectory understanding and selection, and BJudge provides a practical framework to achieve this.
URL: https://openreview.net/forum?id=eve4jBYa8D
---
Title: Benchmarking Transfer Learning: From Simple Baselines to Combined Scorers for Transferability Estimation
Abstract: In the evolving landscape of deep learning, selecting the best pre-trained models from a growing number of choices is a challenge. Transferability scorers propose an efficient alternative to this challenge by calculating a proxy to rank a pool of pre-trained model candidates. Despite their promise, the field currently lacks standardized evaluation protocols, consistent baselines, and reproducible methodologies. This has led to contradictory findings across studies, with the best scorer in one study ranking among the worst in another. In this work, we introduce a benchmark for transferability scorers under a standardized evaluation protocol. As a baseline, our benchmark uses the model's accuracy on its source task (e.g., ImageNet), which surprisingly outperforms most existing scorers. We then conduct a large-scale empirical study, evaluating 13 scorers across 10 vision model architectures and 11 datasets. Finally, we propose a novel combined scorer that leverages information from multiple individual scorers and demonstrate that it consistently outperforms all of them and the baseline.
URL: https://openreview.net/forum?id=3i2ZRk8GDN
---
Title: Optimized Graph Structures for Calibrating Graph Neural Networks with Out-of-Distribution Nodes
Abstract: Graph neural networks (GNNs) have achieved remarkable success in tasks such as node classification, link prediction, and graph classification. However, despite their effectiveness, the reliability of GNN predictions remains a major concern, particularly when graphs contain out-of-distribution (OOD) nodes. To date, the calibration of GNNs in the presence of OOD nodes remains largely underexplored. Our empirical studies reveal that the calibration problem becomes significantly more complex in the presence of OOD nodes, and existing calibration methods are notably less effective in such scenarios. Recently, graph structure learning~(GSL), a family of data-centric learning approaches, has shown promise in mitigating the adverse effects of noisy information in graph topology by jointly optimizing the graph structure and GNN training. However, current GSL methods do not explicitly address the calibration challenges posed by OOD nodes. To tackle this challenge, we propose a novel framework called Graph Calibration via Structure Optimization~(GCSO) to calibrate GNNs in the presence of OOD nodes. Our empirical findings suggest that reducing the weights of edges connecting in-distribution~(ID) and OOD nodes can effectively alleviate the calibration issue. However, identifying such edges and determining their appropriate weights is challenging due to the unknown distribution of OOD nodes. To address this, GCSO introduces an iterative edge-sampling mechanism that captures the topological information of the graph and formulates the structure learning process as a Markov Decision Process (MDP). We then leverage an actor–critic method to dynamically adjust edge weights and evaluate their impact on target node predictions. Additionally, we design a tailored reward signal to guide the policy function toward an optimal adaptive graph structure that minimizes the influence of OOD nodes. Notably, the optimized graph structure can be seamlessly integrated with existing temperature scaling–based calibration techniques for further performance gains. Experimental results on benchmark datasets demonstrate that our method significantly reduces the expected calibration error while maintaining competitive accuracy.
URL: https://openreview.net/forum?id=Y1W3Z3Z6i8
---
Title: Online Learning and Unlearning: Efficient Algorithms with Near-Optimal Regret Guarantees
Abstract: We formalize online learning-unlearning (OLU) in the Online Convex Optimization (OCO) setting, where a learner updates a model sequentially on a stream of convex losses while accommodating occasional unlearning requests between updates. We require that after a deletion, the distribution of all future outputs is statistically indistinguishable from that of a learner trained on the same stream with the deleted data removed. We propose two OLU algorithms based on Online Gradient Descent (OGD). Passive OLU leverages the contractive dynamics of OGD and injects calibrated noise, incurring no additional computation beyond standard OGD; however, its regret depends on the deletion schedule. We then introduce a schedule-robust variant that mitigates this dependence. Active OLU employs an offline unlearning algorithm to actively steer the online iterate toward the corresponding retrained trajectory. Under standard convexity and smoothness assumptions, our methods achieve regret comparable to standard OGD, demonstrating that strong online unlearning guarantees can be achieved with minimal loss in learning performance.
URL: https://openreview.net/forum?id=lyvCVfTdiY
---
Title: VQEL: Enabling Self-Play in Emergent Language Games via Agent Internal Vector Quantization
Abstract: Emergent Language (EL) focuses on the emergence of communication among artificial agents. Although symbolic communication channels more closely mirror the discrete nature of human language, learning such protocols remains fundamentally difficult due to the non-differentiability of symbol sampling. Existing approaches typically rely on high-variance gradient estimators such as REINFORCE or on continuous relaxations such as Gumbel–Softmax, both of which suffer from limitations in training stability and scalability when learning a language from scratch. Motivated by cognitive theories that emphasize intrapersonal processes preceding communication, we explore self-play as a substrate for language emergence prior to mutual interaction. We introduce Vector Quantized Emergent Language (VQEL), a novel architecture that incorporates vector quantization into the message generation process. VQEL enables agents to perform self-play using discrete internal representations derived from a learned codebook while preserving end-to-end differentiability. By grounding the vocabulary through dense gradients in self-play, VQEL completely avoids the cold-start instability of reinforcement learning. The resulting vector-quantized codebook naturally induces a symbolic vocabulary that serves as a highly robust initialization for subsequent REINFORCE-based fine-tuning during mutual play with other agents. Empirical results show that agents pretrained via VQEL self-play achieve more consistent symbol alignment and higher task success when later engaged in mutual interaction. These findings position self-play as a principled and effective mechanism for learning discrete communication protocols, addressing key optimization and representational challenges in emergent language systems.
URL: https://openreview.net/forum?id=5nqQlGWlsW
---
Title: MEGA: Message Passing Neural Networks for Multigraphs with EdGe Attributes
Abstract: Edge-attributed multigraphs, in which multiple edges with distinct attributes connect the same pair of nodes, arise naturally in many real-world systems. In these graphs, effective learning requires preserving information from repeated interactions while distinguishing contributions from different neighbors. Existing neural network solutions for edge-attributed multigraphs remain limited: some lose information from repeated interactions, while others break permutation equivariance. To address this, we introduce \emph{neighbor-aware aggregation}, an operator that first combines multi-edge features for each neighbor and then aggregates across neighbors. This operator captures per-neighbor statistics that standard single-stage aggregation cannot represent. Building on this operator, we present MEGA-GNN, a model-agnostic message-passing framework for edge-attributed multigraphs. We show that MEGA-GNN is permutation equivariant and has the same asymptotic complexity as standard GNNs with edge updates. We evaluate our approach on datasets from social networks and financial transaction networks. Neighbor-aware aggregation consistently improves GNN performance and matches or surpasses state-of-the-art methods.
URL: https://openreview.net/forum?id=Jo4KBvnMIP
---
Title: U-CECE: A Universal Multi-Resolution Framework for Conceptual Counterfactual Explanations
Abstract: As AI models grow more complex, explainability is essential for building trust, yet concept-based counterfactual methods still face a trade-off between expressivity and efficiency. Representing underlying concepts as atomic sets is fast but misses relational context, whereas full graph representations are more faithful but require solving the NP-hard Graph Edit Distance (GED) problem. We propose U-CECE, a unified, model-agnostic multi-resolution framework for conceptual counterfactual explanations that adapts to data regime and compute budget. U-CECE spans three levels of expressivity: atomic concepts for broad explanations, relational sets-of-sets for simple interactions, and structural graphs for full semantic structure. At the structural level, both a precision-oriented transductive mode based on supervised Graph Neural Networks (GNNs) and a scalable inductive mode based on unsupervised graph autoencoders (GAEs) are supported. Experiments on the structurally divergent CUB and Visual Genome datasets characterize the efficiency-expressivity trade-off across levels, while human surveys and LVLM-based evaluation show that the retrieved structural counterfactuals are semantically equivalent to, and often preferred over, exact GED-based ground-truth explanations.
URL: https://openreview.net/forum?id=6YBMR1u1N5
---
Title: A Meta-Analysis of Machine Learning Security Research: Attack-Defense Dynamics, Technical Evolution, and Cross-Concept Patterns
Abstract: Machine learning (ML) security research has grown rapidly, yet systematic understanding of its technical evolution, attack-defense dynamics, and cross-concept patterns remains limited.
This study presents a systematic meta-analysis of 1,591 security papers spanning six security topics and five ML concepts, released between January 1, 2018, and June 30, 2024.
Beyond analyzing research trends, we quantify the attack-defense imbalance across all concept-topic combinations, revealing that defense research significantly lags behind attack research in emerging areas such as LLM jailbreaks (defense ratio = 0.30) and text-to-image membership inference (0.20).
Using LLM-assisted annotation of all paper abstracts, we identify 32 distinct technique families and trace their evolution over time, finding that attack techniques such as backdoor injection and adversarial perturbation first appeared in federated learning and graph neural networks before being adopted in LLM and text-to-image model security research.
We further identify 17 technique families shared across multiple ML concepts, with six spanning all five concepts studied.
Additionally, we examine factors associated with academic influence, finding that ML concepts, security topics, author count, regions, collaboration patterns, and publication status are all statistically significantly associated with citation density.
Our findings highlight critical defense gaps, map the technical landscape of ML security, and suggest concrete directions for future research.
URL: https://openreview.net/forum?id=gDBxeI5FWl
---
Title: Geometry-Aware MCTS for Extremal Problems in Combinatorial Geometry
Abstract: We study certain extremal problems in combinatorial geometry that ask about configurations of points in an $n \times n$ grid that satisfy strict, global geometric constraints. Classical exact solvers suffer from combinatorial explosion for these types of problems, and standard reinforcement learning and transformer-based models struggle with the sparse reward "validity cliff" and quadratic token-consumption limits. To overcome these bottlenecks, we propose a Geometry-Aware Monte Carlo Tree Search (MCTS) framework. Our approach strictly enforces geometric constraints through incremental updates to the feasible action space. For constraints about collections of collinear points, like those that occur in the classic No-Three-in-Line problem (Max-N3IL), this mechanism reduces the constraint checking complexity from $O(n^4)$ to $O(n^2)$. To improve search efficiency, we exploit geometric symmetries in two ways: canonical pruning during node expansion to reduce the branching factor, and symmetric batch transitions to accelerate the discovery of promising configurations. We perform extensive experiments and establish new best-known computational results on five out of six of the problems that we considered. Notably, for Max-N3IL we find configurations of size roughly $1.8 n$ for grids of size $82 \le n \le 119$. For the Smallest Complete Set problem, we find configurations of size roughly $0.95 n$, providing new upper bounds within the tested grids. This work establishes Geometry-Aware MCTS as a highly adaptable framework for discovering novel configurations in combinatorial geometry.
URL: https://openreview.net/forum?id=xtlbkvz6qb
---
Title: DeBLAS: Accelerate LLM Pretraining by Length-based Sequence Scheduling
Abstract: Pretraining large language models (LLMs) is computationally intensive, typically requiring massive datasets and training iterations. Although recent advances in data selection have shown improvement in training efficiency, their gains often diminish under scaling laws. In this work, we dive into the impact of sequence length on language model pretraining and propose a length-based online data scheduling method for acceleration. Specifically, we design a dense-balanced sequence scheduling framework for LLM pretraining: 1) at the first stage, the model is exposed to uniform-length dense token batches to encourage the formation of global language representations; 2) the second stage incorporates variable-length sequences, which reinforces learned abstractions while significantly reducing the total number of training iterations. We prove that the model internalizes the foundational language knowledge during the dense-batch phase, allowing it to optimize more efficiently on the latter variable-length sequences. Empirical results show that our approach achieves comparable perplexity to standard pretraining with substantially fewer optimization steps, pinpointing a promising way to reduce the computational burden of LLM pretraining.
URL: https://openreview.net/forum?id=s0MTAeYvww
---
Title: Support Vector Machines : A more certain estimate of uncertainty
Abstract: This paper explores Support Vector Machine (SVM) potential from the lens of uncertainty quantification (UQ), developed for regression and forecasting tasks. Unlike NN-based approaches commonly used for UQ estimation, SVM-based methods are more stable, sparse, interpretable, and offer well-defined optimal solutions. However, there is only a limited literature addressing UQ in SVM-based prediction. At first, we provide a comprehensive summary of existing Prediction Interval (PI) estimation and probabilistic forecasting methods developed in the SVM framework. Although SVMs offer globally optimal and stable solutions, the existing literature on UQ within the SVM framework still exhibits several critical gaps. In this work, we also address these gaps and extend contemporary UQ techniques to SVMs, for promoting their applicability across diverse domains for more reliable estimation. Our major contributions include the development of sparse SVM models for PI estimation and probabilistic forecasting, an investigation of the role of feature selection in PI estimation, and the extension of SVM regression to the Conformal Regression (CR) setting to construct more stable prediction sets with finite-sample guarantees. Extensive numerical experiments highlight that SVM-based UQ methods yield PIs and probabilistic forecasts that are less uncertain than, those produced by modern complex deep learning and neural network models, particularity for small and moderate scale datasets.
URL: https://openreview.net/forum?id=ZdzIDoGeOC
---
Title: The Landscape of Medical Agents: A Survey
Abstract: Medical Agents are an emerging class of agentic systems deployed in clinical settings that operate
over multimodal, longitudinal data, maintain internal state, plan and adapt sequences of actions, and interact
with clinical information systems under governance constraints. They extend traditional medical artificial
intelligence (MedAI) beyond narrow diagnostic and predictive models toward workflow-centric architectures
that address persistent challenges such as administrative burden, fragmented workflows, and workforce strain.
In this paper, we (i) propose a functional definition and three-level developmental roadmap for Medical Agents,
linking architectural capabilities (planning, memory, tool use, long-horizon control) to degrees of workflow
integration and autonomy; (ii) map representative deployments across hospital departments and tasks, including
domain-specific agents and multi-agent hospital simulations; and (iii) synthesize cross-cutting challenges in
safety, robustness, fairness, evaluation, and governance, outlining research directions for advancing capabilities
under clinical constraints and achieving system-level impact. We argue that Medical Agents should be treated as
emerging infrastructure for learning health systems, whose value will be measured less by benchmark accuracy
than by reliable restructuring of clinical workflows.
URL: https://openreview.net/forum?id=4pwRl2G6je
---
Title: Weaves, Wires, and Morphisms: Formalizing and Implementing the Algebra of Deep Learning
Abstract: Despite deep learning models running well-defined mathematical functions, we lack a formal mathematical framework for describing model architectures. Ad-hoc notation, diagrams, and pseudocode poorly handle nonlinear broadcasting and the relationship between individual components and composed models. This paper introduces a categorical framework for deep learning models that formalizes broadcasting through the novel axis-stride and array-broadcasted categories. This allows the mathematical function underlying architectures to be precisely expressed and manipulated in a compositional manner. These mathematical definitions are translated into human manageable diagrams and machine manageable data structures. We provide a mirrored implementation in Python and TypeScript to show the universal aspect of our framework, along with features including algebraic construction, graph conversion, PyTorch compilation and diagram rendering. This lays the foundation for a systematic, formal approach to deep learning model design and analysis.
URL: https://openreview.net/forum?id=GiO8eom0jD
---
Title: Debate as Reward: A Multi-Agent Reward System for Scientific Ideation via RL Post-Training
Abstract: Large Language Models (LLMs) have demonstrated potential in automating scientific
ideation, yet current approaches relying on iterative prompting or complex multi-agent architectures often suffer from hallucination or computational inefficiency. A critical bottleneck
in applying Reinforcement Learning (RL) to this open-ended domain is reward hacking—
where models exploit imperfect evaluation proxies to maximize scores without producing
genuine scientific innovation. To address these limitations, we propose an RL framework
explicitly tailored for high-quality scientific idea generation. We propose the first multiagent reward function designed to serve as a judge, decoupling methodological validation
from implementation details while providing strict binary rewards that are robust to reward
hacking. To effectively optimize against this sparse signal, we utilize an unbiased variant
of Group Relative Policy Optimization to mitigate artificial length bias. We grounded our
training in ICLR-320, a curated dataset of problem-solution pairs extracted from ICLR 2024
proceedings. Experiments demonstrate that our framework significantly outperforms stateof-the-art baselines across expert-evaluated metrics of novelty, feasibility, and effectiveness.
Our code is available at https://anonymous.4open.science/r/debate-as-reward-80AF.
URL: https://openreview.net/forum?id=OMRVm1OXJh
---
Title: LTLBench: Towards Benchmarks for Evaluating Temporal Reasoning in Large Language Models
Abstract: Temporal Reasoning (TR) is a critical ability for LLMs to understand and reason over temporal information and relationships between events. To study the TR ability in LLMs, prior works provide different ways for evaluating various aspects of TR ability. In this work, we propose an alternative perspective for evaluating TR ability by leveraging Linear Temporal Logic (LTL), and develop a pipeline to automatically synthesize challenges for assessing the TR ability of LLMs. Based on this pipeline, we construct a dataset, namely LTLBench, consisting of $2000$ TR challenges, and benchmark 12 LLMs across 5 different methods. Furthermore, we conduct additional experiments to investigate the impact of increasing the number of formula operators and events on both LLM performance and the complexity of TR problems. We also perform qualitative analyses of their reasoning processes and the effects of varying the number of events and formula operators, which reveal 3 main issues in their temporal reasoning processes and the unexpected performance changes observed as problem complexity increases. We expect this work to provide valuable insights into the TR ability of LLMs.
URL: https://openreview.net/forum?id=6tP2fbRCIA
---
Title: TestDG: Test-time Domain Generalization for Continual Test-time Adaptation
Abstract: This paper studies continual test-time adaptation (CTTA), the task of adapting a model to constantly changing unseen domains in testing while preserving previously learned knowledge. Existing CTTA methods mostly focus on adaptation to the current test domain only, overlooking generalization to arbitrary test domains a model may face in the future. To tackle this limitation, we present a novel online test-time domain generalization framework for CTTA, dubbed TestDG. TestDG aims to learn features invariant to both current and previous test domains on the fly during testing, improving the potential for effective generalization to future domains. To this end, we propose a new model architecture and a test-time adaptation strategy dedicated to learning domain-invariant features, along with a new data structure and optimization algorithm for effectively managing information from previous test domains. TestDG achieved state of the art on four public CTTA benchmarks. Moreover, it showed superior generalization to unseen test domains.
URL: https://openreview.net/forum?id=qZ1HuY07Ud
---
Title: Poisoning the Inner Prediction Logic of Graph Neural Networks for Clean-Label Backdoor Attacks
Abstract: Graph Neural Networks (GNNs) have achieved remarkable results in various tasks. Recent studies reveal that graph backdoor attacks can poison the GNN model to predict test nodes with triggers attached as the target class. However, apart from injecting triggers to training nodes, these graph backdoor attacks generally require altering the labels of trigger-attached training nodes into the target class, which is impractical in real-world scenarios. In this work, we focus on the clean-label graph backdoor attack, a realistic but understudied topic where training labels are not modifiable.
According to our preliminary analysis, existing graph backdoor attacks generally fail under the clean-label setting. Our further analysis identifies that the core failure of existing methods lies in their inability to poison the prediction logic of GNN models, leading to the triggers being deemed unimportant for prediction. Therefore, we study a novel problem of effective clean-label graph backdoor attacks by poisoning the inner prediction logic of GNN models.
We propose BA-Logic to solve the problem by coordinating a poisoned node selector and a logic-poisoning trigger generator.
Extensive experiments on real-world datasets demonstrate that our method effectively enhances the attack success rate and surpasses state-of-the-art graph backdoor attack competitors under clean-label settings.
Our code is available at https://anonymous.4open.science/r/BA-Logic.
URL: https://openreview.net/forum?id=VapGtxAsBT
---
Title: AEyeDE: An Attention-Based Attribution Framework for AI-Generated Text Detection
Abstract: Detecting AI-generated text is becoming increasingly challenging as modern language models approach human-level fluency and can evade detectors that rely on surface statistics or likelihood-based signals. We propose AEyeDE, an attribution-driven approach to human-AI authorship detection that leverages model attention as a discriminative signal. Specifically, we extract attention-based attribution matrices for both human- and AI-generated text using a proxy Transformer model with white-box access and train a lightweight Convolutional Neural Network to learn representations from these attribution maps. Across encoder-decoder translation settings, our method consistently outperforms a text-only baseline. In decoder-only settings, it performs strongly in generator-specific detection, remains competitive on standard benchmarks, and shows robustness under cross-dataset transfer and alternative-spelling perturbations. We further show that attention maps exhibit recurring local structures whose relative frequencies differ consistently between human- and AI-generated text across datasets and proxy models. These findings suggest that attention-based attribution maps provide a complementary and interpretable signal for AI-generated text detection. We will make the code publicly available to support future research.
URL: https://openreview.net/forum?id=k5jHFU5BYm
---
Title: The Deepfake Defense Stack: Why No Single Layer Works and How They Must Compose
Abstract: Every defense against AI-synthesized media, whether passive detection, invisible watermarking, or content provenance, has been shown to fail when deployed in isolation. Detectors suffer 45-50% accuracy degradation from laboratory to deployment and collapse on outputs from unseen generator architectures. Watermarks are removable by regeneration attacks and screenshot capture. Provenance metadata is stripped by most social media platforms. Yet no prior work has formally analyzed how these defenses compose: which attack classes each layer blocks, where cascade failures propagate, and what residual vulnerabilities survive the full stack.
We present the first composition analysis of deepfake defenses. Through a Defense Composition Matrix covering 58 detection methods, 23 proactive defense systems, and 7 adversarial attack classes, we map the interaction between three defense layers (detection, watermarking, provenance) and seven attack classes. We identify two attack classes that penetrate all three layers simultaneously, one that bypasses the primary trust layer, and three emergent composition patterns where stacking defenses creates vulnerabilities absent from any individual layer. We formulate a Detection Ceiling Conjecture arguing, with supporting evidence, that post-hoc detection faces an information-theoretic bound that provenance-based approaches do not share. Our composition analysis draws on 190 papers spanning generation, detection, proactive defense, adversarial attacks, benchmarks, and the societal impact of AI-synthesized media across the 2014-2026 period. We provide focused technical background for each defense layer (Sections 3-5) sufficient to support the composition analysis; readers seeking comprehensive coverage of individual layers should consult the dedicated surveys cited in Table 1. We identify eight open problems with falsifiable hypotheses and proposed experimental protocols.
URL: https://openreview.net/forum?id=TMdNxTE4tk
---
Title: Understanding Transformer-Based Vision Models via Modular Feature Inversion
Abstract: Understanding the internal mechanisms of deep neural networks remains a central challenge in machine learning. In computer vision, one promising yet only preliminarily explored approach is feature inversion via inverse networks, which reconstructs images from intermediate representations using trained inverse networks. In this study, we revisit feature inversion via inverse networks, introducing a novel, modular variant that enables a computationally more efficient application of the technique while producing semantically more coherent image reconstructions. We apply our method to large-scale transformer-based vision models, specifically Detection Transformer, Vision Transformer, Swin Transformer, and Data-Efficient Image Transformer, analyzing the resulting reconstructions across network depth. Our results reveal shared properties and systematic differences in how these architectures process visual information, including their handling of contextual shape, fine-grained image detail, inter-layer representational similarity, and robustness to color perturbations. These findings contribute to a deeper understanding of transformer-based vision models and demonstrate the utility of modular feature inversion as an interpretability tool.
URL: https://openreview.net/forum?id=O5sMv2o3EV
---
Title: Adaptive Online Convex Optimization via Sparse-Low-Rank Gradient Decomposition
Abstract: Adaptive gradient methods for online convex optimization must choose between diagonal preconditioning, which exploits gradient sparsity, and full-matrix preconditioning, which exploits low-rank structure, but not both simultaneously. We propose SLR-FTRL, an algorithm that decomposes the gradient preconditioner into sparse and low-rank components via an online robust PCA oracle, runs two parallel FTRL sub-algorithms with structurally matched preconditioners, and combines their outputs through a coin-betting meta-algorithm that requires no knowledge of the structural parameters. We prove a regret bound of the form $\min_{\alpha^*}\{\alpha^* R_T^L + (1-\alpha^*) R_T^S\} + \widetilde{O}(\sqrt{T})$ that recovers the classical diagonal and full-matrix AdaGrad guarantees as special cases when one structure is absent, with all cross-contamination and preconditioner-lag corrections explicitly retained. We further establish per-term lower bounds showing that the structural dependence on $\sqrt{r \cdot \sigma_1(\mathbf{G}_T^L)}$ and $\sqrt{s \cdot \|\mathbf{G}_T^S\|_\infty}$ is individually tight up to constants. Experiments on online regression with structured gradients confirm the theoretical predictions, demonstrating sublinear regret, pure-case recovery, dimension independence, and graceful degradation under decomposition noise.
URL: https://openreview.net/forum?id=V0DfJ7DTyW
---
Title: FedPolicy: An RL-Guided Redistribution Policy for Synergizing Local-Global Optimization in Federated Learning
Abstract: Statistical heterogeneity remains a central challenge in federated learning. Existing methods primarily address this problem through improved local objectives, aggregation strategies, or personalization mechanisms, while the post-aggregation redistribution step is typically applied uniformly and receives little explicit treatment. This design becomes problematic under heterogeneous client distributions, where repeatedly overwriting local models with the same aggregated parameters can disrupt client-specific adaptation and induce negative transfer. We propose \textsf{FedPolicy}, an RL-guided post-aggregation redistribution framework that treats the return path of the aggregated model as a client-specific decision problem. Rather than broadcasting the same update to every client, FedPolicy learns which part of the aggregated model should be transferred back to each client by selecting among full-model, backbone-only, and head-only parameter blocks. This formulation identifies post-aggregation redistribution as a previously underexplored control axis in federated optimization, improving the balance between global transfer and local specialization. Extensive experiments under heterogeneous federated settings show that FedPolicy consistently outperforms strong baselines across FMNIST, CIFAR-10, and CIFAR-100, with the clearest gains appearing in the more challenging heterogeneous regimes. Across all datasets and heterogeneity settings, FedPolicy achieves an average relative gain of $3.93\%$ over the strongest baseline, with the largest improvement reaching $8.40\%$ on CIFAR-100 under severe heterogeneity, while converging faster and delivering a more favorable cost-to-accuracy trade-off with negligible overhead. These results highlight client-specific post-aggregation redistribution as an underexplored yet impactful design dimension in heterogeneous federated learning.
URL: https://openreview.net/forum?id=HsuRbMi23T
---
Title: Trust Region Policy Optimization for Functional Linear Policies
Abstract: Reinforcement Learning (RL) tasks where the states are given by spatial or temporal measurements often lead to high-dimensional state spaces, making function approximation difficult and unstable. We adapt the classic RL framework to allow the direct use of the inherent functional state, which can be estimated from the discrete measurements. We propose a suitable family of policies based on functional linear models, allowing us to take actions conditionally on functional states. Moreover, we extend Trust Region Policy Optimization (TRPO) to improve such policies and address the challenge of operator inversion in infinite-dimensional spaces using techniques from Functional Data Analysis (FDA). Furthermore, we implement Proximal Policy Optimization (PPO) for these policies. In experiments on three PDE control tasks, functional policies yield more stable training and achieve better performance than multilayer perceptron policies, highlighting the benefits of functional representations in RL.
URL: https://openreview.net/forum?id=mRNOs9Y8t1
---
Title: Bridging Formal Language with Chain-of-Thought Reasoning to Geometry Problem Solving
Abstract: Large vision language models exhibit notable limitations on Geometry Problem Solving (GPS) because of their unreliable diagram interpretation and pure natural-language reasoning. A recent line of work mitigates this by using symbolic solvers: the model directly generates a formal program that a solver can execute. However, this direct program generation lacks intermediate reasoning, making the program prone to errors. In this work, we explore integrating Chain-of-Thought (CoT) with formal language. The model interleaves natural language reasoning with incremental emission of solver-executable code, producing a hybrid reasoning trace in which critical derivations are expressed in formal language. To teach this behavior at scale, we combine (1) supervised fine-tuning on an 11K newly developed synthetic dataset with automatic formalization and interleaved formal-natural language reasoning, and (2) solver-in-the-loop reinforcement learning that jointly optimizes both the CoT narrative and the resulting program. Built upon Qwen2.5-VL, our 7B model outperforms other models at the same scale by up to 18% on GPS benchmarks, while our 32B model achieves performance on par with leading models such as Gemini-2.5-Pro with 2.2× higher token efficiency. Furthermore, we present a comprehensive analysis of method design choices (e.g., reasoning paradigms, data synthesis strategies, and training methodologies), providing actionable insights for future research.
URL: https://openreview.net/forum?id=6eaWuhhdi8
---
Title: Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become the leading paradigm for enhancing reasoning in Large Language Models (LLMs). However, standard RLVR algorithms suffer from a well-documented pathology: while improving Pass@1 through sharpened sampling, they simultaneously narrow the model's reasoning boundary and reduce generation diversity. We identify a root cause that existing methods overlook: the uniform penalization of errors. Current approaches---whether data-filtering methods that select prompts by difficulty, or advantage normalization schemes---treat all incorrect rollouts within a group identically. We show that this uniformity allows overconfident errors---incorrect reasoning paths that the RL process has spuriously reinforced---to persist and monopolize probability mass, suppressing valid exploratory trajectories.
We propose the Asymmetric Confidence-aware Error Penalty (ACE), which introduces a per-rollout confidence shift metric to dynamically modulate negative advantages. Theoretically, we show that ACE's gradient can be decomposed into the gradient of a selective regularizer restricted to overconfident errors, plus a well-characterized residual that partially moderates the regularizer's strength. Experiments fine-tune Qwen2.5-Math-7B, Qwen3-8B-Base, and Llama-3.1-8B-Instruct on the DAPO-Math-17K dataset using GRPO and DAPO with VERL, evaluating on MATH-500 and AIME 2025. ACE yields the strongest and most consistent gains on the two Qwen families, and on Llama-3.1-8B-Instruct ACE-GRPO delivers modest but consistent large-k gains over GRPO, indicating partial robustness beyond the primary model family.
URL: https://openreview.net/forum?id=0PvOtIyqsh
---
Title: Centroid-Referenced Mahalanobis Matching (CRM): A Scalable, Representation-Based Framework for Causal Inference in Large Observational Studies
Abstract: Matching for causal inference faces two persistent challenges: computational intractability at modern data scales, and implicit, unreported estimand changes under limited covariate overlap. We propose Centroid-Referenced Mahalanobis Matching (CRM), which reframes matching as distributional alignment in a two-dimensional geometric space---a Mahalanobis distance and a Fisher discriminant projection---achieving $O(np^2)$ computation with no pairwise search.
Three contributions define the paper. (C1) Bias decomposition: under representation sufficiency and Lipschitz smoothness, estimation error is governed by representation error in $Z$-space rather than marginal covariate balance---explaining why CEM achieves near-zero MaxSMD yet the highest estimation bias on our benchmark. (C2) Pre-matching shortage diagnostic: the fraction $\hat{\pi}$ of treated units lacking any matched counterpart is reported before any unit is discarded, with a provable bound on the gap between the full ATT and the identified estimand. (C3) Rate: the estimator achieves the $O(n^{-1/2})$ MSE rate in fixed representation dimension---matching propensity score subclassification without fitting a treatment model.
On the Criteo benchmark ($n_T$ up to $200{,}000$, true ATE known from randomization), CRM achieves lower MaxSMD than PSM in all 12 large-scale configurations while retaining $\geq 99.4\%$ of treated units.
URL: https://openreview.net/forum?id=z74epfCe3A
---
Title: Aligning CLIP Visual Features with Object Masks via Spatial Remapping for Open-Vocabulary Segmentation
Abstract: Open-vocabulary panoptic segmentation aims to segment and categorize objects from novel classes. Existing methods typically follow a two-stage process: first generating object masks, then classifying objects by comparing text embeddings with object embeddings obtained via mask pooling on CLIP visual features. However, because CLIP was pretrained without object mask supervision, its visual features poorly align with object masks, which impedes accurate category matching. To tackle this challenge, certain approaches ensemble object embeddings from CLIP visual features with those from newly learned segmentation features. However, because the newly learned features lack pretraining on large-scale datasets, they generalize poorly to unseen categories. To address these limitations, we introduce a spatial remapping module for CLIP, tailored for open-vocabulary segmentation. This module spatially remaps CLIP visual features to better align with object masks by leveraging spatial relationships between CLIP visual and segmentation features. Consequently, object embeddings obtained via mask pooling on the remapped CLIP visual features correctly match the text embeddings. This approach eliminates the need for mask pooling on newly learned features, thereby better preserving CLIP’s zero-shot image-text alignment ability. To improve efficiency and accuracy, we model these spatial relationships at the object level and use object mask annotations for supervision. Our model, trained on the COCO panoptic dataset, demonstrates exceptional zero-shot performance across various datasets, affirming its effectiveness.
URL: https://openreview.net/forum?id=zSoBcpdAn9
---
Title: Solving In-Table Prediction Problems by Deep Neural Networks with Performance Evaluation Using Synthetic Data
Abstract: Tabular deep learning (TDL) leverages neural networks (NN) to extract patterns from tabular data. Traditional TDL methods follow a supervised learning paradigm, where a target feature is explicitly given. In this work, however, we explore a different approach by employing deep NNs to learn relationships among individual columns within a given table. We investigate whether NNs can predict the values of arbitrarily selected columns in a given table based on the remaining known columns. We call this problem In-Table Prediction (ITB), which is slightly different from table imputation methods and the pretraining task of TDL. Three potential usage scenarios are identified, which, to our best knowledge, have not been extensively studied in the literature. A self-supervised learning approach is applied to address this problem by randomly selecting columns to be masked out and used as learning targets. This work focuses on tabular datasets containing only continuous features. To handle missing values in continuous features, a novel neural layer is proposed to embed both numerical and empty values. Synthetic data is generated based on predefined column relationships, with empty values inserted using two distinct mechanisms. Additionally, an adapted masking strategy is employed to create test data. Performances of three NN architectures, namely MLP, Resnet and Transformer, are evaluated using the generated synthetic data. We conclude that, the attention-based structure outperforms the other two networks, when a sufficiently large number of training examples is available and a relatively large embedding length is chosen.
URL: https://openreview.net/forum?id=iVCuafkWJV
---
Title: On Next-Token Prediction in LLMs: How End Goals Determine the Consistency of Decoding Algorithms
Abstract: Probabilistic next-token prediction trained using cross-entropy loss is the basis of most large language models. Given a sequence of previous values, next-token prediction assigns a probability to each possible next value in the vocabulary. From this, there are many ways to turn these next-token predictions into token sequences. This paper examines a few of these algorithms (greedy/lookahead decoding, random sampling, and temperature-scaled random sampling) and studies their consistency with respect to the user end goals of information retrieval and creative generation through encoding these goals as loss functions. Although the consistency of surrogate losses with respect to a target loss function is a well researched topic, we are the first to study it in the context of LLMs (to the best of our knowledge). We find that, so long as next-token prediction converges to its true probability distribution, random sampling is consistent with outputting sequences that mimic sampling from the true probability distribution. For the other goals, such as minimizing the 0-1 loss on the entire sequence, we show that deterministic decoders have the edge over stochastic decoders. From these results, we see that there is a dichotomy created between the goals of information retrieval and creative generation for the decoding algorithms. This shows that choosing the correct decoding algorithm based on the desired goal is extremely important and many of the ones used are lacking theoretical grounding in numerous scenarios. While there has been evidence for this empirically, this paper gives rigorous theoretical grounding to these results.
URL: https://openreview.net/forum?id=96S5nE51L8
---
Title: Toward Unified Robot Learning: Bridging Representation, Vision-Language-Action, and World Models
Abstract: For robots to operate reliably in real-world environments, they need to perceive their surroundings, act, and reason about the consequences of those actions. Rapid progress in the domains of representation learning, vision-language-action (VLA) models, and world models has significantly enhanced the capabilities of robot learning systems, enabling robots to work in increasingly complex environments. However, these paradigms are typically developed in isolation, resulting in fragmented systems that struggle with generalization, long temporal reasoning and planning, and deployment in unstructured environments. In this survey, we present a unified perspective on robot learning by organizing the existing methods along three complementary axes: understanding through representation learning, acting through VLA models, and reasoning through world models. We introduce a structured taxonomy that captures key design choices in environment representation, policy learning, and predictive modeling, and summarize the recent progress in these domains. Beyond classifying the existing works, we analyze how these components interact, discuss common limitations, and highlight emerging trends towards more integrated systems. Through this lens, we identify the challenges in the domain of robot learning, including uncertainty quantification, out-of-distribution generalization, cross-embodiment transfer, long-context understanding, and long-horizon planning. We argue that these challenges arise not only from limitations within individual components, but from the lack of integration across perception, action, and reasoning. Building on this analysis, we outline future directions towards unified, physically grounded, and probabilistic robot learning to develop robust real-world robot systems that maintain consistent internal representations and support robust decision making over extended interactions in real-world environments.
URL: https://openreview.net/forum?id=AXv7OjfuiT
---
Title: Speedrunning ImageNet Diffusion
Abstract: Recent advances have significantly improved the training efficiency of diffusion transformers. However, these techniques have largely been studied in isolation, leaving unexplored the potential synergies from combining multiple approaches. We present SR-DiT (Speedrun Diffusion Transformer), a framework that systematically integrates token routing, architectural improvements, and training modifications on top of representation alignment. Our approach achieves FID 3.14 and KDD 0.290 on ImageNet-256 using only a 140M parameter model at 400K iterations without classifier-free guidance---comparable to results from 685M parameter models trained significantly longer. To our knowledge, this is a state-of-the-art result at this model size. Through extensive ablation studies, we identify which technique combinations are most effective and document both synergies and incompatibilities. We release our framework as a computationally accessible baseline for future research.
URL: https://openreview.net/forum?id=0mYu3uPM3j
---
Title: Mitigating Non-Uniform Forgetting Dynamics for Class-incremental Semantic Segmentation
Abstract: Class-Incremental Semantic Segmentation (CISS) aims to learn newly introduced classes sequentially while preserving performance on previously learned ones. Most existing methods mitigate catastrophic forgetting through pseudo labels or regularization, but largely assume that forgetting evolves uniformly across old classes. In this paper, we reveal and characterize a Non-Uniform Forgetting(NUF) phenomenon in CISS, where different old classes exhibit markedly different forgetting trajectories in terms of degradation severity, onset time, and temporal pattern. Our analysis further shows that NUF is closely related to semantic complexity, semantic overlap, and the inherent old--new supervision imbalance of CISS.
To address this problem, we propose a pseudo-labels-assisted framework with two complementary components. The first, Imbalance-Aware Gradient Defence (IGD), alleviates optimization bias through pixel-wise gradient-aware reweighting and channel-wise balancing, while a background-suppression term further reduces spurious foreground activations. The second,Representation Drift Suppressor(RDS), improves representation stability by enhancing inter-class separability with prototype-based contrastive learning and preserving old semantics through selective decoder-level distillation. By jointly combining IGD and RDS, the proposed framework effectively mitigates heterogeneous forgetting and yields more balanced incremental segmentation performance.
Extensive experiments on PASCAL VOC and ADE20K under multiple incremental protocols demonstrate that the proposed method consistently improves old-class retention and overall incremental performance, outperforming state-of-the-art CISS approaches.
URL: https://openreview.net/forum?id=ewcrvzXLMy
---
Title: An Interactive Framework for Finding the Optimal Trade-off in Differential Privacy
Abstract: Differential privacy (DP) is the gold standard for privacy-preserving analysis but introduces a fundamental trade-off between privacy guarantees and model performance. Selecting the optimal balance is a critical challenge, framed as a multi-objective optimization (MOO) problem of discovering the Pareto front and eliciting a decision-maker's preference. While interactive MOO offers a solution, standard approaches---which model objectives separately and rely on simple pairwise feedback---are suboptimal for DP because they do not utilize problem structure. In this work, we propose a method, \textbf{PACE} (\textbf{P}rivacy-\textbf{A}ccuracy \textbf{C}urve \textbf{E}licitation), that exploits two key properties to reduce this inefficiency. First, we leverage the fact that the privacy level naturally serves as a constraint: maximizing accuracy for a fixed privacy level generates a solution on the Pareto front. Second, to efficiently model this trade-off, we theoretically derive the trade-off shape for regularized logistic regression, revealing a characteristic S-curve. This theoretical grounding motivates us to model the Pareto front using a sigmoidal function. We empirically demonstrate its effectiveness across studied DP settings. This model allows us to replace less efficient pairwise comparisons with a richer interaction scheme where decision-makers directly select their most preferred solution from the hypothetical trade-off curve. Experiments on differentially private logistic regression and deep transfer learning across six datasets show that PACE converges to the most preferred trade-off with fewer model evaluations and interactions than baselines.
URL: https://openreview.net/forum?id=ire2TQNxfv
---
Title: Predicting LLM Hallucination Risk from Entity Frequency via Rate-Distortion Theory
Abstract: Large language models struggle with factual hallucinations, and mitigating them typically requires executing the model to assess uncertainty. We introduce a query-dependent rate--distortion framework showing that factual accuracy follows a predictable, sigmoidal ``knowledge cliff'' governed by training exposure. Below a critical frequency $f_{\mathrm{crit}}$, the model lacks the representational budget to reliably retrieve facts; above this threshold, accuracy increases precipitously. We present five core findings based on this framework. First, mapping this frequency response yields an accurate, zero-compute risk score; we formally prove that this pre-inference metric achieves $>99\%$ of the theoretical upper bound for any frequency-only predictor. Second, this cliff is heterogeneous, with $f_{\mathrm{crit}}$ varying by up to $76\times$ depending on the query's relation type. Integrating these structural metadata creates a classifier that outperforms the LLM's own post-inference confidence scores in the rare-entity tail. Third, we establish a sample-complexity bound demonstrating that the calibration data required to locate this cliff under Zipfian sampling is independent of its location. Fourth, we isolate the effect of model scale using the Qwen2.5-Instruct family (0.5B--14B, trained on a fixed corpus), revealing a power-law relationship ($f_{\mathrm{crit}} \propto P^{-0.52}$) where a $10\times$ parameter increase yields reliable recall for entities only $\sim 3.3\times$ less frequent. Finally, we show that these metadata-driven signals establish a superior budget--utility frontier for retrieval routing. By demonstrating that reliability is largely determined by query properties prior to generation, this work enables highly efficient, pre-hoc triage for retrieval-augmented systems.
URL: https://openreview.net/forum?id=QG0g950T1t
---
Title: Benford’s Law as a Distributional Prior for Post-Training Quantization of Large Language Models
Abstract: Post-training quantization (PTQ) is a practical way to reduce the memory footprint of large language models, but low-bit quantization is sensitive to mismatches between the quantization codebook and the empirical weight/activation distributions. We revisit Benford-like leading-digit statistics as a lightweight diagnostic of scale-broad behavior in transformer tensors. Across several model families, we observe a consistent functional dichotomy: transformational nn.Linear weights tend to be Benford-like, whereas LayerNorm and embedding parameters systematically deviate. Motivated by this observation, we propose BenQ, a data-free PTQ codebook that uses a simple log-spaced grid as a proxy for scale-broad distributions and applies it selectively to transformational layers while keeping stability-critical parameters in higher precision. In 4-bit group-wise PTQ, BenQ consistently improves over uniform RTN and is often competitive with NF4, while remaining substantially simpler than optimization-based methods. We additionally report activation quantization results as an exploratory stress test: BenQ can improve robustness over uniform baselines in some families, but performance remains mixed across models, highlighting open challenges for static-grid activation PTQ.
URL: https://openreview.net/forum?id=YiLcQY4Nje
---
Title: Data-Dependent Regret and Polyak Corrections for Constrained Online Convex Optimization
Abstract: In constrained online convex optimization, the learner must minimize regret against adversarially chosen convex costs while satisfying a convex constraint at every round, a requirement that arises naturally in safety-critical domains such as power systems, autonomous control, and clinical decision-making. A natural and computationally efficient approach augments online gradient descent with a Polyak feasibility step: a closed-form half-space projection requiring only one constraint evaluation and one subgradient per round. This approach is known to achieve $O(\sqrt{T})$ regret with per-round feasibility, yet we prove that its existing analysis is strictly loose by identifying two quantities it unnecessarily discards. Specifically, replacing the worst-case gradient envelope $G_f^2 T$ with the observed accumulation $\mathcal{G}_T = \sum_t \lVert \nabla f_t(x_t) \rVert^2$ yields a data-dependent bound without any algorithmic modification. Furthermore, we introduce the Polyak correction $\mathcal{P}_T \geq 0$, which captures the cumulative squared displacement of the feasibility projection and enters the regret bound with a strictly negative sign, a term that all prior proofs lose entirely through the Pythagorean inequality. The total improvement $\Delta_T = \tfrac{\eta}{2}(G_f^2 T - \mathcal{G}_T) + \tfrac{1}{2\eta}\mathcal{P}_T$ is provably non-negative and decomposes into two independent, complementary sources that vanish only in a degenerate corner case. Building on these analytical insights, we propose AdaOGD-PFS, an adaptive-step-size variant that achieves $O(\sqrt{\mathcal{G}_T})$ regret, potentially much smaller than $O(G_f\sqrt{T})$, while preserving per-round constraint satisfaction. Experiments on ball-constrained and halfspace-constrained instances confirm bound improvements of 38--43%, with both data-dependent gradients and Polyak corrections contributing meaningfully.
URL: https://openreview.net/forum?id=tBM4UrAPf8
---
Title: Discrete Double-Bracket Flows for Isotropic-Noise Invariant Eigendecomposition
Abstract: We study matrix-free eigendecomposition under a matrix--vector product (MVP) oracle, where each step observes a covariance operator $C_k = C_{\mathrm{sig}} + \sigma_k^2 I + E_k$. Standard stochastic approximation methods either use fixed steps that couple stability to $||C_k||_2$, or adapt steps in ways that slow down due to vanishing updates.
We introduce a discrete double-bracket flow whose generator is invariant to isotropic shifts, yielding pathwise invariance to $\sigma_k^2 I$ at the discrete-time level. The resulting trajectory and a maximal stable step size $\eta_{\max} \propto 1/||C_e||_2^2$ depend only on the trace-free covariance $C_e$. We establish global convergence via strict-saddle geometry for the diagonalization objective and an input-to-state stability analysis, with sample complexity scaling as $O(||C_e||_2^2/(\Delta^2\varepsilon))$ under trace-free perturbations. An explicit characterization of degenerate blocks yields an accelerated $O(\log(1/\zeta))$ saddle-escape rate and a high-probability finite-time convergence guarantee.
URL: https://openreview.net/forum?id=oShv6wdzP7
---
Title: On-the-go Forgetting without Explicit Unlearning via ERASE
Abstract: Existing unlearning approaches typically rely on post hoc weight adaptation or distillation, leading to duplicated memory costs, degraded generalization, and limited scalability. In this work, we introduce ERASE, Erasure via Reconstructive Adversarial Signal Editing, a framework for on-the-go forgetting that removes the influence of private data without modifying model weights. ERASE leverages structured, class-conditioned input perturbations to induce selective forgetting during inference, eliminating the need for retraining, fine-tuning, or model copies. We rigorously characterize sufficient conditions when ERASE provably forgets designated subclasses while preserving predictions across other subclasses within the same superclass. This analysis offers a principled foundation for inference-time forgetting under mild regularity assumptions. Across diverse architectures and benchmark datasets, ERASE maintains the best observed balance between forgetting efficacy, computational efficiency, and retention fidelity over recent unlearning-based methods. By reimagining data removal as forgetting without unlearning, our work establishes a scalable, regulation-aligned pathway for continual, privacy-conscious learning.
URL: https://openreview.net/forum?id=PIXVov5LQq
---
Title: UM3: Unsupervised Map to Map Matching
Abstract: Map-to-map matching is a critical task for aligning spatial data across heterogeneous sources, yet it remains challenging due to the lack of ground truth correspondences, sparse node features, and scalability demands. In this paper, we propose an unsupervised graph-based framework that addresses these challenges through three key innovations. First, our method is an unsupervised learning approach that requires no training data, which is crucial for large-scale map data where obtaining labeled training samples is challenging. Second, we introduce pseudo coordinates that capture the relative spatial layout of nodes within each map, which enhances feature discriminability and enables scale-invariant learning. Third, we design an mechanism to adaptively balance feature and geometric similarity, as well as a geometric-consistent loss function, ensuring robustness to noisy or incomplete coordinate data. At the implementation level, to handle large-scale maps, we develop a tile-based post-processing pipeline with overlapping regions and majority voting, which enables parallel processing while preserving boundary coherence. Experiments on real-world datasets demonstrate that our method achieves state-of-the-art accuracy in matching tasks, surpassing existing methods by a large margin, particularly in high-noise and large-scale scenarios. Our framework provides a scalable and practical solution for map alignment, offering a robust and efficient alternative to traditional approaches. Code is open-sourced at \url{https://anonymous.4open.science/r/MMM-7619}.
URL: https://openreview.net/forum?id=Gmy76U5bph
---
Title: A Few Large Shifts: Layer-Inconsistency Based Minimal Overhead Adversarial Example Detection
Abstract: Deep neural networks (DNNs) are highly susceptible to adversarial examples—subtle, imperceptible perturbations that can lead to incorrect predictions. While detection-based defenses offer a practical alternative to adversarial training, many existing methods depend on external models, complex architectures, or adversarial data, limiting their efficiency and generalizability. We introduce a lightweight, plug-in detection framework that leverages internal layer-wise inconsistencies within the target model itself, requiring only benign data for calibration. Our approach is grounded in the **A Few Large Shifts Assumption**, which posits that adversarial perturbations induce large, localized violations of *layer-wise Lipschitz continuity* in a small subset of layers. Building on this, we propose two complementary strategies—**Recovery Testing (RT)** and**Logit-layer Testing (LT)**—to empirically measure these violations and expose internal disruptions caused by adversaries. Evaluated on CIFAR-10, CIFAR-100, and ImageNet under both standard and adaptive threat models, our method achieves state-of-the-art detection performance with negligible computational overhead. Furthermore, our system-level analysis provides a practical method for selecting a detection threshold with a formal lower-bound guarantee on accuracy.
URL: https://openreview.net/forum?id=0xXYBxDNHA
---
Title: VLMGuard: Bootstrapping Malicious Prompt Detectors from Unlabeled Vision-Language Prompts in the Wild
Abstract: Vision-language Models (VLMs) are essential for contextual understanding of both visual and textual information. However, their vulnerability to adversarially manipulated inputs presents significant risks, leading to compromised outputs and raising concerns about the reliability in VLM-integrated applications. Detecting these malicious prompts is thus crucial for maintaining trust in VLM generations. A major challenge in developing a safeguarding prompt classifier is the lack of a large amount of labeled benign and malicious data. To address the issue, we introduce VLMGuard, a novel learning framework that leverages the unlabeled user prompts in the wild for malicious prompt detection. These unlabeled prompts, which naturally arise when VLMs are deployed in the open world, consist of both benign and malicious information. To harness the unlabeled data, we present an automated maliciousness estimation score for distinguishing between benign and malicious samples within this unlabeled mixture, thereby enabling the training of a binary prompt classifier on top. Notably, our framework does not require extra human annotations and is robust to realistic prompt variations, offering strong flexibility and practicality for real-world applications. Extensive experiments show that VLMGuard achieves superior detection results, improving AUROC by 9.46\% on average over the state-of-the-art method. Disclaimer: This paper may contain offensive examples; reader discretion is advised.
URL: https://openreview.net/forum?id=z7gczmhmmo
---
Title: On the Scaling Flaws of Verifier-Guided Beam Search in Mathematical Reasoning
Abstract: Large language models (LLMs) struggle with multi-step mathematical reasoning, for which inference-time scaling—via sequential or parallel scaling—has emerged as a promising strategy. While recent advances have focused on sequential scaling, we revisit the less-explored parallel scaling approach, verifier-guided beam search, to examine its limitations. In this paper, we argue that its strength is, paradoxically, also its limitation: verifiers can boost performance under limited sample sizes by elevating promising reasoning paths, yet the same mechanism can also hide or cut off the valid paths that lead to correct answers. Empirically, we uncover a systematic issue--scaling flaws--in verifier-guided beam search, across models, benchmarks (GSM8K, MATH, AIME25), and verifier types (outcome value models, process reward models). Specifically, the search outperforms repeated sampling at small sample sizes but its advantage diminishes—and ultimately reverses—as the sample size grows. We attribute this to verifier failures: imperfect verifiers misrank candidates and can erroneously prune all valid paths, with these effects exacerbated on more challenging scenarios. To mitigate verifier failures, we explore reducing reliance on verifiers and conduct preliminary investigations using two simple methods.
Overall, our findings expose fundamental limitations of verifier-guided beam search and explain why this line has struggled to realize its potential.
URL: https://openreview.net/forum?id=D5VKbIzlrR
---
Title: On the Statistical Limits of Self-Improving Agents
Abstract: We develop a learning-theoretic framework for analyzing self-improving agents by decomposing self-modification into five axes. Within this framework, we prove a sharp boundary: under standard i.i.d. assumptions, distribution-free PAC learnability is preserved if and only if the policy-reachable family remains uniformly capacity-bounded. If reachable capacity can grow without bound, utility-rational self-changes can make learnable tasks unlearnable. We further introduce a simple Two-Gate guardrail—a validation-improvement requirement plus a capacity cap—that preserves this boundary and yields standard VC-rate guarantees. The broader implication is that self-modification must be constrained not only by objectives, but also by structural conditions that preserve the statistical prerequisites for learning. As AI systems become increasing intelligent and autonomous, we view this framework as an important foundation for the statistical theory of self-improvement.
URL: https://openreview.net/forum?id=q4vuDMtYgF
---
Title: Linear Dynamics meets Linear MDPs: Closed-Form Optimal Policies via Reinforcement Learning
Abstract: Many applications---including power systems, robotics, and economics---involve a dynamical system interacting with a stochastic and hard-to-model environment. We adopt a reinforcement learning approach to control such systems. Specifically, we consider a deterministic, discrete-time, linear, time-invariant dynamical system coupled with a feature-based linear Markov process with an unknown transition kernel. The objective is to learn a control policy that minimizes a quadratic cost over the system state, the Markov process, and the control input. Leveraging both components of the system, we derive an explicit parametric form for the optimal state-action value function and the corresponding optimal policy. Our model is distinct in combining aspects of both classical Linear Quadratic Regulator (LQR) and linear Markov decision process (MDP) frameworks. This combination retains the implementation simplicity of LQR, while allowing for sophisticated stochastic modeling afforded by linear MDPs, without estimating the transition probabilities, thereby enabling direct policy improvement. For the nominal setting, where the linear system dynamics are known, we use tools from control theory to provide theoretical guarantees on the stability of the system under the learned policy and provide a sample complexity analysis for its convergence to the optimal policy. We further extend our framework to systems with Gaussian process noise and to systems with unknown linear dynamics. We illustrate our results via numerical examples for the nominal, noisy, and unknown dynamics settings to demonstrate the effectiveness of our approach in learning the optimal control policy under partially known stochastic dynamics.
URL: https://openreview.net/forum?id=gool6th5Z6
---
Title: Train Long, Think Short: Curriculum Learning for Efficient Reasoning
Abstract: Recent work on enhancing the reasoning abilities of large language models (LLMs) has introduced explicit length control as a means of constraining computational cost while preserving accuracy. However, existing approaches rely on fixed-length training budgets, which do not take advantage of the natural progression from exploration to compression during learning. In this work, we propose a curriculum learning strategy for length-controlled reasoning using Group Relative Policy Optimization (GRPO). Our method starts with generous token budgets and gradually tightens them over training, encouraging models to first discover effective solution strategies and then distill them into more concise reasoning traces. We augment GRPO with a reward function that balances three signals: task correctness (via verifier feedback), length efficiency, and formatting adherence (via structural tags). Experiments on GSM8K, MATH500, SVAMP, College Math, and GSM+ demonstrate that curriculum-based training consistently outperforms fixed-budget baselines at the same final budget, achieving higher accuracy and significantly improved token efficiency. We further ablate the impact of reward weighting and decay schedule design, showing that progressive constraint serves as a powerful inductive bias for training efficient reasoning models.
URL: https://openreview.net/forum?id=hjNh64gibF
---
Title: DisCoVAE: Disentangling pretrained latent spaces with customized controls using contrastive learning
Abstract: Deep generative models have recently achieved high-quality results but still lack customizable control abilities. Existing methods mainly rely on using labels as additional inputs to directly condition the generation process. This also constrains the model to be retrained entirely and might ignore the high-level features already captured in the latent space.
In this paper, we propose a new approach that allows one to reshape the latent space of pretrained generative models based on user-specified groups of samples. Our method relies on a variational autoencoder trained with an additional supervised contrastive learning regularization. This leads to a new control space, which disentangles the features of interest based on the underlying variations found within the custom groups. We propose an iterative approach that can disentangle a single labeled feature at a time from the remaining latent factors without additional supervision, which enables to build the control space gradually. We show that our method outperforms state-of-the-art disentanglement approaches on reference datasets, while also enabling high-quality image synthesis with fine-grained continuous controls on real-world datasets.
URL: https://openreview.net/forum?id=npPuLfIutr
---
Title: CAReFuseNet: Cross Attention Fusion Network for Referring Camouflaged Object Detection
Abstract: The Referring Camouflaged Object Detection (Ref-COD) task aims to generate a binary segmentation mask to detect camouflaged objects of a specified category in an image, guided by reference image(s) containing salient example(s) of the target object of the same category. With only a few methods (e.g., R2CNet and UAT) proposed to date, Ref-COD remains challenging due to the similarity of camouflaged objects to their backgrounds and substantial feature gaps with salient references. At the same time, recent state-of-the-art approaches often rely on heavy transformer-based encoder–decoder stacks or large frozen vision backbones, resulting in substantial parameter footprints that hinder efficient deployment. This work proposes `CAReFuseNet’, a novel framework featuring a cross-attention based reference feature fusion module that effectively extracts reference-conditioned feature representations from camouflaged images while targeting parameter efficiency. The proposed CAReFuse module leverages global interactions between reference and camouflaged image features via cross-attention, but constrains all fusion and decoding operations to a lower dimensional feature space and employs a lightweight convolutional decoder. Combined with a frozen Ref-Image Encoder, this design yields a compact Ref-COD model without sacrificing accuracy. Extensive experiments on the R2C7K dataset show that our method surpasses state-of-the-art, while using significantly fewer parameters. Further evaluations across multiple backbone architectures, including Swin Transformer, ConvNeXt, EfficientNet, and ResNet, demonstrate that the proposed reference feature fusion module provides a general and parameter-efficient building block for the referring camouflaged object detection task.
URL: https://openreview.net/forum?id=xMm7MPT7yL
---
Title: Few Contrastive Attention Heads Enable Visual Grounding in Large Vision-Language Models
Abstract: Visual grounding aims to localize image regions corresponding to natural language expressions. While recent Large Vision-Language Models (LVLMs) have shown impressive multi-modal understanding capabilities, their application to visual grounding typically requires fine-tuning and architectural modifications. This requirement, however, can be ignored, considering that text and images tend to have similar feature representations that appear to be approximately linearly disentangled, enabling cleaner extraction of spatial information from LVLMs without any task-specific training. From this viewpoint, we propose an attention-head discovery framework that requires zero labeled grounding samples and no architectural modifications, and identifies discriminative localization heads without manual inspection. Through dual prompting with target and contrastive descriptions, we compute differential residual representations and project them through attention head output matrices to measure per-head spatial contributions via four complementary scores. By aggregating signals using importance-weighted query difference scores from only the top-10 attention heads, we outperform training-free non-LVLM baseline by up to 27.95% on RefCOCO, 21.93% on RefCOCO+, and 8.40% on RefCOCOg. Our method outperforms LVLM baseline by up to 8.04% on RefCOCO without requiring ground-truth category labels.
URL: https://openreview.net/forum?id=CnrAiXbw8a
---
Title: TE-VLM: Transfer Entropy for Vision Language Model Distillation
Abstract: Transfer Entropy (TE) is a principled measure of directed information flow, but its direct estimation in high-dimensional multimodal representation spaces is computationally prohibitive. In this work, we propose a practical distillation framework that replaces direct TE estimation with TE-inspired proxy regularization for multimodal vision-language models. Our method introduces proxy objectives that reward student representations for preserving teacher-aligned predictive structure across modalities, while remaining compatible with standard contrastive distillation losses. We instantiate the framework in CLIP-style teacher--student distillation across multiple teacher backbones, including CLIP RN50, ViT-B/16, and RN50$\times$16, and evaluate it on MSCOCO 2014, Flickr8k, Flickr30k, Food-101, and ImageNet-1k. Across retrieval experiments, the proposed TE-inspired objective consistently improves Image-to-Text performance over MI-based and standard distillation baselines, while remaining competitive on Text-to-Image retrieval. Additional recipe-level diagnostics across temperature and batch size show that these gains are reproducible and are not explained solely by a favorable training recipe. Representation-level analyses further show that TE-inspired distillation yields stronger teacher-student agreement in local neighborhood structure, cosine alignment, and joint image-text embedding geometry. Beyond in-dataset evaluation, cross-dataset retrieval from MSCOCO to Flickr8k shows that the proposed objective better preserves transferable multimodal structure under distribution shift. We also observe improvements in zero-shot classification on Food-101 and out-of-dataset evaluation on ImageNet-1k. Together, these results suggest that TE-inspired proxy regularization provides an effective and scalable mechanism for preserving teacher-consistent cross-modal structure during multimodal distillation.
URL: https://openreview.net/forum?id=Xg5lvMXnWI
---
Title: A Unified Latent Space Disentanglement VAE Framework with Robust Disentanglement Effectiveness Evaluation
Abstract: Evaluating and interpreting latent representations, such as variational autoencoders (VAEs), remains a significant challenge for diverse data types, especially when ground-truth generative factors are unknown. To address this, we propose a general framework -- bfVAE -- that unifies several state-of-the-art disentangled VAE approaches and generates effective latent space disentanglement, especially for tabular data. To assess the effectiveness of a VAE disentanglement technique, we propose two procedures - Feature Variance Heterogeneity via Latent Traversal (FVH-LT) and Dirty Block Sparse Regression in Latent Space (DBSR-LS) for disentanglement assessment, along with the latent space disentanglement index (LSDI) which uses outputs of FVH-LT and DBSR-LS to summarize the overall effectiveness of a VAE disentanglement method without requiring access to or knowledge of the ground-truth generative factors. To the best of our knowledge, these are the first assessment tools to achieve this. FVH-LT and DBSR-LS also enhance latent space interpretability and provide guidance on more efficient content generation. To ensure robust and consistent disentanglement, we develop a greedy alignment strategy (GAS) that mitigates label switching and aligns latent dimensions across runs to obtain aggregated results. We assess the bfVAE framework and validate FVH-LT, DBSR-LS, and LSDI in extensive experiments on tabular and image data. The results suggest that bfVAE surpasses existing disentangled VAE frameworks in terms of disentanglement quality, robustness, achieving a near-zero false discovery rate for informative latent dimensions, that FVH-LT and DBSR-LS reliably uncover semantically meaningful and domain-relevant latent structures, and that LSDI makes an effective overall quantitative summary on disentanglement effectiveness.
URL: https://openreview.net/forum?id=JMaQub8pqY
---
Title: AgenticNet: Rethinking Multi-agent System Architectures with LLM-based Networks
Abstract: Multi-agent systems (MAS) enhance the capabilities of single LLM agents by leveraging collaboration and specialization. However, existing designs often rely on ad-hoc coordination strategies, lacking a principled architecture that integrates reasoning, communication, and adaptation. This limitation makes it difficult to scale multi-agent systems in a way that is both effective and interpretable. To address this challenge, we take inspiration from the architecture of neural networks to rethink MAS design. We treat each LLM-based agent as a computational unit and organize them into layered structures, analogous to neurons and layers, which we call AgenticNet. In this framework, lower layers act as planners that decompose problems, intermediate layers function as executors that advance reasoning step by step, and upper layers serve as synthesizers that verify consistency and deliver final decisions. Information propagates forward through the layers, while adaptation is guided by layer-level prompt updaters and a global prompt supervisor that refine agent behavior based on task loss, serving as an analogue to backpropagation. We conduct extensive experiments on five benchmarks, including AIME24, GSM8K, MATH500, HumanEval, and MBPP. Across all tasks, AgenticNet consistently outperforms both single-agent baselines and existing multi-agent systems, demonstrating its effectiveness as a scalable architecture for multi-agent.
URL: https://openreview.net/forum?id=RQmgTYQ5tR
---