Daily TMLR digest for Feb 07, 2026

4 views

Skip to first unread message

TMLR

unread,

Feb 7, 2026, 12:30:09 AM (5 days ago) Feb 7

to tmlr-anno...@googlegroups.com

Accepted papers
===============

Title: Hierarchical Time Series Forecasting with Robust Reconciliation

Authors: Shuhei Aikawa, Aru Suzuki, Kei Yoshitake, Kanata Teshigawara, Iwabuchi Akira, Ken Kobayashi, Kazuhide Nakata

Abstract: This paper focuses on forecasting hierarchical time-series data, where each higher-level observation equals the sum of its corresponding lower-level time series. In such contexts, the forecast values should be coherent, meaning that the forecast value of each parent series exactly matches the sum of the forecast values of its child series. Existing hierarchical forecasting methods typically generate base forecasts independently for each series and then apply a reconciliation procedure to adjust them so that the resulting forecast values are coherent across the hierarchy. These methods generally derive an optimal reconciliation, using a covariance matrix of the forecast error. In practice, however, the true covariance matrix is unknown and has to be estimated from finite samples in advance. This gap between the true and estimated covariance matrix may degrade forecast performance. To address this issue, we propose a robust optimization framework for hierarchical reconciliation that accounts for uncertainty in the estimated covariance matrix. We first introduce an uncertainty set for the estimated covariance matrix and formulate a reconciliation problem that minimizes the worst-case average of weighted squared residuals over this uncertainty set. We show that our problem can be cast as a semidefinite optimization problem. Numerical experiments demonstrate that the proposed robust reconciliation method achieved better forecast performance than existing hierarchical forecasting methods, which indicates the effectiveness of integrating uncertainty into the reconciliation process.

URL: https://openreview.net/forum?id=XHPLjF52gY

---

Title: Algorithms for the preordering problem and their application to the task of jointly clustering and ordering the accounts of a social network

Authors: Jannik Irmai, Maximilian Moeller, Bjoern Andres

Abstract: The NP-hard maximum value preordering problem is both a joint relaxation and a hybrid of the clique partition problem (a clustering problem) and the partial ordering problem. Toward approximate solutions and lower bounds, we introduce a linear-time 4-approximation algorithm that constructs a maximum dicut of a subgraph and define local search heuristics. Toward upper bounds, we tighten a linear program relaxation by the class of odd closed walk inequalities that define facets, as we show, of the preorder polytope. We contribute implementations of the algorithms, apply these to the task of jointly clustering and partially ordering the accounts of published social networks, and compare the output and efficiency qualitatively and quantitatively.

URL: https://openreview.net/forum?id=cBsUnv7Cb3

---

Title: Fast Graph Generation via Autoregressive Noisy Filtration Modeling

Authors: Markus Krimmel, Jenna Wiens, Karsten Borgwardt, Dexiong Chen

Abstract: Existing graph generative models often face a critical trade-off between sample quality and generation speed. We introduce Autoregressive Noisy Filtration Modeling (ANFM), a flexible autoregressive framework that addresses both challenges. ANFM leverages filtration, a concept from topological data analysis, to transform graphs into short sequences of subgraphs. We identify exposure bias as a potential hurdle in autoregressive graph generation and propose noise augmentation and reinforcement learning as effective mitigation strategies, which allow ANFM to learn both edge addition and deletion operations. This unique capability enables ANFM to correct errors during generation by modeling non-monotonic graph sequences. Our results show that ANFM matches state-of-the-art diffusion models in quality while offering over 100 times faster inference, making it a promising approach for high-throughput graph generation. The source code is publicly available at https://github.com/BorgwardtLab/anfm.

URL: https://openreview.net/forum?id=3Up81Zq728

---

Title: Are Time-Indexed Foundation Models the Future of Time Series Imputation?

Authors: Etienne Le Naour, Tahar Nabil, Adrien Petralia, Ghislain Agoua

Abstract: Foundation models for time series imputation remain largely unexplored. Recently, two such models, TabPFN-TS and MoTM, have emerged. These models share a common philosophy that places them within the family of time-indexed foundation models. This paper presents the first large-scale empirical study of these models for zero-shot imputation, which enables missing value recovery without retraining across a wide range of scenarios. We conduct extensive univariate experiments across 33 out-of-domain datasets ($\approx$ 1.3M imputation windows) and evaluate their ability to integrate covariates at inference time to improve accuracy without fine-tuning. Our results demonstrate that time-indexed foundation models are a powerful and practical step toward achieving general-purpose, zero-shot imputation for real-world time series. Code is available at https://github.com/taharnbl/tsfm_imputation.

URL: https://openreview.net/forum?id=cTk56KpsP5

---

New submissions
===============

Title: Adaptive Hypergraph Pruning with Learned Threshold Control and Attention-Based Contrastive Mining

Abstract: Hypergraph neural networks (HGNNs) effectively model multi-way interactions but suffer from severe scalability limitations due to quadratic computational costs across multiple behavioral contexts. Existing pruning approaches reduce computation using fixed, hand-crafted heuristics, which fail to adapt to diverse graph structures and often introduce representation distortions by removing semantically related nodes or creating spurious similarities that degrade contrastive learning. We propose \textbf{TriPrune-HGNN}, an adaptive hypergraph pruning framework with learnable mechanisms that eliminates manual hyperparameter tuning (over $80\%$ reduction) while achieving a superior accuracy--efficiency tradeoff. TriPrune-HGNN learns pruning schedules from graph statistics and training dynamics, adaptively mines informative contrastive pairs, and automatically balances competing learning objectives via meta-optimization. Extensive experiments on five benchmarks show that TriPrune-HGNN achieves state-of-the-art performance across all 15 evaluation metrics, while reducing inference time by $72.3\%$ and memory usage by $81.1\%$ compared to unpruned models. Compared with efficient baselines of similar memory footprint, TriPrune-HGNN attains up to $5.6\%$ lower error, demonstrating the effectiveness of adaptive pruning for large-scale hypergraph learning.

URL: https://openreview.net/forum?id=b0BLuYYJlc

---

Title: Joint Embedding Variational Bayes

Abstract: We introduce Variational Joint Embedding (VJE), a framework that synthesizes joint embedding and variational inference to enable self-supervised learning of probabilistic representations in a reconstruction-free, non-contrastive setting. Compared to energy-based predictive objectives that optimize pointwise discrepancies, VJE maximizes a symmetric conditional evidence lower bound (ELBO) for a latent-variable model defined directly on encoder embeddings. We instantiate the conditional likelihood with a heavy-tailed Student--$t$ model using a polar decomposition that explicitly decouples directional and radial factors to prevent norm-induced instabilities during training. VJE employs an amortized inference network to parameterize a diagonal Gaussian variational posterior whose feature-wise variances are shared with the likelihood scale to capture anisotropic uncertainty without auxiliary projection heads. Across ImageNet-1K, CIFAR-10/100, and STL-10, VJE achieves performance comparable to standard non-contrastive baselines under linear and k-NN evaluation. We further validate these probabilistic semantics through one-class CIFAR-10 anomaly detection, where likelihood-based scoring under the proposed model outperforms comparable self-supervised baselines.

URL: https://openreview.net/forum?id=4cbPJ5jLtr

---

Title: Revisiting Learning-based Video Motion Magnification for Real-time Processing

Abstract: Video motion magnification is a technique to capture and amplify subtle motion in a video that is invisible to the naked eye. The deep learning-based prior work successfully models outstanding quality better than conventional signal processing-based ones. However, it still lags behind real-time performance, which prevents it from being extended to various online systems. In this paper, we revisit the first learning-based model and present experimental analyses, in particular on the identification of redundant components, the insertion of spatial bottlenecks, and the trade-off relationship between channel reduction and layer addition. By integrating the findings of each experiment, we present a real-time, deep learning-based motion magnification model that achieves a computational speed ranging from a minimum of 2.7 times to a maximum of 34.9 times faster than existing learning-based methods, while maintaining comparable generation quality to prior arts. To the best of our knowledge, this is the first learning-based motion magnification model that runs in real-time on Full-HD resolution videos even without ad hoc quantization.

URL: https://openreview.net/forum?id=TAmmPuExE1

---

Title: Predictive Feature Caching for Training-free Acceleration of Molecular Geometry Generation

Abstract: Flow matching models generate high-fidelity molecular geometries but incur significant computational costs during inference, requiring hundreds of neural network evaluations. This inference cost becomes the primary bottleneck when such models are employed in practice to sample large numbers of molecular candidates. This work presents a training-free caching strategy that accelerates molecular geometry generation by predicting intermediate hidden states across solver steps. This caching scheme operates directly on the SE(3)-equivariant
backbone, is compatible with pretrained models, and is orthogonal to existing training-based accelerations and system-level optimizations. Experiments on molecular geometry generation demonstrate that caching achieves a twofold reduction in wall-clock inference time at matched sample quality and a speedup of up to 3× with minimal sample quality degradation. Because these gains compound with other optimizations, applying caching alongside other general, lossless optimizations yield as much as a 7× speedup.

URL: https://openreview.net/forum?id=NaLVutCHCI

---

Title: Revisiting "Edit Away and My Face Will not Stay: Personal Biometric Defense against Malicious Generative Editing"

Abstract: Recent advances in diffusion-based image editing have enabled highly realistic and accessible manipulation of facial images, raising serious concerns about biometric privacy and malicious misuse. FaceLock, introduced in Edit Away and My Face Will Not Stay: Personal Biometric Defense against Malicious Generative Editing, proposes an optimization-based defense that embeds subtle perturbations into images at publication time to induce identity distortion in downstream generative edits. The method claims prompt-agnostic effectiveness and strong performance across multiple editing scenarios, supported by open-source code. In this paper, we present a systematic reproducibility study of FaceLock that evaluates its technical, quantitative, and qualitative reproducibility. We assess whether the reported results can be obtained using the released codebase, analyze the correspondence between the paper’s algorithmic description and its implementation, and document ambiguities that impact reproducibility. We further examine quantitative reproducibility by attempting to recover the reported performance trends and relative ranking against baselines. We, however, were not able to reproduce the originally reported performance trends, and our outputs were generally worse than those presented in the original paper. Beyond that, we expand the qualitative analysis to a broader set of image–prompt pairs and an additional, harder facial dataset to better test generalization behavior. While we obtained some successful outputs, only a small fraction of our qualitative results matched the consistently high quality reported by the authors. Finally, we introduce an extension to the FaceLock method that helps with robustness, and we critically examine the evaluation criteria used to measure defense effectiveness, highlighting limitations of prompt fidelity as a primary metric and arguing for a more explicit consideration of the trade-off between identity protection and preservation of the original image. We provide a link to our (currently anonymous and private) GitHub repository $\footnote{https://anonymous.4open.science/r/revisiting\_facelock-F4D7}$.

URL: https://openreview.net/forum?id=5Q1gr80AXU

---

Title: Methods and Open Problems in Differentiable Social Choice: Learning Mechanisms, Decisions, and Alignment

Abstract: Social choice is no longer a peripheral concern of political theory or economics-it has become a foundational component of modern machine learning systems. From auctions and resource allocation to federated learning, participatory governance, and the alignment of large language models, machine learning pipelines increasingly aggregate heterogeneous preferences, incentives, and judgments into collective decisions. In effect, many contemporary machine learning systems already implement social choice mechanisms, often implicitly and without explicit normative scrutiny.

This Review surveys differentiable social choice: an emerging paradigm that formulates voting rules, mechanisms, and aggregation procedures as learnable, differentiable models optimized from data. We synthesize work across auctions, voting, budgeting, liquid democracy, decentralized aggregation, and inverse mechanism learning, showing how classical axioms and impossibility results reappear as objectives, constraints, and optimization trade-offs. We conclude by identifying 36 open problems defining a new research agenda at the intersection of machine learning, economics, and social choice theory.

URL: https://openreview.net/forum?id=Ek0nvon7kV

---

Title: On the Convergence of Adam-Type Algorithm for Bilevel Optimization under Unbounded Smoothness

Abstract: Adam has become one of the most popular optimizers for training modern deep neural networks, such as transformers. However, its applicability is largely restricted to single-level optimization problems. In this paper, we aim to extend vanilla Adam to tackle bilevel optimization problems, which have important applications in machine learning, such as meta-learning. In particular, we study stochastic bilevel optimization problems where the lower-level function is strongly convex and the upper-level objective is nonconvex with potentially unbounded smoothness. This unbounded smooth objective function covers a broad class of neural networks, including transformers, which may exhibit non-Lipschitz gradients. In this work, we introduce AdamBO, a single-loop Adam-type method that achieves $\widetilde{O}(\epsilon^{-4})$ oracle complexity to find $\epsilon$-stationary points, where the oracle calls involve stochastic gradient or Hessian/Jacobian-vector product evaluations. The key to our analysis is a novel randomness decoupling lemma that provides refined control over the lower-level variable. We conduct extensive experiments on various machine learning tasks involving bilevel formulations with recurrent neural networks (RNNs) and transformers, demonstrating the effectiveness of our proposed Adam-type algorithm.

URL: https://openreview.net/forum?id=cPnmtVnhk4

---

Title: Robust Recourse via Kernel Distributionally Robust Optimization and Bayesian Posterior Predictive Modeling

Abstract: Machine learning recourse provides actionable recommendations to achieve favorable outcomes from predictive decision models. A critical limitation of current approaches is their reliance on the assumption of model stationarity, an assumption that is frequently violated in dynamic, real-world settings with distributional shifts. Robust approaches such as Robust Algorithmic Recourse (ROAR) and the Wasserstein-based DiRRAc address some uncertainties but remain limited in handling nonlinear dependencies and large-scale shifts, including concept drift and adversarial perturbations.

We propose Kernel Distributionally Robust Recourse Action (KDRRA), a framework that defines ambiguity sets using Maximum Mean Discrepancy (MMD) in a Reproducing Kernel Hilbert Space (RKHS), enabling flexible, nonparametric modeling of complex, nonlinear discrepancies between distributions. A practical challenge for kernel DRO is that empirical kernel mean embeddings can deviate from the true distribution, inflating ambiguity radii and yielding overly conservative recommendations. To address this, we introduce Bayesian KDRRA (BKDRRA), which centers the ambiguity set on a Bayesian posterior predictive distribution constructed via posterior bootstrap. This Bayesian centering integrates sampling variability and moderate model uncertainty into the reference distribution, leading to tighter ambiguity sets and markedly lower conservatism without sacrificing robustness.

Leveraging the representer theorem, we derive finite-dimensional convex reformulations of the worst-case recourse optimization for both KDRRA and BKDRRA. We conduct a comprehensive empirical evaluation across three real-world datasets that exhibit correction, temporal, and geospatial shifts. The KDRRA consistently outperforms state-of-the-art baselines in yielding superior robustness and lower recourse cost, while BKDRRA further improves stability and calibration by integrating Bayesian uncertainty. Our research advances the frontier of distributionally robust recourse by integrating machine learning tools and optimization, offering reliable and resilient decision-making under uncertainty.

URL: https://openreview.net/forum?id=LmEDkCTY0X

---

Title: The Shape of Attraction in UMAP: Exploring the Embedding Forces in Dimensionality Reduction

Abstract: Uniform manifold approximation and projection (UMAP) is among the most popular neighbor embedding methods. The method samples pairs of point indices according to similarities in the high-dimensional space, and applies attractive and repulsive forces to their coordinates in the low-dimensional embedding. In this paper, we analyze the forces to reveal their effects on cluster formations and visualization, and compare UMAP to its contemporaries. Repulsion emphasizes differences, controlling cluster boundaries and inter-cluster distance. Attraction is more subtle, as attractive tension between points can manifest simultaneously as attraction and repulsion in the lower-dimensional mapping. This explains the need for learning rate annealing and motivates the different treatments between attractive and repulsive terms. Moreover, by modifying attraction, we improve the consistency of cluster formation under random initialization. Overall, our analysis makes UMAP and similar embedding methods more interpretable, more robust, and more accurate.

URL: https://openreview.net/forum?id=fdPNhqav5G

---

Title: Derivative-Controlled Compact Surrogates for Predictable Sensitivity

Abstract: Compact neural models are frequently deployed as surrogates inside larger pipelines, where failures are driven less by raw accuracy than by instability and excessive sensitivity. This paper develops a derivative-controlled training approach for low-capacity models, treating derivatives as a primary interface for shaping behavior. We introduce a compact parameterization paired with a derivative-aware objective that discourages brittle sensitivity across depth. We evaluate the approach with property-driven tests—training stability, sensitivity diagnostics, and downstream settings where shape-consistent behavior matters—showing that derivative control can improve robustness while preserving useful predictive performance.

URL: https://openreview.net/forum?id=h3IRuSlONZ

---

Title: Modeling Stochastic Conditional Dynamics from Sparse Observations via Kernel-Stabilized Flow Matching

Abstract: Learning to transform conditional probability densities over time is a fundamental challenge spanning probabilistic modeling and the natural sciences. This task is paramount when forecasting the evolution of stochastic nonlinear dynamical systems in biological and physical domains. While flow-based models can predict the temporal evolution of probability distributions, existing approaches often assume discrete conditioning with samples that are paired across time, limiting their scientific applicability where frequently only sparse data with unpaired continuous conditioning is available. We propose Conditional Variable Flow Matching (CVFM), a framework for learning flows transforming conditional distributions with amortization across the continuous space of conditional densities. CVFM addresses the high-variance instability of prior methods by jointly sampling flows over state and conditioning variables, utilizing a conditioning mismatch kernel alongside a conditional Wasserstein distance to reweight the conditional optimal transport objective. Collectively, these advances allow for learning dynamics from sparse unpaired measurements of state-condition across time. We evaluate CVFM on conditional mapping benchmarks and a case study modeling the temporal evolution of materials internal structure during manufacturing processes, observing improved performance and convergence characteristics over existing conditional variants.

URL: https://openreview.net/forum?id=3A6oAS2TWo

---

Title: A Universal Source-Free Class Unlearning Framework via Synthetic Embeddings

Abstract: Class unlearning in neural classifiers refers to selectively removing the model’s ability to recognize a target (forget) class by reshaping the decision boundaries. This is essential when taxonomies change, labels are corrected, or legal or ethical requirements mandate class removal. The objective is to preserve performance on the remaining (retain) classes while avoiding costly full retraining. Existing methods generally require access to the source, i.e., forget/retain data or a relevant surrogate dataset. This dependency limits their applicability in scenarios where access to source data is restricted or unavailable. Even the recent source-free class unlearning methods rely on generating samples in the data space, which is computationally expensive and not even essential for doing class unlearning. In this work, we propose a novel source-free class unlearning framework that enables existing unlearning methods to operate using only the deployed model. We show that, under weak assumptions on the forget loss with respect to logits, class unlearning can be performed source-free for any given neural classifier by utilizing randomly generated samples within the classifier’s intermediate space. Specifically, randomly generated embeddings classified by the model as belonging to the forget or retain classes are sufficient for effective unlearning, regardless of their marginal distribution. We validate our framework on four backbone architectures, ResNet-18, ResNet-50, ViT-B/16, and Swin-T, across three benchmark datasets, CIFAR-10, CIFAR-100, and TinyImageNet. Our experimental results show that existing class unlearning methods can operate within our source-free framework, with minimal impact on their forgetting efficacy and retain class accuracy.

URL: https://openreview.net/forum?id=Fb2sZ1eoVe

---

Title: Trip-to-Gaussian: A Versatile Framework for Unconditional 3D Generation

Abstract: Unconditional 3D generation is a classical task, which explores effective network architecture to learn the underlying distribution of 3D assets. However, most existing methods limits in versatility, struggling to scale from object- to scene-level generation. Achieving such versatility critically depends on how 3D representations are designed in the latent and output spaces, and how these spaces are connected. In this work, we focus on leveraging the expressiveness of triplane together with the fast and high-fidelity 3D Gaussian Splatting (3DGS). Yet, integrating these two representations remains a challenge due to their fundamentally different natures – the structured triplane and unstructured 3DGS. Our core idea is a coarse-to-fine generation scheme that first extracts reliable geometric priors from triplane and subsequently refines them to capture detailed geometry and textures through 3D Gaussians. To this end, we introduce Trip-to-Gaussian, a versatile 3D generation framework that seamlessly integrates two distinct representations. We propose a Gaussian indicator module (GIM) along with surface occupancy fields (SOF), generating coarse anchor points, which serves as a reliable geometric prior for 3D Gaussians. Building upon this, we present a point upsampling module (PUM) that maps discontinuous and coarse anchor points into a continuous space, densifying them to ensure fine-grained representation. Extensive experiments demonstrate that our approach outperforms recent methods in both unconditional object and scene generation, establishing a versatile paradigm for 3D generation.

URL: https://openreview.net/forum?id=9uL23Jcjvj

---

Title: VideoScore2: Think Before You Score In Generated Video Evaluation

Abstract: Recent advances in text-to-video generation have produced increasingly realistic and diverse content, yet evaluating such videos remains a fundamental challenge due to their multi-faceted nature encompassing visual quality, semantic alignment, and physical consistency. Existing evaluators and reward models are limited to single opaque scores, lack interpretability, or provide only coarse analysis, making them insufficient for capturing the comprehensive nature of video quality assessment. We present VideoScore2, a multi-dimensional, interpretable, and human-aligned framework that explicitly evaluates visual quality, text-to-video alignment, and physical/common-sense consistency while producing detailed chain-of-thought rationales. Our model is trained on a large-scale dataset VideoFeedback2 containing 27,168 human-annotated videos with both scores and reasoning traces across three dimensions, using a two-stage pipeline of supervised fine-tuning followed by reinforcement learning with Group Relative Policy Optimization (GRPO) to enhance analytical robustness. Extensive experiments demonstrate that VideoScore2 achieves superior performance with 44.35 (+5.94) accuracy on our in-domain benchmark \VideoFeedback2 and 50.37 (+4.32) average performance across four out-of-domain benchmarks (VideoGenReward-Bench, VideoPhy2, etc), while providing interpretable assessments that bridge the gap between evaluation and controllable generation through effective reward modeling for Best-of-N sampling.

URL: https://openreview.net/forum?id=MpkVh4jH44

---

Title: k∗means: A Parameter-free Clustering Algorithm

Abstract: Clustering is a widely used and powerful machine learning technique, but its effectiveness is often limited by the need to specify the number of clusters,~$k$, or by relying on thresholds that implicitly determine~$k$. We introduce \method, a novel clustering algorithm that eliminates the need to set $k$~or any other parameters. Instead, it formulates the clustering problem as minimising a three-part encoding of the data. It uses this formulation to determine the optimal number of clusters, $k^*$, by splitting and merging clusters while also optimising the standard $k$-means objective. We prove that \method is guaranteed to converge and demonstrate experimentally that it significantly outperforms existing methods in scenarios where~$k$ is unknown. We also show that it accurately estimates~$k$ and that, empirically, its runtime is competitive with existing methods and scales well with dataset size.

URL: https://openreview.net/forum?id=dgrtDwpMAW

---

Title: Amnesia: A Stealthy Replay Attack on Continual Learning Dreams

Abstract: Continual learning (CL) models rely on experience replay to mitigate catastrophic forgetting, yet their robustness to replay sampling interference is largely unexplored. Existing CL attacks mostly modify inputs or update pipelines (poisoning/backdoors) and lack explicit \emph{auditable} constraints, limiting their realism. Here, \emph{auditability} means that a monitor can verify compliance using sampler-visible telemetry, e.g., logged replay index/label statistics, by checking that the realized replay class histogram stays close to a nominal baseline and that the replay rate is unchanged (per-batch and/or over a rolling window). We study a limited-privilege insider controlling only the replay \emph{index selection}, not pixels, labels, or model parameters, while staying within such auditable limits (e.g., queue priorities). We introduce \textbf{Amnesia}, a replay composition attack maximizing model degradation under two auditable budgets: a visibility budget $\delta$ bounding the $\mathrm{TV}/\mathrm{KL}$ divergence from a nominal class histogram $p_0$, and a mass budget $f$ fixing the replay rate. Amnesia uses a two-step procedure: (i) compute lightweight class utilities (e.g., EMA loss/confidence) to tilt $p_0$ toward harmful classes; (ii) project the tilt back into the $\delta$-ball using efficient $\mathrm{KL}$ (\emph{exponential tilt}) or $\mathrm{TV}$ (\emph{balanced mass redistribution}) optimizers. A windowed scheduler enforces rolling audits. Across challenging CL benchmarks (Split CIFAR-10/100, CORe50, Tiny-ImageNet) and strong replay baselines (ER, ER-ACE, SCR, DER++), Amnesia consistently depresses final accuracy (ACC$\downarrow$) and worsens backward transfer ($-\mathrm{BWT}\uparrow$). The $\mathrm{KL}$ variant achieves high impact while remaining largely undetected by audits, as confirmed empirically under multiple audit schemes (per-batch and rolling-window checks), whereas the $\mathrm{TV}$ variant is more damaging but more easily detected, especially under tight per-class constraints. These results expose \emph{index-only} replay control as a practical, auditable threat surface in CL systems and establish a principled impact-visibility-budget trade-off. Code is available anonymously via \href{https://anonymous.4open.science/r/9124_Amensia/README.md}{Anonymous GitHub}.

URL: https://openreview.net/forum?id=QSTg7z06GH

---

Title: Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models

Abstract: We present Athena-PRM, a multimodal process reward model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. Developing high-performance PRMs typically demands significant time and financial investment, primarily due to the necessity for step-level annotations of reasoning steps. Conventional automated labeling methods, such as Monte Carlo estimation, often produce noisy labels and incur substantial computational costs. To efficiently generate high-quality process-labeled data, we propose leveraging prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. Remarkably, Athena-PRM demonstrates outstanding effectiveness across various scenarios and benchmarks with just 5,000 samples. Furthermore, we also develop two effective strategies to improve the performance of PRMs: ORM initialization and up-sampling for negative data. We validate our approach in three specific scenarios: verification for test-time scaling, direct evaluation of reasoning step correctness, and reward ranked fine-tuning. Our Athena-PRM consistently achieves superior performance across multiple benchmarks and scenarios. Notably, when using Qwen2.5-VL-7B as the policy model, Athena-PRM enhances performance by 10.2 points on WeMath and 7.1 points on MathVista for test-time scaling. Furthermore, Athena-PRM sets the state-of-the-art (SoTA) results in VisualProcessBench and outperforms the previous SoTA by 3.9 F1-score, showcasing its robust capability to accurately assess the correctness of the reasoning step. Additionally, utilizing Athena-PRM as the reward model, we develop Athena-7B with reward ranked fine-tuning and outperforms baseline with a significant margin on five benchmarks.

URL: https://openreview.net/forum?id=unWmplHccF

---

Title: Beyond Correctness: Rewarding Faithful Reasoning in Retrieval-Augmented Generation

Abstract: Inspired by the success of reinforcement learning (RL) in Large Language Model (LLM) training for domains like math and code, recent works have begun exploring how to train LLMs to use search engines more effectively as tools for retrieval-augmented generation. Although these methods achieve performance improvement across QA benchmarks, many prioritize final answer correctness while overlooking the quality of intermediate reasoning steps, which may lead to chain-of-thought unfaithfulness. In this paper, we first introduce a comprehensive evaluation framework for evaluating RL-based search agents, covering three distinct faithfulness metrics: information-think faithfulness, think-answer faithfulness, and think-search faithfulness. Our evaluations reveal that canonical search agents trained via Reinforcement Learning from Verifiable Reward (RLVR) --- including Search-R1 and ReSearch --- have significant room for improvement in this regard. To foster faithful reasoning, we introduce VERITAS (Verifying Entailed Reasoning through Intermediate Traceability in Agentic Search), a novel framework that integrates fine-grained faithfulness rewards into the reinforcement learning process. Our experiments show that models trained with VERITAS not only significantly improve reasoning faithfulness, but also achieves better task performance compared to the baselines trained against pure outcome-based reward.

URL: https://openreview.net/forum?id=mZ0gGlXelF

---

Title: DuoShapley: Adaptive and Scalable Shapley Value Approximation for Federated Learning

Abstract: Federated Learning (FL) enables collaborative model training across decentralized users while preserving data privacy, but it also raises a fundamental challenge: how to efficiently and reliably quantify individual user contributions to the global model. The Shapley value (SV) provides a principled game-theoretic framework for contribution valuation, yet its exact computation is prohibitively expensive in realistic FL systems. Existing SV approximation methods face a trade-off between scalability and estimation fidelity, particularly under heterogeneous data distributions. In this work, we propose DuoShapley, an efficient and adaptive SV approximation tailored to large-scale FL that adaptively balances two complementary orders: Solo, capturing individual contributions, and Leave-One-Out (LOO), capturing marginal contributions relative to the full coalition. By adaptively weighting them during training based on the alignment between local and global model updates, DuoShapley achieves both computational efficiency and accurate contribution valuation across diverse FL scenarios, from independent and identically distributed (IID) to non-IID. Beyond contribution measurement, DuoShapley enables downstream applications such as robust user selection in the presence of users with noisy data, by prioritizing users with high estimated contributions. Such selective participation leads to enhanced robustness to noisy and low-quality updates, and reduced communication overhead. Extensive experiments show that DuoShapley is both computationally efficient and effective across diverse data distributions. Hence, DuoShapley provides a practical and scalable solution for evaluating and leveraging user contributions in FL.

URL: https://openreview.net/forum?id=zjgZFNEEHn

---

Title: GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness

Abstract: Graphical user interface (GUI) agents face severe efficiency bottlenecks when processing long sequences of high-resolution screenshots, making inference costly and memory-bound. Existing KV cache compression methods, designed for natural images, remain suboptimal as they fail to exploit the unique spatial and temporal redundancies of GUIs. In this work, we first demonstrate that unlike natural images, GUI attention sparsity is uniformly high (>0.99) across all transformer layers, invalidating complex layer-varying budget strategies. Building on this insight, we introduce GUI-KV, a training-free compression method that allocates a uniform budget driven by two novel mechanisms: (1) spatial saliency guidance, which augments attention with residual stream L2 norms to preserve semantic visual tokens; and (2) temporal redundancy scoring, which employs subspace projection to identify and prune historical frames that are linearly redundant with the current view. Across six benchmarks, GUI-KV outperforms competitive baselines, often recovering near-full-cache accuracy at 10-20% budgets. Notably, on AgentNetBench, it reduces decoding FLOPs by 38.9% while increasing step accuracy by 4.1% over the full-cache baseline.

URL: https://openreview.net/forum?id=qaJECugPzr

---

Title: Maximizing Confidence Alone Improves Reasoning

Abstract: Reinforcement learning (RL) has enabled machine learning models to achieve significant advances in many fields. Most recently, RL has empowered frontier language models to solve challenging math, science, and coding problems. However, central to any RL algorithm is the reward function, and reward engineering is a notoriously difficult problem in any domain. In this paper, we propose RENT: Reinforcement Learning via Entropy Minimization -- a fully unsupervised RL method that requires no external reward or ground-truth answers, and instead uses the model's entropy of its underlying distribution as an intrinsic reward. We find that by reinforcing the chains of thought that yield high model confidence on its generated answers, the model improves its reasoning ability. In our experiments, we showcase these improvements on an extensive suite of commonly-used reasoning benchmarks, including GSM8K, MATH500, AMC, AIME, and GPQA, and models of varying sizes from the Qwen, Mistral, and Llama families. The generality of our unsupervised learning method lends itself to applicability in a wide range of domains where external supervision is unavailable.

URL: https://openreview.net/forum?id=gInznr8EsQ

---

Title: ADiff4TPP: Asynchronous Diffusion Models for Temporal Point Processes

Abstract: This work introduces a diffusion model-based approach to modeling temporal point processes via an asynchronous noise schedule. At each step of the diffusion process, the noise schedule injects noise of varying scales into different parts of the data. With a careful design of the noise schedules, earlier events are generated faster than later ones, thus providing stronger conditioning for forecasting the more distant future. We derive an objective to effectively train these models for a general family of noise schedules based on conditional flow matching. Our method models the joint distribution of the latent representations of events in a sequence and achieves state-of-the-art results in predicting both the next inter-event time and event type on benchmark datasets. Additionally, it flexibly accommodates varying lengths of observation and prediction windows in different forecasting settings by adjusting the starting and ending points of the generation process. Finally, our method shows superior performance in long horizon prediction tasks, outperforming existing baseline methods.

URL: https://openreview.net/forum?id=bwnZW4wXh4

---

Title: Advances in Temporal Point Processes: Bayesian, Neural, and LLM Approaches

Abstract: Temporal point processes (TPPs) are stochastic process models used to characterize event sequences occurring in continuous time. Traditional statistical TPPs have a long-standing history, with numerous models proposed and successfully applied across diverse domains. In recent years, advances in deep learning have spurred the development of neural TPPs, enabling greater flexibility and expressiveness in capturing complex temporal dynamics. The emergence of large language models (LLMs) has further sparked excitement, offering new possibilities for modeling and analyzing event sequences by leveraging their rich contextual understanding.
This survey presents a comprehensive review of recent research on TPPs from three perspectives: Bayesian, deep learning, and LLM approaches. We begin with a review of the fundamental concepts of TPPs, followed by an in-depth discussion of model design and parameter estimation techniques in these three frameworks. We also revisit classic application areas of TPPs to highlight their practical relevance. Finally, we outline challenges and promising directions for future research.

URL: https://openreview.net/forum?id=SXgGKkShhT

---

Title: SpecEval: Evaluating Model Adherence to Behavior Specifications

Abstract: Companies that develop foundation models often publish behavioral guidelines they pledge their models will follow, but it remains unclear if models actually do so as there has been no systematic audit of adherence to these guidelines. We propose a simple but imperative baseline: at minimum, a foundation model should consistently satisfy its developer's own behavioral specifications when judged by the developer's own evaluator models. We forefront \emph{three-way consistency} between a provider's specification, the provider's model outputs, and adherence scores from the provider model as a judge; an extension of prior two-way generator-validator consistency. We introduce an automated framework that audits models against their providers' specifications by (i) parsing statements that delineate desired behaviors, (ii) generating targeted prompts to elicit the aforementioned behaviors, and (iii) using the responses as inputs to models to judge adherence. We apply our framework to 16 models from six developers across 100+ behavioral statements, finding three-way consistency gaps of up to 20\% across providers.

URL: https://openreview.net/forum?id=VzLIQ3Lqm9

---

Title: Back to the Basics: Revisiting the Median for Out-of-Distribution Detection from Unlabeled Data

Abstract: Out-of-distribution (OOD) detection plays a crucial role in ensuring the robustness and reliability of machine learning systems deployed in real-world applications. Recent approaches have explored the use of unlabeled data, showing potential for enhancing OOD detection capabilities. However, effectively utilizing unlabeled in-the-wild data remains challenging due to the mixed nature of both in-distribution (InD) and OOD samples. The lack of a distinct set of OOD samples complicates the task of training an optimal OOD classifier. In this work, we introduce Medix, a novel framework designed to identify potential outliers from unlabeled data using the median operation. We use the median because it provides a stable estimate of the central tendency, as an OOD detection mechanism, due to its robustness against noise and outliers. Using these identified outliers, along with labeled InD data, we train a robust OOD classifier. From a theoretical perspective, we derive error bounds that demonstrate Medix achieves a low error rate. Empirical results further substantiate our claims, as Medix outperforms existing methods across the board in open-world settings, confirming the validity of our theoretical insights.

URL: https://openreview.net/forum?id=jFjA24PBJx

---

Title: Cross-Domain Feature Alignment for Federated Domain Generalization

Abstract: Learning a robust global model that generalizes well under domain skew is crucial for federated learning (FL). Feature alignment enhances domain-invariant representation learning, thereby aligning inconsistent feature spaces caused by domain skew. However, we find two key problems that limits feature alignment. (1) Mismatched batch normalization (BN) statistics and insufficient inter-class separation lead to divergent local prototypes under domain skew, preventing global prototypes from representing global information. (2) Existing feature alignment methods often introduce aggregation bias under domain skew, causing the feature space to favor domains with more clients. Building on these findings, we propose a novel federated learning approach with cross-domain feature alignment (FedCoda), which calibrates feature alignment and ensures fairness across domains. To learn domain-invariant features with feature alignment, FedCoda calibrates batch normalization and local prototypes to generate consistent representations across domains. To enhance the fairness of feature alignment across domains, FedCoda optimizes prototype aggregation and produces fair global prototypes. Extensive experiments show that FedCoda outperforms relevant baselines.

URL: https://openreview.net/forum?id=2KUupiubOW

---

Title: ChatAni: Language-Driven Multi-Actor Animation Generation in Street Scenes

Abstract: Generating interactive and realistic traffic participant animations from instructions is essential for autonomous driving simulations. Existing methods, however, fail to comprehensively address the diverse participants and their dynamic interactions in street scenes. In this paper, we present ChatAni, the first system capable of generating interactive, realistic, and controllable multi-actor animations based on language instructions. To produce fine-grained, realistic animations, ChatAni introduces two novel animators: PedAnimator, a unified multi-task animator that generates interaction-aware pedestrian animations under varying task plans, and VehAnimator, a kinematics-based policy that generates physically plausible vehicle animations. For precise control through complex language, ChatAni employs a multi-LLM-agent role-playing approach, using natural language to plan the trajectories and behaviors of different participants. Extensive experiments demonstrate that ChatAni can generate realistic street scenes with interacting vehicles and pedestrians, benefiting tasks like prediction and understanding. All related code, data, and checkpoints will be open-sourced.

URL: https://openreview.net/forum?id=Z9fDUbm9iP

---

Title: Temporal Preference Optimization of Large Multimodal Models

Abstract: Despite recent advancements in video large multimodal models (video-LMMs), accurate temporal grounding remains a key challenge. In this work, we introduce Temporal Preference Optimization (TPO)—a post-training framework that unlocks superior temporal reasoning in video-LMMs without requiring human annotations. TPO enables preference modeling by manipulating video inputs to generate contrastive responses, ensuring that preferred responses are more temporally grounded than dis-preferred ones. Through preference learning, TPO enhances the model’s capability for more comprehensive video understanding with better temporal reasoning. Extensive experiments on LongVideoBench, MLVU, and Video-MME demonstrate that TPO significantly improves temporal grounding across multiple video-LMMs. Notably, LLaVA-Video-TPO achieves state-of-the-art performance among 7B models on Video-MME, establishing TPO as a scalable and effective solution for advancing temporal understanding in video analysis.

URL: https://openreview.net/forum?id=2NXL3ZhPci

---

Title: Topology- and Gradient-Guided Knowledge Distillation for Point Cloud Semantic Segmentation

Abstract: Point cloud processing has gained significant attention due to its critical role in applications such as autonomous driving and 3D object recognition. However, deploying high-performance models like Point Transformer V3 in resource-constrained environments remains challenging due to their high computational and memory demands. This work introduces a novel distillation framework that leverages topology-aware representations and gradient-guided knowledge distillation to effectively transfer knowledge from a high-capacity teacher to a lightweight student model. Our approach captures the underlying geometric structures of point clouds while selectively guiding the student model's learning process through gradient-based feature alignment. Experimental results in the Nuscenes, SemanticKITTI, and Waymo datasets demonstrate that the proposed method achieves competitive performance, with an approximately 16$\times$ reduction in model size and up to 1.9$\times$ decrease in inference time compared to its teacher model. Notably, on NuScenes, our method achieves competitive performance among knowledge distillation techniques trained solely on LiDAR data, surpassing prior knowledge distillation baselines in segmentation performance. Our implementation is available anonymously at https://anonymous.4open.science/r/PTv3-distill-4E9E

URL: https://openreview.net/forum?id=lP2phGa5af

---

Title: Diffusion-based Annealed Boltzmann Generators : benefits, pitfalls and hopes

Abstract: Sampling configurations at thermodynamic equilibrium is a central challenge in statistical physics. Boltzmann Generators (BGs) address this problem by pairing a generative model with a Monte Carlo (MC) correction scheme, yielding asymptotically consistent samples from an unnormalized target density. However, most existing BGs rely on classic MC mechanisms such as importance sampling, which (i) impose strong constraints on the backbone model (typically requiring exact and efficient likelihood evaluation) and (ii) suffer from severe scalability issues in high-dimensional, multi-modal settings. This work investigates BGs built around annealed Monte Carlo (aMC) schemes, which mitigate the limitations of classic MC by bridging a simple reference distribution to the target through a sequence of intermediate densities. In this context, diffusion models (DMs) are particularly appealing backbones: they are powerful generative models and naturally induce density paths that have been leveraged in prior aMC-based methods. We provide an empirical meta-analysis of this DM-based aMC-BG design choice on controlled yet challenging synthetic benchmarks based on multi-modal Gaussian mixtures, varying inter-mode separation, number of modes, and dimensionality. To disentangle learning effects from inference effects, we first study an idealized setting in which the DM is perfectly learned, and then turn to realistic settings where the DM is trained from data. Even in the idealized regime, we find that standard aMC integrations of DMs that rely only on first-order stochastic denoising kernels systematically fail in the proposed scenarios. In contrast, incorporating second-order denoising kernels can substantially improve performance when the required covariance information is available. Motivated by this gap, we propose an alternative aMC integration based on deterministic first-order transport maps derived from DMs; empirically, this approach consistently outperforms its stochastic first-order counterpart, albeit at increased computational cost. Overall, while results in the perfect-learning regime suggest that exploiting DM-induced dynamics within aMC is a promising route to building effective BGs, our experiments with learned DMs show that DM–aMC combinations still struggle to produce accurate BGs in practice. We attribute this limitation primarily to inaccuracies in DM log-density estimation.

URL: https://openreview.net/forum?id=la4FDaeIbw

---

Title: Some Robustness Properties of Label Cleaning

Abstract: We demonstrate that learning procedures that rely on aggregated labels, e.g., label information distilled from noisy responses, enjoy robustness properties impossible without data cleaning. This robustness appears in several ways. In the context of risk consistency---when one takes the standard approach in machine learning of minimizing a surrogate (typically convex) loss in place of a desired task loss (such as the zero-one mis-classification error)---procedures using label aggregation obtain stronger consistency guarantees than those even possible using raw labels. And while classical statistical scenarios of fitting perfectly-specified models suggest that incorporating all possible information---modeling uncertainty in labels---is statistically efficient, consistency fails for ``standard'' approaches as soon as a loss to be minimized is even slightly mis-specified. Yet procedures leveraging aggregated information still converge to optimal classifiers, highlighting how incorporating a fuller view of the data analysis pipeline, from collection to model-fitting to prediction time, can yield a more robust methodology by refining noisy signals.

URL: https://openreview.net/forum?id=O2ORErbcBy

---

Title: VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation

Abstract: Large multimodal models (LMMs) have recently emerged as a powerful tool for long video understanding (LVU), prompting the development of standardized LVU benchmarks to evaluate their performance. However, our investigation reveals a rather sober lesson for existing LVU benchmarks. First, most existing benchmarks rely heavily on multiple-choice questions (MCQs), whose evaluation results are inflated due to the possibility of guessing the correct answer; Second, a significant portion of questions in these benchmarks have strong priors to allow models to answer directly without even reading the input video. For example, Gemini-1.5-Pro can achieve over 50% accuracy given a random frame from a long video on Video-MME. We also observe that increasing the number of frames does not necessarily lead to improvement on existing benchmarks, which is counterintuitive. As a result, the validity and robustness of current LVU benchmarks are undermined, impeding a faithful assessment of LMMs’ long‑video understanding capability. To tackle this problem, we propose VideoEval-Pro, a realistic LVU benchmark containing questions with open‑ended short‑answer, which truly require understanding the entire video. VideoEval-Pro assesses both segment‑level and full‑video understanding through perception and reasoning tasks. By evaluating 27 proprietary and open-source video LMMs, we conclude the following findings: (1) video LMMs show drastic performance (>25%) drops on open-ended questions compared with MCQs; (2) surprisingly, higher MCQ scores do not lead to higher open-ended scores on VideoEval-Pro; (3) compared to other MCQ benchmarks, VideoEval-Pro benefits more from increasing the number of input frames. Our results show that VideoEval-Pro offers a more realistic and reliable measure of long video understanding, providing a clearer view of progress in this domain. Our benchmark and evaluation code will be fully released.

URL: https://openreview.net/forum?id=2BCfis3jZA

---

Title: Dynamic Mixture of Progressive Parameter-Efficient Expert Library for Lifelong Robot Learning

Abstract: A generalist agent must continuously learn and adapt throughout its lifetime, achieving efficient forward transfer while minimizing catastrophic forgetting. Previous work within the dominant pretrain-then-finetune paradigm has explored parameter-efficient fine-tuning for single-task adaptation, effectively steering a frozen pretrained model with a small number of parameters. However, in the context of lifelong learning, these methods rely on the impractical assumption of a test-time task identifier and restrict knowledge sharing among isolated adapters. To address these limitations, we propose Dynamic Mixture of Progressive Parameter-Efficient Expert Library (DMPEL) for lifelong robot learning. DMPEL progressively builds a low-rank expert library and employs a lightweight router to dynamically combine experts into an end-to-end policy, enabling flexible and efficient lifelong forward transfer. Furthermore, by leveraging the modular structure of the fine-tuned parameters, we introduce expert coefficient replay, which guides the router to accurately retrieve frozen experts for previously encountered tasks. This technique mitigates forgetting while being significantly more storage- and computation-efficient than experience replay over the entire policy. Extensive experiments on the lifelong robot learning benchmark LIBERO demonstrate that our framework outperforms state-of-the-art lifelong learning methods in success rates during continual adaptation, while utilizing minimal trainable parameters and storage.

URL: https://openreview.net/forum?id=MHVBrjS8cG

---

Title: Super-Linear: A Lightweight Pretrained Mixture of Linear Experts for Time Series Forecasting

Abstract: Time series forecasting (TSF) is critical in domains like energy, finance, healthcare, and
logistics, requiring models that generalize across diverse datasets. Large pre-trained models
such as Chronos and Time-MoE show strong zero-shot (ZS) performance but suffer from
high computational costs. In this work, we introduce Super-Linear, a lightweight and
scalable mixture-of-experts (MoE) model for general forecasting. It replaces deep architectures with simple frequency-specialized linear experts, trained on resampled data across
multiple frequency regimes. A lightweight spectral gating mechanism dynamically selects
relevant experts, enabling efficient, accurate forecasting. Despite its simplicity, Super-Linear
demonstrates strong performance across benchmarks, while substantially improving efficiency,
robustness to sampling rates, and interpretability.

URL: https://openreview.net/forum?id=av8niDGYMk

---

Title: Identifying Invariant Physical Dynamics Across Multiple Environments

Abstract: Data-driven machine learning methods have been widely employed in dynamical systems, but they fail to generalize to unseen dynamical environments where data is sparse and noisy. While few recent works have proposed to address this problem by introducing domain adaptation techniques based on deep neural networks (DNNs), these black-box methods fail to explain and understand the underlying system behaviors. In this work, we propose an Invariant PhysicAl Dynamics identification framework (IPAD), designed to extract common physical laws from data collected from multiple environments. Specifically, IPAD combines Monte Carlo Tree Search (MCTS) and multi-environment reward to effectively uncover physical dynamics from imbalanced data across multiple environments. Moreover, it incorporates a variational formulation (VF) loss function to enhance the robustness in noisy conditions and then introduces a post-hoc purification approach to further refine discovered equations. We also theoretically prove the convergence rate of VF for symbolic regression. The evaluation results demonstrate that IPAD significantly outperforms existing methods in discovering invariant physical dynamics across both simulated and real-world datasets. The source code will be publicly available upon publication.

URL: https://openreview.net/forum?id=xvQYvYEGhj

---

Title: Variational Graph Structure Learning for GNNs by using Marginal Likelihood

Abstract: Learning graph structures for Graph Neural Networks (GNNs) can improve their performance, but it is challenging to search over the large discrete space of graphs. Prior works often impose fixed structural constraints to promote properties such as sparsity, but these constraints can be misspecified and overly restrictive, potentially degrading performance. Here, we propose a simpler alternative based on marginal likelihood, which naturally favors such properties without requiring any explicit graph constraints. We show that a variational formulation with Laplace's method automatically leads to a marginal-likelihood based objective over discrete graph structures, which can be optimized efficiently using the Gumbel-Softmax trick. We call this approach the Laplace Approximation-based Graph Structure (LAGS) method, and show empirically that it improves the recent state-of-the-art GNNs.

URL: https://openreview.net/forum?id=fVMr2sTow5

---

Title: Towards Understanding Neural Collapse: The Effects of Batch Normalization and Weight Decay

Abstract: Neural Collapse (NC) is a geometric structure recently observed at the terminal phase of training deep neural networks, which states that last-layer feature vectors for the same class would `collapse' to a single point, while features of different classes become equally separated. We demonstrate that batch normalization (BN) and weight decay (WD) critically influence the emergence of NC. In the near-optimal loss regime, we establish an asymptotic lower bound on the emergence of NC that depends only on the WD value, training loss, and the presence of last-layer BN. Our experiments substantiate theoretical insights by showing that models demonstrate a stronger presence of NC with BN, appropriate WD values and lower loss. Our findings offer a novel perspective in studying the role of BN and WD in shaping neural network features.

URL: https://openreview.net/forum?id=eKqgCPDBFg

---

Title: From Models to Systems: A Comprehensive Survey of Efficient Multimodal Learning

Abstract: The rapid expansion of multimodal models has surfaced formidable bottlenecks in computation, memory, and deployment, catalyzing the rise of Efficient Multimodal Learning (EML) as a pivotal research frontier. Despite intensive progress, a cohesive understanding of \textit{what}, \textit{how}, and \textit{where} efficiency is manifested across the learning stack remains fragmented. This survey systematizes the EML landscape by introducing the first structured, model-to-system taxonomy. We distill insights from over 300 seminal works into three hierarchical levels—\textit{model}, \textit{algorithm}, and \textit{system}—addressing architectural parsimony, execution refinement, and hardware-aware orchestration, respectively.
Moving beyond a purely categorical review, we offer a methodological synthesis of the vertical synergies between these layers, elucidating how cross-layer co-design resolves the fundamental ``Efficiency-Utility-Privacy'' trilemma. Through an integrative case study of Multimodal Large Language Models (MLLMs), we trace the field’s evolutionary trajectory from initial structural adjustments to modern full-stack resource orchestration. Furthermore, we provide a holistic discussion and application-specific optimization blueprints for diverse domains and posit a paradigm shift toward self-regulating intelligence, where efficiency is an intrinsic, emergent property of the model’s fundamental design rather than a post-hoc constraint. Finally, we present open challenges and future directions that will define the trajectory of EML research. This survey establishes a formal foundation for multimodal systems that are not only high-performing and generalizable but natively efficient and ready for ubiquitous deployment. We also maintain a Github repository to continuously update related work for research community.

URL: https://openreview.net/forum?id=yfTU8FTS2Z

---

Title: RTGen: Real-Time Generative Detection Transformer

Abstract: Although open-vocabulary object detectors can generalize to unseen categories, they still rely on predefined textual prompts or classifier heads during inference. Recent generative object detectors address this limitation by coupling an autoregressive language model with a detector backbone, enabling direct category name generation for each detected object. However, this straightforward design introduces structural redundancy and substantial latency. In this paper, we propose a Real-Time Generative Detection Transformer (RTGen), a real-time generative object detector with a succinct encoder-decoder architecture. Specifically, we introduce a novel Region–Language Decoder (RL-Decoder) that jointly decodes visual and textual representations within a unified framework. The textual side is organized as a Directed Acyclic Graph (DAG), enabling non-autoregressive category naming. Benefiting from these designs, RTGen-R34 achieves 131.3 FPS on T4 GPUs, over 270× faster than GenerateU. Moreover, our models learn to generate category names directly from detection labels, without relying on external supervision such as CLIP or pretrained language models, achieving efficient and flexible open-ended detection.

URL: https://openreview.net/forum?id=ZhUOAZhhc7

---

Title: From Tables to Time: Extending TabPFN-v2 to Time Series Forecasting

Abstract: Recent progress in foundation models has enabled strong zero-shot performance for time series forecasting. In this work, we show that such capabilities can also emerge from tabular foundation models. We introduce TabPFN-TS, a simple method that treats forecasting as a tabular regression problem by combining lightweight temporal featurization with the pretrained TabPFN-v2. This formulation requires no time-series–specific pretraining and naturally supports both univariate and covariate-informed forecasting. Despite its compact size (11M parameters), TabPFN-TS achieves state-of-the-art performance on covariate-informed forecasting and competitive accuracy on univariate forecasting across the GIFT-Eval and fev-bench benchmarks. We further provide controlled analyses examining how the model interprets temporal structure, how featurization choices affect accuracy, and how forecasts change under alternative tabular backbones. Together, our results demonstrate that tabular foundation models—when paired with suitable temporal features—offer an efficient and versatile alternative for forecasting, bridging tabular and time-series learning within a unified framework.

URL: https://openreview.net/forum?id=KIkQj8VOUY

---

Title: Discontinuity-Preserving Image Super-Resolution using MRF-Based MAP-Optimized One-Step Diffusion

Abstract: We propose a real-world image super-resolution framework that leverages a pretrained text-to-image Stable Diffusion model optimized for single-step sampling. Unlike traditional multi-step diffusion-based methods, which are computationally intensive, our approach enables fast inference while preserving high perceptual quality. To this end, we integrate a lightweight image enhancement module trained jointly with the diffusion model under a Maximum A Posteriori (MAP) formulation. The optimization includes a compound Markov Random Field (MRF) prior, derived from the anticipated discontinuity line field energy, which functions as a structural regularizer to preserve fine image details and facilitate deblurring. Existing single-step diffusion approaches often rely on distillation or noise map estimation, which limits their ability to generate rich pixel-space details. In contrast, our method explicitly models high-frequency line field consistency between the low- and high-resolution domains, guiding the image enhancer to reconstruct sharp outputs. By preserving and enhancing structural features such as edges and textures, our framework effectively handles complex degradations commonly encountered in real-world scenarios. Experimental results demonstrate that our method achieves performance that is comparable to or exceeds that of state-of-the-art single-step and multi-step diffusion-based image super-resolution methods qualitatively, quantitatively, and computationally.

URL: https://openreview.net/forum?id=CLrWXyyL5c

---

Title: Approaching the Harm of Gradient Attacks While Only Flipping Labels

Abstract: Machine learning systems deployed in distributed or federated environments are highly susceptible to adversarial manipulations, particularly availability attacks -rendering the trained model unavailable.
Prior research in distributed ML has demonstrated such adversarial effects through the injection of gradients or data poisoning. In this study, we aim to better understand the potential of weaker (action-wise) adversaries by asking: Can availability attacks be inflicted solely through the flipping of a subset of training labels, without altering features, and under a strict flipping budget?
We analyze the extent of damage caused by constrained label flipping attacks against federated learning under mean aggregation—the dominant baseline in research and production. Focusing on a distributed classification problem, (1) we propose a novel formalization of label flipping attacks on logistic regression models and derive a greedy algorithm that is provably optimal at each training step. (2) To demonstrate that availability attacks can be approached by label flipping alone, we show that a budget of only $0.1\%$ of labels at each training step can reduce the accuracy of the model by $6\%$, and that some models can perform worse than random guessing when up to $25\%$ of labels are flipped. (3) We shed light on an interesting interplay between what the attacker gains from more \emph{write-access} versus what they gain from more \emph{flipping budget}. (4) We define and compare the power of targeted label flipping attack to that of an untargeted label flipping attack.

URL: https://openreview.net/forum?id=1NQHfbECpm

---

Title: Real-Time Autonomous Systems for Tracking and Responding to Uncontrolled Fires in the Ambient Environment: A Review

Abstract: Fire-induced air pollution—originating from wildfires and industrial fires—poses a rising threat to public health and environmental systems. These episodic but increasingly frequent events release hazardous mixtures of particulate matter and gases, often overwhelming existing monitoring and response infrastructures. Traditional approaches to air quality sensing, health risk modelling, and emergency coordination are limited in spatial resolution, real-time responsiveness, and system integration. This literature review investigates how artificial intelligence (AI) and autonomous systems can address these limitations by enabling more adaptive, predictive, and interconnected fire pollution management strategies. Using a structured thematic synthesis, the review analyses 128 papers across four domains: (1) risks and impacts of fire-induced air pollution, (2) real-time autonomous systems for sensing, forecasting, and simulation, (3) AI-enhanced health risk modelling, and (4) governance and policy frameworks. Key findings reveal strong potential for UAV-based plume tracking, multi-agent learning systems, and data-driven health forecasting, but also highlight persistent gaps in regulatory readiness, system interoperability, and equity. The review argues for a coordinated, AI-centric framework to improve environmental sensing, health protection, and governance in fire-prone contexts.

URL: https://openreview.net/forum?id=70KyPTKcLA

---

Title: SokoBench: Evaluating Long-Horizon Planning and Reasoning in Large Language Models

Abstract: Although the capabilities of large language models have been increasingly tested on complex reasoning tasks, their long-horizon planning abilities have not yet been extensively investigated. In this work, we provide a systematic assessment of the planning and long-horizon reasoning capabilities of state-of-the-art Large Reasoning Models (LRMs). We propose a novel benchmark based on Sokoban puzzles, intentionally simplified to isolate long-horizon planning from state persistence. Our findings reveal a consistent degradation in planning performance when more than 25 moves are required to reach the solution, suggesting a fundamental constraint on forward planning capacity. We show that equipping LRMs with Planning Domain Definition Language (PDDL) parsing, validation, and solving tools allows for modest improvements, suggesting inherent architectural limitations which might not be overcome by test-time scaling approaches alone.

URL: https://openreview.net/forum?id=pLosAkOoGU

---

Title: Bounding Spill-over Effect under Structural Uncertainty

Abstract: Causal inference on graphs has attracted increasing attention due to possible interactions among units. One main challenge is the spill-over effects, i.e., the influence of treatments on neighboring nodes on the target outcome. However, the observed graphs may suffer from inconsistency between local network structure and the interference mechanism, which invalidates the identifiability guarantee of existing spill-over estimators. To address the challenge, we propose learning to bound spill-over effects under local structural uncertainty. We start by introducing a structure proposal network that maps the ego-graph of uncertain nodes to probably consistent candidate graphs, where a spill-over effect estimator is introduced to explore the upper and lower limits of the spill-over effect by traversing the feasible ego-graph space defined by the generator. The generated ego-graphs are constrained to preserve the incomplete information (\textbf{closeness}) and be indistinguishable from the consistent ego-graphs sampled from stable nodes (\textbf{consistency}), whereas the spill-over estimator is constrained to be compatible with the observed outcomes on the network (\textbf{faithfulness}). We formulate the above objectives as a constrained, bi-level adversarial learning framework, where an efficient and stable EM-based objective is proposed to solve the optimization problem. Experiments on both semi-simulated and realshow the effectiveness of the proposed method.

URL: https://openreview.net/forum?id=kjZudggBJk

---

Title: SafeOR-Gym: A Benchmark Suite for Safe Reinforcement Learning Algorithms on Practical Operations Research Problems

Abstract: Most existing safe reinforcement learning (RL) benchmarks focus on robotics and control tasks, offering limited relevance to high-stakes domains that involve structured constraints, mixed-integer decisions, and industrial complexity. This gap hinders the advancement and deployment of safe RL in critical areas such as energy systems, manufacturing, and supply chains. To address this limitation, we present SafeOR-Gym, a benchmark suite of nine operations research (OR) environments tailored for safe RL under complex constraints. Each environment captures a realistic planning, scheduling, or control problems characterized by cost-based constraint violations, planning horizons, and hybrid discrete-continuous action spaces. The suite integrates seamlessly with the Constrained Markov Decision Process (CMDP) interface provided by OmniSafe. We evaluate several state-of-the-art safe RL algorithms across these environments, revealing a wide range of performance: while some tasks are tractable, others expose fundamental limitations in current approaches. SafeOR-Gym provides a challenging and practical testbed that aims to catalyze future research in safe RL for real-world decision-making problems.

URL: https://openreview.net/forum?id=3EREfePUvi

---

Title: On the Conditioning Consistency Gap in Conditional Neural Processes

Abstract: Neural processes are meta-learning models that map context sets to predictive distributions. While inspired by stochastic processes, NPs do not generally satisfy the Kolmogorov consistency conditions required to define a valid stochastic process. This inconsistency is widely acknowledged but poorly understood. Practitioners note that NPs work well despite the violation, without quantifying what this means. We address this gap by defining the conditioning consistency gap, a KL divergence measuring how much a CNP's predictions change when a point is added to the context versus conditioned upon. Our main results show that for CNPs with bounded encoders and Lipschitz decoders, the consistency gap is $O(1/n^2)$ in context size $n$, and that this rate is tight. These bounds explain why CNPs behave approximately consistently for moderate context sizes while potentially exhibiting inconsistency in the few-shot regime.

URL: https://openreview.net/forum?id=rLJ5Hm5vbG

---

Title: Consistent Spectral Clustering under Hyperbolic Geometry

Abstract: Spectral clustering is a widely used unsupervised learning method that partitions data by analyzing the spectrum of a similarity graph, where the classical formulations implicitly assume Euclidean geometry. But this assumption becomes inadequate when data exhibit a hierarchical or tree-like structure. In such settings, Euclidean distances distort geodesic relationships, leading to unstable spectral embeddings and degraded clustering performance. Motivated by this limitation, we study spectral clustering under hyperbolic geometry, a natural model for hierarchical data, and propose an intrinsically hyperbolic spectral clustering framework in which the similarity operator is defined using hyperbolic distances after estimating a latent hierarchical root. This construction yields a hyperbolic graph Laplacian whose spectrum better reflects the underlying geometry of the data. We provide a rigorous theoretical analysis establishing the weak consistency of the proposed method under a hyperbolic latent variable model, with convergence rates at least as fast as the classical spectral clustering in Euclidean space. Empirical results on real-world hierarchical datasets demonstrate improved robustness to curvature and hierarchy depth relative to other existing deep and hierarchical clustering benchmarks, highlighting the importance of geometric modeling in spectral methods and positioning hyperbolic geometry as a principled foundation for clustering complex structured data.

URL: https://openreview.net/forum?id=VUHVCWmTFH

---

Title: Do Visual Bias Mitigation Methods Generalize? A Cross-Domain Study

Abstract: Spurious correlations, defined as predictive but non-causal relationships within training data, constitute a significant challenge for deep learning. When such shortcuts exist in a dataset, models tend to exploit them instead of learning the intended, task-relevant features, resulting in biased predictions and poor generalization. Although numerous bias mitigation methods have been developed and evaluated on natural images, their ability to generalize to other modalities and domains, such as text (e.g., occupational gender imbalance and lexical bias), audio (e.g., demographic disparities and device signatures), medical imaging (e.g., hospital-level biases such as scanner or protocol differences), and video (e.g., scene background bias), remains largely unexplored. In this work, we conduct the first comprehensive cross-domain benchmark study, evaluating 11 bias mitigation methods across 6 datasets spanning text, audio, medical imaging, and video. For each dataset, we introduce tailored configurations designed to assess bias mitigation performance. Our findings show that several methods provide consistent improvements across modalities, with a subset exhibiting statistically significant bias mitigation in all domains. This study offers the first systematic evidence of cross-modal generalization for visual bias mitigation approaches and establishes a benchmark resource aimed at encouraging the development of bias mitigation methods that extend beyond the natural images domain. Code and data will be released publicly upon acceptance.

URL: https://openreview.net/forum?id=BHZuHFelDx

---

Title: Accelerating Inference of Discrete Autoregressive Normalizing Flows by Selective Jacobi Decoding

Abstract: Discrete normalizing flows are promising generative models with advantages such as analytical log-likelihood computation and end-to-end training.
However, the architectural constraints to ensure invertibility and tractable Jacobian computation limit their expressive power and practical usability.
Recent advancements utilize autoregressive modeling, significantly enhancing expressive power and generation quality.
Nevertheless, such sequential modeling inherently restricts parallel computation during inference, leading to slow generation that impedes practical deployment.
In this paper, we first identify that strict sequential dependency in inference is unnecessary to generate high-quality samples.
We observe that sub-variables in sequential modeling can also be approximated without strictly conditioning on all preceding sub-variables.
Moreover, the models tend to exhibit low dependency redundancy in the initial layer and higher redundancy in subsequent layers.
Leveraging these observations, we propose to selectively use Jacobi decoding strategy that accelerates its autoregressive inference through parallel iterative optimization.
Theoretical analyses demonstrate the method's superlinear convergence rate and guarantee that the number of iterations required is no greater than the original sequential approach.
Empirical evaluations across multiple datasets validate the generality and effectiveness of our acceleration technique, achieving up to 4.7 times faster inference on modern normalizing flow models while preserving generation quality.

URL: https://openreview.net/forum?id=xYATz9HpE7

---

Title: Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs

Abstract: This paper focuses on learning a Constrained Markov Decision Process (CMDP) via general parameterized policies. We propose a Primal-Dual based Regularized Accelerated Natural Policy Gradient (PDR-ANPG) algorithm that uses entropy and quadratic regularizers to reach this goal. For parameterized policy classes with a transferred compatibility approximation error, $\epsilon_{\mathrm{bias}}$, PDR-ANPG achieves a last-iterate $\epsilon$ optimality gap and $\epsilon$ constraint violation with a sample complexity of $\tilde{\mathcal{O}}(\epsilon^{-2}\min\{\epsilon^{-2},\epsilon_{\mathrm{bias}}^{-\frac{1}{3}}\})$. If the class is incomplete ($\epsilon_{\mathrm{bias}}>0$), then the sample complexity reduces to $\tilde{\mathcal{O}}(\epsilon^{-2})$ for $\epsilon<(\epsilon_{\mathrm{bias}})^{\frac{1}{6}}$. Moreover, for complete policies with $\epsilon_{\mathrm{bias}}=0$, our algorithm achieves a last-iterate $\epsilon$ optimality gap and $\epsilon$ constraint violation with $\tilde{\mathcal{O}}(\epsilon^{-4})$ sample complexity. It is a significant improvement over the
state-of-the-art last-iterate guarantees of general parameterized CMDPs.

URL: https://openreview.net/forum?id=JedrMCZC6l

---

Title: Generative Modeling with Continuous Flows: Sample Complexity of Flow Matching

Abstract: Flow matching has recently emerged as a promising alternative to diffusion-based generative models, offering faster sampling and simpler training by learning continuous flows governed by ordinary differential equations. Despite growing empirical success, the theoretical understanding of flow matching remains limited, particularly in terms of sample complexity results. In this work, we provide the first analysis of the sample complexity for flow-matching based generative models without assuming access to the empirical risk minimizer (ERM) of the loss function for estimating the velocity field. Under standard assumptions on the loss function for velocity field estimation and boundedness of the data distribution, we show that a sufficiently expressive neural network can learn a velocity field such that with $\mathcal{O}(\epsilon^{-4})$ samples, such that the Wasserstein-2 distance between the learned and the true distribution is less than $\mathcal{O}(\epsilon)$. The key technical idea is to decompose the velocity field estimation error into neural-network approximation error, statistical error due to the finite sample size, and optimization error due to the finite number of optimization steps for estimating the velocity field. Each of these terms are then handled via techniques that may be of independent interest.

URL: https://openreview.net/forum?id=oBc4oWAlcs

---

Title: LAMBDA: Assessing Few-shot Lexical Analogical Reasoning in Language Models

Abstract: Analogical reasoning in language models is a critical yet underexplored aspect of their capability, particularly as models grow in scale and training data. This work investigates the limitations of current models in inferring latent relational structures, focusing on lexical analogies. We introduce LAMBDA, a novel dataset of 3,000 relation-hidden lexical analogies spanning synonyms, antonyms, and derivational transformations, designed for two-shot induction. Our empirical evaluation across nine models, including four open-source models from 0.1B to 17B parameters, along with five commercial models, reveals a wide performance gap, with accuracies ranging from 4.9% to 49.3%, highlighting the challenge of systematic generalization. By analyzing error patterns such as identity echo and semantic drift, we provide insights into model weaknesses. These findings suggest that large-scale pre-training alone does not guarantee strong relational reasoning abilities, offering a foundation for targeted improvements in model design. Broader implications point to the potential for refining training methodologies to enhance analogical abstraction in language models.

URL: https://openreview.net/forum?id=WrWu3UWXZG

---

Title: SelfPrompt: Confidence-Aware Semi-Supervised Tuning for Improved Vision-Language Model Adaptation

Abstract: We present SelfPrompt, a novel prompt-tuning approach for vision-language models (VLMs) in a semi-supervised learning setup. Existing methods for tuning VLMs in semi-supervised setups struggle with the negative impact of the miscalibrated VLMs on pseudo-labelling, and the accumulation of noisy pseudo-labels. SelfPrompt addresses these challenges by introducing a cluster-guided pseudo-labelling method that improves pseudo-label accuracy, and a confidence-aware semi-supervised learning module that maximizes the utilization of unlabelled data by combining supervised learning and weakly-supervised learning. Additionally, we investigate our method in an active semi-supervised learning setup, where the labelled set is strategically selected to ensure the best utilization of a limited labelling budget. To this end, we propose a weakly-supervised sampling technique that selects a diverse and representative labelled set, which can be seamlessly integrated into existing methods to enhance their performance. We conduct extensive evaluations across 13 datasets, significantly surpassing state-of-the-art performances with average improvements of 6.23% in standard semi-supervised learning, 6.25% in active semi-supervised learning, and 4.9% in base-to-novel generalization, using a 2-shot setup. Furthermore, SelfPrompt shows excellent generalization in single-shot settings, achieving an average improvement of 11.78%.

URL: https://openreview.net/forum?id=cP6USDUjK8

---

Title: State Design Matters: How Representations Shape Dynamic Reasoning in Large Language Models

Abstract: As large language models (LLMs) move from static reasoning tasks toward dynamic environments, their success depends on the ability to navigate and respond to an environment that changes as they interact at inference time. An underexplored factor in these settings is the representation of the state. Holding model parameters fixed, we systematically vary three key aspects: (1) state granularity (long form versus summary), (2) structure (natural language versus symbolic), and (3) spatial grounding (text-only versus images or textual map encodings) across sequential decision-making benchmarks. We find that trajectory summarisation improves performance by reducing noise and stabilising long-horizon reasoning. Second, natural language representations are the most robust across models, whereas structured encodings help mainly for models with strong code or structured output priors, such as JSON schemas. Third, while image-inputs show some benefit, text-based spatial encodings prove most effective. This advantage stems not from the spatial information itself, but from the act of construction, which compels the model to perform the spatial reasoning that static input does not elicit. Overall, we demonstrate that design choices for representing state are a decisive factor in performance, distinct from the availability of information itself. We note, however, that even with improved representations, current LLMs and VLMs remain brittle over long horizons, particularly when they must synthesise information to manage multiple subtasks to reach a goal.

URL: https://openreview.net/forum?id=sKoazMNH84

---

Title: When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering

Abstract: Retrieval Augmented Generation (RAG) is widely used to extend large language models (LLMs) beyond their parametric knowledge, yet it remains unclear when iterative retrieval-reasoning loops meaningfully outperform traditional static RAG, particularly in scientific domains where multi hop reasoning, sparse domain knowledge, and heterogeneous evidence impose substantial complexity. This study provides the first controlled, mechanism level diagnostic evaluation of whether synchronized iterative retrieval and reasoning can surpass even an idealized static upper bound (Gold-Context) RAG.
We benchmark eleven State-of-the-Art LLMs under three regimes: (i) No Context, measuring reliance on parametric memory; (ii) Gold Context, where all oracle evidence is supplied at once; and (iii) Iterative RAG, a training free controller that alternates retrieval, hypothesis refinement, and evidence aware stopping. Using the chemistry focused ChemKGMultiHopQA dataset, we isolate questions requiring genuine retrieval and analyze model behavior through a comprehensive diagnostic suite covering retrieval coverage gaps, anchor carry drop, query quality, composition fidelity, and control calibration.
Across models, iterative RAG consistently outperforms Gold Context, yielding gains up to 25.6 percentage points, particularly for non-reasoning fine-tuned models. Our analysis shows that synchronized retrieval and reasoning reduces late-hop failures, mitigates context overload, and enables dynamic correction of early hypothesis drift, benefits that static evidence cannot provide. However, we also identify limiting failure modes, including incomplete hop coverage, distractor latch trajectories, early stopping miscalibration, and high composition failure rates even with perfect retrieval.
Overall, our results demonstrate that the process of staged retrieval is often more influential than the mere presence of ideal evidence. We provide practical guidance for deploying and diagnosing RAG systems in specialized scientific settings and establish a foundation for developing more reliable, controllable iterative retrieval–reasoning frameworks.
The code and evaluation results are available at https://anonymous.4open.science/r/Iterative-rag-095E/.

URL: https://openreview.net/forum?id=pa5TnBdyDP

---

Title: Semantic-Drive: Trustworthy and Efficient Long-Tail Data Curation via Open-Vocabulary Grounding and Neuro-Symbolic VLM Consensus

Abstract: The development of robust Autonomous Vehicles (AVs) is currently hampered by a critical scarcity of "Long-Tail" training data. While fleets collect petabytes of video logs, identifying rare safety-critical events, specifically scenarios like erratic jaywalking or complex construction diversions, remains a manual process that is often cost-prohibitive. Existing automated solutions rely either on coarse metadata search, which lacks semantic precision, or they utilize cloud-based Vision-Language Models (VLMs) that introduce significant privacy risks and high latency overheads. In this work, we introduce Semantic-Drive, a local-first, neuro-symbolic framework designed for trustworthy semantic data mining. Our approach decouples perception into two distinct stages: (1) Symbolic Grounding via a real-time open-vocabulary detector (YOLOE) to anchor attention, and (2) Cognitive Analysis, where a Reasoning VLM performs forensic scene analysis. To effectively mitigate hallucination and reliability issues common in generative models, we implement a "System 2" inference-time alignment strategy that utilizes a multi-model "Judge-Scout" consensus mechanism. When benchmarked on the nuScenes dataset against the Waymo Open Dataset (WOD-E2E) taxonomy, it was observed that Semantic-Drive achieves a Recall of 0.966 (vs. 0.475 for CLIP). Notably, the system reduces Risk Assessment Error by 40\% compared to single-model baselines. Crucially, the entire pipeline runs on consumer hardware (NVIDIA RTX 3090), offering a privacy-preserving and efficient alternative to cloud-native architectures.

URL: https://openreview.net/forum?id=qN2oN36L3k

---

Title: HERMES: Heterogeneous Effects Representation with Matched Embeddings using Siamese Networks

Abstract: A growing literature has focused on representation learning as it relates to the estimation and analysis of heterogeneous treatment effects in both experimental and observational settings. We are specifically interested in the estimation of conditional average treatment effects (CATE) functions, i.e. functions mapping the effect of a binary treatment to the space of unit-level covariates.
In the absence of a controlled randomized mechanism of treatment assignment, simple comparisons between treated and control populations can be potentially confounded by significant distributional differences in the covariate space. In this context, recent representation learning strategies aim to learn balanced latent representations in a new space where the treated and control distributions are more comparable, addressing confounding and reducing global bias.
In this work, we leverage self-supervision and contrastive learning and propose a novel contrastive loss function that structures the latent space according to the similarity of estimated individual treatment effects (ITE). We integrate this contrastive learning approach in HERMES (Heterogeneous Effects Representation with Matched Embeddings using Siamese networks), a Siamese Neural Network which learns a structured latent space by dynamically pairing samples whose estimated ITEs are similar.
Unlike representation learning approaches that rely only on covariates, HERMES injects the ITE into representation learning, improving accuracy under standard assumptions. Experiments on IHDP and JOBS benchmarks show that HERMES improves the expected Precision in MSE by 14–15% over baselines, without added inference cost.

URL: https://openreview.net/forum?id=G6F0fkBQEG

---

Title: TopicSummRAG: A Topic-Enhanced RAG model for Improving Query-Focused Summarization from Long Documents

Abstract: Despite larger context windows, large language models (LLMs) continue to struggle with answering queries over long, unstructured documents. Retrieval-Augmented Generation (RAG) mitigates this limitation by retrieving relevant text before generation; however, its effectiveness depends critically on how documents are segmented into retrievable units. Existing chunking strategies—such as fixed-size or sliding windows—often ignore topical coherence, frequently merging unrelated content or fragmenting coherent discourse, which degrades retrieval precision and downstream generation quality. We propose TopicSummRAG, a framework that aligns retrieval units with the latent topical structure of documents. TopicSummRAG first segments documents into contextually coherent topical chunks using a boundary-supervision-free contrastive text segmentation model, and then summarises each segment to form compact retrieval metadata. The segmentation component is evaluated independently on the QMSum and TIAGE benchmarks, where it consistently improves boundary detection and placement over strong baselines. During retrieval, segment-level summaries are matched against the query, and an entropy- and dominance-based filtering strategy adaptively selects relevant segments by measuring the concentration of relevance mass, avoiding brittle fixed cutoffs. We evaluate TopicSummRAG on ODSum-Story, ODSum-Meeting, and QMSum across multiple retrievers (BM25, SBERT, Situated) and LLMs (Qwen3-8B, LLaMA-3.1-8B, Gemma-3-12B). TopicSummRAG yields improvements of up to 13\% ROUGE-1, 25\% ROUGE-2, and 20\% ROUGE-L, alongside 10–15 point gains in LENS and up to 10-point improvements in BLANC. These results demonstrate that topic-aware segmentation and adaptive retrieval substantially improve factual grounding, coherence, and retrieval robustness, providing a retriever- and model-agnostic framework for long-document query-focused summarisation and question answering with RAG.

URL: https://openreview.net/forum?id=bLrztOvp8i

---

Title: Aligning Self-Interested Agents with Welfare Maximization in Non-cooperative Equilibrium via Mechanism Design

Abstract: Self-interested multi-agent reinforcement learning has attracted growing attention for its applicability in real-world scenarios. In such settings, social dilemmas often arise, where agents prioritize individual gains over social welfare. Therefore, addressing these social dilemmas is critical for improving social welfare. However, prior work has notable limitations: (1) opponent modeling and incentive design approaches rely heavily on access to other agents’ internal parameters and detailed information. As the number of agents increases or access to such information becomes limited, accurately modeling others’ impact becomes difficult, leading to degraded performance; (2) centralized training is often ineffective, as relying on a single global training signal fails to capture the heterogeneous objectives and behaviors of self-interested agents, limiting effective individual policy learning. To overcome these limitations, we propose a mechanism design approach that leverages centralized information rather than centralized learning, without requiring access to other agents’ internal parameters. Such mechanism dynamically reshapes each agent’s reward to align individual incentives with social welfare. Building on this mechanism, we develop a value iteration algorithm that integrates a counterfactual critic and a maximized return predictor, further improving learning effectiveness. Extensive experiments on social dilemma environments demonstrate that our method achieves higher social welfare compared with existing baselines.

URL: https://openreview.net/forum?id=vgKRAIAnfQ

---

Title: Graph reinforcement learning resistant to sparsity scaling

Abstract: We investigate the impact of graph sparsity on the NP-hard combinatorial optimization problem of delivery route optimization, by combining proximity policy optimization with graph convolutional neural networks.
Sparsity poses a critical challenge for consistent routing policies, as limited connectivity can significantly impact solution strategies by changing the intrinsic structure of possible paths.
In order to address such challenge, we relate robustness in different graph sparsity regimes to learning dynamics in a sequential decision-making and graph-structured environment.
Our experiments systematically consider graphs with up to 20 nodes, demonstrating the robustness of the algorithm to various densities of graph sparsity, from low-degree nodes that impose inefficient paths to highly connected structures that require extensive exploration.
To ensure consistent evaluation across different graph topologies, we introduce the normalization of the return function based on the length of the DRL episode and the number of nodes in the graph.
The algorithm is evaluated based on metrics such as cumulative reward, path length, and the number of steps required to complete a DRL episode.
Learning maintains stable performance across a wide range of graph densities, thus revealing its effectiveness.
By systematically characterizing the role of sparsity in graph-based reinforcement learning for route optimization, this work provides insights into the challenges posed by real-world transportation logistics networks.

URL: https://openreview.net/forum?id=gUxOupQu7H

---

Title: Pairwise Matching of Intermediate Representations for Fine-grained Explainability

Abstract: The differences between images belonging to fine-grained categories are often subtle and highly localized, and existing explainability techniques for deep learning models are often too diffuse to provide useful and interpretable explanations. We propose a new explainability method (PAIR-X) that leverages both intermediate model activations and backpropagated relevance scores to generate fine-grained, highly-localized pairwise visual explanations. We use animal and building re-identification (re-ID) as a primary case study of our method, and we demonstrate qualitatively improved results over a diverse set of explainability baselines on 35 public re-ID datasets. In interviews, animal re-ID experts were in unanimous agreement that PAIR-X was an improvement over existing baselines for deep model explainability, and suggested that its visualizations would be directly applicable to their work. We also propose a novel quantitative evaluation metric for our method, and demonstrate that PAIR-X visualizations appear more plausible for correct image matches than incorrect ones even when the model similarity score for the pairs is the same. By improving interpretability, PAIR-X enables humans to better distinguish correct and incorrect matches. Our code is available at: https://github.com/pairx-explains/pairx

URL: https://openreview.net/forum?id=Fv6ZD4P34v

---

Title: Bayesian Spectral Clustering

Abstract: We introduce Bayesian Spectral Clustering (BSC), a probabilistic reformulation of spectral clustering. Classical spectral clustering relies on a hand-crafted affinity graph (e.g., Gaussian kernel with $k$-NN sparsification) that is then treated as fixed, and recent improvements typically optimize that graph jointly with the clustering objective. However, these approaches still output a single graph and a single hard partition, providing neither principled quantification of uncertainty nor a guaranteed notion of when the learned affinities are reliable. BSC addresses this by treating the affinity matrix $W$ itself as a latent variable with sparsity- and locality-promoting priors, linking $W$ to the observed data through a Laplacian-smoothness likelihood, and performing variational inference to obtain a joint posterior over $W$ and the cluster assignments. We prove that (i) the standard Gaussian affinity emerges as the maximum a posteriori edge weight, giving a probabilistic justification for the classical kernel, and (ii) the posterior-mean graph is the unique minimizer of a strictly convex objective and is automatically sparse. Empirically, BSC attains state-of-the-art clustering quality while producing calibrated per-sample assignment confidence.

URL: https://openreview.net/forum?id=VKlclfQsAa

---

Title: LLM2Prune: Using LLMs as Domain Experts for Search Space Reduction

Abstract: Combinatorial optimization problems defined over graphs involve large discrete search spaces where many candidates contribute little due to redundancy or low value. Pruning the ground set to a smaller pool of promising candidates makes heuristics and exact solvers practical for large real-world instances. Classical submodularity-based pruning algorithms do not scale efficiently, while learning-based approaches depend on handcrafted features that require domain expertise and limit generalization. We propose LLM2Prune, a framework that uses large language models (LLMs) to generate features from a task description, which are then used by a downstream classifier to prune the search space. We guide the feature discovery process with feature-importance scores and performance metrics. Across diverse graph optimization tasks, LLM2Prune prunes over 90% of the ground set while retaining near-optimal solutions, achieving orders-of-magnitude speedups over existing approaches. Code, data, and pre-trained models are available at: https://anonymous.4open.science/r/LLM_Pruning-798E.

URL: https://openreview.net/forum?id=i56Pxr3btq

---

Title: Convex Optimization with Local Label Differential Privacy: Tight Bounds in All Privacy Regimes

Abstract: We study the problem of Stochastic Convex Optimization (SCO) under the constraint of local Label Differential Privacy (L-LDP). In this setting, the features are considered public, but the corresponding labels are sensitive and must be randomized by each user locally before being sent to an untrusted analyzer. Prior work for SCO under L-LDP (Ghazi et al., 2021) established an excess population risk bound with a *linear* dependence on the size of the label space, $K$: $O\left(\frac{K}{\epsilon\sqrt{n}}\right)$ in the high-privacy regime ($\epsilon \leq 1$) and $O\left(\frac{K}{e^{\epsilon} \sqrt{n}}\right)$ in the medium-privacy regime ($1 \leq \epsilon \leq \ln K$). This left open whether this linear cost is fundamental to the L-LDP model. In this note, we resolve this question. First, we present a novel and efficient non-interactive L-LDP algorithm that achieves an excess risk of $O\left(\sqrt{\frac{K}{\epsilon n}}\right)$ in the high-privacy regime ($\epsilon \leq 1$) and $O\left(\sqrt{\frac{K}{e^{\epsilon} n}}\right)$ in the medium-privacy regime ($1 \leq \epsilon \leq \ln K$). This quadratically improves the dependency on the label space size from $O(K)$ to $O(\sqrt{K})$. Second, we prove a matching information-theoretic lower bound across all privacy regimes for any sufficiently large $n$.

URL: https://openreview.net/forum?id=8Sjm0FrV2u

---

Title: Privacy-Aware Visual Language Models

Abstract: As Visual Language Models (VLMs) become increasingly embedded in everyday applications. Ensuring they can recognise and appropriately handle privacy-sensitive content is thus essential to protect users. To this end, we conduct a comprehensive evaluation of ten state-of-the-art VLMs and identify limitations in their understanding of visual privacy. However, existing privacy-related datasets often suffer from label inconsistencies, limiting their reliability. To address this, we introduce two compact, high-quality benchmarks, PrivBench and PrivBench-H, that focus on commonly recognised visual privacy categories aligned with the General Data Protection Regulation (GDPR). Additionally, we present PrivTune, an instruction-tuning dataset specifically curated to improve privacy sensitivity. We obtain a Privacy VLM by fine-tuning an off-the-shelf VLM on only 100 samples from PrivTune, which leads to substantial gains on all benchmarks, surpassing even GPT-4, while maintaining strong performance on other tasks. Our findings show that privacy-awareness in VLMs can be substantially improved with minimal data and careful dataset design, setting the stage for safer, more privacy-aligned AI systems.

URL: https://openreview.net/forum?id=ntLnq05sBA

---

Title: Relative Translation Invariant Wasserstein Distance

Abstract: Motivated by the Bures-Wasserstein distance, we introduce a new family of \emph{relative translation invariant Wasserstein distances}, denoted $(RW_p)$, as an extension of the classical Wasserstein distances $W_p$ for $p \in [1, +\infty)$. We establish that $RW_p$ defines a valid metric and demonstrate that this type of metric is more robust to perturbation than the classical Wasserstein distances. A bi-level algorithm is designed to compute the general $RW_p$ distances between arbitrary discrete distributions.
Additionally, when $p = 2$, we show that the optimal coupling solutions are invariant under distributional translation in discrete settings, and we further propose two algorithms, the $\mathrm{RW}_2$-Sinkhorn algorithm and $\mathrm{RW}_2$-LP algorithm, to improve the numerical stability of computing $W_2$ distances and the optimal coupling solutions.
Finally, we conduct three experiments to validate our theoretical results and algorithms. The first two experiments report that the $\mathrm{RW}_2$-Sinkhorn algorithm and $\mathrm{RW}_2$-LP algorithm can significantly reduce the numerical errors compared to standard algorithms. The third experiment shows that $RW_p$ algorithms are computationally scalable and applicable to the retrieval of similar thunderstorm patterns in practical applications.

URL: https://openreview.net/forum?id=NfhVTi2G4a

---

Reply all

Reply to author

Forward

0 new messages