Accepted papers
===============
Title: VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models
Authors: Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hongming Zhang, Zhisong Zhang, Haitao Mi, Dong Yu
Abstract: Recent Large Vision-Language Models (LVLMs) have advanced multi-modal understanding by incorporating finer-grained visual perception and encoding. However, such methods incur significant computational costs due to longer visual token sequences, posing challenges for real-time deployment. To mitigate this, prior studies have explored pruning unimportant visual tokens either at the output layer of the visual encoder or at the early layers of the language model. In this work, we revisit these design choices and reassess their effectiveness through comprehensive empirical studies of how visual tokens are processed throughout the visual encoding and language decoding stages. Guided by these insights, we propose VScan, a two-stage visual token reduction framework that addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model. Extensive experimental results across four LVLMs validate the effectiveness of VScan in accelerating inference and demonstrate its superior performance over current state-of-the-arts on sixteen benchmarks. Notably, when applied to LLaVA-NeXT-7B, VScan achieves a 2.91$\times$ speedup in prefilling and a 10$\times$ reduction in FLOPs, while retaining 95.4\% of the original performance. Code will be made publicly available upon acceptance.
URL: https://openreview.net/forum?id=KZYhyilFnt
---
Title: Towards Fast Safe Online Reinforcement Learning via Policy Finetuning
Authors: Keru Chen, Honghao Wei, Zhigang Deng, Sen Lin
Abstract: High costs and risks involved in extensive environmental interactions hinder the practical application of current online safe reinforcement learning (RL) methods. Inspired by recent successes in offline-to-online (O2O) RL, it is crucial to explore whether offline safe RL can be leveraged to facilitate faster and safer online learning, a direction that has yet to be fully investigated. To fill this gap, we first show that naively applying existing O2O algorithms from standard RL would not work well in safe RL due to two unique challenges: \emph{erroneous Q-estimations}, resulted from offline-online objective mismatch and offline cost sparsity, and \emph{Lagrangian mismatch}, resulted from difficulties in aligning Lagrange multipliers between offline and online policies. To address these challenges, we introduce \textbf{Marvel}, the first policy-finetuning based framework for O2O safe RL, comprising two key components that work in concert: \emph{Value Pre-Alignment} to align the learned Q-functions with the online objective before finetuning, and \emph{Adaptive PID Control} to effectively adjust the Lagrange multipliers during finetuning. Extensive experiments demonstrate the superior performance of Marvel over related baselines.
URL: https://openreview.net/forum?id=1SO7vmLFUq
---
Title: Game-Theoretic Defenses for Adversarially Robust Conformal Prediction
Authors: Rui Luo, Jie Bao, Suqun Cao, Chuangyin Dang, Zhixin Zhou
Abstract: Adversarial attacks pose major challenges to the reliability of deep learning models in safety-critical domains such as medical imaging and autonomous driving. In such high-stakes applications, providing reliable uncertainty quantification alongside adversarial robustness becomes crucial for safe deployment. Although conformal prediction can provide certain guarantees for model performance under such conditions, unknown attacks may violate the exchangeability assumption, resulting in the loss of coverage guarantees or excessively large predictive uncertainty. To address this, we propose a synergistic framework that integrates conformal prediction with game-theoretic defense strategies by modeling the adversarial interaction as a discrete, zero-sum game between attacker and defender. Our framework yields a Nash Equilibrium defense strategy, which we prove maintains valid coverage while minimizing the worst-case prediction set size against an optimal adversary operating within the defined attack space. Experimental results on CIFAR-10, CIFAR-100, and ImageNet further demonstrate that, under Nash equilibrium, defense models within our framework achieve valid coverage and minimal prediction set size. By bridging adversarial robustness and uncertainty quantification from a game-theoretic perspective, this work provides a verifiable defense paradigm for deploying safety-critical deep learning systems, particularly when adversarial distributions are unknown or dynamically evolving but contained within a known attack space.
URL: https://openreview.net/forum?id=SjsVobIlwL
---
New submissions
===============
Title: Neural Logic Networks for Interpretable Classification
Abstract: Traditional neural networks have an impressive classification performance, but what they learn cannot be inspected, verified or extracted. Neural Logic Networks on the other hand have an interpretable structure that enables them to learn a logical mechanism relating the inputs and outputs with AND and OR operations. We generalize these networks with NOT operations and biases that take into account unobserved data and develop a rigorous logical and probabilistic modeling in terms of concept combinations to motivate their use. We also propose a novel factorized IF-THEN rule structure for the model as well as a modified learning algorithm. Our method improves the state-of-the-art in Boolean networks discovery and is able to learn relevant, interpretable rules in tabular classification, notably on examples from the medical and industrial fields where interpretability has tangible value.
URL: https://openreview.net/forum?id=CmJXxeimUk
---
Title: D-Garment: Physically Grounded Latent Diffusion for Dynamic Garment Deformations
Abstract: We present a method to dynamically deform 3D garments, in the form of 3D polygon mesh, based on body shape, motion, and physical cloth material properties. Considering physical cloth properties allows to learn a physically grounded model, with the advantage of being more accurate in terms of physically inspired metrics such as strain or curvature.
Existing work studies pose-dependent garment modeling to generate garment deformations from example data, and possible data-driven dynamic cloth simulation to generate realistic garments in motion. We propose *D-Garment*, a learning-based approach trained on new data generated with a physics-based simulator. Compared to prior work, our 3D generative model learns garment deformations conditioned by physical material properties, which allows to model loose cloth geometry, especially for large deformations and dynamic wrinkles driven by body motion. Furthermore, the model can be efficiently fitted to observations captured using vision sensors such as 3D point clouds. We leverage the capability of diffusion models to learn flexible and powerful generative priors by modeling the 3D garment in a 2D parameter space, and learning a latent diffusion model using this representation independently from the mesh resolution. This allows to condition global and local geometry with body and cloth material information.
We quantitatively and qualitatively evaluate *D-Garment* on both simulations and data captured with a multi-view acquisition platform. Compared to recent baselines our method is more realistic and accurate in terms of shape similarity and physical validity metrics. Code and data will be shared upon acceptance.
URL: https://openreview.net/forum?id=NrPyio1aUK
---
Title: Temporal Energy Transformer for Long Range Propagation in Continuous Time Dynamic Graphs
Abstract: Representation learning on temporal graphs is crucial for understanding dynamically varying real-world systems such as social media platforms, financial transactions, transportation networks, and communication systems. Existing self-attention based models encounter limitations in capturing long-range dependencies and lack clear theoretical foundations. Energy-based models offer a promising alternative, with a well-established theoretical foundation that avoids reliance on pseudo-losses. However, their application in this domain remains largely unexplored, primarily due to the challenge of designing energy functionals. In this work, we introduce the Temporal Energy Transformer (TET), a novel energy-based architecture that integrates with the Temporal Graph Network (TGN) framework. Our approach centres on a novel energy-based graph propagation module that leverages a specially designed energy functional to capture and preserve spatio-temporal information. This is achieved by modelling the temporal dynamics of irregular data streams with a continuous-time differential equation. Our temporal energy transformer (TET) layer employs a series of temporal energy attention layers and a dense associative memory model or a modern Hopfield network. This design demonstrably minimizes the energy functional that is tailored, enabling efficient retention of historical context while assimilating the incoming data. The efficacy of the model is comprehensively validated across a diverse range of temporal graph datasets, including those with long-range dependencies, demonstrating superior performance in both transductive and inductive scenarios for dynamic link prediction.
URL: https://openreview.net/forum?id=zg3bi0GRJk
---
Title: Rethinking Smoothness in Node Features Learned by Graph Convolutional Networks
Abstract: The pioneering works of Oono and Suzuki (ICLR 2020) and Cai and Wang (arXiv:2006.13318) initiated the analysis of feature smoothness in graph convolutional networks (GCNs), uncovering a strong empirical connection between node classification accuracy and the ratio of smooth to non-smooth feature components. However, it remains unclear how to effectively control this ratio in learned node features to enhance classification performance. Furthermore, deep GCNs with ReLU or leaky ReLU activations tend to suppress non-smooth feature components. In this paper, we introduce a novel strategy to enable GCNs to learn node features with {\bf controllable smoothness}, thereby improving node classification accuracy. Our method comprises three core components: (1) deriving a geometric relationship between the inputs and outputs of ReLU and leaky ReLU activations; (2) augmenting the standard message-passing mechanism in graph convolutional layers with a learnable term for efficient smoothness modulation; and (3) theoretically analyzing the attainable smooth-to-non-smooth ratios under the proposed augmented propagation. Extensive experiments demonstrate that our approach substantially enhances node classification performance across GCNs and related architectures.
URL: https://openreview.net/forum?id=XFOcJJgmdc
---
Title: Causally-Aware Information Bottleneck for Domain Adaptation
Abstract: We address a common domain-adaptation setting in causal systems. In this setting, the target variable is observed in the source domain but is entirely missing in the target domain. We aim to impute the target variable in the target domain from the remaining observed variables under various shifts. We frame this as learning a compact, mechanism-stable representation. This representation preserves information relevant for predicting the target while discarding spurious variation. For linear Gaussian causal models, we derive a closed-form Gaussian Information Bottleneck (GIB) solution. This solution reduces to a canonical correlation analysis (CCA)–style projection and offers Directed Acyclic Graph (DAG)-aware options when desired. For nonlinear or non-Gaussian data, we introduce a Variational Information Bottleneck (VIB) encoder–predictor. This approach scales to high dimensions and can be trained on source data and deployed zero-shot to the target domain. Across synthetic and real datasets, our approach consistently attains accurate imputations, supporting practical use in high-dimensional causal models and furnishing a unified, lightweight toolkit for causal domain adaptation.
URL: https://openreview.net/forum?id=TbcqPEgJ9z
---
Title: CURS: An exact method for sampling on Riemannian manifolds
Abstract: The present work introduces curvature-based rejection sampling (CURS). This is a method for sampling from a general class of probability densities defined on Riemannian manifolds. It can be used to sample from any probability density which ``depends only on distance". The idea is to combine the statistical principle of rejection sampling with the geometric principle of volume comparison. CURS is an exact sampling method, and (assuming the underlying Riemannian manifold satisfies certain technical conditions) it has a particularly moderate computational cost. The aim of the present work is to show that there are many applications where CURS should be the user's method of choice for dealing with relatively low-dimensional scenarios.
URL: https://openreview.net/forum?id=LY9ecALVDm
---
Title: Insights From a Data- and Space-Agnostic Approach to Zero-Cost Proxies
Abstract: Zero-cost proxies (ZCPs) have enabled low-cost Neural Architecture Search (NAS) by removing the computational overhead from model training. However, important drawbacks of currently designed ZCPs remain unaddressed. While there is a strong correlation between ZCPs and model performance at the scale of entire search spaces, this does not necessarily translate to guiding the search to top-performing architectures. In this paper, we conduct extensive benchmarking over state-of-the-art proxies in the NAS-Bench-Suite-Zero setting and observe that the correlation decreases dramatically when reducing the space to the best architectures, demonstrating the presence of a top-rank gap. Moreover, embedded priors on search space and data make ZCPs unreliable across diverse tasks. We leverage adaptive parameter distribution statistics as a discriminator metric in the genetic framework and introduce ParaDis, a low-cost NAS algorithm that remains orthogonal to ZCP design, with potential to define a fully data- and space-agnostic search when paired with the right metric. Experiments on multiple benchmarks confirm that ParaDis reduces the top-rank gap across diverse tasks and remains competitive against methods with heavier priors.
URL: https://openreview.net/forum?id=sVWJczov4Q
---
Title: ARGen-Dexion: Autoregressive Image Generation Made Stronger by Vision Decoder
Abstract: Autoregressive models (ARGen) have emerged as a cornerstone for image generation within multimodal large language models (MLLMs), yet their visual outputs remain stubbornly underwhelming. Traditional efforts, scaling AR models or re-engineering architectures, yield diminishing returns at exorbitant cost, straining infrastructure without resolving core limitations. In this work, we challenge the status quo, asserting that vision decoders must shoulder greater responsibility for image synthesis, liberating autoregressive models from undue burden. We present ARGen-Dexion, a systematic overhaul of the vision decoder that redefines autoregressive image generation without modifying pre-trained AR models or visual encoders. Our approach delivers transformative gains through three innovations: (1) a scaled, fine-tuned decoder achieving unprecedented reconstruction fidelity, (2) bi-directional Transformer-based token refiner that infuses global context to refine the AR model outputs, shattering the constraints of causal inference inherent, and (3) a resolution-aware training strategy enabling seamless multi-resolution and multi-aspect-ratio synthesis. Extensive scaling studies unveil deep insights into decoder design, challenging long-held assumptions. Empirically, ARGen-Dexion boosts LlamaGen by a striking 9\% VQAScore on the GenAI-Benchmark and 4\% GenEval performance. Moreover, it can be applied to various discrete MLLMs. This work compels a bold rethinking of the interplay between MLLMs and vision decoders, paving the way for efficient and visually superior multimodal systems.
URL: https://openreview.net/forum?id=SEvoIzwBGb
---
Title: Variance Matters: Improving Domain Adaptation via Stratified Sampling
Abstract: Domain shift remains a key challenge in deploying machine learning models to the real world. Unsupervised domain adaptation (UDA) aims to address this by minimising domain discrepancy during training, but the discrepancy estimates suffer from high variance in stochastic settings, which can stifle the theoretical benefits of the method. This paper proposes Variance-Reduced Domain Adaptation via Stratified Sampling (VaRDASS), the first specialised stochastic variance reduction technique for UDA. We consider two specific discrepancy measures -- correlation alignment and the maximum mean discrepancy (MMD) -- and derive ad hoc stratification objectives for these terms. We then present expected and worst-case error bounds, and prove that our proposed objective for the MMD is theoretically optimal (i.e., minimises the variance) under certain assumptions. Finally, a practical k-means style optimisation algorithm is introduced and analysed. Experiments on three domain shift datasets demonstrate improved discrepancy estimation accuracy and target domain performance.
URL: https://openreview.net/forum?id=MVwgedTIUs
---
Title: Supervised Quadratic Feature Analysis: Information Geometry for Dimensionality Reduction
Abstract: Supervised dimensionality reduction maps labeled data into a low-dimensional feature space while preserving class discriminability. A common approach is to maximize a statistical measure of dissimilarity between classes in the feature space. Information geometry provides an alternative framework for measuring class dissimilarity, with the potential for improved insights and novel applications. Information geometry, which is grounded in Riemannian geometry, uses the Fisher information metric, a local measure of discriminability that induces the Fisher-Rao distance. Here, we present Supervised Quadratic Feature Analysis (SQFA), a linear dimensionality reduction method that maximizes Fisher-Rao distances between class-conditional distributions, under Gaussian assumptions. We motivate the Fisher-Rao distance as a good proxy for discriminability. We show that SQFA features support good classification performance with Quadratic Discriminant Analysis (QDA) on three real-world datasets. SQFA provides a novel framework for supervised dimensionality reduction, motivating future research in applying information geometry to machine learning and neuroscience.
URL: https://openreview.net/forum?id=jwNJiLphnZ
---
Title: EgoPlan: Towards Effective Embodied Agents via Egocentric Planning
Abstract: We explore leveraging large multi-modal models (LMMs) and Text2image models to build a more general embodied agent. LMMs excel in planning long-horizon tasks over symbolic abstractions but struggle with grounding in the physical world, often failing to accurately identify object positions in images. A bridge is needed to connect LMMs to the physical world. The paper proposes a novel approach, egocentric vision language planning (EgoPlan), to handle long-horizon tasks from an egocentric perspective in varying household scenarios. This pipeline leverages a diffusion model to simulate the fundamental dynamics between states and actions, discusses how to integrate computer vision related techniques like style transfer and optical flow to enhance ability of modeling spatial states and generalization across different environmental dynamics. The LMM serves as a planner, breaking down instructions into sub-goals and selecting actions based on their alignment with these sub-goals, thus enabling more generalized and effective decision-making. By using LMM, we can output text actions, using a series of mechanisms such as reflection to perform high-level task decomposition and low-level action output end-to-end. Experiments show that EgoPlan improves long-horizon task success rates from the egocentric view compared to baselines across household scenarios.
URL: https://openreview.net/forum?id=KPKqTH0LTi
---
Title: InvertiTune: High-Quality Data Synthesis for Cost-Effective Single-Shot Text-to-Knowledge Graph Generation
Abstract: Large Language Models (LLMs) have revolutionized the ability to understand and generate text, enabling significant progress in knowledge graph construction from text (Text2KG). Many Text2KG methods, however, rely on iterative LLM prompting, making them computationally expensive and prone to overlooking complex relations distributed throughout the text. To address these limitations, we propose InvertiTune, a framework that combines a controlled data generation pipeline with supervised fine-tuning (SFT). Within this framework, the data-generation pipeline systematically extracts subgraphs from large knowledge bases, applies noise filtering, and leverages LLMs to generate corresponding natural text descriptions, a task more aligned with LLM capabilities than direct KG generation from text. This pipeline enables generating datasets composed of longer texts paired with larger KGs that better reflect real-world scenarios compared to existing benchmarks, thus supporting effective SFT of lightweight models for single-shot KG construction. Experimental results on CE12k, a dataset generated using our pipeline, show that InvertiTune outperforms larger non-fine-tuned LLMs as well as state-of-the-art Text2KG approaches, while demonstrating stronger cross-dataset generalization on CrossEval-1200, a test set created from three established benchmark datasets and CE12k. These findings highlight the importance of realistic, high-quality training data for advancing efficient and high-performing Text2KG systems.
URL: https://openreview.net/forum?id=bvJlAodxEC
---