Expert Certification: Single-positive Multi-label Learning with Label Cardinality
Shayan Gharib, Pierre-Alexandre Murena, Arto Klami
https://openreview.net/forum?id=XEPPXH2nKu
---
Accepted papers
===============
Title: Beyond Instance Consistency: Investigating View Diversity in Self-supervised Learning
Authors: Huaiyuan Qin, Muli Yang, Siyuan Hu, Peng Hu, Yu Zhang, Chen Gong, Hongyuan Zhu
Abstract: Self-supervised learning (SSL) conventionally relies on the instance consistency paradigm, assuming that different views of the same image can be treated as positive pairs. However, this assumption breaks down for non-iconic data, where different views may contain distinct objects or semantic information. In this paper, we investigate the effectiveness of SSL when instance consistency is not guaranteed. Through extensive ablation studies, we demonstrate that SSL can still learn meaningful representations even when positive pairs lack strict instance consistency. Furthermore, our analysis further reveals that increasing view diversity, by enforcing zero overlapping or using smaller crop scales, can enhance downstream performance on classification and dense prediction tasks. However, excessive diversity is found to reduce effectiveness, suggesting an optimal range for view diversity. To quantify this, we adopt the Earth Mover’s Distance (EMD) as an estimator to measure mutual information between views, finding that moderate EMD values correlate with improved SSL learning, providing insights for future SSL framework design. We validate our findings across a range of settings, highlighting their robustness and applicability on diverse data sources.
URL: https://openreview.net/forum?id=urWCU3YMA0
---
Title: Understanding Emergent In-Context Learning from a Kernel Regression Perspective
Authors: Chi Han, Ziqi Wang, Han Zhao, Heng Ji
Abstract: Large language models (LLMs) have initiated a paradigm shift in transfer learning. In contrast to the classic pretraining-then-finetuning procedure, in order to use LLMs for downstream prediction tasks, one only needs to provide a few demonstrations, known as in-context examples, without adding more or updating existing model parameters. This in-context learning (ICL) capability of LLMs is intriguing, and it is not yet fully understood how pretrained LLMs acquire such capabilities. In this paper, we investigate the reason why a transformer-based language model can accomplish in-context learning after pre-training on a general language corpus by proposing a kernel-regression perspective of understanding LLMs' ICL behaviors when faced with in-context examples. More concretely, we first prove that Bayesian inference on in-context prompts can be asymptotically understood as kernel regression $\hat y = \sum_i y_i K(x, x_i)/\sum_i K(x, x_i)$ as the number of in-context demonstrations grows. Then, we empirically investigate the in-context behaviors of language models. We find that during ICL, the attention and hidden features in LLMs match the behaviors of a kernel regression. Finally, our theory provides insights into multiple phenomena observed in the ICL field: why retrieving demonstrative samples similar to test samples can help, why ICL performance is sensitive to the output formats, and why ICL accuracy benefits from selecting in-distribution and representative samples.
URL: https://openreview.net/forum?id=6rD50Q6yYz
---
Title: Unveiling Multiple Descents in Unsupervised Autoencoders
Authors: Kobi Rahimi, Yehonathan Refael, Tom Tirer, Ofir Lindenbaum
Abstract: The phenomenon of double descent has challenged the traditional bias-variance trade-off in supervised learning but remains unexplored in unsupervised learning, with some studies arguing for its absence. In this study, we first demonstrate analytically that double descent does not occur in linear unsupervised autoencoders (AEs). In contrast, we show for the first time that both double and triple descent can be observed with nonlinear AEs across various data models and architectural designs. We examine the effects of partial sample and feature noise and highlight the critical role of bottleneck size in shaping the double descent curve. Through extensive experiments on both synthetic and real datasets, we uncover model-wise, epoch-wise, and sample-wise double descent across several data types and architectures. Our findings indicate that over-parameterized models not only improve reconstruction but also enhance performance in downstream tasks such as anomaly detection and domain adaptation, highlighting their practical value in complex real-world scenarios.
URL: https://openreview.net/forum?id=FqfHDs6unx
---
Title: Uncertainty Quantification in Retrieval Augmented Question Answering
Authors: Laura Perez-Beltrachini, Mirella Lapata
Abstract: Retrieval augmented Question Answering (QA) helps QA models overcome knowledge gaps by incorporating retrieved evidence, typically a set of passages, alongside the question at test time. Previous studies show that this approach improves QA performance and reduces hallucinations, without, however, assessing whether the retrieved passages are indeed useful at answering correctly. In this work, we propose to quantify the uncertainty of a QA model via estimating the utility of the passages it is provided with. We train a lightweight neural model to predict passage utility for a target QA model and show that while simple information theoretic metrics can predict answer correctness up to a certain extent, our approach efficiently approximates or outperforms more expensive sampling-based methods.
URL: https://openreview.net/forum?id=JLkgI0h7wy
---
Title: Unifi3D: A Study on 3D Representations for Generation and Reconstruction in a Common Framework
Authors: Nina Wiedemann, Sainan Liu, Quentin Leboutet, Katelyn Gao, Benjamin Ummenhofer, Michael Paulitsch, Kai Yuan
Abstract: Following rapid advancements in text and image generation, research has increasingly shifted towards 3D generation. Unlike the well-established pixel-based representation in images, 3D representations remain diverse and fragmented, encompassing a wide variety of approaches such as voxel grids, neural radiance fields, signed distance functions, point clouds, or octrees, each offering distinct advantages and limitations.
In this work, we present a unified evaluation framework designed to assess the performance of 3D representations in reconstruction and generation. We compare these representations based on multiple criteria: quality, computational efficiency, and generalization performance. Beyond standard model benchmarking, our experiments aim to derive best practices over all steps involved in the 3D generation pipeline, including preprocessing, mesh reconstruction, compression with autoencoders, and generation. Our findings highlight that reconstruction errors significantly impact overall performance, underscoring the need to evaluate generation and reconstruction jointly.
We provide insights that can inform the selection of suitable 3D models for various applications, facilitating the development of more robust and application-specific solutions in 3D generation.
The code for our framework is available at https://github.com/isl-org/unifi3d.
URL: https://openreview.net/forum?id=GQpTWpXILA
---
Title: Solution Augmentation for ARC Problems Using GFlowNet: A Probabilistic Exploration Approach
Authors: Sanha Hwang, Seungpil Lee, Sejin Kim, Sundong Kim
Abstract: One of the core challenges in building general reasoning systems lies in generating diverse, human-aligned solution trajectories—different yet valid paths by which a problem can be solved. Prior approaches often rely on handcrafted templates, rule-based augmentations, or human demonstrations, which are limited in scalability and stylistic diversity. To address this, we explore the use of Generative Flow Networks (GFlowNets) for automated solution augmentation in reasoning tasks. We propose a framework that learns to generate diverse reasoning trajectories with probabilities proportional to their quality, guided by a human-inspired reward function and a novel geometric forward policy. This enables the generation of multiple plausible solution paths without relying on manual supervision. Moreover, our method supports efficient test-time augmentation from input-output examples alone, without access to ground-truth programs or external demonstrations—making it suitable for zero-shot settings. We evaluate our framework on the Abstraction and Reasoning Corpus (ARC-AGI), a benchmark designed to test compositional and abstract reasoning. Our results show that GFlowNets can effectively explore the space of valid reasoning processes, producing a variety of plausible reasoning trajectories, similar to how different individuals might solve the same problem using different intermediate steps. These trajectories are generated at scale-over 100k per task in under an hour, and follow a logarithmic yield trend, enabling practical tradeoffs between augmentation volume and novelty. Furthermore, fine-tuning a large language model (LLaMA 3.1 Instruct 8B) on these synthetic trajectories leads to a 28.6% improvement in reasoning accuracy on ARC tasks, demonstrating the downstream utility of our method. These findings suggest that GFlowNets offer a promising foundation for modeling structured reasoning in automated trajectory generation.
Our code is here: https://github.com/GIST-DSLab/GFN_to_ARC
URL: https://openreview.net/forum?id=ULCOhBgGzy
---
Title: M4GN: Mesh-based Multi-segment Hierarchical Graph Network for Dynamic Simulations
Authors: Bo Lei, Victor M Castillo, Yeping Hu
Abstract: Mesh-based graph neural networks (GNNs) have become effective surrogates for PDE simulations, yet their deep message passing incurs high cost and over‑smoothing on large, long‑range meshes; hierarchical GNNs shorten propagation paths but still face two key obstacles: (i) building coarse graphs that respect mesh topology, geometry, and physical discontinuities, and (ii) maintaining fine-scale accuracy without sacrificing the speed gained from coarsening. We tackle these challenges with M4GN—a three‑tier, segment‑centric hierarchical network. M4GN begins with a hybrid segmentation strategy that pairs a fast graph partitioner with a superpixel‑style refinement guided by modal‑decomposition features, producing contiguous segments of dynamically consistent nodes. These segments are encoded by a permutation‑invariant aggregator, avoiding the order sensitivity and quadratic cost of aggregation approaches used in prior works. The resulting information bridges a micro‑level GNN—which captures local dynamics—and a macro‑level transformer that reasons efficiently across segments, achieving a principled balance between accuracy and efficiency. Evaluated on multiple representative benchmark datasets, M4GN improves prediction accuracy by up to 56\% while achieving up to 22\% faster inference than state‑of‑the‑art baselines.
URL: https://openreview.net/forum?id=R3vDbqWa1v
---
Title: DNR-Pruning: Sparsity-Aware Pruning via Dying Neuron Reactivation in Convolutional Neural Networks
Authors: Boyuan Wang, Richard Jiang
Abstract: In this paper, we challenge the conventional view of dead neurons—neurons that cease to activate—during deep neural network training. Traditionally regarded as problematic due to their association with optimization challenges and reduced model adaptability over training epochs, dead neurons are often seen as a hindrance. However, we present a novel perspective, demonstrating that they can be effectively leveraged to enhance network sparsity. Specifically, we propose DNR-Pruning, dying neuron reactivation based sparsity-aware pruning approach for convolutional neural networks (CNNs) that exploits the behavior of individual neurons during training. Through a systematic exploration of hyperparameter configurations, we show that dying neurons can be harnessed to improve pruning algorithms. Our method dynamically monitors the occurrence of dying neurons, enabling adaptive sparsification throughout CNN training. Extensive experiments on diverse datasets demonstrate that DNR-Pruning outperforms existing sparsity-aware pruning techniques while achieving competitive results compared to state-of-the-art methods. These findings suggest that dying neurons can serve as an efficient mechanism for network compression and resource optimization in CNNs, opening new avenues for more efficient and high-performance deep learning models.
URL: https://openreview.net/forum?id=ymUjGCNPYa
---
Title: FedComLoc: Communication-Efficient Distributed Training of Sparse and Quantized Models
Authors: Kai Yi, Georg Meinhardt, Laurent Condat, Peter Richtárik
Abstract: Federated Learning (FL) has garnered increasing attention due to its unique characteristic of allowing heterogeneous clients to process their private data locally and interact with a central server, while being respectful of privacy. A critical bottleneck in FL is the communication cost. A pivotal strategy to mitigate this burden is Local Training, which involves running multiple local stochastic gradient descent iterations between communication phases. Our work is inspired by the innovative Scaffnew algorithm, which has considerably advanced the reduction of communication complexity in FL. We introduce FedComLoc (Federated Compressed and Local Training), integrating practical and effective compression into Scaffnew to further enhance communication efficiency. Extensive experiments, using the popular Top-K compressor and quantization, demonstrate its prowess in substantially reducing communication overheads in heterogeneous settings.
URL: https://openreview.net/forum?id=vYQPLytQsj
---
Title: Efficient Object-Centric Representation Learning using Masked Generative Modeling
Authors: Akihiro Nakano, Masahiro Suzuki, Yutaka Matsuo
Abstract: Learning object-centric representations from visual inputs in an unsupervised manner has drawn focus to solve more complex tasks, such as reasoning and reinforcement learning. However, current state-of-the-art methods, relying on autoregressive transformers or diffusion models to generate scenes from object-centric representations, suffer from computational inefficiency due to their sequential or iterative nature. This computational bottleneck limits their practical application and hinders scaling to more complex downstream tasks. To overcome this, we propose MOGENT, an efficient object-centric learning framework based on masked generative modeling. MOGENT conditions a masked bidirectional transformer on learned object slots and employs a parallel iterative decoding scheme to generate scenes, enabling efficient compositional generation. Experiments show that MOGENT significantly improves computational efficiency, accelerating the generation process by up to 67x and 17x compared to autoregressive models and diffusion-based models, respectively. Importantly, the efficiency is attained while maintaining strong or competitive performance on object segmentation and compositional generation tasks.
URL: https://openreview.net/forum?id=t9KvOYPeL3
---
Title: COMMA: A Communicative Multimodal Multi-Agent Benchmark
Authors: Timothy Ossowski, Danyal Maqbool, Jixuan Chen, Zefan Cai, Tyler J. Bradshaw, Junjie Hu
Abstract: The rapid advances of multimodal agents built on large foundation models have largely overlooked their potential for language-based communication between agents in collaborative tasks. This oversight presents a critical gap in understanding their effectiveness in real-world deployments, particularly when communicating with humans. Existing agentic benchmarks fail to address key aspects of inter-agent communication and collaboration, particularly in scenarios where agents have unequal access to information and must work together to achieve tasks beyond the scope of individual capabilities. To fill this gap, we introduce COMMA: a novel puzzle benchmark designed to evaluate the collaborative performance of multimodal multi-agent systems through language communication. Our benchmark features a variety of multimodal puzzles, providing a comprehensive evaluation across four key categories of agentic capability in a communicative collaboration setting. Our findings reveal surprising weaknesses in state-of-the-art models, including strong proprietary models like GPT-4o and reasoning models like o4-mini. Many chain of thought reasoning models such as R1-Onevision and LLaVA-CoT struggle to outperform even a random baseline in agent-agent collaboration, indicating a potential growth area in their communication abilities.
URL: https://openreview.net/forum?id=TIGQIem1na
---
Title: Label Embedding via Low-Coherence Matrices
Authors: Jianxin Zhang, Clayton Scott
Abstract: Label embedding is a framework for multiclass classification problems where each label is represented by a distinct vector of some fixed dimension, and training involves matching model output to the vector representing the correct label. While label embedding has been successfully applied in extreme classification and zero-shot learning, and offers both computational and statistical advantages, its theoretical foundations remain poorly understood. This work presents an analysis of label embedding in the context of extreme multiclass classification, where the number of classes $C$ is very large. We present an excess risk bound that reveals a trade-off between computational and statistical efficiency, quantified via the coherence of the embedding matrix. We further show that under the Massart noise condition, the statistical penalty for label embedding vanishes with sufficiently low coherence. Our analysis supports an algorithm that is simple, scalable, and easily parallelizable, and experimental results demonstrate its effectiveness in large-scale applications.
URL: https://openreview.net/forum?id=vrcWXcr4On
---
Title: Studying memorization of large language models using answers to Stack Overflow questions
Authors: Laura Caspari, Alexander Trautsch, Michael Granitzer, Steffen Herbold
Abstract: Large Language Models (LLMs) are capable of answering many software related questions and supporting developers by generating code snippets. These capabilities originate from training on massive amounts of data from the Internet, including information from Stack Overflow. This raises the question whether answers to software related questions are simply memorized from the training data, which might raise problems as this often requires attribution (e.g., CC-BY license), sharing with a similar license (e.g., GPL licenses) or may even be prohibited (proprietary license). To study this, we compare responses to questions from Stack Overflow for questions that were known during LLM pre-training and questions that were not included in the pre-training data. We then calculate the overlap both with answers marked as accepted on Stack Overflow as well as other texts we can find on the internet. We further explore the impact of the popularity of programming languages, the complexity of the prompts used, and the randomization of the text generation process on the memorization of answers to Stack Overflow. We find that many generated answers are to some degree collages of memorized content and that this does not dependent on whether the questions were seen during training or not. However, many of the memorized snippets are common phrases or code and, therefore, not copyrightable. Still, we also have clear evidence that copyright violation happens and is likely when LLMs are used at large scales.
URL: https://openreview.net/forum?id=ddocn44Kaq
---
Title: Doubly Robust Uncertainty Quantification for Quantile Treatment Effects in Sequential Decision Making
Authors: Yang Xu, Chengchun Shi, Shikai Luo, Lan Wang, Rui Song
Abstract: We consider multi-stage sequential decision making, where the treatment at any stage may depend on the subject’s entire treatment and covariate history. We introduce a general framework for doubly robust uncertainty quantification for the quantiles of cumulative outcomes under a sequential treatment rule. While previous studies focused on mean effects, quantile effects offer unique insights into the distributional properties and are more robust for heavy-tailed outcomes. It is known that, doubly robust inference is significantly more challenging and largely unexplored for quantile treatment effects. More importantly, for mean effects, doubly robust estimation does not ensure doubly robust inference. Our approach first provides a doubly robust estimator for any quantile of interest based on pre-collected data, achieving semi-parametric efficiency. We then propose a novel doubly robust estimator for the asymptotic variance, enabling the construction of a doubly robust confidence interval. To overcome the challenges in parameter-dependent nuisance functions, we leverage deep conditional generative learning techniques. We demonstrate advantages of our approach via both simulation and real data from a short video platform. Additionally, we observe that our proposed approach leads to another mean effect estimator that outperforms existing estimators with heavy-tailed outcomes.
URL: https://openreview.net/forum?id=F0BwbieVws
---
Title: HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction
Authors: Chen Bao, Jiarui Xu, Xiaolong Wang, Abhinav Gupta, Homanga Bharadhwaj
Abstract: How can we predict future interaction trajectories of human hands in a scene given high-level colloquial task specifications in the form of natural language? In this paper, we extend the classic hand trajectory prediction task to several tasks involving explicit and implicit language queries. Our proposed tasks require extensive understanding of human daily activities and reasoning abilities about what is happening next given cues from the current scene. We also develop new benchmarks to evaluate the proposed two tasks, Vanilla Hand Prediction (VHP) and Reasoning-Based Hand Prediction (RBHP). We enable solving these tasks by integrating high-level world knowledge and reasoning capabilities of Vision-Language Models (VLMs) with the auto-regressive nature of low-level ego-centric hand trajectories. Our model, HandsOnVLM is a novel VLM that can generate textual responses and produce future hand trajectories through natural-language conversations. Our experiments show that HandsOnVLM outperforms existing task-specific methods and other VLM baselines on proposed tasks, and demonstrates its ability to effectively utilize world knowledge for reasoning about low-level human hand trajectories based on the provided context. More details can be found at https://www.chenbao.tech/handsonvlm/.
URL: https://openreview.net/forum?id=ehhMFjKnWm
---
Title: FlowKac: An Efficient Neural Fokker-Planck solver using Temporal Normalizing flows and the Feynman-Kac Formula
Authors: Naoufal EL BEKRI, Lucas Drumetz, Franck Vermet
Abstract: Solving the Fokker-Planck equation for high-dimensional complex dynamical systems remains a pivotal yet challenging task due to the intractability of analytical solutions and the limitations of traditional numerical methods. In this work, we present FlowKac, a novel approach that reformulates the Fokker-Planck equation using the Feynman-Kac formula, allowing to query the solution at a given point via the expected values of stochastic paths. A key innovation of FlowKac lies in its adaptive stochastic sampling scheme which significantly reduces the computational complexity while maintaining high accuracy. This sampling technique, coupled with a time-indexed normalizing flow, designed for capturing time-evolving probability densities, enables robust sampling of collocation points, resulting in a flexible and mesh-free solver. This formulation mitigates the curse of dimensionality and enhances computational efficiency and accuracy, which is particularly crucial for applications that inherently require dimensions beyond the conventional three. We validate the robustness and scalability of our method through various experiments on a range of stochastic differential equations, demonstrating significant improvements over existing techniques.
URL: https://openreview.net/forum?id=paeyQFa5or
---
Title: Mean-Field RL for Large-Scale Unit-Capacity Pickup-and-Delivery Problems
Authors: Kai Cui, Sharif Azem, Christian Fabian, Kirill Kuroptev, Ramin Khalili, Osama Abboud, Florian Steinke, Heinz Koeppl
Abstract: Solving large-scale vehicle routing problems (VRPs) is NP-hard and poses a computational challenge in numerous applications such as logistics. Meanwhile, mean-field control (MFC) provides a tractable and rigorous approach to controlling many agents. We provide a solution to pickup-and-delivery VRPs via scalable MFC. In combination with reinforcement learning (RL) and clustering, our MFC approach efficiently scales to large-scale VRPs. We perform a theoretical analysis of our MFC-based approximation, giving convergence results for large VRP instances and error bounds for clustering-based approximations. We verify our algorithms on different datasets and compare them against solutions such as OR-Tools, PyVRP and heuristics, showing scalability in terms of speed for mean-field methods, for the first time in discrete optimization. Overall, our work establishes a novel synthesis of MFC-based RL techniques, vehicle routing problems and clustering approximations, to solve a hard discrete optimization problem of practical use in a scalable way.
URL: https://openreview.net/forum?id=E8JRswdyDR
---
Title: Solving Inverse Problems using Diffusion with Iterative Colored Renoising
Authors: Matthew C Bendel, Saurav K Shastri, Rizwan Ahmad, Philip Schniter
Abstract: Imaging inverse problems can be solved in an unsupervised manner using pre-trained diffusion models, but doing so requires approximating the gradient of the measurement-conditional score function in the diffusion reverse process.
We show that the approximations produced by existing methods are relatively poor, especially early in the reverse process, and so
we propose a new approach that iteratively reestimates and ``renoises'' the estimate several times per diffusion step.
This iterative approach, which we call Fast Iterative REnoising (FIRE), injects colored noise that is shaped to ensure that the pre-trained diffusion model always sees white noise, in accordance with how it was trained.
We then embed FIRE into the DDIM reverse process and show that the resulting ``DDfire'' offers state-of-the-art accuracy and runtime on several linear inverse problems, as well as phase retrieval.
URL: https://openreview.net/forum?id=RZv8FcQDPW
---
Title: Understanding the learned look-ahead behavior of chess neural networks
Authors: Diogo Cruz
Abstract: We investigate the look-ahead capabilities of chess-playing neural networks, specifically focusing on the Leela Chess Zero policy network. We build on the work of Jenner et al. (2024) by analyzing the model's ability to consider future moves and alternative sequences beyond the immediate next move. Our findings reveal that the network's look-ahead behavior is highly context-dependent, varying significantly based on the specific chess position. We demonstrate that the model can process information about board states up to seven moves ahead, utilizing similar internal mechanisms across different future time steps. Additionally, we provide evidence that the network considers multiple possible move sequences rather than focusing on a single line of play. These results offer new insights into the emergence of sophisticated look-ahead capabilities in neural networks trained on strategic tasks, contributing to our understanding of AI reasoning in complex domains. Our work also showcases the effectiveness of interpretability techniques in uncovering cognitive-like processes in artificial intelligence systems.
URL: https://openreview.net/forum?id=np4Bg2zIxL
---
Title: The Overcooked Generalisation Challenge: Evaluating Cooperation with Novel Partners in Unknown Environments Using Unsupervised Environment Design
Authors: Constantin Ruhdorfer, Matteo Bortoletto, Anna Penzkofer, Andreas Bulling
Abstract: We introduce the Overcooked Generalisation Challenge (OGC) – a new benchmark for evaluating reinforcement learning (RL) agents on their ability to cooperate with unknown partners in unfamiliar environments.
Existing work typically evaluated cooperative RL only in their training environment or with their training partners, thus seriously limiting our ability to understand agents’ generalisation capacity – an essential requirement for future collaboration with humans.
The OGC extends Overcooked-AI to support dual curriculum design (DCD).
It is fully GPU-accelerated, open-source, and integrated into the minimax DCD benchmark suite.
Compared to prior DCD benchmarks, where designers manipulate only minimal elements of the environment, OGC introduces a significantly richer design space: full kitchen layouts with multiple objects that require the designer to account for interaction dynamics between agents.
We evaluate state-of-the-art DCD algorithms alongside scalable neural architectures and find that current methods fail to produce agents that generalise effectively to novel layouts and unfamiliar partners.
Our results indicate that both agents and curriculum designers struggle with the joint challenge of partner and environment generalisation.
These findings establish OGC as a demanding testbed for cooperative generalisation and highlight key directions for future research.
We open-source our code.
URL: https://openreview.net/forum?id=K2KtcMlW6j
---
Title: Single-positive Multi-label Learning with Label Cardinality
Authors: Shayan Gharib, Pierre-Alexandre Murena, Arto Klami
Abstract: We study learning a multi-label classifier from partially labeled data, where each instance has only a single positive label. We explain how auxiliary information available on the label cardinality, the number of positive labels per instance, can be used for improving such methods. We consider auxiliary information of varying granularity, ranging from knowing just the maximum number of labels over all instances to knowledge on the distribution of label cardinalities and even the exact cardinality of each instance. We introduce methods leveraging the different types of auxiliary information, study how close to the fully labeled accuracy we can get under different scenarios, and show that an easy-to-implement method only assuming the knowledge of the maximum cardinality is comparable to the state-of-the-art single-positive multi-label learning methods when using the same base model. Our implementation is publicly available at https://github.com/shayangharib/SPMLL_with_Label_Cardinality.
URL: https://openreview.net/forum?id=XEPPXH2nKu
---
Title: LAPP: Large Language Model Feedback for Preference-Driven Reinforcement Learning
Authors: Pingcheng Jian, Xiao Wei, Yanbaihui Liu, Samuel A. Moore, Michael M. Zavlanos, Boyuan Chen
Abstract: We introduce Large Language Model-Assisted Preference Prediction (LAPP), a novel framework for robot learning that enables efficient, customizable, and expressive behavior acquisition with minimum human effort. Unlike prior approaches that rely heavily on reward engineering, human demonstrations, motion capture, or expensive pairwise preference labels, LAPP leverages large language models (LLMs) to automatically generate preference labels from raw state-action trajectories collected during reinforcement learning (RL). These labels are used to train an online preference predictor, which in turn guides the policy optimization process toward satisfying high-level behavioral specifications provided by humans. Our key technical contribution is the integration of LLMs into the RL feedback loop through trajectory-level preference prediction, enabling robots to acquire complex skills including subtle control over gait patterns and rhythmic timing. We evaluate LAPP on a diverse set of quadruped locomotion and dexterous manipulation tasks and show that it achieves efficient learning, higher final performance, faster adaptation, and precise control of high-level behaviors. Notably, LAPP enables robots to master highly dynamic and expressive tasks such as quadruped backflips, which remain out of reach for standard LLM-generated or handcrafted rewards. Our results highlight LAPP as a promising direction for scalable preference-driven robot learning.
URL: https://openreview.net/forum?id=cq76wx7T9F
---
Title: Text to Stealthy Adversarial Face Masks
Authors: Ben Lewis, Thomas Moyse, James Parkinson, Elizabeth Telford, Callum Whitfield, Ranko Lazic
Abstract: Recent studies have demonstrated that modern facial recognition systems, which are based on deep neural networks, are vulnerable to adversarial attacks, including the use of accessories, makeup patterns, or precision lighting. However, developing attacks that are both robust (resilient to changes in viewing angles and environmental conditions) and stealthy (do not attract suspicion by, for example, incorporating obvious facial features) remains a significant challenge. In this context, we introduce a novel diffusion-based method (DAFR) capable of generating robust and stealthy face masks for dodging recognition systems (where the system fails to identify the attacker). Specifically our approach is capable of producing high-fidelity printable textures using the guidance of textual prompts to determine the style. This method can also be adapted for impersonation purposes, where the system misidentifies the attacker as a specific other individual. Finally, we address a gap in the existing literature by presenting a comprehensive benchmark (FAAB) for evaluating adversarial accessories in three dimensions, assessing their robustness and stealthiness.
URL: https://openreview.net/forum?id=XYqCx026AI
---
Title: Discrete Audio Tokens: More Than a Survey!
Authors: Pooneh Mousavi, Gallil Maimon, Adel Moumen, Darius Petermann, Jiatong Shi, Haibin Wu, Haici Yang, Anastasia Kuznetsova, Artem Ploujnikov, Ricard Marxer, Bhuvana Ramabhadran, Benjamin Elizalde, Loren Lugosch, Jinyu Li, Cem Subakan, Phil Woodland, Minje Kim, Hung-yi Lee, Shinji Watanabe, Yossi Adi, Mirco Ravanelli
Abstract: Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics while enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks. They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: https://poonehmousavi.github.io/dates-website/.
URL: https://openreview.net/forum?id=eqNchtvc6v
---
New submissions
===============
Title: Are We Really Learning the Score Function? Reinterpreting Diffusion Models Through Wasserstein Gradient Flow Matching
Abstract: Diffusion models are commonly interpreted as learning the score function, i.e., the gradient of the log-density of noisy data. However, this assumption implies that the target of learning is a conservative vector field, which is not enforced by the neural network architectures used in practice. We present numerical evidence that trained diffusion networks violate both integral and differential constraints required of true score functions, demonstrating that the learned vector fields are not conservative. Despite this, the models perform remarkably well as generative mechanisms. To explain this apparent paradox, we advocate a new theoretical perspective: diffusion training is better understood as flow matching to the velocity field of a Wasserstein Gradient Flow (WGF), rather than as score learning for a reverse-time stochastic differential equation. Under this view, the ``probability flow'' arises naturally from the WGF framework, eliminating the need to invoke reverse-time SDE theory and clarifying why generative sampling remains successful even when the neural vector field is not a true score. We further show that non-conservative errors from neural approximation do not necessarily harm density transport. Our results advocate for adopting the WGF perspective as a principled, elegant, and theoretically grounded framework for understanding diffusion generative models.
URL: https://openreview.net/forum?id=CzyJqXQRhJ
---
Title: IPA: An Information-Preserving Input Projection Framework for Efficient Foundation Model Adaptation
Abstract: Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, reduce adaptation cost by injecting low-rank updates into pretrained weights. However, LoRA’s down-projection is randomly initialized and data-agnostic, discarding potentially useful information. Prior analyses show that this projection changes little during training, while the up-projection carries most of the adaptation, making the random input compression a performance bottleneck. We propose IPA, a feature-aware projection framework that explicitly preserves information in the reduced hidden space. In the linear case, we instantiate IPA with algorithms approximating top principal components, enabling efficient projector pretraining with negligible inference overhead. Across language and vision benchmarks, IPA consistently improves over LoRA and DoRA, achieving on average 1.5 points higher accuracy on commonsense reasoning and 2.3 points on VTAB-1k, while matching full LoRA performance with roughly half the trainable parameters when the projection is frozen.
URL: https://openreview.net/forum?id=aLmQeZx2pR
---
Title: Variational Online Mirror Descent for Robust Learning in Schrödinger Bridge
Abstract: The Schrödinger bridge (SB) has evolved into a universal class of probabilistic generative models. In practice, however, estimated learning signals are innately uncertain, and the reliability promised by existing methods is often based on speculative optimal case scenarios. Recent studies regarding the Sinkhorn algorithm through mirror descent (MD) have gained attention, revealing geometric insights into solution acquisition of the SB problems. In this paper, we propose a variational online MD (OMD) framework for the SB problems, which provides further stability to SB solvers. We formally prove convergence and a regret bound for the novel OMD formulation of SB acquisition. As a result, we propose a simulation-free SB algorithm called Variational Mirrored Schrödinger Bridge (VMSB) by utilizing the Wasserstein-Fisher-Rao geometry of the Gaussian mixture parameterization for Schrödinger potentials. Based on the Wasserstein gradient flow theory, the algorithm offers tractable learning dynamics that precisely approximate each OMD step. In experiments, we validate the performance of the proposed VMSB algorithm across an extensive suite of benchmarks. VMSB consistently outperforms contemporary SB solvers on a wide range of SB problems, demonstrating the robustness as well as generality predicted by our OMD theory.
URL: https://openreview.net/forum?id=g3SsM9FLpm
---
Title: A Comprehensive Survey on Trustworthiness in Reasoning with Large Language Models
Abstract: The development of Long-CoT reasoning has advanced LLM performance across various tasks, including language understanding, complex problem solving, and code generation. This paradigm enables models to generate intermediate reasoning steps, thereby improving both accuracy and interpretability. However, despite these advancements, a comprehensive understanding of how CoT-based reasoning affects the trustworthiness of language models remains underdeveloped. In this paper, we survey recent work on reasoning models and CoT techniques, focusing on five core dimensions of trustworthy reasoning: truthfulness, safety, robustness, fairness, and privacy. For each aspect, we provide a clear and structured overview of recent studies in chronological order, along with detailed analyses of their methodologies, findings, and limitations. Future research directions are also appended at the end for reference and discussion. Overall, while reasoning techniques hold promise for enhancing model trustworthiness through hallucination mitigation, harmful content detection, and robustness improvement, cutting-edge reasoning models themselves often suffer from comparable or even greater vulnerabilities in safety, robustness, and privacy. By synthesizing these insights, we hope this work serves as a valuable and timely resource for the AI safety community to stay informed on the latest progress in reasoning trustworthiness. A full list of related papers will be made public.
URL: https://openreview.net/forum?id=Ysslwdjb6L
---
Title: LiteXrayNet: Quantum-Inspired Deep Learning Framework for Scalable Pneumonia Diagnosis
Abstract: Pneumonia is a serious health problem affecting the world and affecting heavily low-resource areas, where timely diagnostic facilities are vital. This paper, in turn, presents liteXrayNet, an advanced convolutional neural network (CNN) that is tailored specifically to detect pneumonia on chest radiographs with high accuracy and is designed to run under conditions of limited computer resources. This network structure uses the inverted residual MBConv blocks of MobileNetV3 that can help extract features effectively, a quantum-inspired phase shift layer that can be used to enhance the detection of complex patterns, and a simplified recognizer, which will guarantee strong binary classification. With 179,646 trainable parameters, liteXrayNet achieves a test-level accuracy of 97%, has a small model size of 0.7 MB, and inference latency of 0.60 ms/sample, liteXrayNet can achieve diagnostic accuracy in real time on resource-constrained systems. The model has minimal computing requirements with little impact on diagnostic quality achieved through integrating depthwise separable convolutions, hard-swish activations and quantum-inspired feature modulation. The liteXrayNet has been demonstrated to be a efficient solution to scalable, point-of-care pneumonia diagnosis, allowing significantly more people to access and obtain healthcare and undo disparities by diagnosis in underserved populations globally, due to its lightweight construction and high diagnostic accuracy.
URL: https://openreview.net/forum?id=yfJCllstyT
---
Title: Trajectory-Based Neural Darwinism in CNNs: Variation, Competition, and Selective Retention
Abstract: Understanding how neural networks develop and stabilize their internal representations remains a central challenge in deep learning. Inspired by Edelman’s theory of Neural Darwinism, we investigate whether competitive dynamics analogous to neuronal group selection emerge in artificial neural networks during training. Through detailed trajectory analyses of neuron activations, weights, and cumulative representational change across convolutional neural networks (CNNs) including three-layer MLP-Net, ResNet-18, VGG-16, and ResNet-50, we uncover consistent patterns of variation, competition, and selective retention. Ablation studies reveal that networks tolerate removal of large fractions of neurons without accuracy degradation, indicating high redundancy; however, beyond a critical threshold, performance collapses as the core subset of task-critical neurons is disrupted. Across multiple datasets and architectures, neuron trajectory dynamics show that survived neurons sustain longer, more coherent representational paths, stronger weight norms, and higher activations, while eliminated neurons stagnate or fade toward representational silence. Overall, our findings are consistent with a Darwinian view of representation learning: CNNs exhibit robustness through redundancy at early stages, followed by selective consolidation of highly specialized neurons in deeper layers.
URL: https://openreview.net/forum?id=GQZKCS4Enz
---
Title: FeatInv: Spatially resolved mapping from feature space to input space using conditional diffusion models
Abstract: Internal representations are crucial for understanding deep neural networks, such as their properties and reasoning patterns, but remain difficult to interpret. While mapping from feature space to input space aids in interpreting the former, existing approaches often rely on crude approximations. We propose using a conditional diffusion model - a pretrained high-fidelity diffusion model conditioned on spatially resolved feature maps - to learn such a mapping in a probabilistic manner. We demonstrate the feasibility of this approach across various pretrained image classifiers from CNNs to ViTs, showing excellent reconstruction capabilities. Through qualitative comparisons and robustness analysis, we validate our method and showcase possible applications, such as the visualization of concept steering in input space or investigations of the composite nature of the feature space. This approach has broad potential for improving feature space understanding in computer vision models.
URL: https://openreview.net/forum?id=UtE1YnPNgZ
---
Title: Political-LLM: Large Language Models in Political Science
Abstract: Political science is undergoing a significant transition as large language models (LLMs) gain traction in tasks such as election forecasting, policy assessment, and misinformation detection. While LLMs advance political research, they also pose challenges, including but not limited to societal biases (e.g., partisan skew in political sentiment analysis), ethical concerns (e.g., misinformation propagation in automated legislative summarization), and scalability limitations (e.g., inefficiencies in adapting general LLMs for real-time election forecasting). In this work, we—an interdisciplinary team bridging computer science and political science—take an initial step towards systematically understanding how LLMs can be integrated in political science by introducing the principled conceptual framework named Political-LLM. Specifically, our approach begins with a taxonomy that divides normative political science (NPS) and positive political science (PPS), a method of classification that is deeply rooted in the foundation of classical political science research. By grounding the framework in this perspective, we provide a structured view for organizing previous work, pinpointing critical challenges, and uncovering opportunities to promote both empirical research and responsible applications of LLMs. As a case study, we perform empirical experiments using the ANES benchmark to evaluate state-of-the-art LLMs through a voting simulation task, focusing on their abilities to generate relevant political features and expose inherent biases. This study highlights how to employ our principled taxonomy as the guidance of specific research problems in this interdisciplinary field, while also provides an vivid and understandable example for general audience to deepen their comprehension on Political-LLM framework. Finally, we outline key challenges and future directions, emphasizing domain-specific dataset development, careful attention to issues such as bias and opaque modeling processes, acknowledgment of non-scalability constraints, the value of expert involvement, and the importance of proprietary evaluation criteria that meet the needs of this field. Political-LLM is intended as a Guidebook for researchers seeking to apply Artificial Intelligence in political science with care and impact.
URL: https://openreview.net/forum?id=M2sZKXL8Cr
---
Title: Probabilistic Residual User Clustering
Abstract: Modern recommender systems are typically based on deep learning (DL) models, where a dense encoder learns representations of users and items. As a result, these systems often suffer from the black-box nature and computational complexity of the underlying models, making it difficult to systematically interpret their outputs and enhance their recommendation capabilities. To address this problem, we propose Probabilistic Residual User Clustering (PRUC), a causal Bayesian recommendation model based on user clustering. Specifically, we address this problem by (1) dividing users into clusters in an unsupervised manner and identifying causal confounders that influence latent variables, (2) developing sub-models for each confounder given the observable variables, and (3) generating recommendations by aggregating the rating residuals under each confounder using do-calculus. Experiments demonstrate that our plug-and-play PRUC is compatible with various base DL recommender systems, significantly improving their performance while automatically discovering meaningful user clusters.
URL: https://openreview.net/forum?id=9jxjGJa4E4
---
Title: Synergistic Benefits of Joint Molecule Generation and Property Prediction
Abstract: Modeling the joint distribution of data samples and their properties allows to construct a single model for both data generation and property prediction, with synergistic benefits reaching beyond purely generative or predictive models. However, training joint models presents daunting architectural and optimization challenges. Here, we propose Hyformer, a transformer-based joint model that successfully blends the generative and predictive functionalities, using an alternating attention mechanism and a joint pre-training scheme. We show that Hyformer is simultaneously optimized for molecule generation and property prediction, while exhibiting synergistic benefits in conditional sampling, out-of-distribution property prediction and representation learning. Finally, we demonstrate the benefits of joint learning in a drug design use case of discovering novel antimicrobial peptides.
URL: https://openreview.net/forum?id=jnzCOLyGOA
---
Title: LeanProgress: Guiding Search for Neural Theorem Proving via Proof Progress Prediction
Abstract: Mathematical reasoning remains a significant challenge for Large Language Models (LLMs) due to hallucinations. When combined with formal proof assistants like Lean, these hallucinations can be eliminated through rigorous verification, making theorem proving reliable. However, even with formal verification, LLMs still struggle with long proofs and complex mathematical formalizations. While Lean with LLMs offers valuable assistance with retrieving lemmas, generating tactics, or even complete proofs, it lacks a crucial capability: providing a sense of proof progress. This limitation particularly impacts the overall development efficiency in large formalization projects. We introduce LeanProgress, a method that predicts the progress in the proof. Training and evaluating our models made on a large corpus of Lean proofs from Lean Workbook Plus and Mathlib4 and how many steps remain to complete it, we employ data preprocessing and balancing techniques to handle the skewed distribution of proof lengths. Our experiments show that LeanProgress achieves an overall prediction accuracy of 75.1\% in predicting the amount of progress and, hence, the remaining number of steps. When integrated into a best-first search framework using Reprover, our method shows a 3.8\% improvement on Mathlib4 compared to baseline performances of 41.2\%, particularly for longer proofs. These results demonstrate how proof progress prediction can enhance both automated and interactive theorem proving, enabling users to make more informed decisions about proof strategies.
URL: https://openreview.net/forum?id=eTmOwvvRu9
---
Title: Let's Measure Information Step-by-Step: LLM-Based Evaluation Beyond Vibes
Abstract: We evaluate AI systems without ground truth by exploiting a link between strategic gaming and information loss. Building on established information theory, we analyze which mechanisms resist adversarial manipulation. We extend finite-sample bounds to show that certain f-divergences (e.g., total variation distance) maintain polynomial guarantees under attacks while other measures (e.g., KL divergence) degrade exponentially. We implement these mechanisms by modeling the overseer as an agent and characterize incentive-compatible scoring rules as f-information objectives. Under adversarial attacks, TVD-MI maintains effectiveness (area under curve 0.70-0.77) while other approaches can decay towards chance, demonstrating that when we query the same LLM for information relationships rather than quality judgments, we achieve both theoretical and practical robustness. The mechanisms decompose pairwise evaluations into reliable item-level quality scores without ground truth, addressing a key limitation of standard peer prediction. \textit{Note: Supplementary material including pre-registration details and experimental code is provided in the submission package.}
URL: https://openreview.net/forum?id=i7T1tFvFQM
---
Title: Adapting Vision Transformers to Ultra-High Resolution Semantic Segmentation with Relay Tokens
Abstract: Current approaches for segmenting ultra-high-resolution images either slide a window, thereby discarding global context, or downsample and lose fine detail. We propose a simple yet effective method that brings explicit multi-scale reasoning to vision transformers, simultaneously preserving local details and global awareness. Concretely, we process each image in parallel at a local scale (high-resolution, small crops) and a global scale (low-resolution, large crops), and aggregate and propagate features between the two branches with a small set of learnable relay tokens. The design plugs directly into standard transformer backbones (e.g. ViT and Swin) and adds fewer than 2 % parameters. Extensive experiments on three ultra-high-resolution segmentation benchmarks, Archaeoscape, URUR, and Gleason, and on the conventional Cityscapes dataset show consistent gains, with up to 13 % relative mIoU improvement. Code and pretrained models will be released.
URL: https://openreview.net/forum?id=tidYprMlsg
---
Title: Incorporating Token Usage into Prompting Strategy Evaluation
Abstract: In recent years, large language models have demonstrated remarkable performance across diverse tasks. However, their task effectiveness is heavily dependent on the prompting strategy used to elicit output, which can vary widely in both performance and token usage. While task performance is often used to determine prompting strategy success, we argue that efficiency—balancing performance and token usage—can be a more practical metric for real-world utility. To enable this, we propose Big-$O_{tok}$, a theoretical framework for describing the token usage growth of prompting strategies, and analyze Token Cost, an empirical measure of tokens per performance. We apply these to several common prompting strategies to demonstrate their utility and observe that increased token usage leads to drastically diminishing performance returns. Our results validate the Big-$O_{tok}$ and Token Cost analyses and reinforce the need for efficiency-aware evaluations.
URL: https://openreview.net/forum?id=lgZCAku55O
---
Title: FusionProt: Fusing Sequence and Structural Information for Unified Protein Representation Learning
Abstract: Accurate protein representations that integrate sequence and three-dimensional (3D) structure are critical to many biological and biomedical tasks. Most existing models either ignore structure or combine it with sequence through a single, static fusion step. Here we present FusionProt, a unified model that learns representations via iterative, bidirectional fusion between a protein language model and a structure encoder. A single learnable token serves as a carrier, alternating between sequence attention and spatial message passing across layers. FusionProt is evaluated on Enzyme Commission (EC), Gene Ontology (GO), and mutation stability prediction tasks. It improves F\textsubscript{max} by a median of $+1.3$ points (up to $+2.0$) across EC and GO benchmarks, and boosts AUROC by $+3.6$ points over the strongest baseline on mutation stability. Inference cost remains practical, with only $\sim2\text{--}5\%$ runtime overhead.
Beyond state-of-the-art performance, we further demonstrate FusionProt’s practical relevance through representative biological case studies, indicating that the model captures biologically relevant features.
URL: https://openreview.net/forum?id=imcinaOHod
---
Title: Toward Explanations for Large Language Models in Natural Language
Abstract: Large Language Models (LLMs) have become proficient in addressing complex tasks by leveraging their extensive internal knowledge and reasoning capabilities. However, the black-box nature of these models complicates the task of explaining their decision-making processes. While recent advancements demonstrate the potential of leveraging LLMs to self-explain their predictions through natural language (NL) explanations, their explanations or chain-of-thoughts may not accurately reflect the LLMs' decision-making process due to a lack of true decision-making pivots involved. Measuring the fidelity of NL explanations is a challenging but important issue, as it is difficult to manipulate the input context to mask the semantics of these explanations, but it can effectively assess the quality of explanations. To this end, we introduce FaithLM for explaining the decision of LLMs with NL explanations. Specifically, FaithLM designs a method for evaluating the fidelity of NL explanations by incorporating the contrary explanations into the query process. Moreover, FaithLM conducts an iterative process to improve the fidelity of derived explanations. Experiment results on three datasets from multiple domains demonstrate that FaithLM can significantly improve the fidelity of derived explanations, which also provides a better alignment with the ground-truth explanations.
URL: https://openreview.net/forum?id=7GyPqIOodP
---
Title: A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks but remain fundamentally static, unable to adapt their internal parameters to novel tasks, evolving knowledge domains, or dynamic interaction contexts. As LLMs are increasingly deployed in open-ended, interactive environments, this static nature has become a critical bottleneck, necessitating agents that can adaptively reason, act, and evolve in real time. This paradigm shift -- from scaling static models to developing self-evolving agents -- has sparked growing interest in architectures and methods enabling continual learning and adaptation from data, interactions, and experiences. This survey provides the first systematic and comprehensive review of self-evolving agents, organizing the field around three foundational dimensions -- $\textit{what to evolve, when to evolve, and how to evolve}$. We examine evolutionary mechanisms across agent components (e.g., models, memory, tools, architecture), categorize adaptation methods by stages (e.g., intra-test-time, inter-test-time), and analyze the algorithmic and architectural designs that guide evolutionary adaptation (e.g., scalar rewards, textual feedback, single-agent and multi-agent systems). Additionally, we analyze evaluation metrics and benchmarks tailored for self-evolving agents, highlight applications in domains such as coding, education, and healthcare, and identify critical challenges and research directions in safety, scalability, and co-evolutionary dynamics. By providing a structured framework for understanding and designing self-evolving agents, this survey establishes a roadmap for advancing adaptive, robust, and versatile agentic systems in both research and real-world deployments, ultimately shedding lights to pave the way for the realization of Artificial Super Intelligence (ASI), where agents evolve autonomously, performing at or beyond human-level intelligence across a wide array of tasks.
URL: https://openreview.net/forum?id=CTr3bovS5F
---
Title: Nonvacuous Generalization Bounds For Deep Networks With Improved Size Dependence
Abstract: Despite being massively overparameterized, deep neural networks exhibit a remarkable ability to generalize well to unseen data. Existing generalization bounds fail to explain this phenomenon, often becoming vacuous due to their strong dependence on network depth and width. To address this, we introduce novel nonvacuous generalization bounds for deep networks, offering tighter estimates of their Rademacher complexity by introducing a new analysis of covering number, which exhibits much milder depth dependence. Our bounds grow at a much slower rate of $O(\sqrt{Dpr})$, with network depth D, width p, and weights of rank r, compared to previous works that scale at a rate $O(\sqrt{D^3pr})$. Moreover, under certain plausible assumptions on the norms of network weights, we establish bounds that grow at a sublogarithmic rate of $O(\sqrt{log \ D})$ with depth. This novel bound is much tighter and represents a substantial improvement over prior bounds that scale at a polynomial rate with depth. We provide rigorous empirical validation, demonstrating that our bounds offer consistently tighter estimates compared to the state-of-the-art results. Thus, our bounds offer improved insight into the excellent generalization capabilities of deep overparameterized networks.
URL: https://openreview.net/forum?id=aSv29Xh81q
---
Title: DynaVect: Context-Aware Modulation of Global Edit Direc- tions for Controllable GAN Editing
Abstract: Text-guided editing of generative models like StyleGAN has become a popular method for image manipulation. Current approaches face a trade-off. Optimization-based methods produce edits that are too subtle. This fails to meet the user’s intent for changes. On the flip side, methods that use a single global edit vector often cause unwanted attribute entanglement and identity loss. In this work, we propose DynaVect, a hybrid approach that attempts to resolve this trade-off. Our approach is a lightweight Dynamic Contextual Modulator. The DCM is a network trained to predict a personalized correction (or delta), based on the source image’s features. At inference time, this learned delta is used to steer the global edit direction. This results in changes that are visually different while attempting to preserve the original identity. We train our modulator using an optimization-distillation technique. This technique involves creating a fast feed-forward model that approximates the quality of slow, per-image optimization. We demonstrate that our method produces quali- tatively superior results that better align with users expectations as compared to traditional metrics.
URL: https://openreview.net/forum?id=PclTF8BjyZ
---
Title: SemVAD: Fusing Semantic and Vision Features for Weakly Supervised Video Anomaly Detection
Abstract: In recent years, vision-language models such as CLIP and VideoLLaMA have demonstrated
the ability to express visual data in semantically rich textual representations, making them
highly effective for downstream tasks. Given their cross-modal semantic representation
power, leveraging such models for video anomaly detection (VAD) holds significant promise.
In this work, we introduce Semantic VAD (SemVAD), a novel methodology for weakly super-
vised video anomaly detection (wVAD) that effectively fuses visual and semantic features
obtained from pretrained vision-language models, specifically VideoLLaMA 3 and CLIP.
Our approach enhances performance and explainability in anomaly detection. Additionally,
we analyze the sensitivity of recent state-of-the-art models to randomness in training initial-
ization and introduce a more comprehensive evaluation framework to assess their robustness
to small changes in training such as the seed of random number generator. This framework
aims to provide a more rigorous and holistic assessment of model performance, ensuring a
deeper understanding of their reliability and reproducibility in wVAD.
URL: https://openreview.net/forum?id=6tkvxrHidI
---
Title: Learning to reconstruct from saturated data: audio declipping and high-dynamic range imaging
Abstract: Learning based methods are now ubiquitous for solving inverse problems, but their deployment in real-world applications is often hindered by the lack of ground truth references for training. Recent self-supervised learning strategies offer a promising alternative, avoiding the need for ground truth. However, most existing methods are limited to linear inverse problems. This work extends self-supervised learning to the non-linear problem of recovering audio and images from clipped measurements, by assuming that the signal distribution is approximately invariant to changes in amplitude. We provide sufficient conditions for learning to reconstruct from saturated signals alone and a self-supervised loss that can be used to train reconstruction networks. Experiments on both audio and image data show that the proposed approach \textcolor{reviews}{is almost as effective as} fully supervised approaches, despite relying solely on clipped measurements for training.
URL: https://openreview.net/forum?id=AkJWgglkLd
---
Title: Efficient and Accurate Likelihood Estimation via Learning Amortized Adaptive Proposal Distributions
Abstract: Recent advancements in probabilistic modeling have driven significant progress in deep learning, particularly through the development of generative models based on variational inference. These models optimize a tractable lower bound of the log-likelihood, rather than the log-likelihood itself. However, they often encounter trade-offs between approximation accuracy and computational efficiency. To address these limitations, we propose a novel generative model grounded in importance sampling. Central to our approach is the Amortized Adaptive Proposal Distribution (AAPD), which simultaneously serves as both the proposal distribution for importance sampling and an approximation to the posterior. Extensive evaluations on both synthetic and real-world datasets demonstrate the superior performance and versatility of our method in latent variable modeling. Additionally, we extend our model to mixed-effects settings, effectively addressing some limitations of traditional statistical approaches.
URL: https://openreview.net/forum?id=sz0KRrQLip
---
Title: tensorFM: Low-Rank Approximations of Cross-Order Feature Interactions
Abstract: We address prediction problems on tabular categorical data, where each instance is defined by multiple categorical attributes, each taking values from a finite set. These attributes are often referred to as fields, and their categorical values as features. Such problems frequently arise in practical applications, including click-through rate prediction and social sciences. We introduce and analyze tensorFM, a new model that efficiently captures high-order interactions between attributes via a low-rank tensor approximation representing the strength of these interactions. Our model generalizes field-weighted factorization machines. Empirically, tensorFM demonstrates competitive performance with state-of-the-art methods. Additionally, its low latency makes it well-suited for time-sensitive applications, such as online advertising.
URL: https://openreview.net/forum?id=oXmnm0RRi9
---