Weekly TMLR digest for May 04, 2025

8 views

Skip to first unread message

TMLR

unread,

May 4, 2025, 12:00:11 AMMay 4

to tmlr-annou...@googlegroups.com

New certifications
==================

Survey Certification: Graph Theory-Based Deep Graph Similarity Learning: A Unified Survey of Pipeline, Techniques, and Challenges

Zhouyang LIU, Ning Liu, Yixin Chen, Ziqing Wen, Jiezhong He, Dongsheng Li

https://openreview.net/forum?id=fHf4jbIfex

---

Featured Certification, Reproducibility Certification: Faithful Interpretation for Graph Neural Networks

Lijie Hu, Tianhao Huang, Lu Yu, Wanyu Lin, Tianhang Zheng, Di Wang

https://openreview.net/forum?id=Y8EspxaksH

---

Featured Certification: Random Policy Enables In-Context Reinforcement Learning within Trust Horizons

Weiqin Chen, Santiago Paternain

https://openreview.net/forum?id=mAiMKnr9r5

---

Accepted papers
===============

Title: Personalized Layer Selection for Graph Neural Networks

Authors: Kartik Sharma, Vineeth Rakesh, Yingtong Dou, Srijan Kumar, Mahashweta Das

Abstract: Graph Neural Networks (GNNs) combine node attributes over a fixed granularity of the local graph structure around a node to predict its label. However, different nodes may relate to a node-level property with a different granularity of its local neighborhood, and using the same level of smoothing for all nodes can be detrimental to their classification. In this work, we challenge the common fact that a single GNN layer can classify all nodes of a graph by training GNNs with a distinct personalized layer for each node. Inspired by metric learning, we propose a novel algorithm, MetSelect, to select the optimal representation layer to classify each node. In particular, we identify a prototype representation of each class in a transformed GNN layer and then, classify using the layer where the distance is smallest to a class prototype after normalizing with that layer’s variance. Results on 10 datasets and 3 different GNNs show that we significantly improve the node classification accuracy of GNNs in a plug-and-play manner. We also find that using variable layers for prediction enables GNNs to be deeper and more robust to poisoning attacks. We hope this work can inspire future works to learn more adaptive and personalized graph representations.

URL: https://openreview.net/forum?id=JyjTJAG9yZ

---

Title: Graph Theory-Based Deep Graph Similarity Learning: A Unified Survey of Pipeline, Techniques, and Challenges

Authors: Zhouyang LIU, Ning Liu, Yixin Chen, Ziqing Wen, Jiezhong He, Dongsheng Li

Abstract: Graph similarity computation, which measures the resemblance between graphs, is a crucial operation in fields such as graph search. Recent advances in graph neural networks have enabled the embedding of graphs into low-dimensional vector spaces, where the sim-
ilarity or distance between graphs can be efficiently quantified. However, these methods are often tailored to specific tasks and function as black boxes, limiting both generalization and interpretability. To address these challenges, there is growing interest in incorporating
domain-agnostic and interpretable concepts from graph theory—such as subgraph isomorphism, maximum common subgraph, and graph edit distance—into graph similarity learning as training objectives. This survey presents a comprehensive review of recent advancements
in deep graph similarity learning, focusing on models that integrate these graph theory concepts. Despite the different training objectives of these approaches, they share significant commonalities in the training pipeline, techniques, and challenges. We analyze them within a unified lens referred to as graph theory-based deep similarity learning (GTDGSL) methods. We systematically compare existing GTDGSL methods alongside their common training pipeline, highlighting the technique trend and discussing key challenges, applications, and future research directions in this domain. We organize the papers included in this survey and their open-source implementations at https://github.com/liuzhouyang/Graph-Theory-Based-Deep-Graph-Similarity-Learning-Survey.

URL: https://openreview.net/forum?id=fHf4jbIfex

---

Title: HyperVQ: MLR-based Vector Quantization in Hyperbolic Space

Authors: Nabarun Goswami, Yusuke Mukuta, Tatsuya Harada

Abstract: The success of models operating on tokenized data has heightened the need for effective tokenization methods, particularly in vision and auditory tasks where inputs are naturally continuous. A common solution is to employ Vector Quantization (VQ) within VQ Variational Autoencoders (VQVAEs), transforming inputs into discrete tokens by clustering embeddings in Euclidean space. However, Euclidean embeddings not only suffer from inefficient packing and limited separation—due to their polynomial volume growth—but are also prone to codebook collapse, where only a small subset of codebook vectors are effectively utilized. To address these limitations, we introduce HyperVQ, a novel approach that formulates VQ as a hyperbolic Multinomial Logistic Regression (MLR) problem, leveraging the exponential volume growth in hyperbolic space to mitigate collapse and improve cluster separability. Additionally, HyperVQ represents codebook vectors as geometric representatives of hyperbolic decision hyperplanes, encouraging disentangled and robust latent representations. Our experiments demonstrate that HyperVQ matches traditional VQ in generative and reconstruction tasks, while surpassing it in discriminative performance and yielding a more efficient and disentangled codebook.

URL: https://openreview.net/forum?id=WgJgIULL9Q

---

Title: A Gold Standard Dataset for the Reviewer Assignment Problem

Authors: Ivan Stelmakh, John Frederick Wieting, Yang Xi, Graham Neubig, Nihar B Shah

Abstract: Many peer-review venues are using algorithms to assign submissions to reviewers. The crux of such automated approaches is the notion of the “similarity score’’ — a numerical estimate of the expertise of a reviewer in reviewing a paper — and many algorithms have been proposed to compute these scores. However, these algorithms have not been subjected to a principled comparison, making it difficult for stakeholders to choose the algorithm in an evidence-based manner. The key challenge in comparing existing algorithms and developing better algorithms is the lack of the publicly available gold-standard data that would be needed to perform reproducible research. We address this challenge by collecting a novel dataset of similarity scores that we release to the research community. Our dataset consists of 477 self-reported expertise scores provided by 58 researchers who evaluated their expertise in reviewing papers they have read previously.

We use this data to compare several popular algorithms currently employed in computer science conferences and come up with recommendations for stakeholders. Our four main findings are:
- All algorithms make a non-trivial amount of error. For the task of ordering two papers in terms of their relevance for a reviewer, the error rates range from 12%-30% in easy cases to 36%-43% in hard cases, thereby highlighting the vital need for more research on the similarity-computation problem.
- Most specialized algorithms are designed to work with titles and abstracts of papers, and in this regime the Specter2 algorithm performs best.
- The classical TF-IDF algorithm which can use full texts of papers is on par with Specter2 that uses only titles and abstracts.
- The performance of off-the-shelf LLMs is worse than the specialized algorithms.
We encourage researchers to participate in our survey and contribute their data to the dataset here: https://forms.gle/SP1Rh8eivGz54xR37

URL: https://openreview.net/forum?id=XofMHO5yVY

---

Title: Optimizing Cycle Life Prediction of Lithium-ion Batteries via a Physics-Informed Model

Authors: Nathan Sun, Daniel Nicolae, Sara Sameer, Karena Yan

Abstract: Accurately measuring the cycle lifetime of commercial lithium-ion batteries is crucial for performance and technology development. We introduce a novel hybrid approach combining a physics-based equation with a self-attention model to predict the cycle lifetimes of commercial lithium iron phosphate graphite cells via early-cycle data. After fitting capacity loss curves to this physics-based equation, we then use a self-attention layer to reconstruct entire battery capacity loss curves. Our model exhibits comparable performances to existing models while predicting more information: the entire capacity loss curve instead of cycle life. This provides more robustness and interpretability: our model does not need to be retrained for a different notion of end-of-life and is backed by physical intuition.

URL: https://openreview.net/forum?id=1weZ9Wsajk

---

Title: When SNN meets ANN: Error-Free ANN-to-SNN Conversion for Extreme Edge Efficiency

Authors: Gourav Datta, Zeyu Liu, James Diffenderfer, Bhavya Kailkhura, Peter Anthony Beerel

Abstract: Spiking Neural Networks (SNN) are now demonstrating comparable accuracy to convolutional neural networks (CNN), thanks to advanced ANN-to-SNN conversion techniques, all while delivering remarkable energy and latency efficiency when deployed on neuromorphic hardware. However, these conversion techniques incur a large number of time steps, and consequently, high spiking activity. In this paper, we propose a novel ANN-to-SNN conversion framework, that incurs an exponentially lower number of time steps compared to that required in the existing conversion approaches. Our framework modifies the standard integrate-and-fire (IF) neuron model used in SNNs with no change in computational complexity and shifts the bias term of each batch normalization (BN) layer in the trained ANN. To reduce spiking activity, we propose training the source ANN with a fine-grained $\ell_1$ regularizer with surrogate gradients that encourages high spike sparsity in the converted SNN. Our proposed framework thus yields lossless SNNs with low latency, low compute energy, thanks to the low time steps and high spike sparsity, and high test accuracy, for example, $75.12$% with only $4$ time steps on the ImageNet dataset. Codes will be made available. Code is available at https://github.com/godatta/SNN_meets_ANN.

URL: https://openreview.net/forum?id=WOwQKguWT0

---

Title: ASTRA: A Scene-aware Transformer-based Model for Trajectory Prediction

Authors: Izzeddin Teeti, Aniket Thomas, Munish Monga, Sachin Kumar Giroh, Uddeshya Singh, Andrew Bradley, Biplab Banerjee, Fabio Cuzzolin

Abstract: We present ASTRA (A Scene-aware Transformer-based model for trajectory prediction), a light-weight pedestrian trajectory forecasting model that integrates the scene context, spatial dynamics, social inter-agent interactions and temporal progressions for precise forecasting. We utilised a U-Net-based feature extractor, via its latent vector representation, to capture scene representations and a graph-aware transformer encoder for capturing social interactions. These components are integrated to learn an agent-scene aware embedding, enabling the model to learn spatial dynamics and forecast the future trajectory of pedestrians. The model is designed to produce both deterministic and stochastic outcomes, with the stochastic predictions being generated by incorporating a Conditional Variational Auto-Encoder (CVAE). ASTRA also proposes a simple yet effective weighted penalty loss function, which helps to yield predictions that outperform a wide array of state-of-the-art deterministic and generative models. ASTRA demonstrates an average improvement of 27%/10% in deterministic/stochastic settings on the ETH-UCY dataset, and 26% improvement on the PIE dataset, respectively, along with seven times fewer parameters than the existing state-of-the-art model (see Figure 1). Additionally, the model's versatility allows it to generalize across different perspectives, such as Bird's Eye View (BEV) and Ego-Vehicle View (EVV).

URL: https://openreview.net/forum?id=fqSVqPcaVi

---

Title: Leveraging Unlabeled Data Sharing through Kernel Function Approximation in Offline Reinforcement Learning

Authors: Yen Ru Lai, Fu-Chieh Chang, Pei-Yuan Wu

Abstract: Offline reinforcement learning (RL) learns policies from a fixed dataset, but often requires large amounts of data. The challenge arises when labeled datasets are expensive, especially when rewards have to be provided by human labelers for large datasets. In contrast, unlabelled data tends to be less expensive. This situation highlights the importance of finding effective ways to use unlabelled data in offline RL, especially when labelled data is limited or expensive to obtain. In this paper, we present the algorithm to utilize the unlabeled data in the offline RL method with kernel function approximation and give the theoretical guarantee. We present various eigenvalue decay conditions of $\mathcal{H}_k$ which determine the complexity of the algorithm. In summary, our work provides a promising approach for exploiting the advantages offered by unlabeled data in offline RL, whilst maintaining theoretical assurances.

URL: https://openreview.net/forum?id=78N9tCL6Ly

---

Title: Variational Stochastic Gradient Descent for Deep Neural Networks

Authors: Anna Kuzina, Haotian Chen, Babak Esmaeili, Jakub M. Tomczak

Abstract: Optimizing deep neural networks is one of the main tasks in successful deep learning.
Current state-of-the-art optimizers are adaptive gradient-based optimization methods such as Adam.
Recently, there has been an increasing interest in formulating gradient-based optimizers in a probabilistic framework for better modeling the uncertainty of the gradients.
Here, we propose to combine both approaches, resulting in the Variational Stochastic Gradient Descent (VSGD) optimizer. We model gradient updates as a probabilistic model and utilize stochastic variational inference (SVI) to derive an efficient and effective update rule. Further, we show how our VSGD method relates to other adaptive gradient-based optimizers like Adam.
Lastly, we carry out experiments on two image classification datasets and four deep neural network architectures, where we show that VSGD outperforms Adam and SGD.

URL: https://openreview.net/forum?id=xu4ATNjcdy

---

Title: Non-Myopic Multi-Objective Bayesian Optimization

Authors: Syrine Belakaria, Alaleh Ahmadian, Barbara E Engelhardt, Stefano Ermon, Jana Doppa

Abstract: We consider the problem of finite-horizon sequential experimental design to solve multi-objective optimization (MOO) of expensive black-box objective functions. This problem arises in many real-world applications, including materials design, where we have a small resource budget to make and evaluate candidate materials in the lab. We solve this problem using the framework of Bayesian optimization (BO) and propose the first set of non-myopic methods for MOO problems. Prior work on non-myopic BO for single-objective problems relies on the Bellman optimality principle to handle the lookahead reasoning process. However, this principle does not hold for most MOO problems because the reward function needs to satisfy some conditions: scalar variable, monotonicity, and additivity. We address this challenge by using hypervolume improvement (HVI) as our scalarization approach, which allows us to use a lower-bound on the Bellman equation to approximate the finite-horizon using a batch expected hypervolume improvement (EHVI) acquisition function (AF) for MOO. Our formulation naturally allows us to use other improvement-based scalarizations and compare their efficacy to HVI. We derive three non-myopic AFs for MOBO: 1) the Nested AF, which is based on the exact computation of the lower bound, 2) the Joint AF, which is a lower bound on the nested AF, and 3) the BINOM AF, which is a fast and approximate variant based on batch multi-objective acquisition functions. Our experiments on multiple diverse real-world MO problems demonstrate that our non-myopic AFs substantially improve performance over the existing myopic AFs for MOBO.

URL: https://openreview.net/forum?id=2e1aZZd88C

---

Title: Investigating Continual Pretraining in Large Language Models: Insights and Implications

Authors: Çağatay Yıldız, Nishaanth Kanna Ravichandran, Nitin Sharma, Matthias Bethge, Beyza Ermis

Abstract: Continual learning (CL) in large language models (LLMs) is an evolving domain that focuses on developing efficient and sustainable training strategies to adapt models to emerging knowledge and achieve robustness in dynamic environments. Our primary emphasis is on continual domain-adaptive pretraining, a process designed to equip LLMs with the ability to integrate new information from various domains while retaining previously learned knowledge. Since existing works concentrate mostly on continual fine-tuning for a limited selection of downstream tasks or training domains, we introduce a new benchmark designed to measure the adaptability of LLMs to changing pretraining data landscapes. We further examine the impact of model size on learning efficacy and forgetting, as well as how the progression and similarity of emerging domains affect the knowledge transfer within these models.
Our findings uncover several key insights: (i) continual pretraining consistently improves <1.5B models studied in this work and is also superior to domain adaptation, (ii) larger models always achieve better perplexity than smaller ones when continually pretrained on the same corpus, (iii) smaller models are particularly sensitive to continual pretraining, showing the most significant rates of both learning and forgetting, (iv) continual pretraining boosts downstream task performance of GPT-2 family, (v) continual pretraining enables LLMs to specialize better when the sequence of domains shows semantic similarity while randomizing training domains leads to better transfer and final performance otherwise. We posit that our research establishes a new benchmark for CL in LLMs, providing a more realistic evaluation of knowledge retention and transfer across diverse domains.

URL: https://openreview.net/forum?id=aKjJoEVKgO

---

Title: Efficient Exploration in Multi-Agent Reinforcement Learning via Farsighted Self-Direction

Authors: Tiancheng Lao, Xudong Guo, Mengge Liu, Junjie Yu, Yi Liu, Wenhui Fan

Abstract: Multi-agent reinforcement learning faces greater challenges with efficient exploration compared to single-agent counterparts, primarily due to the exponential growth in state and action spaces. Methods based on intrinsic rewards have been proven to enhance exploration efficiency in multi-agent scenarios effectively. However, these methods are plagued by instability during training and biases in exploration direction. To address these challenges, we propose Farsighted Self-Direction (FSD), a novel model-free method that utilizes a long-term exploration bonus to achieve coordinated exploration. Since prediction error against individual Q-values indicates a potential bonus for committed exploration, it is taken into account in action selection to directly guide the coordinated exploration. Further, we also use clipped double Q-learning to reduce noise in prediction error. We validate the method on didactic examples and demonstrate the outperformance of our method on challenging StarCraft II micromanagement tasks.

URL: https://openreview.net/forum?id=NUV8THrLZC

---

Title: Faithful Interpretation for Graph Neural Networks

Authors: Lijie Hu, Tianhao Huang, Lu Yu, Wanyu Lin, Tianhang Zheng, Di Wang

Abstract: Currently, attention mechanisms have garnered increasing attention in Graph Neural Networks (GNNs), such as Graph Attention Networks (GATs) and Graph Transformers (GTs). This is due to not only the commendable boost in performance they offer but also their capacity to provide a more lucid rationale for model behaviors, which are often viewed as inscrutable. However, Attention-based GNNs have demonstrated instability in interpretability when subjected to various sources of perturbations during both training and testing phases, including factors like additional edges or nodes. In this paper, we propose a solution to this problem by introducing a novel notion called Faithful Graph Attention-based Interpretation (FGAI). In particular, FGAI has four crucial properties in terms of stability and sensitivity to interpretation and the final output distribution. Built upon this notion, we propose an efficient methodology for obtaining FGAI, which can be viewed as an ad hoc modification to the canonical Attention-based GNNs. To validate our proposed solution, we introduce two novel metrics tailored for graph interpretation assessment. Experimental results demonstrate that FGAI exhibits superior stability and preserves the interpretability of attention under various forms of perturbations and randomness, which makes FGAI a more faithful and reliable explanation tool.

URL: https://openreview.net/forum?id=Y8EspxaksH

---

Title: Random Policy Enables In-Context Reinforcement Learning within Trust Horizons

Authors: Weiqin Chen, Santiago Paternain

Abstract: Pretrained foundation models (FMs) have exhibited extraordinary in-context learning performance, allowing zero-shot (or few-shot) generalization to new environments/tasks not encountered during the pretraining. In the case of reinforcement learning (RL), in-context RL (ICRL) emerges when pretraining FMs on decision-making problems in an autoregressive-supervised manner. Nevertheless, the current state-of-the-art ICRL algorithms, such as Algorithm Distillation, Decision Pretrained Transformer and Decision Importance Transformer, impose stringent requirements on the pretraining dataset concerning the behavior (source) policies, context information, and action labels, etc. Notably, these algorithms either demand optimal policies or require varying degrees of well-trained behavior policies for all pretraining environments. This significantly hinders the application of ICRL to real-world scenarios, where acquiring optimal or well-trained policies for a substantial volume of real-world training environments can be prohibitively expensive or even intractable. To overcome this challenge, we introduce a novel approach, termed State-Action Distillation (SAD), that allows to generate an effective pretraining dataset guided solely by random policies. In particular, SAD selects query states and corresponding action labels by distilling the outstanding state-action pairs from the entire state and action spaces by using random policies within a trust horizon, and then inherits the classical autoregressive-supervised mechanism during the pretraining. To the best of our knowledge, this is the first work that enables effective ICRL under (e.g., uniform) random policies and random contexts. We also establish the quantitative analysis of the trustworthiness as well as the performance guarantees of our SAD approach. Moreover, our empirical results across multiple popular ICRL benchmark environments demonstrate that, on average, SAD outperforms the best baseline by 236.3% in the offline evaluation and by 135.2% in the online evaluation.

URL: https://openreview.net/forum?id=mAiMKnr9r5

---

Title: Semantic-Syntactic Discrepancy in Images (SSDI): Learning Meaning and Order of Features from Natural Images

Authors: Chun Tao, Timur Ibrayev, Kaushik Roy

Abstract: Despite considerable progress in image classification tasks, classification models seem unaffected by the images that significantly deviate from those that appear natural to human eyes. Specifically, while human perception can easily identify abnormal appearances or compositions in images, classification models overlook any alterations in the arrangement of object parts as long as they are present in any order, even if unnatural. Hence, this work exposes the vulnerability of having semantic and syntactic discrepancy in images (SSDI) in the form of corruptions that remove or shuffle image patches or present images in the form of puzzles. To address this vulnerability, we propose the concept of "image grammar", comprising "image semantics" and "image syntax". Image semantics pertains to the interpretation of parts or patches within an image, whereas image syntax refers to the arrangement of these parts to form a coherent object. We present a semi-supervised two-stage method for learning the image grammar of visual elements and environments solely from natural images. While the first stage learns the semantic meaning of individual object parts, the second stage learns how their relative arrangement constitutes an entire object. The efficacy of the proposed approach is then demonstrated by achieving SSDI detection rates ranging from 70% to 90% on corruptions generated from CelebA and SUN-RGBD datasets. Code is publicly available at: https://github.com/ChunTao1999/SSDI/.

URL: https://openreview.net/forum?id=8otbGorZK2

---

Title: Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)

Authors: Apurv Verma, Satyapriya Krishna, Sebastian Gehrmann, Madhavan Seshadri, Anu Pradhan, John A. Doucette, David Rabinowitz, Leslie Barrett, Tom Ault, Hai Phan

Abstract: Creating secure and resilient applications with large language models (LLM) requires anticipating,
adjusting to, and countering unforeseen threats. Red-teaming has emerged as a
critical technique for identifying vulnerabilities in real-world LLM implementations. This
paper presents a detailed threat model and provides a systematization of knowledge (SoK)
of red-teaming attacks on LLMs. We develop a taxonomy of attacks based on the stages of
the LLM development and deployment process and extract various insights from previous
research. In addition, we compile methods for defense and practical red-teaming strategies
for practitioners. By delineating prominent attack motifs and shedding light on various
entry points, this paper provides a framework for improving the security and robustness of
LLM-based systems.

URL: https://openreview.net/forum?id=sSAp8ITBpC

---

Title: Reinforcement Learning from Bagged Reward

Authors: Yuting Tang, Xin-Qiang Cai, Yao-Xiang Ding, Qiyu Wu, Guoqing Liu, Masashi Sugiyama

Abstract: In Reinforcement Learning (RL), it is commonly assumed that an immediate reward signal is generated for each action taken by the agent, helping the agent maximize cumulative rewards to obtain the optimal policy. However, in many real-world scenarios, designing immediate reward signals is difficult; instead, agents receive a single reward that is contingent upon a partial sequence or a complete trajectory. In this work, we define this challenging problem as RL from Bagged Reward (RLBR), where sequences of data are treated as bags with non-Markovian bagged rewards, leading to the formulation of Bagged Reward Markov Decision Processes (BRMDPs). Theoretically, we demonstrate that RLBR can be addressed by solving a standard MDP with properly redistributed bagged rewards allocated to each instance within a bag. Empirically, we find that reward redistribution becomes more challenging as the bag length increases, due to reduced informational granularity. Existing reward redistribution methods are insufficient to address these challenges. Therefore, we propose a novel reward redistribution method equipped with a bidirectional attention mechanism, enabling the accurate interpretation of contextual nuances and temporal dependencies within each bag. We experimentally demonstrate that the proposed method consistently outperforms existing approaches.

URL: https://openreview.net/forum?id=bXUipBbZDA

---

Title: RS-Reg: Probabilistic and Robust Certified Regression through Randomized Smoothing

Authors: Aref Miri Rekavandi, Olga Ohrimenko, Benjamin I. P. Rubinstein

Abstract: Randomized smoothing has shown promising certified robustness against adversaries in classification tasks. Despite such success with only zeroth-order access to base models, randomized smoothing has not been extended to a general form of regression. By defining robustness in regression tasks flexibly through probabilities, we demonstrate how to establish upper bounds on input data point perturbation (using the $\ell_2$ norm) for a user-specified probability of observing valid outputs. Furthermore, we showcase the asymptotic property of a basic averaging function in scenarios where the regression model operates without any constraint. We then derive a certified upper bound of the input perturbations when dealing with a family of regression models where the outputs are bounded. Our simulations verify the validity of the theoretical results and reveal the advantages and limitations of simple smoothing functions, i.e., averaging, in regression tasks. The code is publicly available at \url{https://github.com/arekavandi/Certified_Robust_Regression}.

URL: https://openreview.net/forum?id=AcLlg4J52H

---

Title: A functional framework for nonsmooth autodiff with {\it maxpooling} functions

Authors: Bruno Després

Abstract: We make a comment on the recent work by Boustany, by showing that the Murat-TrombettiTheorem provides a simple and efficient mathematical framework for nonsmooth automatic differentiation of {\it maxpooling} functions. In particular it gives a the chain rule formula which correctly defines the composition of Lipschitz-continuous functions which are piecewise $C^1$. The formalism is applied to four basic examples, with some tests in PyTorch. A self contained proof of an important Stampacchia formula is in the appendix.

URL: https://openreview.net/forum?id=qahoztvThX

---

Title: ∇QDARTS: Quantization as an Elastic Dimension to Differentiable NAS

Authors: Payman Behnam, Uday Kamal, Sanjana Vijay Ganesh, Zhaoyi Li, Michael Andrew Jurado, Alind Khare, Igor Fedorov, Gaowen Liu, Alexey Tumanov

Abstract: Differentiable Neural Architecture Search methods efficiently find high-accuracy architectures using gradient-based optimization in a continuous domain, saving computational resources. Mixed-precision search helps optimize precision within a fixed architecture. However, applying it to a NAS-generated network does not assure optimal performance as the optimized quantized architecture may not emerge from a standalone NAS method. In light of these considerations, this paper introduces ∇QDARTS, a novel approach that combines differentiable NAS with mixed-precision search for both weight and activation. ∇QDARTS aims to identify the optimal mixed-precision neural architecture capable of achieving remarkable accuracy while operating with minimal computational requirements in a single-shot, end-to-end differentiable framework, obviating the need for pretraining and proxy methods. Compared to fp32, ∇QDARTS shows impressive performance on CIFAR10 with (2,4) bit precision, reducing bit operations by 160× with a slight 1.57% accuracy drop. Increasing the capacity enables ∇QDARTS to match fp32 accuracy while reducing bit operations by 18×. For the ImageNet dataset, with just (2,4) bit precision, ∇QDARTS outperforms state-of-the-art methods such as APQ, SPOS, OQA, and MNAS by 2.3%, 2.9%, 0.3%, and 2.7% in terms of accuracy. By incorporating (2,4,8) bit precision, ∇QDARTS further minimizes the accuracy drop to 1% compared to fp32, alongside a substantial reduction of 17× in required bit operations and 2.6× in memory footprint. In terms of bit-operation (memory footprint) ∇QDARTS excels over APQ, SPOS, OQA, and MNAS with similar accuracy by 2.3× (12×), 2.4× (3×), 13%
(6.2×), 3.4× (37%), for bit-operation (memory footprint), respectively. ∇QDARTS enhances the overall search and training efficiency, achieving a 3.1× and 1.54× improvement over APQ and OQA, respectively.

URL: https://openreview.net/forum?id=ubrOSWyTS8

---

Title: Piecewise Constant Spectral Graph Neural Network

Authors: Vahan Martirosyan, Jhony H. Giraldo, Fragkiskos D. Malliaros

Abstract: Graph Neural Networks (GNNs) have achieved significant success across various domains by leveraging graph structures in data. Existing spectral GNNs, which use low-degree polynomial filters to capture graph spectral properties, may not fully identify the graph's spectral characteristics because of the polynomial's small degree. However, increasing the polynomial degree is computationally expensive and beyond certain thresholds leads to performance plateaus or degradation. In this paper, we introduce the Piecewise Constant Spectral Graph Neural Network(PieCoN) to address these challenges. PieCoN combines constant spectral filters with polynomial filters to provide a more flexible way to leverage the graph structure. By adaptively partitioning the spectrum into intervals, our approach increases the range of spectral properties that can be effectively learned. Experiments on nine benchmark datasets, including both homophilic and heterophilic graphs, demonstrate that PieCoN is particularly effective on heterophilic datasets, highlighting its potential for a wide range of applications.

URL: https://openreview.net/forum?id=sTdVnDW0HX

---

Title: Global Graph Counterfactual Explanation: A Subgraph Mapping Approach

Authors: Yinhan He, Wendy Zheng, Yaochen Zhu, Jing Ma, Saumitra Mishra, Natraj Raman, Ninghao Liu, Jundong Li

Abstract: Graph Neural Networks (GNNs) have been widely deployed in various real-world applications. However, most GNNs are black-box models that lack explanations. One strategy to explain GNNs is through counterfactual explanation, which aims to find minimum perturbations on input graphs that change the GNN predictions. Existing works on GNN counterfactual explanations primarily concentrate on the local-level perspective (i.e., generating counterfactuals for each individual graph), which suffers from information overload and lacks insights into the broader cross-graph relationships. To address such issues, we propose GlobalGCE, a novel global-level graph counterfactual explanation method. GlobalGCE aims to identify a collection of subgraph mapping rules as counterfactual explanations for the target GNN. According to these rules, substituting certain significant subgraphs with their counterfactual subgraphs will change the GNN prediction to the desired class for most graphs (i.e., maximum coverage). Methodologically, we design a significant subgraph generator and a counterfactual subgraph autoencoder in our GlobalGCE, where the subgraphs and the rules can be effectively generated. Extensive experiments demonstrate the superiority of our GlobalGCE compared to existing baselines. Our code can be found at
\url{https://github.com/YinhanHe123/GlobalGCE}.

URL: https://openreview.net/forum?id=KQzJYI6eo0

---

Title: Speech Synthesis By Unrolling Diffusion Process using Neural Network Layers

Authors: Peter Ochieng

Abstract: This work proposes a novel setup where a neural network is trained to predict multiple steps of the reverse diffusion process in an unrolled manner, with successive layers corresponding to equally spaced steps in the diffusion schedule. Each layer progressively denoises the input during the reverse process until the final layer estimates the original input, $x_0$. Additionally, we introduce a new learning target by using latent variables, rather than the conventional approach of predicting the original input $x_0$ or source error $\epsilon_0$. In speech synthesis, using $x_0$ or $\epsilon_0$ often leads to large prediction errors in the early stages of the denoising process, causing distortion in the recovered speech. Our method mitigates this issue and, through extensive evaluation, demonstrates the generation of high-fidelity speech in competitive time, outperforming current state-of-the-art techniques. Moreover, the proposed approach generalizes well to unseen speech. Sample audio is available at \url{https://onexpeters.github.io/UDPNet/}.

URL: https://openreview.net/forum?id=F6l3BBPElY

---

Title: Gaussian Pre-Activations in Neural Networks: Myth or Reality?

Authors: Pierre Wolinski, Julyan Arbel

Abstract: The study of feature propagation at initialization in neural networks lies at the root of numerous initialization designs. A very common assumption is that the pre-activations are Gaussian. Although this convenient *Gaussian hypothesis* can be justified when the number of neurons per layer tends to infinity, it is challenged by both theoretical and experimental work for finite-width neural networks. Our main contribution is to construct a family of pairs of activation functions and initialization distributions that ensure that the pre-activations remain Gaussian throughout the network depth, even in narrow neural networks, under the assumption that the pre-activations are independent. In the process, we discover a set of constraints that a neural network should satisfy to ensure Gaussian pre-activations. In addition, we provide a critical review of the claims of the Edge of Chaos line of work and construct a non-asymptotic Edge of Chaos analysis. We also propose a unified view on the propagation of pre-activations, encompassing the framework of several well-known initialization procedures. More generally, our work provides a principled framework for addressing the much-debated question: is it desirable to initialize the training of a neural network whose pre-activations are guaranteed to be Gaussian? Our code is available on GitHub: https://github.com/p-wol/gaussian-preact/ .

URL: https://openreview.net/forum?id=goe6fv6iSh

---

Title: ReDistill: Residual Encoded Distillation for Peak Memory Reduction of CNNs

Authors: Fang Chen, Gourav Datta, Mujahid Al Rafi, Hyeran Jeon, Meng Tang

Abstract: The expansion of neural network sizes and the enhanced resolution of modern image sensors result in heightened memory and power demands to process modern computer vision models. In order to deploy these models in extremely resource-constrained edge devices, it is crucial to reduce their peak memory, which is the maximum memory consumed during the execution of a model. A naive approach to reducing peak memory is aggressive down-sampling of feature maps via pooling with large stride, which often results in unacceptable degradation in network performance. To mitigate this problem, we propose residual encoded distillation (ReDistill) for peak memory reduction in a teacher-student framework, in which a student network with less memory is derived from the teacher network using aggressive pooling. We apply our distillation method to multiple problems in computer vision, including image classification and diffusion-based image generation. For image classification, our method yields 4x-5x theoretical peak memory reduction with less degradation in accuracy for most CNN-based architectures. For diffusion-based image generation, our proposed distillation method yields a denoising network with 4x lower theoretical peak memory while maintaining decent diversity and fidelity for image generation.
Experiments demonstrate our method's superior performance compared to other feature-based and response-based distillation methods when applied to the same student network. The code is available at https://github.com/mengtang-lab/ReDistill.

URL: https://openreview.net/forum?id=akumIxQjNN

---

Title: Heterophily-informed Message Passing

Authors: Haishan Wang, Arno Solin, Vikas K Garg

Abstract: Graph neural networks (GNNs) are known to be vulnerable to oversmoothing due to their implicit homophily assumption. We mitigate this problem with a novel scheme that regulates the aggregation of messages, modulating the type and extent of message passing locally thereby preserving both the low and high-frequency components of information. Our approach relies solely on learnt embeddings, obviating the need for auxiliary labels, thus extending the benefits of heterophily-aware embeddings to broader applications, e.g. generative modelling. Our experiments, conducted across various data sets and GNN architectures, demonstrate performance enhancements and reveal heterophily patterns across standard classification benchmarks. Furthermore, application to molecular generation showcases notable performance improvements on chemoinformatics benchmarks.

URL: https://openreview.net/forum?id=9fPinz1iH2

---

Title: Robust Model Selection of Gaussian Graphical Models

Authors: Abrar Zahin, Rajasekhar Anguluri, Lalitha Sankar, Oliver Kosut, Gautam Dasarathy

Abstract: In Gaussian graphical model selection, noise-corrupted samples present significant challenges.
It is known that even minimal amounts of noise can obscure the underlying structure,
leading to fundamental identifiability issues. A recent line of work addressing this “robust
model selection” problem narrows its focus to tree-structured graphical models. Even within
this specific class of models, exact structure recovery is shown to be impossible. However,
several algorithms have been developed that are known to provably recover the underlying
tree-structure up to an (unavoidable) equivalence class.
In this paper, we extend these results beyond tree-structured graphs. We first characterize
the equivalence class up to which general graphs can be recovered in the presence of noise.
Despite the inherent ambiguity (which we prove is unavoidable), the structure that can
be recovered reveals local clustering information and global connectivity patterns in the
underlying model. Such information is useful in a range of real-world problems, including
power grids, social networks, protein-protein interactions, and neural structures. We then
propose an algorithm which provably recovers the underlying graph up to the identified
ambiguity. We further provide finite sample guarantees in the high-dimensional regime for
our algorithm and validate our results through numerical simulations.

URL: https://openreview.net/forum?id=AIby9MQXbu

---

Title: Sample, estimate, aggregate: A recipe for causal discovery foundation models

Authors: Menghua Wu, Yujia Bao, Regina Barzilay, Tommi Jaakkola

Abstract: Causal discovery, the task of inferring causal structure from data, has the potential to uncover mechanistic insights from biological experiments, especially those involving perturbations. However, causal discovery algorithms over larger sets of variables tend to be brittle against misspecification or when data are limited. For example, single-cell transcriptomics measures thousands of genes, but the nature of their relationships is not known, and there may be as few as tens of cells per intervention setting. To mitigate these challenges, we propose a foundation model-inspired approach: a supervised model trained on large-scale, synthetic data to predict causal graphs from summary statistics — like the outputs of classical causal discovery algorithms run over subsets of variables and other statistical hints like inverse covariance. Our approach is enabled by the observation that typical errors in the outputs of a discovery algorithm remain comparable across datasets. Theoretically, we show that the model architecture is well-specified, in the sense that it can recover a causal graph consistent with graphs over subsets. Empirically, we train the model to be robust to misspecification and distribution shift using diverse datasets. Experiments on biological and synthetic data confirm that this model generalizes well beyond its training set, runs on graphs with hundreds of variables in seconds, and can be easily adapted to different underlying data assumptions.

URL: https://openreview.net/forum?id=h434zx5SX0

---

Title: A Learning-Based Framework for Fair and Scalable Solution Generation in Kidney Exchange Problems

Authors: William St-Arnaud, Margarida Carvalho, Golnoosh Farnadi

Abstract: Reinforcement learning and Generative Flow Networks, known as GFlowNets, present an exciting possibility for neural networks to model distributions across various data structures. In this paper, we broaden their applicability to data structures consisting of optimal solutions for a combinatorial problem. Concretely, we propose using Q-learning and various policy gradient methods, as well as GFlowNets to learn the distribution of optimal solutions for kidney exchange problems (KEPs). This could provide a useful tool for decision-making authorities, policymakers and clinicians, as it offers them multiple optimal or near-optimal solutions, and provides a complementary landscape to their traditional integer programming-based toolbox for promoting fairness and societal benefits. Our reinforcement learning-based framework trained on KEP instances provides an effective addition to computationally expensive exact approaches, notably mixed-integer programming. Our experiments thoroughly evaluate the quality of the solution sets sampled from the trained neural networks in terms of optimality, their scalability when dealing with real-sized KEP instances, and their capability to generate a diverse pool of solutions. We also cover the use of their efficient solution generation capabilities to improve fairness and simulate the evolution of the KEP pool in a dynamic setting. Our contribution is thus: 1) methodological, as it introduces a novel setting for reinforcement learning in addition to GFlowNets, 2) implementational, as it delves beyond the theory and details how to use conditional information, and 3) of practical significance, as it considers a specific combinatorial problem in the healthcare domain.

URL: https://openreview.net/forum?id=IizmQoF86Y

---

New submissions
===============

Title: The Logical Impossibility of Artificial General Intelligence (AGI)

Abstract: This paper proves the logical impossibility of Artificial General Intelligence (AGI) through 8 proofs, with 7 independent proofs and the 8th proof being based on 4 independent proofs but with its independent importance. We give pointers to the research direction in Artificial Intelligence (AI) based on the logic of this paper proving AGI as logically impossible.

URL: https://openreview.net/forum?id=J1fECDCy9V

---

Title: HDCS: Hierarchy Discovery and Critic Shaping for Reinforcement Learning with Automaton Specification

Abstract: Training reinforcement learning (RL) agents by scalar reward signals is often infeasible when an environment has sparse and non-Markovian rewards. Deterministic finite-state automaton (DFA) provides a streamlined method for specifying tasks in reinforcement learning (RL) that surpass the limitations of traditional discounted return formulations. However, existing RL algorithms designed to address DFA tasks face unresolved challenges, hindering their practical application. One key issue is that subgoals in the DFA may exhibit hidden hierarchical structures, with some macro-subgoals comprising multiple micro-subgoals in certain orders. Without understanding this hierarchy, RL algorithms may struggle to efficiently solve tasks involving such macro-subgoals. Additionally, the sparse reward problem remains inadequately addressed. Previous approaches, such as potential-based reward shaping, often encounter inefficiencies or result in suboptimal solutions.
To address these challenges, we propose a novel RL framework designed to uncover the hierarchical structure of subgoals and accelerating the solving of DFA tasks without changing the original optimal policies, short as HDCS. The framework operates in two phases: first, a hierarchical RL method is used to identify the prerequisites of subgoals and build the hierarchy; second, given any task specification (DFA), the subgoal hierarchy is incorporated into task DFA to make a product DFA, and then a simple and novel critic shaping approach is proposed to accelerate the satisfaction of product DFA, which does not change optimal policies of the original problem. The effectiveness of HDCS is demonstrated through experiments conducted across various domains. Especially, compared with representative baselines, the critic shaping can have 2X or 3X acceleration on task solving.

URL: https://openreview.net/forum?id=BGoRme2MfG

---

Title: RILe: Reinforced Imitation Learning

Abstract: Acquiring complex behaviors is essential for artificially intelligent agents, yet learning these behaviors in high-dimensional settings poses a significant challenge due to the vast search space. Traditional reinforcement learning (RL) requires extensive manual effort for reward function engineering. Inverse reinforcement learning (IRL) uncovers reward functions from expert demonstrations but relies on an iterative process that is often computationally expensive. Imitation learning (IL) provides a more efficient alternative by directly comparing an agent’s actions to expert demonstrations; however, in high-dimensional environments, such direct comparisons often offer insufficient feedback for effective learning. We introduce RILe (Reinforced Imitation Learning), a framework that combines the strengths of imitation learning and inverse reinforcement learning to learn a dense reward function efficiently and achieve strong performance in high-dimensional tasks. RILe employs a novel trainer–student framework: the trainer learns an adaptive reward function, and the student uses this reward signal to imitate expert behaviors. By dynamically adjusting its guidance as the student evolves, the trainer provides nuanced feedback across different phases of learning. Our framework produces high-performing policies in high-dimensional tasks where direct imitation fails to replicate complex behaviors. We validate RILe in challenging robotic locomotion tasks, demonstrating that it significantly outperforms existing methods and achieves near-expert performance across multiple settings.

URL: https://openreview.net/forum?id=0pPTDOxLq7

---

Title: Local distribution-based adaptive oversampling for imbalanced regression

Abstract: Imbalanced regression occurs when continuous target variables have skewed distributions, creating sparse regions that are difficult for machine learning models to predict accurately. This issue particularly affects neural networks, which often struggle with imbalanced data. While class imbalance in classification has been extensively studied, imbalanced regression remains relatively unexplored, with few effective solutions. Existing approaches often rely on arbitrary thresholds to categorize samples as rare or frequent, ignoring the continuous nature of target distributions. These methods can produce synthetic samples that fail to improve model performance and may discard valuable information through undersampling. To address these limitations, we propose LDAO (Local Distribution-based Adaptive Oversampling), a novel data-level approach that avoids categorizing individual samples as rare or frequent. Instead, LDAO learns the global distribution structure by decomposing the dataset into a mixture of local distributions, each preserving its statistical characteristics. LDAO then models and samples from each local distribution independently before merging them into a balanced training set. LDAO achieves a balanced representation across the entire target range while preserving the inherent statistical structure within each local distribution. In extensive evaluations on 45 imbalanced datasets, LDAO outperforms state-of-the-art oversampling methods on both frequent and rare target values, demonstrating its effectiveness for addressing the challenge of imbalanced regression.

URL: https://openreview.net/forum?id=6qYTR9iJdm

---

Title: Keep your distance: learning dispersed embeddings on $\mathbb{S}_{m}$

Abstract: Learning well-separated features in high-dimensional spaces, such as text or
image embeddings, is crucial for many machine learning applications.
Achieving such separation can be effectively accomplished through the
dispersion of embeddings, where unrelated vectors are pushed apart as
much as possible. By constraining features to be on a hypersphere, we
can connect dispersion to well-studied problems in mathematics and physics,
where optimal solutions are known for limited low-dimensional cases. However,
in representation learning we typically deal with a large number of features
in high-dimensional space, and moreover, dispersion is usually traded off
with some other task-oriented training objective, making existing theoretical
and numerical solutions inapplicable. Therefore, it is common to rely on
gradient-based methods to encourage dispersion, usually by minimizing some
function of the pairwise distances. In this work, we first give an overview
of existing methods from disconnected literature, making new connections and
highlighting similarities. Next, we introduce some new angles. We propose
to reinterpret pairwise dispersion using a maximum mean discrepancy (MMD)
motivation. We then propose an online variant of the celebrated Lloyd’s
algorithm, of K-Means fame, as an effective alternative regularizer for
dispersion on generic domains. Finally, we derive a novel dispersion method
that directly exploits properties of the hypersphere. Our experiments show
the importance of dispersion in image classification and natural language
processing tasks, and how algorithms exhibit different trade-offs in different
regimes.

URL: https://openreview.net/forum?id=5JIQE6HcTd

---

Title: Clustering-Based Validation Splits for Model Selection under Domain Shift

Abstract: This paper considers the problem of model selection under domain shift. Motivated by principles from distributionally robust optimisation (DRO) and domain adaptation theory, it is proposed that the training-validation split should maximise the distribution mismatch between the two sets. By adopting the maximum mean discrepancy (MMD) as the measure of mismatch, it is shown that the partitioning problem reduces to kernel k-means clustering. A constrained clustering algorithm, which leverages linear programming to control the size, label, and (optionally) group distributions of the splits, is presented. The algorithm does not require additional metadata, and comes with convergence guarantees. In experiments, the technique consistently outperforms alternative splitting strategies across a range of datasets and training algorithms, for both domain generalisation (DG) and unsupervised domain adaptation (UDA) tasks. Analysis also shows the MMD between the training and validation sets to be significantly rank-correlated ($\rho=0.63$) with test domain accuracy, further substantiating the validity of this approach.

URL: https://openreview.net/forum?id=Q692C0WtiD

---

Title: What Matters for Model Merging at Scale?

Abstract: Model merging aims to combine multiple expert models into a more capable single model, offering benefits such as reduced storage and serving costs, improved generalization, and support for decentralized model development. Despite its promise, previous studies have primarily focused on merging a few small models. This leaves many unanswered questions about the effect of scaling model size and how it interplays with other key factors—like the base model quality and number of expert models— to affect the merged model’s performance. This work systematically evaluates the utility of model merging at scale for transformer based models to examine the impact of these different factors. We experiment with merging fully fine-tuned models using four popular merging methods—Averaging, Task Arithmetic, Dare-TIES, and TIES-Merging—across model sizes ranging from 1B to 64B parameters and merging up to 8 different expert models. We evaluate the merged models on both held-in tasks, i.e., the expert’s training tasks, and zero-shot generalization to unseen held-out tasks. Our wide range of experiments provide several new insights about merging transformer based language models at scale and the interplay between different factors. First, we find that merging is more effective when experts are created from strong base models, i.e., models with good zero-shot performance, compared to pre-trained ones. Second, larger models perform better when merged. Third merging consistently improves generalization capabilities. Notably, when merging eight large expert models, the merged models often generalize better compared to the multitask trained models. Fourth, we can better merge more expert models when working with larger models. Fifth, different merging methods behave very similarly at larger scales. Overall, our findings shed light on some interesting properties of model merging while also highlighting some limitations.

URL: https://openreview.net/forum?id=9sbetmvNpW

---

Title: Optimal transport based dimensionality reduction

Abstract: This paper investigates whether modeling image and text data as probability measures and applying optimal transport (OT)-based dimensionality reduction techniques leads to improved performance in downstream machine learning tasks. We compare OT-based neighbor embedding methods to their Euclidean counterparts across both classification and clustering tasks using benchmark datasets: MNIST, Fashion MNIST, Coil-20, Yale Face, and 20-Newsgroups. Our methodology involves computing distance matrices using Wasserstein or Euclidean metrics, applying dimensionality reduction techniques such as MDS, Isomap, t-SNE, and Laplacian eigenmaps, and evaluating performance using standard classifiers and clustering algorithms. Experimental results show that OT-based embeddings often yield better performance, although there is some variance in datasets with textures like Fashion MNIST. For all experiments, we perform a statistical hypothesis test to support the findings.

URL: https://openreview.net/forum?id=NRd1Hhmkgj

---

Title: Identifying Macro Causal Effects in a C-DMG over ADMGs

Abstract: Causal effect identification using causal graphs is a fundamental challenge in causal inference. While extensive research has been conducted in this area, most existing methods assume the availability of fully specified directed acyclic graphs or acyclic directed mixed graphs. However, in complex domains such as medicine and epidemiology, complete causal knowledge is often unavailable, and only partial information about the system is accessible. This paper focuses on causal effect identification within partially specified causal graphs, with particular emphasis on cluster-directed mixed graphs (C-DMGs) which can represent many different acyclic directed mixed graphs (ADMGs). These graphs provide a higher-level representation of causal relationships by grouping variables into clusters, offering a more practical approach for handling complex systems. Unlike fully specified ADMGs, C-DMGs can contain cycles, which complicate their analysis and interpretation. Furthermore, their cluster-based nature introduces new challenges, as it gives rise to two distinct types of causal effects, macro causal effects and micro causal effects, with different properties. In this work, we focus on macro causal effects, which describe the effects of entire clusters on other clusters. We establish that the do-calculus is both sound and complete for identifying these effects in C-DMGs over ADMGs when the cluster sizes are either unknown or of size greater than one. Additionally, we provide a graphical characterization of non-identifiability for macro causal effects in these graphs.

URL: https://openreview.net/forum?id=905LEugq6R

---

Title: DRAGON: Distributional Rewards Optimize Diffusion Generative Models

Abstract: We present Distributional RewArds for Generative OptimizatioN (DRAGON), a versatile framework for fine-tuning media generation models towards a desired outcome. Compared with traditional reinforcement learning with human feedback (RLHF) or pairwise preference approaches such as direct preference optimization (DPO), DRAGON is more flexible. It can optimize reward functions that evaluate either individual examples or distributions of them, making it compatible with a broad spectrum of instance-wise, instance-to-distribution, and distribution-to-distribution rewards. Leveraging this versatility, we construct novel reward functions by selecting an encoder and a set of reference examples to create an exemplar distribution. When cross-modality encoders such as CLAP are used, the reference examples may be of a different modality (e.g., text versus audio). Then, DRAGON gathers online and on-policy generations, scores them to construct a positive demonstration set and a negative set, and leverages the contrast between the two sets to maximize the reward. For evaluation, we fine-tune an audio-domain text-to-music diffusion model with 20 different reward functions, including a custom music aesthetics model, CLAP score, Vendi diversity, and Frechet audio distance (FAD). We further compare instance-wise (per-song) and full-dataset FAD settings while ablating multiple FAD encoders and reference sets. Over all 20 target rewards, DRAGON achieves an 81.45% average win rate. Moreover, reward functions based on exemplar sets indeed enhance generations and are comparable to model-based rewards. With an appropriate exemplar set, DRAGON achieves a 60.95% human-voted music quality win rate without training on human preference annotations. As such, DRAGON exhibits a new approach to designing and optimizing reward functions for improving human-perceived quality. Sound examples at https://ml-dragon-tmlr.github.io/web.

URL: https://openreview.net/forum?id=gobhDku03J

---

Title: Hitchhikers' Guide to Masked Latent Semantic Modeling

Abstract: Masked Latent Semantic Modeling (MLSM) is a recent pre-training objective which offers a sample efficient alternative to the use of Masked Language Modeling (MLM) for training encoder language models. In this paper, we identify and carefully evaluate previously unexplored important properties of MLSM pre-training. Based on the results of our rigorous experiments, we formulate a series of recommendations and best practices regarding MLSM pre-training. With these experiments, we also aim at advancing the understanding and proper use of MLSM pre-training by filling in important voids of previous empirical investigations.
We release our code for reproducing our experiments at \url{github.com/[MASK]}

URL: https://openreview.net/forum?id=Tx9Qkgc49I

---

Title: VariFace: Fair and Diverse Synthetic Dataset Generation for Face Recognition

Abstract: The use of large-scale, web-scraped datasets to train face recognition models has raised significant privacy and bias concerns. Synthetic methods mitigate these concerns and provide scalable and controllable face generation to enable fair and accurate face recognition. However, existing synthetic datasets display limited intraclass and interclass diversity and do not match the face recognition performance obtained using real datasets. Here, we propose VariFace, a two-stage diffusion-based pipeline to create fair and diverse synthetic face datasets to train face recognition models. Specifically, we introduce three methods: Face Recognition Consistency to refine demographic labels, Face Vendi Score Guidance to improve interclass diversity, and Divergence Score Conditioning to balance the identity preservation-intraclass diversity trade-off. When constrained to the same dataset size, VariFace considerably outperforms previous synthetic datasets (0.9200 $\rightarrow$ 0.9405) and achieves comparable performance to face recognition models trained with real data (Real Gap = -0.0065). In an unconstrained setting, VariFace not only consistently achieves better performance compared to previous synthetic methods across dataset sizes but also, for the first time, outperforms the real dataset (CASIA-WebFace) across six evaluation datasets. This sets a new state-of-the-art performance with an average face verification accuracy of 0.9567 (Real Gap = +0.0097) across LFW, CFP-FP, CPLFW, AgeDB, and CALFW datasets and 0.9366 (Real Gap = +0.0380) on the RFW dataset.

URL: https://openreview.net/forum?id=tAqxAMeWNX

---

Title: Variance Reduced Smoothed Functional REINFORCE Policy Gradient Algorithms

Abstract: We revisit the REINFORCE policy gradient algorithm from the literature that works with reward (or cost) returns obtained over episodes or trajectories. We propose a major en- hancement to the basic algorithm where we estimate the policy gradient using a smoothed functional (random perturbation) gradient estimator obtained from direct function measure- ments. To handle the issue of high variance that is typical of REINFORCE, we propose two independent enhancements to the basic scheme: (i) use the sign of the increment instead of the original (full) increment that results in smoother convergence and (ii) use clipped gradient estimates as proposed in the Proximal Policy Optimization (PPO) based scheme. We prove the asymptotic convergence of all algorithms and show the results of several ex- periments on various MuJoCo locomotion tasks wherein we compare the performance of our algorithms with the recently well-studied proposed ARS algorithms in the literature. Our algorithms are seen to be competitive when compared to ARS.

URL: https://openreview.net/forum?id=yagxqSJbiY

---

Title: SURE-VQA: Systematic Understanding of Robustness Evaluation in Medical VQA Tasks

Abstract: Vision-Language Models (VLMs) have great potential in medical tasks, like Visual Question Answering (VQA), where they could act as interactive assistants for both patients and clinicians. Yet their robustness to distribution shifts on unseen data remains a key concern for safe deployment. Evaluating such robustness requires a controlled experimental setup that allows for systematic insights into the model's behavior. However, we demonstrate that current setups fail to offer sufficiently thorough evaluations.
To address this gap, we introduce a novel framework, called SURE-VQA, centered around three key requirements to overcome current pitfalls and systematically analyze VLM robustness:
1) Since robustness on synthetic shifts does not necessarily translate to real-world shifts, it should be measured on real-world shifts that are inherent to the VQA data; 2) Traditional token-matching metrics often fail to capture underlying semantics, necessitating the use of large language models (LLMs) for more accurate semantic evaluation; 3) Model performance often lacks interpretability due to missing sanity baselines, thus meaningful baselines should be reported that allow assessing the multimodal impact on the VLM.
To demonstrate the relevance of this framework, we conduct a study on the robustness of various Fine-Tuning (FT) methods across three medical datasets with four types of distribution shifts.
Our study highlights key insights into robustness: 1) No FT method consistently outperforms others in robustness, and 2) robustness trends are more stable across FT methods than across distribution shifts. Additionally, we find that simple sanity baselines that do not use the image data can perform surprisingly well and confirm LoRA as the best-performing FT method on in-distribution data.
Code is provided at https://github.com/KOFRJO/sure-vqa.

URL: https://openreview.net/forum?id=qjNdGpgpV8

---

Title: Communication Cost Reduction for Subgraph Counting under Local Differential Privacy via Hash Functions

Abstract: We suggest the use of hash functions to cut down the communication costs when counting subgraphs under edge local differential privacy. While various algorithms exist for computing graph statistics --- including the count of subgraphs --- under the edge local differential privacy, many suffer with high communication costs, making them less efficient for large graphs. Though data compression is a typical approach in differential privacy, its application in local differential privacy requires a form of compression that every node can reproduce. In our study, we introduce linear congruence hashing. Leveraging amplification by sub-sampling, with a sampling size of $s$, our method can cut communication costs by a factor of $s^2$, albeit at the cost of increasing variance in the published graph statistic by a factor of $s$. The experimental results indicate that, when matched for communication costs, our method achieves a reduction in the $\ell_2$-error by up to 1000 times for triangle counts and by up to $10^3$ times for 4-cycles counts compared to the performance of leading algorithms.

URL: https://openreview.net/forum?id=N1J236mepp

---

Title: GGFlow: A Graph Flow Matching Method with Efficient Optimal Transport

Abstract: Generating graph-structured data is crucial in various domains but remains challenging due to the complex interdependencies between nodes and edges. While diffusion models have demonstrated their superior generative capabilities, they often suffer from unstable training and inefficient sampling. To enhance generation performance and training stability, we propose GGFlow, a discrete flow matching generative model incorporating an efficient optimal transport for graph structures and it incorporates an edge-augmented graph transformer to enable direct communications among edges. Additionally, GGFlow introduces a novel goal-guided generation framework to control the generative trajectory of our model towards desired properties. GGFlow demonstrates superior performance on both unconditional and conditional generation tasks, outperforming existing baselines and underscoring its effectiveness and potential for wider application.

URL: https://openreview.net/forum?id=K8RlXtMgzo

---

Title: Slicing the Gaussian Mixture Wasserstein Distance

Abstract: Gaussian mixture models (GMMs) are widely used in machine learning for tasks such as clustering, classification, image reconstruction, and generative modeling. A key challenge in working with GMMs is defining a computationally efficient and geometrically meaningful metric. The mixture Wasserstein (MW) distance adapts the Wasserstein metric to GMMs and has been applied in various domains, including domain adaptation, dataset comparison, and reinforcement learning. However, its high computational cost—arising from repeated Wasserstein distance computations involving matrix square root estimations and an expensive linear program—limits its scalability to high-dimensional and large-scale problems. To address this, we propose multiple novel slicing-based approximations to the MW distance that significantly reduce computational complexity while preserving key optimal transport properties. From a theoretical viewpoint, we establish several weak and strong equivalences between the introduced metrics, and show the relations to the original MW distance and the well-established sliced Wasserstein distance. Furthermore, we validate the effectiveness of our approach through numerical experiments, demonstrating computational efficiency and applications in clustering, perceptual image comparison, and GMM minimization

URL: https://openreview.net/forum?id=yPBtJ4JPwi

---

Title: Causal Dynamic Variational Autoencoder for Counterfactual Regression in Longitudinal Data

Abstract: Accurately estimating treatment effects over time is crucial in fields such as precision medicine, epidemiology, economics, and marketing. Many current methods for estimating treatment effects over time assume that all confounders are observed or attempt to infer unobserved ones. In contrast, our approach focuses on unobserved adjustment variables—variables that specifically have a causal effect on the outcome sequence. Under the assumption of unconfoundedness, we address estimating Individual Treatment Effects (ITEs) while accounting for unobserved heterogeneity in response to treatment due to these unobserved adjustment variables. Our proposed Causal Dynamic Variational Autoencoder (CDVAE) is grounded in theoretical guarantees concerning the validity of latent adjustment variables and generalization bounds on ITEs estimation error. Extensive evaluations on synthetic and real-world datasets show that CDVAE outperforms existing baselines. Moreover, we demonstrate that state-of-the-art models significantly improve their ITE estimates when augmented with the latent substitutes learned by CDVAE—approaching oracle-level performance without direct access to the true adjustment variables.

URL: https://openreview.net/forum?id=atf9q49DeF

---

Title: Neuro-mimetic Task-free Unsupervised Online Learning with Continual Self-Organizing Maps

Abstract: An intelligent system capable of continual learning is one that can process and extract knowledge from potentially infinitely long streams of pattern vectors. The major challenge that makes crafting such a system difficult is known as catastrophic forgetting -- an agent, such as one based on artificial neural networks (ANNs), struggles to retain previously acquired knowledge when learning from new samples. Furthermore, ensuring that knowledge is preserved for previous tasks becomes more challenging when input is not supplemented with task boundary information. Although forgetting in the context of ANNs has been studied extensively, there still exists far less work investigating it in terms of unsupervised architectures such as the venerable self-organizing map (SOM), a neural model often used in clustering and dimensionality reduction. While the internal mechanisms of SOMs could, in principle, yield sparse representations that improve memory retention, we observe that, when a fixed-size SOM processes continuous data streams, it experiences concept drift. In light of this, we propose a generalization of the SOM, the continual SOM (CSOM), which is capable of online unsupervised learning under a low memory budget. Our results, on benchmarks including MNIST, Kuzushiji-MNIST, and Fashion-MNIST, show almost a two times increase in accuracy, and CIFAR-10 demonstrates a state-of-the-art result when tested on (online) unsupervised class incremental learning setting.

URL: https://openreview.net/forum?id=PV6V03HUbf

---

Title: On Sparsity and Sub-Gaussianity in the Johnson- Lindenstrauss Lemma

Abstract: We provide a simple proof of the Johnson-Lindenstrauss lemma for sub-Gaussian variables.
We extend the analysis to identify how sparse projections can be, and what the cost of
sparsity is on the target dimension. The Johnson-Lindenstrauss lemma is the theoretical core
of the dimensionality reduction methods based on random projections. While its original
formulation involves matrices with Gaussian entries, the computational cost of random
projections can be drastically reduced by the use of simpler variables, especially if they
vanish with a high probability. In this paper, we propose a simple and elementary analysis
of random projections under classical assumptions that emphasizes the key role of sub-
Gaussianity. Furthermore, we show how to extend it to sparse projections, emphasizing the
limits induced by the sparsity of the data itself.

URL: https://openreview.net/forum?id=Znaty8V3a3

---

Reply all

Reply to author

Forward

0 new messages