Weekly TMLR digest for Jun 16, 2024

0 views

Skip to first unread message

TMLR

unread,

Jun 16, 2024, 12:00:09 AMJun 16

to tmlr-annou...@googlegroups.com

New certifications
==================

Featured Certification: What Has Been Overlooked in Contrastive Source-Free Domain Adaptation: Leveraging Source-Informed Latent Augmentation within Neighborhood Context

Jing Wang, Wonho Bae, Jiahong Chen, Kuangen Zhang, Leonid Sigal, Clarence W. de Silva

https://openreview.net/forum?id=iulMde3dP1

---

Featured Certification: Gradient Scarcity in Graph Learning with Bilevel Optimization

Hashem Ghanem, Samuel Vaiter, Nicolas Keriven

https://openreview.net/forum?id=10YJTIsVYq

---

Featured Certification: Q-Learning for Stochastic Control under General Information Structures and Non-Markovian Environments

Ali Devran Kara, Serdar Yuksel

https://openreview.net/forum?id=1Yp6xpTV55

---

Featured Certification: Self-Improvement for Neural Combinatorial Optimization: Sample Without Replacement, but Improvement

Jonathan Pirnay, Dominik G. Grimm

https://openreview.net/forum?id=agT8ojoH0X

---

Accepted papers
===============

Title: Multiple Kronecker RLS fusion-based link propagation for drug-side effect prediction

Authors: Yuqing Qian, Ziyu Zheng, Prayag Tiwari, Yijie Ding, Quan Zou

Abstract: Drug-side effect prediction has become an essential area of research in the field of pharmacology. As the use of medications continues to rise, so does the importance of understanding and mitigating the potential risks associated with them. At present, researchers have turned to data-driven methods to predict drug-side effects. Drug-side effect prediction is a link prediction problem, and the related data can be described from various perspectives. To process these kinds of data, a multi-view method, called Multiple Kronecker RLS fusion-based link propagation (MKronRLSF-LP), is proposed. MKronRLSF-LP extends the Kron-RLS by finding the consensus partitions and multiple graph Laplacian constraints in the multi-view setting. Both of these multi-view settings contribute to a higher quality result. Extensive experiments have been conducted on drug-side effect datasets, and our empirical results provide evidence that our approach is effective and robust.

URL: https://openreview.net/forum?id=LCPzaR9mML

---

Title: Knowledge Accumulation in Continually Learned Representations and the Issue of Feature Forgetting

Authors: Timm Hess, Eli Verwimp, Gido M van de Ven, Tinne Tuytelaars

Abstract: Continual learning research has shown that neural networks suffer from catastrophic forgetting "at the output level", but it is debated whether this is also the case at the level of learned representations. Multiple recent studies ascribe representations a certain level of innate robustness against forgetting - that they only forget minimally in comparison with forgetting at the output level. We revisit and expand upon the experiments that revealed this difference in forgetting and illustrate the coexistence of two phenomena that affect the quality of continually learned representations: knowledge accumulation and feature forgetting. Taking both aspects into account, we show that, even though forgetting in the representation (i.e. feature forgetting) can be small in absolute terms, when measuring relative to how much was learned during a task, forgetting in the representation tends to be just as catastrophic as forgetting at the output level. Next we show that this feature forgetting is problematic as it substantially slows down the incremental learning of good general representations (i.e. knowledge accumulation). Finally, we study how feature forgetting and knowledge accumulation are affected by different types of continual learning methods.

URL: https://openreview.net/forum?id=aHtZuZfHcf

---

Title: Gradient Scarcity in Graph Learning with Bilevel Optimization

Authors: Hashem Ghanem, Samuel Vaiter, Nicolas Keriven

Abstract: Gradient scarcity emerges when learning graphs by minimizing a loss on a subset of nodes under the semi-supervised setting. It consists in edges between unlabeled nodes that are far from the labeled ones receiving zero gradients. The phenomenon was first described when jointly optimizing the graph and the parameters of a shallow Graph Neural Network (GNN) using a single loss function. In this work, we give a precise mathematical characterization of this phenomenon, and prove that it also emerges in bilevel optimization. While for GNNs gradient scarcity occurs due to their finite receptive field, we show that it also occurs with the Laplacian regularization as gradients decrease exponentially in amplitude with distance to labeled nodes, despite the infinite receptive field of this model. We study several solutions to this issue including latent graph learning using a Graph-to-Graph model (G2G), graph regularization to impose a prior structure on the graph, and reducing the graph diameter by optimizing for a larger set of edges. Our empirical results validate our analysis and show that this issue also occurs with the Approximate Personalized Propagation of Neural Predictions (APPNP), which approximates a model of infinite receptive field.

URL: https://openreview.net/forum?id=10YJTIsVYq

---

Title: Estimating class separability of text embeddings with persistent homology.

Authors: Kostis Gourgoulias, Najah Ghalyan, Maxime Labonne, yash satsangi, Sean Moran, Joseph Sabelja

Abstract: This paper introduces an unsupervised method to estimate the class separability of text datasets from a topological point of view. Using persistent homology, we demonstrate how tracking the evolution of embedding manifolds during training can inform about class sep- arability. More specifically, we show how this technique can be applied to detect when the training process stops improving the separability of the embeddings. Our results, validated across binary and multi-class text classification tasks, show that the proposed method’s estimates of class separability align with those obtained from supervised methods. This approach offers a novel perspective on monitoring and improving the fine-tuning of sentence transformers for classification tasks, particularly in scenarios where labeled data is scarce. We also discuss how tracking these quantities can provide additional insights into the properties of the trained classifier.

URL: https://openreview.net/forum?id=8DWrIMuLya

---

Title: Exploring validation metrics for offline model-based optimisation with diffusion models

Authors: Christopher Beckham, Alexandre Piché, David Vazquez, Christopher Pal

Abstract: In model-based optimisation (MBO) we are interested in using machine learning to design candidates that maximise some measure of reward with respect to a black box function called the (ground truth) oracle, which is expensive to compute since it involves executing a real world process. In offline MBO we wish to do so without assuming access to such an oracle during training or validation, with makes evaluation non-straightforward. While an approximation to the ground oracle can be trained and used in place of it during model validation to measure the mean reward over generated candidates, the evaluation is approximate and vulnerable to adversarial examples. Measuring the mean reward of generated candidates over this approximation is one such `validation metric', whereas we are interested in a more fundamental question which is finding which validation metrics correlate the most with the ground truth. This involves proposing validation metrics and quantifying them over many datasets for which the ground truth is known, for instance simulated environments. This is encapsulated under our proposed evaluation framework which is also designed to measure extrapolation, which is the ultimate goal behind leveraging generative models for MBO. While our evaluation framework is model agnostic we specifically evaluate denoising diffusion models due to their state-of-the-art performance, as well as derive interesting insights such as ranking the most effective validation metrics as well as discussing important hyperparameters.

URL: https://openreview.net/forum?id=wC4ZID0H9a

---

Title: Solving the Tree Containment Problem Using Graph Neural Networks

Authors: Arkadiy Dushatskiy, Esther Julien, Leen Stougie, Leo van Iersel

Abstract: \textsc{Tree containment} is a fundamental problem in phylogenetics useful for verifying a proposed phylogenetic network, representing the evolutionary history of certain species. \textsc{Tree containment} asks whether the given phylogenetic tree (for instance, constructed from a DNA fragment showing tree-like evolution) is contained in the given phylogenetic network. In the general case, this is an NP-complete problem. We propose to solve it approximately using Graph Neural Networks. In particular, we propose to combine the given network and the tree and apply a Graph Neural Network to this network-tree graph. This way, we achieve the capability of solving the tree containment instances representing a larger number of species than the instances contained in the training dataset (i.e., our algorithm has the inductive learning ability). Our algorithm demonstrates an accuracy of over $95\%$ in solving the tree containment problem on instances with up to 100 leaves.

URL: https://openreview.net/forum?id=nK5MazeIpn

---

Title: A Simple Video Segmenter by Tracking Objects Along Axial Trajectories

Authors: Ju He, Qihang Yu, Inkyu Shin, Xueqing Deng, Alan Yuille, Xiaohui Shen, Liang-Chieh Chen

Abstract: Video segmentation requires consistently segmenting and tracking objects over time. Due to the quadratic dependency on input size, directly applying self-attention to video segmentation with high-resolution input features poses significant challenges, often leading to GPU Out-Of-Memory errors. Consequently, modern video segmenters either extend an image segmenter without incorporating any temporal attention or resort to window space-time attention in a naive manner. In this work, we present Axial-VS, a general and simple framework that enhances video segmenters by tracking objects along axial trajectories. The framework tackles video segmentation through two sub-tasks: short-term within-clip segmentation and long-term cross-clip tracking. In the first step, Axial-VS augments an off-the-shelf clip-level video segmenter with the proposed axial-trajectory attention, sequentially tracking objects along the height- and width-trajectories within a clip, thereby enhancing temporal consistency by capturing motion trajectories. The axial decomposition significantly reduces the computational complexity for dense features, and outperforms the window space-time attention in segmentation quality. In the second step, we further employ axial-trajectory attention to the object queries in clip-level segmenters, which are learned to encode object information, thereby aiding object tracking across different clips and achieving consistent segmentation throughout the video. Without bells and whistles, Axial-VS showcases state-of-the-art results on video segmentation benchmarks, emphasizing its effectiveness in addressing the limitations of modern clip-level video segmenters. Code will be made available.

URL: https://openreview.net/forum?id=Sy6ZOStz5v

---

Title: Targeted Active Learning for Bayesian Decision-Making

Authors: Louis Filstroff, Iiris Sundin, Petrus Mikkola, Aleksei Tiulpin, Juuso Kylmäoja, Samuel Kaski

Abstract: Active learning is usually applied to acquire labels of informative data points in supervised learning, to maximize accuracy in a sample-efficient way. However, maximizing the supervised learning accuracy is not the end goal when the results are used for decision-making, for example in personalized medicine or economics. We argue that when acquiring samples sequentially, the common practice of separating learning and decision-making is sub-optimal, and we introduce an active learning strategy that takes the down-the-line decision problem into account. Specifically, we adopt a Bayesian experimental design approach, in which the proposed acquisition criterion maximizes the expected information gain on the posterior distribution of the optimal decision. We compare our targeted active learning strategy to existing alternatives on both simulated and real data and show improved performance in decision-making accuracy.

URL: https://openreview.net/forum?id=KxPjuiMgmm

---

Title: Q-Learning for Stochastic Control under General Information Structures and Non-Markovian Environments

Authors: Ali Devran Kara, Serdar Yuksel

Abstract: As a primary contribution, we present a convergence theorem for stochastic iterations, and in particular, Q-learning iterates, under a general, possibly non-Markovian, stochastic environment. Our conditions for convergence involve an ergodicity and a positivity criterion. We provide a precise characterization on the limit of the iterates and conditions on the environment and initializations for convergence. As our second contribution, we discuss the implications and applications of this theorem to a variety of stochastic control problems with non-Markovian environments involving (i) quantized approximations of fully observed Markov Decision Processes (MDPs) with continuous spaces (where quantization break down the Markovian structure), (ii) quantized approximations of belief-MDP reduced partially observable MDPS (POMDPs) with weak Feller continuity and a mild version of filter stability (which requires the knowledge of the model by the controller), (iii) finite window approximations of POMDPs under a uniform controlled filter stability (which does not require the knowledge of the model), and (iv) for multi-agent models where convergence of learning dynamics to a new class of equilibria, subjective Q-learning equilibria, will be studied. In addition to the convergence theorem, some implications of the theorem above are new to the literature and others are interpreted as applications of the convergence theorem. Some open problems are noted.

URL: https://openreview.net/forum?id=1Yp6xpTV55

---

Title: ***FastDoc***: Domain-Specific Fast Continual Pre-training Technique using Document-Level Metadata and Taxonomy

Authors: Abhilash Nandy, Manav Nitin Kapadnis, Sohan Patnaik, Yash Parag Butala, Pawan Goyal, Niloy Ganguly

Abstract: In this paper, we propose FastDoc (Fast Continual Pre-training Technique using Document Level Metadata and Taxonomy), a novel, compute-efficient framework that utilizes Document metadata and Domain-Specific Taxonomy as supervision signals to continually pre-train transformer encoder on a domain-specific corpus. The main innovation is that during domain-specific pretraining, an open-domain encoder is continually pre-trained using sentence-level embeddings as inputs (to accommodate long documents), however, fine-tuning is done with token-level embeddings as inputs to this encoder. We perform such domain-specific pre-training on three different domains namely customer support, scientific, and legal domains, and compare performance on 6 different downstream tasks and 9 different datasets. The novel use of document-level supervision along with sentence-level embedding input for pre-training reduces pre-training compute by around 1,000, 4,500, and 500 times compared to MLM and/or NSP in Customer Support, Scientific, and Legal Domains, respectively. The reduced training time does not lead to a deterioration in performance. In fact we show that FastDoc either outperforms or performs on par with several competitive transformer-based baselines in terms of character-level F1 scores and other automated metrics in the Customer Support, Scientific, and Legal Domains. Moreover, reduced training aids in mitigating the risk of catastrophic forgetting. Thus, unlike baselines, FastDoc shows a negligible drop in performance on open domain.

URL: https://openreview.net/forum?id=RA4yRhjoXw

---

Title: Self-Improvement for Neural Combinatorial Optimization: Sample Without Replacement, but Improvement

Authors: Jonathan Pirnay, Dominik G. Grimm

Abstract: Current methods for end-to-end constructive neural combinatorial optimization usually train a policy using behavior cloning from expert solutions or policy gradient methods from reinforcement learning. While behavior cloning is straightforward, it requires expensive expert solutions, and policy gradient methods are often computationally demanding and complex to fine-tune. In this work, we bridge the two and simplify the training process by sampling multiple solutions for random instances using the current model in each epoch and then selecting the best solution as an expert trajectory for supervised imitation learning. To achieve progressively improving solutions with minimal sampling, we introduce a method that combines round-wise Stochastic Beam Search with an update strategy derived from a provable policy improvement. This strategy refines the policy between rounds by utilizing the advantage of the sampled sequences with almost no computational overhead. We evaluate our approach on the Traveling Salesman Problem and the Capacitated Vehicle Routing Problem. The models trained with our method achieve comparable performance and generalization to those trained with expert data. Additionally, we apply our method to the Job Shop Scheduling Problem using a transformer-based architecture and outperform existing state-of-the-art methods by a wide margin.

URL: https://openreview.net/forum?id=agT8ojoH0X

---

Title: Smoothed Robustness Analysis: Bridging worst- and average-case robustness analyses via smoothed analysis

Authors: Thomas Rodrigues Crespo, Jun-nosuke Teramae

Abstract: The sensitivity to adversarial attacks and noise is a significant drawback of neural networks, and understanding and certifying their robustness has attracted much attention. Studies have attempted to bridge two extreme analyses of robustness; one is the worst-case analysis, which often gives too pessimistic certification, and the other is the average-case analysis, which often fails to give a tight guarantee of robustness. Among them, \textit{Randomized Smoothing} became prominent by certifying a worst-case region of a classifier under input noise. However, the method still suffers from several limitations, probably due to the lack of a larger underlying framework to locate it. Here, inspired by the \textit{Smoothed Analysis} of algorithmic complexity, which bridges the worst-case and average-case analyses of algorithms, we provide a theoretical framework for robustness analyses of classifiers, which contains \textit{Randomized Smoothing} as a special case. Using the framework, we also propose a novel robustness analysis that works even in the small noise regime and thus provides a more confident robustness certification than \textit{Randomized Smoothing}. To validate the approach, we evaluate the robustness of fully connected and convolutional neural networks on the MNIST and CIFAR-10 datasets, respectively, and find that it indeed improves both adversarial and noise robustness.

URL: https://openreview.net/forum?id=BogwFMz5tU

---

Title: [Re] CUDA: Curriculum of Data Augmentation for Long‐tailed Recognition

Authors: Barath Chandran.C

Abstract: In this reproducibility study, we present our results and experience during replicating the paper, titled CUDA: Curriculum of Data Augmentation for Long-Tailed Recognition(Ahn et al., 2023).Traditional datasets used in image recognition, such as ImageNet, are often synthetically balanced, meaning each class has an equal number of samples. In practical scenarios, datasets frequently
exhibit significant class imbalances, with certain classes having a disproportionately larger number of samples compared to others. This discrepancy poses a challenge for traditional image recognition models, as they tend to favor classes with larger sample sizes, leading to poor performance on minority classes. CUDA proposes a class-wise data augmentation technique which can be used over
any existing model to improve the accuracy for LTR: Long Tailed Recognition. We successfully replicated all of the results pertaining to the long-tailed CIFAR-100-LT dataset and extended our analysis to provide deeper insights into how CUDA efficiently tackles class imbalance. The code and the readings are available in https://anonymous.4open.science/r/CUDA-org--C2FD/README.md

URL: https://openreview.net/forum?id=Wm6d44I8St

---

Title: Hybrid Active Learning with Uncertainty-Weighted Embeddings

Authors: Yinan He, Lile Cai, Jingyi Liao, Chuan-Sheng Foo

Abstract: We introduce a hybrid active learning method that simultaneously considers uncertainty and diversity for sample selection. Our method consists of two key steps: computing a novel uncertainty-weighted embedding, then applying distance-based sampling for sample selection. Our proposed uncertainty-weighted embedding is computed by weighting a sample's feature representation by an uncertainty measure. We show how this embedding generalizes the gradient embedding of BADGE so it can be used with arbitrary loss functions and be computed more efficiently, especially for dense prediction tasks and network architectures with large numbers of parameters in the final layer. We extensively evaluate the proposed hybrid active learning method on image classification, semantic segmentation and object detection tasks, and demonstrate that it achieves state-of-the-art performance.

URL: https://openreview.net/forum?id=jD761b5OaE

---

Title: Nuisances via Negativa: Adjusting for Spurious Correlations via Data Augmentation

Authors: Aahlad Manas Puli, Nitish Joshi, Yoav Wald, He He, Rajesh Ranganath

Abstract: In prediction tasks, there exist features that are related to the label in the same way across different settings for that task; these are semantic features or semantics. Features with vary- ing relationships to the label are nuisances. For example, in detecting cows from natural images, the shape of the head is semantic, but because images of cows often have grass back- grounds but not always, the background is a nuisance. Models that exploit nuisance-label relationships face performance degradation when these relationships change. Building mod- els robust to such changes requires additional knowledge beyond samples of the features and labels. For example, existing work uses annotations of nuisances or assumes erm-trained models depend on nuisances. Approaches to integrate new kinds of additional knowledge enlarge the settings where robust models can be built. We develop an approach to use knowledge about the semantics via data augmentations. These data augmentations cor- rupt semantic information to produce models that identify and adjust for where nuisances drive predictions. We study semantic corruptions in powering different spurious-correlation- avoiding methods on multiple out-of-distribution (ood) tasks like classifying waterbirds, natural language inference (nli), and detecting cardiomegaly in chest X-rays.

URL: https://openreview.net/forum?id=RIFJsSzwKY

---

Title: Making Translators Privacy-aware on the User's Side

Authors: Ryoma Sato

Abstract: We propose PRISM to enable users of machine translation systems to preserve the privacy of data on their own initiative. There is a growing demand to apply machine translation systems to data that require privacy protection. While several machine translation engines claim to prioritize privacy, the extent and specifics of such protection are largely ambiguous. First, there is often a lack of clarity on how and to what degree the data is protected. Even if service providers believe they have sufficient safeguards in place, sophisticated adversaries might still extract sensitive information. Second, vulnerabilities may exist outside of these protective measures, such as within communication channels, potentially leading to data leakage. As a result, users are hesitant to utilize machine translation engines for data demanding high levels of privacy protection, thereby missing out on their benefits. PRISM resolves this problem. Instead of relying on the translation service to keep data safe, PRISM provides the means to protect data on the user's side. This approach ensures that even machine translation engines with inadequate privacy measures can be used securely. For platforms already equipped with privacy safeguards, PRISM acts as an additional protection layer, reinforcing their security furthermore. PRISM adds these privacy features without significantly compromising translation accuracy. We prove that PRISM enjoys the theoretical guarantee of word-level differential privacy. Our experiments demonstrate the effectiveness of PRISM using real-world translators, T5 and ChatGPT (GPT-3.5-turbo), and the datasets with two languages. PRISM effectively balances privacy protection with translation accuracy over other user-side privacy protection protocols and helps users grasp the content written in a foreign language without leaking the original content.

URL: https://openreview.net/forum?id=A6eqDMttcs

---

Title: Convergences for Minimax Optimization Problems over Infinite-Dimensional Spaces Towards Stability in Adversarial Training

Authors: Takashi Furuya, Satoshi Okuda, Kazuma Suetake, Yoshihide Sawada

Abstract: Training neural networks that require adversarial optimization, such as generative adversarial networks (GANs) and unsupervised domain adaptations (UDAs), suffers from instability. This instability problem comes from the difficulty of the minimax optimization, and there have been various approaches in GANs and UDAs to overcome this problem. In this study, we tackle this problem theoretically through a functional analysis. Specifically, we show the convergence property of the minimax problem by the gradient descent over the infinite-dimensional spaces of continuous functions and probability measures under certain conditions.
Using this setting, we can discuss GANs and UDAs comprehensively, which have been studied independently.
In addition, we show that the conditions necessary for the convergence property are interpreted as stabilization techniques of adversarial training such as the spectral normalization and the gradient penalty.

URL: https://openreview.net/forum?id=6LePXHr2f3

---

Title: CoMIX: A Multi-agent Reinforcement Learning Training Architecture for Efficient Decentralized Coordination and Independent Decision-Making

Authors: Giovanni Minelli, Mirco Musolesi

Abstract: Robust coordination skills enable agents to operate cohesively in shared environments, together towards a common goal and, ideally, individually without hindering each other's progress. To this end, this paper presents Coordinated QMIX (CoMIX), a novel training framework for decentralized agents that enables emergent coordination through flexible policies, allowing at the same time independent decision-making at individual level. CoMIX models selfish and collaborative behavior as incremental steps in each agent's decision process. This allows agents to dynamically adapt their behavior to different situations balancing independence and collaboration. Experiments using a variety of simulation environments demonstrate that CoMIX outperforms baselines on collaborative tasks. The results validate our incremental approach as effective technique for improving coordination in multi-agent systems.

URL: https://openreview.net/forum?id=JoU9khOwwr

---

Title: [Re] Reproducibility Study of “Explaining Temporal Graph Models Through an Explorer-Navigator Framework"

Authors: Helia Ghasemi, Christina Isaicu, Jesse Wonnink, Andreas Berentzen

Abstract: This paper seeks to reproduce and extend the results of the paper “Explaining Temporal Graph Models Through an Explorer-Navigator Framework” by (Xia et al., 2023). The main contribution of the original authors is a novel explainer for temporal graph networks, the Temporal GNN Explainer (T-GNNExplainer), which finds a subset of preceding events that “explain” a prediction made by a temporal graph model. The explorer is tested on two temporal graph models that are trained on two real-world and two synthetic datasets. The explorer is evaluated using a newly proposed metric for explanatory graph models. The authors compare the performance of their explorer to three baseline explainer methods, either adapted from a GNN explainer or developed by the authors. The authors claim that T-GNNExplainer achieves superior performance compared to the baselines when evaluated with their proposed metric. This work reproduces the original experiments by using the code (with minor adjustments), model specifications, and hyperparameters provided by the original authors. To evaluate the robustness of these claims, the method was extended to one new dataset (MOOC). Results show that the T-GNNexplainer performs best on some, but not all metrics as reported in the original findings. We conclude that the main lines of this paper hold up even though all results are less pronounced than claimed. Results show that the T-GNNExplainer does not perform similarly across different T-GNN models, precise dataset specifications are needed to obtain high performance, and there are simpler, less computationally costly explainer methods (like PBONE) that could offer competitive results.

URL: https://openreview.net/forum?id=9M2XqvH2SB

---

Title: Achieving the Asymptotically Minimax Optimal Sample Complexity of Offline Reinforcement Learning: A DRO-Based Approach

Authors: Yue Wang, Jinjun Xiong, Shaofeng Zou

Abstract: Offline reinforcement learning aims to learn from pre-collected datasets without active exploration. This problem faces significant challenges, including limited data availability and distributional shifts. Existing approaches adopt a pessimistic stance towards uncertainty by penalizing rewards of under-explored state-action pairs to estimate value functions conservatively. In this paper, we show that the distributionally robust optimization (DRO) based approach can also address these challenges and is {asymptotically minimax optimal}. Specifically, we directly model the uncertainty in the transition kernel and construct an uncertainty set of statistically plausible transition kernels. We then show that the policy that optimizes the worst-case performance over this uncertainty set has a near-optimal performance in the underlying problem. We first design a metric-based distribution-based uncertainty set such that with high probability the true transition kernel is in this set. We prove that to achieve a sub-optimality gap of $\epsilon$, the sample complexity is $\mathcal{O}(S^2C^{\pi^*}\epsilon^{-2}(1-\gamma)^{-4})$, where $\gamma$ is the discount factor, $S$ is the number of states, and $C^{\pi^*}$ is the single-policy clipped concentrability coefficient which quantifies the distribution shift. To achieve the optimal sample complexity, we further propose a less conservative value-function-based uncertainty set, which, however, does not necessarily include the true transition kernel. We show that an improved sample complexity of $\mathcal{O}(SC^{\pi^*}\epsilon^{-2}(1-\gamma)^{-3})$ can be obtained, which asymptotically matches with the minimax lower bound for offline reinforcement learning, and thus is asymptotically minimax optimal.

URL: https://openreview.net/forum?id=Y7FbGcjOuD

---

Title: Promoting Exploration in Memory-Augmented Adam using Critical Momenta

Authors: Pranshu Malviya, Goncalo Mordido, Aristide Baratin, Reza Babanezhad Harikandeh, Jerry Huang, Simon Lacoste-Julien, Razvan Pascanu, Sarath Chandar

Abstract: Adaptive gradient-based optimizers, notably Adam, have left their mark in training large-scale deep learning models, offering fast convergence and robustness to hyperparameter settings. However, they often struggle with generalization, attributed to their tendency to converge to sharp minima in the loss landscape. To address this, we propose a new memory-augmented version of Adam that encourages exploration towards flatter minima by incorporating a buffer of critical momentum terms during training. This buffer prompts the optimizer to overshoot beyond narrow minima, promoting exploration. Through comprehensive analysis in simple settings, we illustrate the efficacy of our approach in increasing exploration and bias towards flatter minima. We empirically demonstrate that it can improve model performance for image classification on ImageNet and CIFAR10/100, language modelling on Penn Treebank, and online learning tasks on TinyImageNet and 5-dataset. Our code is available at https://github.com/chandar-lab/CMOptimizer.

URL: https://openreview.net/forum?id=sHSkJqyQgW

---

Title: Physics Informed Distillation for Diffusion Models

Authors: Joshua Tian Jin Tee, Kang Zhang, Hee Suk Yoon, Dhananjaya Nagaraja Gowda, Chanwoo Kim, Chang D. Yoo

Abstract: Diffusion models have recently emerged as a potent tool in generative modeling. However, their inherent iterative nature often results in sluggish image generation due to the requirement for multiple model evaluations. Recent progress has unveiled the intrinsic link between diffusion models and Probability Flow Ordinary Differential Equations (ODEs), thus enabling us to conceptualize diffusion models as ODE systems. Simultaneously, Physics Informed Neural Networks (PINNs) have substantiated their effectiveness in solving intricate differential equations through implicit modeling of their solutions. Building upon these foundational insights, we introduce Physics Informed Distillation (PID), which employs a student model to represent the solution of the ODE system corresponding to the teacher diffusion model, akin to the principles employed in PINNs. Through experiments on CIFAR 10 and ImageNet 64x64, we observe that PID achieves performance comparable to recent distillation methods. Notably, it demonstrates predictable trends concerning method-specific hyperparameters and eliminates the need for synthetic dataset generation during the distillation process. Both of which contribute to its easy-to-use nature as a distillation approach for Diffusion Models.

URL: https://openreview.net/forum?id=rOvaUsF996

---

Title: Understanding and Improving Transfer Learning of Deep Models via Neural Collapse

Authors: Xiao Li, Sheng Liu, Jinxin Zhou, Xinyu Lu, Carlos Fernandez-Granda, Zhihui Zhu, Qing Qu

Abstract: With the ever-increasing complexity of large-scale pre-trained models coupled with a shortage of labeled data for downstream training, transfer learning has become the primary approach in many fields, including natural language processing, computer vision, and multi-modal learning. Despite recent progress, the fine-tuning process for large-scale pre-trained models in vision still mostly relies on trial and error. This work investigates the relationship between neural collapse (NC) and transfer learning for classification problems. NC is an intriguing while prevalent phenomenon that has been recently discovered in terms of the final-layer features and linear classifiers of trained neural networks. Specifically, during the terminal phase of training, NC implies that the variability of the features within each class diminishes to zero, while the means of features between classes are maximally and equally distanced. In this work, we examine the NC attributes of pre-trained models on both downstream and training data for transfer learning, and we find strong correlation between feature collapse and downstream performance. In particular, we discovered a systematic pattern that emerges when linear probing pre-trained models on downstream training data: the more feature collapse of pre-trained models on downstream data, the higher the transfer accuracy.
Additionally, we also studied the relationship between NC and transfer accuracy on the training data. Moreover, these findings allow us to develop a principled, parameter-efficient fine-tuning method that employs skip-connection to induce the last-layer feature collapse on downstream data. Our proposed fine-tuning methods deliver good performances while reducing fine-tuning parameters by at least 90\% and mitigating overfitting in situations especially when the downstream data is scarce.

URL: https://openreview.net/forum?id=o8r84MzTQB

---

Title: AGALE: A Graph-Aware Continual Learning Evaluation Framework

Authors: Tianqi Zhao, Alan Hanjalic, Megha Khosla

Abstract: In recent years, continual learning (CL) techniques have made significant progress in learning from streaming data while preserving knowledge across sequential tasks, particularly in the realm of euclidean data. To foster fair evaluation and recognize challenges in CL settings, several evaluation frameworks have been proposed, focusing mainly on the single- and multi-label classification task on euclidean data. However, these evaluation frameworks are not trivially applicable when the input data is graph-structured, as they do not consider the topological structure inherent in graphs. Existing continual graph learning (CGL) evaluation frameworks have predominantly focussed on single-label scenarios in the node classification (NC) task. This focus has overlooked the complexities of multi-label scenarios, where nodes may exhibit affiliations with multiple labels, simultaneously participating in multiple tasks. We develop a graph-aware evaluation (AGALE) framework that accommodates both single-labeled and multi-labeled nodes, addressing the limitations of previous evaluation frameworks. In particular, we define new incremental settings and devise data partitioning algorithms tailored to CGL datasets. We perform extensive experiments comparing methods from the domains of continual learning, continual graph learning, and dynamic graph learning (DGL). We theoretically analyze \agale and provide new insights about the role of homophily in the performance of compared methods. We release our framework at https://github.com/Tianqi-py/AGALE.

URL: https://openreview.net/forum?id=xDTKRLyaNN

---

New submissions
===============

Title: Video Diffusion Models: A Survey

Abstract: Diffusion generative models have recently become a robust technique for producing and modifying coherent, high-quality video. This survey offers a systematic overview of critical elements of diffusion models for video generation, covering applications, architectural choices, and the modeling of temporal dynamics. Recent advancements in the field are summarized and grouped into development trends. The survey concludes with an overview of remaining challenges and an outlook on the future of the field.

URL: https://openreview.net/forum?id=rJSHjhEYJx

---

Title: Learned feature representations are biased by complexity, learning order, position, and more

Abstract: Representation learning, and interpreting learned representations, are key areas of focus in machine learning and neuroscience. Both fields generally use representations as a means to understand or improve a system's computations. In this work, however, we explore surprising dissociations between representation and computation that may pose challenges for such efforts. We create datasets in which we attempt to match the computational role that different features play, while manipulating other properties of the features or the data. We train various deep learning architectures to compute these multiple abstract features about their inputs. We find that their learned feature representations are systematically biased towards representing some features more strongly than others, depending upon extraneous properties such as feature complexity, the order in which features are learned, and the distribution of features over the inputs. For example, features that are simpler to compute or learned first tend to be represented more strongly and densely than features that are more complex or learned later, even if all features are learned equally well. We also explore how these biases are affected by architectures, optimizers, and training regimes (e.g., in transformers, features decoded earlier in the output sequence also tend to be represented more strongly). Our results help to characterize the inductive biases of gradient-based representation learning. These results also highlight a key challenge for interpretability---or for comparing the representations of models and brains---disentangling extraneous biases from the computationally important aspects of a system's internal representations.

URL: https://openreview.net/forum?id=aY2nsgE97a

---

Title: Adversarial Fine-tuning of Compressed Neural Networks for Joint Improvement of Robustness and Efficiency

Abstract: As deep learning (DL) models are increasingly being integrated into our everyday lives, ensuring their safety by making them robust against adversarial attacks has become increasingly critical. DL models have been found to be susceptible to adversarial attacks which can be achieved by introducing small, targeted perturbations to disrupt the input data. Adversarial training has been presented as a mitigation strategy which can result in more robust models. This adversarial robustness comes with additional computational costs required to design adversarial attacks during training. The two objectives -- adversarial robustness and computational efficiency -- then appear to be in conflict of each other. In this work, we explore the effects of two different model compression methods -- structured weight pruning and quantization -- on adversarial robustness. We specifically explore the effects of fine-tuning on compressed models, and present the trade-off between standard fine-tuning and adversarial fine-tuning. Our results show that compression does not inherently lead to loss in model robustness and adversarial fine-tuning of a compressed model can yield large improvement to the robustness performance of models. We present experiments on two benchmark datasets showing that adversarial fine-tuning of compressed models can achieve robustness performance comparable to adversarially trained models, while also improving computational efficiency.

URL: https://openreview.net/forum?id=PJQ4b2zvvF

---

Title: EHI: End-to-end Learning of Hierarchical Index for Efficient Dense Retrieval

Abstract: Dense embedding-based retrieval is widely used for semantic search and ranking. However, conventional two-stage approaches, involving contrastive embedding learning followed by approximate nearest neighbor search (ANNS), can suffer from misalignment between these stages. This mismatch degrades retrieval performance. We propose End-to-end Hierarchical Indexing (EHI), a novel method that directly addresses this issue by jointly optimizing embedding generation and ANNS structure. EHI leverages a dual encoder for embedding queries and documents while simultaneously learning an inverted file index (IVF)-style tree structure. To facilitate the effective learning of this discrete structure, EHI introduces dense path embeddings that encodes the path traversed by queries and documents within the tree. Extensive evaluations on standard benchmarks, including MS MARCO (Dev set) and TREC DL19, demonstrate EHI's superiority over traditional ANNS index. Under the same computational constraints, EHI outperforms existing state-of-the-art methods by +1.45% in MRR@10 on MS MARCO (Dev) and +8.2% in nDCG@10 on TREC DL19, highlighting the benefits of our end-to-end approach.

URL: https://openreview.net/forum?id=GeLLOGsHV9

---

Title: ProFeAT: Projected Feature Adversarial Training for Self-Supervised Learning of Robust Representations

Abstract: The need for abundant labelled data in supervised Adversarial Training (AT) has prompted the use of Self-Supervised Learning (SSL) techniques with AT. However, the direct application of existing SSL methods to adversarial training has been sub-optimal due to the increased training complexity of combining SSL with AT. A recent approach DeACL mitigates this by utilizing supervision from a standard SSL teacher in a distillation setting, to mimic supervised AT. However, we find that there is still a large performance gap when compared to supervised adversarial training, specifically on larger models. In this work, investigate the key reason for this gap and propose Projected Feature Adversarial Training (ProFeAT) to bridge the same. We show that the sub-optimal distillation performance is a result of mismatch in training objectives of the teacher and student, and propose to use a projection head at the student, that allows it to leverage weak supervision from the teacher while also being able to learn adversarially robust representations that are distinct from the teacher. We further propose appropriate attack and defense losses at the feature and projector, alongside a combination of weak and strong augmentations for the teacher and student respectively, to improve the training data diversity without increasing the training complexity. Through extensive experiments on several benchmark datasets and models, we demonstrate significant improvements in both clean and robust accuracy when compared to existing SSL-AT methods, setting a new state-of-the-art. We further report on-par/ improved performance when compared to TRADES, a popular supervised-AT method.

URL: https://openreview.net/forum?id=AUC0Kmn70N

---

Title: A Bag of Tricks for Few-Shot Class-Incremental Learning

Abstract: We present a bag of tricks framework for few-shot class-incremental learning (FSCIL), which is a challenging form of continual learning that involves continuous adaptation to new tasks with limited samples. FSCIL requires both stability and adaptability, i.e., preserving proficiency in previously learned tasks while learning new ones. Our proposed bag of tricks brings together eight key and highly influential techniques that improve stability, adaptability, and overall performance under a unified framework for FSCIL. We organize these tricks into three categories: stability tricks, adaptability tricks, and training tricks. Stability tricks aim to mitigate the forgetting of previously learned classes by enhancing the separation between the embeddings of learned classes and minimizing interference when learning new ones. On the other hand, adaptability tricks focus on the effective learning of new classes. Finally, training tricks improve the overall performance without compromising stability or adaptability. We perform extensive experiments on three benchmark datasets, CIFAR-100, CUB-200, and miniIMageNet, to evaluate the impact of our proposed framework. Our detailed analysis shows that our approach substantially improves both stability and adaptability, establishing a new state-of-the-art by outperforming prior works in the area. We believe our method provides a go-to solution and establishes a robust baseline for future research in this area.

URL: https://openreview.net/forum?id=DiyYf1Kcdt

---

Title: Target-conditioned GFlowNet for Structure-based Drug Design

Abstract: Searching the vast chemical space for drug-like molecules that bind with a protein pocket is a challenging task in drug discovery. Recently, structure-based generative models have been introduced which promise to be more efficient by learning to generate molecules for any given protein structure. However, since they learn the distribution of a limited protein-ligand complex dataset, structure-based methods do not yet outperform optimization-based methods that generate binding molecules for just one pocket. To overcome limitations on data while leveraging learning across protein targets, we choose to model the reward distribution conditioned on pocket structure, instead of the training data distribution. We introduce TacoGFN, a Generative Flow Network conditioned on any protein pocket structure to generate molecules with probabilities proportional to its reward. In the generative setting for CrossDocked2020 benchmark, TacoGFN attains a state-of-the-art success rate of 56.0% and -8.44 kcal/mol in median Vina Dock score while improving the generation time by multiple orders of magnitude. Fine-tuning TacoGFN further improves the median Vina Dock score to -10.93 kcal/mol and the success rate to 88.8%, outperforming all optimization-based methods.

URL: https://openreview.net/forum?id=N8cPv95zOU

---

Title: A Survey of Lottery Ticket Hypothesis

Abstract: The Lottery Ticket Hypothesis (LTH) states that a dense neural network model contains a highly
sparse subnetwork (i.e., winning tickets) that can achieve even better performance than the original
model when trained in isolation. While LTH has been proved both empirically and theoretically in many works, there still are some open issues, such as efficiency and scalability, to be addressed. Also, the lack of open-source frameworks and consensual experimental setting poses a challenge to future research on LTH. For the first time, we examine previous research
and studies on LTH from different perspectives. We also discuss issues in existing works and list potential directions for further exploration. This survey provides an in-depth look at the state of LTH.

URL: https://openreview.net/forum?id=wnpuy827Yv

---

Title: The Journey, Not the Destination: How Data Guides Diffusion Models

Abstract: Diffusion models trained on large datasets can synthesize photo-realistic images of remarkable quality and diversity. However, attributing these images back to the training data-that is, identifying specific training examples which caused an image to be generated-remains a challenge. In this paper, we propose a framework that: (i) provides a formal notion of data attribution in the context of diffusion models, and (ii) allows us to counterfactually validate such attributions. Then, we provide a method for computing these attributions efficiently. Finally, we apply our method to find (and evaluate) such attributions for denoising diffusion probabilistic models trained on CIFAR-10 and latent diffusion models trained on MS COCO.

URL: https://openreview.net/forum?id=xBEqNJ605v

---

Title: HiFE: Hierarchical Feature Ensemble Framework for Few-shot Hypotheses Adaptation

Abstract: The process of transferring knowledge from a source domain to a target domain in the absence of source data constitutes a formidable obstacle within the field of source-free domain adaptation, often termed hypothesis adaptation. Conventional methodologies have been dependent on a robustly trained (strong) source hypothesis to encapsulate the knowledge pertinent to the source domain. However, this strong hypothesis is prone to overfitting the source domain, resulting in diminished generalization performance when applied to the target domain. To mitigate this issue, we advocate for the augmentation of transferable source knowledge via the integration of multiple (weak) source models that are underfitting. Furthermore, we propose a novel architectural framework, designated as the Hierarchical Feature Ensemble (HiFE) framework for Few-Shot Hypotheses Adaptation, which amalgamates features from both the strong and intentionally underfit source models. Empirical evidence from our experiments indicates that these weaker models, while not optimal within the source domain context, contribute to an enhanced generalization capacity of the resultant model for the target domain. Moreover, the HiFE framework we introduce demonstrates superior performance, surpassing other leading baselines across a spectrum of few-shot hypothesis adaptation scenarios.

URL: https://openreview.net/forum?id=B6RS6DN0Gt

---

Title: Principal Graph Encoder Embedding and Principal Community Detection

Abstract: In this paper, we introduce the concept of principal communities and design a principal graph encoder embedding method to concurrently detect these communities and achieve vertex embedding. Given a graph adjacency matrix with vertex labels, the method computes a sample score for each community, providing a ranking to measure community importance and estimate a set of principal communities. It then produces a vertex embedding by retaining only the dimensions corresponding to the principal communities. We characterize the theoretical properties of the principal graph encoder embedding on the random graph model and prove that the proposed method preserves sufficient information about the vertex labels. The numerical performance of the proposed method is demonstrated through comprehensive simulated and real-data experiments.

URL: https://openreview.net/forum?id=9hihbE9udx

---

Title: A Probabilistic Model behind Self- Supervised Learning

Abstract: In self-supervised learning (SSL), representations are learned via an auxiliary task without annotated labels. A common task is to classify augmentations or different modalities of the data, which share semantic _content_ (e.g. an object in an image) but differ in _style_ (e.g. the object's location). Many approaches to self-supervised learning have been proposed, e.g. SimCLR, CLIP and VicREG, which have recently gained much attention for their representations achieving downstream performance comparable to supervised learning. However, a theoretical understanding of the mechanism behind self-supervised methods eludes. Addressing this, we present a generative latent variable model for self-supervised learning and show that several families of discriminative SSL, including contrastive methods, induce a comparable distribution over representations, providing a unifying theoretical framework for these methods. The proposed model also justifies connections drawn to mutual information and the use of a ``projection head''. Learning representations by fitting the model generatively (termed SimVAE) improves performance over discriminative and other VAE-based methods on simple image benchmarks and significantly narrows the gap between generative and discriminative representation learning in more complex settings. Importantly, as our analysis predicts, SimVAE outperforms self-supervised learning where style information is required, taking an important step toward understanding self-supervised methods and achieving task-agnostic representations.

URL: https://openreview.net/forum?id=QEwz7447tR

---

Title: Intrinsic Biologically-Plausible Adversarial Robustness

Abstract: Artificial Neural Networks (ANNs) trained with Backpropagation (BP) excel in different daily tasks but have a dangerous vulnerability: inputs with small targeted perturbations, also known as adversarial samples, can drastically disrupt their performance. Adversarial training, a technique in which the training dataset is augmented with exemplary adversarial samples, is proven to mitigate this problem but comes at a high computational cost. In contrast to ANNs, humans are not susceptible to misclassifying these same adversarial samples. Thus, one can postulate that biologically-plausible trained ANNs might be more robust against adversarial attacks. In this work, we chose the biologically-plausible learning algorithm Present the Error to Perturb the Input To modulate Activity (PEPITA) as a case study and investigated this question through a comparative analysis with BP-trained ANNs on various computer vision tasks. We observe that PEPITA has a higher intrinsic adversarial robustness and, when adversarially trained, also has a more favorable natural-vs-adversarial performance trade-off. In particular, for the same natural accuracies on the MNIST task, PEPITA's adversarial accuracies decrease on average only by 0.26% while BP's decrease by 8.05%.

URL: https://openreview.net/forum?id=ESLVKKUBZq

---

Title: Representation Norm Amplification for Out-of-Distribution Detection in Long-Tail Learning

Abstract: Detecting out-of-distribution (OOD) samples is a critical task for reliable machine learning. However, it becomes particularly challenging when the models are trained on long-tailed datasets, as the models often struggle to distinguish tail-class in-distribution samples from OOD samples. We examine the main challenges in this problem by identifying the trade-offs between OOD detection and in-distribution (ID) classification, faced by existing methods. We then introduce our method, called *Representation Norm Amplification* (RNA), which solves this challenge by decoupling the two problems. The main idea is to use the norm of the representation as a new dimension for OOD detection, and to develop a training method that generates a noticeable discrepancy in the representation norm between ID and OOD data, while not perturbing the feature learning for ID classification. Our experiments show that RNA achieves superior performance in both OOD detection and classification compared to the state-of-the-art methods, by 1.70\% and 9.46\% in FPR95 and 2.43\% and 6.87\% in classification accuracy on CIFAR10-LT and ImageNet-LT, respectively.

URL: https://openreview.net/forum?id=z4b4WfvooX

---

Title: Non-backtracking Graph Neural Networks

Abstract: The celebrated message-passing updates for graph neural networks allow representing large-scale graphs with local and computationally tractable updates. However, the updates suffer from backtracking, i.e., a message flowing through the same edge twice and revisiting the previously visited node. Since the number of message flows increases exponentially with the number of updates, the redundancy in local updates prevents the graph neural network from accurately recognizing a particular message flow relevant for downstream tasks. In this work, we propose to resolve such redundancy issue via the non-backtracking graph neural network (NBA-GNN) that updates a message without incorporating the message from the previously visited node. We theoretically investigate how NBA-GNN alleviates the over-squashing of GNNs, and establish a connection between NBA-GNN and the impressive performance of non-backtracking updates for stochastic block model recovery. Furthermore, we empirically verify the effectiveness of our NBA-GNN on long-range graph benchmark and transductive node classification problems.

URL: https://openreview.net/forum?id=64HdQKnyTc

---

Title: Set Features for Anomaly Detection

Abstract: This paper proposes to use set features for detecting anomalies in samples that consist of unusual combinations of normal elements. Many leading methods discover anomalies by detecting an unusual part of a sample. For example, state-of-the-art segmentation-based approaches, first classify each element of the sample (e.g., image patch) as normal or anomalous and then classify the entire sample as anomalous if it contains anomalous elements. However, such approaches do not extend well to scenarios where the anomalies are expressed by an unusual combination of normal elements. In this paper, we overcome this limitation by proposing set features that model each sample by the distribution of its elements. We compute the anomaly score of each sample using a simple density estimation method, using fixed features. Our approach outperforms the previous state-of-the-art in image-level logical anomaly detection and sequence-level time series anomaly detection.

URL: https://openreview.net/forum?id=aukOnn7J4M

---

Title: Enhancing Contrastive Clustering with Negative Pair-guided Regularization

Abstract: Contrastive Learning (CL) aims to create effective embeddings for input data by minimizing the distance between positive pairs, i.e., different augmentations or views of the same sample. To avoid degeneracy, CL also employs auxiliary loss to maximize the discrepancy between negative pairs formed with views of distinct samples. As a self-supervised learning strategy, CL inherently attempts to cluster input data into natural groups. However, the often improper trade-off between the attractive and repulsive forces, respectively induced by positive and negative pairs, can lead to deformed clustering, particularly when the number of clusters $k$ is unknown. To address this, we propose NRCC, a CL-based deep clustering framework that generates cluster-friendly embeddings. NRCC repurposes Stochastic Gradient Hamiltonian Monte Carlo sampling as an approximately invariant data augmentation, to curate hard negative pairs that judiciously enhance and balance the two adversarial forces through a regularizer. By preserving the cluster structure in the CL embedding, NRCC retains local density landscapes in lower dimensions through neighborhood-conserving projections. This enables the application of mode-seeking clustering algorithms, typically hindered by high-dimensional CL feature spaces, to achieve exceptional accuracy without needing a predetermined $k$. NRCC's superiority is demonstrated across various datasets with different scales and cluster structures, outperforming 20 state-of-the-art methods.

URL: https://openreview.net/forum?id=y4VYzqQ4Me

---

Title: Explainability of Vision Transformers: A Comprehensive Review and New Perspectives

Abstract: Transformers have had a significant impact on natural language processing and have recently demonstrated their potential in computer vision. They have shown promising results over convolution neural networks in fundamental computer vision tasks. However, the scientific community has not fully grasped the inner workings of vision transformers, nor the basis for their decision-making, which underscores the importance of explainability methods. Understanding how these models arrive at their decisions not only improves their performance but also builds trust in AI systems. This study explores different explainability methods proposed for vision transformers and presents a taxonomy for organizing them according to their motivations, structures, and application scenarios. In addition, it provides a comprehensive review of evaluation criteria that can be used for comparing explanation results, as well as explainability tools and frameworks. Finally, the paper highlights essential but unexplored aspects that can enhance the explainability of vision transformers, and promising research directions are suggested for future investment.

URL: https://openreview.net/forum?id=MkPuV8qw8A

---

Title: Invariance & Causal Representation Learning: Prospects and Limitations

Abstract: Learning causal representations without assumptions is known to be fundamentally impossible, thus establishing the need for suitable inductive biases. At the same time, the invariance of causal mechanisms has emerged as a promising principle to address the challenge of out-of-distribution prediction which machine learning models face. In this work, we explore this invariance principle as a candidate assumption to achieve identifiability of causal representations. While invariance has been utilized for inference in settings where the causal variables are observed, theoretical insights of this principle in the context of causal representation learning are largely missing. We assay the connection between invariance and causal representation learning by establishing impossibility results which show that invariance alone is insufficient to identify latent causal variables. Together with practical considerations, we use our results to reflect generally on the commonly used notion of identifiability in causal representation learning and potential adaptations of this goal moving forward.

URL: https://openreview.net/forum?id=lpOC6s4BcM

---

Title: DrGNN: Deep Residual Graph Neural Network with Contrastive Learning

Abstract: Recent studies reveal that deep representation learning models without proper regularization can suffer from the _dimensional collapse_ problem, i.e., representation vectors span over a lower dimensional space. In the domain of graph deep representation learning, the phenomenon that the node representations are indistinguishable and even shrink to a constant vector is called _oversmoothing_. Based on the analysis of the rank of node representations, we find that representation oversmoothing and dimensional collapse are highly related to each other in deep graph neural networks, and the oversmoothing problem can be interpreted by the dimensional collapse of the node representation matrix. Then, to address the dimensional collapse and the oversmoothing together in deep graph neural networks, we first find vanilla _residual connections_ and _contrastive learning_ producing sub-optimal outcomes by ignoring the structured constraints of graph data. Motivated by this, we propose a novel graph neural network named DrGNN to alleviate the oversmoothing issue from the perspective of addressing dimensional collapse. Specifically, in DrGNN, we design a topology-preserving residual connection for graph neural networks to force the low-rank of hidden representations close to the full-rank input features. Also, we propose the structure-guided contrastive learning to ensure only close neighbors who share similar local connections can have similar representations. Empirical experiments on multiple real-world datasets demonstrate that DrGNN outperforms state-of-the-art deep graph representation baseline algorithms.

URL: https://openreview.net/forum?id=frb6sLbACS

---

Title: Risk Bounds for Mixture Density Estimation on Compact Domains via the $h$-Lifted Kullback--Leibler Divergence

Abstract: We consider the problem of estimating probability density functions based on sample data, using a finite mixture of densities from some component class. To this end, we introduce the h-lifted Kullback--Leibler (KL) divergence as a generalization of the standard KL divergence and a criterion for conducting risk minimization. Under a compact support assumption, we prove an $O(1/{\sqrt{n}})$ bound on the expected estimation error when using the h-lifted KL divergence, which extends the results of Rakhlin et al. (2005, ESAIM: Probability and Statistics, Vol. 9) and Li and Barron (1999, Advances in Neural Information ProcessingSystems, Vol. 12) to permit the risk bounding of density functions that are not strictly positive. We develop a procedure for the computation of the corresponding maximum h-lifted likelihood estimators (h-MLLEs) using the Majorization-Maximization framework and provide experimental results in support of our theoretical bounds.

URL: https://openreview.net/forum?id=lAKvQO4vHj

---

Title: An Online Caching Mechanism for Deep Neural Networks

Abstract: Deep Neural Networks (DNNs) have become an essential component in many application domains, including web-based services. A variety of these services require high throughput and (close to) real-time features, for instance, to respond or react to users' requests or to process a stream of incoming data on time. However, the trend in DNN design is towards larger models with many layers and parameters to achieve more accurate results. Although these models are often pre-trained, the computational complexity in such large models can still be relatively significant, hindering low inference latency.
In this paper, we propose an end-to-end automated caching solution to improve the performance of DNN-based services in terms of computational complexity and inference latency. Our method adopts the ideas of self-distillation of DNN models and early exits. The proposed solution is an automated online layer caching mechanism that allows early exiting of a large model during inference time if the cache model in one of the early exits is confident enough for final prediction. One of the main contributions of this paper is that we have implemented the idea as an online caching, meaning that the cache models do not need access to training data and perform solely based on the incoming data at run-time, making it suitable for applications using pre-trained models.
The results of our experiments on two downstream tasks (image classification and object detection) show that, on average, caching can reduce the computational complexity of these services up to 58\% (in terms of FLOP count) and improve their inference latency up to 46\% with low to zero reduction in accuracy. Our approach also outperforms existing approaches, particularly when applied on complex models and larger datasets. It achieves a remarkable reduction in latency of 51.6\% and 30.4\%, exceeding the Gati and BranchyNet methods for CIFAR100-Resnet50. This enhancement is accompanied by an increase of 2. 92\% and 0. 87\% in the mean precision, further highlighting the superiority of our approach in demanding scenarios.

URL: https://openreview.net/forum?id=h4rUKKfl5S

---

Title: Highway Graph to Accelerate Reinforcement Learning

Abstract: Reinforcement Learning (RL) algorithms often suffer from low training efficiency.
A strategy to mitigate this issue is to incorporate a model-based planning algorithm, such as Monte Carlo Tree Search (MCTS) or Value Iteration (VI), into the environmental model.
The major limitation of VI is the need to iterate over
a large tensor with the shape $|\mathcal{S}|\times |\mathcal{A}| \times |\mathcal{S}|$, where $\mathcal{S}/\mathcal{A}$ denotes the state/action space.
This process iteratively updates the value of the preceding state $s_{t-1}$ based on the state $s_t$ in one step via value propagation. These still lead to intensive computations.
We focus on improving the training efficiency of RL algorithms by improving the efficiency of the value learning process.
For the deterministic environments with discrete state and action spaces, on the sampled empirical state-transition graph, a non-branching sequence of transitions can directly bring the agent from $s_0$ to $s_T$ without deviating from intermediate states, which we call a \textit{highway}.
On such non-branching highways, the value-updating process can be merged as a one-step process instead of iterating the value step-by-step.
Based on this observation, we propose a novel graph structure, named \textit{highway graph}, to model the state transition.
Our highway graph compresses the transition model into a concise graph, where edges can represent multiple state transitions to support value propagation across multiple time steps in each iteration.
We thus can obtain a more efficient value learning approach by facilitating the VI algorithm on highway graphs.
By integrating the highway graph into RL (as a model-based off-policy RL method), the RL training can be remarkably accelerated in the early stages (within 1 million frames).
Moreover, a deep neural network-based agent is trained using the highway graph, resulting in better generalization and lower storage costs.
Comparison against various baselines on four categories of environments reveals that our method outperforms both representative and novel model-free and model-based RL algorithms, demonstrating 10 to more than 150 times more efficiency while maintaining an equal or superior expected return, as confirmed by carefully conducted analyses.

URL: https://openreview.net/forum?id=3mJZfL77WM

---

Title: Data Valuation in the Absence of a Reliable Validation Set

Abstract: Data valuation plays a pivotal role in ensuring data quality and equitably compensating data contributors. Existing game-theoretic data valuation techniques mostly rely on the availability of a high-quality validation set for their efficacy. However, the feasibility of obtaining a clean validation set drawn from the test distribution may be limited in practice. In this work, we show that the choice of validation set can significantly impact the final data value scores. In order to mitigate this, we introduce a general paradigm that converts a traditional validation-based game-theoretic data valuation method into a validation-free alternative. Specifically, we utilize the cross-validation error as a surrogate for to evaluate the model's performance on a validation set. As computing the cross-validation error can be computationally expensive, we propose using the cross-validation error of a kernel regression model as an effective and efficient surrogate for the true performance score on the population. We compare the performance of the validation-free variant of existing data valuation techniques with their original validation-based counterparts. Our results indicate that the validation-free variants generally match or often significantly surpass the performance of their validation-based counterparts.

URL: https://openreview.net/forum?id=xBORyL316c

---

Reply all

Reply to author

Forward

0 new messages