Weekly TMLR digest for Jan 21, 2024

7 views

Skip to first unread message

TMLR

unread,

Jan 20, 2024, 7:00:12 PMJan 20

to tmlr-annou...@googlegroups.com

New certifications
==================

Featured Certification: Wasserstein Distributionally Robust Policy Evaluation and Learning for Contextual Bandits

Yi Shen, Pan Xu, Michael Zavlanos

https://openreview.net/forum?id=NmpjDHWIvg

---

Accepted papers
===============

Title: Variational autoencoder with weighted samples for high-dimensional non-parametric adaptive importance sampling

Authors: Julien Demange-Chryst, Francois Bachoc, Jérôme Morio, Timothé Krauth

Abstract: Adaptive importance sampling is a well-known family of algorithms for density approximation, generation and Monte Carlo integration including rare event estimation. The main common denominator of this family of algorithms is to perform density estimation with weighted samples at each iteration. However, the classical existing methods to do so, such as kernel smoothing or approximation by a Gaussian distribution, suffer from the curse of dimensionality and/or a lack of flexibility. Both are limitations in high dimension and when we do not have any prior knowledge on the form of the target distribution, such as its number of modes. Variational autoencoders are probabilistic tools able to represent with fidelity high-dimensional data in a lower dimensional space. They constitute a parametric family of distributions robust faced to the dimension and since they are based on deep neural networks, they are flexible enough to be considered as non-parametric models. In this paper, we propose to use a variational autoencoder as the auxiliary importance sampling distribution by extending the existing framework to weighted samples. We integrate the proposed procedure in existing adaptive importance sampling algorithms and we illustrate its practical interest on diverse examples.

URL: https://openreview.net/forum?id=nzG9KGssSe

---

Title: Separability Analysis for Causal Discovery in Mixture of DAGs

Authors: Burak Varici, Dmitriy Katz, Dennis Wei, Prasanna Sattigeri, Ali Tajer

Abstract: Directed acyclic graphs (DAGs) are effective for compactly representing causal systems and specifying the causal relationships among the system's constituents. Specifying such causal relationships in some systems requires a mixture of multiple DAGs -- a single DAG is insufficient. Some examples include time-varying causal systems or aggregated subgroups of a population. Recovering the causal structure of the systems represented by single DAGs is investigated extensively, but it remains mainly open for the systems represented by a mixture of DAGs. A major difference between single- versus mixture-DAG recovery is the existence of node pairs that are separable in the individual DAGs but become inseparable in their mixture. This paper provides the theoretical foundations for analyzing such inseparable node pairs. Specifically, the notion of \emph{emergent edges} is introduced to represent such inseparable pairs that do not exist in the single DAGs but emerge in their mixtures. Necessary conditions for identifying the emergent edges are established. Operationally, these conditions serve as sufficient conditions for separating a pair of nodes in the mixture of DAGs. These results are further extended, and matching necessary and sufficient conditions for identifying the emergent edges in tree-structured DAGs are established. Finally, a novel graphical representation is formalized to specify these conditions, and an algorithm is provided for inferring the learnable causal relations.

URL: https://openreview.net/forum?id=ALRWXT1RLZ

---

Title: Unleashing the Potential of Acquisition Functions in High-Dimensional Bayesian Optimization

Authors: Jiayu Zhao, Renyu Yang, SHENGHAO QIU, Zheng Wang

Abstract: Bayesian optimization (BO) is widely used to optimize expensive-to-evaluate black-box functions. It first builds a surrogate for the objective and quantifies its uncertainty. It then decides where to sample by maximizing an acquisition function (AF) defined by the surrogate model. However, when dealing with high-dimensional problems, finding the global maximum of the AF becomes increasingly challenging. In such cases, the manner in which the AF maximizer is initialized plays a pivotal role. An inappropriate initialization can severely limit the potential of AF.

This paper investigates a largely understudied problem concerning the impact of AF maximizer initialization on exploiting AFs' capability. Our large-scale empirical study shows that the widely used random initialization strategy may fail to harness the potential of an AF. Based on this observation, we propose a better initialization approach by employing multiple heuristic optimizers to leverage the historical data of black-box optimization to generate initial points for an AF maximizer. We evaluate our approach with a variety of heavily studied synthetic test functions and real-world applications. Experimental results show that our techniques, while simple, can significantly enhance the standard BO and outperform state-of-the-art methods by a large margin in most test cases.

URL: https://openreview.net/forum?id=0CM7Hfsy61

---

Title: On the Adversarial Robustness of Camera-based 3D Object Detection

Authors: Shaoyuan Xie, Zichao Li, Zeyu Wang, Cihang Xie

Abstract: In recent years, camera-based 3D object detection has gained widespread attention for its ability to achieve high performance with low computational cost. However, the robustness of these methods to adversarial attacks has not been thoroughly examined, especially when considering their deployment in safety-critical domains like autonomous driving. In this study, we conduct the first comprehensive investigation of the robustness of leading camera-based 3D object detection approaches under various adversarial conditions. We systematically analyze the resilience of these models under two attack settings: white-box and black-box; focusing on two primary objectives: classification and localization. Additionally, we delve into two types of adversarial attack techniques: pixel-based and patch-based. Our experiments yield four interesting findings: (a) bird's-eye-view-based representations exhibit stronger robustness against localization attacks; (b) depth-estimation-free approaches have the potential to show stronger robustness; (c) accurate depth estimation effectively improves robustness for depth-estimation-based methods; (d) incorporating multi-frame benign inputs can effectively mitigate adversarial attacks. We hope our findings can steer the development of future camera-based object detection models with enhanced adversarial robustness. The code is available at: https://github.com/Daniel-xsy/BEV-Attack.

URL: https://openreview.net/forum?id=6SofFlwhEv

---

Title: AmbientFlow: Invertible generative models from incomplete, noisy measurements

Authors: Varun A. Kelkar, Rucha Deshpande, Arindam Banerjee, Mark Anastasio

Abstract: Generative models have gained popularity for their potential applications in imaging science, such as image reconstruction, posterior sampling and data sharing. Flow-based generative models are particularly attractive due to their ability to tractably provide exact density estimates along with fast, inexpensive and diverse samples. Training such models, however, requires a large, high quality dataset of objects. In applications such as computed imaging, it is often difficult to acquire such data due to requirements such as long acquisition time or high radiation dose, while acquiring noisy or partially observed measurements of these objects is more feasible. In this work, we propose AmbientFlow, a framework for learning flow-based generative models directly from noisy and incomplete data. Using variational Bayesian methods, a novel framework for establishing flow-based generative models from noisy, incomplete data is proposed. Extensive numerical studies demonstrate the effectiveness of AmbientFlow in learning the object distribution. The utility of AmbientFlow in a downstream inference task of image reconstruction is demonstrated.

URL: https://openreview.net/forum?id=txpYITR8oa

---

Title: Multi-Horizon Representations with Hierarchical Forward Models for Reinforcement Learning

Authors: Trevor McInroe, Lukas Schäfer, Stefano V Albrecht

Abstract: Learning control from pixels is difficult for reinforcement learning (RL) agents because representation learning and policy learning are intertwined. Previous approaches remedy this issue with auxiliary representation learning tasks, but they either do not consider the temporal aspect of the problem or only consider single-step transitions, which may cause learning inefficiencies if important environmental changes take many steps to manifest. We propose Hierarchical $k$-Step Latent (HKSL), an auxiliary task that learns multiple representations via a hierarchy of forward models that learn to communicate and an ensemble of $n$-step critics that all operate at varying magnitudes of step skipping. We evaluate HKSL in a suite of 30 robotic control tasks with and without distractors and a task of our creation. We find that HKSL either converges to higher or optimal episodic returns more quickly than several alternative representation learning approaches. Furthermore, we find that HKSL's representations capture task-relevant details accurately across timescales (even in the presence of distractors) and that communication channels between hierarchy levels organize information based on both sides of the communication process, both of which improve sample efficiency.

URL: https://openreview.net/forum?id=fovUNTilp9

---

Title: Neural Task Synthesis for Visual Programming

Authors: Victor-Alexandru Pădurean, Georgios Tzannetos, Adish Singla

Abstract: Generative neural models hold great promise in enhancing programming education by synthesizing new content. We seek to design neural models that can automatically generate programming tasks for a given specification in the context of visual programming domains. Despite the recent successes of large generative models like GPT-4, our initial results show that these models are ineffective in synthesizing visual programming tasks and struggle with logical and spatial reasoning. We propose a novel neuro-symbolic technique, NeurTaskSyn, that can synthesize programming tasks for a specification given in the form of desired programming concepts exercised by its solution code and constraints on the visual task. NeurTaskSyn has two components: the first component is trained via imitation learning procedure to generate possible solution codes, and the second component is trained via reinforcement learning procedure to guide an underlying symbolic execution engine that generates visual tasks for these codes. We demonstrate the effectiveness of NeurTaskSyn through an extensive empirical evaluation and a qualitative study on reference tasks taken from the Hour of Code: Classic Maze challenge by Code.org and the Intro to Programming with Karel course by CodeHS.com.

URL: https://openreview.net/forum?id=aYkYajcJDN

---

Title: Federated Classification in Hyperbolic Spaces via Secure Aggregation of Convex Hulls

Authors: Saurav Prakash, Jin Sima, Chao Pan, Eli Chien, Olgica Milenkovic

Abstract: Hierarchical and tree-like data sets arise in many relevant applications, including language processing, graph data mining, phylogeny and genomics. It is known that tree-like data cannot be embedded into Euclidean spaces of finite dimension with small distortion, and that this problem can be mitigated through the use of hyperbolic spaces. When such data also has to be processed in a distributed and privatized setting, it becomes necessary to work with new federated learning methods tailored to hyperbolic spaces. As an initial step towards the development of the field of federated learning in hyperbolic spaces, we propose the first known approach to federated classification in hyperbolic spaces. Our contributions are as follows. First, we develop distributed versions of convex SVM classifiers for Poincar\'e discs. In this setting, the information conveyed from clients to the global classifier are convex hulls of clusters present in individual client data. Second, to avoid label switching issues, we introduce a number-theoretic approach for label recovery based on the so-called integer $B_h$ sequences. Third, we compute the complexity of the convex hulls in hyperbolic spaces to assess the extent of data leakage; at the same time, in order to limit the communication cost for the hulls, we propose a new quantization method for the Poincar\'e disc coupled with Reed-Solomon-like encoding. Fourth, at the server level, we introduce a new approach for aggregating convex hulls of the clients based on balanced graph partitioning. We test our method on a collection of diverse data sets, including hierarchical single-cell RNA-seq data from different patients distributed across different repositories that have stringent privacy constraints. The classification accuracy of our method is up to $\sim11\%$ better than its Euclidean counterpart, demonstrating the importance of privacy-preserving learning in hyperbolic spaces. Our implementation for the proposed method is
available at \url{https://github.com/sauravpr/hyperbolic_federated_classification}.

URL: https://openreview.net/forum?id=umggDfMHha

---

Title: Personalized Algorithmic Recourse with Preference Elicitation

Authors: Giovanni De Toni, Paolo Viappiani, Stefano Teso, Bruno Lepri, Andrea Passerini

Abstract: Algorithmic Recourse (AR) is the problem of computing a sequence of actions that -- once performed by a user -- overturns an undesirable machine decision. It is paramount that the sequence of actions does not require too much effort for users to implement. Yet, most approaches to AR assume that actions cost the same for all users, and thus may recommend unfairly expensive recourse plans to certain users. Prompted by this observation, we introduce PEAR, the first human-in-the-loop approach capable of providing personalized algorithmic recourse tailored to the needs of any end-user. PEAR builds on insights from Bayesian Preference Elicitation to iteratively refine an estimate of the costs of actions by asking choice set queries to the target user. The queries themselves are computed by maximizing the Expected Utility of Selection, a principled measure of information gain accounting for uncertainty on both the cost estimate and the user's responses. PEAR integrates elicitation into a Reinforcement Learning agent coupled with Monte Carlo Tree Search to quickly identify promising recourse plans. Our empirical evaluation on real-world datasets highlights how PEAR produces high-quality personalized recourse in only a handful of iterations.

URL: https://openreview.net/forum?id=8sg2I9zXgO

---

Title: Wasserstein Distributionally Robust Policy Evaluation and Learning for Contextual Bandits

Authors: Yi Shen, Pan Xu, Michael Zavlanos

Abstract: Off-policy evaluation and learning are concerned with assessing a given policy and learning an optimal policy from offline data without direct interaction with the environment. Often, the environment in which the data are collected differs from the environment in which the learned policy is applied. To account for the effect of different environments during learning and execution, distributionally robust optimization (DRO) methods have been developed that compute worst-case bounds on the policy values assuming that the distribution of the new environment lies within an uncertainty set. Typically, this uncertainty set is defined based on the KL divergence around the empirical distribution computed from the logging dataset. However, the KL uncertainty set fails to encompass distributions with varying support and lacks awareness of the geometry of the distribution support. As a result, KL approaches fall short in addressing practical environment mismatches and lead to over-fitting to worst-case scenarios. To overcome these limitations, we propose a novel DRO approach that employs the Wasserstein distance instead. While Wasserstein DRO is generally computationally more expensive compared to KL DRO, we present a regularized method and a practical (biased) stochastic gradient descent method to optimize the policy efficiently. We also provide a theoretical analysis of the finite sample complexity and iteration complexity for our proposed method. We further validate our approach using a public dataset that was recorded in a randomized stoke trial.

URL: https://openreview.net/forum?id=NmpjDHWIvg

---

Title: A Fully Decentralized Surrogate for Multi-Agent Policy Optimization

Authors: Kefan Su, Zongqing Lu

Abstract: The study of fully decentralized learning or independent learning in cooperative multi-agent reinforcement learning has a history of decades. Recent empirical studies have shown that independent PPO (IPPO) can achieve good performance, comparable to or even better than the methods of centralized training with decentralized execution, in several benchmarks. However, a decentralized actor-critic algorithm with a convergence guarantee is still an open problem. In this paper, we propose decentralized policy optimization (DPO), a decentralized actor-critic algorithm with monotonic improvement and convergence guarantee. We derive a novel decentralized surrogate for policy optimization such that the monotonic improvement of joint policy can be guaranteed by each agent independently optimizing the surrogate. For practical implementation, this decentralized surrogate can be realized by two adaptive coefficients for policy optimization at each agent. Empirically, we evaluate DPO, IPPO, and independent Q-learning (IQL) in a variety of cooperative multi-agent tasks, covering discrete and continuous action spaces, as well as fully and partially observable environments. The results show DPO outperforms both IPPO and IQL in most tasks, which serves as evidence for our theoretical results. The code is available at https://github.com/PKU-RL/DPO.

URL: https://openreview.net/forum?id=MppUW90uU2

---

Title: A Globally Convergent Algorithm for Neural Network Parameter Optimization Based on Difference-of-Convex Functions

Authors: Daniel Tschernutter, Mathias Kraus, Stefan Feuerriegel

Abstract: We propose an algorithm for optimizing the parameters of single hidden layer neural networks.
Specifically, we derive a blockwise difference-of-convex (DC) functions representation
of the objective function. Based on the latter, we propose a block coordinate descent (BCD)
approach that we combine with a tailored difference-of-convex functions algorithm (DCA).
We prove global convergence of the proposed algorithm. Furthermore, we mathematically
analyze the convergence rate of parameters and the convergence rate in value (i.e., the training
loss). We give conditions under which our algorithm converges linearly or even faster
depending on the local shape of the loss function. We confirm our theoretical derivations
numerically and compare our algorithm against state-of-the-art gradient-based solvers in
terms of both training loss and test loss.

URL: https://openreview.net/forum?id=EDqCY6ihbr

---

Title: Online Reference Tracking For Linear Systems with Unknown Dynamics and Unknown Disturbances

Authors: Nariman Niknejad, Farnaz Adib Yaghmaie, Hamidreza Modares

Abstract: This paper presents an online learning mechanism to address the challenge of state tracking for unknown linear systems under general adversarial disturbances. The reference trajectory is assumed to be generated by unknown exosystem dynamics, which relaxes the common assumption of known dynamics for exosystems. Learning a tracking control policy for unknown systems with unknown exosystem dynamics under general disturbances is challenging and surprisingly unsettled. To face this challenge, the presented online learning algorithm has two stages: In the first stage, an algorithm identifies the dynamics of the uncertain system, and in the second stage, an online parametrized memory-augmented controller accounts for the identification error, unknown exosystem dynamics as well as disturbances. The controller's parameters are learned to optimize a general convex cost function, and learning the control parameters is formulated as an online convex optimization problem. This approach uses the memory of previous disturbances and reference values to capture their effects on performance over time. Besides, it implicitly learns the dynamics of the exosystems. The algorithm enables online tuning of controller parameters to achieve state tracking and disturbance rejection while minimizing general convex costs. It is shown that the algorithm achieves a policy regret of $\mathcal{O}({T}^{\frac{2}{3}})$. In the simulation results, the performance of the presented tracking algorithm was compared with the certainty equivalent $H_{\infty}$-control and linear quadratic regulator.

URL: https://openreview.net/forum?id=pfbVayaUMc

---

Title: DINOv2: Learning Robust Visual Features without Supervision

Authors: Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski

Abstract: The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP on most of the benchmarks at image and pixel levels.

URL: https://openreview.net/forum?id=a68SUt6zFt

---

Title: Boomerang: Local sampling on image manifolds using diffusion models

Authors: Lorenzo Luzi, Paul M Mayer, Josue Casco-Rodriguez, Ali Siahkoohi, Richard Baraniuk

Abstract: The inference stage of diffusion models can be seen as running a reverse-time diffusion stochastic differential equation, where samples from a Gaussian latent distribution are transformed into samples from a target distribution that usually reside on a low-dimensional manifold, e.g., an image manifold. The intermediate values between the initial latent space and the image manifold can be interpreted as noisy images, with the amount of noise determined by the forward diffusion process noise schedule. We utilize this interpretation to present Boomerang, an approach for local sampling of image manifolds exploiting the reverse diffusion process dynamics. As implied by its name, Boomerang local sampling involves adding noise to an input image, moving it closer to the latent space, and then mapping it back to the image manifold through a partial reverse diffusion process. Thus, Boomerang generates images on the manifold that are ``similar,'' but nonidentical, to the original input image. We can control the proximity of the generated images to the original by adjusting the amount of noise added. Furthermore, due to the stochastic nature of the partial reverse diffusion process in Boomerang, the generated images display a certain degree of stochasticity, allowing us to obtain ample local samples from the manifold without encountering any duplicates. Boomerang offers the flexibility to work seamlessly with any pretrained diffusion model, such as Stable Diffusion, without necessitating any adjustments to the reverse diffusion process. We present three applications for local sampling using Boomerang. First, we provide a framework for constructing privacy-preserving datasets having controllable degrees of anonymity. Second, we show that using Boomerang for data augmentation increases generalization performance and outperforms state-of-the-art synthetic data augmentation. Lastly, we introduce a perceptual image enhancement framework powered by Boomerang, which enables resolution enhancement.

URL: https://openreview.net/forum?id=NYdThkjNW1

---

Title: Improved Regret Bounds for Linear Adversarial MDPs via Linear Optimization

Authors: Fang Kong, XiangCheng Zhang, Baoxiang Wang, Shuai Li

Abstract: Learning Markov decision processes (MDP) in an adversarial environment has been a challenging problem. The problem becomes even more challenging with function approximation since the underlying structure of the loss function and transition kernel are especially hard to estimate in a varying environment. In fact, the state-of-the-art results for linear adversarial MDP achieve a regret of $\tilde{\mathcal{O}}({K^{6/7}})$ ($K$ denotes the number of episodes), which admits a large room for improvement.
In this paper, we propose a novel explore-exploit algorithm framework and investigate the problem with a new view, which reduces linear MDP into linear optimization by subtly setting the feature maps of the bandit arms of linear optimization. This new technique, under an exploratory assumption, yields an improved bound of $\tilde{\mathcal{O}}({K^{4/5}})$ for linear adversarial MDP without access to a transition simulator. The new view could be of independent interest for solving other MDP problems that possess a linear structure.

URL: https://openreview.net/forum?id=KcmWZSk53y

---

Title: Neural Circuit Diagrams: Robust Diagrams for the Communication, Implementation, and Analysis of Deep Learning Architectures

Authors: Vincent Abbott

Abstract: Diagrams matter. Unfortunately, the deep learning community has no standard method for diagramming architectures. The current combination of linear algebra notation and ad-hoc diagrams fails to offer the necessary precision to understand architectures in all their detail. However, this detail is critical for faithful implementation, mathematical analysis, further innovation, and ethical assurances. I present neural circuit diagrams, a graphical language tailored to the needs of communicating deep learning architectures. Neural circuit diagrams naturally keep track of the changing arrangement of data, precisely show how operations are broadcast over axes, and display the critical parallel behavior of linear operations. A lingering issue with existing diagramming methods is the inability to simultaneously express the detail of axes and the free arrangement of data, which neural circuit diagrams solve. Their compositional structure is analogous to code, creating a close correspondence between diagrams and implementation.

In this work, I introduce neural circuit diagrams for an audience of machine learning researchers. After introducing neural circuit diagrams, I cover a host of architectures to show their utility and breed familiarity. This includes the transformer architecture, convolution (and its difficult-to-explain extensions), residual networks, the U-Net, and the vision transformer. I include a Jupyter notebook that provides evidence for the close correspondence between diagrams and code. Finally, I examine backpropagation using neural circuit diagrams. I show their utility in providing mathematical insight and analyzing algorithms' time and space complexities.

URL: https://openreview.net/forum?id=RyZB4qXEgt

---

Title: DyG2Vec: Efficient Representation Learning for Dynamic Graphs

Authors: Mohammad Alomrani, Mahdi Biparva, Yingxue Zhang, Mark Coates

Abstract: Temporal graph neural networks have shown promising results in learning inductive representations by automatically extracting temporal patterns. However, previous works often rely on complex memory modules or inefficient random walk methods to construct temporal representations. To address these limitations, we present an efficient yet effective attention-based encoder that leverages temporal edge encodings and window-based subgraph sampling to generate task-agnostic embeddings. Moreover, we propose a joint-embedding architecture using non-contrastive SSL to learn rich temporal embeddings without labels. Experimental results on 7 benchmark datasets indicate that on average, our model outperforms SoTA baselines on the future link prediction task by 4.23% for the transductive setting and 3.30% for the inductive setting while only requiring 5-10x less training/inference time. Lastly, different aspects of the proposed framework are investigated through experimental analysis and ablation studies. The code is publicly available at https://github.com/huawei-noah/noah-research/tree/master/graph_atlas.

URL: https://openreview.net/forum?id=YRKS2J0x36

---

Title: A Survey on Out-of-Distribution Detection in NLP

Authors: Hao Lang, Yinhe Zheng, Yixuan Li, Jian SUN, Fei Huang, Yongbin Li

Abstract: Out-of-distribution (OOD) detection is essential for the reliable and safe deployment of machine learning systems in the real world. Great progress has been made over the past years. This paper presents the first review of recent advances in OOD detection with a particular focus on natural language processing approaches. First, we provide a formal definition of OOD detection and discuss several related fields. We then categorize recent algorithms into three classes according to the data they used: (1) OOD data available, (2) OOD data unavailable + in-distribution (ID) label available, and (3) OOD data unavailable + ID label unavailable. Third, we introduce datasets, applications, and metrics. Finally, we summarize existing work and present potential future research topics.

URL: https://openreview.net/forum?id=nYjSkOy8ij

---

Title: Proximal Mean Field Learning in Shallow Neural Networks

Authors: Alexis Teter, Iman Nodozi, Abhishek Halder

Abstract: We propose a custom learning algorithm for shallow over-parameterized neural networks, i.e., networks with single hidden layer having infinite width. The infinite width of the hidden layer serves as an abstraction for the over-parameterization. Building on the recent mean field interpretations of learning dynamics in shallow neural networks, we realize mean field learning as a computational algorithm, rather than as an analytical tool. Specifically, we design a Sinkhorn regularized proximal algorithm to approximate the distributional flow for the learning dynamics over weighted point clouds. In this setting, a contractive fixed point recursion computes the time-varying weights, numerically realizing the interacting Wasserstein gradient flow of the parameter distribution supported over the neuronal ensemble. An appealing aspect of the proposed algorithm is that the measure-valued recursions allow meshless computation. We demonstrate the proposed computational framework of interacting weighted particle evolution on binary and multi-class classification. Our algorithm performs gradient descent of the free energy associated with the risk functional.

URL: https://openreview.net/forum?id=vyRBsqj5iG

---

New submissions
===============

Title: Representation Learning Dynamics of Self-Supervised Models

Abstract: Self-Supervised Learning (SSL) is an important paradigm for learning representations from unlabelled data, and SSL with neural networks has been highly successful in practice. However current theoretical analysis of SSL is mostly restricted to generalisation error bounds. In contrast, learning dynamics often provide a precise characterisation of the behaviour of neural networks based models but, so far, are mainly known in supervised settings. In this paper, we study the learning dynamics of SSL models, specifically representations obtained by minimising contrastive and non-contrastive losses. We show that a naive extension of the dymanics of multivariate regression to SSL leads to learning trivial scalar representations that demonstrates dimension collapse in SSL. Consequently, we formulate SSL objectives with orthogonality constraints on the weights, and derive the exact (network width independent) learning dynamics of the SSL models trained using gradient descent on the Grassmannian manifold. We also argue that the infinite width approximation of SSL models significantly deviate from the neural tangent kernel approximations of supervised models. We numerically illustrate the validity of our theoretical findings, and discuss how the presented results provide a framework for further theoretical analysis of contrastive and non-contrastive SSL.

URL: https://openreview.net/forum?id=QXLKnrymE1

---

Title: Recent Link Classification on Temporal Graphs Using Graph Profiler

Abstract: The performance of Temporal Graph Learning (TGL) methods are typically evaluated on the future link prediction task, i.e., whether two nodes will get connected and dynamic node classification task, i.e., whether a node's class will change. Comparatively, recent link classification, i.e., to what class an emerging edge belongs to, is investigated much less even though it exists in many industrial settings. In this work, we first formalize recent link classification on temporal graphs as a benchmark downstream task and introduce corresponding benchmark datasets. Secondly, we evaluate the performance of state-of-the-art methods with a statistically meaningful metric Matthews Correlation Coefficient, which is more robust to imbalanced datasets, in addition to the commonly used average precision and area under the curve. We propose several design principles for tailoring models to specific requirements of the task and the dataset including modifications on message aggregation schema, readout layer and time encoding strategy which obtain significant improvement on benchmark datasets. Finally, we propose an architecture that we call Graph Profiler, which is capable of encoding previous events' class information on source and destination nodes. The experiments show that our proposed model achieves an improved Matthews Correlation Coefficient on most cases under interest. We believe the introduction of recent link classification as a benchmark task for temporal graph learning will be useful for the evaluation of prospective methods within the field.

URL: https://openreview.net/forum?id=BTgHh0gSSc

---

Title: GPT-FL: Generative Pre-Trained Model-Assisted Federated Learning

Abstract: In this work, we propose GPT-FL, a generative pre-trained model-assisted federated learning (FL) framework. At its core, GPT-FL leverages generative pre-trained models to generate diversified synthetic data. These generated data are used to train a downstream model on the server, which is then fine-tuned with private client data under the standard FL framework. We show that GPT-FL consistently outperforms state-of-the-art FL methods in terms of model test accuracy, communication efficiency, and client sampling efficiency. Through comprehensive ablation analysis, we discover that the downstream model generated by synthetic data plays a crucial role in controlling the direction of gradient diversity during FL training, which enhances convergence speed and contributes to the notable accuracy boost observed with GPT-FL. Also, regardless of whether the target data falls within or outside the domain of the pre-trained generative model, GPT-FL consistently achieves significant performance gains, surpassing the results obtained by models trained solely with FL or synthetic data.

URL: https://openreview.net/forum?id=GyTkMIuYZq

---

Title: Spatio-Temporal Attention and Gaussian Processes for Personalized Video Gaze Estimation

Abstract: Gaze is an essential prompt for analyzing human behavior and attention. Recently, there has been an increasing interest in determining gaze direction from facial videos. However, video gaze estimation faces significant challenges, such as understanding the dynamic evolution of gaze in video sequences, dealing with static backgrounds, and adapting to variations in illumination. To address these challenges, we propose a simple and novel deep learning model designed to estimate gaze from videos, incorporating a specialized attention module. Our method employs a spatial attention mechanism that tracks spatial dynamics within videos. This technique enables accurate gaze direction prediction through a temporal sequence model, adeptly transforming spatial observations into temporal insights, thereby significantly improving gaze estimation accuracy. Additionally, our approach integrates Gaussian processes to include individual-specific traits, facilitating the personalization of our model with just a few labeled samples. Experimental results confirm the efficacy of the proposed approach, demonstrating its success in both within-dataset and cross-dataset settings. Specifically, our proposed approach achieves state-of-the-art performance on the Gaze360 dataset, improving by 2.5degrees without personalization. Further, by personalizing the model with just three samples, we achieved an additional improvement of 0.8degrees.

URL: https://openreview.net/forum?id=PZbkQ7VGvj

---

Title: Scoring Rule Training for Simulation-Based Inference

Abstract: Bayesian Simulation-Based Inference yields posterior approximations for simulator models with intractable likelihood. Recently, normalizing flows were used to approximate either the likelihood or the posterior directly.
Normalizing flows are invertible neural networks which transform samples from a latent distribution; the probability density of the transformed samples is accessible and the normalizing flow can be trained via maximum likelihood on simulated parameter-observation pairs. In contrast, Ramesh et al. (2022) approximated the posterior with generative networks, which drop invertibility and are thus more flexible, hence scaling to high-dimensional and structured data. As generative networks merely allow sampling from the parametrized distribution, Ramesh et al. (2022) exploited adversarial training, where the generative network plays a game against a ``discriminator'' network.
This procedure is unstable and can lead to a learned distribution underestimating the uncertainty. Here, we apply Scoring Rule minimization, an adversarial-free training approach for generative networks, to approximate the posterior for simulator models. We term the resulting method ScoRuTSBI (Scoring Rule Training for Simulation-Based Inference).
On high-dimensional examples, we found ScoRuTSBI to perform better with shorter training time than the adversarial framework, which was previously found to outperform traditional methods (such as Approximate Bayesian Computation) and normalizing-flow-based inference methods.

URL: https://openreview.net/forum?id=kxUPVi0nod

---

Title: Linear Bandits with Memory

Abstract: Nonstationary phenomena, such as satiation effects in recommendations, have mostly been modeled using bandits with finitely many arms. However, the richer action space provided by linear bandits is often preferred in practice. In this work, we introduce a novel nonstationary linear bandit model, where current rewards are influenced by the learner's past actions in a fixed-size window. Our model, which recovers stationary linear bandits as a special case, leverages two parameters: the window size $m \ge 0$, and an exponent $\gamma$ that captures the rotting ($\gamma < 0)$ or rising ($\gamma > 0$) nature of the phenomenon. When both $m$ and $\gamma$ are known, we propose and analyze a variant of OFUL which minimizes regret against cyclic policies. By choosing the cycle length so as to trade-off approximation and estimation errors, we then prove a bound of order $\sqrt{d}\,(m+1)^{\frac{1}{2}+\max\{\gamma,0\}}\,T^{3/4}$ (ignoring log factors) on the regret against the optimal sequence of actions, where $T$ is the horizon and $d$ is the dimension of the linear action space. Through a bandit model selection approach, our results are then extended to the case where both $m$ and $\gamma$ are unknown. Finally, we complement our theoretical results with experiments comparing our approach to natural baselines.

URL: https://openreview.net/forum?id=CrpDwMFgxr

---

Title: Robust Algorithmic Recourse Design Under Model Shifts

Abstract: Algorithmic recourse offers users recommendations for actions that can help alter unfavorable outcomes in practical decision-making systems. Although many methods have been proposed to design easily implementable recourses, model updates or shifts may render previously generated recourses invalid. To assess the robustness of recourses against model shifts, we propose an uncertainty quantification method to calculate a theoretical upper-bound of the recourse invalidation rate for any counterfactual plan and any prediction model, without requiring distributional assumptions about the feature space. Furthermore, given the inherent trade-off between recourse cost and recourse robustness, users should be empowered to manage the implementation cost versus robustness trade-off. To this end, we propose a novel framework that leverages the derived invalidation rate bounds to generate model-agnostic recourses that satisfy the user's specified invalidation needs. Numerical results on multiple datasets demonstrate the effectiveness of the derived theoretical bounds and the efficacy of the proposed algorithms.

URL: https://openreview.net/forum?id=M04lLePtRF

---

Title: Initial value problem enhanced sampling for closed-loop optimal control design with deep neural networks

Abstract: Closed-loop optimal control design for high-dimensional nonlinear systems has been a long-standing challenge. Traditional methods, such as solving the associated Hamilton-Jacobi-Bellman equation, suffer from the curse of dimensionality. Recent literature proposed a new promising approach based on supervised learning, by leveraging powerful open-loop optimal control solvers to generate training data and neural networks as efficient high-dimensional function approximators to fit the closed-loop optimal control. This approach successfully handles certain high-dimensional optimal control problems but still performs poorly on more challenging problems. One of the crucial reasons for the failure is the so-called distribution mismatch phenomenon brought by the controlled dynamics. In this paper, we investigate this phenomenon and propose the initial value problem enhanced sampling method to mitigate this problem. We theoretically prove that this sampling strategy improves over the vanilla strategy on the classical linear-quadratic regulator by a factor proportional to the total time duration. We further numerically demonstrate that the proposed sampling strategy significantly improves the performance on tested control problems, including the optimal landing problem of a quadrotor and the optimal reaching problem of a 7 DoF manipulator.

URL: https://openreview.net/forum?id=T3hYDAgsa0

---

Title: Understanding Sparse Neural Networks from their Topology via Multipartite Graph Representations

Abstract: Pruning-at-Initialization (PaI) algorithms provide Sparse Neural Networks (SNNs) which are computationally more efficient than their dense counterparts, and try to avoid performance degradation. While lots of emphasis has been directed towards \emph{how} to prune, we still do not know \emph{what topological metrics} of the SNNs characterize \emph{good performance}. From prior work, we have layer-wise topological metrics by which SNN performance can be predicted: the Ramanujan-based metrics. To exploit these metrics, proper ways to represent network layers via Graph Encodings (GEs) are needed, with Bipartite Graph Encodings (BGEs) being the \emph{de-facto} standard at the current stage. Nevertheless, existing BGEs neglect the impact of the inputs, and do not characterize the SNN in an end-to-end manner. Additionally, thanks to a thorough study of the Ramanujan-based metrics, we discover that they are only as good as the \emph{layer-wise density} as performance predictors, when paired with BGEs.To close both gaps, we design a comprehensive topological analysis for SNNs with both linear and convolutional layers, via a (i) new input-aware Multipartite Graph Encoding (MGE) for SNNs and (ii) the design of new end-to-end topological metrics over the MGE. With these novelties, we show the following: (a) The proposed MGE allows to extract topological metrics that are much better predictors of the accuracy drop than metrics computed from current input-agnostic BGEs; (b) Which metrics are important at different sparsity levels and for different architectures; (c) A mixture of our topological metrics can rank PaI algorithms more effectively than Ramanujan-based metrics.

URL: https://openreview.net/forum?id=Egb0tUZnOY

---

Title: TAP: The Attention Patch for Cross-Modal Knowledge Transfer from Unlabeled Modality

Abstract: This paper addresses a cross-modal learning framework, where the objective is to enhance the performance of supervised learning in the primary modality using an unlabeled, unpaired secondary modality. Taking a probabilistic approach for missing information estimation, we show that the extra information contained in the secondary modality can be estimated via Nadaraya-Watson (NW) kernel regression, which can further be expressed as a kernelized cross-attention module (under linear transformation). This expression lays the foundation for introducing The Attention Patch (TAP), a simple neural network add-on that can be trained to allow data-level knowledge transfer from the unlabeled modality. We provide extensive numerical simulations using four real-world datasets to show that TAP can provide statistically significant improvement in generalization across different domains and different neural network architectures, making use of seemingly unusable unlabeled cross-modal data.

URL: https://openreview.net/forum?id=73uyerai53

---

Title: A Survey on Transferability of Adversarial Examples Across Deep Neural Networks

Abstract: The emergence of Deep Neural Networks (DNNs) has revolutionized various domains, enabling the resolution of complex tasks spanning image recognition, natural language processing, and scientific problem-solving. However, this progress has also exposed a concerning vulnerability: adversarial examples. These crafted inputs, imperceptible to humans, can manipulate machine learning models into making erroneous predictions, raising concerns for safety-critical applications. An intriguing property of this phenomenon is the transferability of adversarial examples, where perturbations crafted for one model can deceive another, often with a different architecture. This intriguing property enables "black-box" attacks, circumventing the need for detailed knowledge of the target model. This survey explores the landscape of the adversarial transferability of adversarial examples. We categorize existing methodologies to enhance adversarial transferability and discuss the fundamental principles guiding each approach. While the predominant body of research primarily concentrates on image classification, we also extend our discussion to encompass other vision tasks and beyond. Challenges and future prospects are discussed, highlighting the importance of fortifying DNNs against adversarial vulnerabilities in an evolving landscape.

URL: https://openreview.net/forum?id=AYJ3m7BocI

---

Title: Self-Training: A Survey

Abstract: Semi-supervised algorithms aim to learn prediction functions from a small set of labeled observations and a large set of unlabeled observations. Because this framework is relevant in many applications, they have received a lot of interest in both academia and industry. Among the existing techniques, self-training methods have undoubtedly attracted greater attention in recent years. These models are designed to find the decision boundary on low density regions without making additional assumptions about the data distribution, and use the unsigned output score of a learned classifier, or its margin, as an indicator of confidence. The working principle of self-training algorithms is to learn a classifier iteratively by assigning pseudo-labels to the set of unlabeled training samples with a margin greater than a certain threshold. The pseudo-labeled examples are then used to enrich the labeled training data and to train a new classifier in conjunction with the labeled training set. In this paper, we present self-training methods for binary and multi-class classification as well as their variants and two related approaches, namely consistency-based approaches and transductive learning. We examine the impact of significant self-training features on various methods, using different general and image classification benchmarks, and we discuss our ideas for future research in self-training. To the best of our knowledge, this is the first thorough and complete survey on this subject.v

URL: https://openreview.net/forum?id=hzsZY8ElG2

---

Title: Persistent Local Homology in Graph Learning

Abstract: In this study, we introduce Persistent Local Homology (PLH) for graphs, a novel method that synergizes persistent homology with local homology to analyze graph structures. We begin by mathematically formalizing PLH, defining it as the application of persistent homology to annular local subgraphs. This foundation paves the way for the development of a computational pipeline, specifically tailored for PLH, which we explore in various graph learning contexts. Despite its utility, a complexity analysis reveals potential computational bottlenecks in PLH application. To address this, we propose Reduced PLH (rPLH), an efficient variant designed to significantly lower computational complexity. Experimental evaluations with rPLH demonstrate its capability to retain the effectiveness of the original PLH while substantially reducing computational demands. The practical utility of PLH and rPLH is further corroborated through comprehensive experiments on both synthetic and real-world datasets, highlighting their broad applicability and potential in diverse analytical scenarios.

URL: https://openreview.net/forum?id=qunyX9WYr6

---

Title: The Cross-entropy of Piecewise Linear Probability Density Functions

Abstract: The cross-entropy and its related terms from information theory (e.g. entropy, Kullback-Leibler divergence) are found throughout artificial intelligence/machine learning. This includes many of the major successes, both current and historic, where they commonly appear as the natural objective of an optimisation procedure for learning model parameters, or their distributions. This paper presents a derivation of the differential cross-entropy between two 1D probability density functions represented as piecewise linear functions. Previously, this would need to have been approximated via numerical integration, or equivalent, for which calculating gradients is impractical. Implementation details are explored and experimental validation is presented. This paper contributes the necessary theory for the practical optimisation of information theoretic objectives when dealing with piecewise linear distributions directly; applications will be explored in future papers.

URL: https://openreview.net/forum?id=AoOi9Zgdsv

---

Title: Adversarially Robust Spiking Neural Networks Through Conversion

Abstract: Spiking neural networks (SNNs) provide an energy-efficient alternative to a variety of artificial neural network (ANN) based AI applications. As the progress in neuromorphic computing with SNNs expands their use in applications, the problem of adversarial robustness of SNNs becomes more pronounced. To the contrary of the widely explored end-to-end adversarial training based solutions, we address the limited progress in scalable robust SNN training methods by proposing an adversarially robust ANN-to-SNN conversion algorithm. Our method provides an efficient approach to embrace various computationally demanding robust learning objectives that have been proposed for ANNs. During a post-conversion robust finetuning phase, our method adversarially optimizes both layer-wise firing thresholds and synaptic connectivity weights of the SNN to maintain transferred robustness gains from the pre-trained ANN. We perform experimental evaluations in a novel setting proposed to rigorously assess the robustness of SNNs, where numerous adaptive adversarial attacks that account for the spike-based operation dynamics are considered. Results show that our approach yields a scalable state-of-the-art solution for adversarially robust deep SNNs with low-latency.

URL: https://openreview.net/forum?id=I8FMYa2BdP

---

Title: Restricted Random Pruning at Initialization for High Compression Range

Abstract: Pruning at Initialization (PaI) makes training overparameterized neural networks more efficient by reducing the overall computational cost from training to inference. Recent PaI studies showed that random pruning is more effective than ranking-based pruning, which learns connectivity. However, the effectiveness of each pruning method depends on the existence of skip connections and the compression ratio (the before-after pruning parameter ratio). While random pruning performs better than ranking-based pruning on architectures with skip connections, the superiority without skip connections is reversed in the high compression range. This paper proposes Minimum Connection Assurance (MiCA) that achieves higher accuracy than conventional PaI methods for architectures with and without skip connections, regardless of the compression ratio. MiCA preserves the random connection between the layers and maintains the performance at high compression ratios without the costly connection learning that ranking-based pruning requires. Experiments on CIFAR-10 and CIFAR-100 show that MiCA enhances the compression ratio and accuracy trade-offs compared to existing PaI methods.
In VGG-16 with CIFAR-10, MiCA improves the accuracy of random pruning by $27.0\%$ at $10^{4.7}\times$ compression ratio. Furthermore, experimental analysis reveals that increasing the utilization of the nodes through which information flows from the first layer is essential for maintaining high performance at a high compression ratio.

URL: https://openreview.net/forum?id=yf4ciZcgrg

---

Title: Offline Reinforcement Learning via Tsallis Regularization

Abstract: Offline reinforcement learning (RL) focuses on learning a good policy from a fixed dataset. The dataset is generated by an unknown behavior policy through interactions with the environment and contains only a subset of the state-action spaces. Standard off-policy algorithms often perform poorly in this setting, suffering from errorneously optimistic values incurred by the out-of-distribution (OOD) actions not present in the dataset. The optimisim cannot be corrected as no further interaction with the environment is possible. Imposing divergence regularization and in-sample constraints are among the most popular methods to overcoming the issue by ensuring that the learned policy stays close to the behavior policy to minimize the occurrence of OOD actions. This paper proposes Tsallis regularization for offline RL, which aligns the induced sparsemax policies to the in-sample constraint. Sparsemax interpolates existing methods utilizing hard-max and softmax policies, in that only a subset of actions contributes non-zero action probability as compared to softmax (all actions) and hard-max (single action). We leverage this property to model the behavior policy and show that under several assumptions the learned sparsemax policies may have sparsity-conditional KL divergence to the behavior policy, making Tsallis regularization especially suitable for the Behavior Cloning methods. We propose two actor-critic algorithms, Tsallis In-sample Actor-Critic (Tsallis InAC) and Tsallis Advantage Weighted Actor-Critic (Tsallis AWAC), respectively generalizing InAC and AWAC and analyze their performance in standard Mujoco baselines.

URL: https://openreview.net/forum?id=HNqEKZDDRc

---

Title: Investigations on Modularity and Invariance for Compositional Robustness

Abstract: By default neural networks are not robust to changes in data distribution. This has been demonstrated with simple image corruptions, such as blurring or adding noise, degrading image classification performance. Many methods have been proposed to mitigate these issues but for the most part models are evaluated on single corruptions. In reality, visual space is compositional in nature, that is, that as well as robustness to elemental corruptions, robustness to compositions of corruptions is also needed. In this work we develop a compositional image classification task where, given a few elemental corruptions, models are asked to generalize to compositions of these corruptions. That is, to achieve _compositional robustness_. We experimentally compare empirical risk minimization with an invariance building pairwise contrastive loss and, counter to common intuitions in domain generalization, achieve only marginal improvements in compositional robustness by encouraging invariance. To move beyond invariance, following previously proposed inductive biases that model architectures should reflect data structure, we introduce a modular architecture whose structure replicates the compositional nature of the task. We then show that this modular approach consistently achieves better compositional robustness than non-modular approaches. We additionally find empirical evidence that the degree of invariance between representations of ‘in-distribution’ elemental corruptions fails to correlate with robustness to ‘out-of-distribution’ compositions of corruptions.

URL: https://openreview.net/forum?id=dlkEw3syMg

---

Title: AGALE: A Graph-Aware Continual Learning Evaluation Framework

Abstract: In recent years, continual learning (CL) techniques have made significant progress in learning from streaming data while preserving knowledge across sequential tasks, particularly in the realm of euclidean data. To foster fair evaluation and recognize challenges in CL settings, several evaluation frameworks have been proposed, focusing mainly on the single- and multi-label classification task on euclidean data. However, these evaluation frameworks are not trivially applicable when the input data is graph-structured, as they do not consider the topological structure inherent in graphs. Existing continual graph learning (CGL) evaluation frameworks have predominantly focussed on single-label scenarios in the node classification (NC) task. This focus has overlooked the complexities of multi-label scenarios, where nodes may exhibit affiliations with multiple labels, simultaneously participating in multiple tasks. We develop a graph-aware evaluation (AGALE) framework that accommodates both single-labeled and multi-labeled nodes, addressing the limitations of previous evaluation frameworks. In particular, we define new incremental settings and devise data partitioning algorithms tailored to CGL datasets. We perform extensive experiments comparing methods from the domains of continual learning, continual graph learning, and dynamic graph learning (DGL). We theoretically analyze \agale and provide new insights about the role of homophily in the performance of compared methods. We release our framework at \url{https://anonymous.4open.science/r/AGALE-FBCC/}.

URL: https://openreview.net/forum?id=xDTKRLyaNN

---

Title: AdaEmbed: Semi-supervised Domain Adaptation in the Embedding Space

Abstract: Semi-supervised domain adaptation (SSDA) presents a critical hurdle in computer vision, especially given the frequent scarcity of labeled data in real-world settings. This scarcity often causes foundation models, trained on extensive datasets, to underperform when applied to new domains. AdaEmbed, our newly proposed methodology for SSDA, offers a promising solution to these challenges. Leveraging the potential of unlabeled data, AdaEmbed facilitates the transfer of knowledge from a labeled source domain to an unlabeled target domain by learning a shared embedding space. By generating accurate and uniform pseudo-labels based on the established embedding space, the model overcomes the limitations of conventional SSDA, thus enhancing performance significantly. Our method's effectiveness is validated through extensive experiments on benchmark datasets such as DomainNet, Office-Home, and VisDA-C, where AdaEmbed consistently outperforms all the baselines, setting a new state of the art for SSDA. With its straightforward implementation and high data efficiency, AdaEmbed stands out as a robust and pragmatic solution for real-world scenarios, where labeled data is scarce. To foster further research and application in this area, we are sharing the codebase of our unified framework for semi-supervised domain adaptation.

URL: https://openreview.net/forum?id=S1CpH6aoBI

---

Title: E(n)-equivariant Graph Neural Cellular Automata

Abstract: Cellular automata (CAs) are notable computational models exhibiting rich dynamics emerging from the local interaction of cells arranged in a regular lattice.
Graph CAs (GCAs) generalise standard CAs by allowing for arbitrary graphs rather than regular lattices, similar to how Graph Neural Networks (GNNs) generalise Convolutional NNs.
Recently, Graph Neural CAs (GNCAs) have been proposed as models built on top of standard GNNs that can be trained to approximate the transition rule of any arbitrary GCA.
We note that existing GNCAs can violate the locality principle of CAs by leveraging global information and, furthermore, are anisotropic in the sense that their transition rules are not equivariant to rigid transformations of the nodes' spatial locations.
However, it is desirable for instances related by such transformations to be treated identically by the model.
By replacing standard graph convolutions with E(n)-equivariant ones, we avoid anisotropy by design and propose a class of isotropic automata that we call E(n)-GNCAs.
These models are lightweight, but can nevertheless handle large graphs, capture complex dynamics and exhibit emergent self-organising behaviours.
We showcase the broad and successful applicability of E(n)-GNCAs on three different tasks: (i) isotropic pattern formation, (ii) graph auto-encoding, and (iii) simulation of E(n)-equivariant dynamical systems.

URL: https://openreview.net/forum?id=7PNJzAxkij

---

Title: Distributionally Robust Policy Evaluation under General Co-variate Shift in Contextual Bandits

Abstract: We introduce a distributionally robust approach that enhances the reliability of offline policy evaluation in contextual bandits under general covariate shifts. Our method aims to deliver robust policy evaluation results in the presence of discrepancies in both context and policy distribution between logging and target data. Central to our methodology is the application of robust regression — a distributionally robust technique tailored here to improve the estimation of conditional reward distribution from logging data. Utilizing the reward model obtained from robust regression, we develop a comprehensive suite of policy value estimators, by integrating our reward model into established evaluation frameworks, namely direct methods and doubly robust methods. Through theoretical analysis, we further establish that the proposed policy value estimators offer a finite sample upper bound for the bias, providing a clear advantage over traditional methods, especially when the shift is large. Finally, we designed an extensive range of policy evaluation scenarios, covering diverse magnitudes of shifts and a spectrum of logging and target policies. Our empirical results indicate that our approach significantly outperforms baseline methods, most notably in 90% of the cases under the policy shift-only settings and 72% of the scenarios under the general covariate shift settings.

URL: https://openreview.net/forum?id=R7PReNELww

---

Title: Low-Rank Tensor-Network Encodings for Video-to-Action Behavioral Cloning

Abstract: We describe a tensor-network latent-space encoding approach for increasing the scalability of behavioral cloning of a video game player’s actions entirely from video streams of the gameplay. Specifically, we address challenges associated with the high computational requirements of traditional deep-learning based encoders such as variational autoencoders that prohibit their use in widely available hardware or for large scale data. Our approach uses tensor networks instead of deep variational autoencoders for this purpose, and it yields significant speedups with no loss of accuracy. Empirical results on ATARI games demonstrate that our approach leads to a speedup in the time it takes to encode data and train a predictor using the encodings (between 2.6× to 9.6× compared to autoencoders or variational autoencoders). Furthermore, the tensor train encoding can be efficiently trained on CPU as well, which leads to comparable or better training times than the autoencoder and variational autoencoder trained on GPU (0.9× to 5.4× faster). These results suggest significant possibilities in mitigating the need for cost and time-intensive hardware for training deep-learning architectures for behavioral cloning.

URL: https://openreview.net/forum?id=w4DXLzBPPw

---

Title: Efficient Sparse-Reward Goal-Conditioned Reinforcement Learning with a High Replay Ratio and Regularization

Abstract: Reinforcement learning (RL) methods with a high replay ratio (RR) and regularization have gained interest due to their superior sample efficiency.
However, these methods have mainly been developed for dense-reward tasks.
In this paper, we aim to extend these RL methods to sparse-reward goal-conditioned tasks.
We use Randomized Ensemble Double Q-learning (REDQ) (Chen et al., 2021), an RL method with a high RR and regularization.
To apply REDQ to sparse-reward goal-conditioned tasks, we make the following modifications to it:
(i) using hindsight experience replay and (ii) bounding target Q-values.
We evaluate REDQ with these modifications on 12 sparse-reward goal-conditioned tasks of Robotics (Plappert et al., 2018), and show that it achieves about $2 \times$ better sample efficiency than previous state-of-the-art (SoTA) RL methods.
Furthermore, we reconsider the necessity of specific components of REDQ and simplify it by removing unnecessary ones.
The simplified REDQ with our modifications achieves $\sim 8 \times$ better sample efficiency than the SoTA methods in 4 Fetch tasks of Robotics.

URL: https://openreview.net/forum?id=7uml1PLgAN

---

Title: Interpretable Additive Tabular Transformer Networks

Abstract: Attention based Transformer networks have not only revolutionized Natural Language Processing but have also achieved state-of-the-art results for tabular data modeling. The attention mechanism, in particular, has proven to be highly effective in accurately modeling categorical variables. Although deep learning models recently outperform tree-based models, they often lack a complete comprehension of the individual impact of features because of their opaque nature. In contrast, additive neural network structures have proven to be both predictive and interpretable. Within the context of explainable deep learning, we propose Neural Additive Tabular Transformer Networks (NATT), a modeling framework that combines the intelligibility of additive neural networks with the predictive power of Transformer models. NATT offers inherent intelligibility while achieving similar performance to complex deep learning models. To validate its efficacy, we conduct experiments on multiple datasets and find that NATT performs on par with state-of-the-art methods on tabular data and surpasses other interpretable approaches.

URL: https://openreview.net/forum?id=TdJ7lpzAkD

---

Title: Understanding Fairness Surrogate Functions in Algorithmic Fairness

Abstract: It has been observed that machine learning algorithms exhibit biased predictions against certain population groups. To mitigate such bias while achieving comparable accuracy, a promising approach is to introduce surrogate functions of the concerned fairness definition and solve a constrained optimization problem. However, it is intriguing in previous work that such fairness surrogate functions may yield unfair results and high instability. In this work, in order to deeply understand them, taking a widely used fairness definition—demographic parity as an example, we show that there is a surrogate-fairness gap between the fairness definition and the fairness surrogate function. Also, the theoretical analysis and experimental results about the “gap” motivate us that the fairness and stability will be affected by the points far from the decision boundary, which is the large margin points issue investigated in this paper. To address it, we propose the general sigmoid surrogate to simultaneously reduce both the surrogate-fairness gap and the variance, and offer a rigorous fairness and stability upper bound. Interestingly, the theory also provides insights into two important issues that deal with the large margin points as well as obtaining a more balanced dataset are beneficial to fairness and stability. Furthermore, we elaborate a novel and general algorithm called Balanced Surrogate, which iteratively reduces the “gap” to mitigate unfairness. Finally, we provide empirical evidence showing that our methods consistently improve fairness and stability while maintaining accuracy comparable to the baselines in three real-world datasets.

URL: https://openreview.net/forum?id=iBgmoMTlaz

---

Title: ZigZag: Universal Sampling-free Uncertainty Estimation Through Two-Step Inference

Abstract: Whereas the ability of deep networks to produce useful predictions on many kinds of data has been amply demonstrated, estimating the reliability of these predictions remains challenging. Sampling approaches such as MC-Dropout and Deep Ensembles have emerged as the most popular ones for this purpose. Unfortunately, they require many forward passes at inference time, which slows them down. Sampling-free approaches can be faster but often suffer from other drawbacks, such as lower reliability of uncertainty estimates, difficulty of use, and limited applicability to different types of tasks and data.

In this work, we introduce a sampling-free approach that is generic and easy to deploy, while producing reliable uncertainty estimates on par with state-of-the-art methods at a significantly lower computational cost. It is predicated on training the network to produce the same output with and without additional information about it. At inference time, when no prior information is given, we use the network's own prediction as the additional information. We then take the distance between the predictions with and without prior information as our uncertainty measure.

We demonstrate our approach on several classification and regression tasks. We show that it delivers results on par with those of Ensembles but at a much lower computational cost.

URL: https://openreview.net/forum?id=QSvb6jBXML

---

Title: Efficient Large Language Models: A Survey

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in important tasks such as natural language understanding, language generation, and complex reasoning and have the potential to make a substantial impact on our society. Such capabilities, however, come with the considerable resources they demand, highlighting the strong need to develop effective techniques for addressing their efficiency challenges. In this survey, we provide a systematic and comprehensive review of efficient LLMs research. We organize the literature in a taxonomy consisting of three main categories, covering distinct yet interconnected efficient LLMs topics from model-centric, data-centric, and framework-centric perspective, respectively. We have also created a GitHub repository where we compile the papers featured in this survey at https://anonymous.4open.science/r/Efficient\_LLM-paper-list-5847, and will actively maintain this repository and incorporate new research as it emerges. We hope our survey can serve as a valuable resource to help researchers and practitioners gain a systematic understanding of the research developments in efficient LLMs and inspire them to contribute to this important and exciting field.

URL: https://openreview.net/forum?id=bsCCJHbO8A

---

Title: Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers

Abstract: Visual reasoning is dominated by end-to-end neural networks scaled to billions of model parameters and training examples. However, even the largest models struggle with compositional reasoning, generalization, fine-grained spatial and temporal reasoning, and counting. Visual reasoning with large language models (LLMs) as controllers can, in principle, address these limitations by decomposing the task and solving subtasks by orchestrating a set of (visual) tools. Recently, these models achieved great performance on tasks such as compositional visual question answering, visual grounding, and video temporal reasoning. Nevertheless, in their current form, these models heavily rely on human engineering of in-context examples in the prompt, which are often dataset- and task-specific and require significant labor by highly skilled programmers. In this work, we present a framework that mitigates these issues by introducing spatially and temporally abstract routines and by leveraging a small number of labeled examples to automatically generate in-context examples, thereby avoiding human-created in-context examples. On a number of visual reasoning tasks, we show that our framework leads to consistent gains in performance, makes LLMs as controllers setup more robust, and removes the need for human engineering of in-context examples.

URL: https://openreview.net/forum?id=WYGiqSVstK

---

Title: BP($\mathbf{\lambda}$): Online Learning via Synthetic Gradients

Abstract: Training recurrent neural networks typically relies on backpropagation through time (BPTT). BPTT depends on forward and backward passes to be completed, rendering the network locked to these computations before loss gradients are available. Recently, Jaderberg et al. proposed synthetic gradients to alleviate the need for full BPTT. In their implementation synthetic gradients are learned through a mixture of backpropagated gradients and bootstrapped synthetic gradients, analogous to the temporal difference (TD) algorithm in Reinforcement Learning (RL). However, as in TD learning, heavy use of bootstrapping can result in bias which leads to poor synthetic gradient estimates. Inspired by the accumulate $\mathrm{TD}(\lambda)$ in RL, we propose a fully online method for learning synthetic gradients which avoids the use of BPTT altogether: \emph{accumulate} $BP(\lambda)$. As in accumulate $\mathrm{TD}(\lambda)$, we show analytically that {accumulate~$\mathrm{BP}(\lambda)$} can control the level of bias by using a mixture of temporal difference errors and recursively defined eligibility traces. We next demonstrate empirically that our model outperforms the original implementation for learning synthetic gradients in a variety of tasks, and is particularly suited for capturing longer timescales. Finally, building on recent work we reflect on accumulate $\mathrm{BP}(\lambda)$ as a principle for learning in biological circuits. In summary, inspired by RL principles we introduce an algorithm capable of bias-free online learning via synthetic gradients.

URL: https://openreview.net/forum?id=3kYgouAfqk

---

Title: Diffusion Model-Augmented Behavioral Cloning

Abstract: Imitation learning addresses the challenge of learning by observing an expert’s demonstrations without access to reward signals from environments. Most existing imitation learning methods that do not require interacting with environments either model the expert distribution
as the conditional probability p(a|s) (e.g. behavioral cloning, BC) or the joint probability p(s,a). Despite the simplicity of modeling the conditional probability with BC, it usually struggles with generalization. While modeling the joint probability can improve generalization performance, the inference procedure is often time-consuming, and the model can suffer from manifold overfitting.
This work proposes an imitation learning framework that benefits from modeling both the conditional and joint probability of the expert distribution. Our proposed diffusion model-augmented behavioral cloning (DBC) employs a diffusion model trained to model expert behaviors and learns a policy to optimize both the BC loss (conditional) and our proposed diffusion model loss (joint). DBC outperforms baselines in various continuous control tasks in navigation, robot arm manipulation, dexterous manipulation, and locomotion. We design additional experiments to verify the limitations of modeling either the conditional probability or the joint probability of the expert distribution, as well as compare different generative models. Ablation studies justify the effectiveness of our design choices.

URL: https://openreview.net/forum?id=TdyNw0b7f1

---

Title: Understanding Smoothness of Vector Gaussian Processes on Product Spaces

Abstract: Vector Gaussian processes are becoming increasingly important in machine learning and statistics, with applications to many branches of applied sciences. Recent efforts have al- lowed to understand smoothness in scalar Gaussian processes defined over manifolds as well as over product spaces involving manifolds. This paper challenges the problem of quantify- ing smoothness for vector Gaussian processes that are defined over non-Euclidean product manifolds. After noting that a constructive RKHS approach is unsuitable for this specific task, we proceed through the analysis of spectral properties. Specifically, we find a spectral representation to quantify smoothness through Sobolev spaces that are adapted to certain measure spaces of product measures obtained through the tensor product of Haar mea- sures with multivariate Gaussian measures. Our results allow to measure smoothness in a simple way, and open for the study of foundational properties of certain machine learning techniques over product spaces.

URL: https://openreview.net/forum?id=WHAmxfLjeJ

---

Title: Distributed Training of Large Graph Neural Networks with Variable Communication Rates

Abstract: Training Graph Neural Networks (GNNs) on large graphs presents unique challenges due to the large memory and computing requirements.
Distributed GNN training, where the graph is partitioned across multiple machines, is a common approach to training GNNs on large graphs.
However, as the graph cannot generally be decomposed into small non-interacting components, data communication between the training machines quickly limits training speeds.
Compressing the communicated node activations by a fixed amount improves the training speeds, but lowers the accuracy of the trained GNN.
In this paper, we introduce a variable compression scheme for reducing the communication volume in distributed GNN training without compromising the accuracy of the learned model.
Based on our theoretical analysis, we derive a variable compression method that converges to a solution that is equivalent to the full communication case.
Our empirical results show that our method attains a comparable performance to the one obtained with full communication and that for any communication budget, we outperform full communication at any fixed compression ratio.

URL: https://openreview.net/forum?id=nrzxJEvvHv

---

Title: Personalised Federated Learning On Heterogeneous Feature Spaces

Abstract: Personalised federated learning (FL) approaches assume that raw data of all clients are defined in a common space \emph{i.e.} all clients store their data according to the same schema. For real-world applications, this assumption is restrictive as clients, having their own systems to collect and then store data, may use {\em heterogeneous} data representations. To bridge the gap between the assumption of a shared subspace and the more realistic situation of client-specific spaces, we propose a general framework coined FLIC that maps client's data onto a common feature space via local embedding functions, in a federated manner. Preservation of class information in the latent space is ensured by a distribution alignment with respect to a learned reference distribution. We provide the algorithmic details of FLIC as well as theoretical insights supporting the relevance of our methodology. We compare its performances against FL benchmarks involving heterogeneous input features spaces. Notably, we are the first to present a successful application of FL to Brain-Computer Interface signals acquired on a different number of sensors.

URL: https://openreview.net/forum?id=uCZJaqJchs

---

Title: Machine Learning for the Multi-Dimensional Bin Packing Problem: Literature Review and Empirical Evaluation

Abstract: The Bin Packing Problem (BPP) is a well-established combinatorial optimization (CO) problem. Since it has many applications in our daily life, e.g. logistics and resource allocation, people are seeking efficient bin packing algorithms. On the other hand, researchers have been making constant advances in machine learning (ML), which is famous for its efficiency. In this article, we first formulate BPP, introducing its variants and practical constraints. Then, a comprehensive survey on ML for multi-dimensional BPP is provided. We further collect some public benchmarks of 3D BPP, and evaluate some online methods on the Cutting Stock Dataset. Finally, we share our perspective on challenges and future directions in BPP. To the best of our knowledge, this is the first systematic review of ML-related methods for BPP.

URL: https://openreview.net/forum?id=wLObDCp4ca

---

Title: Unsupervised Federated Domain Adaptation for Segmentation of MRI Images

Abstract: Automatic semantic segmentation of magnetic resonance imaging (MRI) images using deep neural networks greatly assists in evaluating and planning treatments for various clinical applications. However, training these models is conditioned on the availability of abundant annotated data to implement the end-to-end supervised learning procedure. Even if we annotate enough data, MRI images display considerable variability due to factors such as differences in patients, MRI scanners, and imaging protocols. This variability necessitates retraining neural networks for each specific application domain, which, in turn, requires manual annotation by expert radiologists for all new domains. To relax the need for persistent data annotation, we develop a method for unsupervised federated domain adaptation using multiple annotated source domains. Our approach enables the transfer of knowledge from several annotated source domains to adapt a model for effective use in an unannotated target domain. Initially, we ensure that the target domain data shares similar representations with each source domain in a latent embedding space, modeled as the output of a deep encoder, by minimizing the pair-wise distances of the distributions for the target domain and the source domains. We then employ an ensemble approach to leverage the knowledge obtained from all domains. We provide theoretical analysis and perform experiments on the MICCAI 2016 multi-site dataset to demonstrate our method is effective.

URL: https://openreview.net/forum?id=ii6heihoen

---

Title: Optimal Transport Perturbations for Safe Reinforcement Learning with Robustness Guarantees

Abstract: Robustness and safety are critical for the trustworthy deployment of deep reinforcement learning. Real-world decision making applications require algorithms that can guarantee robust performance and safety in the presence of general environment disturbances, while making limited assumptions on the data collection process during training. In order to accomplish this goal, we introduce a safe reinforcement learning framework that incorporates robustness through the use of an optimal transport cost uncertainty set. We provide an efficient implementation based on applying Optimal Transport Perturbations to construct worst-case virtual state transitions, which does not impact data collection during training and does not require detailed simulator access. In experiments on continuous control tasks with safety constraints, our approach demonstrates robust performance while significantly improving safety at deployment time compared to standard safe reinforcement learning.

URL: https://openreview.net/forum?id=cgSXpAR4Gl

---

Title: Worst-Case Optimal Multi-Armed Gaussian Best Arm Identification with a Fixed Budget

Abstract: Experimental design is crucial in evidence-based decision-making with multiple treatment arms, such as online advertisements and medical treatments. This study investigates the problem of identifying the treatment arm with the highest expected outcome, referred to as \emph{the best treatment arm}, while minimizing the probability of misidentification. This problem has been studied across numerous research fields, including \emph{best arm identification} (BAI) and ordinal optimization. In our experiments, the number of treatment-allocation rounds is fixed. During each round, a decision-maker allocates a treatment arm to an experimental unit and observes a corresponding outcome, which follows a Gaussian distribution with variances that can differ among the treatment arms. At the end of the experiment, we recommend one of the treatment arms as an estimate of the best treatment arm based on the observations. To design an experiment, we first discuss lower bounds for the probability of misidentification through an information-theoretic approach. Our analysis highlights that the available information on the outcome distribution for each treatment arm, such as means (expected outcomes), variances, and the choice of the best treatment arm, significantly influences the lower bounds. In scenarios where available information is limited, we develop a lower bound that are valid under the unknown means and the unknown choice of the best treatment arm, which are referred to as the worst-case lower bound. We demonstrate that the worst-case lower bound depends solely on the variances of the outcomes. Then, under the assumption that the variances are known, we propose the \emph{Generalized-Neyman-Allocation (GNA)-empirical-best-arm (EBA) strategy}, an extension of the Neyman allocation proposed by \citet{Neyman1934OnTT}. We show that the GNA-EBA strategy is asymptotically optimal in the sense that its probability of misidentification aligns with the lower bounds as the sample size increases indefinitely and the differences between the expected outcomes of the best and other suboptimal arms converge to a uniform value. We refer to such strategies as asymptotically worst-case optimal.

URL: https://openreview.net/forum?id=AJ8mPi10ws

---

Title: Efficient Variational Continual Learning with optimal parameter trajectory in parameter hyperspace

Abstract: Continual learning, a foundational challenge in machine learning, grapples with critical issues of efficient parameter storage and robust regularization, especially in Bayesian neural networks. Our research addresses these challenges with significant contributions. To address the storage complexity of fully connected layer parameters, we propose an efficient method that substantially reduces memory requirements. In convolutional neural networks, tailored parameter storage for Bayesian networks becomes essential, countering the parameter escalation caused by uncertainty inclusion. In variational continual learning, our work introduces an enhanced regularization term that preserves Kullback-Leibler divergence strengths while overcoming associated challenges. We also present an augmented Evidence Lower Bound term, crucial for capturing correlations between data and network parameters. This enables the storage of common and distinctive parameter hyperspace bases, vital in continual learning. Our approach strategically divides the parameter subspace into common and distinctive subspaces, with conditions for effective backward and forward knowledge transfer, elucidating the network-parameter dataset correspondence. In summary, our contributions advance efficient and effective continual learning, offering insights into parameter storage and regularization techniques for Bayesian neural networks.

URL: https://openreview.net/forum?id=OFQIKBlyvq

---

Title: Scaling Up Bayesian Neural Networks with Neural Networks

Abstract: Bayesian Neural Network (BNN) offers a more principled, robust, and interpretable framework for analyzing high-dimensional data. They address the typical challenges associated with conventional deep learning methods, such as data insatiability, ad-hoc nature, and susceptibility to overfitting. However, their implementation typically relies on Markov chain Monte Carlo (MCMC) methods that are characterized by their computational intensity and inefficiency in a high-dimensional space. To address this issue, we propose a novel
calibration-Emulation-Sampling (CES) strategy to significantly enhance the computational efficiency of BNN. In this CES framework, during the initial calibration stage, we collect a small set of samples from the parameter space. These samples serve as training data for
the emulator. Here, we employ a Deep Neural Network (DNN) emulator to approximate the forward mapping, i.e., the process that input data go through various layers to generate predictions. The trained emulator is then used for sampling from the posterior distribution
at substantially higher speed compared to the original BNN. Using simulated and real data, we demonstrate that our proposed method improves computational efficiency of BNN, while maintaining similar performance in terms of prediction accuracy and uncertainty quantification.

URL: https://openreview.net/forum?id=LjeF7DDm8Z

---

Title: Causal Discovery from Time Series with Hybrids of Constraint-Based and Noise-Based Algorithms

Abstract: Constraint-based methods and noise-based methods are two distinct families of methods proposed for uncovering causal graphs from observational data. However, both operate under strong assumptions that may be challenging to validate or could be violated in real-world scenarios. In response to these challenges, there is a growing interest in hybrid methods that amalgamate principles from both methods, showing robustness to assumption violations. This paper introduces a novel comprehensive framework for hybridizing constraint-based and noise-based methods designed to uncover causal graphs from observational time series. The framework is structured into two classes. The first class employs a noise-based strategy to identify a super graph, containing the true graph, followed by a constraint-based strategy to eliminate unnecessary edges. In the second class, a constraint-based strategy is applied to identify a skeleton, which is then oriented using a noise-based strategy. The paper provides theoretical guarantees for each class under the condition that all assumptions are satisfied, and it outlines some properties when assumptions are violated. To validate the efficacy of the framework, two algorithms from each class are experimentally tested on simulated data, realistic ecological data, and real datasets sourced from diverse applications.
Notably, two novel datasets related to Information Technology monitoring are introduced within the set of considered real datasets.
The experimental results underscore the robustness and effectiveness of the hybrid approaches across a broad spectrum of datasets.

URL: https://openreview.net/forum?id=PGLbZpVk2n

---

Title: High-Performance Machine Learning for FinTech

Abstract: The use of machine learning techniques in the FinTech sector, such as in portfolio selection, requires
a high-performance compute engine to test a large number of candidate strategies. The higher the
runtime performance of the engine, the larger the number of strategies that can be considered and
the wider the scope of testing that could be conducted. This leads to better quality strategies. Thus,
runtime performance should directly impact strategy performance in the market.
This paper presents a compute engine for processing market data and implementing highly customis-
able trading strategies. We use it as the core of our differential-evolution-based machine learning
framework for portfolio selection. We then discuss a range of key techniques that improve its run-
time performance and show that it outperforms other extant engines.

URL: https://openreview.net/forum?id=mU0MWw4w3G

---

Title: Best-of-Both-Worlds Linear Contextual Bandits

Abstract: This study investigates the problem of $K$-armed linear contextual bandits, an instance of the multi-armed bandit problem, under an adversarial corruption. At each round, a decision-maker observes an independent and identically distributed context and then selects an arm based on the context and past observations. After selecting an arm, the decision-maker incurs a loss corresponding to the selected arm. The decision-maker aims to minimize the cumulative loss over the trial. The goal of this study is to develop a strategy that is effective in both stochastic and adversarial environments, with theoretical guarantees. We first formulate the problem by introducing a novel setting of bandits with adversarial corruption, referred to as the contextual adversarial regime with a self-bounding constraint. We assume linear models for the relationship between the loss and the context. Then, we propose a strategy that extends the {\tt RealLinExp3} by \citet{Neu2020} and the Follow-The-Regularized-Leader (FTRL). The regret of our proposed algorithm is shown to be upper-bounded by $O\left(\min\left\{\frac{(\log(T))^3}{\Delta_{*}} + \sqrt{\frac{C(\log(T))^3}{\Delta_{*}}},\ \ \sqrt{T}(\log(T))^2\right\}\right)$, where $T \in\mathbb{N}$ is the number of rounds, $\Delta_{*} > 0$ is the constant minimum gap between the best and suboptimal arms for any context, and $C\in[0, T] $ is an adversarial corruption parameter. This regret upper bound implies $O\left(\frac{(\log(T))^3}{\Delta_{*}}\right)$ in a stochastic environment and by $O\left( \sqrt{T}(\log(T))^2\right)$ in an adversarial environment. We refer to our strategy as the {\tt Best-of-Both-Worlds (BoBW) RealFTRL}, due to its theoretical guarantees in both stochastic and adversarial regimes.

URL: https://openreview.net/forum?id=aIG2RAtNuX

---

Title: FedorAS: Federated Architecture Search under system heterogeneity

Abstract: Federated learning (FL) has gained considerable attention due to its ability to learn on decentralised data while preserving client privacy.
However, it also poses additional challenges related to the heterogeneity of the participating devices, both in terms of their computational capabilities and contributed data. Meanwhile, Neural Architecture Search (NAS) has been successfully used with centralised datasets, producing state-of-the-art results in constrained or unconstrained settings. Such centralised datasets may not be always available for training, though. Most recent work at the intersection of NAS and FL attempts to alleviate this issue in a cross-silo federated setting, which assumes homogeneous compute environments with datacenter-grade hardware. In this paper we explore the question of whether we can design architectures of different footprints in a cross-device federated setting, where the device landscape, availability and scale are very different. To this end, we design our system, FedorAS, to discover and train promising architectures in a resource-aware manner when dealing with devices of varying capabilities holding non-IID distributed data. We present empirical evidence of its effectiveness across different settings, spanning across three different modalities (vision, speech, text), and showcase its superior
performance compared to state-of-the-art federated solutions, while maintaining resource efficiency.

URL: https://openreview.net/forum?id=oomnhFFrrS

---

Title: Towards generalizing deep-audio fake detection networks

Abstract: Today's generative neural networks allow the creation of high-quality synthetic speech at scale. While we welcome the creative use of this new technology, we must also recognize the risks. As synthetic speech is abused for monetary and identity theft, we require a broad set of deepfake identification tools. Furthermore, previous work reported a limited ability of deep classifiers to generalize to unseen audio generators. We study the frequency domain fingerprints of current audio generators. Building on top of the discovered frequency footprints, we train excellent lightweight detectors that generalize. We report improved results on the WaveFake dataset and an extended version. To account for the rapid progress in the field, we extend the WaveFake dataset by additionally considering samples drawn from the novel Avocodo and BigVGAN networks.

URL: https://openreview.net/forum?id=RGewtLtvHz

---

Title: A Gold Standard Dataset for the Reviewer Assignment Problem

Abstract: Many peer-review venues are either using or looking to use algorithms to assign submissions to reviewers. The crux of such automated approaches is the notion of the "similarity score"---a numerical estimate of the expertise of a reviewer in reviewing a paper---and many algorithms have been proposed to compute these scores. However, these algorithms have not been subjected to a principled comparison, making it difficult for stakeholders to choose the algorithm in an evidence-based manner. The key challenge in comparing existing algorithms and developing better algorithms is the lack of the publicly available gold-standard data that would be needed to perform reproducible research. We address this challenge by collecting a novel dataset of similarity scores that we release to the research community. Our dataset consists of 477 self-reported expertise scores provided by 58 researchers who evaluated their expertise in reviewing papers they have read previously.

We use this data to compare several popular algorithms currently employed in computer science conferences and come up with recommendations for stakeholders. Our three main findings are:
- All algorithms make a non-trivial amount of error. For the task of ordering two papers in terms of their relevance for a reviewer, the error rates range from 12%-30% in easy cases to 36%-43% in hard cases, thereby highlighting the vital need for more research on the similarity-computation problem.
- Most existing algorithms are designed to work with titles and abstracts of papers, and in this regime the Specter+MFR algorithm performs best.
- To improve performance, it may be important to develop modern deep-learning based algorithms that can make use of the full texts of papers: the classical TD-IDF algorithm enhanced with full texts of papers is on par with the deep-learning based Specter+MFR that cannot make use of this information.

We encourage researchers to use this dataset for evaluating and developing better similarity-computation algorithms.

URL: https://openreview.net/forum?id=gEjE89BhcI

---

Title: Revealing an Overlooked Challenge in Class-Incremental Graph Learning

Abstract: Graph Neural Networks (GNNs), which effectively learn from static graph-structured data, become ineffective when directly applied to streaming data in a continual learning (CL) scenario. In CL, historical data are not available during the current stage due to a number of reasons, such as limited storage, GDPR1 data retention policy, to name a few. A few recent works study this problem, however, they overlook the uniqueness of continual graph learning (CGL), compared to well-studied continual image classification: the unavailability of previous training data further poses challenges to inference in CGL, in additional to the well-known catastrophic forgetting problem. While existing works make a strong assumption that full access of historical data is unavailable during training but provided during inference, which potentially contradicts the continual learning paradigm Van de Ven & Tolias (2019), we study continual graph learning without this strong and contradictory assumption. In this case, without being re-inserted into previous training graphs for inference, streaming test nodes are often more sparsely connected, which makes the inference more difficult due to insufficient neighborhood information. In this work, we propose ReplayGNN (ReGNN) to jointly solve the above two challenges without memory buffers: catastrophic forgetting and poor neighbor information during inference. Extensive experiments demonstrate the effectiveness of our model over baseline models and its effectiveness in different cases with different levels of neighbor information available.

URL: https://openreview.net/forum?id=ScAc73Y1oJ

---

Title: Choosing Wisely and Learning Deeply: Selective Cross-Modality Distillation via CLIP for Domain Generalization

Abstract: Domain Generalization (DG), a crucial research area, seeks to train models across multiple domains and test them on unseen ones. In this paper, we introduce a novel approach, namely, Selective Cross-Modality Distillation for Domain Generalization (SCMD). SCMD leverages the capabilities of large vision-language models, specifically the CLIP model, to train a more efficient model, ensuring it acquires robust generalization capabilities across unseen domains.
Our primary contribution is a unique selection framework strategically designed to identify hard-to-learn samples for distillation. In parallel, we introduce a novel cross-modality module. This module seamlessly combines the projected features of the student model with the text embeddings from CLIP, ensuring the alignment of similarity distributions.
We assess SCMD's performance on various benchmarks, where it empowers a ResNet50 to deliver state-of-the-art performance, surpassing existing domain generalization methods. Furthermore, we provide a theoretical analysis of our selection strategy, offering deeper insight into its effectiveness and potential in the field of DG.

URL: https://openreview.net/forum?id=4KLwep6mA1

---

Title: Diverse Diffusion: Enhancing Image Diversity in Text-to-Image Generation

Abstract: Latent diffusion models excel at producing high-quality images from text. Yet, concerns appear about the lack of diversity in the generated imagery. To tackle this, we introduce Diverse Diffusion, a method for boosting image diversity beyond gender and ethnicity, spanning into richer realms, including color diversity.

Diverse Diffusion is a general unsupervised technique that can be applied to existing text-to-image models. Our approach focuses on finding vectors in the Stable Diffusion latent space that are distant from each other. We generate multiple vectors in the latent space until we find a set of vectors that meets the desired distance requirements and the required batch size.

To evaluate the effectiveness of our diversity methods, we conduct experiments examining various characteristics, including color diversity, LPIPS metric, and ethnicity/gender representation in images featuring humans.

The results of our experiments emphasize the significance of diversity in generating realistic and varied images, offering valuable insights for improving text-to-image models. Through the enhancement of image diversity, our approach contributes to the creation of more inclusive and representative AI-generated art.

URL: https://openreview.net/forum?id=F0J1N6M5N9

---

Title: DIGNet: Learning Decomposed Patterns in Representation Balancing for Treatment Effect Estimation

Abstract: Estimating treatment effects from observational data is often subject to a covariate shift problem incurred by selection bias. Recent research has sought to mitigate this problem by leveraging representation balancing methods that aim to extract balancing patterns from observational data and utilize them for outcome prediction. The underlying theoretical rationale is that minimizing the unobserved counterfactual error can be achieved through two principles: (I) reducing the risk associated with predicting factual outcomes and (II) mitigating the distributional discrepancy between the treated and controlled samples. However, an inherent trade-off between the two principles can lead to a potential over-balancing issue, resulting in the loss of valuable information for factual outcome predictions and, consequently, deteriorating treatment effect estimations. To overcome this challenge, we propose a novel representation balancing model, DIGNet, for treatment effect estimation. DIGNet incorporates two key components, PDIG and PPBR, which effectively mitigate the trade-off problem by improving one aforementioned principle without sacrificing the other. Specifically, PDIG captures more effective balancing patterns (Principle II) without affecting factual outcome predictions (Principle I), while PPBR enhances factual outcome prediction (Principle I) without affecting the learning of balancing patterns (Principle II). Our comprehensive ablation studies confirm the effectiveness of PDIG and PPBR in improving treatment effect estimation, and experimental results on benchmark datasets demonstrate the superior performance of our DIGNet model compared to baseline models.

URL: https://openreview.net/forum?id=Z20FInfWlm

---

Title: Incremental Extractive Opinion Summarization Using Cover Trees

Abstract: Extractive opinion summarization involves automatically producing a summary of text about an entity (e.g., a product’s reviews) by extracting representative sentences that capture prevalent opinions in the review set. Typically, in online marketplaces user reviews accrue over time, and opinion summaries must be updated periodically to provide customers with up-to-date information. In this work, we study the task of extractive opinion summarization in an incremental setting, where the underlying review set evolves over time. Many of the state-of-the-art extractive opinion summarization approaches are centrality-based, such as CentroidRank (Radev et al., 2004; Chowdhury et al., 2022). CentroidRank performs extractive summarization by selecting a subset of review sentences closest to the centroid in the representation space as the summary. However, these methods are not capable of operating efficiently in an incremental setting, where reviews arrive one at a time. In this paper, we present an efficient algorithm for accurately computing the CentroidRank summaries in an incremental setting. Our approach, CoverSumm, relies on indexing review representations in a cover tree and maintaining a reservoir of candidate summary review sentences. CoverSumm’s efficacy is supported by a theoretical and empirical analysis of running time. Empirically, on a diverse collection of data (both real and synthetically created to illustrate scaling considerations), we demonstrate that CoverSumm is up to 25x faster than baseline methods, and capable of adapting to nuanced changes in data distribution. We also conduct human evaluations of the generated summaries and find that CoverSumm is capable of producing informative summaries consistent with the underlying review set.

URL: https://openreview.net/forum?id=IzmLJ1t49R

---

Title: NysReg-Gradient:~Regularized Nystr\"om-Gradient for Large-Scale Unconstrained Optimization and its Application

Abstract: We develop a regularized Nystr\"om method for solving unconstrained optimization problems with high-dimensional feature spaces. While the conventional second-order approximation methods such as quasi-Newton methods rely on the first-order derivatives, our method leverages the actual Hessian information. Additionally, Newton-sketch based methods employ a sketch matrix to approximate the Hessian, such that it requires the thick embedding matrix with a large sketch size. On the other hand, the randomized subspace Newton method projects Hessian onto a lower dimensional subspace that utilizes limited Hessian information. In contrast, we propose a balanced approach by introducing the regularized Nystr\"om approximation. It leverages partial Hessian information as a thin column to approximate the Hessian. We integrate approximated Hessian with gradient descent and stochastic gradient descent. To further reduce computational complexity per iteration, we compute the inverse of the approximated Hessian-gradient product directly without computing the inverse of the approximated Hessian. We provide the convergence analysis and discuss certain theoretical aspects. We provide numerical experiments for strongly convex functions and deep learning. The numerical experiments for the strongly convex function demonstrate that it notably outperforms the randomized subspace Newton and the approximation of Newton-sketch which shows the considerable advancements in optimization with high-dimensional feature space. Moreover, we report the numerical results on the application of brain tumor detection, which shows that the proposed method is competitive with the existing quasi-Newton methods that showcase its transformative impact on tangible applications in critical domains.

URL: https://openreview.net/forum?id=02laS2ZM6g

---

Title: Large Language Models can be Guided to Evade AI-generated Text Detection

Abstract: Large Language Models (LLMs) demonstrated exceptional performance in a variety of tasks like essay writing and question answering, and have been widely employed by the public. However, the rising concern of LLM misuse, such as plagiarism and spamming, has led to the development of multiple detectors, including fine-tuned classifiers and various statistical methods. In this study, we leverage LLMs with automatically constructed prompts to investigate the vulnerability of these detection systems. We introduce a novel Substitution-based In-Context example Optimization method (SICO) to create prompts automatically for detection evasion. Our evaluation across three real-world tasks indicates that SICO successfully enables GPT-3.5 to evade six detectors, resulting in an average 0.54 AUC drop. Furthermore, comprehensive human evaluation, along with a validation experiment in the wild, shows the texts generated by SICO achieve human-level readability and task completion rate. Additionally, SICO is cost-efficient and flexible to implement, requiring only 40 human-generated data and a limited number of LLM inference. Once a task-specific prompt is generated, it can be employed against various detectors across different task inputs. Finally, the outstanding performance of SICO highlights its potential as a reliable evaluation protocol for future detectors.

URL: https://openreview.net/forum?id=lLE0mWzUrr

---

Title: An Invitation to Deep Reinforcement Learning

Abstract: Training a deep neural network to maximize a target objective has become the standard recipe for successful machine learning over the last decade. These networks can be optimized with supervised learning, if the target objective is differentiable.
For many interesting problems, this is however not the case. Common objectives like intersection over union (IoU), bilingual evaluation understudy (BLEU) score or rewards cannot be optimized with supervised learning.
A common workaround is to define differentiable surrogate losses, leading to suboptimal solutions with respect to the actual objective.
Reinforcement learning (RL) has emerged as a promising alternative for optimizing deep neural networks to maximize non-differentiable objectives in recent years.
Examples include aligning large language models via human feedback, code generation, object detection or control problems.
This makes RL techniques relevant to the larger machine learning audience. The subject is, however, time intensive to approach due to the large range of methods, as well as the often very theoretical presentation.
In this introduction, we take an alternative approach, different from classic reinforcement learning textbooks. Rather than focusing on tabular problems, we introduce reinforcement learning as a generalization of supervised learning, which we first apply to non-differentiable objectives and later to temporal problems.
Assuming only basic knowledge of supervised learning, the reader will be able to understand state‐of-the-art deep RL algorithms like proximal policy optimization (PPO) after reading this tutorial.

URL: https://openreview.net/forum?id=DcFN89waLY

---

Title: Variational Learning ISTA

Abstract: Compressed sensing combines the power of convex optimization techniques with a sparsity-inducing prior on the signal space to solve an underdetermined system of equations. For many problems, the sparsifying dictionary is not directly given, nor its existence can be assumed. Besides, the sensing matrix can change across different scenarios. Addressing these issues requires solving a sparse representation learning problem, namely dictionary learning, taking into account the epistemic uncertainty of the learned dictionaries and, finally, jointly learning sparse representations and reconstructions under varying sensing matrix conditions. We address both concerns by proposing a variant of the LISTA architecture. First, we introduce Augmented Dictionary Learning ISTA (A-DLISTA), which incorporates an augmentation module to adapt parameters to the current measurement setup. Then, we propose to learn a distribution over dictionaries via a variational approach, dubbed Variational Learning ISTA (VLISTA). VLISTA exploits A-DLISTA as the likelihood model and approximates a posterior distribution over the dictionaries as part of an unfolded LISTA-based recovery algorithm. As a result, VLISTA provides a probabilistic way to jointly learn the dictionary distribution and the reconstruction algorithm with varying sensing matrices. We provide theoretical and experimental support for our architecture and show that our model learns calibrated uncertainties.

URL: https://openreview.net/forum?id=AQk0UsituG

---

Title: Mildly Constrained Evaluation Policy for Offline Reinforcement Learning

Abstract: Offline reinforcement learning (RL) methodologies enforce constraints on the policy to adhere closely to the behavior policy, thereby stabilizing value learning and mitigating the selection of out-of-distribution (OOD) actions during test time. Conventional approaches apply identical constraints for both value learning and test time inference. However, our findings indicate that the constraints suitable for value estimation may in fact be excessively restrictive for action selection during test time. To address this issue, we propose a Mildly Constrained Evaluation Policy (MCEP) for test time inference with a more constrained target policy for value estimation. Since the target policy has been adopted in various prior approaches, MCEP can be seamlessly integrated with them as a plug-in. We instantiate MCEP based on TD3BC (Fujimoto & Gu, 2021), AWAC (Nair et al., 2020) and DQL (Wang et al., 2023) algorithms. The empirical results on D4RL MuJoCo locomotion, high-dimensional humanoid and a set of 16 robotic manipulation tasks show that the MCEP brought significant performance improvement on classic offline RL methods and can further improve SOTA methods. The codes are open-sourced at \url{link}.

URL: https://openreview.net/forum?id=imAROs79Pb

---

Title: Self-Guided Semantic Alignment for Text Supervised Segmentation

Abstract: Learning to segment images by purely relying on the image-text alignment from web data can lead to sub-optimal performance due to noise in the training data. The noise comes from the samples where the associated text does not or only partially describes the image's visual content. Instead, this work proposes a novel loss function termed SimCon, which compares an image jointly to images and texts while accounting for intra-modal similarities to determine the appropriate set of semantic positives. Further, using multiple views of the image (created synthetically) and combining the SimCon loss with it makes the training more robust. This version of the loss is termed MV-SimCon. The empirical results demonstrate that using the proposed loss function leads to consistent improvements on zero-shot, text supervised semantic segmentation and outperforms state-of-the-art by $+3.0\%$, $+3.3\%$ and $+6.9\%$ on PASCAL VOC, PASCAL Context and MSCOCO, respectively. With test time augmentations, we set a new record by improving these results further to $58.7\%$, $26.6\%$, and $33.3\%$ on PASCAL VOC, PASCAL Context, and MSCOCO, respectively. In addition, using the proposed loss function leads to robust training and faster convergence.

URL: https://openreview.net/forum?id=COIIuIaufb

---

Title: Geometrical aspects of lattice gauge equivariant convolutional neural networks

Abstract: Lattice gauge equivariant convolutional neural networks (L-CNNs) are a framework for convolutional neural networks that can be applied to non-Abelian lattice gauge theories without violating gauge symmetry. We demonstrate how L-CNNs can be equipped with global group equivariance. This allows us to extend the formulation to be equivariant not just under translations but under global lattice symmetries such as rotations and reflections. Additionally, we provide a geometric formulation of L-CNNs and show how convolutions in L-CNNs arise as a special case of gauge equivariant neural networks on SU(N) principal bundles.

URL: https://openreview.net/forum?id=zO4aAVHxPe

---

Title: Boosting Data-Driven Mirror Descent with Randomization, Equivariance, and Acceleration

Abstract: Learning-to-optimize (L2O) is an emerging research area in large-scale optimization with applications in data science. Recently, researchers have proposed a novel L2O framework called learned mirror descent (LMD), based on the classical mirror descent (MD) algorithm with learnable mirror maps parameterized by input-convex neural networks. The LMD approach has been shown to significantly accelerate convex solvers while inheriting the convergence properties of the classical MD algorithm. This work proposes several practical extensions of the LMD algorithm, addressing its instability, scalability, and feasibility for high-dimensional problems. We first propose accelerated and stochastic variants of LMD, leveraging classical momentum-based acceleration and stochastic optimization techniques for improving the convergence rate and per-iteration computational complexity. Moreover, for the particular application of training neural networks, we derive and propose a novel and efficient parameterization for the mirror potential, exploiting the equivariant structure of the training problems to significantly reduce the dimensionality of the underlying problem. We provide theoretical convergence guarantees for our schemes under standard assumptions and demonstrate their effectiveness in various computational imaging and machine learning applications such as image inpainting, and the training of support vector machines and deep neural networks.

URL: https://openreview.net/forum?id=r2dx1s1lqG

---

Title: On Formal Feature Attribution And Its Approximation

Abstract: Recent years have witnessed the widespread use of artificial intelligence (AI) algorithms and machine learning (ML) models. Despite their tremendous success, a number of vital problems like ML model brittleness, their fairness, and the lack of interpretability warrant the need for the active developments in explainable artificial intelligence (XAI) and formal ML model verification. The two major lines of work in XAI include feature selection methods, e.g. Anchors, and feature attribution techniques, e.g. LIME and SHAP. Despite their promise, most of the existing feature selection and attribution approaches are susceptible to a range of critical issues, including explanation unsoundness and out-of-distribution sampling. A recent formal approach to XAI (FXAI) although serving as an alternative to the above and free of these issues suffers from a few other limitations. For instance and besides the scalability limitation, the formal approach is unable to tackle the feature attribution problem. Additionally, a formal explanation despite being formally sound is typically quite large, which hampers its applicability in practical settings. Motivated by the above, this paper proposes a way to apply the apparatus of formal XAI to the case of feature attribution based on formal explanation enumeration. Formal feature attribution (FFA) is argued to be advantageous over the existing methods, both formal and non-formal. Given the practical complexity of the problem, the paper then proposes an efficient technique for approximating exact FFA. Finally, it offers experimental evidence of the effectiveness of the proposed approximate FFA in comparison to the existing feature attribution algorithms not only in terms of feature importance and but also in terms of their relative order.

URL: https://openreview.net/forum?id=JESCxuSMmi

---

Title: D3: Data Diversity Design for Systematic Generalization in Visual Question Answering

Abstract: Systematic generalization is a crucial aspect of intelligence, which refers to the ability to generalize to novel tasks by combining known subtasks and concepts. One critical factor that has been shown to influence systematic generalization is the diversity of training data. However, diversity can be defined in various ways, as data have many factors of variation. A more granular understanding of how different aspects of data diversity affect systematic generalization is lacking. We present new evidence in the problem of Visual Question Answering (VQA) that reveals that the diversity of simple tasks (i.e. tasks formed by a few subtasks and concepts) plays a key role in achieving systematic generalization. This implies that it may not be essential to gather a large and varied number of complex tasks, which could be costly to obtain. We demonstrate that this result is independent of the similarity between the training and testing data and applies to well-known families of neural network architectures for VQA (i.e. monolithic architectures and neural module networks). Additionally, we observe that neural module networks leverage all forms of data diversity we evaluated, while monolithic architectures require more extensive amounts of data to do so. These findings provide a first step towards understanding the interactions between data diversity design, neural network architectures, and systematic generalization capabilities.

URL: https://openreview.net/forum?id=ZAin13msOp

---

Title: Few Shot Image Generation Using Conditional Set-Based GANs

Abstract: While there have been tremendous advances made in few-shot and zero-shot image generation in recent years, one area that remains comparatively underexplored is few-shot generation of images conditioned on sets of unseen images. Existing methods typically condition on a single image only and require strong assumptions about the similarity of the latent distribution of unseen classes relative to training classes. In contrast, we propose SetGAN - a conditional, set-based GAN that learns to generate sets of images conditioned on reference sets from unseen classes. SetGAN can combine information from multiple reference images, as well as generate diverse sets of images which mimic the factors of variation within the reference class. We also identify limitations of existing performance metrics for few-shot image generation, and discuss alternative performance metrics that can mitigate these problems.

URL: https://openreview.net/forum?id=IJagnIlhpk

---

Title: Identifying and Mitigating Model Failures through Few-shot CLIP-aided Diffusion Generation

Abstract: Deep learning models can encounter unexpected failures, especially when dealing with challenging sub-populations. One common reason for these failures is the occurrence of objects in backgrounds that are rarely seen during training. To gain a better understanding of these failure modes, human-interpretable descriptions are crucial for further analysis and improvement which is expensive. In this study, we propose an end-to-end framework that utilizes the capabilities of large language models (ChatGPT) and vision-language deep models (CLIP) to generate text descriptions of failure modes associated with spurious correlations (e.g. rarely seen backgrounds) without human-in-the-loop intervention. These descriptions can be used to generate synthetic data using generative models, such as diffusion models. The model can now use this generated data to learn from its weaknesses and enhance its performance on backgrounds that are uncommon for each class of data. Our approach serves as a broad solution, promising progress in comprehending model failure modes and strengthening deep learning models across a wide range of failure scenarios (e.g. bacckgrounds, colors) automatically in a few-shot manner. Our experiments have shown remarkable \textbf{improvements in accuracy ($\sim \textbf{21\%}$)} on hard sub-populations (particularly for wrong background association) across $40$ different models, such as ResNets, EfficientNets, DenseNets, Vision Transformer (ViT), SwAVs, MoCos, DINOs, and CLIPs on various datasets such as ImageNet-1000, CIFAR-10, and CIFAR-100.

URL: https://openreview.net/forum?id=YZvlgVJXZC

---

Title: GLASU: A Communication-Efficient Algorithm for Federated Learning with Vertically Distributed Graph Data

Abstract: Vertical federated learning (VFL) is a distributed learning paradigm, where computing clients collectively train a model based on the partial features of the same set of samples they possess. Current research on VFL focuses on the case when samples are independent, but it rarely addresses an emerging scenario when samples are interrelated through a graph. In this work, we train a graph neural network (GNN) through VFL, where each client owns a part of the node features and a different edge set. This data scenario incurs a significant communication overhead, not only because of the handling of distributed features but also due to neighborhood aggregation in a GNN. Moreover, the training analysis is faced with a challenge caused by the biased stochastic gradients. We propose a model-splitting method that splits a backbone GNN across the clients and the server and a communication-efficient algorithm, GLASU, to train such a model. GLASU adopts lazy aggregation and stale updates to skip communication in neighborhood aggregation and in model updates, respectively, greatly reducing communication while enjoying convergence guarantees. We conduct extensive numerical experiments on real-world datasets, showing that GLASU effectively trains a GNN that matches the accuracy of centralized training, while using only a fraction of the time due to communication saving.

URL: https://openreview.net/forum?id=LHl2I2rWZa

---

Title: Hybrid Federated Learning for Feature \& Sample Heterogeneity: Algorithms and Implementation

Abstract: Federated learning (FL) is a popular distributed machine learning paradigm dealing with distributed and private data sets. Based on the data partition pattern, FL is often categorized into horizontal, vertical, and hybrid settings. All three settings have many applications, but the hybrid FL remains relatively less explored, because it deals with the challenging situation where {\it both} the feature space and the data samples are {\it heterogeneous}.
This work designs a novel mathematical model that effectively allows the clients to aggregate distributed data with heterogeneous, and possibly overlapping features and samples. Our main idea is to partition each client's model into a feature extractor part and a classifier part, where the former can be used to process the input data, while the latter is used to perform the learning from the extracted features. The heterogeneous feature aggregation is done through building a server model, which assimilates local classifiers and feature extractors through a carefully designed matching mechanism. A communication-efficient algorithm is then designed to train both the client and server models. Finally, we conducted numerical experiments on multiple image classification data sets to validate the performance of the proposed algorithm. To our knowledge, this is the first formulation and algorithm developed for hybrid FL.

URL: https://openreview.net/forum?id=qc2lmWkvk4

---

Title: Merging by Matching Models in Task Subspaces

Abstract: Model merging aims to cheaply combine individual task-specific models into a single multitask model. In this work, we view past merging methods as leveraging different notions of a ''task subspace'' in which models are matched before being merged. We connect the task subspace of a given model to its loss landscape and formalize how this approach to model merging can be seen as solving a linear system of equations. While past work has generally been limited to linear systems that have a closed-form solution, we consider using the conjugate gradient method to find a solution. We show that using the conjugate gradient method can outperform closed-form solutions, enables merging via linear systems that are otherwise intractable to solve, and flexibly allows choosing from a wide variety of initializations and estimates for the ''task subspace''. We ultimately demonstrate that our merging framework called ''Matching Models in their Task Subspace'' (MATS) achieves state-of-the-art results in multitask and intermediate-task model merging. We include all of the code used in our work.

URL: https://openreview.net/forum?id=qNGo6ghWFB

---

Title: Stochastic Direct Search Methods for Blind Resource Allocation

Abstract: Motivated by programmatic advertising optimization, we consider the task of sequentially allocating budget across a set of resources. At every time step, a feasible allocation is chosen and only a corresponding random return is observed. The goal is to maximize the cumulative expected sum of returns. This is a realistic model for budget allocation across subdivisions of marketing campaigns, with the objective of maximizing the number of conversions. We study direct search (also known as pattern search) methods for linearly constrained and derivative-free optimization in the presence of noise, which apply in particular to sequential budget allocation. These algorithms, which do not rely on hierarchical partitioning of the resource space, are easy to implement; they respect the operational constraints of resource allocation by avoiding evaluation outside of the feasible domain; and, they are also compatible with warm start by being (approximate) descent algorithms. However, they have not yet been analyzed from the perspective of cumulative regret. We show that direct search methods achieves finite regret in the deterministic and unconstrained case. In the presence of evaluation noise and linear constraints, we propose a simple extension of direct search that achieves a regret upper-bound of the order of $T^{2/3}$. We also propose an accelerated version of the algorithm, relying on repeated sequential testing, that significantly improves the practical behavior of the approach.

URL: https://openreview.net/forum?id=m1OXBLH0dH

---

Reply all

Reply to author

Forward

0 new messages